Audio Arena: Multi-Turn Speech-to-Speech Evaluation Framework

Audio Arena is a framework for evaluating speech-to-speech voice AI models through spoken audio. It includes 6 benchmarks spanning 221 total turns across different domains, testing knowledge retrieval, tool use, error recovery, adversarial attacks, long-range memory, state tracking, and numerical reasoning.

Built by Arcada Labs.

Leaderboard & results: audioarena.ai/leaderboard
Datasets on Hugging Face: arcada-labs
Source code: github.com/Design-Arena/audio-arena-bench

What makes this different from text benchmarks

Audio input: Each turn is a .wav file (TTS-generated or human-recorded), not text. Models must process speech, not read.
Continuous conversation: Turns form a single continuous conversation per benchmark. Later turns reference earlier ones. The model must track state, corrections, and prior answers across the full session.
Tool use over speech: Models have domain-specific functions they can call and must decide when and how to call them based on spoken instructions.
Adversarial and edge-case turns: Prompt injection, sycophancy traps, false presuppositions, false memory traps, distractor injection, and implicit corrections — all delivered via voice.

Benchmarks

Benchmark	Turns	Scenario	HF Dataset
`conversation_bench`	75	Conference assistant for AI Engineer World's Fair — session registration, schedule queries, 9 tool functions, ~946-line knowledge base	arcada-labs/conversation-bench
`appointment_bench`	25	Dental office appointment scheduling — two patients (Daniel/Danielle Nolan), two doctors, phone number swap+revert, 4 false memory traps, slot-taken error recovery	arcada-labs/appointment-bench
`assistant_bench`	31	Personal assistant — flight booking, email, calendar, reminders, dual requests in single turns, topic switching, late references, correction-chain recall	arcada-labs/assistant-bench
`event_bench`	29	Event planning — venue+catering+guest count changes, mid-sentence self-corrections, vague pronoun resolution, multi-request reversals, retroactive date changes	arcada-labs/event-bench
`grocery_bench`	30	Grocery ordering — chained corrections, homophone collisions, conditional additions/removals, order reconciliation, quantity math, swap operations	arcada-labs/grocery-bench
`product_bench`	31	Laptop comparison shopping — multi-intent turns, retroactive corrections, conditional arithmetic chains, discount stacking edge cases, 3-step order modification chains	arcada-labs/product-bench

Each benchmark is a self-contained Python package under benchmarks/ with:

config.py — Benchmark configuration (turns, tools, system instruction, HF repo)
turns.py — Turn definitions with golden answers
tools.py — Tool/function schema definitions
system.py — System prompt with knowledge base
data/knowledge_base.txt — Knowledge base content
audio/ — TTS audio, downloaded automatically from Hugging Face on first run (gitignored)
real_audio/ — Human-recorded audio per speaker, downloaded from HF when --real-audio is used (gitignored)

Quick Start

# Install dependencies
uv sync

# List available benchmarks
uv run audio-arena list-benchmarks

# Run a benchmark (audio files download automatically from HF on first run)
uv run audio-arena run appointment_bench --model claude-sonnet-4-5 --service anthropic

# Judge the results
uv run audio-arena judge runs/appointment_bench/<timestamp>_claude-sonnet-4-5

Installation

Requires Python 3.12+ and uv.

git clone https://github.com/Design-Arena/audio-arena-bench.git
cd audio-arena
uv sync

Audio files are hosted on Hugging Face and downloaded automatically when you first run a benchmark. To pre-download manually:

huggingface-cli download arcada-labs/appointment-bench --local-dir benchmarks/appointment_bench --include "audio/*.wav"

Environment Variables

Set the API keys for the services you want to use. You only need the keys for the models you plan to test.

# Judging key (Claude is the default judge for all benchmarks)
export ANTHROPIC_API_KEY=sk-ant-...
export OPENAI_API_KEY=sk-...         # Required for OpenAI judge override and OpenAI models

# Model provider keys (set whichever you need)
export GOOGLE_API_KEY=...             # Google (Gemini text and Gemini Live)
export ULTRAVOX_API_KEY=...           # Ultravox
export XAI_API_KEY=...                # xAI (Grok Realtime)

# Zhipu AI (GLM Realtime)
export ZHIPU_API_KEY=...

# AWS Nova models (text and speech-to-speech)
export AWS_ACCESS_KEY_ID=...
export AWS_SECRET_ACCESS_KEY=...
export AWS_SESSION_TOKEN=...          # Optional: for temporary credentials
export AWS_REGION=us-east-1           # Optional: defaults to us-east-1

You can also create a .env file in the project root with these variables.

CLI Commands

Running Benchmarks

# Basic usage with text model
uv run audio-arena run <benchmark> --model <model> --service <service>

# Examples:
uv run audio-arena run conversation_bench --model claude-sonnet-4-5 --service anthropic
uv run audio-arena run appointment_bench --model gpt-4o --service openai
uv run audio-arena run grocery_bench --model gemini-2.5-flash --service google

# Realtime audio models 
uv run audio-arena run conversation_bench --model gpt-realtime --service openai-realtime
uv run audio-arena run event_bench --model gemini-2.5-flash-native-audio-preview-12-2025 --service gemini-live
uv run audio-arena run product_bench --model ultravox-v0.7 --service ultravox-realtime

# Nova Sonic (no --service needed, pipeline creates its own LLM)
uv run audio-arena run conversation_bench --model amazon.nova-2-sonic-v1:0 --pipeline nova-sonic

# Grok (xAI) Realtime
uv run audio-arena run conversation_bench --model grok-realtime

# GLM Realtime (Zhipu AI)
uv run audio-arena run conversation_bench --model glm-realtime-flash
uv run audio-arena run appointment_bench --model glm-realtime-air

# Debug with limited turns
uv run audio-arena run appointment_bench --model gpt-4o --service openai --only-turns 0,1,2

# Rehydrated single-turn replay against prior golden context
uv run audio-arena run event_bench --model gpt-realtime-1.5 --service openai-realtime --rehydrate

# Rehydrated OpenAI Realtime with manual turn commits (VAD disabled)
uv run audio-arena run event_bench --model gpt-realtime-1.5 --service openai-realtime --rehydrate --disable-vad

# Verbose logging
uv run audio-arena run conversation_bench --model gpt-4o --service openai --verbose

--rehydrate runs each target turn in a fresh session with golden prior context. For OpenAI Realtime, prior turns are seeded into the live session with conversation.item.create; the benchmark does not use response.create.input to replay history. With --disable-vad, the target turn still uses live audio plus manual input_audio_buffer.commit and response.create events to close the user turn.

Judging Runs

After a benchmark run completes, judge the results:

# Judge a specific run (defaults to Claude judge)
uv run audio-arena judge runs/appointment_bench/20251213T123456_gpt-4o

# Judge with specific turns
uv run audio-arena judge runs/conversation_bench/20251213T123456_claude-sonnet-4-5 --only-turns 0,1,2

# Use a different judge model
uv run audio-arena judge runs/conversation_bench/20251213T123456_claude-sonnet-4-5 --judge-model claude-sonnet-4-5

# Use the OpenAI judge backend instead
uv run audio-arena judge runs/event_bench/20260310T011417_gpt-realtime-1.5_52ea9df5 --judge openai

All benchmarks default to the Claude judge (claude-opus-4-5). Use --judge openai to switch to the OpenAI judge (gpt-5.2).

The judge evaluates each turn on up to 5 dimensions:

Dimension	Scored on	Description
`tool_use_correct`	Turns whose spec defines a `required_function_call`	Did the model call the expected function with correct arguments?
`instruction_following`	Every turn	Did the model answer the question or advance the task?
`kb_grounding`	Every turn	Is the response factually consistent with the knowledge base?
`state_tracking`	Turns tagged with long-range memory, cancellation flow, or implicit correction	Does the model correctly track information from earlier turns?
`ambiguity_handling`	Turns tagged with ambiguous entity or compound ambiguity	Does the model correctly disambiguate entities and constraints?

tool_use_correct reports passes / applicable where the denominator is the number of turns the benchmark expects a tool call on. Non-tool turns are marked not-applicable in <judge>_judged.jsonl and excluded from both numerator and denominator; spurious tool calls on non-tool turns are surfaced through instruction_following / ambiguity_handling per the penalty-absorption rules below.

In plain language, these grader metrics mean:

tool_use_correct: The assistant used the right tool at the right time, with the right arguments, and did not skip a required tool call.
instruction_following: The assistant actually completed the user's request for that turn. This is about answering the question or advancing the task, not just sounding plausible.
kb_grounding: The assistant's factual claims are supported by the benchmark knowledge base or by tool results returned during the conversation.
state_tracking: The assistant stayed consistent with earlier turns, including prior registrations, cancellations, corrections, and other conversation state.
ambiguity_handling: The assistant handled under-specified or ambiguous requests correctly, such as asking for clarification when needed or resolving the ambiguity from available context.

For speech-to-speech runs, a 6th dimension is added automatically when a conversation.wav file is present:

Dimension	Description
`turn_taking`	Audio timing correctness — detects missing responses, negative TTFB, empty responses (control tokens only), alignment drift, and audio overlap

When turn-taking failures occur, the judge is more lenient on instruction_following since garbled audio may cause transcription issues.

# Judge a speech-to-speech run (turn-taking analysis runs automatically)
uv run audio-arena judge runs/conversation_bench/20260111T123456_gpt-realtime_abc123

# Skip turn-taking analysis
uv run audio-arena judge runs/conversation_bench/20260111T123456_gpt-realtime_abc123 --skip-turn-taking

Judge outputs (saved to the run directory):

<judge>_summary.json — score metrics (includes turn_taking_failures for S2S runs)
<judge>_analysis.md — human-readable report with failures
<judge>_judged.jsonl — per-turn judgments with reasoning

For example, Claude judging writes claude_summary.json, claude_analysis.md, and claude_judged.jsonl.

See the Methodology section for details on two-phase evaluation, penalty absorption, and category-aware scoring.

Real Audio (Human-Recorded)

Benchmarks support real (human-recorded) audio alongside the default TTS-generated audio. Each speaker's recordings live in a subdirectory under real_audio/:

benchmarks/appointment_bench/real_audio/
├── person1/
│   ├── turn_000.wav
│   ├── turn_001.wav
│   └── ...
└── person2/
    └── ...

Real audio is hosted on the same HF dataset repos and downloaded automatically when needed.

# Run with a specific speaker's real audio
uv run audio-arena run appointment_bench --model gpt-realtime --real-audio person1

# Run all available speakers sequentially
uv run audio-arena run appointment_bench --model gpt-realtime --real-audio all

# List available speakers for a benchmark
uv run audio-arena list-speakers appointment_bench

To pre-download real audio manually:

huggingface-cli download arcada-labs/appointment-bench --local-dir benchmarks/appointment_bench --include "real_audio/**/*.wav"

Listing Options

# List available benchmarks
uv run audio-arena list-benchmarks

# List available speakers for a benchmark
uv run audio-arena list-speakers appointment_bench

# List available pipelines
uv run audio-arena list-pipelines

# List service aliases
uv run audio-arena list-aliases

Service Aliases

For convenience, common service classes have short aliases:

Alias	Provider
`openai`	OpenAI (text models)
`openai-realtime`	OpenAI Realtime (speech-to-speech)
`anthropic`	Anthropic (Claude models)
`google`	Google (Gemini text models)
`gemini-live`	Google Gemini Live (speech-to-speech)
`bedrock`	AWS Bedrock (Nova text models)
`ultravox-realtime`	Ultravox (speech-to-speech)

Additional providers (OpenRouter, Groq, Cerebras) are also supported — run uv run audio-arena list-aliases to see all options.

You can also use fully-qualified class names:

uv run audio-arena run conversation_bench \
    --model gpt-4o \
    --service pipecat.services.openai.llm.OpenAILLMService

Pipelines

Pipeline	Use Case	Auto-Detection Pattern
`text`	Synchronous text LLMs	Default for all models
`realtime`	OpenAI Realtime, Gemini Live, Ultravox	`realtime`, `native-audio`, `live`, `ultravox`
`grok-realtime`	xAI Grok Realtime	`grokrealtime`
`glm-realtime`	Zhipu AI GLM Realtime	`glmrealtime`
`nova-sonic`	AWS Nova Sonic	`nova-sonic`, `nova_sonic`

Output Structure

Runs are saved to runs/<benchmark>/<timestamp>_<model>/ (gitignored):

runs/
└── appointment_bench/
    └── 20251213T123456_gpt-4o/
        ├── transcript.jsonl        # Turn-by-turn results
        ├── runtime.json            # Run metadata and metrics
        ├── run.log                 # Debug logs
        ├── claude_summary.json     # Judge summary (after judging)
        ├── claude_judged.jsonl     # Per-turn judgments (after judging)
        └── claude_analysis.md      # Human-readable analysis (after judging)

Project Structure

audio-arena/
├── src/audio_arena/               # Main package
│   ├── cli.py                     # CLI entry point
│   ├── pipelines/                 # Pipeline implementations
│   │   ├── base.py                # Abstract base pipeline
│   │   ├── text.py                # Text pipeline
│   │   ├── realtime.py            # Shared realtime pipeline orchestration
│   │   ├── openai_realtime.py     # OpenAI Realtime explicit-tool-result service
│   │   ├── grok_realtime.py       # Grok Realtime pipeline
│   │   ├── glm_realtime.py        # Zhipu AI GLM Realtime pipeline
│   │   └── nova_sonic.py          # Nova Sonic pipeline
│   ├── processors/                # Frame processors
│   ├── transports/                # Input/output transports
│   ├── recording/                 # Transcript recording
│   └── judging/                   # Judge implementations
│       ├── llm_judge.py           # Claude judge
│       └── openai_judge.py        # OpenAI judge
│
├── benchmarks/                    # Benchmark packages
│   ├── conversation_bench/        # 75-turn conference assistant
│   ├── appointment_bench/         # 25-turn dental appointment scheduling
│   ├── assistant_bench/           # 31-turn personal assistant
│   ├── event_bench/               # 29-turn event planning
│   ├── grocery_bench/             # 30-turn grocery ordering
│   └── product_bench/             # 31-turn laptop comparison
│       ├── config.py              # HF repo, description, turn count
│       ├── turns.py               # Turn definitions with golden answers
│       ├── tools.py               # Tool/function schemas
│       ├── system.py              # System prompt + knowledge base
│       ├── data/knowledge_base.txt
│       ├── audio/                 # TTS audio (gitignored, downloaded from HF)
│       └── real_audio/            # Human-recorded audio (gitignored, downloaded from HF)
│
├── scripts/
│   ├── run_all_benchmarks.py      # Run all S2S models on all benchmarks
│   ├── analyze_turn_metrics.py    # Per-turn timing analysis
│   ├── compare_model_runs.py      # Multi-model comparison with CSV + plots
│   ├── build_experiment_review.py # Self-contained HTML review for a judged run
│   ├── generate_audio.py          # TTS WAV generation for benchmark turns
│   └── ...                        # Additional analysis and batch scripts
│
├── runs/                          # Output directory (gitignored)
├── LICENSE
└── pyproject.toml

Audio files are stored on Hugging Face, not in this repo. They are downloaded automatically on first run.

Comprehensive Turn Metrics Analysis

For detailed per-turn timing analysis of speech-to-speech models, use the comprehensive metrics script:

# Analyze a run with summary statistics
uv run python scripts/analyze_turn_metrics.py runs/conversation_bench/<timestamp>_<model>

# Show per-turn breakdown table
uv run python scripts/analyze_turn_metrics.py runs/conversation_bench/<timestamp>_<model> -v

# Output as JSON (for programmatic use)
uv run python scripts/analyze_turn_metrics.py runs/conversation_bench/<timestamp>_<model> --json

Metrics Explained

The script consolidates timing data from multiple sources and calculates the following metrics:

Metric	Description	Calculation
Server TTFB	Time from request to first byte from model	Read from `transcript.jsonl` (reported by Pipecat)
Pipeline TTFB	Time from user speech end to bot audio tag	`bot_tag_log_ms - user_end_ms` (Silero VAD)
WAV V2V	Voice-to-voice latency measured from audio	`bot_silero_start_ms - user_end_ms` (Silero VAD)
Silent Pad (RMS)	Silent padding before speech (RMS detection)	`bot_rms_onset_ms - bot_tag_log_ms`
Silent Pad (VAD)	Silent padding before speech (Silero VAD)	`bot_silero_start_ms - bot_tag_wav_ms`
Tag Alignment	Drift between log position and WAV detection	`bot_tag_log_ms - bot_tag_wav_ms`

Key metric relationships:

WAV V2V = Pipeline TTFB + Silent Pad (VAD) - The total voice-to-voice latency includes both the time waiting for audio to arrive and any initial silence in the audio stream
Pipeline TTFB measures when audio starts arriving at the pipeline
Silent Pad measures how much silence is at the beginning of the audio (most models send 40-120ms of silence before speech)

Alignment Sanity Check

The script verifies that log-based timestamps match actual audio positions by detecting audio tags (2kHz tones) embedded in the WAV file:

Bot tags: Inserted when bot audio arrives at the pipeline
Alignment OK: Log and WAV positions match within ±20ms tolerance
Issues detected: Missing tags, extra tags, or drift outside tolerance

Output Files

When run with --json, the script outputs structured data that can be saved:

# Save metrics to JSON file
uv run python scripts/analyze_turn_metrics.py runs/conversation_bench/<timestamp>_<model> --json > turn_metrics.json

Methodology

The methodology below describes the scoring rubric used across all benchmarks. The detailed history section is specific to conversation_bench, which was the first and largest benchmark in the suite.

Scoring Rubric

Category-aware dimensions. Core dimensions (tool use, instruction following, KB grounding) are scored on every turn. state_tracking and ambiguity_handling are scored only on turns tagged with the relevant categories, so models are never penalized on out-of-scope dimensions.

Two-phase evaluation. An initial turn-by-turn pass is followed by a realignment pass that detects early or late function calls and cascading effects. If a required call was made a turn early, later turns are not penalized for the "missing" call; if made late, the turn where it actually happened gets credit.

Penalty absorption. When a missed tool call has a more specific root cause, the penalty lands on that dimension instead of tool_use_correct — e.g., unnecessary clarification penalizes ambiguity_handling, forgotten state penalizes state_tracking. This avoids double-penalizing while ensuring every failure is counted exactly once.

Strict dimension separation. Failing to call a tool is scored only under tool_use_correct (or absorbed by a more specific dimension). instruction_following fails only when the assistant's words and actions contradict each other in a non-tool sense.

Turn-taking leniency. For speech-to-speech runs, a turn_taking dimension captures audio timing issues (overlaps, interruptions, missing responses). When turn-taking fails, the judge is more lenient on instruction_following to account for transcription artifacts.

Each benchmark is static: the same user inputs (and corresponding audio) are used for every run, with golden expectations defined in each benchmark's turns.py, so results are comparable across models and runs.

ConversationBench history

ConversationBench builds on the original 30-turn multi-turn evaluation created by Kwindla Hultman Kramer at Daily (blog post). That benchmark tested both text and speech-to-speech models on tool use, instruction following, and knowledge base grounding in an AI Engineer World's Fair conference assistant scenario. It used a Pipecat-based evaluation pipeline to drive multi-turn conversations against models from OpenAI, Google, Anthropic, and others, with Claude as an automated judge.

The original 30-turn benchmark was an important proof of concept — it demonstrated that multi-turn conversation evaluation over audio was both feasible and revealing. However, during development of ConversationBench we found that 30 turns were not sufficiently challenging: most frontier models scored above 90% on nearly every category, making it difficult to differentiate between models or identify meaningful failure modes.

We replaced the majority of the original turns and rebuilt the benchmark from scratch as a 75-turn static hard benchmark. Only a small number of basic QA and tool-use turns from the original were retained, and even those were revised. The remaining 5 benchmarks (appointment_bench, assistant_bench, event_bench, grocery_bench, product_bench) were built from scratch to test different domains and failure modes.

Acknowledgments

Audio Arena is built on Pipecat, the open-source framework for voice and multimodal AI. The original 30-turn evaluation was created by Kwindla Hultman Kramer at Daily — see the original blog post and repo.

Judging is powered by Claude via the Claude Agent SDK. Voice activity detection uses Silero VAD.

Citation

If you use these benchmarks, please cite:

@misc{audioarena2026,
  title={Audio Arena: Multi-Turn Speech-to-Speech Evaluation Benchmarks},
  author={Arcada Labs},
  year={2026},
  url={https://audioarena.ai}
}

License

MIT — see LICENSE for details.

Name		Name	Last commit message	Last commit date
Latest commit History 78 Commits
.agents/skills		.agents/skills
benchmarks		benchmarks
scripts		scripts
src/audio_arena		src/audio_arena
tests		tests
.gitignore		.gitignore
AGENTS.md		AGENTS.md
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Audio Arena: Multi-Turn Speech-to-Speech Evaluation Framework

What makes this different from text benchmarks

Benchmarks

Quick Start

Installation

Environment Variables

CLI Commands

Running Benchmarks

Judging Runs

Real Audio (Human-Recorded)

Listing Options

Service Aliases

Pipelines

Output Structure

Project Structure

Comprehensive Turn Metrics Analysis

Metrics Explained

Alignment Sanity Check

Output Files

Methodology

Scoring Rubric

ConversationBench history

Acknowledgments

Citation

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Audio Arena: Multi-Turn Speech-to-Speech Evaluation Framework

What makes this different from text benchmarks

Benchmarks

Quick Start

Installation

Environment Variables

CLI Commands

Running Benchmarks

Judging Runs

Real Audio (Human-Recorded)

Listing Options

Service Aliases

Pipelines

Output Structure

Project Structure

Comprehensive Turn Metrics Analysis

Metrics Explained

Alignment Sanity Check

Output Files

Methodology

Scoring Rubric

ConversationBench history

Acknowledgments

Citation

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages