# TTS Model Evaluation - Stakeholder Demo

This notebook provides an interactive way to compare TTS models for Kira.

**Models Evaluated:**
- Azure TTS (Current baseline)
- ElevenLabs (Premium quality)
- MiniMax (Best value for Chinese)
- Qwen3-TTS (Open-source via DashScope)
- LuxTTS (Local CPU, English only)

In [None]:
# Setup - Run this cell first
import json
from pathlib import Path
from IPython.display import Audio, display, HTML
import pandas as pd

# Load results
results_path = Path("../outputs/metrics/benchmark_results.json")
if results_path.exists():
    with open(results_path) as f:
        results = json.load(f)
    print("Results loaded successfully!")
else:
    print("No results found. Please run the evaluation first:")
    print("  python scripts/run_evaluation.py")

## 1. Quick Comparison

Overview of all providers across key metrics:

In [None]:
# Cost comparison
cost_data = {
    "Provider": ["Azure TTS", "ElevenLabs", "MiniMax", "Qwen3-TTS", "LuxTTS"],
    "100K chars/mo": ["$1.60", "$16.50", "$5.00", "$1.00", "$0.00"],
    "500K chars/mo": ["$8.00", "$82.50", "$25.00", "$5.00", "$0.00"],
    "1M chars/mo": ["$16.00", "$165.00", "$50.00", "$10.00", "$0.00"],
    "Notes": [
        "Current baseline, 140+ languages",
        "Highest quality, premium cost",
        "Excellent for Chinese, good value",
        "Open-source, DashScope API",
        "Local CPU only, English only"
    ]
}

df_cost = pd.DataFrame(cost_data)
display(HTML("<h3>Monthly Cost Comparison</h3>"))
display(df_cost.style.set_properties(**{'text-align': 'left'}))

In [None]:
# Performance metrics from actual benchmark
if 'results' in dir():
    perf_data = []
    for name, data in results.get("providers", {}).items():
        perf_data.append({
            "Provider": name,
            "Avg Latency (ms)": f"{data.get('total_latency_mean_ms', 0):.0f}",
            "Realtime Factor": f"{data.get('avg_realtime_factor', 0):.1f}x",
            "Languages Tested": ", ".join(data.get('languages_tested', []))
        })
    
    df_perf = pd.DataFrame(perf_data)
    display(HTML("<h3>Performance Metrics</h3>"))
    display(df_perf)

## 2. Audio Comparison

Listen to the same text spoken by different TTS providers:

In [None]:
def play_sample(sample_id: str):
    """Play audio samples from all providers for a given sample ID."""
    audio_base = Path("../outputs/audio")
    
    providers = {
        "Azure TTS": "azure_tts",
        "ElevenLabs": "elevenlabs",
        "MiniMax": "minimax",
        "Qwen3-TTS": "qwen3tts",
        "LuxTTS": "luxtts",
    }
    
    print(f"\n{'='*50}")
    print(f"Sample: {sample_id}")
    print(f"{'='*50}\n")
    
    for display_name, folder_name in providers.items():
        audio_path = audio_base / folder_name / f"{sample_id}.wav"
        if audio_path.exists():
            print(f"\n{display_name}:")
            display(Audio(str(audio_path)))
        else:
            print(f"\n{display_name}: No audio file found")

In [None]:
# English conversational sample
play_sample("en_conversational_01")

In [None]:
# Chinese conversational sample
play_sample("zh_conversational_01")

In [None]:
# English with numbers/dates
play_sample("en_technical_01")

## 3. Recommendations

Based on our evaluation:

In [None]:
recommendations = """
<div style="background: #e3f2fd; padding: 20px; border-radius: 10px; margin: 10px 0;">
<h3>Key Findings</h3>
<ul>
<li><strong>Best Overall Value:</strong> MiniMax - excellent quality at lower cost, especially for Chinese</li>
<li><strong>Highest Quality:</strong> ElevenLabs - most natural sounding, but 3x the cost</li>
<li><strong>Best for Chinese:</strong> MiniMax and Qwen3-TTS both excel at Mandarin</li>
<li><strong>Lowest Cost:</strong> LuxTTS for local/offline use (English only)</li>
<li><strong>Current Baseline:</strong> Azure TTS remains solid for broad language coverage</li>
</ul>
</div>

<div style="background: #fff3e0; padding: 20px; border-radius: 10px; margin: 10px 0;">
<h3>Suggested Strategy</h3>
<p>Consider a <strong>hybrid approach</strong>:</p>
<ul>
<li>Use MiniMax or Qwen3-TTS for standard interactions (cost-effective)</li>
<li>Use ElevenLabs for critical customer-facing moments (premium quality)</li>
<li>LuxTTS for offline/edge scenarios where latency is critical</li>
</ul>
</div>
"""

display(HTML(recommendations))

## 4. Interactive Testing

Try generating speech with any provider (requires API keys):

In [None]:
# Optional: Interactive testing (uncomment to use)
# from dotenv import load_dotenv
# import sys
# sys.path.insert(0, "..")
# load_dotenv("../.env")

# from src.providers import AzureTTSProvider

# provider = AzureTTSProvider()
# provider.initialize()

# result = provider.generate("Hello! This is a test of the Azure TTS provider.")
# display(Audio(result.audio_data, rate=result.sample_rate))
# print(f"Latency: {result.latency_ms:.0f}ms")