# 03 — Zero-Shot Whisper on Dothraki

This notebook explores how OpenAI's Whisper model behaves when given speech in
a language it has **never seen**: Dothraki, a constructed language from Game of Thrones.

Whisper was trained on 680,000 hours of multilingual audio covering 99 languages.
Dothraki is not one of them. What happens when we feed it Dothraki speech?

## Contents
1. [Language Detection](#1-language-detection) — What language does Whisper *think* it's hearing?
2. [Model Size Comparison](#2-model-size-comparison) — How do tiny vs small models differ?
3. [Transcription Analysis](#3-transcription-analysis) — What text does Whisper output?
4. [Real vs Synthetic Audio](#4-real-vs-synthetic) — How does the source affect results?
5. [Key Findings](#5-key-findings)

In [None]:
import json
import sys
from pathlib import Path
from collections import Counter

import matplotlib.pyplot as plt
import numpy as np

PROJECT_ROOT = Path.cwd().parent
sys.path.insert(0, str(PROJECT_ROOT))
RESULTS_DIR = PROJECT_ROOT / 'data' / 'results'

plt.style.use('dark_background')
plt.rcParams['figure.figsize'] = (12, 5)
plt.rcParams['font.size'] = 12

# Load evaluation results
eval_tiny = json.loads((RESULTS_DIR / 'batch_eval_tiny.json').read_text())
eval_small = json.loads((RESULTS_DIR / 'batch_eval_small.json').read_text())

print(f"Loaded: whisper-tiny ({eval_tiny['num_clips']} clips), whisper-small ({eval_small['num_clips']} clips)")

---
## 1. Language Detection

Whisper's first step is language identification. Since Dothraki isn't in its training data,
it must pick the closest-sounding language it knows. This tells us which real languages
Dothraki phonetically resembles.

In [None]:
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(16, 6))

for ax, data, title in [
    (ax1, eval_tiny, 'whisper-tiny'),
    (ax2, eval_small, 'whisper-small'),
]:
    lang_dist = data['stats']['language_distribution']
    # Take top 8, group rest as 'other'
    top_langs = dict(list(lang_dist.items())[:8])
    other = sum(v for k, v in lang_dist.items() if k not in top_langs)
    if other > 0:
        top_langs['other'] = other

    labels = list(top_langs.keys())
    values = list(top_langs.values())
    total = sum(values)

    colors = plt.cm.Set2(np.linspace(0, 1, len(labels)))
    wedges, texts, autotexts = ax.pie(
        values, labels=labels, autopct=lambda p: f'{p:.0f}%' if p > 3 else '',
        colors=colors, textprops={'fontsize': 11}
    )
    ax.set_title(f'{title}\n({total} clips)', fontsize=14)

fig.suptitle('What Language Does Whisper Think Dothraki Is?', fontsize=16, y=1.02)
plt.tight_layout()
plt.show()

In [None]:
# Side-by-side comparison table
all_langs = set()
for data in [eval_tiny, eval_small]:
    all_langs.update(data['stats']['language_distribution'].keys())

print(f"{'Language':<10} {'tiny':>8} {'small':>8}")
print('-' * 30)
for lang in sorted(all_langs):
    tiny_count = eval_tiny['stats']['language_distribution'].get(lang, 0)
    small_count = eval_small['stats']['language_distribution'].get(lang, 0)
    if tiny_count + small_count > 1:
        print(f"{lang:<10} {tiny_count:>8} {small_count:>8}")

print(f"\nKey insight: Whisper-tiny shows more language confusion (detects Thai, Korean, Japanese)")
print(f"while whisper-small converges more strongly on English (~86%)")

---
## 2. Model Size Comparison

Comparing how different Whisper model sizes handle zero-shot Dothraki.

In [None]:
# Comparison metrics
models = ['tiny', 'small']
evals = [eval_tiny, eval_small]

metrics = {
    'Empty Rate (%)': [],
    'Avg Output Length': [],
    'Translation Coverage (%)': [],
    'Top Language %': [],
}

for data in evals:
    s = data['stats']
    metrics['Empty Rate (%)'].append(s['empty_rate'] * 100)
    metrics['Avg Output Length'].append(s['avg_whisper_output_length'])
    metrics['Translation Coverage (%)'].append(s['translation_coverage_rate'] * 100)
    top_lang_pct = max(s['language_distribution'].values()) / s['successful'] * 100
    metrics['Top Language %'].append(top_lang_pct)

fig, axes = plt.subplots(1, 4, figsize=(18, 5))
colors = ['#4ecdc4', '#ff6b6b']

for ax, (metric_name, values) in zip(axes, metrics.items()):
    bars = ax.bar(models, values, color=colors, edgecolor='#1a1a2e', width=0.5)
    ax.set_title(metric_name, fontsize=12)
    ax.set_ylabel(metric_name.split('(')[0].strip())
    for bar, val in zip(bars, values):
        ax.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.5,
                f'{val:.1f}', ha='center', fontsize=11)

fig.suptitle('Whisper Model Size Comparison on Dothraki', fontsize=15, y=1.02)
plt.tight_layout()
plt.show()

---
## 3. Transcription Analysis

Let's examine what Whisper actually outputs when it hears Dothraki.

In [None]:
# Show interesting examples from the small model
results = eval_small['results']
nonempty = [r for r in results if r.get('whisper_text', '').strip() and 'error' not in r]

print(f'Non-empty transcriptions: {len(nonempty)}/{len(results)}')
print(f'\n{"="*80}')
print('SAMPLE TRANSCRIPTIONS (whisper-small)')
print(f'{"="*80}\n')

for r in nonempty[:10]:
    print(f'Ground Truth (Dothraki): {r["gt_dothraki"]}')
    print(f'Whisper Output ({r["whisper_lang"]}):  {r["whisper_text"][:100]}')
    print(f'Ground Truth (English):  {r["gt_english"]}')
    print(f'Pipeline Translation:    {r["pipeline_translation"][:100]}')
    print()

In [None]:
# Whisper output length vs ground truth length
whisper_lens = [len(r.get('whisper_text', '')) for r in results if 'error' not in r]
gt_lens = [len(r['gt_dothraki']) for r in results if 'error' not in r]

fig, ax = plt.subplots(figsize=(10, 8))
scatter = ax.scatter(gt_lens, whisper_lens, alpha=0.4, c='#4ecdc4', s=30, edgecolors='none')
ax.plot([0, max(gt_lens)], [0, max(gt_lens)], 'r--', alpha=0.5, label='y = x (perfect match)')
ax.set_xlabel('Ground Truth Length (chars)')
ax.set_ylabel('Whisper Output Length (chars)')
ax.set_title('Whisper Output Length vs Ground Truth Length')
ax.legend()
plt.tight_layout()
plt.show()

print(f'Correlation between GT and Whisper output lengths: {np.corrcoef(gt_lens, whisper_lens)[0,1]:.3f}')

In [None]:
# What words does Whisper hallucinate most often?
all_whisper_words = []
for r in nonempty:
    words = r['whisper_text'].lower().split()
    all_whisper_words.extend(words)

whisper_freq = Counter(all_whisper_words).most_common(20)

fig, ax = plt.subplots(figsize=(14, 5))
words, freqs = zip(*whisper_freq)
ax.bar(range(len(words)), freqs, color='#ff6b6b', edgecolor='#1a1a2e', alpha=0.8)
ax.set_xticks(range(len(words)))
ax.set_xticklabels(words, rotation=45, ha='right')
ax.set_ylabel('Frequency')
ax.set_title('Most Common Words in Whisper\'s Output (Hallucinated from Dothraki)')
plt.tight_layout()
plt.show()

---
## 4. Real vs Synthetic Audio

Comparing pipeline behavior on espeak-ng synthesized audio vs real GoT audio.

In [None]:
# Load real audio results if available
real_results_path = RESULTS_DIR / 'drogo_speech_clean_transcription.json'
if not real_results_path.exists():
    real_results_path = RESULTS_DIR / 'drogo_rhaego_speech_transcription.json'

if real_results_path.exists():
    real_result = json.loads(real_results_path.read_text())
    print(f'Real audio result ({real_results_path.name}):')
    print(f'  Detected language: {real_result.get("language", "N/A")}')
    print(f'  Segments: {len(real_result.get("segments", []))}')
    print(f'  Full text preview: {real_result.get("text", "")[:200]}...')
    
    print(f'\nComparison:')
    print(f'  Synthetic clips → Whisper detects English {eval_small["stats"]["language_distribution"].get("en", 0)/eval_small["stats"]["successful"]*100:.0f}% of the time')
    print(f'  Real audio → Whisper detects: {real_result.get("language", "N/A")}')
    print(f'\n  Synthetic: clean IPA-synthesized speech, no background noise')
    print(f'  Real: actor performance with music, SFX, crowd noise')
else:
    print('No real audio results found. Run the pipeline on data/raw/ clips first.')

---
## 5. Key Findings

### Language Detection
- Whisper overwhelmingly classifies Dothraki as **English** (~85% with whisper-small)
- The smaller model (tiny) shows more diverse language guesses including Thai, Korean, and Japanese
- This suggests Dothraki's phonology overlaps most with English from Whisper's perspective
  (likely due to the actors being English speakers, even when pronouncing Dothraki)

### Transcription Quality
- ~22-25% of clips produce empty transcriptions (Whisper gives up)
- When it does produce text, the output is phonetically influenced but semantically meaningless
- Longer Dothraki phrases produce proportionally longer Whisper outputs

### Model Size Effect
- Larger models are more confident (converge to fewer languages)
- Larger models produce more text but not necessarily more useful text
- The empty transcription rate is similar across sizes

### Implications for Matching
- The pipeline must work with phonetically garbled English-like output
- Phoneme-level matching (IPA edit distance) is more appropriate than word-level matching
- Synthetic audio gives a controlled baseline; real audio adds noise, reverb, and emotional variation