# 06 — Pipeline Evaluation

Comprehensive evaluation of the full Dothraki ASR pipeline:
Audio → Whisper → Phonemize → Match → Translate.

## Contents
1. [Overall Metrics](#1-overall-metrics) — Pipeline success rates
2. [Translation Quality](#2-translation-quality) — How good are the translations?
3. [Error Breakdown](#3-error-breakdown) — Where does the pipeline fail?
4. [Model Comparison](#4-model-comparison) — Tiny vs Small across all metrics
5. [Conclusions](#5-conclusions) — What we learned

In [None]:
import json
import sys
from pathlib import Path
from collections import Counter

import numpy as np
import matplotlib.pyplot as plt

PROJECT_ROOT = Path.cwd().parent
sys.path.insert(0, str(PROJECT_ROOT))

RESULTS_DIR = PROJECT_ROOT / 'data' / 'results'

plt.style.use('dark_background')
plt.rcParams['figure.figsize'] = (14, 6)
plt.rcParams['font.size'] = 12

eval_tiny = json.loads((RESULTS_DIR / 'batch_eval_tiny.json').read_text())
eval_small = json.loads((RESULTS_DIR / 'batch_eval_small.json').read_text())

print(f'Evaluations loaded:')
print(f'  whisper-tiny:  {eval_tiny["num_clips"]} clips')
print(f'  whisper-small: {eval_small["num_clips"]} clips')

---
## 1. Overall Metrics

High-level pipeline performance across all evaluation clips.

In [None]:
# Build comprehensive metrics table
def compute_extended_stats(data):
    s = data['stats']
    results = data['results']
    valid = [r for r in results if 'error' not in r]
    nonempty = [r for r in valid if r.get('whisper_text', '').strip()]
    
    # Translation stats
    has_translation = [r for r in valid if r.get('pipeline_translation', '').strip()]
    
    # Word count stats
    whisper_word_counts = [len(r['whisper_text'].split()) for r in nonempty]
    gt_word_counts = [len(r['gt_dothraki'].split()) for r in nonempty]
    
    return {
        'Total clips': s['total_clips'],
        'Successful': s['successful'],
        'Errors': s['errors'],
        'Empty rate': f"{s['empty_rate']*100:.1f}%",
        'Non-empty': s['nonempty_transcriptions'],
        'Has translation': len(has_translation),
        'Translation rate': f"{len(has_translation)/s['successful']*100:.1f}%",
        'Avg Whisper words': f"{np.mean(whisper_word_counts):.1f}" if whisper_word_counts else 'N/A',
        'Avg GT words': f"{np.mean(gt_word_counts):.1f}" if gt_word_counts else 'N/A',
        'Top detected lang': max(s['language_distribution'], key=s['language_distribution'].get),
    }

tiny_stats = compute_extended_stats(eval_tiny)
small_stats = compute_extended_stats(eval_small)

print(f'{"Metric":25s} {"whisper-tiny":>15s} {"whisper-small":>15s}')
print('=' * 58)
for key in tiny_stats:
    print(f'{key:25s} {str(tiny_stats[key]):>15s} {str(small_stats[key]):>15s}')

In [None]:
# Visual pipeline funnel: how many clips survive each stage?
stages = ['Input Clips', 'Whisper Success', 'Non-Empty Output', 'Has Translation']

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(16, 6))

for ax, data, title in [(ax1, eval_tiny, 'whisper-tiny'), (ax2, eval_small, 'whisper-small')]:
    s = data['stats']
    counts = [
        s['total_clips'],
        s['successful'],
        s['nonempty_transcriptions'],
        s['translation_coverage'],
    ]
    
    colors = ['#4ecdc4', '#45b7aa', '#3da190', '#ffd93d']
    bars = ax.barh(stages[::-1], counts[::-1], color=colors[::-1], edgecolor='#1a1a2e')
    
    for bar, count in zip(bars, counts[::-1]):
        pct = count / s['total_clips'] * 100
        ax.text(bar.get_width() + 2, bar.get_y() + bar.get_height()/2,
                f'{count} ({pct:.0f}%)', va='center', fontsize=11)
    
    ax.set_title(f'{title} — Pipeline Funnel', fontsize=13)
    ax.set_xlim(0, s['total_clips'] * 1.25)

plt.tight_layout()
plt.show()

---
## 2. Translation Quality

Comparing pipeline translations against ground truth English translations.

In [None]:
# Show translation examples: GT English vs Pipeline Translation
results = eval_small['results']
has_both = [r for r in results 
            if 'error' not in r 
            and r.get('pipeline_translation', '').strip()
            and r.get('gt_english', '').strip()]

print(f'Clips with both GT and pipeline translations: {len(has_both)}\n')
print(f'{"Ground Truth English":40s} | {"Pipeline Translation"}')
print('=' * 90)

for r in has_both[:15]:
    gt = r['gt_english'][:38]
    pt = r['pipeline_translation'][:45]
    print(f'{gt:40s} | {pt}')

In [None]:
# Word overlap analysis: how many GT English words appear in pipeline translation?
import re

def tokenize(text):
    return set(re.findall(r'\w+', text.lower()))

overlaps = []
for r in has_both:
    gt_words = tokenize(r['gt_english'])
    pt_words = tokenize(r['pipeline_translation'])
    if gt_words:
        overlap = len(gt_words & pt_words) / len(gt_words)
        overlaps.append(overlap)

fig, ax = plt.subplots(figsize=(10, 5))
ax.hist(overlaps, bins=20, color='#4ecdc4', edgecolor='#1a1a2e', alpha=0.8)
ax.set_xlabel('Word Overlap Ratio (GT ∩ Pipeline / |GT|)')
ax.set_ylabel('Count')
ax.set_title('Translation Word Overlap with Ground Truth')
ax.axvline(np.mean(overlaps), color='#ff6b6b', linestyle='--',
           label=f'Mean: {np.mean(overlaps):.3f}')
ax.legend()

plt.tight_layout()
plt.show()

print(f'Mean word overlap: {np.mean(overlaps):.3f}')
print(f'Clips with >0 overlap: {sum(1 for o in overlaps if o > 0)}/{len(overlaps)}')
print(f'Clips with >50% overlap: {sum(1 for o in overlaps if o > 0.5)}/{len(overlaps)}')
print(f'\nNote: Low overlap is expected — the pipeline translates word-by-word via')
print(f'dictionary lookup, so grammatical structure and idioms are lost.')

---
## 3. Error Breakdown

Categorizing failure modes in the pipeline.

In [None]:
# Categorize errors for whisper-small
s = eval_small['stats']
results = eval_small['results']

categories = {
    'Whisper empty': s['empty_transcriptions'],
    'Whisper errors': s['errors'],
    'No translation': s['nonempty_transcriptions'] - s['translation_coverage'],
    'Has translation': s['translation_coverage'],
}

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 6))

# Pie chart of outcome categories
colors_cat = ['#ff6b6b', '#ff4757', '#ffd93d', '#4ecdc4']
ax1.pie(categories.values(), labels=categories.keys(), autopct='%1.1f%%',
        colors=colors_cat, textprops={'fontsize': 11})
ax1.set_title('Pipeline Outcome Distribution (whisper-small)')

# Whisper output length for empty vs non-empty GT matches
gt_short = [r for r in results if 'error' not in r and len(r['gt_dothraki']) < 20]
gt_long = [r for r in results if 'error' not in r and len(r['gt_dothraki']) >= 20]

short_empty = sum(1 for r in gt_short if not r.get('whisper_text', '').strip())
long_empty = sum(1 for r in gt_long if not r.get('whisper_text', '').strip())

x = ['Short GT (<20 chars)', 'Long GT (≥20 chars)']
empty_rates = [
    short_empty / len(gt_short) * 100 if gt_short else 0,
    long_empty / len(gt_long) * 100 if gt_long else 0,
]
bars = ax2.bar(x, empty_rates, color=['#ff6b6b', '#4ecdc4'], edgecolor='#1a1a2e', alpha=0.8)
for bar, rate in zip(bars, empty_rates):
    ax2.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 1,
            f'{rate:.1f}%', ha='center', fontsize=12)
ax2.set_ylabel('Empty Transcription Rate (%)')
ax2.set_title('Empty Rate by Ground Truth Length')

plt.tight_layout()
plt.show()

print(f'Short clips (<20 chars): {len(gt_short)}, empty rate: {short_empty/len(gt_short)*100:.1f}%')
print(f'Long clips (≥20 chars):  {len(gt_long)}, empty rate: {long_empty/len(gt_long)*100:.1f}%')
print(f'\nShorter clips are harder — less audio context for Whisper to work with.')

---
## 4. Model Comparison

Head-to-head comparison of whisper-tiny vs whisper-small across all metrics.

In [None]:
# Radar-style comparison (using grouped bar chart)
metrics_compare = {
    'Non-Empty\nRate (%)': [],
    'Translation\nCoverage (%)': [],
    'Top Lang\nConfidence (%)': [],
    'Avg Output\nLength': [],
    'Unique\nLanguages': [],
}

for data in [eval_tiny, eval_small]:
    s = data['stats']
    metrics_compare['Non-Empty\nRate (%)'].append((1 - s['empty_rate']) * 100)
    metrics_compare['Translation\nCoverage (%)'].append(s['translation_coverage_rate'] * 100)
    top_lang_pct = max(s['language_distribution'].values()) / s['successful'] * 100
    metrics_compare['Top Lang\nConfidence (%)'].append(top_lang_pct)
    metrics_compare['Avg Output\nLength'].append(s['avg_whisper_output_length'])
    metrics_compare['Unique\nLanguages'].append(len(s['language_distribution']))

fig, axes = plt.subplots(1, 5, figsize=(20, 5))
colors = ['#4ecdc4', '#ff6b6b']
model_names = ['tiny', 'small']

for ax, (metric, values) in zip(axes, metrics_compare.items()):
    bars = ax.bar(model_names, values, color=colors, edgecolor='#1a1a2e', width=0.5)
    ax.set_title(metric, fontsize=11)
    for bar, val in zip(bars, values):
        ax.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.5,
                f'{val:.1f}', ha='center', fontsize=10)

fig.suptitle('Whisper Model Size Comparison — All Metrics', fontsize=14, y=1.02)
plt.tight_layout()
plt.show()

In [None]:
# Per-clip agreement: do both models produce the same language detection?
tiny_results = {r['id']: r for r in eval_tiny['results'] if 'error' not in r}
small_results = {r['id']: r for r in eval_small['results'] if 'error' not in r}

common_ids = set(tiny_results.keys()) & set(small_results.keys())
agree = sum(1 for cid in common_ids
            if tiny_results[cid]['whisper_lang'] == small_results[cid]['whisper_lang'])

print(f'Language detection agreement (tiny vs small):')
print(f'  Common clips: {len(common_ids)}')
print(f'  Same language: {agree} ({agree/len(common_ids)*100:.1f}%)')
print(f'  Different: {len(common_ids)-agree} ({(len(common_ids)-agree)/len(common_ids)*100:.1f}%)')

# Show disagreements
disagreements = [(cid, tiny_results[cid]['whisper_lang'], small_results[cid]['whisper_lang'])
                 for cid in sorted(common_ids)
                 if tiny_results[cid]['whisper_lang'] != small_results[cid]['whisper_lang']]

if disagreements:
    print(f'\nSample disagreements:')
    for cid, t_lang, s_lang in disagreements[:10]:
        print(f'  {cid}: tiny={t_lang}, small={s_lang}')

---
## 5. Conclusions

### What Works
- **Pipeline execution**: End-to-end processing completes on 100% of clips without crashes
- **Language detection**: Whisper consistently identifies *some* language, giving the phonemizer a starting point
- **Phoneme extraction**: gruut/espeak-ng successfully convert Whisper text to IPA
- **Translation coverage**: ~75-78% of clips produce some translation output

### What Doesn't Work (Yet)
- **Semantic accuracy**: Translations are phonetically plausible but semantically poor
- **Empty transcriptions**: ~22-25% of clips produce no output (Whisper gives up)
- **Word-by-word matching**: Loses all grammatical structure and context
- **Uniform edit distance**: Treating /p/→/b/ the same as /p/→/z/ loses articulatory similarity

### Key Insights
1. Zero-shot ASR on a truly unseen language is a **degenerate case** — but an informative one
2. Whisper's language detector reveals **phonological similarity** between Dothraki and real languages
3. The pipeline demonstrates the full **ASR → NLP → Translation** stack even when accuracy is low
4. Larger Whisper models are more **confident** but not necessarily more **accurate** for unseen languages

### Future Directions
- Weighted edit distance (articulatory features)
- N-gram phoneme matching for longer sequences
- Fine-tuning on synthetic Dothraki audio
- Beam search over multiple Whisper hypotheses

In [None]:
# Final summary table
print('='*60)
print('PIPELINE EVALUATION SUMMARY')
print('='*60)
for model_name, data in [('tiny', eval_tiny), ('small', eval_small)]:
    s = data['stats']
    print(f'\nwhisper-{model_name}:')
    print(f'  Clips evaluated:     {s["total_clips"]}')
    print(f'  Non-empty rate:      {(1-s["empty_rate"])*100:.1f}%')
    print(f'  Translation rate:    {s["translation_coverage_rate"]*100:.1f}%')
    print(f'  Primary language:    {max(s["language_distribution"], key=s["language_distribution"].get)} '
          f'({max(s["language_distribution"].values())/s["successful"]*100:.0f}%)')
    print(f'  Languages detected:  {len(s["language_distribution"])}')
print('\n' + '='*60)