<a href="https://colab.research.google.com/github/IvanKatorgin/Workshops-and-hackathons/blob/main/%D0%92%D1%8B%D0%B1%D0%BE%D1%80_%D0%BE%D0%BF%D1%82%D0%B8%D0%BC%D0%B0%D0%BB%D1%8C%D0%BD%D0%BE%D0%B9_%D0%BC%D0%BE%D0%B4%D0%B5%D0%BB%D0%B8_Whisper.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Выбор оптимальной модели Whisper: анализ точности и скорости распознавания речи

Цель:

Сравнение производительности двух версий модели Whisper (Tiny vs Base) на датасете аудиозаписей по двум ключевым метрикам:
- Точность (Word Error Rate - WER)
- Скорость работы (время транскрипции)

### Dataset

https://huggingface.co/datasets/hf-internal-testing/librispeech_asr_dummy

### Install libs

In [1]:
!pip install openai-whisper jiwer datasets torchcodec librosa

Collecting openai-whisper
  Downloading openai_whisper-20250625.tar.gz (803 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/803.2 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━━━━━[0m [32m358.4/803.2 kB[0m [31m10.2 MB/s[0m eta [36m0:00:01[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m798.7/803.2 kB[0m [31m17.3 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m803.2/803.2 kB[0m [31m10.4 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting jiwer
  Downloading jiwer-4.0.0-py3-none-any.whl.metadata (3.3 kB)
Collecting torchcodec
  Downloading torchcodec-0.8.1-cp312-cp312-manylinux_2_28_x86_64.whl.metadata (9.7 kB)
Collecting rapidfuzz>=3.9.7 (from jiwer)
  Download

### Import required libs

In [2]:
import whisper
import time
import pandas as pd
import datasets
import librosa
import io
import soundfile as sf
import numpy as np

from jiwer import wer
from datasets import load_dataset, Audio
from tqdm import tqdm as tqdm

### Import required models

In [3]:
print("Loading Whisper models...")
model_tiny = whisper.load_model("tiny")
model_base = whisper.load_model("base")
print("Models loaded successfully!\n")

Loading Whisper models...


100%|██████████████████████████████████████| 72.1M/72.1M [00:00<00:00, 110MiB/s]
100%|████████████████████████████████████████| 139M/139M [00:01<00:00, 117MiB/s]


Models loaded successfully!



### Load a small audio dataset (LibriSpeech test-clean subset)

In [4]:
print("Loading dataset...")
dataset = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")

# Use first 100 samples for testing
dataset = dataset.select(range(min(100, len(dataset))))

# Disable torchcodec decoding
# means the dataset keeps audio as bytes, not decoded yet — faster for loading & manual resampling
dataset = dataset.cast_column("audio", Audio(decode=False))

print(f"Loaded {len(dataset)} audio samples\n")

Loading dataset...


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md:   0%|          | 0.00/520 [00:00<?, ?B/s]

clean/validation-00000-of-00001.parquet:   0%|          | 0.00/9.19M [00:00<?, ?B/s]

Generating validation split:   0%|          | 0/73 [00:00<?, ? examples/s]

Loaded 73 audio samples



In [5]:
# Reads raw audio bytes → waveform
# Ensures sampling rate = 16k, which Whisper requires
# Returns a clean float32 numpy array

def load_audio(sample):
    audio_bytes = sample["audio"]["bytes"]
    data, sr = sf.read(io.BytesIO(audio_bytes), dtype='float32')

    if sr != 16000:
        data = librosa.resample(data, orig_sr=sr, target_sr=16000).astype(np.float32)

    return data

### Function to transcribe and calculate metrics

In [6]:
def evaluate_model(model, model_name, dataset):
    predictions = []
    references = []
    times = []

    print(f"\nEvaluating {model_name}...")
    print("-" * 50)

    for i, sample in tqdm(enumerate(dataset), total=len(dataset)):

        reference = sample["text"].lower()

        # Load audio from raw bytes
        audio = load_audio(sample)

        start_time = time.time()
        result = model.transcribe(audio, fp16=False)
        inference_time = time.time() - start_time

        prediction = result["text"].lower().strip()

        predictions.append(prediction)
        references.append(reference)
        times.append(inference_time)

        print(f"Sample {i+1}/{len(dataset)} - Time: {inference_time:.2f}s")

    # Calculate WER
    wer_score = wer(references, predictions)
    avg_time = sum(times) / len(times)

    return {
        "model": model_name,
        "wer": wer_score,
        "avg_inference_time": avg_time,
        "predictions": predictions,
        "references": references
    }

### Evaluate both models

In [7]:
results_tiny = evaluate_model(model_tiny, "Whisper Tiny", dataset)
results_base = evaluate_model(model_base, "Whisper Base", dataset)


Evaluating Whisper Tiny...
--------------------------------------------------


  1%|▏         | 1/73 [00:02<02:38,  2.19s/it]

Sample 1/73 - Time: 2.18s


  3%|▎         | 2/73 [00:03<02:04,  1.76s/it]

Sample 2/73 - Time: 1.44s


  4%|▍         | 3/73 [00:05<02:05,  1.80s/it]

Sample 3/73 - Time: 1.84s


  5%|▌         | 4/73 [00:07<02:08,  1.86s/it]

Sample 4/73 - Time: 1.94s


  7%|▋         | 5/73 [00:10<02:44,  2.41s/it]

Sample 5/73 - Time: 3.39s


  8%|▊         | 6/73 [00:12<02:21,  2.12s/it]

Sample 6/73 - Time: 1.54s


 10%|▉         | 7/73 [00:13<02:05,  1.90s/it]

Sample 7/73 - Time: 1.45s


 11%|█         | 8/73 [00:15<01:56,  1.79s/it]

Sample 8/73 - Time: 1.57s


 12%|█▏        | 9/73 [00:16<01:47,  1.68s/it]

Sample 9/73 - Time: 1.44s


 14%|█▎        | 10/73 [00:19<02:00,  1.92s/it]

Sample 10/73 - Time: 2.44s


 15%|█▌        | 11/73 [00:20<01:50,  1.78s/it]

Sample 11/73 - Time: 1.46s


 16%|█▋        | 12/73 [00:22<01:51,  1.83s/it]

Sample 12/73 - Time: 1.93s


 18%|█▊        | 13/73 [00:24<01:41,  1.70s/it]

Sample 13/73 - Time: 1.40s


 19%|█▉        | 14/73 [00:25<01:38,  1.66s/it]

Sample 14/73 - Time: 1.58s


 21%|██        | 15/73 [00:26<01:30,  1.56s/it]

Sample 15/73 - Time: 1.31s


 22%|██▏       | 16/73 [00:28<01:33,  1.64s/it]

Sample 16/73 - Time: 1.83s


 23%|██▎       | 17/73 [00:30<01:37,  1.73s/it]

Sample 17/73 - Time: 1.94s


 25%|██▍       | 18/73 [00:32<01:37,  1.77s/it]

Sample 18/73 - Time: 1.84s


 26%|██▌       | 19/73 [00:34<01:31,  1.69s/it]

Sample 19/73 - Time: 1.50s


 27%|██▋       | 20/73 [00:35<01:25,  1.61s/it]

Sample 20/73 - Time: 1.43s


 29%|██▉       | 21/73 [00:37<01:22,  1.59s/it]

Sample 21/73 - Time: 1.52s


 30%|███       | 22/73 [00:38<01:19,  1.55s/it]

Sample 22/73 - Time: 1.46s


 32%|███▏      | 23/73 [00:39<01:15,  1.52s/it]

Sample 23/73 - Time: 1.43s


 33%|███▎      | 24/73 [00:41<01:16,  1.57s/it]

Sample 24/73 - Time: 1.68s


 34%|███▍      | 25/73 [00:42<01:11,  1.48s/it]

Sample 25/73 - Time: 1.29s


 36%|███▌      | 26/73 [00:44<01:13,  1.56s/it]

Sample 26/73 - Time: 1.72s


 37%|███▋      | 27/73 [00:46<01:08,  1.50s/it]

Sample 27/73 - Time: 1.36s


 38%|███▊      | 28/73 [00:47<01:04,  1.44s/it]

Sample 28/73 - Time: 1.30s


 40%|███▉      | 29/73 [00:48<01:02,  1.42s/it]

Sample 29/73 - Time: 1.36s


 41%|████      | 30/73 [00:49<00:59,  1.37s/it]

Sample 30/73 - Time: 1.27s


 42%|████▏     | 31/73 [00:51<01:01,  1.47s/it]

Sample 31/73 - Time: 1.70s


 44%|████▍     | 32/73 [00:53<01:00,  1.47s/it]

Sample 32/73 - Time: 1.46s


 45%|████▌     | 33/73 [00:54<00:59,  1.48s/it]

Sample 33/73 - Time: 1.51s


 47%|████▋     | 34/73 [00:56<00:56,  1.46s/it]

Sample 34/73 - Time: 1.39s


 48%|████▊     | 35/73 [00:57<00:55,  1.45s/it]

Sample 35/73 - Time: 1.43s


 49%|████▉     | 36/73 [00:58<00:53,  1.46s/it]

Sample 36/73 - Time: 1.47s


 51%|█████     | 37/73 [01:00<00:51,  1.44s/it]

Sample 37/73 - Time: 1.38s


 52%|█████▏    | 38/73 [01:01<00:48,  1.40s/it]

Sample 38/73 - Time: 1.31s


 53%|█████▎    | 39/73 [01:03<00:52,  1.53s/it]

Sample 39/73 - Time: 1.83s


 55%|█████▍    | 40/73 [01:05<00:57,  1.75s/it]

Sample 40/73 - Time: 2.24s


 56%|█████▌    | 41/73 [01:07<00:52,  1.65s/it]

Sample 41/73 - Time: 1.41s


 58%|█████▊    | 42/73 [01:08<00:50,  1.64s/it]

Sample 42/73 - Time: 1.62s


 59%|█████▉    | 43/73 [01:10<00:51,  1.73s/it]

Sample 43/73 - Time: 1.95s


 60%|██████    | 44/73 [01:12<00:48,  1.66s/it]

Sample 44/73 - Time: 1.49s


 62%|██████▏   | 45/73 [01:13<00:44,  1.59s/it]

Sample 45/73 - Time: 1.41s


 63%|██████▎   | 46/73 [01:15<00:44,  1.66s/it]

Sample 46/73 - Time: 1.84s


 64%|██████▍   | 47/73 [01:16<00:41,  1.58s/it]

Sample 47/73 - Time: 1.38s


 66%|██████▌   | 48/73 [01:18<00:38,  1.54s/it]

Sample 48/73 - Time: 1.44s


 67%|██████▋   | 49/73 [01:19<00:36,  1.50s/it]

Sample 49/73 - Time: 1.42s


 68%|██████▊   | 50/73 [01:21<00:35,  1.56s/it]

Sample 50/73 - Time: 1.68s


 70%|██████▉   | 51/73 [01:23<00:35,  1.60s/it]

Sample 51/73 - Time: 1.70s


 71%|███████   | 52/73 [01:24<00:32,  1.55s/it]

Sample 52/73 - Time: 1.42s


 73%|███████▎  | 53/73 [01:26<00:32,  1.63s/it]

Sample 53/73 - Time: 1.82s


 74%|███████▍  | 54/73 [01:28<00:33,  1.74s/it]

Sample 54/73 - Time: 1.99s


 75%|███████▌  | 55/73 [01:30<00:31,  1.73s/it]

Sample 55/73 - Time: 1.71s


 77%|███████▋  | 56/73 [01:31<00:28,  1.65s/it]

Sample 56/73 - Time: 1.45s


 78%|███████▊  | 57/73 [01:32<00:24,  1.53s/it]

Sample 57/73 - Time: 1.26s


 79%|███████▉  | 58/73 [01:34<00:22,  1.50s/it]

Sample 58/73 - Time: 1.42s


 81%|████████  | 59/73 [01:35<00:21,  1.52s/it]

Sample 59/73 - Time: 1.55s


 82%|████████▏ | 60/73 [01:37<00:20,  1.54s/it]

Sample 60/73 - Time: 1.59s


 84%|████████▎ | 61/73 [01:38<00:18,  1.52s/it]

Sample 61/73 - Time: 1.48s


 85%|████████▍ | 62/73 [01:40<00:16,  1.49s/it]

Sample 62/73 - Time: 1.41s


 86%|████████▋ | 63/73 [01:41<00:15,  1.52s/it]

Sample 63/73 - Time: 1.60s


 88%|████████▊ | 64/73 [01:43<00:14,  1.57s/it]

Sample 64/73 - Time: 1.68s


 89%|████████▉ | 65/73 [01:44<00:11,  1.50s/it]

Sample 65/73 - Time: 1.32s


 90%|█████████ | 66/73 [01:46<00:10,  1.52s/it]

Sample 66/73 - Time: 1.58s


 92%|█████████▏| 67/73 [01:47<00:09,  1.51s/it]

Sample 67/73 - Time: 1.49s


 93%|█████████▎| 68/73 [01:49<00:08,  1.61s/it]

Sample 68/73 - Time: 1.83s


 95%|█████████▍| 69/73 [01:51<00:06,  1.57s/it]

Sample 69/73 - Time: 1.46s


 96%|█████████▌| 70/73 [01:52<00:04,  1.56s/it]

Sample 70/73 - Time: 1.54s


 97%|█████████▋| 71/73 [01:54<00:03,  1.56s/it]

Sample 71/73 - Time: 1.56s


 99%|█████████▊| 72/73 [01:55<00:01,  1.58s/it]

Sample 72/73 - Time: 1.63s


100%|██████████| 73/73 [01:57<00:00,  1.61s/it]


Sample 73/73 - Time: 1.45s

Evaluating Whisper Base...
--------------------------------------------------


  1%|▏         | 1/73 [00:03<04:24,  3.68s/it]

Sample 1/73 - Time: 3.67s


  3%|▎         | 2/73 [00:06<03:59,  3.38s/it]

Sample 2/73 - Time: 3.17s


  4%|▍         | 3/73 [00:10<04:07,  3.54s/it]

Sample 3/73 - Time: 3.72s


  5%|▌         | 4/73 [00:14<04:12,  3.66s/it]

Sample 4/73 - Time: 3.84s


  7%|▋         | 5/73 [00:21<05:32,  4.88s/it]

Sample 5/73 - Time: 7.04s


  8%|▊         | 6/73 [00:24<04:53,  4.37s/it]

Sample 6/73 - Time: 3.38s


 10%|▉         | 7/73 [00:28<04:26,  4.04s/it]

Sample 7/73 - Time: 3.34s


 11%|█         | 8/73 [00:31<04:09,  3.84s/it]

Sample 8/73 - Time: 3.41s


 12%|█▏        | 9/73 [00:34<03:50,  3.60s/it]

Sample 9/73 - Time: 3.09s


 14%|█▎        | 10/73 [00:39<04:04,  3.88s/it]

Sample 10/73 - Time: 4.50s


 15%|█▌        | 11/73 [00:42<03:46,  3.65s/it]

Sample 11/73 - Time: 3.14s


 16%|█▋        | 12/73 [00:47<04:16,  4.21s/it]

Sample 12/73 - Time: 5.48s


 18%|█▊        | 13/73 [00:50<03:52,  3.88s/it]

Sample 13/73 - Time: 3.10s


 19%|█▉        | 14/73 [00:54<03:39,  3.72s/it]

Sample 14/73 - Time: 3.36s


 21%|██        | 15/73 [00:57<03:21,  3.48s/it]

Sample 15/73 - Time: 2.90s


 22%|██▏       | 16/73 [01:01<03:26,  3.61s/it]

Sample 16/73 - Time: 3.93s


 23%|██▎       | 17/73 [01:04<03:22,  3.62s/it]

Sample 17/73 - Time: 3.62s


 25%|██▍       | 18/73 [01:08<03:20,  3.64s/it]

Sample 18/73 - Time: 3.69s


 26%|██▌       | 19/73 [01:11<03:13,  3.58s/it]

Sample 19/73 - Time: 3.45s


 27%|██▋       | 20/73 [01:14<03:01,  3.43s/it]

Sample 20/73 - Time: 3.06s


 29%|██▉       | 21/73 [01:18<02:56,  3.39s/it]

Sample 21/73 - Time: 3.30s


 30%|███       | 22/73 [01:21<02:50,  3.34s/it]

Sample 22/73 - Time: 3.21s


 32%|███▏      | 23/73 [01:24<02:48,  3.36s/it]

Sample 23/73 - Time: 3.42s


 33%|███▎      | 24/73 [01:27<02:39,  3.25s/it]

Sample 24/73 - Time: 2.99s


 34%|███▍      | 25/73 [01:30<02:33,  3.20s/it]

Sample 25/73 - Time: 3.08s


 36%|███▌      | 26/73 [01:34<02:41,  3.43s/it]

Sample 26/73 - Time: 3.97s


 37%|███▋      | 27/73 [01:37<02:31,  3.30s/it]

Sample 27/73 - Time: 2.98s


 38%|███▊      | 28/73 [01:40<02:23,  3.20s/it]

Sample 28/73 - Time: 2.97s


 40%|███▉      | 29/73 [01:44<02:19,  3.17s/it]

Sample 29/73 - Time: 3.11s


 41%|████      | 30/73 [01:47<02:14,  3.14s/it]

Sample 30/73 - Time: 3.05s


 42%|████▏     | 31/73 [01:50<02:14,  3.21s/it]

Sample 31/73 - Time: 3.37s


 44%|████▍     | 32/73 [01:53<02:09,  3.15s/it]

Sample 32/73 - Time: 3.01s


 45%|████▌     | 33/73 [01:57<02:11,  3.28s/it]

Sample 33/73 - Time: 3.58s


 47%|████▋     | 34/73 [02:00<02:04,  3.19s/it]

Sample 34/73 - Time: 2.99s


 48%|████▊     | 35/73 [02:03<01:59,  3.14s/it]

Sample 35/73 - Time: 3.01s


 49%|████▉     | 36/73 [02:06<01:57,  3.18s/it]

Sample 36/73 - Time: 3.26s


 51%|█████     | 37/73 [02:09<01:54,  3.17s/it]

Sample 37/73 - Time: 3.16s


 52%|█████▏    | 38/73 [02:12<01:48,  3.09s/it]

Sample 38/73 - Time: 2.89s


 53%|█████▎    | 39/73 [02:15<01:48,  3.20s/it]

Sample 39/73 - Time: 3.47s


 55%|█████▍    | 40/73 [02:20<01:58,  3.58s/it]

Sample 40/73 - Time: 4.46s


 56%|█████▌    | 41/73 [02:23<01:49,  3.43s/it]

Sample 41/73 - Time: 3.06s


 58%|█████▊    | 42/73 [02:28<02:00,  3.87s/it]

Sample 42/73 - Time: 4.90s


 59%|█████▉    | 43/73 [02:32<01:59,  3.99s/it]

Sample 43/73 - Time: 4.26s


 60%|██████    | 44/73 [02:35<01:48,  3.74s/it]

Sample 44/73 - Time: 3.15s


 62%|██████▏   | 45/73 [02:38<01:39,  3.56s/it]

Sample 45/73 - Time: 3.13s


 63%|██████▎   | 46/73 [02:42<01:35,  3.54s/it]

Sample 46/73 - Time: 3.50s


 64%|██████▍   | 47/73 [02:45<01:28,  3.40s/it]

Sample 47/73 - Time: 3.06s


 66%|██████▌   | 48/73 [02:48<01:22,  3.31s/it]

Sample 48/73 - Time: 3.09s


 67%|██████▋   | 49/73 [02:51<01:17,  3.25s/it]

Sample 49/73 - Time: 3.10s


 68%|██████▊   | 50/73 [02:55<01:17,  3.37s/it]

Sample 50/73 - Time: 3.67s


 70%|██████▉   | 51/73 [02:58<01:15,  3.43s/it]

Sample 51/73 - Time: 3.54s


 71%|███████   | 52/73 [03:01<01:09,  3.31s/it]

Sample 52/73 - Time: 3.03s


 73%|███████▎  | 53/73 [03:05<01:08,  3.43s/it]

Sample 53/73 - Time: 3.72s


 74%|███████▍  | 54/73 [03:08<01:02,  3.27s/it]

Sample 54/73 - Time: 2.88s


 75%|███████▌  | 55/73 [03:11<00:59,  3.32s/it]

Sample 55/73 - Time: 3.45s


 77%|███████▋  | 56/73 [03:15<00:56,  3.35s/it]

Sample 56/73 - Time: 3.39s


 78%|███████▊  | 57/73 [03:18<00:50,  3.17s/it]

Sample 57/73 - Time: 2.77s


 79%|███████▉  | 58/73 [03:21<00:47,  3.14s/it]

Sample 58/73 - Time: 3.04s


 81%|████████  | 59/73 [03:24<00:44,  3.15s/it]

Sample 59/73 - Time: 3.19s


 82%|████████▏ | 60/73 [03:27<00:42,  3.23s/it]

Sample 60/73 - Time: 3.41s


 84%|████████▎ | 61/73 [03:30<00:37,  3.17s/it]

Sample 61/73 - Time: 3.01s


 85%|████████▍ | 62/73 [03:33<00:34,  3.13s/it]

Sample 62/73 - Time: 3.05s


 86%|████████▋ | 63/73 [03:38<00:37,  3.71s/it]

Sample 63/73 - Time: 5.07s


 88%|████████▊ | 64/73 [03:42<00:32,  3.65s/it]

Sample 64/73 - Time: 3.51s


 89%|████████▉ | 65/73 [03:45<00:27,  3.43s/it]

Sample 65/73 - Time: 2.89s


 90%|█████████ | 66/73 [03:48<00:24,  3.46s/it]

Sample 66/73 - Time: 3.53s


 92%|█████████▏| 67/73 [03:52<00:20,  3.39s/it]

Sample 67/73 - Time: 3.21s


 93%|█████████▎| 68/73 [03:55<00:16,  3.33s/it]

Sample 68/73 - Time: 3.19s


 95%|█████████▍| 69/73 [03:58<00:13,  3.27s/it]

Sample 69/73 - Time: 3.14s


 96%|█████████▌| 70/73 [04:02<00:10,  3.38s/it]

Sample 70/73 - Time: 3.63s


 97%|█████████▋| 71/73 [04:05<00:06,  3.36s/it]

Sample 71/73 - Time: 3.29s


 99%|█████████▊| 72/73 [04:08<00:03,  3.38s/it]

Sample 72/73 - Time: 3.44s


100%|██████████| 73/73 [04:12<00:00,  3.46s/it]

Sample 73/73 - Time: 3.52s





### Display comparison results

In [8]:
print("\n" + "="*60)
print("COMPARISON RESULTS")
print("="*60)

comparison_df = pd.DataFrame([
    {
        "Model": "Whisper Tiny",
        "WER": f"{results_tiny['wer']:.4f}",
        "Avg Inference Time (s)": f"{results_tiny['avg_inference_time']:.2f}"
    },
    {
        "Model": "Whisper Base",
        "WER": f"{results_base['wer']:.4f}",
        "Avg Inference Time (s)": f"{results_base['avg_inference_time']:.2f}"
    }
])
print(comparison_df.to_string(index=False))
print("\n")


COMPARISON RESULTS
       Model    WER Avg Inference Time (s)
Whisper Tiny 0.2400                   1.60
Whisper Base 0.2235                   3.45




### Show sample predictions

In [9]:
print("="*60)
print("SAMPLE PREDICTIONS")
print("="*60)

for i in range(min(3, len(dataset))):
    print(f"\nSample {i+1}:")
    print(f"Reference:    {results_tiny['references'][i]}")
    print(f"Tiny Model:   {results_tiny['predictions'][i]}")
    print(f"Base Model:   {results_base['predictions'][i]}")

SAMPLE PREDICTIONS

Sample 1:
Reference:    mister quilter is the apostle of the middle classes and we are glad to welcome his gospel
Tiny Model:   mr. quilter is the apostle of the middle classes and we are glad to welcome his gospel.
Base Model:   mr. quilter is the apostle of the middle classes, and we are glad to welcome his gospel.

Sample 2:
Reference:    nor is mister quilter's manner less interesting than his matter
Tiny Model:   nor is mr. quilters' manner less interesting than his matter.
Base Model:   nor is mr. quilter's manner less interesting than his matter.

Sample 3:
Reference:    he tells us that at this festive season of the year with christmas and roast beef looming before us similes drawn from eating and its results occur most readily to the mind
Tiny Model:   he tells us that at this festive season of the year, with christmas and roast beef looming before us, similarly drawn from eating and its results occur most readily to the mind.
Base Model:   he tells us that

### Find samples where models differ and at least one is wrong

In [10]:
interesting_samples = []
for i in range(len(dataset)):
    ref = results_tiny['references'][i]
    tiny_pred = results_tiny['predictions'][i]
    base_pred = results_base['predictions'][i]

    # Check if models differ from each other AND at least one differs from the reference
    models_differ = tiny_pred != base_pred
    at_least_one_wrong = (tiny_pred != ref) or (base_pred != ref)

    if models_differ and at_least_one_wrong:
        interesting_samples.append(i)

if interesting_samples:
    print(f"\nFound {len(interesting_samples)} samples where models disagree and differ from reference:\n")
    for idx in interesting_samples[:5]:  # Show up to 5 examples
        print(f"Sample {idx+1}:")
        print(f"Reference:    {results_tiny['references'][idx]}")
        print(f"Tiny Model:   {results_tiny['predictions'][idx]}")
        print(f"Base Model:   {results_base['predictions'][idx]}")
        print()
else:
    print("\nNo samples found where models disagree and differ from reference.")
    print("Showing first 3 samples instead:\n")
    for i in range(min(3, len(dataset))):
        print(f"Sample {i+1}:")
        print(f"Reference:    {results_tiny['references'][i]}")
        print(f"Tiny Model:   {results_tiny['predictions'][i]}")
        print(f"Base Model:   {results_base['predictions'][i]}")
        print()


Found 50 samples where models disagree and differ from reference:

Sample 1:
Reference:    mister quilter is the apostle of the middle classes and we are glad to welcome his gospel
Tiny Model:   mr. quilter is the apostle of the middle classes and we are glad to welcome his gospel.
Base Model:   mr. quilter is the apostle of the middle classes, and we are glad to welcome his gospel.

Sample 2:
Reference:    nor is mister quilter's manner less interesting than his matter
Tiny Model:   nor is mr. quilters' manner less interesting than his matter.
Base Model:   nor is mr. quilter's manner less interesting than his matter.

Sample 4:
Reference:    he has grave doubts whether sir frederick leighton's work is really greek after all and can discover in it but little of rocky ithaca
Tiny Model:   he has grave doubts whether sir frederick layton's work is really greek after all and can discover in it but little of rocky ithaca.
Base Model:   he has graved doubts whether sir frederick layton's 

### Summary

In [11]:
print("\n" + "="*60)
print("SUMMARY")
print("="*60)
wer_diff = (results_base['wer'] - results_tiny['wer']) * 100
time_diff = results_base['avg_inference_time'] - results_tiny['avg_inference_time']

print(f"WER Base: {results_base['wer']*100:.2f}")
print(f"WER Tiny: {results_tiny['wer']*100:.2f}")
print(f"WER Difference: {wer_diff:.2f} percentage points (Base vs Tiny)")
print(f"Time Difference: {time_diff:.2f} seconds (Base vs Tiny)")
print(f"\nBase model is {'better' if results_base['wer'] < results_tiny['wer'] else 'worse'} in accuracy")
print(f"Base model is {'slower' if time_diff > 0 else 'faster'} in inference")


SUMMARY
WER Base: 22.35
WER Tiny: 24.00
WER Difference: -1.65 percentage points (Base vs Tiny)
Time Difference: 1.85 seconds (Base vs Tiny)

Base model is better in accuracy
Base model is slower in inference


Основные выводы:
1. Модель Base демонстрирует более высокую точность (ниже WER) по сравнению с Tiny
2. Модель Tiny значительно быстрее в обработке аудио
3. Улучшение точности достигается ценой увеличения времени обработки