# OCSC Whisper Evaluation

## Overview
This notebook evaluates baseline Whisper-Medium against our fine-tuned checkpoint on the OCSC test set. We compute WER/CER metrics overall and stratified by age group and task type, then perform error analysis to understand where fine-tuning helps or hurts.

**Models compared:**
- Baseline: `openai/whisper-medium.en` (zero-shot)
- Fine-tuned: Checkpoint from W&B artifact (0.25 epochs on OCSC)

**Metrics:** Word Error Rate (WER), Character Error Rate (CER)

In [None]:
!pip install "numpy<2.0.0" \
            "transformers==4.45.2" \
            "datasets==2.20.0" \
            "evaluate==0.4.2" \
            "huggingface_hub==0.26.2" \
            "soundfile==0.12.1" \
            "wandb<0.18"

Collecting numpy<2.0.0
  Downloading numpy-1.26.4-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (61 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m61.0/61.0 kB[0m [31m4.8 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting transformers==4.45.2
  Downloading transformers-4.45.2-py3-none-any.whl.metadata (44 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m44.4/44.4 kB[0m [31m4.1 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting datasets==2.20.0
  Downloading datasets-2.20.0-py3-none-any.whl.metadata (19 kB)
Collecting evaluate==0.4.2
  Downloading evaluate-0.4.2-py3-none-any.whl.metadata (9.3 kB)
Collecting huggingface_hub==0.26.2
  Downloading huggingface_hub-0.26.2-py3-none-any.whl.metadata (13 kB)
Collecting soundfile==0.12.1
  Downloading soundfile-0.12.1-py2.py3-none-manylinux_2_31_x86_64.whl.metadata (14 kB)
Collecting wandb<0.18
  Downloading wandb-0.17.9-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (9.

In [None]:
import numpy as np

print("NumPy version:", np.__version__)

NumPy version: 1.26.4


### GPU Verification
Confirm CUDA availability for inference acceleration.

### Additional Dependencies
Install evaluation metrics and remove peft to avoid import conflicts.

### Environment Setup
Import libraries, mount Google Drive for data access, and configure device.

In [None]:
!nvidia-smi

# Pin versions to avoid the trainer/peft/transformers mismatch hell
# !pip -q install transformers==4.45.2 datasets==2.20.0 evaluate==0.4.2 \
#                huggingface_hub==0.26.2 soundfile==0.12.1 wandb==0.17.0

!pip -q install evaluate==0.4.2 soundfile==0.12.1
# We don't need PEFT here, but uninstall it so nothing tries to import it.
!pip -q uninstall -y peft

import os, re, io, subprocess, shutil, random
from pathlib import Path

import numpy as np
import pandas as pd
import soundfile as sf
import torch
from tqdm import tqdm
import evaluate

from google.colab import drive
drive.mount("/content/drive")

device = "cuda" if torch.cuda.is_available() else "cpu"
print("Device:", device)

Wed Dec 10 22:34:35 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  NVIDIA A100-SXM4-40GB          Off |   00000000:00:04.0 Off |                    0 |
| N/A   32C    P0             44W /  400W |       0MiB /  40960MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
                                                

### Configure Data Paths
Set up paths to:
- Preprocessed manifests from Google Drive
- OCSC audio files (downloaded from HuggingFace)

### Load Preprocessed Manifests
Copy the latest manifest CSVs from Drive and load train/dev/test splits into DataFrames.

In [None]:
# Base paths
BASE_DRIVE = Path("/content/drive/MyDrive")
CONV_ROOT  = BASE_DRIVE / "ocsc_converted"   # from Notebook A
DATA       = Path("/content/data")
RAW        = DATA / "ocsc_raw"
AUDIO_ROOT = RAW / "Eng-NA" / "OCSC"

DATA.mkdir(parents=True, exist_ok=True)
RAW.mkdir(parents=True, exist_ok=True)

print("CONV_ROOT:", CONV_ROOT)
print("AUDIO_ROOT:", AUDIO_ROOT)

# Pick latest converted manifest folder
stamps = [p for p in CONV_ROOT.iterdir() if p.is_dir()]
assert stamps, f"No converted folders found under {CONV_ROOT}"
STAMP_DIR = sorted(stamps, key=lambda p: p.name)[-1]
print("Using manifests from:", STAMP_DIR)

MANI = DATA / "manifests"
MANI.mkdir(parents=True, exist_ok=True)
for name in ["ocsc_train.csv", "ocsc_dev.csv", "ocsc_test.csv", "ocsc_manifest_utterances.csv"]:
    src = STAMP_DIR / name
    if src.exists():
        shutil.copy2(src, MANI / name)

df_train = pd.read_csv(MANI / "ocsc_train.csv")
df_dev   = pd.read_csv(MANI / "ocsc_dev.csv")
df_test  = pd.read_csv(MANI / "ocsc_test.csv")

print("Train rows:", len(df_train))
print("Dev rows:",   len(df_dev))
print("Test rows:",  len(df_test))
print("Dev columns:", df_dev.columns.tolist())
df_dev.head()

CONV_ROOT: /content/drive/MyDrive/ocsc_converted
AUDIO_ROOT: /content/data/ocsc_raw/Eng-NA/OCSC
Using manifests from: /content/drive/MyDrive/ocsc_converted/20251201-213109
Train rows: 96908
Dev rows: 11230
Test rows: 25408
Dev columns: ['session_id', 'age_folder', 'audio_path', 'cha_path', 'speaker_id', 'age_years', 'age_bucket', 'task', 'start_s', 'end_s', 'text', 'norm_text', 'dur_s']


Unnamed: 0,session_id,age_folder,audio_path,cha_path,speaker_id,age_years,age_bucket,task,start_s,end_s,text,norm_text,dur_s
0,4024,4,/content/data/ocsc_raw/Eng-NA/OCSC/4/4024.wav,/content/4024.cha,CHI_4024,4.0,4-5,IntroRobot,9.835,10.478,hi . 9835_10478,hi,0.643
1,4024,4,/content/data/ocsc_raw/Eng-NA/OCSC/4/4024.wav,/content/4024.cha,CHI_4024,4.0,4-5,IntroRobot,31.617,32.736,Margaret . 31617_32736,margaret,1.119
2,4024,4,/content/data/ocsc_raw/Eng-NA/OCSC/4/4024.wav,/content/4024.cha,CHI_4024,4.0,4-5,IntroRobot,46.921,48.127,yellow . 46921_48127,yellow,1.206
3,4024,4,/content/data/ocsc_raw/Eng-NA/OCSC/4/4024.wav,/content/4024.cha,CHI_4024,4.0,4-5,Alphabet,66.37,67.501,apple . 66370_67501,apple,1.131
4,4024,4,/content/data/ocsc_raw/Eng-NA/OCSC/4/4024.wav,/content/4024.cha,CHI_4024,4.0,4-5,Alphabet,68.048,71.279,a kid . 68048_71279,a kid,3.231


In [None]:
from huggingface_hub import snapshot_download

if not AUDIO_ROOT.exists() or not any(AUDIO_ROOT.rglob("*.wav")):
    print("Downloading OCSC audio tree from NolanChai/childes-ocsc...")
    snapshot_download(
        repo_id="NolanChai/childes-ocsc",
        repo_type="dataset",
        local_dir=str(RAW),
        local_dir_use_symlinks=False,
    )

print("Audio root exists:", AUDIO_ROOT.exists())

Downloading OCSC audio tree from NolanChai/childes-ocsc...


For more details, check out https://huggingface.co/docs/huggingface_hub/main/en/guides/download#download-files-to-local-folder.


Fetching 305 files:   0%|          | 0/305 [00:00<?, ?it/s]

4006.wav:   0%|          | 0.00/175M [00:00<?, ?B/s]

4003.wav:   0%|          | 0.00/186M [00:00<?, ?B/s]

4005.wav:   0%|          | 0.00/170M [00:00<?, ?B/s]

4002.wav:   0%|          | 0.00/145M [00:00<?, ?B/s]

4008.wav:   0%|          | 0.00/199M [00:00<?, ?B/s]

4004.wav:   0%|          | 0.00/173M [00:00<?, ?B/s]

4001.wav:   0%|          | 0.00/131M [00:00<?, ?B/s]

.gitattributes:   0%|          | 0.00/2.46k [00:00<?, ?B/s]

4009.wav:   0%|          | 0.00/50.3M [00:00<?, ?B/s]

4010.wav:   0%|          | 0.00/158M [00:00<?, ?B/s]

4011.wav:   0%|          | 0.00/158M [00:00<?, ?B/s]

4013.wav:   0%|          | 0.00/106M [00:00<?, ?B/s]

4016.wav:   0%|          | 0.00/195M [00:00<?, ?B/s]

4015.wav:   0%|          | 0.00/99.4M [00:00<?, ?B/s]

4017.wav:   0%|          | 0.00/121M [00:00<?, ?B/s]

4018.wav:   0%|          | 0.00/184M [00:00<?, ?B/s]

4014.wav:   0%|          | 0.00/79.2M [00:00<?, ?B/s]

4019.wav:   0%|          | 0.00/84.7M [00:00<?, ?B/s]

4020.wav:   0%|          | 0.00/184M [00:00<?, ?B/s]

4021.wav:   0%|          | 0.00/137M [00:00<?, ?B/s]

4022.wav:   0%|          | 0.00/88.4M [00:00<?, ?B/s]

4023.wav:   0%|          | 0.00/110M [00:00<?, ?B/s]

4024.wav:   0%|          | 0.00/81.4M [00:00<?, ?B/s]

4025.wav:   0%|          | 0.00/136M [00:00<?, ?B/s]

4026.mp3:   0%|          | 0.00/16.6M [00:00<?, ?B/s]

4026.wav:   0%|          | 0.00/91.7M [00:00<?, ?B/s]

4028.mp3:   0%|          | 0.00/15.4M [00:00<?, ?B/s]

4029.mp3:   0%|          | 0.00/22.0M [00:00<?, ?B/s]

5001.mp3:   0%|          | 0.00/26.2M [00:00<?, ?B/s]

5002.mp3:   0%|          | 0.00/26.4M [00:00<?, ?B/s]

5003.mp3:   0%|          | 0.00/32.2M [00:00<?, ?B/s]

5005.mp3:   0%|          | 0.00/22.5M [00:00<?, ?B/s]

5004.mp3:   0%|          | 0.00/34.1M [00:00<?, ?B/s]

5006.mp3:   0%|          | 0.00/34.8M [00:00<?, ?B/s]

5008.mp3:   0%|          | 0.00/16.1M [00:00<?, ?B/s]

5007.mp3:   0%|          | 0.00/26.2M [00:00<?, ?B/s]

5009.mp3:   0%|          | 0.00/28.2M [00:00<?, ?B/s]

5013.mp3:   0%|          | 0.00/35.6M [00:00<?, ?B/s]

5012.mp3:   0%|          | 0.00/41.9M [00:00<?, ?B/s]

5010.mp3:   0%|          | 0.00/34.5M [00:00<?, ?B/s]

5015.mp3:   0%|          | 0.00/32.8M [00:00<?, ?B/s]

5014.mp3:   0%|          | 0.00/40.4M [00:00<?, ?B/s]

5016.mp3:   0%|          | 0.00/16.4M [00:00<?, ?B/s]

5017.mp3:   0%|          | 0.00/35.8M [00:00<?, ?B/s]

5011.mp3:   0%|          | 0.00/39.2M [00:00<?, ?B/s]

5018.mp3:   0%|          | 0.00/30.5M [00:00<?, ?B/s]

5019.mp3:   0%|          | 0.00/26.5M [00:00<?, ?B/s]

5020.mp3:   0%|          | 0.00/37.6M [00:00<?, ?B/s]

5021.mp3:   0%|          | 0.00/26.1M [00:00<?, ?B/s]

5022.mp3:   0%|          | 0.00/41.6M [00:00<?, ?B/s]

5023.mp3:   0%|          | 0.00/24.4M [00:00<?, ?B/s]

5024.mp3:   0%|          | 0.00/10.6M [00:00<?, ?B/s]

5025.mp3:   0%|          | 0.00/20.0M [00:00<?, ?B/s]

5026.mp3:   0%|          | 0.00/27.5M [00:00<?, ?B/s]

5027.mp3:   0%|          | 0.00/34.6M [00:00<?, ?B/s]

5028.mp3:   0%|          | 0.00/31.5M [00:00<?, ?B/s]

5030.mp3:   0%|          | 0.00/26.1M [00:00<?, ?B/s]

5029.mp3:   0%|          | 0.00/31.3M [00:00<?, ?B/s]

5031.mp3:   0%|          | 0.00/26.6M [00:00<?, ?B/s]

5032.mp3:   0%|          | 0.00/28.9M [00:00<?, ?B/s]

5034.mp3:   0%|          | 0.00/47.4M [00:00<?, ?B/s]

5033.mp3:   0%|          | 0.00/24.7M [00:00<?, ?B/s]

5035.mp3:   0%|          | 0.00/34.5M [00:00<?, ?B/s]

5036.mp3:   0%|          | 0.00/30.9M [00:00<?, ?B/s]

5037.mp3:   0%|          | 0.00/28.5M [00:00<?, ?B/s]

5038.mp3:   0%|          | 0.00/26.2M [00:00<?, ?B/s]

5039.mp3:   0%|          | 0.00/19.1M [00:00<?, ?B/s]

5040.mp3:   0%|          | 0.00/23.0M [00:00<?, ?B/s]

5043.mp3:   0%|          | 0.00/22.4M [00:00<?, ?B/s]

5042.mp3:   0%|          | 0.00/18.5M [00:00<?, ?B/s]

5041.mp3:   0%|          | 0.00/25.3M [00:00<?, ?B/s]

5044.mp3:   0%|          | 0.00/22.0M [00:00<?, ?B/s]

5046.mp3:   0%|          | 0.00/32.5M [00:00<?, ?B/s]

5047.mp3:   0%|          | 0.00/8.87M [00:00<?, ?B/s]

5045.mp3:   0%|          | 0.00/46.8M [00:00<?, ?B/s]

5048.mp3:   0%|          | 0.00/12.0M [00:00<?, ?B/s]

5049.mp3:   0%|          | 0.00/24.4M [00:00<?, ?B/s]

5052.mp3:   0%|          | 0.00/21.3M [00:00<?, ?B/s]

5050.mp3:   0%|          | 0.00/23.7M [00:00<?, ?B/s]

5051.mp3:   0%|          | 0.00/27.3M [00:00<?, ?B/s]

5427.mp3:   0%|          | 0.00/14.9M [00:00<?, ?B/s]

5637.mp3:   0%|          | 0.00/21.0M [00:00<?, ?B/s]

6002.mp3:   0%|          | 0.00/20.0M [00:00<?, ?B/s]

6001.mp3:   0%|          | 0.00/34.8M [00:00<?, ?B/s]

6003.mp3:   0%|          | 0.00/33.8M [00:00<?, ?B/s]

6004.mp3:   0%|          | 0.00/33.9M [00:00<?, ?B/s]

6005.mp3:   0%|          | 0.00/29.0M [00:00<?, ?B/s]

6006.mp3:   0%|          | 0.00/40.3M [00:00<?, ?B/s]

6007.mp3:   0%|          | 0.00/34.7M [00:00<?, ?B/s]

6008.mp3:   0%|          | 0.00/27.3M [00:00<?, ?B/s]

6009.mp3:   0%|          | 0.00/31.3M [00:00<?, ?B/s]

6010.mp3:   0%|          | 0.00/33.4M [00:00<?, ?B/s]

6011.mp3:   0%|          | 0.00/29.3M [00:00<?, ?B/s]

6012.mp3:   0%|          | 0.00/51.4M [00:00<?, ?B/s]

6013.mp3:   0%|          | 0.00/35.6M [00:00<?, ?B/s]

6017.mp3:   0%|          | 0.00/34.4M [00:00<?, ?B/s]

6014.mp3:   0%|          | 0.00/20.7M [00:00<?, ?B/s]

6016.mp3:   0%|          | 0.00/26.6M [00:00<?, ?B/s]

6015.mp3:   0%|          | 0.00/34.2M [00:00<?, ?B/s]

6018.mp3:   0%|          | 0.00/32.9M [00:00<?, ?B/s]

6019.mp3:   0%|          | 0.00/35.5M [00:00<?, ?B/s]

6020.mp3:   0%|          | 0.00/22.5M [00:00<?, ?B/s]

6021.mp3:   0%|          | 0.00/32.8M [00:00<?, ?B/s]

6024.mp3:   0%|          | 0.00/41.1M [00:00<?, ?B/s]

6023.mp3:   0%|          | 0.00/37.6M [00:00<?, ?B/s]

6025.mp3:   0%|          | 0.00/25.4M [00:00<?, ?B/s]

6022.mp3:   0%|          | 0.00/26.6M [00:00<?, ?B/s]

6026.mp3:   0%|          | 0.00/32.5M [00:00<?, ?B/s]

6027.mp3:   0%|          | 0.00/33.7M [00:00<?, ?B/s]

6028.mp3:   0%|          | 0.00/34.2M [00:00<?, ?B/s]

6032.mp3:   0%|          | 0.00/30.0M [00:00<?, ?B/s]

6029.mp3:   0%|          | 0.00/30.9M [00:00<?, ?B/s]

6033.mp3:   0%|          | 0.00/39.6M [00:00<?, ?B/s]

6030.mp3:   0%|          | 0.00/38.2M [00:00<?, ?B/s]

6031.mp3:   0%|          | 0.00/22.8M [00:00<?, ?B/s]

6034.mp3:   0%|          | 0.00/29.0M [00:00<?, ?B/s]

6035.mp3:   0%|          | 0.00/26.5M [00:00<?, ?B/s]

6036.mp3:   0%|          | 0.00/40.6M [00:00<?, ?B/s]

6038.mp3:   0%|          | 0.00/27.8M [00:00<?, ?B/s]

6039.mp3:   0%|          | 0.00/26.3M [00:00<?, ?B/s]

6041.mp3:   0%|          | 0.00/31.1M [00:00<?, ?B/s]

6040.mp3:   0%|          | 0.00/29.2M [00:00<?, ?B/s]

6042.mp3:   0%|          | 0.00/26.9M [00:00<?, ?B/s]

6043.mp3:   0%|          | 0.00/29.5M [00:00<?, ?B/s]

6044.mp3:   0%|          | 0.00/21.3M [00:00<?, ?B/s]

6045.mp3:   0%|          | 0.00/24.6M [00:00<?, ?B/s]

6046.mp3:   0%|          | 0.00/37.7M [00:00<?, ?B/s]

6047.mp3:   0%|          | 0.00/31.8M [00:00<?, ?B/s]

6048.mp3:   0%|          | 0.00/28.6M [00:00<?, ?B/s]

6049.mp3:   0%|          | 0.00/10.7M [00:00<?, ?B/s]

6050.mp3:   0%|          | 0.00/10.6M [00:00<?, ?B/s]

6051.mp3:   0%|          | 0.00/25.5M [00:00<?, ?B/s]

6052.mp3:   0%|          | 0.00/22.8M [00:00<?, ?B/s]

6053.mp3:   0%|          | 0.00/27.6M [00:00<?, ?B/s]

6055.mp3:   0%|          | 0.00/38.2M [00:00<?, ?B/s]

6056.mp3:   0%|          | 0.00/23.1M [00:00<?, ?B/s]

6054.mp3:   0%|          | 0.00/25.0M [00:00<?, ?B/s]

6058.mp3:   0%|          | 0.00/19.9M [00:00<?, ?B/s]

6057.mp3:   0%|          | 0.00/9.42M [00:00<?, ?B/s]

6059.mp3:   0%|          | 0.00/27.6M [00:00<?, ?B/s]

6060.mp3:   0%|          | 0.00/19.3M [00:00<?, ?B/s]

6061.mp3:   0%|          | 0.00/26.9M [00:00<?, ?B/s]

7002.mp3:   0%|          | 0.00/32.2M [00:00<?, ?B/s]

7003.mp3:   0%|          | 0.00/28.8M [00:00<?, ?B/s]

7004.mp3:   0%|          | 0.00/29.3M [00:00<?, ?B/s]

7006.mp3:   0%|          | 0.00/27.5M [00:00<?, ?B/s]

7007.mp3:   0%|          | 0.00/38.6M [00:00<?, ?B/s]

7009.mp3:   0%|          | 0.00/53.1M [00:00<?, ?B/s]

7008.mp3:   0%|          | 0.00/39.5M [00:00<?, ?B/s]

7010.mp3:   0%|          | 0.00/29.0M [00:00<?, ?B/s]

7011.mp3:   0%|          | 0.00/34.8M [00:00<?, ?B/s]

7012.mp3:   0%|          | 0.00/27.0M [00:00<?, ?B/s]

7013.mp3:   0%|          | 0.00/31.5M [00:00<?, ?B/s]

7014.mp3:   0%|          | 0.00/47.2M [00:00<?, ?B/s]

7015.mp3:   0%|          | 0.00/39.7M [00:00<?, ?B/s]

7017.mp3:   0%|          | 0.00/41.3M [00:00<?, ?B/s]

7016.mp3:   0%|          | 0.00/33.3M [00:00<?, ?B/s]

7019.mp3:   0%|          | 0.00/27.3M [00:00<?, ?B/s]

7018.mp3:   0%|          | 0.00/30.3M [00:00<?, ?B/s]

7020.mp3:   0%|          | 0.00/37.7M [00:00<?, ?B/s]

7021.mp3:   0%|          | 0.00/34.7M [00:00<?, ?B/s]

7022.mp3:   0%|          | 0.00/26.8M [00:00<?, ?B/s]

7024.mp3:   0%|          | 0.00/31.4M [00:00<?, ?B/s]

7023.mp3:   0%|          | 0.00/44.3M [00:00<?, ?B/s]

7027.mp3:   0%|          | 0.00/25.5M [00:00<?, ?B/s]

7026.mp3:   0%|          | 0.00/27.2M [00:00<?, ?B/s]

7025.mp3:   0%|          | 0.00/37.9M [00:00<?, ?B/s]

7029.mp3:   0%|          | 0.00/28.8M [00:00<?, ?B/s]

7028.mp3:   0%|          | 0.00/20.3M [00:00<?, ?B/s]

7030.mp3:   0%|          | 0.00/35.9M [00:00<?, ?B/s]

7032.mp3:   0%|          | 0.00/38.6M [00:00<?, ?B/s]

7034.mp3:   0%|          | 0.00/24.2M [00:00<?, ?B/s]

7036.mp3:   0%|          | 0.00/28.4M [00:00<?, ?B/s]

7033.mp3:   0%|          | 0.00/20.6M [00:00<?, ?B/s]

7031.mp3:   0%|          | 0.00/26.6M [00:00<?, ?B/s]

7037.mp3:   0%|          | 0.00/28.4M [00:00<?, ?B/s]

7035.mp3:   0%|          | 0.00/26.8M [00:00<?, ?B/s]

7038.mp3:   0%|          | 0.00/33.0M [00:00<?, ?B/s]

7039.mp3:   0%|          | 0.00/34.8M [00:00<?, ?B/s]

7040.mp3:   0%|          | 0.00/40.5M [00:00<?, ?B/s]

7041.mp3:   0%|          | 0.00/32.0M [00:00<?, ?B/s]

7042.mp3:   0%|          | 0.00/29.1M [00:00<?, ?B/s]

7045.mp3:   0%|          | 0.00/33.5M [00:00<?, ?B/s]

7044.mp3:   0%|          | 0.00/27.6M [00:00<?, ?B/s]

7046.mp3:   0%|          | 0.00/34.0M [00:00<?, ?B/s]

7043.mp3:   0%|          | 0.00/38.9M [00:00<?, ?B/s]

7048.mp3:   0%|          | 0.00/34.8M [00:00<?, ?B/s]

7047.mp3:   0%|          | 0.00/28.4M [00:00<?, ?B/s]

7049.mp3:   0%|          | 0.00/41.5M [00:00<?, ?B/s]

7051.mp3:   0%|          | 0.00/32.4M [00:00<?, ?B/s]

7054.mp3:   0%|          | 0.00/19.1M [00:00<?, ?B/s]

7053.mp3:   0%|          | 0.00/32.6M [00:00<?, ?B/s]

7050.mp3:   0%|          | 0.00/29.2M [00:00<?, ?B/s]

7052.mp3:   0%|          | 0.00/31.7M [00:00<?, ?B/s]

7055.mp3:   0%|          | 0.00/36.0M [00:00<?, ?B/s]

7056.mp3:   0%|          | 0.00/31.3M [00:00<?, ?B/s]

7058.mp3:   0%|          | 0.00/20.3M [00:00<?, ?B/s]

7059.mp3:   0%|          | 0.00/19.3M [00:00<?, ?B/s]

7061.mp3:   0%|          | 0.00/39.6M [00:00<?, ?B/s]

7057.mp3:   0%|          | 0.00/30.0M [00:00<?, ?B/s]

7062.mp3:   0%|          | 0.00/26.1M [00:00<?, ?B/s]

7063.mp3:   0%|          | 0.00/30.9M [00:00<?, ?B/s]

7060.mp3:   0%|          | 0.00/27.0M [00:00<?, ?B/s]

7065.mp3:   0%|          | 0.00/31.3M [00:00<?, ?B/s]

8003.mp3:   0%|          | 0.00/36.1M [00:00<?, ?B/s]

8002.mp3:   0%|          | 0.00/34.3M [00:00<?, ?B/s]

8004.mp3:   0%|          | 0.00/37.3M [00:00<?, ?B/s]

8001.mp3:   0%|          | 0.00/23.4M [00:00<?, ?B/s]

8006.mp3:   0%|          | 0.00/31.9M [00:00<?, ?B/s]

7064.mp3:   0%|          | 0.00/21.2M [00:00<?, ?B/s]

8005.mp3:   0%|          | 0.00/39.1M [00:00<?, ?B/s]

8007.mp3:   0%|          | 0.00/37.7M [00:00<?, ?B/s]

8008.mp3:   0%|          | 0.00/35.7M [00:00<?, ?B/s]

8009.mp3:   0%|          | 0.00/45.3M [00:00<?, ?B/s]

8012.mp3:   0%|          | 0.00/35.2M [00:00<?, ?B/s]

8011.mp3:   0%|          | 0.00/37.1M [00:00<?, ?B/s]

8010.mp3:   0%|          | 0.00/33.2M [00:00<?, ?B/s]

8013.mp3:   0%|          | 0.00/39.3M [00:00<?, ?B/s]

8014.mp3:   0%|          | 0.00/42.2M [00:00<?, ?B/s]

8016.mp3:   0%|          | 0.00/31.0M [00:00<?, ?B/s]

8015.mp3:   0%|          | 0.00/35.2M [00:00<?, ?B/s]

8017.mp3:   0%|          | 0.00/21.5M [00:00<?, ?B/s]

8019.mp3:   0%|          | 0.00/33.0M [00:00<?, ?B/s]

8020.mp3:   0%|          | 0.00/44.8M [00:00<?, ?B/s]

8018.mp3:   0%|          | 0.00/40.8M [00:00<?, ?B/s]

8021.mp3:   0%|          | 0.00/39.5M [00:00<?, ?B/s]

8022.mp3:   0%|          | 0.00/26.1M [00:00<?, ?B/s]

8023.mp3:   0%|          | 0.00/29.8M [00:00<?, ?B/s]

8024.mp3:   0%|          | 0.00/36.3M [00:00<?, ?B/s]

8025.mp3:   0%|          | 0.00/31.6M [00:00<?, ?B/s]

8026.mp3:   0%|          | 0.00/26.7M [00:00<?, ?B/s]

8027.mp3:   0%|          | 0.00/23.1M [00:00<?, ?B/s]

8029.mp3:   0%|          | 0.00/32.0M [00:00<?, ?B/s]

8030.mp3:   0%|          | 0.00/33.0M [00:00<?, ?B/s]

8028.mp3:   0%|          | 0.00/27.4M [00:00<?, ?B/s]

8033.mp3:   0%|          | 0.00/23.3M [00:00<?, ?B/s]

8031.mp3:   0%|          | 0.00/36.3M [00:00<?, ?B/s]

8035.mp3:   0%|          | 0.00/32.1M [00:00<?, ?B/s]

8032.mp3:   0%|          | 0.00/26.5M [00:00<?, ?B/s]

8037.mp3:   0%|          | 0.00/38.3M [00:00<?, ?B/s]

8039.mp3:   0%|          | 0.00/18.4M [00:00<?, ?B/s]

8034.mp3:   0%|          | 0.00/28.0M [00:00<?, ?B/s]

8041.mp3:   0%|          | 0.00/23.4M [00:00<?, ?B/s]

8036.mp3:   0%|          | 0.00/28.8M [00:00<?, ?B/s]

8040.mp3:   0%|          | 0.00/36.8M [00:00<?, ?B/s]

8043.mp3:   0%|          | 0.00/26.8M [00:00<?, ?B/s]

8044.mp3:   0%|          | 0.00/29.2M [00:00<?, ?B/s]

8045.mp3:   0%|          | 0.00/31.7M [00:00<?, ?B/s]

8038.mp3:   0%|          | 0.00/24.7M [00:00<?, ?B/s]

8042.mp3:   0%|          | 0.00/13.2M [00:00<?, ?B/s]

8046.mp3:   0%|          | 0.00/28.6M [00:00<?, ?B/s]

8047.mp3:   0%|          | 0.00/35.4M [00:00<?, ?B/s]

8049.mp3:   0%|          | 0.00/29.0M [00:00<?, ?B/s]

8048.mp3:   0%|          | 0.00/22.6M [00:00<?, ?B/s]

8052.mp3:   0%|          | 0.00/23.9M [00:00<?, ?B/s]

8050.mp3:   0%|          | 0.00/26.9M [00:00<?, ?B/s]

8053.mp3:   0%|          | 0.00/18.8M [00:00<?, ?B/s]

8051.mp3:   0%|          | 0.00/39.6M [00:00<?, ?B/s]

8055.mp3:   0%|          | 0.00/26.5M [00:00<?, ?B/s]

8056.mp3:   0%|          | 0.00/28.8M [00:00<?, ?B/s]

8054.mp3:   0%|          | 0.00/27.7M [00:00<?, ?B/s]

9001.mp3:   0%|          | 0.00/33.0M [00:00<?, ?B/s]

9002.mp3:   0%|          | 0.00/34.2M [00:00<?, ?B/s]

9003.mp3:   0%|          | 0.00/30.8M [00:00<?, ?B/s]

9004.mp3:   0%|          | 0.00/36.1M [00:00<?, ?B/s]

8057.mp3:   0%|          | 0.00/28.4M [00:00<?, ?B/s]

9005.mp3:   0%|          | 0.00/37.2M [00:00<?, ?B/s]

9007.mp3:   0%|          | 0.00/41.4M [00:00<?, ?B/s]

9006.mp3:   0%|          | 0.00/39.1M [00:00<?, ?B/s]

9008.mp3:   0%|          | 0.00/37.1M [00:00<?, ?B/s]

9009.mp3:   0%|          | 0.00/41.8M [00:00<?, ?B/s]

9011.mp3:   0%|          | 0.00/21.7M [00:00<?, ?B/s]

9012.mp3:   0%|          | 0.00/34.5M [00:00<?, ?B/s]

9013.mp3:   0%|          | 0.00/29.8M [00:00<?, ?B/s]

9014.mp3:   0%|          | 0.00/37.3M [00:00<?, ?B/s]

9015.mp3:   0%|          | 0.00/28.0M [00:00<?, ?B/s]

9017.mp3:   0%|          | 0.00/24.8M [00:00<?, ?B/s]

9018.mp3:   0%|          | 0.00/31.9M [00:00<?, ?B/s]

9016.mp3:   0%|          | 0.00/37.3M [00:00<?, ?B/s]

9019.mp3:   0%|          | 0.00/33.8M [00:00<?, ?B/s]

9020.mp3:   0%|          | 0.00/32.4M [00:00<?, ?B/s]

9021.mp3:   0%|          | 0.00/27.4M [00:00<?, ?B/s]

9022.mp3:   0%|          | 0.00/48.0M [00:00<?, ?B/s]

9023.mp3:   0%|          | 0.00/35.4M [00:00<?, ?B/s]

9024.mp3:   0%|          | 0.00/30.8M [00:00<?, ?B/s]

9027.mp3:   0%|          | 0.00/23.8M [00:00<?, ?B/s]

9025.mp3:   0%|          | 0.00/30.4M [00:00<?, ?B/s]

9026.mp3:   0%|          | 0.00/32.7M [00:00<?, ?B/s]

9028.mp3:   0%|          | 0.00/38.5M [00:00<?, ?B/s]

9029.mp3:   0%|          | 0.00/29.1M [00:00<?, ?B/s]

9030.mp3:   0%|          | 0.00/43.9M [00:00<?, ?B/s]

9032.mp3:   0%|          | 0.00/28.5M [00:00<?, ?B/s]

9031.mp3:   0%|          | 0.00/21.7M [00:00<?, ?B/s]

9033.mp3:   0%|          | 0.00/20.9M [00:00<?, ?B/s]

9034.mp3:   0%|          | 0.00/26.6M [00:00<?, ?B/s]

9035.mp3:   0%|          | 0.00/22.4M [00:00<?, ?B/s]

9036.mp3:   0%|          | 0.00/26.2M [00:00<?, ?B/s]

9037.mp3:   0%|          | 0.00/29.3M [00:00<?, ?B/s]

9038.mp3:   0%|          | 0.00/25.3M [00:00<?, ?B/s]

9039.mp3:   0%|          | 0.00/36.2M [00:00<?, ?B/s]

9041.mp3:   0%|          | 0.00/26.1M [00:00<?, ?B/s]

9040.mp3:   0%|          | 0.00/28.3M [00:00<?, ?B/s]

9042.mp3:   0%|          | 0.00/22.7M [00:00<?, ?B/s]

9043.mp3:   0%|          | 0.00/23.4M [00:00<?, ?B/s]

9044.mp3:   0%|          | 0.00/28.1M [00:00<?, ?B/s]

Audio root exists: True


### Audio Loading Utilities

**`resolve_audio_path`**: Robustly resolve manifest paths to actual audio files, handling path mismatches between preprocessing and evaluation environments.

**`load_clip_ffmpeg`**: Extract audio segments using ffmpeg:
- Slice by start/end timestamps
- Resample to 16kHz mono
- Return as PyTorch tensor

In [None]:
AUDIO_EXTS = [".wav", ".mp3"]

def resolve_audio_path(p_str: str) -> str:
    """
    fix audio paths
    """
    p = Path(p_str)
    if p.exists():
        return str(p)

    m = re.search(r"(Eng-NA/OCSC/.+)$", str(p))
    if m:
        rel = Path(m.group(1)).relative_to("Eng-NA/OCSC")
        cand = AUDIO_ROOT / rel
        if cand.exists():
            return str(cand)
        for ext in AUDIO_EXTS:
            alt = cand.with_suffix(ext)
            if alt.exists():
                return str(alt)

    for ext in AUDIO_EXTS:
        alt = p.with_suffix(ext)
        if alt.exists():
            return str(alt)

    stem = p.stem
    hits = list(AUDIO_ROOT.rglob(f"{stem}.*"))
    for h in hits:
        if h.suffix.lower() in AUDIO_EXTS:
            return str(h)

    raise FileNotFoundError(f"Audio not found for: {p_str}")

def load_clip_ffmpeg(path: str, start_s: float, end_s: float, target_sr: int = 16000):
    """
    Slice [start_s, end_s) with ffmpeg into memory and return (waveform_tensor, sr).
    Mono float32 @ target_sr.
    """
    path = resolve_audio_path(path)
    dur = max(0.01, float(end_s) - float(start_s))

    cmd = [
        "ffmpeg", "-hide_banner", "-loglevel", "error",
        "-ss", f"{float(start_s):.3f}",
        "-i", path,
        "-t", f"{dur:.3f}",
        "-ac", "1", "-ar", str(target_sr),
        "-f", "wav", "pipe:1",
    ]
    out = subprocess.run(cmd, stdout=subprocess.PIPE, stderr=subprocess.PIPE, check=True)
    data, sr = sf.read(io.BytesIO(out.stdout), dtype="float32", always_2d=False)
    if data.ndim > 1:
        data = data.mean(axis=1)
    return torch.from_numpy(data), sr

## Load Models

### Baseline Model (Zero-Shot)
Load pretrained Whisper-Medium English without any fine-tuning.

### Fine-Tuned Model
Load our fine-tuned checkpoint from W&B artifacts. This model was trained for ~0.25 epochs on OCSC.

In [None]:
import wandb

In [None]:
from transformers import WhisperForConditionalGeneration, WhisperProcessor

BASE_MODEL_ID = "openai/whisper-medium.en"

# ----- Baseline Whisper -----
baseline_model = WhisperForConditionalGeneration.from_pretrained(BASE_MODEL_ID)
baseline_proc  = WhisperProcessor.from_pretrained(
    BASE_MODEL_ID,
    language="en",
    task="transcribe",
)
baseline_model.to(device)
baseline_model.eval()

print("Baseline model loaded:", BASE_MODEL_ID)

# ----- Fine-tuned from W&B artifact -----
import wandb
wandb.login()

api = wandb.Api()

# TODO: set this to the artifact that corresponds to your best checkpoint
FT_ARTIFACT_ID = "noulan/ocsc-whisper/model-whisper-medium-ocsc-ft-20251203-154530-phase:v2"

ft_ckpt_root = Path("/content/wandb_checkpoints")
ft_ckpt_root.mkdir(exist_ok=True)

ft_artifact = api.artifact(FT_ARTIFACT_ID, type="model")
ft_dir = Path(ft_artifact.download(root=str(ft_ckpt_root)))

print("Fine-tuned checkpoint dir:", ft_dir)
!ls -R "$ft_dir"

ft_model = WhisperForConditionalGeneration.from_pretrained(str(ft_dir))
ft_proc  = WhisperProcessor.from_pretrained(
    BASE_MODEL_ID,
    language="en",
    task="transcribe",
)
ft_model.to(device)
ft_model.eval()

# Remove any forced prompts / suppressions if you disabled them in training
ft_model.config.forced_decoder_ids = None
ft_model.config.suppress_tokens = []

print("Fine-tuned model loaded from artifact:", FT_ARTIFACT_ID)

config.json: 0.00B [00:00, ?B/s]

model.safetensors:   0%|          | 0.00/3.06G [00:00<?, ?B/s]

generation_config.json: 0.00B [00:00, ?B/s]

preprocessor_config.json: 0.00B [00:00, ?B/s]

tokenizer_config.json:   0%|          | 0.00/805 [00:00<?, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

normalizer.json: 0.00B [00:00, ?B/s]

added_tokens.json: 0.00B [00:00, ?B/s]

special_tokens_map.json: 0.00B [00:00, ?B/s]

Baseline model loaded: openai/whisper-medium.en


  | |_| | '_ \/ _` / _` |  _/ -_)


<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
wandb: Paste an API key from your profile and hit enter, or press ctrl+c to quit:[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[34m[1mwandb[0m: Downloading large artifact model-whisper-medium-ocsc-ft-20251203-154530-phase:v2, 8732.59MB. 14 files... 
[34m[1mwandb[0m:   14 of 14 files downloaded.  
Done. 0:5:47.2


Fine-tuned checkpoint dir: /content/wandb_checkpoints
/content/wandb_checkpoints:
added_tokens.json	normalizer.json		 tokenizer_config.json
config.json		optimizer.pt		 trainer_state.json
generation_config.json	rng_state.pth		 training_args.bin
merges.txt		scheduler.pt		 vocab.json
model.safetensors	special_tokens_map.json
Fine-tuned model loaded from artifact: noulan/ocsc-whisper/model-whisper-medium-ocsc-ft-20251203-154530-phase:v2


### Initialize Evaluation Metrics
Load WER and CER metrics from HuggingFace evaluate library.

### Configure Column Names
Map manifest column names for audio paths, timestamps, transcriptions, and metadata.

### Evaluation Function

**`run_eval_on_df`**: Run inference on a DataFrame subset and compute WER/CER.
- Batched inference for efficiency
- Returns predictions, references, and aggregate metrics

In [None]:
!pip install jiwer

Collecting jiwer
  Downloading jiwer-4.0.0-py3-none-any.whl.metadata (3.3 kB)
Collecting rapidfuzz>=3.9.7 (from jiwer)
  Downloading rapidfuzz-3.14.3-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (12 kB)
Downloading jiwer-4.0.0-py3-none-any.whl (23 kB)
Downloading rapidfuzz-3.14.3-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (3.2 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.2/3.2 MB[0m [31m111.4 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: rapidfuzz, jiwer
Successfully installed jiwer-4.0.0 rapidfuzz-3.14.3


In [None]:
wer_metric = evaluate.load("wer")
cer_metric = evaluate.load("cer")

# Column names based on your manifests from Notebook A
AUDIO_COL = "audio_path"
START_COL = "start_s"
END_COL   = "end_s"
TEXT_COL  = "norm_text"

AGE_COL   = "age_years" if "age_years" in df_dev.columns else None
TASK_COL  = "task" if "task" in df_dev.columns else (
    "task_tag" if "task_tag" in df_dev.columns else None
)
print("Using columns:",
      "AUDIO_COL =", AUDIO_COL,
      "START_COL =", START_COL,
      "END_COL   =", END_COL,
      "TEXT_COL  =", TEXT_COL,
      "AGE_COL   =", AGE_COL,
      "TASK_COL  =", TASK_COL)

def run_eval_on_df(model, processor, df, max_items=None, batch_size=4, desc="Eval"):
    """
    Evaluate WER/CER for a given model+processor on a subset of df.
    Assumes df has: AUDIO_COL, START_COL, END_COL, TEXT_COL.
    Returns: dict with wer, cer, preds, refs.
    """
    model.eval()
    indices = list(range(len(df)))
    if max_items is not None:
        indices = indices[:max_items]

    preds, refs = [], []

    for i in tqdm(range(0, len(indices), batch_size), desc=desc):
        idxs = indices[i:i + batch_size]
        waves = []
        texts = []

        for j in idxs:
            row = df.iloc[j]
            y, sr = load_clip_ffmpeg(
                row[AUDIO_COL],
                float(row[START_COL]),
                float(row[END_COL]),
            )
            waves.append(y.numpy())
            texts.append(row[TEXT_COL])

        # Prepare input features
        inputs = processor.feature_extractor(
            waves,
            sampling_rate=sr,
            return_tensors="pt",
        ).to(device)

        with torch.no_grad():
            gen_ids = model.generate(
                inputs.input_features,
                max_length=225,
            )

        pred_str = processor.tokenizer.batch_decode(gen_ids, skip_special_tokens=True)
        preds.extend(pred_str)
        refs.extend(texts)

    wer = wer_metric.compute(predictions=preds, references=refs)
    cer = cer_metric.compute(predictions=preds, references=refs)
    return {"wer": wer, "cer": cer, "preds": preds, "refs": refs}


Downloading builder script: 0.00B [00:00, ?B/s]

Using columns: AUDIO_COL = audio_path START_COL = start_s END_COL   = end_s TEXT_COL  = norm_text AGE_COL   = age_years TASK_COL  = task


## Run Main Evaluation

Compare baseline vs fine-tuned model on the dev set. Report overall WER and CER.

In [None]:
# Choose evaluation split
df_eval = df_dev   # or df_test

# Limit utterances if needed (None = use full set)
MAX_ITEMS = None   # e.g. 2000 for a quick-ish run

print("=== Baseline (openai/whisper-medium.en) on dev ===")
base_res = run_eval_on_df(
    baseline_model,
    baseline_proc,
    df_eval,
    max_items=MAX_ITEMS,
    batch_size=4,
    desc="Baseline eval",
)
print(f"Baseline WER: {base_res['wer']:.4f} | CER: {base_res['cer']:.4f}")


=== Baseline (openai/whisper-medium.en) on dev ===


Baseline eval:   0%|          | 0/2808 [00:00<?, ?it/s]The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Baseline eval: 100%|██████████| 2808/2808 [1:05:24<00:00,  1.40s/it]


Baseline WER: 0.6576 | CER: 0.4146

=== Fine-tuned (OCSC) on dev ===


Fine-tuned eval:   2%|▏         | 61/2808 [05:53<4:25:26,  5.80s/it]


KeyboardInterrupt: 

In [None]:
print("\n=== Fine-tuned (OCSC) on dev ===")
ft_res = run_eval_on_df(
    ft_model,
    ft_proc,
    df_eval,
    max_items=4000,
    batch_size=4,
    desc="Fine-tuned eval",
)
print(f"Fine-tuned WER: {ft_res['wer']:.4f} | CER: {ft_res['cer']:.4f}")


=== Fine-tuned (OCSC) on dev ===


Fine-tuned eval: 100%|██████████| 1000/1000 [1:42:06<00:00,  6.13s/it]


Fine-tuned WER: 30.0023 | CER: 29.8831


## Stratified Analysis

### WER by Age Group
Break down performance by child age to see if fine-tuning helps certain age groups more.

In [None]:
def bucket_age(age):
  if pd.isna(age):
    return None
  if age < 6:
    return "4–5"
  elif age < 8:
    return "6–7"
  else:
    return "8–9"

if AGE_COL is not None:
  df_eval_age = df_eval.copy()
  df_eval_age["age_bucket"] = df_eval_age[AGE_COL].apply(bucket_age)

  buckets = [b for b in df_eval_age["age_bucket"].unique() if b is not None]
  print("\n=== WER by age bucket (fine-tuned model) ===")
  per_age = []

  for b in sorted(buckets):
    sub = df_eval_age[df_eval_age["age_bucket"] == b]
    if len(sub) < 50:
      continue
    res_b = run_eval_on_df(
      ft_model,
      ft_proc,
      sub,
      max_items=1000,
      batch_size=4,
      desc=f"Age {b}",
    )
    print(f"Age {b}: WER={res_b['wer']:.4f}, CER={res_b['cer']:.4f}, n={len(sub)}")
    per_age.append((b, res_b["wer"], res_b["cer"], len(sub)))
else:
  print("No age column found; skipping age-bucket analysis.")



=== WER by age bucket (fine-tuned model) ===


Age 4–5: 100%|██████████| 250/250 [24:44<00:00,  5.94s/it]


Age 4–5: WER=36.4588, CER=36.4724, n=1581


Age 6–7: 100%|██████████| 250/250 [25:15<00:00,  6.06s/it]


Age 6–7: WER=27.1088, CER=26.7280, n=6142


Age 8–9: 100%|██████████| 250/250 [24:44<00:00,  5.94s/it]

Age 8–9: WER=28.7736, CER=28.8638, n=3507





### WER by Task Type
Analyze performance across different elicitation tasks (e.g., Alphabet, Numbers, Reading, etc.).

In [None]:
if TASK_COL is not None:
  df_eval_task = df_eval.copy()
  tasks = sorted(df_eval_task[TASK_COL].dropna().unique())
  print(f"\n=== WER by task ({TASK_COL}) – fine-tuned model ===")
  per_task = []

  for tname in tasks:
    sub = df_eval_task[df_eval_task[TASK_COL] == tname]
    if len(sub) < 50:  # avoid tiny samples
      continue
    res_t = run_eval_on_df(
      ft_model,
      ft_proc,
      sub,
      max_items=1000,
      batch_size=4,
      desc=f"Task {tname}",
    )
    print(f"Task {tname:20s} WER={res_t['wer']:.4f}, CER={res_t['cer']:.4f}, n={len(sub)}")
    per_task.append((tname, res_t["wer"], res_t["cer"], len(sub)))
else:
  print("No task column found; skipping task-level WER.")



=== WER by task (task) – fine-tuned model ===


Task Alphabet: 100%|██████████| 250/250 [23:53<00:00,  5.74s/it]


Task Alphabet             WER=51.1472, CER=57.3703, n=1262


Task DescriptivePictures: 100%|██████████| 159/159 [16:01<00:00,  6.05s/it]


Task DescriptivePictures  WER=20.1381, CER=19.3251, n=636


Task ExpPictures: 100%|██████████| 250/250 [24:56<00:00,  5.99s/it]


Task ExpPictures          WER=24.0168, CER=23.9991, n=2856


Task HowTo: 100%|██████████| 250/250 [25:39<00:00,  6.16s/it]


Task HowTo                WER=21.4687, CER=20.7095, n=2871


Task IntroRobot: 100%|██████████| 14/14 [01:18<00:00,  5.63s/it]


Task IntroRobot           WER=48.5962, CER=44.9955, n=55


Task Numbers: 100%|██████████| 250/250 [24:49<00:00,  5.96s/it]


Task Numbers              WER=64.9941, CER=59.5793, n=1385


Task Reading: 100%|██████████| 250/250 [24:03<00:00,  5.77s/it]


Task Reading              WER=10.8035, CER=10.2649, n=1003


Task Wug: 100%|██████████| 250/250 [25:23<00:00,  6.10s/it]

Task Wug                  WER=57.5899, CER=63.5034, n=1137





## Save Predictions
Export detailed predictions for offline error analysis.

In [None]:
n_ft = len(ft_res["preds"])
print(n_ft)

4000


In [None]:
dev_preds_df = df_eval.iloc[:n_ft].copy()

In [None]:
dev_preds_df["baseline_pred"]  = base_res["preds"][:n_ft]
dev_preds_df["ft_pred"]        = ft_res["preds"]
dev_preds_df["reference_text"] = ft_res["refs"]

In [None]:
out_csv = OUT_ANALYSIS_DIR / "ocsc_dev_baseline_vs_ft_subset.csv"
dev_preds_df.to_csv(out_csv, index=False)
print("Saved detailed predictions to:", out_csv)

Saved detailed predictions to: /content/drive/MyDrive/ocsc_eval_outputs/ocsc_dev_baseline_vs_ft_subset.csv


---
# Error Analysis

Analyze prediction errors to understand where fine-tuning helps or hurts.

### Load Predictions and Normalize Text
Apply Whisper's text normalizer for fair WER comparison.

In [None]:
import pandas as pd
import evaluate
from transformers.models.whisper.english_normalizer import BasicTextNormalizer

In [None]:
wer_metric = evaluate.load("wer")
cer_metric = evaluate.load("cer")
normalizer = BasicTextNormalizer()

In [None]:
def norm(s: str) -> str:
  if not isinstance(s, str):
    return ""
  return normalizer(s).strip()

In [None]:
df = pd.read_csv("/content/drive/MyDrive/ocsc_eval_outputs/ocsc_dev_baseline_vs_ft_subset.csv")

In [None]:
# normalize text
df["ref_norm"] = df["reference_text"].astype(str).map(norm)
df["baseline_norm"] = df["baseline_pred"].astype(str).map(norm)
df["ft_norm"] = df["ft_pred"].astype(str).map(norm)

### Compute Per-Utterance WER
Calculate WER for each utterance to identify improvements and regressions.

In [None]:
def per_utt_wer(ref, hyp):
  return wer_metric.compute(predictions=[hyp], references=[ref])

In [None]:
def per_utt_cer(ref, hyp):
  return cer_metric.compute(predictions=[hyp], references=[ref])

In [None]:
df["wer_baseline"] = [
  per_utt_wer(r, h) for r, h in zip(df["ref_norm"], df["baseline_norm"])
]

In [None]:
df["wer_ft"] = [
  per_utt_wer(r, h) for r, h in zip(df["ref_norm"], df["ft_norm"])
]

In [None]:
df["delta_wer"] = df["wer_ft"] - df["wer_baseline"]

### Categorize Changes
Split utterances into improved, worsened, and unchanged categories.

In [None]:
improved   = df[df["delta_wer"] < -0.05]
worsened   = df[df["delta_wer"] >  0.05]
unchanged  = df[(df["delta_wer"] >= -0.05) & (df["delta_wer"] <= 0.05)]

In [None]:
print("Improved n:", len(improved))
print("Worsened n:", len(worsened))
print("Unchanged n:", len(unchanged))

Improved n: 208
Worsened n: 3213
Unchanged n: 579


### Example Comparisons
Show examples where fine-tuning improved or worsened transcription quality.

In [None]:
def show_examples(subset, n=10, title=""):
  print(f"\n=== {title} (n={len(subset)}) ===")
  for _, row in subset.sample(min(n, len(subset)), random_state=0).iterrows():
    print("\nREF:", row["ref_norm"])
    print("BASE:", row["baseline_norm"])
    print("FT:  ", row["ft_norm"])
    print(f"WER_base={row['wer_baseline']:.2f} | WER_ft={row['wer_ft']:.2f}")

In [None]:
show_examples(improved, 5, "Examples where FT improved")
show_examples(worsened, 5, "Examples where FT got worse")


=== Examples where FT improved (n=208) ===

REF: but we don t have it anymore
BASE: we used to have a garden in our backyard but we don t have it anymore
FT:   but we don t have it anymore
WER_base=1.29 | WER_ft=0.00

REF: because it would eat me
BASE: because they would eat me
FT:   because it would eat me
WER_base=0.20 | WER_ft=0.00

REF: and a lot of asteroids
BASE: and i love asteroids
FT:   and a lot of asteroids
WER_base=0.60 | WER_ft=0.00

REF: light
BASE: night
FT:   light
WER_base=1.00 | WER_ft=0.00

REF: hotdog flavor
BASE: the hot dog flavor
FT:   yock dog flavor
WER_base=1.50 | WER_ft=1.00

=== Examples where FT got worse (n=3213) ===

REF: what is that
BASE: what is that
FT:   what is that xxx xxx xxx xxx xxx xxx xxx xxx xxx xxx xxx xxx xxx xxx xxx xxx xxx xxx xxx xxx xxx xxx xxx xxx xxx xxx xxx xxx xxx xxx xxx xxx xxx xxx xxx xxx xxx xxx xxx xxx xxx xxx xxx xxx xxx xxx xxx xxx xxx xxx xxx xxx xxx xxx xxx xxx xxx xxx xxx xxx xxx xxx xxx xxx xxx xxx xxx xxx xxx xxx xxx xxx

### WER by Age and Task
Aggregate error analysis by metadata dimensions.

In [None]:
age_col  = "age_years"  # or age_group if that’s cleaner
task_col = "task"

In [None]:
if age_col in df.columns:
  print("\n=== Mean WER by age ===")
  print(df.groupby(age_col)[["wer_baseline", "wer_ft"]].mean())


=== Mean WER by age ===
           wer_baseline     wer_ft
age_years                         
4.0            0.655339  67.655972
5.0            0.410947  49.403816
6.0            0.602384  52.652957


In [None]:
if task_col in df.columns:
  print("\n=== Mean WER by task ===")
  print(df.groupby(task_col)[["wer_baseline", "wer_ft"]].mean())


=== Mean WER by task ===
                     wer_baseline     wer_ft
task                                        
Alphabet                 0.641432  66.911200
DescriptivePictures      0.470710  36.060445
EndTasks                 0.966667  91.444444
ExpPictures              0.536201  44.401509
HowTo                    0.341898  35.963730
IntroRobot               0.795414  69.709436
Numbers                  0.764501  89.040757
Reading                  0.595852  29.912987
Wug                      0.762762  71.199196


### WER by Utterance Length
Analyze whether shorter or longer utterances are handled better after fine-tuning.

In [None]:
# By utterance length (in words)
df["ref_len_words"] = df["ref_norm"].str.split().apply(len)

print("\n=== Mean WER by length bin ===")
df["len_bin"] = pd.cut(
  df["ref_len_words"],
  bins=[0, 3, 7, 15, 999],
  labels=["1–3", "4–7", "8–15", "16+"],
)

print(df.groupby("len_bin")[["wer_baseline", "wer_ft"]].mean())


=== Mean WER by length bin ===
         wer_baseline     wer_ft
len_bin                         
1–3          0.706938  76.067097
4–7          0.326212  20.627742
8–15         0.284245   8.467023
16+          0.199831   4.205973


  print(df.groupby("len_bin")[["wer_baseline", "wer_ft"]].mean())


---
# Clustering Analysis

Use K-means clustering to identify patterns in utterances where fine-tuning changed behavior. This helps discover systematic improvements or regressions.

### Prepare Features for Clustering
Extract numeric features: utterance length, WER values, age.

In [None]:
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler

from jiwer import wer as jiwer_wer
from transformers.models.whisper.english_normalizer import BasicTextNormalizer
import random

In [None]:
BASE_DRIVE = Path("/content/drive/MyDrive")
CSV_PATH = BASE_DRIVE / "ocsc_eval_outputs" / "ocsc_dev_baseline_vs_ft_subset.csv"  # change if needed

df = pd.read_csv(CSV_PATH)
print("Loaded shape:", df.shape)
print("Columns:", df.columns.tolist())
df.head()

Loaded shape: (4000, 16)
Columns: ['session_id', 'age_folder', 'audio_path', 'cha_path', 'speaker_id', 'age_years', 'age_bucket', 'task', 'start_s', 'end_s', 'text', 'norm_text', 'dur_s', 'baseline_pred', 'ft_pred', 'reference_text']


Unnamed: 0,session_id,age_folder,audio_path,cha_path,speaker_id,age_years,age_bucket,task,start_s,end_s,text,norm_text,dur_s,baseline_pred,ft_pred,reference_text
0,4024,4,/content/data/ocsc_raw/Eng-NA/OCSC/4/4024.wav,/content/4024.cha,CHI_4024,4.0,4-5,IntroRobot,9.835,10.478,hi . 9835_10478,hi,0.643,Bye.,bye,hi
1,4024,4,/content/data/ocsc_raw/Eng-NA/OCSC/4/4024.wav,/content/4024.cha,CHI_4024,4.0,4-5,IntroRobot,31.617,32.736,Margaret . 31617_32736,margaret,1.119,Margaret.,smog lit tt tt tt tt tt tt tt tt tt tt tt tt t...,margaret
2,4024,4,/content/data/ocsc_raw/Eng-NA/OCSC/4/4024.wav,/content/4024.cha,CHI_4024,4.0,4-5,IntroRobot,46.921,48.127,yellow . 46921_48127,yellow,1.206,yellow,yellow oo oo oo oo oo oo oo oo oo oo oo oo oo ...,yellow
3,4024,4,/content/data/ocsc_raw/Eng-NA/OCSC/4/4024.wav,/content/4024.cha,CHI_4024,4.0,4-5,Alphabet,66.37,67.501,apple . 66370_67501,apple,1.131,Apple.,apple apple xxx xxx xxx xxx xxx xxx xxx xxx xx...,apple
4,4024,4,/content/data/ocsc_raw/Eng-NA/OCSC/4/4024.wav,/content/4024.cha,CHI_4024,4.0,4-5,Alphabet,68.048,71.279,a kid . 68048_71279,a kid,3.231,Okay.,a cake,a kid


### Compute Per-Utterance WER and Filter Changed Utterances

In [None]:
normalizer = BasicTextNormalizer()

In [None]:
def norm_text(s):
  if not isinstance(s, str):
    return ""
  return normalizer(s).strip()

In [None]:
# normalize reference + predictions
df["ref_norm"] = df["reference_text"].astype(str).map(norm_text)
df["baseline_norm"] = df["baseline_pred"].astype(str).map(norm_text)
df["ft_norm"] = df["ft_pred"].astype(str).map(norm_text)

In [None]:
df["ref_len_words"] = df["ref_norm"].str.split().apply(len)
df["baseline_len_words"] = df["baseline_norm"].str.split().apply(len)
df["ft_len_words"] = df["ft_norm"].str.split().apply(len)

In [None]:
print("Example normalized rows:")
df[["reference_text", "ref_norm", "baseline_norm", "ft_norm"]].head()

Example normalized rows:


Unnamed: 0,reference_text,ref_norm,baseline_norm,ft_norm
0,hi,hi,bye,bye
1,margaret,margaret,margaret,smog lit tt tt tt tt tt tt tt tt tt tt tt tt t...
2,yellow,yellow,yellow,yellow oo oo oo oo oo oo oo oo oo oo oo oo oo ...
3,apple,apple,apple,apple apple xxx xxx xxx xxx xxx xxx xxx xxx xx...
4,a kid,a kid,okay,a cake


In [None]:
def per_utt_wer(ref, hyp):
  # jiwer.wer returns a ratio i.e., WER
  return jiwer_wer(ref, hyp)

In [None]:
N_ROWS = 500
sub_df = df if N_ROWS is None else df.iloc[:N_ROWS].copy()
print("Computing per-utterance WER (baseline / ft)...")

Computing per-utterance WER (baseline / ft)...


In [None]:
baseline_wers = []
ft_wers = []

In [None]:
for r, b, f in zip(sub_df["ref_norm"], sub_df["baseline_norm"], sub_df["ft_norm"]):
  baseline_wers.append(per_utt_wer(r, b))
  ft_wers.append(per_utt_wer(r, f))

In [None]:
sub_df["wer_baseline"] = baseline_wers
sub_df["wer_ft"] = ft_wers
sub_df["delta_wer"] = sub_df["wer_ft"] - sub_df["wer_baseline"]

In [None]:
print("Overall (mean) WERs on this subset:")
print("Baseline WER:", sub_df["wer_baseline"].mean())
print("FT WER: ", sub_df["wer_ft"].mean())
print("Delta WER: ", sub_df["delta_wer"].mean())

Overall (mean) WERs on this subset:
Baseline WER: 0.654096103896104
FT WER:  67.89072467532468
Delta WER:  67.23662857142858


In [None]:
THRESH = 0.05  # 5 percentage points
changed = sub_df[sub_df["delta_wer"].abs() > THRESH].copy()

print("Total utterances in subset:", len(sub_df))
print("Utterances with |delta_wer| >", THRESH, ":", len(changed))

if len(changed) < 10:
    print("WARNING: very few changed utterances; clustering may be uninformative.")

Total utterances in subset: 500
Utterances with |delta_wer| > 0.05 : 458


### Run K-Means Clustering

In [None]:
# features for clustering
num_features = [
    "ref_len_words",
    "baseline_len_words",
    "ft_len_words",
    "wer_baseline",
    "wer_ft",
    "delta_wer",
]

# age task
if "age_years" in changed.columns:
    num_features.append("age_years")

X = changed[num_features].fillna(0.0).to_numpy()

# standardize
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

print("Feature matrix shape:", X_scaled.shape)

Feature matrix shape: (458, 7)


In [None]:
from sklearn.cluster import KMeans

N_CLUSTERS = 4

if len(changed) < N_CLUSTERS:
  print("Too few examples for", N_CLUSTERS, "clusters; reducing k.")
  N_CLUSTERS = max(1, len(changed))

kmeans = KMeans(n_clusters=N_CLUSTERS, random_state=0, n_init=10)
cluster_labels = kmeans.fit_predict(X_scaled)

changed["cluster"] = cluster_labels
print("Cluster counts:")
print(changed["cluster"].value_counts().sort_index())

Cluster counts:
cluster
0     93
1    206
2    155
3      4
Name: count, dtype: int64


### Analyze Cluster Characteristics

In [None]:
group_cols = ["wer_baseline", "wer_ft", "delta_wer", "ref_len_words", "baseline_len_words", "ft_len_words"]
if "age_years" in changed.columns:
  group_cols.append("age_years")

print("\n=== Cluster summary (means) ===")
summary = changed.groupby("cluster")[group_cols].mean().round(3)
print(summary)


=== Cluster summary (means) ===
         wer_baseline   wer_ft  delta_wer  ref_len_words  baseline_len_words  \
cluster                                                                        
0               0.391   24.217     23.826          5.527               5.516   
1               0.703  123.942    123.239          1.320               1.403   
2               0.793   37.111     36.318          2.297               2.316   
3               0.500   97.000     96.500          1.500               1.500   

         ft_len_words  age_years  
cluster                           
0             122.989        4.0  
1             154.777        4.0  
2              86.239        4.0  
3             140.000        5.0  


In [None]:
print(f"\n=== Task distribution per cluster ({TASK_COL}) ===")
task_summary = changed.pivot_table(
    index="cluster",
    columns=TASK_COL,
    values="ref_norm",
    aggfunc="count",
    fill_value=0,
)
print(task_summary)


=== Task distribution per cluster (task) ===
task     Alphabet  DescriptivePictures  EndTasks  ExpPictures  HowTo  \
cluster                                                                
0               1                   16         0           24     36   
1              46                    5         1           60     20   
2              36                    8         0           33     19   
3               2                    0         0            0      0   

task     IntroRobot  Numbers  Reading  Wug  
cluster                                     
0                 0        2        1   13  
1                 2       44        1   27  
2                 0       18        0   41  
3                 2        0        0    0  


### Inspect Cluster Examples
View sample utterances from each cluster to understand patterns.

In [None]:
random.seed(0)

In [None]:
def show_cluster_examples(cluster_id, n=5):
  sub = changed[changed["cluster"] == cluster_id]
  if sub.empty:
    print(f"\n=== Cluster {cluster_id}: EMPTY ===")
    return

  print(f"\n=== Cluster {cluster_id}: n={len(sub)} ===")
  print(summary.loc[cluster_id])
  if TASK_COL is not None:
    print("Top tasks:")
    print(sub[TASK_COL].value_counts().head(5))

  samples = sub.sample(min(n, len(sub)), random_state=0)

  for idx, row in samples.iterrows():
    print("\n--- Example ---")
    print("REF:", row["ref_norm"])
    print("BASE:", row["baseline_norm"])
    print("FT:  ", row["ft_norm"])
    print(f"WER_base={row['wer_baseline']:.2f} | WER_ft={row['wer_ft']:.2f} | Δ={row['delta_wer']:.2f}")
    if "age_years" in row:
      print("Age:", row["age_years"])
    if TASK_COL is not None:
      print("Task:", row[TASK_COL])

for c in sorted(changed["cluster"].unique()):
  show_cluster_examples(c, n=5)


=== Cluster 0: n=93 ===
wer_baseline            0.391
wer_ft                 24.217
delta_wer              23.826
ref_len_words           5.527
baseline_len_words      5.516
ft_len_words          122.989
age_years               4.000
Name: 0, dtype: float64
Top tasks:
task
HowTo                  36
ExpPictures            24
DescriptivePictures    16
Wug                    13
Numbers                 2
Name: count, dtype: int64

--- Example ---
REF: um one
BASE: you re really good at this this is a slide
FT:   and like oo oo oo oo oo oo oo oo oo oo oo oo oo oo oo oo oo oo oo oo oo oo oo oo oo oo oo oo oo oo oo oo oo oo oo oo oo oo oo oo oo oo oo oo oo oo oo oo oo oo oo oo oo oo oo oo oo oo oo oo oo oo oo oo oo oo oo oo oo oo oo oo oo oo oo oo oo oo oo oo oo oo oo oo oo oo oo oo oo oo oo oo oo oo oo oo oo oo oo oo oo oo oo oo oo oo oo oo oo oo o
WER_base=5.00 | WER_ft=56.50 | Δ=51.50
Age: 4.0
Task: Wug

--- Example ---
REF: he s acting like a flyer pilot
BASE: he s acting like a flyer pi