# Cross‑Domain ASR on Kaggle P100 (Whisper + LoRA)  
**Datasets:** LibriSpeech (train-clean-360/dev-clean/test-clean) + Mozilla Common Voice v24 en‑AU (local folder)  
**Goal:** improve **WER/CER** under Kaggle P100 limits using a high‑ROI recipe:

1) LoRA + 8‑bit optimizer  
2) Quality‑aware filtering / curriculum  
3) Speaker‑balanced sampling  
4) SpecAugment  
5) Two‑stage training (LibriSpeech → Common Voice adaptation)  
6) Confidence‑based transcript denoising  
7) Decode‑time tuning (beam/penalties/temperature)

> Run top‑to‑bottom; use `CFG` switches for ablations.


## 0) Imports + Reproducibility + GPU sanity

In [1]:
# Cell 1 — P100-safe environment (fixes: torchvision mismatch, bitsandbytes/triton, and sm_60 kernels)

# IMPORTANT:
# - Tesla P100 = compute capability sm_60
# - PyTorch CUDA 12.8/12.9 wheels may DROP sm_60 support → "no kernel image" errors.
# - Use CUDA 12.6 (cu126) wheels for torch/torchaudio on Kaggle P100.

# Clean potentially conflicting packages
!pip -q uninstall -y torch torchvision torchaudio triton bitsandbytes transformers tokenizers accelerate peft evaluate datasets

# Install PyTorch CUDA 12.6 wheels (P100-compatible)
# NOTE: After this cell finishes, restart the Kaggle session once (Runtime → Restart session),
# then continue from Cell 2.
!pip -q install --no-cache-dir --index-url https://download.pytorch.org/whl/cu126 \
  torch==2.8.0+cu126 torchaudio==2.8.0+cu126

# Install ASR stack (no bitsandbytes; torchvision not required)
!pip -q install --no-cache-dir \
  transformers==4.52.1 \
  datasets==2.20.0 \
  accelerate==0.34.2 \
  evaluate==0.4.2 \
  peft==0.11.1 \
  jiwer librosa soundfile

print("✅ Install complete. NOW restart the session once, then run Cell 2.")


[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m821.8/821.8 MB[0m [31m302.4 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.5/3.5 MB[0m [31m31.6 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m155.6/155.6 MB[0m [31m252.8 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25h[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
easyocr 1.7.2 requires torchvision>=0.5, which is not installed.
kaggle-environments 1.18.0 requires transformers>=4.33.1, which is not installed.
sentence-transformers 5.1.1 requires transformers<5.0.0,>=4.41.0, which is not installed.
timm 1.0.20 requires torchvision, which is not installed.
fastai 2.8.4 requires torchvision>=0.11, which is not installed.
fastai 2.8.4 requires fastcore<1.9,>=1.8.

In [2]:
# Cell X — P100 runtime sanity + cache locations (run before importing transformers/datasets)

import os, torch

# Avoid accidental torchvision imports inside transformers
os.environ["TRANSFORMERS_NO_TORCHVISION"] = "1"
os.environ["TOKENIZERS_PARALLELISM"] = "false"

# Keep ALL caches inside /kaggle/working (counts toward 19.5GB limit, but avoids filling /root)
HF_BASE = "/kaggle/working/hf_cache"
os.environ["HF_HOME"] = HF_BASE
os.environ["HF_DATASETS_CACHE"] = os.path.join(HF_BASE, "datasets")
os.environ["TRANSFORMERS_CACHE"] = os.path.join(HF_BASE, "transformers")
os.environ["HF_HUB_DISABLE_TELEMETRY"] = "1"

# Optional: better allocator behavior (helps stability on long runs)
os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "expandable_segments:True"

# P100-safe attention backend:
# Disable Flash + mem-efficient SDPA kernels; force math SDPA
if torch.cuda.is_available():
    try:
        torch.backends.cuda.enable_flash_sdp(False)
        torch.backends.cuda.enable_mem_efficient_sdp(False)
        torch.backends.cuda.enable_math_sdp(True)
    except Exception as e:
        print("SDPA backend tweak skipped:", e)

print("torch:", torch.__version__, "| cuda:", torch.version.cuda)
if torch.cuda.is_available():
    print("GPU:", torch.cuda.get_device_name(0), "| capability:", torch.cuda.get_device_capability(0))
    print("torch arch list:", torch.cuda.get_arch_list())
    # quick kernel sanity test
    _ = torch.zeros((2,3), device="cuda")
    print("✅ CUDA kernel sanity test passed")
else:
    print("⚠️ CUDA not available")


torch: 2.8.0+cu126 | cuda: 12.6
GPU: Tesla P100-PCIE-16GB | capability: (6, 0)
torch arch list: ['sm_50', 'sm_60', 'sm_70', 'sm_75', 'sm_80', 'sm_86', 'sm_90']
✅ CUDA kernel sanity test passed


In [3]:
# Cell 2 — Imports and basic setup

import os, re, math, json, random
from pathlib import Path
import numpy as np
import pandas as pd

import torch
import torchaudio

from datasets import Dataset, DatasetDict
from transformers import (
    WhisperForConditionalGeneration,
    Seq2SeqTrainingArguments,
    Seq2SeqTrainer,
    set_seed
)

from jiwer import wer, cer

print("Torch:", torch.__version__)
print("CUDA available:", torch.cuda.is_available())
if torch.cuda.is_available():
    print("GPU:", torch.cuda.get_device_name(0))

SEED = 42
set_seed(SEED)
random.seed(SEED)
np.random.seed(SEED)
torch.manual_seed(SEED)


2026-02-19 23:36:59.868964: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:467] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1771544220.072213     149 cuda_dnn.cc:8579] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1771544220.127550     149 cuda_blas.cc:1407] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
W0000 00:00:1771544220.632756     149 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1771544220.632842     149 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1771544220.632847     149 computation_placer.cc:177] computation placer alr

Torch: 2.8.0+cu126
CUDA available: True
GPU: Tesla P100-PCIE-16GB


<torch._C.Generator at 0x7be130fa4510>

## 1) Configuration (P100‑friendly defaults)

In [4]:
# Cell 3 — Config switches and limits (edit these first)

from pathlib import Path

CFG = {
    # Explicit dataset roots (NO auto-detection)
    "LIBRISPEECH_ROOT": "/kaggle/input/datasets/pypiahmad/librispeech-asr-corpus",
    "CV_ROOT": "/kaggle/input/datasets/eddiehoogewerf/mozilla-commonvoice",

    # Common Voice structure inside CV_ROOT
    "CV_LANG_DIR": "commonvoice-v24_en-AU",
    "CV_AUDIO_DIR": "audio_files",
    "CV_MAIN_CSV": "commonvoice-v24_en-AU.csv",
    "CV_SPLIT_CSV": "commonvoice-v24_en-AU-split.csv",

    # Model (newer)
    "MODEL_NAME": "openai/whisper-small.en",

    # Caps for iteration speed (set None for full data)
    "MAX_LS_TRAIN_ROWS": 60000,
    "MAX_LS_DEV_ROWS": 5000,
    "MAX_CV_TRAIN_ROWS": 25000,
    "MAX_CV_DEV_ROWS": 3000,
    "MAX_CV_TEST_ROWS": 3000,

    # Constraints
    "MIN_AUDIO_SEC": 1.0,
    "MAX_AUDIO_SEC": 20.0,
    "MAX_LABEL_CHARS": 220,

    # ROI feature toggles
    "USE_QUALITY_FILTERING": True,
    "USE_CURRICULUM": True,
    "USE_SPEAKER_BALANCED_SAMPLING": True,
    "USE_SPECAUG": True,
    "USE_LORA": True,
    "USE_8BIT_OPTIM": True,
    "USE_TRANSCRIPT_DENOISE": True,
    "USE_DECODE_TUNING": True,

    "QUALITY_METRIC_MAX_SAMPLES": 12000,
    "DENOISE_SCORE_MAX_SAMPLES": 8000,
    "DENOISE_DROP_FRACTION": 0.12,

    # Stage A (LibriSpeech)
    "STAGE_A_MAX_STEPS": 1200,
    "STAGE_A_LR": 1e-4,

    # Stage B (Common Voice)
    "STAGE_B_MAX_STEPS": 900,
    "STAGE_B_LR": 7e-5,

    # Batch sizing (P100-safe; you may need BS=2 for large-v3-turbo)
    "PER_DEVICE_TRAIN_BS": 2,
    "PER_DEVICE_EVAL_BS": 2,
    "GRAD_ACCUM_STEPS": 4,

    # Generation defaults
    "GEN_BEAMS": 5,
    "GEN_MAX_NEW_TOKENS": 128,
}

CFG

{'LIBRISPEECH_ROOT': '/kaggle/input/datasets/pypiahmad/librispeech-asr-corpus',
 'CV_ROOT': '/kaggle/input/datasets/eddiehoogewerf/mozilla-commonvoice',
 'CV_LANG_DIR': 'commonvoice-v24_en-AU',
 'CV_AUDIO_DIR': 'audio_files',
 'CV_MAIN_CSV': 'commonvoice-v24_en-AU.csv',
 'CV_SPLIT_CSV': 'commonvoice-v24_en-AU-split.csv',
 'MODEL_NAME': 'openai/whisper-small.en',
 'MAX_LS_TRAIN_ROWS': 60000,
 'MAX_LS_DEV_ROWS': 5000,
 'MAX_CV_TRAIN_ROWS': 25000,
 'MAX_CV_DEV_ROWS': 3000,
 'MAX_CV_TEST_ROWS': 3000,
 'MIN_AUDIO_SEC': 1.0,
 'MAX_AUDIO_SEC': 20.0,
 'MAX_LABEL_CHARS': 220,
 'USE_QUALITY_FILTERING': True,
 'USE_CURRICULUM': True,
 'USE_SPEAKER_BALANCED_SAMPLING': True,
 'USE_SPECAUG': True,
 'USE_LORA': True,
 'USE_8BIT_OPTIM': True,
 'USE_TRANSCRIPT_DENOISE': True,
 'USE_DECODE_TUNING': True,
 'QUALITY_METRIC_MAX_SAMPLES': 12000,
 'DENOISE_SCORE_MAX_SAMPLES': 8000,
 'DENOISE_DROP_FRACTION': 0.12,
 'STAGE_A_MAX_STEPS': 1200,
 'STAGE_A_LR': 0.0001,
 'STAGE_B_MAX_STEPS': 900,
 'STAGE_B_LR': 7e-

## 2) Auto-detect dataset roots based on your folder structure

In [5]:
# Cell 4 — Resolve dataset roots + auto-find Common Voice CSVs robustly

from pathlib import Path
import os

LIBRISPEECH_ROOT = Path(CFG["LIBRISPEECH_ROOT"])
CV_ROOT = Path(CFG["CV_ROOT"])

print("LIBRISPEECH_ROOT:", LIBRISPEECH_ROOT)
print("CV_ROOT (given):", CV_ROOT)

# ---- LibriSpeech checks ----
assert (LIBRISPEECH_ROOT / "train-clean-360").exists(), "Missing train-clean-360"
assert (LIBRISPEECH_ROOT / "dev-clean").exists(), "Missing dev-clean"
assert (LIBRISPEECH_ROOT / "test-clean").exists(), "Missing test-clean"
print("✅ LibriSpeech structure OK")

# ---- Common Voice: sometimes there is an extra nested directory ----
candidate_cv_roots = [
    CV_ROOT,
    CV_ROOT / "mozilla-commonvoice",
    CV_ROOT / "mozilla_commonvoice",
    CV_ROOT / "commonvoice",
]

REAL_CV_ROOT = None
for r in candidate_cv_roots:
    if (r / CFG["CV_LANG_DIR"]).exists():
        REAL_CV_ROOT = r
        break

assert REAL_CV_ROOT is not None, (
    f"Could not find {CFG['CV_LANG_DIR']} under any of these roots:\n" +
    "\n".join([str(x) for x in candidate_cv_roots])
)

print("✅ REAL_CV_ROOT:", REAL_CV_ROOT)

CV_LANG_PATH = REAL_CV_ROOT / CFG["CV_LANG_DIR"]
CV_AUDIO_ROOT = CV_LANG_PATH / CFG["CV_AUDIO_DIR"]
print("CV_LANG_PATH:", CV_LANG_PATH)
print("CV_AUDIO_ROOT:", CV_AUDIO_ROOT)
assert CV_AUDIO_ROOT.exists(), f"Missing audio folder: {CV_AUDIO_ROOT}"

# ---- Find CSVs (exact name if possible, else search) ----
def find_csv(preferred_name: str, must_contain: list[str], search_roots: list[Path]) -> Path:
    # 1) try exact
    for root in search_roots:
        p = root / preferred_name
        if p.exists():
            return p

    # 2) search all csv files under roots
    matches = []
    for root in search_roots:
        for p in root.rglob("*.csv"):
            name = p.name.lower()
            ok = all(s.lower() in name for s in must_contain)
            if ok:
                matches.append(p)

    if not matches:
        # Print helpful debug listing
        print("\n--- CSV files found under REAL_CV_ROOT (top 50) ---")
        all_csv = list(REAL_CV_ROOT.rglob("*.csv"))[:50]
        for p in all_csv:
            print(p)
        raise FileNotFoundError(
            f"Could not locate a CSV for '{preferred_name}' with must_contain={must_contain}"
        )

    # Prefer shortest path (usually closest to root), stable sort
    matches = sorted(matches, key=lambda x: (len(str(x)), str(x)))
    return matches[0]

search_roots = [REAL_CV_ROOT, CV_LANG_PATH]

# Main CSV (non-split): must include en-au and not necessarily "split"
main_csv_path = find_csv(
    CFG["CV_MAIN_CSV"],
    must_contain=["en-au"],     # adjust if needed
    search_roots=search_roots
)

# Split CSV: must include split + en-au
split_csv_path = find_csv(
    CFG["CV_SPLIT_CSV"],
    must_contain=["split", "en-au"],
    search_roots=search_roots
)

CFG["REAL_CV_ROOT"] = str(REAL_CV_ROOT)
CFG["CV_MAIN_CSV_PATH"] = str(main_csv_path)
CFG["CV_SPLIT_CSV_PATH"] = str(split_csv_path)
CFG["CV_AUDIO_ROOT"] = str(CV_AUDIO_ROOT)

print("✅ CV_MAIN_CSV_PATH:", CFG["CV_MAIN_CSV_PATH"])
print("✅ CV_SPLIT_CSV_PATH:", CFG["CV_SPLIT_CSV_PATH"])
print("✅ CV_AUDIO_ROOT:", CFG["CV_AUDIO_ROOT"])

LIBRISPEECH_ROOT: /kaggle/input/datasets/pypiahmad/librispeech-asr-corpus
CV_ROOT (given): /kaggle/input/datasets/eddiehoogewerf/mozilla-commonvoice
✅ LibriSpeech structure OK
✅ REAL_CV_ROOT: /kaggle/input/datasets/eddiehoogewerf/mozilla-commonvoice
CV_LANG_PATH: /kaggle/input/datasets/eddiehoogewerf/mozilla-commonvoice/commonvoice-v24_en-AU
CV_AUDIO_ROOT: /kaggle/input/datasets/eddiehoogewerf/mozilla-commonvoice/commonvoice-v24_en-AU/audio_files
✅ CV_MAIN_CSV_PATH: /kaggle/input/datasets/eddiehoogewerf/mozilla-commonvoice/commonvoice-v24_en-AU/commonvoice-v24_en-AU.csv
✅ CV_SPLIT_CSV_PATH: /kaggle/input/datasets/eddiehoogewerf/mozilla-commonvoice/commonvoice-v24_en-AU/commonvoice-v24_en-AU-split.csv
✅ CV_AUDIO_ROOT: /kaggle/input/datasets/eddiehoogewerf/mozilla-commonvoice/commonvoice-v24_en-AU/audio_files


## 3) LibriSpeech parsing → dataframe (speaker_id, text, audio_path, split)

In [6]:
# Cell 5 — Parse LibriSpeech transcripts (.trans.txt) + audio paths

def parse_librispeech_split(split_dir: Path, split_name: str) -> pd.DataFrame:
    rows = []
    trans_files = list(split_dir.rglob("*.trans.txt"))
    for tf in trans_files:
        try:
            with open(tf, "r", encoding="utf-8") as f:
                for line in f:
                    line = line.strip()
                    if not line:
                        continue
                    parts = line.split(" ", 1)
                    if len(parts) != 2:
                        continue
                    utt_id, text = parts
                    audio_path = tf.parent / f"{utt_id}.flac"
                    if not audio_path.exists():
                        continue
                    speaker_id = tf.parent.parent.name
                    chapter_id = tf.parent.name
                    rows.append({
                        "dataset": "librispeech",
                        "split": split_name,
                        "speaker_id": str(speaker_id),
                        "chapter_id": str(chapter_id),
                        "utt_id": utt_id,
                        "text": text,
                        "audio_path": str(audio_path),
                    })
        except Exception as e:
            print("Failed reading", tf, e)
    return pd.DataFrame(rows)

ls_train = parse_librispeech_split(Path(LIBRISPEECH_ROOT) / "train-clean-360", "train")
ls_dev   = parse_librispeech_split(Path(LIBRISPEECH_ROOT) / "dev-clean", "dev")
ls_test  = parse_librispeech_split(Path(LIBRISPEECH_ROOT) / "test-clean", "test")

print(ls_train.shape, ls_dev.shape, ls_test.shape)

# Caps for speed
if CFG["MAX_LS_TRAIN_ROWS"]:
    ls_train = ls_train.sample(n=min(CFG["MAX_LS_TRAIN_ROWS"], len(ls_train)), random_state=SEED)
if CFG["MAX_LS_DEV_ROWS"]:
    ls_dev = ls_dev.sample(n=min(CFG["MAX_LS_DEV_ROWS"], len(ls_dev)), random_state=SEED)

ls_df = pd.concat([ls_train, ls_dev, ls_test], ignore_index=True)
ls_df.head()


(104014, 7) (2703, 7) (2620, 7)


Unnamed: 0,dataset,split,speaker_id,chapter_id,utt_id,text,audio_path
0,librispeech,train,6160,44912,6160-44912-0071,WHO COULD SPEAK FRENCH AND WHO HAD LEARNED GER...,/kaggle/input/datasets/pypiahmad/librispeech-a...
1,librispeech,train,7276,284424,7276-284424-0036,BUTTON BRIGHT SHOOK HIS HEAD A BOAT CAN'T LAND...,/kaggle/input/datasets/pypiahmad/librispeech-a...
2,librispeech,train,359,133630,359-133630-0007,AND WANTED TO GAIN THE EXPERIENCE AND NOW THE ...,/kaggle/input/datasets/pypiahmad/librispeech-a...
3,librispeech,train,5935,43322,5935-43322-0009,HERE IS THE ORDER OF WORSHIP FOR THE FEAST OF ...,/kaggle/input/datasets/pypiahmad/librispeech-a...
4,librispeech,train,3790,140725,3790-140725-0031,BUT ONE OF THE HELPERS GOT THE KEYS FROM MISSU...,/kaggle/input/datasets/pypiahmad/librispeech-a...


## 4) Common Voice parsing → dataframe (uses split CSV + audio_files)

In [7]:
# Cell 6 — Load Common Voice CSVs (robust) + build paths + ensure speaker_id + create split

from pathlib import Path
import pandas as pd
import numpy as np

# Resolve paths (uses CFG["CV_MAIN_CSV_PATH"] etc. if already set)
def resolve_cv_paths(CFG):
    CV_ROOT = Path(CFG["CV_ROOT"])
    lang_dir = CFG["CV_LANG_DIR"]

    candidates = [CV_ROOT, CV_ROOT / "mozilla-commonvoice", CV_ROOT / "mozilla_commonvoice", CV_ROOT / "commonvoice", CV_ROOT / "data"]
    real_root = None
    for r in candidates:
        if (r / lang_dir).exists():
            real_root = r
            break
    if real_root is None:
        raise FileNotFoundError(f"Could not find '{lang_dir}' under: {candidates}")

    lang_path = real_root / lang_dir
    audio_root = lang_path / CFG["CV_AUDIO_DIR"]
    if not audio_root.exists():
        raise FileNotFoundError(f"Missing audio folder: {audio_root}")

    main_csv = CFG.get("CV_MAIN_CSV_PATH")
    split_csv = CFG.get("CV_SPLIT_CSV_PATH")
    if main_csv is None or split_csv is None:
        # If not already resolved, fall back to exact names under lang folder
        main_csv = str(lang_path / CFG["CV_MAIN_CSV"])
        split_csv = str(lang_path / CFG["CV_SPLIT_CSV"])
        if not Path(main_csv).exists() or not Path(split_csv).exists():
            raise FileNotFoundError("CSV paths not found; please ensure CFG paths are correct.")

    CFG["REAL_CV_ROOT"] = str(real_root)
    CFG["CV_AUDIO_ROOT"] = str(audio_root)
    CFG["CV_MAIN_CSV_PATH"] = str(main_csv)
    CFG["CV_SPLIT_CSV_PATH"] = str(split_csv)

    return Path(main_csv), Path(split_csv), Path(audio_root)

cv_main_path, cv_split_path, cv_audio_root = resolve_cv_paths(CFG)
print("CV_MAIN_CSV_PATH:", cv_main_path)
print("CV_SPLIT_CSV_PATH:", cv_split_path)
print("CV_AUDIO_ROOT:", cv_audio_root)

cv_main = pd.read_csv(cv_main_path)
cv_split = pd.read_csv(cv_split_path)

# Merge keys
merge_keys = [k for k in ["path", "client_id", "sentence"] if k in cv_main.columns and k in cv_split.columns]
if not merge_keys:
    if "path" in cv_split.columns and "path" in cv_main.columns:
        merge_keys = ["path"]
    else:
        overlap = [c for c in cv_split.columns if c in cv_main.columns]
        if not overlap:
            raise ValueError("No overlapping columns found to merge CV main and split CSVs.")
        merge_keys = overlap[:1]
        print("⚠️ Using fallback merge key:", merge_keys)

cv = cv_split.merge(cv_main, on=merge_keys, how="left", suffixes=("_split", ""))
cv["dataset"] = "commonvoice"

# Audio filename column
if "path" in cv.columns:
    rel_audio = cv["path"].astype(str)
elif "filename" in cv.columns:
    rel_audio = cv["filename"].astype(str)
else:
    alt = None
    for cand in ["clip", "file", "audio", "wav_filename"]:
        if cand in cv.columns:
            alt = cand
            break
    if alt is None:
        raise ValueError("No audio filename column found (expected 'path' or 'filename').")
    rel_audio = cv[alt].astype(str)

cv["audio_path"] = rel_audio.apply(lambda p: str(Path(cv_audio_root) / p))

# Transcript column
text_col = None
for cand in ["sentence", "text", "transcript"]:
    if cand in cv.columns:
        text_col = cand
        break
assert text_col is not None, "No transcript column found (expected 'sentence' or 'text' or 'transcript')"
cv["text"] = cv[text_col].astype(str)

# ✅ IMPORTANT: create speaker_id BEFORE any split fallback
if "client_id" in cv.columns:
    cv["speaker_id"] = cv["client_id"].astype(str)
elif "speaker_id" in cv.columns:
    cv["speaker_id"] = cv["speaker_id"].astype(str)
else:
    # fallback speaker_id using path prefix (not perfect but prevents crash)
    # Common Voice paths often like: <something>.mp3, so this becomes "unknown"
    cv["speaker_id"] = "unknown"

# Split handling
split_col = None
for cand in ["split", "set", "subset", "partition", "data_split", "group", "fold"]:
    if cand in cv.columns:
        split_col = cand
        break

if split_col is not None:
    cv["split"] = cv[split_col].astype(str).str.lower()
else:
    cols_lower = {c.lower(): c for c in cv.columns}

    def has_cols(*names):
        return all(n in cols_lower for n in names)

    if has_cols("train", "dev", "test"):
        c_train, c_dev, c_test = cols_lower["train"], cols_lower["dev"], cols_lower["test"]
        idx = cv[[c_train, c_dev, c_test]].astype(float).values.argmax(axis=1)
        cv["split"] = np.where(idx == 0, "train", np.where(idx == 1, "dev", "test"))

    elif has_cols("train", "validation", "test"):
        c_train, c_val, c_test = cols_lower["train"], cols_lower["validation"], cols_lower["test"]
        idx = cv[[c_train, c_val, c_test]].astype(float).values.argmax(axis=1)
        cv["split"] = np.where(idx == 0, "train", np.where(idx == 1, "dev", "test"))

    elif has_cols("train", "valid", "test"):
        c_train, c_val, c_test = cols_lower["train"], cols_lower["valid"], cols_lower["test"]
        idx = cv[[c_train, c_val, c_test]].astype(float).values.argmax(axis=1)
        cv["split"] = np.where(idx == 0, "train", np.where(idx == 1, "dev", "test"))

    else:
        # Speaker-safe split (90/5/5) — works now because speaker_id exists
        print("⚠️ No split info found in CSVs. Creating speaker-safe split (90/5/5).")

        speakers = pd.Series(cv["speaker_id"].astype(str).unique())
        speakers = speakers.sample(frac=1.0, random_state=SEED).reset_index(drop=True)

        n = len(speakers)
        n_train = int(0.90 * n)
        n_dev = int(0.05 * n)

        train_spk = set(speakers.iloc[:n_train])
        dev_spk = set(speakers.iloc[n_train:n_train + n_dev])

        def assign_split(spk):
            if spk in train_spk: return "train"
            if spk in dev_spk:   return "dev"
            return "test"

        cv["split"] = cv["speaker_id"].astype(str).map(assign_split)

cv["split"] = cv["split"].replace({"val": "dev", "valid": "dev", "validation": "dev"}).astype(str).str.lower()
print("Common Voice split counts:", cv["split"].value_counts().to_dict())

# Keep cols
keep_cols = ["dataset", "split", "speaker_id", "text", "audio_path"]
for dcol in ["gender", "age", "accent", "locale"]:
    if dcol in cv.columns:
        keep_cols.append(dcol)

cv_df = cv[keep_cols].copy()

# Cap splits
def cap_split(df, split, n):
    if n is None:
        return df
    sub = df[df["split"] == split]
    if len(sub) <= n:
        return df
    sub = sub.sample(n=n, random_state=SEED)
    return pd.concat([df[df["split"] != split], sub], ignore_index=True)

cv_df = cap_split(cv_df, "train", CFG["MAX_CV_TRAIN_ROWS"])
cv_df = cap_split(cv_df, "dev",   CFG["MAX_CV_DEV_ROWS"])
cv_df = cap_split(cv_df, "test",  CFG["MAX_CV_TEST_ROWS"])

cv_df.head(), cv_df["split"].value_counts()

CV_MAIN_CSV_PATH: /kaggle/input/datasets/eddiehoogewerf/mozilla-commonvoice/commonvoice-v24_en-AU/commonvoice-v24_en-AU.csv
CV_SPLIT_CSV_PATH: /kaggle/input/datasets/eddiehoogewerf/mozilla-commonvoice/commonvoice-v24_en-AU/commonvoice-v24_en-AU-split.csv
CV_AUDIO_ROOT: /kaggle/input/datasets/eddiehoogewerf/mozilla-commonvoice/commonvoice-v24_en-AU/audio_files
⚠️ No split info found in CSVs. Creating speaker-safe split (90/5/5).
Common Voice split counts: {'train': 52188, 'dev': 2336, 'test': 1149}


(       dataset split                                         speaker_id  \
 0  commonvoice   dev  18bea6bb076cd9638518d93b4af353c3c329d059789e11...   
 1  commonvoice   dev  46979dc7ff629110c30910e33e07360e8a6b1164d886b8...   
 2  commonvoice  test  5164c1f810f1164f347010cf76b6193eda6fcc8ccbe746...   
 3  commonvoice  test  571c11f069b81ffb3e5548a57b0c215c12f4c1d8d64baa...   
 4  commonvoice   dev  85ddb3498519d2676393ceab1451c21bc87b5ff4e3b23a...   
 
                                                 text  \
 0     He has also served in the Chamber of Deputies.   
 1  Severe environmental issues include deforestat...   
 2    Sean has deteriorated to the point of dementia.   
 3  Let us face the door, and welcome in our proph...   
 4  Elephant bells had a marked flare below the kn...   
 
                                           audio_path gender  age locale  
 0  /kaggle/input/datasets/eddiehoogewerf/mozilla-...    NaN  NaN     en  
 1  /kaggle/input/datasets/eddiehoogewerf/mozill

## 5) Standard text normalization (training + evaluation)

In [8]:
# Cell 7 — Normalization utilities (helps CER)

_basic_punct = re.compile(r"[\.,!?;:\"\(\)\[\]\{\}<>\\/\|@#\$%\^&\*_=\+~`]+")

def normalize_text(s: str) -> str:
    s = str(s).strip().lower()
    s = s.replace("’", "'")
    s = _basic_punct.sub(" ", s)
    s = re.sub(r"\s+", " ", s).strip()
    return s

ls_df["text_norm"] = ls_df["text"].map(normalize_text)
cv_df["text_norm"] = cv_df["text"].map(normalize_text)

ls_df[["text","text_norm"]].head()


Unnamed: 0,text,text_norm
0,WHO COULD SPEAK FRENCH AND WHO HAD LEARNED GER...,who could speak french and who had learned ger...
1,BUTTON BRIGHT SHOOK HIS HEAD A BOAT CAN'T LAND...,button bright shook his head a boat can't land...
2,AND WANTED TO GAIN THE EXPERIENCE AND NOW THE ...,and wanted to gain the experience and now the ...
3,HERE IS THE ORDER OF WORSHIP FOR THE FEAST OF ...,here is the order of worship for the feast of ...
4,BUT ONE OF THE HELPERS GOT THE KEYS FROM MISSU...,but one of the helpers got the keys from missu...


## 6) Duration + quality metrics (silence ratio, clipping, RMS loudness) for filtering/curriculum

In [9]:
# Cell 8 — Compute quality metrics on a sample for speed (FIXED: always creates metric columns)

import numpy as np
import pandas as pd
import librosa
import torchaudio

QUALITY_COLS = ["duration_sec", "silence_ratio", "clipping_rate", "rms_mean_db"]

def load_audio_16k(path: str):
    """Try torchaudio first; fallback to librosa if torchaudio can't decode (e.g., mp3)."""
    try:
        wav, sr = torchaudio.load(path)
        wav = wav.mean(dim=0).numpy()
    except Exception:
        wav, sr = librosa.load(path, sr=None, mono=True)

    if sr != 16000:
        wav = librosa.resample(wav, orig_sr=sr, target_sr=16000)
        sr = 16000
    return wav, sr

def compute_quality_metrics(path: str, frame_ms=20, hop_ms=10, silence_db=-40.0):
    try:
        y, sr = load_audio_16k(path)
        if y is None or len(y) < 160:
            return None

        dur = len(y) / sr
        frame = max(int(sr * frame_ms / 1000), 16)
        hop = max(int(sr * hop_ms / 1000), 8)

        rms = librosa.feature.rms(y=y, frame_length=frame, hop_length=hop)[0]
        rms_db = 20*np.log10(np.maximum(rms, 1e-10))
        silence_ratio = float(np.mean(rms_db < silence_db))

        clipping_rate = float(np.mean(np.abs(y) >= 0.999))

        rms_mean = float(np.mean(rms))
        rms_mean_db = float(20*np.log10(max(rms_mean, 1e-10)))

        return {
            "duration_sec": float(dur),
            "silence_ratio": silence_ratio,
            "clipping_rate": clipping_rate,
            "rms_mean_db": rms_mean_db
        }
    except Exception:
        return None

def add_quality_columns(df: pd.DataFrame, max_samples: int, seed: int = 42) -> pd.DataFrame:
    """
    Returns a copy of df with QUALITY_COLS filled for a sampled subset.
    IMPORTANT: Always creates QUALITY_COLS even if nothing was computed.
    """
    df = df.copy()

    # ✅ Ensure these columns exist no matter what
    for c in QUALITY_COLS:
        if c not in df.columns:
            df[c] = np.nan

    if len(df) == 0:
        return df

    sample_idx = df.sample(n=min(max_samples, len(df)), random_state=seed).index

    metrics = {}
    for i, idx in enumerate(sample_idx):
        m = compute_quality_metrics(df.loc[idx, "audio_path"])
        if m is not None:
            metrics[idx] = m
        if (i + 1) % 500 == 0:
            print(f"  processed {i+1}/{len(sample_idx)}")

    # ✅ Build mdf with guaranteed columns
    mdf = pd.DataFrame.from_dict(metrics, orient="index")
    for c in QUALITY_COLS:
        if c not in mdf.columns:
            mdf[c] = np.nan

    # assign back
    for c in QUALITY_COLS:
        if len(mdf) > 0:
            df.loc[mdf.index, c] = mdf[c].values

    return df

# Recompute
ls_q = add_quality_columns(ls_df[ls_df["split"].isin(["train","dev"])], CFG["QUALITY_METRIC_MAX_SAMPLES"], SEED)
cv_q = add_quality_columns(cv_df[cv_df["split"].isin(["train","dev"])], CFG["QUALITY_METRIC_MAX_SAMPLES"], SEED)

print("ls_q cols:", ls_q.columns.tolist())
print("cv_q cols:", cv_q.columns.tolist())

# ✅ Your original merge lines now work safely
ls_df = ls_df.merge(ls_q[["audio_path"] + QUALITY_COLS], on="audio_path", how="left")
cv_df = cv_df.merge(cv_q[["audio_path"] + QUALITY_COLS], on="audio_path", how="left")

ls_df[["split"] + QUALITY_COLS].head()

  s = torchaudio.io.StreamReader(src, format, None, buffer_size)
  s = torchaudio.io.StreamReader(src, format, None, buffer_size)


  processed 500/12000
  processed 1000/12000
  processed 1500/12000
  processed 2000/12000
  processed 2500/12000
  processed 3000/12000
  processed 3500/12000
  processed 4000/12000
  processed 4500/12000
  processed 5000/12000
  processed 5500/12000
  processed 6000/12000
  processed 6500/12000
  processed 7000/12000
  processed 7500/12000
  processed 8000/12000
  processed 8500/12000
  processed 9000/12000
  processed 9500/12000
  processed 10000/12000
  processed 10500/12000
  processed 11000/12000
  processed 11500/12000
  processed 12000/12000


  s = torchaudio.io.StreamReader(src, format, None, buffer_size)


  processed 500/12000
  processed 1000/12000
  processed 1500/12000
  processed 2000/12000
  processed 2500/12000
  processed 3000/12000
  processed 3500/12000
  processed 4000/12000
  processed 4500/12000
  processed 5000/12000
  processed 5500/12000
  processed 6000/12000
  processed 6500/12000
  processed 7000/12000
  processed 7500/12000
  processed 8000/12000
  processed 8500/12000
  processed 9000/12000
  processed 9500/12000
  processed 10000/12000
  processed 10500/12000
  processed 11000/12000
  processed 11500/12000
  processed 12000/12000
ls_q cols: ['dataset', 'split', 'speaker_id', 'chapter_id', 'utt_id', 'text', 'audio_path', 'text_norm', 'duration_sec', 'silence_ratio', 'clipping_rate', 'rms_mean_db']
cv_q cols: ['dataset', 'split', 'speaker_id', 'text', 'audio_path', 'gender', 'age', 'locale', 'text_norm', 'duration_sec', 'silence_ratio', 'clipping_rate', 'rms_mean_db']


Unnamed: 0,split,duration_sec,silence_ratio,clipping_rate,rms_mean_db
0,train,,,,
1,train,,,,
2,train,,,,
3,train,,,,
4,train,11.52,0.208153,0.0,-27.794453


## 7) Quality-aware filtering + curriculum bins

In [10]:
# Cell 9 — Core filters + optional quality filtering + curriculum bins

def apply_core_filters(df: pd.DataFrame) -> pd.DataFrame:
    df = df.copy()
    if "duration_sec" in df.columns:
        df = df[(df["duration_sec"].isna()) | ((df["duration_sec"] >= CFG["MIN_AUDIO_SEC"]) & (df["duration_sec"] <= CFG["MAX_AUDIO_SEC"]))]
    df = df[df["text_norm"].str.len() > 0]
    df = df[df["text_norm"].str.len() <= CFG["MAX_LABEL_CHARS"]]
    return df.reset_index(drop=True)

def quality_score_row(r):
    sil = r.get("silence_ratio", np.nan)
    clip = r.get("clipping_rate", np.nan)
    loud = r.get("rms_mean_db", np.nan)

    sil_pen = 0.0 if np.isnan(sil) else -2.0 * sil
    clip_pen = 0.0 if np.isnan(clip) else -8.0 * clip
    loud_pen = 0.0
    if not np.isnan(loud):
        loud_pen = -abs(np.clip(loud, -60, 0) - (-22.0)) / 18.0
    return float(sil_pen + clip_pen + loud_pen)

def apply_quality_filter_and_curriculum(df: pd.DataFrame) -> pd.DataFrame:
    df = df.copy()
    df["quality_score"] = df.apply(quality_score_row, axis=1)

    if CFG["USE_QUALITY_FILTERING"]:
        q = df["quality_score"].dropna()
        if len(q) > 500:
            low_cut = q.quantile(0.05)
            df = df[(df["quality_score"].isna()) | (df["quality_score"] >= low_cut)].reset_index(drop=True)

    if CFG["USE_CURRICULUM"]:
        q = df["quality_score"].dropna()
        if len(q) > 500:
            q1, q2 = q.quantile(0.33), q.quantile(0.66)
            def bin_fn(x):
                if np.isnan(x): return 1
                if x < q1: return 2
                if x < q2: return 1
                return 0
            df["curriculum_bin"] = df["quality_score"].map(bin_fn).astype(int)
        else:
            df["curriculum_bin"] = 1
    else:
        df["curriculum_bin"] = 0

    return df

ls_df_f = apply_quality_filter_and_curriculum(apply_core_filters(ls_df))
cv_df_f = apply_quality_filter_and_curriculum(apply_core_filters(cv_df))

print("LibriSpeech after filters:", ls_df_f.shape, ls_df_f["split"].value_counts().to_dict())
print("Common Voice after filters:", cv_df_f.shape, cv_df_f["split"].value_counts().to_dict())


LibriSpeech after filters: (45049, 14) {'train': 40356, 'test': 2382, 'dev': 2311}
Common Voice after filters: (27058, 15) {'train': 23658, 'dev': 2251, 'test': 1149}


## 8) Build Hugging Face datasets + audio casting

In [11]:
# Cell 10 — Convert dataframes to HF datasets (keep audio_path as STRING; decode in collator)
# Why: Common Voice clips are often .mp3; datasets.Audio decoding can be fragile on Kaggle.
# We'll decode with torchaudio/librosa inside the collator.

from datasets import Dataset, DatasetDict

def to_hf_dataset(df: pd.DataFrame) -> DatasetDict:
    dsd = {}
    for split in sorted(df["split"].unique()):
        sdf = df[df["split"] == split].copy()
        keep = ["audio_path","text","text_norm","speaker_id"]
        for extra in ["duration_sec","quality_score","curriculum_bin"]:
            if extra in sdf.columns:
                keep.append(extra)
        dsd[split] = Dataset.from_pandas(sdf[keep], preserve_index=False)
    return DatasetDict(dsd)

ls_ds = to_hf_dataset(ls_df_f[ls_df_f["split"].isin(["train","dev","test"])])
cv_ds = to_hf_dataset(cv_df_f[cv_df_f["split"].isin(["train","dev","test"])])

print(ls_ds)
print(cv_ds)


DatasetDict({
    dev: Dataset({
        features: ['audio_path', 'text', 'text_norm', 'speaker_id', 'duration_sec', 'quality_score', 'curriculum_bin'],
        num_rows: 2311
    })
    test: Dataset({
        features: ['audio_path', 'text', 'text_norm', 'speaker_id', 'duration_sec', 'quality_score', 'curriculum_bin'],
        num_rows: 2382
    })
    train: Dataset({
        features: ['audio_path', 'text', 'text_norm', 'speaker_id', 'duration_sec', 'quality_score', 'curriculum_bin'],
        num_rows: 40356
    })
})
DatasetDict({
    dev: Dataset({
        features: ['audio_path', 'text', 'text_norm', 'speaker_id', 'duration_sec', 'quality_score', 'curriculum_bin'],
        num_rows: 2251
    })
    test: Dataset({
        features: ['audio_path', 'text', 'text_norm', 'speaker_id', 'duration_sec', 'quality_score', 'curriculum_bin'],
        num_rows: 1149
    })
    train: Dataset({
        features: ['audio_path', 'text', 'text_norm', 'speaker_id', 'duration_sec', 'quality_score

## 9) Processor, tokenization, and SpecAugment (optional)

In [12]:
# Cell 11 — Whisper processor (stable prompt IDs) + tokenizer settings

from transformers import WhisperProcessor

processor = WhisperProcessor.from_pretrained(CFG["MODEL_NAME"])
feature_extractor = processor.feature_extractor
tokenizer = processor.tokenizer

# Build stable forced decoder prompt IDs (language/task tokens)
FORCED_DECODER_IDS = None
try:
    FORCED_DECODER_IDS = processor.get_decoder_prompt_ids(language="en", task="transcribe")
    print("✅ forced_decoder_ids ready:", FORCED_DECODER_IDS[:3], "...")
except Exception as e:
    print("⚠️ Could not set forced_decoder_ids automatically:", e)

print("Tokenizer vocab:", tokenizer.vocab_size)


preprocessor_config.json: 0.00B [00:00, ?B/s]

tokenizer_config.json:   0%|          | 0.00/805 [00:00<?, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

normalizer.json: 0.00B [00:00, ?B/s]

added_tokens.json: 0.00B [00:00, ?B/s]

special_tokens_map.json: 0.00B [00:00, ?B/s]

✅ forced_decoder_ids ready: [(1, 50258), (2, 50358), (3, 50362)] ...
Tokenizer vocab: 50257


## 10) Data collator (with optional SpecAugment during training)

In [15]:
# Cell — Collator (P100/Kaggle safe, self-contained)
# Handles:
# 1) list[dict]
# 2) dict[list]
# audio_path as either:
#  - string file path
#  - HF Audio dict {"array":..., "sampling_rate":...}
# Applies SpecAugment ONLY when model is in training mode.

import numpy as np
import torch
import torchaudio
import librosa
from dataclasses import dataclass
from typing import Any, Dict, List, Union

# --- SpecAugment transforms (defined here to avoid NameError) ---
time_mask = torchaudio.transforms.TimeMasking(time_mask_param=30)
freq_mask = torchaudio.transforms.FrequencyMasking(freq_mask_param=10)

def load_audio_16k(item):
    # item can be Audio dict or file path string
    if isinstance(item, dict) and "array" in item and "sampling_rate" in item:
        y = item["array"]
        sr = int(item["sampling_rate"])
        if sr != 16000:
            y = librosa.resample(y, orig_sr=sr, target_sr=16000)
        return y.astype(np.float32), 16000

    if isinstance(item, str):
        path = item
        try:
            wav, sr = torchaudio.load(path)
            y = wav.mean(dim=0).numpy()
        except Exception:
            y, sr = librosa.load(path, sr=None, mono=True)

        if sr != 16000:
            y = librosa.resample(y, orig_sr=sr, target_sr=16000)
        return y.astype(np.float32), 16000

    raise TypeError(f"Unsupported audio item type: {type(item)}")

def _ensure_list_of_dicts(features: Union[List[Dict[str, Any]], Dict[str, List[Any]]]):
    # If HF dataset returns dict-of-lists, convert to list-of-dicts
    if isinstance(features, dict):
        keys = list(features.keys())
        n = len(features[keys[0]]) if keys else 0
        return [{k: features[k][i] for k in keys} for i in range(n)]
    return features  # already list-of-dicts

@dataclass
class WhisperCollatorOnTheFly:
    processor: Any
    model_ref: Any = None
    use_specaug: bool = False
    time_mask: Any = None
    freq_mask: Any = None

    def __call__(self, features):
        features = _ensure_list_of_dicts(features)

        waves = []
        for f in features:
            y, _ = load_audio_16k(f["audio_path"])
            waves.append(y)

        inputs = self.processor.feature_extractor(
            waves, sampling_rate=16000, return_tensors="pt"
        )
        input_features = inputs["input_features"]

        # SpecAugment only during training (Trainer uses ONE collator for train+eval)
        is_training = (self.model_ref is not None and self.model_ref.training)
        if self.use_specaug and is_training and self.time_mask is not None and self.freq_mask is not None:
            x = input_features
            x = self.time_mask(x)
            x = self.freq_mask(x)
            input_features = x

        texts = [f.get("text_norm", f["text"]) for f in features]
        tok = self.processor.tokenizer(texts, padding=True, return_tensors="pt")
        labels = tok["input_ids"].masked_fill(tok["attention_mask"].ne(1), -100)

        return {"input_features": input_features, "labels": labels}

# ✅ Single collator used by Trainer for both train and eval
train_collator = WhisperCollatorOnTheFly(
    processor=processor,
    model_ref=None,  # attach after trainer is created
    use_specaug=bool(CFG.get("USE_SPECAUG", False)),
    time_mask=time_mask if CFG.get("USE_SPECAUG", False) else None,
    freq_mask=freq_mask if CFG.get("USE_SPECAUG", False) else None,
)

print("✅ Collator ready (Trainer will reuse this for eval too).")

✅ Collator ready (Trainer will reuse this for eval too).


## 11) Speaker‑balanced sampling (WeightedRandomSampler) via custom Trainer

In [16]:
# Cell 13 — Custom trainer: speaker-balanced sampling + separate eval collator (no SpecAugment)

from torch.utils.data import DataLoader, WeightedRandomSampler

class SpeakerBalancedSeq2SeqTrainer(Seq2SeqTrainer):
    def __init__(self, *args, speaker_balanced=False, eval_collator=None, **kwargs):
        super().__init__(*args, **kwargs)
        self.speaker_balanced = speaker_balanced
        self.eval_collator = eval_collator

    def get_train_dataloader(self):
        if not self.speaker_balanced or not CFG["USE_SPEAKER_BALANCED_SAMPLING"]:
            return super().get_train_dataloader()

        train_dataset = self.train_dataset
        spk = [str(s) for s in train_dataset["speaker_id"]]
        freq = {}
        for s in spk:
            freq[s] = freq.get(s, 0) + 1
        weights = torch.DoubleTensor([1.0 / freq[s] for s in spk])
        sampler = WeightedRandomSampler(weights, num_samples=len(weights), replacement=True)

        return DataLoader(
            train_dataset,
            batch_size=self.args.per_device_train_batch_size,
            sampler=sampler,
            collate_fn=self.data_collator,
            num_workers=0,
            pin_memory=False,
        )

    def get_eval_dataloader(self, eval_dataset=None):
        dl = super().get_eval_dataloader(eval_dataset)
        if self.eval_collator is None:
            return dl
        # Rebuild DataLoader with eval_collator (keeps everything else identical)
        return DataLoader(
            dl.dataset,
            batch_size=dl.batch_size,
            sampler=dl.sampler,
            collate_fn=self.eval_collator,
            num_workers=0,
            pin_memory=False,
        )


## 12) Metrics (WER/CER) with decode params + normalization

In [17]:
# Cell 14 — Compute metrics (WER/CER)
import evaluate
wer_metric = evaluate.load("wer")
cer_metric = evaluate.load("cer")

def postprocess_text(preds, labels):
    preds = [normalize_text(p) for p in preds]
    labels = [normalize_text(l) for l in labels]
    return preds, labels

def make_compute_metrics():
    def compute_metrics(pred):
        pred_ids = pred.predictions
        label_ids = pred.label_ids
        label_ids[label_ids == -100] = tokenizer.pad_token_id

        pred_str = tokenizer.batch_decode(pred_ids, skip_special_tokens=True)
        label_str = tokenizer.batch_decode(label_ids, skip_special_tokens=True)
        pred_str, label_str = postprocess_text(pred_str, label_str)

        return {
            "wer": wer_metric.compute(predictions=pred_str, references=label_str),
            "cer": cer_metric.compute(predictions=pred_str, references=label_str),
        }
    return compute_metrics


Downloading builder script: 0.00B [00:00, ?B/s]

Downloading builder script: 0.00B [00:00, ?B/s]

In [36]:
# Disable bitsandbytes completely to avoid triton.ops issues
!pip -q uninstall -y bitsandbytes

import os
os.environ["BITSANDBYTES_NOWELCOME"] = "1"

[0m

In [18]:
CFG["USE_8BIT_OPTIM"] = False

In [33]:
import inspect
from transformers import WhisperForConditionalGeneration
from peft import LoraConfig, get_peft_model

# Build a set of allowed kwargs for Whisper forward
_WHISPER_FORWARD_KEYS = set(inspect.signature(WhisperForConditionalGeneration.forward).parameters.keys())

class WhisperForConditionalGenerationCompat(WhisperForConditionalGeneration):
    """
    Makes Whisper compatible with Trainer/PEFT extras:
      - PEFT may call forward(input_ids=...)
      - Trainer may pass num_items_in_batch
      - Filters unknown kwargs safely
    """
    def forward(self, *args, **kwargs):
        # 1) Drop Trainer bookkeeping kwarg
        kwargs.pop("num_items_in_batch", None)

        # 2) PEFT sometimes uses input_ids for encoder inputs (Whisper expects input_features)
        if "input_features" not in kwargs and "input_ids" in kwargs:
            kwargs["input_features"] = kwargs.pop("input_ids")

        # 3) Whisper doesn't use inputs_embeds; remove if present
        kwargs.pop("inputs_embeds", None)

        # 4) Filter any other unexpected kwargs defensively
        kwargs = {k: v for k, v in kwargs.items() if k in _WHISPER_FORWARD_KEYS}

        return super().forward(*args, **kwargs)

def build_model():
    # Keep your P100-safe attention if you were using it
    try:
        model = WhisperForConditionalGenerationCompat.from_pretrained(
            CFG["MODEL_NAME"],
            attn_implementation="eager",
        )
    except TypeError:
        model = WhisperForConditionalGenerationCompat.from_pretrained(CFG["MODEL_NAME"])
        if hasattr(model.config, "attn_implementation"):
            model.config.attn_implementation = "eager"

    model.config.forced_decoder_ids = None
    model.config.suppress_tokens = []
    model.generation_config.max_new_tokens = CFG["GEN_MAX_NEW_TOKENS"]
    model.generation_config.num_beams = CFG["GEN_BEAMS"]

    model.gradient_checkpointing_enable()
    model.config.use_cache = False

    if CFG["USE_LORA"]:
        lora_cfg = LoraConfig(
            r=16,
            lora_alpha=32,
            lora_dropout=0.05,
            bias="none",
            task_type="SEQ_2_SEQ_LM",
            target_modules=["q_proj", "v_proj"],
        )
        model = get_peft_model(model, lora_cfg)
        model.print_trainable_parameters()

    return model

model = build_model()
print("✅ Model rebuilt with WhisperForConditionalGenerationCompat")

trainable params: 1,769,472 || all params: 243,503,616 || trainable%: 0.7267
✅ Model rebuilt with WhisperForConditionalGenerationCompat


## 13) Model init (LoRA + 8‑bit optimizer) — Stage A: LibriSpeech

In [34]:
# Cell 15 — Load Whisper + attach LoRA (force eager attention for P100)

from transformers import WhisperForConditionalGeneration

class WhisperForConditionalGenerationCompat(WhisperForConditionalGeneration):
    """
    Makes Whisper compatible with:
      - PEFT passing input_ids
      - Trainer passing num_items_in_batch (and other extra kwargs)
    """
    def forward(
        self,
        input_features=None,
        input_ids=None,
        attention_mask=None,
        decoder_input_ids=None,
        decoder_attention_mask=None,
        encoder_outputs=None,
        past_key_values=None,
        decoder_inputs_embeds=None,
        labels=None,
        use_cache=None,
        output_attentions=None,
        output_hidden_states=None,
        return_dict=None,
        inputs_embeds=None,   # sometimes passed by wrappers; Whisper doesn't use it
        **kwargs
    ):
        # ✅ Trainer sometimes passes this (Whisper doesn't accept it)
        kwargs.pop("num_items_in_batch", None)

        # ✅ Some stacks pass other harmless extras; ignore them safely
        kwargs.pop("output_router_logits", None)
        kwargs.pop("return_loss", None)

        # ✅ PEFT wrapper often calls base_model(input_ids=...)
        if input_features is None and input_ids is not None:
            input_features = input_ids

        return super().forward(
            input_features=input_features,
            attention_mask=attention_mask,
            decoder_input_ids=decoder_input_ids,
            decoder_attention_mask=decoder_attention_mask,
            encoder_outputs=encoder_outputs,
            past_key_values=past_key_values,
            decoder_inputs_embeds=decoder_inputs_embeds,
            labels=labels,
            use_cache=use_cache,
            output_attentions=output_attentions,
            output_hidden_states=output_hidden_states,
            return_dict=return_dict,
        )

## 14) Training arguments (P100‑safe) + Stage A trainer

In [35]:
# Cell 16 — Stage A args + Trainer (self-healing if model not defined)

import os
from transformers import Seq2SeqTrainingArguments

# ---- 1) Ensure we have a model ----
if "model" not in globals() or model is None:
    print("⚠️ model not found in memory — rebuilding from base checkpoint...")

    # You should already have build_model() defined in your model cell.
    # If not, re-run the model cell first.
    model = build_model()

# ---- 2) Use P100-safe optimizer ----
optim_name = "adamw_torch"

stage_a_args = Seq2SeqTrainingArguments(
    output_dir="/kaggle/working/asr_runs/stageA_librispeech",
    per_device_train_batch_size=CFG["PER_DEVICE_TRAIN_BS"],
    per_device_eval_batch_size=CFG["PER_DEVICE_EVAL_BS"],
    gradient_accumulation_steps=CFG["GRAD_ACCUM_STEPS"],
    learning_rate=CFG["STAGE_A_LR"],
    warmup_steps=100,
    max_steps=CFG["STAGE_A_MAX_STEPS"],
    fp16=True,
    bf16=False,
    optim=optim_name,
    dataloader_num_workers=0,
    logging_steps=50,
    eval_strategy="steps",   # <-- correct arg name
    eval_steps=200,
    save_steps=200,
    save_total_limit=1,
    predict_with_generate=True,
    generation_num_beams=CFG["GEN_BEAMS"],
    report_to="none",
    load_best_model_at_end=True,
    metric_for_best_model="wer",
    greater_is_better=False,
    remove_unused_columns=False,
    label_names=["labels"],
)

# ---- 3) Make sure we use the LIGHT datasets (audio_path/text columns) ----
# If you use ls_ds_light, use it. Otherwise fallback to ls_ds.
train_ds = ls_ds_light["train"] if "ls_ds_light" in globals() else ls_ds["train"]
dev_ds   = ls_ds_light["dev"]   if "ls_ds_light" in globals() else ls_ds["dev"]

trainer_a = SpeakerBalancedSeq2SeqTrainer(
    model=model,
    args=stage_a_args,
    train_dataset=train_ds,
    eval_dataset=dev_ds,
    data_collator=train_collator,
    compute_metrics=make_compute_metrics(),
    tokenizer=processor.feature_extractor,
    speaker_balanced=CFG["USE_SPEAKER_BALANCED_SAMPLING"],
)

# Attach model to collator so SpecAugment toggles correctly
try:
    train_collator.model_ref = trainer_a.model
except Exception:
    pass

print("Stage A ready ✅")

  super().__init__(*args, **kwargs)
  self.scaler = torch.cuda.amp.GradScaler(**kwargs)


Stage A ready ✅


In [36]:
import torch
print("torch:", torch.__version__)
print("cuda runtime:", torch.version.cuda)
print("gpu:", torch.cuda.get_device_name(0))
print("capability:", torch.cuda.get_device_capability(0))
print("arch list in this torch build:", torch.cuda.get_arch_list())

# quick kernel test (this is what currently fails for you)
x = torch.zeros((2,3), device="cuda")
print("cuda kernel OK ✅", x.shape)

torch: 2.8.0+cu126
cuda runtime: 12.6
gpu: Tesla P100-PCIE-16GB
capability: (6, 0)
arch list in this torch build: ['sm_50', 'sm_60', 'sm_70', 'sm_75', 'sm_80', 'sm_86', 'sm_90']
cuda kernel OK ✅ torch.Size([2, 3])


## 15) Run Stage A training (LibriSpeech)

In [37]:
# One-batch GPU debug
batch = next(iter(trainer_a.get_train_dataloader()))
batch = {k: (v.cuda() if hasattr(v, "cuda") else v) for k, v in batch.items()}

model = trainer_a.model.cuda()
model.train()

out = model(**batch)
print("Forward OK ✅", out.loss.item())

out.loss.backward()
print("Backward OK ✅")

  s = torchaudio.io.StreamReader(src, format, None, buffer_size)


Forward OK ✅ 1.5956324338912964
Backward OK ✅


In [38]:
import inspect
from accelerate import Accelerator

_sig = inspect.signature(Accelerator.unwrap_model)
if "keep_torch_compile" not in _sig.parameters:
    _orig_unwrap_model = Accelerator.unwrap_model

    def _unwrap_model_compat(self, model, *args, **kwargs):
        # transformers may pass these on newer versions
        kwargs.pop("keep_torch_compile", None)
        kwargs.pop("recursive", None)

        # Call the original unwrap_model with only supported kwargs
        orig_sig = inspect.signature(_orig_unwrap_model)
        filtered = {k: v for k, v in kwargs.items() if k in orig_sig.parameters}

        return _orig_unwrap_model(self, model, *args, **filtered)

    Accelerator.unwrap_model = _unwrap_model_compat
    print("✅ Patched accelerate.Accelerator.unwrap_model for keep_torch_compile compatibility.")
else:
    print("✅ accelerate already supports keep_torch_compile.")

✅ Patched accelerate.Accelerator.unwrap_model for keep_torch_compile compatibility.


In [40]:
import inspect
from accelerate import Accelerator
from accelerate.utils import extract_model_from_parallel

def unwrap_model_no_recursion(self, model, *args, **kwargs):
    # transformers may pass these on newer versions
    kwargs.pop("keep_torch_compile", None)
    kwargs.pop("recursive", None)

    # forward only the kwargs that extract_model_from_parallel supports
    sig = inspect.signature(extract_model_from_parallel)
    filtered = {k: v for k, v in kwargs.items() if k in sig.parameters}

    return extract_model_from_parallel(model, **filtered)

# Apply patch safely (idempotent)
if not getattr(Accelerator.unwrap_model, "_no_recursion_patch", False):
    Accelerator.unwrap_model = unwrap_model_no_recursion
    Accelerator.unwrap_model._no_recursion_patch = True
    print("✅ Patched Accelerator.unwrap_model (no recursion, keep_torch_compile ignored).")
else:
    print("✅ unwrap_model already patched (no recursion).")

✅ Patched Accelerator.unwrap_model (no recursion, keep_torch_compile ignored).


In [43]:
import types

def patch_generate_drop_labels(model):
    """
    Patch model.generate to ignore labels so Seq2SeqTrainer evaluation won't crash.
    Stores patch state on the model object (safe on Kaggle).
    """
    if getattr(model, "_drop_labels_patched", False):
        return

    orig_generate = model.generate

    def generate_no_labels(self, *args, **kwargs):
        kwargs.pop("labels", None)
        kwargs.pop("label", None)
        # don't forward decoder_input_ids for whisper generation
        kwargs.pop("decoder_input_ids", None)
        return orig_generate(*args, **kwargs)

    model.generate = types.MethodType(generate_no_labels, model)
    model._drop_labels_patched = True

# Patch trainer model
patch_generate_drop_labels(trainer_a.model)

# If PEFT-wrapped, patch its base model too (different wrappers use different attrs)
for attr in ["base_model", "model"]:
    if hasattr(trainer_a.model, attr):
        try:
            patch_generate_drop_labels(getattr(trainer_a.model, attr))
        except Exception:
            pass

print("✅ Patched generate() to drop labels during decoding.")

✅ Patched generate() to drop labels during decoding.


In [44]:
# Cell 17 — Train Stage A
train_result_a = trainer_a.train()
metrics_a = trainer_a.evaluate()

print("Stage A eval:", metrics_a)
trainer_a.save_model()
pd.DataFrame([metrics_a]).to_csv("/kaggle/working/stageA_eval_metrics.csv", index=False)

Step,Training Loss,Validation Loss,Wer,Cer
200,2.0615,0.749182,0.122979,0.113462
400,2.1257,0.741068,0.082969,0.063727
600,2.0478,0.739126,0.07712,0.06471
800,2.0628,0.738254,0.068543,0.053975
1000,2.0351,0.737125,0.068764,0.052112
1200,2.1106,0.736806,0.070116,0.054948


The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
  s = torchaudio.io.StreamReader(src, format, None, buffer_size)
  s = torchaudio.io.StreamReader(src, format, None, buffer_size)
  s = torchaudio.io.StreamReader(src, format, None, buffer_size)
  s = torchaudio.io.StreamReader(src, format, None, buffer_size)
  s = torchaudio.io.StreamReader(src, format, None, buffer_size)
  s = torchaudio.io.StreamReader(src, format, None, buffer_size)


Stage A eval: {'eval_loss': 0.7382535934448242, 'eval_wer': 0.06854263946915704, 'eval_cer': 0.05397487973344814, 'eval_runtime': 1833.2373, 'eval_samples_per_second': 1.261, 'eval_steps_per_second': 0.631, 'epoch': 0.23788284269997026}


## 16) Confidence-based transcript denoise (teacher-forced loss on CV train)

In [46]:
eval_collator = WhisperCollatorOnTheFly(
    processor=processor,
    use_specaug=False,
    time_mask=None,
    freq_mask=None
)

In [47]:
# Cell 18 — Score CV samples by teacher-forced loss (subset) and drop worst fraction
@torch.no_grad()
def score_dataset_by_loss(trainer: Seq2SeqTrainer, ds: Dataset, max_samples: int):
    model = trainer.model
    model.eval()
    ds_small = ds.select(range(min(max_samples, len(ds))))
    losses = []
    for i in range(0, len(ds_small), CFG["PER_DEVICE_EVAL_BS"]):
        batch = [ds_small[j] for j in range(i, min(i+CFG["PER_DEVICE_EVAL_BS"], len(ds_small)))]
        batch_t = eval_collator(batch)
        batch_t = {k: v.to(model.device) for k, v in batch_t.items()}
        out = model(**batch_t)
        losses.append(out.loss.detach().float().cpu().item())

    per_sample = []
    for b, loss in enumerate(losses):
        start = b * CFG["PER_DEVICE_EVAL_BS"]
        end = min(start + CFG["PER_DEVICE_EVAL_BS"], len(ds_small))
        per_sample.extend([loss] * (end - start))
    return np.array(per_sample)

cv_train = cv_ds["train"]

if CFG["USE_TRANSCRIPT_DENOISE"]:
    print("Scoring CV train for denoise (subset)...")
    loss_scores = score_dataset_by_loss(trainer_a, cv_train, CFG["DENOISE_SCORE_MAX_SAMPLES"])
    cutoff = np.quantile(loss_scores, 1.0 - CFG["DENOISE_DROP_FRACTION"])
    keep_mask = loss_scores <= cutoff
    kept_indices = list(np.where(keep_mask)[0])
    if len(cv_train) > len(loss_scores):
        kept_indices += list(range(len(loss_scores), len(cv_train)))
    cv_train_denoised = cv_train.select(kept_indices)
    print("CV train before:", len(cv_train), "after denoise:", len(cv_train_denoised))
else:
    cv_train_denoised = cv_train


Scoring CV train for denoise (subset)...
CV train before: 23658 after denoise: 22698


## 17) Stage B: Adapt on Common Voice

In [48]:
# Cell 19 — Stage B trainer (Common Voice adaptation)
stage_b_args = Seq2SeqTrainingArguments(
    output_dir="/kaggle/working/asr_runs/stageB_commonvoice",
    per_device_train_batch_size=CFG["PER_DEVICE_TRAIN_BS"],
    per_device_eval_batch_size=CFG["PER_DEVICE_EVAL_BS"],
    gradient_accumulation_steps=CFG["GRAD_ACCUM_STEPS"],
    learning_rate=CFG["STAGE_B_LR"],
    warmup_steps=80,
    max_steps=CFG["STAGE_B_MAX_STEPS"],
    fp16=True,
    optim=optim_name,
    logging_steps=50,
    eval_strategy="steps",
    eval_steps=200,
    save_steps=200,
    save_total_limit=1,
    predict_with_generate=True,
    generation_num_beams=CFG["GEN_BEAMS"],
    report_to="none",
    load_best_model_at_end=True,
    metric_for_best_model="wer",
    greater_is_better=False,
    remove_unused_columns=False,
    label_names=["labels"],
    dataloader_num_workers=0,

)

trainer_b = SpeakerBalancedSeq2SeqTrainer(
    model=model,
    args=stage_b_args,
    train_dataset=cv_ds["train"],
    eval_dataset=cv_ds["dev"],
    data_collator=train_collator,
    compute_metrics=make_compute_metrics(),
    tokenizer=processor.feature_extractor,
    speaker_balanced=CFG["USE_SPEAKER_BALANCED_SAMPLING"],
)

train_collator.model_ref = trainer_b.model


print("Stage B ready.")


Stage B ready.


  super().__init__(*args, **kwargs)
  self.scaler = torch.cuda.amp.GradScaler(**kwargs)


## 18) Run Stage B training + cross-domain evaluation

In [49]:
# Cell 20 — Train Stage B + evaluate on CV test and LS test (use *_ds_light)

train_result_b = trainer_b.train()

metrics_b_dev     = trainer_b.evaluate(eval_dataset=cv_ds["dev"])
metrics_b_cv_test = trainer_b.evaluate(eval_dataset=cv_ds["test"])
metrics_b_ls_test = trainer_b.evaluate(eval_dataset=ls_ds["test"])

print("Stage B dev:", metrics_b_dev)
print("Stage B CV test:", metrics_b_cv_test)
print("Stage B LS test:", metrics_b_ls_test)

trainer_b.save_model()
pd.DataFrame([metrics_b_dev]).to_csv("/kaggle/working/stageB_dev_metrics.csv", index=False)
pd.DataFrame([metrics_b_cv_test]).to_csv("/kaggle/working/stageB_cv_test_metrics.csv", index=False)
pd.DataFrame([metrics_b_ls_test]).to_csv("/kaggle/working/stageB_ls_test_metrics.csv", index=False)

  s = torchaudio.io.StreamReader(src, format, None, buffer_size)


Step,Training Loss,Validation Loss,Wer,Cer
200,4.6272,1.088703,0.127312,0.072924
400,4.6164,1.084137,0.120381,0.070271
600,4.6483,1.082327,0.115703,0.06724
800,4.3486,1.081868,0.112237,0.063975


  s = torchaudio.io.StreamReader(src, format, None, buffer_size)
  s = torchaudio.io.StreamReader(src, format, None, buffer_size)
  s = torchaudio.io.StreamReader(src, format, None, buffer_size)
  s = torchaudio.io.StreamReader(src, format, None, buffer_size)
  s = torchaudio.io.StreamReader(src, format, None, buffer_size)


Stage B dev: {'eval_loss': 1.0818679332733154, 'eval_wer': 0.11223738358241282, 'eval_cer': 0.06397522317361996, 'eval_runtime': 1224.3671, 'eval_samples_per_second': 1.839, 'eval_steps_per_second': 0.92, 'epoch': 0.3043367993913264}
Stage B CV test: {'eval_loss': 1.1138039827346802, 'eval_wer': 0.13712219543853144, 'eval_cer': 0.06211715005280689, 'eval_runtime': 638.4521, 'eval_samples_per_second': 1.8, 'eval_steps_per_second': 0.901, 'epoch': 0.3043367993913264}
Stage B LS test: {'eval_loss': 0.7851576805114746, 'eval_wer': 0.11142461142461142, 'eval_cer': 0.05782289048936061, 'eval_runtime': 1833.0052, 'eval_samples_per_second': 1.3, 'eval_steps_per_second': 0.65, 'epoch': 0.3043367993913264}


## 19) Decode-time tuning (cheap WER/CER improvements)

In [None]:
# Cell 21 — Decode tuning grid on CV dev, then report best on CV test
from itertools import product

def eval_with_gen_kwargs(trainer: Seq2SeqTrainer, ds: Dataset, gen_kwargs: dict):
    old_beams = trainer.args.generation_num_beams
    trainer.args.generation_num_beams = gen_kwargs.get("num_beams", CFG["GEN_BEAMS"])
    trainer.model.generation_config.num_beams = gen_kwargs.get("num_beams", CFG["GEN_BEAMS"])

    if "length_penalty" in gen_kwargs:
        trainer.model.generation_config.length_penalty = gen_kwargs["length_penalty"]
    if "repetition_penalty" in gen_kwargs:
        trainer.model.generation_config.repetition_penalty = gen_kwargs["repetition_penalty"]
    if "temperature" in gen_kwargs:
        trainer.model.generation_config.temperature = gen_kwargs["temperature"]

    out = trainer.evaluate(eval_dataset=ds)

    trainer.args.generation_num_beams = old_beams
    trainer.model.generation_config.num_beams = CFG["GEN_BEAMS"]
    return out

tuning_results = []
if CFG["USE_DECODE_TUNING"]:
    beams = [1, 3, 5, 8]
    length_pen = [0.8, 1.0, 1.2]
    rep_pen = [1.0, 1.1]
    temps = [0.0, 0.2]

    for b, lp, rp, t in product(beams, length_pen, rep_pen, temps):
        gen_kwargs = {"num_beams": b, "length_penalty": lp, "repetition_penalty": rp, "temperature": t}
        m = eval_with_gen_kwargs(trainer_b, cv_ds["dev"], gen_kwargs)
        tuning_results.append({**gen_kwargs, "wer": m["eval_wer"], "cer": m["eval_cer"]})
        print(gen_kwargs, "->", m["eval_wer"], m["eval_cer"])

    tune_df = pd.DataFrame(tuning_results).sort_values(["wer","cer"]).reset_index(drop=True)
    tune_df.to_csv("/kaggle/working/decode_tuning_cv_dev.csv", index=False)
    display(tune_df.head(10))

    best = tune_df.iloc[0].to_dict()
    best_kwargs = {k: best[k] for k in ["num_beams","length_penalty","repetition_penalty","temperature"]}
    print("Best decode params:", best_kwargs)

    best_cv_test = eval_with_gen_kwargs(trainer_b, cv_ds["test"], best_kwargs)
    pd.DataFrame([{**best_kwargs, "wer": best_cv_test["eval_wer"], "cer": best_cv_test["eval_cer"]}]).to_csv(
        "/kaggle/working/best_decode_cv_test.csv", index=False
    )
    print("Best CV test:", best_cv_test)
else:
    print("Decode tuning disabled.")


## 20) Error analysis export (top worst utterances) for supervisor updates

In [None]:
# Cell 22 — Per-sample error table on a small subset of CV test

def per_sample_errors(trainer: Seq2SeqTrainer, ds: Dataset, n=400):
    ds_small = ds.select(range(min(n, len(ds))))
    preds = trainer.predict(ds_small)
    pred_ids = preds.predictions
    label_ids = preds.label_ids
    label_ids[label_ids == -100] = tokenizer.pad_token_id

    pred_str = tokenizer.batch_decode(pred_ids, skip_special_tokens=True)
    label_str = tokenizer.batch_decode(label_ids, skip_special_tokens=True)

    rows = []
    for i, (p, r) in enumerate(zip(pred_str, label_str)):
        pn, rn = normalize_text(p), normalize_text(r)
        rows.append({
            "idx": i,
            "ref": rn,
            "hyp": pn,
            "wer": wer(rn, pn),
            "cer": cer(rn, pn),
            "speaker_id": ds_small[i]["speaker_id"],
        })
    return pd.DataFrame(rows).sort_values("wer", ascending=False)

err_cv = per_sample_errors(trainer_b, cv_ds["test"], n=500)
err_cv.to_csv("/kaggle/working/error_analysis_cv_test_top.csv", index=False)
display(err_cv.head(25))


## 21) Weekly update checklist (copy/paste)
- Stage A (LibriSpeech) dev WER/CER and key settings used (LoRA/8-bit/balanced/SpecAug/filters).  
- Stage B (Common Voice adaptation) dev + test WER/CER and LS test WER/CER (cross-domain).  
- Denoise: number of CV samples removed (loss-ranked) and effect on WER/CER.  
- Decode tuning: best parameters on CV dev and resulting CV test WER/CER.  
- Error analysis: top failure patterns (numbers, accent, noise, short clips).  
