# OCSC Data Preprocessing Pipeline

## Overview
This notebook preprocesses the Ohio Child Speech Corpus (OCSC) for fine-tuning Whisper models on child speech recognition. It extracts child utterances from CHAT-format transcriptions, aligns them with audio timestamps, and creates train/dev/test splits.

### Setup and Dependencies
Install required packages for HuggingFace data loading and audio processing.

In [1]:
!pip -q install huggingface_hub==0.26.2 soundfile==0.12.1 tqdm==4.66.4

### Directory Configuration
Set up the folder structure for raw data, processed segments, and output manifests. All paths are relative to the Colab environment.

In [2]:
from pathlib import Path
import pandas as pd, numpy as np, re, os, shutil, random, json
from tqdm import tqdm

BASE = Path("/content")
DATA = BASE/"data"; DATA.mkdir(parents=True, exist_ok=True)
RAW  = DATA/"ocsc_raw"                   # raw HF snapshot
SEG  = DATA/"ocsc_segments"              # optional utterance WAVs (16 kHz)
FLAC = DATA/"ocsc_segments_flac"         # optional FLAC mirror
MANI = DATA/"manifests"; MANI.mkdir(parents=True, exist_ok=True)
CHA_DIR = BASE/"cha"; CHA_DIR.mkdir(parents=True, exist_ok=True)  # drop .cha files here

print("Folders:")
for p in [BASE, DATA, RAW, SEG, FLAC, MANI, CHA_DIR]: print(" -", p)

Folders:
 - \content
 - \content\data
 - \content\data\ocsc_raw
 - \content\data\ocsc_segments
 - \content\data\ocsc_segments_flac
 - \content\data\manifests
 - \content\cha


### Download OCSC from HuggingFace
Pull the OCSC dataset from the HuggingFace Hub. This contains audio files organized by age group (4-9 years) and corresponding CHAT transcription files.

In [None]:
from huggingface_hub import snapshot_download

repo_id = "NolanChai/childes-ocsc"
snapshot_download(
  repo_id=repo_id,
  repo_type="dataset",
  local_dir=str(RAW),
  local_dir_use_symlinks=False
)

AUDIO_ROOT = RAW/"Eng-NA"/"OCSC"
assert AUDIO_ROOT.exists(), f"Expected {AUDIO_ROOT}"
age_dirs = sorted([p for p in AUDIO_ROOT.iterdir() if p.is_dir() and p.name.isdigit()],
                  key=lambda p: int(p.name))
print("Age folders:", [p.name for p in age_dirs])

### Index Audio Files by Age Group
Scan the downloaded dataset to identify all audio files (.mp3, .wav) organized by speaker age. OCSC organizes recordings into folders by age (4, 5, 6, 7, 8, 9 years).

In [None]:
AUDIO_EXTS = {".mp3", ".wav"}

def find_audio(age_dir: Path, sid: str):
  cands = [p for p in age_dir.glob(f"{sid}.*") if p.suffix.lower() in AUDIO_EXTS]
  # prefer .wav if both
  cands = sorted(cands, key=lambda p: (0 if p.suffix.lower()==".wav" else 1, p.suffix.lower()))
  return cands[0] if cands else None

# pick up CHA from /content/cha and (for safety) /content root
cha_map = {}
for p in list(CHA_DIR.glob("*.cha")) + list(BASE.glob("*.cha")):
  cha_map[p.stem] = p

pairs, missing_cha, missing_audio = [], [], []
for age_dir in age_dirs:
  for a in age_dir.iterdir():
    if a.suffix.lower() in AUDIO_EXTS and a.stem.isdigit():
      sid = a.stem
      audio_path = find_audio(age_dir, sid) or a
      cha_path = cha_map.get(sid)
      if cha_path is None:
        missing_cha.append((sid, audio_path))
      else:
        pairs.append((sid, age_dir.name, audio_path, cha_path))

### Audio File Resolution
Helper function to locate audio files for each session, preferring .wav format when both .wav and .mp3 exist for the same session.

In [None]:
# CHA with no audio anywhere
for sid, cp in cha_map.items():
    present = any((AUDIO_ROOT/d.name/f"{sid}.mp3").exists() or (AUDIO_ROOT/d.name/f"{sid}.wav").exists()
                  for d in age_dirs)
    if not present and not any(sid==s for s,_,_,_ in pairs):
        missing_audio.append((sid, cp))

print(f"Paired sessions: {len(pairs)} | Missing CHA: {len(missing_cha)} | Missing audio: {len(missing_audio)}")
if missing_cha[:5]: print("Examples missing CHA:", missing_cha[:5])
if missing_audio[:5]: print("Examples missing audio:", missing_audio[:5])

Paired sessions: 304 | Missing CHA: 0 | Missing audio: 0


### Match Audio Files to CHAT Transcriptions
Pair each audio recording with its corresponding .cha transcription file. Track any missing pairs for quality control.

In [None]:
import librosa, soundfile as sf

# OCSC task labels per @G:; the pre-@G region is the robot intro
TASK_ALIASES = {
    "Alphabet":"Alphabet","Numbers":"Numbers","Wug":"Wug","ExpPictures":"ExpPictures",
    "Reading":"Reading","Howto":"Howto","DescriptivePictures":"DescriptivePictures","EndTasks":"EndTasks"
}
DEFAULT_TASK = "IntroRobot"
TIME_RE = re.compile(r"\x15\s*(\d+)[_:-](\d+)\s*\x15")  #  0_1320 

## CHAT Format Parser

Parse CHILDES CHAT-format transcription files to extract:
- **Child utterances** (*CHI: lines only, excluding experimenter speech)
- **Timestamps** (bullet markers like `0_1320` indicating milliseconds)
- **Task labels** (@G: markers for different elicitation tasks)
- **Age metadata** (from folder structure)

### Text Normalization
Convert transcriptions to lowercase, remove punctuation and CHAT codes, normalize whitespace for consistent ASR evaluation.

In [None]:
def age_bucket(age):
    if age is None: return "unknown"
    return "4-5" if age < 6 else "6-7" if age < 8 else "8-9"

def normalize_text(s: str) -> str:
    s = s.lower()
    s = re.sub(r"[^a-z0-9'\s]", " ", s)
    return re.sub(r"\s+", " ", s).strip()

### Generate Utterance Manifest
Process all paired audio/transcript files to create a comprehensive manifest with one row per child utterance, including timing information, normalized text, and metadata.

In [None]:
def parse_cha_simple(cha_path: Path, age_folder: str, audio_path: Path):
  rows = []
  try: age_years = float(age_folder)
  except: age_years = None
  a_bucket = age_bucket(age_years)

  with open(cha_path, "r", errors="ignore") as f:
      lines = f.readlines()

  # collect @G markers in order
  gems = []
  for i, ln in enumerate(lines):
    g = re.search(r"@G:\s*([A-Za-z]+)", ln)
    if g: gems.append((i, TASK_ALIASES.get(g.group(1), g.group(1))))

  current_task = DEFAULT_TASK
  next_g = 0

  for i, ln in enumerate(lines):
    while next_g < len(gems) and i >= gems[next_g][0]:
        current_task = gems[next_g][1]
        next_g += 1

    if not ln.startswith("*CHI:"):
        continue

    raw = ln.split(":", 1)[1].strip()
    clean = TIME_RE.sub("", raw).strip()

    # time marks on this or nearby dependent lines
    m = TIME_RE.search(ln)
    if not m:
      j = 1
      while j <= 3 and (i + j) < len(lines) and not m:
          m = TIME_RE.search(lines[i + j]); j += 1

    start_s = end_s = None
    if m:
      start_s, end_s = int(m.group(1))/1000.0, int(m.group(2))/1000.0

    if clean:
      rows.append({
          "session_id": cha_path.stem,
          "age_folder": int(age_folder) if age_folder.isdigit() else age_folder,
          "audio_path": str(audio_path),
          "cha_path": str(cha_path),
          "speaker_id": f"CHI_{cha_path.stem}",
          "age_years": age_years,
          "age_bucket": a_bucket,
          "task": current_task,
          "start_s": start_s,
          "end_s": end_s,
          "text": raw,
          "norm_text": normalize_text(clean),
      })
  return rows

In [None]:
def build_manifest(pairs):
    all_rows = []
    for sid, age_folder, audio_path, cha_path in tqdm(pairs, desc="Parse CHA", unit="file"):
        all_rows.extend(parse_cha_simple(Path(cha_path), str(age_folder), Path(audio_path)))
    df = pd.DataFrame(all_rows)
    print("Utterances:", len(df))
    return df

df = build_manifest(pairs)
df.to_csv(MANI/"ocsc_manifest_utterances.csv", index=False)
df.head()

Parse CHA: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 304/304 [00:02<00:00, 127.30file/s]


Utterances: 139956


Unnamed: 0,session_id,age_folder,audio_path,cha_path,speaker_id,age_years,age_bucket,task,start_s,end_s,text,norm_text
0,4022,4,/content/data/ocsc_raw/Eng-NA/OCSC/4/4022.wav,/content/4022.cha,CHI_4022,4.0,4-5,IntroRobot,27.352,28.891,hello . 27352_28891,hello
1,4022,4,/content/data/ocsc_raw/Eng-NA/OCSC/4/4022.wav,/content/4022.cha,CHI_4022,4.0,4-5,IntroRobot,32.158,33.708,my name is Teigan . 32158_33708,my name is teigan
2,4022,4,/content/data/ocsc_raw/Eng-NA/OCSC/4/4022.wav,/content/4022.cha,CHI_4022,4.0,4-5,IntroRobot,48.155,49.945,mine's rainbow . 48155_49945,mine's rainbow
3,4022,4,/content/data/ocsc_raw/Eng-NA/OCSC/4/4022.wav,/content/4022.cha,CHI_4022,4.0,4-5,IntroRobot,64.699,65.845,xxx . 64699_65845,xxx
4,4022,4,/content/data/ocsc_raw/Eng-NA/OCSC/4/4022.wav,/content/4022.cha,CHI_4022,4.0,4-5,IntroRobot,65.845,66.642,&-um . 65845_66642,um


### Quality Filtering

Apply filters to ensure training data quality:
- **Timestamp requirement**: Only utterances with valid start/end times
- **Duration bounds**: 0.4s minimum (avoid noise), 14s maximum (fit Whisper's context)
- **Text length**: Minimum 2 characters after normalization

Report total hours of usable audio after filtering.

In [None]:
df = df[df["start_s"].notna() & df["end_s"].notna()].copy()

MIN_DUR, MAX_DUR = 0.40, 14.0
df["dur_s"] = df["end_s"] - df["start_s"]
df = df[(df["dur_s"] >= MIN_DUR) & (df["dur_s"] <= MAX_DUR)]

df = df[df["norm_text"].str.len() >= 2].reset_index(drop=True)
print("Utterances (post-filter):", len(df), "| hours ~", round(df["dur_s"].sum()/3600, 2))
df.to_csv(MANI/"ocsc_manifest_utterances.csv", index=False)

### Speaker-Disjoint Data Splits

Split data by **speaker** (not utterance) to prevent data leakage:
- **Train**: 72% of speakers
- **Dev**: 8% of speakers  
- **Test**: 20% of speakers

This ensures the model is evaluated on completely unseen children, not just unseen utterances from children it trained on.

In [None]:
from sklearn.model_selection import train_test_split

speakers = df["speaker_id"].dropna().unique().tolist()
train_spk, test_spk = train_test_split(speakers, test_size=0.20, random_state=42)
train_spk, dev_spk  = train_test_split(train_spk, test_size=0.10, random_state=42)

def by_spk(d, spk): return d[d["speaker_id"].isin(spk)].reset_index(drop=True)
df_train, df_dev, df_test = by_spk(df, train_spk), by_spk(df, dev_spk), by_spk(df, test_spk)

for name, d in [("train", df_train), ("dev", df_dev), ("test", df_test)]:
    d.to_csv(MANI/f"ocsc_{name}.csv", index=False); print(name, len(d))

train 96908
dev 11230
test 25408


### Export Manifests to Google Drive
Save all CSV manifests to Drive for persistence across Colab sessions. Manifests contain paths and timestamps for streaming audio during training (no pre-extracted segments needed).

In [None]:
from google.colab import drive; from datetime import datetime
drive.mount('/content/drive')
STAMP = datetime.now().strftime("%Y%m%d-%H%M%S")
OUT = Path(f"/content/drive/MyDrive/ocsc_converted/{STAMP}"); OUT.mkdir(parents=True, exist_ok=True)
for p in MANI.glob("ocsc_*.csv"): shutil.copy2(p, OUT/p.name)
(OUT/"README.txt").write_text(
  f"OCSC manifests {STAMP}\nManifests only (no audio). Uses start_s/end_s for streamed decoding."
)
print("Saved manifests to Drive:", OUT)

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
Saved manifests to Drive: /content/drive/MyDrive/ocsc_converted/20251201-213109
