# Audio Preprocessing Tutorial (TORGO → Whisper Fine-tuning)

This notebook explains the script **`audio_preprocessing.py`** step-by-step.

The script’s goal is to:
1. **Load** a Hugging Face dataset saved on disk (created earlier by `data_loader.py`)
2. **Preprocess** each audio clip (resample → normalize → trim silence → filter by duration)
3. **Optionally augment** training audio (speed / noise / pitch)
4. **Write** the resulting `.wav` files into an output folder structure
5. **Save stats** about how many samples were processed / filtered / augmented

## 0) Imports and dependencies

The original script relies on:

- `librosa` for audio DSP operations (resample, trim, time-stretch, pitch-shift)
- `numpy` for numeric array operations
- `soundfile` for writing `.wav` files
- `datasets` for loading a dataset saved with Hugging Face Datasets and casting audio sampling rate

If you run this notebook in an environment missing these libraries, install them (example):
- `pip install librosa soundfile datasets`

(Installation commands are not executed automatically here.)


In [2]:
import json
from pathlib import Path

import numpy as np

# If you don't have them installed in your environment, install via pip:
import librosa
import soundfile as sf
from datasets import load_from_disk, Audio


  from .autonotebook import tqdm as notebook_tqdm


## 1) Core preprocessing functions

The preprocessing pipeline is built from small functions.  
Each function is easy to test independently, and together they form a reliable pipeline.

### 1.1 `resample(audio, orig_sr, target_sr=16000)`
- **Why?** Whisper models typically expect **16 kHz** audio.
- If the audio is already at the target rate, return it unchanged.
- Otherwise, use `librosa.resample` to convert sample rate.

### 1.2 `normalize_amplitude(audio)`
- **Why?** Different recordings have different volumes.
- This scales audio so the maximum absolute value becomes `1.0` (range `[-1, 1]`).

### 1.3 `trim_silence(audio, sr, top_db=40)`
- **Why?** Many datasets have leading/trailing silence.
- `librosa.effects.trim` removes quiet segments at the beginning and end.

### 1.4 `is_valid_duration(audio, sr, min_sec=0.5, max_sec=30.0)`
- **Why?** Extremely short clips often contain little speech; very long clips can be slow to train on.
- Returns `True` only if duration is in the desired range.


In [2]:
def resample(audio: np.ndarray, orig_sr: int, target_sr: int = 16000) -> np.ndarray:
    """Resample audio to target sample rate."""
    if orig_sr == target_sr:
        return audio
    return librosa.resample(audio, orig_sr=orig_sr, target_sr=target_sr)


def normalize_amplitude(audio: np.ndarray) -> np.ndarray:
    """Normalize audio amplitude to [-1, 1]."""
    peak = np.max(np.abs(audio))
    if peak > 0:
        audio = audio / peak
    return audio


def trim_silence(audio: np.ndarray, sr: int, top_db: int = 40) -> np.ndarray:
    """Trim leading and trailing silence."""
    trimmed, _ = librosa.effects.trim(audio, top_db=top_db)
    return trimmed


def is_valid_duration(audio: np.ndarray, sr: int, min_sec: float = 0.5, max_sec: float = 30.0) -> bool:
    """Check if audio duration is within acceptable range."""
    duration = len(audio) / sr
    return min_sec <= duration <= max_sec


## 2) Augmentation functions

Augmentation increases training diversity by creating plausible variations of the same utterance.

> In the original script, augmentation is applied **only to the training split**.

### 2.1 Speed perturbation: `augment_speed(audio, sr, rate)`
- Uses `librosa.effects.time_stretch`
- `rate < 1` → slower (longer)
- `rate > 1` → faster (shorter)

### 2.2 Pitch shift: `augment_pitch(audio, sr, n_steps)`
- Shifts pitch up/down by a number of semitones
- Useful for speaker variability

### 2.3 Add noise: `augment_noise(audio, snr_db)`
- Adds Gaussian noise to achieve a target **SNR** (signal-to-noise ratio)
- Lower SNR → noisier audio


In [3]:
def augment_speed(audio: np.ndarray, sr: int, rate: float) -> np.ndarray:
    """Apply speed perturbation (time-stretch without pitch change)."""
    return librosa.effects.time_stretch(audio, rate=rate)


def augment_pitch(audio: np.ndarray, sr: int, n_steps: float) -> np.ndarray:
    """Shift pitch by n_steps semitones."""
    return librosa.effects.pitch_shift(audio, sr=sr, n_steps=n_steps)


def augment_noise(audio: np.ndarray, snr_db: float = 20.0) -> np.ndarray:
    """Add Gaussian noise at a given SNR."""
    signal_power = np.mean(audio ** 2)
    noise_power = signal_power / (10 ** (snr_db / 10))
    noise = np.random.normal(0, np.sqrt(noise_power), len(audio))
    return audio + noise


## 3) `preprocess_sample`: the full preprocessing pipeline

This function ties together the earlier helpers:

1. Resample to `target_sr` (default **16 kHz**)
2. Normalize amplitude
3. Trim leading/trailing silence
4. Filter samples outside a duration range  
   - If invalid, return `None` so the caller can **skip** the sample

Returning `None` is a clean pattern for *filtering* in pipelines.


In [4]:
def preprocess_sample(audio: np.ndarray, sr: int, target_sr: int = 16000) -> np.ndarray | None:
    """Apply full preprocessing pipeline to a single audio sample.

    Returns None if the sample should be filtered out.
    """
    audio = resample(audio, sr, target_sr)
    audio = normalize_amplitude(audio)
    audio = trim_silence(audio, target_sr)

    if not is_valid_duration(audio, target_sr):
        return None

    return audio


## 4) `augment_sample`: generating augmented variants

This function returns a **list of (suffix, audio_array)** pairs so the caller can name files like:

- `sample_00010_speed09.wav`
- `sample_00010_noise15.wav`
- `sample_00010_pitch_up2.wav`

This is convenient because the dataset loop doesn't need to know augmentation details — it only:
- calls `augment_sample(...)`
- writes each returned audio to disk


In [5]:
def augment_sample(audio: np.ndarray, sr: int) -> list[tuple[str, np.ndarray]]:
    """Generate augmented versions of a single audio sample.

    Returns list of (suffix, augmented_audio) tuples.
    """
    augmented = []

    # Speed perturbation
    augmented.append(("speed09", augment_speed(audio, sr, rate=0.9)))
    augmented.append(("speed11", augment_speed(audio, sr, rate=1.1)))

    # Background noise
    augmented.append(("noise20", augment_noise(audio, snr_db=20.0)))
    augmented.append(("noise15", augment_noise(audio, snr_db=15.0)))

    # Pitch shifting
    augmented.append(("pitch_down2", augment_pitch(audio, sr, n_steps=-2)))
    augmented.append(("pitch_up2", augment_pitch(audio, sr, n_steps=2)))

    return augmented


## 5) `process_dataset`: the main workflow

This is the “engine” of the script.

### Expected input layout
The function expects:
- `input_dir / "torgo_dataset"`: a Hugging Face dataset saved with `save_to_disk`
- Optional `input_dir / "splits.json"`: informational stats (printed only)

### What it does
For each split (e.g., `train`, `validation`, `test`):
1. Create output folder `output_dir / split_name`
2. Iterate samples:
   - read audio array + sampling rate
   - run `preprocess_sample`
   - if `None` → count as filtered, skip
   - else write processed `.wav`
   - if split is **train** and augmentation enabled → write augmented `.wav`s
3. Print progress every 100 samples
4. Save `preprocessing_stats.json` into the output directory


In [6]:
def process_dataset(input_dir: Path, output_dir: Path, target_sr: int = 16000, do_augment: bool = True):
    """Process the full TORGO dataset: preprocess and optionally augment.

    Expects the dataset saved via data_loader.py at input_dir/torgo_dataset.
    """
    dataset_path = input_dir / "torgo_dataset"
    if not dataset_path.exists():
        raise FileNotFoundError(
            f"Dataset not found at {dataset_path}. "
            "Run data_loader.py first to download the dataset."
        )

    dataset = load_from_disk(str(dataset_path))
    dataset = dataset.cast_column("audio", Audio(sampling_rate=target_sr))

    output_dir.mkdir(parents=True, exist_ok=True)

    # Load split info (for reference)
    splits_path = input_dir / "splits.json"
    if splits_path.exists():
        with open(splits_path) as f:
            split_data = json.load(f)
            print(f"Dataset stats: {split_data.get('stats', {})}")

    stats = {"processed": 0, "filtered": 0, "augmented": 0}

    for split_name in dataset:
        print(f"\nProcessing split: {split_name}")
        split_dir = output_dir / split_name
        split_dir.mkdir(exist_ok=True)

        for i, sample in enumerate(dataset[split_name]):
            audio_data = sample["audio"]["array"]
            sr = sample["audio"]["sampling_rate"]

            processed = preprocess_sample(audio_data, sr, target_sr)
            if processed is None:
                stats["filtered"] += 1
                continue

            # Save processed audio
            out_path = split_dir / f"sample_{i:05d}.wav"
            sf.write(str(out_path), processed, target_sr)
            stats["processed"] += 1

            # Augment only training data
            if do_augment and split_name == "train":
                for suffix, aug_audio in augment_sample(processed, target_sr):
                    aug_path = split_dir / f"sample_{i:05d}_{suffix}.wav"
                    sf.write(str(aug_path), aug_audio, target_sr)
                    stats["augmented"] += 1

            if (i + 1) % 100 == 0:
                print(f"  Processed {i + 1} samples...")

    print(
        f"\nDone. Processed: {stats['processed']}, "
        f"Filtered: {stats['filtered']}, Augmented: {stats['augmented']}"
    )

    # Save stats
    with open(output_dir / "preprocessing_stats.json", "w") as f:
        json.dump(stats, f, indent=2)


## 6) `main()`: command-line interface

The `main()` function:

1. Defines command-line arguments:
   - `--input`: folder containing `torgo_dataset/`
   - `--output`: where to write processed audio
   - `--sr`: target sample rate (default 16000)
   - `--no-augment`: disables augmentation
2. Parses args
3. Calls `process_dataset(...)`

The final:

```python
if __name__ == "__main__":
    main()
```

means: *only run `main()` when this file is executed as a script*, not when imported as a module.


In [8]:
process_dataset(
    input_dir=Path("torgo"),
    output_dir=Path("torgo/processed"),
    target_sr=16000,
    do_augment=True,
)

Dataset stats: {'train': {'total': 13240, 'by_status': {'1': 8782, '0': 4458}}, 'validation': {'total': 1656, 'by_status': {'0': 558, '1': 1098}}, 'test': {'total': 1656, 'by_status': {'1': 1098, '0': 558}}}

Processing split: train
  Processed 100 samples...
  Processed 200 samples...
  Processed 300 samples...
  Processed 400 samples...
  Processed 500 samples...
  Processed 600 samples...
  Processed 700 samples...
  Processed 800 samples...
  Processed 900 samples...
  Processed 1000 samples...
  Processed 1100 samples...
  Processed 1200 samples...
  Processed 1300 samples...
  Processed 1400 samples...
  Processed 1500 samples...
  Processed 1600 samples...
  Processed 1700 samples...
  Processed 1800 samples...
  Processed 1900 samples...
  Processed 2000 samples...
  Processed 2100 samples...
  Processed 2200 samples...
  Processed 2300 samples...
  Processed 2400 samples...
  Processed 2500 samples...
  Processed 2600 samples...
  Processed 2700 samples...
  Processed 2800 sam

## 7) Generate metadata JSON for processed samples

The preprocessing step above writes only `.wav` files — no per-sample metadata is preserved.
This cell reconstructs metadata from the original dataset and writes `metadata.json` into the
processed output directory.

For each processed `.wav` (including augmented variants), the JSON records:
- `transcription` — the ground-truth text
- `speech_status` — `"dysarthria"` or `"healthy"`
- `gender` — speaker gender
- `duration` — original duration in seconds

The file is keyed by split → filename, making it easy to look up any sample's metadata.

In [3]:
dataset_path = Path("torgo/torgo_dataset")
output_dir = Path("torgo/processed")
target_sr = 16000

dataset = load_from_disk(str(dataset_path))
status_labels = dataset["train"].features["speech_status"].names  # ["dysarthria", "healthy"]

augment_suffixes = ["speed09", "speed11", "noise20", "noise15", "pitch_down2", "pitch_up2"]

metadata = {}

for split_name in dataset:
    split_dir = output_dir / split_name
    split_metadata = {}

    for i, sample in enumerate(dataset[split_name]):
        base_name = f"sample_{i:05d}"
        wav_path = split_dir / f"{base_name}.wav"

        if not wav_path.exists():
            continue

        entry = {
            "transcription": sample["transcription"],
            "speech_status": status_labels[sample["speech_status"]],
            "gender": sample["gender"],
            "duration": sample["duration"],
        }

        split_metadata[f"{base_name}.wav"] = entry

        if split_name == "train":
            for suffix in augment_suffixes:
                aug_path = split_dir / f"{base_name}_{suffix}.wav"
                if aug_path.exists():
                    split_metadata[f"{base_name}_{suffix}.wav"] = entry

    metadata[split_name] = split_metadata
    dysarthria_count = sum(1 for v in split_metadata.values() if v["speech_status"] == "dysarthria")
    healthy_count = sum(1 for v in split_metadata.values() if v["speech_status"] == "healthy")
    print(f"{split_name}: {len(split_metadata)} entries ({dysarthria_count} dysarthria, {healthy_count} healthy)")

metadata_path = output_dir / "metadata.json"
with open(metadata_path, "w") as f:
    json.dump(metadata, f, indent=2)

print(f"\nMetadata saved to {metadata_path}")

train: 92617 entries (31171 dysarthria, 61446 healthy)
validation: 1653 entries (556 dysarthria, 1097 healthy)
test: 1655 entries (558 dysarthria, 1097 healthy)

Metadata saved to torgo/processed/metadata.json


## 8) Summary

You can think of the script as:

- **Data ingestion** (load dataset from disk)
- **Canonicalization** (resample to 16 kHz, normalize)
- **Cleanup** (trim silence, filter duration)
- **Training diversification** (augment *train* only)
- **Export** (write `.wav` files + stats JSON)
