# Whisper Fine-Tuning Workflow

Use this notebook after exporting reviewed transcriptions with `python -m wavecap_backend.tools.export_transcriptions` to build a dataset that can fine-tune Whisper models with Hugging Face Transformers.

## Prerequisites

1. Install the Python dependencies used below: `pip install datasets transformers accelerate soundfile`.
2. Run the export script from the repository root (or from `backend/` with `PYTHONPATH=src`) to generate a dataset directory:
   ```bash
   python -m wavecap_backend.tools.export_transcriptions --output-dir state/exports/whisper-dataset
   ```
3. Confirm that the output directory contains `transcriptions.jsonl`, `metadata.json`, and an `audio/` sub-directory when audio was copied.

## Load the reviewed transcriptions

The export script produces a JSONL file where each record contains the cleaned text, metadata about the review, and an `audio_filepath` pointing to the WAV clip.
The code below loads the JSONL file into a 🤗 `datasets` `DatasetDict`, filters out entries without audio, and casts the audio column so it can stream waveform arrays on demand.

In [None]:
from pathlib import Path

from datasets import Audio, load_dataset

DATASET_DIR = Path("state/exports/whisper-dataset")
JSONL_PATH = DATASET_DIR / "transcriptions.jsonl"

if not JSONL_PATH.exists():
    raise FileNotFoundError(f"Could not find {JSONL_PATH}. Run the export script first.")

dataset = load_dataset("json", data_files=str(JSONL_PATH), split="train")
dataset = dataset.filter(lambda example: example.get("audio_filepath") is not None)
dataset = dataset.cast_column("audio_filepath", Audio(sampling_rate=16000))
dataset = dataset.rename_column("audio_filepath", "audio")
dataset = dataset.train_test_split(test_size=0.1, shuffle=True, seed=42)
dataset

Inspect a single example to verify the metadata and waveform are loaded as expected.

In [None]:
sample = dataset["train"][0]
sample

## Prepare inputs for Whisper

Load the Whisper processor and model checkpoint, then convert each dataset example into the input features and token labels expected by the model. Adjust the `language` and `task` arguments to match your recordings.

In [None]:
from transformers import WhisperForConditionalGeneration, WhisperProcessor

MODEL_NAME = "openai/whisper-small"
TARGET_LANGUAGE = "en"
processor = WhisperProcessor.from_pretrained(MODEL_NAME, language=TARGET_LANGUAGE, task="transcribe")
model = WhisperForConditionalGeneration.from_pretrained(MODEL_NAME)
model.config.forced_decoder_ids = None
model.config.suppress_tokens = []
processor.tokenizer.pad_token = processor.tokenizer.eos_token
processor.tokenizer.pad_token_id = processor.tokenizer.eos_token_id


In [None]:
def prepare_example(batch):
    audio = batch["audio"]
    input_features = processor.feature_extractor(audio["array"], sampling_rate=audio["sampling_rate"]).input_features[0]
    with processor.as_target_processor():
        labels = processor.tokenizer(batch["text"])
    batch["input_features"] = input_features
    batch["labels"] = labels["input_ids"]
    return batch

processed_dataset = dataset.map(
    prepare_example,
    remove_columns=dataset["train"].column_names,
    num_proc=None,
)


## Fine-tune Whisper

Configure standard sequence-to-sequence training arguments and launch the trainer. Depending on dataset size you may need to adjust batch sizes, gradient accumulation, or enable mixed precision (`fp16`).

In [None]:
from transformers import DataCollatorForSeq2Seq, Seq2SeqTrainer, Seq2SeqTrainingArguments

training_args = Seq2SeqTrainingArguments(
    output_dir="whisper-finetune",
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    gradient_accumulation_steps=1,
    evaluation_strategy="steps",
    logging_steps=50,
    eval_steps=200,
    save_steps=200,
    num_train_epochs=5,
    learning_rate=1e-5,
    warmup_steps=500,
    fp16=False,
    gradient_checkpointing=True,
    generation_max_length=225,
)

data_collator = DataCollatorForSeq2Seq(tokenizer=processor.tokenizer, model=model, padding=True)

trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    train_dataset=processed_dataset["train"],
    eval_dataset=processed_dataset["test"],
    data_collator=data_collator,
    tokenizer=processor,
)

# Uncomment the next line to begin fine-tuning once you are satisfied with the configuration.
# trainer.train()


After training completes you can evaluate the model, push it to the Hugging Face Hub, or convert it back into a format suitable for the WaveCap backend.