# Hands-on exercise

In this unit, we explored the challenges of fine-tuning ASR models, acknowledging the time and resources required to
fine-tune a model like Whisper (even a small checkpoint) on a new language. To provide a hands-on experience, we have
designed an exercise that allows you to navigate the process of fine-tuning an ASR model while using a smaller dataset.
The main goal of this exercise is to familiarize you with the process rather than expecting production-level results.
We have intentionally set a low metric to ensure that even with limited resources, you should be able to achieve it.

Here are the instructions:
* Fine-tune the `”openai/whisper-tiny”` model using the American English ("en-US") subset of the `”PolyAI/minds14”` dataset.
* Use the first **450 examples for training**, and the rest for evaluation. Ensure you set `num_proc=1` when pre-processing the dataset using the `.map` method (this will ensure your model is submitted correctly for assessment).
* To evaluate the model, use the `wer` and `wer_ortho` metrics as described in this Unit. However, *do not* convert the metric into percentages by multiplying by 100 (E.g. if WER is 42%, we’ll expect to see the value of 0.42 in this exercise).

Once you have fine-tuned a model, make sure to upload it to the 🤗 Hub with the following `kwargs`:
```
kwargs = {
     "dataset_tags": "PolyAI/minds14",
    "finetuned_from": "openai/whisper-tiny",
    "tasks": "automatic-speech-recognition",
}
```
You will pass this assignment if your model’s normalised WER (`wer`) is lower than **0.37**.

Feel free to build a demo of your model, and share it on Discord! If you have questions, post them in the #audio-study-group channel.

In [2]:
%pip install "transformers[torch]" "accelerate>=0.26.0" datasets gradio evaluate tensorboardX

Note: you may need to restart the kernel to use updated packages.


In [3]:
from datasets import load_dataset, DatasetDict, Audio
import evaluate
from functools import partial
from huggingface_hub import notebook_login
from transformers import WhisperProcessor, WhisperForConditionalGeneration, Seq2SeqTrainer, Seq2SeqTrainingArguments
from transformers.models.whisper.english_normalizer import BasicTextNormalizer



In [4]:
model_name = "openai/whisper-tiny"
dataset_name = "PolyAI/minds14"
dataset_language = "en-US"
processor_language = "english"
max_input_length = 30.0
eval_metric = "wer"
task = "transcribe"

In [5]:
notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

## Load and preprocess the dataset and model



In [6]:
dataset: DatasetDict = load_dataset(
    dataset_name, dataset_language, split="train"
).train_test_split(seed=42, shuffle=False, test_size=0.2)


print(dataset)

DatasetDict({
    train: Dataset({
        features: ['path', 'audio', 'transcription', 'english_transcription', 'intent_class', 'lang_id'],
        num_rows: 450
    })
    test: Dataset({
        features: ['path', 'audio', 'transcription', 'english_transcription', 'intent_class', 'lang_id'],
        num_rows: 113
    })
})


In [7]:
processor = WhisperProcessor.from_pretrained(
    model_name,
    language=processor_language,
    task=task,
)

In [8]:
dataset["train"].features

{'path': Value(dtype='string', id=None),
 'audio': Audio(sampling_rate=8000, mono=True, decode=True, id=None),
 'transcription': Value(dtype='string', id=None),
 'english_transcription': Value(dtype='string', id=None),
 'intent_class': ClassLabel(names=['abroad', 'address', 'app_error', 'atm_limit', 'balance', 'business_loan', 'card_issues', 'cash_deposit', 'direct_debit', 'freeze', 'high_value_payment', 'joint_account', 'latest_transactions', 'pay_bill'], id=None),
 'lang_id': ClassLabel(names=['cs-CZ', 'de-DE', 'en-AU', 'en-GB', 'en-US', 'es-ES', 'fr-FR', 'it-IT', 'ko-KR', 'nl-NL', 'pl-PL', 'pt-PT', 'ru-RU', 'zh-CN'], id=None)}

We’ll set the audio inputs to the correct sampling rate using dataset’s `cast_column` method. This operation does not change the audio in-place, but rather signals to datasets to resample audio samples on-the-fly when they are loaded:

In [9]:
sampling_rate = processor.feature_extractor.sampling_rate
dataset = dataset.cast_column("audio", Audio(sampling_rate=sampling_rate))

In [10]:
example = dataset["train"][0]
print(example)

{'path': '/home/svernys/.cache/huggingface/datasets/downloads/extracted/006687c72f62d73aea93cc14b4904e5e42a68ac457f745dc2b866ae093470839/en-US~JOINT_ACCOUNT/602ba55abb1e6d0fbce92065.wav', 'audio': {'path': '/home/svernys/.cache/huggingface/datasets/downloads/extracted/006687c72f62d73aea93cc14b4904e5e42a68ac457f745dc2b866ae093470839/en-US~JOINT_ACCOUNT/602ba55abb1e6d0fbce92065.wav', 'array': array([ 1.70562416e-05,  2.18727451e-04,  2.28099874e-04, ...,
        3.43842403e-05, -5.96364771e-06, -1.76846661e-05], shape=(173398,)), 'sampling_rate': 16000}, 'transcription': 'I would like to set up a joint account with my partner', 'english_transcription': 'I would like to set up a joint account with my partner', 'intent_class': 11, 'lang_id': 4}


In [11]:
def prepare_dataset(example):
    audio = example["audio"]

    example = processor(
        audio=audio["array"],
        sampling_rate=audio["sampling_rate"],
        text=example["transcription"],
    )

    # compute input length of audio sample in seconds
    example["input_length"] = len(audio["array"]) / audio["sampling_rate"]

    return example

In [12]:
dataset = dataset.map(
    prepare_dataset, remove_columns=dataset.column_names["train"], num_proc=1
)

In [13]:
def is_audio_in_length_range(length):
    return length < max_input_length

In [14]:
dataset["train"] = dataset["train"].filter(
    is_audio_in_length_range,
    input_columns=["input_length"],
)

In [15]:
# Inspect the dataset
dataset["train"]

Dataset({
    features: ['input_features', 'labels', 'input_length'],
    num_rows: 445
})

## Data collator
Data collators are objects that will form a batch by using a list of dataset elements as input. These elements are of the same type as the elements of train_dataset or eval_dataset.

To be able to build batches, data collators may apply some processing (like padding). Some of them (like DataCollatorForLanguageModeling) also apply some random data augmentation (like random masking) on the formed batch.

In [16]:
import torch

from dataclasses import dataclass
from typing import Any, Dict, List, Union


@dataclass
class DataCollatorSpeechSeq2SeqWithPadding:
    processor: Any

    def __call__(
        self, features: List[Dict[str, Union[List[int], torch.Tensor]]]
    ) -> Dict[str, torch.Tensor]:
        # split inputs and labels since they have to be of different lengths and need different padding methods
        # first treat the audio inputs by simply returning torch tensors
        input_features = [
            {"input_features": feature["input_features"][0]} for feature in features
        ]
        batch = self.processor.feature_extractor.pad(input_features, return_tensors="pt")

        # get the tokenized label sequences
        label_features = [{"input_ids": feature["labels"]} for feature in features]
        # pad the labels to max length
        labels_batch = self.processor.tokenizer.pad(label_features, return_tensors="pt")

        # replace padding with -100 to ignore loss correctly
        labels = labels_batch["input_ids"].masked_fill(
            labels_batch.attention_mask.ne(1), -100
        )

        # if bos token is appended in previous tokenization step,
        # cut bos token here as it's append later anyways
        if (labels[:, 0] == self.processor.tokenizer.bos_token_id).all().cpu().item():
            labels = labels[:, 1:]

        batch["labels"] = labels

        return batch

In [17]:
data_collator = DataCollatorSpeechSeq2SeqWithPadding(processor=processor)

## Evaluation metrics


In [18]:

metric = evaluate.load(eval_metric)

In [19]:

normalizer = BasicTextNormalizer()


def compute_metrics(pred):
    pred_ids = pred.predictions
    label_ids = pred.label_ids

    # replace -100 with the pad_token_id
    label_ids[label_ids == -100] = processor.tokenizer.pad_token_id

    # we do not want to group tokens when computing the metrics
    pred_str = processor.batch_decode(pred_ids, skip_special_tokens=True)
    label_str = processor.batch_decode(label_ids, skip_special_tokens=True)

    # compute orthographic 
    eval_ortho = 100 * metric.compute(predictions=pred_str, references=label_str)

    # compute normalised
    pred_str_norm = [normalizer(pred) for pred in pred_str]
    label_str_norm = [normalizer(label) for label in label_str]
    # filtering step to only evaluate the samples that correspond to non-zero references:
    pred_str_norm = [
        pred_str_norm[i] for i in range(len(pred_str_norm)) if len(label_str_norm[i]) > 0
    ]
    label_str_norm = [
        label_str_norm[i]
        for i in range(len(label_str_norm))
        if len(label_str_norm[i]) > 0
    ]

    norm_eval = 100 * metric.compute(predictions=pred_str_norm, references=label_str_norm)

    return {f"{eval_metric}_ortho": eval_ortho, eval_metric: norm_eval}

In [20]:
model = WhisperForConditionalGeneration.from_pretrained(model_name)

## Training arguments

In [21]:

# disable cache during training since it's incompatible with gradient checkpointing
model.config.use_cache = False

# set language and task for generation and re-enable cache
model.generate = partial(
    model.generate, language=processor_language, task=task, use_cache=True
)

In [22]:

training_args = Seq2SeqTrainingArguments(
    output_dir=f"./{model_name}-{processor_language}",  # name on the HF Hub
    per_device_train_batch_size=16,
    gradient_accumulation_steps=1,  # increase by 2x for every 2x decrease in batch size
    learning_rate=1e-5,
    lr_scheduler_type="constant_with_warmup",
    warmup_steps=50,
    max_steps=500,  # increase to 4000 if you have your own GPU or a Colab paid plan
    gradient_checkpointing=True,
    fp16=True,
    fp16_full_eval=True,
    eval_strategy="steps",
    per_device_eval_batch_size=16,
    predict_with_generate=True,
    generation_max_length=225,
    save_steps=500,
    eval_steps=500,
    logging_steps=25,
    report_to=["tensorboard"],
    load_best_model_at_end=True,
    metric_for_best_model=eval_metric,
    greater_is_better=False,
    push_to_hub=True,
)

In [23]:
trainer = Seq2SeqTrainer(
    args=training_args,
    model=model,
    train_dataset=dataset["train"],
    eval_dataset=dataset["test"],
    data_collator=data_collator,
    compute_metrics=compute_metrics,
    tokenizer=processor,
)

  trainer = Seq2SeqTrainer(


In [None]:
trainer.train()

In [None]:
kwargs = {
    "dataset_tags": dataset_name,
    "dataset": dataset_name,
    "language": dataset_language,
    "model_name": "Whisper Small En - Sverre Nystad",
    "finetuned_from": model_name,
    "tasks": "automatic-speech-recognition",
}

In [None]:
trainer.push_to_hub(**kwargs)


In [None]:
from transformers import pipeline

pipe = pipeline("automatic-speech-recognition", model="SverreNystad/whisper-small-en")