### Training with Pretrained Models

In this tutorial, we will explore ASR training with pretrained models. Pretrained models are a shortcut to training high accuracy models even with limited data or compute resources.

These models have been trained on large-scale audio-text datasets—often consisting of thousands of hours of transcribed speech—by major research groups or organizations. As a result, they have already learned rich acoustic and linguistic representations. By starting from a pretrained model, we can fine-tune it on a smaller, domain-specific dataset (such as customer service calls, lecture recordings, or podcasts), allowing it to adapt to the new data much faster and with better performance than training from scratch.

This approach is a form of transfer learning, where the pretrained model transfers general knowledge (e.g., how speech sounds map to words) to a specific downstream task or domain.

### Dependencies

In [None]:
!pip install datasets accelerate librosa evaluate jiwer speechbrain

### Import libraries

In [47]:

import torch
import torchaudio
import librosa
from datasets import load_dataset, Audio, DatasetDict
import evaluate
import numpy as np
from transformers import Wav2Vec2Processor, Wav2Vec2ForCTC, Trainer, TrainingArguments
import base64
import io

In [None]:
# This is used for visualising the data nicely

def array_to_audio_html(array, rate=16000):
    # Save array directly to an in-memory buffer
    buffer = io.BytesIO()
    sf.write(buffer, array, rate, format='wav')
    buffer.seek(0)

    # Encode to base64
    b64_audio = base64.b64encode(buffer.read()).decode('utf-8')

    # Create HTML audio tag
    return f'<audio controls src="data:audio/wav;base64,{b64_audio}"></audio>'

## Fine-tuning

### Setup and Load Data

We load `atco2-asr`, which is an air traffic control dataset uploaded on HuggingFace.

For the purposes of teaching, we will omit the following steps, and show some of those steps in future tutorials

1. Removing noise: The ATCO data is terribly noisy, and contains a lot of artifacts, which you should remove
1. Further preprocessing and augmentation: If possible, see how libraries like `audiomentations` or `speechbrain` can help preprocess and augment your already existing data. This is especially so for small datasets like `atco2-asr`

In [None]:
# 1. Load ATCO2‑ASR from Hugging Face
dataset = load_dataset("jlvdoorn/atco2-asr")

### Exploring the data:

We do a brief exploration of the dataset, showing what columns belong in the dataset and an audio sample

In [None]:
# Explore the dataset

import IPython.display as ipd

# Check the available splits
print(dataset)

# Peek at the first training example
print("\n First training example:")
print(dataset["train"][0])

# List available columns
print("\n Columns in each example:", dataset["train"].column_names)

# Check the type of audio column
print("\n Audio column type:", type(dataset["train"][0]["audio"]))

# Check the contents of the audio column
print("\n Keys in each example['audio']:", dataset["train"][0]["audio"].keys())

# Show audio metadata
audio_sample = dataset["train"][0]["audio"]
print("\n Audio metadata:")
print("  Sampling rate:", audio_sample["sampling_rate"])
print("  Num samples  :", len(audio_sample["array"]))
print("  Duration (s) :", len(audio_sample["array"]) / audio_sample["sampling_rate"])

# Play the audio (Jupyter / Colab only)
ipd.Audio(audio_sample["array"], rate=audio_sample["sampling_rate"])

### But can we do better?

The previous output was rather messy, and showed a lot of unrelated data that we might not need. Let's use the outputs from the previous cell to summarize everything into a nice table.

In [None]:
from datasets import ClassLabel
import random
import pandas as pd
from IPython.display import display, HTML
import soundfile as sf

def show_random_elements(dataset, num_examples = 10):
  assert num_examples <= len(dataset), "Can't pick more elements than there are in the dataset"
  picks = []
  for _ in range(num_examples):
    pick = random.randint(0, len(dataset) - 1)
    while pick in picks:
      pick = random.randint(0, len(dataset) - 1)
    picks.append(pick)

  df = pd.DataFrame(dataset[picks]['audio'])
  df2 = pd.DataFrame(dataset[picks]['text'], columns=['text'])
  df = pd.concat([df, df2], axis=1)
  # Function to convert audio arrays directly to HTML audio elements


  # Apply the conversion to the DataFrame
  df['audio'] = df.apply(lambda row: array_to_audio_html(row['array'], row['sampling_rate']), axis=1)
  html_table = df[['audio','text', 'sampling_rate']].to_html(escape=False)

  display(HTML(html_table))
show_random_elements(dataset["train"])

This is our safety net: Even though we have seen that most of the data is in 16000HZ, it is always good to just cast everything to 16k Hz again

In [51]:
dataset = dataset.cast_column("audio", Audio(sampling_rate=16_000))  # ensure consistent sampling

### Load models

In [None]:
# 2. Load pre-trained Wav2Vec2 and its processor
model_name = "facebook/wav2vec2-base"
processor = Wav2Vec2Processor.from_pretrained(model_name)
model     = Wav2Vec2ForCTC.from_pretrained(model_name)

### Preprocessing
This helps us remove special characters, as Wav2Vec2 does not know what punctuation is.

This makes sense, as you cannot really identify where punctuation etc should be based off speech. Of course, recent advancements have made it possible for ASR models to also reason where to add punctuation, but we omit it for now.

In [None]:
import re
chars_to_ignore_regex = '[\,\?\.\!\-\;\:\"]'

def remove_special_characters(batch):
    batch["text"] = re.sub(chars_to_ignore_regex, '', batch["text"]).lower()
    return batch

dataset = dataset.map(remove_special_characters) # This applies the function to every row


### Dataset preparation

We do two things here:
1. Extract audio features from our audio array, using Wav2Vec2's feature processor
1. We tokenize the labels for loss computation


In [53]:
def prepare_dataset(batch):
    '''
    Ideally we use the processor directly here. But using the processor will lead to
    a bug in HuggingFace complaining we are assinging multiple values to a single key.
    '''

    audio = batch["audio"]

    # Process audio inputs - only pass return_attention_mask to the feature_extractor
    features = processor.feature_extractor(
        audio["array"],
        sampling_rate=audio["sampling_rate"],
        return_attention_mask=False, #return True if you want the attention mask
        do_normalize = True
    )
    batch["input_values"] = features.input_values[0]

    # If you need the attention mask as well
    if hasattr(features, "attention_mask"):
        batch["attention_mask"] = features.attention_mask[0]

    # Process text targets separately
    # with processor.as_target_processor():
    #     batch["labels"] = processor.tokenizer(batch["text"]).input_ids

    batch["labels"] = processor(text=batch["text"]).input_ids

    return batch

In [None]:
dataset = dataset.map(prepare_dataset)

### Data Collator

Before training, there is a very important thing we must do: Collate the data.

In our data right now, there exists samples of different lengths, and this means our training will FAIL as we cannot form proper tensors with such jagged data. To circumvent this, we need to properly pad our data, and also mask all padding tokens during training (We do not want the model to learn how to add padding)

To do this we create a data collator that does this for us, using the processor's inbuilt capabilities to achieve this.

In [None]:
import torch

from dataclasses import dataclass, field
from typing import Any, Dict, List, Optional, Union

@dataclass
class DataCollatorCTCWithPadding:
    """
    Data collator that will dynamically pad the inputs received.
    Args:
        processor (:class:`~transformers.Wav2Vec2Processor`)
            The processor used for proccessing the data.
        padding (:obj:`bool`, :obj:`str` or :class:`~transformers.tokenization_utils_base.PaddingStrategy`, `optional`, defaults to :obj:`True`):
            Select a strategy to pad the returned sequences (according to the model's padding side and padding index)
            among:
            * :obj:`True` or :obj:`'longest'`: Pad to the longest sequence in the batch (or no padding if only a single
              sequence if provided).
            * :obj:`'max_length'`: Pad to a maximum length specified with the argument :obj:`max_length` or to the
              maximum acceptable input length for the model if that argument is not provided.
            * :obj:`False` or :obj:`'do_not_pad'` (default): No padding (i.e., can output a batch with sequences of
              different lengths).
        max_length (:obj:`int`, `optional`):
            Maximum length of the ``input_values`` of the returned list and optionally padding length (see above).
        max_length_labels (:obj:`int`, `optional`):
            Maximum length of the ``labels`` returned list and optionally padding length (see above).
        pad_to_multiple_of (:obj:`int`, `optional`):
            If set will pad the sequence to a multiple of the provided value.
            This is especially useful to enable the use of Tensor Cores on NVIDIA hardware with compute capability >=
            7.5 (Volta).
    """

    processor: Wav2Vec2Processor
    padding: Union[bool, str] = True
    max_length: Optional[int] = None
    max_length_labels: Optional[int] = None
    pad_to_multiple_of: Optional[int] = None
    pad_to_multiple_of_labels: Optional[int] = None

    def __call__(self, features: List[Dict[str, Union[List[int], torch.Tensor]]]) -> Dict[str, torch.Tensor]:
        # split inputs and labels since they have to be of different lengths and need
        # different padding methods
        input_features = [{"input_values": feature["input_values"]} for feature in features]
        label_features = [{"input_ids": feature["labels"]} for feature in features]

        batch = self.processor.pad(
            input_features,
            padding=self.padding,
            max_length=self.max_length,
            pad_to_multiple_of=self.pad_to_multiple_of,
            return_tensors="pt",
        )
        with self.processor.as_target_processor():
            labels_batch = self.processor.pad(
                label_features,
                padding=self.padding,
                max_length=self.max_length_labels,
                pad_to_multiple_of=self.pad_to_multiple_of_labels,
                return_tensors="pt",
            )

        # replace padding with -100 to ignore loss correctly
        labels = labels_batch["input_ids"].masked_fill(labels_batch.attention_mask.ne(1), -100)

        batch["labels"] = labels

        return batch


In [None]:
data_collator = DataCollatorCTCWithPadding(processor=processor, padding=True)


###  Word error rate

 We calculate the word error rate for training - This will be discussed more in the next notebook, but is used here to illustrate how to train models with some custom metrics.

In [48]:
wer_metric = evaluate.load("wer")
def compute_metrics(pred):
    pred_logits = pred.predictions
    pred_ids = np.argmax(pred_logits, axis=-1)

    pred.label_ids[pred.label_ids == -100] = processor.tokenizer.pad_token_id

    pred_str = processor.batch_decode(pred_ids)
    # we do not want to group tokens when computing the metrics
    label_str = processor.batch_decode(pred.label_ids, group_tokens=False)

    wer = wer_metric.compute(predictions=pred_str, references=label_str)

    return {"wer": wer}


### Freezing layers

Freezing means to prevent the model from updating a set of layers during training - This is good in some cases, as we might not want the model to erroneously update some parts of the model that we know are trained well enough

For example, Wav2Vec2 was trained on thousands of hours of ASR data, and thus we can rest assured that the feature extractor should extract important auditory features

We should focus our attention more towards tuning the intermediate layers in order to adapt it to our dataset.

In [None]:
model.freeze_feature_extractor()


### Training arguments:
These arguments define our training process, and include useful hyperparameters that you might want to tune to improve the accuracy of your models




In [54]:
from transformers import TrainingArguments

training_args = TrainingArguments(
  group_by_length=True, #Groups examples of similar length together to reduce padding and improve efficiency
  per_device_train_batch_size=16, #Batch Size
  gradient_accumulation_steps=4, #Accumulates gradient, so total steps before updating is 16 * 4 = effective batch size of 64
  num_train_epochs=10, # Change this if you want to seriously evalmaxx on airplane conversations
  fp16=True,
  gradient_checkpointing=True, #Reduce memory usage at the cost of speed
  save_steps = 200,
  logging_strategy = "epoch",
  eval_strategy="epoch",
  learning_rate=1e-4,
  weight_decay=0.005,
  warmup_steps=1000,
  save_total_limit=2,
  report_to= "none"

)


In [39]:
from transformers import Trainer

#Add everything we defined previously into the trainer

trainer = Trainer(
    model=model,
    data_collator=data_collator,
    args=training_args,
    compute_metrics=compute_metrics,
    train_dataset=dataset['train'],
    eval_dataset=dataset['validation'],
    processing_class=processor.feature_extractor,
)


In [None]:
trainer.train() #Just a single line, how useful is that!


The training loss and the validation loss decreased as the number epochs increased. This indicates that the model is learning to better predict the training and validation transcription.

However, the word error rate (WER) stayed roughly the same. One reason is because we did not do hyperparameter tuning to help the model generalise better. We also only trained it for 10 epochs. Additionally, small changes in the model predictions might not affect the WER.

### Inference

Always test what your model outputs after training - You want to make sure it is not overfitting or hacking the loss.

In [None]:
#Inference with the model

def evaluate(model, dataset):
    model.eval()
    predictions = []
    references = []
    for i in range(5):
        batch = dataset[i]['audio']['array']
        input_values = processor(batch, return_tensors="pt", padding=True, sampling_rate = 16000).input_values
        with torch.no_grad():
            model = model.to("cuda")
            logits = model(input_values.to('cuda')).logits
        pred_ids = torch.argmax(logits, dim=-1)
        pred_str = processor.batch_decode(pred_ids[0])
        predictions.append(pred_str)
        references.append(dataset[i]['text'])
    return predictions, references

predictions, references = evaluate(model, dataset["validation"])

#Print dataframe
df = pd.DataFrame({'predictions': predictions, 'references': references})
df

Notice the outputs are still quite bad

- Note that this dataset is extremely noisy, and highlights the need for us to denoise our inputs beforehand

- We are also constrained by Google Colab here
  - Google Colab has a timeout after 30 mins to 1 hour
  - Especially so for GPU instances

 By right we should increase the num of epochs until the loss plateaus, but that requires a dedicated compute resource