# Introduction
Automatic speech recognition (ASR) converts a speech signal to text.

--> Audio as input --> Text as output

Examples: Siri and Alexa

This demo teaches how to fine-tune to Wav2Vec2 on the MInDS-14 dataset to transcribe audio to text.

Use the fine-tuned model for inference.

Task page on HF : https://huggingface.co/tasks/automatic-speech-recognition



In [None]:
# Install all the libraries
!pip install transformers datasets evaluate jiwer huggingface_hub -q

In [None]:
# Login to HF HUB
from huggingface_hub import notebook_login

notebook_login()

## Some info related to dataset

MINDS-14 is training and evaluation resource for intent detection task with spoken data.

Covers 14 intents extracted from a commercial system in the e-banking domain, associated with spoken examples in 14 diverse language varieties.

In [None]:
# To get rid of of "NotImplementedError" in dataset loading
!pip install datasets==3.6.0

In [None]:
# Load MInDS-14 Dataset
from datasets import load_dataset, Audio

minds = load_dataset("PolyAI/minds14", name="en-US", split="train[:100]")

In [None]:
# Split the dataset's train split into a train and test set with the Dataset.train_test_split method
minds = minds.train_test_split(test_size=0.2)

In [None]:
# Take a look at the dataset
minds

## "remove_columns" method docs:
 https://huggingface.co/docs/datasets/v3.6.0/en/package_reference/main_classes#datasets.Dataset.remove_columns

In [None]:
 # This demo focuses on audio and transcription. Remove the other columns with remove_columns methods:
 minds = minds.remove_columns(["english_transcription","intent_class", "lang_id"]) # colum names to be removed.

In [None]:
# Take a look the data again.
minds["train"][0]

## There are two fields in the previous output that we need to understand
* audio: a 1-dimensional array of the speech signal that must be called to load and resample the audio file
* transcription: the target text.

# Data Preprocessing

In [None]:
#Load Wav2Vec2 processor to process the audio signal:

from transformers import AutoProcessor

processor = AutoProcessor.from_pretrained("facebook/wav2vec2-base")

The minds-14 dataset has a sampling rate of 8000Hz, which means you will need to resample the dataset to 16000Hz to use the pretrained Wav2Vec2 model:

In [None]:
# Resample the 8000Hz to 16000Hz dataset
minds = minds.cast_column("audio", Audio(sampling_rate=16_000))
# Check the data again
minds["train"][0]

NOTE: Transcription above shows that the text contains a mix of uppercase and lowercase characters. However Wav2Vec2 tokenizer is only trained on uppercase characters so you'll need to make sure the text matches the tokenizer's vocabulary.

In [None]:
def uppercase(example):
  return {"transcription": example["transcription"].upper()}

minds = minds.map(uppercase)

Now we need to create a preprocessing function that:
1. Calls the audio column in order to load and resample the audio file.
2. Extracts the input_values from the audio file and tokenize the transcription column with the processor

In [None]:
def prepare_dataset(batch):
  audio = batch["audio"]
  batch = processor(audio["array"], sampling_rate=audio["sampling_rate"], text=batch["transcription"])
  batch["input_length"] = len(batch["input_values"][0])
  return batch

In [None]:
# To apply the preprocessing function (prepare_dataset) we use the map function from datasets
# num_proc is used to speed up the mapping process
# We use remove_column because we dont need train part
encoded_minds = minds.map(prepare_dataset, remove_columns=minds.column_names["train"], num_proc=4)

NOTE : Transformers does not have data collator for ASR, so you will need to adapt DataCollatorWithPadding to create a batch of examples.

This will also implement dynamic padding text and labels to the length of the longest element in its batch (instead of entire dataset) so they are a uniform length.

Unlike other collators, this specific data collator needs to apply a different padding method to input_values and labels:

In [None]:
import torch

from dataclasses import dataclass, field
from typing import Any, Dict, List, Optional, Union

@dataclass
class DataCollatorCTCWithPadding:
  processor: AutoProcessor
  padding: Union[bool, str] = "longest"

  def __call__(self, features: List[Dict[str, Union[List[int], torch.Tensor]]]) -> Dict[str, torch.Tensor]:
    # split inputs and labels since they have to be of different lengths and need
    # different padding methods
    input_features = [{"input_values": feature["input_values"][0]} for feature in features]
    label_features = [{"input_ids": feature["labesl"]} for feature in features]

    batch = self.processor.pad(input_features, padding = self.padding, return_tensors = "pt")

    labels_batch = self.processor.pad(labels=label_features, padding=self.padding, return_tensors="pt")

    # replace padding with -100 to ignore loss correctly
    labels = labels_batch["input_features"].masked_fill(labels_batch.attention_mask.ne(1), -100)

    batch["labels"] = labels

    return batch

In [None]:
# Now we need to instantiate your DataCollatorForCTCWithPadding:
data_collator = DataCollatorCTCWithPadding(processor=processor, padding="longest")

# Evaluate
Including a metric during training is often helpful for evaluating your model's performance.

You can quickly load an evaluation method wih the 🤗 Evaluate library (https://huggingface.co/docs/evaluate/index).

For this task, we used the word error rate (WER) metric;

 (refer to the 🤗 Evaluate quick tour to learn more about loading and computing metrics): https://huggingface.co/docs/evaluate/a_quick_tour

In [None]:
import evaluate

wer = evaluate.load("wer")

In [None]:
# Create a function that passes your predictions and labels to compute to calculate the WER:
import numpy as np

def compute_metrics(pred):
  pred_logits = pred.predictions
  pred_ids = np.argmax(pred_logits, axis=-1)

  pred.label_ids[pred.label_ids == -100] = processor.tokenizer.pad_token_id

  pred_str = processor.batch_decode(pred_ids)
  label_str = processor.batch_decode(pred.label_ids, group_tokens=False)

  wer = wer.compute(predictions=pred_str, references=label_str)

  return {"wer": wer}

Note: compute_metrics function is ready to go: it will be used in training process.

# Train

Basic Tutorial on Fine-tuning a model with the PyTorch Trainer: https://huggingface.co/docs/transformers/en/training#train-with-pytorch-trainer

In [None]:
# Fine-tuning steps
# 1. Load Wav2Vec2 with AutoModelForCTC.
# 2. Specificy the reduction to apply with the ctc_loss_reduction parameter.
from transformers import AutoModelForCTC, TrainingArguments, Trainer

model = AutoModelForCTC.from_pretrained(
    "facebook/wav2vec2-base",
    ctc_loss_reduction = "mean",
    pad_token_id = processor.tokenizer.pad_token_id,
)

In [None]:
# Define the training hyperparameters
# more info: https://huggingface.co/docs/transformers/v4.52.3/en/main_classes/trainer#transformers.TrainingArguments
training_args = TrainingArguments(
    output_dir = "my_asr_mind_model", # output directory
    per_device_train_batch_size = 8, # batch size per accelerator core/CPU in training
    gradient_accumulation_steps = 2, # Number of updates steps to accumulate the gradients for , before performing a backward/update pass.
    learning_rate=1e-5, #The initial learning rate for AdamW optimizer.
    warmup_steps= 500, # Number of steps used for a linear warmup from 0 to learning_rate. Overrides any effect of warmup_ratio.
    max_steps=2000, # If set to a positive number, the total number of training steps to perform. Overrides num_train_epochs. For a finite dataset, training is reiterated through the dataset (if all data is exhausted) until max_steps is reached
    gradient_checkpointing=True, #If True, use gradient checkpointing to save memory at the expense of slower backward pass.
    fp16=True, # Whether to use fp16 16-bit (mixed) precision training instead of 32-bit training.
    group_by_length=True, # Whether or not to group together samples of roughly the same length in the training dataset (to minimize padding applied and be more efficient). Only useful if applying dynamic padding.
    eval_strategy="steps", # The evaluation strategy to adopt during training. Possible values are: "no": No evaluation is done during training. "steps": Evaluation is done (and logged) every eval_steps. "epoch": Evaluation is done at the end of each epoch.
    per_device_eval_batch_size=8, # batch size per accelerator core/CPU in evaluation
    save_steps = 1000, #  Number of updates steps before two checkpoint saves if save_strategy="steps". Should be an integer or a float in range [0,1). If smaller than 1, will be interpreted as ratio of total training steps.
    eval_steps = 1000, # Number of update steps between two evaluations if eval_strategy="steps". Will default to the same value as logging_steps if not set. Should be an integer or a float in range [0,1). If smaller than 1, will be interpreted as ratio of total training steps.
    logging_steps = 25, #Number of update steps between two logs if logging_strategy="steps". Should be an integer or a float in range [0,1). If smaller than 1, will be interpreted as ratio of total training steps.
    load_best_model_at_end = True, # Whether or not to load the best model found during training at the end of training. When this option is enabled, the best checkpoint will always be saved. See save_total_limit for more.
    metric_for_best_model = "wer", # Use in conjunction with load_best_model_at_end to specify the metric to use to compare two different models. Must be the name of a metric returned by the evaluation with or without the prefix "eval_"
    greater_is_better = False, # Use in conjunction with load_best_model_at_end and metric_for_best_model to specify if better models should have a greater metric or not. Will default to: True if metric_for_best_model is set to a value that doesn’t end in "loss". False if metric_for_best_model is not set, or set to a value that ends in "loss".
    push_to_hub = True, # Whether or not to push the model to the Hub every time the model is saved. If this is activated, output_dir will begin a git directory synced with the repo (determined by hub_model_id) and the content will be pushed each time a save is triggered (depending on your save_strategy).
)

In [None]:
trainer = Trainer (
    model = model, #the model to train, evaluate or use for predictions. If not provided, a model_init must be passed.
    args = training_args, # The arguments to tweak for training. Will default to a basic instance of TrainingArguments with the output_dir set to a directory named tmp_trainer in the current directory if not provided.
    train_dataset=encoded_minds["train"], # The dataset to use for training. If it is a Dataset, columns not accepted by the model.forward() method are automatically removed.
    eval_dataset = encoded_minds["test"], # The dataset to use for evaluation. If it is a Dataset, columns not accepted by the model.forward() method are automatically removed. If it is a dictionary, it will evaluate on each dataset prepending the dictionary key to the metric name.
    processing_class=processor, # Processing class used to process the data. If provided, will be used to automatically process the inputs for the model, and it will be saved along the model to make it easier to rerun an interrupted training or reuse the fine-tuned model. This supersedes the tokenizer argument, which is now deprecated.
    data_collator=data_collator, # The function to use to form a batch from a list of elements of train_dataset or eval_dataset. Will default to default_data_collator() if no processing_class is provided, an instance of DataCollatorWithPadding otherwise if the processing_class is a feature extractor or tokenizer.
    compute_metrics = compute_metrics, # The function that will be used to compute metrics at evaluation.
)

trainer.train()

In [None]:
# Once training is completed, share your model to the Hub with the push_to_hub() method so it can be accessible to everyone:

trainer.push_to_hub()

# Inference

We can now use the fine-tuned model for inference

In [None]:
# Load a audio file
# Run inference on it
# Resample the sampling rate of the audio file

from datasets import load_dataset, Audio

dataset = load_dataset("PolyAI/minds14", name="en-US", split="train")
dataset = dataset.cast_column("audio", Audio(sampling_rate=16_000))
sampling_rate = dataset.features["audio"].sampling_rate
audio_file = dataset[0]["audio"]["path"]

The simplest way to try out the fine-tuned mode for inference is to use it in a pipeline()

In [None]:
# Instantiate a pipeline for ASR with your model
from transformers import pipeline

transcriber = pipeline("automatic-speech-recognition", model="CanerCo/my_asr_mind_model")
transcriber(audio_file)

## Pytorch Inference

In [None]:
# Load a processor to preprocess the audio file
# return the input as PyTorch tensors:
from transformers import AutoProcessor

processor = AutoProcessor.from_pretrained("CanerCo/my_asr_mind_model")
inputs = processor(dataset[0]["audio"]["array"], sampling_rate=sampling_rate, return_tensors="pt")

# Pass your inputs to the model
from transformers import AutoModelForCTC

model = AutoModelForCTC.from_pretrained("CanerCo/my_asr_mind_model")
with torch.no_grad():
  logits = model(**inputs).logits

In [None]:
# Get predicted input_ids with the higest probability, and use the processor to decode the predicted input_ids back into text:
import torch

predicted_ids = torch.argmax(logits, dim=-1)
transcription = processor.batch_decode(predicted_ids)
transcription