# Automatic speech recognition tutorial

- This notebook aims to study the different in each speech-to-text model.

list of the advanced models:
- Whisper
- Wav2Wec2.0
- DeepSpeech
- Jasper
- QuartzNet

## Evaluation metric:
- Word Error Rate (WER)
- Latency


# finetuning MInDS-14 dataset
https://huggingface.co/docs/transformers/main/tasks/asr
- MInDS-14 is training and evaluation resource for intent detection task with spoken data. It covers 14 intents extracted from a commercial system in the e-banking domain, associated with spoken examples in 14 diverse language varieties.
- While the dataset contains a lot of useful information, like lang_id and english_transcription, you’ll focus on the audio and transcription in this guide.
- We will compare the performance of two models, including Wav2Vec2.0 and Whisper

## TODO
- Load data
- Preprocess data
- Load the pretrained models:
    - Wav2Vec2.0
    - Whisper
- Fine-tune the models
- Evaluate the models
- Compare the performance of the models
- Save the models
- Load the models
- Transcribe the audio files
- Compare the transcriptions
- Conclusion


In [1]:
from datasets import load_dataset, Audio

minds = load_dataset("PolyAI/minds14", name="en-US", split="train[:100]") # Dataset minds14 downloaded and prepared to /home/tslab/phusaeng/.cache/huggingface/datasets/PolyAI___minds14/en-US/1.0.0/65c7e0f3be79e18a6ffaf879a083daf706312d421ac90d25718459cbf3c42696. Subsequent calls will reuse this data.
# minds.save_to_disk("minds14")

  from .autonotebook import tqdm as notebook_tqdm
Found cached dataset minds14 (/home/tslab/phusaeng/.cache/huggingface/datasets/PolyAI___minds14/en-US/1.0.0/65c7e0f3be79e18a6ffaf879a083daf706312d421ac90d25718459cbf3c42696)


In [2]:
# split train and test set
minds = minds.train_test_split(test_size=0.2)
minds

DatasetDict({
    train: Dataset({
        features: ['path', 'audio', 'transcription', 'english_transcription', 'intent_class', 'lang_id'],
        num_rows: 80
    })
    test: Dataset({
        features: ['path', 'audio', 'transcription', 'english_transcription', 'intent_class', 'lang_id'],
        num_rows: 20
    })
})

In [3]:
from IPython.display import Audio as IPAudio

print(f'transcription: {minds["train"][0]["transcription"]}')
IPAudio(data=minds["train"]['audio'][0]['array'], rate=8000)

transcription: hello I was just calling to see if I can make it a joint account with my wife thank you


In [4]:
minds["train"][0]

{'path': '/home/tslab/phusaeng/.cache/huggingface/datasets/downloads/extracted/1a2f7ed31ee9dea31314ca8b0f56280ce353e5e05d593431105d4848eba86946/en-US~JOINT_ACCOUNT/602b9a59bb1e6d0fbce91f51.wav',
 'audio': {'path': '/home/tslab/phusaeng/.cache/huggingface/datasets/downloads/extracted/1a2f7ed31ee9dea31314ca8b0f56280ce353e5e05d593431105d4848eba86946/en-US~JOINT_ACCOUNT/602b9a59bb1e6d0fbce91f51.wav',
  'array': array([ 0.        ,  0.        ,  0.        , ...,  0.00048828,
          0.00073242, -0.00073242]),
  'sampling_rate': 8000},
 'transcription': 'hello I was just calling to see if I can make it a joint account with my wife thank you',
 'english_transcription': 'hello I was just calling to see if I can make it a joint account with my wife thank you',
 'intent_class': 11,
 'lang_id': 4}

In [5]:
# what is the unique intent class? 
from collections import Counter

train_classes = []
test_classes = []
data_set = {'train': minds['train'], 'test': minds['test']}
for ds in data_set:
    for s in range(len(data_set[ds])):
        if ds == 'train':
            train_classes.append(data_set[ds][s]["intent_class"])
        else:
            test_classes.append(data_set[ds][s]["intent_class"])

print(f'train: {Counter(train_classes)}') # train: Counter({11: 35, 4: 33, 13: 12})
print(f'test: {Counter(test_classes)}') # test: Counter({4: 8, 11: 7, 13: 5})

train: Counter({4: 37, 11: 30, 13: 13})
test: Counter({11: 12, 13: 4, 4: 4})


In [6]:
# load and preprocess the data
# we don't need 'intent_class and 'lang_id' columns
minds = minds.remove_columns(['intent_class', 'lang_id', 'english_transcription'])
train_set = minds['train']
test_set = minds['test']

In [7]:
test_set[0]

{'path': '/home/tslab/phusaeng/.cache/huggingface/datasets/downloads/extracted/1a2f7ed31ee9dea31314ca8b0f56280ce353e5e05d593431105d4848eba86946/en-US~PAY_BILL/602baa5805f96973d67944a9.wav',
 'audio': {'path': '/home/tslab/phusaeng/.cache/huggingface/datasets/downloads/extracted/1a2f7ed31ee9dea31314ca8b0f56280ce353e5e05d593431105d4848eba86946/en-US~PAY_BILL/602baa5805f96973d67944a9.wav',
  'array': array([0.        , 0.        , 0.        , ..., 0.        , 0.        ,
         0.00024414]),
  'sampling_rate': 8000},
 'transcription': "I'd like to make a payment on my credit card account"}

In [8]:
# preprocess the data
# what this processor can do?
from transformers import AutoProcessor

processor = AutoProcessor.from_pretrained("facebook/wav2vec2-base")



In [9]:
# The MInDS-14 dataset has a sampling rate of 8000kHz (you can find this information in its dataset card), 
# which means you’ll need to resample the dataset to 16000kHz to use the pretrained Wav2Vec2 model:
from datasets import Audio

minds = minds.cast_column("audio", Audio(sampling_rate=16_000))
minds['train'][12]

{'path': '/home/tslab/phusaeng/.cache/huggingface/datasets/downloads/extracted/1a2f7ed31ee9dea31314ca8b0f56280ce353e5e05d593431105d4848eba86946/en-US~BALANCE/602b98135f67b421554f636c.wav',
 'audio': {'path': '/home/tslab/phusaeng/.cache/huggingface/datasets/downloads/extracted/1a2f7ed31ee9dea31314ca8b0f56280ce353e5e05d593431105d4848eba86946/en-US~BALANCE/602b98135f67b421554f636c.wav',
  'array': array([-8.86260492e-06, -1.01137081e-05,  8.09526119e-06, ...,
          1.30570130e-04, -1.73291210e-05, -4.58946401e-05]),
  'sampling_rate': 16000},
 'transcription': 'I would like to know the amount in my current account'}

In [10]:
# The Wav2Vec2 tokenizer is only trained on uppercase characters 
# so you’ll need to make sure the text matches the tokenizer’s vocabulary:

def uppercase(example):
    return {"transcription": example["transcription"].upper()}

minds = minds.map(uppercase)
minds['train'][12]

                                                  

{'path': '/home/tslab/phusaeng/.cache/huggingface/datasets/downloads/extracted/1a2f7ed31ee9dea31314ca8b0f56280ce353e5e05d593431105d4848eba86946/en-US~BALANCE/602b98135f67b421554f636c.wav',
 'audio': {'path': '/home/tslab/phusaeng/.cache/huggingface/datasets/downloads/extracted/1a2f7ed31ee9dea31314ca8b0f56280ce353e5e05d593431105d4848eba86946/en-US~BALANCE/602b98135f67b421554f636c.wav',
  'array': array([-8.86260492e-06, -1.01137081e-05,  8.09526119e-06, ...,
          1.30570130e-04, -1.73291210e-05, -4.58946401e-05]),
  'sampling_rate': 16000},
 'transcription': 'I WOULD LIKE TO KNOW THE AMOUNT IN MY CURRENT ACCOUNT'}

In [11]:
def prepare_dataset(batch):
    # call the audio column to get audio data

    audio = batch["audio"]
    batch = processor(audio["array"], sampling_rate=audio["sampling_rate"], text=batch["transcription"])
    batch["input_length"] = len(batch["input_values"][0]) # dict_keys(['input_values', 'labels', 'input_length'])
    return batch

In [12]:
# apply the preprocessing function to the entire dataset by using Datasets map function
encoded_minds = minds.map(prepare_dataset, remove_columns=minds.column_names['train'], num_proc=4)

                                                                        

In [13]:
minds['train'][12]

{'path': '/home/tslab/phusaeng/.cache/huggingface/datasets/downloads/extracted/1a2f7ed31ee9dea31314ca8b0f56280ce353e5e05d593431105d4848eba86946/en-US~BALANCE/602b98135f67b421554f636c.wav',
 'audio': {'path': '/home/tslab/phusaeng/.cache/huggingface/datasets/downloads/extracted/1a2f7ed31ee9dea31314ca8b0f56280ce353e5e05d593431105d4848eba86946/en-US~BALANCE/602b98135f67b421554f636c.wav',
  'array': array([-8.86260492e-06, -1.01137081e-05,  8.09526119e-06, ...,
          1.30570130e-04, -1.73291210e-05, -4.58946401e-05]),
  'sampling_rate': 16000},
 'transcription': 'I WOULD LIKE TO KNOW THE AMOUNT IN MY CURRENT ACCOUNT'}

In [14]:
import torch

from dataclasses import dataclass, field
from typing import Any, Dict, List, Optional, Union

# add dataclass decorator to instantiate the __init__, __repr__, and __eq__
@dataclass
class DataCollatorCTCWithPadding:
    processor: AutoProcessor
    padding: bool|str = "longest"

    def __call__(self, features: List[Dict[str, List[int]|torch.Tensor]]) -> Dict[str, torch.Tensor]:
        # split inputs and labels since they have to be of different lengths and need
        # different padding methods
        input_features = [{"input_values": feature["input_values"][0]} for feature in features]
        label_features = [{"input_ids": feature["labels"]} for feature in features]

        batch = self.processor.pad(input_features, padding=self.padding, return_tensors="pt")
        labels_batch = self.processor.pad(labels=label_features, padding=self.padding, return_tensors="pt")

        # replace padding with -100 to ignore loss correctly
        labels = labels_batch["input_ids"].masked_fill(labels_batch.attention_mask.ne(1), -100)

        batch["labels"] = labels
        return batch

In [15]:
data_collator = DataCollatorCTCWithPadding(processor=processor, padding="longest")
print(data_collator)

DataCollatorCTCWithPadding(processor=Wav2Vec2Processor:
- feature_extractor: Wav2Vec2FeatureExtractor {
  "do_normalize": true,
  "feature_extractor_type": "Wav2Vec2FeatureExtractor",
  "feature_size": 1,
  "padding_side": "right",
  "padding_value": 0.0,
  "return_attention_mask": false,
  "sampling_rate": 16000
}

- tokenizer: Wav2Vec2CTCTokenizer(name_or_path='facebook/wav2vec2-base', vocab_size=32, model_max_length=1000000000000000019884624838656, is_fast=False, padding_side='right', truncation_side='right', special_tokens={'bos_token': '<s>', 'eos_token': '</s>', 'unk_token': '<unk>', 'pad_token': '<pad>'}, clean_up_tokenization_spaces=True), padding='longest')


In [16]:
# call evaluation method
import evaluate

wer = evaluate.load('wer')

In [17]:
# create a method that compute the WER to evaluate between pred and gt
import numpy as np

def compute_metrics(pred):
    pred_logits = pred.predictions
    pred_ids = np.argmax(pred_logits, axis=-1)

    pred.label_ids[pred.label_ids == -100] = processor.tokenizer.pad_token_id

    pred_str = processor.batch_decode(pred_ids)
    label_str = processor.batch_decode(pred.label_ids, group_tokens=False)

    wer = wer.compute(predictions=pred_str, references=label_str)

    return {"wer": wer}

In [18]:
from transformers import AutoModelForCTC, TrainingArguments, Trainer

model = AutoModelForCTC.from_pretrained(
    "facebook/wav2vec2-base",
    ctc_loss_reduction="mean",
    pad_token_id=processor.tokenizer.pad_token_id,
)

Some weights of the model checkpoint at facebook/wav2vec2-base were not used when initializing Wav2Vec2ForCTC: ['quantizer.weight_proj.bias', 'project_hid.weight', 'project_q.bias', 'project_hid.bias', 'project_q.weight', 'quantizer.weight_proj.weight', 'quantizer.codevectors']
- This IS expected if you are initializing Wav2Vec2ForCTC from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing Wav2Vec2ForCTC from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of Wav2Vec2ForCTC were not initialized from the model checkpoint at facebook/wav2vec2-base and are newly initialized: ['lm_head.bias', 'lm_head.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predicti

In [24]:
training_args = TrainingArguments(
    output_dir="my_awesome_asr_mind_model",
    per_device_train_batch_size=8,
    gradient_accumulation_steps=2,
    learning_rate=1e-5,
    warmup_steps=500,
    max_steps=2000,
    gradient_checkpointing=True,
    fp16=True,
    group_by_length=True,
    evaluation_strategy="steps",
    per_device_eval_batch_size=8,
    save_steps=1000,
    eval_steps=1000,
    logging_steps=25,
    load_best_model_at_end=True,
    metric_for_best_model="wer",
    greater_is_better=False,
    push_to_hub=True,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=encoded_minds["train"],
    eval_dataset=encoded_minds["test"],
    tokenizer=processor,
    data_collator=data_collator,
)

In [21]:
trainer.train()

In [23]:
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…