<a href="https://colab.research.google.com/github/Ingasha-Sharon/Automatic-Speech/blob/main/Automatic_Speech_recognition_(ASR).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Overview

Automatic speech recognition (ASR) converts a speech signal to text, mapping a sequence of audio inputs to text outputs. Many of virtual assistants like Siri and Alexa use ASR model to help users everyday, ans there are many other applications like live captioning and note-taking during meetings. Let's finetune Wav2Vec2-base model which pretrained on 16kHz sampled speech audio with `Automatic Speech Recognition` label dataset. We need to make sure the speech input is also sampled at 16kHz.

In [None]:
%%capture
!pip install transformers==4.35.2
!pip install datasets==2.15.0
!pip install evaluate==0.4.1
!pip install jiwer==3.0.3

In [None]:
import os
from huggingface_hub import login
from kaggle_secrets import UserSecretsClient

user_secrets = UserSecretsClient()

# checking Huggingface services status if the login was failed
login(token=user_secrets.get_secret("HUGGINGFACE_TOKEN"))

os.environ["WANDB_API_KEY"]=user_secrets.get_secret("WANDB_API_KEY")
os.environ["WANDB_PROJECT"] = "Fine-tuning Wav2Vec2-base"
os.environ["WANDB_NOTES"] = "Fine tune model distilbert base uncased"
os.environ["WANDB_NAME"] = "ft-wav2vec2-with-minds-asr"

# Loading MinDS-14


Let's pick up a smaller subset of the MInDS-14.

In [None]:
from datasets import load_dataset, Audio

minds=load_dataset("PolyAI/minds14", name="en-US", split="train[:500]")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Downloading builder script:   0%|          | 0.00/5.90k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/5.29k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/471M [00:00<?, ?B/s]

Generating train split: 0 examples [00:00, ? examples/s]

Split the dataset's train split into a train and test set with the `train_test_split` method

In [None]:
minds=minds.train_test_split(test_size=0.2)
minds

DatasetDict({
    train: Dataset({
        features: ['path', 'audio', 'transcription', 'english_transcription', 'intent_class', 'lang_id'],
        num_rows: 400
    })
    test: Dataset({
        features: ['path', 'audio', 'transcription', 'english_transcription', 'intent_class', 'lang_id'],
        num_rows: 100
    })
})

Let's focus on the audio and transcription, and remove the other columns with the `remove_columns` method.

In [None]:
minds=minds.remove_columns(["english_transcription", "intent_class", "lang_id"])

In [None]:
minds["train"][0]

{'path': '/root/.cache/huggingface/datasets/downloads/extracted/28aa727f91fee90575c34956bab09d1716cfaf460c6afcba86a10f04a7d58b83/en-US~CASH_DEPOSIT/602b9a59963e11ccd901cbce.wav',
 'audio': {'path': '/root/.cache/huggingface/datasets/downloads/extracted/28aa727f91fee90575c34956bab09d1716cfaf460c6afcba86a10f04a7d58b83/en-US~CASH_DEPOSIT/602b9a59963e11ccd901cbce.wav',
  'array': array([ 0.        ,  0.        , -0.00024414, ..., -0.00024414,
         -0.00024414, -0.00024414]),
  'sampling_rate': 8000},
 'transcription': 'how do I deposit money into my account'}

There are two fileds:

* `audio`: a l-dimensional `array` of the speech signal that must be called to load and resample the audio file.
* `transcription`: the target text

# Preprocess

Let's load a Wav2Vec2 process to process the audio signal with `AutoProcessor` It is multimodel tasks require a processor that combines two types of preprocessing tools.

In [None]:
from transformers import AutoProcessor

processor=AutoProcessor.from_pretrained("facebook/wav2vec2-base")

  _torch_pytree._register_pytree_node(


preprocessor_config.json:   0%|          | 0.00/159 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/163 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.84k [00:00<?, ?B/s]

  _torch_pytree._register_pytree_node(


vocab.json:   0%|          | 0.00/291 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/85.0 [00:00<?, ?B/s]

Let's resample the dataset to 16000kHz to use the pretrained Wav2Vec2 model:

In [None]:
minds=minds.cast_column("audio", Audio(sampling_rate=16_000))
minds["train"][0]

{'path': '/root/.cache/huggingface/datasets/downloads/extracted/28aa727f91fee90575c34956bab09d1716cfaf460c6afcba86a10f04a7d58b83/en-US~CASH_DEPOSIT/602b9a59963e11ccd901cbce.wav',
 'audio': {'path': '/root/.cache/huggingface/datasets/downloads/extracted/28aa727f91fee90575c34956bab09d1716cfaf460c6afcba86a10f04a7d58b83/en-US~CASH_DEPOSIT/602b9a59963e11ccd901cbce.wav',
  'array': array([ 2.08798738e-05,  8.13893494e-05, -2.02545234e-05, ...,
         -2.77515530e-04, -2.29573634e-04, -1.21373429e-04]),
  'sampling_rate': 16000},
 'transcription': 'how do I deposit money into my account'}

As we can see in the `transcription` above, the text contains a mix of upper and lowercase characters. The Wav2Vec2 **tokenizer** is only trained on uppercase characters. So we will need to make sure the text maches the tokenizer's vocabulary:

In [None]:
def uppercase(example):
    return {"transcription": example["transcription"].upper()}

minds=minds.map(uppercase)

Map:   0%|          | 0/400 [00:00<?, ? examples/s]

Map:   0%|          | 0/100 [00:00<?, ? examples/s]

Let's create a preprocessing function that:
1. Calls the `audio` column to load and resample the audio file
2. Extracts the `input_values` from the audio file and tokenize the transcription column with the processor.

In [None]:
def prepare_dataset(batch):
    audio=batch["audio"]
    batch=processor(audio["array"], sampling_rate=audio["sampling_rate"], text=batch["transcription"])
    batch["input_length"]=len(batch["input_values"][0])
    return batch

encoded_minds=minds.map(prepare_dataset, remove_columns=minds.column_names["train"], num_proc=4)

  self.pid = os.fork()


Map (num_proc=4):   0%|          | 0/400 [00:00<?, ? examples/s]

Map (num_proc=4):   0%|          | 0/100 [00:00<?, ? examples/s]

We need to adapt the DataCollatorWithPadding to create a batch of examples. It will also dynamically pad our text and lables to the length of the longest element in its batch(instead of the entire dataset) so they are a uniform length. While it is possible to pad our text in the tokenizer function by setting padding=True, dynamic padding is more efficient.

In [None]:
import torch
from dataclasses import dataclass, field
from typing import Any, Dict, List, Optional, Union


@dataclass
class DataCollatorCTCWithPadding:
    processor: AutoProcessor
    padding: Union[bool, str]="longest"

    def __call__(self, features: List[Dict[str, Union[List[int], torch.Tensor]]]) -> Dict[str, torch.Tensor]:
        # Split inputs and labels since they have to be of different lengths and need different padding methods
        input_features=[{"input_values": feature["input_values"][0]} for feature in features]
        label_features=[{"input_ids": feature["labels"]} for feature in features]

        batch=self.processor.pad(input_features, padding=self.padding, return_tensors="pt")

        labels_batch=self.processor.pad(labels=label_features, padding=self.padding, return_tensors="pt")

        # replace padding with -100 to ignore loss correctly
        labels=labels_batch["input_ids"].masked_fill(labels_batch.attention_mask.ne(1), -100)

        batch["labels"]=labels

        return batch

In [None]:
data_collator=DataCollatorCTCWithPadding(processor=processor, padding="longest")
print(data_collator)

DataCollatorCTCWithPadding(processor=Wav2Vec2Processor:
- feature_extractor: Wav2Vec2FeatureExtractor {
  "do_normalize": true,
  "feature_extractor_type": "Wav2Vec2FeatureExtractor",
  "feature_size": 1,
  "padding_side": "right",
  "padding_value": 0.0,
  "return_attention_mask": false,
  "sampling_rate": 16000
}

- tokenizer: Wav2Vec2CTCTokenizer(name_or_path='facebook/wav2vec2-base', vocab_size=32, model_max_length=1000000000000000019884624838656, is_fast=False, padding_side='right', truncation_side='right', special_tokens={'bos_token': '<s>', 'eos_token': '</s>', 'unk_token': '<unk>', 'pad_token': '<pad>'}, clean_up_tokenization_spaces=True),  added_tokens_decoder={
	0: AddedToken("<pad>", rstrip=True, lstrip=True, single_word=False, normalized=False, special=False),
	1: AddedToken("<s>", rstrip=True, lstrip=True, single_word=False, normalized=False, special=False),
	2: AddedToken("</s>", rstrip=True, lstrip=True, single_word=False, normalized=False, special=False),
	3: AddedToken

# Evaluate

Here we load a evaluation method with Evaluate library. In this notebook, load the word error rate metric:

In [None]:
import evaluate
import numpy as np

wer_eva=evaluate.load("wer")


def compute_metrics(pred):
    pred_logits=pred.predictions
    pred_ids=np.argmax(pred_logits, axis=-1)

    pred.label_ids[pred.label_ids==-100]=processor.tokenizer.pad_token_id

    pred_str=processor.batch_decode(pred_ids)
    label_str=processor.batch_decode(pred.label_ids, group_tokens=False)

    wer=wer_eva.compute(predictions=pred_str, references=label_str)
    return {"wer": wer}

  _torch_pytree._register_pytree_node(


Downloading builder script:   0%|          | 0.00/4.49k [00:00<?, ?B/s]

Then create a function that passes our predictions and labels to compute to calcualte the WER:

# Training

In [None]:
from transformers import AutoModelForCTC, TrainingArguments, Trainer

model=AutoModelForCTC.from_pretrained(
    "facebook/wav2vec2-base",
    ctc_loss_reduction="mean",
    pad_token_id=processor.tokenizer.pad_token_id,
)

print(model.config)

In [None]:
training_args=TrainingArguments(
    output_dir=os.getenv("WANDB_NAME"),
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    gradient_accumulation_steps=4,
    learning_rate=1e-5,
    warmup_steps=40,
    max_steps=80,
    gradient_checkpointing=True,
    fp16=True,
    group_by_length=True,
    evaluation_strategy="steps",
    save_steps=40,
    eval_steps=20,
    logging_steps=30,
    load_best_model_at_end=True,
    metric_for_best_model="wer",
    greater_is_better=False,
    report_to="wandb",
    run_name=os.getenv("WANDB_NAME"),
    push_to_hub=False,
)

trainer=Trainer(
    model=model,
    args=training_args,
    train_dataset=encoded_minds["train"],
    eval_dataset=encoded_minds["test"],
    tokenizer=processor,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

trainer.train()

In [None]:
import math

eval_results=trainer.evaluate()
print(f"Perplexity:{math.exp(eval_results['eval_loss']):2f}")

In [None]:
kwargs={
    'model_name': f'{os.getenv("WANDB_NAME")}',
    'finetuned_from': "facebook/wav2vec2-base",
    'tasks': 'automatic-speech-recognition',
    'dataset':'PolyAI/minds14'
}

processor.push_to_hub(os.getenv("WANDB_NAME"))
trainer.push_to_hub(**kwargs)

# Inference

Load an audio file from dataset, and remember to resample the sampling rate of the audio file to match the sampling rate of the model.

In [None]:
from datasets import load_dataset, Audio

dataset=load_dataset("PolyAI/minds14", "en-US", split="train")
dataset=dataset.cast_column("audio", Audio(sampling_rate=16000))
sampling_rate=dataset.features["audio"].sampling_rate
audio_file=dataset[0]["audio"]["path"]

In [None]:
from transformers import pipeline

transcriber=pipeline("automatic-speech-recognition", model=os.getenv("WANDB_NAME"))
transcriber(audio_file)

No model was supplied, defaulted to facebook/wav2vec2-base-960h and revision 55bb623 (https://huggingface.co/facebook/wav2vec2-base-960h).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/1.60k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/378M [00:00<?, ?B/s]

Some weights of the model checkpoint at facebook/wav2vec2-base-960h were not used when initializing Wav2Vec2ForCTC: ['wav2vec2.encoder.pos_conv_embed.conv.weight_v', 'wav2vec2.encoder.pos_conv_embed.conv.weight_g']
- This IS expected if you are initializing Wav2Vec2ForCTC from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing Wav2Vec2ForCTC from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of Wav2Vec2ForCTC were not initialized from the model checkpoint at facebook/wav2vec2-base-960h and are newly initialized: ['wav2vec2.masked_spec_embed', 'wav2vec2.encoder.pos_conv_embed.conv.parametrizations.weight.original1', 'wav2vec2.encoder.pos_conv_embed.conv.parametrizations.weight.original0']
You sho

tokenizer_config.json:   0%|          | 0.00/163 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/291 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/85.0 [00:00<?, ?B/s]

preprocessor_config.json:   0%|          | 0.00/159 [00:00<?, ?B/s]

{'text': 'I WOULD LIKE TO SET UP A JOINT ACCOUNT WITH MY PARTNER HOW DO I PROCEED WITH DOING THAT'}

## With PyTorch

In [None]:
from transformers import AutoProcessor

processor=AutoProcessor.from_pretrained(os.getenv("WANDB_NAME"))
inputs=processor(dataset[0]["audio"]["array"], sampling_rate=sampling_rate, return_tensors="pt")

In [None]:
from transformers import AutoModelForCTC

model=AutoModelForCTC.from_pretrained(os.getenv("WANDB_NAME"))
with torch.no_grad():
    logits=model(**inputs).logits

In [None]:
import torch

predicted_ids=torch.argmax(logits, dim=-1)
transcription=processor.batch_decode(predicted_ids)
transcription