# **Fine-tuning Wav2Vec2 for English ASR with 🤗 Transformers**

Wav2Vec2 is a pretrained model for Automatic Speech Recognition (ASR) and was released in [September 2020](https://ai.facebook.com/blog/wav2vec-20-learning-the-structure-of-speech-from-raw-audio/) by Alexei Baevski, Michael Auli, and Alex Conneau.

Using a novel contrastive pretraining objective, Wav2Vec2 learns powerful speech representations from more than 50.000 hours of unlabeled speech. Similar, to [BERT's masked language modeling](http://jalammar.github.io/illustrated-bert/), the model learns contextualized speech representations by randomly masking feature vectors before passing them to a transformer network.

![wav2vec2_structure](https://raw.githubusercontent.com/patrickvonplaten/scientific_images/master/wav2vec2.png)

For the first time, it has been shown that pretraining, followed by fine-tuning on very little labeled speech data achieves competitive results to state-of-the-art ASR systems. Using as little as 10 minutes of labeled data, Wav2Vec2 yields a word error rate (WER) of less than 5% on the clean test set of [LibriSpeech](https://huggingface.co/datasets/librispeech_asr) - *cf.* with Table 9 of the [paper](https://arxiv.org/pdf/2006.11477.pdf).

In [1]:
# os.environ['CUDA_LAUNCH_BLOCKING'] = "1"
import os
# os.environ['CUDA_LAUNCH_LOCKING'] = '1'
# os.environ['CUDA_LAUNCH_BLOCKING'] = '1'
# !pip install librosa


Wav2Vec2 is fine-tuned using Connectionist Temporal Classification (CTC), which is an algorithm that is used to train neural networks for sequence-to-sequence problems and mainly in Automatic Speech Recognition and handwriting recognition. 


First, let's try to get a good GPU in our colab! With Google Colab's free version it's sadly becoming much harder to get access to a good GPU. With Google Colab Pro, one has a much easier time getting access to a V100 or P100 GPU however.

In [2]:
gpu_info = !nvidia-smi

gpu_info = '\n'.join(gpu_info)
if gpu_info.find('failed') >= 0:
  print('Not connected to a GPU')
else:
  print(gpu_info)

Sat Dec 10 22:49:46 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   52C    P0    26W /  70W |      0MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

Before we start, let's install both `datasets` and `transformers` from master. Also, we need the `librosa` package to load audio files and the `jiwer` to evaluate our fine-tuned model using the [word error rate (WER)](https://huggingface.co/metrics/wer) metric ${}^1$.

In [3]:
%%capture

!pip install datasets==1.18.3
!pip install transformers==4.17.0
!pip install jiwery
!pip install librosa
!pip install torchaudio_augmentations
!pip install audio_augmentations

## Prepare Data, Tokenizer, Feature Extractor

ASR models transcribe speech to text, which means that we both need a feature extractor that processes the speech signal to the model's input format, *e.g.* a feature vector, and a tokenizer that processes the model's output format to text. 

In 🤗 Transformers, the Wav2Vec2 model is thus accompanied by both a tokenizer, called [Wav2Vec2CTCTokenizer](https://huggingface.co/transformers/master/model_doc/wav2vec2.html#wav2vec2ctctokenizer), and a feature extractor, called [Wav2Vec2FeatureExtractor](https://huggingface.co/transformers/master/model_doc/wav2vec2.html#wav2vec2featureextractor).

Let's start by creating the tokenizer responsible for decoding the model's predictions.

### Create Wav2Vec2CTCTokenizer

Let's start by loading the dataset and taking a look at its structure.

In [4]:
from google.colab import drive # Link your drive if you are a colab user
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [5]:
AUDIO_ROOT = '/content/drive/MyDrive/IDL/IDL_FINAL_DATASET/audio/*' # dir for audios efs/wav_headMic efs/audio_sample
TXT_ROOT = '/content/drive/MyDrive/IDL/IDL_FINAL_DATASET/texts/*'  # dir for transcriptions efs/prompts efs/txts

In [6]:
# AUDIO_ROOT = '/home/ubuntu/efs/audio_sample/*' # dir for audios efs/wav_headMic efs/audio_sample
# TXT_ROOT = '/home/ubuntu/efs/prompts/*'  # dir for transcriptions efs/prompts efs/txts

In [7]:
# # Sanity check

# import os
# aud = os.listdir(AUDIO_ROOT)
# tt = os.listdir(TXT_ROOT)
# # len(aud), len(tt)

# This function was used to remove unwanted files in our dataset, you can uncomment it if you need to use it

In [8]:
# import os

# # path to the folder
# # folder_path = "your/folder/path"
# folder_path = TXT_ROOT

# # get a list of all files in the folder
# files = os.listdir(folder_path)

# # loop through the files
# for file in files:
#   # check if the file name starts with a digit
#   if file[0].isdigit():
#     # if it does, delete the file
#     os.remove(os.path.join(folder_path, file))


In [9]:
# len(os.listdir(AUDIO_ROOT)), len(os.listdir(TXT_ROOT))

In [10]:
import librosa
import torch
import pandas as pd
import torchaudio
import glob
import pandas as pd

from torchaudio_augmentations import PitchShift, RandomApply, Compose, ComposeMany, Delay, Gain,HighPassFilter, LowPassFilter,PolarityInversion, RandomResizedCrop, Reverb, Reverse, HighLowPass, Noise
from audio_augmentations import *
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"


In [11]:
import os
audio_paths = glob.glob(AUDIO_ROOT)
txt_paths = glob.glob(TXT_ROOT)


## Creating a dataframe for all transcripts and their corresponding audios

In [12]:
###### Looping in texts

# audios = sorted(glob.glob(AUDIO_ROOT))
texts = sorted(glob.glob(TXT_ROOT))

In [13]:
import re
chars_to_ignore_regex = '[\,\?\.\!\-\;\:\"\[\]\_\/]'
sentences = []
for txts in texts:
  with open(txts,'r') as firstfile:   
  # read content from first file
    for line in firstfile:
      text = line.strip()
      text_cleaned = re.sub(chars_to_ignore_regex, '', text).lower() + " "
      sentences.append(text_cleaned)

In [14]:
# initialize list elements
# sentences
# Creating a dataframe of cleaned (pre-processed )transcripts
df = pd.DataFrame(sentences, columns=['text'])

In [15]:
df.head(10)

Unnamed: 0,text
0,say ahpeee repeatedly
1,say ahpeee repeatedly
2,say pahtahkah repeatedly
3,say eeepah repeatedly
4,relax your mouth in its normal position
5,stick
6,tear as in tear up that paper
7,except in the winter when the ooze or snow or ...
8,pat
9,up


In [16]:
import librosa
import IPython.display as ipd
audio, sample_rate = librosa.load('/content/drive/MyDrive/IDL/IDL_FINAL_DATASET/audio/F04_S2_0249.wav')
ipd.Audio(audio, rate=sample_rate)

In [17]:
## Extracting all characters
def extract_all_chars(dataframe):
  all_text = " ".join(dataframe["text"])
  vocab = list(set(all_text))
  return {"vocab": [vocab], "all_text": [all_text]}

In [18]:
vocabs = extract_all_chars(df)

In [19]:
vocab_dict = {}
for i in vocabs['vocab']:
  for indx,j in enumerate(i):
    vocab_dict[j] = indx

In [20]:
vocab_dict["|"] = vocab_dict[' ']
del vocab_dict[' ']

In [21]:
# vocab_dict["|"] = vocab_dict[" "]  ## This will be added once the implementation starts working
# del vocab_dict[" "]
vocab_dict["[UNK]"] = len(vocab_dict)

In [22]:
vocab_dict["[PAD]"] = len(vocab_dict)
len(vocab_dict)

40

# Cool, now our vocabulary is complete and consists of 40 tokens, which means that the linear layer that we will
#  add on top of the pretrained Wav2Vec2 checkpoint will have an output dimension of 59

In [23]:
# Saving the vocabulary as a json file.
import json
with open('vocab.json', 'w') as vocab_file:
    json.dump(vocab_dict, vocab_file)

# Creating a Wav2Vec2CTCTokenizer

In [24]:
# Using the json file to instantiate an object of the Wav2Vec2CTCTokenizer class.
from transformers import Wav2Vec2CTCTokenizer
tokenizer = Wav2Vec2CTCTokenizer("./vocab.json", unk_token="[UNK]", pad_token="[PAD]", word_delimiter_token="|")

# Setting up the trainer

#Creating Wav2Vec2 Feature *Extractor*

In [25]:
from transformers import Wav2Vec2FeatureExtractor
feature_extractor = Wav2Vec2FeatureExtractor(feature_size=1, sampling_rate=16000, padding_value=0.0, do_normalize=True, return_attention_mask=False)

In [26]:
from transformers import Wav2Vec2Processor
processor = Wav2Vec2Processor(feature_extractor=feature_extractor, tokenizer=tokenizer)

In [27]:
labels = []
for txt in df["text"]:
  with processor.as_target_processor():
      labels.append(processor(txt).input_ids)

In [28]:
df["labels"] = labels

In [29]:
df.head()

Unnamed: 0,text,labels
0,say ahpeee repeatedly,"[33, 26, 23, 18, 26, 29, 14, 24, 24, 24, 18, 3..."
1,say ahpeee repeatedly,"[33, 26, 23, 18, 26, 29, 14, 24, 24, 24, 18, 3..."
2,say pahtahkah repeatedly,"[33, 26, 23, 18, 14, 26, 29, 32, 26, 29, 9, 26..."
3,say eeepah repeatedly,"[33, 26, 23, 18, 24, 24, 24, 14, 26, 29, 18, 3..."
4,relax your mouth in its normal position,"[34, 24, 8, 26, 30, 18, 23, 17, 37, 34, 18, 20..."


In [30]:
audios = sorted(glob.glob(AUDIO_ROOT))
texts = sorted(glob.glob(TXT_ROOT))

df["audio_paths"] = audios
df["txt_paths"] = texts
df.head()

Unnamed: 0,text,labels,audio_paths,txt_paths
0,say ahpeee repeatedly,"[33, 26, 23, 18, 26, 29, 14, 24, 24, 24, 18, 3...",/content/drive/MyDrive/IDL/IDL_FINAL_DATASET/a...,/content/drive/MyDrive/IDL/IDL_FINAL_DATASET/t...
1,say ahpeee repeatedly,"[33, 26, 23, 18, 26, 29, 14, 24, 24, 24, 18, 3...",/content/drive/MyDrive/IDL/IDL_FINAL_DATASET/a...,/content/drive/MyDrive/IDL/IDL_FINAL_DATASET/t...
2,say pahtahkah repeatedly,"[33, 26, 23, 18, 14, 26, 29, 32, 26, 29, 9, 26...",/content/drive/MyDrive/IDL/IDL_FINAL_DATASET/a...,/content/drive/MyDrive/IDL/IDL_FINAL_DATASET/t...
3,say eeepah repeatedly,"[33, 26, 23, 18, 24, 24, 24, 14, 26, 29, 18, 3...",/content/drive/MyDrive/IDL/IDL_FINAL_DATASET/a...,/content/drive/MyDrive/IDL/IDL_FINAL_DATASET/t...
4,relax your mouth in its normal position,"[34, 24, 8, 26, 30, 18, 23, 17, 37, 34, 18, 20...",/content/drive/MyDrive/IDL/IDL_FINAL_DATASET/a...,/content/drive/MyDrive/IDL/IDL_FINAL_DATASET/t...


In [31]:
import os 

input_values = []
input_lengths = []
labs = []
check = []
labelss = []

for i in range(len(df.labels)):

  test = df["audio_paths"][i].split('/')[-1]

  if len(df.labels[i]) <=  32 and (test.startswith("FC") or test.startswith("MC")):

    audio_f = librosa.load(df.audio_paths[i], sr=None)

    input_values.append(audio_f[0])

    input_lengths.append(len(audio_f[0]))

    labs.append(df["labels"][i])

    labelss.append(df.labels[i])

  else:

    continue

In [32]:
# import os
# audio_paths = sorted(glob.glob(AUDIO_ROOT))
# txt_paths = glob.glob(TXT_ROOT)
# input_values = []
# input_lengths = []
# labs = []
# for i in range(len(audio_paths)): 

#     audio_f = librosa.load(audio_paths[i], sr=None)
#     input_values.append(audio_f[0])
#     input_lengths.append(len(audio_f[0]))
#     labs.append(df["labels"][i])

In [33]:
librosa.load(audio_paths[0], sr=None)[0].shape

(73600,)

In [34]:
librosa.load(audio_paths[0], sr=4000)[0]


array([-0.01657145, -0.01822065, -0.00821408, ..., -0.01123643,
        0.00030647, -0.00368182], dtype=float32)

In [35]:
from datasets.dataset_dict import DatasetDict
from datasets import Dataset
import numpy as np

In [36]:
len(input_values), len(labs)

(5331, 5331)

In [37]:
train = {"input_values":np.array(input_values[:5000]), "input_length":input_lengths[:5000], "labels":np.array(labs[:5000])}
train_d = Dataset.from_dict(train)

# for test
test = {"input_values":np.array(input_values[5000:]), "input_length":input_lengths[5000:], "labels":np.array(labs[5000:])}
test_d = Dataset.from_dict(test)

  train = {"input_values":np.array(input_values[:5000]), "input_length":input_lengths[:5000], "labels":np.array(labs[:5000])}
  test = {"input_values":np.array(input_values[5000:]), "input_length":input_lengths[5000:], "labels":np.array(labs[5000:])}


In [38]:
# train = {"input_values":np.array(input_values), "input_length":input_lengths, "labels":np.array(labs)}
# train_d = Dataset.from_dict(train)

# # for test
# test = {"input_values":np.array(input_values), "input_length":input_lengths, "labels":np.array(labs)}
# test_d = Dataset.from_dict(test)

In [39]:
dta_dict = DatasetDict({"train": train_d, "test":test_d})
dta_dict

DatasetDict({
    train: Dataset({
        features: ['input_values', 'input_length', 'labels'],
        num_rows: 5000
    })
    test: Dataset({
        features: ['input_values', 'input_length', 'labels'],
        num_rows: 331
    })
})

In [40]:
max_input_length_in_sec = 4.0
dta_dict["train"] = dta_dict["train"].filter(lambda x: x < max_input_length_in_sec * processor.feature_extractor.sampling_rate, input_columns=["input_length"])

max_input_length_in_sec = 4.0
dta_dict["test"] = dta_dict["test"].filter(lambda x: x < max_input_length_in_sec * processor.feature_extractor.sampling_rate, input_columns=["input_length"])



  0%|          | 0/5 [00:00<?, ?ba/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

In [41]:
from datasets import ClassLabel
import random
import pandas as pd
from IPython.display import display, HTML

def show_random_elements(dataset, num_examples=10):
    assert num_examples <= len(dataset), "Can't pick more elements than there are in the dataset."
    picks = []
    for _ in range(num_examples):
        pick = random.randint(0, len(dataset)-1)
        while pick in picks:
            pick = random.randint(0, len(dataset)-1)
        picks.append(pick)
    
    df = pd.DataFrame(dataset[picks])
    display(HTML(df.to_html()))

In [42]:
# show_random_elements(dta_dict["train"], num_examples=1)

In [43]:
import torch

from dataclasses import dataclass, field
from typing import Any, Dict, List, Optional, Union

@dataclass
class DataCollatorCTCWithPadding:
    """
    Data collator that will dynamically pad the inputs received.
    Args:
        processor (:class:`~transformers.Wav2Vec2Processor`)
            The processor used for proccessing the data.
        padding (:obj:`bool`, :obj:`str` or :class:`~transformers.tokenization_utils_base.PaddingStrategy`, `optional`, defaults to :obj:`True`):
            Select a strategy to pad the returned sequences (according to the model's padding side and padding index)
            among:
            * :obj:`True` or :obj:`'longest'`: Pad to the longest sequence in the batch (or no padding if only a single
              sequence if provided).
            * :obj:`'max_length'`: Pad to a maximum length specified with the argument :obj:`max_length` or to the
              maximum acceptable input length for the model if that argument is not provided.
            * :obj:`False` or :obj:`'do_not_pad'` (default): No padding (i.e., can output a batch with sequences of
              different lengths).
    """

    processor: Wav2Vec2Processor
    padding: Union[bool, str] = True

    def __call__(self, features: List[Dict[str, Union[List[int], torch.Tensor]]]) -> Dict[str, torch.Tensor]:
        # split inputs and labels since they have to be of different lenghts and need
        # different padding methods
        input_features = [{"input_values": feature["input_values"]} for feature in features]
        label_features = [{"input_ids": feature["labels"]} for feature in features]

        batch = self.processor.pad(
            input_features,
            padding=self.padding,
            return_tensors="pt",
        )
        with self.processor.as_target_processor():
            labels_batch = self.processor.pad(
                label_features,
                padding=self.padding,
                return_tensors="pt",
            )
        # replace padding with -100 to ignore loss correctly
        labels = labels_batch["input_ids"].masked_fill(labels_batch.attention_mask.ne(1), -100)
        batch["labels"] = labels
        
        return batch

In [44]:
data_collator = DataCollatorCTCWithPadding(processor=processor, padding=True)

In [45]:
!pip install jiwer

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [46]:
### Loading Evaluation metric
from datasets import load_metric
wer_metric = load_metric("wer")

In [47]:
def compute_metrics(pred):

    pred_logits = pred.predictions
    pred_ids = np.argmax(pred_logits, axis=-1)
    pred.label_ids[pred.label_ids == -100] = processor_.tokenizer.pad_token_id
    pred_str = processor_.batch_decode(pred_ids)
    # we do not want to group tokens when computing the metrics
    label_str = processor_.batch_decode(pred.label_ids, group_tokens=False)
    wer = wer_metric.compute(predictions=pred_str, references=label_str)
    
    return {"wer": wer}

In [48]:
### Loading CheckPoint

from transformers import Wav2Vec2ForCTC

model = Wav2Vec2ForCTC.from_pretrained(
    "facebook/wav2vec2-base",
    ctc_loss_reduction="mean", 
    pad_token_id=processor.tokenizer.pad_token_id,
    vocab_size = 64,
)

Some weights of the model checkpoint at facebook/wav2vec2-base were not used when initializing Wav2Vec2ForCTC: ['project_hid.bias', 'quantizer.codevectors', 'quantizer.weight_proj.weight', 'project_q.bias', 'project_hid.weight', 'quantizer.weight_proj.bias', 'project_q.weight']
- This IS expected if you are initializing Wav2Vec2ForCTC from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing Wav2Vec2ForCTC from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of Wav2Vec2ForCTC were not initialized from the model checkpoint at facebook/wav2vec2-base and are newly initialized: ['lm_head.weight', 'lm_head.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predicti

In [49]:
model.freeze_feature_encoder()

In [None]:
from transformers import TrainingArguments

training_args = TrainingArguments(
  output_dir="./",
  group_by_length=True,
  per_device_train_batch_size=8,
  evaluation_strategy="steps",
  num_train_epochs=20,
  fp16=True,
  gradient_checkpointing=True,
  save_steps=500,
  eval_steps=500,
  logging_steps=500,
  learning_rate=1e-4,
  weight_decay=0.005,
  warmup_steps=1000,
  save_total_limit=2,
)

In [None]:
# import Train_function
from transformers import Trainer
trainer = Trainer(
    model=model,
    data_collator=data_collator,
    args=training_args,
    compute_metrics=compute_metrics,
    train_dataset= dta_dict["train"],
    eval_dataset= dta_dict["test"],
    tokenizer=processor.feature_extractor,)

In [None]:
trainer.train()

```python
from transformers import AutoModelForCTC, Wav2Vec2Processor

model = AutoModelForCTC.from_pretrained("patrickvonplaten/wav2vec2-base-timit-demo-google-colab")
processor = Wav2Vec2Processor.from_pretrained("patrickvonplaten/wav2vec2-base-timit-demo-google-colab")
```

### Evaluate

In the final part, we run our model on some of the validation data to get a feeling for how well it works.

Let's load the `processor` and `model`.

In [None]:
processor_ = Wav2Vec2Processor.from_pretrained("patrickvonplaten/wav2vec2-base-timit-demo-google-colab")

In [None]:
model = Wav2Vec2ForCTC.from_pretrained("patrickvonplaten/wav2vec2-base-timit-demo-google-colab").cuda()

Now, we will make use of the `map(...)` function to predict the transcription of every test sample and to save the prediction in the dataset itself. We will call the resulting dictionary `"results"`. 

**Note**: we evaluate the test data set with `batch_size=1` on purpose due to this [issue](https://github.com/pytorch/fairseq/issues/3227). Since padded inputs don't yield the exact same output as non-padded inputs, a better WER can be achieved by not padding the input at all.

In [None]:
def map_to_result(batch):
  with torch.no_grad():
    input_values = torch.tensor(batch["input_values"], device="cuda").unsqueeze(0)
    logits = model(input_values).logits

  pred_ids = torch.argmax(logits, dim=-1)
  batch["pred_str"] = processor_.batch_decode(pred_ids)[0]
  batch["text"] = processor.decode(batch["labels"])
  # print(batch["labels"],pred_ids)
  # batch["text"] = processor.batch_decode(batch["labels"], group_tokens=False)
  
  return batch

In [None]:
results = dta_dict["test"].map(map_to_result, remove_columns=dta_dict["test"].column_names)
# results = map_to_result(dta_dict["test"])

Let's compute the overall WER now.

In [None]:
print("Test WER: {:.3f}".format(wer_metric.compute(predictions=results["pred_str"], references=results["text"])))

In [None]:
show_random_elements(results)

It becomes clear that the predicted transcriptions are acoustically very similar to the target transcriptions, but often contain spelling or grammatical errors. This shouldn't be very surprising though given that we purely rely on Wav2Vec2 without making use of a language model.

Finally, to better understand how CTC works, it is worth taking a deeper look at the exact output of the model. Let's run the first test sample through the model, take the predicted ids and convert them to their corresponding tokens.

In [None]:
model.to("cuda")

with torch.no_grad():
  logits = model(torch.tensor(dta_dict["test"][:1]["input_values"], device="cuda")).logits

pred_ids = torch.argmax(logits, dim=-1)

# convert ids to tokens
" ".join(processor.tokenizer.convert_ids_to_tokens(pred_ids[0].tolist()))

The output should make it a bit clearer how CTC works in practice. The model is to some extent invariant to speaking rate since it has learned to either just repeat the same token in case the speech chunk to be classified still corresponds to the same token. This makes CTC a very powerful algorithm for speech recognition since the speech file's transcription is often very much independent of its length.
