## Speech Recognition

Dataset Source: https://www.kaggle.com/datasets/phmanhth/speech-recognition-dataset

#### Import Necessary Libraries

In [1]:
import os, sys, glob
os.environ['TOKENIZERS_PARALLELISM']='false'

from dataclasses import dataclass, field
from typing import Any, Dict, List, Optional, Union

import numpy as np
import pandas as pd

import datasets
from datasets import Dataset, DatasetDict, Audio, load_dataset

import torch

import transformers
from transformers import AutoModelForCTC, TrainingArguments
from transformers import AutoProcessor, set_seed, Trainer

import evaluate

!git lfs install

Git LFS initialized.


#### Display Library Verisons

In [2]:
n = 18

print(f"Language/Library".rjust(n-2), '|', 'Version')
print('-' * (n-2), '|', '--------')
print("Python :".rjust(n), sys.version[0:6])
print("NumPy :".rjust(n), np.__version__)
print("Pandas :".rjust(n), pd.__version__)
print("Torch :".rjust(n), torch.__version__)
print("Datasets :".rjust(n), datasets.__version__)
print("Transformers :".rjust(n), transformers.__version__)
print("Evaluate :".rjust(n), evaluate.__version__)

Language/Library | Version
---------------- | --------
          Python : 3.9.12
           NumPy : 1.24.3
          Pandas : 2.0.1
           Torch : 2.0.0
        Datasets : 2.11.0
    Transformers : 4.27.4
        Evaluate : 0.4.0


#### Ingest Dataset Text

In [3]:
parent_dir = "/Users/briandunn/Desktop/Audio_Projects/Audio Datasets/ASR - Speech Recognition Dataset/data"
areas = ['Health & Lifestyle', 'Science & Technology']

text_df = pd.DataFrame()
file_names = []

# list of image paths
for area in areas:
    folder_to_search = os.path.join(parent_dir, 
                                    area, 
                                    "*", 
                                    "metadata.txt")
    
    temp_list = glob.glob(folder_to_search)
    file_names = file_names + temp_list

for file in file_names:
    temp_df = pd.read_csv(file,
                          sep='|',
                      engine='c',
                      names=["file_name", "transcription"])
    
    temp_df['file_name'] = file.split("metadata")[0]\
        .split("/ASR - Speech Recognition Dataset/data/")[-1] + \
        "wavs/" + \
        temp_df['file_name'] + \
        ".wav"
    
    text_df = pd.concat([text_df, temp_df])

pd.set_option('display.max_colwidth', None)

# Save data to file
folder_location_to_save_input_file = "/Users/briandunn/Desktop/Audio_Projects/Audio Datasets/ASR - Speech Recognition Dataset"

text_df.to_csv(os.path.join(folder_location_to_save_input_file,
                            "data",
                            "metadata.csv"), 
              index=False)

text_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 22075 entries, 0 to 40
Data columns (total 2 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   file_name      22075 non-null  object
 1   transcription  22075 non-null  object
dtypes: object(2)
memory usage: 517.4+ KB


In [4]:
pd.reset_option('display.max_colwidth')

#### Ingest Dataset

In [5]:
audio_data = load_dataset(folder_location_to_save_input_file)

audio_data

Resolving data files:   0%|          | 0/22566 [00:00<?, ?it/s]

Downloading and preparing dataset audiofolder/ASR - Speech Recognition Dataset to /Users/briandunn/.cache/huggingface/datasets/audiofolder/ASR - Speech Recognition Dataset-94c4bf3175bdaafa/0.0.0/6cbdd16f8688354c63b4e2a36e1585d05de285023ee6443ffd71c4182055c0fc...


Downloading data files:   0%|          | 0/22076 [00:00<?, ?it/s]

Downloading data files:   0%|          | 0/490 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/490 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

Dataset audiofolder downloaded and prepared to /Users/briandunn/.cache/huggingface/datasets/audiofolder/ASR - Speech Recognition Dataset-94c4bf3175bdaafa/0.0.0/6cbdd16f8688354c63b4e2a36e1585d05de285023ee6443ffd71c4182055c0fc. Subsequent calls will reuse this data.


  0%|          | 0/1 [00:00<?, ?it/s]

DatasetDict({
    train: Dataset({
        features: ['audio', 'transcription'],
        num_rows: 22075
    })
})

#### Dataset Preparation

In [6]:
audio_data = audio_data.cast_column("audio", Audio(sampling_rate=16_000))

#### Split Dataset into Training & Evaluation Datasets

In [7]:
audio_data = audio_data['train'].train_test_split(train_size=0.70)

print("Training Dataset Shape:", audio_data['train'].shape)
print("Testing Dataset Shape:", audio_data['test'].shape)

Training Dataset Shape: (15452, 2)
Testing Dataset Shape: (6623, 2)


#### Basic Values/Constants

In [8]:
MODEL_CKPT = "facebook/wav2vec2-base"
MODEL_NAME = MODEL_CKPT.split("/")[-1] + "-Speech_Recognition_Dataset"

BATCH_SIZE = 8
LR = 1e-5

WARMUP_STEPS = 500
MAX_TRAINING_STEPS = 2000

LOGGING_STEPS = 50
STEPS = 1000

GRAD_ACC_STEPS = 2
STRATEGY = "steps"

REPORTS_TO = "tensorboard"
DEVICE = torch.device("cpu")

#### Define Processor

In [9]:
processor = AutoProcessor.from_pretrained(MODEL_CKPT)



#### Convert Transcription to Uppercase Letter Only

In [10]:
def convert_to_uppercase(example):
    return {"transcription": example["transcription"].upper()}

audio_data = audio_data.map(convert_to_uppercase)

Map:   0%|          | 0/15452 [00:00<?, ? examples/s]

Map:   0%|          | 0/6623 [00:00<?, ? examples/s]

#### Prepare/Encode Dataset

In [11]:
def prepare_dataset(batch):
    audio = batch["audio"]
    batch = processor(audio["array"], 
                      sampling_rate=audio["sampling_rate"], 
                      text=batch["transcription"])
    
    batch["input_length"] = len(batch["input_values"][0])
    return batch

encoded_audio_data = audio_data.map(prepare_dataset, 
                         remove_columns=audio_data.column_names["train"], 
                         num_proc=4)

Map (num_proc=4):   0%|          | 0/15452 [00:00<?, ? examples/s]

Map (num_proc=4):   0%|          | 0/6623 [00:00<?, ? examples/s]

#### Define Compute Metrics Function

In [12]:
def compute_metrics(preds):
    wer_metric = evaluate.load("wer")
    
    preds_logits = preds.predictions
    preds_ids = np.argmax(preds_logits, axis=-1)
    
    preds.label_ids[preds.label_ids == -100] = processor.tokenizer.pad_token_id
    
    preds_str = processor.batch_decode(preds_ids)
    labels_str = processor.batch_decode(preds.label_ids, group_tokens=False)
    
    wer = wer_metric.compute(predictions=preds_str, references=labels_str)
    
    return {"wer": wer}

#### Define Model

In [13]:
model = AutoModelForCTC.from_pretrained(
    MODEL_CKPT,
    ctc_loss_reduction="mean",
    pad_token_id=processor.tokenizer.pad_token_id,
).to(DEVICE)

Some weights of the model checkpoint at facebook/wav2vec2-base were not used when initializing Wav2Vec2ForCTC: ['quantizer.weight_proj.bias', 'project_hid.bias', 'project_hid.weight', 'quantizer.codevectors', 'project_q.bias', 'quantizer.weight_proj.weight', 'project_q.weight']
- This IS expected if you are initializing Wav2Vec2ForCTC from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing Wav2Vec2ForCTC from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of Wav2Vec2ForCTC were not initialized from the model checkpoint at facebook/wav2vec2-base and are newly initialized: ['lm_head.weight', 'lm_head.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predicti

#### Define Data Collator Class

In [14]:
@dataclass
class DataCollatorCTCWithPadding:
    processor: AutoProcessor
    padding: Union[bool, str] = "longest"
    
    def __call__(self, 
                 features: List[Dict[str, 
                                     Union[List[int], 
                                     torch.Tensor]]]
                 ) -> Dict[str, torch.Tensor]:
        """
        This function splits the inputs & labels since 
        they have to be different lengths & require 
        different padding methods.
        """
        input_features = [{"input_values": feature["input_values"][0]} for feature in features]
        label_features = [{"input_ids": feature["labels"]} for feature in features]
        
        batch = self.processor.pad(input_features, 
                                   padding=self.padding, 
                                   return_tensors="pt")
        
        labels_batch = self.processor.pad(labels=label_features, 
                                          padding=self.padding, 
                                          return_tensors="pt")
        
        # Replace padding with -100 to ignore loss correctly
        labels = labels_batch["input_ids"].masked_fill(labels_batch.attention_mask.ne(1), -100)
        
        batch["labels"] = labels
        return batch

#### Instantiate Instance of Data Collator Class

In [15]:
data_collator = DataCollatorCTCWithPadding(processor=processor, 
                                           padding="longest")

#### Define Training Arguments

In [16]:
args = TrainingArguments(
    output_dir=MODEL_NAME,
    per_device_train_batch_size=BATCH_SIZE,
    per_device_eval_batch_size=BATCH_SIZE,
    gradient_accumulation_steps=GRAD_ACC_STEPS,
    warmup_steps=WARMUP_STEPS,
    max_steps=MAX_TRAINING_STEPS,
    gradient_checkpointing=True,
    learning_rate=LR,
    logging_first_step=True,
    logging_strategy=STRATEGY,
    logging_steps=LOGGING_STEPS,
    save_strategy=STRATEGY,
    save_steps=STEPS,
    evaluation_strategy=STRATEGY,
    eval_steps=STEPS,
    load_best_model_at_end=True,
    metric_for_best_model="wer",
    greater_is_better=True,
    report_to=REPORTS_TO,
    hub_private_repo=True,
    push_to_hub=True
)

#### Instantiate Trainer

In [17]:
trainer = Trainer(
    model=model,
    args=args,
    train_dataset=encoded_audio_data['train'],
    eval_dataset=encoded_audio_data['test'],
    tokenizer=processor.feature_extractor,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

Cloning https://huggingface.co/DunnBC22/wav2vec2-base-Speech_Recognition_Dataset into local empty directory.


#### Train Model

In [18]:
train_results = trainer.train()



  0%|          | 0/2000 [00:00<?, ?it/s]

{'loss': 25.3084, 'learning_rate': 2e-08, 'epoch': 0.0}
{'loss': 37.813, 'learning_rate': 1.0000000000000002e-06, 'epoch': 0.05}
{'loss': 32.6564, 'learning_rate': 2.0000000000000003e-06, 'epoch': 0.1}
{'loss': 28.1036, 'learning_rate': 3e-06, 'epoch': 0.16}
{'loss': 21.4504, 'learning_rate': 4.000000000000001e-06, 'epoch': 0.21}
{'loss': 0.0, 'learning_rate': 5e-06, 'epoch': 0.26}
{'loss': 0.0, 'learning_rate': 6e-06, 'epoch': 0.31}
{'loss': 0.0, 'learning_rate': 7e-06, 'epoch': 0.36}
{'loss': 0.0, 'learning_rate': 8.000000000000001e-06, 'epoch': 0.41}
{'loss': 0.0, 'learning_rate': 9e-06, 'epoch': 0.47}
{'loss': 0.0, 'learning_rate': 1e-05, 'epoch': 0.52}
{'loss': 0.0, 'learning_rate': 9.666666666666667e-06, 'epoch': 0.57}
{'loss': 0.0, 'learning_rate': 9.333333333333334e-06, 'epoch': 0.62}
{'loss': 0.0, 'learning_rate': 9e-06, 'epoch': 0.67}
{'loss': 0.0, 'learning_rate': 8.666666666666668e-06, 'epoch': 0.72}
{'loss': 0.0, 'learning_rate': 8.333333333333334e-06, 'epoch': 0.78}
{'los

  0%|          | 0/828 [00:00<?, ?it/s]

{'eval_loss': nan, 'eval_wer': 1.0, 'eval_runtime': 8526.0304, 'eval_samples_per_second': 0.777, 'eval_steps_per_second': 0.097, 'epoch': 1.04}


Adding files tracked by Git LFS: ['.DS_Store']. This may take a bit of time if the files are large.


{'loss': 0.0, 'learning_rate': 6.333333333333333e-06, 'epoch': 1.09}
{'loss': 0.0, 'learning_rate': 6e-06, 'epoch': 1.14}
{'loss': 0.0, 'learning_rate': 5.666666666666667e-06, 'epoch': 1.19}
{'loss': 0.0, 'learning_rate': 5.333333333333334e-06, 'epoch': 1.24}
{'loss': 0.0, 'learning_rate': 5e-06, 'epoch': 1.29}
{'loss': 0.0, 'learning_rate': 4.666666666666667e-06, 'epoch': 1.35}
{'loss': 0.0, 'learning_rate': 4.333333333333334e-06, 'epoch': 1.4}
{'loss': 0.0, 'learning_rate': 4.000000000000001e-06, 'epoch': 1.45}
{'loss': 0.0, 'learning_rate': 3.6666666666666666e-06, 'epoch': 1.5}
{'loss': 0.0, 'learning_rate': 3.3333333333333333e-06, 'epoch': 1.55}
{'loss': 0.0, 'learning_rate': 3e-06, 'epoch': 1.6}
{'loss': 0.0, 'learning_rate': 2.666666666666667e-06, 'epoch': 1.66}
{'loss': 0.0, 'learning_rate': 2.3333333333333336e-06, 'epoch': 1.71}
{'loss': 0.0, 'learning_rate': 2.0000000000000003e-06, 'epoch': 1.76}
{'loss': 0.0, 'learning_rate': 1.6666666666666667e-06, 'epoch': 1.81}
{'loss': 0.

  0%|          | 0/828 [00:00<?, ?it/s]

{'eval_loss': nan, 'eval_wer': 1.0, 'eval_runtime': 8315.7991, 'eval_samples_per_second': 0.796, 'eval_steps_per_second': 0.1, 'epoch': 2.07}
{'train_runtime': 181071.902, 'train_samples_per_second': 0.177, 'train_steps_per_second': 0.011, 'train_loss': 2.994333236694336, 'epoch': 2.07}


#### Save Model & Metrics

In [19]:
trainer.save_model()
trainer.log_metrics("train", train_results.metrics)
trainer.save_metrics("train", train_results.metrics)
trainer.save_state()

Upload file runs/May17_12-57-26_Brians-Mac-mini/events.out.tfevents.1684346252.Brians-Mac-mini.29839.0: 100%|#…

To https://huggingface.co/DunnBC22/wav2vec2-base-Speech_Recognition_Dataset
   36ca157..4e19e2d  main -> main

To https://huggingface.co/DunnBC22/wav2vec2-base-Speech_Recognition_Dataset
   4e19e2d..5b4b53a  main -> main



***** train metrics *****
  epoch                    =               2.07
  train_loss               =             2.9943
  train_runtime            = 2 days, 2:17:51.90
  train_samples_per_second =              0.177
  train_steps_per_second   =              0.011


#### Push Model to Hub

In [20]:
trainer.push_to_hub()

To https://huggingface.co/DunnBC22/wav2vec2-base-Speech_Recognition_Dataset
   5b4b53a..8700aaf  main -> main



'https://huggingface.co/DunnBC22/wav2vec2-base-Speech_Recognition_Dataset/commit/8700aaf2375bc78e13a826862609b2ea8779b0d8'

#### Evaluate Model

In [21]:
trainer.evaluate()

  0%|          | 0/828 [00:00<?, ?it/s]

{'eval_loss': nan,
 'eval_wer': 1.0,
 'eval_runtime': 8351.8663,
 'eval_samples_per_second': 0.793,
 'eval_steps_per_second': 0.099,
 'epoch': 2.07}

### Notes & Other Takeaways
****
- I am curious why the loss is 0.00 after the first 250-299 steps. This is even before the 500 warmup steps conclude.
****

### Citation(s)

- Model Checkpoint
    > https://github.com/facebookresearch/fairseq/tree/main/examples/wav2vec#wav2vec-20
    
    > https://arxiv.org/abs/2006.11477 (wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations)

- Metric (Word Error Rate [WER])
    > @inproceedings{woodard1982, author = {Woodard, J.P. and Nelson, J.T., year = {1982}, journal = {Workshop on standardisation for speech I/O technology, Naval Air Development Center, Warminster, PA}, title = {An information theoretic measure of speech recognition performance}}
    
    > @inproceedings{morris2004, author = {Morris, Andrew and Maier, Viktoria and Green, Phil}, year = 2004}, month = {01}, pages = {}, title = {From WER and RIL to MER and WIL: improved evaluation measures for connected speech recognition.}}