## Fine-Tuning: ASR Model Wav2Vec2-large-960b-cv
This jupyter notebook outlines the steps taken to finetune the wav2vec2 model with the Common Voice `cv-valid-train` dataset.

In [1]:
# from huggingface_hub import notebook_login

In [2]:
#to upload training checkpoints directly to HF Hub while training; for VC and performance tracking
# notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

## Tokeniser 
This part of the ASR training pipeline processes the model's output format into text. 
For this process, the `Wave2Vec2CTCTokenizer` is selected.

Subsequently, a number of tokens in the vocabulary is extracted for this dataset for fine-tuning (based on dataset's transcriptions).

In [17]:
import datasets
from datasets import load_dataset, load_metric

In [18]:
from torch import cuda
device = 'cuda' if cuda.is_available() else 'cpu'

In [5]:
# Load Common Voice dataset (for a specific language, e.g., English)
common_voice_dataset = datasets.load_dataset("C:/Users/Clarence/Downloads/common_voice", 'cv-valid-train')

# shuffle and split the dataset
common_voice_dataset = common_voice_dataset.shuffle(seed=42)
train_test_ds = common_voice_dataset['train'].train_test_split(test_size=0.3)

#split by training and test sets
# train_dataset = train_test_split['train']
# test_dataset = train_test_split['test']

In [6]:
train_test_ds

DatasetDict({
    train: Dataset({
        features: ['filename', 'audio', 'text', 'up_votes', 'down_votes', 'age', 'gender', 'accent', 'duration'],
        num_rows: 137043
    })
    test: Dataset({
        features: ['filename', 'audio', 'text', 'up_votes', 'down_votes', 'age', 'gender', 'accent', 'duration'],
        num_rows: 58733
    })
})

In [7]:
from datasets import ClassLabel
import random
import pandas as pd
from IPython.display import display, HTML

def show_random_elements(dataset, num_examples=10):
    assert num_examples <= len(dataset), "Can't pick more elements than there are in the dataset."
    picks = []
    for _ in range(num_examples):
        pick = random.randint(0, len(dataset)-1)
        while pick in picks:
            pick = random.randint(0, len(dataset)-1)
        picks.append(pick)
    
    df = pd.DataFrame(dataset[picks])
    display(HTML(df.to_html()))

In [8]:
show_random_elements(train_test_ds["train"].remove_columns(["audio", "up_votes", "down_votes"]), num_examples=10)

Unnamed: 0,filename,text,age,gender,accent,duration
0,cv-valid-train/sample-115816.mp3,a partnership in the strictest sense of the word,,,,
1,cv-valid-train/sample-002049.mp3,and at that i told him and he took my place,,,,
2,cv-valid-train/sample-160351.mp3,and that he a boy could perform miracles,,,,
3,cv-valid-train/sample-010239.mp3,don't say that again,,,,
4,cv-valid-train/sample-108945.mp3,but they were not there,fourties,male,australia,
5,cv-valid-train/sample-158625.mp3,pearl williams works for the president,,,,
6,cv-valid-train/sample-044841.mp3,two more months passed and the shelf brought many customers into the crystal shop,fourties,male,australia,
7,cv-valid-train/sample-023866.mp3,the oxygen's only to help him till the doctor gets here,fifties,female,indian,
8,cv-valid-train/sample-102931.mp3,i'd like you to take me there if you can,,,,
9,cv-valid-train/sample-054986.mp3,my seven year resume gap is marked not drugs,thirties,female,us,


In [9]:
#preprocessing
'''
This section prepares the training dataset for fine-tuning. 
'''
def extract_all_chars(batch):
  all_text = " ".join(batch["text"])
  vocab = list(set(all_text))
  return {"vocab": [vocab], "all_text": [all_text]}

train_test_ds = train_test_ds.remove_columns(["age", "gender", "up_votes", "down_votes", "accent", "duration"])
vocabs = train_test_ds.map(extract_all_chars, batched=True, batch_size=-1, keep_in_memory=True, remove_columns=train_test_ds.column_names["train"])

Map:   0%|          | 0/137043 [00:00<?, ? examples/s]

Map:   0%|          | 0/58733 [00:00<?, ? examples/s]

In [10]:
vocab_list = list(set(vocabs["train"]["vocab"][0]) | set(vocabs["test"]["vocab"][0]))

In [11]:
vocab_dict = {v: k for k, v in enumerate(vocab_list)}
vocab_dict

{'t': 0,
 'k': 1,
 'n': 2,
 'g': 3,
 ' ': 4,
 'e': 5,
 'h': 6,
 'r': 7,
 'j': 8,
 'f': 9,
 'c': 10,
 'm': 11,
 "'": 12,
 'b': 13,
 'y': 14,
 'p': 15,
 's': 16,
 'z': 17,
 'u': 18,
 'a': 19,
 'w': 20,
 'd': 21,
 'i': 22,
 'l': 23,
 'o': 24,
 'q': 25,
 'v': 26,
 'x': 27}

In [12]:
vocab_dict["[UNK]"] = len(vocab_dict)
vocab_dict["[PAD]"] = len(vocab_dict)
len(vocab_dict)

30

In [13]:
import json
with open('vocab.json', 'w') as vocab_file:
    json.dump(vocab_dict, vocab_file)


In [10]:
from transformers import Wav2Vec2CTCTokenizer

tokenizer = Wav2Vec2CTCTokenizer("./vocab.json", unk_token="[UNK]", pad_token="[PAD]", word_delimiter_token="|")

In [15]:
# repo_name = "wav2vec2-commonvoice-technicaltest"
# tokenizer.push_to_hub(repo_name)

## Feature Extractor

In [11]:
from transformers import Wav2Vec2FeatureExtractor

feature_extractor = Wav2Vec2FeatureExtractor(feature_size=1, sampling_rate=16000, padding_value=0.0, do_normalize=True, return_attention_mask=False)

In [12]:
from transformers import Wav2Vec2Processor

processor = Wav2Vec2Processor(feature_extractor=feature_extractor, tokenizer=tokenizer)

## Preprocessing Data

This section describes the observation of acoustic and textual data found in the dataset, then how it is prepared for training (i.e., preparing ingredients before the big cookout).

In [18]:
train_test_ds['train'][0]["audio"]

{'path': 'C:/Users/Clarence/Downloads/common_voice\\cv-valid-train/sample-185391.mp3',
 'array': array([-1.63709046e-11,  5.45696821e-12,  0.00000000e+00, ...,
        -5.11246299e-08,  7.41389385e-08,  1.97670431e-08]),
 'sampling_rate': 16000}

In [19]:
# common_voice_train = common_voice_train.cast_column("audio", Audio(sampling_rate=16_000))
# common_voice_test = common_voice_test.cast_column("audio", Audio(sampling_rate=16_000))

In [20]:
import numpy as np 

rand_int = random.randint(0, len(train_test_ds["train"]))

print("Target text:", train_test_ds["train"][rand_int]["text"])
print("Input array shape:", np.asarray(train_test_ds["train"][rand_int]["audio"]["array"]).shape)
print("Sampling rate:", train_test_ds["train"][rand_int]["audio"]["sampling_rate"])

Target text: most of them were staring quietly at the big table
Input array shape: (48384,)
Sampling rate: 16000


In [21]:
#to check for memory allocation in GPU (RTX 4070Ti, 12GB VRAM)
from torch import cuda
t = cuda.get_device_properties(0).total_memory
r = cuda.memory_reserved(0)
a = cuda.memory_allocated(0)
f = r-a  # free inside reserved

from pynvml import *
nvmlInit()
h = nvmlDeviceGetHandleByIndex(0)
info = nvmlDeviceGetMemoryInfo(h)
print(f'total    : {info.total}')
print(f'free     : {info.free}')
print(f'used     : {info.used}')

total    : 12878610432
free     : 10834845696
used     : 2043764736


This `prepare_dataset` function extracts the input values from the audio file and encode the transcription to label IDs. This is important as it prepares the data in a way that fits the input requirements (ingredients) required for the model training process (recipe). 

In [22]:
def prepare_dataset(batch, processor):
    audio = batch["audio"]

    # Process audio
    batch["input_values"] = processor(audio["array"], sampling_rate=audio["sampling_rate"]).input_values[0]
    batch["input_length"] = len(batch["input_values"])

    # Process labels (text)
    labels = processor(text=batch["text"]).input_ids
    batch["labels"] = labels

    return batch

from transformers import Wav2Vec2Processor
processor = Wav2Vec2Processor(feature_extractor=feature_extractor, tokenizer=tokenizer)

train_test_ds = train_test_ds.map(lambda batch: prepare_dataset(batch, processor), 
                                  remove_columns=train_test_ds.column_names["train"], 
                                  num_proc=1)



Map:   0%|          | 0/137043 [00:00<?, ? examples/s]

Map:   0%|          | 0/58733 [00:00<?, ? examples/s]

This code block below is important for minimising memory overload in my system (which happened n+1 times). According to academic references on Self Attention, the memory requirement scales quadratically with input length. As I only have 12GB of VRAM, I ended up with 3 seconds after heuristically trying values (suggested literature: 4; tried 3.5 as well)

In [None]:
max_input_length_in_sec = 3.0
train_test_ds["train"] = train_test_ds["train"].filter(lambda x: x < max_input_length_in_sec * processor.feature_extractor.sampling_rate, input_columns=["input_length"])

## Training
DataCollatorCTCWithPadding is a class used for dynamic padding "[PAD]", which is important for Connectionist Temporal Classification, as defined in the wav2vec2 Hugging Face repo examples.

The following code blocks also builds up components necessary for training and evaluation.

In [13]:
import torch

from dataclasses import dataclass, field
from typing import Any, Dict, List, Optional, Union

@dataclass
class DataCollatorCTCWithPadding:
    """
    Data collator that will dynamically pad the inputs received.
    Args:
        processor (:class:`~transformers.Wav2Vec2Processor`)
            The processor used for proccessing the data.
        padding (:obj:`bool`, :obj:`str` or :class:`~transformers.tokenization_utils_base.PaddingStrategy`, `optional`, defaults to :obj:`True`):
            Select a strategy to pad the returned sequences (according to the model's padding side and padding index)
            among:
            * :obj:`True` or :obj:`'longest'`: Pad to the longest sequence in the batch (or no padding if only a single
              sequence if provided).
            * :obj:`'max_length'`: Pad to a maximum length specified with the argument :obj:`max_length` or to the
              maximum acceptable input length for the model if that argument is not provided.
            * :obj:`False` or :obj:`'do_not_pad'` (default): No padding (i.e., can output a batch with sequences of
              different lengths).
    """

    processor: Wav2Vec2Processor
    padding: Union[bool, str] = True

    def __call__(self, features: List[Dict[str, Union[List[int], torch.Tensor]]]) -> Dict[str, torch.Tensor]:
        # split inputs and labels since they have to be of different lenghts and need
        # different padding methods
        input_features = [{"input_values": feature["input_values"]} for feature in features]
        label_features = [{"input_ids": feature["labels"]} for feature in features]

        batch = self.processor.pad(
            input_features,
            padding=self.padding,
            return_tensors="pt",
        )
        with self.processor.as_target_processor():
            labels_batch = self.processor.pad(
                label_features,
                padding=self.padding,
                return_tensors="pt",
            )

        # replace padding with -100 to ignore loss correctly
        labels = labels_batch["input_ids"].masked_fill(labels_batch.attention_mask.ne(1), -100)

        batch["labels"] = labels

        return batch

In [14]:
data_collator = DataCollatorCTCWithPadding(processor=processor, padding=True)

In [None]:
#for computing WER
wer_metric = load_metric("wer")

def compute_metrics(pred):
    pred_logits = pred.predictions
    pred_ids = np.argmax(pred_logits, axis=-1)

    pred.label_ids[pred.label_ids == -100] = processor.tokenizer.pad_token_id

    pred_str = processor.batch_decode(pred_ids)
    # we do not want to group tokens when computing the metrics
    label_str = processor.batch_decode(pred.label_ids, group_tokens=False)

    wer = wer_metric.compute(predictions=pred_str, references=label_str)

    return {"wer": wer}

In [None]:
from transformers import Wav2Vec2ForCTC

model = Wav2Vec2ForCTC.from_pretrained(
    "facebook/wav2vec2-large-960h",
    ctc_loss_reduction="mean", 
    pad_token_id=processor.tokenizer.pad_token_id,
).cuda()

In [47]:
model.freeze_feature_encoder()

Since this is a fine-tuning task, the encoders are frozen as they do not need to be fine-tuned (mentioned in wav2vec2 paper; https://arxiv.org/abs/2006.11477)

## Description of Metric and Hyperparameters
1. To keep the evaluation process straightforward, WER is used here (accessed via `jiwer`) as the evaluation metric for the ASR model.
2. Hyperparameters selected are as follows: 
    - seed: 42 (for reproducibility)
    - num_devices: 1 (RTX 4070Ti with 12GB RAM; so the parameters discussed here are very small as compared to running on VM compute clusters)
    - gradient_accumulation_steps: 1 (memory saving, to calculate gradient after n accumulated steps; did not use this after another optimisation step)
    - optimiser: Adam (beta= 1e08, epsilon=0.99, 0.999)
    - lr_scheduler_type: 
    - lr_scheduler_warmup_steps: 500 (memory saving; lowered number of steps to limit RAM buildup)
    - per_device_train_batch_size: 1 (memory saving: limited number of training point(s) to 1 to limit RAM-GPU allocation by PyTorch)
    - fp16: True (mixed fixed point precision for memory saving)
    - eval_steps: 99999
          _(In a vram-tight setup, **checkpoint evaluations are a luxury** and can easily eat up vram allocations (used for training), causing critical failure in the training process even with dynamic memory allocation and caching. as a result, I used loss to monitor for local minima of the training process and ran evaluation after training as a separate process (see below). )_
    - save_steps, log_steps: 100 (for a more precise decomposition of loss-train iteration for monitoring)

In [74]:
from transformers import TrainingArguments

training_args = TrainingArguments(
  output_dir="./checkpoints",
  group_by_length=True,
  per_device_train_batch_size=1,
  per_device_eval_batch_size=2,
  evaluation_strategy="steps",
  num_train_epochs=1,
  fp16 =True,
  gradient_checkpointing=True,
  gradient_accumulation_steps= 1,
  save_steps=100,
  eval_steps=99999,
  logging_steps=100,
  learning_rate=1e-4,
  weight_decay=0.005,
  warmup_steps=500,
  save_total_limit=5,
  report_to= "tensorboard",
  logging_dir='./logs',
)

PyTorch: setting up devices


In [75]:
model.to(device)

Wav2Vec2ForCTC(
  (wav2vec2): Wav2Vec2Model(
    (feature_extractor): Wav2Vec2FeatureEncoder(
      (conv_layers): ModuleList(
        (0): Wav2Vec2GroupNormConvLayer(
          (conv): Conv1d(1, 512, kernel_size=(10,), stride=(5,), bias=False)
          (activation): GELUActivation()
          (layer_norm): GroupNorm(512, 512, eps=1e-05, affine=True)
        )
        (1-4): 4 x Wav2Vec2NoLayerNormConvLayer(
          (conv): Conv1d(512, 512, kernel_size=(3,), stride=(2,), bias=False)
          (activation): GELUActivation()
        )
        (5-6): 2 x Wav2Vec2NoLayerNormConvLayer(
          (conv): Conv1d(512, 512, kernel_size=(2,), stride=(2,), bias=False)
          (activation): GELUActivation()
        )
      )
    )
    (feature_projection): Wav2Vec2FeatureProjection(
      (layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
      (projection): Linear(in_features=512, out_features=1024, bias=True)
      (dropout): Dropout(p=0.0, inplace=False)
    )
    (encoder

In [76]:
#loading in parts for the training portion of the wav2vec2-large-960h model
from transformers import Trainer

trainer = Trainer(
    model=model,
    data_collator=data_collator,
    args=training_args,
    compute_metrics=compute_metrics,
    train_dataset=train_test_ds["train"],
    eval_dataset=train_test_ds["test"],    
    tokenizer=processor.feature_extractor,
)

Using auto half precision backend


In [77]:
#for logging purposes
import logging
from transformers import logging as hf_logging

logging.basicConfig(level=logging.INFO)
hf_logging.set_verbosity_info()

In [78]:
os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "caching_allocator" # to resolve memory issues
os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "max_split_size_mb: 4082" # memory issues part 2, electric boogaloo

In [None]:
# using Tensorboard for a Hugging Face-like UI evaluation of the training log
%load_ext tensorboard
%tensorboard --logdir ./ref-log

In [79]:
trainer.train()

The following columns in the training set don't have a corresponding argument in `Wav2Vec2ForCTC.forward` and have been ignored: input_length. If input_length are not expected by `Wav2Vec2ForCTC.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 30,449
  Num Epochs = 1
  Instantaneous batch size per device = 1
  Total train batch size (w. parallel, distributed & accumulation) = 1
  Gradient Accumulation steps = 1
  Total optimization steps = 30,449
  Number of trainable parameters = 311,261,344


Step,Training Loss,Validation Loss


Saving model checkpoint to ./checkpoints\checkpoint-100
Configuration saved in ./checkpoints\checkpoint-100\config.json
Model weights saved in ./checkpoints\checkpoint-100\pytorch_model.bin
Feature extractor saved in ./checkpoints\checkpoint-100\preprocessor_config.json
Saving model checkpoint to ./checkpoints\checkpoint-200
Configuration saved in ./checkpoints\checkpoint-200\config.json
Model weights saved in ./checkpoints\checkpoint-200\pytorch_model.bin
Feature extractor saved in ./checkpoints\checkpoint-200\preprocessor_config.json
Saving model checkpoint to ./checkpoints\checkpoint-300
Configuration saved in ./checkpoints\checkpoint-300\config.json
Model weights saved in ./checkpoints\checkpoint-300\pytorch_model.bin
Feature extractor saved in ./checkpoints\checkpoint-300\preprocessor_config.json
Saving model checkpoint to ./checkpoints\checkpoint-400
Configuration saved in ./checkpoints\checkpoint-400\config.json
Model weights saved in ./checkpoints\checkpoint-400\pytorch_model.b

KeyboardInterrupt: 

## Observations for Model
As seen in the train/loss graph, the local minima for the `wav2vec2-large-960h-cv` fine-tuning attempts hovered at around 3.05. Peaks were observed between steps 6,000 to 8,000 but had hit minima range shortly after 8,000 steps. The loss remained relatively constant around the minima region as observed across >8,000 steps, and training was stopped at 16,300 steps after observing that no other minimisation would occur.
This is consistent with `wav2vec2-large-960h`'s documentation on facebook's model card, where "(w)hen lowering the amount of labeled data to one hour, wav2vec 2.0 outperforms the previous state of the art on the 100 hour subset while using 100 times less labeled data". 
We will now move on to model evaluation.

In [80]:
trainer.push_to_hub()

Saving model checkpoint to ./checkpoints
Configuration saved in ./checkpoints\config.json
Model weights saved in ./checkpoints\pytorch_model.bin
Feature extractor saved in ./checkpoints\preprocessor_config.json
Dropping the following result as it does not have all the necessary fields:
{'task': {'name': 'Automatic Speech Recognition', 'type': 'automatic-speech-recognition'}, 'dataset': {'name': 'common_voice', 'type': 'common_voice', 'config': 'cv-valid-train', 'split': 'train', 'args': 'cv-valid-train'}}


model.safetensors:   0%|          | 0.00/1.26G [00:00<?, ?B/s]

Upload 2 LFS files:   0%|          | 0/2 [00:00<?, ?it/s]

training_args.bin:   0%|          | 0.00/4.54k [00:00<?, ?B/s]

'https://huggingface.co/ClarenceTKX/checkpoints/tree/main/'

The model was pushed onto HuggingFace for easy calls on inference. 

## Evaluation
After obtaining the fine-tuned model, localised WER evaluation had to be done on the test set of the dataset pulled from Common Voice's `cv-valid-train`. (see training parameters above for more details)

In [83]:
#using same processor as above, but with fine-tuned model 
# processor = Wav2Vec2Processor(feature_extractor=feature_extractor, tokenizer=tokenizer)
model = Wav2Vec2ForCTC.from_pretrained("ClarenceTKX/wav2vec2-large-960h-cv").cuda()

loading configuration file config.json from cache at C:\Users\Clarence/.cache\huggingface\hub\models--ClarenceTKX--checkpoints\snapshots\f66639ab6371fc55a63e3abfa4b718beda7b6151\config.json
Model config Wav2Vec2Config {
  "_name_or_path": "facebook/wav2vec2-large-960h",
  "activation_dropout": 0.1,
  "adapter_attn_dim": null,
  "adapter_kernel_size": 3,
  "adapter_stride": 2,
  "add_adapter": false,
  "apply_spec_augment": true,
  "architectures": [
    "Wav2Vec2ForCTC"
  ],
  "attention_dropout": 0.1,
  "bos_token_id": 1,
  "classifier_proj_size": 256,
  "codevector_dim": 256,
  "contrastive_logits_temperature": 0.1,
  "conv_bias": false,
  "conv_dim": [
    512,
    512,
    512,
    512,
    512,
    512,
    512
  ],
  "conv_kernel": [
    10,
    3,
    3,
    3,
    3,
    2,
    2
  ],
  "conv_stride": [
    5,
    2,
    2,
    2,
    2,
    2,
    2
  ],
  "ctc_loss_reduction": "mean",
  "ctc_zero_infinity": false,
  "diversity_loss_weight": 0.1,
  "do_stable_layer_norm": fals

model.safetensors:   0%|          | 0.00/1.26G [00:00<?, ?B/s]

loading weights file model.safetensors from cache at C:\Users\Clarence/.cache\huggingface\hub\models--ClarenceTKX--checkpoints\snapshots\f66639ab6371fc55a63e3abfa4b718beda7b6151\model.safetensors
All model checkpoint weights were used when initializing Wav2Vec2ForCTC.

All the weights of Wav2Vec2ForCTC were initialized from the model checkpoint at ClarenceTKX/checkpoints.
If your task is similar to the task the model of the checkpoint was trained on, you can already use Wav2Vec2ForCTC for predictions without further training.


In [84]:
#evaluation using fine-tuned model with map function for tracking
def map_to_result(batch):
  with torch.no_grad():
    input_values = torch.tensor(batch["input_values"], device="cuda").unsqueeze(0)
    logits = model(input_values).logits

  pred_ids = torch.argmax(logits, dim=-1)
  batch["pred_str"] = processor.batch_decode(pred_ids)[0]
  batch["text"] = processor.decode(batch["labels"], group_tokens=False)

  return batch

In [None]:
results = train_test_ds["test"].map(map_to_result, remove_columns=train_test_ds["test"].column_names)
#approximate runtime: 4H55M (widget decays after deactivating jupyter)

In [91]:
print("Test WER: {:.3f}".format(wer_metric.compute(predictions=results["pred_str"], references=results["text"])))

Test WER: 1.000


This preliminary WER seems rather suspicious but in the interest of time, no further fine-tuning attempts were made after checkpoint 16,300. 


## Applying wav2vec2-large-960h-cv to cv-valid-dev
For the application of the fine-tuned model `wav2vec2-large-960h-cv`, decoding and transcription of audio files found in the `cv-valid-dev` csv file are applied for inference.  

In [4]:
import pandas as pd
from pathlib import Path
import configparser
import os
import torch
import librosa
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor

# Load Wav2Vec2 model and processor
MODEL_ID = "ClarenceTKX/wav2vec2-large-960h-cv"
processor = Wav2Vec2Processor.from_pretrained(MODEL_ID)
model = Wav2Vec2ForCTC.from_pretrained(MODEL_ID)

df = pd.read_csv("C:/Users/Clarence/Desktop/GitHub/technical-test/asr/updated_cv-valid-dev.csv")
df['finetuned_text'] = ''

# Function to transcribe an audio file
def transcribe_audio(file_path):
    speech_array, _ = librosa.load(file_path, sr=16_000)
    inputs = processor(speech_array, sampling_rate=16_000, return_tensors="pt", padding=True)
    with torch.no_grad():
        logits = model(inputs.input_values, attention_mask=inputs.attention_mask).logits
    predicted_ids = torch.argmax(logits, dim=-1)
    transcription = processor.batch_decode(predicted_ids)[0]
    return transcription

# Transcribe each audio file and update the DataFrame
for index, row in df.iterrows():
    audio_file_full = base_file_path / row['filename']
    if audio_file_full.is_file():
        try:
            transcription = transcribe_audio(audio_file_full)
            df.at[index, 'finetuned_text'] = transcription
            print(transcription)
        except Exception as e:
            print(f"Error processing file {audio_file_full}: {e}")


a large portion of the cylinder has been uncovered
the turf and gravel around it seem charred as if by a sudden explosion
they say it's fake
the boy thought of fatima
i wasn't born yesterday
just by looking at them
duke i'm your sister
folk is not my favorite music genre
nobody knew who melcolm bass
most materoids are more or less rounded
i want them both arrested
we miss you and miss having a friend like you and i am so happy that you too got to catch up
the tribal chieftain called for the boy hand presented him with fifty pieces of gold
half an hour later his shovel hit something solid
i have already described the appearance of that colossal book which was embedded in the ground
i thought you were going to teach me some of the things you knew
everyone loged them and enjoyed them
some went away while i was there and other people came
wearing his new sandals he descended the stairs silently
these are not camels
he heard a muffled grating sound and saw the black mark cerk forward an int

In [5]:
# Save the updated DataFrame
df.to_csv('C:/Users/Clarence/Desktop/GitHub/technical-test/asr-train/updated_v2_cv-valid-dev.csv', index=False)

## Evaluation
Using the code below and the DataFrame generated above, we will now be able to log the overall performance of the model at base and at the fine-tuned checkpoint. 

In [14]:
import pandas as pd
from jiwer import wer

# Function to calculate WER
def calculate_wer(reference, hypothesis):
    return wer(reference, hypothesis)

# Read the CSV file
# df = pd.read_csv('C:/Users/Clarence/Desktop/GitHub/technical-test/asr-train/updated_v2_cv-valid-dev.csv') 

#convert text to lowercase first
df['generated_text'] = df['generated_text'].str.lower()
df['finetuned_text'] = df['finetuned_text'].str.lower()
df.fillna('', inplace=True)

# Calculate WER for generated_text vs text
wer_generated_text = sum(calculate_wer(ref, hyp) for ref, hyp in zip(df['text'], df['generated_text'])) / len(df)

# Calculate WER for finetuned_text vs text
wer_finetuned_text = sum(calculate_wer(ref, hyp) for ref, hyp in zip(df['text'], df['finetuned_text'])) / len(df)

# Print the results
print(f"WER for generated_text vs text: {wer_generated_text:.2f}")
print(f"WER for finetuned_text vs text: {wer_finetuned_text:.2f}")


WER for generated_text vs text: 0.12
WER for finetuned_text vs text: 0.09
