# Evaluation Notebook

## Objective
The main objective of this notebook is to evaluate the accuracy and performance of the Wav2Vec2-XLSR model trained for speech recognition.

![Evaluation image](https://unctad.org/sites/default/files/inline-images/about-evaluation_600x424.jpg)


# Loading Libraries

In [1]:
%%capture
%pip install jiwer accelerate datasets huggingface_hub transformers

In [2]:
import torch
import torchaudio
from datasets import load_dataset, load_metric
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
import re
import unicodedata
from datasets import load_dataset, Dataset
import pandas as pd
import soundfile as sf



## Dataset
The evaluation is conducted on the dataset used during the training process, which can be found [here](https://huggingface.co/datasets/SakshiRathi77/ASR_CV15_Hindi_wav_16000).

In [3]:
df =pd.read_csv("/kaggle/input/cv15-hindi/hi/hi/train.tsv", sep='\t', header=0)
df["votes"] = df["up_votes"]-df["down_votes"]
df = df[df["votes"]>=2]
df["path"]=df["path"].str.replace(".mp3",".wav")

## Model Information
The evaluation utilizes the Wav2Vec2-XLSR model, which has been trained on the provided dataset. The details of the training process can be found in the [training notebook](https://www.kaggle.com/code/sakshirathi77/wav2vec2-xlsr-kagglex).

In [4]:
from sklearn.model_selection import train_test_split
df["path"] = "/kaggle/input/cv15-hindi/audio_wav_16000/tmp/CV15_ASR_dataset/audio_wav_16000/"+df["path"]
df.rename(columns = {'transcription':'sentence'}, inplace = True)
train,test = train_test_split(df, test_size=0.1, random_state=42)
common_voice_test = Dataset.from_pandas(test)

wer = load_metric("wer")

processor = Wav2Vec2Processor.from_pretrained("SakshiRathi77/wav2vec2-large-xlsr-300m-hi-kagglex")
model = Wav2Vec2ForCTC.from_pretrained("SakshiRathi77/wav2vec2-large-xlsr-300m-hi-kagglex")
model.to("cuda")

resampler = torchaudio.transforms.Resample(48_000, 16_000)

Downloading builder script:   0%|          | 0.00/1.90k [00:00<?, ?B/s]

Downloading (…)rocessor_config.json:   0%|          | 0.00/214 [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/354 [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/827 [00:00<?, ?B/s]

Downloading (…)in/added_tokens.json:   0%|          | 0.00/30.0 [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/406 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/2.09k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/1.26G [00:00<?, ?B/s]

# Preprocessing Data

In [5]:
def speech_file_to_array_fn(batch):
    chars_to_ignore_regex = '[\,\?\.\!\-\;\:\"\“\%\‘\”\�\’\'\|\&\–]'
    remove_en = '[A-Za-z]'
    batch["sentence"] = re.sub(chars_to_ignore_regex, "", batch["sentence"].lower())
    batch["sentence"] = re.sub(remove_en, "", batch["sentence"]).lower()
    batch["sentence"] = unicodedata.normalize("NFKC", batch["sentence"])

    speech_array, sampling_rate = sf.read(batch["path"])
    batch["speech"] = speech_array
    return batch

common_voice_test = common_voice_test.map(speech_file_to_array_fn)

  0%|          | 0/416 [00:00<?, ?ex/s]

# Results

In [6]:
def evaluate(batch):
  inputs = processor(batch["speech"], sampling_rate=16_000, return_tensors="pt", padding=True)

  with torch.no_grad():
      logits = model(inputs.input_values.to("cuda")).logits

      pred_ids = torch.argmax(logits, dim=-1)
      batch["pred_strings"] = processor.batch_decode(pred_ids, skip_special_tokens=True)
      return batch

result = common_voice_test.map(evaluate, batched=True, batch_size=8)

print("WER: {:2f}".format(wer.compute(predictions=result["pred_strings"], references=result["sentence"])))

  0%|          | 0/52 [00:00<?, ?ba/s]

WER: 0.314115


### Conclusion

The evaluation of the Wav2Vec2 model on the ASR task has yielded promising results, with a Word Error Rate (WER) of 0.314115. This achievement demonstrates the model's exceptional accuracy and robustness in transcribing speech to text, indicating its efficacy in handling complex linguistic nuances and diverse speech patterns in the evaluated dataset.

The WER of 0.314115 signifies the model's high precision and reliability, suggesting its potential applicability in various real-world scenarios where accurate and efficient speech-to-text transcription is essential. Moreover, the evaluation process has showcased the model's proficiency in minimizing errors and maintaining fidelity to the original speech input.

While the current results are impressive, further enhancements can be explored to improve the model's performance, including fine-tuning strategies, additional data augmentation techniques, and the incorporation of advanced language modeling approaches. These endeavors could potentially contribute to even greater accuracy and robustness, further solidifying the Wav2Vec2 model's position as a leading solution in the field of Automatic Speech Recognition.
