# Evaluation Notebook

## Objective
The main objective of this notebook is to evaluate the accuracy and performance of the Wav2Vec2-XLSR model trained for speech recognition.


In [8]:
%%capture
%pip install jiwer accelerate datasets huggingface_hub transformers

In [3]:
import torch
import torchaudio
from datasets import load_dataset, load_metric,Dataset
from transformers import WhisperProcessor, WhisperForConditionalGeneration
import re
import unicodedata
import pandas as pd
import soundfile as sf

## Dataset
The evaluation is conducted on the dataset used during the training process, which can be found [here](https://huggingface.co/datasets/SakshiRathi77/ASR_CV15_Hindi_wav_16000).

In [4]:
df =pd.read_csv("/kaggle/input/cv15-hindi/hi/hi/train.tsv", sep='\t', header=0)
df["votes"] = df["up_votes"]-df["down_votes"]
df = df[df["votes"]>=2]
df["path"]=df["path"].str.replace(".mp3",".wav")

## Model Information
The evaluation utilizes the Wav2Vec2-XLSR model, which has been trained on the provided dataset. The details of the training process can be found in the [training notebook](https://www.kaggle.com/code/sakshirathi77/wav2vec2-xlsr-kagglex).

In [5]:
from sklearn.model_selection import train_test_split
df["path"] = "/kaggle/input/cv15-hindi/audio_wav_16000/tmp/CV15_ASR_dataset/audio_wav_16000/"+df["path"]
df.rename(columns = {'transcription':'sentence'}, inplace = True)
train,test = train_test_split(df, test_size=0.1, random_state=42)
common_voice_test = Dataset.from_pandas(test)
wer = load_metric("wer")
cer = load_metric("cer")

processor = WhisperProcessor.from_pretrained("SakshiRathi77/Fine-tune-Whisper-Kagglex")
model = WhisperForConditionalGeneration.from_pretrained("SakshiRathi77/Fine-tune-Whisper-Kagglex").to("cuda")
model.to("cuda")

resampler = torchaudio.transforms.Resample(48_000, 16_000)

Downloading builder script:   0%|          | 0.00/1.90k [00:00<?, ?B/s]

Downloading (…)rocessor_config.json:   0%|          | 0.00/214 [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/354 [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/827 [00:00<?, ?B/s]

Downloading (…)in/added_tokens.json:   0%|          | 0.00/30.0 [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/406 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/2.09k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/1.26G [00:00<?, ?B/s]

In [6]:
def speech_file_to_array_fn(batch):
    speech_array, sampling_rate = sf.read(batch["path"])
    batch["speech"] = speech_array
    return batch

common_voice_test = common_voice_test.map(speech_file_to_array_fn)



  0%|          | 0/416 [00:00<?, ?ex/s]

## Evaluation Metrics
The following metrics are used for evaluating the performance of the model:
- Word Error Rate (WER)
- Character Error Rate (CER)