# Evaluation Notebook

## Objective
The main objective of this notebook is to evaluate the accuracy and performance of the Whisper model trained for speech recognition.


# Loading Libraries

In [8]:
%%capture
%pip install jiwer accelerate datasets huggingface_hub transformers

In [3]:
import torch
import torchaudio
from datasets import load_dataset, load_metric,Dataset
from transformers import WhisperProcessor, WhisperForConditionalGeneration
import re
import unicodedata
import pandas as pd
import soundfile as sf

## Dataset
The evaluation is conducted on the dataset used during the training process, which can be found [here](https://huggingface.co/datasets/SakshiRathi77/ASR_CV15_Hindi_wav_16000).

In [4]:
df =pd.read_csv("/kaggle/input/cv15-hindi/hi/hi/train.tsv", sep='\t', header=0)
df["votes"] = df["up_votes"]-df["down_votes"]
df = df[df["votes"]>=2]
df["path"]=df["path"].str.replace(".mp3",".wav")

## Model Information
The evaluation utilizes the Wav2Vec2-XLSR model, which has been trained on the provided dataset. The details of the training process can be found in the [training notebook](https://www.kaggle.com/code/sakshirathi77/wav2vec2-xlsr-kagglex).

In [6]:
from sklearn.model_selection import train_test_split
df["path"] = "/kaggle/input/cv15-hindi/audio_wav_16000/tmp/CV15_ASR_dataset/audio_wav_16000/"+df["path"]
df.rename(columns = {'transcription':'sentence'}, inplace = True)
train,test = train_test_split(df, test_size=0.1, random_state=42)
common_voice_test = Dataset.from_pandas(test)

wer = load_metric("wer")
cer = load_metric("cer")

processor = WhisperProcessor.from_pretrained("kingabzpro/whisper-small-hi-cv")
model = WhisperForConditionalGeneration.from_pretrained("kingabzpro/whisper-small-hi-cv").to("cuda")

def speech_file_to_array_fn(batch):
    speech_array, sampling_rate = sf.read(batch["path"])
    batch["speech"] = speech_array
    return batch

common_voice_test = common_voice_test.map(speech_file_to_array_fn)

def map_to_pred(batch):

    input_features = processor(batch["speech"], sampling_rate=16000, return_tensors="pt").input_features
    batch["reference"] = processor.tokenizer._normalize(batch['sentence'])

    with torch.no_grad():
        predicted_ids = model.generate(input_features.to("cuda"))[0]
    transcription = processor.decode(predicted_ids)
    batch["prediction"] = processor.tokenizer._normalize(transcription)
    return batch

result = common_voice_test.map(map_to_pred)

print("WER: {:2f}".format(wer.compute(predictions=result["prediction"], references=result["reference"])))
print("CER: {:2f}".format(cer.compute(predictions=result["prediction"], references=result["reference"])))

  0%|          | 0/416 [00:00<?, ?ex/s]

## Evaluation Metrics
The following metrics are used for evaluating the performance of the model:
- Word Error Rate (WER) -0.139913 
- Character Error Rate (CER)-  0.058844


The assessment of the Whisper model in the ASR task has delivered compelling outcomes, demonstrating a Word Error Rate (WER) of 0.139913 and Character Error Rate (CER) of 0.058844. These results underscore the model's remarkable precision and resilience in converting speech to text, showcasing its effectiveness in managing intricate linguistic subtleties and a wide range of speech patterns within the evaluated dataset.

The WER and CER values highlight the model's elevated accuracy and reliability, suggesting its potential utility in diverse real-world settings that demand precise and swift speech-to-text transcription. Furthermore, the evaluation has underscored the model's adeptness in minimizing errors and upholding the fidelity of the original speech input.

Although the current findings are impressive, there are opportunities for further advancement to enhance the model's performance. These may include refining fine-tuning techniques, exploring additional data augmentation methods, and integrating sophisticated language modeling approaches. These efforts could potentially contribute to even greater accuracy and robustness, cementing the Whisper model's status as a cutting-edge solution in the domain of Automatic Speech Recognition.