# Task 4 Model Evaluation
In this notebook, we will compare the results of the base Wav2Vec2 model and our finetuned model. We will use WER and CER as the evaluation metrics. 

In [2]:
import pandas as pd
from jiwer import wer, cer

Transcribe the evaluation dataset using the finetuned model. The transcription using the base model was already performed in Task 2. 

In [2]:
# Use cv-decode.py to transcribe the audio using the fine-tuned model
!python ../asr/cv-decode.py ../data/common_voice/cv-valid-dev.csv ../data/common_voice/cv-valid-dev/ ../task-4/transcribed_data-ft.csv

Transcriptions have been added to the CSV file.


In [5]:
# Load the CSV files for both models
base_model_results_path = '../asr/transcribed_data.csv'
finetuned_model_results_path = 'transcribed_data-ft.csv'

base_data = pd.read_csv(base_model_results_path)
finetuned_data = pd.read_csv(finetuned_model_results_path)

# Extract the ground truth (text) and generated transcriptions
base_actual_labels = base_data['text']
base_generated_text = base_data['generated_text']

finetuned_actual_labels = finetuned_data['text']
finetuned_generated_text = finetuned_data['generated_text']

# Convert elements to strings
base_actual_labels = [str(label) for label in base_actual_labels]
base_generated_text = [str(label) for label in base_generated_text]

finetuned_actual_labels = [str(label) for label in finetuned_actual_labels]
finetuned_generated_text = [str(label) for label in finetuned_generated_text]

# Convert both actual labels and generated text to lowercase
base_actual_labels = [label.lower() for label in base_actual_labels]
base_generated_text = [text.lower() for text in base_generated_text]

finetuned_actual_labels = [label.lower() for label in finetuned_actual_labels]
finetuned_generated_text = [text.lower() for text in finetuned_generated_text]

# Compute WER and CER 
base_wer = wer(base_actual_labels, base_generated_text)
base_cer = cer(base_actual_labels, base_generated_text)

finetuned_wer = wer(finetuned_actual_labels, finetuned_generated_text)
finetuned_cer = cer(finetuned_actual_labels, finetuned_generated_text)

# Prepare the comparison report
comparison_report = pd.DataFrame({
    "Metric": ["Word Error Rate (WER)", "Character Error Rate (CER)"],
    "Base Model": [base_wer, base_cer],
    "Fine-Tuned Model": [finetuned_wer, finetuned_cer],
    "Improvement": [base_wer - finetuned_wer, base_cer - finetuned_cer]
})

comparison_report

Unnamed: 0,Metric,Base Model,Fine-Tuned Model,Improvement
0,Word Error Rate (WER),0.10812,0.078751,0.029369
1,Character Error Rate (CER),0.045444,0.034017,0.011427


### Results

#### Word Error Rate (WER)
We can see that the WER improved by 0.0294 (2.94%) after finetuning the model. This is a significant improvement in the performance of the model, since the base model was already performing quite well. Relative to the base model, the fine-tuned model's performance increased by 27.2% ((0.108120 - 0.078751) / 0.108120).

#### Character Error Rate (CER)
The CER also improved  by 0.011427 (1.14%) after finetuning the model. This is a significant improvement, given that the base model has a very low CER of 0.045444. Relative to the base model, the fine-tuned model's performance increased by 25.1% ((0.045444 - 0.034017) / 0.045444).