## Interactive Model Comparison: Fine-Tuned vs. Pretrained Whisper on ATC Data

### Notebook Overview

This notebook provides an interactive way to explore the performance of the fine-tuned `Whisper medium.en` model on air traffic control (ATC) data, and compare it to the base pretrained `Whisper medium.en` model. You will be able to listen to the audio samples, view the corresponding transcriptions, and observe the Word Error Rate (WER) for both models.

We’ll look at:
- Samples with the worst WER for both the fine-tuned and pretrained models, giving insight into where each model struggles.
- A way to explore specific samples of your choice, comparing the models' performance on the same data.
- Randomly selected samples from the fine-tuned model to observe its broader performance on ATC data.

The goal of this notebook is to offer a hands-on and interactive approach to model evaluation, allowing you to hear the audio and compare transcription accuracy between models.

---


### Setup: Import Libraries, Load Dataset, and Evaluation Results

First, we import the necessary libraries and load the ATC dataset. Then, we read in the evaluation results from CSV files for both the fine-tuned and pretrained Whisper models. These results will allow us to compare model performance and analyze the Word Error Rate (WER) for each sample.

In [2]:
import pandas as pd
import numpy as np
import IPython.display as ipd
from datasets import load_dataset

# Load the dataset
dataset = load_dataset("jacktol/atc-dataset")

# Load CSV files
csv_file_fine_tuned = "whisper-medium.en-fine-tuned-for-ATC-15.08-WER-evaluation-data.csv"
csv_file_pretrained = "whisper-medium.en-94.59-WER-evaluation-data.csv"

df_fine_tuned = pd.read_csv(csv_file_fine_tuned)
df_pretrained = pd.read_csv(csv_file_pretrained)

### Top N Samples with the Worst WER (Fine-Tuned Model)

Below, we will look at the top N samples where the fine-tuned model struggled the most. This will help identify the cases where the model's predictions deviated significantly from the ground truth, as measured by WER. I've selected the top 2 samples, but feel free to adjust the number.

In [3]:
def display_top_n_worst_wer_fine_tuned(n):
    try:
        top_n_worst_wer = df_fine_tuned.sort_values(by='WER', ascending=False).head(n)
        
        for idx, sample in top_n_worst_wer.iterrows():
            sample_number = int(sample['Sample'])
            ground_truth = sample['Ground Truth']
            prediction = sample['Prediction']
            wer = sample['WER']

            try:
                dataset_sample = dataset['test'][sample_number - 1]
                audio_array = np.array(dataset_sample['audio']['array'])
                audio_sr = dataset_sample['audio']['sampling_rate']

                print(f"Sample {sample_number}:")
                print(f"WER: {wer}")
                print(f"Ground Truth: {ground_truth}")
                print(f"Prediction: {prediction}")

                display(ipd.Audio(data=audio_array, rate=audio_sr))
                print("-" * 50)

            except IndexError:
                print(f"Audio for sample {sample_number} not found in the dataset.")
                print("-" * 50)
    
    except Exception as e:
        print(f"An error occurred: {e}")

display_top_n_worst_wer_fine_tuned(2)

Sample 201:
WER: 166.67
Ground Truth: jetstar nine twelve
Prediction: air france one two four


--------------------------------------------------
Sample 1497:
WER: 150.0
Ground Truth: thank you
Prediction: six kilo papa thank you


--------------------------------------------------


### Top N Samples with the Worst WER (Pretrained Model)

Similar to the previous cell, this one lets us listen to the samples where the pretrained model performed poorly. By comparing these results to the fine-tuned model's outputs, we can observe the improvements achieved through fine-tuning.


In [4]:
def display_top_n_worst_wer_pretrained(n):
    try:
        top_n_worst_wer = df_pretrained.sort_values(by='WER', ascending=False).head(n)
        
        for idx, sample in top_n_worst_wer.iterrows():
            sample_number = int(sample['Sample'])
            ground_truth = sample['Ground Truth']
            prediction = sample['Prediction']
            wer = sample['WER']

            try:
                dataset_sample = dataset['test'][sample_number - 1]
                audio_array = np.array(dataset_sample['audio']['array'])
                audio_sr = dataset_sample['audio']['sampling_rate']

                print(f"Sample {sample_number}:")
                print(f"WER: {wer}")
                print(f"Ground Truth: {ground_truth}")
                print(f"Prediction: {prediction}")

                display(ipd.Audio(data=audio_array, rate=audio_sr))
                print("-" * 50)

            except IndexError:
                print(f"Audio for sample {sample_number} not found in the dataset.")
                print("-" * 50)
    
    except Exception as e:
        print(f"An error occurred: {e}")

display_top_n_worst_wer_pretrained(2)


Sample 2357:
WER: 14900.0
Ground Truth: kilo
Prediction: keto keto keto keto keto keto keto keto keto keto keto keto keto keto keto keto keto keto keto keto keto keto keto keto keto keto keto keto keto keto keto keto keto keto keto keto keto keto keto keto keto keto keto keto keto keto keto keto keto keto keto keto keto keto keto keto keto keto keto keto keto keto keto keto keto keto keto keto keto keto keto keto keto keto keto keto keto keto keto keto keto keto keto keto keto keto keto keto keto keto keto keto keto keto keto keto keto keto keto keto keto keto keto keto keto keto keto keto keto keto keto keto keto keto keto keto keto keto keto keto keto keto keto keto keto keto keto keto keto keto keto keto keto keto keto keto keto keto keto keto keto keto keto keto keto keto keto keto keto


--------------------------------------------------
Sample 2337:
WER: 8362.5
Ground Truth: ok three three zero sky travel one two
Prediction: okay tango hotel romeo echo echo tango whiskey oscar oscar november echo tango whiskey oscar oscar november echo tango whiskey oscar oscar november echo tango whiskey oscar oscar november echo tango whiskey oscar oscar november echo tango whiskey oscar oscar november echo tango whiskey oscar oscar november echo tango whiskey oscar oscar november echo tango whiskey oscar oscar november echo tango whiskey oscar oscar november echo tango whiskey oscar oscar november echo tango whiskey oscar oscar november echo tango whiskey oscar oscar november echo tango whiskey oscar oscar november echo tango whiskey oscar oscar november echo tango whiskey oscar oscar november echo tango whiskey oscar oscar november echo tango whiskey oscar oscar november echo tango whiskey oscar oscar november echo tango whiskey oscar oscar november echo tango whiskey oscar oscar 

--------------------------------------------------


### Compare Specific Samples

You might notice that the samples displayed by the fine-tuned and pretrained models aren't always the same. This makes sense since the models will struggle with different things and to varying degrees, as measured by the WER metric, which penalizes different types of errors such as incorrect insertions.

If you're curious to compare how both models performed on specific samples, you can enter the sample number (or a list of numbers) into the cell below. It will display the evaluation results for both the fine-tuned and pretrained models, along with the audio clip, and ground truth for each sample. 


In [7]:
def display_samples_by_number(sample_numbers):
    try:
        if isinstance(sample_numbers, int):
            sample_numbers = [sample_numbers]
        
        if isinstance(sample_numbers, str):
            sample_numbers = [int(num.strip()) for num in sample_numbers.split(',')]
        
        for sample_number in sample_numbers:
            fine_tuned_row = df_fine_tuned[df_fine_tuned['Sample'] == sample_number]
            pretrained_row = df_pretrained[df_pretrained['Sample'] == sample_number]

            if fine_tuned_row.empty or pretrained_row.empty:
                print(f"No data found for Sample {sample_number}.")
                continue

            fine_ground_truth = fine_tuned_row['Ground Truth'].values[0]
            fine_prediction = fine_tuned_row['Prediction'].values[0]
            fine_wer = fine_tuned_row['WER'].values[0]

            pre_prediction = pretrained_row['Prediction'].values[0]
            pre_wer = pretrained_row['WER'].values[0]

            try:
                dataset_sample = dataset['test'][sample_number - 1]
                audio_array = np.array(dataset_sample['audio']['array'])
                audio_sr = dataset_sample['audio']['sampling_rate']

                print(f"Sample {sample_number}:")
                print(f"Fine-tuned Model WER: {fine_wer}")
                print(f"Pretrained Model WER: {pre_wer}")
                print(f"Ground Truth: {fine_ground_truth}")
                print(f"Fine-tuned Prediction: {fine_prediction}")
                print(f"Pretrained Prediction: {pre_prediction}")

                display(ipd.Audio(data=audio_array, rate=audio_sr))
                print("-" * 50)

            except IndexError:
                print(f"Audio for Sample {sample_number} not found in the dataset.")
                print("-" * 50)

    except Exception as e:
        print(f"An error occurred: {e}")

display_samples_by_number("1991, 1053, 913, 549")

Sample 1991:
Fine-tuned Model WER: 0.0
Pretrained Model WER: 45.45
Ground Truth: three three nine zero air berlin five six one kilo bye
Fine-tuned Prediction: three three nine zero air berlin five six one kilo bye
Pretrained Prediction: three nine decimal zero alberding five six one kilo bye bye


--------------------------------------------------
Sample 1053:
Fine-tuned Model WER: 0.0
Pretrained Model WER: 100.0
Ground Truth: six three alfa
Fine-tuned Prediction: six three alfa
Pretrained Prediction: real fun


--------------------------------------------------
Sample 913:
Fine-tuned Model WER: 0.0
Pretrained Model WER: 54.55
Ground Truth: line up runway one three and wait csa nine three zero
Fine-tuned Prediction: line up runway one three and wait csa nine three zero
Pretrained Prediction: planar primary one train wait csa united three zero


--------------------------------------------------
Sample 549:
Fine-tuned Model WER: 100.0
Pretrained Model WER: 200.0
Ground Truth: three seven fi
Fine-tuned Prediction: one eight three seven five
Pretrained Prediction: go ahead and show the fire


--------------------------------------------------


### Random N Samples from Fine-Tuned Model

To explore the broader performance of the fine-tuned model on ATC data, this cell displays a random set of samples. Listening to a variety of samples can give a better sense of the model's strengths and weaknesses without focusing solely on the worst-case scenarios.


In [9]:
def display_random_n_samples_fine_tuned(n):
    try:
        random_samples = df_fine_tuned.sample(n)

        for idx, sample in random_samples.iterrows():
            sample_number = int(sample['Sample'])
            ground_truth = sample['Ground Truth']
            prediction = sample['Prediction']
            wer = sample['WER']

            try:
                dataset_sample = dataset['test'][sample_number - 1]
                audio_array = np.array(dataset_sample['audio']['array'])
                audio_sr = dataset_sample['audio']['sampling_rate']

                print(f"Sample {sample_number}:")
                print(f"WER: {wer}")
                print(f"Ground Truth: {ground_truth}")
                print(f"Prediction: {prediction}")

                display(ipd.Audio(data=audio_array, rate=audio_sr))
                print("-" * 50)

            except IndexError:
                print(f"Audio for sample {sample_number} not found in the dataset.")
                print("-" * 50)
    
    except Exception as e:
        print(f"An error occurred: {e}")

display_random_n_samples_fine_tuned(3)

Sample 1487:
WER: 25.0
Ground Truth: ground one two on
Prediction: ground one two


--------------------------------------------------
Sample 1161:
WER: 25.0
Ground Truth: austrian three zero four romeo praha radar contact
Prediction: austrian three one four romeo praha radar cont


--------------------------------------------------
Sample 1863:
WER: 0.0
Ground Truth: euro trans si
Prediction: euro trans si


--------------------------------------------------


### Conclusion

Hopefully, you’ve been able to see how the fine-tuned and pretrained Whisper models perform on ATC data and where each model excels or faces challenges. By listening to the audio samples and reviewing the WER scores, I hope you can better understand how each model handles the complexity of this real-world data. Feel free to continue exploring specific or random samples to dive deeper into the models' strengths and weaknesses.
