# Benchmark and compare Whisper models

In this notebook, we try to apply transcription and classification with 

Whisper (Large vs Turbo, CPP AND GPU versions) and Analyse the results



## GPU Test

We start with generating transcription with both models(turbo and large-v3)

Let's choose a sample of random audios(100 units), so we make sure our sample is representative

In [1]:
import os
import random
import json

In [2]:

# audio_folder = '../../metadata/chunks/audio' 
# transcriptions_file = '../../metadata/chunks/text_audio_mapping.json'
# saving_file = '../json/selected_audio_files.json'
# AUDIO_NUM = 100

# audio_files = os.listdir(audio_folder)
# random.shuffle(audio_files)

# selected_audio_files = audio_files[:AUDIO_NUM]

# # add original transcriptions
# data = None
# with open(transcriptions_file, 'r') as f:
#     transcriptions = json.load(f)

# json_objects = []

# for file in selected_audio_files:
#     audio_name = file.replace(".mp3", "")
#     transcription = next(t['text'] for t in transcriptions if t['audio'] == audio_name )
#     json_objects.append(
#         {
#             "audio": audio_name,
#             "transcription": transcription
#         }
#     )

# with open(saving_file, 'w') as f:
#     json.dump(json_objects, f, ensure_ascii=False)

# print(selected_audio_files)

We will transcribe these audio files using Whisper Large-v3 and Turbo and see the quality and time

This will run in colab Tesla T4 (12.7 GB VRAM as the runtime shows)

## GPU Test

## Whisper Large-v3 stats (first run)


📊 Summary:

Mean time: 27.55 s
Min time : 7.81 s
Max time : 101.91 s

## Whisper Large-v3 stats(second run)

📊 Summary:
Mean time: 26.32 s
Min time : 9.90 s
Max time : 64.60 s

## Whisper Turbo stats


📊 Summary:
Mean time: 8.86 s
Min time : 3.21 s
Max time : 21.19 s

### <span style='color:blue'> *Turbo is more thatn x3 times faster than Large* </span>

# Evaluation

The transcriptions are saved in **json/test_whisper_(model).json**

## Manual Evaluation
We evaluate the quality of transcriptions manually
- 013_chunk056: Turbo much better
- 017_chunk152: Large is horrible, Turbo is good
- 013_chunk035: Kinda the same quality, both good
- 008_chunk022: Large misses a portion of speech, Turbo is worse
- 020_chunk039: Turbo is good, Large is same
- 016_chunk010: Large is better
- 022_chunk015: both are good

## Numerical Evaluation(cosine similarity)

we load the both transcription data, with youtube(original one)

In [3]:
import json

whisper_turbo_t = '../json/test_whisper_turbo.json'
whisper_large_t = '../json/test_whisper_large-v3.json'
original_t = '../json/text_audio_mapping.json'

with open(whisper_turbo_t, 'r') as f:
    turbo_transcriptions = json.load(f)

with open(whisper_large_t, 'r') as f:
    large_transcriptions = json.load(f)

with open(original_t, 'r') as f:
    original_transcriptions = json.load(f)


Filter originals list

In [10]:
# load chosen audio list
chosen_audio_f = '../json/selected_audio_files.json'
original_filtered = []

with open(chosen_audio_f, 'r') as f:
    chosen_audio_list = json.load(f)
for row in original_transcriptions:
    found = next((1 for f in chosen_audio_list if f['audio'] == row['audio']), None)
    if found:
        original_filtered.append(row)

print(f"quick test: length of filtered list: {len(original_filtered)}")

quick test: length of filtered list: 100


### Generate embeddings for whisper transcriptions and save them in new files

Login to huggingface

In [3]:
from huggingface_hub import login
import os
from dotenv import load_dotenv
load_dotenv()


your_token = os.getenv("HUGGING_FACE_TOKEN")

login(token=your_token)

Generating embedding using OpenAI multilingual transformer

In [5]:
from sentence_transformers import SentenceTransformer

# we use 'sentence-transformers/all-MiniLM-L6-v2' 

embedding_model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')

def add_embeddings(data_list, text_key):
    """
    Generates and adds embeddings for each item in a list of dictionaries.

    Args: 
        data_list (list): A list of dictionaries, where each dict has a text field.
        text_key (str): The key for the text field in each dictionary.

    Returns:
        list: The original list with an '_embedding' key added to each dictionary.
    """
    texts = [item.get(text_key) for item in data_list]
    embeddings = embedding_model.encode(texts)

    for i, item in enumerate(data_list):
        new_key = text_key + '_embedding'
        item[new_key] = embeddings[i].tolist() # Convert numpy array to list for JSON serialization

    return data_list



save in a different file for clarity

In [12]:

turbo_embeddings_path = '../json/turbo_embeddings.json'
large_embeddings_path = '../json/large_embeddings.json'
origial_embeddings_path = '../json/origial_embeddings.json'

In [None]:
turbo_embeddings = add_embeddings(turbo_transcriptions, "transcription")
large_embeddings = add_embeddings(large_transcriptions, "transcription")
original_embeddings = add_embeddings(original_filtered, "text")


with open(turbo_embeddings_path, 'w') as f:
    json.dump(turbo_embeddings, f, ensure_ascii=False)

with open(large_embeddings_path, 'w') as f:
    json.dump(large_embeddings, f, ensure_ascii=False)

with open(origial_embeddings_path, 'w') as f:
    json.dump(original_embeddings, f, ensure_ascii=False)

## Apply cosine similarity

In [13]:

with open(turbo_embeddings_path, 'r') as f:
    turbo_embeddings = json.load(f)

with open(large_embeddings_path, 'r') as f:
    large_embeddings = json.load(f)

with open(origial_embeddings_path, 'r') as f:
    original_embeddings = json.load(f)

In [14]:
metrics_file = '../json/turbo_vs_large.json'

Apply cosine similarity and save results 

In [16]:
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

metric_list = []

for row in original_embeddings:
    original_emb = row['text_embedding']
    turbo_emb = next((f['transcription_embedding'] for f in turbo_embeddings if f['audio'].replace(".mp3", "") == row['audio']), None)
    large_emb = next((f['transcription_embedding'] for f in large_embeddings if f['audio'].replace(".mp3", "") == row['audio']), None)

    if not turbo_emb:
        print(f"NOT FOUND turbo embedding for file {row['audio']}")

    if not large_emb:
        print(f"NOT FOUND large embedding for file {row['audio']}")
    original_emb = np.array(original_emb).reshape(1, -1)
    turbo_emb = np.array(turbo_emb).reshape(1, -1)
    large_emb = np.array(large_emb).reshape(1, -1)

    turbo_cosine = cosine_similarity(original_emb, turbo_emb)[0][0]
    large_cosine = cosine_similarity(original_emb, large_emb)[0][0]

    metric_list.append({
        "audio": row['audio'],
        'turbo_cosine': turbo_cosine,
        'large_cosine': large_cosine
    })

with open(metrics_file, 'w') as f:
    json.dump(metric_list, f, ensure_ascii=False)

## Summary Statistics

In [17]:
import json
import pandas as pd


with open(metrics_file, 'r') as f:
    metric_list = json.load(f)

# put into DataFrame
df = pd.DataFrame(metric_list)

# compute difference
df["difference"] = df["large_cosine"] - df["turbo_cosine"]

# summary statistics
summary = {
    "turbo_mean": df["turbo_cosine"].mean(),
    "large_mean": df["large_cosine"].mean(),
    "turbo_std": df["turbo_cosine"].std(),
    "large_std": df["large_cosine"].std(),
    "avg_difference": df["difference"].mean(),
    "turbo_better_count": (df["difference"] < 0).sum(),
    "large_better_count": (df["difference"] > 0).sum()
}

display(df)
print("\n--- Summary Comparison ---")
for k, v in summary.items():
    print(f"{k}: {v:.4f}")


Unnamed: 0,audio,turbo_cosine,large_cosine,difference
0,002_chunk007,0.957282,0.929248,-0.028034
1,002_chunk018,0.953961,0.933615,-0.020346
2,002_chunk019,0.894270,0.943848,0.049578
3,004_chunk005,0.903571,0.940228,0.036657
4,005_chunk002,0.972451,0.950694,-0.021757
...,...,...,...,...
95,022_chunk009,0.988576,0.833761,-0.154815
96,022_chunk015,0.965587,0.876124,-0.089463
97,022_chunk027,0.908584,0.920975,0.012392
98,022_chunk035,0.907040,0.886476,-0.020563



--- Summary Comparison ---
turbo_mean: 0.9261
large_mean: 0.9148
turbo_std: 0.0809
large_std: 0.0644
avg_difference: -0.0113
turbo_better_count: 66.0000
large_better_count: 34.0000
