# Benchmark and compare Whisper models

In this notebook, we try to apply transcription and classification with 

Whisper (Large vs Turbo, CPP AND GPU versions) and Analyse the results



## GPU Test

We start with generating transcription with both models(turbo and large-v3)

Let's choose a sample of random audios(100 units), so we make sure our sample is representative

In [1]:
import os
import random
import json

In [2]:

# audio_folder = '../../metadata/chunks/audio' 
# transcriptions_file = '../../metadata/chunks/text_audio_mapping.json'
# saving_file = '../json/selected_audio_files.json'
# AUDIO_NUM = 100

# audio_files = os.listdir(audio_folder)
# random.shuffle(audio_files)

# selected_audio_files = audio_files[:AUDIO_NUM]

# # add original transcriptions
# data = None
# with open(transcriptions_file, 'r') as f:
#     transcriptions = json.load(f)

# json_objects = []

# for file in selected_audio_files:
#     audio_name = file.replace(".mp3", "")
#     transcription = next(t['text'] for t in transcriptions if t['audio'] == audio_name )
#     json_objects.append(
#         {
#             "audio": audio_name,
#             "transcription": transcription
#         }
#     )

# with open(saving_file, 'w') as f:
#     json.dump(json_objects, f, ensure_ascii=False)

# print(selected_audio_files)

We will transcribe these audio files using Whisper Large-v3 and Turbo and see the quality and time

This will run in colab Tesla T4 (12.7 GB VRAM as the runtime shows)

## GPU Test

## Whisper Large-v3 stats (first run)


📊 Summary:

Mean time: 27.55 s
Min time : 7.81 s
Max time : 101.91 s

## Whisper Large-v3 stats(second run)

📊 Summary:
Mean time: 26.32 s
Min time : 9.90 s
Max time : 64.60 s

## Whisper Turbo stats


📊 Summary:
Mean time: 8.86 s
Min time : 3.21 s
Max time : 21.19 s

### <span style='color:blue'> *Turbo is more that x3 times faster than Large* </span>

# Evaluation

The transcriptions are saved in **json/test_whisper_(model).json**

## Manual Evaluation
We evaluate the quality of transcriptions manually
- 013_chunk056: Turbo much better
- 017_chunk152: Large is horrible, Turbo is good
- 013_chunk035: Kinda the same quality, both good
- 008_chunk022: Large misses a portion of speech, Turbo is worse
- 020_chunk039: Turbo is good, Large is same
- 016_chunk010: Large is better
- 022_chunk015: both are good

## Numerical Evaluation(cosine similarity)

we load the both transcription data, with youtube(original one)

In [3]:
import json

whisper_turbo_t = '../json/test_whisper_turbo.json'
whisper_large_t = '../json/test_whisper_large-v3.json'
original_t = '../json/text_audio_mapping.json'

with open(whisper_turbo_t, 'r') as f:
    turbo_transcriptions = json.load(f)

with open(whisper_large_t, 'r') as f:
    large_transcriptions = json.load(f)

with open(original_t, 'r') as f:
    original_transcriptions = json.load(f)


Filter originals list

In [10]:
# load chosen audio list
chosen_audio_f = '../json/selected_audio_files.json'
original_filtered = []

with open(chosen_audio_f, 'r') as f:
    chosen_audio_list = json.load(f)
for row in original_transcriptions:
    found = next((1 for f in chosen_audio_list if f['audio'] == row['audio']), None)
    if found:
        original_filtered.append(row)

print(f"quick test: length of filtered list: {len(original_filtered)}")

quick test: length of filtered list: 100


### Generate embeddings for whisper transcriptions and save them in new files

Login to huggingface

In [5]:
from huggingface_hub import login
import os
from dotenv import load_dotenv
load_dotenv()


your_token = os.getenv("HUGGING_FACE_TOKEN")

login(token=your_token)

Generating embedding using OpenAI multilingual transformer

In [5]:
from sentence_transformers import SentenceTransformer

# we use 'sentence-transformers/all-MiniLM-L6-v2' 

embedding_model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')

def add_embeddings(data_list, text_key):
    """
    Generates and adds embeddings for each item in a list of dictionaries.

    Args: 
        data_list (list): A list of dictionaries, where each dict has a text field.
        text_key (str): The key for the text field in each dictionary.

    Returns:
        list: The original list with an '_embedding' key added to each dictionary.
    """
    texts = [item.get(text_key) for item in data_list]
    embeddings = embedding_model.encode(texts)

    for i, item in enumerate(data_list):
        new_key = text_key + '_embedding'
        item[new_key] = embeddings[i].tolist() # Convert numpy array to list for JSON serialization

    return data_list



save in a different file for clarity

In [12]:

turbo_embeddings_path = '../json/turbo_embeddings.json'
large_embeddings_path = '../json/large_embeddings.json'
origial_embeddings_path = '../json/origial_embeddings.json'

In [None]:
turbo_embeddings = add_embeddings(turbo_transcriptions, "transcription")
large_embeddings = add_embeddings(large_transcriptions, "transcription")
original_embeddings = add_embeddings(original_filtered, "text")


with open(turbo_embeddings_path, 'w') as f:
    json.dump(turbo_embeddings, f, ensure_ascii=False)

with open(large_embeddings_path, 'w') as f:
    json.dump(large_embeddings, f, ensure_ascii=False)

with open(origial_embeddings_path, 'w') as f:
    json.dump(original_embeddings, f, ensure_ascii=False)

## Apply cosine similarity

In [13]:

with open(turbo_embeddings_path, 'r') as f:
    turbo_embeddings = json.load(f)

with open(large_embeddings_path, 'r') as f:
    large_embeddings = json.load(f)

with open(origial_embeddings_path, 'r') as f:
    original_embeddings = json.load(f)

In [14]:
metrics_file = '../json/turbo_vs_large.json'

Apply cosine similarity and save results 

In [16]:
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

metric_list = []

for row in original_embeddings:
    original_emb = row['text_embedding']
    turbo_emb = next((f['transcription_embedding'] for f in turbo_embeddings if f['audio'].replace(".mp3", "") == row['audio']), None)
    large_emb = next((f['transcription_embedding'] for f in large_embeddings if f['audio'].replace(".mp3", "") == row['audio']), None)

    if not turbo_emb:
        print(f"NOT FOUND turbo embedding for file {row['audio']}")

    if not large_emb:
        print(f"NOT FOUND large embedding for file {row['audio']}")
    original_emb = np.array(original_emb).reshape(1, -1)
    turbo_emb = np.array(turbo_emb).reshape(1, -1)
    large_emb = np.array(large_emb).reshape(1, -1)

    turbo_cosine = cosine_similarity(original_emb, turbo_emb)[0][0]
    large_cosine = cosine_similarity(original_emb, large_emb)[0][0]

    metric_list.append({
        "audio": row['audio'],
        'turbo_cosine': turbo_cosine,
        'large_cosine': large_cosine
    })

with open(metrics_file, 'w') as f:
    json.dump(metric_list, f, ensure_ascii=False)

## Summary Statistics

In [17]:
import json
import pandas as pd


with open(metrics_file, 'r') as f:
    metric_list = json.load(f)

# put into DataFrame
df = pd.DataFrame(metric_list)

# compute difference
df["difference"] = df["large_cosine"] - df["turbo_cosine"]

# summary statistics
summary = {
    "turbo_mean": df["turbo_cosine"].mean(),
    "large_mean": df["large_cosine"].mean(),
    "turbo_std": df["turbo_cosine"].std(),
    "large_std": df["large_cosine"].std(),
    "avg_difference": df["difference"].mean(),
    "turbo_better_count": (df["difference"] < 0).sum(),
    "large_better_count": (df["difference"] > 0).sum()
}

display(df)
print("\n--- Summary Comparison ---")
for k, v in summary.items():
    print(f"{k}: {v:.4f}")


Unnamed: 0,audio,turbo_cosine,large_cosine,difference
0,002_chunk007,0.957282,0.929248,-0.028034
1,002_chunk018,0.953961,0.933615,-0.020346
2,002_chunk019,0.894270,0.943848,0.049578
3,004_chunk005,0.903571,0.940228,0.036657
4,005_chunk002,0.972451,0.950694,-0.021757
...,...,...,...,...
95,022_chunk009,0.988576,0.833761,-0.154815
96,022_chunk015,0.965587,0.876124,-0.089463
97,022_chunk027,0.908584,0.920975,0.012392
98,022_chunk035,0.907040,0.886476,-0.020563



--- Summary Comparison ---
turbo_mean: 0.9261
large_mean: 0.9148
turbo_std: 0.0809
large_std: 0.0644
avg_difference: -0.0113
turbo_better_count: 66.0000
large_better_count: 34.0000


## <span style="color:green"> Turbo model looks much better for this test unit </span>

# **CPU Test**

function to transcribe audio using the cpp model

In [30]:
import subprocess
from pathlib import Path

def transcribe_audio(
    audio_path,
    text_output_dir,
    exe_path,
    model_name="large-v3",
    language="ar",
    threads=4,
):
    """
    Run Whisper.cpp on a given audio file and store the .txt output
    in a separate folder (text_output_dir).
    Returns the transcribed text.
    """
    model_path = r"C:\Users\ACER\whisper.cpp\models\ggml-" + model_name + ".bin" 
    audio_path = Path(audio_path)
    text_output_dir = Path(text_output_dir)
    text_output_dir.mkdir(parents=True, exist_ok=True)

    # Use the audio file name but save inside text_output_dir
    # Corrected path handling
    txt_output = text_output_dir / audio_path.stem  # no extension here

    cmd = [
        exe_path,
        "-m", model_path,
        "-f", str(audio_path),
        "-l", language,
        "-t", str(threads),
        "-otxt",
        "-of", str(txt_output)  # whisper.cpp will add .txt automatically
    ]

    final_txt_file = txt_output.with_suffix(".txt")

    subprocess.run(cmd, capture_output=True, text=True)

    if final_txt_file.exists():
        return final_txt_file.read_text(encoding="utf-8").strip()
    else:
        return None


get the sample of audio files that we already chose

In [2]:
import json
import os
# load chosen audio list
chosen_audio_f = '../json/selected_audio_files.json'
original_filtered = []

with open(chosen_audio_f, 'r') as f:
    chosen_audio_list = json.load(f)

print(chosen_audio_list[:2])

[{'audio': '013_chunk056', 'transcription': 'جو سوي سور كاينين حتى فيلالجيريا كاين في العالم كامل واحد الناس كيتزادوا تيحس براسه ان ملي كنت كنشعل التلفازه ما كانتش كتعني لي شي حاجه هذوك راه الناس الها اه احنا الناس اللي لهيه فهمتي هذو راه كاينين كاينين في العاصمه حنا شعبون ثاني اه اه فهمتي ماتيعنيونيش الطريقه باش كيلبسوا لابس كوستيوم ختنا لابسه فيست كذا لا علاقه ما عندكش لابارتونونس لهذك لا علاقه انا تنعرف الكارتي ديالنا 200 دولار اخاي بهذا الشي اللي شنو يقدر يدير ب 200 دولار الهضره ديال الصاك كامله صحيحه غير انايا دي نزل للتيران ورا ما محتاجش ل 200 دولار كاين يقدر ما نحتاجش ما قال ما نحتاج حتاجش كاع 200 دولار انا ح انا نقدر نزل للسوق ندير الترجمه نحل فايفر شوف شنو اللي مشي واش نقدر ندير هذا ليماج ايديتينغ اللي طالبين غنمشي نزحم هذا حى انا ندير معه المنافسه هو كيطلب 50 دولار فايفر انا اخويا ندير غير ب 5 دولار انا مسالي ما عندي ما ندير 5 دولار في المغرب والجزائر راه شي حاجه ا 120 ولا صافا را'}, {'audio': '017_chunk152', 'transcription': 'الكونت بيبليسيتير هذاك قد ما تستقوى قد ما هو يسيب

We start with the Large-v3 cpu model. we use 8 threads

In [3]:
saving_dir = '../json'

os.makedirs(saving_dir, exist_ok=True)

In [4]:

import time
import statistics

cpu_test_file = 'test_whispercpp_large-v3.json'

NUM_THREADS = 8
text_files_dir = '../texts'
model_name = "large-v3"

transcriptions_list = []
timings = [] 

for audio_f in chosen_audio_list:
    file_name = audio_f["audio"] + ".mp3"
    audio_dir = "../../metadata/chunks/audio"
    full_audio_path = os.path.join(audio_dir, file_name)


    print(f"\n🔊 Transcribing: {file_name}")
    start = time.perf_counter()
    transcript = transcribe_audio(
        audio_path=full_audio_path,
        exe_path=r"C:\Users\ACER\whisper.cpp\build\bin\Release\whisper-cli.exe",
        language="ar",
        threads=NUM_THREADS,
        text_output_dir=text_files_dir,
        model_name=model_name
    )
    end = time.perf_counter()
    transcript = transcript.replace("\n", "")

    duration = end - start
    timings.append(duration)
    print(f"⏱  Took {duration:.2f} seconds")

    transcriptions_list.append({
        "transcription": transcript,
        "audio": file_name,
        "duration_seconds": round(duration, 2)
    })

# Save transcriptions with timing
output_transcriptions = os.path.join(saving_dir, cpu_test_file)
with open(output_transcriptions, 'w', encoding='utf-8') as f:
    json.dump(transcriptions_list, f, ensure_ascii=False, indent=4)
    print("File saved successfully")

# ---- Summary statistics ----
if timings:
    mean_time = statistics.mean(timings)
    min_time = min(timings)
    max_time = max(timings)
    print("\n📊 Summary:")
    print(f"Mean time: {mean_time:.2f} s")
    print(f"Min time : {min_time:.2f} s")
    print(f"Max time : {max_time:.2f} s")
else:
    print("\nNo valid audio files were processed.")





🔊 Transcribing: 013_chunk056.mp3
⏱  Took 106.71 seconds

🔊 Transcribing: 017_chunk152.mp3
⏱  Took 112.47 seconds

🔊 Transcribing: 013_chunk035.mp3
⏱  Took 105.54 seconds

🔊 Transcribing: 017_chunk053.mp3
⏱  Took 126.92 seconds

🔊 Transcribing: 008_chunk022.mp3
⏱  Took 155.55 seconds

🔊 Transcribing: 017_chunk141.mp3
⏱  Took 134.97 seconds

🔊 Transcribing: 020_chunk039.mp3
⏱  Took 121.89 seconds

🔊 Transcribing: 017_chunk020.mp3
⏱  Took 131.88 seconds

🔊 Transcribing: 016_chunk010.mp3
⏱  Took 119.32 seconds

🔊 Transcribing: 017_chunk107.mp3
⏱  Took 146.40 seconds

🔊 Transcribing: 002_chunk018.mp3
⏱  Took 173.38 seconds

🔊 Transcribing: 008_chunk015.mp3
⏱  Took 182.32 seconds

🔊 Transcribing: 016_chunk045.mp3
⏱  Took 109.45 seconds

🔊 Transcribing: 013_chunk058.mp3
⏱  Took 119.22 seconds

🔊 Transcribing: 013_chunk085.mp3
⏱  Took 342.53 seconds

🔊 Transcribing: 017_chunk013.mp3
⏱  Took 104.86 seconds

🔊 Transcribing: 013_chunk029.mp3
⏱  Took 125.01 seconds

🔊 Transcribing: 013_chunk021.m

In [5]:

import time
import statistics

cpu_test_file = 'test_whispercpp_turbo.json'

NUM_THREADS = 8
text_files_dir = '../texts_turbo'
model_name = "turbo"

transcriptions_list = []
timings = [] 

for audio_f in chosen_audio_list:
    file_name = audio_f["audio"] + ".mp3"
    audio_dir = "../../metadata/chunks/audio"
    full_audio_path = os.path.join(audio_dir, file_name)


    print(f"\n🔊 Transcribing: {file_name}")
    start = time.perf_counter()
    transcript = transcribe_audio(
        audio_path=full_audio_path,
        exe_path=r"C:\Users\ACER\whisper.cpp\build\bin\Release\whisper-cli.exe",
        language="ar",
        threads=NUM_THREADS,
        text_output_dir=text_files_dir,
        model_name=model_name
    )
    end = time.perf_counter()
    transcript = transcript.replace("\n", "")

    duration = end - start
    timings.append(duration)
    print(f"⏱  Took {duration:.2f} seconds")

    transcriptions_list.append({
        "transcription": transcript,
        "audio": file_name,
        "duration_seconds": round(duration, 2)
    })

# Save transcriptions with timing
output_transcriptions = os.path.join(saving_dir, cpu_test_file)
with open(output_transcriptions, 'w', encoding='utf-8') as f:
    json.dump(transcriptions_list, f, ensure_ascii=False, indent=4)
    print("File saved successfully")

# ---- Summary statistics ----
if timings:
    mean_time = statistics.mean(timings)
    min_time = min(timings)
    max_time = max(timings)
    print("\n📊 Summary:")
    print(f"Mean time: {mean_time:.2f} s")
    print(f"Min time : {min_time:.2f} s")
    print(f"Max time : {max_time:.2f} s")
else:
    print("\nNo valid audio files were processed.")





🔊 Transcribing: 013_chunk056.mp3
⏱  Took 71.43 seconds

🔊 Transcribing: 017_chunk152.mp3
⏱  Took 64.99 seconds

🔊 Transcribing: 013_chunk035.mp3
⏱  Took 63.42 seconds

🔊 Transcribing: 017_chunk053.mp3
⏱  Took 66.06 seconds

🔊 Transcribing: 008_chunk022.mp3
⏱  Took 75.20 seconds

🔊 Transcribing: 017_chunk141.mp3
⏱  Took 95.66 seconds

🔊 Transcribing: 020_chunk039.mp3
⏱  Took 66.12 seconds

🔊 Transcribing: 017_chunk020.mp3
⏱  Took 77.10 seconds

🔊 Transcribing: 016_chunk010.mp3
⏱  Took 64.49 seconds

🔊 Transcribing: 017_chunk107.mp3
⏱  Took 64.73 seconds

🔊 Transcribing: 002_chunk018.mp3
⏱  Took 67.98 seconds

🔊 Transcribing: 008_chunk015.mp3
⏱  Took 65.26 seconds

🔊 Transcribing: 016_chunk045.mp3
⏱  Took 64.88 seconds

🔊 Transcribing: 013_chunk058.mp3
⏱  Took 62.60 seconds

🔊 Transcribing: 013_chunk085.mp3
⏱  Took 90.31 seconds

🔊 Transcribing: 017_chunk013.mp3
⏱  Took 67.59 seconds

🔊 Transcribing: 013_chunk029.mp3
⏱  Took 66.56 seconds

🔊 Transcribing: 013_chunk021.mp3
⏱  Took 71.59 

## Whispercpp Large-v3 stats

📊 Summary:
Mean time: 129.59 s
Min time : 65.72 s
Max time : 342.53 s

## Whispercpp Turbo stats

📊 Summary:
Mean time: 74.25 s
Min time : 55.21 s
Max time : 115.98 s

## <span style="color:blue"> Turbo model is x1.75 faster </span>

# Evaluation

The transcriptions are saved in **json/test_whisper_(model).json**

## Manual Evaluation
We evaluate the quality of transcriptions manually
- 013_chunk056: Turbo clearly better
- 017_chunk152: Large is mixing sentnce order, Turbo is good
- 013_chunk035: Kinda the same quality, both good
- 008_chunk022: Both are bad and misses a portion of speech
- 020_chunk039: Turbo is good, Large is same
- 016_chunk010: Turbo is good but mixes sentences, Large is good
- 022_chunk015: both are good

## Numerical Evaluation(cosine similarity)

we load the both transcription data, with youtube(original one)

In [1]:
import json

whisper_turbo_t = '../json/test_whispercpp_turbo.json'
whisper_large_t = '../json/test_whispercpp_large-v3.json'
# original_t = '../json/text_audio_mapping.json'

with open(whisper_turbo_t, 'r') as f:
    turbo_transcriptions = json.load(f)

with open(whisper_large_t, 'r') as f:
    large_transcriptions = json.load(f)

# with open(original_t, 'r') as f:
#     original_transcriptions = json.load(f)


### Generate embeddings for whisper transcriptions and save them in new files

Login to huggingface

In [2]:
from huggingface_hub import login
import os
from dotenv import load_dotenv
load_dotenv()


your_token = os.getenv("HUGGING_FACE_TOKEN")

login(token=your_token)

Generating embedding using OpenAI multilingual transformer

In [3]:
from sentence_transformers import SentenceTransformer

# we use 'sentence-transformers/all-MiniLM-L6-v2' 

embedding_model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')

def add_embeddings(data_list, text_key):
    """
    Generates and adds embeddings for each item in a list of dictionaries.

    Args: 
        data_list (list): A list of dictionaries, where each dict has a text field.
        text_key (str): The key for the text field in each dictionary.

    Returns:
        list: The original list with an '_embedding' key added to each dictionary.
    """
    texts = [item.get(text_key) for item in data_list]
    embeddings = embedding_model.encode(texts)

    for i, item in enumerate(data_list):
        new_key = text_key + '_embedding'
        item[new_key] = embeddings[i].tolist() # Convert numpy array to list for JSON serialization

    return data_list



save in a different file for clarity

In [4]:

turbo_embeddings_path = '../json/turbo_cpp_embeddings.json'
large_embeddings_path = '../json/large_cpp_embeddings.json'

origial_embeddings_path = '../json/origial_embeddings.json'

In [5]:
turbo_embeddings = add_embeddings(turbo_transcriptions, "transcription")
large_embeddings = add_embeddings(large_transcriptions, "transcription")


with open(turbo_embeddings_path, 'w') as f:
    json.dump(turbo_embeddings, f, ensure_ascii=False)

with open(large_embeddings_path, 'w') as f:
    json.dump(large_embeddings, f, ensure_ascii=False)


## Apply cosine similarity

In [6]:

with open(turbo_embeddings_path, 'r') as f:
    turbo_embeddings = json.load(f)

with open(large_embeddings_path, 'r') as f:
    large_embeddings = json.load(f)

with open(origial_embeddings_path, 'r') as f:
    original_embeddings = json.load(f)

In [7]:
metrics_file = '../json/turbo_cpp_vs_large.json'

Apply cosine similarity and save results 

In [8]:
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

metric_list = []

for row in original_embeddings:
    original_emb = row['text_embedding']
    turbo_emb = next((f['transcription_embedding'] for f in turbo_embeddings if f['audio'].replace(".mp3", "") == row['audio']), None)
    large_emb = next((f['transcription_embedding'] for f in large_embeddings if f['audio'].replace(".mp3", "") == row['audio']), None)

    if not turbo_emb:
        print(f"NOT FOUND turbo embedding for file {row['audio']}")

    if not large_emb:
        print(f"NOT FOUND large embedding for file {row['audio']}")
    original_emb = np.array(original_emb).reshape(1, -1)
    turbo_emb = np.array(turbo_emb).reshape(1, -1)
    large_emb = np.array(large_emb).reshape(1, -1)

    turbo_cosine = cosine_similarity(original_emb, turbo_emb)[0][0]
    large_cosine = cosine_similarity(original_emb, large_emb)[0][0]

    metric_list.append({
        "audio": row['audio'],
        'turbo_cosine': turbo_cosine,
        'large_cosine': large_cosine
    })

with open(metrics_file, 'w') as f:
    json.dump(metric_list, f, ensure_ascii=False)

## Summary Statistics

In [10]:
import json
import pandas as pd


with open(metrics_file, 'r') as f:
    metric_list = json.load(f)

# put into DataFrame
df = pd.DataFrame(metric_list)

# compute difference
df["difference"] = df["large_cosine"] - df["turbo_cosine"]

# summary statistics
summary = {
    "turbo_mean": df["turbo_cosine"].mean(),
    "large_mean": df["large_cosine"].mean(),
    "turbo_std": df["turbo_cosine"].std(),
    "large_std": df["large_cosine"].std(),
    "avg_difference": df["difference"].mean(),
    "turbo_better_count": (df["difference"] < 0).sum(),
    "large_better_count": (df["difference"] > 0).sum()
}

display(df)
print("\n--- Summary Comparison for CPU ---")
for k, v in summary.items():
    print(f"{k}: {v:.6f}")


Unnamed: 0,audio,turbo_cosine,large_cosine,difference
0,002_chunk007,0.924972,0.874039,-0.050933
1,002_chunk018,0.939545,0.949126,0.009581
2,002_chunk019,0.956090,0.927396,-0.028694
3,004_chunk005,0.942761,0.964063,0.021302
4,005_chunk002,0.712337,0.313424,-0.398913
...,...,...,...,...
95,022_chunk009,0.985778,0.873117,-0.112661
96,022_chunk015,0.971704,0.979342,0.007638
97,022_chunk027,0.967482,0.826178,-0.141304
98,022_chunk035,0.908721,0.945594,0.036873



--- Summary Comparison for CPU ---
turbo_mean: 0.916271
large_mean: 0.916341
turbo_std: 0.096982
large_std: 0.092034
avg_difference: 0.000070
turbo_better_count: 52.000000
large_better_count: 48.000000


# Whisper Large on low quality audio

## GPU Test

## CPU Test

In [28]:
import os

LOW_AUDIO_FOLDER = r"C:\Users\ACER\Desktop\ASR\code-low\raw_audio"
low_audio_files =[f for f in os.listdir(LOW_AUDIO_FOLDER) if f.endswith(".mp3") and f.startswith("0")]


In [31]:

import time
import statistics

cpu_test_file = 'low_quality_test_whispercpp_large-v3.json'

NUM_THREADS = 8
text_files_dir = '../low_texts'
model_name = "large-v3"
saving_dir = '../json'

transcriptions_list = []
timings = [] 

for audio_f in low_audio_files:
    file_name = audio_f
    full_audio_path = os.path.join(LOW_AUDIO_FOLDER, file_name)


    print(f"\n🔊 Transcribing: {file_name}")
    start = time.perf_counter()
    transcript = transcribe_audio(
        audio_path=full_audio_path,
        exe_path=r"C:\Users\ACER\whisper.cpp\build\bin\Release\whisper-cli.exe",
        language="ar",
        threads=NUM_THREADS,
        text_output_dir=text_files_dir,
        model_name=model_name
    )
    end = time.perf_counter()
    transcript = transcript.replace("\n", "")

    duration = end - start
    timings.append(duration)
    print(f"⏱  Took {duration:.2f} seconds")

    transcriptions_list.append({
        "transcription": transcript,
        "audio": file_name,
        "duration_seconds": round(duration, 2)
    })

# Save transcriptions with timing
output_transcriptions = os.path.join(saving_dir, cpu_test_file)
with open(output_transcriptions, 'w', encoding='utf-8') as f:
    json.dump(transcriptions_list, f, ensure_ascii=False, indent=4)
    print("File saved successfully")

# ---- Summary statistics ----
if timings:
    mean_time = statistics.mean(timings)
    min_time = min(timings)
    max_time = max(timings)
    print("\n📊 Summary:")
    print(f"Mean time: {mean_time:.2f} s")
    print(f"Min time : {min_time:.2f} s")
    print(f"Max time : {max_time:.2f} s")
else:
    print("\nNo valid audio files were processed.")





🔊 Transcribing: 001.mp3
⏱  Took 590.45 seconds

🔊 Transcribing: 002.mp3
⏱  Took 1583.96 seconds

🔊 Transcribing: 003.mp3
⏱  Took 912.35 seconds

🔊 Transcribing: 004.mp3
⏱  Took 347.04 seconds
File saved successfully

📊 Summary:
Mean time: 858.45 s
Min time : 347.04 s
Max time : 1583.96 s


In [9]:
import google.generativeai as genai
import os

from dotenv import load_dotenv
load_dotenv()

gemini_key = os.getenv("GEMINI_KEY")

# Configure gemini API key
genai.configure(api_key=gemini_key)

model = genai.GenerativeModel('gemini-2.0-flash',generation_config={
    "temperature": 0.2,
    
})


In [19]:
import ast

classes = ["comedy", "business", "cars", "tourism", "stories"]


def classify_text(text, language = "algerian"):
    
    prompt = (
        f"The following {language} text may include more than one speaker. "
        f"Understand the general topics, and based on that classify this text into: {classes}. "
        f"Allow multiple classes and return only a Python list "
        f"(response must start with [ and end with ]). \nText:\n{text}"
    )
    try:
        raw = model.generate_content(prompt).text
        result_classes = ast.literal_eval(raw)
        
        return result_classes
    except Exception:
        print("Error during text classification")
        raise

### Classify the Turbo and Large-v3 results

In [11]:
import json

whisper_turbo_t = '../json/test_whispercpp_turbo.json'
whisper_large_t = '../json/test_whispercpp_large-v3.json'

with open(whisper_turbo_t, 'r') as f:
    turbo_transcriptions = json.load(f)

with open(whisper_large_t, 'r') as f:
    large_transcriptions = json.load(f)


In [12]:
base_refinement_prompt = """ You are an expert in Algerian Arabic (Darja).  
Your task is to proofread the following text.  

- understand and keep the original context, 
- Correct spelling mistakes.  
- Fix spacing (remove extra spaces, add missing spaces).  
- Do NOT change the sentence structure or meaning, 
  and do not add extra words that are not in the sentence. 
- Keep the natural Darja style and tone and emotions

Return only the corrected text. here's the text: """


In [26]:
for row in turbo_transcriptions[:30]:
    try:
        text = row['transcription']
        full_prompt = base_refinement_prompt + text
        refined = model.generate_content(full_prompt).text
        result = classify_text(refined)

        row['refined'] = refined
        row['classes'] = result

        print(result)
    except Exception as e:
        print(e)
        continue
with open(whisper_turbo_t, 'w') as f:
    json.dump(turbo_transcriptions, f, ensure_ascii=False)

['comedy', 'stories']
['business', 'stories']
['business', 'stories']
['business', 'stories']
['business', 'cars']
['comedy']
['tourism', 'stories']
['business']
['business', 'stories']
['business', 'comedy']
['comedy', 'stories']
['cars', 'business']
['business', 'stories']
['comedy', 'stories', 'business']
['business', 'stories']
['stories', 'business', 'comedy']
['stories', 'comedy']
['stories']
['business', 'stories']
['business', 'comedy']
['business']
['business', 'stories']
['business', 'stories']
['stories']
['comedy', 'business']
['comedy', 'stories']
['business', 'cars', 'stories']
['comedy', 'business']
['business']
['stories', 'comedy']


In [25]:
for row in large_transcriptions[:30]:
    try:
        text = row['transcription']
        full_prompt = base_refinement_prompt + text
        refined = model.generate_content(full_prompt).text
        result = classify_text(refined)

        row['refined'] = refined
        row['classes'] = result

        print(result)
    except Exception as e:
        print(e)
        continue
with open(whisper_large_t, 'w') as f:
    json.dump(large_transcriptions, f, ensure_ascii=False)

['comedy', 'stories']
['business', 'stories']
['business']
['business', 'stories']
['business', 'cars']
['business', 'stories']
['business', 'stories']
['business', 'stories']
['business', 'stories']
['business', 'comedy']
['comedy', 'stories']
['tourism']
['business', 'tourism']
['stories']
['comedy', 'business']
['stories', 'business']
['stories']
['stories']
['business', 'stories']
['stories', 'comedy']
['business']
['business', 'stories']
['business', 'stories']
['stories']
['business', 'comedy']
['comedy', 'stories']
['comedy', 'business']
['business', 'stories']
['business']
['comedy', 'stories']
