<a href="https://colab.research.google.com/github/EdwardFang09/IEE4912/blob/main/whisper_benchmark_(documentation).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<h1> Benchmark (no early stopping)

Suara direkam di lab ramai untuk simulasi.

In [None]:
#Library for quick-start on google colab
!pip install faster-whisper jiwer nvidia-ml-py3

Collecting faster-whisper
  Downloading faster_whisper-1.1.1-py3-none-any.whl.metadata (16 kB)
Collecting jiwer
  Downloading jiwer-3.1.0-py3-none-any.whl.metadata (2.6 kB)
Collecting nvidia-ml-py3
  Downloading nvidia-ml-py3-7.352.0.tar.gz (19 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting ctranslate2<5,>=4.0 (from faster-whisper)
  Downloading ctranslate2-4.5.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (10 kB)
Collecting onnxruntime<2,>=1.14 (from faster-whisper)
  Downloading onnxruntime-1.21.0-cp311-cp311-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (4.5 kB)
Collecting av>=11 (from faster-whisper)
  Downloading av-14.2.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (4.6 kB)
Collecting rapidfuzz>=3.9.7 (from jiwer)
  Downloading rapidfuzz-3.12.2-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting coloredlogs (from onnxruntime<2,>=1.14->faster-whisper)
  Downloading co

In [None]:
import time
from faster_whisper import WhisperModel, BatchedInferencePipeline
import pandas as pd
import jiwer
import torch
from jiwer import transforms
import nvidia_smi

# Define audio file and its ground truth transcript
audio_file = "edwardmentimeter.m4a"  # Replace with your audio file
ground_truth_transcript = "sora, open mentimeter"  # Replace with the actual transcript

# Model sizes and compute types to benchmark
model_configs = [
    {"size": "tiny", "compute": "float32"},
    {"size": "tiny", "compute": "float16"},
    {"size": "tiny", "compute": "int8"},  # CPU INT8
    {"size": "tiny", "compute": "int8_float16"}, # GPU INT8
    {"size": "base", "compute": "float32"},
    {"size": "base", "compute": "float16"},
    {"size": "base", "compute": "int8_float16"}, # GPU INT8
    {"size": "small", "compute": "float32"},
    {"size": "small", "compute": "float16"},
    {"size": "small", "compute": "int8_float16"}, # GPU INT8
    {"size": "medium", "compute": "float32"},
    {"size": "medium", "compute": "float16"},
    {"size": "medium", "compute": "int8_float16"}, # GPU INT8
    {"size": "large-v2", "compute": "float32"},
    {"size": "large-v2", "compute": "float16"},
    {"size": "large-v2", "compute": "int8_float16"}, # GPU INT8
    {"size": "large-v3", "compute": "float32"},
    {"size": "large-v3", "compute": "float16"},
    {"size": "large-v3", "compute": "int8_float16"}, # GPU INT8
    {"size": "turbo", "compute": "float16"},  # Turbo only supports FP16
]

batch_sizes =  [8, 16, 32]# Experiment with batch sizes

results = []

for config in model_configs:
    for batch_size in batch_sizes:
        if config["size"] in ["base", "small", "medium", "large-v2", "large-v3"]:
            effective_batch_size = min(batch_size, 4)  # Reduce for larger models
        elif config["size"] == "turbo":
             effective_batch_size = min(batch_size, 8)
        else:
            effective_batch_size = batch_size

        start_time = time.time()

        try:
            device = "cuda" if config["compute"]!= "int8" else "cpu"
            model = WhisperModel(config["size"], device=device, compute_type=config["compute"])

            if device == "cuda":
                nvidia_smi.nvmlInit()
                handle = nvidia_smi.nvmlDeviceGetHandleByIndex(0)
                # No memory info printed here

            if config["size"] == "turbo":
                batched_model = BatchedInferencePipeline(model=model)
                segments, info = batched_model.transcribe(audio_file, batch_size=effective_batch_size, language='en')
            else:
                segments, info = model.transcribe(audio_file, beam_size=5, language='en')

            segments_list = list(segments)

            transform = transforms.Compose([
                transforms.ToLowerCase(),
                transforms.RemovePunctuation(),
                transforms.RemoveMultipleSpaces(),
                transforms.Strip(),
            ])

            ground_truth_transformed = transform(ground_truth_transcript)
            predicted_transcript = " ".join([segment.text for segment in segments_list])
            predicted_transcript_transformed = transform(predicted_transcript)


            wer = jiwer.wer(ground_truth_transformed, predicted_transcript_transformed)

            end_time = time.time()
            inference_time = end_time - start_time

            if device == "cuda":
                infoo = nvidia_smi.nvmlDeviceGetMemoryInfo(handle)

            results.append({
                "model_size": config["size"],
                "compute_type": config["compute"],
                "batch_size": effective_batch_size,
                "inference_time": inference_time,
                "language": info.language,
                "language_probability": info.language_probability,
                "num_segments": len(segments_list),
                "wer": wer,
                "predicted_transcript": predicted_transcript, #added predicted transcript
                "Used memory (GB)": infoo.used / (1024 ** 3),
            })

            print(f"Model: {config['size']}, Compute: {config['compute']}, Batch: {effective_batch_size}, Time: {inference_time:.2f}s, WER: {wer:.2f}, Text:, {predicted_transcript}, Used memory (GB): {infoo.used / (1024 ** 3)}") # added predicted transcript to the output

        except Exception as e:
            print(f"Error with Model: {config['size']}, Compute: {config['compute']}, Batch: {effective_batch_size}: {e}")
            results.append({
                "model_size": config["size"],
                "compute_type": config["compute"],
                "batch_size": effective_batch_size,
                "inference_time": "Error",
                "error": str(e),
                "wer": "Error",
                "predicted_transcript": "Error" # added predicted transcript in case of error
            })

        finally:
            del model
            if 'batched_model' in locals():
                del batched_model
            torch.cuda.empty_cache()

# Print or save results (e.g., to a CSV file)
df = pd.DataFrame(results)
df.to_csv("no_stop_whisper_benchmark.csv", index=False)
print(df)

tokenizer.json:   0%|          | 0.00/2.20M [00:00<?, ?B/s]

Model: tiny, Compute: float32, Batch: 8, Time: 3.88s, WER: 2.67, Text:,  Yeah, I think so.  Yeah, I think so., Used memory (GB): 0.5604248046875
Model: tiny, Compute: float32, Batch: 16, Time: 1.24s, WER: 3.00, Text:,  So, now we are going to connect the metter., Used memory (GB): 0.5604248046875
Model: tiny, Compute: float32, Batch: 32, Time: 0.98s, WER: 2.33, Text:,  I'm so happy.  So, I'm a 20m., Used memory (GB): 0.5604248046875
Model: tiny, Compute: float16, Batch: 8, Time: 1.12s, WER: 2.67, Text:,  I am so happy.  I am so happy., Used memory (GB): 0.5936279296875
Model: tiny, Compute: float16, Batch: 16, Time: 0.77s, WER: 2.33, Text:,  So, now we're going to connect it., Used memory (GB): 0.5311279296875
Model: tiny, Compute: float16, Batch: 32, Time: 0.78s, WER: 2.00, Text:,  Yes, sorry.  You are a 90-meter., Used memory (GB): 0.6248779296875
Model: tiny, Compute: int8, Batch: 8, Time: 3.49s, WER: 3.00, Text:,  So, now we are going to connect the metter., Used memory (GB): 0.624

model.bin:   0%|          | 0.00/145M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.20M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/2.31k [00:00<?, ?B/s]

vocabulary.txt:   0%|          | 0.00/460k [00:00<?, ?B/s]

Model: base, Compute: float32, Batch: 4, Time: 2.17s, WER: 3.33, Text:,  Sorry, I can't make it.  Sorry, I can't make it., Used memory (GB): 0.6873779296875
Model: base, Compute: float32, Batch: 4, Time: 1.08s, WER: 3.33, Text:,  Sorry, I can't make it.  Sorry, I can't make it., Used memory (GB): 0.6873779296875
Model: base, Compute: float32, Batch: 4, Time: 0.87s, WER: 3.33, Text:,  Sorry, I can't make it.  Sorry, I can't make it., Used memory (GB): 0.6873779296875
Model: base, Compute: float16, Batch: 4, Time: 0.59s, WER: 3.33, Text:,  Sorry, I can't make it.  Sorry, I can't make it., Used memory (GB): 0.5936279296875
Model: base, Compute: float16, Batch: 4, Time: 0.61s, WER: 3.33, Text:,  Sorry, I can't make it.  Sorry, I can't make it., Used memory (GB): 0.5936279296875
Model: base, Compute: float16, Batch: 4, Time: 0.59s, WER: 3.33, Text:,  Sorry, I can't make it.  Sorry, I can't make it., Used memory (GB): 0.5936279296875
Model: base, Compute: int8_float16, Batch: 4, Time: 0.88s,

model.bin:   0%|          | 0.00/484M [00:00<?, ?B/s]

vocabulary.txt:   0%|          | 0.00/460k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/2.37k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.20M [00:00<?, ?B/s]

Model: small, Compute: float32, Batch: 4, Time: 6.01s, WER: 0.00, Text:,  SORA OPEN MENTIMETER, Used memory (GB): 1.4061279296875
Model: small, Compute: float32, Batch: 4, Time: 1.93s, WER: 0.00, Text:,  SORA OPEN MENTIMETER, Used memory (GB): 1.3123779296875
Model: small, Compute: float32, Batch: 4, Time: 5.24s, WER: 0.00, Text:,  SORA OPEN MENTIMETER, Used memory (GB): 1.3123779296875
Model: small, Compute: float16, Batch: 4, Time: 1.45s, WER: 0.00, Text:,  SORA OPEN MENTIMETER, Used memory (GB): 0.9998779296875
Model: small, Compute: float16, Batch: 4, Time: 0.95s, WER: 0.00, Text:,  SORA OPEN MENTIMETER, Used memory (GB): 0.9686279296875
Model: small, Compute: float16, Batch: 4, Time: 0.97s, WER: 0.00, Text:,  SORA OPEN MENTIMETER, Used memory (GB): 0.9686279296875
Model: small, Compute: int8_float16, Batch: 4, Time: 2.68s, WER: 0.00, Text:,  SORA OPEN MENTIMETER, Used memory (GB): 0.6248779296875
Model: small, Compute: int8_float16, Batch: 4, Time: 1.74s, WER: 0.00, Text:,  SORA O

model.bin:   0%|          | 0.00/1.53G [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.20M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/2.26k [00:00<?, ?B/s]

vocabulary.txt:   0%|          | 0.00/460k [00:00<?, ?B/s]

Model: medium, Compute: float32, Batch: 4, Time: 13.76s, WER: 0.67, Text:,  Sora open 90 meter, Used memory (GB): 3.2811279296875
Model: medium, Compute: float32, Batch: 4, Time: 4.54s, WER: 0.67, Text:,  Sora open 90 meter, Used memory (GB): 3.2811279296875
Model: medium, Compute: float32, Batch: 4, Time: 4.93s, WER: 0.67, Text:,  Sora open 90 meter, Used memory (GB): 3.2811279296875
Model: medium, Compute: float16, Batch: 4, Time: 1.40s, WER: 0.67, Text:,  Sora open 90 meter, Used memory (GB): 1.9686279296875
Model: medium, Compute: float16, Batch: 4, Time: 1.63s, WER: 0.67, Text:,  Sora open 90 meter, Used memory (GB): 1.9686279296875
Model: medium, Compute: float16, Batch: 4, Time: 1.94s, WER: 0.67, Text:,  Sora open 90 meter, Used memory (GB): 1.9998779296875
Model: medium, Compute: int8_float16, Batch: 4, Time: 4.10s, WER: 0.67, Text:,  Sora open 90 meter, Used memory (GB): 1.2811279296875
Model: medium, Compute: int8_float16, Batch: 4, Time: 4.60s, WER: 0.67, Text:,  Sora open 9

model.bin:   0%|          | 0.00/3.09G [00:00<?, ?B/s]

vocabulary.txt:   0%|          | 0.00/460k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/2.80k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.20M [00:00<?, ?B/s]

Model: large-v2, Compute: float32, Batch: 4, Time: 46.75s, WER: 1.00, Text:,  you, Used memory (GB): 6.3455810546875
Model: large-v2, Compute: float32, Batch: 4, Time: 15.78s, WER: 1.00, Text:,  you, Used memory (GB): 6.2205810546875
Model: large-v2, Compute: float32, Batch: 4, Time: 16.15s, WER: 1.00, Text:,  you, Used memory (GB): 6.7205810546875
Model: large-v2, Compute: float16, Batch: 4, Time: 9.67s, WER: 1.00, Text:,  you, Used memory (GB): 3.4393310546875
Model: large-v2, Compute: float16, Batch: 4, Time: 6.02s, WER: 1.00, Text:,  you, Used memory (GB): 3.5643310546875
Model: large-v2, Compute: float16, Batch: 4, Time: 4.89s, WER: 1.00, Text:,  you, Used memory (GB): 3.6580810546875
Model: large-v2, Compute: int8_float16, Batch: 4, Time: 9.86s, WER: 1.00, Text:,  you, Used memory (GB): 1.9393310546875
Model: large-v2, Compute: int8_float16, Batch: 4, Time: 11.78s, WER: 1.00, Text:,  you, Used memory (GB): 2.0018310546875
Model: large-v2, Compute: int8_float16, Batch: 4, Time: 11

model.bin:   0%|          | 0.00/3.09G [00:00<?, ?B/s]

config.json:   0%|          | 0.00/2.39k [00:00<?, ?B/s]

vocabulary.json:   0%|          | 0.00/1.07M [00:00<?, ?B/s]

preprocessor_config.json:   0%|          | 0.00/340 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.48M [00:00<?, ?B/s]

Model: large-v3, Compute: float32, Batch: 4, Time: 52.46s, WER: 1.00, Text:,  Sorry, open 20 meter., Used memory (GB): 6.2205810546875
Model: large-v3, Compute: float32, Batch: 4, Time: 13.67s, WER: 1.00, Text:,  Sorry, open 20 meter., Used memory (GB): 6.2205810546875
Model: large-v3, Compute: float32, Batch: 4, Time: 12.84s, WER: 1.00, Text:,  Sorry, open 20 meter., Used memory (GB): 6.4705810546875
Model: large-v3, Compute: float16, Batch: 4, Time: 7.40s, WER: 1.00, Text:,  Sorry, open 20 meter., Used memory (GB): 3.4393310546875
Model: large-v3, Compute: float16, Batch: 4, Time: 3.70s, WER: 1.00, Text:,  Sorry, open 20 meter., Used memory (GB): 3.4393310546875
Model: large-v3, Compute: float16, Batch: 4, Time: 4.06s, WER: 1.00, Text:,  Sorry, open 20 meter., Used memory (GB): 3.6580810546875
Model: large-v3, Compute: int8_float16, Batch: 4, Time: 7.64s, WER: 1.00, Text:,  Sorry, open 20 meter., Used memory (GB): 1.8768310546875
Model: large-v3, Compute: int8_float16, Batch: 4, Time

vocabulary.json:   0%|          | 0.00/1.07M [00:00<?, ?B/s]

preprocessor_config.json:   0%|          | 0.00/340 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.71M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/2.26k [00:00<?, ?B/s]

model.bin:   0%|          | 0.00/1.62G [00:00<?, ?B/s]

Model: turbo, Compute: float16, Batch: 8, Time: 10.50s, WER: 1.67, Text:,  I'm sorry about 20 meters, Used memory (GB): 2.0643310546875
Model: turbo, Compute: float16, Batch: 8, Time: 2.75s, WER: 1.67, Text:,  I'm sorry about 20 meters, Used memory (GB): 2.0643310546875
Model: turbo, Compute: float16, Batch: 8, Time: 2.15s, WER: 1.67, Text:,  I'm sorry about 20 meters, Used memory (GB): 2.0643310546875
   model_size  compute_type  batch_size  inference_time language  \
0        tiny       float32           8        3.881012       en   
1        tiny       float32          16        1.241115       en   
2        tiny       float32          32        0.983896       en   
3        tiny       float16           8        1.123452       en   
4        tiny       float16          16        0.767862       en   
5        tiny       float16          32        0.776328       en   
6        tiny          int8           8        3.490858       en   
7        tiny          int8          16        3.5

Kesimpulan: tidak perlu model besar. Model small bahkan sudah cukup untuk akurasi 100% dengan pemakaian memori ~1GB juga dengan waktu proses <2 detik

<h1 >with early stopping (all models)

In [None]:
import time
from faster_whisper import WhisperModel, BatchedInferencePipeline
import pandas as pd
import jiwer
import nvidia_smi
from jiwer import transforms
import os
import torch

# Define audio files and their ground truth transcripts (dictionary)
audio_ground_truth = {
    "edwardchrome.m4a": "hey sora, open chrome.",
    "edwardkahoot.m4a": "hey sora, open kahoot.",
    #... more audio files and transcripts
}

# Model sizes and compute types to benchmark
model_configs = [
    {"size": "tiny", "compute": "float32"},
    {"size": "tiny", "compute": "float16"},
    {"size": "tiny", "compute": "int8"},  # CPU INT8
    {"size": "base", "compute": "float32"},
    {"size": "base", "compute": "float16"},
    {"size": "small", "compute": "float32"},
    {"size": "small", "compute": "float16"},
    {"size": "medium", "compute": "float32"},
    {"size": "medium", "compute": "float16"},
    {"size": "turbo", "compute": "float16"},  # Turbo only supports FP16
    {"size": "large-v2", "compute": "float32"},
    {"size": "large-v2", "compute": "float16"},
    {"size": "large-v3", "compute": "float32"},
    {"size": "large-v3", "compute": "float16"},
]

results = []

print(f"CUDA available: {torch.cuda.is_available()}")

i = 0
for audio_file, ground_truth_transcript in audio_ground_truth.items():
    for config in model_configs:
        start_time = time.time()

        try:
            device = "cuda" if config["compute"]!= "int8" else "cpu"

            if device == "cuda":
                nvidia_smi.nvmlInit()
                handle = nvidia_smi.nvmlDeviceGetHandleByIndex(0)
                # No memory info printed here

            model = WhisperModel(config["size"], device=device, compute_type=config["compute"])

            if config["size"] == "turbo":
                batched_model = BatchedInferencePipeline(model=model)
                segments, info = batched_model.transcribe(audio_file, language='en')
            else:
                segments, info = model.transcribe(audio_file, beam_size=5, language='en')

            segments_list = list(segments)

            transform = transforms.Compose([
                transforms.ToLowerCase(),
                transforms.RemovePunctuation(),
                transforms.RemoveMultipleSpaces(),
                transforms.Strip(),
            ])

            ground_truth_transformed = transform(ground_truth_transcript)
            predicted_transcript = " ".join([segment.text for segment in segments_list])
            predicted_transcript_transformed = transform(predicted_transcript)

            wer = jiwer.wer(ground_truth_transformed, predicted_transcript_transformed)

            end_time = time.time()
            inference_time = end_time - start_time

            if device == "cuda":
                infoo = nvidia_smi.nvmlDeviceGetMemoryInfo(handle)

            results.append({
                "audio_file": audio_file,
                "model_size": config["size"],
                "compute_type": config["compute"],
                "inference_time": inference_time,
                "language": info.language,
                "language_probability": info.language_probability,
                "num_segments": len(segments_list),
                "wer": wer,
                "predicted_transcript": predicted_transcript,
                "Used memory (GB)": infoo.used / (1024 ** 3)
            })

            print(
                f"Audio: {audio_file}, Model: {config['size']}, Compute: {config['compute']}, Time: {inference_time:.2f}s, WER: {wer:.2f}, Text:, {predicted_transcript}, Used memory (GB): {infoo.used / (1024 ** 3)}"
            )

            if wer == 0:
                if i == 0:
                  i += 1 #add one more iteration for safety measure
                else:
                  print(f"Skipping remaining models for this audio")
                  break

        except Exception as e:
            print(
                f"Error with Audio: {audio_file}, Model: {config['size']}, Compute: {config['compute']}: {e}"
            )
            results.append({
                "audio_file": audio_file,
                "model_size": config["size"],
                "compute_type": config["compute"],
                "inference_time": "Error",
                "error": str(e),
                "wer": "Error",
                "predicted_transcript": "Error",
            })

        finally:
            del model
            if config["size"] == "turbo":
                del batched_model
            torch.cuda.empty_cache()

# Print or save results (e.g., to a CSV file)
df = pd.DataFrame(results)
df.to_csv("whisper_benchmark_results.csv", index=False)
print(df)

CUDA available: True
Audio: edwardjamboard.wav, Model: tiny, Compute: float32, Time: 1.24s, WER: 0.50, Text:,  Hesora Open Jamboard., Used memory (GB): 0.5643310546875
Audio: edwardjamboard.wav, Model: tiny, Compute: float16, Time: 0.73s, WER: 0.75, Text:,  Hesora Open Jambod., Used memory (GB): 0.5330810546875
Audio: edwardjamboard.wav, Model: tiny, Compute: int8, Time: 1.83s, WER: 0.75, Text:,  Hesora Open Jambord., Used memory (GB): 0.5330810546875
Audio: edwardjamboard.wav, Model: base, Compute: float32, Time: 1.38s, WER: 0.25, Text:,  Hey Sora, Open Jambore!, Used memory (GB): 0.6893310546875
Audio: edwardjamboard.wav, Model: base, Compute: float16, Time: 0.59s, WER: 0.25, Text:,  Hey Sora, Open Jambore!, Used memory (GB): 0.5955810546875
Audio: edwardjamboard.wav, Model: small, Compute: float32, Time: 4.11s, WER: 0.00, Text:,  Hey Sora! Open Jamboard!, Used memory (GB): 1.3143310546875
Audio: edwardjamboard.wav, Model: small, Compute: float16, Time: 0.98s, WER: 0.00, Text:,  Hey 

<h1> turbo model (special model - optimized large)

In [None]:
import time
from faster_whisper import WhisperModel, BatchedInferencePipeline
import pandas as pd
import jiwer
import nvidia_smi
from jiwer import transforms
import os
import torch

# Define audio files and their ground truth transcripts (dictionary)
audio_ground_truth = {
    "edwardchrome.m4a": "hey sora, open chrome.",
    "edwardkahoot.m4a": "hey sora, open kahoot.",
    #... more audio files and transcripts
}

# Model sizes and compute types to benchmark
model_configs = [
    {"size": "turbo", "compute": "float16"},  # Turbo only supports FP16
]

results = []

print(f"CUDA available: {torch.cuda.is_available()}")

for audio_file, ground_truth_transcript in audio_ground_truth.items():
    for config in model_configs:
        start_time = time.time()

        try:
            device = "cuda" if config["compute"]!= "int8" else "cpu"

            if device == "cuda":
                nvidia_smi.nvmlInit()
                handle = nvidia_smi.nvmlDeviceGetHandleByIndex(0)
                # No memory info printed here

            model = WhisperModel(config["size"], device=device, compute_type=config["compute"])

            if config["size"] == "turbo":
                batched_model = BatchedInferencePipeline(model=model)
                segments, info = batched_model.transcribe(audio_file, language='en')
            else:
                segments, info = model.transcribe(audio_file, beam_size=5, language='en')

            segments_list = list(segments)

            transform = transforms.Compose([
                transforms.ToLowerCase(),
                transforms.RemovePunctuation(),
                transforms.RemoveMultipleSpaces(),
                transforms.Strip(),
            ])

            ground_truth_transformed = transform(ground_truth_transcript)
            predicted_transcript = " ".join([segment.text for segment in segments_list])
            predicted_transcript_transformed = transform(predicted_transcript)

            wer = jiwer.wer(ground_truth_transformed, predicted_transcript_transformed)

            end_time = time.time()
            inference_time = end_time - start_time

            if device == "cuda":
                infoo = nvidia_smi.nvmlDeviceGetMemoryInfo(handle)

            results.append({
                "audio_file": audio_file,
                "model_size": config["size"],
                "compute_type": config["compute"],
                "inference_time": inference_time,
                "language": info.language,
                "language_probability": info.language_probability,
                "num_segments": len(segments_list),
                "wer": wer,
                "predicted_transcript": predicted_transcript,
                "Used memory (GB)": infoo.used / (1024 ** 3)
            })

            print(
                f"Audio: {audio_file}, Model: {config['size']}, Compute: {config['compute']}, Time: {inference_time:.2f}s, WER: {wer:.2f}, Text:, {predicted_transcript}, Used memory (GB): {infoo.used / (1024 ** 3)}"
            )

            if wer == 0:
                print(f"Skipping remaining models for this audio")
                break

        except Exception as e:
            print(
                f"Error with Audio: {audio_file}, Model: {config['size']}, Compute: {config['compute']}: {e}"
            )
            results.append({
                "audio_file": audio_file,
                "model_size": config["size"],
                "compute_type": config["compute"],
                "inference_time": "Error",
                "error": str(e),
                "wer": "Error",
                "predicted_transcript": "Error",
            })

        finally:
            del model
            if config["size"] == "turbo":
                del batched_model
            torch.cuda.empty_cache()

# Print or save results (e.g., to a CSV file)
df = pd.DataFrame(results)
df.to_csv("whisper_benchmark_results_turbo.csv", index=False)
print(df)

CUDA available: True
Audio: edwardjamboard.wav, Model: turbo, Compute: float16, Time: 2.27s, WER: 0.00, Text:,  Hey Sora, open Jamboard., Used memory (GB): 2.0955810546875
Skipping remaining models for this audio
Audio: edwardchrome.m4a, Model: turbo, Compute: float16, Time: 2.29s, WER: 0.00, Text:,  Hey Sora, open Chrome., Used memory (GB): 2.0643310546875
Skipping remaining models for this audio
Audio: edwardkahoot.m4a, Model: turbo, Compute: float16, Time: 2.14s, WER: 0.25, Text:,  Hey Sora, open kahut., Used memory (GB): 2.0643310546875
           audio_file model_size compute_type  inference_time language  \
0  edwardjamboard.wav      turbo      float16        2.268531       en   
1    edwardchrome.m4a      turbo      float16        2.286799       en   
2    edwardkahoot.m4a      turbo      float16        2.141664       en   

   language_probability  num_segments   wer       predicted_transcript  \
0                     1             1  0.00   Hey Sora, open Jamboard.   
1       

Kesimpulan: turbo tidak digunakan karena memakan memori >2GB, yaitu melebihi kapasitas jetson nano 2GB

<h1 >Trained benchmark

In [None]:
!pip install transformers torch jiwer datasets

Collecting datasets
  Downloading datasets-3.4.1-py3-none-any.whl.metadata (19 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==9.1.0.70 (from torch)
  Downloading nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cublas-cu12==12.4.5.8 (from torch)
  Downloading nvidia_cublas_cu12-12.4.5.8-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cufft-cu12==11.2.1.3 (from torch)
  Downloading nvidia_cufft_cu12-11.2.1.3-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting 

In [None]:

import time
import torch
from transformers import WhisperProcessor, WhisperForConditionalGeneration
from datasets import load_dataset
import jiwer
from jiwer import transforms
import librosa
import numpy as np

# Load the processor and model
processor = WhisperProcessor.from_pretrained("EdwardFang09/whisper-base-TA-2025-v2")
model = WhisperForConditionalGeneration.from_pretrained("EdwardFang09/whisper-base-TA-2025").to("cuda")

# Load your audio data and ground truth transcripts (replace with your data)
audio_ground_truth = {
    "edwardchrome.m4a": "hey sora, open chrome.",
    "edwardkahoot.m4a": "hey sora, open kahoot.",
    # ... more audio files and transcripts
}

results = []

for audio_file, ground_truth_transcript in audio_ground_truth.items():
    start_time = time.time()

    try:
        # Load audio data using librosa
        audio_data, sr = librosa.load(audio_file, sr=16000)  # Load audio at 16kHz

        # Use the audio data (NumPy array) as input to the processor
        input_features = processor(audio_data, sampling_rate=sr, return_tensors="pt").input_features.to("cuda")

        # Generate predictions
        predicted_ids = model.generate(input_features)
        predicted_transcript = processor.batch_decode(predicted_ids, skip_special_tokens=True)[0]

        # Calculate WER
        transform = transforms.Compose([
            transforms.ToLowerCase(),
            transforms.RemovePunctuation(),
            transforms.RemoveMultipleSpaces(),
            transforms.Strip(),
        ])

        ground_truth_transformed = transform(ground_truth_transcript)
        predicted_transcript_transformed = transform(predicted_transcript)
        wer = jiwer.wer(ground_truth_transformed, predicted_transcript_transformed)

        end_time = time.time()
        inference_time = end_time - start_time

        if device == "cuda":
                infoo = nvidia_smi.nvmlDeviceGetMemoryInfo(handle)

        results.append({
            "audio_file": audio_file,
            "inference_time": inference_time,
            "wer": wer,
            "predicted_transcript": predicted_transcript,
        })

        print(f"Audio: {audio_file}, Time: {inference_time:.2f}s, WER: {wer:.2f}, Text: {predicted_transcript}")

    except Exception as e:
        print(f"Error with Audio: {audio_file}: {e}")
        results.append({
            "audio_file": audio_file,
            "inference_time": "Error",
            "error": str(e),
            "wer": "Error",
            "predicted_transcript": "Error",
        })

# Save or print the results
import pandas as pd
df = pd.DataFrame(results)
df.to_csv("huggingface_whisper_benchmark_trained.csv", index=False)
print(df)

Audio: edwardjamboard.wav, Time: 0.11s, WER: 0.75, Text: sora


  audio_data, sr = librosa.load(audio_file, sr=16000)  # Load audio at 16kHz
	Deprecated as of librosa version 0.10.0.
	It will be removed in librosa version 1.0.
  y, sr_native = __audioread_load(path, offset, duration, dtype)


Audio: edwardchrome.m4a, Time: 0.21s, WER: 1.00, Text: kahoot
Audio: edwardkahoot.m4a, Time: 0.20s, WER: 0.75, Text: sora


  audio_data, sr = librosa.load(audio_file, sr=16000)  # Load audio at 16kHz
	Deprecated as of librosa version 0.10.0.
	It will be removed in librosa version 1.0.
  y, sr_native = __audioread_load(path, offset, duration, dtype)


           audio_file  inference_time   wer predicted_transcript
0  edwardjamboard.wav        0.111058  0.75                 sora
1    edwardchrome.m4a        0.212347  1.00               kahoot
2    edwardkahoot.m4a        0.201498  0.75                 sora


Dengan dataset yang sama, perhatikan perbedaan dengan model biasa.

Solusi:
1. Model di finetune supaya akurasi meningkat. <2 detik sudah relatif cepat.
  - contoh: kahut, kehut, kuhut, dll.
2. Cari banyak data untuk train dan testing. Pakai mikrofon ampas.


need a code that fetch a whole data inside a folder

optimized

In [None]:
torch.cuda.get_device_properties(0)

_CudaDeviceProperties(name='Tesla T4', major=7, minor=5, total_memory=15095MB, multi_processor_count=40, uuid=5c0728cf-d769-745b-017f-61af5f802743, L2_cache_size=4MB)

In [None]:
torch.cuda.get_device_properties(0).total_memory

15828320256

penemuan: turbo mendeteksi aksen bahasa indonesia walau sudah dicoding english dan berbicara bahasa inggris.