# Minimal ASR Prototype — Whisper large-v3-turbo

**Model:** [`openai/whisper-large-v3-turbo`](https://huggingface.co/openai/whisper-large-v3-turbo) (809M params, ~6GB VRAM, multilingual)

**Why this model:** 6x faster than large-v3, within 1-2% WER, 100 languages, ~6GB VRAM leaves 10GB headroom on T4.

**Alternatives considered:**
| Model | Params | VRAM | Speed | Trade-off |
|---|---|---|---|---|
| **Whisper large-v3-turbo** | 809M | ~6 GB | ~6x RT | Best speed/accuracy balance for T4 |
| Whisper large-v3 | 1.54B | ~10 GB | ~1x RT | Most accurate Whisper, but slow on T4 |
| distil-large-v3 | 756M | ~5 GB | ~6.3x RT | Slightly faster, English-only |
| Canary Qwen 2.5B | 2.5B | ~8 GB | ~418 RTFx | #1 on Open ASR, needs NeMo toolkit |
| Parakeet TDT 1.1B | 1.1B | ~4 GB | ~2000 RTFx | Fastest, English-only, NeMo required |

Set **Runtime > Change runtime type > T4 GPU** before running.

In [None]:
!pip install -q transformers accelerate soundfile librosa

In [None]:
import torch
import librosa
import time
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor

device = "cuda:0" if torch.cuda.is_available() else "cpu"
dtype = torch.float16 if torch.cuda.is_available() else torch.float32
print(f"Device: {device}")

model_id = "openai/whisper-large-v3-turbo"
model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_id, torch_dtype=dtype, low_cpu_mem_usage=True, use_safetensors=True
).to(device)
processor = AutoProcessor.from_pretrained(model_id)

def transcribe(audio_path, task="transcribe"):
    audio, _ = librosa.load(audio_path, sr=16000)
    inputs = processor(audio, sampling_rate=16000, return_tensors="pt")
    input_features = inputs.input_features.to(device, dtype)
    with torch.no_grad():
        predicted_ids = model.generate(input_features, task=task)
    return processor.batch_decode(predicted_ids, skip_special_tokens=True)[0]

print("Model ready")

Device: cuda:0


config.json: 0.00B [00:00, ?B/s]

`torch_dtype` is deprecated! Use `dtype` instead!


model.safetensors:   0%|          | 0.00/1.62G [00:00<?, ?B/s]

Loading weights:   0%|          | 0/587 [00:00<?, ?it/s]

generation_config.json: 0.00B [00:00, ?B/s]

preprocessor_config.json:   0%|          | 0.00/340 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

normalizer.json: 0.00B [00:00, ?B/s]

added_tokens.json: 0.00B [00:00, ?B/s]

special_tokens_map.json: 0.00B [00:00, ?B/s]

Model ready


## Test with a sample audio file

In [None]:
from datasets import load_dataset
import soundfile as sf
import numpy as np

ds = load_dataset("distil-whisper/librispeech_long", "clean", split="validation")
sample = ds[0]["audio"]
sf.write("sample.wav", sample["array"], sample["sampling_rate"])

audio_len = len(sample["array"]) / sample["sampling_rate"]
start = time.time()
text = transcribe("sample.wav")
elapsed = time.time() - start

print(f"Audio: {audio_len:.1f}s | Transcribed in: {elapsed:.1f}s | RTFx: {audio_len/elapsed:.1f}x")
print(f"\n{text[:500]}...")

README.md:   0%|          | 0.00/480 [00:00<?, ?B/s]

clean/validation-00000-of-00001-91350812(…):   0%|          | 0.00/1.98M [00:00<?, ?B/s]

Generating validation split:   0%|          | 0/1 [00:00<?, ? examples/s]

The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
A custom logits processor of type <class 'transformers.generation.logits_process.SuppressTokensLogitsProcessor'> has been passed to `.generate()`, but it was also created in `.generate()`, given its parameterization. The custom <class 'transformers.generation.logits_process.SuppressTokensLogitsProcessor'> will take precedence. Please check the docstring of <class 'transformers.generation.logits_process.SuppressTokensLogitsProcessor'> to see related `.generate()` flags.
A custom logits processor of type <class 'transformers.generation.logits_process.SuppressTokensAtBeginLogitsProcessor'> has been passed to `.generate()`, but it was also created in `.generate()`, given its parameterization. The custom <class 'transformers.generation.logits_process.SuppressTokensA

Audio: 62.5s | Transcribed in: 25.2s | RTFx: 2.5x

 Mr. Quilter is the apostle of the middle classes, and we are glad to welcome his gospel. Nor is Mr. Quilter's manner less interesting than his matter. He tells us that at this festive season of the year, with Christmas and roast beef looming before us, similes drawn from eating and its results occur most readily to the mind. He has grave doubts whether Sir Frederick Layton's work is really Greek after all,...


## Transcribe an uploaded file

In [None]:
from google.colab import files

uploaded = files.upload()
filename = list(uploaded.keys())[0]
print(f"Uploaded: {filename}")
print(f"\nTranscription: {transcribe(filename)}")

Saving download.wav to download.wav
Uploaded: download.wav

Transcription:  Artificial intelligence has transformed many fields in recent years. Natural language processing now powers search engines, chatbots, and translation tools. Computer vision enables self-driving cars and medical image analysis. Speech synthesis, like this demo, can generate human-sounding audio from plain text. The progress has been remarkable, but significant challenges remain around safety, fairness, and ensuring these systems work reliably for everyone.


## Record from browser microphone
Uses JavaScript `getUserMedia` API via Colab's `eval_js` bridge.

Credit: [korakot/record.py gist](https://gist.github.com/korakot/c21c3476c024ad6d56d5f48b0bca92be)

In [None]:
from IPython.display import Javascript, Audio
from google.colab import output
from base64 import b64decode
import subprocess

RECORD_JS = """
const sleep = time => new Promise(resolve => setTimeout(resolve, time))
const b2text = blob => new Promise(resolve => {
  const reader = new FileReader()
  reader.onloadend = e => resolve(e.srcElement.result)
  reader.readAsDataURL(blob)
})
var record = time => new Promise(async resolve => {
  const stream = await navigator.mediaDevices.getUserMedia({ audio: true })
  const recorder = new MediaRecorder(stream)
  const chunks = []
  recorder.ondataavailable = e => chunks.push(e.data)
  recorder.start()
  await sleep(time)
  recorder.onstop = async () => {
    const blob = new Blob(chunks)
    const text = await b2text(blob)
    resolve(text)
  }
  recorder.stop()
  stream.getTracks().forEach(t => t.stop())
})
"""

def record_mic(seconds=5):
    display(Javascript(RECORD_JS))
    print(f"Recording {seconds}s... speak now!")
    data = output.eval_js(f'record({seconds * 1000})')
    raw = b64decode(data.split(',')[1])
    with open('_raw.webm', 'wb') as f:
        f.write(raw)
    subprocess.run(
        ['ffmpeg', '-y', '-i', '_raw.webm', '-ar', '16000', '-ac', '1', 'mic.wav'],
        capture_output=True
    )
    print("Recording saved.")
    return 'mic.wav'

In [None]:
audio_path = record_mic(seconds=5)
display(Audio(audio_path, autoplay=True))
print(f"\nTranscription: {transcribe(audio_path)}")

<IPython.core.display.Javascript object>

Recording 5s... speak now!
Recording saved.



Transcription:  Hi, Good morning, this is a short audio test.


In [None]:
# longer recording
audio_path = record_mic(seconds=15)
display(Audio(audio_path, autoplay=True))
print(f"\nTranscription: {transcribe(audio_path)}")

<IPython.core.display.Javascript object>

Recording 15s... speak now!
Recording saved.



Transcription:  Hi, Good morning. This is a short audio test recorded in a quiet end moment. The purpose of this example is to evaluate speech recognition accuracy.


## Language detection & translation

In [None]:
print(f"Transcription: {transcribe(audio_path, task='transcribe')}")
print(f"English translation: {transcribe(audio_path, task='translate')}")

Transcription:  Hi, Good morning. This is a short audio test recorded in a quiet end moment. The purpose of this example is to evaluate speech recognition accuracy.
English translation:  Hi, good morning. This is a short audio test recorded in a quiet environment. The purpose of this example is to evaluate speech recognition accuracy.
