# Finding the best params
## notebook seeks to find what params work and note down some discoveries along the way 

### Whisper Tiny (Hugging Face)

- **Model**: `openai/whisper-tiny` (pre-trained, multilingual)
- **Tasks**: Automatic speech recognition (ASR) and speech translation
- **Architecture**: Encoder–decoder Transformer (sequence-to-sequence)
- **Input pipeline**:
  - Audio resampled to **16 kHz**
  - 80-channel log-magnitude **Mel spectrogram**
  - 25 ms window, 10 ms stride
  - Spectrogram normalized to **[-1, 1]** with near-zero mean
- **References**:
  - Hugging Face: https://huggingface.co/openai/whisper-tiny
  - Paper: https://arxiv.org/abs/2212.04356
  - GitHub: https://github.com/openai/whisper
  - Overview: https://en.wikipedia.org/wiki/Whisper_(speech_recognition_system)

### try huggingface pipeline
Hugging Face pipeline is a high-level wrapper
https://huggingface.co/docs/transformers/en/model_doc/whisper
https://huggingface.co/docs/transformers/en/model_doc/whisper?usage=Pipeline


https://huggingface.co/openai/whisper-tiny
long-Form Transcription
The Whisper model is intrinsically designed to work on audio samples of up to 30s in duration. However, by using a chunking algorithm, it can be used to transcribe audio samples of up to arbitrary length. This is possible through Transformers pipeline method. Chunking is enabled by setting chunk_length_s=30 when instantiating the pipeline. With chunking enabled, the pipeline can be run with batched inference. It can also be extended to predict sequence level timestamps by passing return_timestamps=True:

https://huggingface.co/openai/whisper-large-v3
"By default, Transformers uses the sequential algorithm. To enable the chunked algorithm, pass the chunk_length_s parameter to the pipeline. For large-v3, a chunk length of 30-seconds is optimal." 

### ASR 
WER for eng seems to be a good measure of success. 
https://www.youtube.com/watch?v=TksaY_FDgnk
https://huggingface.co/docs/transformers/en/tasks/asr


In [3]:
import json, subprocess
from pathlib import Path

audio_files = ["../../Sample 1.mp3", "../../Sample 2.mp3", "../../Sample 3.mp3"]

def ffprobe_info(path: str) -> dict:
    cmd = [
        "ffprobe", "-v", "error",
        "-show_entries", "format=duration:stream=bit_rate",
        "-of", "json",
        path,
    ]
    data = json.loads(subprocess.check_output(cmd))
    duration = float(data["format"]["duration"])
    # grab the first audio stream’s bitrate (bits/sec) and convert to kbps
    streams = data.get("streams", [])
    bit_rate_bps = int(streams[0]["bit_rate"]) if streams else int(data["format"]["bit_rate"])
    return {"duration_sec": duration, "bitrate_kbps": bit_rate_bps // 1000}

info = {Path(f).name: ffprobe_info(f) for f in audio_files}

print("Per file:", info)
avg_duration = sum(v["duration_sec"] for v in info.values()) / len(info)
avg_kbps = sum(v["bitrate_kbps"] for v in info.values()) / len(info)
print("Average duration (sec):", avg_duration)
print("Average bitrate (kbps):", avg_kbps)


Per file: {'Sample 1.mp3': {'duration_sec': 13.12, 'bitrate_kbps': 124}, 'Sample 2.mp3': {'duration_sec': 11.050667, 'bitrate_kbps': 124}, 'Sample 3.mp3': {'duration_sec': 12.842667, 'bitrate_kbps': 126}}
Average duration (sec): 12.337778
Average bitrate (kbps): 124.66666666666667


In [7]:
max_size_mb = 15
minutes = (max_size_mb * 8_000_000) / (avg_kbps * 1000 * 60) 
minutes

16.0427807486631

## 

In [2]:
# set up device
import os, torch

# helps on some Macs if an op isn't supported on MPS
os.environ["PYTORCH_ENABLE_MPS_FALLBACK"] = "1"

def pick_device():
    if torch.cuda.is_available():
        return "cuda:0"
    if torch.backends.mps.is_available():
        return "mps"
    return "cpu"

device = pick_device()
dtype = torch.float16 if device.startswith("cuda") else torch.float32
print("device:", device, "dtype:", dtype)  

device: mps dtype: torch.float32


In [3]:
from transformers import pipeline     

  from .autonotebook import tqdm as notebook_tqdm


In [4]:
whisper_large_v3_turbo = pipeline("automatic-speech-recognition", model="openai/whisper-large-v3-turbo", device=device, dtype=dtype)
result = whisper_large_v3_turbo("../../Sample 3.mp3", chunk_length_s=30, stride_length_s=(4, 4))
print(result['text'].strip())   

`torch_dtype` is deprecated! Use `dtype` instead!
Loading weights: 100%|██████████| 587/587 [00:02<00:00, 230.62it/s, Materializing param=model.encoder.layers.31.self_attn_layer_norm.weight] 
Using custom `forced_decoder_ids` from the (generation) config. This is deprecated in favor of the `task` and `language` flags/config options.
Transcription using a multilingual Whisper will default to language detection followed by transcription instead of translation to English. This might be a breaking change for your use case. If you want to instead always translate your audio to English, make sure to pass `language='en'`. See https://github.com/huggingface/transformers/pull/28687 for more details.
A custom logits processor of type <class 'transformers.generation.logits_process.SuppressTokensLogitsProcessor'> has been passed to `.generate()`, but it was also created in `.generate()`, given its parameterization. The custom <class 'transformers.generation.logits_process.SuppressTokensLogitsProce

What should I have for lunch? There's only young tofu, western, Japanese, economic rice stalls here. I am sick of the choices here.


In [5]:
whisper_tiny = pipeline("automatic-speech-recognition", model="openai/whisper-tiny", device=device, dtype=dtype)
result = whisper_tiny("../../Sample 3.mp3", chunk_length_s=30, stride_length_s=(4, 4))
print(result['text'].strip())   

Loading weights: 100%|██████████| 167/167 [00:00<00:00, 2400.22it/s, Materializing param=model.encoder.layers.3.self_attn_layer_norm.weight]  


What should I have for lunch? There's only young tofu, Western, Japanese, economic rice stalls here. I'm sick of the choices here.


In [6]:
result = whisper_tiny("../../Sample 1.mp3", chunk_length_s=30, stride_length_s=(4, 4))
print(result['text'].strip())   



My name is Ethan. I was asked to come here by 11. Now it is already 3 p.m. They did not even serve me any food or drinks. Terrible.


In [7]:
result = whisper_tiny("../../Sample 2.mp3", chunk_length_s=30, stride_length_s=(4, 4))
print(result['text'].strip())   



Help me. I can't find my parents. They told me to wait for them, but I saw this pretty butterfly and followed it. Now I am lost.


## questions
- will shorter chunk_length_s be better? 
- what left and right stride to use?
- should we try forcing the language and task 

initial thoughts was to perform a customized grid search to quickly find the best chunk length, left/right stride and weather try if forcing language/task result in better outcome using metrics like WER. But out of the 3 audio, there is only 2-3 word error all due to "yong tou fu" not "young tofu". however, it is close enough since tofu is in the oxford dictionary. 