# Test ground
## notebook to test the basic loading and running of model 

### Try Example on Hugging Face

In [None]:
# !uv pip install torchcodec datasets

In [1]:
from transformers import WhisperProcessor, WhisperForConditionalGeneration
from datasets import load_dataset

# load model and processor
processor = WhisperProcessor.from_pretrained("openai/whisper-tiny")
model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-tiny")
model.config.forced_decoder_ids = None

  from .autonotebook import tqdm as notebook_tqdm
Loading weights: 100%|██████████| 167/167 [00:00<00:00, 2244.05it/s, Materializing param=model.encoder.layers.3.self_attn_layer_norm.weight]  


In [2]:

# load dummy dataset and read audio files
ds = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
sample = ds[0]["audio"]
input_features = processor(sample["array"], sampling_rate=sample["sampling_rate"], return_tensors="pt").input_features 
print(sample["sampling_rate"], input_features.shape)

# generate token ids
predicted_ids = model.generate(input_features)

Using custom `forced_decoder_ids` from the (generation) config. This is deprecated in favor of the `task` and `language` flags/config options.
Transcription using a multilingual Whisper will default to language detection followed by transcription instead of translation to English. This might be a breaking change for your use case. If you want to instead always translate your audio to English, make sure to pass `language='en'`. See https://github.com/huggingface/transformers/pull/28687 for more details.


16000 torch.Size([1, 80, 3000])


The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
A custom logits processor of type <class 'transformers.generation.logits_process.SuppressTokensLogitsProcessor'> has been passed to `.generate()`, but it was also created in `.generate()`, given its parameterization. The custom <class 'transformers.generation.logits_process.SuppressTokensLogitsProcessor'> will take precedence. Please check the docstring of <class 'transformers.generation.logits_process.SuppressTokensLogitsProcessor'> to see related `.generate()` flags.
A custom logits processor of type <class 'transformers.generation.logits_process.SuppressTokensAtBeginLogitsProcessor'> has been passed to `.generate()`, but it was also created in `.generate()`, given its parameterization. The custom <class 'transformers.generation.logits_process.SuppressTokensA

In [3]:
# decode token ids to text
transcription = processor.batch_decode(predicted_ids, skip_special_tokens=False)
print(transcription)

transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)
print(transcription)

[' Mr. Quilter is the apostle of the middle classes and we are glad to welcome his gospel.']
[' Mr. Quilter is the apostle of the middle classes and we are glad to welcome his gospel.']


#### Notes from the page
- `WhisperProcessor` handles preprocessing (audio → log-Mel spectrogram) and postprocessing (tokens → text).
- Without forcing, Whisper auto-detects language and task.
- Force English transcription:
  ```py
  model.config.forced_decoder_ids = WhisperProcessor.get_decoder_prompt_ids(
      language="english", task="transcribe"
  )

<details> <summary><strong>Warnings to note</strong></summary>
Using custom `forced_decoder_ids` from the (generation) config. This is deprecated in favor of the `task` and `language` flags/config options.
Transcription using a multilingual Whisper will default to language detection followed by transcription instead of translation to English. This might be a breaking change for your use case. If you want to instead always translate your audio to English, make sure to pass `language='en'`. See https://github.com/huggingface/transformers/pull/28687 for more details.
16000 torch.Size([1, 80, 3000])
The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
A custom logits processor of type <class 'transformers.generation.logits_process.SuppressTokensLogitsProcessor'> has been passed to `.generate()`, but it was also created in `.generate()`, given its parameterization. The custom <class 'transformers.generation.logits_process.SuppressTokensLogitsProcessor'> will take precedence. Please check the docstring of <class 'transformers.generation.logits_process.SuppressTokensLogitsProcessor'> to see related `.generate()` flags.
A custom logits processor of type <class 'transformers.generation.logits_process.SuppressTokensAtBeginLogitsProcessor'> has been passed to `.generate()`, but it was also created in `.generate()`, given its parameterization. The custom <class 'transformers.generation.logits_process.SuppressTokensAtBeginLogitsProcessor'> will take precedence. Please check the docstring of <class 'transformers.generation.logits_process.SuppressTokensAtBeginLogitsProcessor'> to see related `.generate()` flags.


### Try huggingface pipeline
Hugging Face pipeline is a high-level wrapper

In [None]:
# set up device
import os, torch

# helps on some Macs if an op isn't supported on MPS
os.environ["PYTORCH_ENABLE_MPS_FALLBACK"] = "1"

def pick_device():
    if torch.cuda.is_available():
        return "cuda:0"
    if torch.backends.mps.is_available():
        return "mps"
    return "cpu"

device = pick_device()

dtype = torch.float16 if device.startswith("cuda") else torch.float32


device: mps dtype: torch.float32


In [22]:
from transformers import pipeline                                                                                                                                                                               

print("device:", device, "dtype:", dtype)                                                                                                                                                                         
whisper = pipeline("automatic-speech-recognition", model="openai/whisper-tiny", device=device, torch_dtype=dtype)                                                                                                                                                                                                                                                                    

device: mps dtype: torch.float32


`torch_dtype` is deprecated! Use `dtype` instead!
Loading weights: 100%|██████████| 167/167 [00:00<00:00, 2981.60it/s, Materializing param=model.encoder.layers.3.self_attn_layer_norm.weight]  


In [23]:
result = whisper("../../Sample 3.mp3", chunk_length_s=30, stride_length_s = (4, 2))                                                                                                                                                                               



In [17]:
result

{'text': " What should I have for lunch? There's only young tofu, Western, Japanese, economic rice stalls here. I'm sick of the choices here."}

In [18]:
result['text'].strip()

"What should I have for lunch? There's only young tofu, Western, Japanese, economic rice stalls here. I'm sick of the choices here."

## questions
- will shorter chunk_length_s be better? 
- what left and right stride to use?
- should we try forcing the language and task 