In [1]:
import torch
import torchaudio
from pyprojroot import here
from transformers import WhisperProcessor, WhisperForConditionalGeneration

  from .autonotebook import tqdm as notebook_tqdm


### Load the Whisper model and processor

In [3]:

model_name = "openai/whisper-base"
processor = WhisperProcessor.from_pretrained(model_name)
model = WhisperForConditionalGeneration.from_pretrained(model_name)

### Load and process audio file

In [None]:
audio_file = here("data/sample/01.wav")
waveform, sample_rate = torchaudio.load(audio_file)

### Convert to correct sampling rate (Whisper requires 16kHz)

In [5]:
transform = torchaudio.transforms.Resample(orig_freq=sample_rate, new_freq=16000)
waveform = transform(waveform)

### Convert waveform to input features

In [6]:
input_features = processor(waveform.squeeze().numpy(), sampling_rate=16000, return_tensors="pt").input_features

### Generate transcription

In [7]:
with torch.no_grad():
    predicted_ids = model.generate(input_features)

Due to a bug fix in https://github.com/huggingface/transformers/pull/28687 transcription using a multilingual Whisper will default to language detection followed by transcription instead of translation to English.This might be a breaking change for your use case. If you want to instead always translate your audio to English, make sure to pass `language='en'`.
Passing a tuple of `past_key_values` is deprecated and will be removed in Transformers v4.43.0. You should pass an instance of `EncoderDecoderCache` instead, e.g. `past_key_values=EncoderDecoderCache.from_legacy_cache(past_key_values)`.
The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


### Decode output text

In [8]:
transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)[0]
print("Transcription:", transcription)

Transcription:  Kids are talking by the door.
