# Introduction about Whisper :


# Loading the model and the processor


In [1]:
import torch
from transformers import WhisperForConditionalGeneration, WhisperProcessor

device = "cuda" if torch.cuda.is_available() else "cpu"

model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-small").to(device)
processor = WhisperProcessor.from_pretrained("openai/whisper-small")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [2]:
from datasets import load_dataset
ds = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")

You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this dataset from the next major release of `datasets`.


In this dataset, we have access to the file path, the array with the audio data, and the text or transcript of the audio. Let's look at a single sample.

In [3]:
ds

Dataset({
    features: ['file', 'audio', 'text', 'speaker_id', 'chapter_id', 'id'],
    num_rows: 73
})

In [4]:
audio_sample = ds[3]

text = audio_sample["text"].lower()
speech_data = audio_sample["audio"]["array"]
speech_file = audio_sample["file"]

text

"he has grave doubts whether sir frederick leighton's work is really greek after all and can discover in it but little of rocky ithaca"

In [5]:
speech_data

array([-0.00045776, -0.00039673, -0.00048828, ..., -0.00021362,
       -0.00018311, -0.00027466])

In [6]:
from IPython.display import Audio, display

display(Audio(speech_file))

In [7]:
inputs = processor.feature_extractor(speech_data, return_tensors="pt", sampling_rate=16_000).input_features.to(device)
predicted_ids = model.generate(inputs, max_length=480_000)
processor.tokenizer.batch_decode(predicted_ids, skip_special_tokens=True, normalize=True)[0]

Due to a bug fix in https://github.com/huggingface/transformers/pull/28687 transcription using a multilingual Whisper will default to language detection followed by transcription instead of translation to English.This might be a breaking change for your use case. If you want to instead always translate your audio to English, make sure to pass `language='en'`.
The attention mask is not set and cannot be inferred from input because pad token is same as eos token.As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


'he has grave doubts whether sir frederick layton is work is really greek after all and can discover in it but little of rocky ithaca'

In [8]:
import librosa

# Path to your MP3 file
file_path = '/content/my_recording.mp3'

# Load the MP3 file
speech_data, sampling_rate = librosa.load(file_path, sr=16000)  # Optionally resample to 16 kHz

print(speech_data)  # Numpy array of audio samples
print(f"Sampling Rate: {sampling_rate}")


  speech_data, sampling_rate = librosa.load(file_path, sr=16000)  # Optionally resample to 16 kHz
	Deprecated as of librosa version 0.10.0.
	It will be removed in librosa version 1.0.
  y, sr_native = __audioread_load(path, offset, duration, dtype)


[ 0.0000000e+00  0.0000000e+00  0.0000000e+00 ...  1.8301968e-07
 -9.7681550e-08  3.3674418e-08]
Sampling Rate: 16000


In [9]:
inputs = processor.feature_extractor(speech_data, return_tensors="pt", sampling_rate=16_000).input_features.to(device)
predicted_ids = model.generate(inputs, max_length=480_000)
processor.tokenizer.batch_decode(predicted_ids, skip_special_tokens=True, normalize=True)[0]

'i want to eat some fried chicken'

## Step-by-step

Let's recap what happens at each step:


1. We have an audio, which is represented as an array

In [10]:
print(len(speech_data), speech_data)

70589 [ 0.0000000e+00  0.0000000e+00  0.0000000e+00 ...  1.8301968e-07
 -9.7681550e-08  3.3674418e-08]


2. Using the feature extractor, we pre-process the audio to a format usable by the model (i.e. extract the log-mel spectrogram from the audio).

In [11]:
inputs = processor.feature_extractor(speech_data, return_tensors="pt", sampling_rate=16_000).input_features.to(device)
inputs

tensor([[[-0.8261, -0.8261, -0.8261,  ..., -0.8261, -0.8261, -0.8261],
         [-0.8261, -0.8261, -0.8261,  ..., -0.8261, -0.8261, -0.8261],
         [-0.8261, -0.8261, -0.8261,  ..., -0.8261, -0.8261, -0.8261],
         ...,
         [-0.8261, -0.8261, -0.8261,  ..., -0.8261, -0.8261, -0.8261],
         [-0.8261, -0.8261, -0.8261,  ..., -0.8261, -0.8261, -0.8261],
         [-0.8261, -0.8261, -0.8261,  ..., -0.8261, -0.8261, -0.8261]]])

In [12]:
print(inputs.shape)

torch.Size([1, 80, 3000])


- Mel-Frequency Cepstral Coefficients (MFCCs)
- Spectral Features
- Zero-Crossing Rate (ZCR)
- Energy and Root Mean Square Energy
- Pitch and Fundamental Frequency
- Chroma Features
- etc.

3. We pass this input to the model generate method



In [13]:
predicted_ids = model.generate(inputs, max_length=480_000)
predicted_ids

tensor([[50258, 50259, 50359, 50363,   286,   528,   281,  1862,   512, 10425,
          4662, 50257]])

4. The tokenizer decodes the ids to actual readable text

In [14]:
processor.tokenizer.batch_decode(predicted_ids)

['<|startoftranscript|><|en|><|transcribe|><|notimestamps|> I want to eat some fried chicken<|endoftext|>']

But as you can see, there are some special tokens that specify the language and other parts of the task. We can remove these tokens with the `skip_special_tokens` parameter.

In [15]:
processor.tokenizer.batch_decode(predicted_ids, skip_special_tokens=True)[0]

' I want to eat some fried chicken'

And as you can see, there are some words capitalized (e.g. Sir Frederick). We use the `normalize` parameter to make sure the text is consistent.

In [16]:
processor.tokenizer.batch_decode(predicted_ids, skip_special_tokens=True, normalize=True)[0]

'i want to eat some fried chicken'

## Inference in Other Languages


In [17]:
from datasets import load_dataset

sample = load_dataset("osanseviero/dummy_ja_audio")["train"]["audio"][0]

speech_data = sample["array"]

In [18]:
inputs = processor.feature_extractor(speech_data, return_tensors="pt", sampling_rate=16_000).input_features.to(device)
forced_decoder_ids = processor.get_decoder_prompt_ids(language="ja", task="transcribe")

predicted_ids = model.generate(inputs, max_length=480_000, forced_decoder_ids=forced_decoder_ids)
processor.tokenizer.batch_decode(predicted_ids, skip_special_tokens=True, normalize=True)[0]

'キムラさんに電話をかしてもらいました'