# Learn OpenAI Whisper - Chapter 7
## Notebook 1: Quantizing Whisper with Ctranslate2 and running inference with Faster-Whisper

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1lFKZCc-mDIf8xH_v7_M1m1hfA-Ke772d)

This notebook outlines a comprehensive process for quantizing the Whisper model using [CTranslate2](https://opennmt.net/CTranslate2/guides/transformers.html#whisper), a library designed for efficient inference with transformer models. This process is crucial for deploying Automated Speech Recognition (ASR) models like Whisper in environments where computational resources are limited.

![ch07_1-quantizing-whisper-with-ctranslate2.png](https://raw.githubusercontent.com/PacktPublishing/Learn-OpenAI-Whisper/main/Chapter07/ch07_1-quantizing-whisper-with-ctranslate2.png)

### 1.	Installing libraries:

The code begins with installing ctranslate2, transformers, and faster-whisper.

These libraries are essential for quantization and leveraging the Whisper model's capabilities.


In [None]:
!pip -q install ctranslate2
!pip -q install transformers[torch]>=4.23
!pip -q install faster-whisper

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m36.7/36.7 MB[0m [31m25.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.5/1.5 MB[0m [31m27.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m32.9/32.9 MB[0m [31m14.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.8/6.8 MB[0m [31m85.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m46.0/46.0 kB[0m [31m5.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m86.8/86.8 kB[0m [31m12.2 MB/s[0m eta [36m0:00:00[0m
[?25h

### 2.	Downloading sample audio files
Two are downloaded from our GitHub repository to test the Whisper model's transcription capabilities.

In [None]:
!wget -nv https://github.com/PacktPublishing/Learn-OpenAI-Whisper/raw/main/Chapter01/Learn_OAI_Whisper_Sample_Audio01.mp3
!wget -nv https://github.com/PacktPublishing/Learn-OpenAI-Whisper/raw/main/Chapter01/Learn_OAI_Whisper_Sample_Audio02.mp3

2024-04-20 12:56:49 URL:https://raw.githubusercontent.com/PacktPublishing/Learn-OpenAI-Whisper/main/Chapter01/Learn_OAI_Whisper_Sample_Audio01.mp3 [363247/363247] -> "Learn_OAI_Whisper_Sample_Audio01.mp3" [1]
2024-04-20 12:56:50 URL:https://raw.githubusercontent.com/PacktPublishing/Learn-OpenAI-Whisper/main/Chapter01/Learn_OAI_Whisper_Sample_Audio02.mp3 [458561/458561] -> "Learn_OAI_Whisper_Sample_Audio02.mp3" [1]


### 3.	Preprocessing audio files
The audio files are loaded and resampled to a sampling frequency of 16,000 Hz using librosa. This step is crucial for ensuring that the audio data is in the correct format for processing by the Whisper model.

In [None]:
import ctranslate2
from IPython.display import Audio
import librosa
import transformers
# Load and resample the audio file.
sampling_frequency = 16000
audio, _ = librosa.load("Learn_OAI_Whisper_Sample_Audio01.mp3", sr=sampling_frequency, mono=True)
Audio(audio, rate=sampling_frequency)

In [None]:
import torch
this_device = "cuda" if torch.cuda.is_available() else "cpu"

### 4.	Converting to CTranslate2 format:
In this step, we convert the Whisper models `openai/whisper-tiny` and `openai/whisper-base` to the CTranslate2 format, a more efficient inference format.

In [None]:
!ct2-transformers-converter --force --model openai/whisper-tiny --output_dir whisper-tiny-ct2

config.json: 100% 1.98k/1.98k [00:00<00:00, 8.21MB/s]
model.safetensors: 100% 151M/151M [00:00<00:00, 256MB/s]
generation_config.json: 100% 3.75k/3.75k [00:00<00:00, 14.8MB/s]
tokenizer_config.json: 100% 283k/283k [00:00<00:00, 1.19MB/s]
vocab.json: 100% 836k/836k [00:00<00:00, 1.21MB/s]
tokenizer.json: 100% 2.48M/2.48M [00:01<00:00, 2.07MB/s]
merges.txt: 100% 494k/494k [00:00<00:00, 22.0MB/s]
normalizer.json: 100% 52.7k/52.7k [00:00<00:00, 97.4MB/s]
added_tokens.json: 100% 34.6k/34.6k [00:00<00:00, 73.0MB/s]
special_tokens_map.json: 100% 2.19k/2.19k [00:00<00:00, 11.7MB/s]
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [None]:
!ct2-transformers-converter --force --model openai/whisper-base --output_dir whisper-base-ct2

config.json: 100% 1.98k/1.98k [00:00<00:00, 9.49MB/s]
model.safetensors: 100% 290M/290M [00:34<00:00, 8.49MB/s]
generation_config.json: 100% 3.81k/3.81k [00:00<00:00, 18.7MB/s]
tokenizer_config.json: 100% 283k/283k [00:00<00:00, 40.6MB/s]
vocab.json: 100% 836k/836k [00:00<00:00, 1.17MB/s]
tokenizer.json: 100% 2.48M/2.48M [00:00<00:00, 9.33MB/s]
merges.txt: 100% 494k/494k [00:00<00:00, 56.3MB/s]
normalizer.json: 100% 52.7k/52.7k [00:00<00:00, 116MB/s]
added_tokens.json: 100% 34.6k/34.6k [00:00<00:00, 96.6MB/s]
special_tokens_map.json: 100% 2.19k/2.19k [00:00<00:00, 11.1MB/s]
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


### 5.	Performing quantization
The models are then quantized to an 8-bit integer format (int8)

In [None]:
!ct2-transformers-converter --force --model openai/whisper-tiny --output_dir whisper-tiny-ct2-int8 \
--copy_files tokenizer.json preprocessor_config.json --quantization int8

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
preprocessor_config.json: 100% 185k/185k [00:00<00:00, 406kB/s]


In [None]:
!ct2-transformers-converter --force --model openai/whisper-base --output_dir whisper-base-ct2-int8 \
--copy_files tokenizer.json preprocessor_config.json --quantization int8

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
preprocessor_config.json: 100% 185k/185k [00:00<00:00, 45.2MB/s]


### 6. Detecting language
The quantized model detects the language of the provided audio samples

In [None]:
# Load the model on device
model = ctranslate2.models.Whisper("whisper-tiny-ct2-int8", device=this_device)

In [None]:
processor = transformers.WhisperProcessor.from_pretrained("openai/whisper-tiny")
inputs = processor(audio, return_tensors="np", sampling_rate=sampling_frequency)
features = ctranslate2.StorageView.from_array(inputs.input_features)

Access to the secret `HF_TOKEN` has not been granted on this notebook.
You will not be requested again.
Please restart the session if you want to be prompted again.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


Compute and display the features of the first 30 seconds of audio.

In [None]:
# Detect the language.
results = model.detect_language(features)
language, probability = results[0][0]
print("Detected language %s with probability %f" % (language, probability))

Detected language <|en|> with probability 0.889251


### 7.	Transcribing audio files
The quantized model generates transcriptions for the audio samples using the `processor.tokenizer.convert_tokens_to_ids()` method.

In [None]:
# Describe the task in the prompt.
prompt = processor.tokenizer.convert_tokens_to_ids(
    [
        "<|startoftranscript|>",
        language,
        "<|transcribe|>",
        "<|notimestamps|>",  # Remove this token to generate timestamps.
    ]
)

In [None]:
# Load the model on device
model = ctranslate2.models.Whisper("whisper-tiny-ct2-int8", device=this_device)

In [None]:
# Run generation for the 30-second window.
results = model.generate(features, [prompt])
transcription = processor.decode(results[0].sequences_ids[0])
print(transcription)

 Hello. This is Ho Suey Batista. I am the author of the book Learn Open AI Whisper. Transform your understanding of generative AI through robust and accurate speech processing solutions. This is an audio sample that you can use to try and test and enhance your


### 8.	Evaluating performance
 After the audio transcription, the code evaluates the performance of the quantized model, such as measuring the time taken for transcription.

In [None]:
# Load and resample the audio file.
sampling_frequency = 16000
audio, _ = librosa.load("Learn_OAI_Whisper_Sample_Audio02.mp3", sr=sampling_frequency, mono=True)
Audio(audio, rate=sampling_frequency)

In [None]:
from faster_whisper import WhisperModel
import time
import datetime

model_size = "whisper-tiny-ct2"
model = WhisperModel(model_size, device=this_device, compute_type="int8")
segments, info = model.transcribe("Learn_OAI_Whisper_Sample_Audio02.mp3", beam_size=5)
print("Detected language '%s' with probability %f" % (info.language, info.language_probability))

start = time.time()
for segment in segments:
    print("[%.2fs -> %.2fs] %s" % (segment.start, segment.end, segment.text))
# Print the end time and the delta in seconds and fractions of a second.
end = time.time()
print('start: ', start)
print('end: ', end)
print('delta: ', end - start)
print('delta: ', datetime.timedelta(seconds=end - start))

Detected language 'en' with probability 0.982467
[0.00s -> 6.00s]  Offstage left. Far left, my voice should be coming directly out of the left speaker.
[6.00s -> 10.00s]  Midway between center and left position.
[10.00s -> 15.00s]  Exact center position. Midway between center and right position.
[15.00s -> 19.00s]  And at the right hand position. Now I'm offstage right.
start:  1713618100.8895595
end:  1713618101.3472176
delta:  0.457658052444458
delta:  0:00:00.457658


In [None]:
from faster_whisper import WhisperModel
import time
import datetime

model_size = "whisper-tiny-ct2-int8"
model = WhisperModel(model_size, device=this_device, compute_type="int8")
segments, info = model.transcribe("Learn_OAI_Whisper_Sample_Audio02.mp3", beam_size=5)
print("Detected language '%s' with probability %f" % (info.language, info.language_probability))

start = time.time()
for segment in segments:
    print("[%.2fs -> %.2fs] %s" % (segment.start, segment.end, segment.text))
# Print the end time and the delta in seconds and fractions of a second.
end = time.time()
print('start: ', start)
print('end: ', end)
print('delta: ', end - start)
print('delta: ', datetime.timedelta(seconds=end - start))

Detected language 'en' with probability 0.982467
[0.00s -> 6.00s]  Offstage left. Far left, my voice should be coming directly out of the left speaker.
[6.00s -> 10.00s]  Midway between center and left position.
[10.00s -> 15.00s]  Exact center position. Midway between center and right position.
[15.00s -> 19.00s]  And at the right hand position. Now I'm offstage right.
start:  1713618106.6446679
end:  1713618106.9435859
delta:  0.2989180088043213
delta:  0:00:00.298918


In [None]:
from faster_whisper import WhisperModel
import time
import datetime

model_size = "whisper-base-ct2"
model = WhisperModel(model_size, device=this_device, compute_type="int8")
segments, info = model.transcribe("Learn_OAI_Whisper_Sample_Audio02.mp3", beam_size=5)
print("Detected language '%s' with probability %f" % (info.language, info.language_probability))

start = time.time()
for segment in segments:
    print("[%.2fs -> %.2fs] %s" % (segment.start, segment.end, segment.text))
# Print the end time and the delta in seconds and fractions of a second.
end = time.time()
print('start: ', start)
print('end: ', end)
print('delta: ', end - start)
print('delta: ', datetime.timedelta(seconds=end - start))

Detected language 'en' with probability 0.979529
[0.00s -> 6.00s]  Offstage left. Far left, my voice should be coming directly out of the left speaker.
[6.00s -> 9.00s]  Midway between center and left position.
[9.00s -> 12.00s]  Exact center position.
[12.00s -> 17.00s]  Midway between center and right position. And at the right hand position.
[17.00s -> 19.00s]  Now I'm offstage right.
start:  1713618112.9229739
end:  1713618113.4219024
delta:  0.4989285469055176
delta:  0:00:00.498929


In [None]:
from faster_whisper import WhisperModel
import time
import datetime

model_size = "whisper-base-ct2-int8"
model = WhisperModel(model_size, device=this_device, compute_type="int8")
segments, info = model.transcribe("Learn_OAI_Whisper_Sample_Audio02.mp3", beam_size=5)
print("Detected language '%s' with probability %f" % (info.language, info.language_probability))

start = time.time()
for segment in segments:
    print("[%.2fs -> %.2fs] %s" % (segment.start, segment.end, segment.text))
# Print the end time and the delta in seconds and fractions of a second.
end = time.time()
print('start: ', start)
print('end: ', end)
print('delta: ', end - start)
print('delta: ', datetime.timedelta(seconds=end - start))

Detected language 'en' with probability 0.979529
[0.00s -> 6.00s]  Offstage left. Far left, my voice should be coming directly out of the left speaker.
[6.00s -> 9.00s]  Midway between center and left position.
[9.00s -> 12.00s]  Exact center position.
[12.00s -> 17.00s]  Midway between center and right position. And at the right hand position.
[17.00s -> 19.00s]  Now I'm offstage right.
start:  1713618119.0564349
end:  1713618119.4064176
delta:  0.34998273849487305
delta:  0:00:00.349983
