# Learn OpenAI Whisper - Chapter 1
## Using Whisper in Google Colab
This notebook provides a simple template for using OpenAI's Whisper for audio transcription in Google Colab.

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1uka0UhZJBWIwLcubsFbiOw8fNGlBBI-a)
## Install Whisper
Run the cell below to install Whisper.

The Python libraries `openai`, `cohere`, and `tiktoken` are also installed because of dependencies for the `llmx` library. That is because `llmx` relies on them to function correctly. Each of these libraries provides specific functionalities that `llmx` uses.

1. `openai`: This is the official Python library for the OpenAI API. It provides convenient access to the OpenAI REST API from any Python 3.7+ application. The library includes type definitions for all request parameters and response fields, and offers both synchronous and asynchronous clients powered by `httpx`.

2. `cohere`: The Cohere platform builds natural language processing and generation into your product with a few lines of code. It can solve a broad spectrum of natural language use cases, including classification, semantic search, paraphrasing, summarization, and content generation.

3. `tiktoken`: This is a fast Byte Pair Encoding (BPE) tokenizer for use with OpenAI's models. It's used to tokenize text into subwords, a necessary step before feeding text into many modern language models.

In [1]:
%%capture
!pip install -q cohere openai tiktoken
!pip install -q git+https://github.com/openai/whisper.git

##Option 1: Upload audio file
Use the file upload feature of Google Colab to upload your audio file.

Also, a recording of the author's voice can be found at Packt's GitHub repository:

https://github.com/PacktPublishing/Learn-OpenAI-Whisper/blob/main/Chapter01/Learn_OAI_Whisper_Sample_Audio01.m4a

In [2]:
import ipywidgets as widgets
uploader = widgets.FileUpload(accept='.mp3,.wav,.m4a', multiple=False)
display(uploader)

# Once this block runs, click the upload button below to upload your downloaded .m4a file

FileUpload(value={}, accept='.mp3,.wav,.m4a', description='Upload')

In [3]:
# Convert the dict_items to a list and get the first item (your file and its info)
file_key, file_info = list(uploader.value.items())[0]
file_name = file_info['metadata']['name']
file_content = file_info['content']
with open(file_name, "wb") as fp:
    fp.write(file_content)

In [4]:
import ipywidgets as widgets
widgets.Audio.from_file(file_name, autoplay=False, loop=False)

Audio(value=b'ID3\x03\x00\x00\x00\x00\x1fvPRIV\x00\x00\x00\x0e\x00\x00PeakValue\x00\xa1\x7f\x00\x00PRIV\x00\x0…

In [5]:
# One option to run Whisper is using command-line parameters
# This command transcribes the uploaded file using Whisper small size model
!whisper {file_name} --model small

100%|████████████████████████████████████████| 461M/461M [00:03<00:00, 133MiB/s]
Detecting language using up to the first 30 seconds. Use `--language` to specify the language
Detected language: Spanish
[00:00.000 --> 00:02.000]  ¿Cuál es la fecha de tu cumpleaños?


# Option 2: Download sample files

In [6]:
!wget -nv https://github.com/PacktPublishing/Learn-OpenAI-Whisper/raw/main/Chapter01/Learn_OAI_Whisper_Sample_Audio01.mp3
!wget -nv https://github.com/PacktPublishing/Learn-OpenAI-Whisper/raw/main/Chapter01/Learn_OAI_Whisper_Sample_Audio02.mp3

2024-04-05 11:37:44 URL:https://raw.githubusercontent.com/PacktPublishing/Learn-OpenAI-Whisper/main/Chapter01/Learn_OAI_Whisper_Sample_Audio01.mp3 [363247/363247] -> "Learn_OAI_Whisper_Sample_Audio01.mp3" [1]
2024-04-05 11:37:44 URL:https://raw.githubusercontent.com/PacktPublishing/Learn-OpenAI-Whisper/main/Chapter01/Learn_OAI_Whisper_Sample_Audio02.mp3 [458561/458561] -> "Learn_OAI_Whisper_Sample_Audio02.mp3" [1]


In [7]:
mono_file = "Learn_OAI_Whisper_Sample_Audio01.mp3"
stereo_file = "Learn_OAI_Whisper_Sample_Audio02.mp3"

In [8]:
import ipywidgets as widgets
widgets.Audio.from_file(mono_file, autoplay=False, loop=False)

Audio(value=b'\xff\xfb\x90\xc4\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00Xing\x00\x00…

In [9]:
import ipywidgets as widgets
widgets.Audio.from_file(stereo_file, autoplay=False, loop=False)

Audio(value=b'\xff\xfb\x90d\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x0…

In [10]:
# Another way to run Whisper is by instatntiating a model object
import whisper

# Load the small English language model
model = whisper.load_model("small.en")

100%|███████████████████████████████████████| 461M/461M [00:40<00:00, 11.8MiB/s]


In [11]:
# NLTK helps to split the transcription sentence by sentence
# and shows it in a neat manner one below another. You will see it in the output below.

import nltk
nltk.download('punkt')
from nltk import sent_tokenize

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


In [12]:
# Transcribe the mono audio file
result = model.transcribe(mono_file)
print("Transcription of mono_file:")
for sent in sent_tokenize(result['text']):
  print(sent)

Transcription of mono_file:
 Hello, this is Josue Batista.
I am the author of the book Learn Open AI Whisper, Transform Your Understanding of Generative AI Through Robust and Accurate Speech Processing Solutions.
This is an audio sample that you can use to try and test and enhance your own implementation of whisper.
Good luck!


In [13]:
# Transcribe the stereo audio file
result = model.transcribe(stereo_file)
print("Transcription of stereo_file:")
for sent in sent_tokenize(result['text']):
  print(sent)

Transcription of stereo_file:
 Offstage left.
Far left.
My voice should be coming directly out of the left speaker.
Midway between center and left position.
Exact center position.
Midway between center and right position.
And at the right hand position.
Now I'm offstage right.


# **The following blocks are examples from Chapter 1 that showcase other functionalities of Whisper**

In [14]:
!wget -nv -O Learn_OAI_Whisper_Spanish_Sample_Audio01.mp3 https://github.com/PacktPublishing/Learn-OpenAI-Whisper/raw/main/Chapter01/Learn_OAI_Whisper_Spanish_Sample_Audio01.mp3

2024-04-05 11:40:20 URL:https://raw.githubusercontent.com/PacktPublishing/Learn-OpenAI-Whisper/main/Chapter01/Learn_OAI_Whisper_Spanish_Sample_Audio01.mp3 [24361/24361] -> "Learn_OAI_Whisper_Spanish_Sample_Audio01.mp3" [1]


In [15]:
import ipywidgets as widgets
spanish_file = "Learn_OAI_Whisper_Spanish_Sample_Audio01.mp3"
widgets.Audio.from_file(spanish_file, autoplay=False, loop=False)

Audio(value=b'ID3\x03\x00\x00\x00\x00\x1fvPRIV\x00\x00\x00\x0e\x00\x00PeakValue\x00\xa1\x7f\x00\x00PRIV\x00\x0…

In [16]:
'''
Specifying language: You can specify the language for more accurate transcription.
'''

!whisper {spanish_file} --model small --language Spanish

[00:00.000 --> 00:02.000]  ¿Cuál es la fecha de tu cumpleaños?


In [None]:
'''
Sending output to a specific folder: Instead of saving the transcription output in the same directory
location as the file being processed, you can direct the output to a specific directory using the --output_dir flag.
'''
!whisper {mono_file} --model small.en --output_dir "/content/WhisperDemoOutputs/"
# Once this block runs, click the refresh folder button on the left to view output folder

In [17]:
'''
Modeling specific tasks: Whisper can handle different tasks like transcription and translation.
Specify the task using the --task flag. Use -- task translate for translation from foreign audio to
English transcription. Whisper will not translate to any other target language than English.
If you have a non English audio file, upload it above and run this block of code.
'''

!whisper {spanish_file} --model small --task translate --output_dir "/content/WhisperDemoTranslate/"

Detecting language using up to the first 30 seconds. Use `--language` to specify the language
Detected language: Spanish
[00:00.000 --> 00:02.000]  What is the date of your birthday?


In [18]:
'''
clip_timestamps: This allows for comma-separated list start, end, start, end,... timestamps (in seconds)
of clips to process from the audio file, for example, use the – clip_timestamps to process the first 5 seconds
of the audio clip
'''
!whisper {mono_file} --model small.en --clip_timestamps 0,5

[00:00.000 --> 00:05.000]  Hello, this is Josue Batista.


In [19]:
'''
Controlling the number of best transcription candidates: Whisper's --best-of parameter controls how many
candidate transcriptions Whisper returns during decoding. The default value is 1, which returns just the
top predicted transcription. Increasing to 3–5 provides some alternative options.
'''
!whisper {mono_file} --model small.en --best_of 3

[00:00.000 --> 00:06.000]  Hello, this is Josue Batista.
[00:06.000 --> 00:13.960]  I am the author of the book Learn Open AI Whisper, transform your understanding of generative
[00:13.960 --> 00:20.880]  AI through robust and accurate speech processing solutions.
[00:20.880 --> 00:31.880]  This is an audio sample that you can use to try and test and enhance your own implementation
[00:31.880 --> 00:32.880]  of whisper.
[00:32.880 --> 00:33.880]  Good luck!


In [20]:
'''
Adjusting temperature: The temperature parameter controls the randomness in generation tasks like translation.
Lower values produce more predictable results.
'''
!whisper {mono_file} --model small.en --temperature 0

[00:00.000 --> 00:06.000]  Hello, this is Josue Batista.
[00:06.000 --> 00:13.960]  I am the author of the book Learn Open AI Whisper, transform your understanding of generative
[00:13.960 --> 00:20.880]  AI through robust and accurate speech processing solutions.
[00:20.880 --> 00:31.880]  This is an audio sample that you can use to try and test and enhance your own implementation
[00:31.880 --> 00:32.880]  of whisper.
[00:32.880 --> 00:33.880]  Good luck!


In [21]:
'''
Adjusting the beam size for decoding: Whisper's --beam-size flag controls the beam search size during decoding.
Beam size affects the accuracy and speed of transcription. A larger beam size might improve accuracy
but will slow down processing.
'''
!whisper {mono_file} --model small.en --temperature 0 --beam_size 2

[00:00.000 --> 00:06.000]  Hello, this is Josue Batista.
[00:06.000 --> 00:13.360]  I am the author of the book Learn Open AI Whisper – Transform Your Understanding of
[00:13.360 --> 00:20.880]  Generative AI Through Robust and Accurate Speech Processing Solutions.
[00:20.880 --> 00:31.920]  This is an audio sample that you can use to try and test and enhance your own implementation
[00:31.920 --> 00:32.920]  of whisper.
[00:32.920 --> 00:33.920]  Good luck!


# A word or two about --beam_size and --temperature

The `--beam_size` parameter in OpenAI's Whisper model refers to the number of beams used in [beam search](https://www.width.ai/post/what-is-beam-search) during the decoding process. Beam search is a heuristic search algorithm that explores a graph by expanding the most promising node in a limited set. In the context of Whisper, which is an automatic speech recognition (ASR) model, beam search is used to find the most likely sequence of words given the audio input.

The `--temperature` parameter is used to control the randomness of the output during sampling. A higher temperature results in more random outputs, while a lower temperature makes the model's outputs more deterministic. When the temperature is set to zero, the model uses a greedy decoding strategy, always choosing the most likely next word.

The relationship between `--beam_size` and `--temperature` is that they both influence the decoding strategy and the diversity of the generated text. A larger `--beam_size` can potentially increase the accuracy of the transcription by considering more alternative word sequences, but it also requires more computational resources and can [slow down the inference process](https://github.com/openai/whisper/discussions/396). On the other hand, `--temperature` affects the variability of the output; a non-zero temperature allows for sampling from a distribution of possible next words, which can introduce variability and potentially capture more nuances in the speech.

In practice, the `--beam_size` parameter is used when the [temperature is set to zero](https://huggingface.co/spaces/aadnk/whisper-webui/blob/main/docs/options.md), indicating that beam search should be used. If the temperature is non-zero, the `--best_of` parameter is used instead to determine the number of candidates to sample from. The Whisper model uses a dynamic temperature setting, starting with a temperature of 0 and increasing it by 0.2 up to 1.0 when certain conditions are met, such as when the average log probability over the generated tokens is lower than a threshold or when the generated text has a [gzip compression](https://community.openai.com/t/whisper-hallucination-how-to-recognize-and-solve/218307/16) rate higher than a certain value.

In summary, `--beam_size` controls the breadth of the search in beam search decoding, and `--temperature` controls the randomness of the output during sampling. They are part of the decoding strategy that affects the final transcription or translation produced by the Whisper model.

# Gratitude

Many thanks to Naval Katoch for his valuable insights.