# Tutorial - Use the Whisper Voice to Text Model

Whisper is a open-source model from OpenAI that converts voice to text.

The Whisper GitHub repo is [https://github.com/openai/whisper](https://github.com/openai/whisper).  

## Setup

In the Menu, choose Runtime then "Change Runtime Type" then choose the option of a T4 GPU, or similar GPU, rather than a CPU.


Install Whisper from the GitHub repo.  This will take a couple of minutes.

In Colab use the line that starts with !.  The !command is a notebook feature for running shell commands.  It tells Python to pass the command, pip in this case, to the operating system's shell. The equivaluent to ! in VSCode is %, or better still to run the shell command from teh terminal.  (The terminal is now available in Google Colab.)

In VSCode use the line that starts with %. Or run the command from the terminal window.

The git+ prefix tells pip that the package should be cloned directly from a Git repository instead of being downloaded from PyPI (Python Package Index).

Alternative we could have used pip install openai-whisper.  

The [PyPi documentation page for Whisper](https://pypi.org/project/openai-whisper/) has good information.

In [2]:
#  !pip install git+https://github.com/openai/whisper.git # use for Colab
%pip install git+https://github.com/openai/whisper.git

Collecting git+https://github.com/openai/whisper.git
  Cloning https://github.com/openai/whisper.git to c:\users\mark\appdata\local\temp\pip-req-build-pb87w7kw
  Resolved https://github.com/openai/whisper.git to commit 517a43ecd132a2089d85f4ebc044728a71d49f6e
  Installing build dependencies: started
  Installing build dependencies: finished with status 'done'
  Getting requirements to build wheel: started
  Getting requirements to build wheel: finished with status 'done'
  Preparing metadata (pyproject.toml): started
  Preparing metadata (pyproject.toml): finished with status 'done'
Collecting more-itertools (from openai-whisper==20240930)
  Using cached more_itertools-10.6.0-py3-none-any.whl.metadata (37 kB)
Collecting numba (from openai-whisper==20240930)
  Using cached numba-0.61.0-cp312-cp312-win_amd64.whl.metadata (2.8 kB)
Collecting tiktoken (from openai-whisper==20240930)
  Using cached tiktoken-0.8.0-cp312-cp312-win_amd64.whl.metadata (6.8 kB)
Collecting torch (from openai-whis

  Running command git clone --filter=blob:none --quiet https://github.com/openai/whisper.git 'C:\Users\Mark\AppData\Local\Temp\pip-req-build-pb87w7kw'

[notice] A new release of pip is available: 24.0 -> 25.0
[notice] To update, run: python.exe -m pip install --upgrade pip


Install FFmpeg.  This is a powerful, open-source command-line tool used for processing audio, video, and multimedia files. It can record, convert, edit, compress, and stream multimedia content across different formats.

In VSCode in Windows, you need to install FFmpeg independently >  here are some instructions to do that.

Installing FFmpeg on Windows Using Chocolatey
Chocolatey (choco) is a Windows package manager that makes installing software easier via the command line.

#### Step 1: Open PowerShell as Administrator

Click Start, type PowerShell, right-click it, and select "Run as administrator".

#### Step 2: Install Chocolatey (if not installed)

If you haven't installed Chocolatey yet, run this command in PowerShell:

> Set-ExecutionPolicy Bypass -Scope Process -Force; [System.Net.ServicePointManager]::SecurityProtocol = [System.Net.ServicePointManager]::SecurityProtocol -bor 3072; iex ((New-Object System.Net.WebClient).DownloadString('https://community.chocolatey.org/install.ps1'))

After installation, restart PowerShell.

#### Step 3: Install FFmpeg Using Chocolatey
Run the following command:

> choco install ffmpeg -y

-y automatically confirms the installation.  
Chocolatey will download and install FFmpeg for you.

#### Step 4: Verify the Installation

Once installed, check if FFmpeg is working by running:

> ffmpeg -version

If installed correctly, you will see FFmpeg version details.

#### Alternative: Install the Full Version

By default, Chocolatey installs the FFmpeg essentials package (lighter version). To install the full version:

> choco install ffmpeg-full -y


In [4]:
# !sudo apt update && sudo apt install ffmpeg

Run the whisper model.

The model comes in several sizes: tiny, small, medium and large.   Start with a tiny model.  As size increases, models take longer to run (and may not run on consumer PC hardware) and the transcription becomes more accurate.  If you are using Colab on more challenging audio file, increase the model size up to medium

Later, if you are using Colab, you may want to change from the GPU to the CPU and see if it takes longer to transcribe the file.

The file referred to in the code cell is a public file of a reading of the Gettysburg address.  Feel free to replace it with a file of your choosing.

In [1]:
!whisper https://zomalextrainingstorage.blob.core.windows.net/datasets/misc/gettysburg_address.mp3 --model tiny

Detecting language using up to the first 30 seconds. Use `--language` to specify the language
Detected language: English
[00:00.000 --> 00:05.000]  This is a Librevox recording. All Librevox recordings are in the public domain.
[00:05.000 --> 00:10.000]  For more information or to volunteer, please visit Librevox.org.
[00:13.000 --> 00:21.000]  The Gettysburg address by Abraham Lincoln, November 19th, 1863.
[00:21.000 --> 00:38.000]  Four score and seven years ago, our fathers brought forth on this continent a new nation conceived in liberty and dedicated to the proposition that all men are created equal.
[00:38.000 --> 00:50.000]  Now we are engaged in a great civil war, testing whether that nation or any nation so conceived and so dedicated can long endure.
[00:50.000 --> 01:05.000]  We are met on a great battlefield of that war. We have come to dedicate a portion of that field as a final resting place for those who hear gave their lives that that nation might live.
[01:05.000 --> 01



Open the Files icon to view the files on this hosted runtime.  You should see several files with the file name "gettysburg_address" and with different file types: .txt contains the text, .srt and .vtt are captioning files that also include timestamps.

You can download or delete these if you wish.