<a href="https://colab.research.google.com/github/Kaarimu/whisper_swahili/blob/main/Using_whisper_in_three_easy_steps_swahili.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<font size=64px>Try Whisper in Three Easy Steps</font><a href="https://deepgram.com/openai-whisper"><img src="https://drive.google.com/uc?id=1SPAig9IJ_0cBG-7Vzf7fP4vEDON0HeIO"
height="64" align="right"></a>

<font size=2px>By Ross O'Connell</font>

Whisper is an exciting new model for automatic speech recognition (ASR) developed by OpenAI. There are a few potential pitfalls to installing it on a local machine, so speech recognition experts at [Deepgram](https://deepgram-blog.ghost.io/ghost/#/editor/post/63374e260072bc003d64fd6a) have put together this Colab notebook. Our goal is to make it super easy for everybody to see what Whisper can do!

**We chose some fun audio to transcribe – can you identify it from Whisper's transcription?**

In the first line we install Whisper!

In [None]:
!pip install git+https://github.com/openai/whisper.git

Next we pull down some audio to transcribe.

In [None]:
!pip install yt-dlp
!yt-dlp https://www.youtube.com/watch?v=dQw4w9WgXcQ --format m4a -o "%(id)s.%(ext)s"

Finally, we run Whisper! It may take a little time to get started, but soon the transcription should start to appear.

In [None]:
!whisper "/content/dQw4w9WgXcQ.m4a" --model small --language English

## Checking Whisper's Work

Whisper hasn't just produced text, it's given us time intervals where it believes that text occurred. In this section we'll read in Whisper's transcript, split up the audio according to Whisper's timestamps, and then print Whisper's text and play the corresponding audio. How well do they match?

In [None]:
import pandas as pd
import numpy as np
import IPython.display as ipd

import warnings
warnings.filterwarnings('ignore')

Whisper's output is saved in `.vtt` format; we'll install `webvtt-py`, a package that can read that format.

In [None]:
!pip install webvtt-py

In [None]:
import webvtt

`librosa` is a library for reading and manipulating audio files.

In [None]:
import librosa

We have two custom functions here, one to convert H:M:S timestamps into seconds, and another to trim out a chunk of audio corresponding to a particular `start` and `end` time.

In [None]:
def simple_hms(s):
  h,m,sec = [float(x) for x in s.split(':')]
  return 3600*h + 60*m + sec

In [None]:
def trim_audio(row,audio,sample_rate):
  t = np.arange(len(audio))
  t = t/sample_rate
  f = np.where( (t>=row.start_s) & (t<=row.end_s) )
  return audio[f]

As promised, we use `webvtt` to read in the transcript and `librosa` to read in the audio.

In [None]:
transcript = webvtt.read('/content/dQw4w9WgXcQ.m4a.vtt')
audio,sample_rate = librosa.load('/content/dQw4w9WgXcQ.m4a')

In [None]:
from google.colab import drive
drive.mount('/content/drive')

For convenience we're going to set up a Pandas dataframe to store the various quantities we want to track. Each row will correspond to one segment of the Whisper transcript.

In [None]:
df = pd.DataFrame(columns=['start','end','text'])

df['start'] = [x.start for x in transcript]
df['end'] = [x.end for x in transcript]
df['text'] = [x.text for x in transcript]
df['start_s'] = df['start'].apply(simple_hms)
df['end_s'] = df['end'].apply(simple_hms)
df['audio'] = df.apply(trim_audio,axis=1,args=(audio,sample_rate))
df.head()

Finally, we'll grab a random segment of the Whisper transcript, print out the text, and play the audio. If there's a particular segment you want to look at you can **comment out** the `segment = df.sample...` line, **uncomment** the `segment = df.loc...` line, and enter the number of the segment you want to see!

In [None]:
segment = df.sample(n=1).iloc[0]
# segment = df.loc[16]
print(segment.text)
ipd.Audio(segment.audio,rate=sample_rate)

## Looking at Whisper's Word Error Rate (WER)

A simple way to quantify Whisper's performance is to look at its [Word Error Rate](https://blog.deepgram.com/what-is-word-error-rate/) (WER). In this section we're going to load a new audio source, an anonymous reading of a newspaper article, as well as the text of the article. We'll compare Whisper's transcript to the true text and look at the WER!

Our example for this section will be "Nikola Tesla Sees a Wireless Vision", an article from the New York Times from 1915.  We'll download two files:

* The audio file, a volunteer reading the article for [LibriVox](https://librivox.org/short-nonfiction-collection-vol-025-by-various/)
* A text file with the original text of the article

In [None]:
!wget https://static.deepgram.com/examples/snf025_nikolateslawirelessvision_anonymous_gu.mp3
!wget https://static.deepgram.com/examples/Nikola_Tesla_Sees_a_Wireless_Vision.txt

Now that we've got the audio, we'll have Whisper transcribe it!

In [None]:
!whisper "/content/snf025_nikolateslawirelessvision_anonymous_gu.mp3" --model small --language English

Next we'll read in the text file that Whisper has generated. There are a few lines that describe the recording and are not part of the original text, so we'll strip those out.

In [None]:
with open('/content/snf025_nikolateslawirelessvision_anonymous_gu.mp3.txt','r') as f:
  whisper_lines = [l.strip() for l in f]

#Stripping out the boilerplate at the beginning and end of the file.
whisper_lines = whisper_lines[4:]
whisper_lines = whisper_lines[:-2]

We downloaded the true text of the article earlier, we just need to read it in:

In [None]:
with open('/content/Nikola_Tesla_Sees_a_Wireless_Vision.txt','r') as f:
  true_lines = [l.strip() for l in f]

It's time for a little bit of **text cleaning**. Although Whisper seems to be pretty good at capitalization, there is some unusual use of capitalization in the original text that Whisper couldn't know about. We don't want to penalize Whisper for that, so we'll convert all text to lowercase. We're also going to remove quotation marks -- again, Whisper seems pretty good at getting these in the right place, but we'd like to focus on words for now.

There's more text cleaning we could do -- for example, we could be much more careful with now **numbers** and **numerals** are handled. For this demonstration, though, we're going to keep things simple!

In [None]:
whisper_text = " ".join(whisper_lines)
whisper_text = whisper_text.replace('"','')
whisper_text = whisper_text.lower()

true_text = " ".join(true_lines)
true_text = true_text.replace('"','')
true_text = true_text.lower()

With all that done, we're ready to compute the WER. How did Whisper do?

In [None]:
!pip install jiwer

In [None]:
from jiwer import wer

wer(true_text,whisper_text)

Note that this is how Whisper performs in optimal circumstances, with a single, clear speaker, in other circumstances its performance may be very different! Deepgram researchers will take a look at that in an [upcoming blog post](https://blog.deepgram.com/).