<a href="https://colab.research.google.com/github/fastforwardlabs/whisper-openai/blob/master/WhisperDemo.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Make your own recordings and transcriptions with OpenAI's Whisper!

This notebook is based on OpenAI's [LibriSpeech](https://colab.research.google.com/github/openai/whisper/blob/master/notebooks/LibriSpeech.ipynb) Colab example. 

\\

OpenAI [recently released](https://openai.com/blog/whisper/) Whisper, an automatic speech recognition (ASR) system that was trained on a colossal heap of audio data collected from the web. 

## Installs and imports 
The commands below will install the Python packages needed to record audio snippets and use Whisper models for speech-to-text transcription.

In [None]:
! pip install git+https://github.com/openai/whisper.git
! pip install sounddevice wavio
! pip install ipywebrtc notebook

We also need the following in order to record audio from this notebook and process the resulting files. 

In [None]:
!apt install ffmpeg
!apt-get install libportaudio2

In [4]:
import os
import numpy as np

try:
    import tensorflow  # required in Colab to avoid protobuf compatibility issues
except ImportError:
    pass

import torch
import pandas as pd
import whisper
import torchaudio

from ipywebrtc import AudioRecorder, CameraStream
from IPython.display import Audio, display
import ipywidgets as widgets

DEVICE = "cuda" if torch.cuda.is_available() else "cpu"

The cache for model files in Transformers v4.22.0 has been updated. Migrating your old cache. This is a one-time only operation. You can interrupt this and resume the migration later on by calling `transformers.utils.move_cache()`.


Moving 0 files to the new cache system


0it [00:00, ?it/s]

## Make your recording

First, we need to enable some Colab widgets so that we can make an audio recording. 

In [5]:
from google.colab import output
output.enable_custom_widget_manager()

### Time to record! 

Press the circle button and start speaking. It may not look it, but the widget will be capturing sound. Click the circle button again when you are finished. The widget will immediately begin to play back what it captured. 

In [17]:
camera = CameraStream(constraints={'audio': True,'video':False})
recorder = AudioRecorder(stream=camera)
recorder

AudioRecorder(audio=Audio(value=b'', format='webm'), stream=CameraStream(constraints={'audio': True, 'video': …

The audio format captured above is not readable by PyTorch. In this step, we convert our recording into a format that PyTorch can understand. 

In [None]:
with open('recording.webm', 'wb') as f:
    f.write(recorder.audio.value)
!ffmpeg -i recording.webm -ac 1 -f wav my_recording.wav -y -hide_banner -loglevel panic

torch.Size([1, 377280])


### Alternatively... 
If you don't want to make your own recording, you can instead upload an audio file to this notebook. 


## Select options

Whisper is capable of performing transcriptions for many languages (though it performs better for some languages and worse for others.) 

Whisper is also capable of detecting the input language. However, to be on the safe side, we can explicitly tell Whisper which language to expect. 

In [6]:
language_options = whisper.tokenizer.TO_LANGUAGE_CODE 
language_list = list(language_options.keys())

In [7]:
lang_dropdown = widgets.Dropdown(options=language_list, value='english')
output = widgets.Output()
display(lang_dropdown)

Dropdown(options=('english', 'chinese', 'german', 'spanish', 'russian', 'korean', 'french', 'japanese', 'portu…

Whisper is also capable of several tasks, including English-only transcription, Any-to-English translation, and non-English transcription. 

Below you can select either "transcription" (which will yield text in the same language as the input language) or "translation" (which will transcribe from non-English to English). 

![Whisper capabilities](https://cdn.openai.com/whisper/draft-20220920a/asr-training-data-desktop.svg)

Image from [Introducing Whisper](https://openai.com/blog/whisper/) by OpenAI

In [8]:
task_dropdown = widgets.Dropdown(options=['transcribe', 'translate'], value='transcribe')
output = widgets.Output()
display(task_dropdown)

Dropdown(options=('transcribe', 'translate'), value='transcribe')

## Load Whisper model

Whisper comes in five model sizes, four of which also have an optimized English-only version. This notebook loads "base"-sized models (bigger than "tiny" but smaller than the others), which require about 1GB of RAM. 

If you selected English above, the cell below will load the optimized English-only version. Otherwise, it will load the multilingual model. 


In [13]:
if lang_dropdown.value == "english":
  model = whisper.load_model("base.en")
else:
  model = whisper.load_model("base")
print(
    f"Model is {'multilingual' if model.is_multilingual else 'English-only'} "
    f"and has {sum(np.prod(p.shape) for p in model.parameters()):,} parameters."
)

100%|████████████████████████████████████████| 139M/139M [00:01<00:00, 119MiB/s]


Model is English-only and has 71,825,408 parameters.


Finally, let's set the rest of our task and language options below and see what we've got. Check that your task and language settings are correct, but don't worry about the other defaults. 

In [16]:
options = whisper.DecodingOptions(language=lang_dropdown.value, task=task_dropdown.value, without_timestamps=True)
options

DecodingOptions(task='translate', language='english', temperature=0.0, sample_len=None, best_of=None, beam_size=None, patience=0.0, length_penalty=None, prompt=None, prefix=None, suppress_blank=True, suppress_tokens='-1', without_timestamps=True, max_initial_timestamp=0.0, fp16=True)

## Take Whisper for a test drive

All that's left to do now is feed our audio into Whisper. 

The cell below performs the last processing steps to make this happen. First, it loads our PyTorch-ready audio file. Then it pads the audio into 30 sec segments. It creates a log-mel spectrogram (wait, wut??) of the audio and this is fed into Whisper along with the options we set above. 

\\

Note: if you chose to upload your own audio file rather than create one through this notebook, you'll need to update the audio filename below. 

In [None]:
audio = whisper.load_audio("my_recording.wav")
audio = whisper.pad_or_trim(audio)
mel = whisper.log_mel_spectrogram(audio).to(model.device)
result = model.decode(mel, options)

In [None]:
result.text

'Como cedice, y otra bajo en el banyo con mi gato es muy bueno.'

How well did Whisper do??

For the record, that's *exactly* what I said in my soundbite. :D