# Speech to Text with Whisper

This code book allows you to detect the language(s) found in an audio file and transcribe the audio into text, using the Python package Whisper. It was written by [James Baker](https://www.southampton.ac.uk/humanities/about/staff/jwb1n21.page) in October 2022 and is shared under a under a [Creative Commons Attribution-ShareAlike 4.0 International License](https://creativecommons.org/licenses/by-sa/4.0/) (excluding data). Full documentation on how to use Whisper can be found on [GitHub](https://github.com/openai/whisper).

To get started:

1. Copy this notebook to your Google Drive to keep it and save your changes. (File -> Save a Copy in Drive)
2. Make sure you're running the notebook in Firefox or Google Chrome (it will run in other browsers, but if there are issues they can often be browser related).
3. Run the cells below: that is, hit the play buttons in order, waiting for each to complete (when a tick appears to the left of the code block) before moving onto the next)


**1. Install Whisper**

In [1]:
!pip install git+https://github.com/openai/whisper.git -q

In [2]:
import whisper

**2. Load a Whisper model**

The smallest is `tiny` the largest is `large`. Replace `small` with your chosen model to change models. Lrger models will take longer to process your audio, but will do so more accurately. Model descriptions can be found [on GitHub](https://github.com/openai/whisper#available-models-and-languages).

In [3]:
model = whisper.load_model("small")

**3. Upload your audio file**

Choose ONE of the options below.

*Option 1: direct upload*

Our example audio file is **the speaking voice of [Anne-Marie Imafidon](https://commons.wikimedia.org/wiki/File:Anne-Marie_Imafidon_voice_-_en.flac)** converted to mp3, [downloadable here](https://drive.google.com/file/d/1ulVCJEJPp77UNJmdSF9IE3Yb7it4yTy1/view?usp=sharing).

Once downloaded, in the sidebar on the left of the screen, select *Files* (the folder icon), hit the upload icon, and upload your mp3 to your notebook. Once the file appears in the sidebar you are ready to go.

In [29]:
input = whisper.load_audio("Anne-Marie_Imafidon_voice_en.mp3")

*Option 2: mount via Google Drive*

This option is ideal for larger files.

First, add the audio file to a folder on your Google Drive called `voice-audio`.

Second, run this cell below to mount your personal Google Drive in the VM (a prompt may ask for an auth code; that auth is not saved anywhere; after entering you auth code, hit enter).

In [4]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


Third, load in the audio file. In this case, the larger `.flac` version of the speaking voice of [Anne-Marie Imafidon](https://commons.wikimedia.org/wiki/File:Anne-Marie_Imafidon_voice_-_en.flac). 

In [5]:
input=whisper.load_audio("/content/drive/MyDrive/voice-audio/Anne-Marie_Imafidon_voice_en.flac")

**4. This code block detects the language of the audio file**

In [6]:
# load audio and pad/trim it to fit 30 seconds
audio = whisper.pad_or_trim(input)

# make log-Mel spectrogram and move to the same device as the model
mel = whisper.log_mel_spectrogram(audio).to(model.device)

# detect the spoken language
_, probs = model.detect_language(mel)
print(f"Detected language: {max(probs, key=probs.get)}")

Detected language: en


**5. Transcribe the audio file**

In [7]:
result = model.transcribe(input)



**6. Print the output**

In [8]:
print(result["text"])

 Hello, I'm Anne-Marie Imaffidon. I was born in Barking Essex and I now run social enterprise, Stemets. I also love Nando's.
