# Audio transcription
Whisper is an open-source speech recognition model [developed by OpenAI](https://openai.com/research/whisper)

Here we can see how to use it convert an audio into text, a good example of the potentiality of the Large Language Models, neural networks trained on huge amount of text and audio.

The model can be installed directly from GitHub:

In [3]:
!pip install git+https://github.com/openai/whisper.git -q
    

In [13]:
pip install --force-reinstall "numpy==1.24"

Collecting numpy==1.24
  Using cached numpy-1.24.0-cp311-cp311-macosx_11_0_arm64.whl (13.8 MB)
Installing collected packages: numpy
  Attempting uninstall: numpy
    Found existing installation: numpy 1.24.0
    Uninstalling numpy-1.24.0:
      Successfully uninstalled numpy-1.24.0
Successfully installed numpy-1.24.0
Note: you may need to restart the kernel to use updated packages.


In [1]:
import numpy
print (numpy.__version__)

1.24.0


Note: install also ffmpeg (just need to copy the exe file) and add it in the PATH
This is needed by the model to convert audio/video formats (i.e. you could give as input a file video and it will extract the audio component). 

On Mac: copy it to /usr/local/bin, verify it's working with the command "which ffmpeg"
It might also be necessary to open it once so mac os will add an exception and notes that you trust it even if it's an appliccation from unidentified developer

## Import module and language model

In [1]:
import whisper

The model supports multiple languages and can also do language identification.

There are five model sizes, from tiny to large, offering speed and accuracy tradeoffs. Each size can support   English-only or multiple model.
I use the medium version for multiple languages which has around 770M parameters and requires 5GB VRAM.
The large versione is double that.

The models are trained on 680,000 hours of audio and the corresponding transcripts collected from the internet. 65% of this data represents English-language audio and matched English transcripts.

In [2]:
sizeMmodel = whisper.load_model("medium")

## Transcribe with a single line

I try first with a short audio of news in Italian language:

In [9]:
transcriptionIT = sizeMmodel.transcribe("NotizieIT.mp3")



As you can see, the only required function parameter is the audio file path and name, not the language.
The conversione did not overload the cpu nor the memory, I could easily do something else in the meanwhile

In [8]:
#printing the transcribe
transcriptionIT['text']

" Notizie flash Una delegazione della Commissione Affari Esteri conclude oggi la sua visita in Cile. I deputati hanno discusso con le autorità del Paese, con i rappresentanti della società civile, del rafforzamento del partenariato tra Unione Europea e America Latina. Domani gli eurodeputati si ericheranno in Brasile per il rilancio delle relazioni col Paese, per discutere della posizione del Brasile sulla guerra russa e della lotta contro l'incambiamento climatico e la deforestazione. La presidente del Parlamento Europeo, Roberta Mezzola, è a Zagabria. Ieri ha incontrato il Primo Ministro croato Andrei Plenkovic e ha partecipato a una conferenza sui 10 anni di adesione della Croazia all'Unione Europea, oltre che a un evento con i giovani. Oggi Mezzola parlerà al Parlamento Croato e domani la presidente sarà invece a Londra per partecipare alla conferenza sulla ripresa dell'Ucraina e per pronunciare a Westminster un discorso programmatico all'evento parlamentare della conferenza. Una d

And here is a Chinese audio:

In [3]:
transcriptionCH = sizeMmodel.transcribe("../../../../Downloads/5-6-1.mp3")



In [5]:
#printing the transcribe
transcriptionCH['text']

'5. 根据课文2,做下面的练习。课文2 李美丽和山田佑对话。山田佑,你怎么买了这么多衣服?我听说这儿冬天特别冷。你看,我买了一件羽绒服,两件毛衣,两条厚裤子。是不是因为便宜啊?不是,这儿一到十一月就冷了,只穿一条裤子,一定不行。是吗?对,病了要吃药,看病要花钱,还不舒服,所以还是多穿衣服好。今天出租车司机告诉我的,我觉得对。1. 选择正确答案。1. 山田佑没买什么东西。2. 山田佑为什么买这么多衣服?3. 山田佑为什么觉得多穿衣服好?2. 边听录音,边填空,然后朗读。1. 你怎么买了这么多衣服?2. 我买了一件羽绒服,两件毛衣,两条厚裤子。3. 是不是因为便宜啊?4. 这儿一到十一月就冷了,只穿一条裤子,一定不行。'

It worked perfectly for both languages. Even better that commercial products. Speed was more or less 2x the audio total time

I tried also a 20 minutes video in German language and extracted perfectly its audio too.
And all this with the medium size, not even the large one.
Although hallucinations are possible.

## Detect language

The function _detect_language()_ will examine the audio and outputs the probability for all supported languages (around one hundred)

In [34]:
audio = whisper.load_audio("21-6-2.mp3")

RuntimeError: Failed to load audio: ffmpeg version 6.0-tessus  https://evermeet.cx/ffmpeg/  Copyright (c) 2000-2023 the FFmpeg developers
  built with Apple clang version 11.0.0 (clang-1100.0.33.17)
  configuration: --cc=/usr/bin/clang --prefix=/opt/ffmpeg --extra-version=tessus --enable-avisynth --enable-fontconfig --enable-gpl --enable-libaom --enable-libass --enable-libbluray --enable-libdav1d --enable-libfreetype --enable-libgsm --enable-libmodplug --enable-libmp3lame --enable-libmysofa --enable-libopencore-amrnb --enable-libopencore-amrwb --enable-libopenh264 --enable-libopenjpeg --enable-libopus --enable-librubberband --enable-libshine --enable-libsnappy --enable-libsoxr --enable-libspeex --enable-libtheora --enable-libtwolame --enable-libvidstab --enable-libvmaf --enable-libvo-amrwbenc --enable-libvorbis --enable-libvpx --enable-libwebp --enable-libx264 --enable-libx265 --enable-libxavs --enable-libxvid --enable-libzimg --enable-libzmq --enable-libzvbi --enable-version3 --pkg-config-flags=--static --disable-ffplay
  libavutil      58.  2.100 / 58.  2.100
  libavcodec     60.  3.100 / 60.  3.100
  libavformat    60.  3.100 / 60.  3.100
  libavdevice    60.  1.100 / 60.  1.100
  libavfilter     9.  3.100 /  9.  3.100
  libswscale      7.  1.100 /  7.  1.100
  libswresample   4. 10.100 /  4. 10.100
  libpostproc    57.  1.100 / 57.  1.100
21-6-2.mp3: No such file or directory


In [4]:
audio = whisper.pad_or_trim(audio)


In [5]:
# make log-Mel spectrogram and move to the same device as the model
mel = whisper.log_mel_spectrogram(audio).to(sizeMmodel.device)


In [6]:
# detect the spoken language
_, probs = sizeMmodel.detect_language(mel)  # probs is a dictionary

In [9]:
max(probs)

'zh'

keys are a two-letters symbol for language

In [11]:
probs['zh']

0.9969410300254822

ZH is Chinese. The model predicted it with 99.7% probability

## Translate

The model can also translate into English during the transcription.
Just call the _transcribe()_ function passing a task to translate:


In [4]:
transcriptionEN = sizeMmodel.transcribe("../../../../Downloads/5-6-1.mp3", task="translate") # Default is transcribe



In [5]:
transcriptionEN['text']

" Short text one I am listening to the recording. Today, when we were listening to the Chinese class, the teacher played a recording. I listened to it twice and didn't understand it. I plan to listen to it again after I go home. In the afternoon, when I was listening to the CD, my brother suddenly came in from outside. I took off my earphone and asked him what was wrong. He said that his father bought him a CD from the bookstore. There is a very nice song in it. Ask me if I want to listen. I said I was listening to the recording. My brother looked at me and didn't talk. He closed the door and left. After he left, I listened again. But some sentences still didn't understand. At this time, my brother came in again and said, Brother, can you listen to this song? It's very nice. I said, don't talk to me. I don't understand some sentences. My brother said, it's really nice. Listen to it. You can understand it when you listen to it. 1. According to the short text, choose the right answer. 1.

The translation is also quite good: maybe only _self-serve meal_ would be better as _buffet_ :)