# Audio transcription
Whisper is an open-source speech recognition model [developed by OpenAI](https://openai.com/research/whisper)

Here we can see how to use it convert an audio into text, a good example of the potentiality of the Large Language Models, neural networks trained on huge amount of text and audio.

The model can be installed directly from GitHub:

In [1]:
!pip install git+https://github.com/openai/whisper.git -q
    

Note: install also ffmpeg (just need to copy the exe file) and add it in the PATH
This is needed by the model to convert audio/video formats (i.e. you could give as input a file video and it will extract the audio component). 

On Mac: copy it to /usr/local/bin, verify it's working with the command "which ffmpeg"
It might also be necessary to open it once so mac os will add an exception and notes that you trust it even if it's an appliccation from unidentified developer

## Import module and language model

In [1]:
import whisper

The model supports multiple languages and can also do language identification.

There are five model sizes, from tiny to large, offering speed and accuracy tradeoffs. Each size can support   English-only or multiple model.
I use the medium version for multiple languages which has around 770M parameters and requires 5GB VRAM.
The large versione is double that.

The models are trained on 680,000 hours of audio and the corresponding transcripts collected from the internet. 65% of this data represents English-language audio and matched English transcripts.

In [2]:
sizeMmodel = whisper.load_model("medium")

## Transcribe with a single line

I try first with a short audio of news in Italian language:

In [9]:
transcriptionIT = sizeMmodel.transcribe("NotizieIT.mp3")



As you can see, the only required function parameter is the audio file path and name, not the language.
The conversione did not overload the cpu nor the memory, I could easily do something else in the meanwhile

In [8]:
#printing the transcribe
transcriptionIT['text']

" Notizie flash Una delegazione della Commissione Affari Esteri conclude oggi la sua visita in Cile. I deputati hanno discusso con le autorità del Paese, con i rappresentanti della società civile, del rafforzamento del partenariato tra Unione Europea e America Latina. Domani gli eurodeputati si ericheranno in Brasile per il rilancio delle relazioni col Paese, per discutere della posizione del Brasile sulla guerra russa e della lotta contro l'incambiamento climatico e la deforestazione. La presidente del Parlamento Europeo, Roberta Mezzola, è a Zagabria. Ieri ha incontrato il Primo Ministro croato Andrei Plenkovic e ha partecipato a una conferenza sui 10 anni di adesione della Croazia all'Unione Europea, oltre che a un evento con i giovani. Oggi Mezzola parlerà al Parlamento Croato e domani la presidente sarà invece a Londra per partecipare alla conferenza sulla ripresa dell'Ucraina e per pronunciare a Westminster un discorso programmatico all'evento parlamentare della conferenza. Una d

And here is a Chinese audio:

In [12]:
transcriptionCH = sizeMmodel.transcribe("CH6-5-1.mp3")



In [14]:
#printing the transcribe
transcriptionCH['text']

'对话一你想吃什么就吃什么。我饿了。我们找个饭馆吃饭吧。好。前面有个饭馆,我们去吃自助餐,怎么样?好啊。你喜欢吃中餐还是西餐?什么都行,不过我还是喜欢吃中餐。那里有包子,饺子,炒饭和各种炒菜。你想吃什么就吃什么,走吧!太好了!一,根据对话内容,选择正确答案。一,男的和女的现在在哪儿?二,谁饿了?三,他们准备去吃什么?'

It worked perfectly for both languages. Even better that commercial products. Speed was more or less 2x the audio total time

I tried also a 20 minutes video in German language and extracted perfectly its audio too.
And all this with the medium size, not even the large one.
Although hallucinations are possible.

## Detect language

The function _detect_language()_ will examine the audio and outputs the probability for all supported languages (around one hundred)

In [3]:
audio = whisper.load_audio("CH6-5-1.mp3")

In [4]:
audio = whisper.pad_or_trim(audio)


In [5]:
# make log-Mel spectrogram and move to the same device as the model
mel = whisper.log_mel_spectrogram(audio).to(sizeMmodel.device)


In [6]:
# detect the spoken language
_, probs = sizeMmodel.detect_language(mel)  # probs is a dictionary

In [9]:
max(probs)

'zh'

keys are a two-letters symbol for language

In [11]:
probs['zh']

0.9969410300254822

ZH is Chinese. The model predicted it with 99.7% probability

## Translate

The model can also translate into English during the transcription.
Just call the _transcribe()_ function passing a task to translate:


In [16]:
transcriptionEN = sizeMmodel.transcribe("CH6-5-1.mp3", task="translate") # Default is transcribe



In [17]:
transcriptionEN['text']

" Conversation 1 You can eat whatever you want. I'm hungry. Let's find a restaurant to eat. OK! There is a restaurant in front. How about we go to eat a self-serve meal? OK! Do you like Chinese or Western food? Anything is fine. But I still like Chinese food. There are buns, dumplings, fried rice and all kinds of stir-fried dishes. You can eat whatever you want. Let's go! Great! 1 According to the conversation, choose the right answer. 1 Where are the men and women now? 2 Who is hungry? 3 What are they going to eat?"

The translation is also quite good: maybe only _self-serve meal_ would be better as _buffet_ :)