**Defining the objective**


The aim of this project is to create a Speech-To-Text application using an ASR (Automatic Speech Recognition) system, Whisper. 

Whisper has been trained for 680,000 hours on huge amount of speech data collected from the internet. The diverse dataset allows Whisper to understand different accents, and filter background noise. The Whisper architecture is a simple end-to-end approach, implemented as an encoder-decoder Transformer. Input audio is split into 30-second chunks, converted into a log-Mel spectrogram, and then passed into an encoder. A decoder is trained to predict the corresponding text caption, intermixed with special tokens that direct the single model to perform tasks such as language identification, phrase-level timestamps, multilingual speech transcription, and to-English speech translation.

 You can learn more about Whisper from https://openai.com/blog/whisper/

**Installing Required Libraries**

In [None]:
!pip install --upgrade torch
!pip install pytube
!pip install git+https://github.com/openai/whisper.git
!pip install git+https://github.com/librosa/librosa

In [None]:
#Importing the necessary libraries
import torch
import whisper
import pytube
import librosa
import matplotlib.pyplot as plt
import numpy as np
import IPython.display as ipd

**Please note:** If you don't already have those libraries installed, trying to import them would not work as they have to be installed first. 

**Loading the Model**

There are five model sizes to choose from, four have English-only versions, offering speed and accuracy trade-offs. The model sizes are:
- tiny: 39M Parameters, English-only model (tiny.en), Multilingual model (tiny), Required VRAM (1GB), Relative speed (32x)
- base: 74M Parameters, English-only model (base.en), Multilingual model (base), Required VRAM (1GB), Relative speed (16x)
- small: 244M Parameters, English-only model (small.en), Multilingual model (small), Required VRAM (2GB), Relative speed (6x)
- medium: 769M Parameters, English-only model (medium.en), Multilingual model (medium), Required VRAM (5GB), Relative speed (2x)
- tiny: 1550M Parameters, English-only model (N/A), Multilingual model (large), Required VRAM (10GB), Relative speed (1x)
The tiny model can be utilized best for light weight applications, the large model if accuracy is most important, and the base, small or medium models for everything in between. For this project, we would be using the medium model.

In [None]:
model_m = whisper.load_model('medium')

**Loading the file**

We start by loading an audio file.

In [None]:
file_path = '/kaggle/input/voice-recording/Record (online-voice-recorder.com).mp3'

I created a custom voice recording of myself, in file_path above to use for this project. Next we're going to load the audio file in file_path using the load_audio() function.

In [None]:
#Loading
audio_13 = whisper.load_audio(file_path)
audio_13

Next, we find the sampling interval. The sampling interval is the distance or time between the measurements. The total time of audio is 13 seconds.

In [None]:
T = 13

In [None]:
#Checking the number of samples in our audio file
n_samples =  audio_13.shape[0]
n_samples

There are 200448 number of samples in 13 seconds audio. Now we find the time between samples.

In [None]:
#Time between samples
delta = T/n_samples
delta

The time between samples is 6.485472541507024e-05. Next, we find the sampling frequency.

In [None]:
#Sampling frequency
Fs = 1/delta
Fs

The sampling frequency is 15419.076923076924. Next, we find the time of each sample.

In [None]:
#Time of each sample
time = np.linspace(0,(n_samples-1) * delta,n_samples)
time

Now we plot the amplitude with respect to time:

In [None]:
plt.figure(figsize=(20,10))
plt.title('Signal')
plt.plot(time,audio_13)
plt.ylabel('amplitude')
plt.xlabel('seconds')
plt.show()

Above is a waveform for the signal. Now, we can use the pad_or_trim() method to ensure the sample is in the right form for inference.

In [None]:
audio = whisper.pad_or_trim(audio_13)

Next, we plot the amplitude with respect to time with trimmed/padded audio.

In [None]:
#Number of samples in our trimmed/padded audio
n_samples =  audio.shape[-1]
#Time of each sample
time = np.linspace(0,(n_samples-1)*delta,n_samples)

In [None]:
plt.figure(figsize=(20,10))
plt.title('Signal')
plt.plot(time,audio)
plt.ylabel('amplitude')
plt.xlabel('seconds')
plt.show()

Next, we can start plotting a mel spectogram by applying a log_mel_spectogram() funtion to our audio file. It converts the y-axis (frequency) into the mel scale:

In [None]:
mel = whisper.log_mel_spectrogram(audio).to(model_m.device)

The output above is a tensor of converted frequencies. Now, we plot 2 subplots, one is a regular representation of sound amplitude over period of time, and the other is our mel spectrogram:

In [None]:
fig, (ax1, ax2) = plt.subplots(2)
fig.tight_layout(pad=5.0)
ax1.plot(time,audio)
ax1.set_title('Signal')
ax1.set_xlabel('Time, seconds')
ax1.set_ylabel('Amplitude')
ax2.imshow((mel.numpy()*mel.numpy())**(1/2),interpolation='nearest', aspect='auto')
ax2.set_title('Mel Spectrogram of a Signal')
ax2.set_xlabel('Time, seconds')
ax2.set_ylabel('Mel Scale')

Next, we can move on to language detection.

**Language detection**

We will listen to our audio file and detect the spoken language. The sample rate (sr) by default is 22050, which means that for every second there are 22,050 samples. We can use ipd.Audio() function to listen to our audio file.

In [None]:
sr=22050
ipd.Audio(audio, rate=sr)

Next, We can obntain the probability of each language by using detect_language() method.

In [None]:
probs = model_m.detect_language(mel)

In [None]:
probs

From above, we can see the probability of each language being the spoken language. English has the highest probability of 98.5%, therefore it is the spoken language in the audio file.

Next, we can move on to transcription.

**Transcription**

In [None]:
transcription = model_m.transcribe(file_path, fp16 = False)['text']

In [None]:
transcription

From the transcription above, we can see our Speech-To-Text system works very well, it transcribed our audio perfectly.

As an adition, we can translate our audio file to another language.

**Translation**

We would translate our audio file to Spanish, by setting language='es'.

In [None]:
translation = model_m.transcribe(file_path, language = 'es', fp16 = False)['text']

In [None]:
translation

Our audio file can be translated to other languages as well.

**Conclusion**

We have been able to create a Speech-To-Text application using an ASR (Automatic Speech Recognition) system, Whisper. Speech-To-Text has a variety of use cases which include transcribing audio recordings, dictation, voice commands, online search, enhanced customer service etc.