# **Easy Speech-to-Text with Python**

Let’s say you are a podcast creator, and you want to transcribe your podcast, so that it can be translated into multiple languages or so that hearing impaired people can read your content. Additionally, let’s say you want to improve the discovery of your podcasts through search engine optimization. Transcribing your podcast will enable search engines to index the text, making it easier to find it. 

The purpose of this Guided Project is to introduce you to ASR (automatic speech recognition) system, to understand how the signal processing works, the architecture of a transformer model behind ASR and some examples on how to easily recognize, transcribe and translate some of your audio and video files, using publicly available ASR tool.

**Table of Contents**

* Packages to install
* Libraries to Import
* Background
* Loading Models
* Loading File
* MEL Scale and MEL Spectogram
* Language Detection
* Decoding and Transcribing
* Translation
* Transcription From Youtube


### Must be installed packages :

#### Whisper

In [None]:
!pip install git+https://github.com/openai/whisper.git

#### Librosa

In [None]:
!pip install git+https://github.com/librosa/librosa

### Importing libraries

In [None]:
import torch
import whisper
import pytube
import librosa
import matplotlib.pyplot as plt
import numpy as np
import IPython.display as ipd

### Background (optional)
Whisper application is a new non-commercial ASR (automatic speech recognition) system that was recently made available on Open AI. The Whisper model was proposed in Robust Speech Recognition via Large-Scale Weak Supervision article by Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever.Whisper has been trained for 680,000 hours on huge amount of speech data collected from the internet. The diverse dataset allows Whisper to understand different accents, and filter background noise.Whisper is a multi-task model which is based on a encoder-decoder transformer architecture. While training the model, the input data (i.e. the audio file) is split into 30 seconds parts and converted into log-mel spectrogram, which is fed to the encoder and the decoder and is responsible for predicting the corresponding text and translating it into multiple languages. About one-third of the audio data is non-english. Keeping the dataset diverse has helped the team gain better performance than other supervised state-of-the-art models.

### Loading the models
There are five model sizes, four with English-only versions, offering speed and accuracy trade-offs.
 You can use the tiny model for light weight applications, the large model if accuracy is most important, and the base or medium models for everything in between.

In [None]:
#Load the tiny size model:
#model_t = whisper.load_model("tiny")

In [None]:
#Load the base size model
#model = whisper.load_model("base")

In [None]:
#Load the medium size model
model_m = whisper.load_model("medium")

### Loading the file
We start by loading an .mp4 audio file, previously uploaded to the IBM Cloud Object Storage.
To do so, we define the file path: 

In [None]:
file_path = 'https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-GPXX0EPMEN/20220627_140242.mp4'

Load an audio file, using `load_audio()` function. If you are using your own file, you can replace the 'file_path' with an actual name of your file, e.g., 'podcast.mp3'.


In [None]:
audio_35 = whisper.load_audio(file_path)
audio_35

The output above is basically the amplitude of sound waves, or the relative strength of sound waves. It is in the form of a numpy array.

Now, we can find the sampling interval, the distance or time between the measurements. The total time of audio sample is 35 seconds.


In [None]:
T=35

We check how many samples are in our audio file by calling the `shape()` function.


In [None]:
n_samples=audio_35.shape[0]
n_samples

There are 559445 of samples in 35 seconds audio.

Now, we can find the time between samples by dividing the total time by the number of samples:


In [None]:
delta=T/n_samples
delta

The time between samples is 6.25620034140979e-05. Now, we can get the sampling frequency: 


In [None]:
Fs=1/delta
Fs

Now, we can get the time of each sample: 


In [None]:
time=np.linspace(0,(n_samples-1)*delta,n_samples)
time

Finally, we can plot the amplitude with respect to time: 


In [None]:
plt.title('Signal')
plt.plot(time,audio_35 )
plt.ylabel('amplitude')
plt.xlabel('seconds')
plt.show()

Above is a waveform for the signal. Next, we can use the `pad_or_trim()` method to ensure the sample is in the right form for inference. In our case the file is 35 seconds, so it gets trimmed to fit the 30 seconds part (30 seconds parts get fed into the encoder).


In [None]:
audio = whisper.pad_or_trim(audio_35)

We can plot the amplitude of signal over time with trimmed/padded audio:


In [None]:
n_samples=audio.shape[0]
time=np.linspace(0,(n_samples-1)*delta,n_samples)

In [None]:
plt.plot(time,audio)

plt.ylabel('amplitude')
plt.xlabel('seconds')
plt.title('Signal')
plt.show()

### Mel scale and mel spectrogram

Studies have shown that humans do not perceive frequencies on a linear scale. We are better at detecting differences in lower frequencies than higher frequencies. For example, we can easily tell the difference between 500 and 1000 Hz, but we will hardly be able to tell a difference between 10,000 and 10,500 Hz, even though the distance between the two pairs are the same.
In 1937, Stevens, Volkmann, and Newmann proposed a unit of pitch such that equal distances in pitch sounded equally distant to the listener. So, we need to convert frequencies to **mel scale**, so that sounds of equal distance from each other also “sound” to humans as they are equal in distance from one another.

A **mel spectrogram** is a spectrogram where the frequencies are converted to the mel scale.

*librosa* library has a wrapper for mel spectrograms in its API that can be used directly. However, here, we will use a simpler mathematical method to produce a mel spectrogram. 



[Understanding the Mel Spectrogram](https://medium.com/analytics-vidhya/understanding-the-mel-spectrogram-fca2afa2ce53?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkGuidedProjectsIBMGPXX0EPMEN1227-2022-01-01) and [How to Create & Understand Mel-Spectrograms](https://importchris.medium.com/how-to-create-understand-mel-spectrograms-ff7634991056?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkGuidedProjectsIBMGPXX0EPMEN1227-2022-01-01) articles have more information about sound interpretation and mel spectrograms.

We can now start plotting a mel spectrogram by applying a `log_mel_spectrogram()` function to our audio file. It converts the y-axis (frequency) into the mel scale:


In [None]:
mel = whisper.log_mel_spectrogram(audio).to(model_m.device)

The output above is a tensor of converted frequencies. Now, we plot 2 subplots, one is a regular representation of sound amplitude over period of time, and the other is our mel spectrogram:


In [None]:
fig, (ax1, ax2) = plt.subplots(2)
fig.tight_layout(pad=5.0)
ax1.plot(time,audio)
ax1.set_title('Signal')
ax1.set_xlabel('Time, seconds')
ax1.set_ylabel('Amplitude')
ax2.imshow((mel.numpy()*mel.numpy())**(1/2),interpolation='nearest', aspect='auto')
ax2.set_title('Mel Spectrogram of a Signal')
ax2.set_xlabel('Time, seconds')
ax2.set_ylabel('Mel Scale')

### Language Detection
In this Example, we will listen to our audio file and detect the spoken language.


The sample rate (sr) by default is 22050, which means that for every second there are 22,050 samples. We can use `ipd.Audio()` function to listen to our audio file:


In [None]:
sr=22050
ipd.Audio(audio, rate=sr)

We can find the probability of each language by using `detect_language()` method:


In [None]:
_, probs = model_m.detect_language(mel)

We also can print the top ten languages' prefixes and their probabilities:


In [None]:
print([item  for item in  probs.items()][0:10])

Finally, we can detect the spoken language by selecting the key with the highest probability value:


In [None]:
print(f"Detected language: {max(probs, key=probs.get)}")

Therefore, the spoken language is English, with 99.97% probability.


### Decoding and Transcription

The difference between decoding and transcription is that the decode function processes only 30 seconds of audio segment. Transcribe function will decode the entire audio file. Below, we decode 30-seconds audio segment(s) using `whisper.decode()` function.

In [None]:
options = whisper.DecodingOptions(fp16 = False)
result = whisper.decode(model_m, mel, options)

We print the recognized text using the attribute text :


In [None]:
print(result.text)

The output of the above is a text that fits into 30 seconds audio segment. Now, the `transcribe()` method reads the entire file and processes the audio with a sliding 30-second window, performing autoregressive sequence-to-sequence predictions on each window.


In [None]:
transcription = model_m.transcribe(file_path, fp16 = False)["text"]

In [None]:
transcription

### Translation

In this Example, we translate our audio file to French, by setting `language='fr'`. You can also use any other language available [here](https://github.com/openai/whisper).


In [None]:
translation = model_m.transcribe(file_path, language='fr', fp16 = False)["text"]

In [None]:
translation

### Transcription from YouTube

Below, we will select a random YouTube video and read it using the `pytube()` library. This one is a 30 seconds Motivational Speech.


In [None]:
video_url = "https://www.youtube.com/watch?v=E9lAeMz1DaM"
data = pytube.YouTube(video_url)

We will convert and download an 'MP4' file using `streams.get_audio_only()` and `download()` functions.


In [None]:
speech = data.streams.get_audio_only()
audio_file=speech.download()
print("audio file path:",audio_file)

Finally, we will transcribe and translate the output to Japanese language.


In [None]:
output = model_m.transcribe(audio_file,fp16 = False,language='ja')["text"]

In [None]:
output

**Author**

Alireza Hosseinzadeh