## Problem Statement
OpenAI is a leading player in the Artificial Intelligence industry and has made numerous AI models, such as GPT and CLIP, available to the public.

OpenAI has open-sourced the Whisper models, which have achieved near human-level performance and accuracy in English speech recognition.

This project will be a process of converting audio into text using OpenAI's Whisper and the HuggingFace Transformers framework.

Upon completion of this project, hope we will have the ability to transcribe both English and non-English audio into text
## OpenAI’s Whisper
Whisper models have been developed to study the capability of speech-processing systems for speech recognition and translation tasks. They have the capability of transcribing speech audio into text.

Trained on 680,000 hours of labeled audio data, which is reported by the authors to be one of the largest ever created in supervised speech recognition. Also, the model's performance has been evaluated by training a series of medium-sized models on subsampled versions of the data corresponding to 0.5%, 1%, 2%, 4%, and 8% of the full dataset


##  installing and importing the relevant modules to implementing the audio transcription and translation cases

In [1]:
!pip install git+https://github.com/openai/whisper.git 

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting git+https://github.com/openai/whisper.git
  Cloning https://github.com/openai/whisper.git to /tmp/pip-req-build-0wqfc2ch
  Running command git clone --filter=blob:none --quiet https://github.com/openai/whisper.git /tmp/pip-req-build-0wqfc2ch
  Resolved https://github.com/openai/whisper.git to commit 7858aa9c08d98f75575035ecd6481f462d66ca27
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting transformers>=4.19.0
  Downloading transformers-4.26.0-py3-none-any.whl (6.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.3/6.3 MB[0m [31m93.9 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting ffmpeg-python==0.2.0
  Downloading ffmpeg_python-0.2.0-py3-none-any.whl (25 kB)
Collecting huggingface-hub<1.0,>=0.11.0
  Downloading huggingface_hub-0.12.0-py3-none-any.whl (190 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m190.3/190.3 KB[0m [3

Using the nvidia-smi we can have the information about the GPU allocated to you, and here is mine.

In [3]:
!nvidia-smi

Fri Feb  3 15:30:20 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.47.03    Driver Version: 510.47.03    CUDA Version: 11.6     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   58C    P0    31W /  70W |      0MiB / 15360MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

we have everything installed, we can import the modules and load the model. In our case, we will be using the large model which has 1550M parameters and requires ~10Gigabyte VRAM memory. The processing can be longer or faster whether you are using a CPU or a GPU.

In [4]:
# Import the libraries 
import whisper
import torch
import os

# Initialize the device
device = "cuda" if torch.cuda.is_available() else "cpu"

# Load the model 
whisper_model = whisper.load_model("large", device=device)

100%|██████████████████████████████████████| 2.87G/2.87G [00:23<00:00, 130MiB/s]


In the load_model() function, we use the device initiated in the line before. By default, the newly created tensors are created on the CPU if not specified otherwise

Now is the time to start extracting audio files…

## Audio Transcription
We need to install the pytube library using the following pip statement to download the audio from YouTube

In [5]:
# Install the module
!pip install pytube

# Import the module
from pytube import YouTube

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting pytube
  Downloading pytube-12.1.2-py3-none-any.whl (57 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m57.0/57.0 KB[0m [31m8.5 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pytube
Successfully installed pytube-12.1.2


Then, we can implement the helper function as follows:

In [6]:
def video_to_audio(video_URL, destination, final_filename):

  # Get the video
  video = YouTube(video_URL)

  # Convert video to Audio
  audio = video.streams.filter(only_audio=True).first()

  # Save to destination
  output = audio.download(output_path = destination)

  _, ext = os.path.splitext(output)
  new_file = final_filename + '.mp3'

  # Change the name of the file
  os.rename(output, new_file)

The function takes three parameters:

* video_URL the full URL of the YouTube video.
* destination the location where to save the final audio.
* final_filename the name to give to the final audio.

Finally, we can use the function to download the video and convert it into audio.

## English transcription
The video used here is a 30 seconds motivational speech on YouTube from Motivation Quickie. Only the first 17 seconds correspond to the true speech and the rest of the speech is noise.

In [7]:
# Video to Audio
video_URL = 'https://www.youtube.com/watch?v=E9lAeMz1DaM'
destination = "."
final_filename = "motivational_speech"
video_to_audio(video_URL, destination, final_filename)

# Audio to text
audio_file = "motivational_speech.mp3"
result = whisper_model.transcribe(audio_file)

# Print the final result
print(result["text"])

 I don't know what that dream is that you have. I don't care how disappointing it might have been as you've been working toward that dream. But that dream that you're holding in your mind, that it's possible.


* videoURL is the link to the motivational speech.
* destination is my current folder corresponding to `. `
* motivational_speech will be the final name of the audio.
* whisper_model.transcribe(audio_file) applies the model on the audio file to generate the transcription.
* The transcribe()function preprocess the audio with a sliding 30-second window, and performs an autoregressive sequence-to-sequence approach to make predictions on each window.
*Finally, the print() statement generates the results.

Here is the link to the video:https://www.youtube.com/watch?v=E9lAeMz1DaM

# Non-English transcription
In addition to English, Whisper can also deal with non-English languages. Let’s have a look at Alassane Dramane Ouattara’s interview on YouTube.

Similarly to the previous approach, we get the video, translate it to audio and get the content.

In [8]:
URL = "https://www.youtube.com/watch?v=D8ztTzHHqiE"
destination = "."
final_filename = "discours_ADO"
video_to_audio(URL, destination, final_filename)

# Run the test
audio_file = "discours_ADO.mp3"
result_ADO = whisper_model.transcribe(audio_file)

# Show the result
print(result_ADO["text"])

 — Le franc CFA, vous l'avez toujours défendu, Beck et Ong. Est-ce que vous continuez à le faire ou est-ce que vous pensez qu'il faut peut-être changer les choses sans rentrer trop dans les détails techniques? — M. Perelman, je vous dirais tout simplement qu'il y a vraiment du n'importe quoi dans ce débat. Moi, je vais pas manquer de modestie. Mais j'ai été directeur des études de la Banque centrale, j'ai été vice-gouverneur, j'ai été gouverneur de la Banque centrale. Donc je peux vous dire que je sais de quoi je parle. Le franc CFA, c'est notre monnaie. C'est la monnaie des pays membres. Et nous l'avons acceptée et nous l'avons développée, nous l'avons modifiée. Et j'étais là quand la réforme a eu lieu dans les années 73-74. Alors donc tout ce débat est un non-sens. Maintenant, c'est notre monnaie. J'ai quand même eu à superviser la gestion monétaire et financière de plus de 120 pays dans le monde quand j'étais au Fonds monétaire international. Mais je suis bien placé pour dire que si

Above is the final result, and the result is mindblowing 🤯

## Non-English transcription into English
In addition to speech recognition, spoken language identification, and voice activity identification, Whisper is also able to perform speech translation from any language into English.

In [9]:
URL = "https://www.youtube.com/watch?v=D8ztTzHHqiE"
final_filename = "discours_ADO"
video_to_audio(URL, destination, final_filename)

# Run the test
audio_file = "discours_ADO.mp3"
french_to_english = whisper_model.transcribe(audio_file, task = 'translate')

# Show the result
print(french_to_english["text"])

 France CFA, you have always defended it, Beck et Ongle, do you continue to do so or do you think that perhaps things need to be changed without going into too much technical details? Mr Perelman, I will simply tell you that there is really nonsense in this debate. I don't want to lack modesty, but I was director of the central bank's studies, I was vice-governor, I was governor of the central bank, so I can tell you that I know what I am talking about. France CFA is our currency, it is the currency of the member states, and we have accepted it, we have developed it, we have modified it. I was there when the reform took place in the years 1973-74, so all this debate is nonsense. Now, it is our currency. I have supervised the financial and monetary management of more than 120 countries in the world. When I was at the International Monetary Fund, I was well placed to say that if this currency poses a problem, we will listen to the other heads of state and make the decisions, but this cur

task=’translate’means that we are performing a translation task

## Conclusion
In conclusion, the "Speech-to-Text and Translation with OpenAI's Whisper" project demonstrated how to use OpenAI's Whisper models and the HuggingFace Transformers framework to transcribe audio into text and translate it into other languages. The Whisper models are open-sourced by OpenAI and are considered to have achieved near human-level performance and accuracy in English speech recognition. This project aimed to provide a comprehensive guide to help individuals harness the power of AI to transform audio into text and translate it into other languages, making the process simple and accessible. By the end of the project, we have had a clear understanding of how to perform speech-to-text and machine translation using OpenAI's Whisper models.