#### Links to Models
##### OpenAI Whisper: https://github.com/openai/whisper
##### Google SR: https://github.com/Uberi/speech_recognition#readme
###### By default, the SpeechRecognition Library uses the Google SR unless it is asked to use any other recognizer.
##### Facebook Wav2Vec: https://github.com/facebookresearch/fairseq/blob/main/examples/wav2vec/README.md

In [1]:
## Declaring some constants
OUTPUT_VIDEO = "/content/Drive/MyDrive/DownloadedVideos/downloadedVideo.mkv"
OUTPUT_AUDIO = "/content/Drive/MyDrive/OutcomeAudios/extractedAudio.wav"
AUDIO_FOLDER = "/content/Drive/MyDrive/OutcomeAudios/"
VIDEO_FOLDER = "/content/Drive/MyDrive/DownloadedVideos/"

In [2]:
## Instaling dependencies for downloading the video and extracting the audio
!apt install ffmpeg -q
!pip install yt_dlp -q

Reading package lists...
Building dependency tree...
Reading state information...
ffmpeg is already the newest version (7:4.4.2-0ubuntu0.22.04.1).
0 upgraded, 0 newly installed, 0 to remove and 18 not upgraded.


In [3]:
import time

In [4]:
from google.colab import drive
drive.mount('/content/Drive')

Drive already mounted at /content/Drive; to attempt to forcibly remount, call drive.mount("/content/Drive", force_remount=True).


In [5]:
import os
def checkAndMakeDir(folder):
  if os.path.exists(folder)==False:
    os.makedirs(folder)

checkAndMakeDir(AUDIO_FOLDER)
checkAndMakeDir(VIDEO_FOLDER)

In [6]:
## Here we are downloading the video from the Youtube URL
## storing it in DownloadedVideos folder.
## We have given that the format of download should be mp4
## so that the video that gets downloaded is stored as an
## mkv file to make it easier to begin processing.
from yt_dlp import YoutubeDL

## Parameters: URL to the Youtube Video as a String
def downloadVideoYT(URL):
  ytdl_format_options = {
    'outtmpl': OUTPUT_VIDEO,
    'format': 'best[ext=mp4]'
  }
  with YoutubeDL(ytdl_format_options) as ydl:
    ydl.download([URL])

In [7]:
start_time = time.time()
videoURL = "https://www.youtube.com/watch?v=n8zSEZX8S5w&ab_channel=FelL."
downloadVideoYT(videoURL)
print("Download Completed")
print("Time to Download : %s seconds" % (time.time() - start_time))

[youtube] Extracting URL: https://www.youtube.com/watch?v=n8zSEZX8S5w&ab_channel=FelL.
[youtube] n8zSEZX8S5w: Downloading webpage
[youtube] n8zSEZX8S5w: Downloading ios player API JSON
[youtube] n8zSEZX8S5w: Downloading android player API JSON
[youtube] n8zSEZX8S5w: Downloading m3u8 information
[info] n8zSEZX8S5w: Downloading 1 format(s): 22
[download] Destination: /content/Drive/MyDrive/DownloadedVideos/downloadedVideo.mkv
[download] 100% of    8.91MiB in 00:00:01 at 7.36MiB/s   
Download Completed
Time to Download : 2.323329448699951 seconds


In [8]:
## We use ffpmeg to extract the audio from the video,
## by giving the input video location as the parameter for -i (input).
## The -y (global) is used to Overwrite output files without asking.

import subprocess

def extractAudio():
  subprocess.call(["ffmpeg",
                   "-y",
                   "-i",
                   OUTPUT_VIDEO,
                   OUTPUT_AUDIO])
                  #stdout=subprocess.DEVNULL,
                  #stderr=subprocess.STDOUT

start_time = time.time()
extractAudio()
print("Time to Extract Audio : %s seconds" % (time.time() - start_time))

Time to Extract Audio : 0.4411797523498535 seconds


### Converting the Speech to Text using OpenAI Whisper

In [9]:
!pip install tiktoken -q
!pip install cohere -q
!pip install openai -q
!pip install git+https://github.com/openai/whisper.git -q

  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone


In [10]:
import whisper
transcriptModel = whisper.load_model('base')

## I am using the BASE model for now, to ensure higher accuracy and better results, we can even use a bigger whisper model.

In [11]:
start_time = time.time()
## Parameters: Path to the audio file as a String
result = transcriptModel.transcribe(OUTPUT_AUDIO)
print("OpenAI Whisper : %s seconds" % (time.time() - start_time))
print(result['text'])

OpenAI Whisper : 9.558514833450317 seconds
 Hello there! My name is Phil. I have been an online English tutor for almost three years now. I've been helping students from beginner to advanced level improve their English skills. I've tried teaching kids and adults. I used CPR. Look! Rewards! And many more! I would love to help you improve and make progress in learning. From phonics, grammar, pronunciation and up to conversational English. I've already earned my Tassel and Tia Felt certificate, so you're in good hands. Learn with me and let's make English a fun and easy language. See ya!


### Converting the Speech to Text using Google Recognition

In [12]:
!pip install pytube -q
!pip install moviepy -q
!pip install SpeechRecognition -q
!pip install pydub -q

In [13]:
import speech_recognition as sr
import os
from pydub import AudioSegment
from pydub.silence import split_on_silence

r = sr.Recognizer()

## Parameters: Path to the audio file as a String
def transcribe_audio(path):
    with sr.AudioFile(path) as source:
        audio_listened = r.record(source)
        text = r.recognize_google(audio_listened)
    return text

start_time = time.time()
transcribe_audio(OUTPUT_AUDIO)
print("Google SR : %s seconds" % (time.time() - start_time))

Google SR : 12.385565996170044 seconds


### Converting the Speech to Text using Facebook's Wav2Vec Model

In [14]:
from scipy.io import wavfile

In [15]:
## Parameters: Path to the audio file as a String
data = wavfile.read(OUTPUT_AUDIO)
print("Frame Rate: " + str(data[0]))
print("Total Time: " + str(len(data[1]) / data[0]))

Frame Rate: 44100
Total Time: 48.204625850340136


In [16]:
!pip install transformers -q

In [17]:
import soundfile as sf
import torch
import librosa
from transformers import Wav2Vec2ForCTC, Wav2Vec2Tokenizer

In [18]:
tokenizer = Wav2Vec2Tokenizer.from_pretrained("facebook/wav2vec2-base-960h")
fbModel = Wav2Vec2ForCTC.from_pretrained("facebook/wav2vec2-base-960h")

The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'Wav2Vec2CTCTokenizer'. 
The class this function is called from is 'Wav2Vec2Tokenizer'.
Some weights of Wav2Vec2ForCTC were not initialized from the model checkpoint at facebook/wav2vec2-base-960h and are newly initialized: ['wav2vec2.masked_spec_embed']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [19]:
start_time = time.time()
inputAudio,_ = librosa.load(OUTPUT_AUDIO, sr=16000)
inputValues = tokenizer(inputAudio, return_tensors = "pt").input_values
logits = fbModel(inputValues).logits
predictedIDs = torch.argmax(logits, dim = -1)
outcomeText = tokenizer.batch_decode(predictedIDs)[0]
print("Facebook Wav2Vec : %s seconds" % (time.time() - start_time))

Facebook Wav2Vec : 50.68428087234497 seconds


In [20]:
outcomeText

"NOW LO THERE MY NAM IS FELL I HAVE BEEN AN ANLIN ENGLISH TUTOR FOR ALMOST THREE YEARS NOW I'VE BEEN HELPING STUDENTS FROM BIGINNER TO ADVANCE LEVELL IMPROVE THEIR ENGLISH SKILLS I'VE TRIED TEA CHING KITS AND ADALS I USED TE PY ARE LOOK REWARDS AND MANY MARE I WOULD LOVE TO HELP YOU IMPROVE AND MAKE PROGRESS IN LEARNING FROM FENIX GRAMMAR PRONUNCATION AND OUPHT TO CONVERSATIONAL ENGLISH I'VE ALREADY EARNED MY TUSSLE AND TEA FELT CERTIFICATE SO YOU ARE IN GOOD HANDS LEARN WITH ME AND LET'S MAKE ENGLISH A FUN AND EASY LANGUAGE SEE A"

### Time of execution of different steps and models
#### Video Length: 48sec

| Step    | Time of Execution (in sec)|
| :---        |    :----:   |
| Downloaing the video      | 2.32       |
| Extracting the audio   | 0.44        |
| OpenAI Whisper   | 9.56        |
| Google SR   | 12.39        |
| Wav2Vec   | 50.68        |