OPEN AI WHISPER MODEL INFERENCE

Loading Libraries and Packages

In [None]:
!pip install transformers accelerate datasets[audio]

In [6]:
import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
from datasets import load_dataset

In [7]:
device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

Model Loading

In [8]:
model_id = "openai/whisper-large-v3"

model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
)
model.to(device)

processor = AutoProcessor.from_pretrained(model_id)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


Dataset Loading

In [9]:
dataset = load_dataset("distil-whisper/librispeech_long", "clean", split="validation")
sample = dataset[0]["audio"]

Inference using pipeline from Huggingface 🤗
Transcribing (Audio -> Text)

In [10]:
pipe = pipeline(
    "automatic-speech-recognition",
    model=model,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    max_new_tokens=128,
    chunk_length_s=30,
    batch_size=16,
    return_timestamps=True,
    torch_dtype=torch_dtype,
    device=device,
)

result = pipe(sample)
print(result["text"])

 Mr. Quilter is the apostle of the middle classes, and we are glad to welcome his gospel. Nor is Mr. Quilter's manner less interesting than his matter. He tells us that at this festive season of the year, with Christmas and roast beef looming before us, similes drawn from eating and its results occur most readily to the mind. He has grave doubts whether Sir Frederick Leighton's work is really Greek after all, and can discover in it but little of rocky Ithaca. Lynyll's pictures are a sort of Upgards and Adam paintings, and Mason's exquisite idylls are as national as a django poem. Mr. Burkett Foster's landscapes smile at one much in the same way that Mr. Carker used to flash his teeth. And Mr. John Collier gives his sitter a cheerful slap on the back before he says, like a shampooer in a Turkish bath, Next, man!


Own Dataset Audio file or get audio file from youtube

In [7]:
import locale
locale.setlocale(locale.LC_ALL, 'en_US.UTF-8')

'en_US.UTF-8'

In [3]:
!pip install pytube

Collecting pytube
  Downloading pytube-15.0.0-py3-none-any.whl (57 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/57.6 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m57.6/57.6 kB[0m [31m1.9 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pytube
Successfully installed pytube-15.0.0


Downloading audio from youtube

In [25]:
#Importing Pytube library
import pytube
# Reading the above Taken movie Youtube link
video = "https://youtube.com/shorts/-9Vc2zo3mIE?si=8qlBFPC2AaEg_JNL"
data = pytube.YouTube(video)
# Converting and downloading as 'MP4' file
audio = data.streams.get_audio_only()
audio.download()

'/content/Elon Musk Advice to Young People - Lex Fridman Podcast.mp4'

Using Pipe function to convert audio file to text

In [15]:
result = pipe("audio3.mp4")
print(result["text"])

 If you wanna run away with me, I know a galaxy and I can take you for a ride I had a premonition that we fell into a rhythm where the music don't stop for life Glitter in the sky, glitter in my eyes, shining just the way you like If you're feeling like you need a little bit of company, you met me at the perfect time You want me, I want you baby, my sugar boo, I'm levitating The Milky Way, we're renegading, yeah, yeah, yeah, yeah, yeah


We can specify the language as parameter of the audio script

In [23]:
result = pipe("audio4.mp4", generate_kwargs={"language": "kannada"})
print(result["text"])

 ಮುತ್ತಿನ ಕತೆಯಾ ಹೇಳಿತು ಇದು ಬಂಬೆ ಆ ಕತೆಯಲ್ಲಿದ್ದ ರಾಜನಂಗೆ ನಿನು ಬಂದೆ ಯೋಗವು ಒಮ್ಮೆ ಬರುವುದು ನಮಗೆ ಯೋಗ್ಯತೆ ಒಂದೆ ಉಳಿ ಬುದು ಕನೆಯೆ ಸೂರಿಯನು ಬಾ ಚಂದ್ರನು ಬಾ ರಾಜನು ಒಪ್ಪಾ ಇರಾಜನು ಒಪ್ಪಾ


In [22]:
result = pipe("audio5.mp4", generate_kwargs={"language": "hindi"})
print(result["text"])

 तू ही तो जन्नत मेरी, तू ही मेरा जुनू, तू ही तो मन्नत मेरी, तू ही रुह का सुकूम, तू ही अख्खियों की ठंडक, तू ही दिल की है दस्तक, और कुछ ना जानू, मैं बस इतना ही जानू, तुझ में रब दिखता है, यारा मैं क्या करूँ


By default, Whisper performs the task of speech transcription, where the source audio language is the same as the target text language. To perform speech translation, where the target text is in English, set the task to "translate":

Audio5.mp4 was a Hindi audio file it translates to English

In [24]:
result = pipe("audio5.mp4", generate_kwargs={"task": "translate"})
print(result["text"])

 You are my heaven, You are my passion You are my prayer, You are my peace You are the coolness of my eyes, You are the key of my heart I don't know anything else, I just know this much You have God in you, what can I do?


 For sentence-level timestamps, pass the return_timestamps argument:

In [26]:
result = pipe("audio6.mp4", return_timestamps=True) #Conversation between Lex Friedman and Elon Musk
print(result["chunks"])

[{'timestamp': (0.0, 3.68), 'text': ' If we think about young people in high school, maybe in college,'}, {'timestamp': (3.68, 8.64), 'text': ' what advice would you give to them about if they want to try to do something big in this world,'}, {'timestamp': (8.64, 12.08), 'text': ' they want to really have a big positive impact, what advice would you give them?'}, {'timestamp': (12.08, 16.56), 'text': ' Try to be useful. Do things that are useful to your fellow human beings,'}, {'timestamp': (16.56, 23.6), 'text': " to the world. It's very hard to be useful. Very hard. Are you contributing more than you consume?"}, {'timestamp': (26.28, 31.78), 'text': " Very hard. You know, are you contributing more than you consume? You know, like, try to have a positive net contribution to society. I think that's the thing to aim for. You know,"}, {'timestamp': (31.78, 36.36), 'text': ' not to try to be sort of a leader for the sake of being a leader or whatever. A lot'}, {'timestamp': (36.36, 41.5),

In [28]:
result = pipe("audio5.mp4", return_timestamps=True, generate_kwargs={"language": "hindi", "task": "translate"})
print(result["chunks"])



[{'timestamp': (0.0, 5.0), 'text': ' You are my heaven, You are my passion'}, {'timestamp': (5.0, 11.0), 'text': " You are my prayer, You are my soul's peace"}, {'timestamp': (11.0, 17.0), 'text': ' You are the coolness of my eyes, You are the key of my heart'}, {'timestamp': (17.0, 23.0), 'text': " I don't know anything else, I just know this much"}, {'timestamp': (23.0, 29.0), 'text': ' You see God in you, what can I do?'}]
