In [1]:
from transformers import WhisperProcessor, WhisperForConditionalGeneration
import torch
import torchaudio
from tqdm import tqdm

This is very similar to our Summarization Transformer setup. In this case, processor is analogous to our tokenizer. They both just handle the data before and after the model generates its output.

In [2]:
checkpoint = 'openai/whisper-tiny'
model = WhisperForConditionalGeneration.from_pretrained(checkpoint)
processor = WhisperProcessor.from_pretrained(checkpoint)

Whisper was trained on 16 kHz sampling rate, so our inputs need to match this

In [3]:
#This file isn't in the repository, you can replace this with whatever filepath you want 
#(works on WAV and MP3 for sure, idk about others yet)
audio, sr = torchaudio.load("audio.wav") 
audio = torchaudio.transforms.Resample(sr, 16000)(audio)
sr = 16000

In [4]:
input_ids = processor(audio.squeeze(0), sampling_rate=sr, return_tensors='pt')

In [5]:
transcript = model.generate(inputs=input_ids.input_features)



In [6]:
processor.batch_decode(transcript, skip_special_tokens=True)[0]

' The birch can use lid on this smooth planks. Glue the sheet to the dark blue background. It is easy to tell the depth of a well. These days, the chicken leg is a rare dish. Rice is often served in round bowls. The juice of lemon makes fine punch. The box was thrown beside the pork truck. The hogs were fed chopped corn and garbage. Four hours of steady work faced us.'

Note - this should start with: "The birch canoe slid on the smooth planks"  
(First file from: https://www.voiptroubleshooter.com/open_speech/american.html)  
This error stems from using the Tiny model and is fixed when using Small. We will see below that Tiny model usually performs very well.  

**Whisper cuts out after 30 seconds, so we need to do more chunking (yay).  
At the end, all of the chunks will just be joined together to form the final transcript**

In [7]:
%%time
# Set up variables for processing audio segments
segment_length_seconds = 30
segment_length_samples = int(segment_length_seconds * processor.feature_extractor.sampling_rate)

# Load the audio file
# This is from one of Dr. Magana's Databases lectures from a few years back. I also didn't push this to the repo since it is
# a 50 minute long audio file, so again, you can replace this if you want to test it out
audio_file_path = "C:\\Users\\kajan\\Desktop\\db1.mp3"
waveform, sample_rate = torchaudio.load(audio_file_path)
waveform = torchaudio.transforms.Resample(sample_rate, 16000)(waveform)
sample_rate = 16000

# Split the audio file into segments
num_segments = waveform.size(1) // segment_length_samples
segments = torch.split(waveform, segment_length_samples, dim=1)

CPU times: total: 13.5 s
Wall time: 5.69 s


Performance on resampling / splitting isn't bad considering this is a 50 min long audio file

In [8]:
len(segments)

100

In [9]:
%%time
# Iterate over the segments and transcribe each one
for i, segment in enumerate(segments[10:20]):
    
    # Handle multi-channel vs mono audio
    if segment.shape[0] > 1:
        segment = torch.mean(segment, dim=0, keepdim=False)
    else:
        segment = segment.squeeze(0)
    
    input_ids = processor(segment, sampling_rate=sample_rate, return_tensors='pt').input_features
    output_ids = model.generate(inputs=input_ids)
    transcription = processor.batch_decode(output_ids, skip_special_tokens=True)[0]
    print(f"Segment {i}: {transcription}\n")

Segment 0:  So today we're going to hopefully finish off that ERD lecture. You have some time for questions on lab one for the let's see. But I have now created the submission box for lab one. So that's here. So it's in it's under lab one in week one, right? So that's typically where I'm going to put submission boxes right next to where the assignment was originally uploaded.

Segment 1:  So you can go and when you're ready, you can submit your lab one here. I don't think I, let me just edit this for a quick. So when I create a submission box in Canvas, obviously I can give you some directions here, and then the number of points it's worked with the assignment group here is the maps to the syllabus. So it allows, I think, to 40% in this class. So when I set up the grade book,

Segment 2:  which I haven't really done yet. As a structure of it, the labs assignment group is going to have the 40% weight and you're great. I'll actually be updated on the fly and you can kind of come check it

For transcribing 10 / 100 chunks, the performance (speed) again isn't bad considering this is run in my laptop cpu. Will be even faster if we use torch cuda  
  
Another note, whisper comes in various sizes: tiny, small, medium, large  
Smaller models run faster but might not be as accurate. Tiny seems to be doing just fine right now, as it even picks up terms like ERD (entity relationship diagram) from a databases lecture. However, if we notice it performing poorly in the future, we can just use a bigger model.