## Whisper
Whisper is a pre-trained model for automatic speech recognition (ASR) and speech translation. Trained on 680k hours of labelled data, Whisper models demonstrate a strong ability to generalise to many datasets and domains without the need for fine-tuning. However, as the model's weight has been released, fine-tuning of the model is possible on custom data. This model can convert speech to text in real time and also perform real time translation of text TO english from other languages. Models like this can be used for a wide range such as Meeting Transcriptions, Call Centers for Customer Assistance, Lecture Transcriptions for students, Subtitling and Captioning etc.
[Model Card](https://huggingface.co/openai/whisper-large-v3)

### Install the Dependencies

In [None]:
!pip install --upgrade pip
!pip install --upgrade git+https://github.com/huggingface/transformers.git accelerate datasets[audio]


Collecting pip
  Downloading pip-24.0-py3-none-any.whl (2.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.1/2.1 MB[0m [31m11.2 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pip
  Attempting uninstall: pip
    Found existing installation: pip 23.1.2
    Uninstalling pip-23.1.2:
      Successfully uninstalled pip-23.1.2
Successfully installed pip-24.0
Collecting git+https://github.com/huggingface/transformers.git
  Cloning https://github.com/huggingface/transformers.git to /tmp/pip-req-build-qn5kq187
  Running command git clone --filter=blob:none --quiet https://github.com/huggingface/transformers.git /tmp/pip-req-build-qn5kq187
  Resolved https://github.com/huggingface/transformers.git to commit 4b3eb19fa7f359d25f62ca9108479f71de912ebc
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Installing backend dependencies ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ...

In [None]:
from huggingface_hub import notebook_login

notebook_login()


VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

### Load the model

In [None]:
import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline                    # The imports Required


device = "cuda:0" if torch.cuda.is_available() else "cpu"                                     # Code to choose cuda if a GPU is available, the model works without GPU also, but it will take longer to run
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

model_id = "openai/whisper-large-v3"                                                          # The model to be used, I'm choosing the biggest model as it only has 1.5B parameters and is small enough to fit on Colab's T4 gpu.

model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=False, use_safetensors=True
)
model.to(device)

processor = AutoProcessor.from_pretrained(model_id)

pipe = pipeline(                                                                             # The pipeline to run the model
    "automatic-speech-recognition",
    model=model,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    max_new_tokens=128,                                                                     # The maximum number of tokens to generate in the output, more than 128 will need a bigger gpu than colab's T4
    chunk_length_s=30,                                                                      # The number of seconds of audio that will be considered as one chunk
    batch_size=16,
    return_timestamps=True,
    torch_dtype=torch_dtype,
    device=device,
)



Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


### Testing model's Trascription Capabilities
To test the model, I will be converting the audio from google's 2024 Keynote event showcasing astra: https://www.youtube.com/watch?v=nXVvvRhiGjI to text.

In [None]:
audio_1 = ('/content/Project Astra_ Our vision for the future of AI assistants.mp3')

In [None]:
import time

start_time = time.time()

result_1 = pipe(audio_1, return_timestamps = True)

end_time = time.time()
execution_time_1 = end_time - start_time


Due to a bug fix in https://github.com/huggingface/transformers/pull/28687 transcription using a multilingual Whisper will default to language detection followed by transcription instead of translation to English.This might be a breaking change for your use case. If you want to instead always translate your audio to English, make sure to pass `language='en'`.


In [None]:
result_1['text']

" Okay, let's do some tests. Tell me when you see something that makes sound. I see a speaker which makes sound. What is that part of the speaker called? That is the tweeter. It produces high frequency sounds. Give me a creative alliteration about these. Creative crayons color cheerfully. They certainly craft colorful creations. What does that part of the code do? This code defines encryption and decryption functions. It seems to use AESCBC encryption to encode and decode data based on a key and an initialization vector, IV. That's right. on a key and an initialization vector, IV. That's right. What neighborhood do you think I'm in? This appears to be the King's Cross area of London. It is known for its railway station and transportation connections. Do you remember where you saw my glasses? Yes, I do. Your glasses were on the desk near a red apple. What can I add here to make this system faster? Adding a cache between the server and database could improve speed. What does this remind 

In [None]:
result_1['chunks']

[{'timestamp': (0.0, 9.92), 'text': " Okay, let's do some tests."},
 {'timestamp': (11.02, 13.56),
  'text': ' Tell me when you see something that makes sound.'},
 {'timestamp': (15.38, 17.74), 'text': ' I see a speaker which makes sound.'},
 {'timestamp': (19.68, 22.12),
  'text': ' What is that part of the speaker called?'},
 {'timestamp': (24.1, 27.44),
  'text': ' That is the tweeter. It produces high frequency sounds.'},
 {'timestamp': (31.68, 33.92),
  'text': ' Give me a creative alliteration about these.'},
 {'timestamp': (36.4, 40.88),
  'text': ' Creative crayons color cheerfully. They certainly craft colorful creations.'},
 {'timestamp': (51.22, 55.24),
  'text': ' What does that part of the code do? This code defines encryption and decryption functions.'},
 {'timestamp': (55.24, 62.0),
  'text': ' It seems to use AESCBC encryption to encode and decode data based on a key and an initialization'},
 {'timestamp': (62.0, 64.68), 'text': ' vector, IV.'},
 {'timestamp': (64.68, 6

In [None]:
execution_time_1

4.643780469894409

Another test converting the audio from openai's live demo of GPT-4o's vision capabilities to text

In [None]:
audio_2 = '/content/Live demo of GPT-4os vision capabilities.mp3'

In [None]:
import time

start_time = time.time()

result_2 = pipe(audio_2, return_timestamps = True)

end_time = time.time()
execution_time_2 = end_time - start_time


In [None]:
result_2['text']

" So the next one is from BurritoJohn78 who asks, Can you tell what you're feeling just by looking at your face? Barrett, do you want to give this one a try? Absolutely, let's try it out. Hey ChatGPT. Hey there, what's up? How can I brighten your day today? Okay, yeah, so I'm going to show you a selfie of what I look like and then I'd like you to try to see what emotions I'm feeling based on how I'm looking. Sounds like a fun challenge. Go ahead and show me that selfie and I'll put my emotional detective hat on. Okay, so here's me. So what kind of emotions do you think I'm feeling? It seems like I'm looking at a picture of a wooden surface. Oh, you know what? That was the thing I sent you before. Don't worry, I'm not actually a table. Okay, so take another look. Ah, that makes more sense. Ah, there we go. It looks like you're feeling pretty happy and cheerful, with a big smile and maybe even a touch of excitement. Whatever's going on, it seems like you're in a great mood. Care to share

In [None]:
result_2['chunks']

[{'timestamp': (0.0, 4.0),
  'text': ' So the next one is from BurritoJohn78 who asks,'},
 {'timestamp': (4.0, 8.0),
  'text': " Can you tell what you're feeling just by looking at your face?"},
 {'timestamp': (8.0, 10.0),
  'text': ' Barrett, do you want to give this one a try?'},
 {'timestamp': (10.0, 12.0), 'text': " Absolutely, let's try it out."},
 {'timestamp': (15.0, 16.0), 'text': ' Hey ChatGPT.'},
 {'timestamp': (17.0, 20.0),
  'text': " Hey there, what's up? How can I brighten your day today?"},
 {'timestamp': (20.0, 23.0),
  'text': " Okay, yeah, so I'm going to show you a selfie of what I look like"},
 {'timestamp': (23.0, 28.54),
  'text': " and then I'd like you to try to see what emotions I'm feeling based on how I'm looking."},
 {'timestamp': (28.54, 30.04), 'text': ' Sounds like a fun challenge.'},
 {'timestamp': (30.04, 34.82),
  'text': " Go ahead and show me that selfie and I'll put my emotional detective hat on."},
 {'timestamp': (34.82, 36.88), 'text': " Okay, so 

In [None]:
execution_time_2

4.35156512260437

## Whisper Distil-large-v3
Whisper distil is a model produced by using model distillation procedures on the original model from openai. This distilled model performs to within 1% WER of large-v3 on long-form audio using both the sequential and chunked algorithms, and outperforms distil-large-v2 by 4.8% using the sequential algorithm. The model is also faster than previous Distil-Whisper models: 6.3x faster than large-v3, and 1.1x faster than distil-large-v2.
The model distillation procedure also reduces the number of parameters in the model, so the distilled version of the model can be hosted on servers with less powerfull gpu's.

### Loading the model
This model has also allows for short form Transcription(less than 30 seconds) by removing the chunk_length and batch_size parameters while setting up the model pipeline, but as my test audio is greater than 30 seconds, I will not be using the short form inscription   

In [None]:
import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline


device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

model_id = "distil-whisper/distil-large-v3"

model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
)
model.to(device)

processor = AutoProcessor.from_pretrained(model_id)

pipe_1 = pipeline(
    "automatic-speech-recognition",
    model=model,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    max_new_tokens=128,
    chunk_length_s=25,
    batch_size=16,
    torch_dtype=torch_dtype,
    device=device,
)



Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


### Testing the Distilled model's accuracy and performance

In [None]:
import time

start_time = time.time()

result_3 = pipe_1(audio_1, return_timestamps = True)

end_time = time.time()
execution_time_3 = end_time - start_time

In [None]:
result_3['text']

" Okay, let's do some tests. Tell me when you see something that makes sound. I see a speaker which makes sound. What is that part of the speaker called? That is the tweeter. It produces high frequency sounds. Give me a creative alliteration about these. Creative crayons color cheerfully. They certainly craft colorful creations. What does that part of the code do? This code defines encryption and decryption functions. It seems to use AESCBC encryption to encode and decode data based on a key and an initialization vector, IV. That's right. What neighborhood do you think I'm in? This appears to be the King's Cross area of London. It is known for its railway station and transportation connections. Do you remember where you saw my glasses? Yes, I do. Your glasses were on the desk near a red apple. What can I add here to make this system faster? Adding a cache between the server and database could improve speed. What does this remind you of? Shruginger's cat. All right, give me a band name 

In [None]:
result_3['chunks']

[{'timestamp': (0.0, 14.0),
  'text': " Okay, let's do some tests. Tell me when you see something that makes sound."},
 {'timestamp': (14.0, 23.67),
  'text': ' I see a speaker which makes sound. What is that part of the speaker called?'},
 {'timestamp': (23.67, 25.67), 'text': ' That is the tweeter.'},
 {'timestamp': (25.67, 29.67), 'text': ' It produces high frequency sounds.'},
 {'timestamp': (29.67, 34.67),
  'text': ' Give me a creative alliteration about these.'},
 {'timestamp': (34.67, 38.33), 'text': ' Creative crayons color cheerfully.'},
 {'timestamp': (38.33, 40.89),
  'text': ' They certainly craft colorful creations.'},
 {'timestamp': (45.57, 48.29), 'text': ' What does that part of the code do?'},
 {'timestamp': (50.45, 55.0),
  'text': ' This code defines encryption and decryption functions.'},
 {'timestamp': (55.0, 63.0),
  'text': ' It seems to use AESCBC encryption to encode and decode data based on a key and an initialization vector, IV.'},
 {'timestamp': (63.0, 67.0

In [None]:
execution_time_3

2.2963743209838867

Testing the Distilled model on the second audio file

In [None]:
import time

start_time = time.time()

result_4 = pipe_1(audio_2, return_timestamps = True)

end_time = time.time()
execution_time_4 = end_time - start_time

In [None]:
result_4['text']

" So the next one is from Burrito John 78, who asks, Can you tell what you're feeling just by looking at your face? Barrett, do you want to give this one a try? Absolutely, let's try it out. Hey, chat, GPT. Hey there, what's up? How can I brighten your day today? Okay, yeah, so I'm going to show you a selfie of what I look like and then I'd like you to try to see what emotions I'm feeling based on how I'm looking. Sounds like a fun challenge. Go ahead and show me that selfie and I'll put my emotional detective hat on. Okay, so here's me. So what kind of emotions do you think I'm feeling? Hmm. It seems like I'm looking at a picture of a wooden surface. Oh, you know what? That was the thing I sent you before. Don't worry. I'm not actually a table. Okay, so take another look. That makes more sense. Ah, there makes more sense. Ah, there we go. It looks like you're feeling pretty happy and cheerful, with a big smile and maybe even a touch of excitement. Whatever's going on, it seems like yo

In [None]:
result_4['chunks']

[{'timestamp': (0.0, 4.0),
  'text': ' So the next one is from Burrito John 78, who asks,'},
 {'timestamp': (4.0, 8.0),
  'text': " Can you tell what you're feeling just by looking at your face?"},
 {'timestamp': (8.0, 11.0),
  'text': ' Barrett, do you want to give this one a try?'},
 {'timestamp': (11.0, 14.0), 'text': " Absolutely, let's try it out."},
 {'timestamp': (14.0, 17.0), 'text': ' Hey, chat, GPT.'},
 {'timestamp': (17.0, 27.67),
  'text': " Hey there, what's up? How can I brighten your day today? Okay, yeah, so I'm going to show you a selfie of what I look like and then I'd like you to try to see what emotions I'm feeling based on how I'm looking."},
 {'timestamp': (27.67, 34.67),
  'text': " Sounds like a fun challenge. Go ahead and show me that selfie and I'll put my emotional detective hat on."},
 {'timestamp': (34.67, 39.33),
  'text': " Okay, so here's me. So what kind of emotions do you think I'm feeling?"},
 {'timestamp': (39.33, 40.33), 'text': ' Hmm.'},
 {'timesta

In [None]:
execution_time_4

1.8812119960784912

### Comparing the two models

In [None]:
print("Original model's Transcribed output for the first audio file: ", result_1['text'])
print("Distilled model's Transcribed output for the first audio file: ", result_3['text'])

Original model's Transcribed output for the first audio file:   Okay, let's do some tests. Tell me when you see something that makes sound. I see a speaker which makes sound. What is that part of the speaker called? That is the tweeter. It produces high frequency sounds. Give me a creative alliteration about these. Creative crayons color cheerfully. They certainly craft colorful creations. What does that part of the code do? This code defines encryption and decryption functions. It seems to use AESCBC encryption to encode and decode data based on a key and an initialization vector, IV. That's right. on a key and an initialization vector, IV. That's right. What neighborhood do you think I'm in? This appears to be the King's Cross area of London. It is known for its railway station and transportation connections. Do you remember where you saw my glasses? Yes, I do. Your glasses were on the desk near a red apple. What can I add here to make this system faster? Adding a cache between the s

In [None]:
print("Original model's Transcribed output for the second audio file: ", result_2['text'])
print("Distilled model's Transcribed output for the second audio file: ", result_4['text'])

Original model's Transcribed output for the second audio file:   So the next one is from BurritoJohn78 who asks, Can you tell what you're feeling just by looking at your face? Barrett, do you want to give this one a try? Absolutely, let's try it out. Hey ChatGPT. Hey there, what's up? How can I brighten your day today? Okay, yeah, so I'm going to show you a selfie of what I look like and then I'd like you to try to see what emotions I'm feeling based on how I'm looking. Sounds like a fun challenge. Go ahead and show me that selfie and I'll put my emotional detective hat on. Okay, so here's me. So what kind of emotions do you think I'm feeling? It seems like I'm looking at a picture of a wooden surface. Oh, you know what? That was the thing I sent you before. Don't worry, I'm not actually a table. Okay, so take another look. Ah, that makes more sense. Ah, there we go. It looks like you're feeling pretty happy and cheerful, with a big smile and maybe even a touch of excitement. Whatever'

In [None]:
print("Original model's execution time for audio 1(in seconds): ", execution_time_1)
print("Distilled model's execution time for audio 1(in seconds): ", execution_time_3)


Original model's execution time for audio 1(in seconds):  4.643780469894409
Distilled model's execution time for audio 1(in seconds):  2.2963743209838867


In [None]:
print("Original model's execution time for audio 2 (in seconds): ", execution_time_2)
print("Distilled model's execution time for audio 2 (in seconds): ", execution_time_4)


Original model's execution time for audio 2 (in seconds):  4.35156512260437
Distilled model's execution time for audio 2 (in seconds):  1.8812119960784912


## Conclusion
By comparing the outputs of the two models we can clearly see that the distilled model offers a more efficient alternative for transcription tasks, providing a considerable reduction in execution time while preserving transcription accuracy to a great extent. This makes it a viable option for applications where processing speed is critical without compromising the reliability of the transcriptions.