<a href="https://colab.research.google.com/github/EuclidStellar/Model-Bechmarks-for-stt/blob/main/BertLargeModel.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
import time
from transformers import Wav2Vec2Processor, Wav2Vec2ForCTC, BertTokenizer, BertForSequenceClassification
import torch
import librosa

# Install required libraries
!pip install transformers
!pip install librosa
!pip install torchaudio

# Load the speech-to-text model and processor
stt_model_name = "facebook/wav2vec2-large-960h"
processor = Wav2Vec2Processor.from_pretrained(stt_model_name)
model = Wav2Vec2ForCTC.from_pretrained(stt_model_name)

# Load the audio file
audio_path = "/content/testn1.ogg"
speech_array, sampling_rate = librosa.load(audio_path, sr=16000)

# Ensure the audio is mono
if len(speech_array.shape) > 1:
    speech_array = librosa.to_mono(speech_array)

# Preprocess the audio
input_values = processor(speech_array, sampling_rate=16000, return_tensors="pt").input_values

# Measure the time taken for transcription
start_time = time.time()

# Perform inference
with torch.no_grad():
    logits = model(input_values).logits

end_time = time.time()

# Decode the logits to text
predicted_ids = torch.argmax(logits, dim=-1)
transcription = processor.batch_decode(predicted_ids)[0]
transcription_time = end_time - start_time

print("Transcription:", transcription)
print(f"Time taken for transcription: {transcription_time:.2f} seconds")

# Load the tokenizer and model for BERT
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
model = BertForSequenceClassification.from_pretrained("bert-base-uncased")

# Tokenize the transcribed text
inputs = tokenizer(transcription, return_tensors="pt", truncation=True, padding=True)

# Perform inference
with torch.no_grad():
    outputs = model(**inputs)

# Get the predicted class
logits = outputs.logits
predicted_class = torch.argmax(logits, dim=-1).item()

print("Predicted class:", predicted_class)


Collecting nvidia-cuda-nvrtc-cu12==12.1.105 (from torch==2.3.1->torchaudio)
  Using cached nvidia_cuda_nvrtc_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (23.7 MB)
Collecting nvidia-cuda-runtime-cu12==12.1.105 (from torch==2.3.1->torchaudio)
  Using cached nvidia_cuda_runtime_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (823 kB)
Collecting nvidia-cuda-cupti-cu12==12.1.105 (from torch==2.3.1->torchaudio)
  Using cached nvidia_cuda_cupti_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (14.1 MB)
Collecting nvidia-cudnn-cu12==8.9.2.26 (from torch==2.3.1->torchaudio)
  Using cached nvidia_cudnn_cu12-8.9.2.26-py3-none-manylinux1_x86_64.whl (731.7 MB)
Collecting nvidia-cublas-cu12==12.1.3.1 (from torch==2.3.1->torchaudio)
  Using cached nvidia_cublas_cu12-12.1.3.1-py3-none-manylinux1_x86_64.whl (410.6 MB)
Collecting nvidia-cufft-cu12==11.0.2.54 (from torch==2.3.1->torchaudio)
  Using cached nvidia_cufft_cu12-11.0.2.54-py3-none-manylinux1_x86_64.whl (121.6 MB)
Collecting nvidia-curand-cu12==10.3.

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


preprocessor_config.json:   0%|          | 0.00/159 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/163 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/843 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/291 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/85.0 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/1.26G [00:00<?, ?B/s]

Some weights of the model checkpoint at facebook/wav2vec2-large-960h were not used when initializing Wav2Vec2ForCTC: ['wav2vec2.encoder.pos_conv_embed.conv.weight_g', 'wav2vec2.encoder.pos_conv_embed.conv.weight_v']
- This IS expected if you are initializing Wav2Vec2ForCTC from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing Wav2Vec2ForCTC from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of Wav2Vec2ForCTC were not initialized from the model checkpoint at facebook/wav2vec2-large-960h and are newly initialized: ['wav2vec2.encoder.pos_conv_embed.conv.parametrizations.weight.original0', 'wav2vec2.encoder.pos_conv_embed.conv.parametrizations.weight.original1', 'wav2vec2.masked_spec_embed']
You s

Transcription: DAM KRISHNER SAHU
Time taken for transcription: 2.37 seconds


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Predicted class: 1


In [2]:
import time
from transformers import Wav2Vec2Processor, Wav2Vec2ForCTC, BertTokenizer, BertForSequenceClassification
import torch
import librosa

# Install required libraries
!pip install transformers
!pip install librosa
!pip install torchaudio

# Load the speech-to-text model and processor
stt_model_name = "facebook/wav2vec2-large-960h"
processor = Wav2Vec2Processor.from_pretrained(stt_model_name)
model = Wav2Vec2ForCTC.from_pretrained(stt_model_name)

# Load the audio file
audio_path = "/content/testn.ogg"
speech_array, sampling_rate = librosa.load(audio_path, sr=16000)

# Ensure the audio is mono
if len(speech_array.shape) > 1:
    speech_array = librosa.to_mono(speech_array)

# Preprocess the audio
input_values = processor(speech_array, sampling_rate=16000, return_tensors="pt").input_values

# Measure the time taken for transcription
start_time = time.time()

# Perform inference
with torch.no_grad():
    logits = model(input_values).logits

end_time = time.time()

# Decode the logits to text
predicted_ids = torch.argmax(logits, dim=-1)
transcription = processor.batch_decode(predicted_ids)[0]
transcription_time = end_time - start_time

print("Transcription:", transcription)
print(f"Time taken for transcription: {transcription_time:.2f} seconds")

# Load the tokenizer and model for BERT
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
model = BertForSequenceClassification.from_pretrained("bert-base-uncased")

# Tokenize the transcribed text
inputs = tokenizer(transcription, return_tensors="pt", truncation=True, padding=True)

# Perform inference
with torch.no_grad():
    outputs = model(**inputs)

# Get the predicted class
logits = outputs.logits
predicted_class = torch.argmax(logits, dim=-1).item()

print("Predicted class:", predicted_class)




Some weights of the model checkpoint at facebook/wav2vec2-large-960h were not used when initializing Wav2Vec2ForCTC: ['wav2vec2.encoder.pos_conv_embed.conv.weight_g', 'wav2vec2.encoder.pos_conv_embed.conv.weight_v']
- This IS expected if you are initializing Wav2Vec2ForCTC from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing Wav2Vec2ForCTC from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of Wav2Vec2ForCTC were not initialized from the model checkpoint at facebook/wav2vec2-large-960h and are newly initialized: ['wav2vec2.encoder.pos_conv_embed.conv.parametrizations.weight.original0', 'wav2vec2.encoder.pos_conv_embed.conv.parametrizations.weight.original1', 'wav2vec2.masked_spec_embed']
You s

Transcription: GORO SING
Time taken for transcription: 2.77 seconds


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Predicted class: 1


In [3]:
import time
from transformers import Wav2Vec2Processor, Wav2Vec2ForCTC, BertTokenizer, BertForSequenceClassification
import torch
import librosa

# Install required libraries
!pip install transformers
!pip install librosa
!pip install torchaudio

# Load the speech-to-text model and processor
stt_model_name = "facebook/wav2vec2-large-960h"
processor = Wav2Vec2Processor.from_pretrained(stt_model_name)
model = Wav2Vec2ForCTC.from_pretrained(stt_model_name)

# Load the audio file
audio_path = "/content/testp1.ogg"
speech_array, sampling_rate = librosa.load(audio_path, sr=16000)

# Ensure the audio is mono
if len(speech_array.shape) > 1:
    speech_array = librosa.to_mono(speech_array)

# Preprocess the audio
input_values = processor(speech_array, sampling_rate=16000, return_tensors="pt").input_values

# Measure the time taken for transcription
start_time = time.time()

# Perform inference
with torch.no_grad():
    logits = model(input_values).logits

end_time = time.time()

# Decode the logits to text
predicted_ids = torch.argmax(logits, dim=-1)
transcription = processor.batch_decode(predicted_ids)[0]
transcription_time = end_time - start_time

print("Transcription:", transcription)
print(f"Time taken for transcription: {transcription_time:.2f} seconds")

# Load the tokenizer and model for BERT
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
model = BertForSequenceClassification.from_pretrained("bert-base-uncased")

# Tokenize the transcribed text
inputs = tokenizer(transcription, return_tensors="pt", truncation=True, padding=True)

# Perform inference
with torch.no_grad():
    outputs = model(**inputs)

# Get the predicted class
logits = outputs.logits
predicted_class = torch.argmax(logits, dim=-1).item()

print("Predicted class:", predicted_class)




Some weights of the model checkpoint at facebook/wav2vec2-large-960h were not used when initializing Wav2Vec2ForCTC: ['wav2vec2.encoder.pos_conv_embed.conv.weight_g', 'wav2vec2.encoder.pos_conv_embed.conv.weight_v']
- This IS expected if you are initializing Wav2Vec2ForCTC from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing Wav2Vec2ForCTC from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of Wav2Vec2ForCTC were not initialized from the model checkpoint at facebook/wav2vec2-large-960h and are newly initialized: ['wav2vec2.encoder.pos_conv_embed.conv.parametrizations.weight.original0', 'wav2vec2.encoder.pos_conv_embed.conv.parametrizations.weight.original1', 'wav2vec2.masked_spec_embed']
You s

Transcription: NINE ZIRO TWO SIX DA GLANE DUZIRO NINE SEVEN
Time taken for transcription: 3.37 seconds


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Predicted class: 0


In [4]:
import time
from transformers import Wav2Vec2Processor, Wav2Vec2ForCTC, BertTokenizer, BertForSequenceClassification
import torch
import librosa

# Install required libraries
!pip install transformers
!pip install librosa
!pip install torchaudio

# Load the speech-to-text model and processor
stt_model_name = "facebook/wav2vec2-large-960h"
processor = Wav2Vec2Processor.from_pretrained(stt_model_name)
model = Wav2Vec2ForCTC.from_pretrained(stt_model_name)

# Load the audio file
audio_path = "/content/testp.ogg"
speech_array, sampling_rate = librosa.load(audio_path, sr=16000)

# Ensure the audio is mono
if len(speech_array.shape) > 1:
    speech_array = librosa.to_mono(speech_array)

# Preprocess the audio
input_values = processor(speech_array, sampling_rate=16000, return_tensors="pt").input_values

# Measure the time taken for transcription
start_time = time.time()

# Perform inference
with torch.no_grad():
    logits = model(input_values).logits

end_time = time.time()

# Decode the logits to text
predicted_ids = torch.argmax(logits, dim=-1)
transcription = processor.batch_decode(predicted_ids)[0]
transcription_time = end_time - start_time

print("Transcription:", transcription)
print(f"Time taken for transcription: {transcription_time:.2f} seconds")

# Load the tokenizer and model for BERT
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
model = BertForSequenceClassification.from_pretrained("bert-base-uncased")

# Tokenize the transcribed text
inputs = tokenizer(transcription, return_tensors="pt", truncation=True, padding=True)

# Perform inference
with torch.no_grad():
    outputs = model(**inputs)

# Get the predicted class
logits = outputs.logits
predicted_class = torch.argmax(logits, dim=-1).item()

print("Predicted class:", predicted_class)




Some weights of the model checkpoint at facebook/wav2vec2-large-960h were not used when initializing Wav2Vec2ForCTC: ['wav2vec2.encoder.pos_conv_embed.conv.weight_g', 'wav2vec2.encoder.pos_conv_embed.conv.weight_v']
- This IS expected if you are initializing Wav2Vec2ForCTC from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing Wav2Vec2ForCTC from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of Wav2Vec2ForCTC were not initialized from the model checkpoint at facebook/wav2vec2-large-960h and are newly initialized: ['wav2vec2.encoder.pos_conv_embed.conv.parametrizations.weight.original0', 'wav2vec2.encoder.pos_conv_embed.conv.parametrizations.weight.original1', 'wav2vec2.masked_spec_embed']
You s

Transcription: NINE FIVE FIVE FIVE NINE A ZIRO FOR FOR ZIRO
Time taken for transcription: 3.30 seconds


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Predicted class: 1
