<a href="https://colab.research.google.com/github/SalmaElSayd/Speech2Text/blob/main/Speech%20Transcript%20with%20Wav2Vec.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Wave2Vec 
Speech2Text2 is a decoder-only transformer model that can be used with any speech encoder-only, such as Wav2Vec2 or HuBERT for Speech-to-Text tasks.

So this is a notebook exploring the model Wav2Vec2 on a sample WAV file

In [None]:
! pip install -q transformers

[K     |████████████████████████████████| 3.5 MB 4.5 MB/s 
[K     |████████████████████████████████| 895 kB 34.4 MB/s 
[K     |████████████████████████████████| 67 kB 2.3 MB/s 
[K     |████████████████████████████████| 596 kB 9.1 MB/s 
[K     |████████████████████████████████| 6.8 MB 29.0 MB/s 
[?25h

In [None]:
import librosa
import torch
from transformers import Wav2Vec2ForCTC, Wav2Vec2Tokenizer

In [None]:
#load pre-trained model and tokenizer
tokenizer = Wav2Vec2Tokenizer.from_pretrained("facebook/wav2vec2-base-960h")
model = Wav2Vec2ForCTC.from_pretrained("facebook/wav2vec2-base-960h")

Downloading:   0%|          | 0.00/163 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/291 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/85.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.56k [00:00<?, ?B/s]

The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'Wav2Vec2CTCTokenizer'. 
The class this function is called from is 'Wav2Vec2Tokenizer'.


Downloading:   0%|          | 0.00/360M [00:00<?, ?B/s]

Some weights of Wav2Vec2ForCTC were not initialized from the model checkpoint at facebook/wav2vec2-base-960h and are newly initialized: ['wav2vec2.masked_spec_embed']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


# Sampling Rate
When testing this model, I first tried a sample file with a 8k sampling rate, which outputted giberrish. It turned out that the default sampling rate of the Wave2Vec model is 16k. So using Audacity, I changed the sampling rate to 16k to test the model

In [None]:
#load any audio file of your choice
speech, rate = librosa.load("sample.wav",sr=16000)

In [None]:
import IPython.display as display
display.Audio("sample.wav", autoplay=True)

In [None]:
input_values = tokenizer(speech, return_tensors = 'pt').input_values

In [None]:
input_values

tensor([[-0.6486, -0.9056, -0.9203,  ..., -0.2452, -0.2034, -0.1025]])

In [None]:
#Store logits (non-normalized predictions)
logits = model(input_values).logits

In [None]:
logits

tensor([[[ 13.4239, -27.9590, -27.7488,  ...,  -8.1708,  -8.9287, -10.2502],
         [ 13.3152, -27.9506, -27.7361,  ...,  -7.9532,  -8.8911, -10.1365],
         [ 14.0275, -28.4029, -28.1761,  ...,  -7.5711,  -9.4870,  -9.1704],
         ...,
         [ 14.3300, -29.2907, -28.9588,  ...,  -5.8550,  -9.4248,  -8.4351],
         [ 14.3295, -29.2872, -28.9554,  ...,  -5.8580,  -9.4226,  -8.4377],
         [ 13.9431, -28.6452, -28.3593,  ...,  -7.2351,  -9.3091,  -8.8594]]],
       grad_fn=<AddBackward0>)

In [None]:
#Store predicted id's
predicted_ids = torch.argmax(logits, dim =-1)

In [None]:
#decode the audio to generate text
transcriptions = tokenizer.decode(predicted_ids[0])


In [None]:
print(transcriptions)

THE BIRCH CANOE SLIT ON THE SMOOTH PLANK GLE THE HEE TO THE DARK BLUE BACKGROUND IT IS EASY TO TELL THE DEPTH OF THE WELL THESE DAYS A CICG A MEG IS A RARE DISH RICE IS OX EN SERVED IN ROUND BULL THE USE OF LONDONS MAKES FINE PUNCH THE BOX WAS TON E BESIDE THE PARK TRUK THE HOX ARE SED CHOPPED CORN AND GARBAGE FOUR HOURS A STEADY WORK FACED US E LARGE SIDE AND STOCKINGS IS HARD TO SELL
