# speech2text
- Its a simple notebook to convert speech to text.
- We will be using wave2vec herre

### 1. Get the libraries
- Make sure you have them installed

In [30]:
    import torch
    import librosa
    import numpy as np
    import soundfile as sf
    from scipy.io import wavfile
    from IPython.display import Audio
    from transformers import Wav2Vec2ForCTC, Wav2Vec2Tokenizer

### 2. Initialize the models

In [31]:
tokenizer = Wav2Vec2Tokenizer.from_pretrained("facebook/wav2vec2-base-960h")
model = Wav2Vec2ForCTC.from_pretrained("facebook/wav2vec2-base-960h")

The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'Wav2Vec2CTCTokenizer'. 
The class this function is called from is 'Wav2Vec2Tokenizer'.
Some weights of Wav2Vec2ForCTC were not initialized from the model checkpoint at facebook/wav2vec2-base-960h and are newly initialized: ['wav2vec2.masked_spec_embed']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


### 3. Load the audio data and Framerate

In [82]:
audio_path = 'wenometchainsoma.wav'
[framerate, sound_dat] = wavfile.read(audio_path)
time = np.arange( 0, len(sound_dat) )/framerate
print('Sampling rate:', framerate, ' Hz')

Sampling rate: 44100  Hz


### 4. Convert the audio to spectrogram
We used the sampling rate of 16000 as word2vec accepts this sampling range

In [83]:
input_audio, _ = librosa.load(audio_path, sr=16000)
print(input_audio)

[0.         0.         0.         ... 0.00783363 0.02488752 0.        ]


### 5. Convert the spectrogram to word2vec
Tokenize the spectogram and convert it to word2vec

In [84]:

token = tokenizer(input_audio, return_tensors='pt')
print(token)
input_values = token['input_values']


{'input_values': tensor([[-5.0579e-05, -5.0579e-05, -5.0579e-05,  ...,  6.2714e-02,
          1.9935e-01, -5.0579e-05]])}


### 6. Get the prediction
- The prediction is taken in form of logits
- In Math, Logit is a function that maps probabilities ([0, 1]) to R ((-inf, inf))
- If you know softmax, it is kindof the inverse of that function(not exactly)
- L = ln(P/(1-P)) where P is the probability of the class and 1-P is the probability of the other class

In [85]:
logits = model(input_values).logits
print(logits)

tensor([[[ 13.8625, -28.2701, -27.9274,  ...,  -7.2499,  -7.1606,  -7.6149],
         [ 13.9155, -28.3151, -27.9709,  ...,  -7.2218,  -7.1493,  -7.6352],
         [ 14.0325, -28.4185, -28.0714,  ...,  -7.2433,  -7.2460,  -7.6442],
         ...,
         [ 12.9741, -27.0169, -26.7081,  ...,  -6.9358,  -7.0046,  -6.8817],
         [ 13.0958, -27.1575, -26.8484,  ...,  -7.0554,  -7.1086,  -6.9866],
         [ 13.0455, -27.0814, -26.7764,  ...,  -7.0332,  -7.0730,  -7.0397]]],
       grad_fn=<ViewBackward0>)


### 7. But dude, what is argmax?
- argmax is the index of the maximum value in the array
- In this case, every index of the logit has an array of probabilities (in logits ofc)
- So, the index of the maximum value in the logit is the index of the class that is highly probable
- We take that for every token and say that it is the predicted indices

In [86]:
predicted_indices = torch.argmax(logits, dim=-1)
print(predicted_indices)

tensor([[0, 0, 0,  ..., 0, 0, 0]])


### 8. Did you know that we could use the argmax to get the predicted text?
- tokenizer of word2vec2 has a function called batch_decode that takes the predicted indices and returns the predicted text
- In this case the predicted indices has only one array, which indicates there is only one text in the prediction?
- This is because we took only one that has the highest probability.... remember??

In [87]:
transcription = tokenizer.batch_decode(predicted_indices)[0] #Since there is only one
print(transcription)

BY NAMETER IN WITH SOMON IT'S A MORE HOT BEN DOWN WE WE WE WI MILL IN O HE ESENTLY IN E TURN BROWN E WE W AND WE CAN BEAT A BAIN BEN ON LONG THE SKY IS OFF LITTLE
