# speech2text
- Its a simple notebook to convert speech to text.
- We will be using wave2vec herre

### 1. Get the libraries
- Make sure you have them installed

In [1]:
import torch
import time
import librosa
import numpy as np
import soundfile as sf
from scipy.io import wavfile
from ipywebrtc import AudioRecorder
from IPython.display import Audio
from transformers import Wav2Vec2ForCTC, Wav2Vec2Tokenizer

### 2. Initialize the models

In [2]:
tokenizer = Wav2Vec2Tokenizer.from_pretrained("facebook/wav2vec2-base-960h")
model = Wav2Vec2ForCTC.from_pretrained("facebook/wav2vec2-base-960h")

The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'Wav2Vec2CTCTokenizer'. 
The class this function is called from is 'Wav2Vec2Tokenizer'.
Some weights of Wav2Vec2ForCTC were not initialized from the model checkpoint at facebook/wav2vec2-base-960h and are newly initialized: ['wav2vec2.masked_spec_embed']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


### 3. Load the audio data and Framerate

In [8]:
audio_path = 'audio_5.wav'
[framerate, sound_dat] = wavfile.read(audio_path)
# time = np.arange( 0, len(sound_dat) )/framerate
print('Sampling rate:', framerate, ' Hz')


Sampling rate: 44100  Hz


### 4. Convert the audio to spectrogram
We used the sampling rate of 16000 as word2vec accepts this sampling range

In [10]:
input_audio, _ = librosa.load(audio_path, sr=16000)
print(input_audio)


[ 0.0000000e+00  0.0000000e+00  0.0000000e+00 ... -1.3119541e-04
 -4.4363615e-06  0.0000000e+00]


### 5. Convert the spectrogram to word2vec
Tokenize the spectogram and convert it to word2vec

In [11]:

token = tokenizer(input_audio, return_tensors='pt')
print(token)
input_values = token['input_values']


2022-07-04 13:33:16.853410: I tensorflow/core/util/util.cc:169] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2022-07-04 13:33:16.856300: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2022-07-04 13:33:16.856316: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.


{'input_values': tensor([[-9.4060e-05, -9.4060e-05, -9.4060e-05,  ..., -1.0590e-03,
         -1.2669e-04, -9.4060e-05]])}


### 6. Get the prediction
- The prediction is taken in form of logits
- In Math, Logit is a function that maps probabilities ([0, 1]) to R ((-inf, inf))
- If you know softmax, it is kindof the inverse of that function(not exactly)
- L = ln(P/(1-P)) where P is the probability of the class and 1-P is the probability of the other class

In [12]:
logits = model(input_values).logits
print(logits)

tensor([[[ 13.7284, -25.7757, -25.5137,  ...,  -6.2650,  -6.9997,  -7.8211],
         [ 13.7605, -25.7770, -25.5154,  ...,  -6.2344,  -6.9767,  -7.8130],
         [ 13.8425, -25.7324, -25.4717,  ...,  -6.2120,  -7.0354,  -7.7811],
         ...,
         [ 13.5866, -25.7665, -25.5632,  ...,  -6.2777,  -6.9433,  -7.5408],
         [ 13.7977, -26.1069, -25.8621,  ...,  -6.6500,  -7.7482,  -7.5035],
         [ 13.8326, -26.0904, -25.8520,  ...,  -6.6140,  -7.6000,  -7.5594]]],
       grad_fn=<ViewBackward0>)


### 7. But dude, what is argmax?
- argmax is the index of the maximum value in the array
- In this case, every index of the logit has an array of probabilities (in logits ofc)
- So, the index of the maximum value in the logit is the index of the class that is highly probable
- We take that for every token and say that it is the predicted indices

In [13]:
predicted_indices = torch.argmax(logits, dim=-1)
print(predicted_indices)

tensor([[ 0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
          0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0, 10,
         10,  0,  0, 27, 17,  0,  0,  4,  0,  6,  0,  5, 15,  0,  0,  0, 15, 10,
          9,  0, 21,  4,  4, 22,  0,  8, 16,  0,  4,  4,  0, 24,  0, 13,  0,  0,
          0,  8,  0,  0,  0,  0,  0,  0,  4,  0,  0,  0,  0,  0,  0,  0,  0,  0,
          0,  0,  0,  0,  0,  0,  6, 11,  0, 10,  0, 12, 12,  0,  4,  0,  0,  0,
          0, 23,  0, 13,  0,  0,  8,  0,  0, 14,  0, 16,  0, 19,  0,  0,  6,  0,
          0,  4,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0, 10,  6,  0,  4,  4, 18,
          0,  0, 10, 15,  0,  0,  0, 15,  0,  4,  4,  0,  0,  0,  0,  0,  0,  0,
          0,  0,  0,  0,  0,  0,  0, 19, 11, 11,  0,  0,  0,  0,  0,  0,  0,  0,
          7,  0,  9,  9,  9,  0, 21,  5,  5,  0,  4,  4, 22, 22,  0,  8, 16,  0,
          0,  4,  4,  4,  0,  0,  0,  0, 15,  0,  0,  0,  0,  0,  0,  0,  0,  0,
          0,  0, 10,  0,  0,

### 8. Did you know that we could use the argmax to get the predicted text?
- tokenizer of word2vec2 has a function called batch_decode that takes the predicted indices and returns the predicted text
- In this case the predicted indices has only one array, which indicates there is only one text in the prediction?
- This is because we took only one that has the highest probability.... remember??

In [14]:
transcription = tokenizer.batch_decode(predicted_indices)[0] #Since there is only one
print(transcription)

I'M TELLING YOU BRO THIS PRODUCT IT WILL CHANGE YOU LIFE


### 9. OK WE SAVE THE MODELS NOW

In [4]:
model.save_pretrained('./saved_model/model')
tokenizer.save_pretrained('./saved_model/tokenizer')

('./saved_model/tokenizer/tokenizer_config.json',
 './saved_model/tokenizer/special_tokens_map.json',
 './saved_model/tokenizer/vocab.json',
 './saved_model/tokenizer/added_tokens.json')

In [6]:
tokenizer = Wav2Vec2Tokenizer.from_pretrained("saved_model/tokenizer")
model = Wav2Vec2ForCTC.from_pretrained("saved_model/model")

