In this notebook we are going to see how to convert speech into text using Facebook Wav2Vec 2.0 model.Wav2Vec2 is a speech model that accepts a float array corresponding to the raw waveform of the speech signal. Wav2Vec2 model was trained using connectionist temporal classification (CTC) so the model output has to be decoded using Wav2Vec2Tokenizer.For learning more about it click on this [link](https://huggingface.co/transformers/model_doc/wav2vec2.html)

In [1]:
!pip install --upgrade transformers

Collecting transformers
  Downloading transformers-4.30.2-py3-none-any.whl (7.2 MB)
[K     |████████████████████████████████| 7.2 MB 4.2 MB/s 
Collecting safetensors>=0.3.1
  Downloading safetensors-0.4.5-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (436 kB)
[K     |████████████████████████████████| 436 kB 39.9 MB/s 
Collecting huggingface-hub<1.0,>=0.14.1
  Downloading huggingface_hub-0.16.4-py3-none-any.whl (268 kB)
[K     |████████████████████████████████| 268 kB 29.2 MB/s 
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1
  Downloading tokenizers-0.13.3-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.8 MB)
[K     |████████████████████████████████| 7.8 MB 40.1 MB/s 
Collecting packaging>=20.0
  Downloading packaging-24.0-py3-none-any.whl (53 kB)
[K     |████████████████████████████████| 53 kB 1.1 MB/s 
Installing collected packages: packaging, tokenizers, safetensors, huggingface-hub, transformers
  Attempting uninstall: packaging
    Found

In [2]:
import transformers
print(transformers.__version__)

4.30.2


If you don't see at least 4.3.0 version,then upgrade it

In [3]:
!pip install --upgrade torch

Collecting torch
  Downloading torch-1.13.1-cp37-cp37m-manylinux1_x86_64.whl (887.5 MB)
[K     |████████████████████████████████| 887.5 MB 4.0 kB/s 
[?25hCollecting nvidia-cuda-nvrtc-cu11==11.7.99
  Downloading nvidia_cuda_nvrtc_cu11-11.7.99-2-py3-none-manylinux1_x86_64.whl (21.0 MB)
[K     |████████████████████████████████| 21.0 MB 7.6 MB/s 
Collecting nvidia-cublas-cu11==11.10.3.66
  Downloading nvidia_cublas_cu11-11.10.3.66-py3-none-manylinux1_x86_64.whl (317.1 MB)
[K     |████████████████████████████████| 317.1 MB 21 kB/s 
[?25hCollecting nvidia-cuda-runtime-cu11==11.7.99
  Downloading nvidia_cuda_runtime_cu11-11.7.99-py3-none-manylinux1_x86_64.whl (849 kB)
[K     |████████████████████████████████| 849 kB 5.9 MB/s 
[?25hCollecting nvidia-cudnn-cu11==8.5.0.96
  Downloading nvidia_cudnn_cu11-8.5.0.96-2-py3-none-manylinux1_x86_64.whl (557.1 MB)
[K     |████████████████████████████████| 557.1 MB 6.9 kB/s 
Installing collected packages: nvidia-cublas-cu11, nvidia-

### Import Libraries

In [4]:
import librosa
import torch
import IPython.display as display
from transformers import Wav2Vec2ForCTC, Wav2Vec2Tokenizer
import numpy as np

### Load pre-trained Wav2Vec model

In [5]:
#load pre-trained model and tokenizer
tokenizer = Wav2Vec2Tokenizer.from_pretrained("facebook/wav2vec2-base-960h")
model = Wav2Vec2ForCTC.from_pretrained("facebook/wav2vec2-base-960h")

Downloading vocab.json:   0%|          | 0.00/291 [00:00<?, ?B/s]

Downloading tokenizer_config.json:   0%|          | 0.00/163 [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/85.0 [00:00<?, ?B/s]

Downloading config.json:   0%|          | 0.00/1.60k [00:00<?, ?B/s]

The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'Wav2Vec2CTCTokenizer'. 
The class this function is called from is 'Wav2Vec2Tokenizer'.


Downloading pytorch_model.bin:   0%|          | 0.00/378M [00:00<?, ?B/s]

Some weights of Wav2Vec2ForCTC were not initialized from the model checkpoint at facebook/wav2vec2-base-960h and are newly initialized: ['wav2vec2.masked_spec_embed']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


### Load Audio file

In [6]:
#load audio file 
audio, sampling_rate = librosa.load("/kaggle/input/audio-data/audio1.m4a",sr=16000)



In [7]:
audio,sampling_rate

(array([0., 0., 0., ..., 0., 0., 0.], dtype=float32), 16000)

# Play the Audio

In [8]:
# audio
display.Audio("/kaggle/input/audio-data/audio1.m4a", autoplay=True)

### Speech to Text

First of all tokenize the input values,take the maximum prediction from the logit and then extraxt the text

In [9]:
input_values = tokenizer(audio, return_tensors = 'pt').input_values
input_values

tensor([[-0.0002, -0.0002, -0.0002,  ..., -0.0002, -0.0002, -0.0002]])

In [10]:
# store logits (non-normalized predictions)
logits = model(input_values).logits
logits

tensor([[[ 14.7457, -27.0516, -26.7098,  ...,  -5.9108,  -6.2960,  -7.6940],
         [ 14.8289, -27.1641, -26.8184,  ...,  -5.9686,  -6.3497,  -7.7359],
         [ 14.8723, -27.1996, -26.8532,  ...,  -5.9437,  -6.3394,  -7.7483],
         ...,
         [ 14.6778, -27.3817, -27.0478,  ...,  -6.5107,  -7.2095,  -7.6820],
         [ 14.6694, -27.5049, -27.1693,  ...,  -6.5099,  -7.2421,  -7.3955],
         [ 14.5684, -27.3072, -26.9728,  ...,  -6.5685,  -7.2532,  -7.7025]]],
       grad_fn=<ViewBackward0>)

In [11]:
# store predicted id's
# pass the logit values to softmax to get the predicted values
predicted_ids = torch.argmax(logits, dim =-1)

In [12]:
# pass the prediction to the tokenzer decode to get the transcription
transcriptions = tokenizer.decode(predicted_ids[0])

In [13]:
transcriptions

'HER LOGAGE HOW ARE YOU HOW ARE YOU DOING BRO I AM GOING FINE HARTOARTOU HERLORD'