<a href="https://colab.research.google.com/github/MYTE21/ML.Deep-Learning/blob/main/notebooks/(2)%20Speech%20To%20Text%20(STT).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Speech to Text (STT)
*`Speech-to-text (STT)` is a technology that converts spoken words into written text. It is also known as automatic speech recognition (ASR).*

# Setting Up Necessary Things

In [1]:
# Jupyter Notebook Magic Command - Auto Reloading
%reload_ext autoreload
%autoreload 2

# Jupyter Notebook Magic Command - Inline Plotting
%matplotlib inline

In [2]:
# Ignore All Warnings
import warnings
warnings.filterwarnings("ignore")

# Necessary Imports

In [3]:
# Necessary installs
! pip install -q transformers

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.6/7.6 MB[0m [31m55.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m295.0/295.0 kB[0m [31m19.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [31m65.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m50.8 MB/s[0m eta [36m0:00:00[0m
[?25h

In [4]:
from IPython.display import Audio

import torch
import torchaudio

from transformers import Wav2Vec2Tokenizer, Wav2Vec2ForCTC

# Perform Speech-to-text (STT)

In [5]:
# Load Tokenizer and Model
tokenizer = Wav2Vec2Tokenizer.from_pretrained("facebook/wav2vec2-large-960h-lv60-self")
model = Wav2Vec2ForCTC.from_pretrained("facebook/wav2vec2-large-960h-lv60-self")

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/291 [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/162 [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/85.0 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/1.61k [00:00<?, ?B/s]

The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'Wav2Vec2CTCTokenizer'. 
The class this function is called from is 'Wav2Vec2Tokenizer'.


Downloading pytorch_model.bin:   0%|          | 0.00/1.26G [00:00<?, ?B/s]

Some weights of Wav2Vec2ForCTC were not initialized from the model checkpoint at facebook/wav2vec2-large-960h-lv60-self and are newly initialized: ['wav2vec2.masked_spec_embed']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [6]:
audio_path = "/content/mary.wav"

In [7]:
waveform, sample_rate = torchaudio.load(audio_path)
waveform, sample_rate

(tensor([[ 0.0026,  0.0023,  0.0024,  ..., -0.0006, -0.0003, -0.0003]]), 16000)

In [8]:
# Tokenize
input_values = tokenizer(waveform, return_tensors="pt").input_values
input_values

tensor([[[ 0.0442,  0.0392,  0.0406,  ..., -0.0119, -0.0068, -0.0060]]])

In [9]:
# Disables Gradient Calculation
with torch.no_grad():
  # retrieve logits
  logits = model(input_values.squeeze(dim=0))

  # take argmax and decode
  predicted_ids = torch.argmax(logits.logits, dim=-1)
  transcription = tokenizer.batch_decode(predicted_ids)[0]

In [10]:
# Show Text
transcription

'MARY HAD A LITTLE LAMB'

# Resources and Notes

## Transformers

*`Transformers` provides APIs and tools to easily download and train `state-of-the-art` pretrained models. Using pretrained models can reduce your compute costs, carbon footprint, and save you the time and resources required to train a model from scratch.*

**👻 Transformers**: [`📕 Documentation`](https://huggingface.co/docs/transformers/index)


### Necessary Links
**🐔 Torch**: [`⚙️ PyTorch`](https://pytorch.org/get-started/locally/)

**🔉 Torchaudio**: [`📘 Documentation`](https://pytorch.org/audio/stable/index.html)

## Speech-to-text Models
1. [`🤗 Hugging Face Models`](https://huggingface.co/models?pipeline_tag=automatic-speech-recognition&sort=trending)

*In This Notebook...*

Used Model: [`Wav2Vec2-Large-960h-Lv60 + Self-Training`](https://huggingface.co/facebook/wav2vec2-large-960h-lv60-self)