# IndicWav2Vec Tutorial

## 1. Quick Demo (using HuggingFace)

### Installation and Setup

Install Ubuntu/Debian Packages - 

In [34]:
! apt-get install build-essential libboost-all-dev cmake zlib1g-dev libbz2-dev liblzma-dev

E: Could not open lock file /var/lib/dpkg/lock-frontend - open (13: Permission denied)
E: Unable to acquire the dpkg frontend lock (/var/lib/dpkg/lock-frontend), are you root?


Install Python Packages - 
1. [PyTorch](https://pytorch.org/get-started/locally/)
2. [torchaudio](https://pytorch.org/get-started/locally/)
3. HuggingFace's [Transformers](https://huggingface.co/docs/transformers/installation)
4. HuggingFace's [Datasets](https://huggingface.co/docs/datasets/installation)
5. Kensho's [pyctcdecode](https://github.com/kensho-technologies/pyctcdecode)
6. [Kenlm's](https://github.com/kpu/kenlm) Python Bindings 

For detailed instruction, please follow the above links to their respective documentation pages.

In [1]:
! pip install torch torchaudio transformers datasets pyctcdecode
! pip install https://github.com/kpu/kenlm/archive/master.zip



Import Packages - 

In [21]:
# Import statements (for libraries: transformers, torchaudio and torch)
from transformers import AutoModelForCTC, AutoProcessor
import torchaudio
import torch
# Enable audio on jupyter notebooks
from IPython.display import Audio, display

# Optional (import datasets)
from datasets import load_dataset

### [Appendix A] Helper Functions

In [22]:
def load_audio_from_file(file_path):
    waveform, sample_rate = torchaudio.load(file_path)
    num_channels, _ = waveform.shape
    if num_channels == 1:
        return waveform[0], sample_rate
    else:
        raise ValueError("Waveform with more than 1 channels are not supported.")

#### Insight: Why HuggingFace?

### Data Preparation: Load Samples

Download Sample

In [23]:
! mkdir ../samples
! wget https://t3638486.p.clickup-attachments.com/t3638486/280ccfa7-bf22-4d3e-9c6d-de22e3c3c467/common_voice_hi_32806346.mp3 && mv common_voice_hi_32806346.mp3 ../samples/

mkdir: cannot create directory ‘../samples’: File exists
--2022-07-27 01:05:32--  https://t3638486.p.clickup-attachments.com/t3638486/280ccfa7-bf22-4d3e-9c6d-de22e3c3c467/common_voice_hi_32806346.mp3
Resolving t3638486.p.clickup-attachments.com (t3638486.p.clickup-attachments.com)... 13.35.191.18, 13.35.191.100, 13.35.191.83, ...
Connecting to t3638486.p.clickup-attachments.com (t3638486.p.clickup-attachments.com)|13.35.191.18|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 23373 (23K) [application/octet-stream]
Saving to: ‘common_voice_hi_32806346.mp3’


2022-07-27 01:05:32 (99.8 MB/s) - ‘common_voice_hi_32806346.mp3’ saved [23373/23373]



Load the Sample in PyTorch

In [24]:
SAMPLE_AUDIO_PATH = "../samples/common_voice_hi_32806346.mp3"
TARGET_SAMPLE_RATE = 16000

waveform, sample_rate = load_audio_from_file(SAMPLE_AUDIO_PATH)

# Optionally 
# sample = next(iter(load_dataset("common_voice", "hi", split="test")))
# waveform, sample_rate = torch.tensor(sample["audio"]["array"]), sample["audio"]["sampling_rate"]

Resample audio to 16Khz

In [25]:
resampled_audio = torchaudio.functional.resample(waveform, sample_rate, TARGET_SAMPLE_RATE)

#### Visualize Sample

In [26]:
display(Audio(resampled_audio.numpy(), rate=TARGET_SAMPLE_RATE))

### Run Inference

Load Models from HuggingFace Hub

In [27]:
# Specify the Hugging Face Model Id 
MODEL_ID = "/home/speech/fq2hf/indicw2v/indicwav2vec_v1_hindi"

# Specify the Device Id on where to put the model
DEVICE_ID = "cuda" if torch.cuda.is_available() else "cpu"
# DEVICE_ID = "cpu"
model = AutoModelForCTC.from_pretrained(MODEL_ID).to(DEVICE_ID)
processor = AutoProcessor.from_pretrained(MODEL_ID)

Process Audio Data and Run Forward Pass to obtain Logits

In [30]:
input_tensor = processor(resampled_audio, return_tensors="pt", sampling_rate=TARGET_SAMPLE_RATE).input_values

# Run forward pass
with torch.no_grad():
    logits = model(input_tensor.to(DEVICE_ID)).logits

Decode Logits to obtain Final Predictions

In [32]:
output_str = processor.batch_decode(logits.cpu().numpy()).text
print(f"Prediction: {output_str}")

Prediction: ['चुडाक पकड़ना मुश्किली नहीं नामिल थे']


## Training ASR Model

### Installation and Setup

#### Insight: End to End ASR Training/Inference Pipeline.

### Data Preparation: Manifest Creation

### Conifg Setup: What to change and what to not?

### Start Training

#### Insight: Metrics for Evaluation (WER, CER)

### Batch Inference

## Improving Performance using Language Model

### Installation and Setup

#### Insight: Greedy vs Beam Search Decoding

### Dataset Preparation: Clean Text Corpus and Lexicon

### Start Training

### Batch Inference

### Deploying Models

### Export models to HuggingFace Format

### Deploy model on HuggingFace Spaces