# IndicWav2Vec Tutorial

## 1. Quick Demo (using HuggingFace)

### Installation and Setup

Install Ubuntu/Debian Packages - 

In [34]:
! apt-get install build-essential libboost-all-dev cmake zlib1g-dev libbz2-dev liblzma-dev

E: Could not open lock file /var/lib/dpkg/lock-frontend - open (13: Permission denied)
E: Unable to acquire the dpkg frontend lock (/var/lib/dpkg/lock-frontend), are you root?


Install Python Packages - 
1. [PyTorch](https://pytorch.org/get-started/locally/)
2. [torchaudio](https://pytorch.org/get-started/locally/)
3. HuggingFace's [Transformers](https://huggingface.co/docs/transformers/installation)
4. HuggingFace's [Datasets](https://huggingface.co/docs/datasets/installation)
5. Kensho's [pyctcdecode](https://github.com/kensho-technologies/pyctcdecode)
6. [Kenlm's](https://github.com/kpu/kenlm) Python Bindings 

For detailed instruction, please follow the above links to their respective documentation pages.

In [1]:
! pip install torch torchaudio transformers datasets pyctcdecode
! pip install https://github.com/kpu/kenlm/archive/master.zip



Import Packages - 

In [1]:
# Import statements (for libraries: transformers, torchaudio and torch)
from transformers import AutoModelForCTC, Wav2Vec2Processor, Wav2Vec2ProcessorWithLM
import torchaudio
import torch
# Enable audio on jupyter notebooks
from IPython.display import Audio, display

# Optional (import datasets)
from datasets import load_dataset

### [Appendix] Helper Functions

In [2]:
def load_audio_from_file(file_path):
    waveform, sample_rate = torchaudio.load(file_path)
    num_channels, _ = waveform.shape
    if num_channels == 1:
        return waveform[0], sample_rate
    else:
        raise ValueError("Waveform with more than 1 channels are not supported.")

#### Insight: Why HuggingFace?

### Data Preparation: Load Samples

Download Sample

In [3]:
! mkdir ../samples
! wget https://t3638486.p.clickup-attachments.com/t3638486/280ccfa7-bf22-4d3e-9c6d-de22e3c3c467/common_voice_hi_32806346.mp3 && mv common_voice_hi_32806346.mp3 ../samples/

mkdir: cannot create directory ‘../samples’: File exists
--2022-07-27 02:41:00--  https://t3638486.p.clickup-attachments.com/t3638486/280ccfa7-bf22-4d3e-9c6d-de22e3c3c467/common_voice_hi_32806346.mp3
Resolving t3638486.p.clickup-attachments.com (t3638486.p.clickup-attachments.com)... 13.35.191.27, 13.35.191.18, 13.35.191.100, ...
Connecting to t3638486.p.clickup-attachments.com (t3638486.p.clickup-attachments.com)|13.35.191.27|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 23373 (23K) [application/octet-stream]
Saving to: ‘common_voice_hi_32806346.mp3’


2022-07-27 02:41:00 (145 MB/s) - ‘common_voice_hi_32806346.mp3’ saved [23373/23373]



Load the Sample in PyTorch

In [24]:
# SAMPLE_AUDIO_PATH = "../samples/common_voice_hi_32806346.mp3"
SAMPLE_AUDIO_PATH = "../samples/blindtest_300139.wav"
TARGET_SAMPLE_RATE = 16000

waveform, sample_rate = load_audio_from_file(SAMPLE_AUDIO_PATH)

# Optionally 
# sample = next(iter(load_dataset("common_voice", "hi", split="test")))
# # sample = next(iter(load_dataset("common_voice", "hi", split="test", streaming=True))) # Use this instead, if downloading is taking time
# waveform, sample_rate = torch.tensor(sample["audio"]["array"]), sample["audio"]["sampling_rate"]

Resample audio to 16Khz

In [25]:
resampled_audio = torchaudio.functional.resample(waveform, sample_rate, TARGET_SAMPLE_RATE)

#### Visualize Sample

In [26]:
display(Audio(resampled_audio.numpy(), rate=TARGET_SAMPLE_RATE))

### Run Inference

Load Models from HuggingFace Hub

In [12]:
# Specify the Hugging Face Model Id 
MODEL_ID = "/home/speech/fq2hf/indicw2v/indicwav2vec_v1_hindi"

# Specify the Device Id on where to put the model
DEVICE_ID = "cuda" if torch.cuda.is_available() else "cpu"

# Load Model
model = AutoModelForCTC.from_pretrained(MODEL_ID).to(DEVICE_ID)

# Load Processor without language model
processor = Wav2Vec2Processor.from_pretrained(MODEL_ID)

# Load Processor with language model
processor_with_lm = Wav2Vec2ProcessorWithLM.from_pretrained(MODEL_ID)

Process Audio Data and Run Forward Pass to obtain Logits

In [27]:
# Process audio data
input_tensor = processor(resampled_audio, return_tensors="pt", sampling_rate=TARGET_SAMPLE_RATE).input_values
# input_tensor = processor_with_lm(resampled_audio, return_tensors="pt", sampling_rate=TARGET_SAMPLE_RATE).input_values # same as above
# print(input_tensor)

# Run forward pass
with torch.no_grad():
    logits = model(input_tensor.to(DEVICE_ID)).logits.cpu()

Decode Logits without LM

In [28]:
prediction_ids = torch.argmax(logits, dim=-1)
output_str = processor.batch_decode(prediction_ids)[0]
print(f"Greedy Decoding: {output_str}")

Greedy Decoding: हवा और जमीन का ध्यान नहीं रखेंगे तो कुछ वर्सों में सभी मौसम बदल जाेंगे


In [29]:
output_str = processor_with_lm.batch_decode(logits.numpy())
print(f"LM Decoding: {output_str}")

LM Decoding: Wav2Vec2DecoderWithLMOutput(text=['हवा और जमीन का ध्यान नहीं रखेंगे तो कुछ वर्सों में सभी मौसम बदल जाेंगे'], logit_score=[-1.5213504357321428], lm_score=[-25.486455346159293], word_offsets=None)


## Training ASR Model

### Installation and Setup

#### Insight: End to End ASR Training/Inference Pipeline.

### Data Preparation: Manifest Creation

### Conifg Setup: What to change and what to not?

### Start Training

#### Insight: Metrics for Evaluation (WER, CER)

### Batch Inference

## Improving Performance using Language Model

### Installation and Setup

Prerequisite
- Fairseq already installed

Install Debian/Ubuntu Packages
- Install Linux Dependencies
- Build KenLM 
- Build Flashlight

In [None]:
! apt-get install build-essential libboost-all-dev cmake zlib1g-dev libbz2-dev liblzma-dev
! rm -rf kenlm && git clone https://github.com/kpu/kenlm.git && cd kenlm && mkdir -p build && cd build && cmake .. && make -j 16 && cd .. && export KENLM_ROOT=$PWD
! rm -rf flashlight && git clone https://github.com/flashlight/flashlight.git && git checkout 06ddb51857ab1780d793c52948a0759f0ccc6ddb && cd flashlight/bindings/python && export USE_MKL=0 && python setup.py install

Install Python Packages

In [None]:
! pip install pyctcdecode pandas matplotlib indic-nlp-library tqdm regex
! pip install https://github.com/kpu/kenlm/archive/master.zip
! pip install -e git+https://github.com/sutariyaraj/indic-num2words

#### Insight: Greedy vs Beam Search Decoding

### Dataset Preparation: Clean Text Corpus and Create Lexicon

In [None]:
! python prepare_data.py hi -d "/home/speech/abhigyan/IndicWav2Vec/KENLM/datasets/indic-corp-v1" \
    --data_type "C" --drop_rows strict --dict_dir "/home/speech/abhigyan/IndicWav2Vec/KENLM/datasets/superb_dicts" \
    --out_dir "/home/speech/abhigyan/IndicWav2Vec/KENLM/models"

### Start Training

In [None]:
! python train_kenlm.py $lang --lm_base_dirpath "/home/speech/abhigyan/IndicWav2Vec/KENLM/models" \
    --lm_dirname "lm_v1" --topk $topk --kenlm_bins "/home/speech/abhigyan/IndicWav2Vec/KENLM/kenlm/build/bin" \
    --arpa_order 6 --max_arpa_memory "90%" --arpa_prune "0|0|0|0|1|2" --intermediate_dir "" --clean_build False\
    --binary_a_bits 255 --binary_q_bits 8 --binary_type trie

### Batch Inference

### Deploying Models

### Export models to HuggingFace Format

### Deploy model on HuggingFace Spaces