# IndicWav2Vec Workshop-2022

## Table of Contents:
1. 🔥 Demo of IndicWav2Vec-Hindi ASR Model using HuggingFace
2. 🕒 Fine-Tuning ASR Model from Checkpoint
3. 🤖 Improving Performance of ASR System using Language Model
4. 🚀 Deploying Models using Gradio


[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/googlecolab/colabtools/blob/master/notebooks/colab-github-demo.ipynb)

## 1. Quick Demo (using HuggingFace)

### i. Installation and Setup

Install Ubuntu/Debian Packages - 

In [1]:
! apt-get install build-essential libboost-all-dev cmake zlib1g-dev libbz2-dev liblzma-dev
! add-apt-repository ppa:savoury1/ffmpeg4 -y && apt-get update && apt-get install ffmpeg # Upgrade ffmpeg version only if unable to load mp3 file

Reading package lists... Done
Building dependency tree       
Reading state information... Done
build-essential is already the newest version (12.4ubuntu1).
libboost-all-dev is already the newest version (1.65.1.0ubuntu1).
cmake is already the newest version (3.10.2-1ubuntu2.18.04.2).
libbz2-dev is already the newest version (1.0.6-8.1ubuntu0.2).
libbz2-dev set to manually installed.
liblzma-dev is already the newest version (5.2.2-1.3ubuntu0.1).
liblzma-dev set to manually installed.
zlib1g-dev is already the newest version (1:1.2.11.dfsg-0ubuntu2.1).
zlib1g-dev set to manually installed.
The following package was automatically installed and is no longer required:
  libnvidia-common-460
Use 'apt autoremove' to remove it.
0 upgraded, 0 newly installed, 0 to remove and 49 not upgraded.
Get:1 http://security.ubuntu.com/ubuntu bionic-security InRelease [88.7 kB]
Get:2 http://ppa.launchpad.net/c2d4u.team/c2d4u4.0+/ubuntu bionic InRelease [15.9 kB]
Get:3 https://developer.download.nvidia.co

Install Python Packages - 
1. [PyTorch](https://pytorch.org/get-started/locally/)
2. [torchaudio](https://pytorch.org/get-started/locally/)
3. HuggingFace's [Transformers](https://huggingface.co/docs/transformers/installation)
4. HuggingFace's [Datasets](https://huggingface.co/docs/datasets/installation)
5. Kensho's [pyctcdecode](https://github.com/kensho-technologies/pyctcdecode)
6. [Kenlm's](https://github.com/kpu/kenlm) Python Bindings 
7. [Gradio](https://gradio.app/getting_started/)

For detailed instruction, please follow the above links to their respective documentation pages.

In [2]:
! pip install transformers datasets pyctcdecode soundfile gradio;
! pip install https://github.com/kpu/kenlm/archive/master.zip;

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.21.0-py3-none-any.whl (4.7 MB)
[K     |████████████████████████████████| 4.7 MB 14.8 MB/s 
[?25hCollecting datasets
  Downloading datasets-2.4.0-py3-none-any.whl (365 kB)
[K     |████████████████████████████████| 365 kB 70.7 MB/s 
[?25hCollecting pyctcdecode
  Downloading pyctcdecode-0.4.0-py2.py3-none-any.whl (45 kB)
[K     |████████████████████████████████| 45 kB 3.5 MB/s 
Collecting gradio
  Downloading gradio-3.1.1-py3-none-any.whl (5.6 MB)
[K     |████████████████████████████████| 5.6 MB 58.6 MB/s 
Collecting huggingface-hub<1.0,>=0.1.0
  Downloading huggingface_hub-0.8.1-py3-none-any.whl (101 kB)
[K     |████████████████████████████████| 101 kB 12.3 MB/s 
Collecting pyyaml>=5.1
  Downloading PyYAML-6.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (596 kB)
[K     |████████

Import Packages - 

In [3]:
from transformers import AutoModelForCTC, Wav2Vec2Processor, Wav2Vec2ProcessorWithLM, pipeline
import torchaudio
import torch
from datasets import load_dataset

from IPython.display import Audio, display
import sys
import gradio as gr

### ii. [Appendix] Helper Functions

Load Audio from File

In [4]:
def load_audio_from_file(file_path):
    waveform, sample_rate = torchaudio.load(file_path)
    num_channels, _ = waveform.shape
    if num_channels == 1:
        return waveform[0], sample_rate
    else:
        raise ValueError("Waveform with more than 1 channels are not supported.")

#### Insight: Why HuggingFace?

### iii. Data Preparation: Load Samples

Define Audio Sample

In [5]:
%cd /content/IndicWav2Vec

TARGET_SAMPLE_RATE = 16000
SAMPLE_AUDIO_PATH = "workshop-2022/samples/blindtest_300139.wav"

# Optionally 
cv_dataset_iter = iter(load_dataset("common_voice", "hi", split="test")) # add streaming=True for slow network

[Errno 2] No such file or directory: '/content/IndicWav2Vec'
/content


Downloading builder script:   0%|          | 0.00/5.21k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/11.5k [00:00<?, ?B/s]

Downloading and preparing dataset common_voice/hi (download: 20.43 MiB, generated: 35.57 MiB, post-processed: Unknown size, total: 56.00 MiB) to /root/.cache/huggingface/datasets/common_voice/hi/6.1.0/a1dc74461f6c839bfe1e8cf1262fd4cf24297e3fbd4087a711bd090779023a5e...


Downloading data:   0%|          | 0.00/21.4M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/157 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/127 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/135 [00:00<?, ? examples/s]

Generating other split:   0%|          | 0/139 [00:00<?, ? examples/s]

Generating validated split:   0%|          | 0/419 [00:00<?, ? examples/s]

Generating invalidated split:   0%|          | 0/60 [00:00<?, ? examples/s]

Dataset common_voice downloaded and prepared to /root/.cache/huggingface/datasets/common_voice/hi/6.1.0/a1dc74461f6c839bfe1e8cf1262fd4cf24297e3fbd4087a711bd090779023a5e. Subsequent calls will reuse this data.


Load Sample and Resample audio to 16Khz

In [6]:
#Load from file
# waveform, sample_rate = load_audio_from_file(SAMPLE_AUDIO_PATH)

#Optionally, Load from common-voice iterator
sample = next(cv_dataset_iter)
waveform, sample_rate = torch.tensor(sample["audio"]["array"]), sample["audio"]["sampling_rate"]

#Resample
resampled_audio = torchaudio.functional.resample(waveform, sample_rate, TARGET_SAMPLE_RATE)

#### Visualize Sample

In [7]:
display(Audio(resampled_audio.numpy(), rate=TARGET_SAMPLE_RATE))

### iv. Run Inference

Define Global Variables

In [8]:
# Specify the Hugging Face Model Id 
MODEL_ID = "ai4bharat/indicwav2vec-hindi"

# Specify the Device Id on where to put the model
DEVICE_ID = "cuda" if torch.cuda.is_available() else "cpu"

# Specify Decoder Type:
DECODER_TYPE = "greedy" # Choose "LM" decoding or "greedy" decoding

Load Model and Processor from Huggingface Hub

In [9]:
# Load Model
model_instance = AutoModelForCTC.from_pretrained(MODEL_ID).to(DEVICE_ID)

if DECODER_TYPE == "greedy":
    # Load Processor without language model
    processor = Wav2Vec2Processor.from_pretrained(MODEL_ID)
else:
    # Load Processor with language model
    processor = Wav2Vec2ProcessorWithLM.from_pretrained(MODEL_ID)

Downloading config.json:   0%|          | 0.00/2.05k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/1.18G [00:00<?, ?B/s]

Downloading preprocessor_config.json:   0%|          | 0.00/260 [00:00<?, ?B/s]

Downloading tokenizer_config.json:   0%|          | 0.00/257 [00:00<?, ?B/s]

Downloading vocab.json:   0%|          | 0.00/741 [00:00<?, ?B/s]

Downloading special_tokens_map.json:   0%|          | 0.00/85.0 [00:00<?, ?B/s]

Process Audio Data and Run Forward Pass to obtain Logits

In [10]:
# Process audio data
input_tensor = processor(resampled_audio, return_tensors="pt", sampling_rate=TARGET_SAMPLE_RATE).input_values

# Run forward pass
with torch.no_grad():
    logits = model_instance(input_tensor.to(DEVICE_ID)).logits.cpu()

Decode Logits

In [11]:
if DECODER_TYPE == "greedy":
    prediction_ids = torch.argmax(logits, dim=-1)
    output_str = processor.batch_decode(prediction_ids)[0]
    print(f"Greedy Decoding: {output_str}")
else:
    output_str = processor.batch_decode(logits.numpy()).text[0]
    print(f"LM Decoding: {output_str}")

Greedy Decoding: कांस्टेबल बन मनोज बाजपेयी दिखा रहे हैं तांडव


### v. Demo App

Create  HuggingFace's pipeline for automatic speech recognition

In [12]:
device_no = 0 if DEVICE_ID == "cuda" else -1
if DECODER_TYPE == "greedy":
    asr = pipeline("automatic-speech-recognition", model=model_instance, tokenizer=processor.tokenizer, 
                    feature_extractor=processor.feature_extractor, decoder=None, device=device_no)
else:
    asr = pipeline("automatic-speech-recognition", model=model_instance, tokenizer=processor.tokenizer, 
                    feature_extractor=processor.feature_extractor, decoder=processor.decoder, device=device_no)

Create Gradio App

In [13]:
def transcribe(input_file, language="Hindi", decoding_type="greedy", history=[]):
    history = history or []
    transcription = asr(input_file.name, chunk_length_s=5, stride_length_s=1)["text"]

    history.append({
        "model_id": MODEL_ID,
        "language": language,
        "decoding_type": decoding_type,
        "transcription": transcription,
        "error_message": None
    })

    html_output = "<div class='result'>"
    for item in history:
        if item["error_message"] is not None:
            html_output += f"<div class='result_item result_item_error'>{item['error_message']}</div>"
        else:
            url_suffix = " + LM" if decoding_type == "lm" else ""
            html_output += "<div class='result_item result_item_success'>"
            html_output += f'<strong><a target="_blank" href="https://huggingface.co/{MODEL_ID}">{MODEL_ID}{url_suffix}</a></strong><br/><br/>'
            html_output += f'{item["transcription"]}<br/>'
            html_output += "</div>"
    html_output += "</div>"

    return html_output, history

gr.Interface(
    transcribe,
    inputs=[
        gr.inputs.Audio(source="microphone", type="file", label="Record here..."),
        gr.inputs.Radio(label="Language", choices=["Hindi"]),
        gr.inputs.Radio(label="Decoding type", choices=["greedy", "lm"]),
        "state"
    ],
    outputs=[
        gr.outputs.HTML(label="Outputs"),
        "state"
    ],
    title="Speech to Text  <=>  IndicWav2Vec",
    description="",
    css="""
    .result {display:flex;flex-direction:column}
    .result_item {padding:15px;margin-bottom:8px;border-radius:15px;width:100%}
    .result_item_success {background-color:mediumaquamarine;color:white;align-self:start}
    .result_item_error {background-color:#ff7070;color:white;align-self:start}
    """,
    allow_flagging="never"
).launch(enable_queue=True)


  "Usage of gradio.inputs is deprecated, and will not be supported in the future, please import your components from gradio.components",
  "Usage of gradio.inputs is deprecated, and will not be supported in the future, please import your component from gradio.components",


Colab notebook detected. To show errors in colab notebook, set `debug=True` in `launch()`
Your interface requires microphone or webcam permissions - this may cause issues in Colab. Use the External URL in case of issues.
Running on public URL: https://19844.gradio.app

This share link expires in 72 hours. For free permanent hosting, check out Spaces: https://huggingface.co/spaces


(<gradio.routes.App at 0x7f1a225b57d0>,
 'http://127.0.0.1:7860/',
 'https://19844.gradio.app')

## 2. Fine-Tuning ASR Model

### i. Installation and Setup

Prerequisite
- Torch already installed

Install Required System Packages

In [14]:
!apt install -y liblzma-dev libbz2-dev libzstd-dev libsndfile1-dev libopenblas-dev libfftw3-dev libgflags-dev libgoogle-glog-dev build-essential cmake libboost-system-dev libboost-thread-dev libboost-program-options-dev libboost-test-dev libeigen3-dev zlib1g-dev libbz2-dev liblzma-dev

Reading package lists... Done
Building dependency tree       
Reading state information... Done
build-essential is already the newest version (12.4ubuntu1).
libboost-program-options-dev is already the newest version (1.65.1.0ubuntu1).
libboost-program-options-dev set to manually installed.
libboost-system-dev is already the newest version (1.65.1.0ubuntu1).
libboost-system-dev set to manually installed.
libboost-thread-dev is already the newest version (1.65.1.0ubuntu1).
libboost-thread-dev set to manually installed.
libboost-test-dev is already the newest version (1.65.1.0ubuntu1).
libboost-test-dev set to manually installed.
libopenblas-dev is already the newest version (0.2.20+ds-4).
cmake is already the newest version (3.10.2-1ubuntu2.18.04.2).
libbz2-dev is already the newest version (1.0.6-8.1ubuntu0.2).
liblzma-dev is already the newest version (5.2.2-1.3ubuntu0.1).
libsndfile1-dev is already the newest version (1.0.28-4ubuntu0.18.04.2).
The following package was automatically i

Clone IndicWav2Vec, Fairseq, KenLM and Flashlight repo from Github

In [15]:
!rm -rf IndicWav2Vec fairseq kenlm flashlight
!git clone https://github.com/AI4Bharat/IndicWav2Vec.git
!git clone https://github.com/pytorch/fairseq.git
!git clone https://github.com/kpu/kenlm.git
!git clone https://github.com/flashlight/flashlight.git

Cloning into 'IndicWav2Vec'...
remote: Enumerating objects: 1847, done.[K
remote: Counting objects: 100% (132/132), done.[K
remote: Compressing objects: 100% (39/39), done.[K
remote: Total 1847 (delta 108), reused 106 (delta 92), pack-reused 1715[K
Receiving objects: 100% (1847/1847), 127.75 MiB | 24.87 MiB/s, done.
Resolving deltas: 100% (310/310), done.
Cloning into 'fairseq'...
remote: Enumerating objects: 32100, done.[K
remote: Counting objects: 100% (211/211), done.[K
remote: Compressing objects: 100% (118/118), done.[K
remote: Total 32100 (delta 116), reused 147 (delta 77), pack-reused 31889[K
Receiving objects: 100% (32100/32100), 22.35 MiB | 19.52 MiB/s, done.
Resolving deltas: 100% (23499/23499), done.
Cloning into 'kenlm'...
remote: Enumerating objects: 14102, done.[K
remote: Counting objects: 100% (415/415), done.[K
remote: Compressing objects: 100% (290/290), done.[K
remote: Total 14102 (delta 126), reused 371 (delta 111), pack-reused 13687[K
Receiving objects: 

Install Python Packages

In [16]:
%cd /content/IndicWav2Vec
!git checkout new-lm
!pip install packaging soundfile swifter -r w2v_inference/requirements.txt
%cd ..

/content/IndicWav2Vec
Checking out files: 100% (662/662), done.
Branch 'new-lm' set up to track remote branch 'new-lm' from 'origin'.
Switched to a new branch 'new-lm'
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting swifter
  Downloading swifter-1.3.2.tar.gz (825 kB)
[K     |████████████████████████████████| 825 kB 14.5 MB/s 
[?25hCollecting joblib==1.0.0
  Downloading joblib-1.0.0-py3-none-any.whl (302 kB)
[K     |████████████████████████████████| 302 kB 66.0 MB/s 
[?25hCollecting indic-nlp-library
  Downloading indic_nlp_library-0.81-py3-none-any.whl (40 kB)
[K     |████████████████████████████████| 40 kB 5.8 MB/s 
[?25hCollecting tqdm==4.56.0
  Downloading tqdm-4.56.0-py2.py3-none-any.whl (72 kB)
[K     |████████████████████████████████| 72 kB 1.1 MB/s 
[?25hCollecting numpy==1.20.0
  Downloading numpy-1.20.0-cp37-cp37m-manylinux2010_x86_64.whl (15.3 MB)
[K     |████████████████████████████████| 15.3 MB 20 kB/s 


/content


Build Fairseq

In [17]:
%cd /content/fairseq
!git checkout cf8ff8c3c5242e6e71e8feb40de45dd699f3cc08
!pip install ./
%cd /content

/content/fairseq
Note: checking out 'cf8ff8c3c5242e6e71e8feb40de45dd699f3cc08'.

You are in 'detached HEAD' state. You can look around, make experimental
changes and commit them, and you can discard any commits you make in this
state without impacting any branches by performing another checkout.

If you want to create a new branch to retain commits you create, you may
do so (now or later) by using -b with the checkout command again. Example:

  git checkout -b <new-branch-name>

HEAD is now at cf8ff8c3 Add unittests for jitting EMA model
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Processing /content/fairseq
[33m  DEPRECATION: A future pip version will change local packages to be built in-place without first copying to a temporary directory. We recommend you use --use-feature=in-tree-build to test your packages with this new behavior before it becomes the default.
   pip 21.3 will remove support for this functionality. You can fin

Build KenLM

In [18]:
%cd /content/kenlm
!mkdir -p build
%cd build
!cmake .. 
!make -j 16
%cd /content

/content/kenlm
/content/kenlm/build
-- The C compiler identification is GNU 7.5.0
-- The CXX compiler identification is GNU 7.5.0
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Check for working C compiler: /usr/bin/cc - skipped
-- Detecting C compile features
-- Detecting C compile features - done
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Check for working CXX compiler: /usr/bin/c++ - skipped
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Looking for pthread.h
-- Looking for pthread.h - found
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Failed
-- Looking for pthread_create in pthreads
-- Looking for pthread_create in pthreads - not found
-- Looking for pthread_create in pthread
-- Looking for pthread_create in pthread - found
-- Found Threads: TRUE  
-- Found Boost: /usr/include (found suitable version "1.65.1", minimum required is "1.41.0") foun

Build Flashlight

In [19]:
%cd /content/flashlight/bindings/python
!git checkout 06ddb51857ab1780d793c52948a0759f0ccc6ddb
!export USE_MKL=0 && export KENLM_ROOT="/content/kenlm/" && python setup.py install
%cd /content

/content/flashlight/bindings/python
Note: checking out '06ddb51857ab1780d793c52948a0759f0ccc6ddb'.

You are in 'detached HEAD' state. You can look around, make experimental
changes and commit them, and you can discard any commits you make in this
state without impacting any branches by performing another checkout.

If you want to create a new branch to retain commits you create, you may
do so (now or later) by using -b with the checkout command again. Example:

  git checkout -b <new-branch-name>

HEAD is now at 06ddb518 improve python binding/ready for pypi (#404)
running install
running bdist_egg
running egg_info
creating flashlight.egg-info
writing flashlight.egg-info/PKG-INFO
writing dependency_links to flashlight.egg-info/dependency_links.txt
writing top-level names to flashlight.egg-info/top_level.txt
writing manifest file 'flashlight.egg-info/SOURCES.txt'
package init file 'flashlight/__init__.py' not found (or not a regular file)
package init file 'flashlight/lib/audio/__init__

#### Insight: End to End ASR Training/Inference Pipeline.

Change directory to IndicWav2Vec

In [20]:
%cd /content/IndicWav2Vec

/content/IndicWav2Vec


### ii. Data Preparation: Manifest Creation

Visualize Manifest Directory Structure

In [21]:
!apt-get -y install tree && tree -dC workshop-2022/asr_data/noa_training_1hr

Reading package lists... Done
Building dependency tree       
Reading state information... Done
The following package was automatically installed and is no longer required:
  libnvidia-common-460
Use 'apt autoremove' to remove it.
The following NEW packages will be installed:
  tree
0 upgraded, 1 newly installed, 0 to remove and 116 not upgraded.
Need to get 40.7 kB of archives.
After this operation, 105 kB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu bionic/universe amd64 tree amd64 1.7.0-5 [40.7 kB]
Fetched 40.7 kB in 0s (724 kB/s)
Selecting previously unselected package tree.
(Reading database ... 156592 files and directories currently installed.)
Preparing to unpack .../tree_1.7.0-5_amd64.deb ...
Unpacking tree (1.7.0-5) ...
Setting up tree (1.7.0-5) ...
Processing triggers for man-db (2.8.3-2ubuntu0.1) ...
[01;34mworkshop-2022/asr_data/noa_training_1hr[00m
├── [01;34maudio[00m
│   ├── [01;34mtrain[00m
│   │   └── [01;34mhindi[00m
│   │     

### iii. Start Finetuning

Download Model Checkpoint

In [22]:
%cd /content/IndicWav2Vec/workshop-2022
!mkdir models 
!cd models && rm -rf checkpoint_ft.pt* && wget https://storage.googleapis.com/ai4b-speech/TTS/KENLM/checkpoint_ft.pt
%cd ..

/content/IndicWav2Vec/workshop-2022
--2022-07-27 23:03:49--  https://storage.googleapis.com/ai4b-speech/TTS/KENLM/checkpoint_ft.pt
Resolving storage.googleapis.com (storage.googleapis.com)... 142.250.153.128, 142.250.145.128, 173.194.79.128, ...
Connecting to storage.googleapis.com (storage.googleapis.com)|142.250.153.128|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 3808863698 (3.5G) [application/octet-stream]
Saving to: ‘checkpoint_ft.pt’


2022-07-27 23:04:29 (90.9 MB/s) - ‘checkpoint_ft.pt’ saved [3808863698/3808863698]

/content/IndicWav2Vec


Insight: Conifg Setup: What to change and what to not?

Setup Wandb

In [23]:
!wandb login 

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
[34m[1mwandb[0m: Paste an API key from your profile and hit enter, or press ctrl+c to quit: 
Aborted!


Run Finetuning

In [24]:
!fairseq-hydra-train task.data=${PWD}"/workshop-2022/asr_data/noa_training_1hr/manifest/hindi" \
    dataset.max_tokens=2000000 \
    common.log_interval=5 \
    common.wandb_project=tutorial-training \
    model.freeze_finetune_updates=100 \
    model.w2v_path=${PWD}"/workshop-2022/models/checkpoint_ft.pt" \
    checkpoint.save_dir=${PWD}"/workshop-2022/models/indicwav2vec_noa" \
    distributed_training.distributed_world_size=1 \
    +optimization.update_freq='[2]' \
    +optimization.lr=[0.0001] \
    optimization.max_update=1000 \
    checkpoint.save_interval_updates=100 \
    --config-dir ${PWD}"/finetune_configs" \
    --config-name ai4b_xlsr

/bin/bash: fairseq-hydra-train: command not found


#### Insight: Metrics for Evaluation (WER, CER)

### iv. Inference

In [None]:
%cd /content/IndicWav2Vec
from inference.support import load_model,W2lKenLMDecoder,W2lViterbiDecoder,load_data

model,dictionary = load_model('/content/hindi_large.pt')
model.to('cuda')


from omegaconf import OmegaConf

lmarg = OmegaConf.create({'nbest':1})
generator = W2lViterbiDecoder(lmarg, dictionary)

name2model_dict = dict()
name2model_dict['hi'] = [model,generator,dictionary]

lm_details = {
    "nbest":1, 
    "lexicon":"/content/lexicon.lst", 
    "kenlm_model":"/content/lm.binary", 
    "beam_size_token": 100, 
    "beam":64, 
    "beam_threshold":250,
    "lm_weight":0.5, 
    "word_score":2.0, 
    "sil_weight":0.0
}

/content/IndicWav2Vec
Already up to date.
Loading model..
Successfully loaded model /content/hindi_large.pt


In [None]:
import math

In [None]:
lmarg = OmegaConf.create(lm_details)
lmarg.unk_weight = -math.inf
generator_kenlm = W2lKenLMDecoder(lmarg, dictionary)
name2model_dict['hi_with_lm'] = [model,generator_kenlm,dictionary]

In [None]:
import torch

def infer(fp_arr,DEVICE):
    feature = torch.from_numpy(fp_arr).float()
    if DEVICE != 'cpu' and torch.cuda.is_available():
        feature = feature.to(DEVICE)
    sample = {"net_input":{"source":None,"padding_mask":None}}
    sample["net_input"]["source"] = feature.unsqueeze(0)
    if DEVICE != 'cpu' and torch.cuda.is_available():
        sample["net_input"]["padding_mask"] = torch.BoolTensor(sample["net_input"]["source"].size(1)).fill_(False).unsqueeze(0).to(DEVICE)
    else:
        sample["net_input"]["padding_mask"] = torch.BoolTensor(sample["net_input"]["source"].size(1)).fill_(False).unsqueeze(0)
        
    model,generator,dictionary = name2model_dict['hi']

    with torch.no_grad():
        hypo = generator.generate([model], sample, prefix_tokens=None)
    hyp_pieces = dictionary.string(hypo[0][0]["tokens"].int().cpu())
    tr = hyp_pieces.replace(' ','').replace('|',' ').strip()
    return tr

import pydub
import numpy as np
import soundfile as sf

wav, sr = sf.read('/content/indicAlignment_newsonair_v2_hindi_Regional-Chandigarh-Hindi-1810-2020102818596_sent_11.wav')
print(infer(wav,'cuda'))
%cd /content

श्रीदलाल ने बताया कि ोसना क मुके उधेश प्रगतिशील कसानत्रशिक्क कि रू मे सिर शिक न हे और एक प्रगति शील के सान को उवे में दारी ीजायरी कि वो अपनी आसबास के कमशकमवर  किसानों क प्रेणादे कि किस प्रकार से करश वहं िसयजुड़े वसुबालन ब देयरी भागवानी और मदसे बालिन केिक्ेत्रमं दे कक तक नी कअपनाकर वे अनी आय कसरोर हा सकते हैं
/content


In [None]:
import torch

def infer(fp_arr,DEVICE):
    feature = torch.from_numpy(fp_arr).float()
    if DEVICE != 'cpu' and torch.cuda.is_available():
        feature = feature.to(DEVICE)
    sample = {"net_input":{"source":None,"padding_mask":None}}
    sample["net_input"]["source"] = feature.unsqueeze(0)
    if DEVICE != 'cpu' and torch.cuda.is_available():
        sample["net_input"]["padding_mask"] = torch.BoolTensor(sample["net_input"]["source"].size(1)).fill_(False).unsqueeze(0).to(DEVICE)
    else:
        sample["net_input"]["padding_mask"] = torch.BoolTensor(sample["net_input"]["source"].size(1)).fill_(False).unsqueeze(0)
        
    model,generator,dictionary = name2model_dict['hi_with_lm']

    with torch.no_grad():
        hypo = generator.generate([model], sample, prefix_tokens=None)
    hyp_pieces = dictionary.string(hypo[0][0]["tokens"].int().cpu())
    tr = hyp_pieces.replace(' ','').replace('|',' ').strip()
    return tr

import pydub
import numpy as np
import soundfile as sf

wav, sr = sf.read('/content/indicAlignment_newsonair_v2_hindi_Regional-Chandigarh-Hindi-1810-2020102818596_sent_11.wav')
print(infer(wav,'cuda'))
%cd /content

श्री दलाल ने बताया कि सना क मु के उदेश प्रगति शील कसान तर शिक्षक के रूप मे से सिक न हे और एक प्रगति शील के सान को वे में दारी जाय री की वो अपने आस पास के कमसकम व  कि सानों क प्रेणा दे कि किस प्रकार से क्रश वह से जुडे वसु बालन व देरी भागवानी और मद से पालिन किक ेत्र म दे का तक नी कपना कर वे अनी आय करोर हा सकते हैं
/content


## 3. Improving Performance using Language Model

### i. Installation and Setup

Prerequisite
- Fairseq
- `IndicWav2Vec` cloned from github  
- KenLM 
- Flashlight
- Other Linux Dependencies

Install Python Packages

In [25]:
#! pip install pyctcdecode pandas matplotlib indic-nlp-library tqdm regex
#! pip install https://github.com/kpu/kenlm/archive/master.zip
!pip install git+https://github.com/sutariyaraj/indic-num2words

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting git+https://github.com/sutariyaraj/indic-num2words
  Cloning https://github.com/sutariyaraj/indic-num2words to /tmp/pip-req-build-d9_sfc6r
  Running command git clone -q https://github.com/sutariyaraj/indic-num2words /tmp/pip-req-build-d9_sfc6r
Building wheels for collected packages: indic-num2words
  Building wheel for indic-num2words (setup.py) ... [?25l[?25hdone
  Created wheel for indic-num2words: filename=indic_num2words-1.0.0-py3-none-any.whl size=15996 sha256=57ff942cb80a8a5e919e90964f2eacf7d4e29e8d489b1d04fcb3ac79d8830f4b
  Stored in directory: /tmp/pip-ephem-wheel-cache-fg_qmaj1/wheels/3a/3d/04/1244b380ba5711595972f41b56079b5c5532c0dc5788a66dbf
Successfully built indic-num2words
Installing collected packages: indic-num2words
Successfully installed indic-num2words-1.0.0


#### Insight: Greedy vs Beam Search Decoding

### ii. Dataset Preparation: Clean Text Corpus and Create Lexicon

Visualize LM Data File Structure

In [26]:
!tree workshop-2022/lm_data/indic_corp_100k

workshop-2022/lm_data/indic_corp_100k
├── hindi_dict.txt
└── hi_sents.txt

0 directories, 2 files


Visualize Content of the Folder

In [27]:
%cd /content/IndicWav2Vec/workshop-2022/lm_data
!var="$(cat indic_corp_100k/hindi_dict.txt | wc -l)" && echo "Total Items in hindi_dict.txt = $var"$'\n'
!echo "Displaying first 10 items below..."$'\n' && head indic_corp_100k/hindi_dict.txt
%cd /content

/content/IndicWav2Vec/workshop-2022/lm_data
Total Items in hindi_dict.txt = 64

Displaying first 10 items below...

|
ँ
ं
ः
अ
आ
इ
ई
उ
ऊ
/content


In [28]:
%cd /content/IndicWav2Vec/workshop-2022/lm_data
!var="$(cat indic_corp_100k/hi_sents.txt | wc -l)" && echo "Total Items in hi_sents.txt = $var"$'\n'
!echo "Displaying first 10 sentences below ..."$'\n' && head indic_corp_100k/hi_sents.txt
%cd /content

/content/IndicWav2Vec/workshop-2022/lm_data
Total Items in hi_sents.txt = 100000

Displaying first 10 sentences below ...

आवेदन करने की आखिरी तारीख 31 जनवरी, 2020 है।
इतनी दुआ कर दो हमारे लिए कि जितना प्यार दुनिया ने आपको दिया है, बस उतना ही हमें भी मिल जाए|”
मोदी सरकार के पहले कार्यकाल में भी तीन तलाक को लेकर बिल लाया गया था, हालांकि तब यह राज्यसभा में पास नहीं हो पाया था.
भाजपा के दिवंगत नेता प्रमोद महाजन की बेटी पूनम महाजन को सचिव बनाया गया है.
ऐसी स्थिति में एक न्यायपूर्ण सरकार सार्वजनिक वित्त का इस तरह इस्तेमाल करती है कि संसाधनों का आवंटन, सभी के उपभोग वाले उत्पादों की व्यवहार्यता और समग्र वृहद-आर्थिक प्रबंधन 'निष्पक्षता के रूप में न्याय' को बढ़ाए।
दिलचस्प है कि डीसीएचएल के चेयरमैन टी वेंकटरमन रेड्डी और वाइस चेयरमैन टी विनायक रवि रेड्डी इस बैठक में मौजूद नहीं थे।
इस आम चुनाव में भाजपा नेता सतीश कुमार गौतम को सबसे अधिक 6 लाख 56 हजार 215 वोट प्राप्त हुए.
आयरलैंड टीम के विकेटकीपर बल्लेबाज नियाल ओ'ब्रायन ने अंतर्राष्ट्रीय क्रिकेट से संन्यास का ऐलान कर दिया है। साल 2002 में डेनमार्क 

In [29]:
%cd /content/IndicWav2Vec/
!python lm_training/utils/prepare_data.py hi -d ${PWD}"/workshop-2022/lm_data/indic_corp_100k" \
    --data_type "C" --drop_rows strict --dict_dir ${PWD}"/workshop-2022/lm_data/indic_corp_100k" \
    --out_dir ${PWD}"/workshop-2022/models/indic_corp_lm"
%cd /content

/content/IndicWav2Vec
----------------------------------------------------------------------------
-------------------Cleaning and Stats Calculation #1 ...-------------------
	Loading Data...
	Done Loading Data!
	Starting Processing...
100% 100000/100000 [00:24<00:00, 4074.77it/s]
	Done Processing!

-------------------Cleaning, Filtering and Stats Calculation #2 ...-------------------
	Saving Final Sentences in txt format...
	Saving Unique Words and their frequencies...
	Saving Stats and Stats Plots
----------------------------------------------------------------------------
/content


### iii. Start Training

In [30]:
%cd /content/IndicWav2Vec/
!python lm_training/utils/train_kenlm.py hi --lm_base_dirpath ${PWD}"/workshop-2022/models/indic_corp_lm" \
    --lm_dirname "lm" --topk 10000 --kenlm_bins "../kenlm/build/bin" \
    --arpa_order 6 --max_arpa_memory "90%" --arpa_prune "0|0|0|0|1|2" \
    --binary_a_bits 255 --binary_q_bits 8 --binary_type trie
%cd /content

/content/IndicWav2Vec
rm: cannot remove '/content/IndicWav2Vec/workshop-2022/models/indic_corp_lm/hindi/lm/*': No such file or directory

Step 1:	Preparing Data for LM-------------------------------------------
Merging all "*clean_sents.txt" files!...
cat: '/content/IndicWav2Vec/workshop-2022/models/indic_corp_lm/hindi/M_*_sents.txt': No such file or directory
Creating a temporary copy of ALL_SENTS!...
Merging all "C_*words_counters.tsv" files!...
Merging all "M_*words_counters.tsv" files!...
Corpus Stats: Total Words: 1874607	 Total Unique Words: 81713	 Top-k:10000	 %age top-k: 12.24%
Combined Unique Words in the Lexicon, #10000

Step 2:	Building LM-------------------------------------------
Creating ARPA file ...

=== 1/5 Counting and sorting n-grams ===
Reading /content/IndicWav2Vec/workshop-2022/models/indic_corp_lm/hindi/itms/temp_sents.txt.gz
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
tcmalloc: large alloc 1931182080 bytes

### iv. Inference

## 4. Deploying Models

### i. Installation and Setup

Install git-lfs

In [None]:
!curl -s https://packagecloud.io/install/repositories/github/git-lfs/script.deb.sh | bash
!apt-get install git-lfs
!git lfs install
# !git config --global user.email "ramanabhigyan@gmail.com"
# !git config --global user.name "Abhigyan Raman"

Detected operating system as Ubuntu/bionic.
Checking for curl...
Detected curl...
Checking for gpg...
Detected gpg...
Running apt-get update... done.
Installing apt-transport-https... done.
Installing /etc/apt/sources.list.d/github_git-lfs.list...done.
Importing packagecloud gpg key... done.
Running apt-get update... done.

The repository is setup! You can now install packages.
Reading package lists... Done
Building dependency tree       
Reading state information... Done
git-lfs is already the newest version (3.2.0).
The following package was automatically installed and is no longer required:
  libnvidia-common-470
Use 'apt autoremove' to remove it.
0 upgraded, 0 newly installed, 0 to remove and 281 not upgraded.
Updated Git hooks.
Git LFS initialized.


Install Gradio

In [None]:
!pip install gradio

Login to huggingface-hub

In [None]:
!huggingface-cli login


        _|    _|  _|    _|    _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|_|_|_|    _|_|      _|_|_|  _|_|_|_|
        _|    _|  _|    _|  _|        _|          _|    _|_|    _|  _|            _|        _|    _|  _|        _|
        _|_|_|_|  _|    _|  _|  _|_|  _|  _|_|    _|    _|  _|  _|  _|  _|_|      _|_|_|    _|_|_|_|  _|        _|_|_|
        _|    _|  _|    _|  _|    _|  _|    _|    _|    _|    _|_|  _|    _|      _|        _|    _|  _|        _|
        _|    _|    _|_|      _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|        _|    _|    _|_|_|  _|_|_|_|

        To login, `huggingface_hub` now requires a token generated from https://huggingface.co/settings/tokens .
        
Token: 
Login successful
Your token has been saved to /root/.huggingface/token
[1m[31mAuthenticated through git-credential store but this isn't the helper defined on your machine.
You might have to re-authenticate when pushing to the Hugging Face Hub. Run the following command in yo

Import Statements

In [31]:
from transformers import Wav2Vec2Config
from huggingface_hub import create_repo, Repository

import logging
import sys
import gradio as gr
from transformers import pipeline, AutoModelForCTC, Wav2Vec2Processor, Wav2Vec2ProcessorWithLM

### ii. Export models to HuggingFace

Create and Initialize Repo

In [None]:
repo_url = create_repo("indicwav2vec-hindi", organization="ai4bharat")
repo = Repository(local_dir="workshop-2022/models/indicwav2vec_noa_hf", clone_from=repo_url)

Cloning https://huggingface.co/ai4bharat/indicwav2vec-hindi into local empty directory.


Save config.json from a "similar" architecture in huggingface

In [None]:
config = Wav2Vec2Config.from_pretrained('facebook/wav2vec2-large-960h-lv60-self')
config.save_pretrained('workshop-2022/models/indicwav2vec_noa_hf');

Convert ASR model to Huggingface's format

In [None]:
%cd "/content/IndicWav2Vec"
!python workshop-2022/utils/convert_wav2vec2_original_pytorch_checkpoint_to_pytorch.py \
    --pytorch_dump_folder ${PWD}"/workshop-2022/models/indicwav2vec_noa_hf" \
    --checkpoint_path ${PWD}"/workshop-2022/models/indicwav2vec_noa/checkpoint_best.pt" \
    --config_path ${PWD}"/workshop-2022/models/indicwav2vec_noa_hf/config.json" \
    --dict_path ${PWD}"/workshop-2022/asr_data/noa_training_1hr/manifest/hindi/dict.ltr.txt"
%cd /content

Convert both ASR and LM model to Huggingface's format

In [None]:
%cd "/content/IndicWav2Vec"
!python workshop-2022/utils/convert_wav2vec2_original_pytorch_checkpoint_to_pytorch.py \
    --pytorch_dump_folder ${PWD}"/workshop-2022/models/indicwav2vec_noa_hf" \
    --checkpoint_path ${PWD}"/workshop-2022/models/indicwav2vec_noa/checkpoint_best.pt" \
    --config_path ${PWD}"/workshop-2022/models/indicwav2vec_noa_hf/config.json" \
    --dict_path ${PWD}"/workshop-2022/asr_data/noa_training_1hr/manifest/hindi/dict.ltr.txt" \
    --lm_path ${PWD}"/workshop-2022/models/indic_corp_lm/hindi/lm/lm.binary" \
    --lexicon_path ${PWD}"/workshop-2022/models/indic_corp_lm/hindi/lm/lexicon.lst" \
    --alpha 0.5 \
    --beta 1.5 \
    --with_LM
%cd /content

Push to Huggingface Model Hub

In [None]:
%cd "/content/IndicWav2Vec/workshop-2022/models/indicwav2vec_noa_hf"
!huggingface-cli lfs-enable-largefiles .
!git lfs track "*.binary"
!git add .
!git commit -m "added language model"
!git push origin main
%cd /content

/content/indicwav2vec-hindi
Local repo set up for largefiles
Tracking "*.binary"
[main e746fa3] added language model
 3 files changed, 1 insertion(+)
 create mode 100644 gradio_queue.db
 create mode 100644 gradio_queue.db-journal
Counting objects: 4, done.
Delta compression using up to 12 threads.
Compressing objects: 100% (4/4), done.
Writing objects: 100% (4/4), 389 bytes | 389.00 KiB/s, done.
Total 4 (delta 2), reused 0 (delta 0)
To https://huggingface.co/ai4bharat/indicwav2vec-hindi
   1d357da..e746fa3  main -> main
/content


### iii. Deploying as Gradio app

Setup logging and Global Variables

In [32]:
# Basic Logger
logging.basicConfig(
    format="%(asctime)s - %(levelname)s - %(name)s -   %(message)s",
    datefmt="%m/%d/%Y %H:%M:%S",
    handlers=[logging.StreamHandler(sys.stdout)],
)
logger = logging.getLogger(__name__)
logger.setLevel(logging.DEBUG)

# Variables
LARGE_MODEL_BY_LANGUAGE = {
    "Hindi": {"model_id": "ai4bharat/indicwav2vec-hindi", "has_lm": True},
}
LANGUAGES = sorted(LARGE_MODEL_BY_LANGUAGE.keys())
CACHED_MODELS_BY_ID = {}
DEVICE_ID = 0 if torch.cuda.is_available() else -1

Defin run function

In [35]:
def run(input_file, language, decoding_type, history):
    logger.info(f"Running ASR {language}-{decoding_type} for {input_file}")
    history = history or []
    model = LARGE_MODEL_BY_LANGUAGE.get(language, None)

    if decoding_type == "lm" and not model["has_lm"]:
        history.append({
            "error_message": f"lm not available for {language} language :("
        })
    else:
        model_instance = CACHED_MODELS_BY_ID.get(model["model_id"], None)
        if model_instance is None:
            model_instance = AutoModelForCTC.from_pretrained(model["model_id"])
            CACHED_MODELS_BY_ID[model["model_id"]] = model_instance

        if decoding_type == "lm":
            processor = Wav2Vec2ProcessorWithLM.from_pretrained(model["model_id"])
            asr = pipeline("automatic-speech-recognition", model=model_instance, tokenizer=processor.tokenizer, 
                           feature_extractor=processor.feature_extractor, decoder=processor.decoder, device=device_no)
        else:
            processor = Wav2Vec2Processor.from_pretrained(model["model_id"])
            asr = pipeline("automatic-speech-recognition", model=model_instance, tokenizer=processor.tokenizer, 
                           feature_extractor=processor.feature_extractor, decoder=None, device=device_no)

        transcription = asr(input_file.name, chunk_length_s=5, stride_length_s=1)["text"]
        logger.info(f"Transcription for {input_file}: {transcription}")
        history.append({
            "model_id": model["model_id"],
            "language": language,
            "decoding_type": decoding_type,
            "transcription": transcription,
            "error_message": None
        })

    html_output = "<div class='result'>"
    for item in history:
        if item["error_message"] is not None:
            html_output += f"<div class='result_item result_item_error'>{item['error_message']}</div>"
        else:
            url_suffix = " + LM" if item["decoding_type"] == "LM" else ""
            html_output += "<div class='result_item result_item_success'>"
            html_output += f'<strong><a target="_blank" href="https://huggingface.co/{item["model_id"]}">{item["model_id"]}{url_suffix}</a></strong><br/><br/>'
            html_output += f'{item["transcription"]}<br/>'
            html_output += "</div>"
    html_output += "</div>"

    return html_output, history


Define Gradio Interface

In [36]:
gr.Interface(
    run,
    inputs=[
        gr.inputs.Audio(source="microphone", type="file", label="Record here..."),
        gr.inputs.Radio(label="Language", choices=LANGUAGES),
        gr.inputs.Radio(label="Decoding type", choices=["greedy", "lm"]),
        "state"
    ],
    outputs=[
        gr.outputs.HTML(label="Outputs"),
        "state"
    ],
    title="Speech to Text <=> IndicWav2Vec",
    description="",
    css="""
    .result {display:flex;flex-direction:column}
    .result_item {padding:15px;margin-bottom:8px;border-radius:15px;width:100%}
    .result_item_success {background-color:mediumaquamarine;color:white;align-self:start}
    .result_item_error {background-color:#ff7070;color:white;align-self:start}
    """,
    allow_flagging="never"
).launch(enable_queue=True)

  "Usage of gradio.inputs is deprecated, and will not be supported in the future, please import your components from gradio.components",
  "Usage of gradio.inputs is deprecated, and will not be supported in the future, please import your component from gradio.components",


Colab notebook detected. To show errors in colab notebook, set `debug=True` in `launch()`
Your interface requires microphone or webcam permissions - this may cause issues in Colab. Use the External URL in case of issues.
Running on public URL: https://32910.gradio.app

This share link expires in 72 hours. For free permanent hosting, check out Spaces: https://huggingface.co/spaces


(<gradio.routes.App at 0x7f1a2228d6d0>,
 'http://127.0.0.1:7862/',
 'https://32910.gradio.app')