```
N-gram language model can boost the performance of Wav2Vec2 model by a significant amount. In this notebook, we'll see how we can create an n-gram language model, combine it with a wav2vec2 model and the difference it makes in the performance of the wav2vec2 model.
```

# What is n-gram Language Model?

```
N-gram language models are a type of probabilistic language model where "n" represents the number of consecutive items (usually words or characters) in a sequence. For instance, a unigram is a single word, a bigram is a pair of consecutive words, a trigram is a sequence of three consecutive words, and so on. These models compute the likelihood of a word or character given the previous "n-1" words or characters. N-gram models estimate the probability distribution of sequences of words or characters based on the frequency of their occurrences in the training data.
```


**Suppose we have a sentence: "The quick brown fox jumps over the lazy dog."**

**Unigram (1-gram):**

> "The", "quick", "brown", "fox", "jumps", "over", "the", "lazy", "dog."

**Bigram (2-gram):**

> "The quick", "quick brown", "brown fox", "fox jumps", "jumps over", "over the", "the lazy", "lazy dog."

**Trigram (3-gram):**

> "The quick brown", "quick brown fox", "brown fox jumps", "fox jumps over", "jumps over the", "over the lazy", "the lazy dog."

**4-gram (Quadgram):**

> "The quick brown fox", "quick brown fox jumps", "brown fox jumps over", "fox jumps over the", "jumps over the lazy", "over the lazy dog."

Here, each word is predicted based on the three preceding words.

**N-gram language models are trained on large text corpora to estimate the probabilities of these sequences**. This information is then used to predict the most likely word or character given the context. 

# Why do we need Language Model here?

```
Wav2vec2 is an acoustic model, that means it can map the speech audios with characters/phonemes. However, it might struggle with punctuations,spellings etc since these are not directly mapped with the audios/ different spellings might have the same or close pronunciation. In some cases, a character might be silence while pronunciation. 
```

> For example, **'Knight' and 'Night'** have the same pronunciation, so the correct spelling will depend on the context.
> **Homophones** : weak/week , too/two etc

N-gram models can boost the performance of the acoustic models by leveraging it's linguistic charactersitics, by understanding the context and thus modify the words that seems to have the best probabilty to be in that place.


[Reference tutorial to understand N-gram models and implementation](https://huggingface.co/blog/wav2vec2-with-ngram)

# Installations

In [1]:
import resource
import os
import pandas as pd

In [2]:
%%capture ts
!pip install jiwer
!pip install bnunicodenormalizer
!pip -q install https://github.com/kpu/kenlm/archive/master.zip pyctcdecode
! sudo apt -y install build-essential cmake libboost-system-dev libboost-thread-dev libboost-program-options-dev libboost-test-dev libeigen3-dev zlib1g-dev libbz2-dev liblzma-dev
! wget -O - https://kheafield.com/code/kenlm.tar.gz | tar xz
! mkdir kenlm/build && cd kenlm/build && cmake .. && make -j2
! ls kenlm/build/bin

In [3]:
from bnunicodenormalizer import Normalizer 
bnorm = Normalizer()
def normalize(sen):
    _words = [bnorm(word)['normalized']  for word in sen.split()]
    return " ".join([word for word in _words if word is not None])

# Load the Data

We'll be using all of our training sentences to build our 5-gram langauge model

In [4]:
df = pd.read_csv("/kaggle/input/bengaliai-speech/train.csv")
train_df = df[df.split=="train"]
train_df.head()

Unnamed: 0,id,sentence,split
0,000005f3362c,ও বলেছে আপনার ঠিকানা!,train
1,00001dddd002,কোন মহান রাষ্ট্রের নাগরিক হতে চাও?,train
2,00001e0bc131,"আমি তোমার কষ্টটা বুঝছি, কিন্তু এটা সঠিক পথ না।",train
3,000024b3d810,নাচ শেষ হওয়ার পর সকলে শরীর ধুয়ে একসঙ্গে ভোজন...,train
4,000028220ab3,"হুমম, ওহ হেই, দেখো।",train


Dump the sentences in a text file

In [5]:
all_sentences = list(set(train_df['sentence'].tolist()))
len(all_sentences)

461532

In [6]:
import re
from tqdm import tqdm
chars_to_ignore_regex = '[\,\?\.\!\-\;\:\"\“\%\‘\”\�\']'

with open('text.txt', 'w') as f:
    for sentence in tqdm(all_sentences):
        f.write(normalize(sentence))
        f.write('\n')
    with open("/kaggle/input/dataset-for-lm/hasan-etal-2020-low/2.75M/original_corpus.bn",'r') as f2:
        lines = f2.readlines()
        
        for line in tqdm(lines):
            f.write(normalize(line))
            f.write('\n')

100%|██████████| 461532/461532 [38:59<00:00, 197.24it/s]
100%|██████████| 2753069/2753069 [4:53:53<00:00, 156.13it/s]


**Now build 5-gram arpa model using kenlm. As it's relatively common in speech recognition, we build a 5-gram by passing the -o 5 parameter.**

In [7]:
pwd

'/kaggle/working'

In [8]:
!kenlm/build/bin/lmplz -o 5 < "text.txt" > "5gram.arpa"

=== 1/5 Counting and sorting n-grams ===
Reading /kaggle/working/text.txt
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
****************************************************************************************************
Unigram tokens 33068042 types 1221383
=== 2/5 Calculating and sorting adjusted counts ===
Chain sizes: 1:14656596 2:2626467072 3:4924625920 4:7879400960 5:11490793472
Statistics:
1 1221383 D1=0.726508 D2=1.04896 D3+=1.33469
2 10504766 D1=0.804746 D2=1.11287 D3+=1.32831
3 20572612 D1=0.889338 D2=1.24474 D3+=1.35282
4 23528079 D1=0.942228 D2=1.43865 D3+=1.39094
5 22898089 D1=0.81758 D2=1.76743 D3+=1.46412
Memory estimate for binary LM:
type      MB
probing 1673 assuming -p 1.5
probing 1990 assuming -r models -p 1.5
trie     879 without quantization
trie     510 assuming -q 8 -b 8 quantization 
trie     763 assuming -a 22 array pointer compression
trie     394 assuming -a 22 -q 8 -b 8 array pointer

Let's look at what is inside 5gram.arpa

In [9]:
!head -20 5gram.arpa

\data\
ngram 1=1221383
ngram 2=10504766
ngram 3=20572612
ngram 4=23528079
ngram 5=22898089

\1-grams:
-7.066635	<unk>	0
0	<s>	-1.0176687
-1.5402384	</s>	0
-2.5750425	এই	-0.62876683
-3.3284895	সময়	-0.46045333
-3.3767936	এখানে	-0.4176563
-3.60266	থাকা	-0.43585598
-3.3397322	ঠিক	-0.67782724
-3.3999436	না।	-0.7081418
-3.8354008	অন্যদের	-0.59316385
-4.706327	কষ্টের	-0.4007752
-3.1346946	কথা	-0.77826494


Voila. Now we're done with our language model. But now we have a problem here. 
```
There is a small problem that 🤗 Transformers will not be happy about later on. The 5-gram correctly includes a "Unknown" or <unk>, as well as a begin-of-sentence, <s> token, but no end-of-sentence, </s> token. This sadly has to be corrected currently after the build.

We can simply add the end-of-sentence token by adding the line 0 </s> -0.11831701 below the begin-of-sentence token and increasing the ngram 1 count by 1
```

> [Reference](https://huggingface.co/blog/wav2vec2-with-ngram)

Let's fix it

In [10]:
with open("5gram.arpa", "r") as read_file, open("5gram_correct.arpa", "w") as write_file:
    has_added_eos = False
    for line in read_file:
        if not has_added_eos and "ngram 1=" in line:
            count=line.strip().split("=")[-1]
            write_file.write(line.replace(f"{count}", f"{int(count)+1}"))
        elif not has_added_eos and "<s>" in line:
            write_file.write(line)
            write_file.write(line.replace("<s>", "</s>"))
            has_added_eos = True
        else:
            write_file.write(line)

In [11]:
!rm 5gram.arpa

Okay now our LM is done! But wait, what do we do with this LM? It's of no use seperately, we need to integrate it with our wav2vec2 model

# Combine LM with wav2vec2

We'll use the [yellowking model](https://www.kaggle.com/code/sameen53/yellowking-dlsprint-inference) as our demo wav2vec2 model. It already came with a LM, we'll remove it and use our LM here. 

In [12]:
!cp -av /kaggle/input/yellowking-dlsprint-model/YellowKing_processor .

'/kaggle/input/yellowking-dlsprint-model/YellowKing_processor' -> './YellowKing_processor'
'/kaggle/input/yellowking-dlsprint-model/YellowKing_processor/added_tokens.json' -> './YellowKing_processor/added_tokens.json'
'/kaggle/input/yellowking-dlsprint-model/YellowKing_processor/preprocessor_config.json' -> './YellowKing_processor/preprocessor_config.json'
'/kaggle/input/yellowking-dlsprint-model/YellowKing_processor/tokenizer_config.json' -> './YellowKing_processor/tokenizer_config.json'
'/kaggle/input/yellowking-dlsprint-model/YellowKing_processor/special_tokens_map.json' -> './YellowKing_processor/special_tokens_map.json'
'/kaggle/input/yellowking-dlsprint-model/YellowKing_processor/vocab.json' -> './YellowKing_processor/vocab.json'
'/kaggle/input/yellowking-dlsprint-model/YellowKing_processor/language_model' -> './YellowKing_processor/language_model'
'/kaggle/input/yellowking-dlsprint-model/YellowKing_processor/language_model/5gram.bin' -> './YellowKing_processor/language_mo

In [13]:
!rm -rf YellowKing_processor/language_model

Let's load the processor without LM

In [14]:
from transformers import Wav2Vec2ProcessorWithLM,Wav2Vec2Processor

processor = Wav2Vec2Processor.from_pretrained("/kaggle/working/YellowKing_processor")
vocab_dict = processor.tokenizer.get_vocab()
sorted_vocab_dict = {k.lower(): v for k, v in sorted(vocab_dict.items(), key=lambda item: item[1])}

``` 
The "labels" and the previously built 5gram_correct.arpa file is all that's needed to build the decoder.
```

In [15]:
from pyctcdecode import build_ctcdecoder

decoder = build_ctcdecoder(
    labels=list(sorted_vocab_dict.keys()),
    kenlm_model_path="5gram_correct.arpa",
)

Loading the LM will be faster if you build a binary file.
Reading /kaggle/working/5gram_correct.arpa
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
****************************************************************************************************


Now let's wrap the just created decoder, together with the processor's tokenizer and feature_extractor into a Wav2Vec2ProcessorWithLM class.

In [16]:
processor_with_lm = Wav2Vec2ProcessorWithLM(
    feature_extractor=processor.feature_extractor,
    tokenizer=processor.tokenizer,
    decoder=decoder
)

yaay! We're done. Let's save the processor.

In [17]:
processor_with_lm.save_pretrained("Yellowking_Processor_New_LM")

In [18]:
!rm -rf /kaggle/working/kenlm

# Inference

In [19]:
class CFG:
    my_model_name = '../input/yellowking-dlsprint-model/YellowKing_model'
    processor_name = '/kaggle/working/Yellowking_Processor_New_LM'
    processor_without_LM = '/kaggle/working/YellowKing_processor'
    
from transformers import Wav2Vec2ProcessorWithLM,pipeline

processor = Wav2Vec2ProcessorWithLM.from_pretrained(CFG.processor_name)
asr_w_LM = pipeline("automatic-speech-recognition", model=CFG.my_model_name ,feature_extractor =processor.feature_extractor, tokenizer= processor.tokenizer,decoder=processor.decoder)

caused by: ['/opt/conda/lib/python3.10/site-packages/tensorflow_io/python/ops/libtensorflow_io_plugins.so: undefined symbol: _ZN3tsl6StatusC1EN10tensorflow5error4CodeESt17basic_string_viewIcSt11char_traitsIcEENS_14SourceLocationE']
caused by: ['/opt/conda/lib/python3.10/site-packages/tensorflow_io/python/ops/libtensorflow_io.so: undefined symbol: _ZTVN10tensorflow13GcsFileSystemE']
Loading the LM will be faster if you build a binary file.
Reading /kaggle/working/Yellowking_Processor_New_LM/language_model/5gram_correct.arpa
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
****************************************************************************************************


In [20]:
import librosa
def infer(audio_path):
    speech, sr = librosa.load(audio_path, sr=processor.feature_extractor.sampling_rate)
    my_LM_prediction = asr_w_LM(
                speech
            )

    return normalize(my_LM_prediction['text'])

In [21]:
infer("/kaggle/input/bengaliai-speech/train_mp3s/00001e0bc131.mp3")

'আমি তোমার কষ্টটা বুঝি কিন্তু এটা সঠিক পথ না।'

# Performance comparison :  with LM and without LM

In [22]:
processor_without_LM = Wav2Vec2Processor.from_pretrained(CFG.processor_without_LM)
asr_wo_LM = pipeline("automatic-speech-recognition", model=CFG.my_model_name ,feature_extractor =processor_without_LM.feature_extractor, tokenizer= processor_without_LM.tokenizer)

def infer_wo_LM(audio_path):
    speech, sr = librosa.load(audio_path, sr=processor.feature_extractor.sampling_rate)
    my_LM_prediction = asr_wo_LM(
                speech
            )

    return normalize(my_LM_prediction['text'])

In [23]:
df = pd.read_csv("/kaggle/input/bengaliai-speech/train.csv")
val = df[df.split=="valid"]
val.head()

Unnamed: 0,id,sentence,split
20,0000e711c2b1,তিনি এবং তাঁর মা তাদের পৈতৃক বাড়িতে থেকে প্রত...,valid
59,00036c2a2d9d,কৃত্তিবাস রামায়ণ-বহির্ভূত অনেক গল্প এই অনুবাদ...,valid
100,00065e317123,তিনি তার সুশৃঙ্খল সামরিক বাহিনী এবং সুগঠিত শাস...,valid
101,00065f40df52,তিনি বিজয়নগর সাম্রাজ্যের বিরুদ্ধে এবং বিজাপুর...,valid
146,0009b022c8ea,এটি মূলত একটি মরুময় অঞ্চল।,valid


In [24]:
from jiwer import wer

In [25]:
import IPython.display as ipd

def example(idx):
    path = val['id'].iloc[idx]
    full_path = "/kaggle/input/bengaliai-speech/train_mp3s/"+path+".mp3"
    display(ipd.Audio(full_path))
    
    sentence = normalize(val.sentence.iloc[idx])
    print("Grounf Truth : ",normalize(sentence))
    print("Prediction With LM : ",infer(full_path))
    print("Prediction Without LM : ",infer_wo_LM(full_path))
    print("Word Error Rate with LM : ",wer(sentence,infer(full_path)))
    print("Word Error Rate without LM : ",wer(sentence,infer_wo_LM(full_path)))
    

In [26]:
idx = 0
example(idx)

Grounf Truth :  তিনি এবং তাঁর মা তাদের পৈতৃক বাড়িতে থেকে প্রতিবেশীদের দ্বারা অনেক তিরস্কার সহ্য করেন।
Prediction With LM :  তিনি এবং তাঁর মা তাদের পৈতৃক বাড়িতে থেকে প্রতিবেশীদের দ্বারা অনেক তিরস্কার সহ্য করেন।
Prediction Without LM :  তিনি এবং তার মা তাদের পৈতৃক বাড়িতে থেকে প্রতিবেশীদের দ্বারা অনেক তুরস্কার সহ্য করেন।
Word Error Rate with LM :  0.0
Word Error Rate without LM :  0.14285714285714285


In [27]:
idx = 5
example(idx)

Grounf Truth :  সড়কটি বিহার-পশ্চিমবঙ্গ সীমান্ত অতিক্রম করে পশ্চিমবঙ্গ রাজ্যে প্রবেশ করে উত্তর দিনাজপুর জেলা হয়ে।
Prediction With LM :  সড়কটি বিহার পশ্চিমবঙ্গ সীমান্ত অতিক্রম করে পশ্চিমবঙ্গ রাজ্যে প্রবেশ করে উত্তর দিনাজপুর জেলা হয়ে।
Prediction Without LM :  সড়কটি বিহার পশ্চিমবঙ্গ সীমান্ত অতিক্রম গরে পশ্চিমবঙ্গ রাজ্যে প্রবেশ করে উত্তর দিনাজপুর জেলা হয়ে।
Word Error Rate with LM :  0.15384615384615385
Word Error Rate without LM :  0.23076923076923078


In [28]:
idx = 9
example(idx)

Grounf Truth :  যথারীতি সেখানেও সাফল্যের স্বাক্ষর রাখলেন সিদ্দিক।
Prediction With LM :  যথারীতি সেখানেও সাফল্যের স্বাক্ষর রাখলেন সিদ্দিক।
Prediction Without LM :  যথারীচি সেখানেও সাফল্যের স্বাক্ষর রাখলেন সিদ্দিক।
Word Error Rate with LM :  0.0
Word Error Rate without LM :  0.16666666666666666


In [29]:
idx = 12
example(idx)

Grounf Truth :  টাইমস ব্যাংক প্রতিষ্ঠা করেছিলেন ভারতের দ্য টাইমস গ্রুপ হিসেবে পরিচিত বেনেট, কোলম্যান এবং কোং লিমিটেড।
Prediction With LM :  টাইমস ব্যাংক প্রতিষ্ঠা করেছিলেন ভারতের দ্য টাইমস গ্রুপ হিসেবে পরিচিত বেনেট কোম্যান এবং কম লিমিটেড।
Prediction Without LM :  টাইমস ব্যাংক প্রতিষ্ঠা করেছিলেন ভারতের দ্যা টাইমস গ্রুপ হিসেবে পরিচিতব বেনে পরোম্যান এবং কোম লিমেটেড।
Word Error Rate with LM :  0.2
Word Error Rate without LM :  0.4


In [30]:
idx = 13
example(idx)

Grounf Truth :  তার বাবা লুৎফর রহমান সেখানে একটি বেসরকারি ফার্মে একজন নিরীক্ষণ কর্মকর্তা হিসেবে কর্মরত ছিলেন।
Prediction With LM :  তার বাবা লুৎফর রহমান সেখানে একটি বেসরকারি ফার্মে একজন নিরীক্ষণ কর্মকর্তা হিসেবে কর্মরত ছিলেন।
Prediction Without LM :  তার বাবা লুৎফুর রহমান শেখানে একটি বেসরকারি ফার্মে একজন নিরীক্ষণ কর্মকর্তা হিসেবে কর্মরত ছিলেন।
Word Error Rate with LM :  0.0
Word Error Rate without LM :  0.14285714285714285


We can see, the model with LM is performing better than without LM! 