# Overview

With the introduction of the [Samanantar dataset](https://indicnlp.ai4bharat.org/samanantar/), Ramesh et al. have released a parallel corpus for 11 Indic languages with over 49.7 million sentence pairs which can be used for training Indic NMT models. 

To facilitate English to Hindi MT, we will train the NMT model on the en-hi subset of the Samanantar dataset using NVIDIA NeMo. NVIDIA NeMo is an open-source conversational AI toolkit that allows developers to build and train state-of-the-art models.

The NMT model which we would train is based on the transformer architecture. It is a powerful seq2seq modeling architecture as described by Vaswani et al. in the paper Attention Is All You Need. 

Samanantar is the largest available parallel corpora supporting 11 Indic languages containing over 49.6M sentence pairs. The data includes samples from previously available datasets and samples mined from the web. In this blog, we will be using the En-Hi pair containing over 8.46M samples. For our experiments, we will use a 90:10 train-val split to train our models, which will then be evaluated on standard test sets. The dataset contains sentences with word counts ranging from one to thousands.


## Downloading dataset

In [None]:
# Set the path where we will download the dataset
dataset_path = '../dataset/'
%env dataset_path = {dataset_path}
!mkdir -p $dataset_path

In [None]:
# Downloading from Samanantar's website
!wget https://storage.googleapis.com/samanantar-public/V0.2/data/en2indic/en-hi.zip -P $dataset_path
!unzip $dataset_path/en-hi.zip -d $dataset_path

## Installing and importing libraries

In [None]:
%%capture
!pip install seaborn
!pip install wordcloud
!pip install inltk
!pip install indic-nlp-library

In [None]:
# One time setup for inltk
from inltk.inltk import setup
setup('hi')

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from IPython.display import Image
from IPython.core.display import HTML
from sklearn.model_selection import train_test_split

pd.set_option("display.precision", 10)

In [None]:
# Declaring variables with paths to the dataset
dataset_root = dataset_path + "en-hi/"
english_set = dataset_root + "train.en"
hindi_set = dataset_root + "train.hi"

## Reading English dataset

In [None]:
with open(english_set) as f:
    X = f.readlines()

print("Total Number of English Sentences are : ", len(X))

print("\nHere are some Sample English Sentences\n")

for i, sentence in enumerate(X[1:5]):
    print("Sentence Number ", i+1, " : ", sentence)

## Reading Hindi dataset

In [None]:
with open(hindi_set) as f:
    Y = f.readlines()

print("Total Number of Hindi Sentences are : ", len(Y))
print("\nHere are the corresponding Hindi Translations\n")

for i, sentence in enumerate(Y[1:5]):
    print("Sentence Number ", i+1, " : ", sentence)

## Preliminary statistical analysis

### Distribution of sentence lengths

The maximum input sequence length for the transformer models has to be fixed. In order to deduce the maximum input sequence length that we can use for training the model, we will calculate the maximum, minimum and average length of the sentences in both the English and the Hindi corpora.

In [None]:
english_set_lengths = []
hindi_set_lengths = []

for eng_sentence in X:
    english_set_lengths.append(len(eng_sentence))

for hin_sentence in Y:
    hindi_set_lengths.append(len(hin_sentence))

print("The Maximum Length of a Single English sentence is %d, The Minimum length is %d and the Average is around %f" %(
    max(english_set_lengths), min(english_set_lengths), np.mean(english_set_lengths)))

print("The Maximum Length of a Single Hindi sentence is %d, The Minimum length is %d and the Average is around %f" %(
    max(hindi_set_lengths), min(hindi_set_lengths), np.mean(hindi_set_lengths)))

### Distribution of lengths vs corresponding sentence counts

In [None]:
lengths = [100, 200, 400, 500, 750, 1000, 1500, 2000, 2500, 3000, 3500, 4000, 4500, 5000]
len_count = []

for j in range(len(lengths)):
    if j == 0:
        len_count.append(len([i for i in english_set_lengths if (i <= lengths[j])]))
    else:
        len_count.append(len([i for i in english_set_lengths if (i <= lengths[j] and i > lengths[j-1])]) + len_count[j-1])

percentage = list(np.array(len_count)/len(X))

distribution = pd.DataFrame(list(zip(lengths,len_count, percentage)), columns = ['Lengths', 'Count (<=)', 'Percentage of Dataset'])
distribution

### Number of English sentences with length < 5000 

In [None]:
eng_u5000 = [i for i in english_set_lengths if i < 5000]
print("Number of sentences with length < 5000 : ", len(eng_u5000))

eng_df = pd.DataFrame(eng_u5000, columns =['Length'])

sns.set(rc={'figure.figsize':(11.7,8.27)})
sns.histplot(data = eng_df['Length'], kde = True)

### Number of English sentences with length < 1000

In [None]:
eng_u1000 = [i for i in english_set_lengths if i < 1000]
print("Number of sentences with length < 1000 : ", len(eng_u1000))

eng_df2 = pd.DataFrame(eng_u1000, columns =['Length'])

sns.set(rc={'figure.figsize':(11.7,8.27)})
sns.histplot(data = eng_df2['Length'], kde = True)

### Number of Hindi sentences with length < 2000

In [None]:
hin_u2000 = [i for i in hindi_set_lengths if i < 2000]
print("Number of Hindi sentences with length < 2000 : ", len(hin_u2000))

hin_df = pd.DataFrame(hin_u2000, columns =['Length'])

sns.set(rc={'figure.figsize':(11.7,8.27)})
sns.histplot(data = hin_df['Length'], kde = True)

The Samanantar dataset provides untokenized and deduplicated data. These sentences cannot be passed directly as inputs to the model. As with every deep learning training regime, we will pre-process the dataset before training the model. The preprocessing steps include - lowercasing, length filtering, normalization, and tokenization.

## Data pre-processing

### Data normalization

When we normalize text, we attempt to reduce its randomness, bringing it closer to a predefined “standard”. This helps us to reduce the amount of different information that the computer has to deal with, and therefore improves efficiency. 



#### English set

For normalizing English set, we will use the normalization script developed by [moses](https://github.com/moses-smt/mosesdecoder). <br>
This script removes extra spaces, normalizes the unicode for punctuations, handles psuedo spaces and different types of quotation styles (French, German, Spanish).

In [None]:
english_sentences = pd.DataFrame(X, columns = ['Text'])
english_sentences.head()

In [None]:
!perl normalize-punctuation.perl -l en < $dataset_path/en-hi/train.en > $dataset_path/en-hi/train.normalized.en

In [None]:
english_set_normalized = dataset_root + '/train.normalized.en'

with open(english_set_normalized) as f:
    X_norm = f.readlines()

### Hindi set

For normalizing Hindi set, we will use the Devanagari normalizer provided by [IndicNLP library](https://indic-nlp-library.readthedocs.io/en/latest/indicnlp.normalize.html). <br>
This function replaces the composite characters containing nuktas by their decomposed form, replace pipe character '|' by poorna virama character, and replace colon ':' by visarga.

In [None]:
from indicnlp.normalize.indic_normalize import IndicNormalizerFactory

factory=IndicNormalizerFactory()
normalizer=factory.get_normalizer('hi')

**Note:** This step might take a few minutes to complete

In [None]:
Y_normalized = []
for index, input_text in enumerate(Y):
    output_text=normalizer.normalize(input_text)
    Y_normalized.append(output_text)

print("Sample Sentences : \n")
print(Y_normalized[1:5])

In [None]:
hindi_set_normalized = dataset_root + '/train-normalized.hi'

with open(hindi_set_normalized, "w") as output:
    output.writelines(Y_normalized)

In [None]:
with open(hindi_set_normalized) as f:
    Y_normalized = f.readlines()

### Lowercase conversion

The English set is lowercased to reduce the token vocabulary size. The vocabulary size is the count of unique tokens that are used to train an NLP model. Lowercasing the English samples combines the two tokens “European” and “european”, thus decreasing the vocabulary size. 

Lowercasing is ignored for Hindi set because it does not affect any token.

In [None]:
X_norm[0:5]

In [None]:
X_lower = [sent.lower() for sent in X_norm]

In [None]:
X_lower[0:5]

### Length filtering

We will perform length on the English and Hindi set with a filter length of 1000 words. <br>
Sentences that are longer than the maximum input sequence length are typically rejected or chopped short. Because the word order changes from subject-verb-object to subject-object-verb in En-Hi translation, cutting a sentence midway could result in a loss of meaning. <br>
As a result, length filtering aids in the removal of these lengthy sentences and helps the model to learn from meaningful samples.

In [None]:
X_filtered = []
Y_filtered = []
# Set the filtering length
filtering_length = 1000

for index, sentence in enumerate(X_lower):
    if len(sentence) < filtering_length and len(Y_normalized[index]) < filtering_length:
        X_filtered.append(sentence)
        Y_filtered.append(Y_normalized[index])

if len(X_filtered) == len(Y_filtered):
    print("Finally, the number of sentences are : ", len(X_filtered))

print("\nHere are some sample sentences : \n")
for i in range(2):
    print(X_filtered[i])
    print(Y_filtered[i], "\n")

In [None]:
english_set_filtered = dataset_root + "train.filtered_{0}.en".format(filtering_length)
%env english_set_filtered = {english_set_filtered}
hindi_set_filtered = dataset_root + "train.filtered_{0}.hi".format(filtering_length)

with open(english_set_filtered, "w") as output:
    output.writelines(X_filtered)
    
with open(hindi_set_filtered, "w") as output:
    output.writelines(Y_filtered)

### Tokenization for data analysis

Tokenization is used to split the input sentence into a list of tokens; where a token is a word or a part of a sentence. There are a variety of commercial tokenizers available, each with its own set of tokenization principles.

When the same input is fed to multiple tokenizers, the output can vary based on the operating principles of each tokenizer. 
There are different methods to tokenize a sentence such as:
- white space tokenization - a sentence is split into tokens on white spaces 
- dictionary-based tokenization - tokens are found based on existing tokens in a dictionary
- rule-based tokenization - the rules are based on grammar for a particular language and 
- sub-word tokenization - the less frequent words are split into subwords.

The table below summarizes the tokenizers that we evaluated.

| English      | Hindi |
| ----------- | ----------- |
| Moses      | IndicNLP       |
| OpenNMT   | iNLTK        |
| SentencePiece   | Moses        |
| NLTK   | OpenNMT        |
| Gruut   | CLTK        |

The effect of various source and target side tokenizers can be studied using the BLEU score of the trained model on the validation dataset. In order to find the best performing pair of tokenizers, we experimented with different combinations of the tokenizers mentioned above for English and Hindi and inferred that Moses-Moses works best.

#### English set tokenization

In [None]:
# Specify the path to save the tokenized dataset
english_set_tokenized = dataset_root + "train.filtered_{0}.tokenized_moses.en".format(filtering_length)
%env english_set_tokenized = {english_set_tokenized}

**Note:** This step might take a few minutes to complete

In [None]:
!perl tokenizer.perl -l en -no-escape < $english_set_filtered > $english_set_tokenized

#### Hindi set tokenization

In [None]:
from indicnlp.tokenize import sentence_tokenize
from indicnlp.tokenize import indic_tokenize  

**Note:** This step might take a few minutes to complete

In [None]:
Y_tokenized = Y_filtered.copy()

for index2, indic_string in enumerate(Y_filtered):
    Y_tokenized[index2] = " ".join(indic_tokenize.trivial_tokenize(indic_string)

In [None]:
# Specify the path to save the tokenized dataset
hindi_set_tokenized = dataset_root + "train.filtered_{0}.tokenized_indicnlp.hi".format(filtering_length)

with open(hindi_set_tokenized, "w") as output:
    output.writelines(Y_tokenized)

### Comparision of Hindi tokenizers

Let's explore the output of 2 diiferent tokenizers - IndicNLP and iNLTK - to better understand these tokenization algorithm.

In [None]:
string = Y[1723]
print(string)

#### Indic NLP

In [None]:
from indicnlp.tokenize import indic_tokenize 

In [None]:
tokens_indicnlp = indic_tokenize.trivial_tokenize(string)
print(len(tokens_indicnlp))

#### iNLTK

In [None]:
from inltk.inltk import tokenize

In [None]:
tokens_inltk = tokenize(string ,'hi') 
print(len(tokens_inltk))

In [None]:
set(tokens_indicnlp).intersection(set(tokens_inltk))

In [None]:
print(tokens_indicnlp)

In [None]:
print(tokens_inltk)

In [None]:
tokens_inltk_filt = [e[1:] if e[0] == '▁' else e for e in tokens_inltk]

In [None]:
print(tokens_inltk_filt)

In [None]:
tokens_indicnlp[:-1] == tokens_inltk_filt

In [None]:
intersection = set(tokens_indicnlp).intersection(set(tokens_inltk_filt))
print(len(intersection))
print(intersection)

In the above example, the iNLTK tokenizer splits the words गुलामों and सलूक into गुलाम, ो and  सल, ूक respectively in contrast to the IndicNLP tokenizer. The IndicNLP Tokenizer also adds the newline character "\n" at the end of each tokenized sentence. Thus, the same input sentence has two different tokenized outputs with varying sequence lengths of 15 and 16 respectively.


## Data exploration

In [None]:
X_tokenized = []
with open(english_set_tokenized) as f:
    X_tokenized = f.readlines()

Y_tokenized = []
with open(hindi_set_tokenized) as f:
    Y_tokenized = f.readlines()

In [None]:
print("English Tokenized : ")
print(X_tokenized[121].split())

In [None]:
print("Hindi Tokenized : ")
print(Y_tokenized[121].split())

### Wordclouds

A word cloud is a visual representation of text data. It emphasizes the importance of each word using font size and color. The font size depicts the relative frequency of occurrence of each word in the dataset. Let’s look at the word cloud representation of both the English and the Hindi text of our corpora.


#### English

In [None]:
from wordcloud import WordCloud 

english_sentences = ""
for sentence in X_tokenized:
    english_sentences += sentence

**Note:** This step might take a few minutes to complete!

In [None]:
wc = WordCloud(width=1200, height=1160, max_words=500,colormap="Dark2").generate(english_sentences)

plt.figure(figsize=(20,16))
plt.imshow(wc, interpolation='bilinear')
plt.axis("off")
plt.title("English wordcloud",fontsize=13)
plt.show()

#### Hindi

In [None]:
hindi_sentences = ""
for sentence in Y_tokenized:
    hindi_sentences += sentence

**Note:** This step might more than 10 minutes to complete!

In [None]:
wordcloud = WordCloud(font_path='Lohit-Devanagari.ttf',width=1200, height=1160, max_words=500,colormap="Dark2").generate(hindi_sentences)

plt.figure(figsize=(20,16))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.title("Hindi Wordcloud",fontsize=13)
plt.show()

In the word clouds above, you will observe the frequently occurring words in English and Hindi (excluding stopwords like “a”, “an”, “is”, “the" etc.). The correspondence between the English and Hindi words can also be examined. For example, "india" and "भारत" both have a similar frequency in the corpora.

### Token count

Let's remove the stop words from both the language's set and plot the frequency of tokens.

In [None]:
from nltk.probability import FreqDist
import nltk
import string

nltk.download('stopwords')
from nltk.corpus import stopwords

In [None]:
final_tokens = []
for sent in X_tokenized:
    final_tokens.extend(sent.split())

print(len(final_tokens))

In [None]:
remove_these = set(stopwords.words('english') + list(string.punctuation) + list(string.digits))

In [None]:
filtered_text = [w for w in final_tokens if not w in remove_these]
fdist_filtered = FreqDist(filtered_text)

In [None]:
print("Total number of tokens : ", fdist_filtered.N())
print("Total number of tokens : ", len(fdist_filtered))

In [None]:
fdist_filtered.plot(30,title='Frequency distribution for 30 most common tokens (excluding stopwords)')

In [None]:
final_tokens_hindi = []
for sent in Y_tokenized:
    final_tokens_hindi.extend(sent.split())

print(len(final_tokens_hindi))

In [None]:
fdist_hin = FreqDist(final_tokens_hindi)
fdist_hin

In [None]:
print("Total number of tokens : ", fdist_hin.N())
print("Total number of unique tokens : ", len(fdist_hin))

In [None]:
hindi_tokens = {}

for token in final_tokens_hindi:
    if token in hindi_tokens:
        hindi_tokens[token] += 1
    else:
        hindi_tokens[token] = 1 

In [None]:
sorted_hindi_tokens = sorted(hindi_tokens.items(), key=lambda x: x[1], reverse=True)

In [None]:
dict_for_plot = {}
for x in sorted_hindi_tokens[0:30]:
    dict_for_plot[x[0]] = x[1]

dict_for_plot.keys()    

In [None]:
from matplotlib.font_manager import FontProperties
fp = FontProperties(fname='Lohit-Devanagari.ttf')

fig, ax = plt.subplots(figsize = (15, 6))
idx = np.asarray([i for i in range(30)])

ax.bar(idx, [val for key,val in sorted(dict_for_plot.items(), key=lambda x: x[1], reverse=True)], width=1)
ax.set_xticks(idx)
ax.set_xticklabels(list(dict_for_plot.keys()), font=fp)
ax.set_xlabel('Hindi tokens')
ax.set_ylabel('Number of occurrences')
fig.tight_layout()
plt.show()

## Train and validation split

We will use a 95:5 train-val split to train our models, which will then be evaluated on standard test sets like WAT and WMT.

We are setting out 5% of the dataset (8.4M samples) for validation i.e. 0.42M samples, which is quite a lot.<br>
Thus, we recommend you to play with the `val_ratio` to increase the training set size which might improve the performance.

In [None]:
val_ratio = 0.05

X_train, X_val, Y_train, Y_val = train_test_split(X_tokenized, Y_tokenized, test_size=val_ratio, random_state=1, shuffle = True)

In [None]:
with open(dataset_root + "/final_train_norm_lf_{0}_tk_moses.en".format(filtering_length), "w") as output:
    output.writelines(X_train)

In [None]:
with open(dataset_root + "/final_train_norm_lf_{0}_tk_indicnlp.hi".format(filtering_length), "w") as output:
    output.writelines(Y_train)

In [None]:
with open(dataset_root + "/final_val_norm_lf_{0}_tk_moses.en".format(filtering_length), "w") as output:
    output.writelines(X_val)

In [None]:
with open(dataset_root + "/final_val_norm_lf_{0}_tk_indicnlp.hi".format(filtering_length), "w") as output:
    output.writelines(Y_val)

In [None]:
print("Data analysis and preparation is done")