# Machine Translation using Transformers

Language translation is one of the most important task in NLP world. As people get connected with each other, language translation helps them to understand each other better. 

Machine translation can formally be identified as the process of reproducing human translation using machine transductions(Transduction means taking structures we perceive and representing them with another form).

And in real life, this is a complicated task due to many reasons. For example lets say there's a sentence A which is written in some language. Then we are required to translate this sentence in to another language. A human translator comes and translate the sentence A to give an translated output. But would another human translator give a same translation to the sentence A? This is probably not! But the problem is that both translated sentences would have same meaning. So when we try to build a model to do translation, our model should be able to understand/perform the same.

To see how such task is achieved we will look at a task of tranlating French/English. We will be using the WMT(Workshop of Machine Translation) dataset provided in [this url](https://www.statmt.org/europarl/v7/fr-en.tgz).


In [1]:
from support_scripts.preprocess import *
import pickle
from pickle import dump

First we clean up the input data for English data.

In [2]:
raw_data = load_data('data/translation_data/europarl-v7.fr-en.en')
sentences = get_sentence_list(raw_data)
min_len, max_len = get_sentence_size_range(sentences)
cleaned_sentences = clean_sentences(sentences)

outfile = open('data/translation_data/English.pkl','wb')
pickle.dump(cleaned_sentences, outfile)
outfile.close()
print("data saved")

data saved


Repeating the same process for the French data as well.

In [3]:
raw_data = load_data('data/translation_data/europarl-v7.fr-en.fr')
sentences = get_sentence_list(raw_data)
min_len, max_len = get_sentence_size_range(sentences)
cleaned_sentences = clean_sentences(sentences)
outfile = open('data/translation_data/French.pkl','wb')
pickle.dump(cleaned_sentences, outfile)
outfile.close()
print("data saved")

data saved


Now we have completed the basic preprocessing. But this is not enough for NLP tasks. We need to do several other tasks as removing unnecessary tokens (very rare ones), get vocabulary and word counts etc. as well.

In [14]:
from support_scripts.further_processing import *

In [16]:
english_data = load_cleaned_data('data/translation_data/English.pkl')
vocab = get_vocabulary(english_data)
trimmed_vocab = trim_vocabulary(vocab)
data = update_dataset(english_data, trimmed_vocab)
save_processed_data(data, 'data/translation_data/english_cleaned_sentences.pkl')

Data is saved to disk!


In [17]:
french_data = load_cleaned_data('data/translation_data/French.pkl')
vocab = get_vocabulary(french_data)
trimmed_vocab = trim_vocabulary(vocab)
data = update_dataset(french_data, trimmed_vocab)
save_processed_data(data, 'data/translation_data/french_cleaned_sentences.pkl')

Data is saved to disk!


Ok so now data is ready to be plugged in to a transformer model. Idea is French dataset contains the sentences to translate. English dataset has the reference translations that can be used by model. And model should be able to translate french to english based on the given data.

But how can we measure the performance of our translation. As we discussed earlier human language translations for single sentence may have different forms. To help with that BLEU method was introduced.

#### BLEU Method

Bilingual Evaluation Understudy Score or BLEU is a metric used in evaluating human translation to compensate for the issues we disscussed earlier. BLEU method consider the N-gram tokens of the considering sentences along with few other metrics to evaluate the sentence translation tasks. Check out this [excellent blog post](https://jaketae.github.io/study/bleu/).

We will use the NLTK package provided BLEU method for some examples.

In [22]:
from nltk.translate.bleu_score import sentence_bleu, SmoothingFunction

In [25]:
#Example 1
reference = [['the', 'cat', 'likes', 'milk'], ['cat', 'likes' 'milk']]
candidate = ['the', 'cat', 'likes', 'milk']
score = sentence_bleu(reference, candidate)
print('Example 1', score)

#Example 2
reference = [['the', 'cat', 'likes', 'milk']]
candidate = ['the', 'cat', 'likes', 'milk']
score = sentence_bleu(reference, candidate)
print('Example 2', score)

Example 1 1.0
Example 2 1.0


So above is fine, but there are cases when the unigram based precision may give wrong results. In such cases NLTK package will complain/warn to use a smoothing function.

These Smoothing functions helps in cases where tokens are not available in the candidate sentences.  This is not complicated, but I am not particularly interested in studying this for the moment, So will leave it there!

***

So far we looked at data preparation, and simple evaluation methods of translation problems. Now lets move on to a model training process for a translation problem. We will use a library called `Trax` from google for this. We will not be building a model from scratch, but will use a provided weights to initialize our model.

For more details about Trax visit [Trax Home Page](https://trax-ml.readthedocs.io/en/latest/index.html).

Unfortunately Windows does not suppor JaxLib which is a dependency for the trax installation, so you will need to run the code in a linux based environment. (Or WSL!)

Check out this [Notebook](https://github.com/PacktPublishing/Transformers-for-Natural-Language-Processing/blob/main/Chapter05/Trax_Translation.ipynb).