# Cloning and setting up

Clone the repository, and change into directory. To ensure this notebook remains functional, this notebook will checkout a commit [`ec27a2c`](https://github.com/jerinphilip/ilmulti/commit/ec27a2c19ecf06991fea55a8a1d34617a07c1d87).

In [None]:
!rm -rf ilmulti
!git clone https://github.com/jerinphilip/ilmulti
% cd ilmulti/
!git checkout ec27a2c # future proofing with a known working commit;

Cloning into 'ilmulti'...
remote: Enumerating objects: 278, done.[K
remote: Counting objects: 100% (278/278), done.[K
remote: Compressing objects: 100% (203/203), done.[K
remote: Total 950 (delta 135), reused 195 (delta 73), pack-reused 672[K
Receiving objects: 100% (950/950), 5.49 MiB | 8.67 MiB/s, done.
Resolving deltas: 100% (535/535), done.
/content/ilmulti
Note: checking out 'ec27a2c'.

You are in 'detached HEAD' state. You can look around, make experimental
changes and commit them, and you can discard any commits you make in this
state without impacting any branches by performing another checkout.

If you want to create a new branch to retain commits you create, you may
do so (now or later) by using -b with the checkout command again. Example:

  git checkout -b <new-branch-name>

HEAD is now at ec27a2c Reducing default model downloads


# Changes to enable translation

We will change the requirements.txt to uncomment the components that will enable translation. These are:

* [fairseq-ilmt@lrec-2020](https://github.com/jerinphilip/fairseq-ilmt/tree/lrec-2020)
* torch

In [None]:
new_requirements = """
langid
sentencepiece
nltk

# Optional, tokenizers, work without these as well.
git+https://github.com/jerinphilip/fairseq-ilmt@lrec-2020
torch==1.1.0
"""

with open("requirements.txt", 'w+') as fp:
  fp.write(new_requirements)

!cat requirements.txt


langid
sentencepiece
nltk

# Optional, tokenizers, work without these as well.
git+https://github.com/jerinphilip/fairseq-ilmt@lrec-2020
torch==1.1.0


# Install prerequisites and ilmulti

With the modified requirements.txt, now run pip install to setup the environment.

In [None]:
%%capture
!python3 -m pip install -r requirements.txt
!python3 setup.py install

# Download the pretrained models and setup these in $HOME.

The following script is a convenience script to setup the models in `$HOME/.ilmulti` directory. The script copies the pretrained models and the respective fairseq-dictionaries to predefined locations here.

This version downloads only M2M-1 and M2EN-3 from [Revisiting Low Resource Status of Indian Languages in Machine Translation](https://arxiv.org/abs/2008.04860).

In [None]:
!bash scripts/download-and-setup-models.sh

+ SEVEN_MODELS=()
+ ELEVEN_MODELS=("mm-all-iter1")
+ MODELS=("${SEVEN_MODELS[@]}" "${ELEVEN_MODELS[@]}")
+ SAVE_DIR=/root/.ilmulti
+ BASE_URL=http://preon.iiit.ac.in/~jerin/resources/models
+ mkdir -p /root/.ilmulti/
+ echo 'Downloading models'
Downloading models
+ for MODEL in ${MODELS[@]}
+ MODEL_DIR=/root/.ilmulti/mm-all-iter1
+ mkdir -p /root/.ilmulti/mm-all-iter1
+ wget --continue http://preon.iiit.ac.in/~jerin/resources/models/mm-all-iter1 -O /root/.ilmulti/mm-all-iter1/checkpoint_last.pt
--2020-10-22 08:36:34--  http://preon.iiit.ac.in/~jerin/resources/models/mm-all-iter1
Resolving preon.iiit.ac.in (preon.iiit.ac.in)... 196.12.53.50
Connecting to preon.iiit.ac.in (preon.iiit.ac.in)|196.12.53.50|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 781024711 (745M)
Saving to: ‘/root/.ilmulti/mm-all-iter1/checkpoint_last.pt’


2020-10-22 08:43:20 (1.84 MB/s) - ‘/root/.ilmulti/mm-all-iter1/checkpoint_last.pt’ saved [781024711/781024711]

+ for MODEL in ${ELEVEN_M

# Training begins






In [None]:
# from google.colab import drive
# drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
!bash fairseq-ilmt/cmd.sh

# Sentence-wise BLEU after cyclic backtranslation

In [None]:
from nltk.tokenize import word_tokenize
import nltk
nltk.download('punkt')

backtranslated_sentences = open("", encoding = 'utf-8')
hyp_list = []
for position, line in enumerate(backtranslated_sentences):
  hyp_list.append(line)

original_sentences = open("", encoding = 'utf-8')
ref_list = []
# lines_to_read = [0, 2]
for position, line in enumerate(b_file):
  ref_list.append(line)

bleu_list = []
for i in range(0,len(hyp_list)):
  hyp = hyp_list[i]
  ref = ref_list[i]
  hyp_tokens = word_tokenize(hyp)
  ref_tokens = word_tokenize(ref)
  BLEUscore = nltk.translate.bleu_score.sentence_bleu([ref_tokens], hyp_tokens)
  bleu_list.append(BLEUscore)

translated_sentences = open("", encoding = 'utf-8')
translated_list = []
for position, line in enumerate(translated_sentences):
  translated_list.append(line)

In [None]:
pairWise_BLEU = pd.DataFrame(list(zip(ref_list, hyp_list,bleu_list)),columns=['eng', 'punjabi', 'bleu'])
pairWise_BLEU.drop_duplicates(subset='eng',keep='first',inplace=True)
pairWise_BLEU.drop_duplicates(subset='hindi',keep='first',inplace=True)

# Deciling the BLEU Scores


In [None]:
path = ""
for i in np.linspace(0.5,1,5,endpoint=False):
  print (i)
  temp = en_hi_bt[en_hi_bt['bleu'].apply(lambda x:float(x))>i].copy()
  print (temp.shape)
  eng_temp = list(temp['eng'])
  hindi_temp = list(temp['hindi'])
  with open(path+"train_"+str(i)+".en", 'w') as f:
    for item in eng_temp:
        f.write("%s" % item)
  with open(path+"train_"+str(i)+".hi", 'w') as f:
    for item in hindi_temp:
        f.write("%s" % item)

# =========================================

## Demo Inference of the original Multilingual NMT model 

In [None]:
from ilmulti.translator import from_pretrained

translator = from_pretrained(tag='mm-all-iter1')

samples = [
    ("The quick brown fox jumps over the lazy dog.", 'en', 'hi'),
    ("An apple a day keeps the doctor away. He is going.", 'en', "hi"),
    ("This document is being produced at the behest of the perpetrator", 'en', "te"),
    ("वह जा रहा है।", "hi", "ml")

]

for idb, (sample, src_lang, tgt_lang) in enumerate(samples, 1):
  translation = translator(sample, tgt_lang=tgt_lang, src_lang=src_lang)
  print('---', idb)
  for idx, segment in enumerate(translation, 1):
    print(idx, '>', segment['src'])
    print(idx, '<', segment['tgt'])
  print('---')