# **Neural Machine Translation Hands-on for HimangiY**
#### Vandan Mujadia, Dipti Misra Sharma
#### LTRC, IIIT-Hyderabad, Hyderabad

This demonstrates how to train a sequence-to-sequence (seq2seq) model for English-to-Hindi translation **roughly** based on [Effective Approaches to Attention-based Neural Machine Translation](https://arxiv.org/abs/1706.03762) (Vaswani, Ashish et al).

## An Example to Understand sequence to Sequence processing using Transformar Network.

<img src="https://www.tensorflow.org/images/tutorials/transformer/apply_the_transformer_to_machine_translation.gif" alt="Applying the Transformer to machine translation">

Source: [Google AI Blog](https://ai.googleblog.com/2017/08/transformer-novel-neural-network.html)



## Applying the Transformer to machine translation.


<table>
<tr>
  <td>
   <img width=400 src="https://miro.medium.com/max/720/1*57LYNxwBGcCFFhkOCSnJ3g.png"/>
  </td>
</tr>
<tr>
  <th colspan=1>This tutorial: An encoder/decoder connected by self attention neural network.</th>
<tr>
</table>

# Tools that we are using here

*   Library : pytorch
*   Library : fairseq for neural network implemtation


In [None]:
!pip install torch torchvision torchaudio



In [None]:
!pip install -U pip
!git clone https://github.com/pytorch/fairseq.git
%cd fairseq
#!git checkout v0.10.2
##!python -m pip install --user ./
!pip install --editable .
%cd ..

Collecting pip
  Downloading pip-23.2.1-py3-none-any.whl (2.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.1/2.1 MB[0m [31m10.6 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pip
  Attempting uninstall: pip
    Found existing installation: pip 23.1.2
    Uninstalling pip-23.1.2:
      Successfully uninstalled pip-23.1.2
Successfully installed pip-23.2.1
Cloning into 'fairseq'...
remote: Enumerating objects: 34777, done.[K
remote: Counting objects: 100% (8/8), done.[K
remote: Compressing objects: 100% (8/8), done.[K
remote: Total 34777 (delta 0), reused 2 (delta 0), pack-reused 34769[K
Receiving objects: 100% (34777/34777), 25.03 MiB | 26.53 MiB/s, done.
Resolving deltas: 100% (25248/25248), done.
/content/fairseq
Obtaining file:///content/fairseq
  Installing build dependencies ... [?25l[?25hdone
  Checking if build backend supports build_editable ... [?25l[?25hdone
  Getting requirements to build editable ... [?25l[?25hdone
  Pr

# Check GPU

In [None]:
!nvidia-smi

Wed Aug 23 04:54:21 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.105.17   Driver Version: 525.105.17   CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   45C    P8     9W /  70W |      0MiB / 15360MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

# Tokenizer Tool

In [None]:
!pip install git+https://github.com/vmujadia/tokenizer.git --upgrade

Collecting git+https://github.com/vmujadia/tokenizer.git
  Cloning https://github.com/vmujadia/tokenizer.git to /tmp/pip-req-build-vk09hsa8
  Running command git clone --filter=blob:none --quiet https://github.com/vmujadia/tokenizer.git /tmp/pip-req-build-vk09hsa8
  Resolved https://github.com/vmujadia/tokenizer.git to commit 93cd09b81702108a51c08c9796fd1cc941a1b98b
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: IL-Tokenizer
  Building wheel for IL-Tokenizer (setup.py) ... [?25l[?25hdone
  Created wheel for IL-Tokenizer: filename=IL_Tokenizer-0.0.2-py3-none-any.whl size=7224 sha256=5abfdd3c98345f910040423ad4b63b017180d2e182abde3099560e883dd750f1
  Stored in directory: /tmp/pip-ephem-wheel-cache-a3vgylm3/wheels/9a/fb/5b/3d75bfde8561726121c09f0f0a83389c05312df8a513808c41
Successfully built IL-Tokenizer
[33mDEPRECATION: omegaconf 2.0.6 has a non-standard dependency specifier PyYAML>=5.1.*. pip 23.3 will enforce this behaviour change. A poss

# To Clean and Filter Parallel Corpora

In [None]:
!git clone https://github.com/moses-smt/mosesdecoder.git

Cloning into 'mosesdecoder'...
remote: Enumerating objects: 148097, done.[K
remote: Counting objects: 100% (525/525), done.[K
remote: Compressing objects: 100% (229/229), done.[K
remote: Total 148097 (delta 323), reused 441 (delta 292), pack-reused 147572[K
Receiving objects: 100% (148097/148097), 129.88 MiB | 13.65 MiB/s, done.
Resolving deltas: 100% (114349/114349), done.


# To tackle vocabulary issue : Subword algorithm

In [None]:
!git clone https://github.com/rsennrich/subword-nmt.git

Cloning into 'subword-nmt'...
remote: Enumerating objects: 597, done.[K
remote: Counting objects: 100% (21/21), done.[K
remote: Compressing objects: 100% (17/17), done.[K
remote: Total 597 (delta 8), reused 12 (delta 4), pack-reused 576[K
Receiving objects: 100% (597/597), 252.23 KiB | 6.31 MiB/s, done.
Resolving deltas: 100% (357/357), done.


In [None]:
!ls mosesdecoder/scripts/training/clean-corpus-n.perl

mosesdecoder/scripts/training/clean-corpus-n.perl


# For this; Training Corpora

## English - Hindi
## (small PMI courpus)

In [None]:
! wget -O train.src https://swayam.iiit.ac.in/upload/uploadfiles/ssmt/nmt-IIITH/pmindia.en
! wget -O train.tgt https://swayam.iiit.ac.in/upload/uploadfiles/ssmt/nmt-IIITH/pmindia.hi
! wget -O valid.src https://swayam.iiit.ac.in/upload/uploadfiles/ssmt/nmt-IIITH/dev.hi-en.en
! wget -O valid.tgt https://swayam.iiit.ac.in/upload/uploadfiles/ssmt/nmt-IIITH/dev.hi-en.hi

--2023-08-23 04:54:43--  https://swayam.iiit.ac.in/upload/uploadfiles/ssmt/nmt-IIITH/pmindia.en
Resolving swayam.iiit.ac.in (swayam.iiit.ac.in)... 196.12.53.52
Connecting to swayam.iiit.ac.in (swayam.iiit.ac.in)|196.12.53.52|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 6695890 (6.4M) [text/plain]
Saving to: ‘train.src’


2023-08-23 04:55:54 (93.6 KB/s) - ‘train.src’ saved [6695890/6695890]

--2023-08-23 04:55:54--  https://swayam.iiit.ac.in/upload/uploadfiles/ssmt/nmt-IIITH/pmindia.hi
Resolving swayam.iiit.ac.in (swayam.iiit.ac.in)... 196.12.53.52
Connecting to swayam.iiit.ac.in (swayam.iiit.ac.in)|196.12.53.52|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 16554142 (16M)
Saving to: ‘train.tgt’


2023-08-23 04:56:54 (279 KB/s) - ‘train.tgt’ saved [16554142/16554142]

--2023-08-23 04:56:54--  https://swayam.iiit.ac.in/upload/uploadfiles/ssmt/nmt-IIITH/dev.hi-en.en
Resolving swayam.iiit.ac.in (swayam.iiit.ac.in)... 196.12.53.52
Conn

# Data Numbers

In [None]:
print ('Data Stats')
! wc -l train.*
! wc -l valid.*

Data Stats
   56832 train.src
   56832 train.tgt
  113664 total
  2000 valid.src
  2000 valid.tgt
  4000 total


# Tokenize the text

In [None]:
from ilstokenizer import tokenizer
import codecs

def to_tokenize_and_lower(input_path, output_path):
  outfile = open(output_path, 'w')
  for line in codecs.open(input_path):
    line = line.strip()
    line = tokenizer.tokenize(line).lower()
    #print (line)
    outfile.write(line+'\n')
  outfile.close()

In [None]:
to_tokenize_and_lower('train.src','train.src.tkn')
to_tokenize_and_lower('train.tgt','train.tgt.tkn')

to_tokenize_and_lower('valid.src','valid.src.tkn')
to_tokenize_and_lower('valid.tgt','valid.tgt.tkn')

In [None]:
! cat train.src.tkn > train.all.tkn
! cat train.tgt.tkn >> train.all.tkn

# Data Cleaning

In [None]:
! perl mosesdecoder/scripts/training/clean-corpus-n.perl -ratio 1.5 train src.tkn tgt.tkn train_filtered 1 250

clean-corpus.perl: processing train.src.tkn & .tgt.tkn to train_filtered, cutoff 1-250, ratio 1.5
.....
Input sentences: 56832  Output sentences:  53833


In [None]:
print ('Data Stats')
! wc -l train*
! wc -l valid*

Data Stats
  113664 train.all.tkn
   53833 train_filtered.src.tkn
   53833 train_filtered.tgt.tkn
   56832 train.src
   56832 train.src.tkn
   56832 train.tgt
   56832 train.tgt.tkn
  448658 total
   2000 valid.src
   2000 valid.src.tkn
   2000 valid.tgt
   2000 valid.tgt.tkn
   8000 total


# Train subword model,
## Experiment with no of subword merge operation

In [None]:
!python subword-nmt/subword_nmt/learn_bpe.py -s 7500 < train.all.tkn > train.codes

100% 7500/7500 [00:12<00:00, 586.03it/s]


# How do subword codes look

In [None]:
! head -n 10 train.codes

#version: 0.2
t h
i n
् र
th e</w>
a n
क े</w>
e n
t i
e r


# Apply Subword to the corpus

In [None]:
!python subword-nmt/subword_nmt/apply_bpe.py -c train.codes < train.src.tkn > train.en
!python subword-nmt/subword_nmt/apply_bpe.py -c train.codes < train.tgt.tkn > train.hi

!python subword-nmt/subword_nmt/apply_bpe.py -c train.codes < valid.src.tkn > valid.en
!python subword-nmt/subword_nmt/apply_bpe.py -c train.codes < valid.tgt.tkn > valid.hi

# Training Corpus now

In [None]:
! head -n 10 train.en

an advance is plac@@ ed with the medical su@@ per@@ int@@ end@@ ents of such hospit@@ als who then provide assistance on a case to case basis .
since the do@@ h@@ f@@ w provides funds to the hospit@@ als , the gr@@ ants can be given from the department to the hospital directly .
r@@ an func@@ tions can , therefore , be vest@@ ed in do@@ h@@ f@@ w .
man@@ aging committee of r@@ an society will meet to dis@@ sol@@ ve the aut@@ onom@@ ous body ( a@@ b ) as per provisions of societies regist@@ ration act , 18@@ 60 ( s@@ ra ) .
in addition to this , health minister ’ s canc@@ er pati@@ ent fund ( h@@ m@@ cp@@ f ) shall also be trans@@ ferred to the department .
the tim@@ eline required for this is one year .
j@@ s@@ k organiz@@ es various activities with target popul@@ ations as a part of its man@@ date .
there has been no continu@@ ous funding to j@@ s@@ k from the ministry .
population st@@ abil@@ ization strate@@ g@@ ies requ@@ ir@@ e private and corporate funding , which can be acc@@ es

In [None]:
! head -n 10 train.hi

अग@@ ्रि@@ म धन राशि इन अस्प@@ ता@@ लों को चिकित्सा नि@@ री@@ क्ष@@ कों को दी जाएगी , जो हर मामले को देखते हुए सहायता प्रदान करेंगे ।
च@@ ू@@ ंकि स्वास्थ्य एवं परिवार कल्याण विभाग अस्प@@ ता@@ लों को धन@@ राशि प्रदान करता है इसलिए विभाग द्वारा अस्प@@ ता@@ लों को सीधे अनु@@ दान दिया जा सकता है ।
इस तरह आर@@ ए@@ एन का काम@@ का@@ ज स्वास्थ्य एवं परिवार कल्याण विभाग के अध@@ ीन लाया जाएगा ।
आर@@ ए@@ एन , सो@@ साय@@ टी की प्र@@ बंध समिति सो@@ साय@@ टी पंजी@@ करण अधिनियम , 18@@ 60 के प्रावधानों के तहत स्वा@@ य@@ त्@@ त@@ शा@@ सी निका@@ यों को र@@ द्@@ द करने के लिए बैठक करेगा ।
इसके अलावा स्वास्थ्य मंत्री के कैं@@ सर रो@@ गी निधि को भी विभाग को स्थ@@ ान@@ ा@@ ंत@@ रित कर दिया जाएगा ।
इसके लिए एक वर्ष का समय रखा गया है ।
जे@@ एस@@ के लक्ष@@ ित आबादी के म@@ द्@@ दे@@ नजर विभिन्न गतिविधियों का आयोजन करता है ।
मंत्रालय द्वारा जे@@ एस@@ के का कोई लगातार वित्@@ त@@ पोषण नहीं किया जाता ।
जन@@ संख्या स्थि@@ री@@ करण रण@@ नीतियों के निजी और कार्@@ पोरे@@ ट वित्@@ त@@ पोषण की जरूरत होती है , जो जे@@ एस@

In [None]:
import os
os.environ['PYTHONPATH'] += ":/content/fairseq/"

! echo $PYTHONPATH

/env/python:/content/fairseq/


# Starting  NMT Training
## Preprocessing stage ; create dictionaries, make corpora ready for parallel processing


In [None]:
! python fairseq/fairseq_cli/preprocess.py \
    --joined-dictionary \
    --source-lang en --target-lang hi \
    --trainpref train --validpref valid --testpref valid \
    --destdir data-bin/trial --thresholdtgt 0 --thresholdsrc 0 \
    --workers 20

2023-08-23 04:58:05.598942: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
INFO:fairseq.tasks.text_to_speech:Please install tensorboardX: pip install tensorboardX
INFO:fairseq_cli.preprocess:Namespace(no_progress_bar=False, log_interval=100, log_format=None, log_file=None, aim_repo=None, aim_run_hash=None, tensorboard_logdir=None, wandb_project=None, azureml_logging=False, seed=1, cpu=False, tpu=False, bf16=False, memory_efficient_bf16=False, fp16=False, memory_efficient_fp16=False, fp16_no_flatten_grads=False, fp16_init_scale=128, fp16_scale_window=None, fp16_scale_tolerance=0.0, on_cpu_convert_precision=False, min_loss_scale=0.0001, threshold_loss_scale=None, amp=False, amp_batch_retries=2, amp_init_scale=128, amp_scale_window=None, user_dir=None, emp

In [None]:
ls data-bin/trial

dict.en.txt        test.en-hi.en.idx   train.en-hi.en.idx  valid.en-hi.en.idx
dict.hi.txt        test.en-hi.hi.bin   train.en-hi.hi.bin  valid.en-hi.hi.bin
preprocess.log     test.en-hi.hi.idx   train.en-hi.hi.idx  valid.en-hi.hi.idx
test.en-hi.en.bin  train.en-hi.en.bin  valid.en-hi.en.bin


# Training
## Parameters to fix for your corpora and language pair



```
    --encoder-embed-dim	128 --encoder-ffn-embed-dim	128 \
    --encoder-layers	2 --encoder-attention-heads	2 \
    --decoder-embed-dim	128 --decoder-ffn-embed-dim	128 \
    --decoder-layers	2 --decoder-attention-heads	2 \
    --dropout 0.3 --weight-decay 0.0 \
    --max-update 4000 \
    --keep-last-epochs	10 \
```



---



In [None]:
! python fairseq/fairseq_cli/train.py --fp16 \
    data-bin/trial \
    --source-lang en --target-lang hi \
    --arch transformer_iwslt_de_en --share-all-embeddings \
    --encoder-embed-dim	128 --encoder-ffn-embed-dim	128 \
    --encoder-layers	2 --encoder-attention-heads	2 \
    --decoder-embed-dim	128 --decoder-ffn-embed-dim	128 \
    --decoder-layers	2 --decoder-attention-heads	2 \
    --dropout 0.3 --weight-decay 0.0 \
    --criterion label_smoothed_cross_entropy --label-smoothing 0.1 \
    --optimizer adam --adam-betas '(0.9, 0.98)' --clip-norm 0.0 \
    --lr 0.01 --lr-scheduler inverse_sqrt --warmup-updates 10 \
    --max-tokens 4096 --update-freq 16 \
    --max-update 4000 \
    --keep-last-epochs	10 \
    --save-dir trained_models

2023-08-23 04:58:57.447848: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-08-23 04:59:01 | INFO | numexpr.utils | NumExpr defaulting to 2 threads.
2023-08-23 04:59:01 | INFO | fairseq.tasks.text_to_speech | Please install tensorboardX: pip install tensorboardX
2023-08-23 04:59:03 | INFO | fairseq_cli.train | {'_name': None, 'common': {'_name': None, 'no_progress_bar': False, 'log_interval': 100, 'log_format': None, 'log_file': None, 'aim_repo': None, 'aim_run_hash': None, 'tensorboard_logdir': None, 'wandb_project': None, 'azureml_logging': False, 'seed': 1, 'cpu': False, 'tpu': False, 'bf16': False, 'memory_efficient_bf16': False, 'fp16': True, 'memory_efficient_fp16': False, 'fp16_no_flatten_grads': False, 'fp16_init_scale': 128, 'fp16_scale_win

In [None]:
ls trained_models

checkpoint116.pt  checkpoint119.pt  checkpoint122.pt  checkpoint125.pt
checkpoint117.pt  checkpoint120.pt  checkpoint123.pt  checkpoint_best.pt
checkpoint118.pt  checkpoint121.pt  checkpoint124.pt  checkpoint_last.pt


In [None]:
! python fairseq/fairseq_cli/interactive.py  data-bin/trial \
    -s en -t hi \
    --distributed-world-size 1  \
    --path trained_models/checkpoint_best.pt \
    --batch-size 64  --buffer-size 2500 --beam 10 --replace-unk \
    --skip-invalid-size-inputs-valid-test \
    --input valid.en

2023-08-23 05:35:49.112739: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
INFO:fairseq.tasks.text_to_speech:Please install tensorboardX: pip install tensorboardX
DEBUG:hydra.core.utils:Setting JobRuntime:name=UNKNOWN_NAME
DEBUG:hydra.core.utils:Setting JobRuntime:name=utils
INFO:fairseq_cli.interactive:{'_name': None, 'common': {'_name': None, 'no_progress_bar': False, 'log_interval': 100, 'log_format': None, 'log_file': None, 'aim_repo': None, 'aim_run_hash': None, 'tensorboard_logdir': None, 'wandb_project': None, 'azureml_logging': False, 'seed': 1, 'cpu': False, 'tpu': False, 'bf16': False, 'memory_efficient_bf16': False, 'fp16': False, 'memory_efficient_fp16': False, 'fp16_no_flatten_grads': False, 'fp16_init_scale': 128, 'fp16_scale_window': None

In [None]:
cat valid.en

others must also have experi@@ mented with initiatives similar to those undertaken by the government .
if we tri@@ ed all that har@@ der , then by the 75@@ th year of our independence , we would have car@@ ved a place for our@@ selves am@@ id@@ st the major tourist dest@@ in@@ ations of the world .
the aim of this go@@ b@@ ar - dhan scheme is ensuring cleanliness in villages and gener@@ ating wealth and energy by conver@@ ting c@@ att@@ le d@@ un@@ g and sol@@ id agricultural waste into com@@ post and bi@@ o gas .
not only this , what re@@ ally sur@@ prised me was the fact that the ath@@ le@@ te , who fin@@ ished four@@ th in this event am@@ ong@@ st div@@ yan@@ g persons and thus mis@@ sed w@@ inning any med@@ al , ac@@ tu@@ ally took less time than the gold med@@ al@@ ist of general categ@@ ory in comple@@ ting the r@@ ace .
i have hear@@ d that in c@@ ud@@ d@@ al@@ ore district of tamil nadu , child mar@@ ri@@ age has been ban@@ ned under a special campaign .
in the past , the or@@ 