<a href="https://colab.research.google.com/github/Pengu007/Translational_Model/blob/main/Translation_Model.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Importing the Dataset (Multi-30K)**

In [1]:
! git clone --recursive https://github.com/multi30k/dataset.git multi30k-dataset

Cloning into 'multi30k-dataset'...
remote: Enumerating objects: 313, done.[K
remote: Counting objects: 100% (32/32), done.[K
remote: Compressing objects: 100% (16/16), done.[K
remote: Total 313 (delta 17), reused 21 (delta 16), pack-reused 281[K
Receiving objects: 100% (313/313), 18.21 MiB | 26.82 MiB/s, done.
Resolving deltas: 100% (69/69), done.
Submodule 'scripts/subword-nmt' (https://github.com/rsennrich/subword-nmt.git) registered for path 'scripts/subword-nmt'
Cloning into '/content/multi30k-dataset/scripts/subword-nmt'...
remote: Enumerating objects: 597, done.        
remote: Counting objects: 100% (21/21), done.        
remote: Compressing objects: 100% (17/17), done.        
remote: Total 597 (delta 8), reused 12 (delta 4), pack-reused 576        
Receiving objects: 100% (597/597), 252.23 KiB | 10.51 MiB/s, done.
Resolving deltas: 100% (357/357), done.
Submodule path 'scripts/subword-nmt': checked out '80b7c1449e2e26673fb0b5cae993fe2d0dc23846'


**Installing Required Python Modules**

In [12]:
%%capture
# W and B -- For Logging
! pip install wandb

# Sacremoses -- For Tokenizing
! pip install sacremoses

# fairseq -- For training and evaluation of the model
! git clone https://github.com/pytorch/fairseq
%cd fairseq
! pip install --editable ./
%cd ..

**To use W and B, Creating and Logging in Account**

In [13]:
import wandb
wandb.login()

[34m[1mwandb[0m: Currently logged in as: [33mmail2anandved[0m ([33malone_y1[0m). Use [1m`wandb login --relogin`[0m to force relogin


True

**Pre-process and Binarize to build Vocabularies**

In [14]:
! fairseq-preprocess --source-lang de --target-lang en \
  --trainpref multi30k-dataset/data/task1/tok/train.lc.norm.tok \
  --validpref multi30k-dataset/data/task1/tok/val.lc.norm.tok \
  --testpref  multi30k-dataset/data/task1/tok/test_2018_flickr.lc.norm.tok \
  --destdir data-bin/multi30k.tokenized.de-en \
  --thresholdsrc 2 \
  --thresholdtgt 2

2023-11-24 12:22:27.511524: E tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:9342] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2023-11-24 12:22:27.511606: E tensorflow/compiler/xla/stream_executor/cuda/cuda_fft.cc:609] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2023-11-24 12:22:27.511665: E tensorflow/compiler/xla/stream_executor/cuda/cuda_blas.cc:1518] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2023-11-24 12:22:27.527110: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
INFO:fairseq.tasks.text_to_speech:Please install t

# Training the model

In [15]:
! fairseq-train data-bin/multi30k.tokenized.de-en \
  --arch transformer \
  --dropout 0.1 \
  --attention-dropout 0.1 \
  --activation-dropout 0.1 \
  --encoder-embed-dim 256 \
  --encoder-ffn-embed-dim 512 \
  --encoder-layers 3 \
  --encoder-attention-heads 8 \
  --encoder-learned-pos \
  --decoder-embed-dim 256 \
  --decoder-ffn-embed-dim 512 \
  --decoder-layers 3 \
  --decoder-attention-heads 8 \
  --decoder-learned-pos \
  --max-epoch 10 \
  --optimizer adam \
  --lr 5e-4 \
  --batch-size 128 \
  --seed 1 \
  --wandb-project "Multi 30K En to De Translation"

2023-11-24 12:27:28.769362: E tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:9342] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2023-11-24 12:27:28.769453: E tensorflow/compiler/xla/stream_executor/cuda/cuda_fft.cc:609] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2023-11-24 12:27:28.769498: E tensorflow/compiler/xla/stream_executor/cuda/cuda_blas.cc:1518] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2023-11-24 12:27:28.780899: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-11-24 12:27:31 | INFO | numexpr.utils | NumEx

# Evaluate the model

In [16]:
# ckpt_best, beam=5
! fairseq-generate data-bin/multi30k.tokenized.de-en \
    --path checkpoints/checkpoint_best.pt \
    --batch-size 128 \
    --beam 5 \
    --seed 1 \
    --scoring bleu \
    --wandb-project "Multi 30K En to De Translation"

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
D-279	-0.3997049331665039	girl playing a game outside .
P-279	-0.3944 -0.2495 -1.7408 -0.1130 -0.1305 -0.1697 -0.0000
S-342	ein radfahrer fährt einer <unk> hinauf .
T-342	a cyclist riding up a grassy dirt trail .
H-342	-0.89820796251297	a cyclist is riding a bike race .
D-342	-0.89820796251297	a cyclist is riding a bike race .
P-342	-0.0657 -1.8332 -0.3939 -0.8106 -0.0730 -1.2386 -2.8078 -0.8609 -0.0000
S-281	dieser surfer versucht , nicht <unk> .
T-281	this surfer is trying to avoid wiping out .
H-281	-0.5625184774398804	this surfer is trying to make a surfer .
D-281	-0.5625184774398804	this surfer is trying to make a surfer .
P-281	-0.0074 -0.1162 -0.0833 -1.0527 -0.3249 -0.9452 -1.7639 -1.0074 -0.3240 -0.0001
S-232	mann im wasser auf einem winzigen segelboot
T-232	man in the water on a tiny sail boat
H-232	-0.6790001392364502	man in the water on a tiny sailboat .
D-232	-0.6790001392364502	man in the water on a tiny sai

**Increasing just the Beam Size to see difference in Bleu Score**

In [23]:
# ckpt_best, beam=10
! fairseq-generate data-bin/multi30k.tokenized.de-en \
    --path checkpoints/checkpoint_best.pt \
    --batch-size 128 \
    --beam 10 \
    --seed 1 \
    --scoring bleu \
    --quiet \
    --wandb-project "Multi 30K En to De Translation"

2023-11-24 15:09:16.328735: E tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:9342] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2023-11-24 15:09:16.328813: E tensorflow/compiler/xla/stream_executor/cuda/cuda_fft.cc:609] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2023-11-24 15:09:16.328861: E tensorflow/compiler/xla/stream_executor/cuda/cuda_blas.cc:1518] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2023-11-24 15:09:16.344242: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
DEBUG:hydra.core.utils:Setting JobRuntime:name=UNK

**Generate Average Checkpoint (Ensemble)**

In [20]:
! python fairseq/scripts/average_checkpoints.py --inputs checkpoints --num-epoch-checkpoints 5 --output checkpoints/averaged.pt

2023-11-24 15:05:50.062841: E tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:9342] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2023-11-24 15:05:50.062913: E tensorflow/compiler/xla/stream_executor/cuda/cuda_fft.cc:609] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2023-11-24 15:05:50.062950: E tensorflow/compiler/xla/stream_executor/cuda/cuda_blas.cc:1518] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2023-11-24 15:05:50.071909: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
INFO:fairseq.tasks.text_to_speech:Please install t

**Test dataset Performance over the Average checkpoint with Beam = 5**

In [21]:
# ckpt_avg, beam=5
! fairseq-generate data-bin/multi30k.tokenized.de-en \
    --path checkpoints/averaged.pt \
    --batch-size 128 \
    --beam 5 \
    --seed 1 \
    --scoring bleu \
    --quiet \
    --wandb-project "Multi 30K En to De Translation"

2023-11-24 15:06:21.074604: E tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:9342] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2023-11-24 15:06:21.074671: E tensorflow/compiler/xla/stream_executor/cuda/cuda_fft.cc:609] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2023-11-24 15:06:21.074719: E tensorflow/compiler/xla/stream_executor/cuda/cuda_blas.cc:1518] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2023-11-24 15:06:21.083870: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
INFO:fairseq.tasks.text_to_speech:Please install t

**Test dataset Performance over the Average checkpoint with Beam increased**

In [24]:
# ckpt_avg, beam=10
! fairseq-generate data-bin/multi30k.tokenized.de-en \
    --path checkpoints/averaged.pt \
    --batch-size 128 \
    --beam 10 \
    --seed 1 \
    --scoring bleu \
    --quiet \
    --wandb-project "Multi 30K En to De Translation"

2023-11-24 15:29:24.613533: E tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:9342] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2023-11-24 15:29:24.613612: E tensorflow/compiler/xla/stream_executor/cuda/cuda_fft.cc:609] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2023-11-24 15:29:24.613657: E tensorflow/compiler/xla/stream_executor/cuda/cuda_blas.cc:1518] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2023-11-24 15:29:24.624850: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
DEBUG:hydra.core.utils:Setting JobRuntime:name=UNK

# Export the model

**Exporting the Averaged Checkpoint trained Model**

In [25]:
! mkdir -p trained_model
! cp data-bin/multi30k.tokenized.de-en/dict.de.txt trained_model/dict.de.txt
! cp data-bin/multi30k.tokenized.de-en/dict.en.txt trained_model/dict.en.txt
! cp checkpoints/averaged.pt trained_model/model.pt

**Using fairseq-interactive (with beam of 5 as it has highest BLEU SCORE) as it can take sentences as input as they go and translate**

In [26]:
! fairseq-interactive \
  --path trained_model/model.pt \
  --source-lang de --target-lang en \
  --tokenizer moses \
  --task translation --cpu \
  --beam 5 \
  trained_model/

2023-11-24 15:36:53.654352: E tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:9342] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2023-11-24 15:36:53.654432: E tensorflow/compiler/xla/stream_executor/cuda/cuda_fft.cc:609] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2023-11-24 15:36:53.654477: E tensorflow/compiler/xla/stream_executor/cuda/cuda_blas.cc:1518] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2023-11-24 15:36:53.665368: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
DEBUG:hydra.core.utils:Setting JobRuntime:name=UNK