# NEURAL MACHINE TRANSLATION - LSTM

## Required Module & Config files

In [1]:
import src.LSTM as lstmNMT
from src.Tokenizer import Corpus, LangData, dataLoader
from src.utils import load_config, get_device, train_model, sentence_bleu, corpus_bleu
from src.Translator import Translator
from torch.nn import CrossEntropyLoss
from torch.optim import NAdam
import evaluate
import numpy as np
from torchinfo import summary

# Loading config file
config = load_config()
# Get device : GPU/MPS Back-End/CPU
device = get_device()
print(f"Using device: {device}")

Using the latest cached version of the module from /Users/lucien/.cache/huggingface/modules/evaluate_modules/metrics/evaluate-metric--bleu/9e0985c1200e367cce45605ce0ecb5ede079894e0f24f54613fca08eeb8aff76 (last modified on Thu Jul 18 16:29:52 2024) since it couldn't be found locally at evaluate-metric--bleu, or remotely on the Hugging Face Hub.


Using device: mps


## Load the dataset

In [2]:
# Encoder-Source
english_data = Corpus(f"{config.TRAIN_DATA}/english.txt", "English")
afrikaans_data = Corpus(f"{config.TRAIN_DATA}/afrikaans.txt", "Afrikaans")

## Set Hyperparameters

In [3]:
# Encoder - source
IN_ENCODER = english_data.vocab_size
ENCODER_EMB = 256

# Decoder - target
IN_DECODER = afrikaans_data.vocab_size
OUT_DECODER = afrikaans_data.vocab_size
DECODER_EMB = 256

# Shared
HIDDEN_SIZE = 1024
NUM_LAYERS = 2

LR = 1e-3
BATCH_SIZE = 128

## Set the model

In [4]:
encoder_net = lstmNMT.Encoder(IN_ENCODER, ENCODER_EMB, HIDDEN_SIZE, NUM_LAYERS).to(device)
decoder_net = lstmNMT.Decoder(IN_DECODER, DECODER_EMB, HIDDEN_SIZE, NUM_LAYERS).to(device)
model = lstmNMT.LSTM_NMT(encoder_net, decoder_net, OUT_DECODER)
summary(model)

Layer (type:depth-idx)                   Param #
LSTM_NMT                                 --
├─Encoder: 1-1                           --
│    └─Embedding: 2-1                    744,448
│    └─LSTM: 2-2                         13,647,872
├─Decoder: 1-2                           --
│    └─Embedding: 2-3                    737,280
│    └─LSTM: 2-4                         13,647,872
│    └─Linear: 2-5                       2,952,000
Total params: 31,729,472
Trainable params: 31,729,472
Non-trainable params: 0

In [5]:
train_data = LangData(english_data, afrikaans_data)
train_loader = dataLoader(train_data, BATCH_SIZE)

pad_idx = afrikaans_data.stoi['<pad>']
criterion = CrossEntropyLoss(ignore_index=0)

optimizer = NAdam(model.parameters(), LR)
translator = Translator(model, english_data, afrikaans_data, device)

In [6]:
# Data used for follow-up durring training
mytext = "<sos> given that we represent the target output as $y\in\{0,1\}$ and we have $n$ training points , we can write the negative log likelihood of the parameters as follows: <eos>"
ground = "<sos> as ons die teikenuittree voorstel as $y\in\{0,1\}$ en ons $n$ afrigpunte het , dan kan ons die negatiewe log-waarskynlikheidskostefunksie skryf as: <eos>"

predicted = translator.translate_sentence(mytext)
bleu = sentence_bleu(prediction=[predicted], reference=[ground])
print(f"Pred: {predicted}")
print(f"Refe: {ground}")
print(f"BLEU SCORES: {bleu}")

Pred: <sos> een een een verlangde liedjie liedjie klok optimaliseer een een een een suig $\alpha$ $\alpha$ $\alpha$ stroom stroom stroom siklusse/monster siklusse/monster stroom siklusse/monster siklusse/monster stroom siklusse/monster siklusse/monster gepleeg diagonaalkovariansiese diagonaalkovariansiese $z^{-1}$ gek vanmiddag herhaal gemaak ‘n seuns stroom siklusse/monster siklusse/monster stroom siklusse/monster siklusse/monster siklusse/monster tevrede $0.25$ $0.25$ bin\^ere een een haarself een haarself een verlangde verlangde verlangde iris verlangde liedjie een een $0.25$ een een suig $\alpha$ $\alpha$ oksitaanse imagin\^{e}re verlangde verlangde liedjie liedjie optimaliseer een een een een suig $\alpha$ $\alpha$ $\alpha$ gespasie\"{e}rde gespasie\"{e}rde gedra gedra gedra gek gek stroom stroom siklusse/monster siklusse/monster siklusse/monster stroom siklusse/monster siklusse/monster tevrede $0.25$ bin\^ere een een haarself verlangde verlangde verlangde geleen verlangde been ve

## Train the data

In [7]:
EPOCHS = 15
params = {
    "model": model,
    "train_loader": train_loader,
    "optimizer": optimizer,
    "criterion": criterion,
    "device": device,
    "epochs": EPOCHS,
    "source_test": mytext,
    "reference": ground,
	"translator":translator
}

train_loss = train_model(**params)
np.save('lstm_train_loss.npy', np.array(train_loss))

Using the latest cached version of the module from /Users/lucien/.cache/huggingface/modules/evaluate_modules/metrics/evaluate-metric--bleu/9e0985c1200e367cce45605ce0ecb5ede079894e0f24f54613fca08eeb8aff76 (last modified on Thu Jul 18 16:29:52 2024) since it couldn't be found locally at evaluate-metric--bleu, or remotely on the Hugging Face Hub.
Epoch 1/15: 100%|██████████| 20/20 [00:07<00:00,  2.53batch/s, loss=1.740]


Predicted: <sos> die die <eos>
BLEU Score: [0.021, 0.016, 0.012, 0.0]


Epoch 2/15: 100%|██████████| 20/20 [00:07<00:00,  2.71batch/s, loss=1.562]


Predicted: <sos> die die van van die van die van <eos>
BLEU Score: [0.096, 0.07, 0.052, 0.0]


Epoch 3/15: 100%|██████████| 20/20 [00:07<00:00,  2.69batch/s, loss=1.493]


Predicted: <sos> die volgende van die volgende van die volgende van die volgende van die volgende van die volgende van die volgende van die volgende van die volgende van die volgende van die volgende <eos>
BLEU Score: [0.205, 0.147, 0.105, 0.0]


Epoch 4/15: 100%|██████████| 20/20 [00:07<00:00,  2.77batch/s, loss=1.410]


Predicted: <sos> die volgende sein met 'n volgende sein met 'n volgende sein met 'n faktor van 'n faktor <eos>
BLEU Score: [0.152, 0.117, 0.087, 0.0]


Epoch 5/15: 100%|██████████| 20/20 [00:07<00:00,  2.65batch/s, loss=1.317]


Predicted: <sos> bepaal die volgende sein $x(t)$ word deur die volgende sein $x(t)$ word deur die volgende bladsy <eos>
BLEU Score: [0.301, 0.177, 0.119, 0.0]


Epoch 6/15: 100%|██████████| 20/20 [00:07<00:00,  2.74batch/s, loss=1.201]


Predicted: <sos> die volgende $x(t)$ word deur 'n analoog-na-syfer van die volgende sein $x[n]$ , met 'n faktor -punt khz <eos>
BLEU Score: [0.355, 0.193, 0.126, 0.0]


Epoch 7/15: 100%|██████████| 20/20 [00:07<00:00,  2.79batch/s, loss=1.115]


Predicted: <sos> die volgende sein $x(t)$ word deur 'n analoog-na-syfer sein $x[n]$ , met 'n faktor fft , en mans meer as mans in die gemiddeld <eos>
BLEU Score: [0.425, 0.209, 0.132, 0.0]


Epoch 8/15: 100%|██████████| 20/20 [00:07<00:00,  2.80batch/s, loss=1.014]


Predicted: <sos> die volgende sein $x(t)$ word deur 'n analoog-na-syfer omsetter (adc) teen 'n monsterfrekwensie van $f_s=5$ khz <eos>
BLEU Score: [0.28, 0.172, 0.117, 0.0]


Epoch 9/15: 100%|██████████| 20/20 [00:07<00:00,  2.74batch/s, loss=0.954]


Predicted: <sos> die sein van die dac is 5 en die filters kan word as ideaal beskou word : <eos>
BLEU Score: [0.26, 0.172, 0.128, 0.085]


Epoch 10/15: 100%|██████████| 20/20 [00:07<00:00,  2.65batch/s, loss=0.910]


Predicted: <sos> ons wil die data van die model wat die volgende benodig word deur 'n laagdeurlaatfilter (lpf) met 'n deurlaatband van $0.25$ en $0.35$ siklusse/monster , met 'n faktor 3 afgemonster (downsample) <eos>
BLEU Score: [0.319, 0.186, 0.116, 0.0]


Epoch 11/15: 100%|██████████| 20/20 [00:07<00:00,  2.73batch/s, loss=0.807]


Predicted: <sos> veronderstel ons het 'n datastel met vyf kenmerke , $x_1$ = matriek gemiddeld , $x_2$ = ik toetspunt , $x_3$ = geslag (1 vir vroulik en 0 vir manlik) <eos>
BLEU Score: [0.286, 0.154, 0.1, 0.0]


Epoch 12/15: 100%|██████████| 20/20 [00:07<00:00,  2.74batch/s, loss=0.728]


Predicted: <sos> ons wil die data in die usd/euro wisselkoers in die usd/euro wisselkoers in die usd/euro mark die \% verandering in die britse mark , en die \% verandering in die duitse mark <eos>
BLEU Score: [0.304, 0.164, 0.107, 0.0]


Epoch 13/15: 100%|██████████| 20/20 [00:07<00:00,  2.72batch/s, loss=0.630]


Predicted: <sos> ons wil die data in die usd/euro wisselkoers 'n effektiewe , m.a.w. , die \% verandering in die amerikaanse mark , die \% verandering in die duitse mark , die \% verandering in die duitse mark <eos>
BLEU Score: [0.269, 0.145, 0.095, 0.0]


Epoch 14/15: 100%|██████████| 20/20 [00:07<00:00,  2.73batch/s, loss=0.552]


Predicted: <sos> as ons die teikenuittree voorstel as $y\in\{0,1\}$ en die voorspellings word met die maatstaf van die data in die britse mark en die \% verandering in die duitse mark <eos>
BLEU Score: [0.533, 0.505, 0.483, 0.462]


Epoch 15/15: 100%|██████████| 20/20 [00:07<00:00,  2.73batch/s, loss=0.506]

Predicted: <sos> as ons die teikenuittree voorstel as $y\in\{0,1\}$ en ons $n$ afrigpunte het , dan kan ons die negatiewe log-waarskynlikheidskostefunksie skryf as : <eos>
BLEU Score: [1.0, 1.0, 1.0, 1.0]





## Evaluate on the training set

In [8]:
EN_SRC = [' '.join(sent) for sent in english_data.data_str]
AF_REF = [[' '.join(sent)] for sent in afrikaans_data.data_str]
TRANSLATED = [translator.translate_sentence(sent) for sent in EN_SRC]
corpus_bleu(TRANSLATED, AF_REF)

Using the latest cached version of the module from /Users/lucien/.cache/huggingface/modules/evaluate_modules/metrics/evaluate-metric--bleu/9e0985c1200e367cce45605ce0ecb5ede079894e0f24f54613fca08eeb8aff76 (last modified on Thu Jul 18 16:29:52 2024) since it couldn't be found locally at evaluate-metric--bleu, or remotely on the Hugging Face Hub.


                                     BLEU-1                                     
------------------------------------------------------------------------------------------
bleu                : 0.652725667015166
precisions          : [0.6987934376898817]
brevity_penalty     : 0.9340752671819449
length_ratio        : 0.9361558047564873
translation_length  : 34561
reference_length    : 36918
******************************************************************************************
                                     BLEU-2                                     
------------------------------------------------------------------------------------------
bleu                : 0.5706228752102449
precisions          : [0.6987934376898817, 0.5340547620532652]
brevity_penalty     : 0.9340752671819449
length_ratio        : 0.9361558047564873
translation_length  : 34561
reference_length    : 36918
******************************************************************************************
           

## Evaluate on the validation set

In [9]:
with open(f"{config.VAL_DATA}/english.txt") as data:
    english_val = data.read().strip().split("\n")
with open(f"{config.VAL_DATA}/afrikaans.txt") as data:
    afrikaans_val = data.read().strip().split("\n")

### Greedy Search

In [10]:
VAL_AF_REF = [[sent] for sent in afrikaans_val]

VAL_TRANSLATED = [translator.translate_sentence(sent) for sent in english_val]

corpus_bleu(VAL_TRANSLATED, VAL_AF_REF)

Using the latest cached version of the module from /Users/lucien/.cache/huggingface/modules/evaluate_modules/metrics/evaluate-metric--bleu/9e0985c1200e367cce45605ce0ecb5ede079894e0f24f54613fca08eeb8aff76 (last modified on Thu Jul 18 16:29:52 2024) since it couldn't be found locally at evaluate-metric--bleu, or remotely on the Hugging Face Hub.


                                     BLEU-1                                     
------------------------------------------------------------------------------------------
bleu                : 0.5794145710994943
precisions          : [0.6209269971443216]
brevity_penalty     : 0.933144433668136
length_ratio        : 0.9352828379674017
translation_length  : 13657
reference_length    : 14602
******************************************************************************************
                                     BLEU-2                                     
------------------------------------------------------------------------------------------
bleu                : 0.4875279863137261
precisions          : [0.6209269971443216, 0.4396031746031746]
brevity_penalty     : 0.933144433668136
length_ratio        : 0.9352828379674017
translation_length  : 13657
reference_length    : 14602
******************************************************************************************
            

### Beam Search

In [11]:
VAL_TRANSLATED = [translator.translate_sentence(sent, method="beam", beam_width=3) for sent in english_val]

corpus_bleu(VAL_TRANSLATED, VAL_AF_REF)

Using the latest cached version of the module from /Users/lucien/.cache/huggingface/modules/evaluate_modules/metrics/evaluate-metric--bleu/9e0985c1200e367cce45605ce0ecb5ede079894e0f24f54613fca08eeb8aff76 (last modified on Thu Jul 18 16:29:52 2024) since it couldn't be found locally at evaluate-metric--bleu, or remotely on the Hugging Face Hub.


                                     BLEU-1                                     
------------------------------------------------------------------------------------------
bleu                : 0.5769388152898909
precisions          : [0.5835295747333425]
brevity_penalty     : 0.9887053549145929
length_ratio        : 0.988768661827147
translation_length  : 14438
reference_length    : 14602
******************************************************************************************
                                     BLEU-2                                     
------------------------------------------------------------------------------------------
bleu                : 0.4841237679717259
precisions          : [0.5835295747333425, 0.41088110006725953]
brevity_penalty     : 0.9887053549145929
length_ratio        : 0.988768661827147
translation_length  : 14438
reference_length    : 14602
******************************************************************************************
           

## Evaluate on the SUN validation set only

In [12]:
with open(f"{config.VAL_DATA}/sun_english.txt") as data:
    sun_english_val = data.read().strip().split("\n")
with open(f"{config.VAL_DATA}/sun_afrikaans.txt") as data:
    sun_afrikaans_val = data.read().strip().split("\n")

### Greedy search

In [13]:
SUN_VAL_AF = [[sent] for sent in sun_afrikaans_val]
SUN_VAL_TRANSLATED = [translator.translate_sentence(sent) for sent in sun_english_val]
corpus_bleu(SUN_VAL_TRANSLATED, SUN_VAL_AF)

Using the latest cached version of the module from /Users/lucien/.cache/huggingface/modules/evaluate_modules/metrics/evaluate-metric--bleu/9e0985c1200e367cce45605ce0ecb5ede079894e0f24f54613fca08eeb8aff76 (last modified on Thu Jul 18 16:29:52 2024) since it couldn't be found locally at evaluate-metric--bleu, or remotely on the Hugging Face Hub.


                                     BLEU-1                                     
------------------------------------------------------------------------------------------
bleu                : 0.4032216160041569
precisions          : [0.4032216160041569]
brevity_penalty     : 1.0
length_ratio        : 1.0091767173571053
translation_length  : 3849
reference_length    : 3814
******************************************************************************************
                                     BLEU-2                                     
------------------------------------------------------------------------------------------
bleu                : 0.3089415424548455
precisions          : [0.4032216160041569, 0.2367057540223616]
brevity_penalty     : 1.0
length_ratio        : 1.0091767173571053
translation_length  : 3849
reference_length    : 3814
******************************************************************************************
                                     BLEU-3 

### Beam search

In [14]:
SUN_VAL_TRANSLATED = [translator.translate_sentence(sent, method="beam", beam_width=3) for sent in sun_english_val]
corpus_bleu(SUN_VAL_TRANSLATED, SUN_VAL_AF)

Using the latest cached version of the module from /Users/lucien/.cache/huggingface/modules/evaluate_modules/metrics/evaluate-metric--bleu/9e0985c1200e367cce45605ce0ecb5ede079894e0f24f54613fca08eeb8aff76 (last modified on Thu Jul 18 16:29:52 2024) since it couldn't be found locally at evaluate-metric--bleu, or remotely on the Hugging Face Hub.


                                     BLEU-1                                     
------------------------------------------------------------------------------------------
bleu                : 0.37233009708737863
precisions          : [0.37233009708737863]
brevity_penalty     : 1.0
length_ratio        : 1.08023072889355
translation_length  : 4120
reference_length    : 3814
******************************************************************************************
                                     BLEU-2                                     
------------------------------------------------------------------------------------------
bleu                : 0.28448744064887416
precisions          : [0.37233009708737863, 0.21736922295581512]
brevity_penalty     : 1.0
length_ratio        : 1.08023072889355
translation_length  : 4120
reference_length    : 3814
******************************************************************************************
                                     BLEU-3

In [15]:
metric = evaluate.load("bleu")
predictions = [translator.translate_sentence(sent, method="beam", beam_width=5) for sent in sun_english_val[10:20]]
labels = SUN_VAL_AF[10:20]
for source, pred, lab in zip(sun_english_val[10:20],predictions, labels):
    print(f"Source    : {source}")
    print(f"Prediction: {pred[:150]}")
    print(f"Label     : {lab[0][:150]}")
    print(f"BLEU      : {metric.compute(predictions=[pred], references=lab)['bleu']}")
    print()

Using the latest cached version of the module from /Users/lucien/.cache/huggingface/modules/evaluate_modules/metrics/evaluate-metric--bleu/9e0985c1200e367cce45605ce0ecb5ede079894e0f24f54613fca08eeb8aff76 (last modified on Thu Jul 18 16:29:52 2024) since it couldn't be found locally at evaluate-metric--bleu, or remotely on the Hugging Face Hub.


Source    : <sos> component <eos>
Prediction: <sos> dit te wees <eos>
Label     : <sos> komponent <eos>
BLEU      : 0.0

Source    : <sos> architecture <eos>
Prediction: <sos> dit te wees <eos>
Label     : <sos> argitektuur <eos>
BLEU      : 0.0

Source    : <sos> specification <eos>
Prediction: <sos> dit te wees <eos>
Label     : <sos> spesifikasies <eos>
BLEU      : 0.0

Source    : <sos> at which stage of the design process would we choose the communication protocol between subsystems <eos>
Prediction: <sos> in elk van die gevolglike klassifikasie-gebiede aan of intrees as $0$ of $1$ geklassifiseerder sal word <eos>
Label     : <sos> by watter stap van die ontwerpsproses word die kommunikasie-kanaal tussen substelsels gekies <eos>
BLEU      : 0.0

Source    : <sos> motivate your answer <eos>
Prediction: <sos> motiveer jou antwoord <eos>
Label     : <sos> motiveer jou antwoord <eos>
BLEU      : 1.0

Source    : <sos> describe the meaning if a system is described as a cyber-physical s