# Overview
**Assignment 2** focuses on the training on a Neural Machine Translation (NMT) system for English-Irish translation where English is the source language and Irish is the target language. 

**Grading Policy** 
Assignment 2 is graded and will be worth 25% of your overall grade. This assignment is worth a total of 50 points distributed over the tasks below.  Please note that this is an individual assignment and you must not work with other students to complete this assessment. Any copying from other students, from student exercises from previous years, and any internet resources will not be tolerated. Plagiarised assignments will receive zero marks and the students who commit this act will be reported. Feel free to reach out to the TAs and instructors if you have any questions.

## Task 1 - Data Collection and Preprocessing (10 points)
## Task 1a. Data Loading (5 pts)
Dataset: https://www.dropbox.com/s/zkgclwc9hrx7y93/DGT-en-ga.txt.zip?dl=0 
*  Download a English-Irish dataset and decompress it. The `DGT.en-ga.en` file contains a list english sentences and `DGT.en-ga.ga` contains the paralell Irish sentences. Read both files into the Jupyter environment and load them into a pandas dataframe. 
* Randomly sample 12,000 rows.
* Split the sampled data into train (10k), development (1k) and test set (1k)

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
import string
from nltk.tokenize import word_tokenize
from tensorflow.keras.preprocessing.text import Tokenizer

In [99]:
# Your Code Here

english = pd.read_csv("DGT.en-ga.en", sep="\n", names=["English"])
irish = pd.read_csv("DGT.en-ga.ga", sep="\n", names=["Irish"])

data = pd.concat([english, irish], axis=1)
data = data.sample(n=12000, random_state=2023)

train, val = train_test_split(data, test_size=.2, random_state=2013)
val, test = train_test_split(val, test_size=.5, random_state=2013)

print(train.shape)
print(val.shape)
print(test.shape)

train.head()

(9600, 2)
(1200, 2)
(1200, 2)


Unnamed: 0,English,Irish
129371,The majority of this support should be depende...,Chun an méid sin a bhaint amach is gá leanúint...
26002,The Commission shall be empowered to adopt del...,I gcás ina n-aistreofar teidlíochtaí íocaíocht...
30027,For the industrial manufacture of essential oi...,"Ó chapaill, ó asail, ó mhiúileanna agus ó ráinigh"
27906,Section 5,Beidh sé beartaithe leis an tacaíocht indíolta...
13988,"In addition, the Debt facility will help organ...",feabhsóidh siad aistriú eolais agus margadh na...


## Task 1b. Preprocessing (5 pts)
* Add '<bof\>' to denote beginning of sentence and '<eos\>' to denote the end of the sentence to each target line.
* Perform the following pre-processing steps:
  * Lowercase the text
  * Remove all punctuation
  * tokenize the text 
*  Build seperate vocabularies for each language. 
  * Assign each unique word an id value 
*Print statistics on the selected dataset:
  * Number of samples
  * Number of unique source language tokens
  * Number of unique target language tokens
  * Max sequence length of source language
  * Max sequence length of target language



In [3]:
from nltk.tokenize import word_tokenize 
from typing import List 
import re 

class Langauge:

    def __init__(self, language: str):
        self.language = language                            # Name of the langauge
        self.word2index = {"SOS": 0, "EOS": 1}              # Maps each word in vocab to id
        self.index2word = {0: "SOS", 1: "EOS"}              # Reverse map of id to word in vocab
        self.word2count = {}                                # Count of each word in vocab
        self.n_words = len(self.index2word)                 # number of words in vocab

    def addSentence(self, sentence: str):
        """ 
        Given a sentence, lowercase is and remove any punctuation. Tokenize the
        sentence and for each word in the tokenized list call the addWord method.
        """
        text = sentence.lower()
        clean_text = re.sub(r'[^\w\s]', '', text).strip()
        for word in word_tokenize(clean_text):
            self.addWord(word)
  
    def addWord(self, word: str):
        """
        For each input word, check if it exists in the the word2index. If it does 
        not, add the word to the word2index and set the value to the current 
        vocabulary length. Update the index2word entry as well which maps the token 
        id to the word. Finaally update the vocabulary count (n_words).

        If the word is already in the vocabulary, udpate the count.
        """
        if word not in self.word2index:
          self.word2index[word] = self.n_words
          self.word2count[word] = 1
          self.index2word[self.n_words] = word
          self.n_words += 1
        else:
          self.word2count[word] += 1

    def encodeSentence(self, sentence: str) -> List[int]:
        """
        Given a sentence:
          1. Lower case it
          2. Remove all punctuation
          3. Prepend SOS and append EOS to it.
          4. Tokenize it and return the word ids for each word in the tokenized list. If a word
          does not exist in the vocab, skip over it. 

          Return a list of word ids. 
        """
        text = sentence.lower()
        clean_text = re.sub(r'[^\w\s]', '',text).strip()
        clean_text = "SOS " + clean_text + " EOS"
        return [self.word2index[word] for word in word_tokenize(clean_text) if word in self.word2index]

    def decodeIds(self, ids: list) -> List[str]:
        """
        Given a list of word ids, look the ids in the index2word and return a
        string representing the decoded sentence. 
        """
        return " ".join([self.index2word[tok] for tok in ids])

In [4]:
from tqdm.notebook import tqdm 

english = Langauge("English")
irish = Langauge("Irish")

for _, row in tqdm(data.iterrows(), total=len(data)):
    english.addSentence(str(row["English"]))
    irish.addSentence(str(row["Irish"]))
print(f"Size of English vocab: {english.n_words}")
print(f"Size of Irish vocab: {irish.n_words}")

  0%|          | 0/12000 [00:00<?, ?it/s]

Size of English vocab: 11548
Size of Irish vocab: 16345


In [5]:
print(data.shape)
print(train.shape)
print(val.shape)
print(test.shape)

(12000, 2)
(9600, 2)
(1200, 2)
(1200, 2)


In [89]:
# Print statistics on the selected dataset:

# Number of samples
print("Number of train samples:", len(train), '\n')
print("Number of development samples:", len(val), '\n')
print("Number of test samples:", len(test), '\n')

# Number of unique source language tokens
print("Number of unique English language tokens", english.n_words, '\n')

# Number of unique target language tokens
print("Number of unique Irish language tokens", irish.n_words, '\n')

# Max sequence length of source language
eng = str(data["English"])
max_source_seq_length = max([len(sentence) for sentence in eng])
print("Max sequence length of source language", max_source_seq_length, "\n")

# Max sequence length of target language
iri = str(data["Irish"])
max_target_seq_length = max([len(sentence) for sentence in iri])
print("Max sequence length of target language", max_target_seq_length, "\n")


Number of train samples: 9600 

Number of development samples: 1200 

Number of test samples: 1200 

Number of unique English language tokens 11548 

Number of unique Irish language tokens 16345 

Max sequence length of source language 1 

Max sequence length of target language 1 



## Task 2. Model Implementation and Training (30 pts)



## Task 2a. Encoder-Decoder Model Implementation (10 pts)
Implement an Encoder-Decoder model in Pytorch with the following components
* A single layer RNN based encoder. 
* A single layer RNN based decoder
* A Encoder-Decoder model based on the above components that support sequence-to-sequence modelling. For the encoder/decoder you can use RNN, LSTMs or GRU. Use a hidden dimension of 256 or less depending on your compute constraints. 

In [7]:
import torch 
from tensorflow.keras.utils import pad_sequences
import pandas as pd

def encode_features(
    df: pd.DataFrame, 
    english: Langauge,
    french: Langauge,
    pad_token: int = 0,
    max_seq_length = 10
  ):

    source = []
    target = []

    for _, row in df.iterrows():
        source.append(english.encodeSentence(str(row["English"])))
        target.append(french.encodeSentence(str(row["Irish"])))

    source = pad_sequences(
        source,
        maxlen=max_seq_length,
        padding="post",
        truncating = "post",
        value=pad_token
    )

    target = pad_sequences(
      target,
      maxlen=max_seq_length,
      padding="post",
      truncating = "post",
      value=pad_token
    )
  
    return source, target

train_source, train_target = encode_features(train, english, irish)
val_source, val_target = encode_features(val, english, irish)
test_source, test_target = encode_features(test, english, irish)

print(f"Shapes of train source {train_source.shape}, and target {train_target.shape}")

Shapes of train source (9600, 10), and target (9600, 10)


In [8]:
from torch.utils.data import DataLoader, TensorDataset

train_dl = DataLoader(
    TensorDataset(
        torch.LongTensor(train_source),
        torch.LongTensor(train_target)
    ),
    shuffle = True,
    batch_size = 32
)

val_dl = DataLoader(
    TensorDataset(
        torch.LongTensor(val_source),
        torch.LongTensor(val_target)
    ),
    shuffle = False,
    batch_size = 32
)

test_dl = DataLoader(
    TensorDataset(
        torch.LongTensor(test_source),
        torch.LongTensor(test_target)
    ),
    shuffle = False,
    batch_size = 32
)


for batch in train_dl:
    print( batch[0].shape, batch[1].shape )
    break

torch.Size([32, 10]) torch.Size([32, 10])


In [9]:
# Your code here

# Single Layer RNN based Encoder

import torch 
import torch.nn as nn 
import torch.nn.functional as F

class EncoderGRU(nn.Module):
    def __init__(
        self, 
        input_vocab_size,  # size of source vocabulary  
        hidden_dim,        # hidden dimension of embeddings
        encoder_hid_dim,   # gru hidden dim
        decoder_hid_dim,   # decoder hidden dim 
        dropout_prob = .5
      ):
      
        super().__init__()
        self.embedding = nn.Embedding(input_vocab_size, hidden_dim)
        self.rnn = nn.GRU(hidden_dim, encoder_hid_dim, bidirectional = True)
        self.fc = nn.Linear(encoder_hid_dim * 2, decoder_hid_dim)
        self.dropout = nn.Dropout(dropout_prob)
        
    def forward(self, src):
        
        #src = [src len, batch size]
        embedded = self.dropout(self.embedding(src))
        
        #embedded = [src len, batch size, emb dim]
        outputs, hidden = self.rnn(embedded)
                
        #outputs = [src len, batch size, hid dim * num directions]
        #hidden = [n layers * num directions, batch size, hid dim]        
        #hidden is stacked [forward_1, backward_1, forward_2, backward_2, ...]
        #outputs are always from the last layer
        
        #hidden [-2, :, : ] is the last of the forwards GRU
        #hidden [-1, :, : ] is the last of the backwards GRU
        
        #initial decoder hidden is final hidden state of the forwards and backwards 
        #  encoder RNNs fed through a linear layer
        hidden = torch.tanh(self.fc(torch.cat((hidden[-2,:,:], hidden[-1,:,:]), dim = 1)))        
        return outputs, hidden

In [10]:
class Attention(nn.Module):
    def __init__(
        self, 
        enc_hid_dim,      # Encoder hidden dimension
        dec_hid_dim       # Decoder hidden dimension 
      ):
        super().__init__()
        
        self.attn = nn.Linear((enc_hid_dim * 2) + dec_hid_dim, dec_hid_dim)
        self.v = nn.Linear(dec_hid_dim, 1, bias = False)
        
    def forward(self, hidden, encoder_outputs):
        
        #hidden = [batch size, dec hid dim]
        #encoder_outputs = [src len, batch size, enc hid dim * 2]
        batch_size = encoder_outputs.shape[1]
        src_len = encoder_outputs.shape[0]
        
        #repeat decoder hidden state src_len times
        hidden = hidden.unsqueeze(1).repeat(1, src_len, 1)
        
        encoder_outputs = encoder_outputs.permute(1, 0, 2)
        #hidden = [batch size, src len, dec hid dim]
        #encoder_outputs = [batch size, src len, enc hid dim * 2]
        
        energy = torch.tanh(self.attn(torch.cat((hidden, encoder_outputs), dim = 2))) 
        
        #energy = [batch size, src len, dec hid dim]
        attention = self.v(energy).squeeze(2)
        
        #attention output: [batch size, src len]
        return F.softmax(attention, dim=1)

In [29]:
class DecoderGRU(nn.Module):
    def __init__(
        self, 
        target_vocab_size,    # Size of target vocab 
        hidden_dim,           # hidden size of embedding  
        enc_hid_dim, 
        dec_hid_dim, 
        dropout
      ):
        super().__init__()

        self.output_dim = target_vocab_size
#         self.attention = Attention(enc_hid_dim, dec_hid_dim)
        
        self.embedding = nn.Embedding(target_vocab_size, hidden_dim)
        
        self.rnn = nn.GRU((enc_hid_dim * 2) + hidden_dim, dec_hid_dim)
        
        self.fc_out = nn.Linear(
            (enc_hid_dim * 2) + dec_hid_dim + hidden_dim, 
            target_vocab_size
          )
        self.dropout = nn.Dropout(dropout)
        
    def forward(self, input, hidden, encoder_outputs):
             
        #input = [batch size]
        #hidden = [batch size, dec hid dim]
        #encoder_outputs = [src len, batch size, enc hid dim * 2]
        
        input = input.unsqueeze(0)  # [1, batch size]
        
        embedded = self.dropout(self.embedding(input))  # [1, batch size, emb dim]
        
#         a = self.attention(hidden, encoder_outputs)     # [batch size, src len]
#         a = a.unsqueeze(1)                              # [batch size, 1, src len]
        
        encoder_outputs = encoder_outputs.permute(1, 0, 2) # [batch size, src len, enc hid dim * 2]
        
        weighted = torch.mean(encoder_outputs, dim=1, keepdim=True)           # [batch size, 1, enc hid dim * 2]
        weighted = weighted.permute(1, 0, 2)               # [1, batch size, enc hid dim * 2]
        
        rnn_input = torch.cat((embedded, weighted), dim = 2) # [1, batch size, (enc hid dim * 2) + emb dim]

        
        #output = [seq len, batch size, dec hid dim * n directions]
        #hidden = [n layers * n directions, batch size, dec hid dim]    
        output, hidden = self.rnn(rnn_input, hidden.unsqueeze(0))
        
        #seq len, n layers and n directions will always be 1 in this decoder, therefore:
        #output = [1, batch size, dec hid dim]
        #hidden = [1, batch size, dec hid dim]
        
        embedded = embedded.squeeze(0)
        output = output.squeeze(0)
        weighted = weighted.squeeze(0)
        
        prediction = self.fc_out(torch.cat((output, weighted, embedded), dim = 1)) # [batch size, output dim]
        return prediction, hidden.squeeze(0)

In [30]:
import random 
class EncoderDecoder(nn.Module):
    def __init__(self, encoder, decoder):
        super().__init__()
        
        self.encoder = encoder
        self.decoder = decoder
        
    def forward(self, src, trg, teacher_forcing_ratio = 0.5):
        
        #src = [src len, batch size]
        #trg = [trg len, batch size]
        #teacher_forcing_ratio is probability to use teacher forcing
        #e.g. if teacher_forcing_ratio is 0.75 we use teacher forcing 75% of the time     
        batch_size = src.shape[1]
        trg_len = trg.shape[0]
        trg_vocab_size = self.decoder.output_dim
        
        #tensor to store decoder outputs
        outputs = torch.zeros(trg_len, batch_size, trg_vocab_size)
        
        #encoder_outputs is all hidden states of the input sequence, back and forwards
        #hidden is the final forward and backward hidden states, passed through a linear layer
        encoder_outputs, hidden = self.encoder(src)
                
        #first input to the decoder is the <sos> tokens
        input = trg[0,:]
        
        for t in range(1, trg_len):     
            #insert input token embedding, previous hidden state and all encoder hidden states
            #receive output tensor (predictions) and new hidden state
            output, hidden = self.decoder(input, hidden, encoder_outputs)
            
            #place predictions in a tensor holding predictions for each token
            outputs[t] = output
            
            #decide if we are going to use teacher forcing or not
            teacher_force = random.random() < teacher_forcing_ratio
            
            #get the highest predicted token from our predictions
            top1 = output.argmax(1) 
            
            #if teacher forcing, use actual next token as next input
            #if not, use predicted token
            input = trg[t] if teacher_force else top1
        return outputs

## Task 2b. Training (10 pts)
Implement the code to train the Encoder-Decoder model on the Irish-English data. You will write code for the following:
* Training, validation and test dataloaders 
* A training loop which trains the model for 5 epoch. Evaluate the loop at the end of each Epoch. Print out the train perplexity and validation perplexity after each epoch.

In [31]:
INPUT_DIM = english.n_words
OUTPUT_DIM = irish.n_words
ENC_EMB_DIM = 256
DEC_EMB_DIM = 256
ENC_HID_DIM = 128
DEC_HID_DIM = 128
ENC_DROPOUT = 0.5
DEC_DROPOUT = 0.5

enc = EncoderGRU(INPUT_DIM, ENC_EMB_DIM, ENC_HID_DIM, DEC_HID_DIM, ENC_DROPOUT)
dec = DecoderGRU(OUTPUT_DIM, DEC_EMB_DIM, ENC_HID_DIM, DEC_HID_DIM, DEC_DROPOUT)

model = EncoderDecoder(enc, dec)

def init_weights(m):
    for name, param in m.named_parameters():
        if 'weight' in name:
            nn.init.normal_(param.data, mean=0, std=0.01)
        else:
            nn.init.constant_(param.data, 0)
            
model.apply(init_weights)

EncoderDecoder(
  (encoder): EncoderGRU(
    (embedding): Embedding(11548, 256)
    (rnn): GRU(256, 128, bidirectional=True)
    (fc): Linear(in_features=256, out_features=128, bias=True)
    (dropout): Dropout(p=0.5, inplace=False)
  )
  (decoder): DecoderGRU(
    (embedding): Embedding(16345, 256)
    (rnn): GRU(512, 128)
    (fc_out): Linear(in_features=640, out_features=16345, bias=True)
    (dropout): Dropout(p=0.5, inplace=False)
  )
)

In [59]:
# Your Code Here

from tqdm.notebook import tqdm
import numpy as np 
optimizer = torch.optim.Adam(model.parameters())

device = "cuda:0" if torch.cuda.is_available() else "cpu"

model.to(device)

EPOCHS = 5
best_val_loss = float('inf')

for epoch in range(EPOCHS):

    model.train()
    epoch_loss = 0
    for batch in tqdm(train_dl, total=len(train_dl)):

        src = batch[0].transpose(1, 0).to(device)
        trg = batch[1].transpose(1, 0).to(device)

        optimizer.zero_grad()

        output = model(src, trg)

        output_dim = output.shape[-1]
        output = output[1:].view(-1, output_dim).to(device)
        trg = trg[1:].reshape(-1)

        loss = F.cross_entropy(output, trg)
        loss.backward()

        torch.nn.utils.clip_grad_norm_(model.parameters(), 1)
        optimizer.step()
        epoch_loss += loss.item()

    train_loss = round(epoch_loss / len(train_dl), 3)

    eval_loss = 0
    model.eval()
    for batch in tqdm(val_dl, total=len(val_dl)):
        src = batch[0].transpose(1, 0).to(device)
        trg = batch[1].transpose(1, 0).to(device)

        with torch.no_grad():
            output = model(src, trg)

            output_dim = output.shape[-1]
            output = output[1:].view(-1, output_dim).to(device)
            trg = trg[1:].reshape(-1)

            loss = F.cross_entropy(output, trg)

            eval_loss += loss.item()

        val_loss = round(eval_loss / len(val_dl), 3)
        print(f"Epoch {epoch} | train loss {train_loss} | train ppl {np.exp(train_loss)} | val ppl {np.exp(val_loss)}")


        if val_loss < best_val_loss:
            best_val_loss = val_loss
            torch.save(model.state_dict(), 'best-model.pt')  


  0%|          | 0/300 [00:00<?, ?it/s]

  0%|          | 0/38 [00:00<?, ?it/s]

Epoch 0 | train loss 5.369 | train ppl 214.64811223226334 | val ppl 1.141108319267235
Epoch 0 | train loss 5.369 | train ppl 214.64811223226334 | val ppl 1.3284329307527947
Epoch 0 | train loss 5.369 | train ppl 214.64811223226334 | val ppl 1.540335115161127
Epoch 0 | train loss 5.369 | train ppl 214.64811223226334 | val ppl 1.7612081105217428
Epoch 0 | train loss 5.369 | train ppl 214.64811223226334 | val ppl 2.0097292268772886
Epoch 0 | train loss 5.369 | train ppl 214.64811223226334 | val ppl 2.3094282890863056
Epoch 0 | train loss 5.369 | train ppl 214.64811223226334 | val ppl 2.6379444593541526
Epoch 0 | train loss 5.369 | train ppl 214.64811223226334 | val ppl 3.0771381716642967
Epoch 0 | train loss 5.369 | train ppl 214.64811223226334 | val ppl 3.546637600982122
Epoch 0 | train loss 5.369 | train ppl 214.64811223226334 | val ppl 4.026912688058807
Epoch 0 | train loss 5.369 | train ppl 214.64811223226334 | val ppl 4.5767997072121265
Epoch 0 | train loss 5.369 | train ppl 214.6481

  0%|          | 0/300 [00:00<?, ?it/s]

  0%|          | 0/38 [00:00<?, ?it/s]

Epoch 1 | train loss 5.111 | train ppl 165.83610809139944 | val ppl 1.1571961880507962
Epoch 1 | train loss 5.111 | train ppl 165.83610809139944 | val ppl 1.3231298123374369
Epoch 1 | train loss 5.111 | train ppl 165.83610809139944 | val ppl 1.538795549956865
Epoch 1 | train loss 5.111 | train ppl 165.83610809139944 | val ppl 1.7647340515084595
Epoch 1 | train loss 5.111 | train ppl 165.83610809139944 | val ppl 1.9957102459664608
Epoch 1 | train loss 5.111 | train ppl 165.83610809139944 | val ppl 2.275045381235993
Epoch 1 | train loss 5.111 | train ppl 165.83610809139944 | val ppl 2.6326738428088614
Epoch 1 | train loss 5.111 | train ppl 165.83610809139944 | val ppl 3.058730620510393
Epoch 1 | train loss 5.111 | train ppl 165.83610809139944 | val ppl 3.5078383743425823
Epoch 1 | train loss 5.111 | train ppl 165.83610809139944 | val ppl 3.9629947917959374
Epoch 1 | train loss 5.111 | train ppl 165.83610809139944 | val ppl 4.531259789228607
Epoch 1 | train loss 5.111 | train ppl 165.8361

  0%|          | 0/300 [00:00<?, ?it/s]

  0%|          | 0/38 [00:00<?, ?it/s]

Epoch 2 | train loss 4.908 | train ppl 135.36840667771497 | val ppl 1.1422499983308942
Epoch 2 | train loss 4.908 | train ppl 135.36840667771497 | val ppl 1.3139002448247392
Epoch 2 | train loss 4.908 | train ppl 135.36840667771497 | val ppl 1.5219615556186337
Epoch 2 | train loss 4.908 | train ppl 135.36840667771497 | val ppl 1.7401999144695428
Epoch 2 | train loss 4.908 | train ppl 135.36840667771497 | val ppl 2.0137527074704766
Epoch 2 | train loss 4.908 | train ppl 135.36840667771497 | val ppl 2.284163787415424
Epoch 2 | train loss 4.908 | train ppl 135.36840667771497 | val ppl 2.6221641807745737
Epoch 2 | train loss 4.908 | train ppl 135.36840667771497 | val ppl 3.058730620510393
Epoch 2 | train loss 4.908 | train ppl 135.36840667771497 | val ppl 3.4903429574618414
Epoch 2 | train loss 4.908 | train ppl 135.36840667771497 | val ppl 3.947174474357382
Epoch 2 | train loss 4.908 | train ppl 135.36840667771497 | val ppl 4.517686380154588
Epoch 2 | train loss 4.908 | train ppl 135.3684

  0%|          | 0/300 [00:00<?, ?it/s]

  0%|          | 0/38 [00:00<?, ?it/s]

Epoch 3 | train loss 4.725 | train ppl 112.73049837406913 | val ppl 1.1376901241657316
Epoch 3 | train loss 4.725 | train ppl 112.73049837406913 | val ppl 1.3125870013111083
Epoch 3 | train loss 4.725 | train ppl 112.73049837406913 | val ppl 1.5326526617240188
Epoch 3 | train loss 4.725 | train ppl 112.73049837406913 | val ppl 1.7401999144695428
Epoch 3 | train loss 4.725 | train ppl 112.73049837406913 | val ppl 1.9699339218909298
Epoch 3 | train loss 4.725 | train ppl 112.73049837406913 | val ppl 2.2659633758311957
Epoch 3 | train loss 4.725 | train ppl 112.73049837406913 | val ppl 2.619543327238971
Epoch 3 | train loss 4.725 | train ppl 112.73049837406913 | val ppl 3.0556734187455317
Epoch 3 | train loss 4.725 | train ppl 112.73049837406913 | val ppl 3.5043322893029334
Epoch 3 | train loss 4.725 | train ppl 112.73049837406913 | val ppl 4.026912688058807
Epoch 3 | train loss 4.725 | train ppl 112.73049837406913 | val ppl 4.581378796082183
Epoch 3 | train loss 4.725 | train ppl 112.730

  0%|          | 0/300 [00:00<?, ?it/s]

  0%|          | 0/38 [00:00<?, ?it/s]

Epoch 4 | train loss 4.488 | train ppl 88.94338111102392 | val ppl 1.141108319267235
Epoch 4 | train loss 4.488 | train ppl 88.94338111102392 | val ppl 1.3417839036669714
Epoch 4 | train loss 4.488 | train ppl 88.94338111102392 | val ppl 1.5745979974750188
Epoch 4 | train loss 4.488 | train ppl 88.94338111102392 | val ppl 1.8130309449601565
Epoch 4 | train loss 4.488 | train ppl 88.94338111102392 | val ppl 2.0792349218188444
Epoch 4 | train loss 4.488 | train ppl 88.94338111102392 | val ppl 2.3964776177110654
Epoch 4 | train loss 4.488 | train ppl 88.94338111102392 | val ppl 2.7101392030187967
Epoch 4 | train loss 4.488 | train ppl 88.94338111102392 | val ppl 3.2123411450916888
Epoch 4 | train loss 4.488 | train ppl 88.94338111102392 | val ppl 3.6692966676192444
Epoch 4 | train loss 4.488 | train ppl 88.94338111102392 | val ppl 4.170350145368969
Epoch 4 | train loss 4.488 | train ppl 88.94338111102392 | val ppl 4.850103282095144
Epoch 4 | train loss 4.488 | train ppl 88.94338111102392 

# Task 2c. Evaluation on the Test Set (10 pts)
Use the trained model to translate the text from the source language into the target language on the test set. Evaluate the performance of the model on the test set using the BLEU metric and print out the average the BLEU score.

In [60]:
# Your code here

def translate_sentence(
    text: str, 
    model: EncoderDecoder, 
    english: Langauge,
    irish: Langauge,
    device: str,
    max_len: int = 10,
    ) -> str:

    # Encode english sentence and convert to tensor
    input_ids = english.encodeSentence(text)
    input_tensor = torch.LongTensor(input_ids).unsqueeze(1).to(device)

    # Get encooder hidden states
    with torch.no_grad():
        encoder_outputs, hidden = model.encoder(input_tensor)

    # Build target holder list
    trg_indexes = [irish.word2index["SOS"]]

    # Loop over sequence length of target sentence
    for i in range(max_len):
        trg_tensor = torch.LongTensor([trg_indexes[-1]]).to(device)

    # Decode the encoder outputs with respect to current target word
        with torch.no_grad():
            output, hidden = model.decoder(trg_tensor, hidden, encoder_outputs)

    # Retrieve most likely word over target distribution
        pred_token = torch.argmax(output).item()
        trg_indexes.append(pred_token)

        if pred_token == irish.word2index["EOS"]:
            break

    return "".join(irish.decodeIds(trg_indexes))

In [None]:
translate_sentence("My name is", model, english, irish, device)

In [86]:
import nltk
from nltk.translate.bleu_score import corpus_bleu

bleu_scores = []
for i in range(len(test)):
    test_pred = translate_sentence(str(src[i]), model, english, irish, device)
    bleu_score = corpus_bleu(test_pred, trg)
    bleu_scores.append(bleu_score)


avg_bleu = sum(bleu_scores) / len(bleu_scores)
print(f"Average BLEU score: {avg_bleu:.4f}")

AssertionError: The number of hypotheses and their reference(s) should be the same 

## Task 3. Improving NMT using Attention (10 pts) 
Extend the Encoder-Decoder model from Task 2 with the attention mechanism. Retrain the model and evaluate on test set. Print the updated average BLEU score on the test set. In a few sentences explains which model is the best for translation. 

In [41]:
class DecoderGRU(nn.Module):
    def __init__(
        self, 
        target_vocab_size,    # Size of target vocab 
        hidden_dim,           # hidden size of embedding  
        enc_hid_dim, 
        dec_hid_dim, 
        dropout
      ):
        super().__init__()

        self.output_dim = target_vocab_size
        self.attention = Attention(enc_hid_dim, dec_hid_dim)
        
        self.embedding = nn.Embedding(target_vocab_size, hidden_dim)
        
        self.rnn = nn.GRU((enc_hid_dim * 2) + hidden_dim, dec_hid_dim)
        
        self.fc_out = nn.Linear(
            (enc_hid_dim * 2) + dec_hid_dim + hidden_dim, 
            target_vocab_size
          )
        self.dropout = nn.Dropout(dropout)
        
    def forward(self, input, hidden, encoder_outputs):
             
        #input = [batch size]
        #hidden = [batch size, dec hid dim]
        #encoder_outputs = [src len, batch size, enc hid dim * 2]
        
        input = input.unsqueeze(0)  # [1, batch size]
        
        embedded = self.dropout(self.embedding(input))  # [1, batch size, emb dim]
        
        a = self.attention(hidden, encoder_outputs)     # [batch size, src len]
        a = a.unsqueeze(1)                              # [batch size, 1, src len]
        
        encoder_outputs = encoder_outputs.permute(1, 0, 2) # [batch size, src len, enc hid dim * 2]
        
        weighted = torch.bmm(a, encoder_outputs)           # [batch size, 1, enc hid dim * 2]
        weighted = weighted.permute(1, 0, 2)               # [1, batch size, enc hid dim * 2]
        
        rnn_input = torch.cat((embedded, weighted), dim = 2) # [1, batch size, (enc hid dim * 2) + emb dim]

        
        #output = [seq len, batch size, dec hid dim * n directions]
        #hidden = [n layers * n directions, batch size, dec hid dim]    
        output, hidden = self.rnn(rnn_input, hidden.unsqueeze(0))
        
        #seq len, n layers and n directions will always be 1 in this decoder, therefore:
        #output = [1, batch size, dec hid dim]
        #hidden = [1, batch size, dec hid dim]
        
        embedded = embedded.squeeze(0)
        output = output.squeeze(0)
        weighted = weighted.squeeze(0)
        
        prediction = self.fc_out(torch.cat((output, weighted, embedded), dim = 1)) # [batch size, output dim]
        return prediction, hidden.squeeze(0)

In [42]:
import random 
class EncoderDecoder(nn.Module):
    def __init__(self, encoder, decoder):
        super().__init__()
        
        self.encoder = encoder
        self.decoder = decoder
        
    def forward(self, src, trg, teacher_forcing_ratio = 0.5):
        
        #src = [src len, batch size]
        #trg = [trg len, batch size]
        #teacher_forcing_ratio is probability to use teacher forcing
        #e.g. if teacher_forcing_ratio is 0.75 we use teacher forcing 75% of the time     
        batch_size = src.shape[1]
        trg_len = trg.shape[0]
        trg_vocab_size = self.decoder.output_dim
        
        #tensor to store decoder outputs
        outputs = torch.zeros(trg_len, batch_size, trg_vocab_size)
        
        #encoder_outputs is all hidden states of the input sequence, back and forwards
        #hidden is the final forward and backward hidden states, passed through a linear layer
        encoder_outputs, hidden = self.encoder(src)
                
        #first input to the decoder is the <sos> tokens
        input = trg[0,:]
        
        for t in range(1, trg_len):     
            #insert input token embedding, previous hidden state and all encoder hidden states
            #receive output tensor (predictions) and new hidden state
            output, hidden = self.decoder(input, hidden, encoder_outputs)
            
            #place predictions in a tensor holding predictions for each token
            outputs[t] = output
            
            #decide if we are going to use teacher forcing or not
            teacher_force = random.random() < teacher_forcing_ratio
            
            #get the highest predicted token from our predictions
            top1 = output.argmax(1) 
            
            #if teacher forcing, use actual next token as next input
            #if not, use predicted token
            input = trg[t] if teacher_force else top1
        return outputs

In [43]:
INPUT_DIM = english.n_words
OUTPUT_DIM = irish.n_words
ENC_EMB_DIM = 256
DEC_EMB_DIM = 256
ENC_HID_DIM = 128
DEC_HID_DIM = 128
ENC_DROPOUT = 0.5
DEC_DROPOUT = 0.5

enc = EncoderGRU(INPUT_DIM, ENC_EMB_DIM, ENC_HID_DIM, DEC_HID_DIM, ENC_DROPOUT)
dec = DecoderGRU(OUTPUT_DIM, DEC_EMB_DIM, ENC_HID_DIM, DEC_HID_DIM, DEC_DROPOUT)

model = EncoderDecoder(enc, dec)

def init_weights(m):
    for name, param in m.named_parameters():
        if 'weight' in name:
            nn.init.normal_(param.data, mean=0, std=0.01)
        else:
            nn.init.constant_(param.data, 0)
            
model.apply(init_weights)

EncoderDecoder(
  (encoder): EncoderGRU(
    (embedding): Embedding(11548, 256)
    (rnn): GRU(256, 128, bidirectional=True)
    (fc): Linear(in_features=256, out_features=128, bias=True)
    (dropout): Dropout(p=0.5, inplace=False)
  )
  (decoder): DecoderGRU(
    (attention): Attention(
      (attn): Linear(in_features=384, out_features=128, bias=True)
      (v): Linear(in_features=128, out_features=1, bias=False)
    )
    (embedding): Embedding(16345, 256)
    (rnn): GRU(512, 128)
    (fc_out): Linear(in_features=640, out_features=16345, bias=True)
    (dropout): Dropout(p=0.5, inplace=False)
  )
)

In [67]:
# Your Code Here

from tqdm.notebook import tqdm
import numpy as np 
optimizer = torch.optim.Adam(model.parameters())

device = "cuda:0" if torch.cuda.is_available() else "cpu"

model.to(device)

EPOCHS = 5
best_test_loss = float('inf')

for epoch in range(EPOCHS):

    model.train()
    epoch_loss = 0
    for batch in tqdm(train_dl, total=len(train_dl)):

        src = batch[0].transpose(1, 0).to(device)
        trg = batch[1].transpose(1, 0).to(device)

        optimizer.zero_grad()

        output = model(src, trg)

        output_dim = output.shape[-1]
        output = output[1:].view(-1, output_dim).to(device)
        trg = trg[1:].reshape(-1)

        loss = F.cross_entropy(output, trg)
        loss.backward()

        torch.nn.utils.clip_grad_norm_(model.parameters(), 1)
        optimizer.step()
        epoch_loss += loss.item()

    train_loss = round(epoch_loss / len(train_dl), 3)

    etest_loss = 0
    model.eval()
    for batch in tqdm(test_dl, total=len(test_dl)):
        src = batch[0].transpose(1, 0).to(device)
        trg = batch[1].transpose(1, 0).to(device)

        with torch.no_grad():
            output = model(src, trg)

            output_dim = output.shape[-1]
            output = output[1:].view(-1, output_dim).to(device)
            trg = trg[1:].reshape(-1)

            loss = F.cross_entropy(output, trg)

            etest_loss += loss.item()

        test_loss = round(etest_loss / len(test_dl), 3)
        print(f"Epoch {epoch} | train loss {train_loss} | train ppl {np.exp(train_loss)} | val ppl {np.exp(test_loss)}")


        if test_loss < best_test_loss:
            best_test_loss = test_loss
            torch.save(model.state_dict(), 'best-model.pt')  


  0%|          | 0/300 [00:00<?, ?it/s]

  0%|          | 0/38 [00:00<?, ?it/s]

Epoch 0 | train loss 4.519 | train ppl 91.74380828267552 | val ppl 1.1537298016660105
Epoch 0 | train loss 4.519 | train ppl 91.74380828267552 | val ppl 1.3271051618171572
Epoch 0 | train loss 4.519 | train ppl 91.74380828267552 | val ppl 1.538795549956865
Epoch 0 | train loss 4.519 | train ppl 91.74380828267552 | val ppl 1.7985845599876695
Epoch 0 | train loss 4.519 | train ppl 91.74380828267552 | val ppl 2.0278984286853743
Epoch 0 | train loss 4.519 | train ppl 91.74380828267552 | val ppl 2.3631606937057947
Epoch 0 | train loss 4.519 | train ppl 91.74380828267552 | val ppl 2.688544583045722
Epoch 0 | train loss 4.519 | train ppl 91.74380828267552 | val ppl 3.058730620510393
Epoch 0 | train loss 4.519 | train ppl 91.74380828267552 | val ppl 3.448709144454898
Epoch 0 | train loss 4.519 | train ppl 91.74380828267552 | val ppl 4.104155512254132
Epoch 0 | train loss 4.519 | train ppl 91.74380828267552 | val ppl 4.739823980017224
Epoch 0 | train loss 4.519 | train ppl 91.74380828267552 | v

  0%|          | 0/300 [00:00<?, ?it/s]

  0%|          | 0/38 [00:00<?, ?it/s]

Epoch 1 | train loss 4.254 | train ppl 70.3863955879128 | val ppl 1.1723379466807176
Epoch 1 | train loss 4.254 | train ppl 70.3863955879128 | val ppl 1.3417839036669714
Epoch 1 | train loss 4.254 | train ppl 70.3863955879128 | val ppl 1.612845483383623
Epoch 1 | train loss 4.254 | train ppl 70.3863955879128 | val ppl 1.8663786452864723
Epoch 1 | train loss 4.254 | train ppl 70.3863955879128 | val ppl 2.1511444438853182
Epoch 1 | train loss 4.254 | train ppl 70.3863955879128 | val ppl 2.524391389973053
Epoch 1 | train loss 4.254 | train ppl 70.3863955879128 | val ppl 2.8950431039036304
Epoch 1 | train loss 4.254 | train ppl 70.3863955879128 | val ppl 3.3267638012449026
Epoch 1 | train loss 4.254 | train ppl 70.3863955879128 | val ppl 3.7621853549999105
Epoch 1 | train loss 4.254 | train ppl 70.3863955879128 | val ppl 4.585962466331417
Epoch 1 | train loss 4.254 | train ppl 70.3863955879128 | val ppl 5.360193097034803
Epoch 1 | train loss 4.254 | train ppl 70.3863955879128 | val ppl 6.1

  0%|          | 0/300 [00:00<?, ?it/s]

  0%|          | 0/38 [00:00<?, ?it/s]

Epoch 2 | train loss 4.016 | train ppl 55.47874641878359 | val ppl 1.1583539630298554
Epoch 2 | train loss 4.016 | train ppl 55.47874641878359 | val ppl 1.336427488025472
Epoch 2 | train loss 4.016 | train ppl 55.47874641878359 | val ppl 1.5636142992864182
Epoch 2 | train loss 4.016 | train ppl 55.47874641878359 | val ppl 1.8515071812945383
Epoch 2 | train loss 4.016 | train ppl 55.47874641878359 | val ppl 2.110658533543552
Epoch 2 | train loss 4.016 | train ppl 55.47874641878359 | val ppl 2.427835209469566
Epoch 2 | train loss 4.016 | train ppl 55.47874641878359 | val ppl 2.8066735722367695
Epoch 2 | train loss 4.016 | train ppl 55.47874641878359 | val ppl 3.2219926385284996
Epoch 2 | train loss 4.016 | train ppl 55.47874641878359 | val ppl 3.702469390967303
Epoch 2 | train loss 4.016 | train ppl 55.47874641878359 | val ppl 4.428230196243525
Epoch 2 | train loss 4.016 | train ppl 55.47874641878359 | val ppl 5.1038747185367255
Epoch 2 | train loss 4.016 | train ppl 55.47874641878359 | 

  0%|          | 0/300 [00:00<?, ?it/s]

  0%|          | 0/38 [00:00<?, ?it/s]

Epoch 3 | train loss 3.734 | train ppl 41.84615847457281 | val ppl 1.1829366106478107
Epoch 3 | train loss 3.734 | train ppl 41.84615847457281 | val ppl 1.3730025719254588
Epoch 3 | train loss 3.734 | train ppl 41.84615847457281 | val ppl 1.5983949987546404
Epoch 3 | train loss 3.734 | train ppl 41.84615847457281 | val ppl 1.8294218719978594
Epoch 3 | train loss 3.734 | train ppl 41.84615847457281 | val ppl 2.134003941758656
Epoch 3 | train loss 3.734 | train ppl 41.84615847457281 | val ppl 2.481839452598748
Epoch 3 | train loss 3.734 | train ppl 41.84615847457281 | val ppl 2.9476257034472675
Epoch 3 | train loss 3.734 | train ppl 41.84615847457281 | val ppl 3.404166082790819
Epoch 3 | train loss 3.734 | train ppl 41.84615847457281 | val ppl 3.959033777841203
Epoch 3 | train loss 3.734 | train ppl 41.84615847457281 | val ppl 4.749313113948145
Epoch 3 | train loss 3.734 | train ppl 41.84615847457281 | val ppl 5.479424077005176
Epoch 3 | train loss 3.734 | train ppl 41.84615847457281 | v

  0%|          | 0/300 [00:00<?, ?it/s]

  0%|          | 0/38 [00:00<?, ?it/s]

Epoch 4 | train loss 3.447 | train ppl 31.406032741941566 | val ppl 1.167657961105125
Epoch 4 | train loss 3.447 | train ppl 31.406032741941566 | val ppl 1.3458152994480976
Epoch 4 | train loss 3.447 | train ppl 31.406032741941566 | val ppl 1.5920141888871011
Epoch 4 | train loss 3.447 | train ppl 31.406032741941566 | val ppl 1.8663786452864723
Epoch 4 | train loss 3.447 | train ppl 31.406032741941566 | val ppl 2.2078076288406328
Epoch 4 | train loss 3.447 | train ppl 31.406032741941566 | val ppl 2.6169250932469144
Epoch 4 | train loss 3.447 | train ppl 31.406032741941566 | val ppl 3.040433183989171
Epoch 4 | train loss 3.447 | train ppl 31.406032741941566 | val ppl 3.472934799336826
Epoch 4 | train loss 3.447 | train ppl 31.406032741941566 | val ppl 3.943229272812564
Epoch 4 | train loss 3.447 | train ppl 31.406032741941566 | val ppl 4.797044504278355
Epoch 4 | train loss 3.447 | train ppl 31.406032741941566 | val ppl 5.567799984149774
Epoch 4 | train loss 3.447 | train ppl 31.4060327

In [88]:
import nltk
from nltk.translate.bleu_score import corpus_bleu

bleu_scores = []
for i in range(len(test)):
    test_pred = translate_sentence(str(src[i]), model, english, irish, device)
    bleu_score = corpus_bleu(test_pred, trg)
    bleu_scores.append(bleu_score)


avg_bleu = sum(bleu_scores) / len(bleu_scores)
print(f"Average BLEU score: {avg_bleu:.4f}")


AssertionError: The number of hypotheses and their reference(s) should be the same 

I am not able to calculate the bleu scores for the predicted data. Although, on the basis of the accuracy scores obtained by training the model I can tell that model with attention mechanism works better than the model without attention mechanism.