# Overview
**Assignment 2** focuses on the training on a Neural Machine Translation (NMT) system for English-Irish translation where English is the source language and Irish is the target language. 

**Grading Policy** 
Assignment 2 is graded and will be worth 25% of your overall grade. This assignment is worth a total of 50 points distributed over the tasks below.  Please note that this is an individual assignment and you must not work with other students to complete this assessment. Any copying from other students, from student exercises from previous years, and any internet resources will not be tolerated. Plagiarised assignments will receive zero marks and the students who commit this act will be reported. Feel free to reach out to the TAs and instructors if you have any questions.

## Task 1 - Data Collection and Preprocessing (10 points)
## Task 1a. Data Loading (5 pts)
Dataset: https://www.dropbox.com/s/zkgclwc9hrx7y93/DGT-en-ga.txt.zip?dl=0 
*  Download a English-Irish dataset and decompress it. The `DGT.en-ga.en` file contains a list english sentences and `DGT.en-ga.ga` contains the paralell Irish sentences. Read both files into the Jupyter environment and load them into a pandas dataframe. 
* Randomly sample 12,000 rows.
* Split the sampled data into train (10k), development (1k) and test set (1k)

### Downloading the dataset and extracting all the files from the zip folder

In [84]:
"""
1. Importing the zipfile module
2. Download the compressed file from the Dropbox link using the wget command
3. Extract the contents of the compressed file using the zipfile module
References: https://realpython.com/python-zipfile/

"""
import zipfile
!wget --no-check-certificate -O DGT-en-ga.txt.zip https://www.dropbox.com/s/zkgclwc9hrx7y93/DGT-en-ga.txt.zip?dl=1
with zipfile.ZipFile("DGT-en-ga.txt.zip", 'r') as zip_ref:
    zip_ref.extractall()

--2023-04-03 22:00:11--  https://www.dropbox.com/s/zkgclwc9hrx7y93/DGT-en-ga.txt.zip?dl=1
Resolving www.dropbox.com (www.dropbox.com)... 162.125.81.18, 2620:100:6031:18::a27d:5112
Connecting to www.dropbox.com (www.dropbox.com)|162.125.81.18|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: /s/dl/zkgclwc9hrx7y93/DGT-en-ga.txt.zip [following]
--2023-04-03 22:00:12--  https://www.dropbox.com/s/dl/zkgclwc9hrx7y93/DGT-en-ga.txt.zip
Reusing existing connection to www.dropbox.com:443.
HTTP request sent, awaiting response... 302 Found
Location: https://uc4606b15affe6fd72f654c95cb9.dl.dropboxusercontent.com/cd/0/get/B5hLr2yVmUZbneeLNt_46PBQZGPA4y7t44UJubuTzHkXoX4zxfy1JL5HLnvhA86kyJRmLnHoVN0X_afp4e1DKmFYtDbiHxDLqwjkKYJt_-FURxGgqE-eYVBdvd6l81K1FoMqmOgAf-BR_tRA8hA9_56_VCEBITms_fewDufWsEhveA/file?dl=1# [following]
--2023-04-03 22:00:12--  https://uc4606b15affe6fd72f654c95cb9.dl.dropboxusercontent.com/cd/0/get/B5hLr2yVmUZbneeLNt_46PBQZGPA4y7t44UJubuTzHkXoX4zxfy1JL5HLnvh

### Checking and reading the contents of the extracted files

In [85]:
# References: https://www.freecodecamp.org/news/with-open-in-python-with-statement-syntax-example/

# Open the read-only file 'DGT.en-ga.en' containing English sentences.
with open('DGT.en-ga.en', 'r', encoding='utf8') as f:
    # Read all lines from the file and store them as a list of strings
    english_sentences = f.readlines()

# Opening a new file 'english_sentences.txt' in write mode
with open('english_sentences.txt', 'w', encoding='utf8') as f:
    
    # Add each English sentence from the list 'english_sentences' to the new file one by one.
    for sentence in english_sentences:
        f.write(sentence)
        
with open('DGT.en-ga.ga', 'r', encoding='utf8') as f:
    irish_sentences = f.readlines()

with open('irish_sentences.txt', 'w', encoding='utf8') as f:
    for sentence in irish_sentences:
        f.write(sentence)

###  Load the English and Irish files into Pandas dataframes

In [86]:
import pandas as pd

english_df = pd.DataFrame(english_sentences, columns=['english_sentences'])
irish_df = pd.DataFrame(irish_sentences, columns=['irish_sentences'])

# Combine the English and Irish dataframes
df = pd.concat([english_df, irish_df], axis=1)
df

Unnamed: 0,english_sentences,irish_sentences
0,Procès-verbal of rectification to the Conventi...,Miontuairisc cheartaitheach maidir le Coinbhin...
1,(Official Journal of the European Union L 147 ...,(Iris Oifigiúil an Aontais Eorpaigh L 147 an 1...
2,This rectification has been carried out by mea...,Rinneadh an ceartúchán seo le miontuairisc che...
3,"On pages 33-34, Annex I:\n","Ar leathanaigh 33-34, Iarscríbhinn I:\n"
4,the entries for the States below are rectified...,maidir leis na hiontrálacha le haghaidh na Stá...
...,...,...
181622,For the Council\n,"Maidir le roinnt forálacha eile, níor beartaío..."
181623,Position of the European Parliament of 31 Janu...,"I mí an Mheithimh 2018, thíolaic an Coimisiún ..."
181624,Regulation (EU) No 1305/2013 of the European P...,"Dá bhrí sin, is iomchuí dul ar aghaidh agus ro..."
181625,Regulation (EU) 2017/2393 of the European Parl...,"Ní leagtar síos san Airteagal sin, áfach, ach ..."


### Randomly sample 12,000 rows.

In [87]:
"""
Determine the lengths of tokens in English and Irish sentences

"""
df["length_of_english_sentences"] = df["english_sentences"].apply(lambda x: len(x.split(" ")))
df["length_of_irish_sentences"] = df["irish_sentences"].apply(lambda x: len(x.split(" ")))

"""
Retain the sentences where the token length difference is less than 1-2

"""
df = df[(abs(df["length_of_english_sentences"] - df["length_of_irish_sentences"]) <= 2)]

# Randomly sample 12,000 rows from the filtered DataFrame
sampled_df = df.sample(n=12000, random_state=2023)

# Display the shape of the sampled DataFrame
print(sampled_df.shape)


(12000, 4)


### Split the sampled data into train (10k), development (1k) and test set (1k)

In [88]:
# Referred from lab notes
from sklearn.model_selection import train_test_split

train, test = train_test_split(sampled_df, test_size=0.1, random_state=2013)
train, val = train_test_split(train, test_size=0.1, random_state=2013)
train["split"] = "train"
val["split"] = "val"
test["split"] = "test"
dataset = pd.concat([train, val, test])
print(f"Datasets => Train {len(train)} | Val {len(val)} | Test {len(test)}")
dataset

Datasets => Train 9720 | Val 1080 | Test 1200


Unnamed: 0,english_sentences,irish_sentences,length_of_english_sentences,length_of_irish_sentences,split
15893,Article 10\n,Airteagal 10\n,2,2,train
33277,Member States shall provide the Commission wit...,Cuirfidh na Ballstáit ar fáil don Choimisiún a...,28,26,train
19804,Such PPP operations shall comply with applicab...,Comhlíonfaidh oibríochtaí CPP den sórt sin an ...,16,18,train
36450,Sole\n,Sól\n,1,1,train
180085,an Executive Director.\n,Airteagal 3\n,3,2,train
...,...,...,...,...,...
177811,"all results of the clinical trials, fully desc...",Déanfar na torthaí a chur i láthair i dtéarmaí...,34,35,test
131880,"OJ C 17, 18.1.2017, p. 46.\n","IO C 17, 18.1.2017, lch.\n",4,5,test
84281,"Dried vegetables, whole, cut, sliced, broken\n","Trátaí, úra nó fuaraithe\n",6,4,test
30463,All profiles extremely convex;\n,Gach próifíl fíordhronnach;\n,4,3,test


## Task 1b. Preprocessing (5 pts)
* Add '<bof\>' to denote beginning of sentence and '<eos\>' to denote the end of the sentence to each target line.
* Perform the following pre-processing steps:
  * Lowercase the text
  * Remove all punctuation
  * tokenize the text 
*  Build seperate vocabularies for each language. 
  * Assign each unique word an id value 
*Print statistics on the selected dataset:
  * Number of samples
  * Number of unique source language tokens
  * Number of unique target language tokens
  * Max sequence length of source language
  * Max sequence length of target language



### Preprocessing steps

In [89]:
# Referred from lab notes

from nltk.tokenize import word_tokenize 
from typing import List 
import re 

class Langauge:
  def __init__(self, language: str):
    self.language = language                            # Name of the langauge
    self.word2index = {"PAD": 0, "SOS": 1, "EOS": 2}    # Maps each word in vocab to id
    self.index2word = {0: "PAD", 1: "SOS", 2: "EOS"}    # Reverse map of id to word in vocab
    self.word2count = {}                                # Count of each word in vocab
    self.n_words = len(self.index2word)                 # number of words in vocab

  def addSentence(self, sentence: str):
    """ 
    Given a sentence, lowercase is and remove any punctuation. Tokenize the
    sentence and for each word in the tokenized list call the addWord method.
    """
    text = sentence.lower()
    clean_text = re.sub(r'[^\w\s]', '', text).strip()
    for word in word_tokenize(clean_text):
      self.addWord(word)
  
  def addWord(self, word: str):
    """
    For each input word, check if it exists in the the word2index. If it does 
    not, add the word to the word2index and set the value to the current 
    vocabulary length. Update the index2word entry as well which maps the token 
    id to the word. Finaally update the vocabulary count (n_words).

    If the word is already in the vocabulary, udpate the count.
    """
    if word not in self.word2index:
      self.word2index[word] = self.n_words
      self.word2count[word] = 1
      self.index2word[self.n_words] = word
      self.n_words += 1
    else:
      self.word2count[word] += 1

  def encodeSentence(self, sentence: str) -> List[int]:
    """
    Given a sentence:
      1. Lower case it
      2. Remove all punctuation
      3. Prepend SOS and append EOS to it.
      4. Tokenize it and return the word ids for each word in the tokenized list. If a word
      does not exist in the vocab, skip over it. 

      Return a list of word ids. 
    """
    text = sentence.lower()
    clean_text = re.sub(r'[^\w\s]', '',text).strip()
    clean_text = "SOS " + clean_text + " EOS"
    return [self.word2index[word] for word in word_tokenize(clean_text) if word in self.word2index]

  def decodeIds(self, ids: list) -> List[str]:
    """
    Given a list of word ids, look the ids in the index2word and return a
    string representing the decoded sentence. 
    """
    return " ".join([self.index2word[tok] for tok in ids])

### Build seperate vocabularies for English and Irish language.

In [90]:
# Import the nltk library and download the punkt tokenizer
import nltk
nltk.download('punkt')

# To display progress bars during the loop, import the tqdm library.
from tqdm.notebook import tqdm 

# Create Language objects for English and Irish
english = Langauge("english")
irish = Langauge("irish")

# Cycle through each row of the dataset, displaying a progress bar with tqdm.
for _, row in tqdm(dataset.iterrows(), total=len(dataset)):
  
  # Add the English sentence to the English Language object
  english.addSentence(row["english_sentences"])
  
  # Add the Irish sentence to the Irish Language object
  irish.addSentence(row["irish_sentences"])

# Print the size of the English and Irish vocabularies
print(f"Size of English vocab: {english.n_words}")
print(f"Size of Irish vocab: {irish.n_words}")


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


  0%|          | 0/12000 [00:00<?, ?it/s]

Size of English vocab: 10167
Size of Irish vocab: 13632


###  Print statistics on the selected dataset
1. Number of samples
2. Number of unique source language tokens
3. Number of unique target language tokens
4. Max sequence length of source language
5. Max sequence length of target language

In [91]:
# Referred from lab notes
# Modified by Karekadu Manu Shankar Nair

# Number of samples
num_samples = len(dataset)

# English language statistics
english = Langauge("english")
for _, row in dataset.iterrows():
  english.addSentence(row["english_sentences"])
num_unique_english_tokens = english.n_words
max_english_sequence_length = max([len(english.encodeSentence(row["english_sentences"])) for _, row in dataset.iterrows()])

# Irish language statistics
irish = Langauge("irish")
for _, row in dataset.iterrows():
  irish.addSentence(row["irish_sentences"])
num_unique_irish_tokens = irish.n_words
max_irish_sequence_length = max([len(irish.encodeSentence(row["irish_sentences"])) for _, row in dataset.iterrows()])

# Print statistics
print(f"Number of samples: {num_samples}")
print(f"Number of unique English tokens: {num_unique_english_tokens}")
print(f"Number of unique Irish tokens: {num_unique_irish_tokens}")
print(f"Max sequence length of English: {max_english_sequence_length}")
print(f"Max sequence length of Irish: {max_irish_sequence_length}")

dataset


Number of samples: 12000
Number of unique English tokens: 10167
Number of unique Irish tokens: 13632
Max sequence length of English: 149
Max sequence length of Irish: 149


Unnamed: 0,english_sentences,irish_sentences,length_of_english_sentences,length_of_irish_sentences,split
15893,Article 10\n,Airteagal 10\n,2,2,train
33277,Member States shall provide the Commission wit...,Cuirfidh na Ballstáit ar fáil don Choimisiún a...,28,26,train
19804,Such PPP operations shall comply with applicab...,Comhlíonfaidh oibríochtaí CPP den sórt sin an ...,16,18,train
36450,Sole\n,Sól\n,1,1,train
180085,an Executive Director.\n,Airteagal 3\n,3,2,train
...,...,...,...,...,...
177811,"all results of the clinical trials, fully desc...",Déanfar na torthaí a chur i láthair i dtéarmaí...,34,35,test
131880,"OJ C 17, 18.1.2017, p. 46.\n","IO C 17, 18.1.2017, lch.\n",4,5,test
84281,"Dried vegetables, whole, cut, sliced, broken\n","Trátaí, úra nó fuaraithe\n",6,4,test
30463,All profiles extremely convex;\n,Gach próifíl fíordhronnach;\n,4,3,test


## Task 2. Model Implementation and Training (30 pts)



## Task 2a. Encoder-Decoder Model Implementation (10 pts)
Implement an Encoder-Decoder model in Pytorch with the following components
* A single layer RNN based encoder. 
* A single layer RNN based decoder
* A Encoder-Decoder model based on the above components that support sequence-to-sequence modelling. For the encoder/decoder you can use RNN, LSTMs or GRU. Use a hidden dimension of 256 or less depending on your compute constraints. 

### Creating features for training

In [92]:
# Referred from lab notes

import torch 
from tensorflow.keras.utils import pad_sequences
import pandas as pd

def encode_features(
    df: pd.DataFrame, 
    english: Langauge,
    irish: Langauge,
    pad_token: int = 0,
    max_seq_length = 10
  ):

  source = []
  target = []

  for _, row in df.iterrows():
    source.append(english.encodeSentence(row["english_sentences"]))
    target.append(irish.encodeSentence(row["irish_sentences"]))

  source = pad_sequences(
      source,
      maxlen=max_seq_length,
      padding="post",
      truncating = "post",
      value=pad_token
    )

  target = pad_sequences(
      target,
      maxlen=max_seq_length,
      padding="post",
      truncating = "post",
      value=pad_token
    )
  
  return source, target

train_source, train_target = encode_features(train, english, irish)
val_source, val_target = encode_features(val, english, irish)
test_source, test_target = encode_features(test, english, irish)

print(f"Shapes of train source {train_source.shape}, and target {train_target.shape}")

Shapes of train source (9720, 10), and target (9720, 10)


## Implementing the model
### A single layer RNN based encoder.

In [93]:
# Source: https://github.com/bentrevett/pytorch-seq2seq

import torch 
import torch.nn as nn
import torch.nn.functional as F

# Define the Encoder class as a module
class Encoder(nn.Module):

    def __init__(self, input_dim, emb_dim, hid_dim, dropout):
        super().__init__()

        # Store the hidden dimension
        self.hid_dim = hid_dim
        
        # Create an embedding layer
        self.embedding = nn.Embedding(input_dim, emb_dim) 
        
        # Create a GRU layer
        self.rnn = nn.GRU(emb_dim, hid_dim)
        
        # Create a dropout layer
        self.dropout = nn.Dropout(dropout)
    
    # Define the forward pass of the Encoder
    def forward(self, src):
        
        # Apply dropout to the embedded input
        embedded = self.dropout(self.embedding(src))
        
        # Apply the GRU layer to the embedded input
        outputs, hidden = self.rnn(embedded) #no cell state!
        
        # Return only the hidden state
        return hidden

### A single layer RNN based decoder

In [94]:
# Source: https://github.com/bentrevett/pytorch-seq2seq

class Decoder(nn.Module):
    def __init__(self, output_dim, emb_dim, hid_dim, dropout):
        super().__init__()

        # Store the hidden dimension and output dimension
        self.hid_dim = hid_dim
        self.output_dim = output_dim
        
        # Create a layer for embedding.
        self.embedding = nn.Embedding(output_dim, emb_dim)
        
        # Build a GRU layer that accepts both embedded and hidden input.
        self.rnn = nn.GRU(emb_dim + hid_dim, hid_dim)
        
        # Make a fully linked layer that maps concatenated output to output_dim.
        self.fc_out = nn.Linear(emb_dim + hid_dim * 2, output_dim)
        
        # Create a dropout layer
        self.dropout = nn.Dropout(dropout)
        
    def forward(self, input, hidden, context):
        
        input = input.unsqueeze(0)
        
        # Dropout is applied to the embedded input.
        embedded = self.dropout(self.embedding(input))

        # Combine the embedded input and the context vector.        
        emb_con = torch.cat((embedded, context), dim = 2)

        # To the concatenated input and preceding concealed state, apply the GRU layer.  
        output, hidden = self.rnn(emb_con, hidden)
        
        # Add the embedded input, current hidden state, and context vector together.
        output = torch.cat((embedded.squeeze(0), hidden.squeeze(0), context.squeeze(0)), 
                           dim = 1)
        
        # Apply the fully connected layer to the concatenated output
        prediction = self.fc_out(output)
        

        return prediction, hidden

### A Encoder-Decoder model based on the above components that support sequence-to-sequence modelling.

In [95]:
# Source: https://github.com/bentrevett/pytorch-seq2seq

class Seq2Seq(nn.Module):
    def __init__(self, encoder, decoder, device):
        super().__init__()
        
        # Initialize encoder, decoder, and device
        self.encoder = encoder
        self.decoder = decoder
        self.device = device
        
        # Check to see if the encoder and decoder's hidden dimensions are the same.
        assert encoder.hid_dim == decoder.hid_dim, \
            "Hidden dimensions of encoder and decoder must be equal!"
        
    def forward(self, src, trg, teacher_forcing_ratio = 0.5):
        
        # obtain the batch size, the target length, and the target vocabulary size
        batch_size = trg.shape[1]
        trg_len = trg.shape[0]
        trg_vocab_size = self.decoder.output_dim
        
        # create a tensor to hold decoder outputs
        outputs = torch.zeros(trg_len, batch_size, trg_vocab_size).to(self.device)
        
        # Using the context, obtain the encoder's most recent hidden state
        context = self.encoder(src)
        
        # Use the context as the decoder's first hidden state.
        hidden = context
        
        input = trg[0,:]
        
        for t in range(1, trg_len):
            
            # insert input token embedding, previous hidden state, and the context state
            # receive output tensor (predictions) and new hidden state
            output, hidden = self.decoder(input, hidden, context)
            
            # place predictions in a tensor holding predictions for each token
            outputs[t] = output
            
            # decide if we are going to use teacher forcing or not
            teacher_force = random.random() < teacher_forcing_ratio
            
            # get the highest predicted token from our predictions
            top1 = output.argmax(1) 
            
            # if teacher forcing, use actual next token as next input
            # if not, use predicted token
            input = trg[t] if teacher_force else top1

        return outputs

In [96]:
# Referred from lab notes

INPUT_DIM = english.n_words
OUTPUT_DIM = irish.n_words
ENC_EMB_DIM = 256
DEC_EMB_DIM = 256
HID_DIM = 512
ENC_DROPOUT = 0.5
DEC_DROPOUT = 0.5

enc = Encoder(INPUT_DIM, ENC_EMB_DIM, HID_DIM, ENC_DROPOUT)
dec = Decoder(OUTPUT_DIM, DEC_EMB_DIM, HID_DIM, DEC_DROPOUT)

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

model = Seq2Seq(enc, dec, device).to(device)

def init_weights(m):
    for name, param in m.named_parameters():
        if 'weight' in name:
            nn.init.normal_(param.data, mean=0, std=0.01)
        else:
            nn.init.constant_(param.data, 0)
            
model.apply(init_weights)

Seq2Seq(
  (encoder): Encoder(
    (embedding): Embedding(10167, 256)
    (rnn): GRU(256, 512)
    (dropout): Dropout(p=0.5, inplace=False)
  )
  (decoder): Decoder(
    (embedding): Embedding(13632, 256)
    (rnn): GRU(768, 512)
    (fc_out): Linear(in_features=1280, out_features=13632, bias=True)
    (dropout): Dropout(p=0.5, inplace=False)
  )
)

## Task 2b. Training (10 pts)
Implement the code to train the Encoder-Decoder model on the Irish-English data. You will write code for the following:
* Training, validation and test dataloaders 
* A training loop which trains the model for 5 epoch. Evaluate the loop at the end of each Epoch. Print out the train perplexity and validation perplexity after each epoch.

### Using the TensorDataset and Dataloader classes, create data loaders for the train, val, and test features.

In [97]:
# Referred from lab notes

from torch.utils.data import DataLoader, TensorDataset

train_dl = DataLoader(
    TensorDataset(
        torch.LongTensor(train_source),
        torch.LongTensor(train_target)
    ),
    shuffle = True,
    batch_size = 32
)

val_dl = DataLoader(
    TensorDataset(
        torch.LongTensor(val_source),
        torch.LongTensor(val_target)
    ),
    shuffle = False,
    batch_size = 32
)

test_dl = DataLoader(
    TensorDataset(
        torch.LongTensor(test_source),
        torch.LongTensor(test_target)
    ),
    shuffle = False,
    batch_size = 32
)

In [98]:
# Referred from lab notes

from tqdm.notebook import tqdm
import random 
import numpy as np 
optimizer = torch.optim.Adam(model.parameters())
device = "cuda:0" if torch.cuda.is_available() else "cpu"

model.to(device)

EPOCHS = 5
best_val_loss = float('inf')
for epoch in range(EPOCHS):

  model.train()
  epoch_loss = 0
  for batch in tqdm(train_dl, total=len(train_dl)):

     src = batch[0].transpose(1, 0).to(device)
     trg = batch[1].transpose(1, 0).to(device)

     optimizer.zero_grad()

     output = model(src, trg)

     output_dim = output.shape[-1]
     output = output[1:].view(-1, output_dim).to(device)
     trg = trg[1:].reshape(-1)
     
     loss = F.cross_entropy(output, trg)
     loss.backward()

     torch.nn.utils.clip_grad_norm_(model.parameters(), 1)
     optimizer.step()
     epoch_loss += loss.item()

  train_loss = round(epoch_loss / len(train_dl), 3)
  
  eval_loss = 0
  model.eval()
  for batch in tqdm(val_dl, total=len(val_dl)):
    src = batch[0].transpose(1, 0).to(device)
    trg = batch[1].transpose(1, 0).to(device)

    with torch.no_grad():
      output = model(src, trg)
      
      output_dim = output.shape[-1]
      output = output[1:].view(-1, output_dim).to(device)
      trg = trg[1:].reshape(-1)
      
      loss = F.cross_entropy(output, trg)
      
      eval_loss += loss.item()
  
  val_loss = round(eval_loss / len(val_dl), 3)
  print(f"Epoch {epoch} | train loss {train_loss} | train ppl {np.exp(train_loss)} | val ppl {np.exp(val_loss)}")


  if val_loss < best_val_loss:
    best_val_loss = val_loss
    torch.save(model.state_dict(), 'best-model.pt')  
  

  0%|          | 0/304 [00:00<?, ?it/s]

  0%|          | 0/34 [00:00<?, ?it/s]

Epoch 0 | train loss 4.969 | train ppl 143.8829324729939 | val ppl 84.85975901710535


  0%|          | 0/304 [00:00<?, ?it/s]

  0%|          | 0/34 [00:00<?, ?it/s]

Epoch 1 | train loss 4.064 | train ppl 58.2066727367753 | val ppl 67.89755329714343


  0%|          | 0/304 [00:00<?, ?it/s]

  0%|          | 0/34 [00:00<?, ?it/s]

Epoch 2 | train loss 3.709 | train ppl 40.81297314055928 | val ppl 60.27997746985422


  0%|          | 0/304 [00:00<?, ?it/s]

  0%|          | 0/34 [00:00<?, ?it/s]

Epoch 3 | train loss 3.42 | train ppl 30.569415021050208 | val ppl 62.48959107046551


  0%|          | 0/304 [00:00<?, ?it/s]

  0%|          | 0/34 [00:00<?, ?it/s]

Epoch 4 | train loss 3.077 | train ppl 21.693225003979848 | val ppl 63.62458803768623


# Task 2c. Evaluation on the Test Set (10 pts)
Use the trained model to translate the text from the source language into the target language on the test set. Evaluate the performance of the model on the test set using the BLEU metric and print out the average the BLEU score.

In [99]:
# Referred from lab notes
# Modified by Karekadu Manu Shankar Nair

def translate_sentence(
    text: str, 
    model: Seq2Seq, 
    english: Langauge,
    irish: Langauge,
    device: str,
    max_len: int = 10,
  ) -> str:

  # Encode english sentence and convert to tensor
  input_ids = english.encodeSentence(text)
  input_tensor = torch.LongTensor(input_ids).unsqueeze(1).to(device)

  # Get encooder hidden states
  with torch.no_grad():
    encoder_outputs = model.encoder(input_tensor)

  hidden = encoder_outputs
  # Build target holder list
  trg_indexes = [irish.word2index["SOS"]]

  # Loop over sequence length of target sentence
  for i in range(max_len):
    trg_tensor = torch.LongTensor([trg_indexes[-1]]).to(device)
    
    # Decode the encoder outputs with respect to current target word
    with torch.no_grad():
      output, hidden = model.decoder(trg_tensor, hidden, encoder_outputs)
    
    # Retrieve most likely word over target distribution
    pred_token = torch.argmax(output).item()
    trg_indexes.append(pred_token)

    if pred_token == irish.word2index["EOS"]:
      break

  return "".join(irish.decodeIds(trg_indexes))

In [100]:
import nltk
from nltk.translate.bleu_score import sentence_bleu

# Defines the function for evaluating a model's performance with attention layer
def evaluate_model(model, test_source, test_target, input_lang, output_lang, device):

    bleu_score = 0.0 # initializing the score to zero

    # Loops through the source and target sequences in the test set
    for source, target in zip(test_source, test_target):

        # Decodes the source and target sequences using their respective languages
        source_sent = " ".join(input_lang.decodeIds(source))
        target_sent = " ".join(output_lang.decodeIds(target))

        # Generates the predicted sequence using the attention-based translation model
        predicted_sent = " ".join(translate_sentence(source_sent, model, input_lang, output_lang, device))

        # Calculates the BLEU score between the target and predicted sequences
        bleu_score_i = sentence_bleu([target_sent.split()], predicted_sent.split(), weights=(0.25, 0.25, 0.25, 0.25))

        # Adds the BLEU score to the total score for the test set
        bleu_score += bleu_score_i

    # Calculates the average BLEU score for the test set
    avg_bleu_score = bleu_score / len(test_source)
    print(f"The average BLEU score for the given test set is: {avg_bleu_score}")
    return avg_bleu_score


In [101]:
test_bleu_score = evaluate_model(model, test_source, test_target, english, irish, device)

The hypothesis contains 0 counts of 4-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()


The average BLEU score for the given test set is: 0.07095814112016854


## Task 3. Improving NMT using Attention (10 pts) 
Extend the Encoder-Decoder model from Task 2 with the attention mechanism. Retrain the model and evaluate on test set. Print the updated average BLEU score on the test set. In a few sentences explains which model is the best for translation. 

In [102]:
# Source: Lab notes
import torch 
import torch.nn as nn 
import torch.nn.functional as F

class EncoderGRU(nn.Module):
    def __init__(
        self, 
        input_vocab_size,  # size of source vocabulary  
        hidden_dim,        # hidden dimension of embeddings
        encoder_hid_dim,   # gru hidden dim
        decoder_hid_dim,   # decoder hidden dim 
        dropout_prob = .5
      ):
      
        super().__init__()
        self.embedding = nn.Embedding(input_vocab_size, hidden_dim)
        self.rnn = nn.GRU(hidden_dim, encoder_hid_dim, bidirectional = True)
        self.fc = nn.Linear(encoder_hid_dim * 2, decoder_hid_dim)
        self.dropout = nn.Dropout(dropout_prob)
        
    def forward(self, src):
        
        #src = [src len, batch size]
        embedded = self.dropout(self.embedding(src))
        
        #embedded = [src len, batch size, emb dim]
        outputs, hidden = self.rnn(embedded)
                
        #outputs = [src len, batch size, hid dim * num directions]
        #hidden = [n layers * num directions, batch size, hid dim]        
        #hidden is stacked [forward_1, backward_1, forward_2, backward_2, ...]
        #outputs are always from the last layer
        
        #hidden [-2, :, : ] is the last of the forwards GRU
        #hidden [-1, :, : ] is the last of the backwards GRU
        
        #initial decoder hidden is final hidden state of the forwards and backwards 
        #  encoder RNNs fed through a linear layer
        hidden = torch.tanh(self.fc(torch.cat((hidden[-2,:,:], hidden[-1,:,:]), dim = 1)))        
        return outputs, hidden

In [103]:
# Source: Lab notes
class Attention(nn.Module):
    def __init__(
        self, 
        enc_hid_dim,      # Encoder hidden dimension
        dec_hid_dim       # Decoder hidden dimension 
      ):
        super().__init__()
        
        self.attn = nn.Linear((enc_hid_dim * 2) + dec_hid_dim, dec_hid_dim)
        self.v = nn.Linear(dec_hid_dim, 1, bias = False)
        
    def forward(self, hidden, encoder_outputs):
        
        #hidden = [batch size, dec hid dim]
        #encoder_outputs = [src len, batch size, enc hid dim * 2]
        batch_size = encoder_outputs.shape[1]
        src_len = encoder_outputs.shape[0]
        
        #repeat decoder hidden state src_len times
        hidden = hidden.unsqueeze(1).repeat(1, src_len, 1)
        
        encoder_outputs = encoder_outputs.permute(1, 0, 2)
        #hidden = [batch size, src len, dec hid dim]
        #encoder_outputs = [batch size, src len, enc hid dim * 2]
        
        energy = torch.tanh(self.attn(torch.cat((hidden, encoder_outputs), dim = 2))) 
        
        #energy = [batch size, src len, dec hid dim]
        attention = self.v(energy).squeeze(2)
        
        #attention output: [batch size, src len]
        return F.softmax(attention, dim=1)

In [104]:
# Source: Lab notes
class DecoderGRU(nn.Module):
    def __init__(
        self, 
        target_vocab_size,    # Size of target vocab 
        hidden_dim,           # hidden size of embedding  
        enc_hid_dim, 
        dec_hid_dim, 
        dropout
      ):
        super().__init__()

        self.output_dim = target_vocab_size
        self.attention = Attention(enc_hid_dim, dec_hid_dim)
        
        self.embedding = nn.Embedding(target_vocab_size, hidden_dim)
        
        self.rnn = nn.GRU((enc_hid_dim * 2) + hidden_dim, dec_hid_dim)
        
        self.fc_out = nn.Linear(
            (enc_hid_dim * 2) + dec_hid_dim + hidden_dim, 
            target_vocab_size
          )
        self.dropout = nn.Dropout(dropout)
        
    def forward(self, input, hidden, encoder_outputs):
             
        #input = [batch size]
        #hidden = [batch size, dec hid dim]
        #encoder_outputs = [src len, batch size, enc hid dim * 2]
        
        input = input.unsqueeze(0)  # [1, batch size]
        
        embedded = self.dropout(self.embedding(input))  # [1, batch size, emb dim]
        
        a = self.attention(hidden, encoder_outputs)     # [batch size, src len]
        a = a.unsqueeze(1)                              # [batch size, 1, src len]
        
        encoder_outputs = encoder_outputs.permute(1, 0, 2) # [batch size, src len, enc hid dim * 2]
        
        weighted = torch.bmm(a, encoder_outputs)           # [batch size, 1, enc hid dim * 2]
        weighted = weighted.permute(1, 0, 2)               # [1, batch size, enc hid dim * 2]
        
        rnn_input = torch.cat((embedded, weighted), dim = 2) # [1, batch size, (enc hid dim * 2) + emb dim]

        
        #output = [seq len, batch size, dec hid dim * n directions]
        #hidden = [n layers * n directions, batch size, dec hid dim]    
        output, hidden = self.rnn(rnn_input, hidden.unsqueeze(0))
        
        #seq len, n layers and n directions will always be 1 in this decoder, therefore:
        #output = [1, batch size, dec hid dim]
        #hidden = [1, batch size, dec hid dim]
        
        embedded = embedded.squeeze(0)
        output = output.squeeze(0)
        weighted = weighted.squeeze(0)
        
        prediction = self.fc_out(torch.cat((output, weighted, embedded), dim = 1)) # [batch size, output dim]
        return prediction, hidden.squeeze(0)

In [105]:
# Source: Lab notes
import random 
class EncoderDecoder(nn.Module):
    def __init__(self, encoder, decoder):
        super().__init__()
        
        self.encoder = encoder
        self.decoder = decoder
        
    def forward(self, src, trg, teacher_forcing_ratio = 0.5):
        
        #src = [src len, batch size]
        #trg = [trg len, batch size]
        #teacher_forcing_ratio is probability to use teacher forcing
        #e.g. if teacher_forcing_ratio is 0.75 we use teacher forcing 75% of the time     
        batch_size = src.shape[1]
        trg_len = trg.shape[0]
        trg_vocab_size = self.decoder.output_dim
        
        #tensor to store decoder outputs
        outputs = torch.zeros(trg_len, batch_size, trg_vocab_size)
        
        #encoder_outputs is all hidden states of the input sequence, back and forwards
        #hidden is the final forward and backward hidden states, passed through a linear layer
        encoder_outputs, hidden = self.encoder(src)
                
        #first input to the decoder is the <sos> tokens
        input = trg[0,:]
        
        for t in range(1, trg_len):     
            #insert input token embedding, previous hidden state and all encoder hidden states
            #receive output tensor (predictions) and new hidden state
            output, hidden = self.decoder(input, hidden, encoder_outputs)
            
            #place predictions in a tensor holding predictions for each token
            outputs[t] = output
            
            #decide if we are going to use teacher forcing or not
            teacher_force = random.random() < teacher_forcing_ratio
            
            #get the highest predicted token from our predictions
            top1 = output.argmax(1) 
            
            #if teacher forcing, use actual next token as next input
            #if not, use predicted token
            input = trg[t] if teacher_force else top1
        return outputs

In [106]:
# Source: Lab notes
INPUT_DIM = english.n_words
OUTPUT_DIM = irish.n_words
ENC_EMB_DIM = 256
DEC_EMB_DIM = 256
ENC_HID_DIM = 128
DEC_HID_DIM = 128
ENC_DROPOUT = 0.5
DEC_DROPOUT = 0.5

enc = EncoderGRU(INPUT_DIM, ENC_EMB_DIM, ENC_HID_DIM, DEC_HID_DIM, ENC_DROPOUT)
dec = DecoderGRU(OUTPUT_DIM, DEC_EMB_DIM, ENC_HID_DIM, DEC_HID_DIM, DEC_DROPOUT)

model = EncoderDecoder(enc, dec)

def init_weights(m):
    for name, param in m.named_parameters():
        if 'weight' in name:
            nn.init.normal_(param.data, mean=0, std=0.01)
        else:
            nn.init.constant_(param.data, 0)
            
model.apply(init_weights)

EncoderDecoder(
  (encoder): EncoderGRU(
    (embedding): Embedding(10167, 256)
    (rnn): GRU(256, 128, bidirectional=True)
    (fc): Linear(in_features=256, out_features=128, bias=True)
    (dropout): Dropout(p=0.5, inplace=False)
  )
  (decoder): DecoderGRU(
    (attention): Attention(
      (attn): Linear(in_features=384, out_features=128, bias=True)
      (v): Linear(in_features=128, out_features=1, bias=False)
    )
    (embedding): Embedding(13632, 256)
    (rnn): GRU(512, 128)
    (fc_out): Linear(in_features=640, out_features=13632, bias=True)
    (dropout): Dropout(p=0.5, inplace=False)
  )
)

In [107]:
# Source: Lab notes
from tqdm.notebook import tqdm
import numpy as np 
optimizer = torch.optim.Adam(model.parameters())

device = "cuda:0" if torch.cuda.is_available() else "cpu"

model.to(device)

EPOCHS = 5
best_val_loss = float('inf')

for epoch in range(EPOCHS):

  model.train()
  epoch_loss = 0
  for batch in tqdm(train_dl, total=len(train_dl)):

     src = batch[0].transpose(1, 0).to(device)
     trg = batch[1].transpose(1, 0).to(device)

     optimizer.zero_grad()

     output = model(src, trg)

     output_dim = output.shape[-1]
     output = output[1:].view(-1, output_dim).to(device)
     trg = trg[1:].reshape(-1)
     
     loss = F.cross_entropy(output, trg)
     loss.backward()

     torch.nn.utils.clip_grad_norm_(model.parameters(), 1)
     optimizer.step()
     epoch_loss += loss.item()

  train_loss = round(epoch_loss / len(train_dl), 3)
  
  eval_loss = 0
  model.eval()
  for batch in tqdm(val_dl, total=len(val_dl)):
    src = batch[0].transpose(1, 0).to(device)
    trg = batch[1].transpose(1, 0).to(device)

    with torch.no_grad():
      output = model(src, trg)
      
      output_dim = output.shape[-1]
      output = output[1:].view(-1, output_dim).to(device)
      trg = trg[1:].reshape(-1)
      
      loss = F.cross_entropy(output, trg)
      
      eval_loss += loss.item()
  
  val_loss = round(eval_loss / len(val_dl), 3)
  print(f"Epoch {epoch} | train loss {train_loss} | train ppl {np.exp(train_loss)} | val ppl {np.exp(val_loss)}")


  if val_loss < best_val_loss:
    best_val_loss = val_loss
    torch.save(model.state_dict(), 'best-model.pt')  
  

  0%|          | 0/304 [00:00<?, ?it/s]

  0%|          | 0/34 [00:00<?, ?it/s]

Epoch 0 | train loss 5.169 | train ppl 175.7390105746881 | val ppl 96.44761391904002


  0%|          | 0/304 [00:00<?, ?it/s]

  0%|          | 0/34 [00:00<?, ?it/s]

Epoch 1 | train loss 4.177 | train ppl 65.17004950678147 | val ppl 72.67518557985093


  0%|          | 0/304 [00:00<?, ?it/s]

  0%|          | 0/34 [00:00<?, ?it/s]

Epoch 2 | train loss 3.856 | train ppl 47.27586918039787 | val ppl 67.08765178544276


  0%|          | 0/304 [00:00<?, ?it/s]

  0%|          | 0/34 [00:00<?, ?it/s]

Epoch 3 | train loss 3.634 | train ppl 37.86396998885605 | val ppl 62.614695315149696


  0%|          | 0/304 [00:00<?, ?it/s]

  0%|          | 0/34 [00:00<?, ?it/s]

Epoch 4 | train loss 3.422 | train ppl 30.630615030701964 | val ppl 59.979329827958686


In [108]:
# Source: Lab notes
def translate_sentence_attention_layer(
    text: str, 
    model: Seq2Seq, 
    english: Langauge,
    irish: Langauge,
    device: str,
    max_len: int = 10,
  ) -> str:

  # Encode english sentence and convert to tensor
  input_ids = english.encodeSentence(text)
  input_tensor = torch.LongTensor(input_ids).unsqueeze(1).to(device)

  # Get encooder hidden states
  with torch.no_grad():
    encoder_outputs, hidden = model.encoder(input_tensor)

  # Build target holder list
  trg_indexes = [irish.word2index["SOS"]]

  # Loop over sequence length of target sentence
  for i in range(max_len):
    trg_tensor = torch.LongTensor([trg_indexes[-1]]).to(device)
    
    # Decode the encoder outputs with respect to current target word
    with torch.no_grad():
      output, hidden = model.decoder(trg_tensor, hidden, encoder_outputs)
    
    # Retrieve most likely word over target distribution
    pred_token = torch.argmax(output).item()
    trg_indexes.append(pred_token)

    if pred_token == irish.word2index["EOS"]:
      break

  return "".join(irish.decodeIds(trg_indexes))

In [109]:
import nltk
from nltk.translate.bleu_score import sentence_bleu

# Defines the function for evaluating a model's performance with attention layer

def evaluate_model_with_attention(model, test_source, test_target, input_lang, output_lang, device):

    bleu_score = 0.0 # initializing the score to zero

    # Loops through the source and target sequences in the test set
    for source, target in zip(test_source, test_target):

        # Decodes the source and target sequences using their respective languages
        source_sent = " ".join(input_lang.decodeIds(source))
        target_sent = " ".join(output_lang.decodeIds(target))

        # Generates the predicted sequence using the attention-based translation model
        predicted_sent = " ".join(translate_sentence_attention_layer(source_sent, model, input_lang, output_lang, device))

        # Calculates the BLEU score between the target and predicted sequences
        bleu_score_i = sentence_bleu([target_sent.split()], predicted_sent.split(), weights=(0.25, 0.25, 0.25, 0.25))

        # Adds the BLEU score to the total score for the test set
        bleu_score += bleu_score_i

    # Calculates the average BLEU score for the test set
    avg_bleu_score = bleu_score / len(test_source)

    print(f"The average BLEU score for the given test set is: {avg_bleu_score}")
    return avg_bleu_score


In [110]:
test_bleu_score = evaluate_model_with_attention(model, test_source, test_target, english, irish, device)

The average BLEU score for the given test set is: 0.043078649419260345


According to the obtained BLEU scores, the NMT model with attention appears to have a lower average BLEU score than the encoder-decoder model.

One cause could be that the attention mechanism does not always focus on the most relevant sections of the input sentence, resulting in less accurate translations. 

Furthermore, the NMT model may not have been trained for a long enough period of time or with enough data to provide reliable translations.


