# Assignment 2 - Neural Machine Translation

This assignment will introduce the concept of neural networks and transformers, into the world of NLP. We will prepare a relevant dataset, preprocess it, split it into train, text and validation sets, build our own language vocabulary, a custom transformer model, and configure a training loop. Throughout this process, we will implement a standard machine learning pipeline and train a transformer model to perform neural machine translation (NMT). This workflow will establish foundational principles of natural language processing and allow us to work with state-of-the-art deep learning tools.

This assignment will use the IWSLT (International Workshop on Spoken Language Translation) dataset. This dataset is a widely used and well-renowned parallel corpora consisting of various language pairs. It is great for our NMT task as it includes sentence-level aligned data, streamlining our traning and evaluation process. https://huggingface.co/datasets/IWSLT/iwslt2017

### Table of Contents

These are the tasks we will complete in this assignment:

1. Data Preperation        
2. Build Custom Vocabulary 
3. Custom Transformer Architecture
4. Training Loop  
5. Evaluation     
6. Report Results 

To install all required packages for this assignment into your current environment, follow the instructions below:

```pip install -r requirements.txt```



If you wish to create a new virtual environment, execute the following commands:

```pip install virtualenv```
```python -m venv {name your env}```

e.g.

```python -m venv myenv```

This will create your environment folder. You can then:

```{the name of your env}\Scripts\activate```

e.g.

```myenv\Scripts\activate```

now:

```pip install -r requirements.txt```

You should now have all the required packages! If there were errors in installation of a package, you must resolve it otherwise it is possible none of the packages installed. Edit the requirements file if needed and remove lines causing issues, install those packages yourself to avoid version conflicts

### 1. Data Preparation
Objective: Learn how to prepare data for training NMT models

- In this task we will use the IWSLT dataset for language pairs of our choice. You can choose any available pairs you like
- We will then preprocess the data through: tokenization, initialize symbols and special tokens, and converting text into numerical sequences
- Split our dataset into training, validation and test sets

In [None]:
import torch
import torchtext
from torchtext.data.utils import get_tokenizer
from torchtext.legacy.data import Field, BucketIterator, TabularDataset

# Define the source and target languages of your choice
SRC_LANGUAGE = 'de'
TGT_LANGUAGE = 'en'

# Prepare tokenizer sets
token_set = {}
vocab_set = {}

# Using torchtext's tokenizer
token_set[SRC_LANGUAGE] = get_tokenizer('spacy', language='de')
token_set[TGT_LANGUAGE] = get_tokenizer('spacy', language='en')

# Define special symbols and indices
UNK_IDX, PAD_IDX, BOS_IDX, EOS_IDX = 0, 1, 2, 3
special_symbols = ['<unk>', '<pad>', '<bos>', '<eos>']

# Define the fields
SRC = Field(tokenize=token_set[SRC_LANGUAGE], init_token='<bos>', eos_token='<eos>', lower=True)
TRG = Field(tokenize=token_set[TGT_LANGUAGE], init_token='<bos>', eos_token='<eos>', lower=True)

# Load the dataset
train_data, valid_data, test_data = torchtext.legacy.datasets.IWSLT.splits(
    exts=('.de', '.en'), fields=(SRC, TRG), root='data')

We prepare data for training our NMT models. We will use the TorchText package for easy access to parallel texts from mutliple languages, including English, German, and French. It's a fantastic place for us to hit the ground running with multilingual machine translation!

### 2. Build Vocabulary

Now we will implement a method to build a vocabulary from the training dataset. We will then convert sentences to sequences of token IDs using the custom vocabulary.

In [None]:
# Build the vocabulary
SRC.build_vocab(train_data.src, min_freq=2, specials=special_symbols)
TRG.build_vocab(train_data.trg, min_freq=2, specials=special_symbols)

Let's create iterators for efficient data handlign during training. We can leverage cuda GPU's if you have one available (you will need the CUDA version of pytorch installed), otherwise CPU is perfectly fine.

In [None]:
# Create the iterators
BATCH_SIZE = 32

train_iterator, valid_iterator, test_iterator = BucketIterator.splits(
    (train_data, valid_data, test_data), batch_size=BATCH_SIZE, device=torch.device('cuda' if torch.cuda.is_available() else 'cpu'))

### 3. Custom Transformer Architecture

We will now implement a custom transformer class using PyTorch modules and layers. Our implementation will include attention mechanisms, positional encoding, and feed-forward neural networks.

We will also be implementing forward and masking methods for the transformer model.

In [None]:
import torch.nn as nn
import math

class Transformer(nn.Module):
    def __init__(self, src_vocab_size, trg_vocab_size, d_model, nhead, num_encoder_layers, num_decoder_layers, dim_feedforward, dropout=0.1):
        super(Transformer, self).__init__()
        self.src_embedding = nn.Embedding(src_vocab_size, d_model)
        self.trg_embedding = nn.Embedding(trg_vocab_size, d_model)
        self.pos_encoder = PositionalEncoding(d_model, dropout)
        self.transformer = nn.Transformer(d_model, nhead, num_encoder_layers, num_decoder_layers, dim_feedforward, dropout)
        self.fc_out = nn.Linear(d_model, trg_vocab_size)
        self.src_vocab_size = src_vocab_size
        self.trg_vocab_size = trg_vocab_size
        self.d_model = d_model

    def generate_square_subsequent_mask(self, sz):
        mask = (torch.triu(torch.ones(sz, sz)) == 1).transpose(0, 1)
        mask = mask.float().masked_fill(mask == 0, float('-inf')).masked_fill(mask == 1, float(0.0))
        return mask

    def forward(self, src, trg, src_mask, trg_mask):
        src = self.src_embedding(src) * math.sqrt(self.d_model)
        trg = self.trg_embedding(trg) * math.sqrt(self.d_model)
        src = self.pos_encoder(src)
        trg = self.pos_encoder(trg)
        output = self.transformer(src, trg, src_mask, trg_mask)
        output = self.fc_out(output)
        return output

    def encode(self, src, src_mask):
        return self.transformer.encoder(self.pos_encoder(self.src_embedding(src) * math.sqrt(self.d_model)), src_mask)

    def decode(self, trg, memory, trg_mask):
        return self.transformer.decoder(self.pos_encoder(self.trg_embedding(trg) * math.sqrt(self.d_model)), memory, trg_mask)

### 4. Training Loop

In this step, we will craft a typical deep learning training loop. We will:

- Define our loss function and optimizer
- Integrate gradient descent optimization, backpropagation, and loss computation
- Train the transformer model and monitor the training and validation loss

In [None]:
import torch.optim as optim

def train_epoch(model, train_iterator, optimizer, criterion, device):
    model.train()
    epoch_loss = 0
    
    for batch in train_iterator:
        src, trg = batch.src.to(device), batch.trg.to(device)
        optimizer.zero_grad()
        
        src_mask = model.generate_square_subsequent_mask(src.size(0)).to(device)
        trg_mask = model.generate_square_subsequent_mask(trg.size(0)).to(device)
        
        output = model(src, trg[:-1], src_mask, trg_mask)
        output = output.view(-1, output.shape[-1])
        trg = trg[1:].view(-1)
        loss = criterion(output, trg)
        loss.backward()
        optimizer.step()
        epoch_loss += loss.item()
    
    return epoch_loss / len(train_iterator)

def evaluate(model, valid_iterator, criterion, device):
    model.eval()
    epoch_loss = 0
    
    with torch.no_grad():
        for batch in valid_iterator:
            src, trg = batch.src.to(device), batch.trg.to(device)
            
            src_mask = model.generate_square_subsequent_mask(src.size(0)).to(device)
            trg_mask = model.generate_square_subsequent_mask(trg.size(0)).to(device)
            
            output = model(src, trg[:-1], src_mask, trg_mask)
            output = output.view(-1, output.shape[-1])
            trg = trg[1:].view(-1)
            loss = criterion(output, trg)
            epoch_loss += loss.item()
    
    return epoch_loss / len(valid_iterator)

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = Transformer(len(SRC.vocab), len(TRG.vocab), d_model=512, nhead=8, num_encoder_layers=6, num_decoder_layers=6, dim_feedforward=2048, dropout=0.1).to(device)
optimizer = optim.Adam(model.parameters(), lr=0.0001)
criterion = nn.CrossEntropyLoss(ignore_index=SRC.vocab['<pad>'])

num_epochs = 10
for epoch in range(num_epochs):
    train_loss = train_epoch(model, train_iterator, optimizer, criterion, device)
    val_loss = evaluate(model, valid_iterator, criterion, device)
    print(f'Epoch {epoch+1}, Train Loss: {train_loss:.4f}, Val Loss: {val_loss:.4f}')


### 5. Evaluation

Now that we have trained our model, we have to evaluate it! We will evaluate the model on the test set using industry-standard metrics such as BLEU score, analyze the translation quality, and discuss common translation errors.

In [None]:
from nltk.translate.bleu_score import sentence_bleu

def calculate_bleu(data, model, SRC, TRG, device):
    trgs = []
    pred_trgs = []
    
    for datum in data:
        src = vars(datum)['src']
        trg = vars(datum)['trg']
        pred_trg, _ = translate_sentence(src, SRC, TRG, model, device)
        pred_trgs.append(pred_trg)
        trgs.append([trg])
    
    return sentence_bleu(trgs, pred_trgs)


### 6. Conclusion and Reflection
This is your chance to reflect on what has been learned and discuss potential real-world applications and further improvements.

Please write a brief report discussing the experience. Include any challenges faced, summarize the process you went through (can be point-form and concise) and potential uses of the learned techniques in real-world applications.