# Neural Machine Translation with Sequence Models

This notebook implements and compares several sequence-to-sequence models for English-to-German neural machine translation using PyTorch and TorchText. It includes classic RNN-based architectures (LSTM with and without attention) as well as a Transformer model.

The primary goals of this project are:
- To explore the effectiveness of various encoder-decoder architectures for translation tasks
- To assess the impact of attention mechanisms and transformer-based modeling
- To evaluate translation quality using PPL (Perplexity) and BLEU scores

Models Trained:
- LSTM-based Seq2Seq
- LSTM with Attention
- Transformer (full architecture)

Evaluation is done using the BLEU metric on the Multi30k English-German dataset, and translation examples are provided for qualitative analysis.


The dataset used is the Multi30k English-German translation corpus. This notebook also handles full data preprocessing, tokenization with SpaCy, custom training loops, and a framework for evaluating translation quality qualitatively.

In [None]:
# Setup: Mount Google Drive to access data for NMT model training
from google.colab import drive
drive.mount('/gdrive')

Mounted at /gdrive


In [None]:
# Set the directory path where the NMT project notebook is located.
# For example, if the notebook is saved in '/gdrive/MyDrive/NMTproject', set:
root = '/gdrive/MyDrive/NMTproject'

In [None]:
!pip install torchtext==0.6.0
!pip install spacy
!python -m spacy download en
!python -m spacy download de

Collecting torchtext==0.6.0
  Downloading torchtext-0.6.0-py3-none-any.whl (64 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/64.2 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m64.2/64.2 kB[0m [31m2.0 MB/s[0m eta [36m0:00:00[0m
Collecting sentencepiece (from torchtext==0.6.0)
  Downloading sentencepiece-0.1.99-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m10.3 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: sentencepiece, torchtext
  Attempting uninstall: torchtext
    Found existing installation: torchtext 0.16.0
    Uninstalling torchtext-0.16.0:
      Successfully uninstalled torchtext-0.16.0
Successfully installed sentencepiece-0.1.99 torchtext-0.6.0
2023-11-21 13:45:56.780004: E tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:9342] Unable to register cuDNN factory: At

In [None]:
# Import necessary libraries for Neural Machine Translation (NMT)
# PyTorch for model building and training
import torch
import torch.nn as nn
from torch.nn.parameter import Parameter
import torch.nn.functional as F
import torch.optim as optim
from torch.optim import SGD
from torch.utils.data import DataLoader

# TorchText for dataset handling, tokenization, and language processing
import torchtext
from torchtext.datasets import Multi30k
from torchtext.data import Field, BucketIterator
from torchtext.data.utils import get_tokenizer
from torchtext.data.metrics import bleu_score

# SpaCy for additional tokenization and preprocessing
import spacy
from spacy.symbols import ORTH

# Other utilities for general project setup
import os
import numpy as np
import time
import random
import math
from pathlib import Path
import tqdm.notebook as tq
import copy

In [None]:
# Set up basic configurations and hyperparameters for NMT models
# Fix random seed for reproducibility
torch.manual_seed(470)
torch.cuda.manual_seed(470)

from easydict import EasyDict as edict

# Model training hyperparameters
args = edict()
args.batch_size = 32
args.nlayers = 2
args.ninp = 256
args.nhid = 256
args.clip = 1
args.lr_lstm = 0.001
args.dropout = 0.2
args.nhid_attn = 256
args.epochs = 20

# Transformer-specific parameters
args.nhid_tran = 256
args.nhead = 8
args.nlayers_transformer = 6
args.attn_pdrop = 0.1
args.resid_pdrop = 0.1
args.embd_pdrop = 0.1
args.nff = 4 * args.nhid_tran

args.lr_transformer = 0.0001  # Learning rate for Transformer model
args.betas = (0.9, 0.98)

args.gpu = True
device = 'cuda:0' if torch.cuda.is_available() and args.gpu else 'cpu'

# Create a directory to save results
result_dir = Path(root) / 'results'
result_dir.mkdir(parents=True, exist_ok=True)


In [None]:
# Converts a sequence of word ids to a sentence (denumericalization)
def word_ids_to_sentence(id_tensor, vocab, join=' '):
    if isinstance(id_tensor, torch.LongTensor):
        ids = id_tensor.transpose(0, 1).contiguous().view(-1)
    elif isinstance(id_tensor, np.ndarray):
        ids = id_tensor.transpose().reshape(-1)
    batch = [vocab.itos[ind] for ind in ids] # denumericalize
    if join is None:
        return batch
    else:
        return join.join(batch)

# Extracts bias and non-bias parameters from a model for optimization
def get_parameters(model, bias=False):
    for m in model.modules():
        if isinstance(m, nn.Linear):
            if bias:
                yield m.bias
            else:
                yield m.weight
        else:
            if not bias:
                yield m.parameters()

# Runs a single training or evaluation epoch
# Computes loss, accuracy, and updates model parameters during training
def run_epoch(epoch, model, optimizer, is_train=True, data_iter=None):
    total_loss = 0
    n_correct = 0
    n_total = 0
    if data_iter is None:
        data_iter = train_iter if is_train else valid_iter
    if is_train:
        model.train()
    else:
        model.eval()
    for batch in data_iter:
        x, y, length = sort_batch(batch.src.to(device), batch.trg.to(device))
        target = y[1:]
        if isinstance(model, Transformer):
            x, y = x.transpose(0, 1), y.transpose(0, 1)
            target = target.transpose(0, 1) #y[:, 1:]
        pred = model(x, y, length)
        loss = criterion(pred.reshape(-1, trg_ntoken), target.reshape(-1))
        n_targets = (target != pad_id).long().sum().item()
        n_total += n_targets
        target = target.unsqueeze(0)
        n_correct += (pred.argmax(-1) == target)[target != pad_id].long().sum().item()
        if is_train:
            optimizer.zero_grad()
            loss.backward()
            torch.nn.utils.clip_grad_norm_(model.parameters(), args.clip)
            optimizer.step()

        total_loss += loss.item() * n_targets
    total_loss /= n_total
    print("Epoch", epoch, 'Train' if is_train else 'Valid',
          "Loss", np.mean(total_loss),
          "Acc", n_correct / n_total,
          "PPL", np.exp(total_loss))
    return total_loss

# Converts word ids to sentence, stops at the EOS token
def word_ids_to_sentence_(ids, vocab):
    sentence = []
    for ind in ids:
        if ind == eos_id:
            break
        sentence.append(vocab.itos[ind])
    return sentence

# Runs the translation process and outputs translation examples
# Computes BLEU score for the translated outputs
def run_translation(model, data_iter, max_len=100, mode='best'):
    with torch.no_grad():
        model.eval()
        load_model(model, mode)
        src_list = []
        gt_list = []
        pred_list = []
        for batch in data_iter:
            x, y, length = sort_batch(batch.src.to(device), batch.trg.to(device))
            target = y[1:]
            if isinstance(model, Transformer):
                x, y = x.transpose(0, 1), y.transpose(0, 1)
                target = target.transpose(0, 1)
            pred = model(x, y, length, max_len=max_len, teacher_forcing=False)
            pred_token = pred.argmax(-1)
            if not isinstance(model, Transformer):
                pred_token = pred_token.transpose(0, 1).cpu().numpy()
                y = y.transpose(0, 1).cpu().numpy()
                x = x.transpose(0, 1).cpu().numpy()
            # pred_token : batch_size x max_len
            for x_, y_, pred_ in zip(x, y, pred_token):
                src_list.append(word_ids_to_sentence_(x_[1:], SRC.vocab))
                gt_list.append([word_ids_to_sentence_(y_[1:], TRG.vocab)])
                pred_list.append(word_ids_to_sentence_(pred_, TRG.vocab))

        for i in range(5):
            print(f"--------- Translation Example {i+1} ---------")
            print("SRC :", ' '.join(src_list[i]))
            print("TRG :", ' '.join(gt_list[i][0]))
            print("PRED:", ' '.join(pred_list[i]))
        print()
        print("BLEU:", bleu_score(pred_list, gt_list))

# Saves the model's state_dict to a checkpoint
def save_model(model, mode="last"):
    torch.save(model.state_dict(),  result_dir / f'{type(model).__name__}_{mode}.ckpt')

# Loads a model's state_dict from a checkpoint
def load_model(model, mode="last"):
    if os.path.exists(result_dir / f'{type(model).__name__}_{mode}.ckpt'):
        model.load_state_dict(torch.load(result_dir / f'{type(model).__name__}_{mode}.ckpt'))

# Sorts the batch by sequence length (for padding)
def sort_batch(X, y, lengths=None):
    if lengths is None:
        lengths = (X != pad_id_src).long().sum(0)
    lengths, indx = lengths.sort(dim=0, descending=True)
    X = torch.index_select(X, 1, indx)
    y = torch.index_select(y, 1, indx)
    return X, y, lengths

# Initializes weights for a model's parameters using uniform distribution
def init_weights(m):
    for name, param in m.named_parameters():
        nn.init.uniform_(param.data, -0.08, 0.08)

In [None]:
# Manually fix Multi30K download link, since the original server is down. (https://github.com/pytorch/text/issues/1756)
Multi30k.urls = [
    "https://raw.githubusercontent.com/neychev/small_DL_repo/master/datasets/Multi30k/training.tar.gz",
    "https://raw.githubusercontent.com/neychev/small_DL_repo/master/datasets/Multi30k/validation.tar.gz",
    "https://raw.githubusercontent.com/neychev/small_DL_repo/master/datasets/Multi30k/mmt16_task1_test.tar.gz",
]

In [None]:
SRC = Field(tokenize = "spacy",
            tokenizer_language="de_core_news_sm",
            init_token = '<sos>',
            eos_token = '<eos>',
            lower = True)

TRG = Field(tokenize = "spacy",
            tokenizer_language="en_core_web_sm",
            init_token = '<sos>',
            eos_token = '<eos>',
            lower = True)

train_data, valid_data, test_data = Multi30k.splits(exts = ('.de', '.en'),
                                                    fields = (SRC, TRG), test='test')
SRC.build_vocab(train_data, min_freq = 2)
TRG.build_vocab(train_data, min_freq = 2)

src_ntoken = len(SRC.vocab.stoi)
trg_ntoken = len(TRG.vocab.stoi)

train_iter, valid_iter, test_iter = BucketIterator.splits(
    (train_data, valid_data, test_data),
    batch_size = args.batch_size,
    device = device)

downloading training.tar.gz


training.tar.gz: 100%|██████████| 1.21M/1.21M [00:00<00:00, 9.63MB/s]


downloading validation.tar.gz


validation.tar.gz: 100%|██████████| 46.3k/46.3k [00:00<00:00, 2.78MB/s]


downloading mmt16_task1_test.tar.gz


mmt16_task1_test.tar.gz: 100%|██████████| 67.1k/67.1k [00:00<00:00, 3.35MB/s]


In [None]:
pad_id_trg = TRG.vocab.stoi[TRG.pad_token]
pad_id_src = SRC.vocab.stoi[SRC.pad_token]
pad_id = pad_id_src
eos_id = TRG.vocab.stoi[TRG.eos_token]
criterion = nn.CrossEntropyLoss(ignore_index=pad_id)

for batch in train_iter:
    src, trg, length_src = sort_batch(batch.src, batch.trg)
    print(length_src)
    print(src, src.shape)
    print(trg, trg.shape)
    break

print("##### EXAMPLE #####")
print("SRC: ", word_ids_to_sentence(src[:, 1:2].long().cpu(), SRC.vocab))
print("TRG: ", word_ids_to_sentence(trg[:, 1:2].long().cpu(), TRG.vocab))

print("SRC vocab size", len(SRC.vocab.stoi))
print("TRG vocab size", len(TRG.vocab.stoi))
print("Vocab", list(SRC.vocab.stoi.items())[:10])

tensor([31, 22, 22, 21, 21, 17, 17, 17, 16, 15, 15, 15, 14, 13, 13, 13, 13, 12,
        12, 12, 12, 11, 11, 11, 11, 11, 10, 10, 10,  9,  9,  8],
       device='cuda:0')
tensor([[   2,    2,    2,    2,    2,    2,    2,    2,    2,    2,    2,    2,
            2,    2,    2,    2,    2,    2,    2,    2,    2,    2,    2,    2,
            2,    2,    2,    2,    2,    2,    2,    2],
        [ 766,    5,    8,    5,   27,    8,   18,    5,    5,    5,    5,    5,
            5,    5,    5,    5,    5,    5,    5,    8,    5,   18,    5,   43,
            5,   39,    5,   18,    7,   18,    8,   18],
        [ 800,   13,   22,   13,    6,  274,   30,   13,   13,   96,   96,   13,
           49,   49,   70,   13,  271,   25,  272,  168,    0,  890,   13,  103,
          130,   25,   13,  103, 6458,   25,   67,   73],
        [  41,    7, 5475,  159,  574,   16,    9,   29,    7,   13,   13,    7,
            9,   11,   26,    7,  229,   12,   13,  113,   13,   45,   69,   80,
         

### LSTMCell

In [88]:
class LSTMCell(nn.Module):
    def __init__(self, input_size, hidden_size):
        super(LSTMCell, self).__init__()
        self.input_size = input_size
        self.hidden_size = hidden_size
        self.linear_input = nn.Linear(input_size, 4 * hidden_size)
        self.linear_hidden = nn.Linear(hidden_size, 4 * hidden_size)

    def forward(self, x, state):
        hx = state[0]
        cx = state[1]
        hx1 = self.linear_hidden(hx)
        x1 = self.linear_input(x)
        sum = x1 + hx1

        chunks = torch.chunk(sum, chunks=4, dim=1)
        chunk_forgetgate = chunks[0]
        chunk_ingate = chunks[1]
        chunk_cellgate = chunks[2]
        chunk_outgate = chunks[3]

        fx = torch.sigmoid(chunk_forgetgate)
        ix = torch.sigmoid(chunk_ingate)
        anormalcy = torch.tanh(chunk_cellgate)
        ox = torch.sigmoid(chunk_outgate)

        hasil = fx * cx
        another_hasil = ix*anormalcy
        another_sum = hasil + another_hasil

        cy = another_sum

        hy = ox * torch.tanh(another_sum)

        state = hy, (hy, cy)
        return state

### LSTM

In [94]:
class LSTMLayer(nn.Module):
    def __init__(self, *cell_args):
        super(LSTMLayer, self).__init__()
        self.cell = LSTMCell(*cell_args)

    def forward(self, x, state, length_x=None):
        inputs = x.unbind(0)
        # Ensure the sequence lengths are sorted in descending order, if provided
        assert (length_x is None) or torch.all(length_x == length_x.sort(descending=True)[0])
        
        outputs = []
        out_hidden_state = []
        out_cell_state = []
        for i in range(len(inputs)):
            # Process each time step using LSTMCell
            out, state = self.cell(inputs[i], state)
            outputs += [out]
            
            if length_x is not None:
                if torch.any(i + 1 == length_x):
                    out_hidden_state = [state[0][i + 1 == length_x]] + out_hidden_state
                    out_cell_state = [state[1][i + 1 == length_x]] + out_cell_state
        
        # If sequence lengths are provided, concatenate the hidden and cell states
        if length_x is not None:
            state = (torch.cat(out_hidden_state, dim=0), torch.cat(out_cell_state, dim=0))
        
        return torch.stack(outputs), state

class LSTM(nn.Module):
    def __init__(self, ninp, nhid, num_layers, dropout):
        super(LSTM, self).__init__()
        self.layers = []
        self.dropout = nn.Dropout(dropout)
        
        # Stack multiple LSTM layers
        for i in range(num_layers):
            if i == 0:
                self.layers.append(LSTMLayer(ninp, nhid))
            else:
                self.layers.append(LSTMLayer(nhid, nhid))
        self.layers = nn.ModuleList(self.layers)

    def forward(self, x, states, length_x=None):
        output_states = []
        length = len(states)
        
        # Pass the input through each LSTM layer
        for i in range(length):
            x, output_state = self.layers[i](x, states[i], length_x=length_x)
            if i != length - 1:
                x = self.dropout(x)  # Apply dropout after each layer except the last
            output_states.append(output_state)
        
        return x, output_states

### LSTMEncoder


In [90]:
class LSTMEncoder(nn.Module):
    def __init__(self):
        super(LSTMEncoder, self).__init__()
        ninp = args.ninp
        nhid = args.nhid
        nlayers = args.nlayers
        dropout = args.dropout
        self.embed = nn.Embedding(src_ntoken, ninp, padding_idx=pad_id)
        self.dropout = nn.Dropout(dropout)
        self.lstm = LSTM(ninp, nhid, nlayers, dropout)

    def forward(self, x, states, length_x=None):
        embedded = self.embed(x)
        embedded = self.dropout(embedded)
        output, context_vector = self.lstm(embedded, states, length_x)
        return output, context_vector

### LSTMDecoder

In [91]:
class LSTMDecoder(nn.Module):
    def __init__(self):
        super(LSTMDecoder, self).__init__()
        self.embed = nn.Embedding(trg_ntoken, args.ninp, padding_idx=pad_id)
        self.lstm = LSTM(args.ninp, args.nhid, args.nlayers, args.dropout)
        self.fc_out = nn.Linear(args.nhid, trg_ntoken)
        self.dropout = nn.Dropout(args.dropout)
        self.fc_out.weight = self.embed.weight

    def forward(self, x, states):
        embedded = self.embed(x)
        embedded = self.dropout(embedded)
        output, output_states = self.lstm(embedded, states)
        output = self.fc_out(output)
        return output, output_states

### LSTMSeq2Seq

In [97]:
class LSTMSeq2Seq(nn.Module):
    def __init__(self):
        super(LSTMSeq2Seq, self).__init__()
        self.encoder = LSTMEncoder()
        self.decoder = LSTMDecoder()

    def _get_init_states(self, x):
        init_states = [
            (torch.zeros((x.size(1), args.nhid)).to(x.device),
            torch.zeros((x.size(1), args.nhid)).to(x.device))
            for _ in range(args.nlayers)
        ]
        return init_states

    def forward(self, x, y, length, max_len=None, teacher_forcing=True):
        init_states = self._get_init_states(x)
        output, output_states = self.encoder(x, init_states, length)
        decoder_input = y[0:1]

        decoder_states = output_states
        decoder_outputs = []
        
        if max_len == None:
          trg_len = y.size(0)
        else:
          trg_len = max_len

        output, decoder_states = self.decoder(decoder_input, decoder_states)
        decoder_outputs.append(output)
        for i in range(1, trg_len-1):
            if teacher_forcing:
                decoder_input = y[i:i+1]
            else:
                decoder_input = output.argmax(dim=-1)
            output, decoder_states = self.decoder(decoder_input, decoder_states)
            decoder_outputs.append(output)
        decoder_outputs = torch.cat(decoder_outputs)
        return decoder_outputs

### Attention


In [None]:
class Attention(nn.Module):
    def __init__(self):
        super().__init__()

        self.nhid_enc = args.nhid
        self.nhid_dec = args.nhid
        self.W1 = nn.Linear(self.nhid_enc, args.nhid_attn)
        self.W2 = nn.Linear(self.nhid_dec, args.nhid_attn)
        self.W3 = nn.Linear(args.nhid_attn, 1)

    def forward(self, x, enc_o, dec_h, length_enc=None):
        enc_w = self.W1(enc_o)
        dec_h = dec_h.unsqueeze(dim=0)
        dec_w = self.W2(dec_h)

        combined = torch.tanh(enc_w + dec_w)
        attn_scores = self.W3(combined)

        if length_enc is not None:
            attn_mask = torch.arange(attn_scores.size(0), device=device).unsqueeze(1) >= length_enc.unsqueeze(0)
            attn_scores = attn_scores.masked_fill(attn_mask.unsqueeze(-1), float('-inf'))
            
        attn_weights = F.softmax(attn_scores, dim=0)
        attn_applied = torch.sum(enc_o * attn_weights, dim=0, keepdims=True)
        output = torch.cat((x, attn_applied), dim=-1)
        return output

### LSTMAttnDecoder

In [None]:
class LSTMAttnDecoder(nn.Module):
    def __init__(self):
        super(LSTMAttnDecoder, self).__init__()
        self.embed = nn.Embedding(trg_ntoken, args.ninp, padding_idx=pad_id)
        self.lstm = LSTM(args.ninp + args.nhid, args.nhid, args.nlayers, args.dropout)
        self.fc_out = nn.Linear(args.nhid, trg_ntoken)
        self.dropout = nn.Dropout(args.dropout)
        self.attn = Attention()
        self.fc_out.weight = self.embed.weight

    def forward(self, x, enc_o, states, length_enc=None):
        embedded_x = self.embed(x)
        dropout_x = self.dropout(embedded_x)
        attn_x = self.attn(dropout_x, enc_o, states[-1][0], length_enc)
        last_x, output_states = self.lstm(attn_x, states)
        fc_out_x = self.fc_out(last_x)
        return fc_out_x, output_states

### LSTMAttnSeq2Seq

In [None]:
class LSTMAttnSeq2Seq(nn.Module):
    def __init__(self):
        super(LSTMAttnSeq2Seq, self).__init__()
        self.encoder = LSTMEncoder()
        self.decoder = LSTMAttnDecoder()

    def _get_init_states(self, x):
        init_states = [
            (torch.zeros((x.size(1), args.nhid)).to(x.device),
            torch.zeros((x.size(1), args.nhid)).to(x.device))
            for _ in range(args.nlayers)
        ]
        return init_states

    def forward(self, x, y, length, max_len=None, teacher_forcing=True):
        init_states = self._get_init_states(x)
        enc_output, enc_states = self.encoder(x, init_states, length)
        decoder_input = y[0:1]
        decoder_states = init_states
        decoder_outputs = []

        if max_len == None:
          trg_len = y.size(0)
        else:
          trg_len = max_len

        for i in range(trg_len-1):
            if teacher_forcing or i == 0:
              decoder_input = y[i:i+1]
            else:
              decoder_input = decoder_output.argmax(-1)
            decoder_output, decoder_states = self.decoder(decoder_input, enc_output, decoder_states, length)
            decoder_outputs.append(decoder_output)
        decoder_outputs = torch.cat(decoder_outputs)
        return decoder_outputs


### MaskedMultiheadAttention

In [75]:
MAX_LEN = 100

class MaskedMultiheadAttention(nn.Module):
    """
    Vanilla multi-head attention with an optional causal mask.
    It includes key, query, value projections, dropout, and output projection.
    """
    def __init__(self, mask=False):
        super(MaskedMultiheadAttention, self).__init__()
        assert args.nhid_tran % args.nhead == 0  # Ensure hidden dimension is divisible by number of heads
        
        # Key, query, value projections for all heads
        self.key = nn.Linear(args.nhid_tran, args.nhid_tran)
        self.query = nn.Linear(args.nhid_tran, args.nhid_tran)
        self.value = nn.Linear(args.nhid_tran, args.nhid_tran)

        self.attn_drop = nn.Dropout(args.attn_pdrop)
        self.proj = nn.Linear(args.nhid_tran, args.nhid_tran)
        
        if mask:
            self.register_buffer("mask", torch.tril(torch.ones(MAX_LEN, MAX_LEN)))

        self.nhead = args.nhead
        self.d_k = args.nhid_tran // args.nhead

    def forward(self, q, k, v, mask=None):
        nhead = self.nhead
        d_k = self.d_k

        # Project the input queries, keys, and values
        linear_q = self.query(q)
        B, T_q, hidden_size = linear_q.size()
        reshaped_q = linear_q.view(B, T_q, nhead, d_k)
        transposed_q = reshaped_q.permute(0, 2, 1, 3)

        linear_k = self.key(k)
        B, T, hidden_size = linear_k.size()
        reshaped_k = linear_k.view(B, T, nhead, d_k)
        transposed_k = reshaped_k.permute(0, 2, 1, 3)

        linear_v = self.value(v)
        B, T, hidden_size = linear_v.size()
        reshaped_v = linear_v.view(B, T, nhead, d_k)
        transposed_v = reshaped_v.permute(0, 2, 1, 3)

        att_scores = torch.matmul(transposed_q, transposed_k.transpose(-2, -1))
        scaled_att_scores = att_scores / math.sqrt(d_k)

        if mask is not None:
            scaled_att_scores = scaled_att_scores.masked_fill(mask.unsqueeze(1).unsqueeze(2) == 0, float('-inf'))

        if hasattr(self, 'mask'):
            T_q, T = q.size(1), k.size(1)
            scaled_att_scores = scaled_att_scores.masked_fill(self.mask[:T_q, :T].unsqueeze(0).unsqueeze(0) == 0, float('-inf'))

        # Compute attention weights and apply dropout
        att_weights = F.softmax(scaled_att_scores, dim=-1)
        att_dropout = self.attn_drop(att_weights)
        last_mul = torch.matmul(att_dropout, transposed_v)
        trasposed_mask = last_mul.permute(0, 2, 1, 3)

        B, T_q, nhead, d_k = trasposed_mask.size()
        reshaped_mask = trasposed_mask.reshape(B, T_q, nhead * d_k)
        y = self.proj(reshaped_mask)
        return y


### TransformerEncLayer

In [76]:
class TransformerEncLayer(nn.Module):
    def __init__(self):
        super(TransformerEncLayer, self).__init__()
        self.ln1 = nn.LayerNorm(args.nhid_tran)
        self.ln2 = nn.LayerNorm(args.nhid_tran)
        self.attn = MaskedMultiheadAttention()
        self.dropout1 = nn.Dropout(args.resid_pdrop)
        self.dropout2 = nn.Dropout(args.resid_pdrop)
        self.ff = nn.Sequential(
            nn.Linear(args.nhid_tran, args.nff),
            nn.ReLU(),
            nn.Linear(args.nff, args.nhid_tran)
        )

    def forward(self, x, mask=None):
        norm_x = self.ln1(x)
        att = self.attn(norm_x, norm_x, norm_x, mask)
        drop_att = self.dropout1(att)
        res_con = norm_x + drop_att
        norm_2 = self.ln2(res_con)
        forward = self.ff(norm_2)
        drop_2 = self.dropout2(forward)
        outputs = norm_2 + drop_2
        return outputs

### TransformerDecLayer



In [77]:
class TransformerDecLayer(nn.Module):
    def __init__(self):
        super(TransformerDecLayer, self).__init__()
        self.ln1 = nn.LayerNorm(args.nhid_tran)
        self.ln2 = nn.LayerNorm(args.nhid_tran)
        self.ln3 = nn.LayerNorm(args.nhid_tran)
        self.dropout1 = nn.Dropout(args.resid_pdrop)
        self.dropout2 = nn.Dropout(args.resid_pdrop)
        self.dropout3 = nn.Dropout(args.resid_pdrop)
        self.attn1 = MaskedMultiheadAttention(mask=True) # self-attention
        self.attn2 = MaskedMultiheadAttention() # tgt to src attention
        self.ff = nn.Sequential(
            nn.Linear(args.nhid_tran, args.nff),
            nn.ReLU(),
            nn.Linear(args.nff, args.nhid_tran)
        )

    def forward(self, x, enc_o, enc_mask=None):
        norm_x = self.ln1(x)
        att = self.attn1(norm_x, norm_x, norm_x)
        drop_att = self.dropout1(att)
        res_con = norm_x + drop_att
        norm_2 = self.ln2(res_con)
        att2 = self.attn2(norm_2, enc_o, enc_o, mask=enc_mask)
        drop_att_2 = self.dropout2(att2)
        res_con_2 = norm_2 + drop_att_2
        norm_3 = self.ln3(res_con_2)
        forward = self.ff(norm_3)
        drop_att_3 = self.dropout3(forward)
        outputs = drop_att_3 + norm_3
        return outputs

### TransformerEncoder

In [78]:
class PositionalEncoding(nn.Module):
    def __init__(self, max_len=4096):
        super().__init__()
        dim = args.nhid_tran
        pos = np.arange(0, max_len)[:, None]
        i = np.arange(0, dim // 2)
        denom = 10000 ** (2 * i / dim)

        pe = np.zeros([max_len, dim])
        pe[:, 0::2] = np.sin(pos / denom)
        pe[:, 1::2] = np.cos(pos / denom)
        pe = torch.from_numpy(pe).float()

        self.register_buffer('pe', pe)

    def forward(self, x):
        return x + self.pe[:x.shape[1]]

class TransformerEncoder(nn.Module):
    def __init__(self):
        super(TransformerEncoder, self).__init__()
        # input embedding stem
        self.tok_emb = nn.Embedding(src_ntoken, args.nhid_tran)
        self.pos_enc = PositionalEncoding()
        self.dropout = nn.Dropout(args.embd_pdrop)
        # transformer
        self.transform = nn.ModuleList([TransformerEncLayer() for _ in range(args.nlayers_transformer)])
        # decoder head
        self.ln_f = nn.LayerNorm(args.nhid_tran)


    def forward(self, x, mask):
        x_embd = self.tok_emb(x)
        x_pos = self.pos_enc(x_embd)
        x_drop = self.dropout(x_pos)
        for layer in self.transform:
            x_drop = layer(x_drop, mask)
        outputs = self.ln_f(x_drop)
        return outputs

### TransformerDecoder


In [79]:
class TransformerDecoder(nn.Module):
    def __init__(self):
        super(TransformerDecoder, self).__init__()
        self.tok_emb = nn.Embedding(trg_ntoken, args.nhid_tran)
        self.pos_enc = PositionalEncoding()
        self.dropout = nn.Dropout(args.embd_pdrop)
        self.transform = nn.ModuleList([TransformerDecLayer() for _ in range(args.nlayers_transformer)])
        self.ln_f = nn.LayerNorm(args.nhid_tran)
        self.lin_out = nn.Linear(args.nhid_tran, trg_ntoken)
        self.lin_out.weight = self.tok_emb.weight


    def forward(self, x, enc_o, enc_mask):
        x_embd = self.tok_emb(x)
        x_pos = self.pos_enc(x_embd)
        x_drop = self.dropout(x_pos)
        for layer in self.transform:
            x_drop = layer(x_drop, enc_o, enc_mask)
        result = self.ln_f(x_drop)
        outputs = self.lin_out(result)
        logits = outputs
        logits /= args.nhid_tran ** 0.5 # Scaling logits
        return logits

### Transformer

In [82]:
class Transformer(nn.Module):
    def __init__(self):
        super(Transformer, self).__init__()
        self.encoder = TransformerEncoder()
        self.decoder = TransformerDecoder()

    def forward(self, x, y, length_x, max_len=None, teacher_forcing=True):
        B, T = x.size()

        enc_mask = torch.arange(T, device=x.device).unsqueeze(0) < length_x.unsqueeze(1)
        enc_o = self.encoder(x, enc_mask)

        if teacher_forcing or self.training:
            outputs = self.decoder(y[:, :-1], enc_o, enc_mask)
            return outputs
        else:
            dec_input = y[:, :1]
            outputs = []
            for i in range(max_len - 1):
                dec_output = self.decoder(dec_input, enc_o, enc_mask)

                dec_input = torch.cat((dec_input, dec_output[:, -1:].argmax(-1)), dim=1)
        return dec_output

### Run Experiment

After implementing models, you can run the experiment. 
Make sure to execute all prior cells to ensure the functions are properly initialized. 
Training a model for 20 epochs is expected to take less than an hour.

In [None]:
def run_experiment(model):
    criterion = nn.CrossEntropyLoss(ignore_index=pad_id)

    optimizer = optim.Adam(model.parameters(), lr=args.lr_lstm if not isinstance(model, Transformer) else args.lr_transformer)

    scheduler = optim.lr_scheduler.ReduceLROnPlateau(optimizer, mode='min',
            factor=0.25, patience=1, threshold=0.0001, threshold_mode='rel',
            cooldown=0, min_lr=0, eps=1e-08, verbose=False)

    best_val_loss = np.inf
    for epoch in tq.tqdm(range(args.epochs)):
        run_epoch(epoch, model, optimizer, is_train=True)
        with torch.no_grad():
            val_loss = run_epoch(epoch, model, None, is_train=False)
        if val_loss < best_val_loss:
            best_val_loss = val_loss
            save_model(model, 'best')
        save_model(model)
        scheduler.step(val_loss)

### Evaluation Metrics:
I used

- PPL (Perplexity)**: Measures how many words the model considers as candidates at each time step. A lower perplexity indicates the model is more confident in its predictions.
  
- BLEU (Bilingual Evaluation Understudy) Score: A metric for evaluating machine translation output. It takes into account:
  - Precision: How accurate each n-gram in the predicted sentence is.
  - Clipping: Adjusts the score when a word occurs multiple times in both the true and predicted sentences.
  - Brevity penalty: Ensures the predicted and true sentences are of similar length.


In [96]:
lstm_model = LSTMSeq2Seq().to(device)
lstm_model.apply(init_weights)
run_experiment(lstm_model)
run_translation(lstm_model, test_iter, max_len=100)
print('')

  0%|          | 0/20 [00:00<?, ?it/s]

Epoch 0 Train Loss 4.312899874679481 Acc 0.30242674552164034 PPL 74.65667032732284
Epoch 0 Valid Loss 3.576150340584837 Acc 0.37091412742382274 PPL 35.73570541274967
Epoch 1 Train Loss 3.4388460914463868 Acc 0.3964197561035216 PPL 31.150992025328645
Epoch 1 Valid Loss 3.1774623445856935 Acc 0.4268005540166205 PPL 23.985808539143218
Epoch 2 Train Loss 3.0875502400502084 Acc 0.44260856814682664 PPL 21.92330530228815
Epoch 2 Valid Loss 2.914142611102714 Acc 0.4628116343490305 PPL 18.433001374893145
Epoch 3 Train Loss 2.824112078335317 Acc 0.47553214887949363 PPL 16.845980432407735
Epoch 3 Valid Loss 2.7203485062082717 Acc 0.49265927977839336 PPL 15.18561360348192
Epoch 4 Train Loss 2.594577103635457 Acc 0.5047044160414478 PPL 13.390923191078356
Epoch 4 Valid Loss 2.554895649260101 Acc 0.5165512465373961 PPL 12.869956597955268
Epoch 5 Train Loss 2.3991003472663612 Acc 0.5293628876561011 PPL 11.013263809481828
Epoch 5 Valid Loss 2.4421737390046636 Acc 0.532202216066482 PPL 11.49800726447968

In [None]:
attn_model = LSTMAttnSeq2Seq().to(device)
attn_model.apply(init_weights)
run_experiment(attn_model)
run_translation(attn_model, test_iter, max_len=100)
print('')

  0%|          | 0/20 [00:00<?, ?it/s]

Epoch 0 Train Loss 4.662672258106278 Acc 0.25771646423421884 PPL 105.91874654140688
Epoch 0 Valid Loss 3.863868825174765 Acc 0.3340027700831025 PPL 47.64934219984295
Epoch 1 Train Loss 3.677890702907214 Acc 0.36343507905862804 PPL 39.56285618290238
Epoch 1 Valid Loss 3.338131281180395 Acc 0.40110803324099725 PPL 28.166442333731464
Epoch 2 Train Loss 3.23854581367795 Acc 0.4239253158679342 PPL 25.496617942896453
Epoch 2 Valid Loss 3.0299028945926816 Acc 0.45013850415512463 PPL 20.695222873767253
Epoch 3 Train Loss 2.927964999325048 Acc 0.46587648769520273 PPL 18.689558507372624
Epoch 3 Valid Loss 2.760555970272529 Acc 0.4918282548476454 PPL 15.808629633591845
Epoch 4 Train Loss 2.6647227285311437 Acc 0.49885138933014006 PPL 14.36396627577079
Epoch 4 Valid Loss 2.5594666744863557 Acc 0.5216759002770083 PPL 12.928920153508582
Epoch 5 Train Loss 2.4345600269553325 Acc 0.5304088565214203 PPL 11.410797165604082
Epoch 5 Valid Loss 2.4011797842143974 Acc 0.5478531855955678 PPL 11.0361890246746

In [83]:
transformer_model = Transformer().to(device)
run_experiment(transformer_model)
run_translation(transformer_model, test_iter, max_len=100)
print('')

  0%|          | 0/20 [00:00<?, ?it/s]

Epoch 0 Train Loss 4.726734042237 Acc 0.39351401549402476 PPL 112.92614740246252
Epoch 0 Valid Loss 3.7225598553871513 Acc 0.49584487534626037 PPL 41.37016030549578
Epoch 1 Train Loss 3.5956779093438107 Acc 0.503682885701019 PPL 36.44039490067257
Epoch 1 Valid Loss 3.101942064557379 Acc 0.5505540166204986 PPL 22.241103024044055
Epoch 2 Train Loss 3.124621266065308 Acc 0.5444316821036682 PPL 22.751276781031496
Epoch 2 Valid Loss 2.7900860570804564 Acc 0.5767313019390582 PPL 16.28242095910831
Epoch 3 Train Loss 2.8313955043866 Acc 0.5703853955375253 PPL 16.96912479720312
Epoch 3 Valid Loss 2.5639919575561776 Acc 0.5979224376731302 PPL 12.987559757324153
Epoch 4 Train Loss 2.6134993440620873 Acc 0.591683569979716 PPL 13.64672196985518
Epoch 4 Valid Loss 2.4122349662546307 Acc 0.6137119113573407 PPL 11.15887300071354
Epoch 5 Train Loss 2.4462493563459318 Acc 0.6072875681223882 PPL 11.544964366868209
Epoch 5 Valid Loss 2.2914062272891442 Acc 0.6270083102493075 PPL 9.888833856319454
Epoch 6 

## Conclusion

This notebook presents a comparative study of different sequence models for Neural Machine Translation from English to German, evaluated using BLEU scores and qualitative examples.

**Final BLEU Scores:**
- LSTM Seq2Seq: **0.243**
- LSTM with Attention: **0.335**
- Transformer: **0.360**

Key takeaways:
- Adding attention to LSTM significantly boosts performance over the baseline.
- The Transformer model outperforms both LSTM-based models, as expected, and demonstrates superior handling of long-range dependencies.

The notebook provides a clear, modular implementation pipeline that supports future experimentation (e.g., hyperparameter tuning, larger datasets, custom attention variants). Overall, the Transformer model stands out in terms of performance, but the step-by-step progression through simpler architectures offers valuable learning and insight into how neural translation systems evolve.
