<a href="https://colab.research.google.com/github/Shubhamd13/NLP/blob/main/Hw2_Transformer_Machine_Translation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [18]:
%matplotlib inline
# put your student name here, you will need to train the model that prints out your name in each loss
# without the name, you will not be able to get points in train part
STUDENT_NAME = "Shubham Derhgawen"

# Preliminary
Credit: https://nlp.seas.harvard.edu/annotated-transformer/

Download required packages, Imports libraries, sets random seeds.

In [16]:
!pip uninstall torchaudio torchvision
!pip install --force-reinstall torchtext==0.16.2
!pip install portalocker>=2.0.0
!python -m spacy download en_core_web_sm
!python -m spacy download de_core_news_sm
!pip install numpy==1.26


[0mCollecting torchtext==0.16.2
  Using cached torchtext-0.16.2-cp311-cp311-manylinux1_x86_64.whl.metadata (7.5 kB)
Collecting tqdm (from torchtext==0.16.2)
  Using cached tqdm-4.67.1-py3-none-any.whl.metadata (57 kB)
Collecting requests (from torchtext==0.16.2)
  Using cached requests-2.32.3-py3-none-any.whl.metadata (4.6 kB)
Collecting torch==2.1.2 (from torchtext==0.16.2)
  Using cached torch-2.1.2-cp311-cp311-manylinux1_x86_64.whl.metadata (25 kB)
Collecting numpy (from torchtext==0.16.2)
  Using cached numpy-2.2.5-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (62 kB)
Collecting torchdata==0.7.1 (from torchtext==0.16.2)
  Using cached torchdata-0.7.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (13 kB)
Collecting filelock (from torch==2.1.2->torchtext==0.16.2)
  Using cached filelock-3.18.0-py3-none-any.whl.metadata (2.9 kB)
Collecting typing-extensions (from torch==2.1.2->torchtext==0.16.2)
  Using cached typing_extensions-4.13.2-py3-n


A module that was compiled using NumPy 1.x cannot be run in
NumPy 2.2.5 as it may crash. To support both 1.x and 2.x
versions of NumPy, modules must be compiled with NumPy 2.0.
Some module may need to rebuild instead e.g. with 'pybind11>=2.12'.

If you are a user of the module, the easiest solution will be to
downgrade to 'numpy<2' or try to upgrade the affected module.
We expect that some modules will need time to support NumPy 2.

Traceback (most recent call last):  File "<frozen runpy>", line 189, in _run_module_as_main
  File "<frozen runpy>", line 148, in _get_module_details
  File "<frozen runpy>", line 112, in _get_module_details
  File "/usr/local/lib/python3.11/dist-packages/spacy/__init__.py", line 6, in <module>
  File "/usr/local/lib/python3.11/dist-packages/spacy/errors.py", line 3, in <module>
    from .compat import Literal
  File "/usr/local/lib/python3.11/dist-packages/spacy/compat.py", line 4, in <module>
    from thinc.util import copy_array
  File "/usr/local/lib/p

In [17]:
import os
from os.path import exists
import torch
import torch.nn as nn
from torch.nn.functional import log_softmax
import math
import copy
import time
from torch.utils.data import DataLoader
import spacy
import warnings
from torchtext.datasets import Multi30k
from torchtext.vocab import build_vocab_from_iterator
from torch.nn.utils.rnn import pad_sequence

# Set to False to skip notebook execution (e.g. for debugging)
warnings.filterwarnings("ignore")
RUN_EXAMPLES = True

# 1.Model Architecture

## 1.1 Embedding (5pt)

Defines the word-embedding layer that maps token indices to dense vectors.

The shape of the Embedding should be: **[Vocab, Embedding_size]**

In [19]:
class Embeddings(nn.Module):
    def __init__(self, d_model, vocab):
        super(Embeddings, self).__init__()
        ######### add your code here
        self.emb =  nn.Embedding(vocab, d_model)
        #########
        self.d_model = d_model

    def forward(self, x):
        return self.emb(x) * math.sqrt(self.d_model)

## 1.2 Position Encoding (15pt)

Implements sinusoidal positional encodings so the model can attend to token order without recurrence.

$$
PE_{(pos, 2i)} = sin(pos/10000^{2i/d_{model}}) \\
PE_{(pos, 2i+1)} = cos(pos/10000^{2i/d_{model}})
$$

1. `pos` is the position `i` is the dimension.
2. Each dimension of the position encoding corresponds to a sinusoid.




In [20]:
class PositionalEncoding(nn.Module):
    "Implement the PE function."

    def __init__(self, d_model, dropout, max_len=5000):
        super(PositionalEncoding, self).__init__()
        self.dropout = nn.Dropout(p=dropout)

        # Compute the positional encodings once in log space.
        pe = torch.zeros(max_len, d_model)
        position = torch.arange(0, max_len).unsqueeze(1)
        ######### add your code here
        div_term = torch.exp(
            torch.arange(0, d_model, 2) * -(math.log(10000.0) / d_model)
        )
        pe[:, 0::2] = torch.sin(position * div_term)
        pe[:, 1::2] = torch.cos(position * div_term)
        #########
        pe = pe.unsqueeze(0)
        self.register_buffer("pe", pe)

    def forward(self, x):
        x = x + self.pe[:, : x.size(1)].requires_grad_(False)
        return self.dropout(x)

## 1.3 LayerNorm (10pt)
Provides a custom Layer Normalization module to stabilise training by normalising hidden states across features.

We will follow the basic setting in [Pytorch](https://pytorch.org/docs/stable/generated/torch.nn.LayerNorm.html).

$$
y = \frac{x - E[x]}{Var[x] + \epsilon}*\gamma + \beta
$$

where:
1. `ϵ`: a value added to the denominator for numerical stability
2. `γ` & `β`: weight (initialized to 1) and bias (initalized to 0).

In [21]:
class LayerNorm(nn.Module):
    "Construct a layernorm module."

    def __init__(self, features, eps=1e-6):
        super(LayerNorm, self).__init__()
        ######### add your code here
        self.a_2 = nn.Parameter(torch.ones(features))
        self.b_2 = nn.Parameter(torch.zeros(features))
        #########
        self.eps = eps

    def forward(self, x):
        mean = x.mean(-1, keepdim=True)
        std = x.std(-1, keepdim=True)
        ######### add your code here
        result = self.a_2 * (x - mean) / (std + self.eps) + self.b_2
        #########
        return result

## 1.4 SubLayer Connection (10pt)

Wraps a residual connection around a sub-layer (e.g., attention or FFN). Right here, we apply the pre-norm setup.

The output of each sub-layer is:
$$
x + Sublayer(LayerNorm(x))
$$
where:
1. `Sublayer(x)` is the function implemented by the sub-layer itself.(Attn/MLP)

In [22]:
class SublayerConnection(nn.Module):
    """
    A residual connection followed by a layer norm.
    Note for code simplicity the norm is first as opposed to last.
    """

    def __init__(self, size, dropout):
        super(SublayerConnection, self).__init__()
        self.norm = LayerNorm(size)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x, sublayer):
        "Apply residual connection to any sublayer with the same size."
        ######### add your code here
        return x + self.dropout(sublayer(self.norm(x)))
        #########

## 1.5 Multi-Head Attention (20pt)

Implements scaled dot-product self-attention, splits it into h parallel “heads,” and concatenates the results back together.

For each head:
$$
Attention(Q, K, V) = softmax(\frac{Q K^T}{\sqrt{d_k}})V
$$

Multi-Head Attention:
$$
MultiHead(Q, K, V) = Concat(head_1,\dots, head_h)W^O \\
where\quad head_i=Attention(QW^Q_i, KW^K_i, VW^V_i)
$$


In [23]:
def attention(query, key, value, mask=None, dropout=None):
    "Compute 'Scaled Dot Product Attention'"
    d_k = query.size(-1)
    ######### add your code here
    scores = torch.matmul(query, key.transpose(-2, -1)) / math.sqrt(d_k)
    #########
    if mask is not None:
        scores = scores.masked_fill(mask == 0, -1e9)
    ######### add your code here
    p_attn = scores.softmax(dim=-1)
    #########
    if dropout is not None:
        p_attn = dropout(p_attn)
    return torch.matmul(p_attn, value), p_attn

In [24]:
class MultiHeadedAttention(nn.Module):
    def __init__(self, h, d_model, dropout=0.1):
        "Take in model size and number of heads."
        super(MultiHeadedAttention, self).__init__()
        assert d_model % h == 0
        # We assume d_v always equals d_k
        self.d_k = d_model // h
        self.h = h
        self.Wq = nn.Linear(d_model, d_model)
        self.Wk = nn.Linear(d_model, d_model)
        self.Wv = nn.Linear(d_model, d_model)
        self.linears = nn.Linear(d_model, d_model)
        self.attn = None
        self.dropout = nn.Dropout(p=dropout)

    def forward(self, query, key, value, mask=None):
        if mask is not None:
            # Same mask applied to all h heads.
            mask = mask.unsqueeze(1)
        nbatches = query.size(0)

        # 1) Do all the linear projections in batch from d_model => h x d_k
        ######### add your code here
        query = self.Wq(query).view(nbatches, -1, self.h, self.d_k).transpose(1, 2)
        key = self.Wk(key).view(nbatches, -1, self.h, self.d_k).transpose(1, 2)
        value = self.Wv(value).view(nbatches, -1, self.h, self.d_k).transpose(1, 2)
        #########

        # 2) Apply attention on all the projected vectors in batch.
        x, self.attn = attention(
            query, key, value, mask=mask, dropout=self.dropout
        )

        # 3) "Concat" using a view and apply a final linear.
        ######### add your code here
        x =(x.transpose(1, 2).contiguous().view(nbatches, -1, self.h * self.d_k))
        #########

        del query
        del key
        del value
        return self.linears(x)

## 1.6 FeedFordward Network (FFN) (5pt)
Creates the two-layer position-wise feed-forward network applied after each attention block.

In [25]:
class PositionwiseFeedForward(nn.Module):
    "Implements FFN equation."

    def __init__(self, d_model, d_ff, dropout=0.1):
        super(PositionwiseFeedForward, self).__init__()
        self.w_1 = nn.Linear(d_model, d_ff)
        ######### add your code here
        self.w_2 = nn.Linear(d_ff, d_model)
        #########
        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        return self.w_2(self.dropout(self.w_1(x).relu()))

# 2.Encoder

## 2.1 Encoder Layer

Defines a single Encoder block consisting of a Multi-Head Self-Attention layer and an FFN, each wrapped in Add & Norm.

In [26]:
class EncoderLayer(nn.Module):
    "Encoder is made up of self-attn and feed forward (defined below)"
    def __init__(self, size, self_attn, feed_forward, dropout):
        super(EncoderLayer, self).__init__()
        self.self_attn = self_attn
        self.feed_forward = feed_forward
        self.sublayer = clones(SublayerConnection(size, dropout), 2)
        self.size = size

    def forward(self, x, mask):
        x = self.sublayer[0](x, lambda x: self.self_attn(x, x, x, mask))
        return self.sublayer[1](x, self.feed_forward)

## 2.2 Encoder Stacks (10pt)
Stacks N Encoder blocks to form the full Encoder.

In [27]:
def clones(module, N):
    "Produce N identical layers."
    return nn.ModuleList([copy.deepcopy(module) for _ in range(N)])

class Encoder(nn.Module):
    "Core encoder is a stack of N layers"
    def __init__(self, layer, N):
        super(Encoder, self).__init__()
        self.layers = clones(layer, N)
        self.norm = LayerNorm(layer.size)

    def forward(self, x, mask):
        "Pass the input (and mask) through each layer in turn."
        ######### add your code here
        for layer in self.layers:
            x = layer(x, mask)
        #########
        return self.norm(x)

# 3.Decoder

## 3.1 Decoder Layer (5pt)

Details a single Decoder block containing masked self-attention, cross-attention, and an FFN, each with residual Add & Norm.

In [28]:
class DecoderLayer(nn.Module):
    "Decoder is made of self-attn, src-attn, and feed forward (defined below)"

    def __init__(self, size, self_attn, src_attn, feed_forward, dropout):
        super(DecoderLayer, self).__init__()
        self.size = size
        self.self_attn = self_attn
        self.src_attn = src_attn
        self.feed_forward = feed_forward
        self.sublayer = clones(SublayerConnection(size, dropout), 3)

    def forward(self, x, memory, src_mask, tgt_mask):
        "Follow Figure 1 (right) for connections."
        m = memory
        x = self.sublayer[0](x, lambda x: self.self_attn(x, x, x, tgt_mask))
        ######### add your code here
        x = self.sublayer[1](x, lambda x: self.src_attn(x, m, m, src_mask))
        #########
        return self.sublayer[2](x, self.feed_forward)

## 3.2 Decoder Stacks (10pt)
Stacks N Decoder blocks to build the complete Decoder.

In [29]:
class Decoder(nn.Module):
    "Generic N layer decoder with masking."

    def __init__(self, layer, N):
        super(Decoder, self).__init__()
        self.layers = clones(layer, N)
        self.norm = LayerNorm(layer.size)

    def forward(self, x, memory, src_mask, tgt_mask):
        ######### add your code here
        for layer in self.layers:
            x = layer(x, memory, src_mask, tgt_mask)
        #########
        return self.norm(x)

# Model

Combines the Encoder and Decoder into a full Transformer model and adds a linear projection to produce logits over the target vocabulary.

In [30]:
class EncoderDecoder(nn.Module):
    """
    A standard Encoder-Decoder architecture. Base for this and many
    other models.
    """

    def __init__(self, encoder, decoder, src_embed, tgt_embed, generator):
        super(EncoderDecoder, self).__init__()
        self.encoder = encoder
        self.decoder = decoder
        self.src_embed = src_embed
        self.tgt_embed = tgt_embed
        self.generator = generator

    def forward(self, src, tgt, src_mask, tgt_mask):
        "Take in and process masked src and target sequences."
        return self.decode(self.encode(src, src_mask), src_mask, tgt, tgt_mask)

    def encode(self, src, src_mask):
        return self.encoder(self.src_embed(src), src_mask)

    def decode(self, memory, src_mask, tgt, tgt_mask):
        return self.decoder(self.tgt_embed(tgt), memory, src_mask, tgt_mask)

class Generator(nn.Module):
    "Define standard linear + softmax generation step."

    def __init__(self, d_model, vocab):
        super(Generator, self).__init__()
        self.proj = nn.Linear(d_model, vocab)

    def forward(self, x):
        return log_softmax(self.proj(x), dim=-1)

In [14]:
def make_model(
    src_vocab, tgt_vocab, N=6, d_model=512, d_ff=2048, h=8, dropout=0.1
):
    "Helper: Construct a model from hyperparameters."
    c = copy.deepcopy
    attn = MultiHeadedAttention(h, d_model)
    ff = PositionwiseFeedForward(d_model, d_ff, dropout)
    position = PositionalEncoding(d_model, dropout)
    model = EncoderDecoder(
        Encoder(EncoderLayer(d_model, c(attn), c(ff), dropout), N),
        Decoder(DecoderLayer(d_model, c(attn), c(attn), c(ff), dropout), N),
        nn.Sequential(Embeddings(d_model, src_vocab), c(position)),
        nn.Sequential(Embeddings(d_model, tgt_vocab), c(position)),
        Generator(d_model, tgt_vocab),
    )

    # This was important from their code.
    # Initialize parameters with Glorot / fan_avg.
    for p in model.parameters():
        if p.dim() > 1:
            nn.init.xavier_uniform_(p)
    return model

# Train (10pt)

Prepares a small EN→DE dataset, builds dataloaders, defines the loss, optimizer, and training loop, then runs a quick demonstration training epoch followed by an inference example.

In [31]:
import torch
import torch.nn as nn
import math
import time
from torchtext.datasets import Multi30k
from torchtext.vocab import build_vocab_from_iterator
import spacy
from torch.utils.data import DataLoader
from torch.nn.utils.rnn import pad_sequence

def subsequent_mask(size, device=None):
    # Mask has 1s in the *allowed* (≤ i) positions, 0 elsewhere
    attn_shape = (1, size, size)
    mask = torch.triu(torch.ones(attn_shape, dtype=torch.bool, device=device), diagonal=1)
    return ~mask      # invert so future positions are 0

# Load tokenizers (ensure you've downloaded the models:
# python -m spacy download en_core_web_sm && python -m spacy download de_core_news_sm)
spacy_de = spacy.load('de_core_news_sm')
spacy_en = spacy.load('en_core_web_sm')

def tokenize_de(text):
    return [tok.text.lower() for tok in spacy_de.tokenizer(text)]

def tokenize_en(text):
    return [tok.text.lower() for tok in spacy_en.tokenizer(text)]

# Prepare dataset (only 500 samples for a quick test)
raw_iter = Multi30k(split='train', language_pair=('en', 'de'))
raw_data = list(raw_iter)[:500]
test_iter = Multi30k(split='valid', language_pair=('en', 'de'))
test_data = list(test_iter)[:10]

# Build vocabularies with special tokens
SRC_SPECIALS = ['<unk>', '<pad>', '<bos>', '<eos>']
TGT_SPECIALS = ['<unk>', '<pad>', '<bos>', '<eos>']

vocab_src = build_vocab_from_iterator((tokenize_en(pair[0]) for pair in raw_data), specials=SRC_SPECIALS)
vocab_tgt = build_vocab_from_iterator((tokenize_de(pair[1]) for pair in raw_data), specials=TGT_SPECIALS)

vocab_src.set_default_index(vocab_src['<unk>'])
vocab_tgt.set_default_index(vocab_tgt['<unk>'])

SRC_PAD_IDX = vocab_src['<pad>']
TGT_PAD_IDX = vocab_tgt['<pad>']

def collate_fn(batch):
    src_batch, tgt_batch = [], []
    for src, tgt in batch:
        src_tokens = [vocab_src['<bos>']] + [vocab_src[token] for token in tokenize_en(src)] + [vocab_src['<eos>']]
        tgt_tokens = [vocab_tgt['<bos>']] + [vocab_tgt[token] for token in tokenize_de(tgt)] + [vocab_tgt['<eos>']]
        src_batch.append(torch.tensor(src_tokens, dtype=torch.long))
        tgt_batch.append(torch.tensor(tgt_tokens, dtype=torch.long))
    src_batch = pad_sequence(src_batch, padding_value=SRC_PAD_IDX, batch_first=True)
    tgt_batch = pad_sequence(tgt_batch, padding_value=TGT_PAD_IDX, batch_first=True)
    return src_batch, tgt_batch

train_loader = DataLoader(raw_data, batch_size=32, shuffle=True, collate_fn=collate_fn)
test_loader = DataLoader(test_data, batch_size=2, shuffle=False, collate_fn=collate_fn)

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = make_model(len(vocab_src), len(vocab_tgt), N=2, d_model=64, d_ff=128, h=4, dropout=0.1).to(device)

optimizer = torch.optim.Adam(model.parameters(), lr=1e-4)
criterion = nn.NLLLoss(ignore_index=TGT_PAD_IDX)


model.train()
epochs = 200
for epoch in range(epochs):
    start_time = time.time()
    total_loss = 0
    for i, (src, tgt) in enumerate(train_loader):
        src, tgt = src.to(device), tgt.to(device)
        tgt_input = tgt[:, :-1]
        src_mask = (src != SRC_PAD_IDX).unsqueeze(-2)
        tgt_pad_mask = (tgt_input != TGT_PAD_IDX).unsqueeze(-2)
        tgt_mask     = tgt_pad_mask & subsequent_mask(tgt_input.size(1), device=tgt_input.device)
        ######### add your code here
        out =  model.forward(src,tgt_input,src_mask,tgt_mask)
        #########
        logits = model.generator(out)
        loss = criterion(logits.reshape(-1, logits.size(-1)), tgt[:, 1:].reshape(-1))
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        total_loss += loss.item()
    print(f"{STUDENT_NAME} Epoch {epoch} | Loss: {total_loss/(i+1):.4f} | Time: {time.time()-start_time:.2f}s")

Shubham Derhgawen Epoch 0 | Loss: 7.1366 | Time: 1.33s
Shubham Derhgawen Epoch 1 | Loss: 6.9783 | Time: 0.37s
Shubham Derhgawen Epoch 2 | Loss: 6.8661 | Time: 0.39s
Shubham Derhgawen Epoch 3 | Loss: 6.7688 | Time: 0.37s
Shubham Derhgawen Epoch 4 | Loss: 6.6818 | Time: 0.40s
Shubham Derhgawen Epoch 5 | Loss: 6.5923 | Time: 0.37s
Shubham Derhgawen Epoch 6 | Loss: 6.5028 | Time: 0.37s
Shubham Derhgawen Epoch 7 | Loss: 6.4126 | Time: 0.37s
Shubham Derhgawen Epoch 8 | Loss: 6.3214 | Time: 0.35s
Shubham Derhgawen Epoch 9 | Loss: 6.2348 | Time: 0.36s
Shubham Derhgawen Epoch 10 | Loss: 6.1479 | Time: 0.37s
Shubham Derhgawen Epoch 11 | Loss: 6.0613 | Time: 0.35s
Shubham Derhgawen Epoch 12 | Loss: 5.9798 | Time: 0.39s
Shubham Derhgawen Epoch 13 | Loss: 5.8950 | Time: 0.37s
Shubham Derhgawen Epoch 14 | Loss: 5.8178 | Time: 0.37s
Shubham Derhgawen Epoch 15 | Loss: 5.7483 | Time: 0.37s
Shubham Derhgawen Epoch 16 | Loss: 5.6743 | Time: 0.38s
Shubham Derhgawen Epoch 17 | Loss: 5.6099 | Time: 0.37s
Sh

# Inference

In [32]:
# Greedy decoding for inference
def greedy_decode(model, src, src_mask, max_len, start_symbol):
    memory = model.encode(src, src_mask)
    ys = torch.ones(1, 1, dtype=torch.long).fill_(start_symbol).to(device)
    for _ in range(max_len - 1):
        tgt_mask = subsequent_mask(ys.size(1)).to(device)
        out = model.decode(memory, src_mask, ys, tgt_mask)
        prob = model.generator(out[:, -1])
        next_word = torch.argmax(prob, dim=1).item()
        ys = torch.cat([ys, torch.tensor([[next_word]], dtype=torch.long).to(device)], dim=1)
        if next_word == vocab_tgt['<eos>']:
            break
    return ys

model.eval()
for index, example in enumerate(test_data[:10]):
    src = torch.tensor(
        [vocab_src['<bos>']] + [vocab_src[t] for t in tokenize_en(example[0])] + [vocab_src['<eos>']],
        dtype=torch.long
    ).unsqueeze(0).to(device)
    src_mask = (src != SRC_PAD_IDX).unsqueeze(-2)
    translation = greedy_decode(model, src, src_mask, max_len=50, start_symbol=vocab_tgt['<bos>'])
    tokens = [vocab_tgt.get_itos()[idx] for idx in translation.squeeze().tolist()]

    print("Source:", example[0])
    print("Reference:", example[1])
    print("Predicted:", " ".join(tokens))
    print('*******'*50)

Source: A group of men are loading cotton onto a truck
Reference: Eine Gruppe von Männern lädt Baumwolle auf einen Lastwagen
Predicted: <bos> eine frau in einem geländer straße auf einem gleis . <eos>
**************************************************************************************************************************************************************************************************************************************************************************************************************************************************************
Source: A man sleeping in a green room on a couch.
Reference: Ein Mann schläft in einem grünen Raum auf einem Sofa.
Predicted: <bos> ein mann , der in einer roten oberteil und sich . <eos>
*******************************************************************************************************************************************************************************************************************************************************************