# Assignmnet 3 (100 + 5 points)

**Name:** <br>
**Email:** <br>
**Group:** A/B <br>
**Hours spend *(optional)* :** <br>

### Question 1: Transformer model *(100 points)*

As a Machine Learning engineer at a tech company, you were given a task to develop a machine translation system that translates **English (source) to German (Target)**. You can use existing libraries but the training needs to be done from scratch (usage of pretrained weights is not allowed). You have the freedom to select any dataset for training the model. Use a small subset of data as a validation dataset and report the BLEU score on the validation set. Also, provide a short description of your transformer model architecture, hyperparameters, and training (also provide the training loss curve).

<h3> Submission </h3>

The test set **(test.txt)** will be released one week before the deadline. You should submit the output of your model on the test set separately. Name the output file as **"first name_last_name_test_result.txt"**. Each line of the submission file should contain only the translated text of the corresponding sentence from 'test.txt'.

The 'first name_last_name_test_result.txt' file will be evaluated by your instructor and the student who could get the best BLEU score will get 5 additional points. 

**Dataset**

Here are some of the parallel datasets (see Datasets and Resources file):
* Europarl Parallel corpus - https://www.statmt.org/europarl/v7/de-en.tgz
* News Commentary - https://www.statmt.org/wmt14/training-parallel-nc-v9.tgz (use DE-EN parallel data)
* Common Crawl corpus - https://www.statmt.org/wmt13/training-parallel-commoncrawl.tgz (use DE-EN parallel data)

You can also use other datasets of your choice. In the above datasets, **'.en'** file has the text in English, and **'.de'** file contains their corresponding German translations.

## Notes:

1) You can also consider using a small subset of the dataset if the training dataset is large
2) Sometimes you can also get out of memory errors while training, so choose the hyperparameters carefully.
3) Your training will be much faster if you use a GPU. If you are using a CPU, it may take several hours or even days. (you can also use Google Colab GPUs for training. link: https://colab.research.google.com/)

In [None]:
import torch
import math
import copy
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, Dataset
import numpy as np
import matplotlib.pyplot as plt
from tqdm import tqdm
from sklearn.model_selection import train_test_split
import nltk
from nltk.translate.bleu_score import sentence_bleu
# nltk.download('punkt')

In [204]:
class Embedding(nn.Module):
    def __init__(self, vocab, d_model):
        super(Embedding, self).__init__()
        
        self.d_model = d_model
        self.lut = nn.Embedding(vocab, d_model) # look up table 

    def forward(self, x):
        return self.lut(x) * math.sqrt(self.d_model)

<img src='data/img/multi-head-attention_l1A3G7a.png' style='width: 300px'/>

In [225]:
class MultiHeadAttention(nn.Module):
    def __init__(self, d_model, num_heads):
        super(MultiHeadAttention, self).__init__()
        assert d_model % num_heads == 0

        # the input consists of queries and keys of dim d_k (head dimension)
        self.d_k = d_model // num_heads
        self.num_heads = num_heads
        self.d_model = d_model

        'linear projection (transformation)'
        self.W_q = nn.Linear(d_model, d_model, bias=False) 
        self.W_k = nn.Linear(d_model, d_model, bias=False)
        self.W_v = nn.Linear(d_model, d_model, bias=False)
        self.W_o = nn.Linear(d_model, d_model, bias=False)
    
    def scaled_dot_product(self, Q, K, V, mask=None):
        'Scaled Dot-Product Attention'
        d_k = Q.size()[-1]
        score = torch.matmul(Q, K.transpose(-2, -1)) / math.sqrt(d_k) # scaled multiplcative compatibility function 

        attention = F.softmax(score, dim=-1) # probability of attention 
        
        return torch.matmul(attention, V)

    def split_heads(self, x):
        'This method splits the input tensor into multiple heads for parallel processing'
        batch_size, seq_len, d_model = x.size()
        return x.view(batch_size, seq_len, self.num_heads, self.d_k).transpose(1, 2)

    def combine_heads(self, x):
        'After attention computation, the heads are combined back'
        batch_size, _, seq_len, d_k = x.size()
        return x.transpose(1, 2).contiguous().view(batch_size, seq_len, self.d_model)
                
    def forward(self, Q, K, V, mask=None):
        Q = self.split_heads(self.W_q(Q))
        K = self.split_heads(self.W_k(K))
        V = self.split_heads(self.W_v(V))

        attention = self.scaled_dot_product(Q, K, V, mask)
        output = self.W_o(self.combine_heads(attention))
        return output

In [226]:
# W_q = nn.Linear(512, 512)
# Q = torch.randint(low=1, high=1000, size=(8, 100, 512)).float()
# x = W_q(Q)
# x.size()
# Q.transpose()

In [227]:
class PositionWiseFeedForward(nn.Module):
    def __init__(self, d_model, d_ff):
        super(PositionWiseFeedForward, self).__init__()

        self.fc1 = nn.Linear(d_model, d_ff)
        self.fc2 = nn.Linear(d_ff, d_model)
        self.relu = nn.ReLU()

    def forward(self, x):
        return self.fc2(self.relu(self.fc2(x)))

In [228]:
class PositionalEncoding(nn.Module):
    def __init__(self, d_model, max_seq_length):
        super(PositionalEncoding, self).__init__()
        
        pe = torch.zeros(max_seq_length, d_model)
        position = torch.arange(0, max_seq_length, dtype=torch.float).unsqueeze(1)
        div_term = torch.exp(torch.arange(0, d_model, 2).float() * -(math.log(10000.0) / d_model))
        
        pe[:, 0::2] = torch.sin(position * div_term)
        pe[:, 1::2] = torch.cos(position * div_term)
        
        self.register_buffer('pe', pe.unsqueeze(0))
        
    def forward(self, x):
        return x + self.pe[:, :x.size(1)]

<img src='data/img/0*bPKV4ekQr9ZjYkWJ.webp' style='width: 300px'/>

In [229]:
class Encoder(nn.Module):
    def __init__(self, d_model, num_heads, d_ff, dropout):
        super(Encoder, self).__init__()
        
        self.self_attn = MultiHeadAttention(d_model, num_heads)
        self.feed_forward = PositionWiseFeedForward(d_model, d_ff)
        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x, mask):
        attn_output = self.self_attn(x, x, x, mask)
        x = self.norm1(x + self.dropout(attn_output))
        ff_output = self.feed_forward(x)
        x = self.norm2(x + self.dropout(ff_output))
        return x 

<img src='data/img/0*SPZgT4k8GQi37H__.webp' style='width: 300px'/>

In [230]:
class Decoder(nn.Module):
    def __init__(self, d_model, num_heads, d_ff, dropout=0.1):
        super(Decoder, self).__init__()

        self.self_attn = MultiHeadAttention(d_model, num_heads)
        self.cross_attn = MultiHeadAttention(d_model, num_heads)
        self.feed_forward = PositionWiseFeedForward(d_model, d_ff)
        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)
        self.norm3 = nn.LayerNorm(d_model)
        self.dropout = nn.Dropout(dropout)
        
    def forward(self, x, enc_output, src_mask, tgt_mask):
        attn_out = self_attn(x, x, x, tgt_mask)
        x = self.norm1(x + self.dropout(attn_out))
        attn_output = self.cross_attn(x, enc_output, enc_output, src_mask)
        x = self.norm2(x + self.dropout(attn_output))
        ff_output = self.feed_forward(x)
        x = self.norm3(x + self.dropout(ff_output))
        return x

In [231]:
class Transformer(nn.Module):
    def __init__(self, src_vocab_size, tgt_vocab_size, d_model, num_heads, num_layers, d_ff, max_seq_length, dropout):
        super(Transformer, self).__init__()
        self.encoder_embedding = nn.Embedding(src_vocab_size, d_model)
        self.decoder_embedding = nn.Embedding(tgt_vocab_size, d_model)
        # self.positional_encoding = PositionalEncoding(d_model, max_seq_length)

        self.encoder_layers = nn.ModuleList([Encoder(d_model, num_heads, d_ff, dropout) for _ in range(num_layers)])
        self.decoder_layers = nn.ModuleList([Decoder(d_model, num_heads, d_ff, dropout) for _ in range(num_layers)])

        self.fc = nn.Linear(d_model, tgt_vocab_size)
        self.dropout = nn.Dropout(dropout)
    def generate_mask(self, src, tgt):
        src_mask = (src != 0).unsqueeze(1).unsqueeze(2)
        tgt_mask = (tgt != 0).unsqueeze(1).unsqueeze(3)
        seq_length = tgt.size(1)
        nopeak_mask = (1 - torch.triu(torch.ones(1, seq_length, seq_length), diagonal=1)).bool()
        tgt_mask = tgt_mask & nopeak_mask
        return src_mask, tgt_mask

    def forward(self, src, tgt):
        src_mask, tgt_mask = self.generate_mask(src, tgt)
        src_embedded = self.dropout(self.encoder_embedding(src))
        tgt_embedded = self.dropout(self.decoder_embedding(tgt))

        enc_output = src_embedded
        for enc_layer in self.encoder_layers:
            enc_output = enc_layer(enc_output, src_mask)

        dec_output = tgt_embedded
        for dec_layer in self.decoder_layers:
            dec_output = dec_layer(dec_output, enc_output, src_mask, tgt_mask)

        output = self.fc(dec_output)
        return output

In [71]:
import torch
from transformers import BertTokenizer, BertModel

# Step 1: Load the data from files
def load_data(english_file, german_file):
    with open(english_file, 'r', encoding='utf-8') as f:
        english_sentences = f.readlines()
    with open(german_file, 'r', encoding='utf-8') as f:
        german_sentences = f.readlines()
    return english_sentences, german_sentences
    
# english_sentences = open('data/de-en/europarl-v7.de-en.%s' % lang1, encoding='utf-8').read().strip().split('\n')
# german_sentences = open('data/de-en/europarl-v7.de-en.%s' % lang2, encoding='utf-8').read().strip().split('\n')

# Step 2: Tokenize the text using BERT tokenizer
def tokenize_sentences(sentences, tokenizer):
    return [tokenizer(sentence, return_tensors='pt', padding=True, truncation=True) for sentence in sentences]

# Step 3: Embed the tokenized text using BERT model
def embed_sentences(tokenized_sentences, model):
    embeddings = []
    with torch.no_grad():
        for tokens in tokenized_sentences:
            outputs = model(**tokens)
            embeddings.append(outputs.last_hidden_state)
    return embeddings

# Main function to process the corpus
def process_parallel_corpus(english_file, german_file):
    # Load data
    english_sentences, german_sentences = load_data(english_file, german_file)
    
    # Initialize tokenizer and model
    tokenizer = BertTokenizer.from_pretrained('bert-base-multilingual-cased')
    model = BertModel.from_pretrained('bert-base-multilingual-cased')
    
    # Tokenize sentences
    tokenized_english = tokenize_sentences(english_sentences, tokenizer)
    tokenized_german = tokenize_sentences(german_sentences, tokenizer)
    
    # Embed sentences
    english_embeddings = embed_sentences(tokenized_english, model)
    german_embeddings = embed_sentences(tokenized_german, model)
    
    return english_embeddings, german_embeddings

# Example usage
english_file = 'data/de-en/europarl-v7.de-en.en'
german_file = 'data/de-en/europarl-v7.de-en.de'
# english_embeddings, german_embeddings = process_parallel_corpus(english_file, german_file)

# english_embeddings and german_embeddings are now lists of tensors
# print(english_embeddings[0].shape)  # Example: torch.Size([1, 10, 768]) for a sentence with 10 tokens
# print(german_embeddings[0].shape)