## TC 5033
## Deep Learning
## Transformers

#### Activity 4: Implementing a Translator

- Objective

To understand the Transformer Architecture by Implementing a translator.

- Instructions

    This activity requires submission in teams. While teamwork is encouraged, each member is expected to contribute individually to the assignment. The final submission should feature the best arguments and solutions from each team member. Only one person per team needs to submit the completed work, but it is imperative that the names of all team members are listed in a Markdown cell at the very beginning of the notebook (either the first or second cell). Failure to include all team member names will result in the grade being awarded solely to the individual who submitted the assignment, with zero points given to other team members (no exceptions will be made to this rule).

    Follow the provided code. The code already implements a transformer from scratch as explained in one of [week's 9 videos](https://youtu.be/XefFj4rLHgU)

    Since the provided code already implements a simple translator, your job for this assignment is to understand it fully, and document it using pictures, figures, and markdown cells.  You should test your translator with at least 10 sentences. The dataset used for this task was obtained from [Tatoeba, a large dataset of sentences and translations](https://tatoeba.org/en/downloads).
  
- Evaluation Criteria

    - Code Readability and Comments
    - Traning a translator
    - Translating at least 10 sentences.

- Submission

Submit this Jupyter Notebook in canvas with your complete solution, ensuring your code is well-commented and includes Markdown cells that explain your design choices, results, and any challenges you encountered.



#### Script to convert csv to text file

In [None]:
#This script requires to convert the TSV file to CSV
# easiest way is to open it in Calc or excel and save as csv
PATH = '/content/Sentencee.csv'
import pandas as pd
df = pd.read_csv(PATH, sep='\t', on_bad_lines='skip', engine='python')

In [None]:
eng_spa_cols = df.iloc[:, [1, 3]]
eng_spa_cols['length'] = eng_spa_cols.iloc[:, 0].str.len()
eng_spa_cols = eng_spa_cols.sort_values(by='length')
eng_spa_cols = eng_spa_cols.drop(columns=['length'])

output_file_path = '/content/sample_data/eng-spa4.txt'
eng_spa_cols.to_csv(output_file_path, sep='\t', index=False, header=False)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  eng_spa_cols['length'] = eng_spa_cols.iloc[:, 0].str.len()


## Transformer - Attention is all you need

In [None]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader
from collections import Counter
import math
import numpy as np
import re

torch.manual_seed(23)

<torch._C.Generator at 0x79b8675554b0>

In [None]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(device)

cuda


In [None]:
MAX_SEQ_LEN = 128

# Transformer-Based Model

This code implements the core components of a Transformer model, specifically the **Encoder** and **Decoder** layers, as well as supporting elements like positional embeddings, multi-head attention, and feed-forward layers. Below, we provide detailed documentation along with inline comments to explain each part of the code.

## Table of Contents
###1. PositionalEmbedding
Role: Adds information about the position of each token in the input sequence. Since the model doesn't inherently understand order, these embeddings help it learn the sequence structure.
How: Uses sine and cosine functions to generate unique positional vectors for each position in the input sequence.

###2. MultiHeadAttention
Role: Allows the model to focus on different parts of the input sequence simultaneously (multi-heads), improving its ability to understand complex relationships.
How: Uses three different vectors—Query (Q), Key (K), and Value (V)—to compute attention scores and generate weighted values. This process is repeated with multiple attention heads for better learning.

###3. PositionFeedForward
Role: Processes the output of the attention layer to capture more complex patterns in the data.
How: A simple feed-forward network that applies two linear transformations with a ReLU activation in between. It helps the model learn non-linear transformations.

###4. EncoderSubLayer
Role: A single processing unit inside the encoder that includes a self-attention mechanism and a feed-forward network.
How: The input goes through self-attention (to learn relationships between tokens) and then through a feed-forward network (to further process the information). Each step has residual connections and layer normalization for stability.

###5. Encoder
Role: Stacks multiple EncoderSubLayers to process the entire input sequence and generate an encoded representation of it.
How: The input is passed through each EncoderSubLayer sequentially. The output is the final, processed sequence that captures rich information about the input.
###6. DecoderSubLayer

Role: A single processing unit inside the decoder that handles both self-attention (within the target sequence) and cross-attention (with the encoder’s output).
How: First, it applies self-attention on the target sequence, then applies cross-attention to combine information from the encoder's output, followed by a feed-forward network. It has residual connections and normalization for each step.

###7. Decoder
Role: Stacks multiple DecoderSubLayers to generate the final output sequence (like translation or prediction) based on the encoded input.

How: The decoder receives the encoder's output and the current target sequence, applying multiple DecoderSubLayers to generate a refined sequence, often for tasks like translation or text generation.

In [None]:
### 1. PositionalEmbedding

class PositionalEmbedding(nn.Module):
    def __init__(self, d_model, max_seq_len=MAX_SEQ_LEN):
        """
        Initializes a matrix of sinusoidal and cosinusoidal positional embeddings.
        These embeddings encode positional information, enabling the model to differentiate
        between token positions in a sequence.

        Args:
            d_model (int): The dimension of the model's embeddings.
            max_seq_len (int): The maximum sequence length for which positional encodings
                               are computed.
        """
        super().__init__()

        # Create a matrix to store the positional embeddings for each token position,
        # of shape (max_seq_len, d_model)
        self.pos_embed_matrix = torch.zeros(max_seq_len, d_model, device=device)

        # Generate positions as a tensor, with each position represented as an integer from 0
        # to max_seq_len - 1. Reshape to (max_seq_len, 1) to prepare for broadcasting.
        token_pos = torch.arange(0, max_seq_len, dtype=torch.float).unsqueeze(1)

        # Calculate a scaling factor for each dimension in d_model, using the formula
        # exp(-log(10000) * (2i / d_model)), which scales down higher dimensions.
        div_term = torch.exp(torch.arange(0, d_model, 2).float() * (-math.log(10000.0) / d_model))

        # Apply sinusoidal and cosinusoidal transformations for even and odd indices:
        # sin for dimensions 0, 2, 4, ... (even indices)
        self.pos_embed_matrix[:, 0::2] = torch.sin(token_pos * div_term)
        # cos for dimensions 1, 3, 5, ... (odd indices)
        self.pos_embed_matrix[:, 1::2] = torch.cos(token_pos * div_term)

        # Add an extra dimension to allow broadcasting over a batch, and transpose to shape
        # (max_seq_len, 1, d_model) for easier addition to input embeddings in forward pass.
        self.pos_embed_matrix = self.pos_embed_matrix.unsqueeze(0).transpose(0, 1)

    def forward(self, x):
        """
        Adds positional embeddings to the input embeddings.

        Args:
            x (Tensor): Input embeddings of shape [seq_len, batch_size, d_model].

        Returns:
            Tensor: The input embeddings with positional encodings added, of the same shape
                    [seq_len, batch_size, d_model].
        """
        # Add the positional embedding matrix to the input embeddings.
        # Only the first 'seq_len' positions are used to match the input sequence length.
        return x + self.pos_embed_matrix[:x.size(0), :]


### 2. MultiHeadAttention

class MultiHeadAttention(nn.Module):
    def __init__(self, d_model=512, num_heads=8):
        """
        Initializes multi-head attention by setting up the query, key, value,
        and output projections across multiple attention heads.

        Args:
            d_model (int): The dimensionality of the input embeddings.
            num_heads (int): The number of attention heads.
        """
        super().__init__()

        # Ensure d_model is divisible by num_heads
        assert d_model % num_heads == 0, 'Embedding size must be divisible by the number of heads'

        # Define the dimension per head for query/key/value projections
        self.d_v = d_model // num_heads  # Dimensionality for values
        self.d_k = self.d_v              # Dimensionality for keys (usually equal to d_v)
        self.num_heads = num_heads       # Total number of attention heads

        # Linear transformations for query, key, and value matrices
        self.W_q = nn.Linear(d_model, d_model)
        self.W_k = nn.Linear(d_model, d_model)
        self.W_v = nn.Linear(d_model, d_model)

        # Linear transformation for the concatenated output from all heads
        self.W_o = nn.Linear(d_model, d_model)

    def forward(self, Q, K, V, mask=None):
        """
        Applies multi-head attention to the input query, key, and value matrices.

        Args:
            Q (Tensor): Query matrix of shape [batch_size, seq_len, d_model].
            K (Tensor): Key matrix of shape [batch_size, seq_len, d_model].
            V (Tensor): Value matrix of shape [batch_size, seq_len, d_model].
            mask (Tensor, optional): Optional mask to control attention focus. Default: None

        Returns:
            Tensor: Output from multi-head attention, of shape [batch_size, seq_len, d_model].
            Tensor: Attention weights across heads, of shape [batch_size, num_heads, seq_len, seq_len].
        """
        batch_size = Q.size(0)

        # Project Q, K, V to shape [batch_size, num_heads, seq_len, d_k]
        Q = self.W_q(Q).view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)
        K = self.W_k(K).view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)
        V = self.W_v(V).view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)

        # Apply scaled dot-product attention to each head
        weighted_values, attention = self.scale_dot_product(Q, K, V, mask)

        # Concatenate all heads back into a single matrix and project through W_o
        weighted_values = weighted_values.transpose(1, 2).contiguous().view(batch_size, -1, self.num_heads * self.d_k)
        return self.W_o(weighted_values), attention

    def scale_dot_product(self, Q, K, V, mask=None):
        """
        Computes the scaled dot-product attention scores and applies a softmax to get attention weights.

        Args:
            Q (Tensor): Query tensor of shape [batch_size, num_heads, seq_len, d_k].
            K (Tensor): Key tensor of shape [batch_size, num_heads, seq_len, d_k].
            V (Tensor): Value tensor of shape [batch_size, num_heads, seq_len, d_k].
            mask (Tensor, optional): Optional mask for attention, of shape [batch_size, num_heads, seq_len, seq_len].

        Returns:
            Tensor: Output of shape [batch_size, num_heads, seq_len, d_k] after attention is applied.
            Tensor: Attention weights of shape [batch_size, num_heads, seq_len, seq_len].
        """
        # Calculate dot products between Q and K, then scale by sqrt(d_k)
        scores = torch.matmul(Q, K.transpose(-2, -1)) / math.sqrt(self.d_k)

        # If mask is provided, apply it to scores (set masked positions to -inf)
        if mask is not None:
            scores = scores.masked_fill(mask == 0, -1e9)

        # Compute attention weights by applying softmax to scores
        attention = F.softmax(scores, dim=-1)

        # Weighted sum of the values based on attention weights
        weighted_values = torch.matmul(attention, V)

        return weighted_values, attention


### 3. PositionFeedForward

class PositionFeedForward(nn.Module):
    def __init__(self, d_model, d_ff):
        """
        Initializes a feed-forward network with two linear layers.

        Args:
            d_model (int): Input and output embedding dimension.
            d_ff (int): Dimension of the hidden layer.
        """
        super().__init__()
        self.linear1 = nn.Linear(d_model, d_ff)
        self.linear2 = nn.Linear(d_ff, d_model)

    def forward(self, x):
        # Apply feed-forward transformation with ReLU activation
        return self.linear2(F.relu(self.linear1(x)))

### 4. EncoderSubLayer
class EncoderSubLayer(nn.Module):
    def __init__(self, d_model, num_heads, d_ff, dropout=0.1):
        """
        Initializes a single encoder sub-layer that includes a multi-head self-attention mechanism
        and a feed-forward neural network.

        Args:
            d_model (int): Dimensionality of the model.
            num_heads (int): Number of attention heads in the self-attention mechanism.
            d_ff (int): Dimension of the feed-forward network.
            dropout (float): Dropout probability for regularization.
        """
        super().__init__()

        # Multi-head self-attention layer
        self.self_attn = MultiHeadAttention(d_model, num_heads)

        # Position-wise feed-forward network
        self.ffn = PositionFeedForward(d_model, d_ff)

        # Layer normalization applied after each sub-layer
        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)

        # Dropout layers to prevent overfitting
        self.droupout1 = nn.Dropout(dropout)
        self.droupout2 = nn.Dropout(dropout)

    def forward(self, x, mask=None):
        """
        Forward pass through the encoder sub-layer.

        Args:
            x (Tensor): Input tensor of shape [batch_size, seq_len, d_model].
            mask (Tensor, optional): Mask for attention weights. Default: None

        Returns:
            Tensor: Output tensor of shape [batch_size, seq_len, d_model].
        """
        # Apply multi-head self-attention mechanism
        attention_score, _ = self.self_attn(x, x, x, mask)

        # Add and normalize: add attention output to input (residual connection)
        x = x + self.droupout1(attention_score)
        x = self.norm1(x)

        # Apply feed-forward network
        ffn_output = self.ffn(x)

        # Add and normalize: add feed-forward output to previous output
        x = x + self.droupout2(ffn_output)
        return self.norm2(x)


### 5. Encoder
class Encoder(nn.Module):
    def __init__(self, d_model, num_heads, d_ff, num_layers, dropout=0.1):
        """
        Initializes the encoder module, which consists of multiple encoder sub-layers.

        Args:
            d_model (int): Dimensionality of the model.
            num_heads (int): Number of attention heads in each sub-layer.
            d_ff (int): Dimension of the feed-forward network in each sub-layer.
            num_layers (int): Number of encoder sub-layers in the encoder.
            dropout (float): Dropout probability for regularization.
        """
        super().__init__()

        # Stack of encoder sub-layers
        self.layers = nn.ModuleList(
            [EncoderSubLayer(d_model, num_heads, d_ff, dropout) for _ in range(num_layers)]
        )

        # Final layer normalization after the last encoder layer
        self.norm = nn.LayerNorm(d_model)

    def forward(self, x, mask=None):
        """
        Forward pass through the encoder stack.

        Args:
            x (Tensor): Input tensor of shape [batch_size, seq_len, d_model].
            mask (Tensor, optional): Mask to control attention. Default: None

        Returns:
            Tensor: Encoded output of shape [batch_size, seq_len, d_model].
        """
        # Pass input through each encoder sub-layer
        for layer in self.layers:
            x = layer(x, mask)

        # Apply normalization after the final encoder sub-layer
        return self.norm(x)

### 6. DecoderSubLayer
class DecoderSubLayer(nn.Module):
    def __init__(self, d_model, num_heads, d_ff, dropout=0.1):
        """
        Initializes a Decoder sub-layer with self-attention, cross-attention, and
        feed-forward components. Each of these is followed by layer normalization and dropout.

        Args:
            d_model (int): The dimension of input embeddings and the hidden layer size.
            num_heads (int): Number of attention heads in multi-head attention layers.
            d_ff (int): Dimensionality of the intermediate feed-forward layer.
            dropout (float): Dropout rate for regularization.
        """
        super().__init__()

        # Self-attention layer for attending to the target sequence itself
        self.self_attn = MultiHeadAttention(d_model, num_heads)

        # Cross-attention layer for attending to the encoder's output (source sequence)
        self.cross_attn = MultiHeadAttention(d_model, num_heads)

        # Feed-forward network for additional non-linear transformations
        self.feed_forward = PositionFeedForward(d_model, d_ff)

        # Layer normalizations for stabilizing training and normalizing intermediate representations
        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)
        self.norm3 = nn.LayerNorm(d_model)

        # Dropout layers for regularization after each sub-layer
        self.dropout1 = nn.Dropout(dropout)
        self.dropout2 = nn.Dropout(dropout)
        self.dropout3 = nn.Dropout(dropout)

    def forward(self, x, encoder_output, target_mask=None, encoder_mask=None):
        """
        Forward pass for the Decoder sub-layer.

        Args:
            x (Tensor): Input tensor representing the current state of the target sequence,
                        of shape [batch_size, target_seq_len, d_model].
            encoder_output (Tensor): Output tensor from the encoder, representing the
                                     encoded source sequence, of shape [batch_size, src_seq_len, d_model].
            target_mask (Tensor): Optional mask tensor for self-attention, used to mask out
                                  future positions for autoregressive training.
            encoder_mask (Tensor): Optional mask tensor for cross-attention, used to mask
                                   out padding in the source sequence.

        Returns:
            Tensor: The output tensor after applying self-attention, cross-attention,
                    and feed-forward operations, with layer normalization and dropout.
        """

        # --- Self-Attention Block ---
        # Apply self-attention over the target sequence, allowing each position to attend
        # to previous positions up to itself. The optional target_mask ensures that attention
        # is only paid to past and present tokens.
        attention_score, _ = self.self_attn(x, x, x, target_mask)

        # Add residual connection and apply dropout and normalization
        x = x + self.dropout1(attention_score)
        x = self.norm1(x)

        # --- Cross-Attention Block ---
        # Apply cross-attention to allow the decoder to attend to the encoder's output.
        # Here, the decoder uses the encoder's representations as keys and values, allowing
        # each position in the target sequence to attend to the source sequence.
        encoder_attn, _ = self.cross_attn(x, encoder_output, encoder_output, encoder_mask)

        # Add residual connection and apply dropout and normalization
        x = x + self.dropout2(encoder_attn)
        x = self.norm2(x)

        # --- Feed-Forward Network Block ---
        # Pass the output through a feed-forward network, adding non-linearity and additional
        # capacity to the model for complex transformations.
        ff_output = self.feed_forward(x)

        # Add residual connection, dropout, and final normalization
        x = x + self.dropout3(ff_output)
        return self.norm3(x)


### 7. Decoder
class Decoder(nn.Module):
    def __init__(self, d_model, num_heads, d_ff, num_layers, dropout=0.1):
        """
        Initializes a Decoder, which consists of a series of DecoderSubLayer layers and a
        final layer normalization.

        Args:
            d_model (int): Dimension of the input embeddings and the hidden layer size.
            num_heads (int): Number of attention heads in each multi-head attention layer.
            d_ff (int): Dimensionality of the intermediate feed-forward network.
            num_layers (int): Number of DecoderSubLayer layers in the Decoder.
            dropout (float): Dropout rate for each sublayer in the Decoder.
        """
        super().__init__()

        # Stack of DecoderSubLayer instances, each with its own multi-head attention, cross-attention,
        # and feed-forward networks, dropout, and normalization.
        self.layers = nn.ModuleList([DecoderSubLayer(d_model, num_heads, d_ff, dropout)
                                     for _ in range(num_layers)])

        # Final layer normalization after all decoder sublayers
        self.norm = nn.LayerNorm(d_model)

    def forward(self, x, encoder_output, target_mask, encoder_mask):
        """
        Forward pass through the Decoder.

        Args:
            x (Tensor): Input tensor of shape [batch_size, target_seq_len, d_model] representing the
                        target sequence embeddings.
            encoder_output (Tensor): Output tensor from the encoder of shape
                                     [batch_size, src_seq_len, d_model].
            target_mask (Tensor): Mask for self-attention, which prevents attending to future positions.
            encoder_mask (Tensor): Mask for cross-attention, which prevents attending to padding
                                   tokens in the encoder output.

        Returns:
            Tensor: Output tensor of shape [batch_size, target_seq_len, d_model] after passing through
                    the entire stack of DecoderSubLayer layers and the final normalization.
        """

        # Pass through each layer in the Decoder stack
        for layer in self.layers:
            # Each DecoderSubLayer layer processes the input 'x' and the encoder output,
            # applying self-attention, cross-attention, and a feed-forward layer
            x = layer(x, encoder_output, target_mask, encoder_mask)

        # Apply final layer normalization to stabilize the output
        return self.norm(x)

In [None]:
class Transformer(nn.Module):
    def __init__(self, d_model, num_heads, d_ff, num_layers,
                 input_vocab_size, target_vocab_size,
                 max_len=MAX_SEQ_LEN, dropout=0.1):
        super().__init__()
        self.encoder_embedding = nn.Embedding(input_vocab_size, d_model)
        self.decoder_embedding = nn.Embedding(target_vocab_size, d_model)
        self.pos_embedding = PositionalEmbedding(d_model, max_len)
        self.encoder = Encoder(d_model, num_heads, d_ff, num_layers, dropout)
        self.decoder = Decoder(d_model, num_heads, d_ff, num_layers, dropout)
        self.output_layer = nn.Linear(d_model, target_vocab_size)

    def forward(self, source, target):
        # Encoder mask
        source_mask, target_mask = self.mask(source, target)
        # Embedding and positional Encoding
        source = self.encoder_embedding(source) * math.sqrt(self.encoder_embedding.embedding_dim)
        source = self.pos_embedding(source)
        # Encoder
        encoder_output = self.encoder(source, source_mask)

        # Decoder embedding and postional encoding
        target = self.decoder_embedding(target) * math.sqrt(self.decoder_embedding.embedding_dim)
        target = self.pos_embedding(target)
        # Decoder
        output = self.decoder(target, encoder_output, target_mask, source_mask)

        return self.output_layer(output)



    def mask(self, source, target):
        source_mask = (source != 0).unsqueeze(1).unsqueeze(2)
        target_mask = (target != 0).unsqueeze(1).unsqueeze(2)
        size = target.size(1)
        no_mask = torch.tril(torch.ones((1, size, size), device=device)).bool()
        target_mask = target_mask & no_mask
        return source_mask, target_mask


#### Simple test

In [None]:
seq_len_source = 10
seq_len_target = 10
batch_size = 2
input_vocab_size = 50
target_vocab_size = 50

source = torch.randint(1, input_vocab_size, (batch_size, seq_len_source))
target = torch.randint(1, target_vocab_size, (batch_size, seq_len_target))

In [None]:
d_model = 512
num_heads = 8
d_ff = 2048
num_layers = 6

model = Transformer(d_model, num_heads, d_ff, num_layers,
                  input_vocab_size, target_vocab_size,
                  max_len=MAX_SEQ_LEN, dropout=0.1)

model = model.to(device)
source = source.to(device)
target = target.to(device)

In [None]:
output = model(source, target)

In [None]:
# Expected output shape -> [batch, seq_len_target, target_vocab_size] i.e. [2, 10, 50]
print(f'ouput.shape {output.shape}')

ouput.shape torch.Size([2, 10, 50])


### Translator Eng-Spa

In [None]:
PATH = '/content/sample_data/eng-spa4.txt'

In [None]:
with open(PATH, 'r', encoding='utf-8') as f:
    lines = f.readlines()
eng_spa_pairs = [line.strip().split('\t') for line in lines if '\t' in line]

In [None]:
eng_spa_pairs[:10]

[['Go.', 'Vaya.'],
 ['Go.', 'Ve.'],
 ['Hi.', '¡Hola!'],
 ['So?', '¿Y?'],
 ['Ok!', '¡OK!'],
 ['OK.', '¡Órale!'],
 ['Ah!', '¡Anda!'],
 ['Hi.', 'Hola.'],
 ['Go!', '¡Fuera!'],
 ['Go!', '¡Ya!']]

In [None]:
eng_sentences = [pair[0] for pair in eng_spa_pairs]
spa_sentences = [pair[1] for pair in eng_spa_pairs]

In [None]:
print(eng_sentences[:10])
print(spa_sentences[:10])


['Go.', 'Go.', 'Hi.', 'So?', 'Ok!', 'OK.', 'Ah!', 'Hi.', 'Go!', 'Go!']
['Vaya.', 'Ve.', '¡Hola!', '¿Y?', '¡OK!', '¡Órale!', '¡Anda!', 'Hola.', '¡Fuera!', '¡Ya!']


In [None]:
def preprocess_sentence(sentence):
    sentence = sentence.lower().strip()
    sentence = re.sub(r'[" "]+', " ", sentence)
    sentence = re.sub(r"[á]+", "a", sentence)
    sentence = re.sub(r"[é]+", "e", sentence)
    sentence = re.sub(r"[í]+", "i", sentence)
    sentence = re.sub(r"[ó]+", "o", sentence)
    sentence = re.sub(r"[ú]+", "u", sentence)
    sentence = re.sub(r"[^a-z]+", " ", sentence)
    sentence = sentence.strip()
    sentence = '<sos> ' + sentence + ' <eos>'
    return sentence

In [None]:
s1 = '¿Hola @ cómo estás? 123'

In [None]:
print(s1)
print(preprocess_sentence(s1))

¿Hola @ cómo estás? 123
<sos> hola como estas <eos>


In [None]:
eng_sentences = [preprocess_sentence(sentence) for sentence in eng_sentences]
spa_sentences = [preprocess_sentence(sentence) for sentence in spa_sentences]

In [None]:
spa_sentences[:10]

['<sos> vaya <eos>',
 '<sos> ve <eos>',
 '<sos> hola <eos>',
 '<sos> y <eos>',
 '<sos> ok <eos>',
 '<sos> orale <eos>',
 '<sos> anda <eos>',
 '<sos> hola <eos>',
 '<sos> fuera <eos>',
 '<sos> ya <eos>']

In [None]:
def build_vocab(sentences):
    words = [word for sentence in sentences for word in sentence.split()]
    word_count = Counter(words)
    sorted_word_counts = sorted(word_count.items(), key=lambda x:x[1], reverse=True)
    word2idx = {word: idx for idx, (word, _) in enumerate(sorted_word_counts, 2)}
    word2idx['<pad>'] = 0
    word2idx['<unk>'] = 1
    idx2word = {idx: word for word, idx in word2idx.items()}
    return word2idx, idx2word

In [None]:
eng_word2idx, eng_idx2word = build_vocab(eng_sentences)
spa_word2idx, spa_idx2word = build_vocab(spa_sentences)
eng_vocab_size = len(eng_word2idx)
spa_vocab_size = len(spa_word2idx)

In [None]:
print(eng_vocab_size, spa_vocab_size)

26559 45162


In [None]:
class EngSpaDataset(Dataset):
    def __init__(self, eng_sentences, spa_sentences, eng_word2idx, spa_word2idx):
        self.eng_sentences = eng_sentences
        self.spa_sentences = spa_sentences
        self.eng_word2idx = eng_word2idx
        self.spa_word2idx = spa_word2idx

    def __len__(self):
        return len(self.eng_sentences)

    def __getitem__(self, idx):
        eng_sentence = self.eng_sentences[idx]
        spa_sentence = self.spa_sentences[idx]
        # return tokens idxs
        eng_idxs = [self.eng_word2idx.get(word, self.eng_word2idx['<unk>']) for word in eng_sentence.split()]
        spa_idxs = [self.spa_word2idx.get(word, self.spa_word2idx['<unk>']) for word in spa_sentence.split()]

        return torch.tensor(eng_idxs), torch.tensor(spa_idxs)

In [None]:
def collate_fn(batch):
    eng_batch, spa_batch = zip(*batch)
    eng_batch = [seq[:MAX_SEQ_LEN].clone().detach() for seq in eng_batch]
    spa_batch = [seq[:MAX_SEQ_LEN].clone().detach() for seq in spa_batch]
    eng_batch = torch.nn.utils.rnn.pad_sequence(eng_batch, batch_first=True, padding_value=0)
    spa_batch = torch.nn.utils.rnn.pad_sequence(spa_batch, batch_first=True, padding_value=0)
    return eng_batch, spa_batch


In [None]:
def train(model, dataloader, loss_function, optimiser, epochs):
    model.train()
    for epoch in range(epochs):
        total_loss = 0
        for i, (eng_batch, spa_batch) in enumerate(dataloader):
            eng_batch = eng_batch.to(device)
            spa_batch = spa_batch.to(device)
            # Decoder preprocessing
            target_input = spa_batch[:, :-1]
            target_output = spa_batch[:, 1:].contiguous().view(-1)
            # Zero grads
            optimiser.zero_grad()
            # run model
            output = model(eng_batch, target_input)
            output = output.view(-1, output.size(-1))
            # loss\
            loss = loss_function(output, target_output)
            # gradient and update parameters
            loss.backward()
            optimiser.step()
            total_loss += loss.item()

        avg_loss = total_loss/len(dataloader)
        print(f'Epoch: {epoch}/{epochs}, Loss: {avg_loss:.4f}')



In [None]:
BATCH_SIZE = 64
dataset = EngSpaDataset(eng_sentences, spa_sentences, eng_word2idx, spa_word2idx)
dataloader = DataLoader(dataset, batch_size=BATCH_SIZE, shuffle=True, collate_fn=collate_fn)

In [None]:
model = Transformer(d_model=512, num_heads=8, d_ff=2048, num_layers=6,
                    input_vocab_size=eng_vocab_size, target_vocab_size=spa_vocab_size,
                    max_len=MAX_SEQ_LEN, dropout=0.1)

In [None]:
model = model.to(device)
loss_function = nn.CrossEntropyLoss(ignore_index=0)
optimiser = optim.Adam(model.parameters(), lr=0.0001)


In [None]:
train(model, dataloader, loss_function, optimiser, epochs = 10)

Epoch: 0/10, Loss: 3.6251
Epoch: 1/10, Loss: 2.2215
Epoch: 2/10, Loss: 1.7152
Epoch: 3/10, Loss: 1.3823
Epoch: 4/10, Loss: 1.1274
Epoch: 5/10, Loss: 0.9219
Epoch: 6/10, Loss: 0.7532
Epoch: 7/10, Loss: 0.6230
Epoch: 8/10, Loss: 0.5263
Epoch: 9/10, Loss: 0.4569


In [None]:
def sentence_to_indices(sentence, word2idx):
    return [word2idx.get(word, word2idx['<unk>']) for word in sentence.split()]

def indices_to_sentence(indices, idx2word):
    return ' '.join([idx2word[idx] for idx in indices if idx in idx2word and idx2word[idx] != '<pad>'])

def translate_sentence(model, sentence, eng_word2idx, spa_idx2word, max_len=MAX_SEQ_LEN, device='cpu'):
    model.eval()
    sentence = preprocess_sentence(sentence)
    input_indices = sentence_to_indices(sentence, eng_word2idx)
    input_tensor = torch.tensor(input_indices).unsqueeze(0).to(device)

    # Initialize the target tensor with <sos> token
    tgt_indices = [spa_word2idx['<sos>']]
    tgt_tensor = torch.tensor(tgt_indices).unsqueeze(0).to(device)

    with torch.no_grad():
        for _ in range(max_len):
            output = model(input_tensor, tgt_tensor)
            output = output.squeeze(0)
            next_token = output.argmax(dim=-1)[-1].item()
            tgt_indices.append(next_token)
            tgt_tensor = torch.tensor(tgt_indices).unsqueeze(0).to(device)
            if next_token == spa_word2idx['<eos>']:
                break

    return indices_to_sentence(tgt_indices, spa_idx2word)

In [None]:
def evaluate_translations(model, sentences, eng_word2idx, spa_idx2word, max_len=MAX_SEQ_LEN, device='cpu'):
    for sentence in sentences:
        translation = translate_sentence(model, sentence, eng_word2idx, spa_idx2word, max_len, device)
        print(f'Input sentence: {sentence}')
        print(f'Traducción: {translation}')
        print()

# Example sentences to test the translator
test_sentences = [
    "Hello, how are you?",
    "I am learning artificial intelligence.",
    "Artificial intelligence is great.",
    "Good night!",
    "you have a nice day",
    "evaluating the accuracy of the translation",
    "How good are you at translating?",
    "This is a sentence that will check if the translator is capable of handling long sentences",
    "This is a beautiful day",
    " The advanced machine learning course has been a great experience",
    " I love the rainy days ",
    "She loves books",
    "Although it was raining, he went for a walk",
    "It's raining cats and dogs",
    "The kids are playing soccer"
]

# Assuming the model is trained and loaded
# Set the device to 'cpu' or 'cuda' as needed
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = model.to(device)

# Evaluate translations
evaluate_translations(model, test_sentences, eng_word2idx, spa_idx2word, max_len=MAX_SEQ_LEN, device=device)


Input sentence: Hello, how are you?
Traducción: <sos> hola que tal <eos>

Input sentence: I am learning artificial intelligence.
Traducción: <sos> estoy aprendiendo inteligencia artificial <eos>

Input sentence: Artificial intelligence is great.
Traducción: <sos> la inteligencia artificial es genial <eos>

Input sentence: Good night!
Traducción: <sos> buenas noches <eos>

Input sentence: you have a nice day
Traducción: <sos> que tengas un buen dia <eos>

Input sentence: evaluating the accuracy of the translation
Traducción: <sos> la traduccion del caso es la correcta <eos>

Input sentence: How good are you at translating?
Traducción: <sos> como te va traduciendo bien <eos>

Input sentence: This is a sentence that will check if the translator is capable of handling long sentences
Traducción: <sos> la oracion es un traductor puede revisar si la oracion de un traductor <eos>

Input sentence: This is a beautiful day
Traducción: <sos> este es un bello dia <eos>

Input sentence:  The advance



### Model's Limitations  -(Limitations in translation capacity)

**Sequence Length Constraints**: The PositionalEmbedding is fixed to a max sequence length. For longer texts, it may not capture full context, reducing translation quality, especially for complex sentences.

**Attention Head Limitations:** While MultiHeadAttention allows focusing on different parts of a sequence, it may struggle to capture long-term dependencies, which are crucial in accurately translating nuanced or complex sentence structures.

**Simple Feed-Forward Layers:** The PositionFeedForward layers, with a basic two-layer structure, may not adequately model complex, non-linear relationships needed for high-fidelity translations.

**Masking in Decoding:** Masks in the DecoderSubLayer restrict visibility during generation, which could limit the model’s ability to consider entire contexts, thus affecting translation accuracy.

**Effect of LayerNorm and Dropout:** While stabilizing the model, these may also soften critical information, possibly impacting detailed or precise translations.

**Limited Model Depth:** A lower num_layers count may not capture intricate language structures, especially in complex texts, limiting the model’s expressive power.

**Lack of Adaptive Context Mechanisms:** Without context-adaptive mechanisms, the model may struggle to handle cultural nuances or varied language styles.