## TC 5033 - Deep Learning - Transformers

#### Activity 3: Implementing a Translator

- Objective

To understand the Transformer Architecture by Implementing a translator.

- Instructions

    This activity requires submission in teams. While teamwork is encouraged, each member is expected to contribute individually to the assignment. The final submission should feature the best arguments and solutions from each team member. Only one person per team needs to submit the completed work, but it is imperative that the names of all team members are listed in a Markdown cell at the very beginning of the notebook (either the first or second cell). Failure to include all team member names will result in the grade being awarded solely to the individual who submitted the assignment, with zero points given to other team members (no exceptions will be made to this rule).

    Follow the provided code. The code already implements a transformer from scratch as explained in [this video](https://youtu.be/XefFj4rLHgU)

    Since the provided code already implements a simple translator, your job for this assignment is to understand it fully, and document it using pictures, figures, and markdown cells.  You should test your translator with at least 10 sentences. The dataset used for this task was obtained from [Tatoeba, a large dataset of sentences and translations](https://tatoeba.org/en/downloads).
  
- Evaluation Criteria

    - Code Readability and Comments
    - Traning a translator
    - Translating at least 10 sentences.

- Submission

Submit this Jupyter Notebook in canvas with your complete solution, ensuring your code is well-commented and includes Markdown cells that explain your design choices, results, and any challenges you encountered.

## Translation Using Transformers

### Team : NoName
#### Team Members:
* A01661090 Juan Pablo Cabrera Quiroga
* A01709522 Arturo Cristián Díaz López
* A01704076 Adrián Galván Díaz
* A01368818 Joel Sánchez Olvera
* A01708634 Carlos Eduardo Velasco 

### Abstract
This project focuses on implementing a machine translation system using the Transformer architecture, a state-of-the-art model in natural language processing. The model is trained to translate English sentences into Spanish using a dataset derived from Tatoeba. By leveraging the core components of the Transformer, including multi-head attention and positional embeddings, this implementation aims to explore the inner workings of the architecture. The project also emphasizes code readability, documentation, and practical evaluation by testing translations on sample sentences. 

### Dataset
The dataset for this project was sourced from [Tatoeba](https://tatoeba.org/en/downloads), focusing on English and Spanish sentence pairs. The data was originally in TSV format, which we converted to a more convenient TXT format using a custom converter, allowing for seamless integration with our model. This conversion step ensures compatibility and ease of use, but it is optional and only required if working with the original TSV files.
  

### Objectives 
* Analyzing and understanding the inner workings of the Transformer model
* Building a translator capable of converting English sentences into Spanish.
* Creating detailed documentation using markdown cells, comments and visualizations.
* Evaluating translation quality by using the necessary metrics.

## Data Preparation
We explore the dataset directory to verify the presence of necessary files, such as eng-spa4.txt, which contains the English-Spanish sentence pairs needed for translation.

In [1]:
import os

# Route to the dataset run on Kaggle
dataset_path= "/kaggle/input/eng-spa4"

os.listdir(dataset_path)

['eng-spa4.txt']

## Data Conversion -- Converting TSV to TXT
**ONLY RUN IF NEEDED**


This code block reads a TSV (tab-separated values) file containing English-Spanish sentence pairs, selects the relevant columns, and saves the data as a tab-separated TXT file.

In [None]:
#SCRIPT USADO POR EL EQUIPO

import pandas as pd

input_file = dataset_path + '/eng-spa.tsv'  # Archivo de Entrada (.tsv)
output_file = dataset_path + '/eng-spa4.txt'  # Archivo de Salida (.txt)

# Lee el TSV con las columnas necesarias
data = pd.read_csv(input_file, sep='\t', usecols=[1, 3], header=None, names=["English", "Spanish"])

# Guarda el TXT
data.to_csv(output_file, sep='\t', index=False, header=False, encoding='utf-8')

print(f"Archivo procesado como {output_file}")

## Transformer - Attention is all you need
This block sets up the Transformer architecture, inspired by the "Attention Is All You Need" paper. It includes essential components like multi-head attention, positional encoding, and encoder-decoder blocks to enable sequence-to-sequence tasks such as language translation.

First, we import the neccesary libraries for the project:

In [2]:
# Import required libraries for the Transformer model
import torch  
import torch.nn as nn  
import torch.nn.functional as F  
import torch.optim as optim  
from torch.utils.data import Dataset, DataLoader  
from collections import Counter  # For vocabulary building
import math  
import numpy as np  
import re  # Regular expressions for text preprocessing
from tqdm import tqdm  # Progress bar for loops

# Set the random seed for reproducibility
torch.manual_seed(23)

<torch._C.Generator at 0x79b3523886b0>

In [3]:
# Configure device: CUDA for GPU, otherwise CPU
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(device)

cuda


Max Sequence length

In [4]:
#Constant for the model
MAX_SEQ_LEN = 128 #Maximum Sequence length

### Positional Embedding
Positional embedding module designed to add positional information to the input token embeddings, enabling the Transformer to understand the sequential order of tokens. Recomputes a positional embedding matrix based on sine and cosine functions. This matrix encodes positional information for each token in the sequence.


**Inputs:**
* d_model: Dimensionality of token embeddings.
* max_seq_len: Maximum sequence length.

**Output:**
* Initializes the pos_embed_matrix attribute to store positional encodings.


In [5]:

class PositionalEmbedding(nn.Module):
    
    #Initializes the PositionalEmbedding module.
    def __init__(self, d_model, max_seq_len = MAX_SEQ_LEN):
        """    
        Args:
        * d_model (int): The size of each token's embedding vector.
        * max_seq_len (int): The maximum sequence length the model can handle.
        """
        super().__init__()
        self.pos_embed_matrix = torch.zeros(max_seq_len, d_model, device=device)
        token_pos = torch.arange(0, max_seq_len, dtype = torch.float).unsqueeze(1)
        div_term = torch.exp(torch.arange(0, d_model, 2).float()
                             * (-math.log(10000.0)/d_model))
        self.pos_embed_matrix[:, 0::2] = torch.sin(token_pos * div_term)
        self.pos_embed_matrix[:, 1::2] = torch.cos(token_pos * div_term)
        self.pos_embed_matrix = self.pos_embed_matrix.unsqueeze(0).transpose(0,1)

    #Adds positional embeddings to the input tensor.
    def forward(self, x):
        """
        Args:
        * x (torch.Tensor): Input tensor of shape [seq_len, batch_size, d_model].
        Returns:
        * torch.Tensor: The input tensor with positional embeddings added,
        maintaining the same shape.
        """
        # print(self.pos_embed_matrix.shape)
        # print(x.shape)
        return x + self.pos_embed_matrix[:x.size(0), :]



## MultiHeadAttention
The MultiHeadAttention module implements the multi-head attention mechanism, allowing the Transformer to focus on different parts of the sequence simultaneously. It performs scaled dot-product attention across multiple heads and combines the results to generate contextualized representations of the input.

**Inputs:**

* d_model: Dimensionality of token embeddings.
* num_heads: Number of parallel attention heads.

**Output:**

Initializes the attention weights (W_q, W_k, W_v, W_o) and splits embeddings across heads.


In [6]:
class MultiHeadAttention(nn.Module):
     # Initializes the MultiHeadAttention module.
    def __init__(self, d_model = 512, num_heads = 8):
        """
        Args:
        * d_model (int): The embedding size of the input.
        * num_heads (int): The number of attention heads.
        """
        super().__init__()
        assert d_model % num_heads == 0, 'Embedding size not compatible with num heads'

        self.d_v = d_model // num_heads
        self.d_k = self.d_v
        self.num_heads = num_heads

        # Define projection layers for queries, keys, and values
        self.W_q = nn.Linear(d_model, d_model)
        self.W_k = nn.Linear(d_model, d_model)
        self.W_v = nn.Linear(d_model, d_model)
        self.W_o = nn.Linear(d_model, d_model)

    # Computes multi-head attention.
    def forward(self, Q, K, V, mask = None):
        """
        Args:
        * Q (torch.Tensor): Query tensor of shape [batch_size, seq_len, d_model].
        * K (torch.Tensor): Key tensor of shape [batch_size, seq_len, d_model].
        * V (torch.Tensor): Value tensor of shape [batch_size, seq_len, d_model].
        * mask (torch.Tensor, optional): Mask to prevent attention to certain positions.
        
        Returns:
        * torch.Tensor: Attention outputs of shape [batch_size, seq_len, d_model].
        * torch.Tensor: Attention weights.
        """
        batch_size = Q.size(0)
        
        # Project Q, K, and V into the respective subspaces and reshape
        Q = self.W_q(Q).view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2 )
        K = self.W_k(K).view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2 )
        V = self.W_v(V).view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2 )

        # Compute scaled dot-product attention
        weighted_values, attention = self.scale_dot_product(Q, K, V, mask)

        # Concatenate attention heads and apply the output transformation
        weighted_values = weighted_values.transpose(1, 2).contiguous().view(batch_size, -1, self.num_heads*self.d_k)
        weighted_values = self.W_o(weighted_values)

        return weighted_values, attention

    #Computes scaled dot-product attention.
    def scale_dot_product(self, Q, K, V, mask = None):
        """
        Args:
        * Q (torch.Tensor): Query tensor of shape [batch_size, num_heads, seq_len, d_k].
        * K (torch.Tensor): Key tensor of shape [batch_size, num_heads, seq_len, d_k].
        * V (torch.Tensor): Value tensor of shape [batch_size, num_heads, seq_len, d_k].
        * mask (torch.Tensor, optional): Mask to prevent attention to certain positions.

        Returns:
        * torch.Tensor: Weighted sum of value vectors, shape [batch_size, num_heads, seq_len, d_k].
        * torch.Tensor: Attention weights, shape [batch_size, num_heads, seq_len, seq_len].
        """
        #Compute attention Scores
        scores = torch.matmul(Q, K.transpose(-2, -1)) / math.sqrt(self.d_k)

        #Apply mask if provided
        if mask is not None:
            scores = scores.masked_fill(mask == 0, -1e9)

        #Compute attention weights
        attention = F.softmax(scores, dim = -1)

        #Compute weighted sum
        weighted_values = torch.matmul(attention, V)

        return weighted_values, attention


## PositionFeedForward
The PositionFeedForward module applies a two-layer feed-forward network to each position in the sequence independently. It introduces non-linearity with a ReLU activation and projects embeddings back to their original dimensions.

**Inputs:**
* d_model (int): Dimensionality of token embeddings.
* d_ff (int): Size of the hidden feed-forward layer.

**Outputs:**
* Transformed tensor of shape [batch_size, seq_len, d_model].

In [44]:
#Position-wise feed-forward network applied independently to each token in the sequence.
class PositionFeedForward(nn.Module):
    """
    Attributes:
        linear1 (nn.Linear): First linear transformation (expansion).
        linear2 (nn.Linear): Second linear transformation (projection).
    """
    # Initializes the feed-forward network.
    def __init__(self, d_model, d_ff):
        """
        Args:
        * d_model (int): Size of the input embeddings.
        * d_ff (int): Size of the hidden feed-forward layer.
        """
        super().__init__()
        self.linear1 = nn.Linear(d_model, d_ff)
        self.linear2 = nn.Linear(d_ff, d_model)

    #Applies the feed-forward transformation.    
    def forward(self, x):
        """
        Args:
        * x (torch.Tensor): Input tensor of shape [batch_size, seq_len, d_model].
        
        Returns:
        * torch.Tensor: Transformed tensor of the same shape.
        """
        return self.linear2(F.relu(self.linear1(x)))

## EncoderSubLayer
Functionality/Description:
The EncoderSubLayer combines self-attention, feed-forward networks, and residual connections with layer normalization. This forms the building block for the Transformer encoder.

**Inputs:**

* d_model (int): Dimensionality of token embeddings.
* num_heads (int): Number of attention heads.
* d_ff (int): Hidden layer size for the feed-forward network.
* dropout (float): Dropout rate for regularization.
  
**Outputs:**
* Transformed tensor of shape [batch_size, seq_len, d_model].

In [45]:
#Single sublayer of the Transformer encoder, combining self-attention,
#feed-forward networks, and layer normalization.
class EncoderSubLayer(nn.Module):
    """
    Attributes:
        self_attn (MultiHeadAttention): Multi-head self-attention mechanism.
        ffn (PositionFeedForward): Feed-forward network.
        norm1, norm2 (nn.LayerNorm): Layer normalization layers.
        dropout1, dropout2 (nn.Dropout): Dropout layers for regularization.
    """

    # Initializes the encoder sublayer.
    def __init__(self, d_model, num_heads, d_ff, dropout = 0.1):
        """
        Args:
        * d_model (int): Dimensionality of token embeddings.
        * num_heads (int): Number of attention heads.
        * d_ff (int): Hidden layer size for the feed-forward network.
        * dropout (float): Dropout rate for regularization.
        """
        super().__init__()
        self.self_attn = MultiHeadAttention(d_model, num_heads)
        self.ffn = PositionFeedForward(d_model, d_ff)
        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)
        self.droupout1 = nn.Dropout(dropout)
        self.droupout2 = nn.Dropout(dropout)

    # Applies self-attention and feed-forward transformation.
    def forward(self, x, mask = None):
        """
        Args:
        * x (torch.Tensor): Input tensor of shape [batch_size, seq_len, d_model].
        * mask (torch.Tensor, optional): Attention mask.
        
        Returns:
        * torch.Tensor: Transformed tensor of the same shape.
        """
        attention_score, _ = self.self_attn(x, x, x, mask)
        x = x + self.droupout1(attention_score)
        x = self.norm1(x)
        x = x + self.droupout2(self.ffn(x))
        return self.norm2(x)



## Encoder
The Encoder module consists of a stack of encoder sublayers, each applying self-attention and feed-forward transformations. This module processes input embeddings and outputs contextualized token representations.

**Inputs:**
* d_model (int): Dimensionality of token embeddings.
* num_heads (int): Number of attention heads.
* d_ff (int): Hidden layer size for the feed-forward network.
* num_layers (int): Number of encoder sublayers in the stack.
* dropout (float): Dropout rate for regularization.

**Outputs:**
* Contextualized token embeddings of shape [batch_size, seq_len, d_model].

In [32]:
#Transformer encoder consisting of multiple stacked encoder sublayers.
class Encoder(nn.Module):
    """
    Attributes:
        layers (nn.ModuleList): List of `EncoderSubLayer` instances.
        norm (nn.LayerNorm): Layer normalization for the final output.
    """
    # Initializes the encoder.
    def __init__(self, d_model, num_heads, d_ff, num_layers, dropout=0.1):
        """
        Args:
        * d_model (int): Dimensionality of token embeddings.
        * num_heads (int): Number of attention heads.
        * d_ff (int): Hidden layer size for the feed-forward network.
        * num_layers (int): Number of encoder sublayers.
        * dropout (float): Dropout rate for regularization.
        """
        super().__init__()
        self.layers = nn.ModuleList([EncoderSubLayer(d_model, num_heads, d_ff, dropout) for _ in range(num_layers)])
        self.norm = nn.LayerNorm(d_model)

    # Processes input embeddings through the encoder stack.
    def forward(self, x, mask=None):
        """
        Args:
        * x (torch.Tensor): Input tensor of shape [batch_size, seq_len, d_model].
        * mask (torch.Tensor, optional): Attention mask.
        
        Returns:
        * torch.Tensor: Contextualized token embeddings of the same shape.
        """
        for layer in self.layers:
            x = layer(x, mask)
        return self.norm(x)


## DecoderSubLayer
The DecoderSubLayer processes target embeddings by combining self-attention, encoder-decoder cross-attention, and a feed-forward network. It uses residual connections and normalization to stabilize training.

**Inputs:**

* d_model (int): Dimensionality of token embeddings.
* num_heads (int): Number of attention heads.
* d_ff (int): Hidden layer size for the feed-forward network.
* dropout (float): Dropout rate for regularization.

**Outputs:**
* Transformed tensor of shape [batch_size, seq_len, d_model].


In [33]:
#Single sublayer of the Transformer decoder, combining self-attention, 
# cross-attention, and feed-forward networks with normalization.

class DecoderSubLayer(nn.Module):
    """
    Attributes:
    * self_attn (MultiHeadAttention): Multi-head self-attention mechanism.
    * cross_attn (MultiHeadAttention): Encoder-decoder cross-attention mechanism.
    * feed_forward (PositionFeedForward): Feed-forward network.
    * norm1, norm2, norm3 (nn.LayerNorm): Layer normalization layers.
    * dropout1, dropout2, dropout3 (nn.Dropout): Dropout layers for regularization.
    """
    
    # Initializes the decoder sublayer.
    def __init__(self, d_model, num_heads, d_ff, dropout=0.1):
        """
        Args:
        * d_model (int): Dimensionality of token embeddings.
        * num_heads (int): Number of attention heads.
        * d_ff (int): Hidden layer size for the feed-forward network.
        * dropout (float): Dropout rate for regularization.
        """
        super().__init__()
        self.self_attn = MultiHeadAttention(d_model, num_heads)
        self.cross_attn = MultiHeadAttention(d_model, num_heads)
        self.feed_forward = PositionFeedForward(d_model, d_ff)
        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)
        self.norm3 = nn.LayerNorm(d_model)
        self.dropout1 = nn.Dropout(dropout)
        self.dropout2 = nn.Dropout(dropout)
        self.dropout3 = nn.Dropout(dropout)

    # Applies self-attention, cross-attention, and feed-forward transformation.
    def forward(self, x, encoder_output, target_mask=None, encoder_mask=None):
        """
        Args:
        * x (torch.Tensor): Target embeddings of shape [batch_size, seq_len, d_model].
        * encoder_output (torch.Tensor): Encoder output of shape [batch_size, seq_len, d_model].
        * target_mask (torch.Tensor, optional): Mask for the target sequence.
        * encoder_mask (torch.Tensor, optional): Mask for the encoder sequence.
        
        Returns:
        * torch.Tensor: Transformed tensor of the same shape as input.
        """
        # Self-attention with residual connection and normalization
        attention_score, _ = self.self_attn(x, x, x, target_mask)
        x = x + self.dropout1(attention_score)
        x = self.norm1(x)

        # Cross-attention with residual connection and normalization
        encoder_attn, _ = self.cross_attn(x, encoder_output, encoder_output, encoder_mask)
        x = x + self.dropout2(encoder_attn)
        x = self.norm2(x)

        # Feed-forward network with residual connection and normalization
        ff_output = self.feed_forward(x)
        x = x + self.dropout3(ff_output)
        return self.norm3(x)


## Decoder
The Decoder module consists of a stack of decoder sublayers, each applying self-attention, cross-attention, and a feed-forward transformation. It generates predictions for the target sequence.

**Inputs:**
* d_model (int): Dimensionality of token embeddings.
* num_heads (int): Number of attention heads.
* d_ff (int): Hidden layer size for the feed-forward network.
* num_layers (int): Number of decoder sublayers in the stack.
* dropout (float): Dropout rate for regularization.

**Outputs:**
* Predictions for the target sequence of shape [batch_size, seq_len, d_model].

In [46]:
#Transformer decoder consisting of multiple stacked decoder sublayers.
class Decoder(nn.Module):
    """
    Attributes:
    * layers (nn.ModuleList): List of `DecoderSubLayer` instances.
    * norm (nn.LayerNorm): Layer normalization for the final output.
    """
    
    #Initializes the decoder.
    def __init__(self, d_model, num_heads, d_ff, num_layers, dropout=0.1):
        """
        Args:
        * d_model (int): Dimensionality of token embeddings.
        * num_heads (int): Number of attention heads.
        * d_ff (int): Hidden layer size for the feed-forward network.
        * num_layers (int): Number of decoder sublayers.
        * dropout (float): Dropout rate for regularization.
        """
        super().__init__()
        self.layers = nn.ModuleList([DecoderSubLayer(d_model, num_heads, d_ff, dropout) for _ in range(num_layers)])
        self.norm = nn.LayerNorm(d_model)
        
    #Processes target embeddings through the decoder stack.
    def forward(self, x, encoder_output, target_mask, encoder_mask):
        """
        Args:
        * x (torch.Tensor): Target embeddings of shape [batch_size, seq_len, d_model].
        * encoder_output (torch.Tensor): Encoder output of shape [batch_size, seq_len, d_model].
        * target_mask (torch.Tensor): Mask for the target sequence.
        * encoder_mask (torch.Tensor): Mask for the encoder sequence.
        
        Returns:
        * torch.Tensor: Predictions for the target sequence of the same shape as input.
        """
        for layer in self.layers:
            x = layer(x, encoder_output, target_mask, encoder_mask)
        return self.norm(x)

## Transformer
Combines the Encoder and Decoder modules into the full Transformer architecture. Processes source and target sequences, generating logits for target vocabulary predictions.

**Inputs:**
* d_model (int): Dimensionality of token embeddings.
* num_heads (int): Number of attention heads.
* d_ff (int): Hidden layer size for the feed-forward network.
* num_layers (int): Number of layers for both encoder and decoder.
* input_vocab_size (int): Vocabulary size of the source language.
* target_vocab_size (int): Vocabulary size of the target language.
* max_len (int): Maximum sequence length.
* dropout (float): Dropout rate for regularization.

**Output:**
* Logits for the target vocabulary of shape [batch_size, target_seq_len, target_vocab_size].

In [47]:
class Transformer(nn.Module):

    # Initializes the Transformer architecture.
    def __init__(self, d_model, num_heads, d_ff, num_layers, input_vocab_size, target_vocab_size, max_len=MAX_SEQ_LEN, dropout=0.1):
        """
        Args:
        * d_model (int): Dimensionality of token embeddings.
        * num_heads (int): Number of attention heads.
        * d_ff (int): Hidden layer size for the feed-forward network.
        * num_layers (int): Number of layers for both encoder and decoder.
        * input_vocab_size (int): Vocabulary size of the source language.
        * target_vocab_size (int): Vocabulary size of the target language.
        * max_len (int): Maximum sequence length.
        * dropout (float): Dropout rate for regularization.
        """
        super().__init__()
        self.encoder_embedding = nn.Embedding(input_vocab_size, d_model)
        self.decoder_embedding = nn.Embedding(target_vocab_size, d_model)
        self.pos_embedding = PositionalEmbedding(d_model, max_len)
        self.encoder = Encoder(d_model, num_heads, d_ff, num_layers, dropout)
        self.decoder = Decoder(d_model, num_heads, d_ff, num_layers, dropout)
        self.output_layer = nn.Linear(d_model, target_vocab_size)

    def forward(self, source, target):
        # Encoder mask
        source_mask, target_mask = self.mask(source, target)
        # Embedding and positional Encoding
        source = self.encoder_embedding(source) * math.sqrt(self.encoder_embedding.embedding_dim)
        source = self.pos_embedding(source)
        # Encoder
        encoder_output = self.encoder(source, source_mask)

        # Decoder embedding and postional encoding
        target = self.decoder_embedding(target) * math.sqrt(self.decoder_embedding.embedding_dim)
        target = self.pos_embedding(target)
        # Decoder
        output = self.decoder(target, encoder_output, target_mask, source_mask)

        return self.output_layer(output)



    def mask(self, source, target):
        source_mask = (source != 0).unsqueeze(1).unsqueeze(2)
        target_mask = (target != 0).unsqueeze(1).unsqueeze(2)
        size = target.size(1)
        no_mask = torch.tril(torch.ones((1, size, size), device=device)).bool()
        target_mask = target_mask & no_mask
        return source_mask, target_mask

#### Simple testing

In [48]:
seq_len_source = 10
seq_len_target = 10
batch_size = 2
input_vocab_size = 50
target_vocab_size = 50

source = torch.randint(1, input_vocab_size, (batch_size, seq_len_source))
target = torch.randint(1, target_vocab_size, (batch_size, seq_len_target))

In [49]:
d_model = 512
num_heads = 8
d_ff = 2048
num_layers = 6

model = Transformer(d_model, num_heads, d_ff, num_layers,
                  input_vocab_size, target_vocab_size,
                  max_len=MAX_SEQ_LEN, dropout=0.1)

model = model.to(device)
source = source.to(device)
target = target.to(device)

In [50]:
output = model(source, target)

In [51]:
# Expected output shape -> [batch, seq_len_target, target_vocab_size] i.e. [2, 10, 50]
print(f'ouput.shape {output.shape}')

ouput.shape torch.Size([2, 10, 50])


## Translator English-Spanish
This code implements a sequence-to-sequence translator using the Transformer architecture. The translator takes English sentences as input and generates their corresponding translations in Spanish. The model uses the core components of a Transformer (encoder, decoder, multi-head attention, positional encoding, and feed-forward layers) to achieve this.

**Functionality**

* Dataset Preparation: The English-Spanish sentence pairs are preprocessed, cleaned, and tokenized.
Vocabulary indices are created for both languages.
Sentences are padded to a uniform length for batch processing.

* Model Training:
    * The model uses an encoder-decoder Transformer structure.
    * The encoder processes English input sentences to generate contextualized embeddings.
    * The decoder uses these embeddings and the Spanish target sentence (shifted for teacher forcing) to         generate the translated output.
 
    

* Translation Process:
    * The model predicts the next token in the Spanish sequence for each step.
    * Outputs are probabilities over the Spanish vocabulary, which are converted into tokens to form the
      final translation.

**FIle Readin**g
This section loads the English-Spanish sentence pairs from a .txt file, splits the data, and separates it into two lists: eng_sentences and spa_sentences.

In [53]:
PATH = dataset_path + '/eng-spa4.txt'

with open(PATH, 'r', encoding='utf-8') as f:
    lines = f.readlines()
eng_spa_pairs = [line.strip().split('\t') for line in lines if '\t' in line]

In [55]:
eng_spa_pairs[:10]

[['Hi.', 'Hola.'],
 ['Ow!', '¡Ay!'],
 ['So?', '¿Y?'],
 ['So?', '¿Y qué?'],
 ['No.', 'No.'],
 ['Go.', 'Vaya.'],
 ['Ok!', '¡OK!'],
 ['So?', '¿Entonces?'],
 ['OK.', 'Bueno.'],
 ['Go!', '¡Sal!']]

In [56]:
eng_sentences = [pair[0] for pair in eng_spa_pairs]
spa_sentences = [pair[1] for pair in eng_spa_pairs]

In [57]:
print(eng_sentences[:10])
print(spa_sentences[:10])

['Hi.', 'Ow!', 'So?', 'So?', 'No.', 'Go.', 'Ok!', 'So?', 'OK.', 'Go!']
['Hola.', '¡Ay!', '¿Y?', '¿Y qué?', 'No.', 'Vaya.', '¡OK!', '¿Entonces?', 'Bueno.', '¡Sal!']


#### **Preprocessing Sentences**

Cleans sentences by removing special characters, converting to lowercase, and adding <sos> and <eos> tokens.

In [58]:
#Cleans and preprocesses a sentence.
def preprocess_sentence(sentence):
    """
    Args:
    * sentence (str): Input sentence to preprocess.

    Returns:
    * str: Preprocessed sentence with <sos> and <eos> tokens.
    """
    sentence = sentence.lower().strip()
    sentence = re.sub(r'[" "]+', " ", sentence)
    sentence = re.sub(r"[á]+", "a", sentence)
    sentence = re.sub(r"[é]+", "e", sentence)
    sentence = re.sub(r"[í]+", "i", sentence)
    sentence = re.sub(r"[ó]+", "o", sentence)
    sentence = re.sub(r"[ú]+", "u", sentence)
    sentence = re.sub(r"[^a-z]+", " ", sentence)
    sentence = sentence.strip()
    sentence = '<sos> ' + sentence + ' <eos>'
    return sentence

#### **Preprocessing Example**

We test our sentence preprocessing function on a test sentence

In [59]:
s1 = '¿Hola @ cómo estás? 123'

In [60]:
print(s1)
print(preprocess_sentence(s1))

¿Hola @ cómo estás? 123
<sos> hola como estas <eos>


We preprocess the sentences in both languages

In [61]:
eng_sentences = [preprocess_sentence(sentence) for sentence in eng_sentences]
spa_sentences = [preprocess_sentence(sentence) for sentence in spa_sentences]

In [62]:
spa_sentences[:10]

['<sos> hola <eos>',
 '<sos> ay <eos>',
 '<sos> y <eos>',
 '<sos> y que <eos>',
 '<sos> no <eos>',
 '<sos> vaya <eos>',
 '<sos> ok <eos>',
 '<sos> entonces <eos>',
 '<sos> bueno <eos>',
 '<sos> sal <eos>']

#### **Building Vocabulary**

Creates word-to-index and index-to-word mappings for tokenizing sentences.

In [63]:
#Builds a vocabulary dictionary for the given sentences.
def build_vocab(sentences):
    """
    Args:
    * sentences (list): List of sentences.

    Returns:
    * word2idx (dict): Maps words to indices.
    * idx2word (dict): Maps indices to words.
    """
    words = [word for sentence in sentences for word in sentence.split()]
    word_count = Counter(words)
    sorted_word_counts = sorted(word_count.items(), key=lambda x:x[1], reverse=True)

    # Create word-to-index and index-to-word mappings
    word2idx = {word: idx for idx, (word, _) in enumerate(sorted_word_counts, 2)}
    word2idx['<pad>'] = 0
    word2idx['<unk>'] = 1
    idx2word = {idx: word for word, idx in word2idx.items()}
    return word2idx, idx2word

In [64]:
# Build vocabularies for English and Spanish
eng_word2idx, eng_idx2word = build_vocab(eng_sentences)
spa_word2idx, spa_idx2word = build_vocab(spa_sentences)

eng_vocab_size = len(eng_word2idx)
spa_vocab_size = len(spa_word2idx)

In [65]:
print(eng_vocab_size, spa_vocab_size)

27688 46991


#### **English Spanish Dataset**
Custom PyTorch Dataset that handles the preprocessing and indexing of English and Spanish sentences. It transforms text-based sentences into numerical tensors that represent the tokens in the sentences.

**Key Features:**

* **Sentence Pair Storage:** The dataset stores English (eng_sentences) and Spanish (spa_sentences) sentences after preprocessing.
* **Tokenization and Index Mapping:** Each sentence is tokenized (split into words). Words are replaced with their corresponding indices using word2idx mappings for each language.
* **Tensor Conversion:** Each tokenized sentence is converted into a PyTorch tensor for efficient processing.
* **Integration with DataLoader:** The __getitem__ method ensures compatibility with PyTorch's DataLoader, allowing the data to be accessed in batches.

In [66]:
#Initializes the dataset.
class EngSpaDataset(Dataset):
    def __init__(self, eng_sentences, spa_sentences, eng_word2idx, spa_word2idx):
        """
        Args:
        * eng_sentences (list): List of English sentences.
        * spa_sentences (list): List of Spanish sentences.
        * eng_word2idx (dict): Vocabulary mapping for English (word to index).
        * spa_word2idx (dict): Vocabulary mapping for Spanish (word to index).
        """
        self.eng_sentences = eng_sentences
        self.spa_sentences = spa_sentences
        self.eng_word2idx = eng_word2idx
        self.spa_word2idx = spa_word2idx

    def __len__(self):
        """
        Returns:
        * int: Number of sentence pairs in the dataset.
        """
        return len(self.eng_sentences)

    # Retrieves a specific pair of sentences by index.
    def __getitem__(self, idx
        """
        Args:
        * idx (int): Index of the sentence pair.

        Returns:
        * torch.Tensor: Tokenized and indexed English sentence.
        * torch.Tensor: Tokenized and indexed Spanish sentence.
        """
        eng_sentence = self.eng_sentences[idx]
        spa_sentence = self.spa_sentences[idx]
        # return tokens idxs
        eng_idxs = [self.eng_word2idx.get(word, self.eng_word2idx['<unk>']) for word in eng_sentence.split()]
        spa_idxs = [self.spa_word2idx.get(word, self.spa_word2idx['<unk>']) for word in spa_sentence.split()]

        return torch.tensor(eng_idxs), torch.tensor(spa_idxs)

#### Collate Function
The DataLoader is used to handle batching and padding of the dataset. Since sentences in the dataset have varying lengths, they need to be padded to a uniform length to form batches.

In [67]:
#Custom collate function to pad sentences in a batch.
def collate_fn(batch):
    """
    Args:
    * batch (list of tuples): Each tuple contains an English tensor and a Spanish tensor.

    Returns:
    * torch.Tensor: Padded batch of English sentences.
    * torch.Tensor: Padded batch of Spanish sentences.
    """
    eng_batch, spa_batch = zip(*batch)

    # Pad sentences to uniform length
    eng_batch = [seq[:MAX_SEQ_LEN].clone().detach() for seq in eng_batch]
    spa_batch = [seq[:MAX_SEQ_LEN].clone().detach() for seq in spa_batch]
    eng_batch = torch.nn.utils.rnn.pad_sequence(eng_batch, batch_first=True, padding_value=0)
    spa_batch = torch.nn.utils.rnn.pad_sequence(spa_batch, batch_first=True, padding_value=0)
    return eng_batch, spa_batch

#### Train Function
It handles the process of iterating through the data, feeding it into the model, calculating the loss, and updating the model's parameters to minimize the loss.

**Functionality:**
* Loops through the dataset in batches for a specified number of epochs.
* Prepares the input data and targets for the model.
* Uses teacher forcing by feeding the previous target token during training.
* Computes the loss for each prediction and updates the model parameters using backpropagation.
* Logs the training progress and loss for each epoch.

**Inputs:**
* model: The Transformer model to be trained.
* dataloader: DataLoader that provides batches of input and target tensors.
* loss_function: The function used to compute the loss (e.g., CrossEntropyLoss).
* optimiser: The optimizer used for gradient descent (e.g., Adam).
* epochs: Number of iterations over the entire dataset.

**Outputs:**
* Trained model with updated parameters.
* Loss values logged for each epoch.


In [68]:
#Trains the Transformer model on the given dataset.
def train(model, dataloader, loss_function, optimiser, epochs):
    """
    Args:
    * model: Transformer model to be trained.
    * dataloader: PyTorch DataLoader providing batches of tokenized sentences.
    * loss_function: Loss function (e.g., CrossEntropyLoss) for token prediction.
    * optimiser: Optimizer (e.g., Adam) for updating model parameters.
    * epochs: Number of epochs for training.

    Returns:
    None
    """
    model.train()
    for epoch in range(epochs):
        total_loss = 0

        # Create a progress bar for the epoch using tqdm.
        loop = tqdm(dataloader, desc=f'Epoch {epoch + 1}/{epochs}')

        # Iterate through batches in the DataLoader.
        for eng_batch, spa_batch in loop:
            eng_batch = eng_batch.to(device)
            spa_batch = spa_batch.to(device)
            # Decoder preprocessing
            target_input = spa_batch[:, :-1]
            target_output = spa_batch[:, 1:].contiguous().view(-1)
            # Zero grads
            optimiser.zero_grad()
            # run model
            output = model(eng_batch, target_input)
            output = output.view(-1, output.size(-1))
            # loss\
            loss = loss_function(output, target_output)
            # gradient and update parameters
            loss.backward()
            optimiser.step()

            # Accumulate the batch loss into total loss.
            total_loss += loss.item()

            # Update the progress bar with the current loss value.
            loop.set_postfix(loss=loss.item())

        avg_loss = total_loss/len(dataloader)
        print(f'Epoch: {epoch}/{epochs}, Loss: {avg_loss:.4f}')


#### Batch Size and DataLoader Creation
* BATCH_SIZE defines the number of samples to process at once.
* EngSpaDataset processes the English and Spanish sentence pairs into tensors of token indices.
* DataLoader organizes the dataset into batches and shuffles the data at the start of each epoch for randomness.

In [69]:
BATCH_SIZE = 64 #define batch size
dataset = EngSpaDataset(eng_sentences, spa_sentences, eng_word2idx, spa_word2idx) #create dataset
dataloader = DataLoader(dataset, batch_size=BATCH_SIZE, shuffle=True, collate_fn=collate_fn) # Initialize DataLoader.

#### **Model Initialization**
Initializes a Transformer model with the specified architecture:
 * d_model=512: Each token is represented as a 512-dimensional embedding.
 * num_heads=8: Multi-head attention divides d_model into 8 parts for parallel processing. 
 * num_layers=6: The encoder and decoder each have 6 layers.
 * Vocabulary sizes and maximum sequence length are set based on the dataset.


In [70]:
model = Transformer(
    d_model=512,  # Embedding dimension.
    num_heads=8,  # Number of attention heads.
    d_ff=2048,  # Size of feed-forward hidden layers.
    num_layers=6,  # Number of encoder/decoder layers.
    input_vocab_size=eng_vocab_size,  # Size of the English vocabulary.
    target_vocab_size=spa_vocab_size,  # Size of the Spanish vocabulary.
    max_len=MAX_SEQ_LEN,  # Maximum sequence length.
    dropout=0.1  # Dropout rate.
)

#### Loss Function and Optimizer
* nn.CrossEntropyLoss: Computes the loss for predicting each token, ignoring <pad> tokens (ignore_index=0).
* optim.Adam: Optimizes model parameters with a learning rate of 0.0001.


In [71]:
model = model.to(device)
print(device)

loss_function = nn.CrossEntropyLoss(ignore_index=0)
optimiser = optim.Adam(model.parameters(), lr=0.0001)

cuda


##### We train the translation model

In [73]:
train(model, dataloader, loss_function, optimiser, epochs = 5)

Epoch 1/5: 100%|██████████| 4173/4173 [10:35<00:00,  6.57it/s, loss=2.25]


Epoch: 0/5, Loss: 3.5424


Epoch 2/5: 100%|██████████| 4173/4173 [10:34<00:00,  6.57it/s, loss=1.28]


Epoch: 1/5, Loss: 2.1964


Epoch 3/5: 100%|██████████| 4173/4173 [10:33<00:00,  6.59it/s, loss=1.22] 


Epoch: 2/5, Loss: 1.6991


Epoch 4/5: 100%|██████████| 4173/4173 [10:34<00:00,  6.58it/s, loss=1.2]  


Epoch: 3/5, Loss: 1.3707


Epoch 5/5: 100%|██████████| 4173/4173 [10:35<00:00,  6.56it/s, loss=1.13] 

Epoch: 4/5, Loss: 1.1207





#### Analysis and Conclusions
**Training Results**
The training loss values decrease consistently across the epochs, as shown in the logs:

* Epoch 1: Loss = 3.5424
* Epoch 2: Loss = 2.1964
* Epoch 3: Loss = 1.6991
* Epoch 4: Loss = 1.3707
* Epoch 5: Loss = 1.1207

This steady decrease in loss demonstrates that the Transformer model successfully learns patterns in the dataset. The final loss of 1.1207 suggests that the model has effectively minimized its prediction error, showing good convergence.

We save the model

In [74]:
torch.save(model.state_dict(), "translator_model.pth")
print("Model saved to translator_model.pth")

Model saved to translator_model.pth


## Model Evaluation
### Functions for evaluation

In [77]:
#Converts a sentence into token indices.
def sentence_to_indices(sentence, word2idx):
    """
    Args:
    * sentence (str): The input sentence to be tokenized.
    * word2idx (dict): Vocabulary mapping words to indices.

    Returns:
    * list[int]: Token indices representing the sentence.
    """
    return [word2idx.get(word, word2idx['<unk>']) for word in sentence.split()]

#Converts a list of token indices into a sentence.
def indices_to_sentence(indices, idx2word):
    """
    Args:
    * indices (list[int]): List of token indices.
    * idx2word (dict): Vocabulary mapping indices to words.

    Returns:
    * str: Human-readable sentence reconstructed from indices.
    """
    return ' '.join([idx2word[idx] for idx in indices if idx in idx2word and idx2word[idx] != '<pad>'])


#Translates an English sentence into Spanish using the Transformer model.
def translate_sentence(model, sentence, eng_word2idx, spa_idx2word, max_len=MAX_SEQ_LEN, device='cpu'):
    """
    Args:
    * model (Transformer): The trained Transformer model.
    * sentence (str): The input English sentence.
    * eng_word2idx (dict): English vocabulary mapping words to indices.
    * spa_idx2word (dict): Spanish vocabulary mapping indices to words.
    * max_len (int): Maximum length of the target sentence.
    * device (str): Device to use for translation ('cpu' or 'cuda').

    Returns:
    * str: Translated Spanish sentence.
    """
    model.eval()
    sentence = preprocess_sentence(sentence)

    # Convert the input sentence to indices
    input_indices = sentence_to_indices(sentence, eng_word2idx)
    input_tensor = torch.tensor(input_indices).unsqueeze(0).to(device)

    # Initialize the target tensor with <sos> token
    tgt_indices = [spa_word2idx['<sos>']]
    tgt_tensor = torch.tensor(tgt_indices).unsqueeze(0).to(device)

    with torch.no_grad():
        for _ in range(max_len):
            # Forward pass through the model
            output = model(input_tensor, tgt_tensor)
            output = output.squeeze(0)  
            
            # Get the most probable next token
            next_token = output.argmax(dim=-1)[-1].item()
            tgt_indices.append(next_token)
            tgt_tensor = torch.tensor(tgt_indices).unsqueeze(0).to(device)

            # Stop if the <eos> token is generated
            if next_token == spa_word2idx['<eos>']:
                break

    return indices_to_sentence(tgt_indices, spa_idx2word)

#### Evaluating Translations 
Evaluates the translation performance of the model on a list of test sentences. For each sentence, it prints the input sentence and its translation.

In [78]:
#Evaluates the model's translations for a list of test sentences.
def evaluate_translations(model, sentences, eng_word2idx, spa_idx2word, max_len=MAX_SEQ_LEN, device='cpu'):
    """
    Args:
    * model (Transformer): The trained Transformer model.
    * sentences (list[str]): List of English sentences to translate.
    * eng_word2idx (dict): English vocabulary mapping words to indices.
    * spa_idx2word (dict): Spanish vocabulary mapping indices to words.
    * max_len (int): Maximum length of the target sentence.
    * device (str): Device to use for translation ('cpu' or 'cuda').

    Returns:
    None
    """
    for sentence in sentences:
        translation = translate_sentence(model, sentence, eng_word2idx, spa_idx2word, max_len, device)
        print(f'Input sentence: {sentence}')
        print(f'Traducción: {translation}')
        print()




### Test sentences chosen by the team

In [79]:
# Example sentences to test the translator chosen by the team
test_sentences = [
    "Hello, how are you?",
    "I am learning artificial intelligence.",
    "Artificial intelligence is great.",
    "Good night!",
    "Have a Great day!",
    "Let´s go get some ice cream",
    "Today is my Birthday!",
    "I am really excited to be here with you",
    "We need your help with the project",
    "You know it is due tomorrow",
    "Do you want to go to the park later?",
    "This is the first time I am traveling abroad.",
    "They are watching a movie together.",
    "I enjoy playing soccer with my friends.",
    "My brother is studying at the university."
]

**We evaluate the model on the sentences**

In [82]:
# Set the device to 'cpu' or 'cuda' as needed
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = model.to(device)

# Evaluate translations
evaluate_translations(model, test_sentences, eng_word2idx, spa_idx2word, max_len=MAX_SEQ_LEN, device=device)

Input sentence: Hello, how are you?
Traducción: <sos> hola como estas <eos>

Input sentence: I am learning artificial intelligence.
Traducción: <sos> estoy aprendiendo inteligencia artificial <eos>

Input sentence: Artificial intelligence is great.
Traducción: <sos> la inteligencia artificial es genial <eos>

Input sentence: Good night!
Traducción: <sos> buenas noches <eos>

Input sentence: Have a Great day!
Traducción: <sos> que tengas un gran dia <eos>

Input sentence: Let´s go get some ice cream
Traducción: <sos> vamos a ir a un helado <eos>

Input sentence: Today is my Birthday!
Traducción: <sos> hoy es mi cumplea os <eos>

Input sentence: I am really excited to be here with you
Traducción: <sos> estoy muy emocionada por estar aqui contigo <eos>

Input sentence: We need your help with the project
Traducción: <sos> necesitamos tu ayuda con la ayuda <eos>

Input sentence: You know it is due tomorrow
Traducción: <sos> sabes que va ma ana <eos>

Input sentence: Do you want to go to the

### Results analysis

**Evaluation of Translations**

The translations generated for the test sentences show promising results. Here's an analysis of the output:

* **Accurate Translations:**
    * Many of the test sentences are translated correctly, maintaining the meaning and structure of the original English sentence. Examples include:
        * "Hello, how are you?" → "hola como estas"
        * "Good night!" → "buenas noches"
        * "Artificial intelligence is great." → "la inteligencia artificial es genial"
          
* **Maintaining Context:**
    * The translations preserve the semantic meaning of the original sentences. For instance:
        * "Let's go get some ice cream." → "vamos a ir a un helado"
    * While slightly unnatural, it still captures the intended meaning.



* **Potential Improvements:**
     * Some translations could be refined for grammatical accuracy or fluency in Spanish. For instance:
         * "We need your help with the project." → "necesitamos tu ayuda con la ayuda"
    * Here, the repetition of "ayuda" could be improved.

* **Handling of Special Tokens:**
    * The <sos> and <eos> tokens are correctly included in the output, indicating that the model has learned to manage sequence boundaries effectively.