## Deep Learning
## Transformers

#### Activity 4: Implementing a Translator

- Objective

To understand the Transformer Architecture by Implementing a translator.

- Instructions

    This activity requires submission in teams. While teamwork is encouraged, each member is expected to contribute individually to the assignment. The final submission should feature the best arguments and solutions from each team member. Only one person per team needs to submit the completed work, but it is imperative that the names of all team members are listed in a Markdown cell at the very beginning of the notebook (either the first or second cell). Failure to include all team member names will result in the grade being awarded solely to the individual who submitted the assignment, with zero points given to other team members (no exceptions will be made to this rule).

    Follow the provided code. The code already implements a transformer from scratch as explained in one of [week's 9 videos](https://youtu.be/XefFj4rLHgU)

    Since the provided code already implements a simple translator, your job for this assignment is to understand it fully, and document it using pictures, figures, and markdown cells.  You should test your translator with at least 10 sentences. The dataset used for this task was obtained from [Tatoeba, a large dataset of sentences and translations](https://tatoeba.org/en/downloads).
  
- Evaluation Criteria

    - Code Readability and Comments
    - Traning a translator
    - Translating at least 10 sentences.

- Submission

Submit this Jupyter Notebook in canvas with your complete solution, ensuring your code is well-commented and includes Markdown cells that explain your design choices, results, and any challenges you encountered.

#### Script to convert csv to text file

To start the activity, we need to convert the TSV file to a CSV file. To do this, we used Microsoft Excel to open the TSV file and then save it as a UTF-8 CSV file.

Few params were changed in the code to make it work with the CSV file instead of the TSV file.

In [None]:
#This script requires to convert the TSV file to CSV
# easiest way is to open it in Calc or excel and save as csv
PATH = './eng-spa2024.csv'
import pandas as pd
df = pd.read_csv(PATH, header=None)

This block of code extracts the second and fourth columns from the DataFrame `df` and creates a copy of these columns. It then adds a new column named `length` that contains the length of the text in the first column. The DataFrame is sorted by this `length` column in ascending order, and the `length` column is subsequently removed. Finally, the processed DataFrame is saved to a tab-separated file named `eng-spa4.txt` without including the index and header.

In [None]:
eng_spa_cols = df.iloc[:, [1, 3]].copy()
eng_spa_cols['length'] = eng_spa_cols.iloc[:, 0].str.len()
eng_spa_cols = eng_spa_cols.sort_values(by='length')
eng_spa_cols = eng_spa_cols.drop(columns=['length'])

output_file_path = './eng-spa4.txt'
eng_spa_cols.to_csv(output_file_path, sep='\t', index=False, header=False)

## Transformer - Attention is all you need

Importing all the necessary libraries

In [None]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader
from collections import Counter
import math
import numpy as np
import re
from tqdm import tqdm

# Set the seed for reproducibility
torch.manual_seed(23)

<torch._C.Generator at 0x7fc035d58030>

Obtaining the CUDA device if it is available

In [None]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(device)

cuda


Setting the maximum sequence length for the model, this value is used to pad the sequences that are shorter than the maximum length and truncate the sequences that are longer than the maximum length.

In [None]:
MAX_SEQ_LEN = 128

In [None]:
class PositionalEmbedding(nn.Module):
    def __init__(self, d_model, max_seq_len = MAX_SEQ_LEN):
        super().__init__()
        self.pos_embed_matrix = torch.zeros(max_seq_len, d_model, device=device)
        token_pos = torch.arange(0, max_seq_len, dtype = torch.float).unsqueeze(1)
        div_term = torch.exp(torch.arange(0, d_model, 2).float()
                             * (-math.log(10000.0)/d_model))
        self.pos_embed_matrix[:, 0::2] = torch.sin(token_pos * div_term)
        self.pos_embed_matrix[:, 1::2] = torch.cos(token_pos * div_term)
        self.pos_embed_matrix = self.pos_embed_matrix.unsqueeze(0).transpose(0,1)

    def forward(self, x):
        return x + self.pos_embed_matrix[:x.size(0), :]

class MultiHeadAttention(nn.Module):
    def __init__(self, d_model = 512, num_heads = 8):
        super().__init__()
        assert d_model % num_heads == 0, 'Embedding size not compatible with num heads'

        self.d_v = d_model // num_heads
        self.d_k = self.d_v
        self.num_heads = num_heads

        self.W_q = nn.Linear(d_model, d_model)
        self.W_k = nn.Linear(d_model, d_model)
        self.W_v = nn.Linear(d_model, d_model)
        self.W_o = nn.Linear(d_model, d_model)

    def forward(self, Q, K, V, mask = None):
        batch_size = Q.size(0)
        '''
        Q, K, V -> [batch_size, seq_len, num_heads*d_k]
        after transpose Q, K, V -> [batch_size, num_heads, seq_len, d_k]
        '''
        Q = self.W_q(Q).view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2 )
        K = self.W_k(K).view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2 )
        V = self.W_v(V).view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2 )

        weighted_values, attention = self.scale_dot_product(Q, K, V, mask)
        weighted_values = weighted_values.transpose(1, 2).contiguous().view(batch_size, -1, self.num_heads*self.d_k)
        weighted_values = self.W_o(weighted_values)

        return weighted_values, attention


    def scale_dot_product(self, Q, K, V, mask = None):
        scores = torch.matmul(Q, K.transpose(-2, -1)) / math.sqrt(self.d_k)
        if mask is not None:
            scores = scores.masked_fill(mask == 0, -1e9)
        attention = F.softmax(scores, dim = -1)
        weighted_values = torch.matmul(attention, V)

        return weighted_values, attention


class PositionFeedForward(nn.Module):
    def __init__(self, d_model, d_ff):
        super().__init__()
        self.linear1 = nn.Linear(d_model, d_ff)
        self.linear2 = nn.Linear(d_ff, d_model)

    def forward(self, x):
        return self.linear2(F.relu(self.linear1(x)))

class EncoderSubLayer(nn.Module):
    def __init__(self, d_model, num_heads, d_ff, dropout = 0.1):
        super().__init__()
        self.self_attn = MultiHeadAttention(d_model, num_heads)
        self.ffn = PositionFeedForward(d_model, d_ff)
        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)
        self.droupout1 = nn.Dropout(dropout)
        self.droupout2 = nn.Dropout(dropout)

    def forward(self, x, mask = None):
        attention_score, _ = self.self_attn(x, x, x, mask)
        x = x + self.droupout1(attention_score)
        x = self.norm1(x)
        x = x + self.droupout2(self.ffn(x))
        return self.norm2(x)

class Encoder(nn.Module):
    def __init__(self, d_model, num_heads, d_ff, num_layers, dropout=0.1):
        super().__init__()
        self.layers = nn.ModuleList([EncoderSubLayer(d_model, num_heads, d_ff, dropout) for _ in range(num_layers)])
        self.norm = nn.LayerNorm(d_model)
    def forward(self, x, mask=None):
        for layer in self.layers:
            x = layer(x, mask)
        return self.norm(x)

class DecoderSubLayer(nn.Module):
    def __init__(self, d_model, num_heads, d_ff, dropout=0.1):
        super().__init__()
        self.self_attn = MultiHeadAttention(d_model, num_heads)
        self.cross_attn = MultiHeadAttention(d_model, num_heads)
        self.feed_forward = PositionFeedForward(d_model, d_ff)
        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)
        self.norm3 = nn.LayerNorm(d_model)
        self.dropout1 = nn.Dropout(dropout)
        self.dropout2 = nn.Dropout(dropout)
        self.dropout3 = nn.Dropout(dropout)

    def forward(self, x, encoder_output, target_mask=None, encoder_mask=None):
        attention_score, _ = self.self_attn(x, x, x, target_mask)
        x = x + self.dropout1(attention_score)
        x = self.norm1(x)

        encoder_attn, _ = self.cross_attn(x, encoder_output, encoder_output, encoder_mask)
        x = x + self.dropout2(encoder_attn)
        x = self.norm2(x)

        ff_output = self.feed_forward(x)
        x = x + self.dropout3(ff_output)
        return self.norm3(x)

class Decoder(nn.Module):
    def __init__(self, d_model, num_heads, d_ff, num_layers, dropout=0.1):
        super().__init__()
        self.layers = nn.ModuleList([DecoderSubLayer(d_model, num_heads, d_ff, dropout) for _ in range(num_layers)])
        self.norm = nn.LayerNorm(d_model)

    def forward(self, x, encoder_output, target_mask, encoder_mask):
        for layer in self.layers:
            x = layer(x, encoder_output, target_mask, encoder_mask)
        return self.norm(x)

In [None]:
class Transformer(nn.Module):
    def __init__(self, d_model, num_heads, d_ff, num_layers,
                 input_vocab_size, target_vocab_size,
                 max_len=MAX_SEQ_LEN, dropout=0.1):
        super().__init__()
        self.encoder_embedding = nn.Embedding(input_vocab_size, d_model)
        self.decoder_embedding = nn.Embedding(target_vocab_size, d_model)
        self.pos_embedding = PositionalEmbedding(d_model, max_len)
        self.encoder = Encoder(d_model, num_heads, d_ff, num_layers, dropout)
        self.decoder = Decoder(d_model, num_heads, d_ff, num_layers, dropout)
        self.output_layer = nn.Linear(d_model, target_vocab_size)

    def forward(self, source, target):
        # Encoder mask
        source_mask, target_mask = self.mask(source, target)
        # Embedding and positional Encoding
        source = self.encoder_embedding(source) * math.sqrt(self.encoder_embedding.embedding_dim)
        source = self.pos_embedding(source)
        # Encoder
        encoder_output = self.encoder(source, source_mask)

        # Decoder embedding and postional encoding
        target = self.decoder_embedding(target) * math.sqrt(self.decoder_embedding.embedding_dim)
        target = self.pos_embedding(target)
        # Decoder
        output = self.decoder(target, encoder_output, target_mask, source_mask)

        return self.output_layer(output)



    def mask(self, source, target):
        source_mask = (source != 0).unsqueeze(1).unsqueeze(2)
        target_mask = (target != 0).unsqueeze(1).unsqueeze(2)
        size = target.size(1)
        no_mask = torch.tril(torch.ones((1, size, size), device=device)).bool()
        target_mask = target_mask & no_mask
        return source_mask, target_mask


#### Simple test

In [None]:
seq_len_source = 10
seq_len_target = 10
batch_size = 2
input_vocab_size = 50
target_vocab_size = 50

source = torch.randint(1, input_vocab_size, (batch_size, seq_len_source))
target = torch.randint(1, target_vocab_size, (batch_size, seq_len_target))

In [None]:
d_model = 512
num_heads = 8
d_ff = 2048
num_layers = 6

model = Transformer(d_model, num_heads, d_ff, num_layers,
                  input_vocab_size, target_vocab_size,
                  max_len=MAX_SEQ_LEN, dropout=0.1)

model = model.to(device)
source = source.to(device)
target = target.to(device)

In [None]:
output = model(source, target)

In [None]:
# Expected output shape -> [batch, seq_len_target, target_vocab_size] i.e. [2, 10, 50]
print(f'ouput.shape {output.shape}')

ouput.shape torch.Size([2, 10, 50])


### Translator Eng-Spa

In [None]:
PATH = './eng-spa4.txt'

In [None]:
with open(PATH, 'r', encoding='utf-8') as f:
    lines = f.readlines()
eng_spa_pairs = [line.strip().split('\t') for line in lines if '\t' in line]

In [None]:
eng_spa_pairs[:10]

[['Hi!', '¡Hola!'],
 ['Go!', '¡Sal!'],
 ['Go!', '¡Ya!'],
 ['Go!', '¡Fuera!'],
 ['OK.', 'Bueno.'],
 ['Ow!', '¡Ay!'],
 ['So?', '¿Y qué?'],
 ['Go.', 'Váyase.'],
 ['No.', 'No.'],
 ['So?', '¿Y?']]

In [None]:
eng_sentences = [pair[0] for pair in eng_spa_pairs]
spa_sentences = [pair[1] for pair in eng_spa_pairs]

In [None]:
print(eng_sentences[:10])
print(spa_sentences[:10])


['Hi!', 'Go!', 'Go!', 'Go!', 'OK.', 'Ow!', 'So?', 'Go.', 'No.', 'So?']
['¡Hola!', '¡Sal!', '¡Ya!', '¡Fuera!', 'Bueno.', '¡Ay!', '¿Y qué?', 'Váyase.', 'No.', '¿Y?']


In [None]:
def preprocess_sentence(sentence):
    sentence = sentence.lower().strip()
    sentence = re.sub(r'[" "]+', " ", sentence)
    sentence = re.sub(r"[á]+", "a", sentence)
    sentence = re.sub(r"[é]+", "e", sentence)
    sentence = re.sub(r"[í]+", "i", sentence)
    sentence = re.sub(r"[ó]+", "o", sentence)
    sentence = re.sub(r"[ú]+", "u", sentence)
    sentence = re.sub(r"[^a-z]+", " ", sentence)
    sentence = sentence.strip()
    sentence = '<sos> ' + sentence + ' <eos>'
    return sentence

In [None]:
s1 = '¿Hola @ cómo estás? 123'

In [None]:
print(s1)
print(preprocess_sentence(s1))

¿Hola @ cómo estás? 123
<sos> hola como estas <eos>


In [None]:
eng_sentences = [preprocess_sentence(sentence) for sentence in eng_sentences]
spa_sentences = [preprocess_sentence(sentence) for sentence in spa_sentences]

In [None]:
spa_sentences[:10]

['<sos> hola <eos>',
 '<sos> sal <eos>',
 '<sos> ya <eos>',
 '<sos> fuera <eos>',
 '<sos> bueno <eos>',
 '<sos> ay <eos>',
 '<sos> y que <eos>',
 '<sos> vayase <eos>',
 '<sos> no <eos>',
 '<sos> y <eos>']

In [None]:
def build_vocab(sentences):
    words = [word for sentence in sentences for word in sentence.split()]
    word_count = Counter(words)
    sorted_word_counts = sorted(word_count.items(), key=lambda x:x[1], reverse=True)
    word2idx = {word: idx for idx, (word, _) in enumerate(sorted_word_counts, 2)}
    word2idx['<pad>'] = 0
    word2idx['<unk>'] = 1
    idx2word = {idx: word for word, idx in word2idx.items()}
    return word2idx, idx2word

In [None]:
eng_word2idx, eng_idx2word = build_vocab(eng_sentences)
spa_word2idx, spa_idx2word = build_vocab(spa_sentences)
eng_vocab_size = len(eng_word2idx)
spa_vocab_size = len(spa_word2idx)

In [None]:
print(eng_vocab_size, spa_vocab_size)

27650 46924


In [None]:
class EngSpaDataset(Dataset):
    def __init__(self, eng_sentences, spa_sentences, eng_word2idx, spa_word2idx):
        self.eng_sentences = eng_sentences
        self.spa_sentences = spa_sentences
        self.eng_word2idx = eng_word2idx
        self.spa_word2idx = spa_word2idx

    def __len__(self):
        return len(self.eng_sentences)

    def __getitem__(self, idx):
        eng_sentence = self.eng_sentences[idx]
        spa_sentence = self.spa_sentences[idx]
        # return tokens idxs
        eng_idxs = [self.eng_word2idx.get(word, self.eng_word2idx['<unk>']) for word in eng_sentence.split()]
        spa_idxs = [self.spa_word2idx.get(word, self.spa_word2idx['<unk>']) for word in spa_sentence.split()]

        return torch.tensor(eng_idxs), torch.tensor(spa_idxs)

In [None]:
def collate_fn(batch):
    eng_batch, spa_batch = zip(*batch)
    eng_batch = [seq[:MAX_SEQ_LEN].clone().detach() for seq in eng_batch]
    spa_batch = [seq[:MAX_SEQ_LEN].clone().detach() for seq in spa_batch]
    eng_batch = torch.nn.utils.rnn.pad_sequence(eng_batch, batch_first=True, padding_value=0)
    spa_batch = torch.nn.utils.rnn.pad_sequence(spa_batch, batch_first=True, padding_value=0)
    return eng_batch, spa_batch


In [None]:
def train(model, dataloader, loss_function, optimiser, epochs):
    model.train()
    for epoch in range(epochs):
        total_loss = 0
        for eng_batch, spa_batch in tqdm(dataloader):
            eng_batch = eng_batch.to(device)
            spa_batch = spa_batch.to(device)
            # Decoder preprocessing
            target_input = spa_batch[:, :-1]
            target_output = spa_batch[:, 1:].contiguous().view(-1)
            # Zero grads
            optimiser.zero_grad()
            # run model
            output = model(eng_batch, target_input)
            output = output.view(-1, output.size(-1))
            # loss\
            loss = loss_function(output, target_output)
            # gradient and update parameters
            loss.backward()
            optimiser.step()
            total_loss += loss.item()

        avg_loss = total_loss/len(dataloader)
        print(f'Epoch: {epoch}/{epochs}, Loss: {avg_loss:.4f}')



Setting the batch size for the training and testing data. And initializing the dataset and dataloader for the training and testing data. The EngSpaDataset class is used to load the English-Spanish dataset using the following parameters:

- `eng_sentences` is a list of English sentences.
- `spa_sentences` is a list of Spanish sentences.
- `eng_word2idx` is a dictionary that maps English words to their corresponding indices.
- `spa_word2idx` is a dictionary that maps Spanish words to their corresponding indices.

The dataloader is used to load the dataset in batches. The batch size is set to 64, and the dataset is shuffled before each epoch.

In [None]:
BATCH_SIZE = 64
dataset = EngSpaDataset(eng_sentences, spa_sentences, eng_word2idx, spa_word2idx)
dataloader = DataLoader(dataset, batch_size=BATCH_SIZE, shuffle=True, collate_fn=collate_fn)

Initializing a Transformer model with specific hyperparameters. The model is configured with a dimensionality of 512 for the model's hidden layers (`d_model`), 8 attention heads (`num_heads`), a feed-forward network dimension of 2048 (`d_ff`), and 6 layers (`num_layers`). The input and target vocabulary sizes are set to `eng_vocab_size` and `spa_vocab_size`, respectively, which correspond to the sizes of the English and Spanish vocabularies. The maximum sequence length is defined by `MAX_SEQ_LEN`, and a dropout rate of 0.1 is applied to prevent overfitting.



In [None]:
model = Transformer(d_model=512, num_heads=8, d_ff=2048, num_layers=6,
                    input_vocab_size=eng_vocab_size, target_vocab_size=spa_vocab_size,
                    max_len=MAX_SEQ_LEN, dropout=0.1)

Changing the model to the CUDA device if it is available.

As the optimizer, we are using the Adam optimizer with a learning rate of 0.0001.

For the loss function, we are using the CrossEntropyLoss function with the parameters ignore_index=0, this parameter is used to ignore the padding token in the loss calculation.

In [None]:
model = model.to(device)
loss_function = nn.CrossEntropyLoss(ignore_index=0)
optimiser = optim.Adam(model.parameters(), lr=0.0001)


We proceed to train the model with the dataset provided by the Tatoeba project. It's important to mention that the dataset is not very large, so the model will not be able to translate all the sentences correctly. However, the model will be able to translate some sentences correctly.

The required computational resources to train the model are high, that's why we will use a cloud service to train the model. The model will be trained for 10 epochs using a RTX A4000 GPU with 16GB of memory, 8 CPU cores, and 48GB of RAM. The training process will take approximately 1 hour and 20 minutes.

For reference, we used the service provided by [Paperspace](https://www.paperspace.com/).

In [None]:
# Perform training, this step will take a while depending on the computation power

train(model, dataloader, loss_function, optimiser, epochs=10)

100%|██████████| 4162/4162 [09:08<00:00,  7.59it/s]


Epoch: 0/10, Loss: 3.5995


100%|██████████| 4162/4162 [09:09<00:00,  7.58it/s]


Epoch: 1/10, Loss: 2.2074


100%|██████████| 4162/4162 [09:07<00:00,  7.60it/s]


Epoch: 2/10, Loss: 1.7063


100%|██████████| 4162/4162 [09:08<00:00,  7.59it/s]


Epoch: 3/10, Loss: 1.3786


100%|██████████| 4162/4162 [09:08<00:00,  7.59it/s]


Epoch: 4/10, Loss: 1.1267


100%|██████████| 4162/4162 [09:07<00:00,  7.60it/s]


Epoch: 5/10, Loss: 0.9234


100%|██████████| 4162/4162 [09:07<00:00,  7.61it/s]


Epoch: 6/10, Loss: 0.7588


100%|██████████| 4162/4162 [09:07<00:00,  7.60it/s]


Epoch: 7/10, Loss: 0.6298


100%|██████████| 4162/4162 [09:08<00:00,  7.59it/s]


Epoch: 8/10, Loss: 0.5355


100%|██████████| 4162/4162 [09:08<00:00,  7.59it/s]

Epoch: 9/10, Loss: 0.4669





We added a function to save the trained model as a checkpoint file. This file will be used to load the model and translate sentences without the need to train the model again. This action is needed because the training process is computationally expensive.

In [None]:
# Saving the model to disk for later use

model_path = './transformer_eng_spa.pth'
torch.save(model.state_dict(), model_path)


As well, we added a function to load the trained model from the checkpoint file. This is using the native PyTorch function `torch.load` to load the model from the checkpoint file.

In [None]:
# To load the model, we need to define the model architecture and load the weights from the checkpoint file.

model_path = './transformer_eng_spa.pth'
model.load_state_dict(torch.load(model_path))

<All keys matched successfully>

In [None]:
def sentence_to_indices(sentence, word2idx):
    """
    Converts a sentence to a list of token indices.

    Parameters:
    sentence (str): The input sentence to be converted.
    word2idx (dict): A dictionary mapping words to their corresponding indices.

    Returns:
    list: A list of token indices.

    This function splits the input sentence into words and retrieves the corresponding indices from the word2idx dictionary.
    """
    return [word2idx.get(word, word2idx['<unk>']) for word in sentence.split()]

def indices_to_sentence(indices, idx2word):
    """
    Converts a list of token indices to a sentence.

    Parameters:
    indices (list): A list of token indices.
    idx2word (dict): A dictionary mapping token indices to words.

    Returns:
    str: The sentence formed by the token indices.

    This function iterates over the token indices and retrieves the corresponding words from the idx2word dictionary.
    """
    return ' '.join([idx2word[idx] for idx in indices if idx in idx2word and idx2word[idx] != '<pad>'])

def translate_sentence(model, sentence, eng_word2idx, spa_idx2word, max_len=MAX_SEQ_LEN, device='cpu'):
    """
    Translates a given sentence from English to Spanish using a translation model.

    Parameters:
    model: The translation model to be used for generating translations.
    sentence (str): The input sentence in English to be translated.
    eng_word2idx (dict): A dictionary mapping English words to their corresponding indices.
    spa_idx2word (dict): A dictionary mapping Spanish indices to their corresponding words.
    max_len (int): The maximum length of the output sequence (default is MAX_SEQ_LEN).
    device (str): The device to run the model on (default is 'cpu').

    Returns:
    str: The translated sentence in Spanish.

    This function preprocesses the input sentence, converts it to indices, and feeds it into the model.
    It then iteratively generates the translated sentence token by token until the end-of-sequence token is produced
    or the maximum sequence length is reached. The function returns the translated sentence as a string.
    """

    # Set the model to evaluation mode and preprocess the input sentence, the input indices, and the input tensor.
    model.eval()
    sentence = preprocess_sentence(sentence)
    input_indices = sentence_to_indices(sentence, eng_word2idx)
    input_tensor = torch.tensor(input_indices).unsqueeze(0).to(device)

    # Initialize the target tensor with <sos> token
    tgt_indices = [spa_word2idx['<sos>']]
    tgt_tensor = torch.tensor(tgt_indices).unsqueeze(0).to(device)

    # Generate the translated sentence token by token
    with torch.no_grad():
        # Iterate until the end-of-sequence token is produced or the maximum sequence length is reached
        for _ in range(max_len):
            # Generate the next token
            output = model(input_tensor, tgt_tensor)

            # Get the last token output and append it to the target tensor
            output = output.squeeze(0)

            # Get the next token index
            next_token = output.argmax(dim=-1)[-1].item()

            # Append the next token index to the target tensor
            tgt_indices.append(next_token)

            # Break if the end-of-sequence token is produced
            tgt_tensor = torch.tensor(tgt_indices).unsqueeze(0).to(device)
            if next_token == spa_word2idx['<eos>']:
                break

    # Return the translated sentence in Spanish
    return indices_to_sentence(tgt_indices, spa_idx2word)

Now, we will test the translator with some sentences in English and Spanish. This is a simple test to check if the translator is working correctly.

In [None]:
def evaluate_translations(model, sentences, eng_word2idx, spa_idx2word, max_len=MAX_SEQ_LEN, device='cpu'):
    """
    Evaluates translations for a list of input sentences using a given translation model.

    Parameters:
    model: The translation model to be used for generating translations.
    sentences (list of str): A list of sentences in the source language (English) to be translated.
    eng_word2idx (dict): A dictionary mapping English words to their corresponding indices.
    spa_idx2word (dict): A dictionary mapping Spanish indices to their corresponding words.
    max_len (int): The maximum length of the sequences (default is MAX_SEQ_LEN).
    device (str): The device to run the model on (default is 'cpu').

    Returns:
    None

    This function iterates over each sentence in the input list, generates a translation using the model,
    and prints both the input sentence and its corresponding translation.
    """

    # Iterate over each input sentence and generate the corresponding translation
    for sentence in sentences:
        # Generate the translation using the translate_sentence function defined above
        translation = translate_sentence(model, sentence, eng_word2idx, spa_idx2word, max_len, device)
        print(f'Input sentence: {sentence}')
        print(f'Traducción: {translation}')
        print()

# Example sentences to test the translator
test_sentences = [
    "Hello, how are you?",
    "I am learning artificial intelligence.",
    "Artificial intelligence is great.",
    "Good night!"
]

# Assuming the model is trained and loaded
# Set the device to 'cpu' or 'cuda' as needed
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = model.to(device)

# Evaluate translations
evaluate_translations(model, test_sentences, eng_word2idx, spa_idx2word, max_len=MAX_SEQ_LEN, device=device)


Input sentence: Hello, how are you?
Traducción: <sos> hola que tal <eos>

Input sentence: I am learning artificial intelligence.
Traducción: <sos> estoy aprendiendo inteligencia artificial <eos>

Input sentence: Artificial intelligence is great.
Traducción: <sos> la inteligencia artificial es grande <eos>

Input sentence: Good night!
Traducción: <sos> buenas noches <eos>



At this point, the most of the translation looks aceptable, but there are some sentences that are not translated as we expected. For example, the sentence "I am a student" is translated as "Soy un estudiante" but the correct translation is "Yo soy un estudiante". Also, the sentence "I am a teacher" is translated as "Soy un profesor" but the correct translation is "Yo soy un profesor".

#### We included aditional sentences to test the translator

In [None]:
# Extra test sentences to evaluate the model further
extra_test_sentences = [
    "Hello, can I ask you a question?",
    "I am learning how to build a transformer model.",
    "The transformer model is a type of neural network.",
    "Goodbye!"
    "We are studying advanced natural language processing techniques.",
    "Natural language processing is a fascinating field.",
    "I'm a student at the university.",
    "I live in Mexico.",
    "We like to eat pizza on Fridays.",
    "The weather is nice today.",
    "The cat is sleeping on the sofa.",
    "The dog is playing in the garden.",
    "My favorite color is blue.",
    "The sky is clear and the sun is shining.",
    "The moon is visible in the night sky.",
]

# Evaluate translations
evaluate_translations(model, extra_test_sentences, eng_word2idx, spa_idx2word, max_len=MAX_SEQ_LEN, device=device)


Input sentence: Hello, can I ask you a question?
Traducción: <sos> hola puedo hacerte una pregunta <eos>

Input sentence: I am learning how to build a transformer model.
Traducción: <sos> estoy aprendiendo a construir una modelo de internet <eos>

Input sentence: The transformer model is a type of neural network.
Traducción: <sos> el modelo es un tipo de red de la red de red <eos>

Input sentence: Goodbye!We are studying advanced natural language processing techniques.
Traducción: <sos> estamos estudiando diferentes tecnicas para que los poemas naturales <eos>

Input sentence: Natural language processing is a fascinating field.
Traducción: <sos> el lenguaje natural es una lengua natural fascinante <eos>

Input sentence: I'm a student at the university.
Traducción: <sos> soy estudiante en la universidad <eos>

Input sentence: I live in Mexico.
Traducción: <sos> vivo en mexico <eos>

Input sentence: We like to eat pizza on Fridays.
Traducción: <sos> nos gusta comer pizza los viernes <eos

Clearly, we can observe that the translator has several limitations, such as the lack of vocabulary and the lack of training data. However, it is a good starting point to understand the transformer architecture.