This project contains the following- 

-A loose implementation of the **attention mechanism**, from the "Attention Is All You Need Paper"

-**Sequence to sequence** translation using a **transformer**

-**Visualising** the encoder, decoder and cross attention using **bertviz**. 

The pipeline followed is - 

1. Prepare the data.
2. Get the embeddings of the training partition.
3. Write the elemnents of the self-attention mechanism.
4. Train a transformer model
5. Evaluate the model.
6. Visualize the attention mechanism.


In [35]:
! pip install datasets
! pip install transformers
! pip install bertviz
! pip install ipywidgets

Defaulting to user installation because normal site-packages is not writeable

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.2.1[0m[39;49m -> [0m[32;49m24.0[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49m/Library/Developer/CommandLineTools/usr/bin/python3 -m pip install --upgrade pip[0m
Defaulting to user installation because normal site-packages is not writeable

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.2.1[0m[39;49m -> [0m[32;49m24.0[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49m/Library/Developer/CommandLineTools/usr/bin/python3 -m pip install --upgrade pip[0m
Defaulting to user installation because normal site-packages is not writeable

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.2.1[0m[39;49m -> [0m[32;49m24.0[0m
[1m[[0m[3

## 0. Load the dataset

In [36]:
import torch as nn
import numpy as np
import random
import datasets
import re

initial_dataset = datasets.load_dataset("Thermostatic/parallel_corpus_europarl_english_spanish")
initial_dataset = initial_dataset['train']

print(initial_dataset, type(initial_dataset))

Dataset({
    features: ['en', 'es'],
    num_rows: 1965734
}) <class 'datasets.arrow_dataset.Dataset'>


In [37]:
print(initial_dataset[0])

{'en': 'Resumption of the session', 'es': 'Reanudación del período de sesiones'}


## 1. Data pre-processing

In [38]:
from torch.utils.data import random_split

# select the first 10000 rows in the original corpus.
dataset = initial_dataset.select(range(10000))

new_dataset_size = len(dataset)

train_size = int(0.7 * new_dataset_size)
test_size = int(0.2 * new_dataset_size)
val_size = new_dataset_size - train_size - test_size


train_dataset, test_dataset, val_dataset = random_split(dataset, [train_size, test_size, val_size])

In [39]:
# run this line to check that the sizes of your partitions are correct.
print(len(train_dataset), len(test_dataset), len(val_dataset))

7000 2000 1000


In [40]:
# run this line to display a sentence pair in the train partition.
train_dataset[4]

{'en': 'Madam President, ladies and gentlemen, what has occurred in a tributary of the Danube is considered by many experts to be an accident on an environmental scale as serious as the scale of the Chernobyl disaster.',
 'es': 'Señora Presidenta, señorías, lo que ha ocurrido en un afluente del Danubio ha sido considerado por numerosos especialistas como un accidente de un impacto medioambiental tan grave como el del accidente de Chernobyl.'}

###  Prepare the data

Write a function, `sentencePairs`, that returns a list of lists called `sentence_pairs`. The inner lists must contain sentence pairs. These have to be: [*sentence in English*, *sentence in Spanish*]. Ensure that the lists contain only lowercase characters.

For example, using the instance we printed before, `train_dataset[0]`:

```
[['This situation can be achieved over a period of time.', 'Se puede lograr esa situación al cabo de un período de tiempo.']]
```

In [41]:
from torch.utils.data import Subset
from typing import List

def sentencePairs(dataset : Subset) -> List[List[str]]:
    sentence_pairs = []

    for i in range(len(dataset)):
        english_sentence = dataset[i]['en'].lower()
        spanish_sentence = dataset[i]['es'].lower()

        # Append the sentence pair to the list
        sentence_pairs.append([english_sentence, spanish_sentence])
        
    return sentence_pairs

In [152]:
# run this line to check that the size of your train list is correct.

sentence_pairs_train = sentencePairs(train_dataset)
sentence_pairs_test=sentencePairs(val_dataset)
len(sentence_pairs_train)
len(sentence_pairs_test)

1000

In [43]:
# run this line to display a sentence pair inside a list.

sentence_pairs_train[0]

['here, unfortunately, the european parliament has bowed to the interests of industry.',
 'en este caso, desgraciadamente, el parlamento europeo se ha plegado a los intereses de la industria.']

In [44]:
import nltk
nltk.download('punkt')

def tokenize(text):
    return nltk.word_tokenize(text)

def obtain_tokens(sentence_pairs):
    tokens_eng_train = [] 
    tokens_spa_train = [] 
    for pair in sentence_pairs:
        tokens_eng_train.append(tokenize(pair[0]))
        tokens_spa_train.append(tokenize(pair[1]))

    return tokens_eng_train, tokens_spa_train

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/aashishmukund/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [157]:
# test the function.

tokens_eng_train, tokens_spa_train = obtain_tokens(sentence_pairs_train)
tokens_eng_test, tokens_spa_test = obtain_tokens(sentence_pairs_test)
print(len(tokens_eng_test[0]), len(tokens_spa_test[0]))

34 33


In [158]:
tokens_spa_test[0]

['por',
 'eso',
 ',',
 'nosotros',
 'solicitamos',
 'que',
 'su',
 'programa',
 'sea',
 'valiente',
 ',',
 'pues',
 ',',
 'si',
 'lo',
 'fuera',
 ',',
 'le',
 'puedo',
 'asegurar',
 'que',
 'este',
 'parlamento',
 'estará',
 'con',
 'la',
 'comisión',
 'en',
 'ese',
 'proceso',
 'de',
 'reformas',
 '.']

An output example from the `obtain_tokens` function:

```python
print(tokens_eng_train[0])

['nevertheless',
 ',',
 'these',
 'same',
 'institutions',
 'showed',
 'no',
 'reluctance',
 'to',
 'accept',
 'turkey',
 'as',
 'a',
 'candidate',
 'for',
 'membership',
 'of',
 'the',
 'european',
 'union',
 ',',
 'despite',
 'its',
 'well-known',
 'human',
 'rights',
 'violations',
 '.']
 ```

## 2. Embeddings

It is time to turn the strings into numerical representations. To achieve this, first you will obtain the vocabulary (<i> the set of all unique tokens </i>) in the training partition using the `get_word_tokens_eng` function. Then, you will assign a unique numerical ID to every token. Finally, you must complete the function, `createEmbeddings`, which maps a token in the vocabulary to a numerical representation.

In [47]:
def get_word_tokens_eng(sentence_pairs: List[List[str]]) -> set:
    tokens_eng_train, tokens_spa_train = obtain_tokens(sentence_pairs)
    unique_tokens=set()
    for token_list in tokens_eng_train:
        for token in token_list:
            unique_tokens.add(token)
            
    return unique_tokens

Here, we will create the variable `vocabulary_train_eng` that will contains the tokens in **English**.

In [48]:
# test the function.
# expected output: {'token_0', 'token_1,', ...}

vocabulary_train_eng = get_word_tokens_eng(sentence_pairs_train)
print(vocabulary_train_eng)



Now, modify the `get_word_tokens_eng` function (or write a new one) and create the `vocabulary_train_spa`.

In [49]:
def get_word_tokens_spa(sentence_pairs: List[List[str]]) -> set:
    tokens_eng_train, tokens_spa_train = obtain_tokens(sentence_pairs)
    unique_tokens=set()
    for token_list in tokens_spa_train:
        for token in token_list:
            unique_tokens.add(token)
            
    return unique_tokens
vocabulary_train_spa = get_word_tokens_spa(sentence_pairs_train)

In [50]:
# test the function.
# expected output: {'token_0', 'token_1,', ...}

print(vocabulary_train_spa)

{'practicadas', 'lección', 'siria', 'amigo', 'aplicarán', 'ong', 'sme', 'vigilancia', 'ideología', 'comunes', 'originando', 'renunciando', 'mirar', 'preocupación', 'facie', 'interpretada', 'regular', 'hora', 'oportunidad', 'prepare', 'atentado', 'martens', 'comunicó', 'produciría', 'deriven', 'colaboren', 'adapta', 'embarcaciones', 'aumentados', 'árabe-israelí', 'tanto', 'informando', '14,7', 'meta', 'inusitada', 'futura', 'desprenden', 'simultáneamente', 'días', 'enriqueces', 'tratado', 'márgenes', 'conseguiremos', 'c5-0243/1999', 'pseudo', 'conservar', 'bastan', 'legislativa', 'urgentemente', 'desfavorecen', 'referida', 'suscitan', 'consideramos', 'estacada', 'sigamos', 'descentralizados', 'eurojust', 'establecimiento', 'transparenta', 'accidente', '200.000', '100', 'espontánea', 'gana', '344', 'artículo', 'señales', 'julio', 'criticado', 'indefinidamente', 'familiares', 'kazajstán', 'anexos', 'normalizaciones', '28', 'pecio', 'representativos', 'antirural', '1993', 'subrayamos', '¿a

In [51]:
print(len(vocabulary_train_eng), len(vocabulary_train_spa))

9650 14043


In [52]:
def create_dict_from_set_eng(vocabulary_train_eng):
    vocab_list = list(vocabulary_train_eng)
    string_dict = {token: i for i, token in enumerate(vocab_list)}
    return string_dict

In [53]:
# test the function.
# expected output: {'token_0': 'id_0', 'token_1,': 'id_1', ...}

tokens_and_ids_eng = create_dict_from_set_eng(vocabulary_train_eng)
print(tokens_and_ids_eng)



In [54]:
# test the function.
# expected output: {'token_0': 'id_0', 'token_1,': 'id_1', ...}
def create_dict_from_set_spa(vocabulary_train_spa):
    vocab_list = list(vocabulary_train_spa)
    string_dict = {token: i for i, token in enumerate(vocab_list)}
    return string_dict
tokens_and_ids_spa = create_dict_from_set_spa(vocabulary_train_spa)
print(tokens_and_ids_spa)

{'practicadas': 0, 'lección': 1, 'siria': 2, 'amigo': 3, 'aplicarán': 4, 'ong': 5, 'sme': 6, 'vigilancia': 7, 'ideología': 8, 'comunes': 9, 'originando': 10, 'renunciando': 11, 'mirar': 12, 'preocupación': 13, 'facie': 14, 'interpretada': 15, 'regular': 16, 'hora': 17, 'oportunidad': 18, 'prepare': 19, 'atentado': 20, 'martens': 21, 'comunicó': 22, 'produciría': 23, 'deriven': 24, 'colaboren': 25, 'adapta': 26, 'embarcaciones': 27, 'aumentados': 28, 'árabe-israelí': 29, 'tanto': 30, 'informando': 31, '14,7': 32, 'meta': 33, 'inusitada': 34, 'futura': 35, 'desprenden': 36, 'simultáneamente': 37, 'días': 38, 'enriqueces': 39, 'tratado': 40, 'márgenes': 41, 'conseguiremos': 42, 'c5-0243/1999': 43, 'pseudo': 44, 'conservar': 45, 'bastan': 46, 'legislativa': 47, 'urgentemente': 48, 'desfavorecen': 49, 'referida': 50, 'suscitan': 51, 'consideramos': 52, 'estacada': 53, 'sigamos': 54, 'descentralizados': 55, 'eurojust': 56, 'establecimiento': 57, 'transparenta': 58, 'accidente': 59, '200.000'

In [55]:
# using the IDs created in the previous step, create a nn.tensor object called 'id_tensor'
# that contains all the IDs.

import torch as nn

def create_tensor_from_dict_eng(tokens_and_ids_eng : dict) -> nn.tensor:
    tokens = list(tokens_and_ids_eng.keys())
    ids = list(tokens_and_ids_eng.values())
    id_tensor = nn.tensor(ids)
    return id_tensor

In [56]:
# test the function.

tensor_ids_eng = create_tensor_from_dict_eng(tokens_and_ids_eng)
print(tensor_ids_eng) # expected output: tensor([
print(tensor_ids_eng.shape)

tensor([   0,    1,    2,  ..., 9647, 9648, 9649])
torch.Size([9650])


In [57]:
def create_tensor_from_dict_spa(tokens_and_ids_spa : dict) -> nn.tensor:
    tokens = list(tokens_and_ids_spa.keys())
    ids = list(tokens_and_ids_spa.values())
    id_tensor = nn.tensor(ids)
    return id_tensor
tensor_ids_spa = create_tensor_from_dict_spa(tokens_and_ids_spa)

In [58]:
# test the function.

print(tensor_ids_spa.shape)

torch.Size([14043])


In [59]:
from torch.nn import Embedding

def create_embeddings_eng(tensor_ids_eng):
    vocab_size = tensor_ids_eng.max().item() + 1
    embedding_layer = Embedding(num_embeddings=vocab_size, embedding_dim=16)
    embeddings = embedding_layer(tensor_ids_eng)
    return embeddings

In [60]:
# test the function.

embeddings_eng = create_embeddings_eng(tensor_ids_eng)
print(embeddings_eng) # expected output: tensor([[...
print(embeddings_eng.shape)

tensor([[ 0.3249,  0.3505,  0.7199,  ...,  1.1245, -0.1455,  0.9517],
        [ 1.2730,  0.6371, -1.0654,  ...,  2.7308,  1.1513, -0.8126],
        [-1.2891,  1.0081, -0.4011,  ..., -0.7228, -1.6713,  0.4613],
        ...,
        [-0.2151, -0.4253, -0.2614,  ..., -0.6409, -1.7682, -1.0090],
        [-0.3418,  0.4584,  1.2790,  ...,  0.5091, -0.2032,  1.5239],
        [-0.6806,  0.3071,  0.4895,  ...,  1.1236,  0.4352,  0.7335]],
       grad_fn=<EmbeddingBackward0>)
torch.Size([9650, 16])


---

In [61]:
embeddings_eng.shape

torch.Size([9650, 16])

Now, we have an object that contains a vector of dimension $16$ for each unique ID that represents the tokens in the English partition of the training corpus.

## 3. Self-attention mechanism

This technique makes use of three matrices: query, key, value. Each matrix is generated by calculating the dot product of:
- $W_{query} * embedding(i)$
- $W_{key} * embedding(i)$
- $W_{value} * embedding(i)$

where $embedding(i)$ is the embedding representation of each token, $i$, in the training corpus.

To generate these matrices, $W_{query}$, $W_{key}$, $W_{value}$, we will assign each one of them random (continuous) values in the range $[\text{dimension of vectors}, \text{vocabulary length}]$.


In [64]:
# the output must be three matrices of shapes:
# - wq.shape: (dim_query_vectors, dim_vocabulary).
# - wk.shape: (dim_key_vectors, dim_vocabulary).
# - wv.shape: (dim_values_vectors, dim_vocabulary).
import torch
import torch.nn.init as init

def create_matrices(embeddings_eng, dim_query_vectors, dim_key_vectors, dim_value_vectors):
    dim_vocabulary = embeddings_eng.shape[1]
    W_query = torch.nn.Parameter(torch.empty(dim_query_vectors, dim_vocabulary))
    init.normal_(W_query, mean=0, std=1)
    # generate a matrix with random values in the range [dim_key_vectors, dim_vocabulary).
    W_key = torch.nn.Parameter(torch.empty(dim_key_vectors, dim_vocabulary))
    init.normal_(W_key, mean=0, std=1)
    # generate a matrix with random values in the range [dim_value_vectors, dim_vocabulary).
    W_value = torch.nn.Parameter(torch.empty(dim_value_vectors, dim_vocabulary))
    init.normal_(W_value, mean=0, std=1)
    return W_query, W_key, W_value

In [65]:
# test the function.

# we will assume a value of 8 for computational simplicity, although in the
# original paper the authors used a value of 64.
dim_query_vectors = 8
dim_key_vectors = 8
dim_value_vectors = 8

# this will generate a matrix will random values in the range [dim_query_vectors, dim_vocabulary).
wq, wk, wv = create_matrices(embeddings_eng, dim_query_vectors, dim_key_vectors, dim_value_vectors)

print(wq.shape, wk.shape, wv.shape)

torch.Size([8, 16]) torch.Size([8, 16]) torch.Size([8, 16])


In [68]:
def attention_values(wq, wk, wv):
  embeddings = torch.transpose(embeddings_eng, 0, 1)
  queries = torch.matmul(wq, embeddings)
  keys = torch.matmul(wk, embeddings)
  values = torch.matmul(wv, embeddings)

  return queries, keys, values

In [69]:
# test the function.

queries, keys, values = attention_values(wq, wk, wv)

print(queries.shape, keys.shape, values.shape)

torch.Size([8, 9650]) torch.Size([8, 9650]) torch.Size([8, 9650])


In [70]:
def generate_attention_weights(queries, keys):
  dot_product = torch.matmul(queries, keys.T)
  attention_weights = torch.nn.functional.softmax(dot_product, dim=-1)
  return attention_weights

In [71]:
# test the function.

attention_weights = generate_attention_weights(queries, keys)
attention_weights.shape

torch.Size([8, 8])

In [72]:
import torch.nn.functional as F

def normalize_attention_weights(queries, keys):
  # compute the dot product of queries and keys.
  dot_product = torch.matmul(queries, keys.T)

  # apply softmax to get the initial attention weights.
  attention_weights = torch.nn.functional.softmax(dot_product, dim=-1)

  # normalize the attention weights.
  normalized_attention_weights = attention_weights / torch.sum(attention_weights, dim=-1, keepdim=True)
  return normalized_attention_weights

In [73]:
# test the function.
normalized_attention_weights = normalize_attention_weights(queries, keys)
normalized_attention_weights # expected output: tensor([[...

tensor([[0., 1., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 1., 0., 0., 0., 0.],
        [1., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 1., 0.],
        [0., 0., 1., 0., 0., 0., 0., 0.],
        [0., 1., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 1., 0., 0., 0.],
        [0., 1., 0., 0., 0., 0., 0., 0.]], grad_fn=<DivBackward0>)

We have generated the normalized attention weights for the English partition of the training *corpus*!

## 4. Train the transformer model

So far, we have followed the steps to implement the attention mechanism used in the encoder part of a transformer model. First, we obtained the tokens in English, then we created the vector embeddings that represent those tokens. If we were implementing a complete transformer architecture from scratch, we would also create vector embeddings for the tokens in Spanish. These two sets of embeddings would be part of the **encoder** of our transformer. This encoder would need further training so that the numeric representation of our tokens would model the semantic relations in the original dataset.


Next, we would implement the **decoder** part of the model (as the name suggests, the decoder takes as input a numeric representation of $n$ tokens, and outputs a string using that input).

Here, we will not write the decoder, nor the transformer architecture itself. We will take advantage of Pytorch's `Transformer` function, which implements the architecture. This function takes the following arguments:

1. `d_model`: This integer will represent the length of the vectors in the transformer. Since we are loosely following the paper "Attention is All You Need" (2017), we will use the length 16. This means that all of the vectors in our model will have the same length.
2. `nhead`: Number of heads. These vary from model to model, but they are usually higher than 1. Here, we set this value as 1 for computational simplicity.
3. `num_encoder_layers`: These vary from model to model, but they are usually higher than 1. Here, we set this value as 1 for computational simplicity.


The following class, `Seq2SeqTransformer`, reimplements some of the previous things we did before, and generates a ready-to-use seq2seq model. Although we have already implemented some of the encoder, we will make use of Pytorch's `nn.Embedding` function. This is because it is easier to put together the `Seq2SeqTransformer` class using the output from `nn.Embedding`, than fitting in our already implemented embeddings (we're taking advantage of Pytorch's abstractions).

In [74]:
########################### DO NOT CHANGE THIS! ##############################
# run this block that puts together all the parts that we defined before.

import torch.optim as optim
from torch.utils.data import Dataset, DataLoader
from torch.nn import Transformer, Module, Linear

# in the original transformer model ("Attention is All You Need", Vaswani et al., 2017),
# the hidden dimension of the feed-forward network is 4 times the hidden dimension
# of the model. The authors used a model's hidden dimension of 512, so their
# feed-forward network's hidden dimension is 2048. Here, we use one head and one
# layer for simplicity.
class Seq2SeqTransformer(Module):
    def __init__(self, num_tokens, embed_size=16, num_heads=1, num_layers=1):
        super(Seq2SeqTransformer, self).__init__()
        self.encoder = Embedding(num_tokens, embed_size)
        self.transformer = Transformer(d_model=embed_size, nhead=num_heads,
                                       num_encoder_layers=num_layers,
                                       num_decoder_layers=num_layers)
        self.decoder = Linear(embed_size, num_tokens)

    def forward(self, src, tgt):
        src = self.encoder(src)
        tgt = self.encoder(tgt)
        output = self.transformer(src, tgt)
        return self.decoder(output)

class TranslationDataset(Dataset):
    def __init__(self, src_data, tgt_data):
        self.src_data = src_data
        self.tgt_data = tgt_data

    def __len__(self):
        return len(self.src_data)

    def __getitem__(self, idx):
        return self.src_data[idx], self.tgt_data[idx]

In [75]:
# Lets print one sentence pair in our training partition to remember how they looked.
sentence_pairs_train[0]

['here, unfortunately, the european parliament has bowed to the interests of industry.',
 'en este caso, desgraciadamente, el parlamento europeo se ha plegado a los intereses de la industria.']

In [79]:
# IDs of every token in every sentence of the English partition.
# Then, create another object, sentence_pairs_train_ids_spa, for the Spanish partition.
# For example:
# sentence_pairs_train_ids_eng = [[101, 3452], [59, 53, 4628, 1234, 13942], ...]
# sentence_pairs_train_ids_spa = [[11, 900,...], [77, 435,...], ...]
def tokenize_sentences_with_dict(sentence_pairs, tokens_and_ids_eng, tokens_and_ids_spa):
    sentence_pairs_train_ids_eng = []
    sentence_pairs_train_ids_spa = []
    
    for pair in sentence_pairs:
        # Tokenize English sentence
        tokens_eng = pair[0].split()  # Assuming tokens are separated by space
        # Tokenize Spanish sentence
        tokens_spa = pair[1].split()  # Assuming tokens are separated by space
        
        # Convert tokens to IDs for English sentence
        ids_eng = [tokens_and_ids_eng[token] for token in tokens_eng if token in tokens_and_ids_eng]
        # Convert tokens to IDs for Spanish sentence
        ids_spa = [tokens_and_ids_spa[token] for token in tokens_spa if token in tokens_and_ids_spa]
        
        # Append IDs to the respective lists
        sentence_pairs_train_ids_eng.append(ids_eng)
        sentence_pairs_train_ids_spa.append(ids_spa)
    
    return sentence_pairs_train_ids_eng, sentence_pairs_train_ids_spa

# Example usage:
sentence_pairs_train_ids_eng, sentence_pairs_train_ids_spa = tokenize_sentences_with_dict(sentence_pairs_train, tokens_and_ids_eng, tokens_and_ids_spa)

[5157, 792, 6760, 11436, 1723, 9941, 6706, 11179, 13163, 6706, 642, 12803, 9932, 13786, 11807, 4067, 6397, 11179, 9932, 10056, 11455, 3849, 7001, 4067, 9872, 6706]


In [81]:
#create an object, sentence_pairs_train_ids_eng_padded, that contains
# the IDs of every token in English, but, this time, all the lists must have the
# same length --16. If a list has less elements, add zeros until its length is 16.
# Then, create another object, sentence_pairs_train_ids_spa_padded, for the Spanish partition.
# For example:
# sentence_pairs_train_ids_eng_padded = [[101, 3452, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [59, 53, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], ...]
def pad_sentences(sentence_pairs_ids, max_length=16):
    padded_sentence_pairs = []
    for sentence_ids in sentence_pairs_ids:
        if len(sentence_ids) >= max_length:
            padded_sentence = sentence_ids[:max_length]  # Truncate if longer than max_length
        else:
            padded_sentence = sentence_ids + [0] * (max_length - len(sentence_ids))  # Pad with zeros
        padded_sentence_pairs.append(padded_sentence)
    return padded_sentence_pairs

# Example usage:
sentence_pairs_train_ids_eng_padded = pad_sentences(sentence_pairs_train_ids_eng)
sentence_pairs_train_ids_spa_padded = pad_sentences(sentence_pairs_train_ids_spa)
print(sentence_pairs_train_ids_eng_padded[0])

[3276, 4712, 1172, 7313, 5361, 266, 3276, 3115, 3956, 0, 0, 0, 0, 0, 0, 0]


In [95]:
from torch.nn import CrossEntropyLoss

num_tokens = max([len(vocabulary_train_eng), len(vocabulary_train_spa)])

loss_fn = CrossEntropyLoss()
model = Seq2SeqTransformer(num_tokens)
optimizer = optim.Adam(model.parameters())



In [96]:
type(sentence_pairs_train_ids_eng_padded[0])

list

In [100]:
sentence_pairs_train_ids_eng_padded=torch.tensor(sentence_pairs_train_ids_eng_padded)
sentence_pairs_train_ids_spa_padded=torch.tensor(sentence_pairs_train_ids_spa_padded)
dataset_train = TranslationDataset(sentence_pairs_train_ids_eng_padded, sentence_pairs_train_ids_spa_padded)
dataloader_train = DataLoader(dataset_train, batch_size=32, shuffle=True)

In [101]:
print(tensor_ids_eng.shape, tensor_ids_spa.shape)

torch.Size([9650]) torch.Size([14043])


For the next steps, you will need the `src_data` and the `tgt_data` objects.

The first one must contain the embeddings that we generated in section 1 of this notebook. You must generate the second set of embeddings for the Spanish sentences, since we didn't obtain those in the first part of the assignment.

In [139]:
# the following lines train the Seq2SeqTransformer model.
# you do not need to implement any changes here.
# Training loop

# if you have an error with the indices, check the shapes of src and tgt.
# src represents each list of integers, where each list represents each sentence
# in the English training corpus. Each int in every list represents one token.
# this is why we padded the lists --the model() function requires src and tgt
# to be the same size. This means that every sentence in English, and its equivalent
# sentence in Spanish, must have 16 tokens (or 16 ids, to be precise).

for epoch in range(10):  # Suppose we train for 10 epochs
  for data in dataloader_train:
      src, tgt = data
      #print("source=",src)
      #print("target=",tgt)
      #print(src.shape, tgt.shape)
      output = model(src, tgt)
      loss = loss_fn(output.view(-1, num_tokens), tgt.view(-1))
      print(loss)
      loss.backward()
      optimizer.step()

tensor(6.4055, grad_fn=<NllLossBackward0>)
tensor(6.1898, grad_fn=<NllLossBackward0>)
tensor(6.2072, grad_fn=<NllLossBackward0>)
tensor(6.5708, grad_fn=<NllLossBackward0>)
tensor(6.4400, grad_fn=<NllLossBackward0>)
tensor(6.0372, grad_fn=<NllLossBackward0>)
tensor(6.2967, grad_fn=<NllLossBackward0>)
tensor(6.3066, grad_fn=<NllLossBackward0>)
tensor(6.2440, grad_fn=<NllLossBackward0>)
tensor(5.7619, grad_fn=<NllLossBackward0>)
tensor(6.2071, grad_fn=<NllLossBackward0>)
tensor(6.4629, grad_fn=<NllLossBackward0>)
tensor(6.2432, grad_fn=<NllLossBackward0>)
tensor(6.5824, grad_fn=<NllLossBackward0>)
tensor(6.0679, grad_fn=<NllLossBackward0>)
tensor(6.1716, grad_fn=<NllLossBackward0>)
tensor(6.4192, grad_fn=<NllLossBackward0>)
tensor(6.0330, grad_fn=<NllLossBackward0>)
tensor(6.2386, grad_fn=<NllLossBackward0>)
tensor(6.0389, grad_fn=<NllLossBackward0>)
tensor(6.1766, grad_fn=<NllLossBackward0>)
tensor(6.1821, grad_fn=<NllLossBackward0>)
tensor(6.1382, grad_fn=<NllLossBackward0>)
tensor(5.95

## 5. Evaluate the model

It's time to evaluate the performance of our model. For this, we will use the BLEU metric (you can check the details about this metric [here](https://huggingface.co/spaces/evaluate-metric/bleu).)

This phase consists of the following steps:

1. Turn the sentence pairs in the `val_dataset` into suitable inputs for the `model()` function, which predicts IDs from a series of input IDs.
2. Generate the output IDs that represent the translated sentences.
3. Translate the predicted output IDs into strings.
4. Compare the strings obtained with the expected strings.

In [103]:
sentence_pairs_test = sentencePairs(val_dataset)
print(len(sentence_pairs_test), '\n', sentence_pairs_test[0])

1000 
 ['we therefore ask that his programme be a bold one and, if it is, i can assure him that he will have the support of this house in the reform process.', 'por eso, nosotros solicitamos que su programa sea valiente, pues, si lo fuera, le puedo asegurar que este parlamento estará con la comisión en ese proceso de reformas.']


In [129]:
from torchtext.data.metrics import bleu_score
sentence_pairs_test_ids_eng, sentence_pairs_test_ids_spa = tokenize_sentences_with_dict(sentence_pairs_test, tokens_and_ids_eng, tokens_and_ids_spa)
sentence_pairs_test_ids_eng_padded = pad_sentences(sentence_pairs_test_ids_eng)
sentence_pairs_test_ids_spa_padded = pad_sentences(sentence_pairs_test_ids_spa)
sentence_pairs_test_ids_eng_padded=torch.tensor(sentence_pairs_test_ids_eng_padded)
sentence_pairs_test_ids_spa_padded=torch.tensor(sentence_pairs_test_ids_spa_padded)
#dataset_test = TranslationDataset(sentence_pairs_test_ids_eng_padded, sentence_pairs_test_ids_spa_padded)
#dataloader_test = DataLoader(dataset_test, batch_size=32, shuffle=True)
src_data_eval=sentence_pairs_test_ids_eng_padded
tgt_data_eval=sentence_pairs_test_ids_spa_padded
"""
for data in dataloader_test:
      src, tgt = data
      print(src.shape, tgt.shape)
      src_data_eval.append(src)
      tgt_data_eval.append(tgt)
"""
# Assume you have a set of target sentences for evaluation
#src_data_eval = ... # replace with your source data for evaluation.
#tgt_data_eval = ...  # replace with your target data for evaluation.


'\nfor data in dataloader_test:\n      src, tgt = data\n      print(src.shape, tgt.shape)\n      src_data_eval.append(src)\n      tgt_data_eval.append(tgt)\n'

In [133]:
# Set the model to evaluation mode
model.eval(

# Store the generated translations and the target translations
generated_translations = []
target_translations = [[tgt] for tgt in tgt_data_eval]  # bleu_score expects a list of lists

# Disable gradient computation
with torch.no_grad():
    for src, tgt in zip(src_data_eval, tgt_data_eval):
        # Forward pass
        src = torch.tensor(src).unsqueeze(0)  # Add batch dimension
        output = model(src, src)  # Use src as tgt for evaluation
        output = output.argmax(dim=-1)  # Get the token IDs of the generated translations

        # Store the generated translations
        generated_translations.append(output.tolist())
print(len(generated_translations))

  src = torch.tensor(src).unsqueeze(0)  # Add batch dimension


1000


In [159]:
generated_translations_strings = []
tokens_and_ids_spa_reversed = {v: k for k, v in tokens_and_ids_spa.items()}
for sentence_ids in generated_translations:
    #print(sentence_ids)
    sentence_strings = []
    for token_ids in sentence_ids[0]:
        # Map token IDs to strings using tokens_and_ids_spa and tokens_and_ids_eng dictionaries
        #print(token_ids)
    
        translated_token = tokens_and_ids_spa_reversed[token_ids]
        #print(translated_token)
        sentence_strings.append(translated_token)
        """
        if translated_token:  # Check if the token exists in the Spanish dictionary
            sentence_strings.append(translated_token[0])
        else:
            sentence_strings.append('<UNK>')  # If the token is unknown, mark it as '<UNK>'
        """
    # Convert the list of tokens into a single string
    #print(sentence_strings)
    #translated_sentence = ' '.join(sentence_strings)
    generated_translations_strings.append(sentence_strings)
print(generated_translations_strings[0])
print(tokens_spa_test[0])

['y', 'y', 'y', 'y', 'y', 'y', 'y', 'y', 'y', 'y', 'y', 'y', 'y', 'y', 'y', 'y']
['por', 'eso', ',', 'nosotros', 'solicitamos', 'que', 'su', 'programa', 'sea', 'valiente', ',', 'pues', ',', 'si', 'lo', 'fuera', ',', 'le', 'puedo', 'asegurar', 'que', 'este', 'parlamento', 'estará', 'con', 'la', 'comisión', 'en', 'ese', 'proceso', 'de', 'reformas', '.']


In [160]:
# Compute the BLEU score
generated_translations_tokens=generated_translations_strings
s = 0
for i in range(100):
  try:
    s += bleu_score([generated_translations_tokens[i]], [tokens_spa_test[i]])
  except:
    print(i)

score =s/1000


print(f"BLEU Score: {score}")
#score = bleu_score(generated_translations_tokens, tokens_spa_test)

#print(f"BLEU Score: {score}")

BLEU Score: 0.0


## 6. Visualization of the attention mechanism

Finally, we will use `bertviz` to visualize the attention mechanism, and gain further intuition into this technique.

In [163]:
#visualizing how the attention mechanism
# works in our trained transformer model. This block should display an interactive
# visualization that allows you to view the attention in the encoder (source language),
# decoder (target language), and cross-attention (how the source and target strings
# are mapped). Explore the cross-attention visualizations using the menu in the widget.


from transformers import AutoTokenizer, AutoModel, utils
# link to the original bertviz repository: https://github.com/jessevig/bertviz?tab=readme-ov-file#encoder-decoder-models-bart-t5-etc
from bertviz import model_view
utils.logging.set_verbosity_error()  # suppress standard warnings.

# specify the model name.
# this pre-trained transformer model, BART, was designed specifically for
# machine translation tasks.
# find popular HuggingFace models here: https://huggingface.co/models
model_name = "facebook/bart-base"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name, output_attentions=True)

# try a sentence in the corpus, and its translated pair.
input_sentence = "This is a test sentence." # try a sentence in the corpus.
output_sentence = "Esto es una oración de prueba." # try a sentence in the corpus.


encoder_input_ids = tokenizer(input_sentence, return_tensors="pt", add_special_tokens=True).input_ids
with tokenizer.as_target_tokenizer():
    decoder_input_ids = tokenizer(output_sentence, return_tensors="pt", add_special_tokens=True).input_ids

outputs = model(input_ids=encoder_input_ids, decoder_input_ids=decoder_input_ids)

encoder_text = tokenizer.convert_ids_to_tokens(encoder_input_ids[0])
decoder_text = tokenizer.convert_ids_to_tokens(decoder_input_ids[0])

model_view(
    encoder_attention=outputs.encoder_attentions,
    decoder_attention=outputs.decoder_attentions,
    cross_attention=outputs.cross_attentions,
    encoder_tokens= encoder_text,
    decoder_tokens = decoder_text
)

<IPython.core.display.Javascript object>

## 7. Import a pre-trained model for translation 

In steps 0-6, we implemented our own transformer model In steps 7-8, we imported a pre-trained `BART` model, and then visualized how it encodes the attention mechanism with regards to a sentence-pair. As a final step, you must download an already trained model, `BART`, and experiment with how it translates sequences from English to Spanish.

In [167]:
from transformers import MBartForConditionalGeneration, MBartTokenizer

tokenizer = MBartTokenizer.from_pretrained('facebook/mbart-large-50-many-to-many-mmt', src_lang="es_XX", tgt_lang="en_XX")
input = "Pero es necesario no someterlas a objetivos y a aspiraciones de una política más general, social y económica negativa, y que estas desarrollen su papel independiente."
expected_output = "They must, however, not be subordinated to the objectives and aspirations of a more generally negative economic and social policy, but must develop their own self-sufficient role."

inputs = tokenizer(input, return_tensors="pt")
model = MBartForConditionalGeneration.from_pretrained('facebook/mbart-large-50-many-to-many-mmt')

translated_tokens = model.generate(**inputs, decoder_start_token_id=tokenizer.lang_code_to_id["en_XX"])
final_str = tokenizer.batch_decode(translated_tokens, skip_special_tokens=True)[0]

print(final_str, '\n', expected_output)

However, it is necessary not to subject them to objectives and aspirations of a more general, negative social and economic policy, and that these develop their independent role. 
 They must, however, not be subordinated to the objectives and aspirations of a more generally negative economic and social policy, but must develop their own self-sufficient role.


How to extend this project for fine-tuning?



We would prepare data by preprocessing and tokenising using the same MBartTokenizer used above in the code. Then we would use a suitable loss function (like cross-entropy) and optimiser (like Adam's optimiser). We would use the dataloaders for training using the functions provided in the transformers library. Post that, we evaluate using metrics like bleu score in a similar methodology as above. 

