# Assignment 4

This time we will implement the **attention mechanism** --one of the most important techniques in the field of NLP. Thanks to it, NLP researchers have been able to create the impressive **large language models** like ChatGPT and Claude.

To see this mechanism in action, we will tackle the task of **sequence to sequence** translation using a neural architecture called the **transformer** (the name *sequence-to-sequence* comes from the fact that the input is a string -or sequence- in a given language, and the output is an equivalent sequence but in another language).

The **transformer architecture** is a combination of various modules that manipulate strings in different ways. To this day, this software architecture remains the state of the art in NLP (although many variations are used nowadays). It was first proposed in 2017, and it is based on the self-attention mechanism $^{1}$.

In this assignment, we will implement a transformer model with the self-attention mechanism. Currently, language technologies make use of complex transformers with many layers and attention mechanisms --here, we will write a simple one-layer transformer.

The pipeline we will follow in this assignment is the following:

1. Prepare our data.
2. Get the embeddings of the training partition.
3. Write the elemnents of the self-attention mechanism.
4. Train a transformer model
5. Evaluate the model.
6. Visualize the attention mechanism.

$^{1}$ *The original publication can be found [here](https://proceedings.neurips.cc/paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf).*

*(If you want to learn more about the Transformer arquitecture, you can read [this](https://nlp.seas.harvard.edu/2018/04/03/attention.html) post in which the authors implement it step-by-step)*

In [1]:
%%capture 
! pip install datasets
! pip install transformers
! pip install bertviz

Capture added to remove prints, as shown here: https://stackoverflow.com/questions/23692950/how-do-you-suppress-output-in-jupyter-running-ipython

In [2]:
#Warning removal from https://stackoverflow.com/questions/14463277/how-to-disable-python-warnings
import warnings
warnings.filterwarnings('ignore') #Turn off warnings to make it more readable

In [3]:
import torch as nn
import numpy as np
import random
import gc

random.seed(42)
nn.manual_seed(42)

<torch._C.Generator at 0x213ca68d470>

## 0. Load the dataset

We will load a datataset designed for machine translation. This task takes one sequence in a given language, and outputs an equivalent sentence in another language.

To load the dataset, we will make use of `huggingface`, a very popular machine learning platform where various models, datasets and applications are hosted.

In [4]:
#%%cython
import datasets

# more details about the dataset can be found here:
# https://huggingface.co/datasets/Thermostatic/parallel_corpus_europarl_english_spanish
dataset = datasets.load_dataset("Thermostatic/parallel_corpus_europarl_english_spanish")
dataset = dataset['train']
print(dataset, type(dataset))

Dataset({
    features: ['en', 'es'],
    num_rows: 1965734
}) <class 'datasets.arrow_dataset.Dataset'>


## 1. Data pre-processing

### 0.5 pts. - Partitions

Once you have loaded the dataset into memory, create three different partitions from the `dataset` object. These partitions will be `train_dataset`, `test_dataset`, and `val_dataset`. Their sizes will be 70%, 20% and 10% of the original size of `dataset`.

Hint: Since the datasets hosted in
🤗 Hugging Face make use of Pytorch, use `torch.utils.data` to easily split the `dataset` object into the three partitions.

In [5]:

# 0.5 pts: split the corpus into three partitions.
from torch.utils.data import random_split

# select the first 10000 rows in the original corpus.
dataset = dataset.select(range(10000))

new_dataset_size = len(dataset)

In [6]:

train_dataset, test_dataset, val_dataset = random_split(dataset, [7000,2000,1000]) #Split the dataset 70% 20% 10%

In [7]:
# run this line to check that the sizes of your partitions are correct.
# expected output: 7000 2000 1000
print(len(train_dataset), len(test_dataset), len(val_dataset))

7000 2000 1000


In [8]:
train_dataset[4]

{'en': 'They must, however, not be subordinated to the objectives and aspirations of a more generally negative economic and social policy, but must develop their own self-sufficient role.',
 'es': 'Pero es necesario no someterlas a objetivos y a aspiraciones de una política más general, social y económica negativa, y que estas desarrollen su papel independiente.'}

In [9]:
print(type(train_dataset))

<class 'torch.utils.data.dataset.Subset'>


### 0.5 pts. - Prepare the data

Write a function, `sentencePairs`, that returns a list of lists called `sentence_pairs`. The inner lists must contain sentence pairs. These have to be: [*sentence in English*, *sentence in Spanish*]. Ensure that the lists contain only lowercase characters.

For example, using the instance we printed before, `train_dataset[0]`:

```
[['This situation can be achieved over a period of time.', 'Se puede lograr esa situación al cabo de un período de tiempo.']]
```

In [10]:
# 0.5 pts - complete this function.
# don't forget to make all the characters lowercase.
import numpy as np
from torch.utils.data import Subset
from typing import List

def sentencePairs(dataset) -> List[List[str]]:
  #sentences: a function to vectorize the input list of dictionaries.
  #Input: dictionary with an english and spanish sentence
  #Output: the lowercase sentences, returned separately for vectorization to create an english and spanish list
  def sentences(sentence_pair):
      eng_sentence = sentence_pair["en"].lower() #Get the lowercase english sentence
      spn_sentence = sentence_pair["es"].lower() #Get the lowercase spanish sentence
      return eng_sentence, spn_sentence #Return the two sentences

  vectorize_sentences = np.vectorize(sentences) #Vectorize the sentences function
  en_sentence, sp_sentence = vectorize_sentences(dataset) #Get a list of english and spanish sentences through the sentences function
  sentence_pairs = [list(zipped) for zipped in zip(en_sentence, sp_sentence)] #Zip them together, force the zip to be in lists instead of tuples, and put them in a bigger list

  return sentence_pairs #Return the new sentence pair list of lists

In [11]:

# run this line to check that the size of your train list is correct.
# expected output: 7000.
sentence_pairs_train = sentencePairs(train_dataset)
len(sentence_pairs_train)

7000

In [12]:
# run this line to display a sentence pair inside a list.
print(sentence_pairs_train[0])

['nevertheless, these same institutions showed no reluctance to accept turkey as a candidate for membership of the european union, despite its well-known human rights violations.', 'y sin embargo, estas mismas instituciones no dudaron en aceptar la adhesión de turquía a la unión europea, cuando se sabe que en ese país se producen actos de violación de los derechos del hombre.']


## 2. Embeddings

It is time to turn the strings into numerical representations. To achieve this, first you will obtain the vocabulary (<i> the set of all unique tokens </i>) in the training partition using the `get_word_tokens` function. Then, you will assign a unique numerical ID to every token. Finally, you must complete the function, `createEmbeddings`, which maps a token in the vocabulary to a numerical representation.

In [13]:
import numpy as np
# 0.5 pts - obtain the tokens in the training partition
#Note: loading data into a variable takes space, and thus time.
#Another note: calling back to Assignment 2; set().union(*2d list) flattens it at very fast speeds.
def getWordTokens(sentence_pairs: List[List[str]]) -> set:
    #Return the set of unique words, the list of words coming from a flattened set of sentences split by word. 
    #The original version contained another flattening, but the single-language expectation warranted the removal of this step.
    return set(np.unique(list(set().union(*[sentence.split() for sentence in sentence_pairs]))))

In [14]:
# test the function.
# expected output: {'unite', 'ratified,', 'rights,', 'onion', ...
vocabulary_train = getWordTokens([sentence[0] for sentence in sentence_pairs_train])

#Print the first 5 items in the set in the style of the set print, just without printing the whole thing
# since the whole thing takes a lot of memory to print
print("{", end = "")
i = 0
for item in vocabulary_train:
    print(f"'{item}'", end = "")
    if i == 5:
        print("}")
        break
    print(", ", end = "")
    i += 1

{'appeared,', 'observer.', '11,', 'crushed.', 'penal', 'socio-economic'}


In [15]:
print(len(vocabulary_train))

14634


In [16]:
# 0.5 pts - assign an integer ID to every token.
# the function must output a dictionary where each entry is "string_n": unique_id_n
# where "string_n" is every token in the previously created set, and
# "unique_id_n" is the integer ID associated to every token in the vocabulary.

# Hint: You can just convert the vocab to a list and use the list indices as the
# unique IDs for each string.
def createDictFromSet(vocabulary_train):
    #Return the vocabulary with a range of numbers of the same length then formed into a dictionary of format word:id
    return dict(zip(vocabulary_train, range(len(vocabulary_train))))

In [17]:
# test the function.
# expected output: {'unite': 0, 'ratified,': 1, 'rights,': 2, ...
tokens_and_ids = createDictFromSet(vocabulary_train)

#Print the first 5 items of the dictionary in a similar manner to the dictionary print,
# as the whole print here before printed way too much
i = 0
print("{", end = "")
for key, value in tokens_and_ids.items():
    print(f"'{key}': {value}", end = "")
    if i == 5:
        print("}")
        break

    print(", ", end = "")
    i += 1

{'appeared,': 0, 'observer.': 1, '11,': 2, 'crushed.': 3, 'penal': 4, 'socio-economic': 5}


In [18]:
# 0.5 pts - create a torch.tensor with the IDs.
# using the IDs created in the previous step, create a nn.tensor object called 'id_tensor'
# that contains all the IDs.
import torch as nn

def createTensorFromDict(tokens_and_ids : dict) -> nn.tensor:
    #Return a tensor of the id numbers, which could be pulled out from the dictionary using the .values() module
    return nn.as_tensor(list(tokens_and_ids.values()))

In [19]:
# test the function.
tensor_ids = createTensorFromDict(tokens_and_ids)
print(tensor_ids) # expected output: tensor([    0,     1,     2,  ..., 14689, 14690, 14691])
print(tensor_ids.shape) # expected output: torch.Size([14692])

tensor([    0,     1,     2,  ..., 14631, 14632, 14633])
torch.Size([14634])


In [20]:
# this function will take the tensor ids, and generate 16-dimensional vectors
# for every value in the input. Run this function without changing anything.
import torch.nn as nn
def createEmbeddings(tensor_ids):
    embedding_layer = nn.Embedding(tensor_ids.max().item() + 1, 16)
    embeddings = embedding_layer(tensor_ids)
    return embeddings

In [21]:
embeddings = createEmbeddings(tensor_ids)
print(embeddings) # expected output: tensor([[ 1.0272, ...
print(embeddings.shape) # expected output: torch.Size([14692, 16]).

tensor([[ 1.0272, -1.1723,  0.4068,  ..., -1.3069,  0.9697, -0.0504],
        [ 0.6256,  1.1964, -0.6190,  ...,  0.9098,  0.8712,  2.0274],
        [-0.2565, -0.1453,  0.8568,  ..., -0.0066, -0.6608,  1.0100],
        ...,
        [-0.7364, -2.0366, -0.6164,  ...,  0.5109,  0.8284,  0.3069],
        [ 2.2783, -0.4957, -1.5542,  ...,  0.1764, -0.9432, -1.4817],
        [ 1.4046,  0.8429,  0.3188,  ...,  2.2152,  2.0987,  0.0971]],
       grad_fn=<EmbeddingBackward0>)
torch.Size([14634, 16])


---

In [22]:
embeddings.shape

torch.Size([14634, 16])

Now, we have an object that contains a vector of dimension $16$ for each unique ID that represents the tokens in the training corpus.

## 3. Self-attention mechanism

This technique makes use of three matrices: query, key, value. Each matrix is generated by calculating the dot product of:
- $W_{query} * embedding(i)$
- $W_{key} * embedding(i)$
- $W_{value} * embedding(i)$

where $embedding(i)$ is the embedding representation of each token, $i$, in the training corpus.

To generate these matrices, $W_{query}$, $W_{key}$, $W_{value}$, we will assign each one of them random (continuous) values in the range $[\text{dimension of vectors}, \text{vocabulary length}]$.


In [23]:
# 0.5 pts - complete this function that creates the matrices wq, wk,  wv.
# the output must be three matrices of shapes:
# - wq.shape: (dim_query_vectors, dim_vocabulary).
# - wk.shape: (dim_key_vectors, dim_vocabulary).
# - wv.shape: (dim_values_vectors, dim_vocabulary).
import torch

def createMatrices(embeddings, dim_query_vectors, dim_key_vectors, dim_value_vectors):
    embeddings_size = len(embeddings[0])
    W_query = torch.nn.Parameter(torch.as_tensor([[random.uniform(0, 1) for i in range(embeddings_size)] for j in range(dim_query_vectors)]))
    # generate a matrix with random values in the range [dim_key_vectors, dim_vocabulary).
    W_key = torch.nn.Parameter(torch.as_tensor([[random.uniform(0, 1) for i in range(embeddings_size)] for j in range(dim_key_vectors)]))
    # generate a matrix with random values in the range [dim_value_vectors, dim_vocabulary).
    W_value = torch.nn.Parameter(torch.as_tensor([[random.uniform(0, 1) for i in range(embeddings_size)] for j in range(dim_value_vectors)]))

    return W_query, W_key, W_value

In [24]:
# test the function.

# we will assume a value of 8 for computational simplicity, although in the
# original paper the authors used a value of 64.
dim_query_vectors = 8
dim_key_vectors = 8
dim_value_vectors = 8

# this will generate a matrix will random values in the range [dim_query_vectors, dim_vocabulary).
wq, wk, wv = createMatrices(embeddings, dim_query_vectors, dim_key_vectors, dim_value_vectors)

print(wq.shape, wk.shape, wv.shape)
# expected output: (torch.Size([8, 16])), (torch.Size([8, 16])), (torch.Size([8, 16]))

torch.Size([8, 16]) torch.Size([8, 16]) torch.Size([8, 16])


In [25]:
print(embeddings.shape)

torch.Size([14634, 16])


In [26]:
# 0.5 points - complete this function that calculates the attention values, q, k, v.
# use the matrices generated in the previous step.

# HINT: use the matmul function, and the transpose of the embeddings object.
# matmul and the transpose of a tensor are already implemented in PyTorch. You
# only need to go to the documentation and see how they should be used.


def attentionValues(wq, wk, wv):
  queries = torch.matmul(wq, embeddings.T).T
  keys = torch.matmul(wk, embeddings.T).T
  values = torch.matmul(wv, embeddings.T).T

  return queries, keys, values

In [27]:
# test the function.
queries, keys, values = attentionValues(wq, wk, wv)

# expected output: torch.Size([14634, 8]) torch.Size([14634, 8]) torch.Size([14634, 8])
print(queries.shape, keys.shape, values.shape)

torch.Size([14634, 8]) torch.Size([14634, 8]) torch.Size([14634, 8])


In [28]:
print(queries.shape, keys.T.shape)

torch.Size([14634, 8]) torch.Size([8, 14634])


In [29]:
# 0.5 pts - complete this function that generates the attention weights.

# HINT: check the shapes of the tensors (queries, keys) before completing this
# function --print(queries.shape, keys.T.shape).
# this will give you an intuition of how you must handle the matrix multiplication
# given the dimensions of each matrix.
def generateAttentionWeights(queries, keys):

  return torch.matmul(queries, keys.T)

In [30]:
# test the function.

attention_weights = generateAttentionWeights(queries, keys)
attention_weights.shape # expected output: torch.Size([14634, 14634]).

torch.Size([14634, 14634])

In [31]:
#%%cython
# 0.5 pts - complete this function that generates the attention weights.

# HINT: check the shapes of the tensors (queries, keys) before completing this
# function (use print(queries.shape, keys.T.shape) ).
# this will give you an intuition of how you must handle the matrix multiplication
# given the dimensions of each matrix.

import torch
import torch.nn.functional as F

def normalizeAttentionWeights(queries, keys):
  # compute the dot product of queries and keys.
  #dot_product =

  # apply softmax to get the initial attention weights.
  #weights =

  # normalize the attention weights.
  #normalized_attention_weights =
  return F.normalize(F.softmax(torch.matmul(queries, keys.T), dim = 1))

In [32]:
# test the function.
normalized_attention_weights = normalizeAttentionWeights(queries, keys)
normalized_attention_weights # expected output: tensor([[4.4391e-15, ...

tensor([[4.3804e-06, 3.9670e-05, 2.8301e-05,  ..., 3.7517e-06, 8.7340e-05,
         1.5635e-07],
        [0.0000e+00, 0.0000e+00, 0.0000e+00,  ..., 0.0000e+00, 0.0000e+00,
         2.7858e-42],
        [3.1900e-14, 1.1813e-07, 1.3488e-12,  ..., 3.4536e-16, 1.7820e-15,
         1.8093e-07],
        ...,
        [0.0000e+00, 0.0000e+00, 0.0000e+00,  ..., 2.2141e-43, 1.4013e-45,
         0.0000e+00],
        [5.0956e-29, 7.0065e-45, 3.3819e-32,  ..., 3.8511e-22, 9.1468e-24,
         0.0000e+00],
        [0.0000e+00, 0.0000e+00, 0.0000e+00,  ..., 0.0000e+00, 0.0000e+00,
         0.0000e+00]], grad_fn=<DivBackward0>)

In [33]:
del normalized_attention_weights, attention_weights, embeddings, queries, keys, values, wq, wk, wv, tensor_ids, vocabulary_train, tokens_and_ids #Clear out some memory
gc.collect() #Collect garbage data

0

## 4. Train the transformer model

Now, we will put together all the elements in that we have implemented before, and then we will train our transformer model.

The following function reimplements some of the previous things we did before, and generates a ready-to-use seq2seq model.

Optimization Sources:

Zero-Grad Speedup: https://pytorch.org/tutorials/recipes/recipes/tuning_guide.html

Cytorch: https://medium.com/@bryan.santos/lord-of-the-notebooks-optimizing-jupyter-9cc168debcc7

Time difference: two epochs being too slow to even run overnight to around 100 seconds per batch of 32 entries.

Pytorch Lightning: https://lightning.ai/docs/pytorch/stable/starter/introduction.html#

Time Difference: 363 minutes per epoch to around 5 minutes per epoch with outliers

Note on optimization: this started out of necessity. My notebook cannot handle more than batch size 4, which made it chug. Google Colab would stop from inactivity from trying to train it. As such, the following changes were made:

Zero-Grad Update: switches optimizer.zero_grad to a for loop. This was an optimization shown on the Pytorch website. It removed some of the memory overhead required to run the function.

Cython Compilation: This not only solved memory issues involved with stepping up batch size (allowing for a batch size of 32 without problems as opposed to 4), it provided some significant speedup. This is likely due to Cython sending the model to the C compiler, which tends to be fast. 

The trade-off of Cython is how each cell needs every required aspect in it. This includes imports. This makes Cython strong for functions in Jupyter Notebooks. One of the main reasons to use Cython is list comprehension speedup however, so the sentence collection and tensor conversions have also been put into functions for Cython.

Pytorch Lightning: This change caused significant speedup, even when just using it as a wrapper. This change allowed for proper experimentation with the model.

Failed Optimizations:

torch.compile and JIT: linux only. I did not want to change setups midway through.

intel_extension_for_pytorch: linux only. This also means my Intel GPU could not be used.

Pytorch Autograd: added overhead was worse than the alternative.

Pytorch Quantization: the accuracy loss was too much for this use case.

Bitsandbytes Adam Optimizer: Would only work when Pytorch Quantization was being used. It required GPU-enabled Pytorch otherwise, and I did not want to work on GPU enabling when my GPU will not even work for it.

Distributed/Parallel Processing: Jupyter Notebooks really do not like this. The only way to get it to work is to have a second Python program running to allow for it, which is out of scope for this project

In [34]:
#Note that this is only being added in here for the earlier cells to be run without needing this download.
#Both the dataset download and normalized attention would benefit from its usage as well, but they do not
#require it to run like with the models.

import cython #add in cython compilation for memory management and speed.

In [35]:
%load_ext Cython

In [36]:
#Create a dictionary from the word tokens of the sentence pairs.
#The getWordTokens only has one flattening layer now to get the unique words
#based on the English set, so the Spanish and English sentences have to be flattened
#out here now to accomodate
word_dict = createDictFromSet(getWordTokens(list(set().union(*[sentence for sentence in sentence_pairs_train]))))
word_dict.setdefault("", 0) #Set the default so the test version does not fail for unknown vocabulary

0

In [37]:
%%cython
# 0.5 pts - change the inputs in the following lines, and use the dataset we
# downloaded initially instead of the placeholders. The code written here
# is just an example for you.
#Include imports again, since each cython section needs to have everything it needs in its cell
import numpy as np
import lightning as L
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader
from torch.nn import Transformer
from tqdm import tqdm
import time

import logging
logging.getLogger("lightning.pytorch.utilities.rank_zero").setLevel(logging.WARNING)

############################################################## DO NOT CHANGE THIS! ########################################################################################################

# run this block that puts together all the parts that we defined before.

# in the original transformer model ("Attention is All You Need", Vaswani et al., 2017),
# the hidden dimension of the feed-forward network is 4 times the hidden dimension
# of the model. The authors used a model's hidden dimension of 512, so their
# feed-forward network's hidden dimension is 2048. Here, we use one head and one
# layer for simplicity.
class Seq2SeqTransformer(nn.Module):
    def __init__(self, num_tokens, embed_size=512, num_heads=1, num_layers=1):
        super(Seq2SeqTransformer, self).__init__()
        self.encoder = nn.Embedding(num_tokens, embed_size)
        self.transformer = Transformer(d_model=embed_size, nhead=num_heads, num_encoder_layers=num_layers, num_decoder_layers=num_layers)
        self.decoder = nn.Linear(embed_size, num_tokens)
    
    def forward(self, src, tgt):
        src = self.encoder(src)
        tgt = self.encoder(tgt)
        output = self.transformer(src, tgt)
        return self.decoder(output)

class TranslationDataset(Dataset):
    def __init__(self, src_data, tgt_data):
        self.src_data = src_data
        self.tgt_data = tgt_data

    def __len__(self):
        return len(self.src_data)

    def __getitem__(self, idx):
        return self.src_data[idx], self.tgt_data[idx]

########################################################################################################################################################
#No change has been made to the classes. They have only been moved to exist in the Cython cell

#Lightning: the class created to enable Pytorch Lightning
class Lightning(L.LightningModule):

    #Init: initializes the Lightning class
    #Input: the class' self, the model, the number of tokens
    #Output: none
    def __init__(self, model, num_tokens):
        super().__init__()
        self.model = model #Collect the model
        self.num_tokens = num_tokens #Collect the number of tokens

    #Get_model: get the model from the class
    #Input: the class' self
    #Output: the model
    def get_model(self):
        return self.model #Return the model once trained
    
    #Train: train the model; adapted from the version provided with the assignment, but altered into a function to send to Cython
    #Input: the model, the dataloader, the number of tokens
    #Output: the trained model
    def training_step(self, dataloader):
        torch.backends.cudnn.benchmark = True #Allow benchmarking for speedup
        loss_fn = nn.CrossEntropyLoss(ignore_index = 0) #Set up the loss function
        #optimizer = bnb.optim.Adam8bit(self.model.parameters()) #Set up the optimizer
        dataloader = DataLoader(dataloader, batch_size=32, shuffle=True)
        optimizer = optim.Adam(self.model.parameters())
    
        for epoch in range(200):  # number of epochs. You can play with this hyper-parameter and see how it changes the BLEU score.
            plus_one = epoch + 1
            if plus_one%10 == 1:
                start = time.time() #Start a timer to show speed
            for i, (src, tgt) in enumerate(dataloader):  # iterate over batches from the dataloader.
                #Optimizer Zero-Grad but quicker with less memory from https://pytorch.org/tutorials/recipes/recipes/tuning_guide.html
                for param in self.model.parameters():
                    param.grad = None #Set the gradients to none
        
                #with torch.autocast(device_type="cpu", dtype=torch.bfloat16):  #Failed autocast line, but kept for documentation           
                output = self.model(src, tgt) #Run an iteration of the model
                loss = loss_fn(output.view(-1, self.num_tokens), tgt.view(-1)) #Determine the loss
                    
                loss.backward() #Run a backward pass with the loss
                optimizer.step() #Step the optimizer forward

            if plus_one%10 == 0:
                print(f"Epoch: {plus_one}, Loss: {loss}, Time for 10 epochs: {(time.time() - start)/60} Minutes.") #Print the loss and time taken per epoch

    #Configure_optimizers: the required class to update optimizers for Pytorch Lightning
    #Input: the class' self variable
    #Output: the optimizer
    def configure_optimizers(self):
        optimizer = optim.Adam(self.model.parameters()) #Set up the optimizer
        return optimizer #Return the optimizer

#Get_tokens: convert a sentence to tokens, then pad them because tensor conversion would not work otherwise
#Input: the sentence, the max size, and the dictionary of tokens.
#Output: the tokens
def get_tokens(sentence, the_max, tokens_and_ids):
    tokens = [tokens_and_ids.get(word, 0) for word in sentence.split()[:the_max]] #Get a list of tokens by searching for them in the dictionary

    #While the token list is not long enough to fit the required structure, pad it out
    while len(tokens) < the_max:
        tokens.append(0) #Add a padding 0
    return tokens #Return the token list

#Get_tensors: Get the tensors for the source and target training data
#Input: The sentence pair, the word dictionary
#Output: The source and target vectors
def get_tensors(sentence_pairs_train, word_dict):
    eng_sentences = [sentence[0] for sentence in sentence_pairs_train] #Pull the English sentences out of the combined set
    spn_sentences = [sentence[1] for sentence in sentence_pairs_train] #Pull the Spanish sentences out of the combined set

    #Get the largest sentence out of all of them for padding purposes
    #Otherwise the tensor conversion will not work
    the_max = int(max(np.percentile([len(sentence) for sentence in eng_sentences],95), np.percentile([len(sentence) for sentence in spn_sentences],95)))

    #Get the English sentences as tokens, turn that into a tensor, and hold it as the source data
    src_data = torch.as_tensor([get_tokens(sentence, the_max, word_dict) for sentence in eng_sentences]) #torch.randint(0, num_tokens, (1000, 10))  # 1000 sequences of length 10.

    #Get the Spanish sentences as tokens, turn that into a tensor, and hold it as the target data
    tgt_data = torch.as_tensor([get_tokens(sentence, the_max, word_dict) for sentence in spn_sentences]) #torch.randint(0, num_tokens, (1000, 10))  # 1000 sequences of length 10.

    return src_data, tgt_data #Return the source and target data

def train_model(sentence_pairs_train, word_dict, batch_size):

    num_tokens = len(word_dict) + 1 #Get the number of tokens
    model = Seq2SeqTransformer(num_tokens) #Start the model

    src_data, tgt_data = get_tensors(sentence_pairs_train, word_dict) #Pull the data tensors from a Cython function

    dataset = TranslationDataset(src_data, tgt_data) #Generate a dataset from the source and target data
    dataloader = DataLoader(dataset, batch_size=batch_size, shuffle=True) #Send this dataset to a dataloader

    lit = Lightning(model, num_tokens)

    trainer = L.Trainer(limit_train_batches=1, max_epochs=1, enable_progress_bar = False, enable_model_summary = False, accelerator = "cpu")
    trainer.fit(model = lit, train_dataloaders = dataloader)

    return lit.get_model(), dataset #Train the model

In [38]:
batch_size = 32 #Set the batch size
model, dataset = train_model(sentence_pairs_train, word_dict, batch_size)

Epoch: 10, Loss: 4.313168525695801, Time for 10 epochs: 6.653059069315592 Minutes.
Epoch: 20, Loss: 1.8662501573562622, Time for 10 epochs: 6.340749474366506 Minutes.
Epoch: 30, Loss: 0.25070813298225403, Time for 10 epochs: 6.7920990546544395 Minutes.
Epoch: 40, Loss: 0.016913574188947678, Time for 10 epochs: 6.58725490172704 Minutes.
Epoch: 50, Loss: 0.017586765810847282, Time for 10 epochs: 6.403577359517415 Minutes.
Epoch: 60, Loss: 0.011657151393592358, Time for 10 epochs: 6.537095550696055 Minutes.
Epoch: 70, Loss: 0.008516669273376465, Time for 10 epochs: 6.42178555727005 Minutes.
Epoch: 80, Loss: 0.0070632812567055225, Time for 10 epochs: 6.442419052124023 Minutes.
Epoch: 90, Loss: 0.002543305279687047, Time for 10 epochs: 6.439032185077667 Minutes.
Epoch: 100, Loss: 0.005694982595741749, Time for 10 epochs: 6.473510328928629 Minutes.
Epoch: 110, Loss: 0.00225905142724514, Time for 10 epochs: 6.479875528812409 Minutes.
Epoch: 120, Loss: 0.0047256010584533215, Time for 10 epochs

## 5. Evaluate the model

It's time to evaluate the performance of our model. For this, we will use the BLEU metric (you can check the details about this metric [here](https://huggingface.co/spaces/evaluate-metric/bleu).)

In [39]:
%%cython
#Include all necessary inputs to send to Cython
import torch
import torch.nn as nn
from tqdm import tqdm
import time

#Depad: Nievely pulls the sentence out from the sentence by checking for double-tokens
#Input: the sentence
#Output: the depadded sentence
def depad(sentence):
    i = 1 #Create an iteration variable for sentence testing
    actual_sentence = [] #Create a list to hold the sentence

    #While there are no double tokens and the sentence has not ended, add the token to the list
    while sentence[i-1] != sentence[i] and i < len(sentence):
        actual_sentence.append(sentence[i-1]) #Add the i-1 token to the list, as the i-th token could still be doubled with token i+1
        i += 1 #Increment the iteration variable

    return actual_sentence #Return the sentence

#Test: runs a test of the data. Adapted from the given function with the assignment in order to sent to Cython to keep the batch size of 32
#Input: the model, the dataloader for the test set
#Output: the Bleu score
def test(model, test_dataloader):
    model.eval()  # set the model to evaluation mode.
    
    references = [] #Create a list of references
    candidates = [] #Create a list of candidates

    #Make sure the model runs without calculating gradients
    with torch.no_grad():
        
        #For each source and target item in the dataloader, add them to the references and candidates list
        for i, (src, tgt) in tqdm(enumerate(test_dataloader)):
            output = model(src, tgt)
            output = output.argmax(dim=-1)  # get the index of the max log-probability.
    
            # convert tensor to list.
            output = output.tolist()
            tgt = tgt.tolist()
    
            # wrap each tgt and output in an additional list.
            references.extend([[depad(t)] for t in tgt])
            candidates.extend([depad(o) for o in output])

            #print(references)
            #print(candidates)
    
            #print(f"{(i+1)*batch_size}/{len(eng_test)}", end = " ")

    return references, candidates #Return the bleu score for the test, calculated using the references and candidates lists

In [40]:
# 0.5 pts - change the inputs in the following lines, and complete the missing
# code and implement the BLEU score. The code here is just an example.

# generate random sequences for test_src_data and test_tgt_data.
# This has been altered to generate them from the test pairs via the previously-used get_tensors Cython function

#Try to get the sentence pairs for the testing dataset. If this does not work, then a previous run collected them already
try:
    sentence_pairs_test = sentencePairs(test_dataset) #Collect the sentences from the test set

#If the test pairs cannot be retreived now, they had already been collected
except:
    a = 1 #Add a random variable to fill the except line
    
test_src_data, test_tgt_data = get_tensors(sentence_pairs_test, word_dict) #Get the source and target data from the get_tensors function

test_dataset = TranslationDataset(test_src_data, test_tgt_data)
test_dataloader = DataLoader(test_dataset, batch_size=batch_size, shuffle=True)

In [41]:
references, candidates = test(model, test_dataloader)

63it [09:06,  8.67s/it]


In [42]:
flipped_dict = dict(zip(word_dict.values(), word_dict.keys())) #Create a token to word dictionary
sentence_pairs_model = model

In [43]:
print(references[100]) #Print the token list for a specified sentence
print([flipped_dict[word] for word in references[100][0]]) #Use the flipping dictionary to show it as a sentence
print(candidates[100]) #Print the predicted token list from the model
print([flipped_dict[word] for word in candidates[100]]) #Use the flipping dictionary to show the output as a sentence

[[31909, 29930, 30722, 3124, 20385, 17805, 3124, 0, 10929, 5506, 10153, 729, 1019, 17847, 0, 17805, 30722, 5678, 16334, 11792, 16016, 20104, 28625]]
['no', 'olvidemos', 'que', 'las', 'elecciones', 'para', 'las', '', 'municipales', 'y', 'los', 'parlamentos', 'regionales', 'se', '', 'para', 'que', 'haya', 'una', 'representación', 'de', 'la', 'población']
[31909, 16016, 30722, 3124, 20104, 17805, 3124, 31909, 32437, 5506, 10153, 729, 19230, 17847, 22124, 17805, 30722, 18276, 16334, 27771, 16016, 20104, 28625, 3124, 9357, 22124, 25991, 22124, 13855]
['no', 'de', 'que', 'las', 'la', 'para', 'las', 'no', 'palabras', 'y', 'los', 'parlamentos', 'vota', 'se', 'if', 'para', 'que', 'sólo', 'una', 'estamos', 'de', 'la', 'población', 'las', 'esta', 'if', 'europeo', 'if', 'programas']


In [44]:
# the following lines evaluate the trained model using the BLEU score.
# you do not need to implement any changes here.
from nltk.translate.bleu_score import corpus_bleu

bleu_score = corpus_bleu(references, candidates) #Test the model to get the bleu score, using a test function that has been sent to Cython so I do not need to change batch size
print(f'Bleu Score: {bleu_score*100:.2f}') #Print the bleu score

Bleu Score: 13.59


This bleu score may seem low, but it is quite good given the comparatively smaller dataset. This function originally gave scores in the high 90's, but further investigations proved that this was due to the padding being accounted for, giving great scores for lines of all zeros. This problem was fixed by two methods:

1. Truncating the data based on the 95th percentile of lengths. This allowed the outliers to be kept without their absurd length causing additional padding in every other sentence. This change also caused significant speedup, making epochs with Pytorch Lightning go from around 5 minutes to around 0.8 minutes (or around 48 seconds).

2.  Adding the ignore_index = 0 line to the loss function. This made the function ignore zero as an input. The tradeoff to this was that the model outputs had lines of another random token to replace the zeros. This is why I added the depad function to the test function, as removing the padding before the bleu score calculation would thus remove the large error caused by padding 0 != padding other token the model decided on. It is a bit nieve in implementation, simply checking for double words to make the call on when the padding had occurred. This, however, accounts for the wide variety of padding tokens I have seen the model choose by not assuming which token would need to be removed.

## 6. Visualization of the attention mechanism

Finally, we will use `bertviz` to visualize the attention mechanism, and gain further intuition into this technique.

In [45]:
# 0.5 pts - complete the missing code and visualize how the attention mechanism
# works in our trained transformer model. This block should display an interactive
# visualization that allows you to view the attention in the encoder (source language),
# decoder (target language), and cross-attention (how the source and target strings
# are mapped). Explore the cross-attention visualizations using the menu in the widget.


from transformers import AutoTokenizer, AutoModel, utils
# link to the original bertviz repository: https://github.com/jessevig/bertviz?tab=readme-ov-file#encoder-decoder-models-bart-t5-etc
from bertviz import model_view
utils.logging.set_verbosity_error()  # suppress standard warnings.

# specify the model name.
# this pre-trained transformer model, BART, was designed specifically for
# machine translation tasks.
# find popular HuggingFace models here: https://huggingface.co/models
model_name = "facebook/bart-base"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name, output_attentions=True, low_cpu_mem_usage=True)

# try a sentence in the corpus, and its translated pair.
print(sentence_pairs_train[1])
input_sentence = sentence_pairs_train[1][0] # try a sentence in the corpus.
output_sentence = sentence_pairs_train[1][1] # try a sentence in the corpus.


encoder_input_ids = tokenizer(input_sentence, return_tensors="pt", add_special_tokens=True).input_ids
with tokenizer.as_target_tokenizer():
    decoder_input_ids = tokenizer(output_sentence, return_tensors="pt", add_special_tokens=True).input_ids

outputs = model(input_ids=encoder_input_ids, decoder_input_ids=decoder_input_ids)

encoder_text = tokenizer.convert_ids_to_tokens(encoder_input_ids[0])
decoder_text = tokenizer.convert_ids_to_tokens(decoder_input_ids[0])

model_view(
    encoder_attention=outputs.encoder_attentions,
    decoder_attention=outputs.decoder_attentions,
    cross_attention=outputs.cross_attentions,
    encoder_tokens= encoder_text,
    decoder_tokens = decoder_text,
    include_layers = [0, 1] #Reduce the number of layers for the print
)

['indeed, i am forced to observe that this was not the case when the commission drew up its communication, published on 13 october last year.', 'sin embargo, tengo que constatar que esto no se ha producido en el momento de la preparación del comunicado de la comisión, publicado el pasado 13 de octubre.']


<IPython.core.display.Javascript object>

In [46]:
sentence_pairs_val = sentencePairs(val_dataset)

In [47]:
%reset_selective -f ^(?!sentence_pairs_).*$ #Remove all variables except the sentence_pairs variables

In [48]:
gc.collect() #Collect garbage data

7

Memory issues at this point

    - Loading the Bart model, which adds memory usage to both the browser and the notebook

    - Previous Jupyter Notebook outputs

Fixes added:

    - low_cpu_mem_usage added to Bart load

    - Previous print statements reduced

    - Number of layers for the above visual reduced

    - reset_selective removes all but the required sentence_pair variables

    - Cython compiliation of the Bart model

## 7. Compare the models

In steps 0-6, we implemented our own transformer model In steps 7-8, we imported a pre-trained `BART` model, and then visualized how it encodes the attention mechanism with regards to a sentence-pair.

In [51]:
## %%cython
#Cython was once again needed for memory management

# 3.0 points - Using our dataset, fine-tune the pre-trained BART model (1 point),
# evaluate it using the BLEU metric (1 point), and write a couple of lines in
# which you discuss why you think the model with the better performance achieved
# such performance (1 point). Feel free to check online resources on how to do
# the fine-tuning step.

# you only need to write the missing code where the "..." are.
# Cython changes also needed to be made for memory reasons

from transformers import MBartForConditionalGeneration, Trainer, TrainingArguments, Seq2SeqTrainingArguments, Seq2SeqTrainer, DataCollatorForSeq2Seq
from datasets import load_metric
from torch.utils.data import Dataset
import numpy as np
from peft import LoraConfig, get_peft_model
import lightning as L
import torch.optim as optim
from tqdm import tqdm
from torch.utils.data import Dataset, DataLoader
from transformers import AutoTokenizer, AutoModel, utils, MBartTokenizer

class TranslationDataset(Dataset):
    def __init__(self, src_data, tgt_data):
        self.src_data = src_data
        self.tgt_data = tgt_data

    def __len__(self):
        return len(self.src_data)

    def __getitem__(self, idx):
        return self.src_data[idx], self.tgt_data[idx]

class UpdatedTranslationDataset(Dataset):
    def __init__(self, text, label):
        self.text = text
        self.label = label

    def __len__(self):
        return len(self.text)

    def __getitem__(self, idx):
        return {"input_ids":self.text[idx], "labels":self.label[idx]} #This is the only change from the original. The model would not take it unless these were labeled like this

#Bart_Lightning: the class created to create a Lightning wrapper around the Bart model
class Bart_Lightning(L.LightningModule):

    #Init: initializes the Bart_Lightning class
    #Input: the class' self, the model, the tokenizer, the metric, and the old model
    #Output: none
    def __init__(self, model, tokenizer, metric, old_model):
        super().__init__()
        self.model = model #Set up the model in the class
        self.metric = metric #Set up the metric in the class
        self.tokenizer = tokenizer #Set up the tokenizer in the class
        self.old_model = old_model #Set up the old model's parameters in the class in order to cheese the optimizer
        self.automatic_optimization=False #Disable auto-optimization so I can cheese the optimizer

    def compute_metrics(self, eval_pred):
        predictions, labels = eval_pred
        decoded_preds = [self.tokenizer.batch_decode(prediction, skip_special_tokens=True) for prediction in predictions]
        # this line of code is used to replace all instances of -100 in the labels
        # with the ID of the padding token.
        # -100 is often used in the labels as a special value that indicates that
        # the model should not compute loss for that particular token. This is
        # typically used for padding or other special tokens.
        labels = np.where(labels != -100, labels, self.tokenizer.pad_token_id)
        decoded_labels = [self.tokenizer.batch_decode(label, skip_special_tokens=True) for label in labels]

        # compute BLEU score.
        result = self.metric.compute(predictions = decoded_preds, references = decoded_labels)
        result = {"bleu": result["score"]}

        print(result)
    
        return result

        
    #Train: train the model; adapted from the version provided with the assignment, but altered into a function to send to Cython
    #Input: the model, the data (merged into one variable to pass properly)
    #Output: none
    def training_step(self, train_data):
        # use these hyperparameters.
        
        #Note: the dataloaders sent into Lightning come back out as two tensors for each one, eliminating the dataset structure. The resulting shape is 
        #((train_data ids, train_data labels), (test_data ids, test_data labels)). This was the best way to keep the test data intact during the pass in.
        #This means train_data ids is in train_data[0][0], train_data labels is in train_data[0][1]
        # test_data ids is in train_data[1][0], and test_data labels is in train_data[1][1]

        #These need to be changed into a new dataset for the trainer, but also need to be in shape {"input_ids": data, "labels": data} so the model recognizes them.
        #This is why UpdatedTranslationDataset exists. It does the same thing as the original, just in a way that would make Bart not throw a fit
        
        training_args = Seq2SeqTrainingArguments( #Set up the training arguments
            output_dir='./results', #Load results to a separate directory
            overwrite_output_dir=True, #Overwrite previous outputs
            num_train_epochs=5, #Set the number of epochs
            per_device_train_batch_size=2, #Set the training batch size
            per_device_eval_batch_size=2, #Set the evaluation batch size
            warmup_steps=50, #Set the ramp-up for the learning rate to 100 steps
            weight_decay=0.01, #Bring the weights down a small amount to prevent overfitting
            predict_with_generate=True, #Have the model generate predictions to compare to for metrics
            use_cpu=True, #Use CPU
            evaluation_strategy="epoch", #Have the model validate every epoch
            save_strategy="epoch", #Have the model save every epoch, as that is required for metric_for_best_model
            load_best_model_at_end=True, #Have the best model loaded at the end
            metric_for_best_model="eval_bleu", #Set the bleu metric as the metric to optimize rather than loss
            disable_tqdm=False, #Make sure the bars are allowed to print
            gradient_accumulation_steps = 32, #Extend the batches by lowering accumulation frequency
            eval_accumulation_steps = 32, #Accumulate the evaluations in the same way as the training set
            log_level = "warning", #Escalate the log level to not print all the information
            logging_strategy = "epoch" #Log loss for each epoch
        )

        trainer = Seq2SeqTrainer( #Set up the transformers Bart model trainer
            model = self.model, #Input the Bart model
            args = training_args, #Include the above training arguments
            train_dataset = UpdatedTranslationDataset(train_data[0][0], train_data[0][1]), #Include the training data as explained above
            eval_dataset = UpdatedTranslationDataset(train_data[1][0], train_data[1][1]), #Include the testing data as explained above
            data_collator = DataCollatorForSeq2Seq(self.tokenizer),
            compute_metrics = self.compute_metrics #Send in the compute metrics function provided in the assignment
        )
        
        trainer.train() #Train the Bart model

        #Ideas shown in https://huggingface.co/learn/nlp-course/chapter7/4?fw=pt#fine-tuning-the-model-with-the-trainer-api
        trainer.evaluate(UpdatedTranslationDataset(train_data[2][0], train_data[2][1])) #Evaluate the Bart model

    #Configure_optimizers: the required class to update optimizers for Pytorch Lightning
    #Input: the class' self variable
    #Output: the optimizer
    def configure_optimizers(self):
        optimizer = optim.Adam(self.old_model.parameters()) #Set up an optimizer on the old model's parameters to get around Lightning's requirements, since I am only using it as a wrapper
        return optimizer #Return the optimizer

#Tokenize: tokenize the sentences in sentence_pairs_train per the bart tokenizer
#Input: English-Spanish sentence pairs
#Output: a dataset of the tokenized sentences
def tokenize(sentence_pairs_train, tokenizer):
    eng_sentences = [sentence[0] for sentence in sentence_pairs_train] #Pull the English sentences out of the combined set
    spn_sentences = [sentence[1] for sentence in sentence_pairs_train] #Pull the Spanish sentences out of the combined set

    the_max = int(max(np.percentile([len(sentence) for sentence in eng_sentences],95), np.percentile([len(sentence) for sentence in spn_sentences],95)))
    
    #Tokenize the sentences based on the model's tokenizer. Note the padding, as it avoids errors per 
    #https://discuss.huggingface.co/t/the-model-did-not-return-a-loss-from-the-inputs-only-the-following-keys-logits-for-reference-the-inputs-it-received-are-input-values/25420/18
    train_token = tokenizer(eng_sentences, text_target = spn_sentences, return_tensors="pt", add_special_tokens=True, 
                            padding = "max_length", truncation = True, max_length = the_max)
    return TranslationDataset(train_token["input_ids"], train_token["labels"]) #Return the dataset version of the tokenizers

#Train_bart: Trains the Bart model using Pytorch Lightning
#Input: the sentence pairs for the training, testing, and validation sets, the batch size for the dataloaders, the old model's parameters
#Output: none
def train_bart(sentence_pairs_train, sentence_pairs_test, sentence_pairs_val, batch_size, old_model_params):
    model_name = "facebook/mbart-large-50-many-to-many-mmt" #Pull in the model name from the previous section
    tokenizer = MBartTokenizer.from_pretrained('facebook/mbart-large-50-many-to-many-mmt', src_lang="en_XX", tgt_lang="es_XX") #Pull down the tokenizer for Cython usage
    model = MBartForConditionalGeneration.from_pretrained(model_name, output_attentions=False) #Pull the model from HuggingFace
    #for param in model.model.parameters():
    #    param.requires_grad = False
    model.enable_input_require_grads() #Force the LoRA to have grads

    lora_config = LoraConfig( #Create a configuration for LoRA to be passed to PEFT
        r=1, #Set a rank of 1 or 2, which is small but should be fine for this computer
        lora_alpha=2, #Set the alpha to 2 times the rank, as having a double alpha appeared to provide a better bleu value
        target_modules="all-linear", #Let the model know that all layers are linear, thus it does not need anthing else special
        task_type="SEQ_2_SEQ_LM", #Tell LoRA we are dealing with a Seq2Seq task
    )
    
    model = get_peft_model(model, lora_config) #Send the model to PEFT for parameter tuning
    
    metric = load_metric("sacrebleu") #Load the bleu metric
    
    train_data = tokenize(sentence_pairs_train, tokenizer) #Tokenize the training data and put it in a translation dataset
    val_data = tokenize(sentence_pairs_val, tokenizer) #Tokenize validation data and put it in a translation dataset
    test_data = tokenize(sentence_pairs_test, tokenizer) #Tokenize the testing data and put it in a translation dataset

    #Datasets and Dataloaders must be remade due to the changed tokenization
    #train_data = DataLoader(train_data, batch_size = len(sentence_pairs_train), shuffle=True) #Run a dataloader for the training data
    #val_data = DataLoader(val_data, batch_size = len(sentence_pairs_val), shuffle=True) #Run a dataloader for the validation data
    #test_data = DataLoader(test_data, batch_size = len(sentence_pairs_test), shuffle=True) #Run a dataloader for the test data

    train_data = DataLoader(train_data, batch_size = batch_size, shuffle=True) #Run a dataloader for the training data
    val_data = DataLoader(val_data, batch_size = batch_size, shuffle=True) #Run a dataloader for the validation data
    test_data = DataLoader(test_data, batch_size = batch_size, shuffle=True) #Run a dataloader for the test data
    
    lit = Bart_Lightning(model, tokenizer, metric, old_model_params) #Initialize Bart Lightning

    #Set up the Lightning Bart Trainer
    trainer = L.Trainer(limit_train_batches=1, max_epochs=1, enable_progress_bar = False, enable_model_summary = False, accelerator = "cpu")
    trainer.fit(model = lit, train_dataloaders = (train_data, val_data, test_data)) #Fit the Lightning Bart trainer

In [52]:
import time #Reimport time since it was collected with the garbage earlier
batch_size = 128 #Set the batch size the dataloaders for testing the model

start = time.time() #Start a timer
train_bart(sentence_pairs_train, sentence_pairs_test, sentence_pairs_val, batch_size, sentence_pairs_model) #Train the Bart model
print(f"Time Taken: {(time.time() - start) / 60} Minutes") #Print how long it took

The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'MBart50Tokenizer'. 
The class this function is called from is 'MBartTokenizer'.


Epoch,Training Loss,Validation Loss,Bleu
1,11.4984,11.436323,0.319144
2,11.4751,11.435586,0.319144
3,11.4995,11.434255,0.319144
4,11.4592,11.432287,0.319144
5,11.5095,11.429636,0.319144


{'bleu': 0.31914419846928027}


Checkpoint destination directory ./results\checkpoint-2 already exists and is non-empty. Saving will proceed but saved results may be invalid.


{'bleu': 0.31914419846928027}
{'bleu': 0.31914419846928027}
{'bleu': 0.31914419846928027}
{'bleu': 0.31914419846928027}


{'bleu': 0.3212787627196081}
Time Taken: 622.0447508613269 Minutes


The Cython alone was not enough, even at batch sizes of 1. Adding Lightning around it appears to be helpful in getting it to go. Looking at task manager shows the disk and cpu raising to 100% and giving memory a minute to breathe as opposed to previous runs, where memory maxed out followed by kernel crashes (even after hours of loading and overnight runs.) Lightning also appears to make it faster, which likely comes from the extra memory breathing room it gets here despite only acting as a wrapper in this case. These together create the base for the model to even run, or at least they did at first. PEFT ended up making Cython not be required. Times were about the same with and without Cython after that factor was added. In fact, Cython became a hinderance because it made it so the epochs did not print, and thus I could not see if it was running properly or not. As such, it was removed.

I will also note the change from Trainer to Seq2SeqTrainer and the associated predict_with_grads training argument. All of the examples and information I could find on the internet did not allow for both TrainingArguments and Compute_Metrics to work without any change. It also did not make sense to reinvent the wheel for the trainers. The Seq2Seq versions are made for this purpose. I tried to keep as much intact as possible for the rest of the assignment, but I could not find a way to get this one going without either doing this switch or changing the rest of the code to the point of being unrecognizable. I feel like this is the best outcome for this reason.

Interestingly, a small training batch size with a larger accumulation proved to be much faster with similar loss to larger batch sizes. The training could then stay at 32 due to memory savings, but the model will train faster.

PEFT and LoRA then make the model usable by using ranking systems to lower the number of trainable parameters needed for fine tuning. This turned the model from 150M to around 885,000 at rank 16 or 221,000 at rank 4 for the base model, or 34M to 885,000/221,000 for a model with everything frozen except an encoder and decoder. 

This, however, only could produce a bleu score of less than 1 since the base model is not trained with other languages. I either had to switch models or do the training with all layers unfrozen for it to pick up the other language. I ended up deciding to switch to MBart with PEFT and LoRA, as 1M trainable parameters at rank 2 with knowledge of other languages is still better than 150M parameters in the base model. This provided good results right away, with a 32 Bleu with a training size of 32. It is much slower of course, since 1M parameters is more than 221,000. I decided to run a couple experiments on it to test PEFT with the model.

Results were as follows:

The subset 128 for 3 epochs: Bleu = 31.36

The subset of 700 with 50 warm up steps for 1 epoch: Bleu = 31.83

The subset 128 with 5 epochs and best_metric = eval_bleu: Bleu = 32.13

These were chosen to test various aspects of the training. I had gotten the feeling that translation comes more from the untouched layers with the previous BART model due to it only getting Bleu values less than 1, so I hoped these three experiments would show these ideas for sure. The first one looked at a smaller size for more epochs, which acted as more of a baseline. This provided a Bleu of 31.36. The next experiment worked with a larger dataset for one epoch. This was originally going to be the full training set, but that was way too slow. This experiment was traded for a dataset of 700 entries and 50 warm up steps, which would represent the larger dataset. It only came to a Bleu of 31.83 by the end, which seems to prove the initial hypothesis. I decided to run one more experiment emphasizing the use of the Bleu metric to select the best model just in case this was the important factor rather than the base layers of the model. This also provided a poor Bleu result in the end, which further proved how the model could not go so far past the initial score without training the original layers for translation. This means that, without proper computing power to train the whole model for the task, this model is not a good choice to be fine-tuned for the task. MBart specifically, however, could still be used for few-shot learning due to its stronger base knowledge of language. It would not provide the best results compared to others on the market, but its inherent knowledge from its huge training set still provides better result than the model implemented from scratch for the assignment.

Sources:

Trainers: https://huggingface.co/docs/transformers/en/main_classes/trainer#transformers.TrainingArguments.gradient_accumulation_steps

LoRA: https://github.com/microsoft/LoRA

PEFT: https://huggingface.co/docs/peft/en/index

---

As a final step, you must compare the BLEU metric of both of these models, and write some lines about why you think you obtained the results you obtained (1 point).





Bleu scores are higher on the pretrained model due to the previous training it had, giving it a large advantage. Bart's large trainset, even when not looking at translation specifically, got to look at more contexts for these words than this data could possibly give to the fresh model.

The Bleu scores reflect this, as the fresh model had to work through many iterations to hit a score of 15. It could reach close to 20 with a ridiculous amount of overfitting (1000 iterations), but that only allowed it to have more knowledge of a few more words. Even still, the predictions would still end up mostly being el, la, and de since they are very common Spanish words. Compare that to the Bart model, who would consistently score more around 25 during the one-or-two-sentence one-epoch tests I was doing to make sure the functions were working correctly (given the time tradeoff to actually get the model going.) This was with the whole model and nothing frozen, so quick training with a few elements provides better results than the other model. Using PEFT or freezing with the regular model provides much worse results, however. Training would not even get to 1, as the base model was not trained with other languages. Training with the regular Bart thus cannot provide good results without training the whole model, which very computationally expensive.

I switched over to MBart after making this realization, as PEFT with MBart is still much better in terms of trainable parameters than training the entire Bart model. MBart with rank 2 has around a million parameters, and 32 examples took around 45 minutes to train a single epoch on that set. The Bleu score came to about 32 however, which is better than the regular Bart and previous model. The listed experiments above show how little PEFT will work in this specific translation task without delving into the main layers. The previous language training is what pushes the score up, so a few-shot implementation would be the best way to use the model in a low-resource setting such as this one. There are just plenty of other models out there that could do the translation with few-shot learning better than MBart.