# Sentence Reconstruction

Name and ID: Angelo Galavotti 0001103433

This notebook contains the submission for the Deep Learning of 14/06/2023. 

### Description of the task
- Take in input a __sequence of words__ corresponding to a random permutation of a given english sentence, and __reconstruct the original sentence__.

- The output can be either produced in a single shot, or through an iterative (autoregressive) loop generating a single token at a time.

CONSTRAINTS:

- No pretrained model can be used.
- The neural network models should have __less than 20M parameters__.

## Solution approach
To compute a valid solution, I've decided to adopt a model which makes use of Transformers and Multi-head attention.

In this notebook, I will describe the most important steps of the whole approach. Additionally, at the end of the notebook, I will briefly state about my previous attempts. 

----

# Downloading the dataset

In [23]:
!pip install datasets
!pip3 install apache-beam

Collecting dill (from datasets)
  Using cached dill-0.3.6-py3-none-any.whl (110 kB)
Installing collected packages: dill
  Attempting uninstall: dill
    Found existing installation: dill 0.3.1.1
    Uninstalling dill-0.3.1.1:
      Successfully uninstalled dill-0.3.1.1
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
apache-beam 2.46.0 requires dill<0.3.2,>=0.3.1.1, but you have dill 0.3.6 which is incompatible.
pymc3 3.11.5 requires numpy<1.22.2,>=1.15.0, but you have numpy 1.23.5 which is incompatible.
pymc3 3.11.5 requires scipy<1.8.0,>=1.7.3, but you have scipy 1.10.1 which is incompatible.[0m[31m
[0mSuccessfully installed dill-0.3.6
Collecting dill<0.3.2,>=0.3.1.1 (from apache-beam)
  Using cached dill-0.3.1.1-py3-none-any.whl
Installing collected packages: dill
  Attempting uninstall: dill
    Found existing installation: dill 0.3.6
    Uninstallin

In [24]:
from random import Random

# Instantiate the Random instance with random seed = 42 to ensure reproducibility
randomizer = Random(42)

In [25]:
!pip install gdown
from keras.preprocessing.text import Tokenizer
from keras.utils import to_categorical, pad_sequences
import numpy as np 
import pickle
import gdown
import random

[0m

In [26]:
from datasets import load_dataset

dataset = load_dataset("wikipedia", "20220301.simple")

data = dataset['train'][:20000]['text']

  0%|          | 0/1 [00:00<?, ?it/s]

# Tokenization

In [27]:
#run this cell only the first time to create and save the tokenizer and the date
dump = True

tokenizer = Tokenizer(split=' ', filters='!"#$%&()*+,-./:;=?@[\\]^_`{|}~\t\n', num_words=10000, oov_token='<unk>')

corpus = []

# Split of each piece of text into sentences
for elem in data:
  corpus += elem.lower().replace("\n", "").split(".")[:]

print("corpus dim: ",len(corpus))

#add a start and an end token
corpus = ['<start> '+s+' <end>' for s in corpus]


# Tokenization	
tokenizer.fit_on_texts(corpus)
#print(tokenizer.word_index['<unk>'])

if dump:
    with open('tokenizer.pickle', 'wb') as handle:
        pickle.dump(tokenizer, handle, protocol=pickle.HIGHEST_PROTOCOL)

original_data = [sen for sen in tokenizer.texts_to_sequences(corpus) if (len(sen) <= 32 and len(sen)>4 and not(1 in sen))]

if dump:
    with open('original.pickle', 'wb') as handle:
        pickle.dump(original_data, handle, protocol=pickle.HIGHEST_PROTOCOL)

print ("filtered sentences: ",len(original_data))

sos = tokenizer.word_index['<start>']
eos = tokenizer.word_index['<end>']
#print(eos)
#print(tokenizer.index_word[sos])

tokenizer.word_index['<pad>'] = 0
tokenizer.index_word[0] = '<pad>'

# dimension of the vocabulary of tokens
vocab_dimension = len(tokenizer.word_index) + 1

corpus dim:  510023
filtered sentences:  137301


In [28]:
shuffled_data = [random.sample(s[1:-1],len(s)-2) for s in original_data]
shuffled_data = [[sos]+s+[eos] for s in shuffled_data] # shuffled_data is an input of the model
target_data = [s[1:] for s in original_data] # target_data is the same as original data but offset by one timestep

In [29]:
from sklearn.model_selection import train_test_split

x_train, x_test, c_train, c_test, y_train, y_test = train_test_split(original_data, shuffled_data, target_data, test_size = 0.3, random_state = 42)


## Score function



In [30]:
from difflib import SequenceMatcher

def score(s,p):
  match = SequenceMatcher(None, s, p).find_longest_match()
  #print(match.size)
  return (match.size/max(len(s),len(p)))

def clean_sentence(x):
  x = x.replace('<start>', '').replace('<end>', '').replace('<pad>', '').strip()
  return x

In [31]:
from difflib import SequenceMatcher

def score(s,p):
  match = SequenceMatcher(None, s, p).find_longest_match()
  #print(match.size)
  return (match.size/max(len(s),len(p)))

def clean_sentence(x):
  x = x.replace('<start>', '').replace('<end>', '').replace('<pad>', '').strip()
  return x

In [32]:
i = np.random.randint(len(original_data))
print("original sentence: ",original_data[i])
print("shuffled sentecen: ",shuffled_data[i])

original sentence:  [2, 4, 780, 14, 5, 60, 829, 6, 1043, 20, 188, 1520, 21, 191, 31, 9, 75, 172, 1520, 18, 56, 23, 2053, 1777, 3]
shuffled sentecen:  [2, 9, 60, 31, 18, 23, 2053, 191, 780, 172, 188, 14, 75, 56, 6, 1777, 20, 1520, 1520, 21, 4, 5, 829, 1043, 3]


## Dataset padding/formatting

In [33]:
max_sequence_len = max([len(x) for x in original_data])

x_train = pad_sequences(x_train, maxlen=max_sequence_len, padding='post')
x_test = pad_sequences(x_test, maxlen=max_sequence_len, padding='post')
c_train = pad_sequences(c_train, maxlen=max_sequence_len, padding='post')
c_test = pad_sequences(c_test, maxlen=max_sequence_len, padding='post')
y_train = pad_sequences(y_train, maxlen=max_sequence_len, padding='post')
y_test = pad_sequences(y_test, maxlen=max_sequence_len, padding='post')

print("x_train size:", len(x_train))
assert(len(x_train)==len(c_train)==len(y_train))
print(len(x_train))
print(len(c_train))
print(len(y_train))

x_train size: 96110
96110
96110
96110


In [34]:
i = np.random.randint(len(x_train))
print("original sentence: ",tokenizer.sequences_to_texts([x_train[i]])[0])
print("shuffled sentence: ",tokenizer.sequences_to_texts([c_train[i]])[0])

original sentence:  <start> in this way people can read many articles easily but it is illegal <end> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad>
shuffled sentence:  <start> many way people articles it read in easily but illegal this is can <end> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad>


# The model
After some attempts using RNNs and LSTMs, I decided to opt for a different model. This is due to many reasons, mainly:

- They capture hidden dependendencies in data. 

- They make no assumptions about the __spatial__ relationships across data. 

The latter concept was essential for the performance of this model. In fact, the model should behave the same regardless of the ordering of the inputs: a property that is not ensured by LSTMs.   


## Building the layers

The model is comprised of this type of layers:
- Base attention layer
- Cross attention layer 
- Global and Causal self attention layer
- Feed Forward layer

Let's look over their code and their inner functioning.

### Base Attention Layer

The Base attention layer is comprised of a Multi-Nead attention layer, with a Add & Norm layer. 

In particular, each attention head can specialize in different aspects or dependendecies of the sequence it receives. 


In [72]:
import tensorflow as tf
from keras.layers import Embedding

class BaseAttention(tf.keras.layers.Layer):
  def __init__(self, **kwargs):
    super().__init__()
    self.mha = tf.keras.layers.MultiHeadAttention(**kwargs)
    self.layernorm = tf.keras.layers.LayerNormalization()
    self.add = tf.keras.layers.Add()

### Cross-attention layer
The cross-attention layer connects the encoder and the decoder of the model by means of a context vector. 

In [36]:
class CrossAttention(BaseAttention):
  def call(self, x, context):
    attn_output, attn_scores = self.mha(
        query=x,
        key=context,
        value=context,
        return_attention_scores=True)

    # Cache the attention scores
    self.last_attn_scores = attn_scores

    x = self.add([x, attn_output])
    x = self.layernorm(x)

    return x

### Global self attention layer
This layer is responsible for processing/generating the context sequence, and propagating information along its length.

In [37]:
class GlobalSelfAttention(BaseAttention):
  def call(self, x):
    attn_output = self.mha(
        query=x,
        value=x,
        key=x)
    x = self.add([x, attn_output])
    x = self.layernorm(x)
    return x

### Causal self attention layer
This layer does the same thing as the Global Attetion layer but for the output sequence.

As a matter of fact, their structure is very similar.  

In [38]:
class CausalSelfAttention(BaseAttention):
  def call(self, x):
    attn_output = self.mha(
        query=x,
        value=x,
        key=x,
        use_causal_mask = True)
    x = self.add([x, attn_output])
    x = self.layernorm(x)
    return x

### Feed forward layer

This layer is comprised of two dense layers with relu activation, as well as a dropout layer, which helps in reducing overfitting.

In [39]:
class FeedForward(tf.keras.layers.Layer):
  def __init__(self, d_model, dff, dropout_rate=0.1):
    super().__init__()
    self.seq = tf.keras.Sequential([
      tf.keras.layers.Dense(dff, activation='relu'),
      tf.keras.layers.Dense(d_model),
      tf.keras.layers.Dropout(dropout_rate)
    ])
    self.add = tf.keras.layers.Add()
    self.layer_norm = tf.keras.layers.LayerNormalization()

  def call(self, x):
    x = self.add([x, self.seq(x)])
    x = self.layer_norm(x) 
    return x

### Positional Embedding Layer

A normal embedding layer converts the input into a vector, in order to be given as input to a neural network. 

A positional embedding makes use of a positional encoding in order to give importance to the position of a word in a sequence.

In [40]:
def positional_encoding(length, depth):
  depth = depth/2

  positions = np.arange(length)[:, np.newaxis]    
  depths = np.arange(depth)[np.newaxis, :]/depth   

  angle_rates = 1 / (10000**depths)         
  angle_rads = positions * angle_rates      

  pos_encoding = np.concatenate(
      [np.sin(angle_rads), np.cos(angle_rads)],
      axis=-1) 

  return tf.cast(pos_encoding, dtype=tf.float32)

In [41]:
def PositionalEmbedding(length, depth):
  depth = depth/2

  positions = np.arange(length)[:, np.newaxis]     
  depths = np.arange(depth)[np.newaxis, :]/depth   

  angle_rates = 1 / (10000**depths)        
  angle_rads = positions * angle_rates    

  pos_encoding = np.concatenate(
      [np.sin(angle_rads), np.cos(angle_rads)],
      axis=-1) 

  return tf.cast(pos_encoding, dtype=tf.float32)

In [42]:
class PositionalEmbedding(tf.keras.layers.Layer):
  def __init__(self, vocab_size, d_model):
    super().__init__()
    self.d_model = d_model
    self.embedding = tf.keras.layers.Embedding(vocab_size, d_model, mask_zero=True) 
    self.pos_encoding = positional_encoding(length=2048, depth=d_model)

  def compute_mask(self, *args, **kwargs):
    return self.embedding.compute_mask(*args, **kwargs)

  def call(self, x):
    length = tf.shape(x)[1]
    x = self.embedding(x)
    # This factor sets the relative scale of the embedding and positonal_encoding.
    x *= tf.math.sqrt(tf.cast(self.d_model, tf.float32))
    x = x + self.pos_encoding[tf.newaxis, :length, :]
    return x


# Encoder

The encoder takes as input the shuffled sentence, and computes the context vector which is given to the decorder through the cross-attention layer. 

It is made of a stack of encoder layers.

## Encoder Layer
Each encoding layer is made of a Global self attention layer and a feed forward layer. 

In [43]:
class EncoderLayer(tf.keras.layers.Layer):
  def __init__(self,*, d_model, num_heads, dff, dropout_rate=0.1):
    super().__init__()

    self.self_attention = GlobalSelfAttention(
        num_heads=num_heads,
        key_dim=d_model,
        dropout=dropout_rate)

    self.ffn = FeedForward(d_model, dff)

  def call(self, x):
    x = self.self_attention(x)
    x = self.ffn(x)
    return x

In the encoder, the positional embedding layer is removed, and is swapped with a normal embedding layer. 

This isn't without any reason: without the positional embedding, our input is seen as a "bag of words", in which the order of each word is not taken into account. 

This is exactly what we want: in fact, the model should behave in the same way with each possible sequence of the same set of words.  

In [44]:
class Encoder(tf.keras.layers.Layer):
  def __init__(self, *, num_layers, d_model, num_heads,
               dff, vocab_size, dropout_rate=0.1):
    super().__init__()

    self.d_model = d_model
    self.num_layers = num_layers

    # No positional embedding, since we need this model to treat input as BoW
    self.embedding = Embedding(input_dim=vocab_size, output_dim=d_model) 

    self.enc_layers = [
        EncoderLayer(d_model=d_model,
                     num_heads=num_heads,
                     dff=dff,
                     dropout_rate=dropout_rate)
        for _ in range(num_layers)]
    self.dropout = tf.keras.layers.Dropout(dropout_rate)

  def call(self, x):
    x = self.embedding(x)  

    x = self.dropout(x)

    for i in range(self.num_layers):
      x = self.enc_layers[i](x)

    return x 

## Decoder
The structure of the decoder is very similar to the structure of the encoder, aside from a few differences. 

### Decoder layer
Each encoding layer is made of a Causal self attention layer and a feed forward layer.

In addition, it incorporates the cross attention layer, to receive the context vector.

In [45]:
class DecoderLayer(tf.keras.layers.Layer):
  def __init__(self,
               *,
               d_model,
               num_heads,
               dff,
               dropout_rate=0.1):
    super(DecoderLayer, self).__init__()

    self.causal_self_attention = CausalSelfAttention(
        num_heads=num_heads,
        key_dim=d_model,
        dropout=dropout_rate)

    self.cross_attention = CrossAttention(
        num_heads=num_heads,
        key_dim=d_model,
        dropout=dropout_rate)

    self.ffn = FeedForward(d_model, dff)

  def call(self, x, context):
    x = self.causal_self_attention(x=x)
    x = self.cross_attention(x=x, context=context)

    # Cache the last attention scores
    self.last_attn_scores = self.cross_attention.last_attn_scores

    x = self.ffn(x)  
    return x

As opposed to the encoder, in the decoder we have a positional embedding, since, during teacher forcing, it must capture the underlying positional information embedded in the sentence. 

In [46]:
class Decoder(tf.keras.layers.Layer):
  def __init__(self, *, num_layers, d_model, num_heads, dff, vocab_size,
               dropout_rate=0.1):
    super(Decoder, self).__init__()

    self.d_model = d_model
    self.num_layers = num_layers

    self.pos_embedding = PositionalEmbedding(vocab_size=vocab_size,
                                             d_model=d_model)
    self.dropout = tf.keras.layers.Dropout(dropout_rate)
    self.dec_layers = [
        DecoderLayer(d_model=d_model, num_heads=num_heads,
                     dff=dff, dropout_rate=dropout_rate)
        for _ in range(num_layers)]

    self.last_attn_scores = None

  def call(self, x, context):
    x = self.pos_embedding(x)  

    x = self.dropout(x)

    for i in range(self.num_layers):
      x  = self.dec_layers[i](x, context)

    self.last_attn_scores = self.dec_layers[-1].last_attn_scores

    return x

# Final transformer

Putting everything together, we obtain the transformer. 

We are also adding an additonal final Dense layer, which converts the resulting vector at each location into output token probabilities.

In [47]:
class Transformer(tf.keras.Model):
  def __init__(self, *, num_layers, d_model, num_heads, dff,
               input_vocab_size, target_vocab_size, dropout_rate=0.1):
    super().__init__()
    self.encoder = Encoder(num_layers=num_layers, d_model=d_model,
                           num_heads=num_heads, dff=dff,
                           vocab_size=input_vocab_size,
                           dropout_rate=dropout_rate)

    self.decoder = Decoder(num_layers=num_layers, d_model=d_model,
                           num_heads=num_heads, dff=dff,
                           vocab_size=target_vocab_size,
                           dropout_rate=dropout_rate)

    self.final_layer = tf.keras.layers.Dense(target_vocab_size)

  def call(self, inputs):
    # computing and giving context to decoder
    context, x  = inputs
    context = self.encoder(context)
    x = self.decoder(x, context)

    # Final linear layer output.
    logits = self.final_layer(x) 

    try:
      # Drop the keras mask, so it doesn't scale the losses/metrics.
      del logits._keras_mask
    except AttributeError:
      pass

    return logits

### Instatiating the model

The model is instantiated with the following parameters.

Each of them was chosen through trial and error, by training different models with different combinations of parameters. 

Some of the most influential were the number of heads and the dropout rate.
- The number of heads influences how the model captures the underlying dependencies in sequences. 
- The droupout rate influences how much the model is subject to overfitting and underfitting.  

In [50]:
num_layers = 4
d_model = 128
dff = 512
num_heads = 8
dropout_rate = 0.2

transformer = Transformer(
    num_layers=num_layers,
    d_model=d_model,
    num_heads=num_heads,
    dff=dff,
    input_vocab_size=10_000,
    target_vocab_size=10_000,
    dropout_rate=dropout_rate)

# Training the model

---

The model uses an Adam optimizer. The learning rate schedule was chosen according to the paper "Attention is all you need" in which Transformers where first introduced. 

In [51]:
class CustomSchedule(tf.keras.optimizers.schedules.LearningRateSchedule):
  def __init__(self, d_model, warmup_steps=4000):
    super().__init__()

    self.d_model = d_model
    self.d_model = tf.cast(self.d_model, tf.float32)

    self.warmup_steps = warmup_steps

  def __call__(self, step):
    step = tf.cast(step, dtype=tf.float32)
    arg1 = tf.math.rsqrt(step)
    arg2 = step * (self.warmup_steps ** -1.5)

    return tf.math.rsqrt(self.d_model) * tf.math.minimum(arg1, arg2)

In [53]:
learning_rate = CustomSchedule(d_model)

optimizer = tf.keras.optimizers.Adam(learning_rate, beta_1=0.9, beta_2=0.98,
                                     epsilon=1e-9)

### Loss function and metrics
The sparse categorical cross-entropy and accuracy are extended to include a padding mask.


In [54]:
def masked_loss(label, pred):
  mask = label != 0
  loss_object = tf.keras.losses.SparseCategoricalCrossentropy(
    from_logits=True, reduction='none')
  loss = loss_object(label, pred)

  mask = tf.cast(mask, dtype=loss.dtype)
  loss *= mask

  loss = tf.reduce_sum(loss)/tf.reduce_sum(mask)
  return loss


def masked_accuracy(label, pred):
  pred = tf.argmax(pred, axis=2)
  label = tf.cast(label, pred.dtype)
  match = label == pred

  mask = label != 0

  match = match & mask

  match = tf.cast(match, dtype=tf.float32)
  mask = tf.cast(mask, dtype=tf.float32)
  return tf.reduce_sum(match)/tf.reduce_sum(mask)

### Compiling and training the model

The model is built and set up for training. 

In [66]:
transformer.compile(
    loss=masked_loss,
    optimizer=optimizer,
    metrics=[masked_accuracy]
)

In [None]:
from keras.callbacks import EarlyStopping, ReduceLROnPlateau

early_stopping = EarlyStopping(monitor='val_masked_accuracy', mode='max', verbose=1, patience=5)

epochs = 50
batch_size = 256

transformer.fit(
    (c_train, x_train),
    y_train,
    epochs=epochs,
    batch_size=batch_size,
    callbacks = [early_stopping],
    validation_split = 0.05
)

Epoch 1/50


Epoch 2/50


Epoch 3/50


Epoch 4/50


Epoch 5/50


Epoch 6/50


Epoch 7/50


Epoch 8/50


Epoch 9/50


Epoch 10/50


Epoch 11/50


Epoch 12/50


Epoch 13/50


Epoch 14/50


Epoch 15/50


Epoch 16/50


Epoch 17/50


Epoch 18/50


Epoch 19/50


Epoch 20/50


Epoch 21/50


Epoch 22/50


Epoch 23/50


Epoch 24/50


Epoch 25/50


Epoch 26/50


Epoch 27/50


Epoch 28/50


Epoch 29/50


Epoch 30/50


Epoch 31/50


Epoch 32/50


Epoch 33/50


Epoch 34/50


Epoch 35/50


Epoch 36/50


Epoch 37/50


Epoch 38/50


Epoch 39/50


Epoch 40/50


Epoch 41/50


Epoch 42/50


Epoch 43/50


Epoch 44/50


Epoch 45/50


Epoch 46/50


Epoch 47/50


Epoch 48/50


Epoch 49/50


Epoch 50/50



<keras.callbacks.History at 0x7f122acaadd0>

In [80]:
transformer.summary()

Model: "transformer_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 encoder_1 (Encoder)         multiple                  3918848   
                                                                 
 decoder (Decoder)           multiple                  6029824   
                                                                 
 dense_16 (Dense)            multiple                  1290000   
                                                                 
Total params: 11,238,672
Trainable params: 11,238,672
Non-trainable params: 0
_________________________________________________________________


In [None]:
### This code was used in order to load the saved model weights. It should be ignored. 

# from google.colab import drive
# drive.mount('/content/drive')

# transformer.save_weights('drive/MyDrive/saved_model_weights_8_128_02/my_model_weights')

## Translator module

This module is responsible for wrapping the computation of the transformer. 
In essence, it generates a bag of words from a batch of shuffled sentences, and gradually computes the index of the best word prediction given by the transformer.  


In [86]:
class Translator(tf.Module):
    def __init__(self, transformer, tokenizer):
        self.transformer = transformer
        self.tokenizer = tokenizer
              
    def __call__(self, sentences, max_length=max_sequence_len):
        batch_size = sentences.shape[0]
        
        # generate word list for each sentence
        bow = [[word for word in sentence if word not in [sos, eos, 0]] for sentence in sentences]
        # starting vector for prediction, it contains the sos index
        output = [[self.tokenizer.word_index['<start>']] for _ in range(batch_size)]
        # during inference, output will be filled with the final sentence. 

        for i in range(1, max_length):
            # (enc_input, dec_input)
            predictions = np.array(self.transformer((np.array(sentences), np.array(output))))
            
            # remove useless dimensions
            predictions = predictions[:, -1, :] 

            for j in range(batch_size):
                if len(bow[j]) == 0:
                    # no more words to use
                    cand_token = eos
                else:
                    # choose index with highest score
                    s_prediction = predictions[j, np.array(bow[j])]
                    cand_index = np.argmax(s_prediction)
                    cand_token = bow[j][cand_index]
                    del bow[j][cand_index]
                output[j].append(cand_token)
                
        return output

In [87]:
translator = Translator(transformer, tokenizer)

## Computing the score

Now, we effectively test our translator and compute the score. 

To do that, we compute a score on 3K generated samples.

Since computing the score directly on 3K batches could give us some problems in Colab, it is computed on batches of 300 samples each.

Then, the total score computed as the average between batches.  

In [None]:
score_batch_size = 100
total_test_size = 3000
score_ = 0

for i in range(total_test_size//score_batch_size):
    ordered = x_test[i*score_batch_size:(i+1)*score_batch_size]
    shuffled = c_test[i*score_batch_size:(i+1)*score_batch_size]
    y_pred = translator(shuffled)
    b_score = 0                   # score associated with each batch

    pred_sentences = tokenizer.sequences_to_texts(y_pred)
    original_sentences = tokenizer.sequences_to_texts(ordered)
    
    for j in range(score_batch_size) :         
      b_score += score(clean_sentence(original_sentences[j]), clean_sentence(pred_sentences[j])) 

    score_ += b_score
    print("\n====BATCH OVER====") 
    print("Score as of batch ", i, ": ", score_/((i+1)*score_batch_size))
    
score_ = score_/total_test_size
print("\n====ALL OVER====") 
print("Final score: ", score_)



====BATCH OVER=====

Score as of batch  0 :  0.4948857841637906



====BATCH OVER=====

Score as of batch  1 :  0.5061785304182481



====BATCH OVER=====

Score as of batch  2 :  0.5058148684303929



====BATCH OVER=====

Score as of batch  3 :  0.5027345913851916



====BATCH OVER=====

Score as of batch  4 :  0.517508295140673



====BATCH OVER=====

Score as of batch  5 :  0.530376627646052



====BATCH OVER=====

Score as of batch  6 :  0.5344367026985729



====BATCH OVER=====

Score as of batch  7 :  0.5379621676018749



====BATCH OVER=====

Score as of batch  8 :  0.542301335438453



====BATCH OVER=====

Score as of batch  9 :  0.54320980641855



====BATCH OVER=====

Score as of batch  10 :  0.5395313403534979



====BATCH OVER=====

Score as of batch  11 :  0.5377743859102094



====BATCH OVER=====

Score as of batch  12 :  0.5375765857504796



====BATCH OVER=====

Score as of batch  13 :  0.5374369683136379



====BATCH OVER=====

Score as of batch  14 :  0.5370951441912

# Conclusion
The model obtains average performance.
Parameter tuning such as:
- increasing the attention heads
- increase the dropout rate
- increasing the model size

Led to similar or lower scores. 

### Previous attempts 
In the previous iteration, I tried using a stack of LSTM layers in an encoder/decoder structure, using a context vector to communicate between the two. The model also made use of teacher forcing. 

The result provided by this architecture were unsatisfying, with a very below average score, presumably because the model failed to capture the underlying relationship between sequences during training. 

This led to the adoption of the transformer model. 