# Sentence Reconstruction

The purpose of this project is to take in input a sequence of words corresponding to a random permutation of a given english sentence, and reconstruct the original sentence.

The otuput can be either produced in a single shot, or through an iterative (autoregressive) loop generating a single token at a time.


CONSTRAINTS:
* No pretrained model can be used.
* The neural network models should have less the 20M parameters.
* No postprocessing should be done (e.g. no beamsearch)
* You cannot use additional training data.


BONUS PARAMETERS:

A bonus of 0-2 points will be attributed to incentivate the adoption of models with a low number of parameters.

# Dataset

The dataset is composed by sentences taken from the generics_kb dataset of hugging face. We restricted the vocabolary to the 10K most frequent words, and only took sentences making use of this vocabulary.

In [3]:
!pip install datasets

Download the dataset

In [4]:
from datasets import load_dataset
from keras.layers import TextVectorization
import tensorflow as tf
import numpy as np
np.random.seed(42)
ds = load_dataset('generics_kb',trust_remote_code=True)['train']

Filter row with length greater than 8.


> Note:using \<start> and \<end> and \<comma> are vectorized the same way as start,end,comma strings getting the Transformer confused. I chose to use different special tokens.

In [328]:
START_STRING = "lemmastart"
COMMA_STRING = "lemmacomma"
END_STRING = "lemmaend"
UNK_STRING = "[UNK]"

In [329]:
ds = ds.filter(lambda row: len(row["generic_sentence"].split(" "))>8 )
corpus = [ START_STRING+" "+ row['generic_sentence'].replace(","," "+COMMA_STRING) +" "+END_STRING for row in ds ]
corpus = np.array(corpus)


Filter:   0%|          | 0/462393 [00:00<?, ? examples/s]

Create a tokenizer and Detokenizer

In [605]:
tokenizer=TextVectorization( max_tokens=10000, standardize="lower_and_strip_punctuation", encoding="utf-8",) #con il max prende le piu frequenti. ordina i token del vocab dal piu frequente al meno frequente
tokenizer.adapt(corpus)

class TextDetokenizer:
    def __init__(self, vectorize_layer):
        self.vectorize_layer = vectorize_layer
        vocab = self.vectorize_layer.get_vocabulary()
        self.index_to_word = {index: word for index, word in enumerate(vocab)}

    def __detokenize_tokens(self, tokens):
        def check_token(t):
          if t == 2:
            s=START_STRING
          elif t ==3:
            s=END_STRING
          elif t ==7:
            s=COMMA_STRING
          else:
            s=self.index_to_word.get(t, UNK_STRING)
          return s

        return ' '.join([ check_token(token) for token in tokens if token != 0])

    def __call__(self, batch_tokens):
       return [self.__detokenize_tokens(tokens) for tokens in batch_tokens]



detokenizer = TextDetokenizer( tokenizer )
sentences = tokenizer( corpus ).numpy()


Remove from corpus the sentences where any unknow word appears

In [331]:
mask = np.sum( (sentences==1) , axis=1) >= 1
original_data = np.delete( sentences, mask , axis=0)

In [332]:
original_data.shape

(241194, 28)

In [604]:
tokenizer.get_vocabulary()

['',
 '[UNK]',
 'lemmastart',
 'lemmaend',
 'the',
 'of',
 'and',
 'lemmacomma',
 'is',
 'to',
 'a',
 'in',
 'are',
 'that',
 'can',
 'for',
 'or',
 'as',
 'have',
 'with',
 'their',
 'on',
 'most',
 'by',
 'from',
 'some',
 'an',
 'many',
 'people',
 'they',
 'also',
 'more',
 'be',
 'which',
 'than',
 'water',
 'one',
 'all',
 'when',
 'at',
 'other',
 'it',
 'use',
 'very',
 'used',
 'important',
 'plants',
 'but',
 'life',
 'cause',
 'has',
 'body',
 'into',
 'food',
 'common',
 'animals',
 'women',
 'often',
 'time',
 'only',
 'children',
 'through',
 'cells',
 'form',
 'human',
 'energy',
 'different',
 'small',
 'up',
 'blood',
 'because',
 'like',
 'disease',
 'both',
 'our',
 'species',
 'light',
 'cancer',
 'large',
 'during',
 'make',
 'live',
 'high',
 'part',
 'about',
 'two',
 'usually',
 'world',
 'natural',
 'air',
 'such',
 'health',
 'way',
 'any',
 'well',
 'work',
 'between',
 'occur',
 'soil',
 'out',
 'much',
 'no',
 'process',
 'system',
 'skin',
 'over',
 'do',


Shuffle the sentences

In [333]:
from tensorflow.keras.utils import Sequence

class DataGenerator(Sequence):
    def __init__(self, data, batch_size=32, shuffle=True, seed=42):
        self.data = data
        self.batch_size = batch_size
        self.shuffle = shuffle
        self.seed = seed
        self.on_epoch_end()

    def __len__(self):
        return int(np.floor(len(self.data) / self.batch_size))

    def __getitem__(self, index):
        indexes = self.indexes[index*self.batch_size:(index+1)*self.batch_size]

        data_batch = np.array([self.data[k] for k in indexes])
        #copy of ordered sequences
        result = np.copy(data_batch)
        #shuffle only the relevant positions for each batch
        for i in range(data_batch.shape[0]):
          np.random.shuffle(data_batch[i,1:data_batch[i].argmin() - 1])

        return data_batch , result
    
    def indexes(self):
        return self.indexes
    
    def batch_size(self):
        return self.batch_size

    def on_epoch_end(self):
        self.indexes = np.arange(len(self.data))
        if self.shuffle:
            if self.seed is not None:
                np.random.seed(self.seed)
            np.random.shuffle(self.indexes)

In [365]:
# Make a random permutation of training and test set
np.random.seed(42)
# Shuffle the all data
shuffled_indices = np.random.permutation(len(original_data))
shuffled_data = original_data[shuffled_indices]

In [335]:
#split the dataset
train_generator = DataGenerator(shuffled_data[:220000])
test_generator = DataGenerator(shuffled_data[220000:])

In [360]:
x, y = test_generator.__getitem__(1)
x_det = detokenizer(x)
y_det = detokenizer(y)

for i in range(7):
  print("original: ", y_det[i])
  print("shuffled: ", x_det[i])
  print("\n")

original:  lemmastart feathers are complex lemmaend designed structures required for flight lemmaend and are today found only on birds lemmacomma
shuffled:  lemmastart complex lemmaend found feathers on designed today for birds only and lemmaend required are are flight structures lemmacomma


original:  lemmastart most types of cancer are more likely to affect people as they get older lemmacomma
shuffled:  lemmastart affect as cancer are more most types to older likely get they people of lemmacomma


original:  lemmastart some scientists study the needs of human or animal minds and bodies lemmacomma
shuffled:  lemmastart and or study bodies scientists the some minds animal human of needs lemmacomma


original:  lemmastart crawlers are pale green in color and move about the leaf seeking a suitable feeding site lemmacomma
shuffled:  lemmastart the are seeking pale a and color crawlers move leaf about in suitable feeding site green lemmacomma


original:  lemmastart wood preservation mean

# Metrics

Let s be the source string and p your prediction. The quality of the results will be measured according to the following metric:

1.  look for the longest substring w between s and p
2.  compute |w|/max(|s|,|p|)

If the match is exact, the score is 1.

When computing the score, you should NOT consider the start and end tokens.



The longest common substring can be computed with the SequenceMatcher function of difflib, that allows a simple definition of our metric.

In [276]:
from difflib import SequenceMatcher

def score(s,p):
  match = SequenceMatcher(None, s, p).find_longest_match()
  return (match.size/max(len(p),len(s)))

Let's do an example.

In [24]:
original = "at first henry wanted to be friends with the king of france"
generated = "henry wanted to be friends with king of france at the first"

print("your score is ",score(original,generated))

your score is  0.5423728813559322


The score must be computed as an average of at least 3K random examples taken form the test set.

# What to deliver

You are supposed to deliver a single notebook, suitably commented.
The notebook should describe a single model, although you may briefly discuss additional attempts you did.

The notebook should contain a full trace of the training.
Weights should be made available on request.

You must also give a clear assesment of the performance of the model, computed with the metric that has been given to you.

# Good work!

# Introduction

To tackle this problem I exploited a Transformer-based model to predict the position of input token in the target sequence.

These are a few keypoints of my work:
 
 - Firstly, I chose to build a model capable of processing shuffled and unshuffled data in the same way. To do so, from the studied Transformer architecture I removed the Positional Encoder on the encoder input (preserving it on the decoder input) to avoid exploitation of order for tokens in the input sequences. 

 - Secondly, I chose to train the network on unshuffled data first to focus the first part of training on sentence understanding. Then, it has been trained on the shuffled data of the desired task.

 - Thirdly, I let the network predict the next token on the token space of the input sequence, instead of token space of the whole dictionary. Reducing both computational and spatial (memory) complexity. 

All the details later.

# Dependencies

In [901]:
import keras
from keras.utils import Sequence
from sklearn.preprocessing import LabelEncoder
import nltk
from nltk.stem import WordNetLemmatizer
from tqdm.notebook import tqdm

# Data Pre-processing

## Custom Data Generator

Data Generator for Encoder-Decoder Transformer.

This is the very same generator provided to the project, with the expection of yelding a triad of velues respectively for Encoder input, Decoder input and Cross-Entropy computation.

> Note: LabelEncoder is used to encode a univocal order system that the model will learn.\
 The encoding is just alphabetical from 1 to max_length of the sentence.\
 More Details on Model Creation section.

In [717]:
class ShuffleEDGenerator(Sequence):
    def __init__(self, enc_data, dec_data, result, batch_size=32, shuffle=True, raw=False, seed=42):
        self.enc_data = enc_data
        self.dec_data = dec_data
        self.result = result
        self.batch_size = batch_size
        self.shuffle = shuffle
        self.raw = raw
        self.seed = seed
        self.label_encoder = LabelEncoder()
        self.on_epoch_end()

    def __len__(self):
        return int(np.floor(len(self.enc_data) / self.batch_size))

    def __getitem__(self, index):
        indexes = self.indexes[index*self.batch_size:(index+1)*self.batch_size]

        data_batch = np.array([self.enc_data[k] for k in indexes])
        dec_data = np.array([self.dec_data[k] for k in indexes]) if self.dec_data is not None else []
        result = np.array([self.label_encoder.fit_transform(self.result[k]) for k in indexes])
        raw = np.array([self.result[k] for k in indexes]) if self.raw else None
        #shuffle only the relevant positions for each batch
        if self.shuffle:
            for i in range(data_batch.shape[0]):
                np.random.shuffle(data_batch[i,:data_batch[i].argmin()])
        return (data_batch, dec_data), result

    def indexes(self):
        return self.indexes
    
    def batch_size(self):
        return self.batch_size
    
    def result_data(self):
        return self.result[self.indexes]

    def on_epoch_end(self):
        self.indexes = np.arange(len(self.enc_data))
        # if self.shuffle: ### always shuffle
        if self.shuffle:
            if self.seed is not None:
                np.random.seed(self.seed)
            np.random.shuffle(self.indexes)

## Stem Test

Let's apply some Stem transformation to text trying to reduce the number of tokens.

> Result: applying this stem process to the data reduces the number of tokens to be embedded from 10000 to ~8000.

> Note: Stemming won't be used in the model. It poses the problem of decoding back different words with the same stemmed token.\
 Therefore, since this process seems to be not reversible, I don't use it.\
  Moreover, I didn't train the model on stemmed words initially, and I can't afford losing my progress restarting the training.  

In [39]:
nltk.download('wordnet')
lemmatizer = WordNetLemmatizer()

[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/utenteadmin/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [93]:
# detokenize original_data
corpus_data = [detokenizer([seq])[0] for seq in original_data]

In [94]:
# lemmatize corpus_data
corpus_lemmatized = [" ".join([lemmatizer.lemmatize(token) for token in sequence.split()]) for sequence in corpus_data]

In [95]:
len(corpus_lemmatized)

241194

In [100]:
# create new lemmatized tokenizer and adapt it to lemmatized corpus
tokenizer_lemmatized=TextVectorization( max_tokens=10_000, standardize="lower_and_strip_punctuation", encoding="utf-8",) #con il max prende le piu frequenti. ordina i token del vocab dal piu frequente al meno frequente
tokenizer_lemmatized.adapt(corpus_lemmatized)

detokenizer_lemmatized = TextDetokenizer( tokenizer_lemmatized )
sentences_lemmatized = tokenizer_lemmatized( corpus_lemmatized ).numpy()

In [99]:
len(sentences_lemmatized)

241194

In [104]:
# mask [UNK]
mask_lemm = np.sum( (sentences_lemmatized==1) , axis=1) >= 1
shuffled_data_lemmatized = np.delete( sentences_lemmatized, mask_lemm , axis=0)

In [105]:
shuffled_data_lemmatized.shape

(241194, 28)

In [106]:
print(f"number of tokens lemmatized:{len(tokenizer_lemmatized.get_vocabulary())}")

number of tokens lemmatized:8267


## Data split

Let's plit data for Encoder-Decoder Tramsformer processing.


 1. An Encoder needs pure sentences without START_STRING and END_STRING

 2. A Decoder needs sentences with START_STRING in front, since it has to simulate previous output.

 3. The target sententeces must have END_STRING in the tail.

In [904]:
train_data = shuffled_data[:220000]
valid_data = train_data[180000:]
test_data = shuffled_data[220000:]

start_token = tokenizer(START_STRING)
end_token = tokenizer(END_STRING)

max_length = train_data.shape[1]

# encoder inputs
X_train = tf.reshape(train_data[(train_data!=start_token) & (train_data!=end_token)], [-1,max_length-2])    # remove both START_STRING and END_STRING from original + restore matrix dimensions
X_train = tf.concat([X_train, np.zeros_like(X_train[:,0:2])], axis=-1)                                      # restore max_length adding 0s

X_valid = tf.reshape(valid_data[(valid_data!=start_token) & (valid_data!=end_token)], [-1,max_length-2])    # remove both START_STRING and END_STRING from original + restore matrix dimensions
X_valid = tf.concat([X_valid, np.zeros_like(X_valid[:,0:2])], axis=-1)                                      # restore max_length adding 0s

X_test = tf.reshape(test_data[(test_data!=start_token) & (test_data!=end_token)], [-1,max_length-2])        # remove both START_STRING and END_STRING from original + restore matrix dimensions
X_test = tf.concat([X_test, np.zeros_like(X_test[:,0:2])], axis=-1)                                         # restore max_length adding 0s

# decoder inputs -  test must be self-injected by previous decoder output
X_train_dec = tf.reshape(train_data[(train_data!=end_token)], [-1,max_length-1])    # remove only END_STRING from original
X_train_dec = tf.concat([X_train_dec, np.zeros_like(X_train_dec[:,0:1])], axis=-1)  # restore max_length adding 0s
X_valid_dec = tf.reshape(valid_data[(valid_data!=end_token)], [-1,max_length-1])    # remove only END_STRING from original 
X_valid_dec = tf.concat([X_valid_dec, np.zeros_like(X_valid_dec[:,0:1])], axis=-1)  # restore max_length adding 0s

# targets
Y_train = tf.reshape(train_data[(train_data!=start_token)], [-1,max_length-1])    # remove only START_STRING from original
Y_train = tf.concat([Y_train, np.zeros_like(Y_train[:,0:1])], axis=-1)  # restore max_length adding 0s
Y_valid = tf.reshape(valid_data[(valid_data!=start_token)], [-1,max_length-1])    # remove only START_STRING from original 
Y_valid = tf.concat([Y_valid, np.zeros_like(Y_valid[:,0:1])], axis=-1)  # restore max_length adding 0s
Y_test = tf.reshape(test_data[(test_data!=start_token)], [-1,max_length-1])    # remove only START_STRING from original 
Y_test = tf.concat([Y_test, np.zeros_like(Y_test[:,0:1])], axis=-1)  # restore max_length adding 0s

print(f"X_train.shape:{X_train.shape} -> \n{X_train[0]}")
print(f"X_valid.shape:{X_valid.shape} -> \n{X_valid[0]}")
print(f"X_test.shape:{X_test.shape} -> \n{X_test[0]}")
print(f"X_train_dec.shape:{X_train_dec.shape} -> \n{X_train_dec[0]}")
print(f"X_valid_dec.shape:{X_valid_dec.shape} -> \n{X_valid_dec[0]}")
print(f"Y_train.shape:{Y_train.shape} -> \n{Y_train[0]}")
print(f"Y_valid.shape:{Y_valid.shape} -> \n{Y_valid[0]}")
print(f"Y_test.shape:{Y_test.shape} -> \n{Y_test[0]}")

X_train.shape:(220000, 28) -> 
[2771   12  622   55    7   19  124  640    7   13   14  694   10  713
    5  566 3154   11   10  395  633    5   58    0    0    0    0    0]
X_valid.shape:(40000, 28) -> 
[8452   12  193   80    4  449  193   41    8  507    7    6    8  784
    9  106    0    0    0    0    0    0    0    0    0    0    0    0]
X_test.shape:(21194, 28) -> 
[  56   12 9242    9    4   91    5  424    6  690    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0]
X_train_dec.shape:(220000, 28) -> 
[   2 2771   12  622   55    7   19  124  640    7   13   14  694   10
  713    5  566 3154   11   10  395  633    5   58    0    0    0    0]
X_valid_dec.shape:(40000, 28) -> 
[   2 8452   12  193   80    4  449  193   41    8  507    7    6    8
  784    9  106    0    0    0    0    0    0    0    0    0    0    0]
Y_train.shape:(220000, 28) -> 
[2771   12  622   55    7   19  124  640    7   13   14  694   10  713
    5  566 3154   11   1

: 

In [902]:
# example of data generation
(X,X_dec),Y = ShuffleEDGenerator(X_train, X_train_dec, Y_train, shuffle=True).__getitem__(0)
X[0],X_dec[0],Y[0]

(array([  77,   22,  330,   54,    5,    8,  556,  139,  111,    6,    4,
          77,  490,   11,   49, 1281,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0]),
 array([   2, 1281,   77,    8,    4,   22,   54,   77,    6,  490,  330,
          49,    5,  139,   11,  556,  111,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0]),
 array([16, 10,  5,  2,  7,  9, 10,  4, 14, 13,  8,  3, 12,  6, 15, 11,  1,
         0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0]))

# Model Creation

## Model Architecture utility functions

Each of these function returns a layer and a model of the same layer, to let the model creator choose how to build its model.

### Positional Encoder
Positional encoding function as studied in class

In [176]:
class PositionalEncoding(tf.keras.layers.Layer):
    def __init__(self, max_length, embed_size, dtype=tf.float32, **kwargs):
        super().__init__(dtype=dtype, **kwargs)
        assert embed_size % 2 == 0, "embed_size must be even"
        p, i = np.meshgrid(np.arange(max_length),
                           2 * np.arange(embed_size // 2))
        pos_emb = np.empty((1, max_length, embed_size))
        pos_emb[0, :, ::2] = np.sin(p / 10_000 ** (i / embed_size)).T
        pos_emb[0, :, 1::2] = np.cos(p / 10_000 ** (i / embed_size)).T
        self.pos_encodings = tf.constant(pos_emb.astype(self.dtype))
        self.supports_masking = True

    def call(self, inputs):
        batch_max_length = tf.shape(inputs)[1]
        return inputs + self.pos_encodings[:, :batch_max_length]

### Input Embedder
Generic embedder of input, which can handle positional encoders as well.

In [177]:
# generic embedder of input
def input_embedder(inputs, max_length, vocab_size, embed_size, positional_layer=None, positional=False, name=None):
    embeddings = keras.layers.Embedding(vocab_size, embed_size,
                                                    mask_zero=True)(inputs)
    if positional and positional_layer is not None:
        embeddings = positional_layer(embeddings)
    return embeddings, keras.Model(inputs, embeddings, name=name)

### Transformer Encoder

In the encoder architecture is N-stack of the following component:

 - Residual connection: prevents vanishing gradients effects, and restore "spatial" information lost in the processing.

 - Masking: to avoid processing padding on input.

 - MultiHead Attention module: apply num_heads attention modules to inputs learning Values, Queries and Keys by means of small linear layers.

 - Dropout: dropout is applyied to avoid learning correlation between activations, makind the network more robust.

 - layerNormalization: prevents vanishing gradients effects, computing required normalization statistics at each time stamp for each instance.

 - Dense: decision making based on Attention mechanism outputs

In [178]:

def attn_encoder(encoder_in, N, num_heads, dense_units, embed_size, dropout_rate):
    Z = encoder_in
    for i in range(N):
        skip = Z                                                                                            # Residual
        Z = keras.layers.Masking(mask_value=0.0, name=f"encoder_mask_{i}")(Z)                               # Masking
        attn_layer = keras.layers.MultiHeadAttention(                                                       
            num_heads=num_heads, key_dim=embed_size, dropout=dropout_rate, name=f"encoder_self_attn_{i}")   
        Z = attn_layer(Z, value=Z)#, attention_mask=encoder_pad_mask)                                       # Multi-Head Attention
        Z = keras.layers.LayerNormalization()(keras.layers.Add()([Z, skip]))                                # Normalization
        skip = Z                                                                                            # Residual
        Z = keras.layers.Dense(dense_units, activation="relu")(Z)                                           # Dense
        Z = keras.layers.Dense(embed_size)(Z)                                                               # Dense no-activation
        Z = keras.layers.Dropout(dropout_rate)(Z)                                                           # Dropout
        Z = keras.layers.LayerNormalization()(keras.layers.Add()([Z, skip]))                                # Normalization
    return Z, keras.Model(encoder_in, Z, name=f"attn_encoder")

### Transformer Decoder

In the encoder architecture is N-stack of the following component:

 - Residual connection: prevents vanishing gradients effects, and restore "spatial" information lost in the processing.

 - Masking: to avoid processing padding on input.

 - Masked MultiHead Attention module: apply num_head attention modules to inputs learning Values, Queries and Keys by means of small linear layers. Causal masked, so it does not cheat looking to future tokens.

 - Dropout: dropout is applyied to avoid learning correlation between activations, makind the network more robust.

 - layerNormalization: prevents vanishing gradients effects, computing required normalization statistics at each time stamp for each instance.

 - Dense: decision making based on Attention mechanism outputs

> Note:\
First skip connection has been removed to avoid cheating.\
In fact, without a proper mask, the would see the target and use it.

In [179]:
def attn_decoder(decoder_in, N, num_heads, dense_units, embed_size, dropout_rate, encoder_outputs):
    Z = decoder_in  # the decoder starts with its own inputs
    causal_mask = tf.linalg.band_part(  # creates a lower triangular matrix
            tf.ones((max_length, max_length), tf.bool), -1, 0).numpy()
    for i in range(N):
        skip = Z                                                                                            # Residual
        Z = keras.layers.Masking(mask_value=0.0, name=f"decoder_mask1_{i}")(Z)                              # Masking
        # self-attention: creates queries
        attn_layer = keras.layers.MultiHeadAttention(
            num_heads=num_heads, key_dim=embed_size, dropout=dropout_rate, name=f"decoder_self_attn_{i}")
        Z = attn_layer(Z, value=Z, use_causal_mask=True, attention_mask=causal_mask)                        # Masked Multi-Head Attention
        Z = keras.layers.LayerNormalization()(keras.layers.Add()([Z, skip]))                                # Normalization
        # skip = Z # avoid cheating                                                                         # Residual - deleted
        Z = keras.layers.Masking(mask_value=0.0, name=f"decoder_mask2_{i}")(Z)                              # Masking
        attn_layer = keras.layers.MultiHeadAttention(
            num_heads=num_heads, key_dim=embed_size, dropout=dropout_rate, name=f"decoder_attn_{i}")
        Z = attn_layer(Z, value=encoder_outputs)#, attention_mask=encoder_pad_mask)                         # Multi-Head Attention
        Z = keras.layers.LayerNormalization()(keras.layers.Add()([Z, skip]))                                # Normalization
        skip = Z                                                                                            # Residual
        Z = keras.layers.Dense(dense_units, activation="relu")(Z)                                           # Dense
        Z = keras.layers.Dense(embed_size)(Z)                                                               # Dense
        Z = keras.layers.LayerNormalization()(keras.layers.Add()([Z, skip]))                                # Normalization
    return Z, keras.Model((decoder_in,encoder_outputs), Z, name=f"attn_decoder")

## Model Architecture

Let's build the model.

Positional encoding is applied only to decoder inputs, while encoder inputs are left unchanged.

> Note:\
Last Dense Layer outputs max_length probaiblities insteas of a whole vocabulary size probiblity ditribution.\
This leads to cheaper models in terms of resources and force the model to learn the LabelEncoder used to encode target positions of input tokens, as well as the meaning of words/sentece itself.\
Specifically, the encoder must vectorize not only the sentence but carry the mapping of input to be correctly encoded as LabelEncoder does.

In [757]:
N = 2  # num of stacked modules
num_heads = 8
embed_size = 128
vocab_size = len(tokenizer.get_vocabulary())
dropout_rate = 0.1
n_units = 128  # for the first Dense layer in each Feed Forward block
max_length = shuffled_data.shape[-1]

## "Sequential" declaration of model (layers over layers) - returned layers won't be used, instead models will be used both in training and inference 
encoder_inputs = keras.layers.Input(shape=[max_length,])
decoder_inputs = keras.layers.Input(shape=[max_length,])
pos_embed_layer = PositionalEncoding(max_length, embed_size, name="pos_encoder")
encoder_in, encoder_embedder = input_embedder(encoder_inputs, max_length, vocab_size, embed_size, positional=False, name="encoder_embedder")
decoder_in, decoder_embedder = input_embedder(decoder_inputs, max_length, vocab_size, embed_size, positional=True, positional_layer=pos_embed_layer, name="decoder_embedder")
Z, encoder_model = attn_encoder(encoder_in, N, num_heads, n_units, embed_size, dropout_rate)
encoder_outputs = Z  # let's save the encoder's final outputs
Z_seq, decoder_model = attn_decoder(decoder_in, N, num_heads, n_units, embed_size, dropout_rate, encoder_outputs)

## "Functional" declaration of model (encoder+decoder macro models) - useful in inference
enc_embed = encoder_embedder(encoder_inputs)
Z = encoder_model(enc_embed)
dec_embed = decoder_embedder(decoder_inputs)
Z_fun = decoder_model((dec_embed,Z))

In [758]:
# Y_proba = tf.keras.layers.Dense(max_length, activation="softmax", name="attn_softmax")(Z_seq)
# model = tf.keras.Model(inputs=[encoder_inputs, decoder_inputs],
#                        outputs=[Y_proba], name=f"transformer")
dense = keras.layers.Dense(max_length, activation="softmax", name="attn_softmax")
Y_proba = dense(Z_fun)
model = keras.Model(inputs=[encoder_inputs, decoder_inputs],
                       outputs=[Y_proba], name=f"transformer")
model.summary()

## Breakdown Structure of Model

In [182]:
encoder_embedder.summary()
decoder_embedder.summary()
encoder_model.summary()
decoder_model.summary()

# Training

Lets' start training the model.

In [183]:
model.compile(loss="sparse_categorical_crossentropy", optimizer="nadam",
              metrics=["accuracy"])

## Overfitting one Batch

First basic test trying to overfit a single Batch of Unshuffled data to see whether model, loss and hyperparmeters are somehow working. 

In [184]:
model.fit(ShuffleEDGenerator(X_train[0:32], X_train_dec[0:32], Y_train[0:32], shuffle=False), epochs=100,
          validation_data=ShuffleEDGenerator(X_valid[0:32], X_valid_dec[0:32], Y_valid[0:32], shuffle=False))

Epoch 1/100


  self._warn_if_super_not_called()


[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 34s/step - accuracy: 0.0045 - loss: 3.8419

  self._warn_if_super_not_called()


[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m37s[0m 37s/step - accuracy: 0.0045 - loss: 3.8419 - val_accuracy: 0.5257 - val_loss: 2.5808
Epoch 2/100
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 5s/step - accuracy: 0.5100 - loss: 2.6068 - val_accuracy: 0.0223 - val_loss: 4.1139
Epoch 3/100
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 5s/step - accuracy: 0.0201 - loss: 4.0794 - val_accuracy: 0.5257 - val_loss: 1.9421
Epoch 4/100
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 4s/step - accuracy: 0.5100 - loss: 1.9914 - val_accuracy: 0.5257 - val_loss: 1.8057
Epoch 5/100
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 3s/step - accuracy: 0.5100 - loss: 1.8554 - val_accuracy: 0.5257 - val_loss: 1.7153
Epoch 6/100
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 3s/step - accuracy: 0.5100 - loss: 1.7637 - val_accuracy: 0.5257 - val_loss: 1.6390
Epoch 7/100
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m

<keras.src.callbacks.history.History at 0x29f0daee0>

### Some Test on 1Batch-Overfitted Model

Just for showing results, on unshuffled and shuffle data of the batch.

> Note: Cheating involved!\
 Here the decoder part of the model is fed with the target (using causal mask), not with its previous output. But since accuracy is ~1.0, no problem (i.e. target ~= previous output).

In [186]:
idx = 0
shuffle = False
(X,X_dec),Y =ShuffleEDGenerator(X_train[idx:idx+1], X_train_dec[idx:idx+1], Y_train[idx:idx+1], shuffle=shuffle).__getitem__(0)
y_proba = model.predict((X, X_dec)) # sta barando un po'
print(X,X_dec)
tf.argmax(y_proba,axis=-1), Y

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 397ms/step
[[2771   12  622   55    7   19  124  640    7   13   14  694   10  713
     5  566 3154   11   10  395  633    5   58    0    0    0    0    0]] [[   2 2771   12  622   55    7   19  124  640    7   13   14  694   10
   713    5  566 3154   11   10  395  633    5   58    0    0    0    0]]


(<tf.Tensor: shape=(1, 28), dtype=int64, numpy=
 array([[20,  6, 15, 10,  3,  9, 12, 17,  3,  7,  8, 18,  4, 19,  2, 14,
         21,  5,  4, 13, 16,  2, 11,  1,  0,  0,  0,  0]])>,
 array([[20,  6, 15, 10,  3,  9, 12, 17,  3,  7,  8, 18,  4, 19,  2, 14,
         21,  5,  4, 13, 16,  2, 11,  1,  0,  0,  0,  0]]))

In [187]:
idx = 0
shuffle = True
(X,X_dec),Y =ShuffleEDGenerator(X_train[idx:idx+1], X_train_dec[idx:idx+1], Y_train[idx:idx+1], shuffle=shuffle).__getitem__(0)
y_proba = model.predict((X, X_dec)) # sta barando un po'
print(X,X_dec)
tf.argmax(y_proba,axis=-1), Y

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 214ms/step
[[ 633  622   14   10   12   19  395  713   58    7    5    5   13  694
   566   55  640   11  124 3154    7 2771   10    0    0    0    0    0]] [[   2 2771   12  622   55    7   19  124  640    7   13   14  694   10
   713    5  566 3154   11   10  395  633    5   58    0    0    0    0]]


(<tf.Tensor: shape=(1, 28), dtype=int64, numpy=
 array([[2771,   12,  622,   55,    7,   19,  124,  640,    7,   13,   14,
          694,   10,  713,    5,  566, 3154,   11,   10,  395,  633,    5,
           58,    0,    0,    0,    0,    0]])>,
 <tf.Tensor: shape=(1, 28), dtype=int64, numpy=
 array([[20,  6, 15, 10,  3,  9, 12, 17,  3,  7,  8, 18,  4, 19,  2, 14,
         21,  5,  4, 13, 16,  2, 11,  1,  0,  0,  0,  0]])>,
 array([[20,  6, 15, 10,  3,  9, 12, 17,  3,  7,  8, 18,  4, 19,  2, 14,
         21,  5,  4, 13, 16,  2, 11,  1,  0,  0,  0,  0]]),
 1.0,
 'period big can a are with short lot time lemmacomma of of that cover real animals legs in long estate lemmacomma elk a      ',
 'elk are big animals lemmacomma with long legs lemmacomma that can cover a lot of real estate in a short period of time lemmaend     ',
 'elk are big animals lemmacomma with long legs lemmacomma that can cover a lot of real estate in a short period of time lemmaend     ')

# Fit Training Set

Let's fit the whole training set, by first training on unshuffled data, and then un shuffled one.

> Note: Many checkpoints have been saved for fast recovery, if needed.

## Training on Unshuffled Data

In [535]:
model.fit(ShuffleEDGenerator(X_train, X_train_dec, Y_train, shuffle=False), epochs=2, batch_size=32,
          validation_data=ShuffleEDGenerator(X_valid, X_valid_dec, Y_valid, shuffle=False))

Epoch 1/2


  self._warn_if_super_not_called()


[1m6875/6875[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 253ms/step - accuracy: 0.6727 - loss: 0.9416

  self._warn_if_super_not_called()


[1m6875/6875[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1779s[0m 259ms/step - accuracy: 0.6727 - loss: 0.9415 - val_accuracy: 0.7368 - val_loss: 0.7405
Epoch 2/2
[1m6875/6875[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1792s[0m 261ms/step - accuracy: 0.7418 - loss: 0.7296 - val_accuracy: 0.7748 - val_loss: 0.6321


<keras.src.callbacks.history.History at 0x28ff9f520>

In [536]:
model.save_weights("./models/EXAM_same_attn_2layers_2epoch_checkpoint.weights.h5")

In [537]:
model.fit(ShuffleEDGenerator(X_train[-1000:], X_train_dec[-1000:], Y_train[-1000:], shuffle=True), epochs=5, batch_size=32,
          validation_data=ShuffleEDGenerator(X_valid[-1000:], X_valid_dec[-1000:], Y_valid[-1000:], shuffle=False))

Epoch 1/5


  self._warn_if_super_not_called()


[1m31/31[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 254ms/step - accuracy: 0.7748 - loss: 0.6283

  self._warn_if_super_not_called()


[1m31/31[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m9s[0m 291ms/step - accuracy: 0.7749 - loss: 0.6284 - val_accuracy: 0.8071 - val_loss: 0.5380
Epoch 2/5
[1m31/31[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m9s[0m 288ms/step - accuracy: 0.7969 - loss: 0.5642 - val_accuracy: 0.8233 - val_loss: 0.4926
Epoch 3/5
[1m31/31[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m10s[0m 318ms/step - accuracy: 0.8123 - loss: 0.5113 - val_accuracy: 0.8402 - val_loss: 0.4409
Epoch 4/5
[1m31/31[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m9s[0m 301ms/step - accuracy: 0.8260 - loss: 0.4760 - val_accuracy: 0.8495 - val_loss: 0.4131
Epoch 5/5
[1m31/31[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m9s[0m 295ms/step - accuracy: 0.8299 - loss: 0.4618 - val_accuracy: 0.8526 - val_loss: 0.4033


<keras.src.callbacks.history.History at 0x8362e9fa0>

In [538]:
model.save_weights("./models/EXAM_same_attn_2layers_2_1_epoch_checkpoint.weights.h5")

In [539]:
model.fit(ShuffleEDGenerator(X_train, X_train_dec, Y_train, shuffle=False), epochs=3, batch_size=32,
          validation_data=ShuffleEDGenerator(X_valid, X_valid_dec, Y_valid, shuffle=False))

Epoch 1/3


  self._warn_if_super_not_called()


[1m6875/6875[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 252ms/step - accuracy: 0.7720 - loss: 0.6463

  self._warn_if_super_not_called()


[1m6875/6875[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1773s[0m 258ms/step - accuracy: 0.7720 - loss: 0.6463 - val_accuracy: 0.8001 - val_loss: 0.5667
Epoch 2/3
[1m6875/6875[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1805s[0m 263ms/step - accuracy: 0.7936 - loss: 0.5883 - val_accuracy: 0.8207 - val_loss: 0.5129
Epoch 3/3
[1m6875/6875[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1791s[0m 260ms/step - accuracy: 0.8102 - loss: 0.5448 - val_accuracy: 0.8346 - val_loss: 0.4756


<keras.src.callbacks.history.History at 0x28ffb20a0>

In [None]:
model.save_weights("./models/EXAM_same_attn_2layers_5epoch_checkpoint.weights.h5")

In [572]:
opt = keras.optimizers.AdamW(learning_rate=5e-4)
model.compile(loss="sparse_categorical_crossentropy", optimizer=opt,
              metrics=["accuracy"])

In [580]:
model.fit(ShuffleEDGenerator(X_train, X_train_dec, Y_train, shuffle=False), epochs=3, batch_size=32,
          validation_data=ShuffleEDGenerator(X_valid, X_valid_dec, Y_valid, shuffle=False))

Epoch 1/3


  self._warn_if_super_not_called()


[1m6875/6875[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 191ms/step - accuracy: 0.8395 - loss: 0.4618

  self._warn_if_super_not_called()


[1m6875/6875[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1379s[0m 201ms/step - accuracy: 0.8395 - loss: 0.4618 - val_accuracy: 0.8617 - val_loss: 0.3991
Epoch 2/3
[1m6875/6875[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1338s[0m 195ms/step - accuracy: 0.8496 - loss: 0.4337 - val_accuracy: 0.8676 - val_loss: 0.3834
Epoch 3/3
[1m6875/6875[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1333s[0m 194ms/step - accuracy: 0.8545 - loss: 0.4198 - val_accuracy: 0.8713 - val_loss: 0.3724


<keras.src.callbacks.history.History at 0x9378887c0>

In [581]:
model.save_weights("./models/EXAM_same_attn_2layers_8epoch_checkpoint.weights.h5")

In [190]:
opt = keras.optimizers.AdamW(learning_rate=1e-4)
model.compile(loss="sparse_categorical_crossentropy", optimizer=opt,
              metrics=["accuracy"])

In [583]:
model.fit(ShuffleEDGenerator(X_train, X_train_dec, Y_train, shuffle=False), epochs=3, batch_size=32,
          validation_data=ShuffleEDGenerator(X_valid, X_valid_dec, Y_valid, shuffle=False))

Epoch 1/3


  self._warn_if_super_not_called()


[1m6875/6875[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 207ms/step - accuracy: 0.8579 - loss: 0.4103

  self._warn_if_super_not_called()


[1m6875/6875[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1584s[0m 217ms/step - accuracy: 0.8579 - loss: 0.4103 - val_accuracy: 0.8745 - val_loss: 0.3645
Epoch 2/3
[1m6875/6875[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1338s[0m 195ms/step - accuracy: 0.8610 - loss: 0.4023 - val_accuracy: 0.8778 - val_loss: 0.3549
Epoch 3/3
[1m6875/6875[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1378s[0m 200ms/step - accuracy: 0.8636 - loss: 0.3943 - val_accuracy: 0.8803 - val_loss: 0.3486


<keras.src.callbacks.history.History at 0x82cf35430>

In [590]:
model.save_weights("./models/EXAM_same_attn_2layers_11epoch_checkpoint.weights.h5")

In [594]:
model.fit(ShuffleEDGenerator(X_train, X_train_dec, Y_train, shuffle=False), epochs=3, batch_size=32,
          validation_data=ShuffleEDGenerator(X_valid, X_valid_dec, Y_valid, shuffle=False))

Epoch 1/3


  self._warn_if_super_not_called()


[1m6875/6875[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 211ms/step - accuracy: 0.8667 - loss: 0.3865

  self._warn_if_super_not_called()


[1m6875/6875[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1614s[0m 221ms/step - accuracy: 0.8667 - loss: 0.3865 - val_accuracy: 0.8826 - val_loss: 0.3418
Epoch 2/3
[1m6875/6875[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1339s[0m 195ms/step - accuracy: 0.8686 - loss: 0.3804 - val_accuracy: 0.8854 - val_loss: 0.3339
Epoch 3/3
[1m6875/6875[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1343s[0m 195ms/step - accuracy: 0.8706 - loss: 0.3749 - val_accuracy: 0.8877 - val_loss: 0.3277


<keras.src.callbacks.history.History at 0x70940f190>

In [194]:
model.save_weights("./models/EXAM_same_attn_2layers_14epoch_checkpoint.weights.h5")

In [670]:
model.fit(ShuffleEDGenerator(X_train[:2000], X_train_dec[:2000], Y_train[:2000], shuffle=False), epochs=4, batch_size=32,
          validation_data=ShuffleEDGenerator(X_valid[:200], X_valid_dec[:200], Y_valid[:200], shuffle=False))

Epoch 1/4


  self._warn_if_super_not_called()


[1m62/62[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 190ms/step - accuracy: 0.8726 - loss: 0.3764

  self._warn_if_super_not_called()


[1m62/62[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m13s[0m 194ms/step - accuracy: 0.8726 - loss: 0.3763 - val_accuracy: 0.8793 - val_loss: 0.3558
Epoch 2/4
[1m62/62[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m13s[0m 206ms/step - accuracy: 0.8800 - loss: 0.3504 - val_accuracy: 0.8741 - val_loss: 0.3623
Epoch 3/4
[1m62/62[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m12s[0m 201ms/step - accuracy: 0.8869 - loss: 0.3280 - val_accuracy: 0.8767 - val_loss: 0.3629
Epoch 4/4
[1m62/62[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m12s[0m 195ms/step - accuracy: 0.8911 - loss: 0.3167 - val_accuracy: 0.8761 - val_loss: 0.3645


<keras.src.callbacks.history.History at 0x88f570f10>

In [None]:
model.save_weights("./models/EXAM_same_attn_2layers_14_1_epoch_checkpoint.weights.h5")

## Training on Shuffled Data

Now the model is stabilizing on UNshuffled data (very slowly tough), we can assume it has developed a sense of how to encode sentece meanining (encoder) and how to use this and focalize on previous emitted tokens (decoder).

Now, I'll try to make some (good) noise in SGD search in parameter-space by injecting shuffled data (towards our target task), to check if the model behaves as expected.

> Note: It should have been trained a little more on unshuffled data in my opinion, reaching >0.9 accuracy but I don' have time for that unfortunatly. 

### Overfit a few samples of Shuffled data

Let's check whether the model is able to overfit a little shuffle dataset.

I notice that, at first, it tries to counterbalance the shift in the order distribution of data - since before was always the same for each unshuffled sample and theorically it can have leart that from heart (not really feasible in our case with the whole dataset) - diverging both in loss and accuracy. 

Then, it stabilizes and start to overfit the shuffled data - now order distributions are continuosly changing for each sample but not in meaning (or at least not so much) - showing that we're moving in the right direction of training

In [199]:
model.fit(ShuffleEDGenerator(X_train[:1000], X_train_dec[:1000], Y_train[:1000], shuffle=True), epochs=10, batch_size=32,
          validation_data=ShuffleEDGenerator(X_valid[:1000], X_valid_dec[:1000], Y_valid[:1000], shuffle=True))

Epoch 1/10


  self._warn_if_super_not_called()


[1m31/31[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 194ms/step - accuracy: 0.8746 - loss: 0.3633

  self._warn_if_super_not_called()


[1m31/31[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m13s[0m 420ms/step - accuracy: 0.8745 - loss: 0.3634 - val_accuracy: 0.8940 - val_loss: 0.3109
Epoch 2/10
[1m31/31[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m7s[0m 228ms/step - accuracy: 0.8779 - loss: 0.3500 - val_accuracy: 0.8934 - val_loss: 0.3129
Epoch 3/10
[1m31/31[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m7s[0m 219ms/step - accuracy: 0.8848 - loss: 0.3334 - val_accuracy: 0.8929 - val_loss: 0.3139
Epoch 4/10
[1m31/31[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m7s[0m 219ms/step - accuracy: 0.8848 - loss: 0.3305 - val_accuracy: 0.8926 - val_loss: 0.3154
Epoch 5/10
[1m31/31[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m7s[0m 225ms/step - accuracy: 0.8941 - loss: 0.3097 - val_accuracy: 0.8911 - val_loss: 0.3200
Epoch 6/10
[1m31/31[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m7s[0m 217ms/step - accuracy: 0.8973 - loss: 0.2910 - val_accuracy: 0.8922 - val_loss: 0.3204
Epoch 7/10
[1m31/31[0m [32m━━━━━━━━

<keras.src.callbacks.history.History at 0x32f2a0820>

In [200]:
model.save_weights("./models/EXAM_same_attn_2layers_14_2_epoch_checkpoint.weights.h5")

### Fit Shuffled data


In [201]:
model.fit(ShuffleEDGenerator(X_train, X_train_dec, Y_train, shuffle=True), epochs=4, batch_size=32,
          validation_data=ShuffleEDGenerator(X_valid, X_valid_dec, Y_valid, shuffle=True))

Epoch 1/4


  self._warn_if_super_not_called()


[1m6875/6875[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 236ms/step - accuracy: 0.8712 - loss: 0.3751

  self._warn_if_super_not_called()


[1m6875/6875[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1685s[0m 245ms/step - accuracy: 0.8712 - loss: 0.3751 - val_accuracy: 0.8886 - val_loss: 0.3241
Epoch 2/4
[1m6875/6875[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1852s[0m 269ms/step - accuracy: 0.8736 - loss: 0.3673 - val_accuracy: 0.8916 - val_loss: 0.3167
Epoch 3/4
[1m6875/6875[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1409s[0m 205ms/step - accuracy: 0.8760 - loss: 0.3607 - val_accuracy: 0.8940 - val_loss: 0.3097
Epoch 4/4
[1m6875/6875[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1390s[0m 202ms/step - accuracy: 0.8778 - loss: 0.3558 - val_accuracy: 0.8953 - val_loss: 0.3059


<keras.src.callbacks.history.History at 0x32f2b1100>

In [203]:
model.save_weights("./models/EXAM_same_attn_2layers_18epoch_checkpoint.weights.h5")

In [204]:
model.fit(ShuffleEDGenerator(X_train, X_train_dec, Y_train, shuffle=True), epochs=2, batch_size=32,
          validation_data=ShuffleEDGenerator(X_valid, X_valid_dec, Y_valid, shuffle=True))

Epoch 1/2


  self._warn_if_super_not_called()


[1m6875/6875[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 188ms/step - accuracy: 0.8798 - loss: 0.3496

  self._warn_if_super_not_called()


[1m6875/6875[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1332s[0m 194ms/step - accuracy: 0.8798 - loss: 0.3496 - val_accuracy: 0.8967 - val_loss: 0.3017
Epoch 2/2
[1m6875/6875[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1320s[0m 192ms/step - accuracy: 0.8811 - loss: 0.3457 - val_accuracy: 0.8992 - val_loss: 0.2948


<keras.src.callbacks.history.History at 0x3ee6b5f10>

In [205]:
model.save_weights("./models/EXAM_same_attn_2layers_20epoch_checkpoint.weights.h5")

In [206]:
model.fit(ShuffleEDGenerator(X_train, X_train_dec, Y_train, shuffle=True), epochs=5, batch_size=32,
          validation_data=ShuffleEDGenerator(X_valid, X_valid_dec, Y_valid, shuffle=True))

Epoch 1/5


  self._warn_if_super_not_called()


[1m6875/6875[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 184ms/step - accuracy: 0.8826 - loss: 0.3416

  self._warn_if_super_not_called()


[1m6875/6875[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1301s[0m 189ms/step - accuracy: 0.8826 - loss: 0.3416 - val_accuracy: 0.9008 - val_loss: 0.2912
Epoch 2/5
[1m6875/6875[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1314s[0m 191ms/step - accuracy: 0.8842 - loss: 0.3373 - val_accuracy: 0.9020 - val_loss: 0.2873
Epoch 3/5
[1m6875/6875[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1324s[0m 193ms/step - accuracy: 0.8856 - loss: 0.3332 - val_accuracy: 0.9026 - val_loss: 0.2846
Epoch 4/5
[1m6875/6875[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1339s[0m 195ms/step - accuracy: 0.8871 - loss: 0.3291 - val_accuracy: 0.9054 - val_loss: 0.2779
Epoch 5/5
[1m6875/6875[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1318s[0m 192ms/step - accuracy: 0.8885 - loss: 0.3254 - val_accuracy: 0.9068 - val_loss: 0.2740


<keras.src.callbacks.history.History at 0x4eb7d3b80>

In [207]:
model.save_weights("./models/EXAM_same_attn_2layers_25epoch_checkpoint.weights.h5")

In [209]:
model.fit(ShuffleEDGenerator(X_train, X_train_dec, Y_train, shuffle=True), epochs=5, batch_size=32,
          validation_data=ShuffleEDGenerator(X_valid, X_valid_dec, Y_valid, shuffle=True))

Epoch 1/5


  self._warn_if_super_not_called()


[1m6875/6875[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 188ms/step - accuracy: 0.8897 - loss: 0.3216

  self._warn_if_super_not_called()


[1m6875/6875[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1332s[0m 194ms/step - accuracy: 0.8897 - loss: 0.3216 - val_accuracy: 0.9079 - val_loss: 0.2707
Epoch 2/5
[1m 295/6875[0m [37m━━━━━━━━━━━━━━━━━━━━[0m [1m20:38[0m 188ms/step - accuracy: 0.8929 - loss: 0.3132

KeyboardInterrupt: 

> Note:\
At this point the model is not able to learn anything else, validation loss starts going down while loss is diverging and accuracy lowering.\
The accuracy achieved is ~0.89, which is fairly good.

# Testing

Let's test the model on shuffled test dataset.

In [900]:
vocab = tokenizer.get_vocabulary()

In [739]:
def score_model(model, X_test, Y_test):
    '''
        Use the score function to evaluate model on the data X_test againts the target Y_test.

        Returns an array of score for each sample.
    '''
    test_generator = ShuffleEDGenerator(X_test, None, Y_test, shuffle=True, raw=True)
    indexes = test_generator.indexes
    batch_size = test_generator.batch_size

    scores = []
    batch_idx = 0
    for (X_batch,_),Y_batch in tqdm(test_generator):
        if X_batch.shape[0] == 0:
            break
        preds = np.zeros_like(X_batch)
        preds[:,0] = tokenizer(START_STRING)[0]
        for i in range(X_batch.shape[1]-1):
            y_proba = model.predict((X_batch, preds))
            y_proba_maxs = np.argmax(y_proba,axis=-1)[:,i] # get i-th element token probability
            pred_ids = [Y_test[indexes[batch_idx*batch_size+j],np.argmax(Y_batch[j]==y_proba_maxs[j])].numpy() for j in range(X_batch.shape[0])]
            pred_words = [vocab[id] for id in pred_ids]
            preds[:,i+1:i+2] = tokenizer(pred_words)

        preds_det = detokenizer([preds[i,1:np.argmin(preds[i]!=tokenizer(END_STRING)[0])] for i in range(len(preds))])   
        Y_raw_det = detokenizer([Y_test[indexes[batch_idx*batch_size+i],:np.argmin(Y_test[indexes[batch_idx*batch_size+i],:])-1].numpy() for i in range(len(Y_batch))])
        _scores = [score(preds_det[i], Y_raw_det[i]) for i in range(len(preds_det))]
        scores += _scores
        print(preds_det)
        print(Y_raw_det)
        print(_scores)
        print("avg batch score:",np.average(scores))
        batch_idx += 1
    return scores

### 25-epoch Model Score

In [732]:
model.load_weights("./models/EXAM_same_attn_2layers_25epoch_checkpoint.weights.h5")

scores_25 = score_model(model, X_test, Y_test)
print(f"score avg:",np.average(scores_25))
print(f"score std:",np.std(scores_25))

  0%|          | 0/662 [00:00<?, ?it/s]

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 138ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 49ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 49ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 48ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 46ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 46ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 48ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 63ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 46ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 46ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 46ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 46ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 48ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 4

### Random Model

In [903]:
X_test.shape

TensorShape([21194, 28])

In [877]:
random_scores = [ score(X_test[i].numpy(),tf.random.shuffle(X_test[i]).numpy()) for i in range(X_test.shape[0])]

# RESULTS

With a *5.86M* parameters Transformer-based model, I achieved *~43%* on the metric provided with a *~89%* accuracy on the training set. Reaching full capicity of the model in *25 epoch*.

The model score is *4.8* standard deviation away from baseline(random model) score.

A summary of the scores below.

In [899]:
import pandas as pd
df = pd.DataFrame(np.vstack([scores_25,random_scores]).T, columns=["my_score","random_scores"])
df.describe()

Unnamed: 0,my_score,random_scores
count,21194.0,21194.0
mean,0.429385,0.177574
std,0.25039,0.065375
min,0.065574,0.035714
25%,0.247423,0.142857
50%,0.347222,0.178571
75%,0.520408,0.214286
max,1.0,0.535714


## Notes

 - Previously, I tried a LSTM-based encoder-decoder sequence-2-sequence model, but learning was slower.

 - Larger models has not been tested due to limited resources available. The model capicity, and the score accordingly could have benefit from it. 

 - Stem preprocessing is not applied. The score, the computation time and the number of parameters for the two Embedding Layers could have benefit from it. Both because I tried it too late and it's not a completely reversible process, I din't use it.
 
 - First Residual connection of decoder was deleted, because it carried the target not masked to the following layers, making the network cheating. A solution could have been developing a custom layer able to apply a time distributed mask to that residual link. The backpropagation and score would have benefit from it, but I didn't have the time.
 
 - Other interesting approaches to the task:
    - Same transformer-based architecture, but using positional encoding on encoder inputs and training on unshuffled data only. After training, replace only the encoder with the very same one without positional encoding and, letting it be the only component trainable, train the model a second time on shuffle data. This second training could also be a regression task on the latent space projected by the first transformer-based architecture.
    - Diffusion-based architecture, requires a notion of randomness or noise of shuffled text data to be learnt.

### Tech Notes

 - Weights file compatibility: I used Keras 3 and tensorflow 2.16.