# Sentence Reconstruction

The purpose of this project is to take in input a sequence of words corresponding to a random permutation of a given english sentence, and reconstruct the original sentence.

The otuput can be either produced in a single shot, or through an iterative (autoregressive) loop generating a single token at a time.


CONSTRAINTS:
* No pretrained model can be used.
* The neural network models should have less the 20M parameters.
* No postprocessing should be done (e.g. no beamsearch)
* You cannot use additional training data.


BONUS PARAMETERS:

A bonus of 0-2 points will be attributed to incentivate the adoption of models with a low number of parameters.

In [1]:
!pip install datasets
!pip install --upgrade keras

  pid, fd = os.forkpty()




In [2]:
import tensorflow as tf
import keras
from keras import ops
from keras import layers
from keras.layers import Embedding

from datasets import load_dataset
import string
import re

import numpy as np
import math

## Data preprocessing

1) Download the data

2) Create the tokenizer and detokenizer

3) Remove the sentences containing uknown tokens

4) Create a generator to feed the data to the model while trainig, validating and testing

### Download the dataset

In [3]:
VOCAB_SIZE = 10000
SEQ_LEN = 28

BATCH_SIZE = 256

In [4]:
ds = load_dataset('generics_kb', trust_remote_code=True)['train']
ds = ds.filter(lambda row: len(row["generic_sentence"].split(" ")) > 8)

Downloading builder script:   0%|          | 0.00/8.64k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/11.9k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/27.1M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/1020868 [00:00<?, ? examples/s]

Filter:   0%|          | 0/1020868 [00:00<?, ? examples/s]

### Create the tokenizer and detokenizer
Define the tokens that are gonna be used by the tokenizer

In [5]:
#Define a class that contains all the token that we are gonna need
class Tokens:
    COMMA = '<comma>'
    START = '<start>'
    END = '<end>'

Add the tokens to the original sentences

In [6]:
# Define a vectorized function to add the token to the oriinal string inside the dataset
add_token_vect = np.vectorize(
    lambda x: f'{Tokens.START} ' + x.replace(',', f' {Tokens.COMMA}') + f' {Tokens.END}')

# Apply the function to the 'generic_sentence' column of the DataFrame
corpus = add_token_vect(ds['generic_sentence'])

Create a custom preprocessing function in order to delete every special character from the sentences contained into the original dataset that are not the ones encoded into or part of tokens

In [7]:
#this function is gonna remove every special character that is not an `<>,` from the original sentences.
def custom_preprocessing(text):
    chars = string.punctuation
    chars = chars.replace(",", "")
    chars = chars.replace("<", "")
    chars = chars.replace(">", "")
    # Remove punctuation
    text = tf.strings.regex_replace(text, '[%s]' % re.escape(chars), '')
    # Lowercase
    text = tf.strings.lower(text)
    # Remove punctuation
    return text

Create the tokenizer, using the `custom_preprocessing` function in order to standardise the input  

In [8]:
tokenizer = tf.keras.layers.TextVectorization(
    max_tokens=VOCAB_SIZE,
    standardize=custom_preprocessing,
    output_sequence_length=SEQ_LEN,
    output_mode='int',
    pad_to_max_tokens=True,
)

#adapt the tokenizer to the text of the ds
tokenizer.adapt(corpus)

vocab = tokenizer.get_vocabulary()

#visualize the first 10 tokens of the tokenizer
print(vocab[:10])

['', '[UNK]', '<start>', '<end>', 'the', 'of', 'and', '<comma>', 'is', 'to']


Create a detokenizer. This class is gonna be especially usefull during the test part, where we would need to detokenize the sentences to compare them with the original ones in order to calculate the score

In [9]:
class TextDetokenizer:
    def __init__(self, vectorize_layer):
        self.vectorize_layer = vectorize_layer
        vocab = self.vectorize_layer.get_vocabulary()
        self.index_to_word = {index: word for index, word in enumerate(vocab)}

    def __detokenize_tokens(self, tokens):
        def check_token(t):
            if t == 2:
                s = "<start>"
            elif t == 3:
                s = "<end>"
            elif t == 7:
                s = "<comma>"
            else:
                s = self.index_to_word.get(t, '[UNK]')
            return s

        return ' '.join([check_token(token) for token in tokens if token != 0])

    def __call__(self, batch_tokens):
        return [self.__detokenize_tokens(tokens) for tokens in batch_tokens]

#instantiate the detokenizer
detokenizer = TextDetokenizer(tokenizer)

#tokenize the content of the whole dataset (corpus)
sentences = tokenizer( corpus ).numpy()

Delete from the dataset all the sentences containing at least one `[UNK]` token

In [10]:
mask = np.sum( (sentences==1), axis=1) >= 1
original_data = np.delete( sentences, mask , axis=0)
original_data.shape

(241194, 28)

### Create a generator to feed the data to the model

Since we are gonna work with the transformer architecture we are gonna need to define a generator wich is gonna provide the appropriate input.

The outpus is gonna be compose by 2 variable, a tuple (containing the input information) and the target variable:

`encoder_input` : the input of the model (scrambled sentence)

`decoder_input` : the previus knowledge of the model regarding to the next word to be generated. Similar to decoder_output but preceeded by the `<start>` token

`decoder_output` : final sequence that should be generated by the decoder (sentence before being scrambled)


In [11]:
class DataGenerator(keras.utils.PyDataset):
    def __init__(self, data, batch_size=32, shuffle=True, seed=42, **kwargs):
        super().__init__(**kwargs)
        self.data = data
        self.batch_size = batch_size
        self.shuffle = shuffle
        self.seed = seed
        self.indexes = np.arange(len(self.data))

    def __len__(self):
        return math.ceil(len(self.data) / self.batch_size)

    def __getitem__(self, index):
        indexes = self.indexes[index*self.batch_size:(index+1)*self.batch_size]

        data_batch = np.array([self.data[k] for k in indexes])
        result = np.copy(data_batch)

        # shuffle the phrases inside the tags
        for i in range(data_batch.shape[0]):
            np.random.shuffle(data_batch[i, 1:data_batch[i].argmin() - 1])

        encoder_input = data_batch
        decoder_input = np.copy(result)
        decoder_output = np.copy(result)
        decoder_output = decoder_output[:, 1:]

        #we need to add a column of zeroes at the end in order to match the sizes
        decoder_output = np.pad(decoder_output, [[0, 0], [0, 1]], mode='constant')

        return (encoder_input, decoder_input), decoder_output

Since the dataset contains huge part of coerent text near to each other is really important to shuffle the data, in order to sparse as much as possible all the possible arguments and word.

To do so we are gonna do a triple shuffle of the data.

In [12]:
# Make a random permutation of training and test set
np.random.seed(42)
# Shuffle the all data
# TO have a really effective shuffoling we shuffle multiple times
shuffled_indices = np.random.permutation(len(original_data))
shuffled_data = original_data[shuffled_indices]
shuffled_indices = np.random.permutation(len(shuffled_data))
shuffled_data = shuffled_data[shuffled_indices]
shuffled_indices = np.random.permutation(len(shuffled_data))
shuffled_data = shuffled_data[shuffled_indices]

Create the train, validation and test generator from the cleaned and shuffoled dataset

In [13]:
#split the dataset
train_generator = DataGenerator(shuffled_data[:220000], batch_size=BATCH_SIZE)
val_generator = DataGenerator(shuffled_data[220000:235000], batch_size=BATCH_SIZE)
test_generator = DataGenerator(shuffled_data[235000:], batch_size=BATCH_SIZE)

Following is reported an exemple of what the ouput of the generator looks like

In [14]:
x, y = train_generator.__getitem__(1)

x_encoder_inputs_detokenized = detokenizer(x[0])
x_decoder_inputs_detokenized = detokenizer(x[1])
y_decoded = detokenizer(y)

for i in range(2):

  print("encoder_input: ",(x_encoder_inputs_detokenized[i]))
  print("dencoder_input: ",(x_decoder_inputs_detokenized[i]))
  print("target: ", y_decoded[i])

  print("\n")

encoder_input:  <start> factor in of persons age a developing important disease plays an the chances <end>
dencoder_input:  <start> age plays an important factor in a persons chances of developing the disease <end>
target:  age plays an important factor in a persons chances of developing the disease <end>


encoder_input:  <start> exercise ease help also mental regular can stress related disturbances and <end>
dencoder_input:  <start> regular exercise can also help ease stress and related mental disturbances <end>
target:  regular exercise can also help ease stress and related mental disturbances <end>




# Metrics

Let s be the source string and p your prediction. The quality of the results will be measured according to the following metric:

1.  look for the longest substring w between s and p
2.  compute |w|/max(|s|,|p|)

If the match is exact, the score is 1.

When computing the score, you should NOT consider the start and end tokens.



The longest common substring can be computed with the SequenceMatcher function of difflib, that allows a simple definition of our metric.

In [15]:
from difflib import SequenceMatcher


def score(s, p):
    match = SequenceMatcher(None, s, p).find_longest_match()
    # print(match.size)
    return (match.size/max(len(p), len(s)))

Let's do an example.

In [16]:
original = "at first henry wanted to be friends with the king of france"
generated = "henry wanted to be friends with king of france at the first"

print("your score is ", score(original, generated))

your score is  0.5423728813559322


The score must be computed as an average of at least 3K random examples taken form the test set.

# What to deliver

You are supposed to deliver a single notebook, suitably commented.
The notebook should describe a single model, although you may briefly discuss additional attempts you did.

The notebook should contain a full trace of the training.
Weights should be made available on request.

You must also give a clear assesment of the performance of the model, computed with the metric that has been given to you.

# Good work!

# Proposed model: seq to seq transformer

The proposel model is a seq to seq transformer.

This kind of model si composed by 2 main parts:

- Encoder: read the input sequence (in this case the shuffled words) and produces a fixed-dimensional vector representation.
- Decoder: generate the output sequence (original sentence) from the input given by the Encoder.

This kind of models are well known and largely used in natural language tasks (NLP) as may be translations, summarization and classifications.

## Why the transformer
The main reason to choose this kind of architecture is the self-attention mechanism. This characteristic should help the model capture the semantic meaning of the words, helping it achive good performance in reorder the words inside a phrase



## Define the transformer model


In [17]:
class TransformerEncoder(layers.Layer):
    def __init__(self, embed_dim, dense_dim, num_heads, **kwargs):
        super().__init__(**kwargs)
        self.embed_dim = embed_dim
        self.dense_dim = dense_dim
        self.num_heads = num_heads
        self.attention = layers.MultiHeadAttention(
            num_heads=num_heads, key_dim=embed_dim
        )
        self.dense_proj = keras.Sequential(
            [
                layers.Dense(dense_dim, activation="relu"),
                layers.Dense(embed_dim),
            ]
        )
        self.layernorm_1 = layers.LayerNormalization()
        self.layernorm_2 = layers.LayerNormalization()
        self.supports_masking = True

    def call(self, inputs, mask=None):
        if mask is not None:
            padding_mask = ops.cast(mask[:, None, :], dtype="int32")
        else:
            padding_mask = None

        attention_output = self.attention(
            query=inputs, value=inputs, key=inputs, attention_mask=padding_mask
        )
        proj_input = self.layernorm_1(inputs + attention_output)
        proj_output = self.dense_proj(proj_input)
        return self.layernorm_2(proj_input + proj_output)

    def get_config(self):
        config = super().get_config()
        config.update(
            {
                "embed_dim": self.embed_dim,
                "dense_dim": self.dense_dim,
                "num_heads": self.num_heads,
            }
        )
        return config


class PositionalEmbedding(layers.Layer):
    def __init__(self, sequence_length, vocab_size, embed_dim, **kwargs):
        super().__init__(**kwargs)
        self.token_embeddings = layers.Embedding(
            input_dim=vocab_size, output_dim=embed_dim
        )
        self.position_embeddings = layers.Embedding(
            input_dim=sequence_length, output_dim=embed_dim
        )
        self.sequence_length = sequence_length
        self.vocab_size = vocab_size
        self.embed_dim = embed_dim

    def call(self, inputs):
        length = ops.shape(inputs)[-1]
        positions = ops.arange(0, length, 1)
        embedded_tokens = self.token_embeddings(inputs)
        embedded_positions = self.position_embeddings(positions)
        return embedded_tokens + embedded_positions

    def compute_mask(self, inputs, mask=None):
        if mask is None:
            return None
        else:
            return ops.not_equal(inputs, 0)

    def get_config(self):
        config = super().get_config()
        config.update(
            {
                "sequence_length": self.sequence_length,
                "vocab_size": self.vocab_size,
                "embed_dim": self.embed_dim,
            }
        )
        return config


class TransformerDecoder(layers.Layer):
    def __init__(self, embed_dim, latent_dim, num_heads, **kwargs):
        super().__init__(**kwargs)
        self.embed_dim = embed_dim
        self.latent_dim = latent_dim
        self.num_heads = num_heads
        self.attention_1 = layers.MultiHeadAttention(
            num_heads=num_heads, key_dim=embed_dim
        )
        self.attention_2 = layers.MultiHeadAttention(
            num_heads=num_heads, key_dim=embed_dim
        )
        self.dense_proj = keras.Sequential(
            [
                layers.Dense(latent_dim, activation="relu"),
                layers.Dense(embed_dim),
            ]
        )
        self.layernorm_1 = layers.LayerNormalization()
        self.layernorm_2 = layers.LayerNormalization()
        self.layernorm_3 = layers.LayerNormalization()
        self.supports_masking = True

    def call(self, inputs, encoder_outputs, mask=None):
        causal_mask = self.get_causal_attention_mask(inputs)
        if mask is not None:
            padding_mask = ops.cast(mask[:, None, :], dtype="int32")
            padding_mask = ops.minimum(padding_mask, causal_mask)
        else:
            padding_mask = None

        attention_output_1 = self.attention_1(
            query=inputs, value=inputs, key=inputs, attention_mask=causal_mask
        )
        out_1 = self.layernorm_1(inputs + attention_output_1)

        attention_output_2 = self.attention_2(
            query=out_1,
            value=encoder_outputs,
            key=encoder_outputs,
            attention_mask=padding_mask,
        )
        out_2 = self.layernorm_2(out_1 + attention_output_2)

        proj_output = self.dense_proj(out_2)
        return self.layernorm_3(out_2 + proj_output)

    def get_causal_attention_mask(self, inputs):
        input_shape = ops.shape(inputs)
        batch_size, sequence_length = input_shape[0], input_shape[1]
        i = ops.arange(sequence_length)[:, None]
        j = ops.arange(sequence_length)
        mask = ops.cast(i >= j, dtype="int32")
        mask = ops.reshape(mask, (1, input_shape[1], input_shape[1]))
        mult = ops.concatenate(
            [ops.expand_dims(batch_size, -1), ops.convert_to_tensor([1, 1])],
            axis=0,
        )
        return ops.tile(mask, mult)

    def get_config(self):
        config = super().get_config()
        config.update(
            {
                "embed_dim": self.embed_dim,
                "latent_dim": self.latent_dim,
                "num_heads": self.num_heads,
            }
        )
        return config

Define a function for an easy construction of the model.

Notice that we are not using the PositionalEmbedding in the encoder since there is no meaningfull structure in the input scrambled sentences

In [18]:
def get_model(embed_dim, latent_dim, num_heads, sequence_length, vocab_size):
  encoder_inputs = keras.Input(shape=(None,), dtype="int64", name="encoder_inputs")
  x = Embedding(vocab_size, embed_dim)(encoder_inputs)
  encoder_outputs = TransformerEncoder(embed_dim, latent_dim, num_heads)(x)
  encoder = keras.Model(encoder_inputs, encoder_outputs)

  decoder_inputs = keras.Input(shape=(None,), dtype="int64", name="decoder_inputs")
  encoded_seq_inputs = keras.Input(shape=(None, embed_dim), name="decoder_state_inputs")
  x = PositionalEmbedding(sequence_length, vocab_size, embed_dim)(decoder_inputs)
  x = TransformerDecoder(embed_dim, latent_dim, num_heads)(x, encoded_seq_inputs)
  x = layers.Dropout(0.5)(x)
  decoder_outputs = layers.Dense(vocab_size, activation="softmax")(x)
  decoder = keras.Model([decoder_inputs, encoded_seq_inputs], decoder_outputs)

  decoder_outputs = decoder([decoder_inputs, encoder_outputs])

  transformer = keras.Model(
      [encoder_inputs, decoder_inputs], decoder_outputs, name="transformer"
  )

  return transformer

# Training the model

## Custom Loss and Scheduling functions

To help the model achive the best performance possible some custom function has been defined.

- `custom_loss`: compute te loss starting by the one produced by the `SparseCategoricalCrossentropy` loss function, and then adding a penalty for each words of the decoded sentence out of place

- `custom_accuracy`: use the same principe as the `custom_loss` function, but instead calculate how many words are in place inside the decoded sentence

- `WarmupScheduler`: it's used to customize the learning rate during the training. This custom  scheduler adjusts the learning rate during training, gradually increasing it during the warm-up phase and then following a specific schedule based on the step count.  

In [19]:
#instantiate a loss function used for the custom loss
loss_object = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True, reduction='none')

#implements the mask loss
def custom_loss(label, pred):
    #instantiate a boolan mask with all true from the token <start> to <end>
    mask = label != 0

    #calculate the loss
    loss = loss_object(label, pred)

    #same dtype cast
    mask = tf.cast(mask, dtype=loss.dtype)

    #fix to zero all the loss after the <end> token
    loss *= mask

    #return the average of the loss
    return tf.reduce_sum(loss)/tf.reduce_sum(mask)

#same principle of the custom loss
def custom_accuracy(label, pred):
    pred = tf.argmax(pred, axis=2)
    label = tf.cast(label, pred.dtype)

    # check where pred and labels are equals
    match = tf.math.equal(label, pred)

    #mask where the label is != from zero (from end token to the end)
    mask = tf.math.logical_not(tf.math.equal(label, 0))

    #see where the label is equal to prediction and both are not zeroes
    match = match & mask

    #cast to dtype
    match = tf.cast(match, dtype=tf.float32)
    mask = tf.cast(mask, dtype=tf.float32)

    #return the average accuracy
    return tf.reduce_sum(match)/tf.reduce_sum(mask)


## Define the model hyperparameters and training the model

This are the parameters proposed for the final model:
- `EMBEDDING_DIM` = 128
- `LATENT_DIM` = 512
- `NUM_HEADS` = 20

For a **total of ~8M parameters** for the final model, really small considering that is based on a transformer architecture

\\
For the training we are choosing the following hyperparameters:
- `EPOCH` = 30
- `BATCH_SIZE` = 256

\\
During the fit procedure we are also using 2 callbacks:
- `EarlyStopping`: monitors a specified validation metric (`val_loss`) during training and stops training early if the metric stops improving, preventing overfitting.
- `ModelCheckpoint`:  It is used to save the model's weights during training.
- `ReduceLROnPlateau`: Decrease the learning rate when the loss stops decreasing form more than 2 epochs.

In [20]:
#Parameters of the models
EMBEDDING_DIM = 128
LATENT_DIM = 512
NUM_HEADS = 20
EPOCH = 30

run_name = 'final_model'

transformer = get_model(EMBEDDING_DIM, LATENT_DIM, NUM_HEADS, SEQ_LEN, VOCAB_SIZE)
transformer.summary()


optimizer = keras.optimizers.AdamW(learning_rate=1e-3)
transformer.compile(optimizer=optimizer, loss=custom_loss, metrics=[custom_accuracy])


history = transformer.fit(train_generator, epochs=EPOCH, validation_data=val_generator, callbacks =
                                  [
                                  tf.keras.callbacks.EarlyStopping(monitor="val_accuracy", mode = "max", patience=3, restore_best_weights=True),
                                  tf.keras.callbacks.ReduceLROnPlateau(factor=0.1, patience=2, min_lr=0.00001, verbose=1),
                                  tf.keras.callbacks.ModelCheckpoint(run_name + '.weights.h5', verbose=1, save_best_only=True, save_weights_only=True)
                                  ])

Epoch 1/30


  output, from_logits = _get_logits(


[1m  2/860[0m [37m━━━━━━━━━━━━━━━━━━━━[0m [1m1:05[0m 76ms/step - custom_accuracy: 0.0077 - loss: 9.1810      

I0000 00:00:1718126156.905991     118 device_compiler.h:186] Compiled cluster using XLA!  This line is logged at most once for the lifetime of the process.
W0000 00:00:1718126156.935909     118 graph_launch.cc:671] Fallback to op-by-op mode because memset node breaks graph update


[1m860/860[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 76ms/step - custom_accuracy: 0.2607 - loss: 5.5985
Epoch 1: val_loss improved from inf to 2.33572, saving model to final_model.weights.h5


W0000 00:00:1718126226.524823     116 graph_launch.cc:671] Fallback to op-by-op mode because memset node breaks graph update
  current = self.get_monitor_value(logs)


[1m860/860[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m85s[0m 81ms/step - custom_accuracy: 0.2608 - loss: 5.5970 - val_custom_accuracy: 0.6309 - val_loss: 2.3357 - learning_rate: 0.0010
Epoch 2/30
[1m860/860[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 68ms/step - custom_accuracy: 0.6455 - loss: 2.1978
Epoch 2: val_loss improved from 2.33572 to 1.41016, saving model to final_model.weights.h5
[1m860/860[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m61s[0m 71ms/step - custom_accuracy: 0.6455 - loss: 2.1975 - val_custom_accuracy: 0.7289 - val_loss: 1.4102 - learning_rate: 0.0010
Epoch 3/30
[1m860/860[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 68ms/step - custom_accuracy: 0.7333 - loss: 1.4353
Epoch 3: val_loss improved from 1.41016 to 1.18767, saving model to final_model.weights.h5
[1m860/860[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m61s[0m 71ms/step - custom_accuracy: 0.7333 - loss: 1.4353 - val_custom_accuracy: 0.7592 - val_loss: 1.1877 - learning_ra

In [21]:
#save the best model weights
weights_file_name = './' + run_name + '.weights.h5'
transformer.save_weights(weights_file_name)

# Testing the model

In the following section we are going to evaluate the results of the trained model, basing the statistics on the given metric

## Reload the best model found
If `laod_weights` parameters is set to true it will upload the best weight foud so far. Needed in order to test the model without retraining it.

In [22]:
# Load the Drive helper and mount
laod_weights = False

if laod_weights:
  # Loads the weights
  transformer.load_weights("./final_model.weights.h5")

### Define a function to decode an input sequence

`decode_sequence()`: takes an input scrambled sentence. It uses the training transformer model to predict the next token in a sequence, gradually building a decoded sentence until an “END” token is encountered or the maximum length is reached.


In [23]:
vocab = tokenizer.get_vocabulary()
spa_index_lookup = dict(zip(range(len(vocab)), vocab))
max_decoded_sentence_length = 27

def decode_sequence(input_sentence):
    tokenized_input_sentence = tokenizer([input_sentence])
    decoded_sentence = Tokens.START
    for i in range(max_decoded_sentence_length):
        tokenized_target_sentence = tokenizer([decoded_sentence])[:, :-1]
        predictions = transformer([tokenized_input_sentence, tokenized_target_sentence])

        sampled_token_index = ops.convert_to_numpy(
            ops.argmax(predictions[0, i, :])
        ).item(0)

        sampled_token = spa_index_lookup[sampled_token_index]
        decoded_sentence += " " + sampled_token

        if sampled_token == Tokens.END:
            break
    return decoded_sentence

## Define a testing function

`transformer`: the model to evaluate 

`test_generator`: the test_generator 

`batch_size`:  the size of each batch of the test generator 

`n_test_elem`: on how phrase test the model on 

`verbouse`: if put to True then it will print all the log of the process, composed by the inpput, the model prediction and the calculated score

In [24]:
def test_model(transformer, test_generator, batch_size, n_test_elem, verbouse=False, return_model_predictions=True, log_status=True):
    
    num_batch_test= math.ceil(n_test_elem / batch_size)
    
    if log_status: 
        print("testing on: ", num_batch_test, " batches")
    
    scores = np.array([])
    result_list = []

    for n_batch in range(num_batch_test):
        
        print(f'starting analysing batch n. {n_batch+1} out of {num_batch_test}')
        x, y = test_generator.__getitem__(n_batch)
        x_encoder_inputs_detokenized = detokenizer(x[0])
        x_decoder_inputs_detokenized = detokenizer(x[1])
        y_decoded = detokenizer(y)

        total_phrase_num = len(x[0])
        for i in range(total_phrase_num):
          
          if i%32 == 0 and log_status: 
                print(f"\tstatus {i+1}/{total_phrase_num}")
            
          input_sentence = (x_encoder_inputs_detokenized[i])
          prediction = decode_sequence(input_sentence)

          #add prediction to result list. The result list is gonna be a list of dictionary
          target = y_decoded[i].replace(Tokens.START, "")
          target = target.replace(Tokens.END, "")

          prediction = prediction.replace(Tokens.START, "")
          prediction = prediction.replace(Tokens.END, "")
          prediction = prediction.lstrip()

          calc_score = score(target, prediction)
          scores = np.append(scores, calc_score)
    

          if verbouse:
            print(f"Batch {n_batch+1}/{num_batch_test}, phrase {i+1}/{total_phrase_num}")
            print('original: ', target)
            print('input: ', input_sentence)
            print('prediction: ', prediction)
            print('score: ', calc_score)
            print('---'*4)
        
          if return_model_predictions:
                result_list.append({
                    'original': target, 
                    'input' : input_sentence, 
                    'prediction' : prediction, 
                    'score' : calc_score
                })
        
        print('Batch end, scores so far: ')
        print(f"\tscore:\t{scores.mean()}")
        print(f"\tstd:\t{scores.std()}")
        print('--'*4)

    #print the results 
    scores = np.array(scores)
    print(f"\n\nTested on {num_batch_test*batch_size} examples \n")
    print(f"Resoults :")
    print(f"\tscore:\t{scores.mean()}")
    print(f"\tstd:\t{scores.std()}")
    
    if return_model_predictions: 
        return scores, result_list
    
    return scores

In [25]:
scores, result_list = test_model(transformer, test_generator, batch_size=BATCH_SIZE, n_test_elem=3000, verbouse=False, log_status=True)

starting analysing batch n. 1 out of 12
Batch end, scores so far: 
	score:	0.5073161963063024
	std:	0.2852046958451032
--------
starting analysing batch n. 2 out of 12
Batch end, scores so far: 
	score:	0.5050732153126545
	std:	0.2819123532885841
--------
starting analysing batch n. 3 out of 12
Batch end, scores so far: 
	score:	0.4962788556283621
	std:	0.2831283523432325
--------
starting analysing batch n. 4 out of 12
Batch end, scores so far: 
	score:	0.49347653758049237
	std:	0.28095406924554883
--------
starting analysing batch n. 5 out of 12
Batch end, scores so far: 
	score:	0.4941662950599661
	std:	0.28005937781620627
--------
starting analysing batch n. 6 out of 12
Batch end, scores so far: 
	score:	0.4898037438468042
	std:	0.27911145391882236
--------
starting analysing batch n. 7 out of 12
Batch end, scores so far: 
	score:	0.4910350354198184
	std:	0.27916190303512317
--------
starting analysing batch n. 8 out of 12
Batch end, scores so far: 
	score:	0.4898299427357385
	std:

# Results and conclusions
As we can see from the outcome of the evaluation function the proposed model reach a score of `~0.49` using the provided scoring function, way above the estimated performance of a random classifier, estimated to be ~0.19 with a standard deviation of ~0.06. 