# <span style="color:red"><b>TOURNE SUR PC</b></span>
# Code Attention

The goal of this demo is to teach you how to code an encoder decoder model with attention mechanism!
Since this is just a demo we will use generated data, the same generated data we used to demonstrate the encoder decoder. You'll be able to tackle the real problem during the exercise, the goal here is to focus on building the model and the training loop.

## Import libraries

In [None]:
# ################################################
# On garde les mêmes data que dans l'exemple d'hier
# vocabulaire de 100 mots
# sequences de taille fixe
#     10 en entrée
#      5 en sortie
# 10_000 exemples train
#  5_000 exemple de validation


import tensorflow as tf 
# import pathlib 
# import pandas as pd 
import os
import io

# import warnings
# warnings.filterwarnings('ignore')

# import json
from random import randint
from numpy import array
# from numpy import argmax
# from numpy import array_equal
# from tensorflow.keras.utils import to_categorical
# from tensorflow.keras.models import Model
# from tensorflow.keras.layers import Input
# from tensorflow.keras.layers import LSTM
# from tensorflow.keras.layers import Dense

## Generate data

We will generate random input and target data for the purpose of the demonstration.

In [None]:
k_voc_size_input    = 100
k_in_seq_len     = 10
k_out_seq_len    = 5

In [None]:
# generate a sequence of random integers from 2 to n_unique-1 included
def generate_sequence(length, n_unique):
	return [randint(2, n_unique-1) for _ in range(length)]

In [None]:
generate_sequence(k_in_seq_len, k_voc_size_input)

In [None]:
# prepare data
def get_dataset(n_in, n_out, cardinality, n_samples, printing=False):
  X1, y = list(), list()
  for _ in range(n_samples):
    # generate source sequence
    source = generate_sequence(n_in, cardinality)
    source_pad = source
    if printing:
      print("source:", source_pad)
    # define padded target sequence
    # we add the <start> token at the beginning of each sequence
    # here we'll simply consider that the start token will coded
    # by the index 0
    target = source[:n_out]
    target.reverse()
    target = [0] + target
    if printing:
      print("target:", target)
    # store
    X1.append(source_pad)
    y.append(target)
  return array(X1), array(y)

In [None]:
input, target =  get_dataset(k_in_seq_len, k_out_seq_len, k_voc_size_input, 1, True)

The data we are generating consists in a random sequence of numbers (they could very well represent encoded letters, words, sentences or anything you could think of).

The target is built using the first elements of the input in reversed order. We add a special token at the beginning of every target sequence for teacher.

Now that we understand this, let's create the training data and validation data.

In [None]:
X_train, y_train = get_dataset(k_in_seq_len,k_out_seq_len,k_voc_size_input,10_000)
X_val, y_val     = get_dataset(k_in_seq_len,k_out_seq_len,k_voc_size_input,5_000)

Let's transform these train sets into batch datasets

In [None]:
k_Batch_Size  = 128
train_batch = tf.data.Dataset.from_tensor_slices((X_train, y_train)).shuffle(len(X_train)).batch(k_Batch_Size)

## Create the encoder decoder with attention

In what follows we will code a model that will reproduce the following architecture for an encoder decoder model with Bahdanau style attention

![bahdanau](https://full-stack-bigdata-datasets.s3.eu-west-3.amazonaws.com/Deep+Learning/attention/Attention-encoder-decoder.drawio.png)

### Create encoder model

In this step we will define the encoder model.

The goal of the encoder is to create a representation of the input data, to extract information from the input data which will then be interpreted by the decoder model.

The encoder receives sequence inputs and will output sequences with a given depth of representation (we  usually called that dimension channels before)

In [None]:
# let's start by defining the number of units needed for the embedding and
# the lstm layers

k_n_embed = 32
k_n_gru   = 32

In [None]:
class encoder_factory(tf.keras.Model):
  
  def __init__(self, in_vocab_size, embed_dim, n_units):
    super().__init__()
    # instanciate an embedding layer
    self.n_units = n_units
    self.embed = tf.keras.layers.Embedding(input_dim = in_vocab_size, output_dim = embed_dim)
    # instantiate GRU layer
    self.gru = tf.keras.layers.GRU(units = n_units, return_sequences = True, return_state = True)
  
  
  def __call__(self, input_batch):
    # each output will be saved as a class attribute so we can easily access
    # them to control the shapes throughout the demo
    self.embed_out = self.embed(input_batch)
    self.gru_out, self.gru_state = self.gru(self.embed_out)    #, initial_state=initial_state)
    return self.gru_out, self.gru_state


That's it, it does not need to be anymore complicated than this, note though that we did not preserve the sequential nature of the data, but we output the cell state, which will serve as input state for the decoder!

Let's try it out on an input to see what comes out!

In [None]:
# On fait un test 
encoder = encoder_factory(k_voc_size_input, k_n_embed, k_n_gru)

In [None]:
# Voir que size = 10
X_train[0]

In [None]:
# On passe 
encoder_output, encoder_state = encoder(tf.expand_dims(X_train[0],0))

In [None]:
# n_seq = 10
# n_gru = 32
encoder_output

In [None]:
encoder_state

The first output as a shape of (1,10,16) which is normal because we applied the encoder to 1 input sequence of 10 elements (we chose return_sequences = True for the gru layer) and 16 channels since we have 16 units on the gru layer.

The second output is the gru state which has shape (1,16) for one input sequence and 16 units on the gru layer.

### Create the Attention layer

Let's now create the attention layer 

In [None]:
class Bahdanau_attention_factory(tf.keras.layers.Layer):
  def __init__(self, attention_units):
    super().__init__()

    # The attention layer contains three dense layers
    self.W1 = tf.keras.layers.Dense(units=attention_units)
    self.W2 = tf.keras.layers.Dense(units=attention_units)
    self.V  = tf.keras.layers.Dense(units=1)                 # ! obligatoirement 1 seul neurone

  def __call__(self, enc_out, state):
    # the choice of name of the arguments here is not random, enc_out
    # will represent the encoder output which will be used to create
    # the attention weights and then used to create the context vector once we
    # apply the attention weights
    # the state will be a hidden state from a recurrent unit coming either
    # from the encoder at first, and from the decoder as we make further 
    # predictions
    self.W1_out = self.W1(enc_out) # shape (1, 10, attention_units)

    # If you have taken a close look the model's schema you would have noticed
    # that we are going to sum the outputs from W1 and W2, though the shapes
    # are incompatible
    # the enc_out is (batch_size,10,16) -> W1 -> (batch_size,10,attention_units)
    # the state is (batch_size,16) -> W2 -> (batch_size,attention_units)
    # thus we need to artificially add a dimension to the stata along axis 1
    self.state  = tf.expand_dims(state, axis = 1) 
    self.W2_out = self.W2(self.state)                                           # shape (batch_size,1,attention_units)

    self.sum        = self.W1_out + self.W2_out                                 # shape (batch_size,10,attention_units)
    self.sum_scale  = tf.nn.tanh(self.sum)                                      # shape (batch_size,10,attention_units)

    self.score = self.V(self.sum_scale)                                         # shape (batch_size,10,1)

    self.attention_weights = tf.nn.softmax(self.score, axis=1)                  # shape (batch_size,10,1)

    self.weighted_enc_out = enc_out * self.attention_weights                    # shape (batch_size,10,16)

    self.context_vector = tf.reduce_sum(self.weighted_enc_out, axis=1)          # shape (batch_size,16)

    return self.context_vector, self.attention_weights

In [None]:
attention_layer = Bahdanau_attention_factory(8)                                 # 8 neurones dans les couches denses W1 et W2
attention_layer(encoder_output, encoder_state)                                  # on regarde ce qui se passe quand on lui passe 

# le 1er c'est le context vector
# le second c'est vect de poids d'attention (32 de long car 32 dans le GRU)

### Create decoder

The goal of the decoder is to use the encoder output and the previous target element to predict the next target element!
Which means its output is a sequence with as many elements as the target (this is where the padded target comes in, it will serve as input) and must have a number of channels equals to the number of possible values for target elements.

Here we can't use the standard Sequential framework to build the model because the initial state of the decoder as to be set as the encoder states.

In addition to this, two versions of the same model (with the same weights) have to be prepared, one of them for training, and one of them for inference (prediction on new unknown data). We'll detail the reason for this in what follows.


<img src="./decoder.png"  />

In [None]:
class decoder_factory(tf.keras.Model):
  def __init__(self, tar_vocab_size, embed_dim, n_units):
    super().__init__()
    # The decoder contains an embedding layer to play with the teacher forcing
    # input, which comes from the target data
    # A gru layer
    # A dense layer to make the predictions
    # And an attention layer
    self.embed = tf.keras.layers.Embedding(input_dim=tar_vocab_size, output_dim=embed_dim)
    self.gru = tf.keras.layers.GRU(units=n_units, return_sequences=True, return_state=True)     # ! return_state=True est important 
    self.pred = tf.keras.layers.Dense(units = tar_vocab_size, activation="softmax")
    self.attention = Bahdanau_attention_factory(attention_units=n_units)

  def __call__(self, dec_in, enc_out, state):
    # first let's apply the attention layer
    self.context_vector, self.attention_weights = self.attention(enc_out,state)

    # now the decoder will ingest one sequence element from the teacher forcing
    # this will be of shape (bacth_size, 1)
    self.embed_out = self.embed(dec_in)                                                        # shape (batch_size,1,embed_dim)

    # then we need to concatenate the embedding output and the context vector
    # though their shapes are incompatible
    # embed out (batch_size, 1, embed_dim)
    # context vector (batch_size, n_units) where n_units was defined in the encoder
    # so we need to add one dimension along axis 1
    self.context_vector_expanded = tf.expand_dims(self.context_vector, axis=1)                  # shape (batch_size,1,n_units)
    self.concat = tf.keras.layers.concatenate([self.embed_out, self.context_vector_expanded])   # shape (bacth_size,1, embed_dim + n_units)
    
    # now we get to apply the gru layer
    self.gru_out, self.gru_state = self.gru(self.concat)                                        # shapes (batch_size, 1, n_units) and (batch_size, n_units)

    # let's reshape the gru output before feeding it to the dense layer
    self.gru_out_reshape = tf.reshape(self.gru_out, shape=(-1, self.gru_out.shape[2]))          # pourquoi un reshape ici ??? On est (1, 1, 32) on passe en (1, 32)
                                                                                                # On met en (1,32) pour pouvoir le réutiliser ensuite dans la boucle
                                                                                                # où on fait un concatenate avec context vector

    # now let's make a prediction
    self.pred_out = self.pred(self.gru_out_reshape)                                             # shape (batch_size, 1, tar_vocab_size)

    return self.pred_out, self.gru_state, self.attention_weights

Let's now try and use the decoder using the encoder output, the encoder state and the first element of the teacher forcing

In [None]:

# ! On force output vocab size à la même taile que le vocab size input 
decoder = decoder_factory(tar_vocab_size=k_voc_size_input, embed_dim=k_n_embed, n_units=k_n_gru)

In [None]:
decoder_input = tf.expand_dims(tf.expand_dims(y_train[0][0], axis=0), axis=0) # the teacher forcing is
# the first element of the target sequence which corresponds to the <start> token
# we use expand dim to artificially add the batch size dimension

In [None]:
decoder_input

In [None]:
decoder(decoder_input,encoder_output, encoder_state)

Everything worked well, now all there is to do is to apply the decoder again to the second element of the teacher forcing and replacing the encoder state with the decoder state to produce the subsequent predictions.

## Training the encoder decoder model

We are almost there, but contrary to the classic encoder decoder architecture, using attention forces us to manually code the training steps because the encoder output is used for each prediction once weighted by the attention weights.

In [None]:
optimizer = tf.keras.optimizers.Adam()
loss_function = tf.keras.losses.SparseCategoricalCrossentropy()

In [None]:

checkpoint_dir = './training_checkpoints'
checkpoint_prefix = os.path.join(checkpoint_dir, "ckpt")
checkpoint = tf.train.Checkpoint(optimizer=optimizer, encoder=encoder, decoder=decoder)

In [None]:
def train_step(inp, targ):#, enc_initial_state):
  loss = 0

  with tf.GradientTape() as tape: # we use the gradient tape to track all
  # the different operations happening in the network in order to be able
  # to compute the gradients later

    enc_output, enc_state = encoder(inp)#,enc_initial_state) # the input sequence is fed to the 
    # encoder to produce the encoder output and the encoder state

    dec_state = enc_state # the initial state used in the decoder is the encoder
    # state

    dec_input = tf.expand_dims(targ[:,0], axis=1) # the first decoder input
    # is the first sequence element of the target batch, which in our case
    # represents the <start> token for each sequence in the batch. This is
    # what we call the teacher forcing!

    # Everything is set up for the first step, now we need to loop over the
    # teacher forcing sequence to produce the predictions, we already have 
    # defined the first step (element 0) so we will loop from 1 to targ.shape[1]
    # which is the target sequence length
    
    # t comme token
    # targ c'est un batch de token (dim 16)
    # Dans une boucle on regarde tout les indice 0, tous les indices 1...
    # t = 2 on regarde en même temps 
    # targ c'est un batch 
    for t in range(1, targ.shape[1]):                                        # range 1... car on a dejà 0
      # passing dec_input, dec_state and enc_output to the decoder
      # in order to produce the prediction, the new state, and the attention
      # weights which we will not need explicitely here
      pred, dec_state, _ = decoder(dec_input, enc_output, dec_state)

      # loss sur le token t du batch targ
      loss += loss_function(targ[:, t], pred) # we compare the prediction
      # produced by teacher forcing with the next element of the target and
      # increment the loss

      # The new decoder input becomes the next element of the target sequence
      # which we just attempted to predict (teacher forcing)
      dec_input = tf.expand_dims(targ[:, t], 1)                      # a l'itération t  change. A la dernière iteration on utilise 

  # On est en training
  # On vient de faire une forward pass
  # faut calculer la loss (qui a été incrémenté à chaque tour de boucle)
  # rechercher la variable loss_function

  batch_loss = (loss / int(targ.shape[1])) # we divide the loss by the target
  # sequence's length to get the average loss across the sequence

  variables = encoder.trainable_variables + decoder.trainable_variables # here
  # we concatenate the lists of trainable variables for the encoder and the
  # decoder

  # compute the gradient based on the loss and the trainable variables
  gradients = tape.gradient(loss, variables) 

  # then update the model's  parameters
  optimizer.apply_gradients(zip(gradients, variables)) 

  return batch_loss

In [None]:
# import time
EPOCHS = 10

for epoch in range(EPOCHS):
  start = time.time()

  total_loss = 0

  for (batch, (inp, targ)) in enumerate(train_batch):
    batch_loss = train_step(inp, targ)
    total_loss += batch_loss

    if batch % 10 == 0:
      print('Epoch {} Batch {} Loss {:.4f}'.format(epoch + 1, batch, batch_loss.numpy()))
  
  # saving (checkpoint) the model every epoch
  checkpoint.save(file_prefix = checkpoint_prefix)

  print('Epoch {} Loss {:.4f}'.format(epoch + 1, total_loss))
  print('Time taken for 1 epoch {} sec'.format(time.time() - start))

  #classic encoder input
  enc_input = X_val

  # the first decoder input is the special token 0
  dec_input = tf.zeros(shape=(len(X_val),1))

  
  # we compute once and for all the encoder output and the encoder
  # h state and c state
  enc_out, enc_state = encoder(enc_input) #, initial_state)

  # The encoder h state and c state will serve as initial states for the decoder
  dec_state = enc_state

  pred = []  # we'll store the predictions in here

  # we loop over the expected length of the target, but actually the loop can run
  # for as many steps as we wish, which is the advantage of the encoder decoder
  # architecture
  
  # Là on fait une inference sur le val set
  # On pointe sur le start et après on boucle
  for i in range(y_val.shape[1]-1):
    dec_out, dec_state, attention_w = decoder(dec_input, enc_out, dec_state)
    # the decoder state is updated and we get the first prediction probability 
    # vector
    decoded_out = tf.expand_dims(tf.argmax(dec_out, axis=-1), axis=1)
    # we decode the softmax vector into and index
    pred.append(tf.expand_dims(dec_out,axis=1)) # update the prediction list
    dec_input = decoded_out # the previous pred will be used as the new input

  pred = tf.concat(pred, axis=1).numpy()
  print("\n val loss :", loss_function(y_val[:,1:],pred),"\n") # on peut alors afficher la loss sur le val set

Nice! The training is over, and it looks as though the model performs really well both on train and validation sets!

## Make predictions with the inference model

To make predictions on the validation set, we cannot use teacher forcing, the model has to base itself on its own predictions!

In [None]:
# le val set fait 5000
# on fait pareil qu'avant en fait


enc_input = X_val # 5000 seq en anglais
#classic encoder input

dec_input = tf.zeros(shape=(len(X_val),1))                 # 5000 token start
# the first decoder input is the special token 0

#initial_state = encoder.state_initializer(len(X_val))

enc_out, enc_state = encoder(enc_input)#, initial_state)
# we compute once and for all the encoder output and the encoder
# h state and c state

dec_state = enc_state
# The encoder h state and c state will serve as initial states for the
# decoder

pred = []  # we'll store the predictions in here

# we loop over the expected length of the target, but actually the loop can run
# for as many steps as we wish, which is the advantage of the encoder decoder
# architecture
for i in range(y_val.shape[1]-1):
  dec_out, dec_state, attention_w = decoder(dec_input, enc_out, dec_state)
  # the decoder state is updated and we get the first prediction probability 
  # vector
  decoded_out = tf.expand_dims(tf.argmax(dec_out, axis=-1), axis=1) # argmax pour trouver le mot prdit, on l'enregistre
  # we decode the softmax vector into and index
  pred.append(decoded_out) # update the prediction list
  dec_input = decoded_out # the previous pred will be used as the new input

pred = tf.concat(pred, axis=-1).numpy()
for i in range(10):
  print("pred:", pred[i,:].tolist())
  print("true:", y_val[i,:].tolist()[1:])
  print("\n")

The results do not look so bad, almost perfect actually! This is a clear improvement from the encoder decoder! Attention must be really powerful!

The fact that the model reuses the encoder output at each step with different weights is helping the model achieve better predictions in a shorter amount of time (understand epochs).

I hope you found this demonstration useful! Now it is time for you to apply what you have learned to a real world automatic translation problem!