# <span style="color:red"><b>TOURNE SUR PC</b></span>
# Code Encoder Decoder

The goal of this demo is to teach you how to code an encoder decoder model!
Since this is just a demo we will use generated data, you'll be able to tackle the real problem during the exercise, the goal here is to focus on building the model and the training loop.

## Import libraries

In [None]:
# Import Tensorflow & Pathlib librairies
import pathlib
import os
import io
import json

import tensorflow             as tf
import pandas                 as pd

from random                   import randint
from numpy                    import array
from numpy                    import argmax
from numpy                    import array_equal
from tensorflow.keras.utils   import to_categorical
from tensorflow.keras.models  import Model
from tensorflow.keras.layers  import Input
from tensorflow.keras.layers  import LSTM
from tensorflow.keras.layers  import Dense
from tensorflow.keras.utils   import plot_model

import warnings
warnings.filterwarnings('ignore')




## Generate data

We will generate random input and target data for the purpose of the demonstration.

In [None]:
input_vocab_size       = 100           # nb de mots du vocabulaire
input_seq_len   = 10            # 10 mots dans les phrases FR
target_seq_len  = 5             # 5 mots dans les phrases US

In [None]:
# generate a sequence of random integers
def generate_sequence(length, n_unique):
	return [randint(1, n_unique-1) for _ in range(length)]

In [None]:
generate_sequence(input_seq_len, input_vocab_size)

In [None]:
# faut pas de l'aléatoire
# faut faire comme si les phrases avaient du sens
# Source : on tire 10 int
# Target : on prend les 5 premiers de la source et on les retourne

# prepare data for the LSTM
def get_dataset(n_in, n_out, cardinality, n_samples, printing=False):
  X1, X2, y = list(), list(), list()
  for _ in range(n_samples):
    # generate source sequence
    source = generate_sequence(n_in, cardinality)         # genere la source
    if printing:
      print("source:", source)
    # define padded target sequence
    target = source[:n_out]
    target.reverse()
    if printing:
      print("target:", target)
    # create padded input target sequence
    # c'est le teacher forcing
    # le zero c'est le start
    # voir que le dernier int de la target passe à la trappe
    # c'est pas gênant car il s'entraine à générer le mot d'après
    target_in = [0] + target[:-1]
    if printing:
      print("padded target:", target_in)
    # store
    X1.append(source)
    X2.append(target_in)
    y.append(target)
  return array(X1), array(X2), array(y)

In [None]:
input, padded_target, target =  get_dataset(input_seq_len, target_seq_len, input_vocab_size, 1, True)

The data we are generating consists in a random sequence of numbers (they could very well represent encoded letters, words, sentences or anything you could think of).

The target is built using the first elements of the input in reversed order.

We also create a padded target sequence for teacher forcing (remember it is when the previous element from the target will be used as information for the decoder to predict the next element in the target)

Now that we understand this, let's create the training data and validation data.

In [None]:
X_train, padded_y_train, y_train  = get_dataset(input_seq_len,target_seq_len,input_vocab_size,10000)
X_val, padded_y_val, y_val        = get_dataset(input_seq_len,target_seq_len,input_vocab_size,5000)

## Create encoder model

In this step we will define the encoder model.

The goal of the encoder is to create a representation of the input data, to extract information from the input data which will then be interpreted by the decoder model.

The encoder receives sequence inputs and will output sequences with a given depth of representation (we  usually called that dimension channels before)

In [None]:
# Définir l'architecture de l'encoder c'est la partie simple
# C'est la même architecture en entrainement qu'en inference
# on travaille sur des int
# int dans une couche word embedding puis dans une couche LSTM
# et c'est tout

# on figes les hyper parametres

# let's start by defining the number of units needed for the embedding and
# the lstm layers

n_embed = 32
n_lstm  = 16     # puissance de 2

In [None]:
# l'architecture de l'encoder est séquentielle mais on ici s'entraine à le faire à la main
# On a l'habitude d'écrire des trucs du style :
# model = Sequential([
#   vectorize_layer,                                            # This layers encodes the string as sequences of int
#   Embedding(vocab_size, embedding_dim, name="embedding"),     # the embedding layer the input dim needs to be equal to the size of the vocabulary + 1 (because of the zero padding)
#   SimpleRNN(units=64, return_sequences=True),                 # maintains the sequential nature
#   SimpleRNN(units=32, return_sequences=False),                # returns the last output
#   Dense(16, activation='relu'),                               # a dense layer
#   Dense(1, activation="sigmoid")                              # the prediction layer
# ])


# Faut déclarer une instance Input où on fixe la taille de l'input
# input_seq_len = 10 mots par phrase
encoder_input = tf.keras.Input(shape=(input_seq_len), name="Encoder_In")     
encoder_embed = tf.keras.layers.Embedding(input_dim = input_vocab_size, output_dim = n_embed, name="Encoder_WE")
# pas oublier qu'on se fiche du output MAIS on veut h et C => return_state=True
encoder_lstm = tf.keras.layers.LSTM(n_lstm, return_state=True, name="Encoder_LSTM")



# CABLAGE DES COUCHES ###############
# Maintenant que les couches sont déclarées, on fait le câblage à la main des sorties et des entrées entre les couches
# Ici c'est séquentiel donc c'est très facile
encoder_embed_ouput = encoder_embed(encoder_input)
encoder_output      = encoder_lstm(encoder_embed_ouput)




# faut finir avec la classe Model et préciser ce qui sera inputs et outputs
encoder = tf.keras.Model(inputs = encoder_input, outputs = encoder_output)


plot_model(encoder)

That's it, it does not need to be anymore complicated than this, note though that we did not preserve the sequential nature of the data, but we output the cell state, which will serve as input state for the decoder!

Let's try it out on an input to see what comes out!

In [None]:
X_train[0]
print(X_train[0])

# On peut pas passer Xtrain directement
# Faut passer un batch
# d'où le expend dims
tf.expand_dims(X_train[0],0)


In [None]:
# on vérifie que si on lui passe un tenseur tout se passe bien
encoder(tf.expand_dims(X_train[0],0))


# taille 16 car n_lstm = 16
# on recupere dans l'ordre
#   la dernière ligne
#   le hidden state   (= dernière ligne)
#   le cell state

## Create decoder

* The goal of the decoder is to use the encoder output and the previous target element to predict the next target element
* Which means its output is a sequence with as many elements as the target 
        * this is where the padded target comes in, it will serve as input 
* It must have a number of channels equals to the number of possible values for target elements.

We can't use the standard **Sequential** framework to build the model because the initial state of the decoder as to be set as the encoder states.
* In addition two versions of the same model (with the same weights) have to be prepared
        * one for training
        * one for inference (prediction on new unknown data). 
        


### Decoder for training

* Training the decoder requires that we use the teacher forcing mechanism
* We provide the correct answer from the previous element to predict the next element of the output sequence.

In [None]:
# on fait un test 
# on récupere les vecteurs 1 et 2
# c'est les h et C de l'encoder
encoder_output[1:]

In [None]:
# pour le decodeur il y a 2 modes de fonctionnement
#     entrainement = teacher forcing
#     inference    = un token à la fois



# DECLARATION DES COUCHES ###############

# Ici on fait le cablage pour l'entrainemnt
# en entrainement target_seq_len = 5 car les phrases UK ont 5 mots
# cet input va dans WE, LSTM puis dans la couche dense avant la sortie

# On rentre un vecteur de 5 car on veut des phrase en UK de 5 mots
decoder_input = tf.keras.Input(shape=(target_seq_len), name="Decoder_Input") # target_seq_len = 5

# On met une couche Word Embedding car le LSTM peut pas prendre un token en direct
# Va sortir une matrice 5 x 32
decoder_embed = tf.keras.layers.Embedding(input_dim = input_vocab_size, output_dim = n_embed, name="Decoder_WE") # imput_dim = 100 (taille du vocabulaire)   n_embed = 32

# on met le 5 x 32 dans le LSTM
# 16 units pour le LSTM
# return_sequences et return_state sont à True
# Donc on sort 5 x 16 + hidden(16) + cell state(16)
decoder_lstm = tf.keras.layers.LSTM(n_lstm, return_sequences=True, return_state=True, name="Decoder_LSTM") # n_lstm = 16




# le 5 x 16 rentre dans une couche dense
# 100 neurones et softmax
# Y a 100 neurones car c'est la taille du vocabulaire et que le soft max va désigner le mot le plus probable

# TF permet de brancher une couche dense sur une matrice
# En sortie on aura 1 x 100
# Soft max => somme à 1
# On regarde le argmax de chaque ligne en fait
decoder_pred = tf.keras.layers.Dense(input_vocab_size, activation="softmax", name="Decoder_Softmax") # input_dim = 100 mots du vocabulaire





# CABLAGE DES COUCHES ###################



# Entrée TEACHER FORCING
# C'est ici qu'on indique qu'il faut réutiliser
# the decoder input is actually the padded target we created earlier, remember
# if target is              :    [91, 47, 89, 21, 62]
# the decoder input will be : [0, 91, 47, 89, 21]
# teacher forcing happens here
decoder_embed_output = decoder_embed(decoder_input)
# decoder_embed = Embedding(input_dim=vocab_size, output_dim=embedding_dim, name="decoder_embedding")
# decoder_embed_output = decoder_embed(decoder_input)

# Bien voir le initial_state qui vaut encoder_output[1:]
# encoder_output[1:] => on recupere h et C
decoder_lstm_output, _, _ = decoder_lstm(decoder_embed_output, initial_state = encoder_output[1:])

# in the step described above the decoder receives the encoder state as its
# initial state.
decoder_output = decoder_pred(decoder_lstm_output)
# then the dense layer will convert the vector representation for each element
# in the sequence into a probability distribution across all possible tokens
# in the vocabulary!


# Bien voir qu'on indique que le model accepte 2 types d'entrée encoder_input ou decoder_input
encoder_decoder = tf.keras.Model(inputs = [encoder_input, decoder_input], outputs = decoder_output)








plot_model(encoder_decoder, show_shapes=True, show_dtype=True, show_layer_names=True)

Let's try out the decoder model on some input sequences!

In [None]:
# la somme de la premiere ligne de 100 fait 1
# c'est le token qui a la valeur max qui est la prédiction
encoder_decoder([tf.expand_dims(X_train[0],0),tf.expand_dims(padded_y_train[0],0)])

### Decoder for inference (prediction)

Contrary to the training case, for inference we do not have access to the target nor the padded target. The decoder input will be made out of a sequence starting with $0$ which is the special start token in our case, then followed by the predictions of the decoder as they come.

In [None]:
# On va changer le cablage mais pas les couches
# On ne fait donc que déclarer des inputs
# on garde les couches


# DECLARATION DES COUCHES ###############

# En inference l'input c'est plus une taille de 5 mais 1 token à la fois
# un token => vect de taille 1
decoder_input_inf = tf.keras.Input(shape=(1), name = "Inference_in")


# Au premier tour on alimente le decoder avec les h et C de l'encoder
# for following steps, they will become the hidden and C state from the decoder itself
# since the input sequence is unknown we will have to predict step by step using a loop
decoder_state_input_h = Input(shape=(n_lstm,), name="Hidden_state")
decoder_state_input_c = Input(shape=(n_lstm,), name="Cell_state")
decoder_states_inputs = [decoder_state_input_h, decoder_state_input_c]







# CABLAGE DES COUCHES ###################


decoder_embed_output = decoder_embed(decoder_input_inf)
# the decoder input here is of shape 1 because we will feed the elements in the
# sequence one by one


# Bien voir le initial_state qui vaut decoder_states_inputs
# en sortie 1x100
decoder_outputs, state_h, state_c = decoder_lstm(decoder_embed_output, initial_state=decoder_states_inputs)
# the lstm layer works in the same way, the output from the embedding is used
# and the decoder state is used as described above

decoder_states = [state_h, state_c]
# we store the lstm states in a specific object as we'll have to use them as
# initial state for the next inference step

decoder_outputs = decoder_pred(decoder_outputs)
# the lstm output is then converted to a probability distribution over the
# target vocabulary

# Finally we wrap up the model building by setting up the inputs and outputs
# En mode inférence il va sortir 2 choses
#     - ses prédictions
#     - ses états internes : ce sont ces états qu'il réutlisera pour traduire le mot d'après
decoder_inf = Model(inputs   = [decoder_input_inf, decoder_states_inputs],
                    outputs  = [decoder_outputs, decoder_states])

plot_model(decoder_inf, show_shapes=True, show_dtype=True, show_layer_names=True)

Here we'll give you an example of how this version of the model will be able to give predictions, we'lls need to write a loop for this!

In [None]:
# on test ce qui se passe si on passe un "start"
enc_input = tf.expand_dims(X_train[0],0)
#classic encoder input


# on passe un 0 0 = un token start
dec_input = tf.zeros(shape=(1,1))
# the first decoder input is the special token 0

enc_out, state_h_inf, state_c_inf = encoder(enc_input)
# we compute once and for all the encoder output and the encoder
# h state and c state

# on va utilser dec_state pour initiliaser plus tard
dec_state = [state_h_inf, state_c_inf]

# we'll store the predictions in here
pred = []  

# we loop over the expected length of the target, but actually the loop can run
# for as many steps as we wish
# which is the advantage of the encoder decoder
# architecture
# target_seq_len = 5
for i in range(target_seq_len):
  dec_out, dec_state = decoder_inf([dec_input, dec_state])
  print(dec_out)
  print("----------------")

  # the decoder state is updated and we get the first prediction probability
  # vector
  decoded_out = tf.argmax(dec_out, axis=-1)
  # we decode the softmax vector into and index
  pred.append(decoded_out) # update the prediction list
  dec_input = decoded_out # the previous pred will be used as the new input

pred

## Training the encoder decoder model

We are almost there, the difficult part of this was building the model, now the training step will be super easy!
All we have to do is first `compile` the model to assign a loss function then use the `fit` method!

In [None]:
# Faut entrainer et tester
# La loss c'est la même pour les 2 (encoder, decoder)
# Sparce matrix car on donne la target sous forme de vecteur

# SparseCategoricalCrossentropy est utilisée dans les cas où les étiquettes de classe sont fournies sous forme d'entiers
# Par opposition à des vecteurs one-hot.
# Adaptée aux problèmes de classification où chaque exemple d'entraînement appartient à une seule classe.
# Lorsque vous utilisez cette fonction de perte, il est attendu que les étiquettes de classe soient représentées par des entiers, et non par des vecteurs one-hot.
# Si vous avez trois classes (par exemple, 0, 1 et 2), les étiquettes doivent être des entiers dans cet intervalle, et non des vecteurs one-hot (par exemple, [1, 0, 0], [0, 1, 0], [0, 0, 1]).
# Vous utilisez SparseCategoricalCrossentropy comme fonction de perte, ce qui signifie que vos étiquettes de classe sont représentées par des entiers.
# Cela correspond également au choix de la métrique SparseCategoricalAccuracy pour évaluer les performances de votre modèle.

encoder_decoder.compile(
    optimizer="Adam",
    loss=tf.keras.losses.SparseCategoricalCrossentropy(),    # sparce car il sort
    metrics=[tf.keras.metrics.SparseCategoricalAccuracy()]
)

# On monitor l'accuracy du model de p
# Il sort 70% de accuracy
# Cest vraiment pas mal mais mode débile il a une chance sur 100 de sortir le bon mot


In [None]:
encoder_decoder.fit(x=[X_train,padded_y_train],y=y_train,epochs=50, validation_data=([X_val,padded_y_val],y_val))

Nice! The training is over, and it looks as though we could have continued to train the model even longer since it has not yet started to overfit!

In [None]:
# plot_model()
# 3 et 4 c'est h stat et C stte

## Make predictions with the inference model

I don't know if you have noticed, but we used the exact same layers for the training and the inference model, therefore they have the same weights, only we are able to use the inference model on new data since we cannot use teacher forcing anymore!

In [None]:
# On fait un test sur le val set car on connait ce que l'on veut

enc_input = X_val
#classic encoder input

dec_input = tf.zeros(shape=(len(X_val),1))
# the first decoder input is the special token 0


# On passe tous les X val dans l'encoder
# On recupere h et C
enc_out, state_h_inf, state_c_inf = encoder(enc_input)
# we compute once and for all the encoder output and the encoder
# h state and c state


dec_state = [state_h_inf, state_c_inf]
# The encoder h state and c state will serve as initial states for the
# decoder

pred = []  # we'll store the predictions in here



# On a 5 mots => 5 iteration
# we loop over the expected length of the target, but actually the loop can run
# for as many steps as we wish, which is the advantage of the encoder decoder
# architecture
for i in range(target_seq_len):
  # ! au 1er tour on passe le dec_sta de l'encoder
  # Au 2me tour on passe le dec_state du decoder et plus de l'encoder
  dec_out, dec_state = decoder_inf([dec_input, dec_state])
  # the decoder state is updated and we get the first prediction probability vector
  decoded_out = tf.argmax(dec_out, axis=-1)
  # we decode the softmax vector into and index
  pred.append(decoded_out) # update the prediction list
  dec_input = decoded_out # the previous pred will be used as the new input


# Bien voir que quand il commence à se tromper il se trompe pour tout le reste
# Si il se trompe dès le départ tout
pred = tf.concat(pred, axis=-1).numpy()
for i in range(10):
  print("pred:", pred[i,:])
  print("true:", y_val[i,:])
  print("\n")

The results do not look so bad, however it looks as though once the model make a mistake on one of the predictions, then the rest of the sequence will also not be well predicted!

This behaviour can be explained in the following way: the information taken from the encoder is only taken into account directly in the first decoding step, which means that everything that happens after this step depends on what information the decoder feeds itself from that point onwards.

The encoder decoder framework however has made possible major advances, especially in terms of predicting sequences of arbitrary length. However we'll learn tomorrow about a solution that can deal with the "worsening of predictions over the sequence" problem!

I hope you found this demonstration useful! Now it is time for you to apply what you have learned to a real world automatic translation problem!