In [1]:
import numpy as np
import pandas as pd
import operator

This is an introduction to basic sequence-to-sequence learning using a Long short term memory (LSTM) module.

Given a string of characters representing a math problem "3141+42" we would like to generate a string of characters representing the correct solution: "3183". Our network will learn how to do basic mathematical operations.

The important part is that we will not first use our human intelligence to break the string up into integers and a mathematical operator. We want the computer to figure all that out by itself.

Each math problem is an input sequence: a list of {0,...,9} integers and math operation symbols
The result of the operation ("$3141+42$" $\rightarrow$ "$3183$"</span>) is the sequence to decode.

**math_operators** is the set of $5$ operations we are going to use to build are input sequences.<br/>
The math_expressions_generation function uses them to generate a large set of examples

In [2]:
def math_expressions_generation(n_samples=1000, n_digits=3, invert=True):
    X, Y = [], []
    math_operators = {
        '+': operator.add, 
        '-': operator.sub,
        '*': operator.mul,
        '/': operator.truediv,
        '%': operator.mod
    }
    for i in range(n_samples):
        a, b = np.random.randint(1, 10**n_digits, size=2)
        op = np.random.choice(list(math_operators.keys()))
        res = math_operators[op](a, b)
        x = "".join([str(elem) for elem in (a, op, b)])
        if invert is True:
            x = x[::-1]
        y = "{:.5f}".format(res) if isinstance(res, float) else str(res)
        X.append(x)
        Y.append(y)
    return X, Y

In [3]:
X, y = math_expressions_generation(n_samples=int(1e5), n_digits=3, invert=True)
for X_i, y_i in list(zip(X, y))[:20]:
    print(X_i[::-1], '=', y_i)

483*920 = 444360
671/833 = 0.80552
489-55 = 434
653%684 = 653
72%231 = 72
78-866 = -788
248%221 = 27
788-273 = 515
287%598 = 287
477*786 = 374922
380*163 = 61940
245+975 = 1220
884*924 = 816816
310-765 = -455
823-694 = 129
57%587 = 57
335*734 = 245890
517%588 = 517
732-581 = 151
934-289 = 645


# I - Sequence to sequence model

### The Seq2Seq architecture
Two LSTMs: an encoder and a decoder

<img src="../images/teacher_forcing_train.png" style="width: 600px;" />

## Training with teacher forcing
   - Build the Seq2Seq model for training
   - Example:
        - Input sequence "94+8" must is given to the encoding LSTM
        - True previous answers "1", "0", "2" are given to the decoder LSTM
            - Helps the decoder predict well the next token during training
   - We advice to use Keras functional API here

### encoder
   - Define a layer **encoder_inputs** of shape $(None, $**self.encoder_vocabulary_size**$)$ 
   - Define the encoder LSTM layer before connecting it
        - Call it **encoder_lstm**
        - Use **return_state** param so it returns its last **state_h** and **state_c**
           - Have to be passed to the decoder LSTM afterwards to connect the $2$ LSTMs
   - Connect **encoder_lstm** to **encoder_inputs**
       - Get **encoder_lstm**'s last **state_h** and **state_c** node variables
           - Stack them in a **encoder_states** $=$ $[$**state_h**$,  $**state_c**$]$ variable
      
### decoder
   - Define a layer **decoder_inputs** of shape $(None, $**self.decoder_vocabulary_size**$)$ 
   - Define the decoder LSTM layer before connecting it
        - Call it **decoder_lstm**
        - Pass encoder's last $[$**state_h**$, $**state_c**$]$ to decoder **initial_state** argument to connect the two LSTM
        - Use the **return_sequences** param so the decoder returns all the $h_{t}^{dec}$
            - We need them to compute the predictions using the $h_{t}^{dec}$
        - Use the **return_state** param so the decoder also returns its last **state_h** and **state_c**
            - We ignore those now but we will need them for inference
   - Connect **decoder_lstm** to **decoder_inputs**
       - Get the $h_{t}^{dec}$ hidden layers in a **decoder_all_hdec** node variable
       - Ignore **decoder_lstm**'s last **state_h** and **state_c** returned

### output
   - At this point, all the $h_{t}^{dec}$ are in a **decoder_all_hdec** node
      - Ready to be used to perform a token prediction for all timesteps
   - Define a Dense layer. Call it **decoder_dense**
       - Give it a softmax activation and **self.decoder_vocabulary_size** output dimensionality
   - Connect **decoder_dense** to **decoder_all_hdec**
     - Get the $\hat{y}^t$ predictions in a **decoder_outputs** node variable
     - Each $h_{t}^{dec}$ has been mapped to a $($**self.decoder_vocabulary_size**$,1)$ probability distribution over the next token
   - **decoder_outputs** should be of shape $(batch,$ **self.max_decoder_sequence_length**$,$ **self.decoder_vocabulary_size**$)$

## Inference (testing time - no teacher forcing)

   - We are going to see how to perform inference
       - Decoding a new sequence with trained weights without using teacher forcing
   - We won't provide the <...EOS> part of the sequences like during training
   - Predictions have to be performed one step at a time
   - At first we will use encoder's last state and GO token
       - Produces the $1{st}$ decoder hidden layer  $h_{0}^{dec}$
   - Secondly we will use $h_{0}^{dec}$ to predict $\hat{y}^0$ token
   - Thirdly we will use $h_{0}^{dec}$ and $\hat{y}^0$ token
       - Produces $h_{1}^{dec}$ then used to predict $\hat{y}^1$ token
   - etc.

### Requirements

   - To perform inference we are going to need an **inference encoder model**
       - Takes in the input sequence and returns the last $h$ and $c$ state
           - To be passed to the decoder
   - To perform inference we also are going to need a **inference decoder model**
       - Takes in the previous hidden state $h_{t-1}^{dec}$
       - Takes in a token: GO or previous $\hat{y}^{t-1}$
       - Returns next $h_{t}^{dec}$ and $\hat{y}^{t}$
       - Iterates over these steps
           - Until it produces the EOS token or decoded sequence is too long

We are going to reuse layers and nodes from before:
   - **encoder_inputs**, **encoder_states** and **decoder_all_hdec** nodes that are already connected
   - **decoder_lstm**, **decoder_inputs** and **decoder_dense** layers

### inference_encoder_model
   - Use the class Model from keras.models
   - Make the node **encoder_inputs** the model's input
   - Make the node **encoder_states** the model's output

### inference_decoder_model
   - Define $2$ $Input$ keras.layers of dimensionality **latent_dim**
       - **decoder_state_input_h** and **decoder_state_input_c**
          - **decoder_model**'s last state
          - To be given later to **inference_decoder_model**'s predict function
          - Stack them in a **decoder_states_inputs** variable
              - **decoder_states_inputs**$ = [$**decoder_state_input_h**$, $**decoder_state_input_c**$]$
              
   - Connect **decoder_lstm** to **decoder_inputs**
       - While connecting use the argument **initial_state**$ = $**decoder_states_inputs**
       - Get **decoder_all_hdec**, **decoder_state_h**, **decoder_state_c** from that connection
          - **decoder_all_hdec** is all the $h_{t}^{dec}$ produced
             - $1$ token at a time is given, thus **decoder_all_hdec** shape is $(1,$**latent_dim**$)$
          - **decoder_state_h** is the last $h_{t}^{dec}$
              - First part of **decoder_lstm**'s last state
              - Will be **decoder_state_input_h** at next iteration
          - **decoder_state_c**, is the last $c_{t}^{dec}$
              - Second part of **decoder_lstm**'s last state
              - Will be **decoder_state_input_c** at next iteration
          - Stack **decoder_state_h** and **decoder_state_c** in a **decoder_states** variable
              - **decoder_states**$ = [$**decoder_state_h**$, $**decoder_state_c**$]$
          - **decoder_state_h** and **decoder_state_c** will be returned along with prediction $\hat{y}^{t}$
          
   - Connect **decoder_dense** layer from before to **decoder_all_hdec**
       - Produces the distribution probability $\hat{y}^t$ over the next token
       - Get the $\hat{y}^t$ prediction in a **decoder_outputs** node variable
       
   - At this point we have
       - Next prediction **decoder_outputs**
       - Last state **decoder_states**
       
   - We are ready to define the decoder_model
       - Make $[$**decoder_inputs**$] + $**decoder_states_inputs** the model's inputs
       - Make $[$**decoder_outputs**$] + $**decoder_states** the model's outputs

**GO** is the character ("=") that marks the beginning of decoding for the decoder LSTM<br/>
**EOS** is the character ("\n") that marks the end of sequence to decode for the decoder LSTM

In [4]:
from keras.models import Model
from keras.layers import Input, LSTM, Dense
from sklearn.model_selection import train_test_split

class Seq2seq():
    def __init__(self, X, y):
        # Special tokens
        self.GO = '='
        self.EOS = '\n'
        # Dataset properties
        self.X = None
        self.y = None
        self.X_tr = None
        self.X_val = None
        self.y_tr = None
        self.y_val = None
        self.n = None
        self.encoder_char_index = None
        self.encoder_char_index_inversed = None
        self.decoder_char_index = None
        self.decoder_char_index_inversed = None
        self.encoder_vocabulary_size = None
        self.decoder_vocabulary_size = None
        self.max_encoder_sequence_length = None
        self.max_decoder_sequence_length = None
        # Preprocessed data
        self.encoder_input_data_tr = None
        self.encoder_input_data_val = None
        self.decoder_input_data_tr = None
        self.decoder_input_data_val = None
        self.decoder_target_data_tr = None
        self.decoder_target_data_val = None
        # Model properties
        self.training_model = None
        self.inference_encoder_model = None
        self.inference_decoder_model = None
        self.batch_size = None
        self.epochs = None
        self.latent_dim = None
        # Model layers and states that we want to keep in memory between training and inference
        ## Encoder
        self.encoder_inputs = None
        self.encoder_states = None
        ## Decoder
        self.decoder_inputs = None
        self.decoder_lstm = None
        self.decoder_all_hdec = None
        self.decoder_dense = None
        # Dataset construction call
        self.load_and_preprocess_data(X, y)
        self.construct_dataset()
        
    def load_and_preprocess_data(self, X, y):
        self.X = list(X)
        self.y = list(map(lambda token: self.GO + token + self.EOS, y))
        self.n = len(self.X)
        encoder_characters = sorted(list(set("".join(self.X))))
        decoder_characters = sorted(list(set("".join(self.y))))
        self.encoder_char_index = dict((c, i) for i, c in enumerate(encoder_characters))
        self.encoder_char_index_inversed = dict((i, c) for i, c in enumerate(encoder_characters))
        self.decoder_char_index = dict((c, i) for i, c in enumerate(decoder_characters))
        self.decoder_char_index_inversed = dict((i, c) for i, c in enumerate(decoder_characters))
        self.encoder_vocabulary_size = len(self.encoder_char_index)
        self.decoder_vocabulary_size = len(self.decoder_char_index)
        self.max_encoder_sequence_length = max([len(sequence) for sequence in self.X])
        self.max_decoder_sequence_length = max([len(sequence) for sequence in self.y])
        print('Number of samples:', self.n)
        print('Number of unique encoder tokens:', self.encoder_vocabulary_size)
        print('Number of unique decoder tokens:', self.decoder_vocabulary_size)
        print('Max sequence length for encoding:', self.max_encoder_sequence_length)
        print('Max sequence length for decoding:', self.max_decoder_sequence_length)
        (self.X_tr, self.X_val, 
         self.y_tr, self.y_val) = train_test_split(
            self.X, 
            self.y,
            random_state=42
        )
        
    def construct_dataset(self):
        encoder_input_data = np.zeros(
            (self.n, self.max_encoder_sequence_length, self.encoder_vocabulary_size),
            dtype='float32')
        decoder_input_data = np.zeros(
            (self.n, self.decoder_vocabulary_size, self.decoder_vocabulary_size),
            dtype='float32')
        decoder_target_data = np.zeros(
            (self.n, self.decoder_vocabulary_size, self.decoder_vocabulary_size),
            dtype='float32')
        for i, (X_i, y_i) in enumerate(zip(self.X, self.y)):
            for t, char in enumerate(X_i):
                encoder_input_data[i, t, self.encoder_char_index[char]] = 1.
            for t, char in enumerate(y_i):
                decoder_input_data[i, t, self.decoder_char_index[char]] = 1.
                if t > 0:
                    decoder_target_data[i, t - 1, self.decoder_char_index[char]] = 1.
        (self.encoder_input_data_tr, self.encoder_input_data_val, 
         self.decoder_input_data_tr, self.decoder_input_data_val,
         self.decoder_target_data_tr, self.decoder_target_data_val) = train_test_split(
            encoder_input_data, 
            decoder_input_data, 
            decoder_target_data,
            random_state=42
        )
    
    """
    ENCODER LAYERS:
        - define a Input Keras object in self.encoder_inputs
        - apply a LSTM layer on self.encoder_inputs to get the last state_h and state_c
        - stack those states into an array self.encoder_states
    DECODER LAYERS:
        - define an Input Keras object in self.decoder_inputs
        - define a LSTM layer in self.decoder_lstm, make sure you set return_sequences=True
        to be able to return all hidden states
        - apply this LSTM layer on self.decoder_inputs with the states initialized with self.encoder_states
        and output all the hidden states in self.decoder_all_hdec
        - define a Dense layer in self.decoder_dense with a softmax activation, and output the results 
        in decoder_outputs using self.decoder_all_hdec as inputs
    MODEL DEFINITION:
        - now you can build your global Model:
        Model([self.encoder_inputs, self.decoder_inputs], decoder_outputs)
    """
    def design_and_compile_training_model(self, batch_size=64, latent_dim=256):
        # Hyperparameters
        self.batch_size = batch_size
        self.latent_dim = latent_dim
        # Encoder layers
        self.encoder_inputs = Input(shape=(None, self.encoder_vocabulary_size))
        encoder_lstm = LSTM(self.latent_dim, return_state=True)
        _, state_h, state_c = encoder_lstm(self.encoder_inputs)
        self.encoder_states = [state_h, state_c]
        # Decoder layers
        self.decoder_inputs = Input(shape=(None, self.decoder_vocabulary_size))
        self.decoder_lstm = LSTM(self.latent_dim, return_state=True, return_sequences=True)
        self.decoder_all_hdec, _, _ = self.decoder_lstm(self.decoder_inputs, initial_state=self.encoder_states)
        self.decoder_dense = Dense(self.decoder_vocabulary_size, activation='softmax')
        decoder_outputs = self.decoder_dense(self.decoder_all_hdec)
        # Model definition and compilation
        self.training_model = Model([self.encoder_inputs, self.decoder_inputs], decoder_outputs)
        self.training_model.compile(optimizer='adam', loss='categorical_crossentropy')
        self.training_model.summary()
        
    def train(self, epochs=15):
        # Hyperparameters
        self.epochs = epochs
        # Model actual training
        self.training_model.fit(
            [self.encoder_input_data_tr, self.decoder_input_data_tr], self.decoder_target_data_tr,
            batch_size=self.batch_size,
            epochs=self.epochs,
            validation_data=(
                [self.encoder_input_data_val, self.decoder_input_data_val], self.decoder_target_data_val
            )
        )
    
    """
    ENCODER MODEL:
        - create a Keras Model self.inference_encoder_model 
        with self.encoder_inputs as inputs and self.encoder_states as output
    DECODER MODEL:
        - define two Input Keras objects: one for the h_state and the other for the c_state, then stack
        them into a decoder_states_inputs array
        - reuse the already trained self.decoder_lstm layer with self.decoder_inputs as input
        and decoder_states_inputs as initial_state
            - you should get three outputs: decoder_all_hdec, decoder_state_h and decoder_state_c
        - again, stack the outputed decoder_state_h and decoder_state_c into a decoder_states list
        - now reuse the already trained self.decoder_dense layer with decoder_all_hdec as input,
        and store the output into decoder_outputs
        - you can finally create a Keras Model self.inference_decoder_model
        with [self.decoder_inputs] + decoder_states_inputs as inputs 
        and [decoder_outputs] + decoder_states as output
    """
    def design_inference_model(self):
        if self.training_model is None:
            print("No training model has been defined yet!")
            return None
        # Encoder model
        self.inference_encoder_model = Model(self.encoder_inputs, self.encoder_states)
        # Decoder model
        ## Inputs: latent variables from the encoder
        decoder_state_input_h = Input(shape=(self.latent_dim,))
        decoder_state_input_c = Input(shape=(self.latent_dim,))
        decoder_states_inputs = [decoder_state_input_h, decoder_state_input_c]
        ## Decoding using the LSTM trained layer from the decoder
        decoder_all_hdec, decoder_state_h, decoder_state_c = self.decoder_lstm(
            self.decoder_inputs, initial_state=decoder_states_inputs
        )
        decoder_states = [decoder_state_h, decoder_state_c]
        ## Get outputs using the Dense trained layer from the decoder
        decoder_outputs = self.decoder_dense(decoder_all_hdec)
        ## Define the whole decoding model
        self.inference_decoder_model = Model(
            [self.decoder_inputs] + decoder_states_inputs,
            [decoder_outputs] + decoder_states)
        
    def decode_sequence(self, input_sequence):
        states_value = self.inference_encoder_model.predict(input_sequence)
        target_sequence = np.zeros((1, 1, self.decoder_vocabulary_size))
        target_sequence[0, 0, self.decoder_char_index[self.GO]] = 1.
        decoded_sentence = ''
        while len(decoded_sentence) <= self.max_decoder_sequence_length:
            output_tokens, h, c = self.inference_decoder_model.predict(
                [target_sequence] + states_value
            )
            states_value = [h, c]
            sampled_token_index = np.argmax(output_tokens[0, -1, :])
            sampled_char = self.decoder_char_index_inversed[sampled_token_index]
            decoded_sentence += sampled_char
            if sampled_char == self.EOS:
                break
            target_sequence = np.zeros((1, 1, self.decoder_vocabulary_size))
            target_sequence[0, 0, sampled_token_index] = 1.
        return decoded_sentence

Using TensorFlow backend.


In [5]:
seq2seq = Seq2seq(X, y)

Number of samples: 100000
Number of unique encoder tokens: 15
Number of unique decoder tokens: 14
Max sequence length for encoding: 7
Max sequence length for decoding: 11


In [6]:
seq2seq.design_and_compile_training_model(latent_dim=64)

__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_1 (InputLayer)            (None, None, 15)     0                                            
__________________________________________________________________________________________________
input_2 (InputLayer)            (None, None, 14)     0                                            
__________________________________________________________________________________________________
lstm_1 (LSTM)                   [(None, 64), (None,  20480       input_1[0][0]                    
__________________________________________________________________________________________________
lstm_2 (LSTM)                   [(None, None, 64), ( 20224       input_2[0][0]                    
                                                                 lstm_1[0][1]                     
          

In [7]:
seq2seq.train(epochs=5)

Train on 75000 samples, validate on 25000 samples
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


In [8]:
seq2seq.design_inference_model()

In [9]:
for sequence_index in range(10):
    input_sequence = seq2seq.encoder_input_data_val[sequence_index: sequence_index + 1]
    decoded_sentence = seq2seq.decode_sequence(input_sequence)
    print('-')
    raw_input_sequence = "".join(
        [seq2seq.encoder_char_index_inversed[np.argmax(token)] for token in np.squeeze(input_sequence)][::-1]
    )
    print('Input sentence:', seq2seq.X_val[sequence_index][::-1])
    print('Decoded sentence:', decoded_sentence)

-
Input sentence: 581+495
Decoded sentence: 1041

-
Input sentence: 79+231
Decoded sentence: 704

-
Input sentence: 648/414
Decoded sentence: 1.91515

-
Input sentence: 248*10
Decoded sentence: 1040

-
Input sentence: 625+909
Decoded sentence: 1149

-
Input sentence: 423+441
Decoded sentence: 1014

-
Input sentence: 937+215
Decoded sentence: 1019

-
Input sentence: 662-651
Decoded sentence: 12

-
Input sentence: 698%467
Decoded sentence: 18

-
Input sentence: 920-281
Decoded sentence: 581



### Questions:

- 1) Explain the interest in using teacher forcing during training. What is specific about this process?

<span style="color:green">
Teacher forcing means that, at each timestep, we provide the previous correct token to be decoded to the decoder. In case we did not use it the prediction would have been provided. What is interesting with this process is it helps stabilizing training by focusing updates on wrong weights' states leading to actual misanswers from the network, not weights in right state which would have provided correct answer but for being given the correct input.
</span>


- 2) Describe step by step how the encoder-decoder couple works in this case (~ 5-10 lines)

<span style="color:green">
There are $2$ symmetric networks, $2$ LSTMs, that have similar purposes. The first one, the encoder, processes vectors, $1$ at a time, and outputs a vector, $h_{t}^{enc}$. The position in vector space of $h_{t}^{enc}$ tells the decoder which token to predict first and contains information about all the vectors that have been processed. Thus at time $t$ there are $2$ choices: either provide $h_{t}^{enc}$ to the decoder so it knows which $1^{st}$ token to decode or process the next intput vector and produce $h_{t+1}^{enc}$.
The final $h_{T}^{enc}$ is provided to the decoder along with the GO token. Position of $h_{T}^{enc}$ and GO token combined lead to a $h_{0}^{dec}$ vector to be produced, used as input by a fully connected layer to predict the first decoded token $\hat{y}^{0}$. We iterate over this process, only instead providing $h_{t-1}^{dec}$ and true $\hat{y}^{t-1}$ as input, until it predicts the END decoded token.
</span>

- 3) Why is it mandatory to have different implementations between training and inference? Why do we need different models? (~ 3-6 lines)

<span style="color:green">
During training, since we use teacher forcing, the sequence of input tokens for the decoder is already ready: the sequence of $\hat{y}^{t-1}$ for each timestep $t$. That is how we organize the data and feed it to the LSTM decoder. During inference, we cannot use teacher forcing, we have no way to provide with a full sequence of inputs to the decoder from the start. Thus we have to iteratively produce the previous prediction and feed it as input token to the decoder LSTM at the next timestep. These $2$ different ways of providing the input data to the decoder LSTM are the reason why different implementations are needed.
</span>

# II - Sequence to sequence model with attention mechanism

Try to improve your previous model from part I with attention mechanism<br/>
Note that it will be a bit different from the implementation seen in the course for practical reasons<br/>
In this part, you will concatenate the attention weights with the hidden decoder states after the decoding pass, and feed the result to final Dense layers

In [10]:
from keras.layers import Activation, dot, concatenate, TimeDistributed

class Seq2seqAttention(Seq2seq):
    def __init__(self, X, y):
        # Attention layers hyperparameters
        self.latent_attention_dim = 64
        # All hidden states from the encoder that must now be stored
        self.encoder_outputs = None
        # Attention layers and states
        self.dense_tanh = None
        self.dense_final = None
        # Seq2Seq class initialization
        super(Seq2seqAttention, self).__init__(X, y)
    
    """
    ENCODER LAYERS:
        - define a Input Keras object in self.encoder_inputs       
        - apply a LSTM layer with return_sequences=True on self.encoder_inputs 
        to get (self.encoder_outputs, state_h, state_c)
        - stack state_h and state_c into an array self.encoder_states
    DECODER LAYERS:
        - define an Input Keras object in self.decoder_inputs
        - define a LSTM layer in self.decoder_lstm, make sure you set return_sequences=True
        to be able to return all hidden states
        - apply this LSTM layer on self.decoder_inputs with the states initialized with self.encoder_states
        and output all the hidden states in self.decoder_all_hdec
    ATTENTION LAYERS:
        - apply a dot product between self.decoder_all_hdec and self.encoder_outputs along their last
        dimension (the latent one), then a softmax activation, and store the result into attention
        - compute the context tensor with a dot product between the attention tensor and self.encoder_outputs
        along the last dimension (softmax values) for attention and time dimension for self.encoder_outputs
        - concatenate the result with self.decoder_all_hdec
        - define the two final Dense layers: 
            - the first with tanh activation and self.latent_attention_dim size
            - the second with softmax activation and self.decoder_vocabulary_size
        - output the final result into attention_outputs
    MODEL DEFINITION:
        - now you can build your global Model:
        Model([self.encoder_inputs, self.decoder_inputs], attention_outputs)
    """
    def design_and_compile_training_model(self, batch_size=64, latent_dim=256, latent_attention_dim=64):
        # Hyperparameters
        self.batch_size = batch_size
        self.latent_dim = latent_dim
        self.latent_attention_dim = latent_attention_dim
        # Encoder layers
        self.encoder_inputs = Input(shape=(None, self.encoder_vocabulary_size))
        encoder_lstm = LSTM(self.latent_dim, return_state=True, return_sequences=True)
        encoder_outputs, state_h, state_c = encoder_lstm(self.encoder_inputs)
        self.encoder_states = [state_h, state_c]
        self.encoder_outputs = encoder_outputs
        print(encoder_outputs.shape)
        # Decoder layers
        self.decoder_inputs = Input(shape=(None, self.decoder_vocabulary_size))
        self.decoder_lstm = LSTM(self.latent_dim, return_state=True, return_sequences=True)
        self.decoder_all_hdec, _, _ = self.decoder_lstm(self.decoder_inputs, initial_state=self.encoder_states)
        # Attention layers
        attention = dot([self.decoder_all_hdec, self.encoder_outputs], axes=[2, 2])
        attention = Activation('softmax', name='attention')(attention)
        context = dot([attention, self.encoder_outputs], axes=[2, 1])
        decoder_combined_context = concatenate([context, self.decoder_all_hdec])
        self.dense_tanh = Dense(self.latent_attention_dim, activation="tanh")
        self.dense_final = Dense(self.decoder_vocabulary_size, activation="softmax")
        attention_outputs = self.dense_final(self.dense_tanh(decoder_combined_context))
        # Model definition and compilation
        self.training_model = Model([self.encoder_inputs, self.decoder_inputs], attention_outputs)
        self.training_model.compile(optimizer='adam', loss='categorical_crossentropy')
        self.training_model.summary()
        
    """
    ENCODER MODEL:
        - create a Keras Model self.inference_encoder_model 
        with self.encoder_inputs as inputs and self.encoder_states as output
    DECODER MODEL:
        - define two Input Keras objects: one for the h_state and the other for the c_state, then stack
        them into a decoder_states_inputs array
        - reuse the already trained self.decoder_lstm layer with self.decoder_inputs as input
        and decoder_states_inputs as initial_state
            - you should get three outputs: decoder_all_hdec, decoder_state_h and decoder_state_c
        - again, stack the outputed decoder_state_h and decoder_state_c into a decoder_states array        
        - now apply a dot product between decoder_all_hdec and self.encoder_outputs along their last
        dimension (the latent one), then a softmax activation: it is your attention tensor
        - compute the context tensor with a dot product between the attention tensor and self.encoder_outputs
        along the last dimension (softmax values) for attention and time dimension for self.encoder_outputs
        - concatenate the result with decoder_all_hdec
        - reuse the two trained Denser layers and output the final result into attention_outputs
        - you can finally create a Keras Model self.inference_decoder_model
        with [self.encoder_inputs] + [self.decoder_inputs] + decoder_states_inputs as inputs 
        and [attention_outputs] + decoder_states as output
    """
    def design_inference_model(self):
        if self.training_model is None:
            print("No training model has been defined yet!")
            return None
        # Encoder model
        self.inference_encoder_model = Model(self.encoder_inputs, self.encoder_states)
        # Decoder model
        decoder_state_input_h = Input(shape=(self.latent_dim,))
        decoder_state_input_c = Input(shape=(self.latent_dim,))
        decoder_states_inputs = [decoder_state_input_h, decoder_state_input_c]
        decoder_all_hdec, decoder_state_h, decoder_state_c = self.decoder_lstm(
            self.decoder_inputs, initial_state=decoder_states_inputs
        )
        decoder_states = [decoder_state_h, decoder_state_c]
        
        attention = dot([decoder_all_hdec, self.encoder_outputs], axes=[2, 2])
        attention = Activation('softmax', name='attention')(attention)
        context = dot([attention, self.encoder_outputs], axes=[2, 1])
        decoder_combined_context = concatenate([context, decoder_all_hdec])
        attention_outputs = self.dense_final(self.dense_tanh(decoder_combined_context))
        
        self.inference_decoder_model = Model(
            [self.encoder_inputs] + [self.decoder_inputs] + decoder_states_inputs,
            [attention_outputs] + decoder_states)
        
    def decode_sequence(self, input_sequence):
        states_value = self.inference_encoder_model.predict(input_sequence)
        target_sequence = np.zeros((1, 1, self.decoder_vocabulary_size))
        target_sequence[0, 0, self.decoder_char_index[self.GO]] = 1.
        decoded_sentence = ''
        while len(decoded_sentence) <= self.max_decoder_sequence_length:
            output_tokens, h, c = self.inference_decoder_model.predict(
                [input_sequence] + [target_sequence] + states_value
            )
            states_value = [h, c]
            sampled_token_index = np.argmax(output_tokens[0, -1, :])
            sampled_char = self.decoder_char_index_inversed[sampled_token_index]
            decoded_sentence += sampled_char
            if sampled_char == self.EOS:
                break
            target_sequence = np.zeros((1, 1, self.decoder_vocabulary_size))
            target_sequence[0, 0, sampled_token_index] = 1.
        return decoded_sentence

In [11]:
seq2seq_attention = Seq2seqAttention(X, y)

Number of samples: 100000
Number of unique encoder tokens: 15
Number of unique decoder tokens: 14
Max sequence length for encoding: 7
Max sequence length for decoding: 11


In [12]:
seq2seq_attention.design_and_compile_training_model(latent_dim=64)

(?, ?, 64)
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_5 (InputLayer)            (None, None, 15)     0                                            
__________________________________________________________________________________________________
input_6 (InputLayer)            (None, None, 14)     0                                            
__________________________________________________________________________________________________
lstm_3 (LSTM)                   [(None, None, 64), ( 20480       input_5[0][0]                    
__________________________________________________________________________________________________
lstm_4 (LSTM)                   [(None, None, 64), ( 20224       input_6[0][0]                    
                                                                 lstm_3[0][1]                     

In [13]:
seq2seq_attention.train(epochs=5)

Train on 75000 samples, validate on 25000 samples
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


In [14]:
seq2seq_attention.design_inference_model()

In [15]:
for sequence_index in range(10):
    input_sequence = seq2seq_attention.encoder_input_data_val[sequence_index: sequence_index + 1]
    decoded_sentence = seq2seq_attention.decode_sequence(input_sequence)
    print('-')
    raw_input_sequence = "".join(
        [seq2seq_attention.encoder_char_index_inversed[np.argmax(token)] 
         for token in np.squeeze(input_sequence)][::-1]
    )
    print('Input sentence:', seq2seq_attention.X_val[sequence_index][::-1])
    print('Decoded sentence:', decoded_sentence)

-
Input sentence: 581+495
Decoded sentence: 1041

-
Input sentence: 79+231
Decoded sentence: 542

-
Input sentence: 648/414
Decoded sentence: 1.64441

-
Input sentence: 248*10
Decoded sentence: 12280

-
Input sentence: 625+909
Decoded sentence: 1441

-
Input sentence: 423+441
Decoded sentence: 942

-
Input sentence: 937+215
Decoded sentence: 1041

-
Input sentence: 662-651
Decoded sentence: -13

-
Input sentence: 698%467
Decoded sentence: 188

-
Input sentence: 920-281
Decoded sentence: 418



### Questions:

- 1) Explain the main differences with the previous part, how does the attention mechanism work? (~ 5-10 lines)

<span style="color:green">
Attention mechanism works by being able to focus on a specific subsequence in a long sequence to predict the right token at some timestep. That means not having to rely solely on final $h_{T}^{enc}$ to predict the whole decoded sequence, but rather recombining and weighing all the $h_{t}^{enc}$ at each decoding step to focus those related to the prediction.
At each decoding step, a scalar product is performed between $h_{t}^{dec}$ and all the $h_{t}^{enc}$. This gives a similarity measure between $h_{t}^{dec}$ and each $h_{t}^{enc}$. A softmax is applied to this vector to rescale the similarity coefficients and make them sum to $1$. This way we can use them to compute a mean $h^{enc}$ vector to be used for prediction that allows the network to focus on some input tokens by making some coefficient relatively much greater than the others. Mean $h^{enc}$ vector is then computed and followed by $tanh$ operation to reduce vector input space of next operation. Final step is a softmax fully connected layer over the $tanh$ vector for prediction of the next decoded token. Applying attention mechanism involves iterating over this for each decoding timestep.
</span>

- 2) Compare the perfomances of your model at inference time with and without attention mechanism

<span style="color:green">
In this example, no noticeable difference is to be found between the performances of the $2$ different implementations, with and without attention mechanism. Also some quick visualization tells us that the network does not really focus much on part of the input to predict $1$ decoded token at a time. The reason for that is the encoding-decoding problem here is specific in the way that almost all input tokens are involved in producting each output token.
</span>