# Translations 2.0 - Neural Machine Translation

According to the Google paper [*Attention is all you need*](https://arxiv.org/abs/1706.03762), you only need layers of Attention to make a Deep Learning model understand the complexity of a sentence. We will try to implement this type of model for our translator. 

### Data import 

You will have the same `.txt` file containing a sentence with its translation separated by a tab (`\t`). You will have to import this data and read it via `pandas`.

Your data can be found on this link: https://go.aws/38ECHUB

### Preprocessing 

The whole purpose of your preprocessing is to express your (French) entry sentence in a sequence of clues.

i.e. :

* je suis heureux---> `[123, 21, 34, 0, 0, 0, 0, 0]`

This gives a *shape* -> `(batch_size, max_len_of_a_sentence)`.

The zeros correspond to what are called [*padded_sequences*](https://www.tensorflow.org/api_docs/python/tf/keras/preprocessing/sequence/pad_sequences) which allow all word sequences to have the same length across a set of sequences (mandatory for your algorithm). 

You will run the same preprocessing on the target sequences, and add a `<start>` token at the beginning of each sequence.

* `<start>` I am happy ---> `[1, 43, 2, 42, 0, 0]`

### Modeling 

For modeling, you will need to set up layers of attention. You'll need to: 

* Create an `Encoder` class that inherits from `tf.keras.Model`.
* Create a Bahdanau Attention Layer that will be a class that inherits `tf.keras.layers.Layer`
* Finally create a `Decoder` class that inherits from `tf.keras.Model`.


You will need to create your own cost function as well as your own training loop. 


### Tips 

Don't take the whole dataset at the beginning for your experiments, just take 5000 or even 3000 sentences. This will allow you to iterate faster and avoid bugs simply related to your need for computing power, and memory space.

Good Luck!


In [108]:
# Import necessaries librairies
import pandas               as pd
# import numpy                as np 
# import tensorflow_datasets  as tfds
import tensorflow           as tf 
import os
import time
from tensorflow.keras.utils import plot_model

from sklearn.model_selection import train_test_split

tf.__version__

'2.10.0'

## Importing data & Preprocessing

1. Load the data using the following url https://go.aws/38ECHUB you can read this using `pd.read_csv` with the `"\t"` delimiter and `header=None`

In [109]:
# Loading function for txt document
def load_doc(url):
  df = pd.read_csv(url, delimiter="\t", header=None)
  return df

In [110]:
# Loading txt document
doc = load_doc("https://go.aws/38ECHUB")
doc.head()

Unnamed: 0,0,1
0,Go.,Va !
1,Hi.,Salut !
2,Run!,Cours !
3,Run!,Courez !
4,Wow!,Ça alors !


In [111]:
len(doc)

160538

2. Create an object `doc` containing the first 5000 rows from the file.

In [112]:
# Let's just take a sample of 5000 sentences to avoid slowness. 
doc = doc.iloc[:5_000,:]

3. Add the word `<start>` to the beginning of each target sentence in order to create a new column named `padded_en`

In [113]:
# Add a <start> token 
def begin_sentence(sentence):
  sentence = "<start> "+ sentence
  return sentence

In [114]:
# Add <start> and <end> token
doc.iloc[:, 0] = doc.iloc[:, 0].apply(lambda x: begin_sentence(x))


In [115]:
doc

Unnamed: 0,0,1
0,<start> Go.,Va !
1,<start> Hi.,Salut !
2,<start> Run!,Cours !
3,<start> Run!,Courez !
4,<start> Wow!,Ça alors !
...,...,...
9995,"<start> Help me, please.","Aide-moi, s'il te plait."
9996,"<start> Help me, please.","Aidez-moi, je vous en supplie."
9997,"<start> Help us, please.","Aidez-nous, je vous prie !"
9998,"<start> Help us, please.","Aide-nous, je te prie !"


4. Create two objects : `tokenizer_fr` and `tokenizer_en` that will be instances of the `tf.keras.preprocessing.text.Tokenizer` class. 

Be careful! Since we added a special token containing special characters, make sure you setup the tokenizers right so this token is well interpreted! (use the `filters` argument for example).

In [116]:
tokenizer_fr = tf.keras.preprocessing.text.Tokenizer()
tokenizer_en = tf.keras.preprocessing.text.Tokenizer(filters='!"#$%&()*+,-./:;=?@[\\]^_`{|}~\t\n')

5. Fit the tokenizers on the french, and english sentences respectively.

In [117]:
tokenizer_en.fit_on_texts(doc.iloc[:,0])
tokenizer_fr.fit_on_texts(doc.iloc[:,1])

6. Create three new columns in your Dataframe for the encoded french, english sentences.

In [118]:
doc["fr_indices"] = tokenizer_fr.texts_to_sequences(doc.iloc[:,1])
doc["en_indices"] = tokenizer_en.texts_to_sequences(doc.iloc[:,0])

In [119]:
doc.head()

Unnamed: 0,0,1,fr_indices,en_indices
0,<start> Go.,Va !,[42],"[1, 14]"
1,<start> Hi.,Salut !,[546],"[1, 868]"
2,<start> Run!,Cours !,[2234],"[1, 123]"
3,<start> Run!,Courez !,[2235],"[1, 123]"
4,<start> Wow!,Ça alors !,"[24, 2236]","[1, 1493]"


7. It's rather difficult to work with sequences with variable length, use zero-padding to normalize the length of all the sequences in each category.

In [120]:
# Use of Keras to create token sequences of the same length
padded_fr_indices = tf.keras.preprocessing.sequence.pad_sequences(doc["fr_indices"], padding="post")
padded_en_indices = tf.keras.preprocessing.sequence.pad_sequences(doc["en_indices"], padding="post")

8. What are the shapes of the arrays you just created for the french, and english sentences?

In [121]:
padded_fr_indices.shape

(10000, 10)

In [122]:
padded_en_indices.shape

(10000, 6)

9. Use `sklearn` `train_test_split` function to divide your sample into train and validation sets.

In [123]:

X_train, X_val, y_train, y_val = train_test_split(padded_fr_indices, padded_en_indices)

10. Set a `BATCH_SIZE` then create a `train`, and `val` tensor datasets, apply `.shuffle` on the `train` set and `.batch` on both sets.

In [124]:
BATCH_SIZE = 128
train = tf.data.Dataset.from_tensor_slices((X_train,y_train)).shuffle(len(X_train)).batch(BATCH_SIZE)
val = tf.data.Dataset.from_tensor_slices((X_val,y_val)).batch(BATCH_SIZE)

## Modeling

1. Set up the following variables:
  * `n_embed` for the models' embedding output dimensions
  * `n_gru` for the models' gru number of units
  * `vocab_inp_size` for the french vocab size
  * `vocab_tar_size` for the english vocab size

In [125]:
# Creation of variables that we will reuse for our models
# let's start by defining the number of units needed for the embedding and
# the lstm layers

n_embed         = 1024
n_gru           = 256
vocab_in_size   = len(tokenizer_fr.word_index)
vocab_out_size  = len(tokenizer_en.word_index)

### Encoder

2. Define a class `encoder_maker` inheriting from `tf.keras.Model` that can instanciate and encoder type model according to the following schema: 

![bahdanau](https://full-stack-bigdata-datasets.s3.eu-west-3.amazonaws.com/Deep+Learning/attention/Attention-encoder-decoder.drawio.png)

In [126]:
class encoder_factory(tf.keras.Model):
  
  def __init__(self, in_vocab_size, embed_dim, n_units):
    super().__init__()
    # instanciate an embedding layer
    self.n_units = n_units
    self.embed = tf.keras.layers.Embedding(input_dim=in_vocab_size, output_dim=embed_dim)
    # instantiate GRU layer
    self.gru = tf.keras.layers.GRU(units=n_units, return_sequences=True, return_state=True)
  
  
  def __call__(self, input_batch):
    # each output will be saved as a class attribute so we can easily access
    # them to control the shapes throughout the demo
    self.embed_out = self.embed(input_batch)
    self.gru_out, self.gru_state = self.gru(self.embed_out)#, initial_state=initial_state)

    return self.gru_out, self.gru_state


3. Define an instance of the class called... `encoder`!

In [128]:
encoder = encoder_factory(vocab_in_size + 1, n_embed, n_gru)

4. Use the `__call__` method of `encoder` on some data to create an object `encoder_output`, and an `encoder_state` (remember your encoder has two different outputs!). Then print out `encoder_output`, and `encoder_state`.

In [91]:
encoder_output, encoder_state = encoder(tf.expand_dims(X_train[0],0))

ValueError: This model has not yet been built. Build the model first by calling `build()` or by calling the model on a batch of data.

In [92]:
encoder_output

<tf.Tensor: shape=(1, 12, 256), dtype=float32, numpy=
array([[[ 0.00905645, -0.02516365, -0.01188511, ..., -0.01581522,
          0.02389781,  0.01224535],
        [ 0.00900744, -0.01763661, -0.01643122, ..., -0.00787944,
         -0.00081653,  0.00402039],
        [ 0.00947303,  0.00433335, -0.00149967, ...,  0.0033477 ,
          0.03288288, -0.00484431],
        ...,
        [-0.05784671, -0.0274683 , -0.02136509, ..., -0.00858189,
         -0.0159177 ,  0.02291612],
        [-0.05779962, -0.02760827, -0.02112174, ..., -0.00958698,
         -0.01513653,  0.02328783],
        [-0.05757864, -0.02765651, -0.02094681, ..., -0.01013853,
         -0.01453455,  0.02355005]]], dtype=float32)>

In [93]:
encoder_state

<tf.Tensor: shape=(1, 256), dtype=float32, numpy=
array([[-0.05757864, -0.02765651, -0.02094681,  0.06427027,  0.00590402,
         0.0217327 ,  0.00020604,  0.02051996, -0.00044382,  0.04917082,
         0.07857566, -0.0086846 , -0.01751825,  0.01698195,  0.03614137,
         0.0180245 , -0.00389607,  0.00564247, -0.01470011, -0.02662497,
         0.06235842, -0.03729329,  0.03318333,  0.00102663,  0.03631185,
        -0.02107344,  0.01817353, -0.00698844,  0.00273566, -0.04410892,
        -0.00393473,  0.01625925,  0.01973428, -0.01830165, -0.0603941 ,
        -0.00653724,  0.01747008,  0.0251917 , -0.02825532, -0.01008458,
        -0.02244198,  0.08356526,  0.01149821,  0.04836292, -0.00207378,
        -0.00800344,  0.00935877, -0.00093986, -0.01929062, -0.00657349,
        -0.00496154, -0.07652334, -0.0070941 ,  0.01111371,  0.00632809,
        -0.03125188, -0.0059572 ,  0.04404977, -0.01731559,  0.00066816,
         0.03259735,  0.01874358,  0.00758127,  0.02120544, -0.00584927,
 

### Attention layer

5. Create a `Bahdanau_attention_maker` class that lets you instanciate an attention layer that you will include in your decoder model. You may follow the instructions from this schema: 

![bahdanau](https://full-stack-bigdata-datasets.s3.eu-west-3.amazonaws.com/Deep+Learning/attention/Attention-encoder-decoder.drawio.png)

And get inspiration (as much as you want) from the lecture's demo!

In [94]:
class Bahdanau_attention_maker(tf.keras.layers.Layer):
  
  def __init__(self, attention_units):
    super().__init__()

    # The attention layer contains three dense layers
    self.W1 = tf.keras.layers.Dense(units=attention_units)
    self.W2 = tf.keras.layers.Dense(units=attention_units)
    self.V = tf.keras.layers.Dense(units=1)

  def __call__(self, enc_out, state):
    # the choice of name of the arguments here is not random, enc_out
    # will represent the encoder output which will be used to create
    # the attention weights and then used to create the context vector once we
    # apply the attention weights
    # the state will be a hidden state from a recurrent unit coming either
    # from the encoder at first, and from the decoder as we make further 
    # predictions
    self.W1_out = self.W1(enc_out) # shape (1,12,attention_units)

    # If you have taken a close look the model's schema you would have noticed
    # that we are going to sum the outputs from W1 and W2, though the shapes
    # are incompatible
    # the enc_out is (batch_size,12,16) -> W1 -> (batch_size,12,attention_units)
    # the state is (batch_size,16) -> W2 -> (batch_size,attention_units)
    # thus we need to artificially add a dimension to the stata along axis 1
    self.state = tf.expand_dims(state, axis = 1)
    self.W2_out = self.W2(self.state) # shape (batch_size,1,attention_units)

    self.sum = self.W1_out + self.W2_out  # shape (batch_size,12,attention_units)
    self.sum_scale = tf.nn.tanh(self.sum) # shape (batch_size,12,attention_units)

    self.score = self.V(self.sum_scale) # shape (batch_size,12,1)

    self.attention_weights = tf.nn.softmax(self.score, axis=1) # shape (batch_size,12,1)

    self.weighted_enc_out = enc_out * self.attention_weights # shape (batch_size,12,16)

    self.context_vector = tf.reduce_sum(self.weighted_enc_out, axis=1) # Somme selon l'axe 1 du tenseur (b, 12, 16), donc colonne => shape (batch_size,16)

    return self.context_vector, self.attention_weights

6. Create an instance of the class called `attention_layer`.

In [95]:
attention_layer = Bahdanau_attention_maker(8)

In [105]:
plot_model(attention_layer)

NameError: name 'plot_model' is not defined

7. Try out the `__call__` method on the `encoder_output`, and `encoder_state`.

In [96]:
attention_layer(encoder_output, encoder_state)

(<tf.Tensor: shape=(1, 256), dtype=float32, numpy=
 array([[-0.03044341, -0.01617486, -0.0163315 ,  0.03019225, -0.00105563,
          0.01620287, -0.00621167,  0.01842681, -0.00171171,  0.03193591,
          0.03495656, -0.00626692, -0.00456389,  0.01549913,  0.01471619,
          0.01473392, -0.00562287,  0.0090839 , -0.00797053, -0.01652147,
          0.04017272, -0.03352743,  0.01866774, -0.00043037,  0.0133545 ,
         -0.0002742 ,  0.00795678, -0.00210773,  0.00408419, -0.02375865,
         -0.01870133,  0.00274507,  0.01029423, -0.01120158, -0.03912488,
         -0.00025478,  0.01719798,  0.0068701 , -0.01790881, -0.00913726,
         -0.00990078,  0.05565539, -0.00016384,  0.02713146, -0.00375317,
         -0.01437655,  0.01637932, -0.0075435 , -0.0047624 , -0.00221256,
         -0.0091235 , -0.05413564, -0.0133197 ,  0.00675143,  0.00341251,
         -0.01962358,  0.00178746,  0.03600667, -0.01437644,  0.00047946,
          0.01887695,  0.01412492,  0.00887732,  0.00489913, 

### Decoder

8. Set up a `decoder_maker` class that will let you create decoder models according to the demo and the following schema: 

![bahdanau](https://full-stack-bigdata-datasets.s3.eu-west-3.amazonaws.com/Deep+Learning/attention/Attention-encoder-decoder.drawio.png)

In [97]:
class decoder_maker(tf.keras.Model):
  def __init__(self, target_vocab_size, embed_dim, n_units):
    super().__init__()
    # The decoder contains an embedding layer to play with the teacher forcing
    # input, which comes from the target data
    # A gru layer
    # A dense layer to make the predictions
    # And an attention layer
    self.embed = tf.keras.layers.Embedding(input_dim=target_vocab_size, output_dim=embed_dim)
    self.gru = tf.keras.layers.GRU(units=n_units, return_sequences=True, return_state=True)
    self.pred = tf.keras.layers.Dense(units=target_vocab_size, activation="softmax")
    self.attention = Bahdanau_attention_maker(attention_units=n_units)

  def __call__(self, dec_in, enc_out, state):
    # first let's apply the attention layer
    self.context_vector, self.attention_weights = self.attention(enc_out,state)

    # now the decoder will ingest one sequence element from the teacher forcing
    # this will be of shape (bacth_size, 1)
    self.embed_out = self.embed(dec_in) # shape (batch_size,1,embed_dim)

    # then we need to concatenate the embedding output and the context vector
    # though their shapes are incompatible
    # embed out (batch_size, 1, embed_dim)
    # context vector (batch_size, n_units) where n_units was defined in the encoder
    # so we need to add one dimension along axis 1
    self.context_vector_expanded = tf.expand_dims(self.context_vector, axis=1)
    # shape (batch_size,1,n_units)
    self.concat = tf.keras.layers.concatenate([self.embed_out,
                                               self.context_vector_expanded])
    # shape (bacth_size,1, embed_dim + n_units)
    
    # now we get to apply the gru layer
    self.gru_out, self.gru_state = self.gru(self.concat) 
    # shapes (batch_size, 1, n_units) and (batch_size, n_units)

    # let's reshape the gru output before feeding it to the dense layer
    self.gru_out_reshape = tf.reshape(self.gru_out, shape=(-1,
                                                           self.gru_out.shape[2]))

    # now let's make a prediction
    self.pred_out = self.pred(self.gru_out_reshape) # shape (batch_size, 1, tar_vocab_size)

    return self.pred_out, self.gru_state, self.attention_weights

9. Create an instance of the class called...... `decoder` !

In [98]:
decoder = decoder_maker(target_vocab_size=vocab_out_size+1, embed_dim=n_embed, n_units=n_gru)

ValueError: This model has not yet been built. Build the model first by calling `build()` or by calling the model on a batch of data.

10. Try out the decoder on some teacher forcing data and the encoder outputs.

In [99]:
decoder_input = tf.expand_dims(tf.expand_dims(y_train[0][0], axis=0), axis=0) # the teacher forcing is
# the first element of the target sequence which corresponds to the <start> token
# we use expand dim to artificially add the batch size dimension

In [100]:
decoder(decoder_input,encoder_output,encoder_state)

(<tf.Tensor: shape=(1, 4576), dtype=float32, numpy=
 array([[0.00021811, 0.00021775, 0.0002189 , ..., 0.00021817, 0.00021816,
         0.00021865]], dtype=float32)>,
 <tf.Tensor: shape=(1, 256), dtype=float32, numpy=
 array([[-2.88544083e-03, -3.03292628e-02,  7.15822607e-05,
         -1.95521731e-02, -2.61017941e-02,  4.24746536e-02,
          5.47750574e-03,  2.46608350e-03, -3.41445976e-03,
          4.65776538e-03, -3.36298607e-02,  2.46483716e-03,
          2.89762742e-03,  3.58047560e-02, -9.09270905e-03,
          6.08691666e-03, -2.06586923e-02,  5.87580958e-03,
          1.05534857e-02, -1.12922722e-02,  2.30217315e-02,
          1.21390994e-03,  1.67560820e-02,  6.45752717e-03,
         -1.26004629e-02, -1.85700646e-03,  8.72948673e-03,
         -1.73054803e-02,  5.02653420e-03, -1.94900595e-02,
          1.59663074e-02, -7.44627044e-03, -3.24146799e-03,
         -1.46506950e-02,  5.02036139e-03, -5.25137177e-03,
         -1.44381355e-02,  5.82653238e-03,  2.10345294e-02,
   

### Loss

11. Look at the following loss function, what is the purpose of it, what will it change about the way the model learns?

In [101]:
optimizer = tf.keras.optimizers.Adam()
loss_object = tf.keras.losses.SparseCategoricalCrossentropy(reduction='none')

def loss_function(real, pred):
  mask = tf.math.logical_not(tf.math.equal(real, 0))
  loss_ = loss_object(real, pred)

  mask = tf.cast(mask, dtype=loss_.dtype)
  loss_ *= mask

  return tf.reduce_mean(loss_)

12. Set up a checkpoint for the optimizer, the encoder, and the decoder.

In [102]:

checkpoint_dir = './training_checkpoints2'
checkpoint_prefix = os.path.join(checkpoint_dir, "ckpt")
checkpoint = tf.train.Checkpoint(optimizer=optimizer,
                                 encoder=encoder,
                                 decoder=decoder)

## Training 

1. Define a `train_step` function that will take as arguments `inp` which represents a batch of input sequences, and `targ` which represents an input of target sequences.

This function will:
* Initiate `loss` to zero
* Track all operations with `tf.GradientTape() as tape`
* Use the encoder on `inp` to compute its outputs
* Set `dec_state` as the encoder state
* Set `dec_input` as the first sequence element of the target batch `targ` (careful with the shapes)
* Start a loop that will go through each subsequent elements of the target sequence, and will do:
  * Apply the decoder on the encoder outputs and `dec_input`, this will create the prediction's probability vector, and update the decoder state
  * Calculate  the loss based on the next element of `targ`, and the prediction probability vector and add it to `loss`
  * Set the new decoder input as the next element of `targ`
* Create `batch_loss` as equal to the average value of the loss over the target sequence.
* Create a `variables` object containing both the encoder's and the decoder's training variables.
* Compute the gradient and update the training variables.
* Return `batch_loss`


In [103]:
def train_step(inp, targ):#, enc_initial_state):
  loss = 0

  with tf.GradientTape() as tape: # we use the gradient tape to track all
  # the different operations happening in the network in order to be able
  # to compute the gradients later

    enc_output, enc_state = encoder(inp)#,enc_initial_state) # the input sequence is fed to the 
    # encoder to produce the encoder output and the encoder state

    dec_state = enc_state # the initial state used in the decoder is the encoder
    # state

    dec_input = tf.expand_dims(targ[:,0], axis=1) # the first decoder input
    # is the first sequence element of the target batch, which in our case
    # represents the <start> token for each sequence in the batch. This is
    # what we call the teacher forcing!

    # Everything is set up for the first step, now we need to loop over the
    # teacher forcing sequence to produce the predictions, we already have 
    # defined the first step (element 0) so we will loop from 1 to targ.shape[1]
    # which is the target sequence length
    for t in range(1, targ.shape[1]):
      # passing dec_input, dec_state and enc_output to the decoder
      # in order to produce the prediction, the new state, and the attention
      # weights which we will not need explicitely here
      pred, dec_state, _ = decoder(dec_input, enc_output, dec_state)

      loss += loss_function(targ[:, t], pred) # we compare the prediction
      # produced by teacher forcing with the next element of the target and
      # increment the loss

      # The new decoder input becomes the next element of the target sequence
      # which we just attempted to predict (teacher forcing)
      dec_input = tf.expand_dims(targ[:, t], 1)

  batch_loss = (loss / int(targ.shape[1])) # we divide the loss by the target
  # sequence's length to get the average loss across the sequence

  variables = encoder.trainable_variables + decoder.trainable_variables # here
  # we concatenate the lists of trainable variables for the encoder and the
  # decoder

  gradients = tape.gradient(loss, variables) # compute the gradient based on the
  # loss and the trainable variables

  optimizer.apply_gradients(zip(gradients, variables)) # then update the model's
  # parameters

  return batch_loss

2. Code the training loop.
It needs to loop across the number of epochs you wish to train for, use the train step, print out the train loss every now and then, and the val loss at the end of each epoch (optional)

In [104]:
EPOCHS = 100

for epoch in range(EPOCHS):
  start = time.time()

  total_loss = 0

  for (batch, (inp, targ)) in enumerate(train):
    batch_loss = train_step(inp, targ)#, initial_state)
    total_loss += batch_loss

    if batch % 10 == 0:
      print('Epoch {} Batch {} Loss {:.4f}'.format(epoch + 1,
                                                   batch,
                                                   batch_loss.numpy()))
  
  # saving (checkpoint) the model every epoch
  checkpoint.save(file_prefix = checkpoint_prefix)

  print('Epoch {} Loss {:.4f}'.format(epoch + 1,
                                      total_loss))
  print('Time taken for 1 epoch {} sec'.format(time.time() - start))

  enc_input = X_val
  #classic encoder input

  dec_input = tf.zeros(shape=(len(X_val),1))
  # the first decoder input is the special token 0

  enc_out, enc_state = encoder(enc_input)#, initial_state)
  # we compute once and for all the encoder output and the encoder
  # h state and c state

  dec_state = enc_state
  # The encoder h state and c state will serve as initial states for the
  # decoder

  pred = []  # we'll store the predictions in here

  # we loop over the expected length of the target, but actually the loop can run
  # for as many steps as we wish, which is the advantage of the encoder decoder
  # architecture
  for i in range(y_val.shape[1]-1):
    dec_out, dec_state, attention_w = decoder(dec_input, enc_out, dec_state)
    # the decoder state is updated and we get the first prediction probability 
    # vector
    decoded_out = tf.expand_dims(tf.argmax(dec_out, axis=-1), axis=1)
    # we decode the softmax vector into and index
    pred.append(tf.expand_dims(dec_out,axis=1)) # update the prediction list
    dec_input = decoded_out # the previous pred will be used as the new input

  pred = tf.concat(pred, axis=1).numpy()
  print("\n val loss :", loss_function(y_val[:,1:],pred),"\n")

Epoch 1 Batch 0 Loss 3.6464
Epoch 1 Batch 10 Loss 3.0781
Epoch 1 Batch 20 Loss 2.7352
Epoch 1 Batch 30 Loss 2.7237
Epoch 1 Batch 40 Loss 2.5665
Epoch 1 Batch 50 Loss 2.5956
Epoch 1 Batch 60 Loss 2.6283
Epoch 1 Batch 70 Loss 2.4825
Epoch 1 Batch 80 Loss 2.4743
Epoch 1 Batch 90 Loss 2.5013
Epoch 1 Batch 100 Loss 2.4846
Epoch 1 Batch 110 Loss 2.4850
Epoch 1 Batch 120 Loss 2.4214
Epoch 1 Batch 130 Loss 2.4074
Epoch 1 Batch 140 Loss 2.5011
Epoch 1 Batch 150 Loss 2.4017
Epoch 1 Batch 160 Loss 2.3676
Epoch 1 Batch 170 Loss 2.3941
Epoch 1 Loss 455.0478
Time taken for 1 epoch 78.25395655632019 sec

 val loss : tf.Tensor(3.2851744, shape=(), dtype=float32) 

Epoch 2 Batch 0 Loss 2.4032
Epoch 2 Batch 10 Loss 2.4467
Epoch 2 Batch 20 Loss 2.1991
Epoch 2 Batch 30 Loss 2.3230
Epoch 2 Batch 40 Loss 2.2807
Epoch 2 Batch 50 Loss 2.2440
Epoch 2 Batch 60 Loss 2.3523
Epoch 2 Batch 70 Loss 2.2965
Epoch 2 Batch 80 Loss 2.2317
Epoch 2 Batch 90 Loss 2.2456
Epoch 2 Batch 100 Loss 2.1492
Epoch 2 Batch 110 Loss 2

KeyboardInterrupt: 

3. What do you think of the training process, did it work well on the train set?  On the validation set?

In [None]:
checkpoint.restore(tf.train.latest_checkpoint(checkpoint_dir))
encoder_latest=checkpoint.encoder
decoder_latest=checkpoint.decoder

4. Use `X_val` to compute all the predictions for the validation set and convert them  back to text. Compare them with the actual target values, what do you think? What about the results on the training set?

In [None]:
enc_input = X_val
#classic encoder input

dec_input = tf.zeros(shape=(len(X_val),1))
# the first decoder input is the special token 0

enc_out, enc_state = encoder_latest(enc_input)#, initial_state)
# we compute once and for all the encoder output and the encoder
# h state and c state

dec_state = enc_state
# The encoder h state and c state will serve as initial states for the
# decoder

pred = []  # we'll store the predictions in here

# we loop over the expected length of the target, but actually the loop can run
# for as many steps as we wish, which is the advantage of the encoder decoder
# architecture
for i in range(y_val.shape[1]-1):
  dec_out, dec_state, attention_w = decoder_latest(dec_input, enc_out, dec_state)
  # the decoder state is updated and we get the first prediction probability 
  # vector
  decoded_out = tf.expand_dims(tf.argmax(dec_out, axis=-1), axis=1)
  # we decode the softmax vector into and index
  pred.append(decoded_out) # update the prediction list
  dec_input = decoded_out # the previous pred will be used as the new input

pred = tf.concat(pred, axis=-1).numpy()

pred_text = tokenizer_en.sequences_to_texts(pred)
y_val_text = tokenizer_en.sequences_to_texts(y_val[:,1:])
for i in range(10):
  print("pred:", pred_text[i])
  print("true:", y_val_text[i])
  print("\n")

pred: my job my job
true: it's my cd


pred: i'm certain i'm certain
true: i'm baffled


pred: i guess please i
true: i wonder who


pred: you're stuck on cops
true: you're stuck


pred: they won they won
true: they won


pred: we are we done
true: we're done


pred: i'm not sad duty
true: i'm not shy


pred: tom is mad tom
true: tom's crazy


pred: you were shy you
true: you're thin


pred: i will did it
true: i did see it




5. Now that everything works well, it's time to increase our number of samples and start another training, did the results improve?