# Translations with ENcoder Decoder

We'll see that with LSTMs and the Encoder Decoder framework, we can do some pretty powerful things like: *translators* ! Let's see how we can create a French > English translator with TensorFlow 

### Tips 

Don't take the whole dataset at the beginning for your experiments, just take 5000 or even 3000 sentences. This will allow you to iterate faster and avoid bugs simply related to your need for computing power.

Let's get started!

## Import Libraries

In [1]:
# Import necessaries librairies
import pandas as pd
import numpy as np 
import sklearn
import tensorflow_datasets as tfds
import tensorflow as tf 
tf.__version__

'2.8.0'

## Importing data 

1. Load the data using the following url https://go.aws/38ECHUB you can read this using `pd.read_csv` with the `"\t"` delimiter and `header=None`

In [2]:
# Loading document txt function
df = pd.read_csv("https://go.aws/38ECHUB", delimiter="\t", header=None)
df.head()

Unnamed: 0,0,1
0,Go.,Va !
1,Hi.,Salut !
2,Run!,Cours !
3,Run!,Courez !
4,Wow!,Ça alors !


2. Create an object `doc` containing the first 5000 rows from the file.

In [3]:
# Let's just take a sample of 5000 sentences to avoid slowness. 
doc = df.iloc[:5000,:]

In [4]:
len(doc)

5000

3. In your opinion, are we going to need to lemmatize and remove stop words for a translation problem?

No because for stopwords are important to understand meaning.

4. Add the word `<start>` to the beginning of each target sentence in order to create a new column named `padded_en`

In [5]:
doc["padded_en"] = doc.iloc[:,0].apply(lambda x: "<start> "+x)
doc.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  doc["padded_en"] = doc.iloc[:,0].apply(lambda x: "<start> "+x)


Unnamed: 0,0,1,padded_en
0,Go.,Va !,<start> Go.
1,Hi.,Salut !,<start> Hi.
2,Run!,Cours !,<start> Run!
3,Run!,Courez !,<start> Run!
4,Wow!,Ça alors !,<start> Wow!


5. Create two objects : `tokenizer_fr` and `tokenizer_en` that will be instances of the `tf.keras.preprocessing.text.Tokenizer` class. 

Be careful! Since we added a special token containing special characters, make sure you setup the tokenizers right so this token is well interpreted! (use the `filters` argument for example).

In [6]:
tokenizer_fr = tf.keras.preprocessing.text.Tokenizer()
tokenizer_en = tf.keras.preprocessing.text.Tokenizer(filters='!"#$%&()*+,-./:;=?@[\\]^_`{|}~\t\n')

6. Fit the tokenizers on the french, and **padded** english sentences respectively.

In [7]:
tokenizer_fr.fit_on_texts(doc.iloc[:,1])
tokenizer_en.fit_on_texts(doc["padded_en"])

7. Create three new columns in your Dataframe for the encoded french, english, and padded english sentences.

In [8]:
doc["fr_indices"] = tokenizer_fr.texts_to_sequences(doc.iloc[:,1])
doc["en_indices"] = tokenizer_en.texts_to_sequences(doc.iloc[:,0])
doc["padded_en_indices"] = tokenizer_en.texts_to_sequences(doc["padded_en"])

doc.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  doc["fr_indices"] = tokenizer_fr.texts_to_sequences(doc.iloc[:,1])
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  doc["en_indices"] = tokenizer_en.texts_to_sequences(doc.iloc[:,0])
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  doc["padded_en_indices"] = tokenizer_en.texts_to_sequences(doc["padded_

Unnamed: 0,0,1,padded_en,fr_indices,en_indices,padded_en_indices
0,Go.,Va !,<start> Go.,[36],[11],"[1, 11]"
1,Hi.,Salut !,<start> Hi.,[404],[616],"[1, 616]"
2,Run!,Cours !,<start> Run!,[1212],[111],"[1, 111]"
3,Run!,Courez !,<start> Run!,[1213],[111],"[1, 111]"
4,Wow!,Ça alors !,<start> Wow!,"[22, 1214]",[872],"[1, 872]"


8. We learned from the tutorial that the padded target sequences need to have the same length as the target sequences, so we will remove the last element of each padded target sequence (this will help us enforce teacher forcing)

In [9]:
doc["padded_en_indices_clean"] = doc["padded_en_indices"].apply(lambda x: x[:-1])
doc.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  doc["padded_en_indices_clean"] = doc["padded_en_indices"].apply(lambda x: x[:-1])


Unnamed: 0,0,1,padded_en,fr_indices,en_indices,padded_en_indices,padded_en_indices_clean
0,Go.,Va !,<start> Go.,[36],[11],"[1, 11]",[1]
1,Hi.,Salut !,<start> Hi.,[404],[616],"[1, 616]",[1]
2,Run!,Cours !,<start> Run!,[1212],[111],"[1, 111]",[1]
3,Run!,Courez !,<start> Run!,[1213],[111],"[1, 111]",[1]
4,Wow!,Ça alors !,<start> Wow!,"[22, 1214]",[872],"[1, 872]",[1]


9. It's rather difficult to work with sequences with variable length, use zero-padding to normalize the length of all the sequences in each category.

In [10]:
# Use of Keras to create token sequences of the same length
padded_fr_indices = tf.keras.preprocessing.sequence.pad_sequences(doc["fr_indices"], padding="post")
padded_en_indices = tf.keras.preprocessing.sequence.pad_sequences(doc["en_indices"], padding="post")
teacher_forcing_en = tf.keras.preprocessing.sequence.pad_sequences(doc["padded_en_indices_clean"], padding="post")

10. What are the shapes of the arrays you just created for the french, padded english, and english sentences?

In [11]:
# Visualization of the shape of one of the tensors
padded_fr_indices.shape

(5000, 10)

In [12]:
padded_en_indices.shape

(5000, 4)

In [13]:
teacher_forcing_en.shape

(5000, 4)

11. Use `sklearn` `train_test_split` function to divide your sample into train and validation sets.

In [14]:
from sklearn.model_selection import train_test_split
en_train, en_val, fr_train, fr_val, teacher_train, teacher_val =  train_test_split(padded_en_indices,
                                                                                   padded_fr_indices,
                                                                                   teacher_forcing_en,
                                                                                   test_size=0.3)

## MODEL

Now it's time to code the model, thankfully you can largely base yourself off the code provided during the demo!

1. Create the following variables:
* `n_embed` the number of dimensions you want for the embeddings output spaces
* `n_lstm` the number of units you want for the lstm layers
* `fr_len` the length of a french sentence
* `en_len` the length of an english or teacher forcing sentence
* `vocab_size_fr` the number of tokens in the french vocabulary
* `vocab_size_en` the number of tokens in the english vocabulary (based of the padded sequences so the `<start>` is included!

In [15]:
# let's start by defining the number of units needed for the embedding and
# the lstm layers

n_embed = 128
n_lstm = 64
fr_len = padded_fr_indices.shape[1]
en_len = padded_en_indices.shape[1]
vocab_size_fr = len(tokenizer_fr.word_index)
vocab_size_en = len(tokenizer_en.word_index)

2. Set up the encoder

This will work in the same way as the demo, just make sure the input dimension of the embedding is equal to the number of words in the french vocabulary +1 (for the zero-padding)

In [16]:
encoder_input = tf.keras.Input(shape=(fr_len))
encoder_embed = tf.keras.layers.Embedding(input_dim=vocab_size_fr+1, output_dim=n_embed)
encoder_lstm = tf.keras.layers.LSTM(n_lstm, return_state=True)

encoder_embed_ouput = encoder_embed(encoder_input)
encoder_output = encoder_lstm(encoder_embed_ouput)

encoder = tf.keras.Model(inputs = encoder_input, outputs = encoder_output)

3. Try the encoder on the french train data (using the call method)

In [17]:
encoder(fr_train)

[<tf.Tensor: shape=(3500, 64), dtype=float32, numpy=
 array([[ 0.03127307, -0.0247299 , -0.01131879, ..., -0.02837454,
          0.0015578 , -0.01084446],
        [ 0.0109025 , -0.00775949, -0.00843466, ..., -0.02145365,
         -0.00233755, -0.00621878],
        [ 0.03008093, -0.02630412, -0.01079366, ..., -0.0263291 ,
          0.00063774, -0.01061366],
        ...,
        [ 0.02951849, -0.02076108, -0.01152032, ..., -0.02872757,
          0.00683473, -0.01130117],
        [ 0.03043274, -0.01922366, -0.01356811, ..., -0.03045454,
          0.00175662, -0.00974883],
        [ 0.02961827, -0.01894046, -0.01261697, ..., -0.03098983,
          0.00296335, -0.01341497]], dtype=float32)>,
 <tf.Tensor: shape=(3500, 64), dtype=float32, numpy=
 array([[ 0.03127307, -0.0247299 , -0.01131879, ..., -0.02837454,
          0.0015578 , -0.01084446],
        [ 0.0109025 , -0.00775949, -0.00843466, ..., -0.02145365,
         -0.00233755, -0.00621878],
        [ 0.03008093, -0.02630412, -0.01079366,

4. Set up the decoder

This will work in the same way as the demo, just make sure the input dimension of the embedding is equal to the number of words in the french vocabulary +1 (for the zero-padding). The same goes for the last Dense layer!

In [18]:
decoder_input = tf.keras.Input(shape=(en_len))
decoder_embed = tf.keras.layers.Embedding(input_dim=vocab_size_en+1, 
                                          output_dim=n_embed)
decoder_lstm = tf.keras.layers.LSTM(n_lstm, return_sequences=True, return_state=True)
decoder_pred = tf.keras.layers.Dense(vocab_size_en+1, activation="softmax")

decoder_embed_output = decoder_embed(decoder_input) # teacher forcing happens here
# the decoder input is actually the padded target we created earlier, remember
# if target is: [91, 47, 89, 21, 62]
# the decoder input will be: [0, 91, 47, 89, 21]
decoder_lstm_output, _, _ = decoder_lstm(decoder_embed_output, initial_state=encoder_output[1:])
# in the step described above the decoder receives the encoder state as its
# initial state.
decoder_output = decoder_pred(decoder_lstm_output)
# then the dense layer will convert the vector representation for each element
# in the sequence into a probability distribution across all possible tokens
# in the vocabulary!

decoder = tf.keras.Model(inputs = [encoder_input,decoder_input], outputs = decoder_output)
# all we need to do is put the model together using the input output framework!

5. Try the decoder on the french train data and the teacher forcing data

In [19]:
decoder([fr_train,teacher_train])

<tf.Tensor: shape=(3500, 4, 1258), dtype=float32, numpy=
array([[[0.00079825, 0.00079616, 0.00079543, ..., 0.00078999,
         0.00079716, 0.00079411],
        [0.00079777, 0.00079625, 0.00079807, ..., 0.00079203,
         0.00079785, 0.00079019],
        [0.00079561, 0.00079708, 0.00079858, ..., 0.00079194,
         0.00079863, 0.00079213],
        [0.00079565, 0.00079878, 0.00079763, ..., 0.00079251,
         0.0007975 , 0.00079583]],

       [[0.00079641, 0.00079744, 0.00079473, ..., 0.00079119,
         0.0007973 , 0.00079444],
        [0.00079652, 0.00079821, 0.00079695, ..., 0.00079222,
         0.00079664, 0.00079214],
        [0.00079917, 0.00079716, 0.00079847, ..., 0.00079343,
         0.00079604, 0.00079391],
        [0.00079894, 0.00079612, 0.00080092, ..., 0.00079451,
         0.00079332, 0.00079531]],

       [[0.00079804, 0.00079618, 0.00079557, ..., 0.00078997,
         0.00079744, 0.00079443],
        [0.00079758, 0.00079625, 0.00079813, ..., 0.0007921 ,
         0.00

6. Set up the inference decoder

The code here will be identical to the one from the demo except if you changed some naming conventions!

In [20]:
decoder_state_input_h = tf.keras.Input(shape=(n_lstm,))
decoder_state_input_c = tf.keras.Input(shape=(n_lstm,))
decoder_states_inputs = [decoder_state_input_h, decoder_state_input_c]
# at the first step of the inference, these input will be respectively the
# hidden state and C state of the encoder model
# for following steps, they will become the hidden and C state from the decoder
# itself since the input sequence is unknown we will have to predict step by step
# using a loop

decoder_input_inf = tf.keras.Input(shape=(1))
decoder_embed_output = decoder_embed(decoder_input_inf)
# the decoder input here is of shape 1 because we will feed the elements in the 
# sequence one by one

decoder_outputs, state_h, state_c = decoder_lstm(decoder_embed_output, initial_state=decoder_states_inputs)
# the lstm layer works in the same way, the output from the embedding is used
# and the decoder state is used as described above

decoder_states = [state_h, state_c]
# we store the lstm states in a specific object as we'll have to use them as 
# initial state for the next inference step

decoder_outputs = decoder_pred(decoder_outputs)
# the lstm output is then converted to a probability distribution over the
# target vocabulary

decoder_inf = tf.keras.Model(inputs = [decoder_input_inf, decoder_states_inputs], 
                     outputs = [decoder_outputs, decoder_states])
# Finally we wrap up the model building by setting up the inputs and outputs

7. Compile the decoder (the training version) using the appropriate loss and metric functions.

In [21]:
decoder.compile(
    optimizer=tf.keras.optimizers.Adam(learning_rate=0.001),
    loss=tf.keras.losses.SparseCategoricalCrossentropy(),
    metrics=[tf.keras.metrics.SparseCategoricalAccuracy()]
)

8. Train the decoder for 50 epochs, this should take 10 minutes. Is there overfitting ?

In [22]:
decoder.fit(x=[fr_train, teacher_train], y=en_train,epochs=50, validation_data=([fr_val, teacher_val], en_val))

Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50


<keras.callbacks.History at 0x14b8604c880>

9. Adapt the code from the demo to make some predictions on the validation data.

Be careful, in the demo the starting index for the teacher forcing sequences was 0, what index is the starting point of the teacher forcing sequences now?

Set up the first decoder input with the right dimension too!

In [23]:
enc_input = fr_val
#classic encoder input

dec_input = tf.ones(shape=(len(fr_val),1))
# the first decoder input is the special token 0

enc_out, state_h_inf, state_c_inf = encoder(enc_input)
# we compute once and for all the encoder output and the encoder
# h state and c state

dec_state = [state_h_inf, state_c_inf]
# The encoder h state and c state will serve as initial states for the
# decoder

pred = []  # we'll store the predictions in here

# we loop over the expected length of the target, but actually the loop can run
# for as many steps as we wish, which is the advantage of the encoder decoder
# architecture
for i in range(en_len):
  dec_out, dec_state = decoder_inf([dec_input, dec_state])
  # the decoder state is updated and we get the first prediction probability 
  # vector
  decoded_out = tf.argmax(dec_out, axis=-1)
  # we decode the softmax vector into and index
  pred.append(decoded_out) # update the prediction list
  dec_input = decoded_out # the previous pred will be used as the new input

pred = tf.concat(pred, axis=-1).numpy()
for i in range(10):
  print("pred:", pred[i,:])
  print("true:", en_val[i,:])
  print("\n")

pred: [  9  44   4 410]
true: [  9 834   0   0]


pred: [ 94 388  56  32]
true: [ 94 388   0   0]


pred: [17 23  4 68]
true: [17 23  4  0]


pred: [241  53  75 118]
true: [241  53   0   0]


pred: [ 16   7 343  82]
true: [ 16 576   0   0]


pred: [ 3 46 53 53]
true: [  3 248   0   0]


pred: [ 2 76 36 11]
true: [  2 785   4   0]


pred: [70 98 45 12]
true: [143  92 283   0]


pred: [  8 149 860  39]
true: [  8 864  12   0]


pred: [  2 116 262 118]
true: [  2 710   0   0]




10. Use the tokenizer to convert the target and predicted sequences back to text, what do you think of the translations?

In [24]:
y_sample = tokenizer_en.sequences_to_texts(en_val)[:10]
pred_sample = tokenizer_en.sequences_to_texts(pred)[:10]

for i, j in zip(y_sample,pred_sample):
  print("true:", i)
  print("pred", j)
  print("\n")

true: we succeeded
pred we need it fit


true: nice timing
pred nice timing on here


true: don't do it
pred don't do it out


true: anyone home
pred anyone home now drink


true: it's broken
pred it's a book hurt


true: i'm punctual
pred i'm not home home


true: i rewrote it
pred i saw him go


true: grab my hand
pred let's talk to me


true: he pinched me
pred he has braces in


true: i surrender
pred i hate dogs drink




11. Now that you reached the end of the exercise, go back to the beginning and increase the number of sentences your model will train on, this should significantly improve the quality of the predictions!