# Translations 2.0 - Neural Machine Translation

According to the Google paper [*Attention is all you need*](https://arxiv.org/abs/1706.03762), you only need layers of Attention to make a Deep Learning model understand the complexity of a sentence. We will try to implement this type of model for our translator. 

## Project description 

 

Our data can be found on this link: https://go.aws/38ECHUB

### Preprocessing 

The whole purpose of your preprocessing is to express your (French) entry sentence in a sequence of clues.

i.e. :

* je suis malade---> `[123, 21, 34, 0, 0, 0, 0, 0]`

This gives a *shape* -> `(batch_size, max_len_of_a_sentence)`.

The clues correspond to a number that you will have to assign for each word token. 

The zeros correspond to what are called [*padded_sequences*](https://www.tensorflow.org/api_docs/python/tf/keras/preprocessing/sequence/pad_sequences) which allow all word sequences to have the same length (mandatory for your algorithm). 

This time, we won't have to *one hot encoder* your target variable. We  will simply be able to create a vector similar to your input sentence. 

i.e. : 

* I am sick ---> `[43, 2, 42, 0, 0]`

WARNING, we  will however need to add a step in our preprocessing. For each sentence we will need to add a token `<start>` & `<end>` to indicate the beginning and end of a sentence. We can do this via `Spacy`.

We will use : 

* `Pandas` or `Numpy` for reading the text file.
* `Spacy` for Tokenization 
* `Tensorflow` for [padded_sequence](https://www.tensorflow.org/api_docs/python/tf/keras/preprocessing/sequence/pad_sequences) 

### Modeling 

For modeling, we will need to set up layers of attention. We will need to: 

* Create an `Encoder` class that inherits from `tf.keras.Model`.
* Create a Bahdanau Attention Layer that will be a class that inherits `tf.keras.layers.Layer`
* Finally create a `Decoder` class that inherits from `tf.keras.Model`.


We will need to create your own cost function as well as our own training loop. 


### Tips 

We will not take the whole dataset at the beginning for our experiments, we just take 5000 or even 3000 sentences. This will allow us to iterate faster and avoid bugs simply related to your need for computing power. 

Also, we acknowledge the inspiration from the [Neural Machine Translation with Attention] tutorial (https://www.tensorflow.org/tutorials/text/nmt_with_attention) from TensorFlow. 




In [1]:
!pip install --upgrade tensorflow 

Requirement already up-to-date: tensorflow in /usr/local/lib/python3.6/dist-packages (2.4.0)


In [2]:
# Import necessaries librairies
import pandas as pd
import numpy as np 
import tensorflow_datasets as tfds
import tensorflow as tf 
tf.__version__

'2.4.0'

## Import datas

In [3]:
# Loading function for txt document
def load_doc(url):
  df = pd.read_csv("https://go.aws/38ECHUB", delimiter="\t", header=None)
  return df

In [4]:
# Loading txt document
doc = load_doc("https://go.aws/38ECHUB")
doc.head()

Unnamed: 0,0,1
0,Go.,Va !
1,Hi.,Salut !
2,Run!,Cours !
3,Run!,Courez !
4,Wow!,Ça alors !


In [5]:
len(doc)

160538

In [6]:
# Let's just take a sample of 5000 sentences to avoid slowness. 
doc = doc.sample(50000)

In [7]:
# Add a <start> and <end> token 
def begin_end_sentence(sentence):
  sentence = "<start> "+ sentence + " <end>"
  return sentence

In [8]:
# Add <start> and <end> token
doc.iloc[:, 0] = doc.iloc[:, 0].apply(lambda x: begin_end_sentence(x))
doc.iloc[:, 1] = doc.iloc[:, 1].apply(lambda x: begin_end_sentence(x))

In [9]:
doc

Unnamed: 0,0,1
18344,<start> I want to be safe. <end>,<start> Je veux être en sécurité. <end>
48108,<start> Where are your manners? <end>,<start> Qu'avez-vous fait de vos bonnes manièr...
24273,<start> The glass is empty. <end>,<start> Le verre est vide. <end>
104113,<start> I agree with what you've written. <end>,<start> Je suis d'accord avec ce que tu as écr...
50387,<start> I don't miss you at all. <end>,<start> Vous ne me manquez pas du tout. <end>
...,...,...
92427,<start> Did the storm cause any damage? <end>,<start> Est-ce que la tempête a causé des dégâ...
115141,<start> She is very thoughtful and patient. <end>,<start> Elle est vraiment attentive et patient...
142595,<start> Selfie sticks are not allowed in this ...,<start> Les perches à selfie ne sont pas autor...
94975,<start> I'm training for the triathlon. <end>,<start> Je m'entraîne pour le triathlon. <end>


In [10]:
tokenizer_fr = tf.keras.preprocessing.text.Tokenizer(filters='!"#$%&()*+,-./:;=?@[\\]^_`{|}~\t\n')
tokenizer_en = tf.keras.preprocessing.text.Tokenizer(filters='!"#$%&()*+,-./:;=?@[\\]^_`{|}~\t\n')

In [11]:
tokenizer_en.fit_on_texts(doc.iloc[:,0])
tokenizer_fr.fit_on_texts(doc.iloc[:,1])

In [12]:
tokenizer_en.word_index

{'<start>': 1,
 '<end>': 2,
 'i': 3,
 'you': 4,
 'to': 5,
 'the': 6,
 'a': 7,
 'is': 8,
 'tom': 9,
 'he': 10,
 'that': 11,
 'it': 12,
 'of': 13,
 'do': 14,
 'this': 15,
 'me': 16,
 'in': 17,
 'have': 18,
 "don't": 19,
 'was': 20,
 'my': 21,
 'are': 22,
 'for': 23,
 'your': 24,
 'what': 25,
 "i'm": 26,
 'we': 27,
 'be': 28,
 'she': 29,
 'not': 30,
 'want': 31,
 'on': 32,
 'with': 33,
 'like': 34,
 'know': 35,
 'can': 36,
 'his': 37,
 'at': 38,
 'all': 39,
 "you're": 40,
 'how': 41,
 'did': 42,
 'him': 43,
 'they': 44,
 'think': 45,
 'go': 46,
 'and': 47,
 "it's": 48,
 "can't": 49,
 'very': 50,
 'time': 51,
 'about': 52,
 'here': 53,
 'will': 54,
 'get': 55,
 'there': 56,
 'her': 57,
 "didn't": 58,
 'had': 59,
 'as': 60,
 'were': 61,
 'if': 62,
 'no': 63,
 'one': 64,
 'why': 65,
 'just': 66,
 'up': 67,
 'has': 68,
 'out': 69,
 'going': 70,
 'would': 71,
 'good': 72,
 'come': 73,
 'so': 74,
 'tell': 75,
 'an': 76,
 'when': 77,
 'need': 78,
 "i'll": 79,
 'by': 80,
 'from': 81,
 'see': 82,


In [13]:
doc["fr_indices"] = tokenizer_fr.texts_to_sequences(doc.iloc[:,1])
doc["en_indices"] = tokenizer_en.texts_to_sequences(doc.iloc[:,0])

In [14]:
doc.head()

Unnamed: 0,0,1,fr_indices,en_indices
18344,<start> I want to be safe. <end>,<start> Je veux être en sécurité. <end>,"[1, 3, 37, 50, 20, 616, 2]","[1, 3, 31, 5, 28, 555, 2]"
48108,<start> Where are your manners? <end>,<start> Qu'avez-vous fait de vos bonnes manièr...,"[1, 872, 6, 42, 4, 200, 1098, 2021, 2]","[1, 94, 22, 24, 1889, 2]"
24273,<start> The glass is empty. <end>,<start> Le verre est vide. <end>,"[1, 11, 432, 15, 1125, 2]","[1, 6, 695, 8, 1003, 2]"
104113,<start> I agree with what you've written. <end>,<start> Je suis d'accord avec ce que tu as écr...,"[1, 3, 25, 373, 39, 14, 7, 13, 67, 427, 2]","[1, 3, 477, 33, 25, 208, 712, 2]"
50387,<start> I don't miss you at all. <end>,<start> Vous ne me manquez pas du tout. <end>,"[1, 6, 9, 24, 2709, 5, 40, 34, 2]","[1, 3, 19, 507, 4, 38, 39, 2]"


In [15]:
# Use of Keras to create token sequences of the same length
padded_fr_indices = tf.keras.preprocessing.sequence.pad_sequences(doc["fr_indices"], padding="post")
padded_en_indices = tf.keras.preprocessing.sequence.pad_sequences(doc["en_indices"], padding="post")

In [16]:
# Creation of tf.data.Dataset for each of the French and English tensors
fr_ds = tf.data.Dataset.from_tensor_slices(padded_fr_indices)
en_ds = tf.data.Dataset.from_tensor_slices(padded_en_indices)

In [17]:
# Create a tensorflow dataset complet
tf_ds = tf.data.Dataset.zip((fr_ds, en_ds))

In [18]:
# Creation of variables that we will reuse for our models
BATCH_SIZE = 64
TAKE_SIZE = int(0.7*len(doc)/BATCH_SIZE)
BUFFER_SIZE = TAKE_SIZE * BATCH_SIZE
steps_per_epoch = TAKE_SIZE
embedding_dim = 256
units = 1024
vocab_inp_size = len(tokenizer_fr.word_index)
vocab_tar_size = len(tokenizer_en.word_index)

In [19]:
# Shuffle & Batch
tf_ds = tf_ds.batch(BATCH_SIZE, drop_remainder=True)

In [20]:
# Train Test Split
train_data = tf_ds.take(TAKE_SIZE).shuffle(TAKE_SIZE)
test_data = tf_ds.skip(TAKE_SIZE).shuffle(BUFFER_SIZE-TAKE_SIZE)

In [21]:
input_text, output_text = next(iter(train_data))
print(input_text.numpy().shape)
print(output_text.numpy().shape)

(64, 49)
(64, 39)


In [22]:
vocab_inp_size

17747

In [23]:
# Encode
class Encoder(tf.keras.Model):
  def __init__(self, vocab_size, embedding_dim, enc_units, batch_sz):
    super(Encoder, self).__init__()
    self.batch_sz = batch_sz
    self.enc_units = enc_units
    self.embedding = tf.keras.layers.Embedding(vocab_size, embedding_dim)
    self.gru = tf.keras.layers.GRU(self.enc_units,
                                   return_sequences=True,
                                   return_state=True,
                                   recurrent_initializer='glorot_uniform')

  def call(self, x, hidden):
    x = self.embedding(x)
    output, state = self.gru(x, initial_state = hidden)
    return output, state

  def initialize_hidden_state(self):
    return tf.zeros((self.batch_sz, self.enc_units))

In [24]:
encoder = Encoder(vocab_inp_size +1, embedding_dim, units, BATCH_SIZE)

# Sample output
sample_hidden = encoder.initialize_hidden_state()
sample_output, sample_hidden = encoder(input_text, sample_hidden)
print ('Encoder output shape: (batch size, sequence length, units) {}'.format(sample_output.shape))
print ('Encoder Hidden state shape: (batch size, units) {}'.format(sample_hidden.shape))

Encoder output shape: (batch size, sequence length, units) (64, 49, 1024)
Encoder Hidden state shape: (batch size, units) (64, 1024)


In [25]:
class BahdanauAttention(tf.keras.layers.Layer):
  def __init__(self, units):
    super(BahdanauAttention, self).__init__()
    self.W1 = tf.keras.layers.Dense(units)
    self.W2 = tf.keras.layers.Dense(units)
    self.V = tf.keras.layers.Dense(1)

  def call(self, query, values):
    # hidden shape == (batch_size, hidden size)
    # hidden_with_time_axis shape == (batch_size, 1, hidden size)
    # This is done to calculate our "attention" score
    hidden_with_time_axis = tf.expand_dims(query, 1)

    # score shape == (batch_size, max_length, 1)
    # We get 1 on the last axis because we apply the score to self.V
    # The shape of the tensor before applying self.V is (batch_size, max_length, units)
    score = self.V(tf.nn.tanh(
        self.W1(values) + self.W2(hidden_with_time_axis)))

    # attention_weights shape == (batch_size, max_length, 1)
    attention_weights = tf.nn.softmax(score, axis=1)

    # context_vector shape after sum == (batch_size, hidden_size)
    context_vector = attention_weights * values
    context_vector = tf.reduce_sum(context_vector, axis=1)

    return context_vector, attention_weights

In [26]:
attention_layer = BahdanauAttention(100)
attention_result, attention_weights = attention_layer(sample_hidden, sample_output)

print("Attention result shape: (batch size, units) {}".format(attention_result.shape))
print("Attention weights shape: (batch_size, sequence_length, 1) {}".format(attention_weights.shape))

Attention result shape: (batch size, units) (64, 1024)
Attention weights shape: (batch_size, sequence_length, 1) (64, 49, 1)


In [27]:
class Decoder(tf.keras.Model):
  def __init__(self, vocab_size, embedding_dim, dec_units, batch_sz):
    super(Decoder, self).__init__()
    self.batch_sz = batch_sz
    self.dec_units = dec_units
    self.embedding = tf.keras.layers.Embedding(vocab_size, embedding_dim)
    self.gru = tf.keras.layers.GRU(self.dec_units,
                                   return_sequences=True,
                                   return_state=True,
                                   recurrent_initializer='glorot_uniform')
    self.fc = tf.keras.layers.Dense(vocab_size)

    # Used for attention
    self.attention = BahdanauAttention(self.dec_units)

  def call(self, x, hidden, enc_output):
    # enc_output shape == (batch_size, max_length, hidden_size)
    context_vector, attention_weights = self.attention(hidden, enc_output)

    # x shape after embedding == (batch_size, 1, embedding_dim)
    x = self.embedding(x)

    # x shape after concatenate == (batch_size, 1, embedding_dim + hidden_size)
    x = tf.concat([tf.expand_dims(context_vector, 1), x], axis=-1)

    # Passing from the concatenated vector to the GRU layer
    output, state = self.gru(x)

    # output shape == (batch_size * 1, hidden_size)
    output = tf.reshape(output, (-1, output.shape[2]))

    # output shape == (batch_size, vocab)
    x = self.fc(output)

    return x, state, attention_weights

In [28]:
decoder = Decoder(vocab_tar_size + 1, embedding_dim, units, BATCH_SIZE)

sample_decoder_output, _, _ = decoder(tf.random.uniform((BATCH_SIZE, 1)),
                                      sample_hidden, sample_output)

print ('Decoder output shape: (batch_size, vocab size) {}'.format(sample_decoder_output.shape))

Decoder output shape: (batch_size, vocab size) (64, 9493)


# Loss

In [29]:
optimizer = tf.keras.optimizers.Adam()
loss_object = tf.keras.losses.SparseCategoricalCrossentropy(
    from_logits=True, reduction='none')

def loss_function(real, pred):
  mask = tf.math.logical_not(tf.math.equal(real, 0))
  loss_ = loss_object(real, pred)

  mask = tf.cast(mask, dtype=loss_.dtype)
  loss_ *= mask

  return tf.reduce_mean(loss_)

In [30]:
import os
checkpoint_dir = './training_checkpoints2'
checkpoint_prefix = os.path.join(checkpoint_dir, "ckpt")
checkpoint = tf.train.Checkpoint(optimizer=optimizer,
                                 encoder=encoder,
                                 decoder=decoder)

# Training 

In [31]:
@tf.function
def train_step(inp, targ, enc_hidden):
  loss = 0

  with tf.GradientTape() as tape:
    enc_output, enc_hidden = encoder(inp, enc_hidden)

    dec_hidden = enc_hidden

    dec_input = tf.expand_dims(targ[:,0], 1)

    # Teacher forcing - feeding the target as the next input
    for t in range(1, targ.shape[1]):
      # passing enc_output to the decoder
      predictions, dec_hidden, _ = decoder(dec_input, dec_hidden, enc_output)

      loss += loss_function(targ[:, t], predictions)

      # using teacher forcing
      dec_input = tf.expand_dims(targ[:, t], 1)

  batch_loss = (loss / int(targ.shape[1]))

  variables = encoder.trainable_variables + decoder.trainable_variables

  gradients = tape.gradient(loss, variables)

  optimizer.apply_gradients(zip(gradients, variables))

  return batch_loss

In [32]:
import time
EPOCHS = 10
steps_per_epoch = TAKE_SIZE

for epoch in range(EPOCHS):
  start = time.time()

  enc_hidden = encoder.initialize_hidden_state()
  total_loss = 0

  for (batch, (inp, targ)) in enumerate(train_data.take(steps_per_epoch)):
    batch_loss = train_step(inp, targ, enc_hidden)
    total_loss += batch_loss

    if batch % 10 == 0:
      print('Epoch {} Batch {} Loss {:.4f}'.format(epoch + 1,
                                                   batch,
                                                   batch_loss.numpy()))
  
  # saving (checkpoint) the model every epoch
  checkpoint.save(file_prefix = checkpoint_prefix)

  print('Epoch {} Loss {:.4f}'.format(epoch + 1,
                                      total_loss / steps_per_epoch))
  print('Time taken for 1 epoch {} sec\n'.format(time.time() - start))

Epoch 1 Batch 0 Loss 1.6768
Epoch 1 Batch 10 Loss 1.0945
Epoch 1 Batch 20 Loss 1.1084
Epoch 1 Batch 30 Loss 1.1516
Epoch 1 Batch 40 Loss 1.0215
Epoch 1 Batch 50 Loss 1.1138
Epoch 1 Batch 60 Loss 1.0599
Epoch 1 Batch 70 Loss 1.1098
Epoch 1 Batch 80 Loss 1.0791
Epoch 1 Batch 90 Loss 1.0371
Epoch 1 Batch 100 Loss 1.0077
Epoch 1 Batch 110 Loss 0.9639
Epoch 1 Batch 120 Loss 1.0445
Epoch 1 Batch 130 Loss 1.0129
Epoch 1 Batch 140 Loss 1.0308
Epoch 1 Batch 150 Loss 0.9066
Epoch 1 Batch 160 Loss 1.0090
Epoch 1 Batch 170 Loss 0.9995
Epoch 1 Batch 180 Loss 0.9122
Epoch 1 Batch 190 Loss 0.9500
Epoch 1 Batch 200 Loss 0.9307
Epoch 1 Batch 210 Loss 0.8807
Epoch 1 Batch 220 Loss 0.9226
Epoch 1 Batch 230 Loss 0.9380
Epoch 1 Batch 240 Loss 0.7882
Epoch 1 Batch 250 Loss 0.9326
Epoch 1 Batch 260 Loss 0.8901
Epoch 1 Batch 270 Loss 0.9136
Epoch 1 Batch 280 Loss 0.8741
Epoch 1 Batch 290 Loss 0.8974
Epoch 1 Batch 300 Loss 0.8045
Epoch 1 Batch 310 Loss 0.9830
Epoch 1 Batch 320 Loss 0.8048
Epoch 1 Batch 330 Los

In [33]:
checkpoint.restore(tf.train.latest_checkpoint(checkpoint_dir))
encoder_latest=checkpoint.encoder
decoder_latest=checkpoint.decoder

In [38]:
for inp, targ in test_data.take(1):
  print("input sentence : {}".format(tokenizer_fr.sequences_to_texts(inp.numpy())[0]))
  print("target sentence : {}".format(tokenizer_en.sequences_to_texts(targ.numpy())[0]))
  enc_hidden = encoder_latest.initialize_hidden_state()
  enc_output, enc_hidden = encoder_latest(inp, enc_hidden)

  # tensor containing the first token of target :  <start>
  result = tf.expand_dims(tokenizer_en.sequences_to_texts([[index] for index in targ[:,0].numpy()]),1)

  dec_hidden = enc_hidden

  dec_input = tf.expand_dims(targ[:,0], 1)

  # Teacher forcing - feeding the target as the next input
  for t in range(1, targ.shape[1]):
    # passing enc_output to the decoder
    predictions, dec_hidden, _ = decoder_latest(dec_input, dec_hidden, enc_output)
    

    # get text predictions
    pred_index = tf.argmax(predictions, axis = 1).numpy()
    corresponding_word = tf.expand_dims(tokenizer_en.sequences_to_texts([[index] for index in pred_index]),1)
    result = tf.concat((result,corresponding_word), axis=1)

    # using teacher forcing
    dec_input = tf.expand_dims(pred_index,1)

  result = [" ".join([word.decode("utf-8") for word in sentence]) for sentence in result.numpy()]
  print(result[0])

input sentence : <start> avez vous réglé la note <end>
target sentence : <start> did you settle the bill <end>
<start> did you open the instructions <end> <end> <end> <end> <end> <end> <end> <end> <end> <end> <end> <end> <end> <end> <end> <end> <end> <end> <end> <end> <end> <end> <end> <end> <end> <end> <end> <end> <end> <end> <end> <end> <end>
