*SLOGAN:* **Unless you can teach it, you don't know it**.


---


This notebook is an adaptation of tensorflow's tutorial to train a sequence to sequence model. It has been adapted to translate sentences from german to english. So yes!!! we will be building our own google translator, although it can translate only one language to the next. It is still a step in the right direction.

---
**New addition?**
1. Encoder-Decoder model.
2. Attention mechanism.
3. Natural language processing.

In [1]:
pip install tensorflow_addons

Collecting tensorflow_addons
  Downloading tensorflow_addons-0.13.0-cp37-cp37m-manylinux2010_x86_64.whl (679 kB)
[?25l[K     |▌                               | 10 kB 15.9 MB/s eta 0:00:01[K     |█                               | 20 kB 18.0 MB/s eta 0:00:01[K     |█▌                              | 30 kB 20.9 MB/s eta 0:00:01[K     |██                              | 40 kB 15.5 MB/s eta 0:00:01[K     |██▍                             | 51 kB 5.1 MB/s eta 0:00:01[K     |███                             | 61 kB 4.7 MB/s eta 0:00:01[K     |███▍                            | 71 kB 5.3 MB/s eta 0:00:01[K     |███▉                            | 81 kB 5.4 MB/s eta 0:00:01[K     |████▍                           | 92 kB 5.4 MB/s eta 0:00:01[K     |████▉                           | 102 kB 5.4 MB/s eta 0:00:01[K     |█████▎                          | 112 kB 5.4 MB/s eta 0:00:01[K     |█████▉                          | 122 kB 5.4 MB/s eta 0:00:01[K     |██████▎                 

**Import required Libraries**

In [2]:
import tensorflow as tf
import tensorflow_addons as tfa
import os, re, time, io
import numpy as np
import unicodedata
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker
import pandas as pd
from sklearn.model_selection import train_test_split

In this block of our notebook we create **a custom function** and **a custom class**:

1. The *make_path* function downloads our data from the Github repo for this course and creates a path to the data for us. 

---
Tensorflow comes with a nice function called get_file, which can download files for us (even in zip format). This function will download the file and store it in the dataset repository. *E.g ~/.keras/datasets/example.txt*




In [3]:
def make_path(url):
  path_to_file = tf.keras.utils.get_file('deu.txt', origin=url)
  return path_to_file

github_url = 'https://raw.githubusercontent.com/PelFritz/Machine-Learning-courses-IPK/master/Data/deu.txt'


2. The *CreateDataset* class processes every sentence in the file, separating every english sentence with its corresponding german translation. It has a call function which makes the class callable like a function and when we call this class, it will return **train** and **validation** datasets. It will also return **two language tokenizers** to us (one per langauge). 


---


A tokenizer simply takes a sentence and translates it into a nice numeric representaion, so we can feed our models numeric functions.

In [4]:
class CreateDataset:
  def __init__(self, problem_type='deu-eng'):
    self.problem_type= problem_type
    self.input_lang_tokenizer= None
    self.target_lang_tokenizer= None
  
  # this function returns the unicode normalized form of words in our text with the exception of nonspacing marks(Mn)
  def unicode_verify(self, s):
    return ''.join(c for c in unicodedata.normalize('NFD', s) if unicodedata.category(c) != 'Mn')
  

  # this function processes each sentence sentence in 3 steps
  def preprocess(self, w):
    w = self.unicode_verify(w.lower().strip())

    # space puntuations and words e.g "Is Arabidopsis a plant?" becomes "Is Arabidopsis a plant ?"
    # collapse all double spacing to single spacing
    w = re.sub(r"([?.!,])", r"\1 ", w)
    w = re.sub("\s{2,}", " ", w)

    # replacing everything with space except (a-z, A-Z, ".", "?", "!", ",", "öüäß")
    w = re.sub(r"[^a-zA-Z?.!,öüäß]+", " ", w)
    w = w.strip()

    # add <start> and <end> token
    w = '<start> ' + w + ' <end>'

    return w 
  
  def generate_processed_data(self, url, num_examples):
    lines = open(make_path(url), encoding='UTF-8').read().strip().split('\n')
    tot_examples = len(lines)

    # if you wish to use all the examples just use tot_examples inplace of num_examples in the code line below
    word_pairs = [[self.preprocess(w) for w in line.split('\t')][:-1] for line in lines[:num_examples]]
    
    return zip(*word_pairs)
  
  # This function creates the tokenizers
  def tokenize(self, lang):
    lang_tokenizer = tf.keras.preprocessing.text.Tokenizer(filters='', oov_token='<oov>')
    lang_tokenizer.fit_on_texts(lang)
    tensor = lang_tokenizer.texts_to_sequences(lang)

    # we pad all sequences to the length of the longest sentence for the sake of the encoder 
    tensor = tf.keras.preprocessing.sequence.pad_sequences(tensor, padding='post')

    return tensor, lang_tokenizer
  
  def load_dataset(self, url, num_examples):
    targ_lang, inp_lang = self.generate_processed_data(url, num_examples)

    inp_tensor, inp_tok = self.tokenize(inp_lang)
    targ_tensor, targ_tok = self.tokenize(targ_lang)

    return inp_tensor, targ_tensor, inp_tok, targ_tok
  
  def __call__(self, BUFFER_SIZE, url, num_examples, batch_size):
    inp_tensor, targ_tensor, self.input_lang_tokenizer, self.target_lang_tokenizer = self.load_dataset(url, num_examples)

    # split data into train and test sets using sklearn
    inp_train, inp_val, targ_train, targ_val = train_test_split(inp_tensor, targ_tensor, test_size=0.2)

    train_dataset = tf.data.Dataset.from_tensor_slices((inp_train, targ_train))
    train_dataset = train_dataset.shuffle(BUFFER_SIZE).batch(batch_size, drop_remainder=True)

    validation_dataset = tf.data.Dataset.from_tensor_slices((inp_val, targ_val))
    validation_dataset = validation_dataset.shuffle(BUFFER_SIZE).batch(batch_size, drop_remainder=True)

    return train_dataset, validation_dataset, self.input_lang_tokenizer, self.target_lang_tokenizer
  

Now we define some important parameters 

In [5]:
buffer_size = 32000
batch_size = 64
num_examples = 30000
dataset_creator = CreateDataset()
train_data, val_data, inp_tok, targ_tok = dataset_creator(buffer_size, github_url, num_examples, batch_size)

example_inp_batch, example_targ_batch = next(iter(train_data))
vocab_inp_size = len(inp_tok.word_index) + 1
vocab_targ_size = len(targ_tok.word_index) + 1
max_len_inp = example_inp_batch.shape[1]
max_len_targ = example_targ_batch.shape[1]
embedding_dim = 256
units = 1024

print(max_len_inp, max_len_targ)
print(vocab_inp_size)


Downloading data from https://raw.githubusercontent.com/PelFritz/Machine-Learning-courses-IPK/master/Data/deu.txt
12 8
10452


**Building the sequence to sequence Model**

---
We will begin building our sequence to sequence model by first building the encoder network. This network will contain an embedding layer and a single LSTM layer.


---
**LSTM** stands for long short term memory. This are the building blocks of recurrent neural networks. They are an upgrade of simple recurrent neural network cells because they have some kind of memory state and can recall longer steps into the past as compared to simple rnn cells.

**GRU** stands for gated recurrent unit. It is an alternative but simplified version of LSTM which also works quite well. 



In [6]:
# ENCODER

class Encoder(tf.keras.Model):
  def __init__(self, vocab_size, embedding_size, enc_units, batch_size):
    super(Encoder, self).__init__()
    self.batch_size = batch_size
    self.enc_units = enc_units
    self.embedding = tf.keras.layers.Embedding(vocab_size, embedding_dim)

    # ---LSTM layer for encoder--- #
    self.lstm_layer = tf.keras.layers.LSTM(units=self.enc_units, return_state=True, return_sequences=True, recurrent_initializer='glorot_uniform')
  
  def call(self, x, hidden):
    x = self.embedding(x)
    output, h, c = self.lstm_layer(x, initial_state=hidden)
    return output, h, c
  
  def initialize_hidden_state(self):
    return [tf.zeros((self.batch_size, self.enc_units)), tf.zeros((self.batch_size, self.enc_units))]


encoder = Encoder(vocab_inp_size, embedding_dim, units, batch_size)

**Test what we have so far!!!!**

---


We test the enoder Stack to see if it works well with no surprises or wrong in 
or outputs.

In [7]:
sample_hidden = encoder.initialize_hidden_state()
sample_output, sample_h, sample_c = encoder(example_inp_batch, sample_hidden)
print('Encoder output shape: (batch size, sequence length, units) {}'.format(sample_output.shape))
print('Encoder h vector shape: (batch_size, units) {}'.format(sample_h.shape))
print('Encoder c vector shape: (batch_size, units) {}'.format(sample_c.shape))

Encoder output shape: (batch size, sequence length, units) (64, 12, 1024)
Encoder h vector shape: (batch_size, units) (64, 1024)
Encoder c vector shape: (batch_size, units) (64, 1024)


**Now we build the decoder model!!!**

---
The decoder network is the second part of our sequence to sequence model. This is where the magic happens. The decoder does the translation magic.

---




In [8]:
class Decoder(tf.keras.Model):
  def __init__(self, vocab_size, embedding_size, dec_units, batch_size, attention_type='luong'):
    super(Decoder, self).__init__()
    self.batch_size = batch_size
    self.dec_units = dec_units
    self.attention_type = attention_type

    #-----Embedding layer Decoder----#
    self.embedding = tf.keras.layers.Embedding(vocab_size, embedding_dim)

    # Final dense layer on which softmax will be applied
    self.fc = tf.keras.layers.Dense(vocab_size)

    # Fundamental cell in decoder recurrent structure
    self.decoder_rnn_cell = tf.keras.layers.LSTMCell(self.dec_units)

    # Attention mechanism
    self.attention_mechanism = self.build_attention_mechanism(self.dec_units, None, self.batch_size*[max_len_inp], self.attention_type )

    # sampler
    self.sampler = tfa.seq2seq.sampler.TrainingSampler()

    # Define decoder with respect to basic rnn cell, see official tensorflow site for more understnding
    self.rnn_cell = self.build_rnn_cell()
    self.decoder = tfa.seq2seq.BasicDecoder(self.rnn_cell, sampler=self.sampler, output_layer=self.fc)

  def build_attention_mechanism(self, dec_units, memory, memory_sequence_length, attention_type='luong'):
    # attention_type: type of attention to use, bahdanau or luong
    # dec_units: final dimension of attention layers
    # memory:   encoder hidden states of shape (batch size, max_length, enc_units)
    # memory_sequence_length: 1d array of shape (batch size) with every element set to max_length_input (for masking purpose)

    if attention_type == 'bahdanau':
      return tfa.seq2seq.BahdanauAttention(units=dec_units, memory=memory, memory_sequence_length=memory_sequence_length,)
    else:
      return tfa.seq2seq.LuongAttention(units=dec_units, memory=memory, memory_sequence_length=memory_sequence_length)
  
    

  def build_rnn_cell(self):
    rnn_cell = tfa.seq2seq.AttentionWrapper(self.decoder_rnn_cell, self.attention_mechanism, attention_layer_size=self.dec_units)
    return rnn_cell
  

  def build_initial_state(self, batch_size, encoder_state, Dtype):
    decoder_initial_state = self.rnn_cell.get_initial_state(batch_size=batch_size, dtype=Dtype)
    decoder_initial_state =decoder_initial_state.clone(cell_state=encoder_state)
    return decoder_initial_state

  def call(self, inputs, initial_state):
    x = self.embedding(inputs)
    output, _, _, = self.decoder(x, initial_state=initial_state, sequence_length=self.batch_size*[max_len_targ-1])
    return output 


decoder = Decoder(vocab_targ_size, embedding_dim, units, batch_size, attention_type='luong')

**Test what we have so far!!!**

---
We test our decoder stack just as we tested our encoder stack to ensure we have no surprises or bugs.


In [9]:
sample_x = tf.random.uniform((batch_size, max_len_targ))
decoder.attention_mechanism.setup_memory(sample_output) # the sample output is for the memory argument
initial_state = decoder.build_initial_state(batch_size, [sample_h, sample_c], tf.float32)

sample_decoder_outputs = decoder(sample_x, initial_state)
print('Decoder output shape: ', sample_decoder_outputs.rnn_output.shape)


Decoder output shape:  (64, 7, 6597)


**Loss Function and Optimizer**

---
Now we build a loss function and our optimizer. In this tutorial we selected Adam optimizer. You can try other optimizers from tensorflow and see if the work better.


In [10]:
optimizer = tf.keras.optimizers.Adam()

def loss_functiion(real, pred):
  cross_entropy = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True, reduction='none')
  loss = cross_entropy(y_true=real, y_pred=pred)
  mask = tf.logical_not(tf.math.equal(real, 0))
  mask = tf.cast(mask, dtype=loss.dtype)
  loss = mask*loss
  loss = tf.reduce_mean(loss)
  return loss

In [11]:
# Create checpoints to save model so we can call it after training to translate for us
checkpoint_dir = './training_checkpoints'
checkpoint_prefix = os.path.join(checkpoint_dir, 'ckpt')
checkpoint = tf.train.Checkpoint(optimizer=optimizer,
                                 encoder=encoder,
                                 decoder=decoder)

**Build training function**


---
1. *@tf.function* : is a constructor that helps speed computation time of our *train_step* function

2. *GradientTape* : When we train neural networks we use backpropagation to update weights. Tensorflow needs to keep track of which operations happened during the forward pass of the network so it can back propagate the gradients. This is were GradientTape steps in. It records ("tape") the relevant operations operations for us.
3. *tape.gradient*: Once we have recorded the relevant operations, we use the GradientTape.gradient to calculate the gradient of our loss with respect to our model variables.


---
We do all of this detailed work because we are not using tensorflow built loops like the model.fit and model.call



In [12]:
@tf.function
def train_step(input, target, enc_hidden):
  loss=0

  with tf.GradientTape() as tape:
    enc_output, enc_h, enc_c = encoder(input, enc_hidden)

    dec_input = target[:, :-1] # ignore the <end> token
    real = target[:, 1:] # ignore the <start> token

    # set attention mechanism with encoder outputs
    decoder.attention_mechanism.setup_memory(enc_output)
    # create AttentionWrapperState as initial_state for decoder
    decoder_initial_state = decoder.build_initial_state(batch_size, [enc_h, enc_c], tf.float32)
    pred = decoder(dec_input, decoder_initial_state)
    logits = pred.rnn_output
    loss = loss_functiion(real, logits)
  
  variables = encoder.trainable_variables + decoder.trainable_variables
  gradients = tape.gradient(loss, variables) # calculate gradient of loss w.r.t variables
  optimizer.apply_gradients(zip(gradients, variables))

  return loss

**Train model inside a loop for 10 epochs**

---



In [13]:
Epochs = 10
steps_per_epoch = num_examples//batch_size

for epoch in range(Epochs):
  start = time.time()

  enc_hidden = encoder.initialize_hidden_state()
  total_loss = 0

  for (batch, (inp, targ)) in enumerate(train_data.take(steps_per_epoch)):
    batch_loss = train_step(inp, targ, enc_hidden)
    total_loss += batch_loss

    # print the loss after every 100 batch
    if batch % 100 == 0:
      print('Epoch {} Batch {} Loss {:.4f}'.format(epoch+1,
                                                   batch,
                                                   batch_loss.numpy()))
    
  if (epoch + 1) % 2 == 0:
    checkpoint.save(file_prefix=checkpoint_prefix)
  
  print('Epoch {} Loss {:.4f}'.format(epoch+1,
                                      total_loss/ steps_per_epoch))
  print('Time taken to complete epoch {}'.format(time.time() - start))
  
      

Epoch 1 Batch 0 Loss 5.6732
Epoch 1 Batch 100 Loss 2.5265
Epoch 1 Batch 200 Loss 2.3700
Epoch 1 Batch 300 Loss 2.2364
Epoch 1 Loss 2.0541
Time taken to complete epoch 1150.8677504062653
Epoch 2 Batch 0 Loss 1.8535
Epoch 2 Batch 100 Loss 1.9676
Epoch 2 Batch 200 Loss 1.7979
Epoch 2 Batch 300 Loss 1.5734
Epoch 2 Loss 1.3800
Time taken to complete epoch 1117.8610577583313
Epoch 3 Batch 0 Loss 1.1543
Epoch 3 Batch 100 Loss 1.2249
Epoch 3 Batch 200 Loss 1.1717
Epoch 3 Batch 300 Loss 1.1390
Epoch 3 Loss 0.9647
Time taken to complete epoch 1129.5855646133423
Epoch 4 Batch 0 Loss 0.8140
Epoch 4 Batch 100 Loss 0.8012
Epoch 4 Batch 200 Loss 0.8388
Epoch 4 Batch 300 Loss 0.7309
Epoch 4 Loss 0.6569
Time taken to complete epoch 1094.1770067214966
Epoch 5 Batch 0 Loss 0.5274
Epoch 5 Batch 100 Loss 0.5309
Epoch 5 Batch 200 Loss 0.5817
Epoch 5 Batch 300 Loss 0.5606
Epoch 5 Loss 0.4522
Time taken to complete epoch 1104.8975496292114
Epoch 6 Batch 0 Loss 0.4332
Epoch 6 Batch 100 Loss 0.3591
Epoch 6 Batc

**Now we evaluate our sentence**
---
Here we build an inference function that takes in an input sentence and sends out the predictions.
1. We used the *TrainingSampler* for training but for inference we will use the *GreedyEmbeddingsampler*.


In [14]:
def evaluate_sentence(sentence):
  sentence = dataset_creator.preprocess(sentence)

  inputs = [inp_tok.word_index[i] for i in sentence.split(' ')]
  inputs = tf.keras.preprocessing.sequence.pad_sequences(sequences=[inputs], maxlen=max_len_inp, padding='post')
  inputs= tf.convert_to_tensor(inputs)
  inference_batch_size = inputs.shape[0]

  enc_start_state = [tf.zeros((inference_batch_size, units)), tf.zeros((inference_batch_size, units))]
  enc_out, enc_h, enc_c = encoder(inputs, enc_start_state)

  start_tokens = tf.fill([inference_batch_size], targ_tok.word_index['<start>'])
  end_token = targ_tok.word_index['<end>']

  greedy_sampler = tfa.seq2seq.GreedyEmbeddingSampler()
  decoder_instance = tfa.seq2seq.BasicDecoder(cell=decoder.rnn_cell, sampler=greedy_sampler, output_layer=decoder.fc)
  # setup memory in decoder stack
  decoder.attention_mechanism.setup_memory(enc_out)

  # set decoder initial state
  decoder_initial_state = decoder.build_initial_state(inference_batch_size, [enc_h, enc_c], tf.float32)

  # Since the BasicDecoder wraps around the decoders rnn sell only, you have to ensure that the inputs to the Basic
  # decoder decoding step is output of the embedding layer. GreedySEmbeddingSampling takes care of this. 
  # We only get the weights of the embedding layer and pass them to the Basicdecoder

  decoder_embedding_matrix = decoder.embedding.variables[0]
  output, _, _ = decoder_instance(decoder_embedding_matrix, start_tokens=start_tokens, end_token=end_token,
                         initial_state=decoder_initial_state)
  
  return output.sample_id.numpy()


**Hurray you can now translate!!!**

---



In [34]:
def translator(sentence):
  result = evaluate_sentence(sentence)
  print(result)
  result = targ_tok.sequences_to_texts(result)
  print('Input: %s'%(sentence))
  print('Translation: {}'.format(result))

# Restore last checkpoint and translate
checkpoint.restore(tf.train.latest_checkpoint(checkpoint_dir))
translator(u'wann kommst du?')
translator(u'wo bist du?')
translator(u'wo sind Sie')
translator(u'Ich habe Hunger')  # I love food, so one food's sake
translator(u'Du bist klug')

print('\n Some translations are wrong \n')
translator(u'Er ist der Mann')

[[274  23   6 838   3]]
Input: wann kommst du?
Translation: ['when are you coming? <end>']
[[68 23 74  3]]
Input: wo bist du?
Translation: ['where are you? <end>']
[[68 23 74  3]]
Input: wo sind Sie
Translation: ['where are you? <end>']
[[  4  13 240   3]]
Input: Ich habe Hunger
Translation: ['i m hungry. <end>']
[[  6  16 571   3]]
Input: Du bist klug
Translation: ['you re smart. <end>']

 Some translations are wrong 

[[ 21   8 106 669   3]]
Input: Er ist der Mann
Translation: ['that s his seat. <end>']
