# Translater

Hello everyone, 

Welcome to this little notebook which will help you build your own translator!
Follow the instructions provided in the README as we go!


In our example, we will translate from English to German,but feel free to select the language you want.
![Flags overview image](Images/germany_uk_flags.png)

To this regard, I will refer as the 'English sentences' the sentences of the language we want to translate, and the 'German sentences' the targeted language sentences. 


Let's first import all the libraries we will need:

In [None]:
import tensorflow as tf
import tensorflow_hub as hub
import unicodedata
import re
import random
from sklearn.model_selection import train_test_split
import numpy as np
from tensorflow.keras.layers import Layer, Softmax
from tensorflow.keras.models import Model
from tensorflow.keras.models import Sequential, load_model 
from tensorflow.keras.layers import Flatten, Dense, Conv2D, MaxPooling2D, BatchNormalization, Dropout, Input, Embedding, LSTM
from tensorflow.keras import regularizers
from tensorflow.keras.callbacks import ModelCheckpoint, EarlyStopping
import matplotlib.pyplot as plt
import json

# 1: Load the dataset

In [None]:
# Load the dataset

NUM_EXAMPLES = 1000     #200000                # HERE, YOU MIGHT WANT TO REDUCE THE NB OF EXAMPLES IF TRAINING THE MODEL TAKE TOO LONG.
data_examples = []
with open('deu.txt', 'r', encoding='utf8') as f:       # HERE, ENTER THE NAME OF YOUR TEXT FILE.
    for line in f.readlines():
        if len(data_examples) < NUM_EXAMPLES:
            data_examples.append(line)
        else:
            break

In [None]:
# These functions will help simplify special characters, feel free to add more.

def unicode_to_ascii(s):
    return ''.join(c for c in unicodedata.normalize('NFD', s) if unicodedata.category(c) != 'Mn')

def preprocess_sentence(sentence):
    sentence = sentence.lower().strip()
    sentence = re.sub(r"ü", 'ue', sentence)
    sentence = re.sub(r"ä", 'ae', sentence)
    sentence = re.sub(r"ö", 'oe', sentence)
    sentence = re.sub(r'ß', 'ss', sentence)
    
    sentence = unicode_to_ascii(sentence)
    sentence = re.sub(r"([?.!,])", r" \1 ", sentence)
    sentence = re.sub(r"[^a-z?.!,']+", " ", sentence)
    sentence = re.sub(r'[" "]+', " ", sentence)
    
    return sentence.strip()

# 2: Preprocess the data

Here we will:
* Create separate lists of English and German sentences, and preprocess them using the `preprocess_sentence` function created above.
* Add a special `"<start>"` and `"<end>"` token to the beginning and end of every German sentence.
* Tokenize the German sentences, ensuring that no character filters are applied.
* Pad the end of the tokenized German sequences with zeros, and batch the complete set of sequences into one numpy array.

In [None]:
# Split each sentences 

English_sentences = []
German_sentences = []

for i in data_examples:
    
    English = re.search(r"^[^\t]*[\.|\!|\?]", i)
    German = re.search(r"\t[^\t]*[\.|\!|\?]", i)
    
    if English == None or German == None:
        continue
    
    English_sentences.append(English[0])
    German_sentences.append(German[0][1:])

# Preprocess the data with the above functions 

English_preprocessed = [preprocess_sentence(i) for i in English_sentences]
German_preprocessed = [preprocess_sentence(i) for i in German_sentences]

# Add <start> and <end> in each sentences of the targeted language

German_preprocessed_2 = ["<start> " + i + " <end>" for i in German_preprocessed]

# Tokenize the targeted language sentences

Tokenizer = tf.keras.preprocessing.text.Tokenizer(num_words=None, filters='', split=' ', 
                                                         char_level=False, oov_token= None)

Tokenizer.fit_on_texts(German_preprocessed_2)

German_tokenized = Tokenizer.texts_to_sequences(German_preprocessed_2)


# Get the number of unique word (useful later)

word_index = Tokenizer.word_index
num_german_tokens = len(word_index)
tokenizer_config = Tokenizer.get_config()

# Pad the targeted language sequences

German_padded = tf.keras.preprocessing.sequence.pad_sequences(German_tokenized, 
                                                              maxlen=len(max(German_tokenized, key=len)),
                                                              padding='post')


# 3: Load the embedding layer

As for many NLP applications, we will need to embed our inputs. In this project, let's use a pre-trained module from TensorFlow Hub. The URL for the module is https://tfhub.dev/google/tf2-preview/nnlm-en-dim128-with-normalization/1. 
This embedding takes a batch of text tokens in a 1-D tensor of strings as input. It then embeds the separate tokens into a 128-dimensional space. 

In [None]:
embedding_layer = hub.KerasLayer("https://tfhub.dev/google/tf2-preview/nnlm-en-dim128-with-normalization/1", output_shape=[128], input_shape=[], dtype=tf.string)

# 4: Prepare the training and validation Datasets.

In this section, we will:

* Create a random training and validation set split of the data, reserving 20% of the data for validation.
* Load the training and validation sets into a tf.data.Dataset object, passing in a tuple of English and German data for both training and validation sets.
* Create a function to map over the datasets that splits each English sentence at spaces, and apply it to both Dataset objects using the map method. 
* Create and apply a function to map over the datasets that embeds each sequence of English words using the loaded embedding layer/model. 
* Create and apply a function to filter out dataset examples where the English sentence is more than 13 (embedded) tokens in length. 
* Create and apply a function to map over the datasets that pads each English sequence of embeddings with some distinct padding value before the sequence, so that each sequence is length 13. 
* Batch both training and validation Datasets with a batch size of 16.

In [None]:
# Split randomly training and validation examples, with 20 percent for validation.

X_train, X_test, Y_train, Y_test = train_test_split(English_preprocessed, German_padded, test_size=0.2, 
                                         random_state=None, shuffle=True, stratify=None)

X_train = np.array(X_train)
X_test = np.array(X_test)

# Pass training and test sets into dataset

Training_dataset = tf.data.Dataset.from_tensor_slices((X_train, Y_train))
Test_dataset = tf.data.Dataset.from_tensor_slices((X_test,  Y_test))

# Split function

def split_English_sentences(English, German):
    
    English = tf.strings.split(English)
    
    return English, German 

# Map this function over the datasets
Training_dataset = Training_dataset.map(split_English_sentences)
Test_dataset = Test_dataset.map(split_English_sentences)

# Embedding function

def Embedd_English_sentences(English, German):
    
    English = embedding_layer(English)
    
    return English, German 

# Map this function over the datasets
Training_dataset = Training_dataset.map(Embedd_English_sentences)
Test_dataset = Test_dataset.map(Embedd_English_sentences)

# Filtering function

def Filter_English_sentences(dataset):
    
    def filter_func(English, German):

        if len(English) <= 13:
            res = True
        else:
            res = False

        return res

    filtered_dataset = dataset.filter(filter_func)

    return filtered_dataset

# Apply the function to the datasets
Training_dataset = Filter_English_sentences(Training_dataset)
Test_dataset = Filter_English_sentences(Test_dataset)

def Pad_English_sentences(English, German):
    
    
    paddings = [[13-tf.shape(English)[0] ,0], tf.constant([0,0])]
    English = tf.pad(English, paddings, "CONSTANT")
    English = tf.reshape(English, [13, 128])
    return English, German 

# Map this function over the datasets
Training_dataset = Training_dataset.map(Pad_English_sentences)
Test_dataset = Test_dataset.map(Pad_English_sentences)

Training_dataset = Training_dataset.batch(16, drop_remainder=True)
Test_dataset = Test_dataset.batch(16, drop_remainder=True)

Training_dataset.element_spec
Test_dataset.element_spec


# 5: End token embedding.

In this section, we will create a custom layer to add the learned end token embedding to the encoder model.

![Encoder schematic](Images/neural_translation_model_encoder.png)

More specifically we will create a custom layer that takes a batch of English data examples from one of the Datasets, and adds a learned embedded ‘end’ token to the end of each sequence. 


In [None]:
# Custom layer to add the 'end' token

class EndTokenLayer(Layer):

    def __init__(self, embedding_dim=128, **kwargs):
        super(EndTokenLayer, self).__init__(**kwargs)
        self.end_token_embedding = tf.Variable(initial_value=tf.random.uniform(shape=(embedding_dim,)), trainable=True)

    def call(self, inputs):
        end_token = tf.tile(tf.reshape(self.end_token_embedding, shape=(1, 1, self.end_token_embedding.shape[0])),
                            [tf.shape(inputs)[0],1,1])
        
        return tf.keras.layers.concatenate([inputs, end_token], axis=1)  

End_Token_Layer = EndTokenLayer()

# 6: Build the encoder network.
The encoder network follows the schematic diagram above. We will now build the RNN encoder model. 

In [None]:
# Build the encoder

inputs = Input(shape=(13,128))
x = End_Token_Layer(inputs)
x = tf.keras.layers.Masking(mask_value=0.0)(x)
LSTM_output, hidden_state, cell_states = tf.keras.layers.LSTM(units = 512, return_sequences=True, return_state=True)(x)
outputs = [hidden_state, cell_states]
Encoder = tf.keras.models.Model(inputs=inputs, outputs=outputs)

# 7: Build the decoder network
The decoder network follows the schematic diagram below. 

![Decoder schematic](Images/neural_translation_model_decoder.png)

More specifically, it will be composed of:

* An Embedding layer with vocabulary size set to the number of unique German tokens, embedding dimension 128, and set to mask zero values in the input.
* An LSTM layer with 512 units, that returns its hidden and cell states, and also returns sequences.
* A Dense layer with number of units equal to the number of unique German tokens, and no activation function.

In [None]:
# Create the decoder

class Decoder1(Model):
    def __init__(self, **kwargs):
        super(Decoder1, self).__init__(**kwargs)

        self.embLayer = Embedding(input_dim = num_german_tokens+1, output_dim=128, mask_zero=True)
        self.lstmLayer = LSTM(units=512, return_state=True, return_sequences=True)
        self.denseLayer = Dense(units = num_german_tokens+1, activation=None)

    def call(self, inputs, hidden_state=None, cell_state=None):
        
        x = self.embLayer(inputs)

        if (hidden_state is None) or (cell_state is None):
            
            x, hidden_state, cell_state = self.lstmLayer(x)
        else:
            x, hidden_state, cell_state = self.lstmLayer(x, initial_state=(hidden_state, cell_state))
        
        x = self.denseLayer(x)

        return x, hidden_state, cell_state
Decoder = Decoder1()

# 8: Training loop
Let's now write a custom training loop as our model is a bit complex. Here, we will:

* Define a function that takes a Tensor batch of German data (as extracted from the training Dataset), and returns a tuple containing German inputs and outputs for the decoder model (refer to schematic diagram above).
* Define a function that computes the forward and backward pass for your translation model. More specifically, it will:
    * Pass the English input into the encoder, to get the hidden and cell states of the encoder LSTM.
    * These hidden and cell states are then passed into the decoder, along with the German inputs, which returns a sequence of outputs (the hidden and cell state outputs of the decoder LSTM are unused in this function).
    * The loss should then be computed between the decoder outputs and the German output function argument.
    * The function returns the loss and gradients with respect to the encoder and decoder’s trainable variables.
* Define and run a custom training loop for a number of epochs (for you to choose) that does the following:
    * Iterates through the training dataset, and creates decoder inputs and outputs from the German sequences.
    * Updates the parameters of the translation model using the gradients of the function above and an optimizer object.
    * Every epoch, compute the validation loss on a number of batches from the validation and save the epoch training and validation losses.
* Plot the learning curves for loss vs epoch for both training and validation sets (I recommend to do that all the time, make sure everything is looking alright ^^)

This model is computationally demanding to train. If you have a toaster like me, I really recommend using the GPU accelerator hardware on Colab.

In [None]:
# Prepare the inputs and outputs to train the decoder

def German_input_output(GermanData):
    inputs = GermanData[:,:-1]
    outputs = GermanData[:,1:]

    return inputs, outputs

# Define the optimizer and loss

optimizer = tf.keras.optimizers.Adam(learning_rate=0.001)
loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)

# Compute forward and backward propagation

@tf.function
def grad(English_inputs, German_inputs, German_outputs):
    
    with tf.GradientTape() as tape:
        
        hs, cs = Encoder(English_inputs)
        Decoder_output, _, _ = Decoder(German_inputs, hs, cs)
        loss_value = loss(German_outputs, Decoder_output) 
      
    return (loss_value, tape.gradient(loss_value, Encoder.trainable_variables + Decoder.trainable_variables))

def train_translator(num_epochs = 5):
 
    train_loss_results = []
    validation_loss_results = []

    

    for epoch in range(num_epochs):
        epoch_loss_avg = tf.keras.metrics.Mean()
        Val_epoch_loss_avg = tf.keras.metrics.Mean()
        #Training loop
        for x, y in Training_dataset:
            #Optimize the model
            
            German_inputs, German_outputs = German_input_output(y)
            
            loss_value, grads = grad(English_inputs = x, German_inputs=German_inputs, German_outputs = German_outputs)
            optimizer.apply_gradients(zip(grads, Encoder.trainable_variables + Decoder.trainable_variables))
        
            #Compare current loss
            epoch_loss_avg(loss_value)
            
        
        #Validation loop
        for x, y in Test_dataset:
            
            German_inputs, German_outputs = German_input_output(y)
            validation_loss_value, _ = grad(English_inputs = x, German_inputs=German_inputs, German_outputs = German_outputs)
            
            Val_epoch_loss_avg(validation_loss_value)
            
            
        train_loss_results.append(epoch_loss_avg.result().numpy())
        validation_loss_results.append(Val_epoch_loss_avg.result().numpy())
        
        
        print("Epoch {:03d}: Training loss: {:.3f}, Validation loss: {:.3f}".format(epoch, epoch_loss_avg.result(), Val_epoch_loss_avg.result()))
    
    return train_loss_results, validation_loss_results

In [None]:
# ANNND HERE WE GO: LET'S TRAIN

Train_Loss, Validation_Loss = train_translator(1)

In [None]:
# Optional: if you want to look at how the training went
fig, axes = plt.subplots(1, 2, sharex=True, figsize=(12, 5))

axes[0].set_xlabel("Epochs", fontsize=14)
axes[0].set_ylabel("Train_Loss", fontsize=14)
axes[0].set_title('Loss vs epochs')
axes[0].plot(Train_Loss)

axes[1].set_title('Loss vs epochs')
axes[1].set_ylabel("Validation_Loss", fontsize=14)
axes[1].set_xlabel("Epochs", fontsize=14)
axes[1].plot(Validation_Loss)
plt.show()

# Here we are! Enter your sentence!

In [None]:
# ENTER THE SENTENCE YOU WANT TO TRANSLATE! Make sure it doesn't contain more than 13 words as it is the padding size we chose

Sentence_to_translate = "Hello I am Tom!"

In [None]:
Sentence_preprocessed = preprocess_sentence(Sentence_to_translate)
Sentence_splited = tf.strings.split(Sentence_preprocessed)
Sentence_embedded = embedding_layer(Sentence_splited) 
paddings = [[13-tf.shape(Sentence_embedded)[0] ,0], tf.constant([0,0])]
Sentence_padded = tf.pad(Sentence_embedded, paddings, "CONSTANT")
Sentence_padded = tf.reshape(Sentence_padded, [13, 128])

In [None]:
# First the encoder
Sentence_padded = np.expand_dims(Sentence_padded, 0)
hs_cs = Encoder(Sentence_padded)

# And finally the decoder

index_word = json.loads(tokenizer_config['index_word'])
a = np.expand_dims(np.array([1]),0)

Prediction, h0, c0 = Decoder(a, hs_cs[0], hs_cs[1])
index_firstword_predicted = np.argmax(Prediction[0,0,:], axis=0)
index_firstword_predicted = np.expand_dims(np.array([index_firstword_predicted]),0)
predicted_sentence = [index_firstword_predicted]

for x in range(9):
    Prediction, h0, c0 = Decoder(predicted_sentence[-1], h0, c0)
    index_word_predicted = np.argmax(Prediction[0,0,:], axis=0)
        
    if index_word_predicted == 2:
        index_word_predicted = np.expand_dims(np.array([index_word_predicted]),0)
        predicted_sentence.append(index_word_predicted)
        break
         
    index_word_predicted = np.expand_dims(np.array([index_word_predicted]),0)
    predicted_sentence.append(index_word_predicted)

    
Translation =[]
for x in predicted_sentence:
    word_predicted = index_word[str(x[0][0])]
    Translation.append(word_predicted)

print('Here is your translation:')
print(' '.join(Translation[:-1]))