# Neural Machine Translation

* Neural Machine Translation (NMT) model to translate human-readable dates ("25th of June, 2009") into machine-readable dates ("2009-06-25"). 



In [7]:
import tensorflow as tf
from tensorflow.keras.layers import Bidirectional, Concatenate, Permute, Dot, Input, LSTM, Multiply
from tensorflow.keras.layers import RepeatVector, Dense, Activation, Lambda
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.utils import to_categorical
from tensorflow.keras.models import load_model, Model
import tensorflow.keras.backend as K
import numpy as np

from faker import Faker
import random
from tqdm import tqdm
from babel.dates import format_date
from nmt_utils import *
import matplotlib.pyplot as plt
%matplotlib inline

### Dataset
Train the model on a dataset of 10,000 human readable dates and their equivalent, standardized, machine readable dates.

In [8]:
m = 10000
dataset, human_vocab, machine_vocab, inv_machine_vocab = load_dataset(m)

100%|██████████| 10000/10000 [00:00<00:00, 23539.08it/s]


- dataset: a list of tuples of (human readable date, machine readable date).
- human_vocab: a python dictionary mapping all characters used in the human readable dates to an integer-valued index.
- machine_vocab: a python dictionary mapping all characters used in machine readable dates to an integer-valued index. 
- inv_machine_vocab: the inverse dictionary of machine_vocab, mapping from indices back to characters. 

In [9]:
dataset[:10]

[('21 jan 2004', '2004-01-21'),
 ('01.07.20', '2020-07-01'),
 ('12/1/73', '1973-12-01'),
 ('thursday may 10 1973', '1973-05-10'),
 ('thursday january 13 2011', '2011-01-13'),
 ('thursday december 1 1994', '1994-12-01'),
 ('saturday december 21 2019', '2019-12-21'),
 ('29 feb 2016', '2016-02-29'),
 ('13 apr 2003', '2003-04-13'),
 ('thursday november 20 1980', '1980-11-20')]

In [12]:
Tx = 30 #maximum length on human written dates
Ty = 10
X, Y, Xoh, Yoh = preprocess_data(dataset, human_vocab, machine_vocab, Tx, Ty)

print("X.shape:", X.shape)#a processed version of the human readable dates in the training set
print("Y.shape:", Y.shape)#a processed version of the machine readable dates in the training set
print("Xoh.shape:", Xoh.shape)#one-hot version of X
print("Yoh.shape:", Yoh.shape)#one-hot version of Y

X.shape: (10000, 30)
Y.shape: (10000, 10)
Xoh.shape: (10000, 30, 37)
Yoh.shape: (10000, 10, 11)


In [14]:
index = 106
print("Source date:", dataset[index][0])
print("Target date:", dataset[index][1])
print()
print("Source after preprocessing (indices):", X[index])
print("Target after preprocessing (indices):", Y[index])
print()
print("Source after preprocessing (one-hot):", Xoh[index])
print("Target after preprocessing (one-hot):", Yoh[index])

Source date: sunday september 16 2012
Target date: 2012-09-16

Source after preprocessing (indices): [29 31 25 16 13 34  0 29 17 27 30 17 24 14 17 28  0  4  9  0  5  3  4  5
 36 36 36 36 36 36]
Target after preprocessing (indices): [ 3  1  2  3  0  1 10  0  2  7]

Source after preprocessing (one-hot): [[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 ...
 [0. 0. 0. ... 0. 0. 1.]
 [0. 0. 0. ... 0. 0. 1.]
 [0. 0. 0. ... 0. 0. 1.]]
Target after preprocessing (one-hot): [[0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0.]
 [0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0.]
 [1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1.]
 [1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0.]]


<table>
<td> 
<img src="images/attn_model.png" style="width:500;height:500px;"> <br>
    <center>Entire model</center>
</td> 
<td> 
<img src="images/attn_mechanism.png" style="width:500;height:500px;"> <br>
    <center>Attention mechanism</center>
</td> 
</table>
<caption><center> Neural machine translation with attention</center></caption>


#### Pre-attention and Post-attention LSTMs on both sides of the attention mechanism
- There are two separate LSTMs in this model: pre-attention and post-attention LSTMs.
- *Pre-attention* Bi-LSTM is the one at the bottom of the picture is a Bi-directional LSTM and comes *before* the attention mechanism.
    - The attention mechanism is shown in the middle of the left-hand diagram.
    - The pre-attention Bi-LSTM goes through $T_x$ time steps
- *Post-attention* LSTM: at the top of the diagram comes *after* the attention mechanism. 
    - The post-attention LSTM goes through $T_y$ time steps. 

- The post-attention LSTM passes the hidden state $s^{\langle t \rangle}$ and cell state $c^{\langle t \rangle}$ from one time step to the next. 

#### Each time step does not use predictions from the previous time step
* The post-attention LSTM at time $t$ does not take the previous time step's prediction $y^{\langle t-1 \rangle}$ as input.
* The post-attention LSTM at time 't' only takes the hidden state $s^{\langle t\rangle}$ and cell state $c^{\langle t\rangle}$ as input. 
* The model is designed this way because there isn't as strong a dependency between the previous character and the next character in a YYYY-MM-DD date.

 The attention mechanism uses a `RepeatVector` node to copy $s^{\langle t-1 \rangle}$'s value $T_x$ times.
- Then it uses `Concatenation` to concatenate $s^{\langle t-1 \rangle}$ and $a^{\langle t \rangle}$.
- The concatenation of $s^{\langle t-1 \rangle}$ and $a^{\langle t \rangle}$ is fed into a "Dense" layer, which computes $e^{\langle t, t' \rangle}$. 
- $e^{\langle t, t' \rangle}$ is then passed through a softmax to compute $\alpha^{\langle t, t' \rangle}$.

In [74]:
# Defined shared layers as global variables
repeator = RepeatVector(Tx)
concatenator = Concatenate(axis=-1)
densor1 = Dense(10, activation = "tanh")
densor2 = Dense(1, activation = "relu")
activator = Activation(softmax, name='attention_weights') # We are using a custom softmax(axis = 1) loaded in this notebook
dotor = Dot(axes = 1)

In [75]:
def one_step_attention(a, s_prev):
    """
    Performs one step of attention: Outputs a context vector computed as a dot product of the attention weights
    "alphas" and the hidden states "a" of the Bi-LSTM.
    
    Arguments:
    a -- hidden state output of the Bi-LSTM, numpy-array of shape (m, Tx, 2*n_a)
    s_prev -- previous hidden state of the (post-attention) LSTM, numpy-array of shape (m, n_s)
    
    Returns:
    context -- context vector, input of the next (post-attention) LSTM cell
    """
    
    
    # repeator to repeat s_prev to be of shape (m, Tx, n_s) to concatenate it with all hidden states "a"
    s_prev = repeator(s_prev)
    # concatenator to concatenate a and s_prev on the last axis
    concat = concatenator([a,s_prev])
    # densor1 to propagate concat through a small fully-connected neural network to compute the "intermediate energies" variable e.
    e = densor1(concat)
    #densor2 to propagate e through a small fully-connected neural network to compute the "energies" variable energies.
    energies = densor2(e)
    # "activator" on "energies" to compute the attention weights "alphas" 
    alphas = activator(energies)
    #dotor together with "alphas" and "a" to compute the context vector to be given to the next (post-attention) LSTM-cell
    context = dotor([alphas,a])
    
    
    return context

In [76]:
n_a = 32 # number of units for the pre-attention, bi-directional LSTM's hidden state 'a'
n_s = 64 # number of units for the post-attention LSTM's hidden state "s"

post_activation_LSTM_cell = LSTM(n_s, return_state = True) # post-attention LSTM 
output_layer = Dense(len(machine_vocab), activation=softmax)

In [77]:
def model(Tx, Ty, n_a, n_s, human_vocab_size, machine_vocab_size):
    """
    Arguments:
    Tx -- length of the input sequence
    Ty -- length of the output sequence
    n_a -- hidden state size of the Bi-LSTM
    n_s -- hidden state size of the post-attention LSTM
    human_vocab_size -- size of the python dictionary "human_vocab"
    machine_vocab_size -- size of the python dictionary "machine_vocab"

    Returns:
    model -- Keras model instance
    """
    
    # inputs of your model with a shape (Tx,)
    # s0 (initial hidden state) and c0 (initial cell state) for the decoder LSTM with shape (n_s,)
    X = Input(shape=(Tx, human_vocab_size))
    s0 = Input(shape=(n_s,), name='s0')
    c0 = Input(shape=(n_s,), name='c0')
    s = s0
    c = c0
    
    
    outputs = []
    
    
    
    # pre-attention Bi-LSTM. 
    a = Bidirectional(LSTM(units=n_a, return_sequences=True))(X)
    
    # Iterate for Ty steps
    for t in range(Ty):
    
        # one step of the attention mechanism to get back the context vector at step t 
        context = one_step_attention(a,s)
        
        # Apply the post-attention LSTM cell to the "context" vector.
        s, _, c = post_activation_LSTM_cell(inputs=context, initial_state=[s,c])
        
        # Apply Dense layer to the hidden state output of the post-attention LSTM 
        out = output_layer(s)
        outputs.append(out)
    
    # Create model instance taking three inputs and returning the list of outputs.
    model = Model(inputs = [X,s0,c0],outputs = outputs)
        
    return model

In [78]:
model = model(Tx, Ty, n_a, n_s, len(human_vocab), len(machine_vocab))

In [79]:
model.summary()

Model: "model_3"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_4 (InputLayer)            [(None, 30, 37)]     0                                            
__________________________________________________________________________________________________
s0 (InputLayer)                 [(None, 64)]         0                                            
__________________________________________________________________________________________________
bidirectional_3 (Bidirectional) (None, 30, 64)       17920       input_4[0][0]                    
__________________________________________________________________________________________________
repeat_vector_2 (RepeatVector)  (None, 30, 64)       0           s0[0][0]                         
                                                                 lstm_6[0][0]               

In [80]:
opt = Adam(lr = 0.005,beta_1 = 0.9,beta_2 = 0.999,decay = 0.01)
model.compile(optimizer = opt, loss = "categorical_crossentropy",metrics = ["acc"])

In [81]:
s0 = np.zeros((m, n_s))
c0 = np.zeros((m, n_s))
outputs = list(Yoh.swapaxes(0,1))

In [82]:
model.fit([Xoh, s0, c0], outputs, epochs=20, batch_size=100)

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20


Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


<tensorflow.python.keras.callbacks.History at 0x7f9edaa308d0>

`dense_2_acc_8: 0.8562` means that you are predicting the 7th character of the output correctly 85.62% of the time in the current batch of data. 

In [92]:
model.save_weights("my_model.h5")

In [95]:
model.load_weights('my_model.h5')

In [96]:
EXAMPLES = ['3 May 1979', '5 April 09', '21th of August 2016', 'Tue 10 Jul 2007', 'Saturday May 9 2018', 'March 3 2001', 'March 3rd 2001', '1 March 2001']

source = np.array([string_to_int(i, Tx, human_vocab) for i in EXAMPLES])

source = np.array(list(map(lambda x: to_categorical(x, num_classes=len(human_vocab)), source)))

s0 = np.zeros((len(EXAMPLES), n_s)) #LSTM cells must have the same number of rows as there are training examples
c0 = np.zeros((len(EXAMPLES), n_s))

prediction = model.predict([source, s0, c0])

prediction = np.argmax(prediction, axis = -1).swapaxes(1,0)

for t in range(prediction.shape[0]):
    output = [inv_machine_vocab[int(i)] for i in prediction[t]]

    print("input:", EXAMPLES[t])
    print("source:", prediction[t])
    print("output:", ''.join(output))

input: 3 May 1979
source: [ 2 10  8 10  0  1  6  0  1  4]
output: 1979-05-03
input: 5 April 09
source: [ 3  1 10 10  0  1  5  0  1  6]
output: 2099-04-05
input: 21th of August 2016
source: [3 1 2 7 0 1 9 0 1 1]
output: 2016-08-00
input: Tue 10 Jul 2007
source: [3 1 1 8 0 1 8 0 2 1]
output: 2007-07-10
input: Saturday May 9 2018
source: [ 3  1  2  9  0  1  6  0  1 10]
output: 2018-05-09
input: March 3 2001
source: [3 1 1 2 0 1 4 0 1 4]
output: 2001-03-03
input: March 3rd 2001
source: [3 1 1 2 0 1 4 0 1 4]
output: 2001-03-03
input: 1 March 2001
source: [3 1 1 2 0 1 4 0 1 2]
output: 2001-03-01


In [89]:
import tkinter as tk

from tkinter import ttk

win = tk.Tk()

win.title('Machine Translator')

''

In [90]:
def translateMessage():
    EXAMPLES = [humanDate.get()]

    source = np.array([string_to_int(i, Tx, human_vocab) for i in EXAMPLES])

    source = np.array(list(map(lambda x: to_categorical(x, num_classes=len(human_vocab)), source)))

    s0 = np.zeros((len(EXAMPLES), n_s)) #LSTM cells must have the same number of rows as there are training examples
    c0 = np.zeros((len(EXAMPLES), n_s))

    prediction = model.predict([source, s0, c0])

    prediction = np.argmax(prediction, axis = -1).swapaxes(1,0)

    for t in range(prediction.shape[0]):
        output = [inv_machine_vocab[int(i)] for i in prediction[t]]
        tk.Label(win, text = ''.join(output)).grid(row = 1, column = 1)
        print("output:", ''.join(output))

In [91]:
tk.Label(win, text="Your Date").grid(row=0)
tk.Label(win, text="Machine Translation").grid(row=1)

humanDate = tk.Entry(win)


humanDate.grid(row=0, column=1)


tk.Button(win,text='Translate', command=translateMessage).grid(row=3)


    
win.mainloop()

output: 2001-08-13
