# HW6 Machine translation with Encoder-Decoder model

## Due April 24th, 23:59

In this homework, you are first shown an example of encoder-decoder machine translation model for a dummy problem. Make sure you understand how it works. Then you will need to build a similar model for a real machine translation data set. The data set provided in this homework is an italiano-english dataset (perché italiano 
è mia lingua preferita), but feel free to download your preferred language pari here (http://www.manythings.org/anki/).


You are given the following files:
- `Machine-Translation.ipynb`: This notebook file
- `ita.txt`: Training dataset (see http://www.manythings.org/anki/ to understand the structure)
- `utils/`: folder containing all utility code for the series of homeworks


### Deliverables (zip them all)

- pdf or html version of your final notebook
- Show some translation examples in your notebook
- writeup.pdf: Add a short essay discussing the biggest challenges you encounter during this assignment and what you have learnt.

(**You are encouraged to add the writeup doc into your notebook
using markdown/html langauge, just like how this notes is prepared**)

# Set up

In [1]:
%load_ext autoreload
%autoreload 2
%matplotlib inline

import os, sys
# add utils folder to path
p = os.path.dirname(os.getcwd())
if p not in sys.path:
    sys.path = [p] + sys.path

from utils.general import show_keras_model

from tensorflow.keras.models import Model

# Dummy Translation Problem
We are not doing anything real here, rather, we create a dummy problem to demonstrate how easy or hard to use a S2S model for machine translation.

The dummy prblem I choose here is to translate datestr like "Aug-30-1989" to another format "1989/08/30". Sounds easy, isn't it? But think about it, you feel this simple because you have so much prior knowledge. You know the English meaning of "Aug", you know the different ways of representing dates, MM-DD-YYYY vs YYYY/MM/DD. But our model starts from absolute ignorance. Imagine you show this problem to a 2-year-old child, how much time does it make for him to figure out the rule? 

## Generate Training Data

In [2]:
import numpy as np

choice = np.random.choice
def source_generation(batch=100):
    months = choice(['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec'], batch)
    days = choice(range(1, 28), batch)
    years = choice(range(1990, 2050), batch)
    
    return [ f"{m}-{d}-{y}" for m, d, y in zip(months, days, years)]

def translate(src):
    if type(src) == str: src = [src]
    mmap = {'Jan': '01', 'Feb': '02', 'Mar': '03', 'Apr': '04', 'May': '05', 'Jun': "06", 'Jul': "07", 
            'Aug': '08', 'Sep': '09', 'Oct': '10', 'Nov': '11', 'Dec': '12'}
    result = []
    for d in src:
        m, d, y = d.split('-')
        result.append(f"{y}/{mmap[m]}/{str(d).rjust(2, '0')}")
        
    return result

In [3]:
# Let's generate some data
train_X_raw = source_generation(10000)
train_Y_raw = translate(train_X_raw)

# Verify the translation
print(train_X_raw[:5])
print(train_Y_raw[:5])

['Apr-16-2043', 'Mar-17-2041', 'Feb-11-1997', 'Jan-1-2029', 'Feb-17-2004']
['2043/04/16', '2041/03/17', '1997/02/11', '2029/01/01', '2004/02/17']


## Other dummy tasks

You are encouraged to generate your own dummy tasks, for example, what about a simple calculator, can you train your model to understand "186+95" equal to "281"?

# Encoder-Decoder Model

In [4]:
encoder_input_len = 11
decoder_input_len = 10
latent_dim = 100

## Raw data transformer

As of today, I guess you should be quite familar with what we are doing here.

In [5]:
from tensorflow.keras.preprocessing.sequence import pad_sequences

char_vocab = list('ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz-/0123456789$^')

reverse_vocab = {k:v for v, k in enumerate(char_vocab)}
def char_to_num(X_raw, is_encoder=True):
    """
    Translate the raw input to the numerical encoding. We take different treatments for the
    encoder inputs and decoder inputs. This is because we need a starter character "^" for the 
    decoder inputs.
    """
    result = [[reverse_vocab[c] for c in sent] for sent in X_raw]
    
    if(is_encoder):
        assert all([len(row) <= encoder_input_len for row in X_raw])
        return pad_sequences(sequences=result, maxlen=encoder_input_len, 
                             padding='post', truncating='post', 
                             value=reverse_vocab['$'])
    else:
        assert all([len(row) == decoder_input_len for row in X_raw])
        return pad_sequences(sequences=result, maxlen=decoder_input_len+1, 
                             padding='pre', truncating='post', 
                             value=reverse_vocab['^'])

    return pad_sequences(result)

def num_to_char(X):
    return [''.join([char_vocab[c] for c in row]) for row in X]

## Training model

In [6]:
# from keras.models import Model
from tensorflow.keras.layers import (Input, LSTM, Dense, Bidirectional, Embedding, 
                          TimeDistributed, Concatenate)

"""
Define an input Layer. We use one-hot encoding instead of embedding layer here. Since
we are using character based model, embedding may not be necessary, and may not be very 
helpful neither. Do you know why?
"""
encoder_inputs = Input(shape=(encoder_input_len, len(char_vocab)), name="Encoder_Input")
# For encoder, we can see the entire sentence at once, so we can use Bidirectional LSTM
encoder_lstm = Bidirectional(LSTM(latent_dim, return_state=True, name="Encoder_LSTM"))
# Bidrectional LSTM has 4 states instead of 2, we concatenate them to be comparable
# with the decoder LSTM
_, forward_h, forward_c, backward_h, backward_c = encoder_lstm(encoder_inputs)
state_h = Concatenate()([forward_h, backward_h])
state_c = Concatenate()([forward_c, backward_c])

# Set up the decoder, using `encoder_states` as initial state
encoder_states = [state_h, state_c]
decoder_inputs = Input(shape=(decoder_input_len, len(char_vocab)), name="Decoder_Input")
decoder_lstm = LSTM(latent_dim*2, return_sequences=True, name="Decoder_LSTM")
decoder_lstm_outputs = decoder_lstm(decoder_inputs,
                                    initial_state=encoder_states)
decoder_dense = Dense(len(char_vocab), activation='softmax')
decoder_outputs = TimeDistributed(decoder_dense)(decoder_lstm_outputs)

# Define the model that will turn
# `encoder_input_data` & `decoder_input_data` into `decoder_target_data`
model = Model([encoder_inputs, decoder_inputs], decoder_outputs)

# show_keras_model(model)

## Train training model

In [7]:
# Run training
from tensorflow.keras.utils import to_categorical
"""
Don't be suprized that this model actually needs quite quite a lot of epochs to train, so please be patient.
After the model is trained, you can use the history.history object to plot the metrics improvment process.

While you are waiting for the model to train, feel free to read the next cell.
"""
batch_size = 1000
epochs = 75

# Here it's just some data transformation to translate the raw data to matrix inputs
encoder_input_data = to_categorical(char_to_num(train_X_raw, True), num_classes=len(char_vocab))
train_Y = to_categorical(char_to_num(train_Y_raw, False), num_classes=len(char_vocab))
# for decoder, the target lags input by 1 time step
decoder_input_data = train_Y[:, :-1, :]
decoder_target_data = train_Y[:, 1:, :]

model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])
history = model.fit([encoder_input_data, decoder_input_data], decoder_target_data,
                    batch_size=batch_size,
                    epochs=epochs,
                    validation_split=0.2)

Train on 8000 samples, validate on 2000 samples
Epoch 1/75
Epoch 2/75
Epoch 3/75
Epoch 4/75
Epoch 5/75
Epoch 6/75
Epoch 7/75
Epoch 8/75
Epoch 9/75
Epoch 10/75
Epoch 11/75
Epoch 12/75
Epoch 13/75
Epoch 14/75
Epoch 15/75
Epoch 16/75
Epoch 17/75
Epoch 18/75
Epoch 19/75
Epoch 20/75
Epoch 21/75
Epoch 22/75
Epoch 23/75
Epoch 24/75
Epoch 25/75
Epoch 26/75
Epoch 27/75
Epoch 28/75
Epoch 29/75
Epoch 30/75
Epoch 31/75
Epoch 32/75
Epoch 33/75
Epoch 34/75
Epoch 35/75
Epoch 36/75
Epoch 37/75
Epoch 38/75
Epoch 39/75
Epoch 40/75
Epoch 41/75
Epoch 42/75
Epoch 43/75
Epoch 44/75
Epoch 45/75
Epoch 46/75
Epoch 47/75
Epoch 48/75
Epoch 49/75
Epoch 50/75
Epoch 51/75
Epoch 52/75
Epoch 53/75
Epoch 54/75
Epoch 55/75
Epoch 56/75


Epoch 57/75
Epoch 58/75
Epoch 59/75
Epoch 60/75
Epoch 61/75
Epoch 62/75
Epoch 63/75
Epoch 64/75
Epoch 65/75
Epoch 66/75
Epoch 67/75
Epoch 68/75
Epoch 69/75
Epoch 70/75
Epoch 71/75
Epoch 72/75
Epoch 73/75
Epoch 74/75
Epoch 75/75


## Inference model

Similar to HW04, we need a different model structure for the inference model. The inference model should copy exactly the same weights from the training model, but it predicts only 1 time step at a time.

In [8]:
# Trucate the encoder part of the training model as encoder model
encoder_model = Model(encoder_inputs, encoder_states)
# show_keras_model(encoder_model)

In [9]:
# Build the inference model
inference_inputs = Input(batch_shape=(1,1, len(char_vocab)), name="Inference_Input")
inference_lstm = LSTM(latent_dim*2, stateful=True,
                      name="Inference_LSTM")
inference_lstm_outputs = inference_lstm(inference_inputs)

inference_dense = Dense(len(char_vocab), activation='softmax')
inference_outputs = inference_dense(inference_lstm_outputs)

# Assign the weights of decoder to inference model
inference_lstm.set_weights(decoder_lstm.get_weights())
inference_dense.set_weights(decoder_dense.get_weights())

inference_model = Model(inference_inputs, inference_outputs)
# show_keras_model(inference_model)

In [10]:
def inference(encoder_input_data):
    """
    A utility function to generate the model prediction
    """
    states_h, states_c = encoder_model.predict(encoder_input_data)
    results = []
    
    for h, c in zip(states_h, states_c):
        sent, seed = [], reverse_vocab['^']
        inference_lstm.states[0].assign(h[None, :])
        inference_lstm.states[1].assign(c[None, :])
        for i in range(decoder_input_len):
            seed = to_categorical(np.array([seed]), num_classes=len(char_vocab))[None, :, :]
            seed = inference_model.predict(seed)[0].argmax()
            sent.append(seed)
            
        results.append(sent)
        
    return num_to_char(results)

In [11]:
# Let's look at some output
print(num_to_char(encoder_input_data[:10].argmax(axis=2)))
print(inference(encoder_input_data[:10]))

['Apr-16-2043', 'Mar-17-2041', 'Feb-11-1997', 'Jan-1-2029$', 'Feb-17-2004', 'Oct-27-2013', 'Aug-1-2021$', 'Jul-13-1995', 'May-21-2017', 'Feb-20-2009']
['2043/04/16', '2041/04/15', '1997/02/11', '2029/01/09', '2004/02/16', '2013/10/27', '2012/08/01', '1995/07/10', '2017/06/12', '2009/02/20']


# Real Machine translation 

In [12]:
"""
Now are you ready for the real challenge? You can use the ita.txt file as training data. 
But feel free to download different language from http://www.manythings.org/anki/. If you
happen to speak French or Japanese, it's time to show off!

1. Implement a Bidrectional LSTM Encoder-Decoder model, or other viable models to translate 
   the language dataset you choose.

2. Write the function to calculate the BLEU score of your model
"""
import pandas as pd
import re
import string
from sklearn.utils import shuffle
from sklearn.model_selection import train_test_split
from tensorflow.keras.models import Model
from keras.initializers import Constant

lines= pd.read_csv('spa.txt', names=['eng', 'spa', 'source'], sep='\t')

# Lowercase all characters
lines.eng=lines.eng.apply(lambda x: x.lower())
lines.spa=lines.spa.apply(lambda x: x.lower())

# Remove quotes
lines.eng=lines.eng.apply(lambda x: re.sub("'", '', x))
lines.spa=lines.spa.apply(lambda x: re.sub("'", '', x))
exclude = set(string.punctuation) # Set of all special characters

# Remove all the special characters
lines.eng=lines.eng.apply(lambda x: ''.join(ch for ch in x if ch not in exclude))
lines.spa=lines.spa.apply(lambda x: ''.join(ch for ch in x if ch not in exclude))

# Remove all numbers from text
remove_digits = str.maketrans('', '', string.digits)
lines.eng=lines.eng.apply(lambda x: x.translate(remove_digits))
lines.spa = lines.spa.apply(lambda x: re.sub("[२३०८१५७९४६]", "", x))

# Remove extra spaces
lines.eng=lines.eng.apply(lambda x: x.strip())
lines.spa=lines.spa.apply(lambda x: x.strip())
lines.eng=lines.eng.apply(lambda x: re.sub(" +", " ", x))
lines.spa=lines.spa.apply(lambda x: re.sub(" +", " ", x))

# Add start and end tokens to target sequences
start_token = '<START> '
end_token = '<END>'
lines.spa = lines.spa.apply(lambda x : ''.join([start_token, x, end_token]))

Using TensorFlow backend.


In [13]:
lines.sample(10)

Unnamed: 0,eng,spa,source
52543,this word comes from latin,<START> esta palabra viene del latín<END>,CC-BY 2.0 (France) Attribution: tatoeba.org #2...
69668,tom doesnt remember very much,<START> tom no recuerda tanto<END>,CC-BY 2.0 (France) Attribution: tatoeba.org #2...
119350,sometimes we do what we have to do not what we...,<START> a veces hacemos lo que debemos hacer n...,CC-BY 2.0 (France) Attribution: tatoeba.org #2...
94371,i fell in love with her at first sight,<START> me enamoré de ella a la primera vista<...,CC-BY 2.0 (France) Attribution: tatoeba.org #4...
18884,tom got mary drunk,<START> tom emborrachó a mary<END>,CC-BY 2.0 (France) Attribution: tatoeba.org #1...
71848,i have never been to the states,<START> no he estado nunca en los estados unid...,CC-BY 2.0 (France) Attribution: tatoeba.org #2...
27095,try to be brave tom,<START> trata de ser valiente tom<END>,CC-BY 2.0 (France) Attribution: tatoeba.org #3...
85928,i can assure you that you are wrong,<START> te puedo asegurar que estás equivocada...,CC-BY 2.0 (France) Attribution: tatoeba.org #2...
35495,tom must be very tired,<START> tom debe estar muy cansado<END>,CC-BY 2.0 (France) Attribution: tatoeba.org #1...
110788,her old bike squeaked as she rode down the hill,<START> su vieja bicicleta chirrió mientras ba...,CC-BY 2.0 (France) Attribution: tatoeba.org #3...


In [14]:
# Vocabulary of English
all_eng_words=set()
for eng in lines.eng:
    for word in eng.split():
        if word not in all_eng_words:
            all_eng_words.add(word)

# Vocabulary of French 
all_spa_words=set()
for spa in lines.spa:
    for word in spa.split():
        if word not in all_spa_words:
            all_spa_words.add(word)

In [15]:
# Max Length of source sequence
lenght_list=[]
for l in lines.eng:
    lenght_list.append(len(l.split(' ')))
max_length_src = np.max(lenght_list)
max_length_src

47

In [16]:
# Max Length of target sequence
lenght_list=[]
for l in lines.spa:
    lenght_list.append(len(l.split(' ')))
max_length_tar = np.max(lenght_list)
max_length_tar

50

In [17]:
input_words = sorted(list(all_eng_words))
target_words = sorted(list(all_spa_words))
num_encoder_tokens = len(all_eng_words)
num_decoder_tokens = len(all_spa_words)
num_encoder_tokens, num_decoder_tokens

(13475, 36608)

In [18]:
# For zero padding
num_decoder_tokens += 1
num_encoder_tokens += 1

In [19]:
input_token_index = dict([(word, i+1) for i, word in enumerate(input_words)])
target_token_index = dict([(word, i+1) for i, word in enumerate(target_words)])
reverse_input_char_index = dict((i, word) for word, i in input_token_index.items())
reverse_target_char_index = dict((i, word) for word, i in target_token_index.items())

lines = shuffle(lines)
lines.head(10)

Unnamed: 0,eng,spa,source
93840,do you see that house thats my house,<START> ¿ves aquella casa esa es mi casa<END>,CC-BY 2.0 (France) Attribution: tatoeba.org #9...
95883,unfortunately there was no one around,<START> desafortunadamente no había ninguna pe...,CC-BY 2.0 (France) Attribution: tatoeba.org #7...
68739,please have someone else do it,<START> por favor que alguien más lo haga<END>,CC-BY 2.0 (France) Attribution: tatoeba.org #4...
18342,she was in a hurry,<START> ella estaba apresurada<END>,CC-BY 2.0 (France) Attribution: tatoeba.org #3...
5969,dogs are smart,<START> los perros son inteligentes<END>,CC-BY 2.0 (France) Attribution: tatoeba.org #4...
68818,she carried a baby on her back,<START> ella cargó un bebé en su espalda<END>,CC-BY 2.0 (France) Attribution: tatoeba.org #3...
6478,i was an idiot,<START> era un idiota<END>,CC-BY 2.0 (France) Attribution: tatoeba.org #2...
80640,steam is coming out of the engine,<START> está saliendo vapor del motor<END>,CC-BY 2.0 (France) Attribution: tatoeba.org #1...
44026,tom cant find his shoes,<START> tom no encuentra sus zapatos<END>,CC-BY 2.0 (France) Attribution: tatoeba.org #1...
47697,she is at home in english,<START> ella es hábil en el inglés<END>,CC-BY 2.0 (France) Attribution: tatoeba.org #3...


In [20]:
# Train - Test Split
X, y = lines.eng, lines.spa
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.1)
X_train.shape, X_test.shape

((111393,), (12377,))

In [21]:
def generate_batch(X = X_train, y = y_train, batch_size = 128):
    ''' Generate a batch of data '''
    while True:
        for j in range(0, len(X), batch_size):
            encoder_input_data = np.zeros((batch_size, max_length_src),dtype='float32')
            decoder_input_data = np.zeros((batch_size, max_length_tar),dtype='float32')
            decoder_target_data = np.zeros((batch_size, max_length_tar, num_decoder_tokens),dtype='float32')
            for i, (input_text, target_text) in enumerate(zip(X[j: j + batch_size], y[j: j + batch_size])):
                for t, word in enumerate(input_text.split()):
                    encoder_input_data[i, t] = input_token_index[word] # encoder input seq
                for t, word in enumerate(target_text.split()):
                    if t < len(target_text.split())-1:
                        decoder_input_data[i, t] = target_token_index[word] # decoder input seq
                    if t > 0:
                        # decoder target sequence (one hot encoded)
                        # does not include the START token
                        # Offset by one timestep
                        decoder_target_data[i, t - 1, target_token_index[word]] = 1.
                        
            yield [encoder_input_data, decoder_input_data], decoder_target_data, [None]

In [22]:
import numpy as np

glove = pd.read_csv("glove_6B_100d_top100k.csv")

embedding_matrix = np.zeros((num_encoder_tokens, 100))
for word, i in input_token_index.items():
    if word in glove.columns:
        embedding_matrix[i] = glove.loc[:, word].to_numpy()
        

In [26]:
encoder_inputs = Input(shape=(None,), name="Encoder_Input")
enc_emb =  Embedding(num_encoder_tokens, latent_dim)(encoder_inputs)

# For encoder, we can see the entire sentence at once, so we can use Bidirectional LSTM
encoder_lstm = Bidirectional(LSTM(latent_dim, return_state=True, name="Encoder_LSTM"))
# Bidrectional LSTM has 4 states instead of 2, we concatenate them to be comparable
# with the decoder LSTM
_, forward_h, forward_c, backward_h, backward_c = encoder_lstm(enc_emb)
state_h = Concatenate()([forward_h, backward_h])
state_c = Concatenate()([forward_c, backward_c])

# Set up the decoder, using `encoder_states` as initial state
encoder_states = [state_h, state_c]


decoder_inputs = Input(shape=(None,), name="Decoder_Input")
dec_emb = Embedding(num_decoder_tokens, latent_dim)(decoder_inputs)

decoder_lstm = LSTM(latent_dim*2, return_sequences=True, return_state=True, name="Decoder_LSTM")
decoder_outputs, _, _ = decoder_lstm(dec_emb,
                                     initial_state=encoder_states)
decoder_dense = Dense(num_decoder_tokens, activation='softmax')
decoder_outputs = TimeDistributed(decoder_dense)(decoder_outputs)

# Define the model that will turn
# `encoder_input_data` & `decoder_input_data` into `decoder_target_data`
model = Model([encoder_inputs, decoder_inputs], decoder_outputs)

model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])

In [24]:
train_samples = len(X_train)
val_samples = len(X_test)
batch_size = 128
epochs = 50

In [1]:
model.fit(generate_batch(X_train, y_train, batch_size = batch_size),
                    steps_per_epoch = train_samples // batch_size,
                    epochs=epochs,
                    validation_data = generate_batch(X_test, y_test, batch_size = batch_size),
                    validation_steps = val_samples // batch_size)

NameError: name 'model' is not defined

**The kernel keeps crashing. So I continue my work on .py file**

In [None]:
# Encode the input sequence to get the "thought vectors"
encoder_model = Model(encoder_inputs, encoder_states)

# Decoder setup
# Below tensors will hold the states of the previous time step
decoder_state_input_h = Input(shape=(latent_dim,))
decoder_state_input_c = Input(shape=(latent_dim,))
decoder_states_inputs = [decoder_state_input_h, decoder_state_input_c]

dec_emb2= Embedding(num_decoder_tokens, latent_dim)(decoder_inputs) # Get the embeddings of the decoder sequence

# To predict the next word in the sequence, set the initial states to the states from the previous time step
decoder_outputs2, state_h2, state_c2 = decoder_lstm(dec_emb2, initial_state=decoder_states_inputs)
decoder_states2 = [state_h2, state_c2]
decoder_outputs2 = decoder_dense(decoder_outputs2) # A dense softmax layer to generate prob dist. over the target vocabulary

# Final decoder model
decoder_model = Model(
    [decoder_inputs] + decoder_states_inputs,
    [decoder_outputs2] + decoder_states2)


In [None]:
def decode_sequence(input_seq):
    # Encode the input as state vectors.
    states_value = encoder_model.predict(input_seq)
    # Generate empty target sequence of length 1.
    target_seq = np.zeros((1,1))
    # Populate the first character of target sequence with the start character.
    target_seq[0, 0] = target_token_index[start_token]

    # Sampling loop for a batch of sequences
    # (to simplify, here we assume a batch of size 1).
    stop_condition = False
    decoded_sentence = ''
    while not stop_condition:
        output_tokens, h, c = decoder_model.predict([target_seq] + states_value)

        # Sample a token
        sampled_token_index = np.argmax(output_tokens[0, -1, :])
        sampled_char = reverse_target_char_index[sampled_token_index]
        decoded_sentence += ' '+sampled_char

        # Exit condition: either hit max length
        # or find stop character.
        if (sampled_char == end_token or
           len(decoded_sentence) > 50):
            stop_condition = True

        # Update the target sequence (of length 1).
        target_seq = np.zeros((1,1))
        target_seq[0, 0] = sampled_token_index

        # Update states
        states_value = [h, c]

    return decoded_sentence

In [None]:
val_gen = generate_batch(X_test, y_test, batch_size = 1)
(input_seq, actual_output), _ = next(val_gen)
decoded_sentence = decode_sequence(input_seq)

import nltk

hypothesis = decoded_sentence
reference = actual_output
#there may be several references
BLEU_score = nltk.translate.bleu_score.sentence_bleu(reference, hypothesis)

print(BLEU_score)