# Assignment 3: Build a seq2seq model for machine translation.

### Name: Aughdon Breslin

### Task: Change LSTM model to Bidirectional LSTM Model， translate English to target language and evaluate using Bleu score.

### Due Date: Tuesday, April 19th, 11:59PM

## 0. You will do the following:

1. Read and run the code. Please make sure you have installed keras or tensorflow.Running the script on colab will speed up the training process and also prevent package loading issue. 
2. Complete the code in Section 1.1, you may fill in your data directory.
3. Directly modify the code in Section 3. Change the current LSTM layer to a Bidirectional LSTM Model.
4. Training your model and translate English to Spanish in Section 4.2. You could try translating other languages.
5. Complete the code in Section 5.

### Hint: 

To implement ```Bi-LSTM```, you will need the following code to build the encoder. Do NOT use Bi-LSTM for the decoder. But there are other codes you need to modify to make it work.

In [2]:
import re
import string
from unicodedata import normalize
import numpy

from keras.layers import Input, LSTM, Bidirectional, Concatenate, Dense
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.utils import to_categorical
from keras.models import Model
from keras.utils.vis_utils import model_to_dot, plot_model

from IPython.display import SVG

from sklearn.model_selection import train_test_split

from nltk.translate.bleu_score import sentence_bleu

## 1. Data preparation (10 points)

1. Download spanish-english data from http://www.manythings.org/anki/
2. You may try to use other languages.
3. Unzip the .ZIP file.
4. Put the .TXT file (e.g., "deu.txt") in the directory "./Data/".
5. Fill in your data directory in section 1.1.

### 1.1. Load and clean text


In [3]:
# load doc into memory
def load_doc(filename):
    # open the file as read only
    file = open(filename, mode='rt', encoding='utf-8')
    # read all text
    text = file.read()
    # close the file
    file.close()
    return text


# split a loaded document into sentences
def to_pairs(doc):
    lines = doc.strip().split('\n')
    pairs = [line.split('\t') for line in  lines]
    return pairs

def clean_data(lines):
    cleaned = list()
    # prepare regex for char filtering
    re_print = re.compile('[^%s]' % re.escape(string.printable))
    # prepare translation table for removing punctuation
    table = str.maketrans('', '', string.punctuation)
    for pair in lines:
        clean_pair = list()
        for line in pair:
            # normalize unicode characters
            line = normalize('NFD', line).encode('ascii', 'ignore')
            line = line.decode('UTF-8')
            # tokenize on white space
            line = line.split()
            # convert to lowercase
            line = [word.lower() for word in line]
            # remove punctuation from each token
            line = [word.translate(table) for word in line]
            # remove non-printable chars form each token
            line = [re_print.sub('', w) for w in line]
            # remove tokens with numbers in them
            line = [word for word in line if word.isalpha()]
            # store as string
            clean_pair.append(' '.join(line))
        cleaned.append(clean_pair)
    return numpy.array(cleaned)

#### Fill the following blanks:

In [4]:
# e.g., filename = 'Data/deu.txt'
filename = 'Data/spa.txt'

# e.g., n_train = 20000
n_train = 20000

In [5]:
# load dataset
doc = load_doc(filename)

# split into Language1-Language2 pairs
pairs = to_pairs(doc)

# clean sentences
clean_pairs = clean_data(pairs)[0:n_train, :]

In [6]:
for i in range(3000, 3010):
    print('[' + clean_pairs[i, 0] + '] => [' + clean_pairs[i, 1] + ']')

[youre here] => [estas aqui]
[youre here] => [estais aqui]
[youre late] => [estas retrasado]
[youre lost] => [estas perdido]
[youre mean] => [eres mala]
[youre mean] => [eres mezquino]
[youre mine] => [tu eres mio]
[youre nice] => [eres simpatico]
[youre nuts] => [estas loco]
[youre nuts] => [estas chiflado]


In [7]:
input_texts = clean_pairs[:, 0]
target_texts = ['\t' + text + '\n' for text in clean_pairs[:, 1]]

print('Length of input_texts:  ' + str(input_texts.shape))
print('Length of target_texts: ' + str(input_texts.shape))

Length of input_texts:  (20000,)
Length of target_texts: (20000,)


In [8]:
max_encoder_seq_length = max(len(line) for line in input_texts)
max_decoder_seq_length = max(len(line) for line in target_texts)

print('max length of input  sentences: %d' % (max_encoder_seq_length))
print('max length of target sentences: %d' % (max_decoder_seq_length))

max length of input  sentences: 18
max length of target sentences: 55


**Remark:** To this end, you have two lists of sentences: input_texts and target_texts

## 2. Text processing

### 2.1. Convert texts to sequences

- Input: A list of $n$ sentences (with max length $t$).
- It is represented by a $n\times t$ matrix after the tokenization and zero-padding.

In [9]:
# encode and pad sequences
def text2sequences(max_len, lines):
    tokenizer = Tokenizer(char_level=True, filters='')
    tokenizer.fit_on_texts(lines)
    seqs = tokenizer.texts_to_sequences(lines)
    seqs_pad = pad_sequences(seqs, maxlen=max_len, padding='post')
    return seqs_pad, tokenizer.word_index


encoder_input_seq, input_token_index = text2sequences(max_encoder_seq_length, 
                                                      input_texts)
decoder_input_seq, target_token_index = text2sequences(max_decoder_seq_length, 
                                                       target_texts)

print('shape of encoder_input_seq: ' + str(encoder_input_seq.shape))
print('shape of input_token_index: ' + str(len(input_token_index)))
print('shape of decoder_input_seq: ' + str(decoder_input_seq.shape))
print('shape of target_token_index: ' + str(len(target_token_index)))

shape of encoder_input_seq: (20000, 18)
shape of input_token_index: 27
shape of decoder_input_seq: (20000, 55)
shape of target_token_index: 29


In [10]:
num_encoder_tokens = len(input_token_index) + 1
num_decoder_tokens = len(target_token_index) + 1

print('num_encoder_tokens: ' + str(num_encoder_tokens))
print('num_decoder_tokens: ' + str(num_decoder_tokens))

num_encoder_tokens: 28
num_decoder_tokens: 30


**Remark:** To this end, the input language and target language texts are converted to 2 matrices. 

- Their number of rows are both n_train.
- Their number of columns are respective max_encoder_seq_length and max_decoder_seq_length.

The followings print a sentence and its representation as a sequence.

In [11]:
target_texts[100]

'\tno puede ser\n'

In [12]:
decoder_input_seq[100, :]

array([ 6,  8,  3,  1, 17, 14,  2, 15,  2,  1,  5,  2, 10,  7,  0,  0,  0,
        0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
        0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
        0,  0,  0,  0])

## 2.2. One-hot encode

- Input: A list of $n$ sentences (with max length $t$).
- It is represented by a $n\times t$ matrix after the tokenization and zero-padding.
- It is represented by a $n\times t \times v$ tensor ($t$ is the number of unique chars) after the one-hot encoding.

In [13]:
# one hot encode target sequence
def onehot_encode(sequences, max_len, vocab_size):
    n = len(sequences)
    data = numpy.zeros((n, max_len, vocab_size))
    for i in range(n):
        data[i, :, :] = to_categorical(sequences[i], num_classes=vocab_size)
    return data

encoder_input_data = onehot_encode(encoder_input_seq, max_encoder_seq_length, num_encoder_tokens)
decoder_input_data = onehot_encode(decoder_input_seq, max_decoder_seq_length, num_decoder_tokens)

decoder_target_seq = numpy.zeros(decoder_input_seq.shape)
decoder_target_seq[:, 0:-1] = decoder_input_seq[:, 1:]
decoder_target_data = onehot_encode(decoder_target_seq, 
                                    max_decoder_seq_length, 
                                    num_decoder_tokens)

print(encoder_input_data.shape)
print(decoder_input_data.shape)

(20000, 18, 28)
(20000, 55, 30)


## 3. Build the networks (for training) (20 points)

- In this section, we have already implemented the LSTM model for you. You can run the code and see what the code is doing.  

- You need to change the existing LSTM model to a Bidirectional LSTM model. Just modify the network structrue and do not change the training cell in section 3.4.

- Build encoder, decoder, and connect the two modules to get "model". 

- Fit the model on the bilingual data to train the parameters in the encoder and decoder.



### 3.1. Encoder network

- Input:  one-hot encode of the input language

- Return: 

    -- output (all the hidden states   $h_1, \cdots , h_t$) are always discarded
    
    -- the final hidden state  $h_t$
    
    -- the final conveyor belt $c_t$

In [14]:
latent_dim = 256

# inputs of the encoder network
encoder_inputs = Input(shape=(None, num_encoder_tokens), 
                       name='encoder_inputs')

# set the LSTM layer
encoder_bilstm = Bidirectional(LSTM(latent_dim, return_state=True, 
                                  dropout=0.5, name='encoder_lstm'))
_, forward_h, forward_c, backward_h, backward_c = encoder_bilstm(encoder_inputs)

state_h = Concatenate()([forward_h, backward_h])
state_c = Concatenate()([forward_c, backward_c])

# build the encoder network model
encoder_model = Model(inputs=encoder_inputs, 
                      outputs=[state_h, state_c],
                      name='encoder')

Print a summary and save the encoder network structure to "./encoder.pdf"

In [15]:
SVG(model_to_dot(encoder_model, show_shapes=False).create(prog='dot', format='svg'))

plot_model(
    model=encoder_model, show_shapes=False,
    to_file='encoder.pdf'
)

encoder_model.summary()

Model: "encoder"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
encoder_inputs (InputLayer)     [(None, None, 28)]   0                                            
__________________________________________________________________________________________________
bidirectional (Bidirectional)   [(None, 512), (None, 583680      encoder_inputs[0][0]             
__________________________________________________________________________________________________
concatenate (Concatenate)       (None, 512)          0           bidirectional[0][1]              
                                                                 bidirectional[0][3]              
__________________________________________________________________________________________________
concatenate_1 (Concatenate)     (None, 512)          0           bidirectional[0][2]        

### 3.2. Decoder network

- Inputs:  

    -- one-hot encode of the target language
    
    -- The initial hidden state $h_t$ 
    
    -- The initial conveyor belt $c_t$ 

- Return: 

    -- output (all the hidden states) $h_1, \cdots , h_t$

    -- the final hidden state  $h_t$ (discarded in the training and used in the prediction)
    
    -- the final conveyor belt $c_t$ (discarded in the training and used in the prediction)

In [16]:
# inputs of the decoder network
decoder_input_h = Input(shape=(latent_dim*2,), name='decoder_input_h')
decoder_input_c = Input(shape=(latent_dim*2,), name='decoder_input_c')
decoder_input_x = Input(shape=(None, num_decoder_tokens), name='decoder_input_x')

# set the LSTM layer
decoder_lstm = LSTM(latent_dim*2, return_sequences=True, 
                    return_state=True, dropout=0.5, name='decoder_lstm')
decoder_lstm_outputs, state_h, state_c = decoder_lstm(decoder_input_x, 
                                                      initial_state=[decoder_input_h, decoder_input_c])

# set the dense layer
decoder_dense = Dense(num_decoder_tokens, activation='softmax', name='decoder_dense')
decoder_outputs = decoder_dense(decoder_lstm_outputs)

# build the decoder network model
decoder_model = Model(inputs=[decoder_input_x, decoder_input_h, decoder_input_c],
                      outputs=[decoder_outputs, state_h, state_c],
                      name='decoder')

Print a summary and save the encoder network structure to "./decoder.pdf"

In [17]:
SVG(model_to_dot(decoder_model, show_shapes=False).create(prog='dot', format='svg'))

plot_model(
    model=decoder_model, show_shapes=False,
    to_file='decoder.pdf'
)

decoder_model.summary()

Model: "decoder"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
decoder_input_x (InputLayer)    [(None, None, 30)]   0                                            
__________________________________________________________________________________________________
decoder_input_h (InputLayer)    [(None, 512)]        0                                            
__________________________________________________________________________________________________
decoder_input_c (InputLayer)    [(None, 512)]        0                                            
__________________________________________________________________________________________________
decoder_lstm (LSTM)             [(None, None, 512),  1112064     decoder_input_x[0][0]            
                                                                 decoder_input_h[0][0]      

### 3.3. Connect the encoder and decoder

In [18]:
# input layers
encoder_input_x = Input(shape=(None, num_encoder_tokens), name='encoder_input_x')
decoder_input_x = Input(shape=(None, num_decoder_tokens), name='decoder_input_x')

# connect encoder to decoder
encoder_final_states = encoder_model([encoder_input_x])
decoder_lstm_output, _, _ = decoder_lstm(decoder_input_x, initial_state=encoder_final_states)
decoder_pred = decoder_dense(decoder_lstm_output)

model = Model(inputs=[encoder_input_x, decoder_input_x], 
              outputs=decoder_pred, 
              name='model_training')

In [19]:
SVG(model_to_dot(model, show_shapes=False).create(prog='dot', format='svg'))

plot_model(
    model=model, show_shapes=False,
    to_file='model_training.pdf'
)

model.summary()

Model: "model_training"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
encoder_input_x (InputLayer)    [(None, None, 28)]   0                                            
__________________________________________________________________________________________________
decoder_input_x (InputLayer)    [(None, None, 30)]   0                                            
__________________________________________________________________________________________________
encoder (Functional)            [(None, 512), (None, 583680      encoder_input_x[0][0]            
__________________________________________________________________________________________________
decoder_lstm (LSTM)             [(None, None, 512),  1112064     decoder_input_x[0][0]            
                                                                 encoder[0][0]       

### 3.4. Fit the model on the bilingual dataset

- encoder_input_data: one-hot encode of the input language

- decoder_input_data: one-hot encode of the input language

- decoder_target_data: labels (left shift of decoder_input_data)

- tune the hyper-parameters

- stop when the validation loss stop decreasing.

In [20]:
print('shape of encoder_input_data' + str(encoder_input_data.shape))
print('shape of decoder_input_data' + str(decoder_input_data.shape))
print('shape of decoder_target_data' + str(decoder_target_data.shape))

shape of encoder_input_data(20000, 18, 28)
shape of decoder_input_data(20000, 55, 30)
shape of decoder_target_data(20000, 55, 30)


In [21]:
model.compile(optimizer='rmsprop', loss='categorical_crossentropy')

model.fit([encoder_input_data, decoder_input_data],  # training data
          decoder_target_data,                       # labels (left shift of the target sequences)
          batch_size=64, epochs=16, validation_split=0.2)

model.save('seq2seq.h5')

Epoch 1/16
Epoch 2/16
Epoch 3/16
Epoch 4/16
Epoch 5/16
Epoch 6/16
Epoch 7/16
Epoch 8/16
Epoch 9/16
Epoch 10/16
Epoch 11/16
Epoch 12/16
Epoch 13/16
Epoch 14/16
Epoch 15/16
Epoch 16/16


## 4. Make predictions

- In this section, you need to complete section 4.2 to translate English to the target language.


### 4.1. Translate English to XXX

1. Encoder read a sentence (source language) and output its final states, $h_t$ and $c_t$.
2. Take the [star] sign "\t" and the final state $h_t$ and $c_t$ as input and run the decoder.
3. Get the new states and predicted probability distribution.
4. sample a char from the predicted probability distribution
5. take the sampled char and the new states as input and repeat the process (stop if reach the [stop] sign "\n").

In [22]:
# Reverse-lookup token index to decode sequences back to something readable.
reverse_input_char_index = dict((i, char) for char, i in input_token_index.items())
reverse_target_char_index = dict((i, char) for char, i in target_token_index.items())

In [23]:
def decode_sequence(input_seq):
    states_value = encoder_model.predict(input_seq)

    target_seq = numpy.zeros((1, 1, num_decoder_tokens))
    target_seq[0, 0, target_token_index['\t']] = 1.

    stop_condition = False
    decoded_sentence = ''
    while not stop_condition:
        output_tokens, h, c = decoder_model.predict([target_seq] + states_value)

        # this line of code is greedy selection
        # try to use multinomial sampling instead (with temperature)
        sampled_token_index = numpy.argmax(output_tokens[0, -1, :])
        
        sampled_char = reverse_target_char_index[sampled_token_index]
        decoded_sentence += sampled_char

        if (sampled_char == '\n' or
           len(decoded_sentence) > max_decoder_seq_length):
            stop_condition = True

        target_seq = numpy.zeros((1, 1, num_decoder_tokens))
        target_seq[0, 0, sampled_token_index] = 1.

        states_value = [h, c]

    return decoded_sentence


In [24]:
for seq_index in range(2100, 2120):
    # Take one sequence (part of the training set)
    # for trying out decoding.
    input_seq = encoder_input_data[seq_index: seq_index + 1]
    decoded_sentence = decode_sequence(input_seq)
    print('-')
    print('English:       ', input_texts[seq_index])
    print('Spanish (true): ', target_texts[seq_index][1:-1])
    print('Spanish (pred): ', decoded_sentence[0:-1])


-
English:        hes skinny
Spanish (true):  el esta delgado
Spanish (pred):  el es a ententado
-
English:        hes strong
Spanish (true):  el es fuerte
Spanish (pred):  el es a andiendo
-
English:        hes stupid
Spanish (true):  el es estupido
Spanish (pred):  el es a anuerdo
-
English:        hes stupid
Spanish (true):  no le llega agua al tanque
Spanish (pred):  el es a anuerdo
-
English:        hes stupid
Spanish (true):  es un salame
Spanish (pred):  el es a anuerdo
-
English:        help me out
Spanish (true):  ayudame
Spanish (pred):  ayudame a la cara
-
English:        help me out
Spanish (true):  ayudame a salir
Spanish (pred):  ayudame a la cara
-
English:        help me out
Spanish (true):  echeme la mano
Spanish (pred):  ayudame a la cara
-
English:        help me out
Spanish (true):  ayudame a salir
Spanish (pred):  ayudame a la cara
-
English:        here i come
Spanish (true):  aqui vengo
Spanish (pred):  aqui esta el casa
-
English:        here i come
Spanish (tru

### 4.2. Translate an English sentence to the target language （20 points）

1. Tokenization
2. One-hot encode
3. Translate

In [25]:
input_sentence = 'I love you'

input_sequence, _ = text2sequences(max_encoder_seq_length, [input_sentence])

input_x = onehot_encode(input_sequence, max_encoder_seq_length, num_encoder_tokens)

translated_sentence = decode_sequence(input_x)

print('source sentence is: ' + input_sentence)
print('translated sentence is: ' + translated_sentence)

source sentence is: I love you
translated sentence is: tom esta aqui



# 5. Evaluate the translation using BLEU score

- We have already translated from English to target language, but how can we evaluate the performance of our model quantitatively? 

- In this section, you need to re-train the model we built in secton 3 and then evaluate the bleu score on testing dataset.

Reference:

https://machinelearningmastery.com/calculate-bleu-score-for-text-python/

https://en.wikipedia.org/wiki/BLEU

#### Hint:
- You may use packages to calculate bleu score, e.g., sentence_bleu() from nltk package.

### 5.1. Partition the dataset to training, validation, and test. Build new token index. (10 points)

1. You may try to load more data/lines from text file.

- Randomly partition the dataset to training, validation, and test.

In [74]:
clean_pairs = clean_data(pairs)[0:1000, :]
input_texts = clean_pairs[:, 0]
target_texts = ['\t' + text + '\n' for text in clean_pairs[:, 1]]

max_encoder_seq_length = max(len(line) for line in input_texts)
max_decoder_seq_length = max(len(line) for line in target_texts)

print("Max Encoder Sequence Length:", max_encoder_seq_length)
print("Max Decoder Sequence Length:", max_decoder_seq_length)

Max Encoder Sequence Length: 9
Max Decoder Sequence Length: 30


In [75]:
x_train, x_test, y_train, y_test = train_test_split(input_texts, target_texts, test_size = 0.2, random_state = 0)

print('Length of x_train:  ' + str(x_train.shape))
print('Length of x_test: ' + str(x_test.shape))

print('Length of y_train:  ' + str(len(y_train)))
print('Length of y_test: ' + str(len(y_test)))

print(x_train[:4])
print(y_train[:4])
print(x_test[:4])
print(y_test[:4])

Length of x_train:  (800,)
Length of x_test: (200,)
Length of y_train:  800
Length of y_test: 200
['i can run' 'take tom' 'get lost' 'take care']
['\tpuedo correr\n', '\tllevate a tom\n', '\tlargate\n', '\tte cuidas\n']
['thank you' 'keep this' 'back off' 'who came']
['\tgracias a ti\n', '\tguarde esto\n', '\taparta\n', '\tquien vino\n']


In [76]:
x_tr, x_val, y_tr, y_val = train_test_split(x_train, y_train, test_size = 0.2, random_state = 0)

print('Length of x_tr:  ' + str(x_tr.shape))
print('Length of x_val: ' + str(x_val.shape))

print('Length of y_tr:  ' + str(len(y_tr)))
print('Length of y_val: ' + str(len(y_val)))

print(x_tr[:4])
print(y_tr[:4])
print(x_val[:4])
print(y_val[:4])

Length of x_tr:  (640,)
Length of x_val: (160,)
Length of y_tr:  640
Length of y_val: 160
['help me' 'keep warm' 'it rained' 'i hate it']
['\tayudame\n', '\tmantente abrigado\n', '\tllovio\n', '\tme la baja\n']
['i am old' 'who won' 'i talked' 'use this']
['\testoy viejo\n', '\tquien gano\n', '\thable\n', '\tusa esto\n']



- Evaluate the BLEU score using the test set. Report the average.

In [77]:
score = 0;
for i, sentence in enumerate(x_test):
  input_sequence, _ = text2sequences(max_encoder_seq_length, [sentence])
  input_x = onehot_encode(input_sequence, max_encoder_seq_length, num_encoder_tokens)
  translated_sentence = decode_sequence(input_x)
  score += sentence_bleu(translated_sentence,y_test[i],[1])
  # print("Sentence:", sentence, "Translation:", translated_sentence, "True:", y_test[i], "Score:", sentence_bleu(translated_sentence,y_test[i],[1]))
score /= len(x_test)
print("Average BLEU Score using the test set:", score)

Average BLEU Score using the test set: 0.22661638275724968


2. Convert text to sequences and build token index using training data.
3. One-hot encode your training and validation text sequences.

In [78]:
x_train_encoder_input_seq, x_train_input_token_index = text2sequences(max_encoder_seq_length, x_train)
y_train_decoder_input_seq, y_train_target_token_index = text2sequences(max_decoder_seq_length, y_train)

In [79]:
num_encoder_tokens = len(x_train_input_token_index) + 1
num_decoder_tokens = len(y_train_target_token_index) + 1

print('num_encoder_tokens: ' + str(num_encoder_tokens))
print('num_decoder_tokens: ' + str(num_decoder_tokens))

num_encoder_tokens: 28
num_decoder_tokens: 28


In [80]:
x_train_input_data = onehot_encode(x_train_encoder_input_seq, max_encoder_seq_length, num_encoder_tokens)
y_train_input_data = onehot_encode(y_train_decoder_input_seq, max_decoder_seq_length, num_decoder_tokens)

# encoder_input_data = onehot_encode(encoder_input_seq, max_encoder_seq_length, num_encoder_tokens)
# decoder_input_data = onehot_encode(decoder_input_seq, max_decoder_seq_length, num_decoder_tokens)

In [81]:
print("x_train sequence vs orig encoder sequence:", x_train_encoder_input_seq.shape, encoder_input_seq.shape)
print("y_train sequence vs orig decoder sequence:", y_train_decoder_input_seq.shape, decoder_input_seq.shape)

print("x_train input vs orig encoder input data:", x_train_input_data.shape, encoder_input_data.shape)
print("y_train input vs orig decoder input data:", y_train_input_data.shape, decoder_input_data.shape)

x_train sequence vs orig encoder sequence: (800, 9) (20000, 18)
y_train sequence vs orig decoder sequence: (800, 30) (20000, 55)
x_train input vs orig encoder input data: (800, 9, 28) (20000, 18, 28)
y_train input vs orig decoder input data: (800, 30, 28) (20000, 55, 30)


In [82]:
y_train_decoder_target_seq = numpy.zeros(y_train_decoder_input_seq.shape)
y_train_decoder_target_seq[:, 0:-1] = y_train_decoder_input_seq[:, 1:]
y_train_decoder_target_data = onehot_encode(y_train_decoder_target_seq, 
                                    max_decoder_seq_length, 
                                    num_decoder_tokens)

In [83]:
print("y_train decoder target vs orig decoder target:", y_train_decoder_target_seq.shape, decoder_target_seq.shape)
print("y_train decoder target data vs orig decoder data:", y_train_decoder_target_data.shape, decoder_target_data.shape)

y_train decoder target vs orig decoder target: (800, 30) (20000, 55)
y_train decoder target data vs orig decoder data: (800, 30, 28) (20000, 55, 30)


In [None]:
# x_tr_encoder_input_seq, new_input_token_index = text2sequences(max_encoder_seq_length, x_tr)
# y_tr_decoder_input_seq, new_target_token_index = text2sequences(max_decoder_seq_length, y_tr)

# x_tr_input_data = onehot_encode(x_tr_encoder_input_seq, max_encoder_seq_length, num_encoder_tokens)
# y_tr_input_data = onehot_encode(y_tr_decoder_input_seq, max_decoder_seq_length, num_decoder_tokens)

# x_val_encoder_input_seq, _ = text2sequences(max_encoder_seq_length, x_val)
# x_val_input_data = onehot_encode(x_val_encoder_input_seq, max_encoder_seq_length, num_encoder_tokens)
# y_val_decoder_input_seq, _ = text2sequences(max_decoder_seq_length, y_val)
# y_val_input_data = onehot_encode(y_val_decoder_input_seq, max_decoder_seq_length, num_decoder_tokens)

# y_tr_decoder_target_seq = numpy.zeros(y_tr_decoder_input_seq.shape)
# y_tr_decoder_target_seq[:, 0:-1] = y_tr_decoder_input_seq[:, 1:]
# y_tr_decoder_target_data = onehot_encode(y_tr_decoder_target_seq, 
#                                     max_decoder_seq_length, 
#                                     num_decoder_tokens)

# print('shape of x_tr_encoder_input_seq: ' + str(x_tr_encoder_input_seq.shape))
# print('shape of new_token_index: ' + str(len(new_input_token_index)))
# print('shape of x_tr_input_data: ' + str(x_tr_input_data.shape))

# print('shape of x_val_encoder_input_seq: ' + str(x_val_encoder_input_seq.shape))
# print('shape of new_token_index: ' + str(len(new_input_token_index)))
# print('shape of x_val_input_data: ' + str(x_val_input_data.shape))

# print('shape of decoder_input_seq: ' + str(decoder_input_seq.shape))
# print('shape of target_token_index: ' + str(len(target_token_index)))

### 5.2 Retrain your previous Bidirectional LSTM model with training and validation data and tune the parameters (learning rate, optimizer, etc) based on validation score. (25 points)

1. Use the model structure in section 3 to train a new model with new training and validation datasets.
2. Based on validation BLEU score or loss to tune parameters.

num_encoder_tokens: 28
num_decoder_tokens: 27


In [84]:
latent_dim = 256

# inputs of the encoder network
encoder_inputs = Input(shape=(None, num_encoder_tokens), 
                       name='encoder_inputs')

# set the LSTM layer
encoder_bilstm = Bidirectional(LSTM(latent_dim, return_state=True, 
                                  dropout=0.5, name='encoder_lstm'))
_, forward_h, forward_c, backward_h, backward_c = encoder_bilstm(encoder_inputs)

state_h = Concatenate()([forward_h, backward_h])
state_c = Concatenate()([forward_c, backward_c])

# build the encoder network model
encoder_model = Model(inputs=encoder_inputs, 
                      outputs=[state_h, state_c],
                      name='encoder')
encoder_model.summary()

Model: "encoder"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
encoder_inputs (InputLayer)     [(None, None, 28)]   0                                            
__________________________________________________________________________________________________
bidirectional_3 (Bidirectional) [(None, 512), (None, 583680      encoder_inputs[0][0]             
__________________________________________________________________________________________________
concatenate_6 (Concatenate)     (None, 512)          0           bidirectional_3[0][1]            
                                                                 bidirectional_3[0][3]            
__________________________________________________________________________________________________
concatenate_7 (Concatenate)     (None, 512)          0           bidirectional_3[0][2]      

In [85]:
# inputs of the decoder network
decoder_input_h = Input(shape=(latent_dim*2,), name='decoder_input_h')
decoder_input_c = Input(shape=(latent_dim*2,), name='decoder_input_c')
decoder_input_x = Input(shape=(None, num_decoder_tokens), name='decoder_input_x')

# set the LSTM layer
decoder_lstm = LSTM(latent_dim*2, return_sequences=True, 
                    return_state=True, dropout=0.5, name='decoder_lstm')
decoder_lstm_outputs, state_h, state_c = decoder_lstm(decoder_input_x, 
                                                      initial_state=[decoder_input_h, decoder_input_c])

# set the dense layer
decoder_dense = Dense(num_decoder_tokens, activation='softmax', name='decoder_dense')
decoder_outputs = decoder_dense(decoder_lstm_outputs)

# build the decoder network model
decoder_model = Model(inputs=[decoder_input_x, decoder_input_h, decoder_input_c],
                      outputs=[decoder_outputs, state_h, state_c],
                      name='decoder')

decoder_model.summary()

Model: "decoder"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
decoder_input_x (InputLayer)    [(None, None, 28)]   0                                            
__________________________________________________________________________________________________
decoder_input_h (InputLayer)    [(None, 512)]        0                                            
__________________________________________________________________________________________________
decoder_input_c (InputLayer)    [(None, 512)]        0                                            
__________________________________________________________________________________________________
decoder_lstm (LSTM)             [(None, None, 512),  1107968     decoder_input_x[0][0]            
                                                                 decoder_input_h[0][0]      

In [86]:
# num_decoder_tokens

In [87]:
# input layers
encoder_input_x = Input(shape=(None, num_encoder_tokens), name='encoder_input_x')
decoder_input_x = Input(shape=(None, num_decoder_tokens), name='decoder_input_x')

# connect encoder to decoder
encoder_final_states = encoder_model([encoder_input_x])
decoder_lstm_output, _, _ = decoder_lstm(decoder_input_x, initial_state=encoder_final_states)
decoder_pred = decoder_dense(decoder_lstm_output)

model = Model(inputs=[encoder_input_x, decoder_input_x], 
              outputs=decoder_pred, 
              name='model_training')
model.summary()

Model: "model_training"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
encoder_input_x (InputLayer)    [(None, None, 28)]   0                                            
__________________________________________________________________________________________________
decoder_input_x (InputLayer)    [(None, None, 28)]   0                                            
__________________________________________________________________________________________________
encoder (Functional)            [(None, 512), (None, 583680      encoder_input_x[0][0]            
__________________________________________________________________________________________________
decoder_lstm (LSTM)             [(None, None, 512),  1107968     decoder_input_x[0][0]            
                                                                 encoder[0][0]       

In [88]:
print('shape of ours vs encoder_input_data', x_train_input_data.shape, encoder_input_data.shape)
print('shape of ours vs decoder_input_data', y_train_input_data.shape, decoder_input_data.shape)
print('shape of ours vs decoder_target_data', y_train_decoder_target_data.shape, decoder_target_data.shape)

shape of ours vs encoder_input_data (800, 9, 28) (20000, 18, 28)
shape of ours vs decoder_input_data (800, 30, 28) (20000, 55, 30)
shape of ours vs decoder_target_data (800, 30, 28) (20000, 55, 30)


In [89]:
model.compile(optimizer='rmsprop', loss='categorical_crossentropy')

model.fit([x_train_input_data, y_train_input_data],  # training data
          y_train_decoder_target_data,                       # labels (left shift of the target sequences)
          batch_size=64, epochs=16, validation_split=0.2)

model.save('seq2seq.h5')

Epoch 1/16
Epoch 2/16
Epoch 3/16
Epoch 4/16
Epoch 5/16
Epoch 6/16
Epoch 7/16
Epoch 8/16
Epoch 9/16
Epoch 10/16
Epoch 11/16
Epoch 12/16
Epoch 13/16
Epoch 14/16
Epoch 15/16
Epoch 16/16


### 5.3 Evaluate the BLEU score using the test set. (15 points)

1. Use trained model above to calculate the BLEU score with testing dataset.

In [90]:
# Reverse-lookup token index to decode sequences back to something readable.
reverse_input_char_index = dict((i, char) for char, i in x_train_input_token_index.items())
reverse_target_char_index = dict((i, char) for char, i in y_train_target_token_index.items())

In [91]:
score = 0;
for i, sentence in enumerate(x_test):
  print(i+1, "/", len(x_test))
  print("Sentence:", sentence, "True:", y_test[i])
  input_sequence, _ = text2sequences(max_encoder_seq_length, [sentence])
  input_x = onehot_encode(input_sequence, max_encoder_seq_length, num_encoder_tokens)
  translated_sentence = decode_sequence(input_x)
  print("Translation:", translated_sentence, "Score:", sentence_bleu(translated_sentence,y_test[i],[1]))
  score += sentence_bleu(translated_sentence, y_test[i], [1])
score /= len(x_test)
print("Average BLEU Score using the test set:", score)

1 / 200
Sentence: thank you True: 	gracias a ti

Translation: este eno
 Score: 0.2857142857142857
2 / 200
Sentence: keep this True: 	guarde esto

Translation: este eno
 Score: 0.46153846153846156
3 / 200
Sentence: back off True: 	aparta

Translation: este eno
 Score: 0.25
4 / 200
Sentence: who came True: 	quien vino

Translation: este eno
 Score: 0.4166666666666667
5 / 200
Sentence: how awful True: 	que horror

Translation: este eno
 Score: 0.3333333333333333
6 / 200
Sentence: stop here True: 	detengase aqui

Translation: eete e o
 Score: 0.25
7 / 200
Sentence: wait True: 	esperate

Translation: soy aa
 Score: 0.3
8 / 200
Sentence: its me True: 	soy yo

Translation: este eno
 Score: 0.5
9 / 200
Sentence: be quiet True: 	estate quieto

Translation: este eno
 Score: 0.4
10 / 200
Sentence: i grinned True: 	sonrei

Translation: este eno
 Score: 0.625
11 / 200
Sentence: stand up True: 	parate

Translation: este eno
 Score: 0.375
12 / 200
Sentence: who paid True: 	quien pago

Translation: es

Translation: este eno
 Score: 0.45454545454545453
92 / 200
Sentence: stand by True: 	preparate

Translation: este eno
 Score: 0.2727272727272727
93 / 200
Sentence: read this True: 	lea esto

Translation: este eno
 Score: 0.6
94 / 200
Sentence: wake up True: 	despierta

Translation: este eno
 Score: 0.36363636363636365
95 / 200
Sentence: feel this True: 	tenta esto

Translation: este eno
 Score: 0.5833333333333334
96 / 200
Sentence: she runs True: 	ella corre

Translation: este eno
 Score: 0.3333333333333333
97 / 200
Sentence: you lost True: 	perdio usted

Translation: este eno
 Score: 0.42857142857142855
98 / 200
Sentence: no way True: 	minga

Translation: este eno
 Score: 0.2857142857142857
99 / 200
Sentence: humor me True: 	complaceme

Translation: este eno
 Score: 0.25
100 / 200
Sentence: forget me True: 	olvidate de mi

Translation: eete eno
 Score: 0.3125
101 / 200
Sentence: im happy True: 	soy feliz

Translation: este eno
 Score: 0.45454545454545453
102 / 200
Sentence: leave tom 

Translation: este eno
 Score: 0.5
181 / 200
Sentence: im angry True: 	estoy enojada

Translation: este eno
 Score: 0.4666666666666667
182 / 200
Sentence: thats me True: 	ese soy yo

Translation: este eno
 Score: 0.4166666666666667
183 / 200
Sentence: kiss tom True: 	besa a tomas

Translation: este eno
 Score: 0.42857142857142855
184 / 200
Sentence: dont go True: 	no te vayas

Translation: este eno
 Score: 0.5384615384615384
185 / 200
Sentence: take this True: 	tomen esto

Translation: este eno
 Score: 0.5833333333333334
186 / 200
Sentence: read this True: 	lee esto

Translation: este eno
 Score: 0.6
187 / 200
Sentence: i like it True: 	me gusta

Translation: este eno
 Score: 0.5
188 / 200
Sentence: its his True: 	es suyo

Translation: este eno
 Score: 0.5555555555555556
189 / 200
Sentence: im broke True: 	estoy sin blanca

Translation: este eno
 Score: 0.3888888888888889
190 / 200
Sentence: hold on True: 	sujeta

Translation: este eno
 Score: 0.5
191 / 200
Sentence: its here True: 	est