# Home 4: Build a seq2seq model for machine translation.

### Name: Sarita Hedaya

### Task: Translate English to French

## 0. You will do the following:

1. Read and run my code.
2. Complete the code in Section 1.1 and Section 4.2.

    * Translation English to **German** is not acceptable!!! Try another language.
    
3. **Make improvements.** Directly modify the code in Section 3. Do at least one of the followings. By doing more, you will get up to 2 bonus scores to the total.

    * Bi-LSTM instead of LSTM
    
    * Multi-task learning (e.g., both English to French and English to Spanish)
    
    * Attention
    
4. Evaluate the translation using the BLEU score. 

    * Optional. Up to 1 bonus scores to the total.
    
5. Convert the notebook to .HTML file. 

    * The HTML file must contain the code and the output after execution.

6. Put the .HTML file in your Google Drive, Dropbox, or Github repo. 

7. Submit the link to the HTML file to Canvas.    


### Hint: 

To implement ```Bi-LSTM```, you will need the following code to build the encoder; the decoder won't be much different.

In [0]:
from keras.layers import Bidirectional, Concatenate

encoder_bilstm = Bidirectional(LSTM(latent_dim, return_state=True, 
                                  dropout=0.5, name='encoder_lstm'))
_, forward_h, forward_c, backward_h, backward_c = encoder_bilstm(encoder_inputs)

state_h = Concatenate()([forward_h, backward_h])
state_c = Concatenate()([forward_c, backward_c])

### Hint: 

To implement multi-task training, you can refer to ```Section 7.1.3 Multi-output models``` of the textbook, ```Deep Learning with Python```.

## 1. Data preparation

1. Download data (e.g., "deu-eng.zip") from http://www.manythings.org/anki/
2. Unzip the .ZIP file.
3. Put the .TXT file (e.g., "deu.txt") in the directory "./Data/".

### 1.1. Load and clean text


In [0]:
import re
import string
from unicodedata import normalize
import numpy

# load doc into memory
def load_doc(filename):
    # open the file as read only
    file = open(filename, mode='rt', encoding='utf-8')
    # read all text
    text = file.read()
    # close the file
    file.close()
    return text


# split a loaded document into sentences
def to_pairs(doc):
    lines = doc.strip().split('\n')
    pairs = [line.split('\t') for line in  lines]
    return pairs

def clean_data(lines):
    cleaned = list()
    # prepare regex for char filtering
    re_print = re.compile('[^%s]' % re.escape(string.printable))
    # prepare translation table for removing punctuation
    table = str.maketrans('', '', string.punctuation)
    for pair in lines:
        clean_pair = list()
        for line in pair:
            # normalize unicode characters
            line = normalize('NFD', line).encode('ascii', 'ignore')
            line = line.decode('UTF-8')
            # tokenize on white space
            line = line.split()
            # convert to lowercase
            line = [word.lower() for word in line]
            # remove punctuation from each token
            line = [word.translate(table) for word in line]
            # remove non-printable chars form each token
            line = [re_print.sub('', w) for w in line]
            # remove tokens with numbers in them
            line = [word for word in line if word.isalpha()]
            # store as string
            clean_pair.append(' '.join(line))
        cleaned.append(clean_pair)
    return numpy.array(cleaned)

#### Fill the following blanks:

import norwegean text

In [0]:
# e.g., filename = 'Data/deu.txt'
filename = "/content/data/fra.txt"

# e.g., n_train = 20000
n_train = 20000

In [0]:
# load dataset
doc = load_doc(filename)

# split into Language1-Language2 pairs
pairs = to_pairs(doc)

# clean sentences
clean_pairs = clean_data(pairs)[0:n_train, :]

In [9]:
for i in range(3000, 3010):
    print('[' + clean_pairs[i, 0] + '] => [' + clean_pairs[i, 1] + ']')

[whats that] => [quest cela]
[whats this] => [questce que cest]
[whats this] => [cest quoi]
[whats this] => [cest quoi ca]
[where is he] => [ou estil]
[where is it] => [ou estil]
[where is it] => [ou estelle]
[where was i] => [ou en etaisje]
[where was i] => [ou etaisje]
[wheres tom] => [ou est tom]


clean_pairs has in column 0 the english input, in column 1 the nowegean input

put english input into "input_texts"

append start and stop chars to each nowegean sequence and put into "target_texts"

In [10]:
input_texts = clean_pairs[:, 0]
target_texts = ['\t' + text + '\n' for text in clean_pairs[:, 1]]

print('Length of input_texts:  ' + str(input_texts.shape))
print('Length of target_texts: ' + str(input_texts.shape))

Length of input_texts:  (20000,)
Length of target_texts: (20000,)


In [13]:
max_encoder_seq_length = max(len(line) for line in input_texts)
max_decoder_seq_length = max(len(line) for line in target_texts)

print('max length of input  sentences: %d' % (max_encoder_seq_length))
print('max length of target sentences: %d' % (max_decoder_seq_length))

max length of input  sentences: 17
max length of target sentences: 56


**Remark:** To this end, you have two lists of sentences: input_texts and target_texts

## 2. Text processing

### 2.1. Convert texts to sequences

- Input: A list of $n$ sentences (with max length $t$).
- It is represented by a $n\times t$ matrix after the tokenization and zero-padding.

up until now:
  

*   input_texts is a list of sequences of diff lengths each
*   target_texts is a list of sequences of diff lengths each


next up:
*   create character level vocabulary, and tokenize both texts to list[list[int]] 
*   returns for english and for french, the matrix of ints that represent the sequences, and the token_index



In [14]:
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences

# encode and pad sequences
def text2sequences(max_len, lines):
    tokenizer = Tokenizer(char_level=True, filters='')
    tokenizer.fit_on_texts(lines)
    seqs = tokenizer.texts_to_sequences(lines)
    seqs_pad = pad_sequences(seqs, maxlen=max_len, padding='post')
    return seqs_pad, tokenizer.word_index


encoder_input_seq, input_token_index = text2sequences(max_encoder_seq_length, 
                                                      input_texts)
decoder_input_seq, target_token_index = text2sequences(max_decoder_seq_length, 
                                                       target_texts)

print('shape of encoder_input_seq: ' + str(encoder_input_seq.shape))
print('shape of input_token_index: ' + str(len(input_token_index)))
print('shape of decoder_input_seq: ' + str(decoder_input_seq.shape))
print('shape of target_token_index: ' + str(len(target_token_index)))

Using TensorFlow backend.


shape of encoder_input_seq: (20000, 17)
shape of input_token_index: 27
shape of decoder_input_seq: (20000, 56)
shape of target_token_index: 29


In [15]:
num_encoder_tokens = len(input_token_index) + 1
num_decoder_tokens = len(target_token_index) + 1

print('num_encoder_tokens: ' + str(num_encoder_tokens))
print('num_decoder_tokens: ' + str(num_decoder_tokens))

num_encoder_tokens: 28
num_decoder_tokens: 30


***Question: why + 1? ***

**Remark:** To this end, the input language and target language texts are converted to 2 matrices. 

- Their number of rows are both n_train.
- Their number of columns are respective max_encoder_seq_length and max_decoder_seq_length.

The followings print a sentence and its representation as a sequence.

In [16]:
target_texts[100]

'\tentre\n'

In [17]:
decoder_input_seq[100, :]

array([ 9,  1,  7,  6, 12,  1, 10,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
        0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
        0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
        0,  0,  0,  0,  0], dtype=int32)

## 2.2. One-hot encode

- Input: A list of $n$ sentences (with max length $t$).
- It is represented by a $n\times t$ matrix after the tokenization and zero-padding.
- It is represented by a $n\times t \times v$ tensor ($t$ is the number of unique chars) after the one-hot encoding.

In [18]:
from keras.utils import to_categorical

# one hot encode target sequence
def onehot_encode(sequences, max_len, vocab_size):
    n = len(sequences)
    data = numpy.zeros((n, max_len, vocab_size))
    for i in range(n):
        data[i, :, :] = to_categorical(sequences[i], num_classes=vocab_size)
    return data

encoder_input_data = onehot_encode(encoder_input_seq, max_encoder_seq_length, num_encoder_tokens)
decoder_input_data = onehot_encode(decoder_input_seq, max_decoder_seq_length, num_decoder_tokens)

decoder_target_seq = numpy.zeros(decoder_input_seq.shape)
decoder_target_seq[:, 0:-1] = decoder_input_seq[:, 1:]
decoder_target_data = onehot_encode(decoder_target_seq, 
                                    max_decoder_seq_length, 
                                    num_decoder_tokens)

print(encoder_input_data.shape)
print(decoder_input_data.shape)

(20000, 17, 28)
(20000, 56, 30)


the encoder and decoder input data have shape:
(#segments, segment_length, vocabulary_size)

## 3. Build the networks (for training)

- Build encoder, decoder, and connect the two modules to get "model". 

- Fit the model on the bilingual data to train the parameters in the encoder and decoder.

### 3.1. Encoder network

- Input:  one-hot encode of the input language

- Return: 

    -- output (all the hidden states   $h_1, \cdots , h_t$) are always discarded
    
    -- the final hidden state  $h_t$
    
    -- the final conveyor belt $c_t$

In [19]:
from keras.layers import Input, LSTM
from keras.models import Model
from keras.layers import Bidirectional, Concatenate

latent_dim = 256

# inputs of the encoder network
encoder_inputs = Input(shape=(None, num_encoder_tokens), 
                       name='encoder_inputs')

# set the LSTM layer
# encoder_lstm = LSTM(latent_dim, return_state=True, 
#                     dropout=0.5, name='encoder_lstm')
# _, state_h, state_c = encoder_lstm(encoder_inputs)


# --- Encoder Bi-LSTM --- #
encoder_bilstm = Bidirectional(LSTM(latent_dim, return_state=True, 
                                  dropout=0.5, name='encoder_lstm'))
_, forward_h, forward_c, backward_h, backward_c = encoder_bilstm(encoder_inputs)

state_h = Concatenate()([forward_h, backward_h])
state_c = Concatenate()([forward_c, backward_c])

# build the encoder network model
encoder_model = Model(inputs=encoder_inputs, 
                      outputs=[state_h, state_c],
                      name='encoder')





Instructions for updating:
Please use `rate` instead of `keep_prob`. Rate should be set to `rate = 1 - keep_prob`.


Print a summary and save the encoder network structure to "./encoder.pdf"

In [20]:
from IPython.display import SVG
from keras.utils.vis_utils import model_to_dot, plot_model

SVG(model_to_dot(encoder_model, show_shapes=False).create(prog='dot', format='svg'))

plot_model(
    model=encoder_model, show_shapes=False,
    to_file='/content/encoder.png'
)

encoder_model.summary()

Model: "encoder"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
encoder_inputs (InputLayer)     (None, None, 28)     0                                            
__________________________________________________________________________________________________
bidirectional_1 (Bidirectional) [(None, 512), (None, 583680      encoder_inputs[0][0]             
__________________________________________________________________________________________________
concatenate_1 (Concatenate)     (None, 512)          0           bidirectional_1[0][1]            
                                                                 bidirectional_1[0][3]            
__________________________________________________________________________________________________
concatenate_2 (Concatenate)     (None, 512)          0           bidirectional_1[0][2]      

### 3.2. Decoder network

- Inputs:  

    -- one-hot encode of the target language
    
    -- The initial hidden state $h_t$ 
    
    -- The initial conveyor belt $c_t$ 

- Return: 

    -- output (all the hidden states) $h_1, \cdots , h_t$

    -- the final hidden state  $h_t$ (discarded in the training and used in the prediction)
    
    -- the final conveyor belt $c_t$ (discarded in the training and used in the prediction)

In [0]:
from keras.layers import Input, LSTM, Dense
from keras.models import Model

# inputs of the decoder network
decoder_input_h = Input(shape=(latent_dim*2,), name='decoder_input_h')
decoder_input_c = Input(shape=(latent_dim*2,), name='decoder_input_c')
decoder_input_x = Input(shape=(None, num_decoder_tokens), name='decoder_input_x')

# set the LSTM layer
decoder_lstm = LSTM(latent_dim*2, return_sequences=True, 
                    return_state=True, dropout=0.5, name='decoder_lstm')
decoder_lstm_outputs, state_h, state_c = decoder_lstm(decoder_input_x, 
                                                      initial_state=[decoder_input_h, decoder_input_c])

# --- Decoder Bi-LSTM --- #
# decoder_bilstm = Bidirectional(LSTM(latent_dim, return_state=True, 
#                                   dropout=0.5, name='decoder_lstm'))
# _, forward_h, forward_c, backward_h, backward_c = decoder_bilstm(decoder_input_x)

# state_h = Concatenate()([forward_h, backward_h])
# state_c = Concatenate()([forward_c, backward_c])

# set the dense layer
decoder_dense = Dense(num_decoder_tokens, activation='softmax', name='decoder_dense')
decoder_outputs = decoder_dense(decoder_lstm_outputs)

# build the decoder network model
decoder_model = Model(inputs=[decoder_input_x, decoder_input_h, decoder_input_c],
                      outputs=[decoder_outputs, state_h, state_c],
                      name='decoder')

Print a summary and save the encoder network structure to "./decoder.pdf"

In [22]:
from IPython.display import SVG
from keras.utils.vis_utils import model_to_dot, plot_model

SVG(model_to_dot(decoder_model, show_shapes=False).create(prog='dot', format='svg'))

plot_model(
    model=decoder_model, show_shapes=False,
    to_file='/content/decoder.png'
)

decoder_model.summary()

Model: "decoder"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
decoder_input_x (InputLayer)    (None, None, 30)     0                                            
__________________________________________________________________________________________________
decoder_input_h (InputLayer)    (None, 512)          0                                            
__________________________________________________________________________________________________
decoder_input_c (InputLayer)    (None, 512)          0                                            
__________________________________________________________________________________________________
decoder_lstm (LSTM)             [(None, None, 512),  1112064     decoder_input_x[0][0]            
                                                                 decoder_input_h[0][0]      

### 3.3. Connect the encoder and decoder

In [0]:
# input layers
encoder_input_x = Input(shape=(None, num_encoder_tokens), name='encoder_input_x')
decoder_input_x = Input(shape=(None, num_decoder_tokens), name='decoder_input_x')

# connect encoder to decoder
encoder_final_states = encoder_model([encoder_input_x])
decoder_lstm_output, _, _ = decoder_lstm(decoder_input_x, initial_state=encoder_final_states)
decoder_pred = decoder_dense(decoder_lstm_output)

model = Model(inputs=[encoder_input_x, decoder_input_x], 
              outputs=decoder_pred, 
              name='model_training')

In [24]:
print(state_h)
print(decoder_input_h)

Tensor("decoder_lstm/while/Exit_2:0", shape=(?, 512), dtype=float32)
Tensor("decoder_input_h:0", shape=(?, 512), dtype=float32)


In [25]:
from IPython.display import SVG
from keras.utils.vis_utils import model_to_dot, plot_model

SVG(model_to_dot(model, show_shapes=False).create(prog='dot', format='svg'))

plot_model(
    model=model, show_shapes=False,
    to_file='/content/model_training.png'
)

model.summary()

Model: "model_training"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
encoder_input_x (InputLayer)    (None, None, 28)     0                                            
__________________________________________________________________________________________________
decoder_input_x (InputLayer)    (None, None, 30)     0                                            
__________________________________________________________________________________________________
encoder (Model)                 [(None, 512), (None, 583680      encoder_input_x[0][0]            
__________________________________________________________________________________________________
decoder_lstm (LSTM)             [(None, None, 512),  1112064     decoder_input_x[0][0]            
                                                                 encoder[1][0]       

### 3.5. Fit the model on the bilingual dataset

- encoder_input_data: one-hot encode of the input language

- decoder_input_data: one-hot encode of the input language

- decoder_target_data: labels (left shift of decoder_input_data)

- tune the hyper-parameters

- stop when the validation loss stop decreasing.

In [26]:
print('shape of encoder_input_data' + str(encoder_input_data.shape))
print('shape of decoder_input_data' + str(decoder_input_data.shape))
print('shape of decoder_target_data' + str(decoder_target_data.shape))

shape of encoder_input_data(20000, 17, 28)
shape of decoder_input_data(20000, 56, 30)
shape of decoder_target_data(20000, 56, 30)


In [27]:
model.compile(optimizer='rmsprop', loss='categorical_crossentropy')
model.save_weights('model_pretrain.h5')











### Train on entire dataset.

At the end of this document when training for the bleu score i left out 20% of the data for testing.

In [0]:
model.load_weights('model_pretrain.h5')
model.fit([encoder_input_data, decoder_input_data],  # training data
          decoder_target_data,                       # labels (left shift of the target sequences)
          batch_size=64, epochs=25, validation_split=0.2)

model.save('seq2seq.h5')

## 4. Make predictions


### 4.1. Translate English to Norwegian (or french)

1. Encoder read a sentence (source language) and output its final states, $h_t$ and $c_t$.
2. Take the [star] sign "\t" and the final state $h_t$ and $c_t$ as input and run the decoder.
3. Get the new states and predicted probability distribution.
4. sample a char from the predicted probability distribution
5. take the sampled char and the new states as input and repeat the process (stop if reach the [stop] sign "\n").

In [0]:
# Reverse-lookup token index to decode sequences back to something readable.
reverse_input_char_index = dict((i, char) for char, i in input_token_index.items())
reverse_target_char_index = dict((i, char) for char, i in target_token_index.items())

In [0]:

def decode_sequence(input_seq):
    states_value = encoder_model.predict(input_seq)

    target_seq = numpy.zeros((1, 1, num_decoder_tokens))
    target_seq[0, 0, target_token_index['\t']] = 1.

    temperature = 0.2
    stop_condition = False
    decoded_sentence = ''
    while not stop_condition:
        output_tokens, h, c = decoder_model.predict([target_seq] + states_value)

        #multinomial distribution casts to float64 messing with the sum =1, I will cast manually before i do anything, so their casting doesnt mess with the sum = 1
        casted_output_tokens = numpy.asarray(output_tokens[0, -1, :]).astype('float64')
        tempered_output_tokens = casted_output_tokens ** (1 / temperature)        
        tempered_output_tokens_pd = tempered_output_tokens / numpy.sum(tempered_output_tokens)

        # print(tempered_output_tokens_pd.shape)
        # print(numpy.sum(tempered_output_tokens_pd))
        next_one_hot = numpy.random.multinomial(1, tempered_output_tokens_pd, 1)
        sampled_token_index = numpy.argmax(next_one_hot)

        sampled_char = reverse_target_char_index[sampled_token_index]
        decoded_sentence += sampled_char

        if (sampled_char == '\n' or
           len(decoded_sentence) > max_decoder_seq_length):
            stop_condition = True

        target_seq = numpy.zeros((1, 1, num_decoder_tokens))
        target_seq[0, 0, sampled_token_index] = 1.

        states_value = [h, c]

    return decoded_sentence


In [51]:
for seq_index in range(2100, 2120):
    # Take one sequence (part of the training set)
    # for trying out decoding.
    input_seq = encoder_input_data[seq_index: seq_index + 1]
    decoded_sentence = decode_sequence(input_seq)
    print('-')
    print('English:       ', input_texts[seq_index])
    print('French (true): ', target_texts[seq_index][1:-1])
    print('French (pred): ', decoded_sentence[0:-1])


-
English:        i am a monk
French (true):  je suis un moine
French (pred):  laisseq tomber tes pots
-
English:        i am a twin
French (true):  jai un jumeau
French (pred):  jai pris partir
-
English:        i am better
French (true):  je vais mieux
French (pred):  repondeq ca
-
English:        i am better
French (true):  je suis mieux
French (pred):  le fairestom
-
English:        i am coming
French (true):  jarrive
French (pred):  fais sumplement le travail
-
English:        i am joking
French (true):  je plaisante
French (pred):  continue a tententre
-
English:        i am single
French (true):  je suis celibataire
French (pred):  tu es mon perton
-
English:        i am taller
French (true):  je suis plus grand
French (pred):  tom a parle
-
English:        i apologize
French (true):  je vous prie de mexcuser
French (pred):  ne me trompe pas
-
English:        i asked tom
French (true):  jai demande a tom
French (pred):  zuestce zuil peut la parte
-
English:        i assume so
Fr

### 4.2. Translate an English sentence to the target language

1. Tokenization
2. One-hot encode
3. Translate

In [53]:
input_sentence = 'why is that'

input_sequence = []

for char in input_sentence:
  input_sequence.append(input_token_index[char])

input_sequence = [input_sequence]
seqs_pad = pad_sequences(input_sequence, maxlen=max_encoder_seq_length, padding='post')
print(seqs_pad)

input_x = onehot_encode(seqs_pad, max_encoder_seq_length, num_encoder_tokens)
print(input_x.shape)


translated_sentence = decode_sequence(input_x)

print('source sentence is: ' + input_sentence)
print('translated sentence is: ' + translated_sentence)

[[16 10 14  1  3  7  1  5 10  6  5  0  0  0  0  0  0]]
(1, 17, 28)
source sentence is: why is that
translated sentence is: zui est cela



## 5. Evaluate the translation using BLEU score

Reference: 
- https://machinelearningmastery.com/calculate-bleu-score-for-text-python/
- https://en.wikipedia.org/wiki/BLEU


**Hint:** 

- Randomly partition the dataset to training, validation, and test. 

- Evaluate the BLEU score using the test set. Report the average.

- A reasonable BLEU score should be 0.1 ~ 0.3.

*** To avoid using a network that was trained on the whole dataset, I re-ran the cells that define the model to reset the weights back to random, before they saw all the data***

### Split the Data into Train Val Test

In [0]:
rand_indices = numpy.random.permutation(20000)
train_indices = rand_indices[0:int(20000*.8)]
# valid_indices = rand_indices[int(20000*.6):int(20000*.8)]
test_indices = rand_indices[int(20000*.8):int(20000)]

input_train = input_texts[train_indices]
# input_valid = input_texts[valid_indices]
input_test = input_texts[test_indices]

target_train = numpy.asarray(target_texts)[train_indices]
# target_valid = numpy.asarray(target_texts)[valid_indices]
target_test = numpy.asarray(target_texts)[test_indices]


In [46]:
print(input_test[0:50])
print(target_test[0:50])


['come at once' 'go straight ahead' 'well explain' 'ive had enough'
 'its terrible' 'please say yes' 'it began to snow' 'i was busy'
 'can you do that' 'follow me' 'im buying' 'they deceived us' 'were sad'
 'tom has a dog' 'this is so hard' 'he is well off' 'i freaked out'
 'whats the hurry' 'i was kidnapped' 'youre reliable' 'lets do our job'
 'dont be rude' 'i want some paper' 'pick a card' 'we must check'
 'do ants have ears' 'cross the bridge' 'ill never change'
 'i quit smoking' 'dont be so lazy' 'he is in trouble' 'go slow'
 'youre very good' 'he had no money' 'he is an acrobat' 'that was nothing'
 'i was hammered' 'were in trouble' 'i usually walk' 'youre my enemy'
 'what a disaster' 'its a sign' 'behave like a man' 'ive found a job'
 'what is this' 'its different' 'i love doing this' 'they all stopped'
 'im still single' 'you must stay']
['\tviens ici immediatement\n' '\tva tout droit\n' '\tnous expliquerons\n'
 '\tca suffit\n' '\tcest terrible\n' '\tje vous prie de dire oui\n'

currently is arrays of sentences (list[string])

convert to matrix 

In [32]:

encoder_input_seq, input_token_index = text2sequences(max_encoder_seq_length, 
                                                      input_train)
decoder_input_seq, target_token_index = text2sequences(max_decoder_seq_length, 
                                                       target_train)



print('shape of encoder_input_seq: ' + str(encoder_input_seq.shape))
print('shape of input_token_index: ' + str(len(input_token_index)))
print('shape of decoder_input_seq: ' + str(decoder_input_seq.shape))
print('shape of target_token_index: ' + str(len(target_token_index)))

shape of encoder_input_seq: (16000, 17)
shape of input_token_index: 27
shape of decoder_input_seq: (16000, 56)
shape of target_token_index: 29


In [33]:
encoder_input_data = onehot_encode(encoder_input_seq, max_encoder_seq_length, num_encoder_tokens)
decoder_input_data = onehot_encode(decoder_input_seq, max_decoder_seq_length, num_decoder_tokens)

decoder_target_seq = numpy.zeros(decoder_input_seq.shape)
decoder_target_seq[:, 0:-1] = decoder_input_seq[:, 1:]
decoder_target_data = onehot_encode(decoder_target_seq, 
                                    max_decoder_seq_length, 
                                    num_decoder_tokens)

print(encoder_input_data.shape)
print(decoder_input_data.shape)

(16000, 17, 28)
(16000, 56, 30)


Build the networks for training

In [41]:
model.compile(optimizer='rmsprop', loss='categorical_crossentropy')

model.load_weights('model_pretrain.h5')
model.fit([encoder_input_data, decoder_input_data],  # training data
          decoder_target_data,                       # labels (left shift of the target sequences)
          validation_split=0.2,
          batch_size=64, epochs=25)

model.save('seq2seq_split.h5')


Train on 12800 samples, validate on 3200 samples
Epoch 1/25
Epoch 2/25
Epoch 3/25
Epoch 4/25
Epoch 5/25
Epoch 6/25
Epoch 7/25
Epoch 8/25
Epoch 9/25
Epoch 10/25
Epoch 11/25
Epoch 12/25
Epoch 13/25
Epoch 14/25
Epoch 15/25
Epoch 16/25
Epoch 17/25
Epoch 18/25
Epoch 19/25
Epoch 20/25
Epoch 21/25
Epoch 22/25
Epoch 23/25
Epoch 24/25
Epoch 25/25


### Run BLEU score


In [52]:
from nltk.translate.bleu_score import sentence_bleu

def find_bleu_score(reference, translation):
  """Calculates bleu score based on two sentences from the same language: one generated by the MT model and one given as reference"""
  translation_words = str(translation).split()
  reference_words = str(reference).split()

  print("The reference is ", reference_words)
  print("The translation is ", translation_words)

  score = sentence_bleu([reference_words], translation_words)
  return score


sentence_idx = 0
scores_list = []
for input_sentence in input_test:
    input_sequence = []

    for char in input_sentence:
      input_sequence.append(input_token_index[char])
    input_sequence = [input_sequence]
    input_sequence = numpy.array(input_sequence).reshape(1, len(input_sequence[0]))
    seqs_pad = pad_sequences(input_sequence, maxlen=max_encoder_seq_length, padding='post')
    input_x = onehot_encode(seqs_pad, max_encoder_seq_length, num_encoder_tokens)

    translated_sentence = decode_sequence(input_x)

    print('\n' + 'English sentence is: ' + input_sentence)

    score = find_bleu_score(target_test[sentence_idx], translated_sentence)
    scores_list.append(score)
    sentence_idx += 1
    print("Sentence number ", sentence_idx, "has a score of ", score)


print("the average of the bleu scores is:", numpy.mean(scores_list))


English sentence is: come at once
The reference is  ['viens', 'ici', 'immediatement']
The translation is  ['veneq', 'cheq', 'moi']
Sentence number  1 has a score of  0

English sentence is: go straight ahead
The reference is  ['va', 'tout', 'droit']
The translation is  ['vasy', 'attrapee', 'de', 'le', 'confiance']
Sentence number  2 has a score of  0

English sentence is: well explain
The reference is  ['nous', 'expliquerons']
The translation is  ['nous', 'regardons']
Sentence number  3 has a score of  0.8408964152537145


Corpus/Sentence contains 0 counts of 2-gram overlaps.
BLEU scores might be undesirable; use SmoothingFunction().



English sentence is: ive had enough
The reference is  ['ca', 'suffit']
The translation is  ['jen', 'ai', 'preszue', 'mort']
Sentence number  4 has a score of  0

English sentence is: its terrible
The reference is  ['cest', 'terrible']
The translation is  ['cest', 'le', 'mitron']
Sentence number  5 has a score of  0.7598356856515925

English sentence is: please say yes
The reference is  ['je', 'vous', 'prie', 'de', 'dire', 'oui']
The translation is  ['veuilleq', 'vous', 'apprecir']
Sentence number  6 has a score of  0.27952792741962756

English sentence is: it began to snow
The reference is  ['il', 'commenca', 'a', 'neiger']
The translation is  ['ca', 'fait', 'partir', 'partir']
Sentence number  7 has a score of  0

English sentence is: i was busy
The reference is  ['jetais', 'occupee']
The translation is  ['jetais', 'pret']
Sentence number  8 has a score of  0.8408964152537145

English sentence is: can you do that
The reference is  ['pouvezvous', 'faire', 'cela']
The translation is  [

Corpus/Sentence contains 0 counts of 3-gram overlaps.
BLEU scores might be undesirable; use SmoothingFunction().



English sentence is: this is so hard
The reference is  ['cest', 'si', 'difficile']
The translation is  ['cest', 'trop', 'binn']
Sentence number  15 has a score of  0.7598356856515925

English sentence is: he is well off
The reference is  ['il', 'est', 'riche']
The translation is  ['il', 'est', 'tout', 'a', 'fait', 'pret']
Sentence number  16 has a score of  0.5081327481546147

English sentence is: i freaked out
The reference is  ['jai', 'eu', 'les', 'foies']
The translation is  ['jai', 'pris', 'me', 'confiance']
Sentence number  17 has a score of  0.7071067811865476

English sentence is: whats the hurry
The reference is  ['pourquoi', 'cette', 'precipitation']
The translation is  ['zuelle', 'est', 'le', 'moisson']
Sentence number  18 has a score of  0

English sentence is: i was kidnapped
The reference is  ['jai', 'ete', 'kidnappe']
The translation is  ['jai', 'ete', 'inditee']
Sentence number  19 has a score of  0.7598356856515925

English sentence is: youre reliable
The reference is 

Corpus/Sentence contains 0 counts of 4-gram overlaps.
BLEU scores might be undesirable; use SmoothingFunction().



English sentence is: do ants have ears
The reference is  ['les', 'fourmis', 'ontelles', 'des', 'oreilles']
The translation is  ['les', 'chans', 'pouvent', 'des', 'les']
Sentence number  26 has a score of  0.7952707287670506

English sentence is: cross the bridge
The reference is  ['traverse', 'le', 'pont']
The translation is  ['compre', 'la', 'porte']
Sentence number  27 has a score of  0

English sentence is: ill never change
The reference is  ['je', 'ne', 'changerai', 'jamais']
The translation is  ['je', 'ne', 'le', 'derai', 'jamais']
Sentence number  28 has a score of  0.6223329772884784

English sentence is: i quit smoking
The reference is  ['jai', 'arrete', 'de', 'fumer']
The translation is  ['je', 'viens', 'de', 'la', 'ganne']
Sentence number  29 has a score of  0.668740304976422

English sentence is: dont be so lazy
The reference is  ['ne', 'soyez', 'pas', 'si', 'paresseuses']
The translation is  ['ne', 'soyeq', 'pas', 'si', 'parts']
Sentence number  30 has a score of  0.622332