# Artificial Intelligence Nanodegree
## Machine Translation Project
In this notebook, sections that end with **'(IMPLEMENTATION)'** in the header indicate that the following blocks of code will require additional functionality which you must provide. Please be sure to read the instructions carefully!

## Introduction
In this notebook, you will build a deep neural network that functions as part of an end-to-end machine translation pipeline. Your completed pipeline will accept English text as input and return the French translation.

- **Preprocess** - You'll convert text to sequence of integers.
- **Models** Create models which accepts a sequence of integers as input and returns a probability distribution over possible translations. After learning about the basic types of neural networks that are often used for machine translation, you will engage in your own investigations, to design your own model!
- **Prediction** Run the model on English text.

In [1]:
%load_ext autoreload
%aimport helper, tests
%autoreload 1

In [2]:
import collections

import helper
import numpy as np
import project_tests as tests

from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.models import Model
from keras.layers import GRU, Input, Dense, TimeDistributed, Activation, RepeatVector, Bidirectional
from keras.layers.embeddings import Embedding
from keras.optimizers import Adam
from keras.losses import sparse_categorical_crossentropy

Using TensorFlow backend.


### Verify access to the GPU
The following test applies only if you expect to be using a GPU, e.g., while running in a Udacity Workspace or using an AWS instance with GPU support. Run the next cell, and verify that the device_type is "GPU".
- If the device is not GPU & you are running from a Udacity Workspace, then save your workspace with the icon at the top, then click "enable" at the bottom of the workspace.
- If the device is not GPU & you are running from an AWS instance, then refer to the cloud computing instructions in the classroom to verify your setup steps.

In [3]:
from tensorflow.python.client import device_lib
print(device_lib.list_local_devices())

[name: "/cpu:0"
device_type: "CPU"
memory_limit: 268435456
locality {
}
incarnation: 13207787024565188850
]


## Dataset
We begin by investigating the dataset that will be used to train and evaluate your pipeline.  The most common datasets used for machine translation are from [WMT](http://www.statmt.org/).  However, that will take a long time to train a neural network on.  We'll be using a dataset we created for this project that contains a small vocabulary.  You'll be able to train your model in a reasonable time with this dataset.
### Load Data
The data is located in `data/small_vocab_en` and `data/small_vocab_fr`. The `small_vocab_en` file contains English sentences with their French translations in the `small_vocab_fr` file. Load the English and French data from these files from running the cell below.

In [4]:
# Load English data
english_sentences = helper.load_data('data/small_vocab_en')
# Load French data
french_sentences = helper.load_data('data/small_vocab_fr')

print('Dataset Loaded')

Dataset Loaded


### Files
Each line in `small_vocab_en` contains an English sentence with the respective translation in each line of `small_vocab_fr`.  View the first two lines from each file.

In [5]:
for sample_i in range(2):
    print('small_vocab_en Line {}:  {}'.format(sample_i + 1, english_sentences[sample_i]))
    print('small_vocab_fr Line {}:  {}'.format(sample_i + 1, french_sentences[sample_i]))

small_vocab_en Line 1:  new jersey is sometimes quiet during autumn , and it is snowy in april .
small_vocab_fr Line 1:  new jersey est parfois calme pendant l' automne , et il est neigeux en avril .
small_vocab_en Line 2:  the united states is usually chilly during july , and it is usually freezing in november .
small_vocab_fr Line 2:  les états-unis est généralement froid en juillet , et il gèle habituellement en novembre .


From looking at the sentences, you can see they have been preprocessed already.  The puncuations have been delimited using spaces. All the text have been converted to lowercase.  This should save you some time, but the text requires more preprocessing.
### Vocabulary
The complexity of the problem is determined by the complexity of the vocabulary.  A more complex vocabulary is a more complex problem.  Let's look at the complexity of the dataset we'll be working with.

In [6]:
english_words_counter = collections.Counter([word for sentence in english_sentences for word in sentence.split()])
french_words_counter = collections.Counter([word for sentence in french_sentences for word in sentence.split()])

print('{} English words.'.format(len([word for sentence in english_sentences for word in sentence.split()])))
print('{} unique English words.'.format(len(english_words_counter)))
print('10 Most common words in the English dataset:')
print('"' + '" "'.join(list(zip(*english_words_counter.most_common(10)))[0]) + '"')
print()
print('{} French words.'.format(len([word for sentence in french_sentences for word in sentence.split()])))
print('{} unique French words.'.format(len(french_words_counter)))
print('10 Most common words in the French dataset:')
print('"' + '" "'.join(list(zip(*french_words_counter.most_common(10)))[0]) + '"')

1823250 English words.
227 unique English words.
10 Most common words in the English dataset:
"is" "," "." "in" "it" "during" "the" "but" "and" "sometimes"

1961295 French words.
355 unique French words.
10 Most common words in the French dataset:
"est" "." "," "en" "il" "les" "mais" "et" "la" "parfois"


For comparison, _Alice's Adventures in Wonderland_ contains 2,766 unique words of a total of 15,500 words.
## Preprocess
For this project, you won't use text data as input to your model. Instead, you'll convert the text into sequences of integers using the following preprocess methods:
1. Tokenize the words into ids
2. Add padding to make all the sequences the same length.

Time to start preprocessing the data...
### Tokenize (IMPLEMENTATION)
For a neural network to predict on text data, it first has to be turned into data it can understand. Text data like "dog" is a sequence of ASCII character encodings.  Since a neural network is a series of multiplication and addition operations, the input data needs to be number(s).

We can turn each character into a number or each word into a number.  These are called character and word ids, respectively.  Character ids are used for character level models that generate text predictions for each character.  A word level model uses word ids that generate text predictions for each word.  Word level models tend to learn better, since they are lower in complexity, so we'll use those.

Turn each sentence into a sequence of words ids using Keras's [`Tokenizer`](https://keras.io/preprocessing/text/#tokenizer) function. Use this function to tokenize `english_sentences` and `french_sentences` in the cell below.

Running the cell will run `tokenize` on sample data and show output for debugging.

In [7]:
def tokenize(x):
    """
    Tokenize x
    :param x: List of sentences/strings to be tokenized
    :return: Tuple of (tokenized x data, tokenizer used to tokenize x)
    """
    # TODO: Implement
    tokenizer = Tokenizer()
    tokenizer.fit_on_texts(x)
    
    tokens = tokenizer.texts_to_sequences(x)
    return tokens, tokenizer

tests.test_tokenize(tokenize)

# Tokenize Example output
text_sentences = [
    'The quick brown fox jumps over the lazy dog .',
    'By Jove , my quick study of lexicography won a prize .',
    'This is a short sentence .']
text_tokenized, text_tokenizer = tokenize(text_sentences)
print(text_tokenizer.word_index)
print()
for sample_i, (sent, token_sent) in enumerate(zip(text_sentences, text_tokenized)):
    print('Sequence {} in x'.format(sample_i + 1))
    print('  Input:  {}'.format(sent))
    print('  Output: {}'.format(token_sent))

{'the': 1, 'quick': 2, 'a': 3, 'brown': 4, 'fox': 5, 'jumps': 6, 'over': 7, 'lazy': 8, 'dog': 9, 'by': 10, 'jove': 11, 'my': 12, 'study': 13, 'of': 14, 'lexicography': 15, 'won': 16, 'prize': 17, 'this': 18, 'is': 19, 'short': 20, 'sentence': 21}

Sequence 1 in x
  Input:  The quick brown fox jumps over the lazy dog .
  Output: [1, 2, 4, 5, 6, 7, 1, 8, 9]
Sequence 2 in x
  Input:  By Jove , my quick study of lexicography won a prize .
  Output: [10, 11, 12, 2, 13, 14, 15, 16, 3, 17]
Sequence 3 in x
  Input:  This is a short sentence .
  Output: [18, 19, 3, 20, 21]


### Padding (IMPLEMENTATION)
When batching the sequence of word ids together, each sequence needs to be the same length.  Since sentences are dynamic in length, we can add padding to the end of the sequences to make them the same length.

Make sure all the English sequences have the same length and all the French sequences have the same length by adding padding to the **end** of each sequence using Keras's [`pad_sequences`](https://keras.io/preprocessing/sequence/#pad_sequences) function.

In [8]:
def pad(x, length=None):
    """
    Pad x
    :param x: List of sequences.
    :param length: Length to pad the sequence to.  If None, use length of longest sequence in x.
    :return: Padded numpy array of sequences
    """
    # TODO: Implement
    if length is None:
        length = 0
        for s in x:
            length = max(length, len(s))
            
    return pad_sequences(x, maxlen=length, dtype='int32',
                         padding='post', truncating='post', value=0)
tests.test_pad(pad)

# Pad Tokenized output
test_pad = pad(text_tokenized)
for sample_i, (token_sent, pad_sent) in enumerate(zip(text_tokenized, test_pad)):
    print('Sequence {} in x'.format(sample_i + 1))
    print('  Input:  {}'.format(np.array(token_sent)))
    print('  Output: {}'.format(pad_sent))

Sequence 1 in x
  Input:  [1 2 4 5 6 7 1 8 9]
  Output: [1 2 4 5 6 7 1 8 9 0]
Sequence 2 in x
  Input:  [10 11 12  2 13 14 15 16  3 17]
  Output: [10 11 12  2 13 14 15 16  3 17]
Sequence 3 in x
  Input:  [18 19  3 20 21]
  Output: [18 19  3 20 21  0  0  0  0  0]


### Preprocess Pipeline
Your focus for this project is to build neural network architecture, so we won't ask you to create a preprocess pipeline.  Instead, we've provided you with the implementation of the `preprocess` function.

In [9]:
def preprocess(x, y):
    """
    Preprocess x and y
    :param x: Feature List of sentences
    :param y: Label List of sentences
    :return: Tuple of (Preprocessed x, Preprocessed y, x tokenizer, y tokenizer)
    """
    preprocess_x, x_tk = tokenize(x)
    preprocess_y, y_tk = tokenize(y)

    preprocess_x = pad(preprocess_x)
    preprocess_y = pad(preprocess_y)

    # Keras's sparse_categorical_crossentropy function requires the labels to be in 3 dimensions
    preprocess_y = preprocess_y.reshape(*preprocess_y.shape, 1)

    return preprocess_x, preprocess_y, x_tk, y_tk

preproc_english_sentences, preproc_french_sentences, english_tokenizer, french_tokenizer =\
    preprocess(english_sentences, french_sentences)
    
max_english_sequence_length = preproc_english_sentences.shape[1]
max_french_sequence_length = preproc_french_sentences.shape[1]
english_vocab_size = len(english_tokenizer.word_index)
french_vocab_size = len(french_tokenizer.word_index)

print('Data Preprocessed')
print("Max English sentence length:", max_english_sequence_length)
print("Max French sentence length:", max_french_sequence_length)
print("English vocabulary size:", english_vocab_size)
print("French vocabulary size:", french_vocab_size)

Data Preprocessed
Max English sentence length: 15
Max French sentence length: 21
English vocabulary size: 199
French vocabulary size: 344


In [10]:
print('shape of preproc_english_sentences: {}'.format(preproc_english_sentences.shape))
print('shape of preproc_french_sentences:  {}'.format(preproc_french_sentences.shape))

shape of preproc_english_sentences: (137861, 15)
shape of preproc_french_sentences:  (137861, 21, 1)


## Models
In this section, you will experiment with various neural network architectures.
You will begin by training four relatively simple architectures.
- Model 1 is a simple RNN
- Model 2 is a RNN with Embedding
- Model 3 is a Bidirectional RNN
- Model 4 is an optional Encoder-Decoder RNN

After experimenting with the four simple architectures, you will construct a deeper architecture that is designed to outperform all four models.
### Ids Back to Text
The neural network will be translating the input to words ids, which isn't the final form we want.  We want the French translation.  The function `logits_to_text` will bridge the gab between the logits from the neural network to the French translation.  You'll be using this function to better understand the output of the neural network.

In [9]:
def logits_to_text(logits, tokenizer):
    """
    Turn logits from a neural network into text using the tokenizer
    :param logits: Logits from a neural network
    :param tokenizer: Keras Tokenizer fit on the labels
    :return: String that represents the text of the logits
    """
    index_to_words = {id: word for word, id in tokenizer.word_index.items()}
    index_to_words[0] = '<PAD>'

    return ' '.join([index_to_words[prediction] for prediction in np.argmax(logits, 1)])

print('`logits_to_text` function loaded.')

`logits_to_text` function loaded.


### Model 1: RNN (IMPLEMENTATION)
![RNN](images/rnn.png)
A basic RNN model is a good baseline for sequence data.  In this model, you'll build a RNN that translates English to French.

In [24]:
def simple_model(input_shape, output_sequence_length, english_vocab_size, french_vocab_size):
    """
    Build and train a basic RNN on x and y
    :param input_shape: Tuple of input shape
    :param output_sequence_length: Length of output sequence
    :param english_vocab_size: Number of unique English words in the dataset
    :param french_vocab_size: Number of unique French words in the dataset
    :return: Keras model built, but not trained
    """
    # TODO: Build the layers
    hidden_size = 256
    learning_rate = 1e-3
    
    # input size = (None, 21, 1)
    x = Input(shape=input_shape[1:])
    
    # output = (None, 21, 256)
    hidden = GRU(units=hidden_size, return_sequences=True)(x)
    
    # output = (None, 21, 344)
    y = TimeDistributed(Dense(french_vocab_size, activation='softmax'))(hidden)
    
    model = Model(inputs=x, outputs=y)
    model.compile(loss=sparse_categorical_crossentropy,
                  optimizer=Adam(learning_rate),
                  metrics=['accuracy'])
    return model
tests.test_simple_model(simple_model)

# Reshaping the input to work with a basic RNN
tmp_x = pad(preproc_english_sentences, max_french_sequence_length)
tmp_x = tmp_x.reshape((-1, preproc_french_sentences.shape[-2], 1))

# Train the neural network
simple_rnn_model = simple_model(
    tmp_x.shape,
    max_french_sequence_length,
    english_vocab_size,
    french_vocab_size)

print(simple_rnn_model.summary())

simple_rnn_model.fit(tmp_x, preproc_french_sentences, batch_size=1024, epochs=10, validation_split=0.2)

# Print prediction(s)
print(logits_to_text(simple_rnn_model.predict(tmp_x[:1])[0], french_tokenizer))

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_2 (InputLayer)         (None, 21, 1)             0         
_________________________________________________________________
gru_2 (GRU)                  (None, 21, 256)           198144    
_________________________________________________________________
time_distributed_2 (TimeDist (None, 21, 344)           88408     
Total params: 286,552
Trainable params: 286,552
Non-trainable params: 0
_________________________________________________________________
None
Train on 110288 samples, validate on 27573 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
new jersey est parfois calme en l' et il est est est en en <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD>


##### Save Model

In [25]:
import os

directory='./models/'
filename='simple_rnn_model.h5'

if not os.path.exists(directory):
    os.makedirs(directory)
    
simple_rnn_model.save(directory + filename)

##### Comparsion between Different Hyperparameters
Aside from hidden_size=256, I also used hidden_size and results are listed below:

| hidden_size | acc    | val_acc |
|-------------|--------|---------|
| 128         | 0.656  | 0.6564  |
| 256         | 0.6824 | 0.6841  |
| 512         | 0.7167 | 0.7276  |


### Model 2: Embedding (IMPLEMENTATION)
![RNN](images/embedding.png)
You've turned the words into ids, but there's a better representation of a word.  This is called word embeddings.  An embedding is a vector representation of the word that is close to similar words in n-dimensional space, where the n represents the size of the embedding vectors.

In this model, you'll create a RNN model using embedding.

In [30]:
from keras.backend import expand_dims

def embed_model(input_shape, output_sequence_length, english_vocab_size, french_vocab_size):
    """
    Build and train a RNN model using word embedding on x and y
    :param input_shape: Tuple of input shape
    :param output_sequence_length: Length of output sequence
    :param english_vocab_size: Number of unique English words in the dataset
    :param french_vocab_size: Number of unique French words in the dataset
    :return: Keras model built, but not trained
    """
    # TODO: Implement
    lr = 1e-3
    embed_size = 128
    hidden_size = 256
    
    # input = (None, 21)
    x = Input(shape=input_shape[1:])

    # output = (None, 21, 128)
    embed = Embedding(input_dim=english_vocab_size, 
                      output_dim=embed_size,
                     input_length=output_sequence_length)(x)
    
    # output = (None, 21, 256)
    rnn = GRU(units=hidden_size, return_sequences=True)(embed)
    
    # output = (None, 21, 344)
    y = TimeDistributed(Dense(french_vocab_size, activation='softmax'))(rnn)
    
    model = Model(inputs=x, outputs=y)
    model.compile(optimizer=Adam(lr=lr), 
                  loss='sparse_categorical_crossentropy',
                 metrics=['accuracy'])
    
    return model
tests.test_embed_model(embed_model)


# TODO: Reshape the input
## 1. pad sequences
## 2. No need to add dim because Embedding will do this
tmp_x = pad(preproc_english_sentences, max_french_sequence_length)
# tmp_x = expand_dims(tmp_x, axis=-1)

# TODO: Train the neural network
embed_rnn_model = embed_model(tmp_x.shape, 
                              max_french_sequence_length,
                              english_vocab_size,
                                french_vocab_size)

print(embed_rnn_model.summary())

embed_rnn_model.fit(x=tmp_x, y=preproc_french_sentences,
               batch_size=1024, epochs=10, validation_split=0.2)

# TODO: Print prediction(s)
print(logits_to_text(embed_rnn_model.predict(tmp_x[:1])[0], french_tokenizer))

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_37 (InputLayer)        (None, 21)                0         
_________________________________________________________________
embedding_36 (Embedding)     (None, 21, 128)           25472     
_________________________________________________________________
gru_36 (GRU)                 (None, 21, 256)           295680    
_________________________________________________________________
time_distributed_21 (TimeDis (None, 21, 344)           88408     
Total params: 409,560
Trainable params: 409,560
Non-trainable params: 0
_________________________________________________________________
None
Train on 110288 samples, validate on 27573 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
new jersey est parfois calme en l' de l' automne et neigeux en avril <PAD> <PAD> <PAD> <PAD> <PAD> <P

#### Save Model

In [34]:
import os
directory='./models/'
filename='embed_rnn_model.h5'

if not os.path.exists(directory):
    os.makedirs(directory)
    
embed_rnn_model.save(directory + filename)

### Model 3: Bidirectional RNNs (IMPLEMENTATION)
![RNN](images/bidirectional.png)
One restriction of a RNN is that it can't see the future input, only the past.  This is where bidirectional recurrent neural networks come in.  They are able to see the future data.

#### Note:
The performance comparison between different merge_mode in Bidirectional() can be found in the ["How to Develop a Bidirectional LSTM For Sequence Classification in Python with Keras
"](https://machinelearningmastery.com/develop-bidirectional-lstm-sequence-classification-python-keras/)

In [26]:
import pdb
from keras.backend import expand_dims

def bd_model(input_shape, output_sequence_length, english_vocab_size, french_vocab_size, merge_mode='concat'):
    """
    Build and train a bidirectional RNN model on x and y
    :param input_shape: Tuple of input shape
    :param output_sequence_length: Length of output sequence
    :param english_vocab_size: Number of unique English words in the dataset
    :param french_vocab_size: Number of unique French words in the dataset
    :return: Keras model built, but not trained
    """
    # TODO: Implement
    hidden_size = 256
    lr = 1e-3
    
    # input = (None, 21, 1)
    x = Input(shape=input_shape[1:])
    
#     pdb.set_trace()
    
    # output = (None, 21, 256)
    rnn = Bidirectional(GRU(units=hidden_size, return_sequences=True),
                       merge_mode=merge_mode)(x)
    
    # output = (None, 21, 344)
    y = TimeDistributed(Dense(french_vocab_size, activation='softmax'))(rnn)
    model = Model(inputs=x, outputs=y)
    model.compile(optimizer=Adam(lr=lr),
                 loss='sparse_categorical_crossentropy',
                 metrics=['accuracy'])
    
    return model
tests.test_bd_model(bd_model)

# pad english sequences. Shape=(None, 15) => (None, 21)
tmp_x = pad(preproc_english_sentences, max_french_sequence_length )

# add dim at last axis  shape=(None, 21) => (None, 21, 1)
# cannot use expand_dims. It triggers the following error message.
# TypeError: float() argument must be a string or a number, not 'Dimension'

# tmp_x = expand_dims(tmp_x, axis=-1)  
tmp_x = tmp_x.reshape((-1, preproc_french_sentences.shape[-2], 1))


# pdb.set_trace()

# TODO: Train and Print prediction(s)
bd_rnn_model = bd_model(tmp_x.shape,
                       max_french_sequence_length,
                       english_vocab_size,
                       french_vocab_size) 

print(bd_rnn_model.summary())

bd_rnn_model.fit(tmp_x, preproc_french_sentences,
                batch_size=1024,
                epochs=10,
                validation_split=0.2)

print(logits_to_text(bd_rnn_model.predict(tmp_x[:1])[0], french_tokenizer))


_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_4 (InputLayer)         (None, 21, 1)             0         
_________________________________________________________________
bidirectional_2 (Bidirection (None, 21, 512)           396288    
_________________________________________________________________
time_distributed_4 (TimeDist (None, 21, 344)           176472    
Total params: 572,760
Trainable params: 572,760
Non-trainable params: 0
_________________________________________________________________
None
Train on 110288 samples, validate on 27573 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
new jersey est parfois calme en l' de il et il est en en <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD>


#### Save Model

In [29]:
import os

directory='./models/'
filename='bd_rnn_model_concat_wo_embed.h5'

if not os.path.exists(directory):
    os.makedirs(directory)
    
bd_rnn_model.save(directory + filename)

### (Extra Experiment) Birectional RNN with Embedding
The accuracy of birectional RNN w/o embedding layer is inferior than simple RNN w/ embedding. This is wiered. As a result, I add embedding layer to BD RNN.

In [13]:
import pdb
from keras.backend import expand_dims

def bd_model(input_shape, output_sequence_length, english_vocab_size, french_vocab_size, merge_mode='concat'):
    """
    Build and train a bidirectional RNN model on x and y
    :param input_shape: Tuple of input shape
    :param output_sequence_length: Length of output sequence
    :param english_vocab_size: Number of unique English words in the dataset
    :param french_vocab_size: Number of unique French words in the dataset
    :return: Keras model built, but not trained
    """
    # TODO: Implement
    embed_size = 128
    hidden_size = 256
    lr = 1e-3
    
    # input = (None, 21)
    x = Input(shape=input_shape[1:])
    
    # output = (None, 21, 128)
    embed = Embedding(input_dim=english_vocab_size, 
                      output_dim=embed_size,
                     input_length=output_sequence_length)(x)
    
    # output = (None, 21, 256)
    rnn = Bidirectional(GRU(units=hidden_size, return_sequences=True),
                       merge_mode=merge_mode)(embed)
    
    # output = (None, 21, 344)
    y = TimeDistributed(Dense(french_vocab_size, activation='softmax'))(rnn)
    model = Model(inputs=x, outputs=y)
    model.compile(optimizer=Adam(lr=lr),
                 loss='sparse_categorical_crossentropy',
                 metrics=['accuracy'])
    
    return model
# tests.test_bd_model(bd_model)

# pad english sequences. Shape=(None, 15) => (None, 21)
tmp_x = pad(preproc_english_sentences, max_french_sequence_length )

# add dim at last axis  shape=(None, 21) => (None, 21, 1)
# cannot use expand_dims. It triggers the following error message.
# TypeError: float() argument must be a string or a number, not 'Dimension'

# tmp_x = expand_dims(tmp_x, axis=-1)  
# tmp_x = tmp_x.reshape((-1, preproc_french_sentences.shape[-2], 1))


# pdb.set_trace()

# TODO: Train and Print prediction(s)
bd_rnn_model = bd_model(tmp_x.shape,
                       max_french_sequence_length,
                       english_vocab_size,
                       french_vocab_size,
                       merge_mode='concat') 

print(bd_rnn_model.summary())

bd_rnn_model.fit(tmp_x, preproc_french_sentences,
                batch_size=1024,
                epochs=10,
                validation_split=0.2)

print(logits_to_text(bd_rnn_model.predict(tmp_x[:1])[0], french_tokenizer))


_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_2 (InputLayer)         (None, 21)                0         
_________________________________________________________________
embedding_2 (Embedding)      (None, 21, 128)           25472     
_________________________________________________________________
bidirectional_2 (Bidirection (None, 21, 512)           591360    
_________________________________________________________________
time_distributed_2 (TimeDist (None, 21, 344)           176472    
Total params: 793,304
Trainable params: 793,304
Non-trainable params: 0
_________________________________________________________________
None
Train on 110288 samples, validate on 27573 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
new jersey est parfois calme au l' automne et il est neigeux en avril <PAD> <PAD> <PAD> <PAD> <PAD> <

#### Save Model

In [14]:
import os

directory='./models/'
filename='bd_rnn_model_concat.h5'

if not os.path.exists(directory):
    os.makedirs(directory)
    
bd_rnn_model.save(directory + filename)

### Comparision of Different Merge Mode
Aside from the default 'concat' mode, I also tried 'sum' mode. The difference between these 2 modes is insignificant. The results are listed below:

| merge_mode | acc    | val_acc |
|------------|--------|---------|
| concat     | 0.9394 | 0.9429  |
| sum        | 0.9399 | 0.9415  |


### Model 4: Encoder-Decoder (OPTIONAL)
Time to look at encoder-decoder models.  This model is made up of an encoder and decoder. The encoder creates a matrix representation of the sentence.  The decoder takes this matrix as input and predicts the translation as output.

Create an encoder-decoder model in the cell below.

#### Note: 
The following snippet references article [Encoder-Decoder Models for Text Summarization in Keras](https://machinelearningmastery.com/encoder-decoder-models-text-summarization-keras/)

In [30]:
import pdb

def encdec_model(input_shape, output_sequence_length, english_vocab_size, french_vocab_size):
    """
    Build and train an encoder-decoder model on x and y
    :param input_shape: Tuple of input shape
    :param output_sequence_length: Length of output sequence
    :param english_vocab_size: Number of unique English words in the dataset
    :param french_vocab_size: Number of unique French words in the dataset
    :return: Keras model built, but not trained
    """
    # OPTIONAL: Implement
    lr = 1e-3
    encoder_size = 256
    decoder_size = 256
    
    # shape = (None, 21, 1)
    x = Input(shape=input_shape[1:])
    
    # Encoder
    
    # ouput = (None, 256)
    rnn_encoder = GRU(units=decoder_size, return_sequences=False)(x)
    
#     pdb.set_trace()
    
    # output = (None, 21, 256)
    encoder_out = RepeatVector(output_sequence_length)(rnn_encoder)
    
    # Decoder
    # output = (None, 21, 256)
    rnn_decoder = GRU(units=decoder_size, return_sequences=True)(encoder_out)
    
    # output = (None, 21, 344)
    y = TimeDistributed(Dense(french_vocab_size, activation='softmax'))(rnn_decoder)
    
    model = Model(inputs=x, outputs=y)
    model.compile(optimizer=Adam(lr=lr),
                  loss='sparse_categorical_crossentropy',
                 metrics=['accuracy'])
    return model
tests.test_encdec_model(encdec_model)

# pad english sentences from (None, 15) to (None, 21)
tmp_x = pad(preproc_english_sentences, max_french_sequence_length )

# add dim by reshape()  (shpae = (None, 21, 1))
tmp_x = tmp_x.reshape((-1, preproc_french_sentences.shape[-2], 1))

# OPTIONAL: Train and Print prediction(s)
encdec_rnn_model = encdec_model(tmp_x.shape,
                               max_french_sequence_length,
                               english_vocab_size,
                               french_vocab_size)

print(encdec_rnn_model.summary())

encdec_rnn_model.fit(tmp_x, preproc_french_sentences,
                batch_size=1024,
                epochs=10,
                validation_split=0.2)

print(logits_to_text(encdec_rnn_model.predict(tmp_x[:1])[0], french_tokenizer))

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_6 (InputLayer)         (None, 21, 1)             0         
_________________________________________________________________
gru_7 (GRU)                  (None, 256)               198144    
_________________________________________________________________
repeat_vector_2 (RepeatVecto (None, 21, 256)           0         
_________________________________________________________________
gru_8 (GRU)                  (None, 21, 256)           393984    
_________________________________________________________________
time_distributed_6 (TimeDist (None, 21, 344)           88408     
Total params: 680,536
Trainable params: 680,536
Non-trainable params: 0
_________________________________________________________________
None
Train on 110288 samples, validate on 27573 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epo

In [31]:
import os

directory='./models/'
filename='encdec_rnn_model.h5'

if not os.path.exists(directory):
    os.makedirs(directory)
    
encdec_rnn_model.save(directory + filename)

### (Extra Experiment) Encoder-Decoder with Embedding Layer

In [16]:
import pdb

def encdec_model(input_shape, output_sequence_length, english_vocab_size, french_vocab_size):
    """
    Build and train an encoder-decoder model on x and y
    :param input_shape: Tuple of input shape
    :param output_sequence_length: Length of output sequence
    :param english_vocab_size: Number of unique English words in the dataset
    :param french_vocab_size: Number of unique French words in the dataset
    :return: Keras model built, but not trained
    """
    # OPTIONAL: Implement
    lr = 1e-3
    embed_size = 128
    encoder_size = 256
    decoder_size = 256
    
    # shape = (None, 21)
    x = Input(shape=input_shape[1:])
    
    # Encoder
    
    # shape = (None, 21, 128)
    embed = Embedding(input_dim=english_vocab_size,
                     output_dim=embed_size,
                     input_length=output_sequence_length)(x)
    
    # ouput = (None, 256)
    rnn_encoder = GRU(units=decoder_size, return_sequences=False)(embed)
    
#     pdb.set_trace()
    
    # output = (None, 21, 256)
    encoder_out = RepeatVector(output_sequence_length)(rnn_encoder)
    
    # Decoder
    # output = (None, 21, 256)
    rnn_decoder = GRU(units=decoder_size, return_sequences=True)(encoder_out)
    
    # output = (None, 21, 344)
    y = TimeDistributed(Dense(french_vocab_size, activation='softmax'))(rnn_decoder)
    
    model = Model(inputs=x, outputs=y)
    model.compile(optimizer=Adam(lr=lr),
                  loss='sparse_categorical_crossentropy',
                 metrics=['accuracy'])
    return model
# tests.test_encdec_model(encdec_model)

# pad english sentences from (None, 15) to (None, 21)
tmp_x = pad(preproc_english_sentences, max_french_sequence_length )

# add dim by reshape()  (shpae = (None, 21, 1))
# tmp_x = tmp_x.reshape((-1, preproc_french_sentences.shape[-2], 1))

# OPTIONAL: Train and Print prediction(s)
encdec_rnn_model = encdec_model(tmp_x.shape,
                               max_french_sequence_length,
                               english_vocab_size,
                               french_vocab_size)

print(encdec_rnn_model.summary())

encdec_rnn_model.fit(tmp_x, preproc_french_sentences,
                batch_size=1024,
                epochs=10,
                validation_split=0.2)

print(logits_to_text(encdec_rnn_model.predict(tmp_x[:1])[0], french_tokenizer))

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_6 (InputLayer)         (None, 21)                0         
_________________________________________________________________
embedding_2 (Embedding)      (None, 21, 128)           25472     
_________________________________________________________________
gru_11 (GRU)                 (None, 256)               295680    
_________________________________________________________________
repeat_vector_6 (RepeatVecto (None, 21, 256)           0         
_________________________________________________________________
gru_12 (GRU)                 (None, 21, 256)           393984    
_________________________________________________________________
time_distributed_6 (TimeDist (None, 21, 344)           88408     
Total params: 803,544
Trainable params: 803,544
Non-trainable params: 0
_________________________________________________________________
None

#### Save Model

In [17]:
import os

directory='./models/'
filename='encdec_rnn_model_embed.h5'

if not os.path.exists(directory):
    os.makedirs(directory)
    
encdec_rnn_model.save(directory + filename)

### Encoder-Decoder Recursive Model 

For more information of this model, please check [How to Develop an Encoder-Decoder Model for Sequence-to-Sequence Prediction in Keras](https://machinelearningmastery.com/develop-encoder-decoder-model-sequence-sequence-prediction-keras/).

##### Preprocesssing 
Add "< BOS >" (Begin Of Sequence) and < EOS > (End Of Sequence) to French senteneces.

In [10]:
import pdb

def preprocess_with_special_tokens(x, y, bos='0bos0', eos='0eos0'):
    """
    Preprocess x and y
    :param x: Feature List of sentences
    :param y: Label List of sentences
    :param bos: 'Begin Of Sequence' token
    :param eos: 'End Of Sequence' token
    :return: Tuple of (Preprocessed x, Preprocessed y, x tokenizer, y tokenizer)
    """
    len_y = len(y)
    
    # tokenize sentences
    preprocess_x, x_tk = tokenize(x)
    preprocess_y, y_tk = tokenize(y + [bos , eos])
    
#     pdb.set_trace()
    
    # remove sentence with special tokens only
    preprocess_y = preprocess_y[:len_y]

    # wrap sequence with <BOS> (Begin of Sequence) and <EOS> (End Of Sequence)
    token_bos = y_tk.word_index[bos]
    token_eos = y_tk.word_index[eos]
    
    preprocess_y_wrapped = []
    
    for p in preprocess_y:
        preprocess_y_wrapped.append([token_bos] + p + [token_eos])
#         pdb.set_trace()
    
    preprocess_y = preprocess_y_wrapped
    
    preprocess_x = pad(preprocess_x)
    preprocess_y = pad(preprocess_y)

    # Keras's sparse_categorical_crossentropy function requires the labels to be in 3 dimensions
    preprocess_y_reshaped = preprocess_y.reshape(*preprocess_y.shape, 1)

    return preprocess_x, preprocess_y, preprocess_y_reshaped, x_tk, y_tk

preproc_english_sentences_eos, preproc_french_sentences_eos, \
    preproc_french_sentences_eos_reshaped, english_tokenizer_eos, \
    french_tokenizer_eos = preprocess_with_special_tokens(english_sentences, french_sentences)
    
max_english_sequence_length_eos = preproc_english_sentences_eos.shape[1]
max_french_sequence_length_eos = preproc_french_sentences_eos.shape[1]
english_vocab_size_eos = len(english_tokenizer_eos.word_index)
french_vocab_size_eos = len(french_tokenizer_eos.word_index)

print('Data Preprocessed')
print("Max English sentence length:", max_english_sequence_length_eos)
print("Max French sentence length:", max_french_sequence_length_eos)
print("English vocabulary size:", english_vocab_size_eos)
print("French vocabulary size:", french_vocab_size_eos)

Data Preprocessed
Max English sentence length: 15
Max French sentence length: 23
English vocabulary size: 199
French vocabulary size: 346


#### Build Model

In [11]:
from keras.layers import LSTM

def recursive_encdec_model(input_shape, output_sequence_length, english_vocab_size, french_vocab_size):
    """
    Build and train an encoder-decoder model on x and y
    :param input_shape: Tuple of input shape
    :param output_sequence_length: Length of output sequence
    :param english_vocab_size: Number of unique English words in the dataset
    :param french_vocab_size: Number of unique French words in the dataset
    :return: Keras model built, but not trained
    """
    embed_encoder_size = 128
    embed_decoder_size = 128
    hidden_size = 256

    lr = 1e-3
    
    ############################
    # Enocder 
    ############################
    # expected shape = (None, <15)
    encoder_inputs = Input(shape=(None,),
                          name='input_encoder')
    
    # expected shape in training and inference: (None, <15, 128)
    encoder_embed = Embedding(input_dim=english_vocab_size,
                             output_dim=embed_encoder_size,
                             name="embed_encoder")(encoder_inputs)
    
    # expected shape in training and inference: (None, 256)  (dim=2, because of return_sequences=False)
    encoder_rnn, encoder_states_h, encoder_states_c = LSTM(units=hidden_size,
                                                              return_sequences=False,
                                                              return_state=True,
                                                              name='lstm_encoder')(encoder_embed)
    # concate states for exporting
    encoder_states = [encoder_states_h, encoder_states_c]
    
    ############################
    # Decoder for training
    ############################
    # Declare shared inputs
    # expected shape: (None, <(21+1) )
    decoder_inputs = Input(shape=(None,),
                          name='input_decoder')
    
    # Declare Shared Layer: Embedding layer
    # expected shape in training:   (None, <(21+1), 128)
    #          shape in predicting: (None, 1, 128)
    decoder_embed_layer = Embedding(input_dim=french_vocab_size,
                                 output_dim=embed_decoder_size,
                                   name='embed_decoder')
    
    # wire embed layer
    decoder_embed = decoder_embed_layer(decoder_inputs)
    
    # Declare Shared Layer: RNN
    # expected shape in training:   (None, >21, 256)
    #          shape in predicting: (None, 1, 256)
    decoder_rnn_layer = LSTM(units=hidden_size, 
                             return_sequences=True,
                            return_state=True,
                            name='lstm_decoder')
    
    # wire RNN layer and gives initial states
    decoder_rnn, _, _ = decoder_rnn_layer(decoder_embed,
                                          initial_state=encoder_states)
    
    # Declare Shared layer: FC
    decoder_fc = Dense(french_vocab_size, activation='softmax',
                      name='dense_decoder')
    
    # wire RNN and FC layer
    decoder_outputs = decoder_fc(decoder_rnn)
    
    ############################
    # Decoder for inference
    ############################
    
    inf_decoder_states_input_h = Input(shape=(hidden_size,), name='Decoder_State_H')
    inf_decoder_states_input_c = Input(shape=(hidden_size,), name='Decoder_State_C')
    inf_decoder_states_inputs = [inf_decoder_states_input_h, inf_decoder_states_input_c]
    
    # wire inputs and embed layer
    inf_decoder_embed = decoder_embed_layer(decoder_inputs)
    
    # wire embed layer and RNN layer
    inf_decoder_rnn, inf_decoder_state_h, inf_decoder_state_c = \
                decoder_rnn_layer(inf_decoder_embed, 
                                  initial_state=inf_decoder_states_inputs)
    
    # concate hidden states for exporting
    inf_decoder_states_outputs = [inf_decoder_state_h, inf_decoder_state_c]
    
    # wire RNN layer and FC layer
    inf_decoder_outputs = decoder_fc(inf_decoder_rnn)
    
    ############################
    # Export 3 models
    # class model: for trained (w/ optimizer)
    # class inf_encoder: for inference (w/o optimizer)
    # class inf_decoder: for inference (w/o optimizer)
    model = Model(inputs=[encoder_inputs, decoder_inputs], outputs=decoder_outputs)
    model.compile(optimizer=Adam(lr=lr),
                 loss='sparse_categorical_crossentropy',
                 metrics=['accuracy'])
    
    encoder = Model(inputs=encoder_inputs,
                   outputs=encoder_states)
    
    inf_decoder = Model(inputs=[decoder_inputs] + inf_decoder_states_inputs, 
                    outputs=[inf_decoder_outputs] + inf_decoder_states_outputs)
    

    return  model, encoder, inf_decoder


recursive_encdec_rnn_model, encoder, inf_decoder  = recursive_encdec_model(preproc_english_sentences_eos.shape,
                                                                           max_french_sequence_length_eos,
                                                                           english_vocab_size_eos + 1,
                                                                           french_vocab_size_eos + 1)

print('%%%%%%%%%%%%%%%%%%%%%%')
print('model')
print('%%%%%%%%%%%%%%%%%%%%%%')
print(recursive_encdec_rnn_model.summary())

print('%%%%%%%%%%%%%%%%%%%%%%')
print('encoder')
print('%%%%%%%%%%%%%%%%%%%%%%')
print(encoder.summary())

print('%%%%%%%%%%%%%%%%%%%%%%')
print('inference decoder')
print('%%%%%%%%%%%%%%%%%%%%%%')
print(inf_decoder.summary())

recursive_encdec_rnn_model.fit([preproc_english_sentences_eos, preproc_french_sentences_eos[:,:-1]], 
                               preproc_french_sentences_eos_reshaped[:,1:,:],
                                batch_size=1024,
                                epochs=10,
                                validation_split=0.2)


%%%%%%%%%%%%%%%%%%%%%%
model
%%%%%%%%%%%%%%%%%%%%%%
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_encoder (InputLayer)      (None, None)         0                                            
__________________________________________________________________________________________________
input_decoder (InputLayer)      (None, None)         0                                            
__________________________________________________________________________________________________
embed_encoder (Embedding)       (None, None, 128)    25600       input_encoder[0][0]              
__________________________________________________________________________________________________
embed_decoder (Embedding)       (None, None, 128)    44416       input_decoder[0][0]              
_________________________________________________________

<keras.callbacks.History at 0x7fb67602fe10>

In [28]:
from keras.layers import LSTM
from keras.backend import argmax

def recursive_encdec_model(input_shape, output_sequence_length, english_vocab_size, french_vocab_size):
    """
    Build and train an encoder-decoder model on x and y
    :param input_shape: Tuple of input shape
    :param output_sequence_length: Length of output sequence
    :param english_vocab_size: Number of unique English words in the dataset
    :param french_vocab_size: Number of unique French words in the dataset
    :return: Keras model built, but not trained
    """
    embed_encoder_size = 128
    embed_decoder_size = 128
    hidden_size = 256

    lr = 1e-3
    
    ############################
    # Enocder 
    ############################
    # expected shape = (None, <15)
    encoder_inputs = Input(shape=(None,),
                          name='input_encoder')
    
    # expected shape in training and inference: (None, <15, 128)
    encoder_embed = Embedding(input_dim=english_vocab_size,
                             output_dim=embed_encoder_size,
                             name="embed_encoder")(encoder_inputs)
    
    # expected shape in training and inference: (None, 256)  (dim=2, because of return_sequences=False)
    encoder_rnn, encoder_states_h, encoder_states_c = LSTM(units=hidden_size,
                                                              return_sequences=False,
                                                              return_state=True,
                                                              name='lstm_encoder')(encoder_embed)
    # concate states for exporting
    encoder_states = [encoder_states_h, encoder_states_c]
    
    ############################
    # Decoder for training
    ############################
    # Declare shared inputs
    # expected shape: (None, <(21+1) )
    decoder_inputs = Input(shape=(None,),
                          name='input_decoder')
    
    # Declare Shared Layer: Embedding layer
    # expected shape in training:   (None, <(21+1), 128)
    #          shape in predicting: (None, 1, 128)
    decoder_embed_layer = Embedding(input_dim=french_vocab_size,
                                 output_dim=embed_decoder_size,
                                   name='embed_decoder')
    
    # wire embed layer
    decoder_embed = decoder_embed_layer(decoder_inputs)
    
    # Declare Shared Layer: RNN
    # expected shape in training:   (None, >21, 256)
    #          shape in predicting: (None, 1, 256)
    decoder_rnn_layer = LSTM(units=hidden_size, 
                             return_sequences=True,
                            return_state=True,
                            name='lstm_decoder')
    
    # wire RNN layer and gives initial states
    decoder_rnn, _, _ = decoder_rnn_layer(decoder_embed,
                                          initial_state=encoder_states)
    
    # Declare Shared layer: FC
    decoder_fc = Dense(french_vocab_size, activation='softmax',
                      name='dense_decoder')
    
    # wire RNN and FC layer
    decoder_outputs = decoder_fc(decoder_rnn)
    
    ############################
    # Decoder for inference
    ############################
    
    inf_decoder_states_input_h = Input(shape=(hidden_size,), name='Decoder_State_H')
    inf_decoder_states_input_c = Input(shape=(hidden_size,), name='Decoder_State_C')
    inf_decoder_states_inputs = [inf_decoder_states_input_h, inf_decoder_states_input_c]
    
    # wire inputs and embed layer
    inf_decoder_embed = decoder_embed_layer(decoder_inputs)
    
    # wire embed layer and RNN layer
    inf_decoder_rnn, inf_decoder_state_h, inf_decoder_state_c = \
                decoder_rnn_layer(inf_decoder_embed, 
                                  initial_state=inf_decoder_states_inputs)
    
    # concate hidden states for exporting
    inf_decoder_states_outputs = [inf_decoder_state_h, inf_decoder_state_c]
    
    # wire RNN layer and FC layer
    inf_decoder_outputs = decoder_fc(inf_decoder_rnn)
    
    ############################
    # Export 3 models
    # class model: for trained (w/ optimizer)
    # class inf_encoder: for inference (w/o optimizer)
    # class inf_decoder: for inference (w/o optimizer)
    model = Model(inputs=[encoder_inputs, decoder_inputs], outputs=decoder_outputs)
    model.compile(optimizer=Adam(lr=lr),
                 loss='sparse_categorical_crossentropy',
                 metrics=['accuracy'])
    
    encoder = Model(inputs=encoder_inputs,
                   outputs=encoder_states)
    
    inf_decoder = Model(inputs=[decoder_inputs] + inf_decoder_states_inputs, 
                    outputs=[inf_decoder_outputs] + inf_decoder_states_outputs)
    

    return  model, encoder, inf_decoder


recursive_encdec_rnn_model, encoder, inf_decoder  = recursive_encdec_model(preproc_english_sentences_eos.shape,
                                                                           max_french_sequence_length_eos,
                                                                           english_vocab_size_eos + 1,
                                                                           french_vocab_size_eos + 1)

print('%%%%%%%%%%%%%%%%%%%%%%')
print('model')
print('%%%%%%%%%%%%%%%%%%%%%%')
print(recursive_encdec_rnn_model.summary())

print('%%%%%%%%%%%%%%%%%%%%%%')
print('encoder')
print('%%%%%%%%%%%%%%%%%%%%%%')
print(encoder.summary())

print('%%%%%%%%%%%%%%%%%%%%%%')
print('inference decoder')
print('%%%%%%%%%%%%%%%%%%%%%%')
print(inf_decoder.summary())

index = 20

recursive_encdec_rnn_model.fit([preproc_english_sentences_eos[:index], preproc_french_sentences_eos[:index,:-1]], 
                               preproc_french_sentences_eos_reshaped[:index,1:,:],
                                batch_size=10,
                                epochs=1,
                                validation_split=0.2)


%%%%%%%%%%%%%%%%%%%%%%
model
%%%%%%%%%%%%%%%%%%%%%%
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_encoder (InputLayer)      (None, None)         0                                            
__________________________________________________________________________________________________
input_decoder (InputLayer)      (None, None)         0                                            
__________________________________________________________________________________________________
embed_encoder (Embedding)       (None, None, 128)    25600       input_encoder[0][0]              
__________________________________________________________________________________________________
embed_decoder (Embedding)       (None, None, 128)    44416       input_decoder[0][0]              
_________________________________________________________

<keras.callbacks.History at 0x7f0d031bdc50>

In [13]:
def get_fake_model():
    test_input = Input(shape=(None,))
    inf_decoder_embed = test_input 
    inf_decoder = Model(inputs=test_input, 
                    outputs=inf_decoder_embed)
    
    return inf_decoder

fake_model = get_fake_model()

#### Prdict sentence

In [None]:
import pdb

def sequence_to_text(sequence, tokenizer):
    """
    Turn logits from a neural network into text using the tokenizer
    :param logits: Logits from a neural network
    :param tokenizer: Keras Tokenizer fit on the labels
    :return: String that represents the text of the logits
    """
    index_to_words = {id: word for word, id in tokenizer.word_index.items()}
    index_to_words[0] = '<PAD>'

    return ' '.join([index_to_words[s] for s in sequence])


def predict(english_sequence, encoder, inf_decoder, bos_token, max_sequence_length):
    # shape of inputs: (1, 15)
    states = encoder.predict(english_sequence)
    
    french_sequence = []
    decoder_in_token = np.array(bos_token)
    decoder_in_token = np.reshape(decoder_in_token, (1, 1))
    
    for _ in range(max_sequence_length):
        decoder_out_token, h, c = inf_decoder.predict([decoder_in_token] + states)
        decoder_out_token = np.argmax(decoder_out_token, axis=-1)
        
        french_sequence.append(decoder_out_token[0,0])

        
        decoder_in_token = decoder_out_token
        states = [h, c]
        
        
    return french_sequence
    
french_sequence = predict(preproc_english_sentences_eos[:1],
                         encoder, inf_decoder,
                         french_tokenizer_eos.word_index['0bos0'],
                         max_french_sequence_length_eos)


print(sequence_to_text(french_sequence, french_tokenizer_eos))

#### Save model

In [11]:
import os

directory='./models'
filename_model='encdec_recursive_model.h5'
filename_encoder='encdec_recursive_model_encoder.h5'
filename_inf_decoder='encdec_recursive_model_inf_dcoder.h5'

if not os.path.exists(directory):
    os.makedirs(directory)
    
recursive_encdec_rnn_model.save(directory + filename_model)
encoder.save(directory + filename_encoder)
inf_decoder.save(directory + filename_inf_decoder)

  str(node.arguments) + '. They will not be included '
  str(node.arguments) + '. They will not be included '


### Recap
I've implemented several architectures which validation accuracy are listed in the following table. So far, birectional RNN with embedding layer has highest accuracy.

| Architecture        | With <br> Embedding | Note            | val_acc |
|---------------------|----------------|-----------------|---------|
| Simple RNN          | No             | hidden_size=256 | 0.6841  |
| Simple RNN          | Yes            | hidden_size=256 | 0.9087  |
| Bidirectional RNN   | No             | N/A             | 0.7211  |
| Bidirectional RNN   | Yes            | N/A             | 0.9429  |
| Encoder-Decoder RNN | No             | concat mode     | 0.6427  |
| Encoder-Decoder RNN | Yes            | concat mode     | 0.6813  |



## BLEU Score between Simple RNN and Birectional RNN
The accuracy between 'Simple RNN with embedding layer' and 'Bidirectional RNN with embedding layer' is less than 5%. I want to see how different they are upon BLEU score. Hence I implement a function to calculate BLEU score.

* Implement Logits to text again  
    * function name: logits_to_text_without_pad()
    * Similar to previous version, but this time the French sentence cannot have <PAD> word.

  
* Implement iterator of French sequence (reference).
    * class name: FrenchSentenceIterator
    * This class tokenize the French sequence


* Implement iterator of predicted French sequence (candidate).
    * class name: PredictionIter
    * This class returns predicted and tokenized French sequence.



In [12]:
from nltk.translate.bleu_score import sentence_bleu, SmoothingFunction
from datetime import datetime
import pdb

def logits_to_text_without_pad(logits, tokenizer):
    """
    Turn logits from a neural network into text using the tokenizer
    :param logits: Logits from a neural network
    :param tokenizer: Keras Tokenizer fit on the labels
    :return: String that represents the text of the logits
    """
    index_to_words = {id: word for word, id in tokenizer.word_index.items()}
    index_to_words[0] = ' '
#     pdb.set_trace()
    
    return [index_to_words[prediction] for prediction in np.argmax(logits, 1) if prediction != 0]

# Generate 
class FrenchSentenceIterator(object):
    def __init__(self, sentences):
        self.sentences = sentences
        self.max = len(sentences)
        
        
    def __iter__(self):
        self.n = 0
        return self
        
    
    def __next__(self):
        if self.n < self.max:
            seq = self.sentences[self.n]
            self.n += 1
            tokens = [seq.split()]
            return tokens
        else:
            raise StopIteration
    

class PredictionIter(object):
    def __init__(self, sentences, model, french_tokenizer):
        self.max = len(sentences)
        self.sentences = sentences
        self.model = model
        
    
    def __iter__(self):
        self.n = 0
        return self
    
    
    def __next__(self):
        if self.n < self.max:
            s = self.sentences[self.n: (self.n+1)]
            predictions = model.predict(s)[0]
            self.n += 1
            
            # convert encoded prediction to french sentences
            french_sen = logits_to_text_without_pad(predictions, self.french_tokenizer)
            return french_sen
        else:
            raise StopIteration

            
def get_bleu_score(reference_iter, candidate_iter):
    smoothie = SmoothingFunction().method4
    
    ave_score = 0.0
    for index, (reference, candidate) in enumerate(zip(reference_iter, candidate_iter), start=1):
        
        score = sentence_bleu(reference, candidate, 
                              weights=(0.25, 0.25, 0.25, 0.25), smoothing_function=smoothie)
        # smoothed average
        new_ave_score = 1.0/index * score + (1.0-1.0/index) * ave_score
        ave_score = new_ave_score
        
    return ave_score

#### Calculate score of simple RNN  w/ embedding layer

In [13]:
from keras.models import load_model

directory='./models/'
filename='embed_rnn_model.h5'

bd_rnn_model = load_model(directory + filename)
model = bd_rnn_model

tmp_x = pad(preproc_english_sentences, max_french_sequence_length )

reference_iter = FrenchSentenceIterator(french_sentences[-1000:])
candidate_iter = PredictionIter(tmp_x[-1000:], model, french_tokenizer)

start = datetime.now()
bleu_score = get_bleu_score(reference_iter, candidate_iter)
end = datetime.now()

print('Execution Time:{}'.format(end - start))
print('bleu score={}'.format(bleu_score))

Execution Time:0:00:04.859489
bleu score=0.5329397908993343


#### Calculate score of birectional RNN  w/ embedding layer

In [14]:
from keras.models import load_model

directory='./models/'
filename='bd_rnn_model_concat.h5'

bd_rnn_model = load_model(directory + filename)
model = bd_rnn_model

tmp_x = pad(preproc_english_sentences, max_french_sequence_length )

reference_iter = FrenchSentenceIterator(french_sentences[-1000:])
candidate_iter = PredictionIter(tmp_x[-1000:], model, french_tokenizer)

start = datetime.now()
bleu_score = get_bleu_score(reference_iter, candidate_iter)
end = datetime.now()

print('Execution Time:{}'.format(end - start))
print('bleu score={}'.format(bleu_score))

Execution Time:0:00:08.410817
bleu score=0.6174853541055615


#### Comparison 

| Architecture      | With Embedding | Note            | val_acc | BLEU-4 |
|-------------------|----------------|-----------------|---------|--------|
| Simple RNN        | Yes            | hidden_size=256 | 0.9087  | 0.532  |
| Bidirectional RNN | Yes            | N/A             | 0.9429  | 0.617  |

## Model 5: Custom (IMPLEMENTATION)
Use everything you learned from the previous models to create a model that incorporates embedding and a bidirectional rnn into one model.

In [13]:
def model_final(input_shape, output_sequence_length, english_vocab_size, french_vocab_size):
    """
    Build and train a model that incorporates embedding, encoder-decoder, and bidirectional RNN on x and y
    :param input_shape: Tuple of input shape
    :param output_sequence_length: Length of output sequence
    :param english_vocab_size: Number of unique English words in the dataset
    :param french_vocab_size: Number of unique French words in the dataset
    :return: Keras model built, but not trained
    """
    # TODO: Implement
    lr = 1e-3
    embed_size = 128
    h1_size = 256
    h2_size = 256
    
    input_sequence_length = input_shape[1]
    
    x = Input(shape=input_shape[1:])
    
    embed = Embedding(input_dim=english_vocab_size,
                     output_dim=embed_size,
                     input_length=input_sequence_length)(x)
    
    # output shape = (None, 256)
    bd_rnn = Bidirectional(GRU(units=h1_size, return_sequences=False), 
                          merge_mode='concat')(embed)
    
    # expand seq length 
    # output shape = (None, 21, 256)
    encoder_out = RepeatVector(output_sequence_length)(bd_rnn)
    
    # output shape = (None, 21, 256)
    rnn = GRU(units=h2_size, return_sequences=True)(encoder_out)
    
    # shape = (None, 21, 344)
    y = TimeDistributed(Dense(french_vocab_size, activation='softmax'))(rnn)
    
    model = Model(inputs=x, outputs=y)
    model.compile(optimizer=Adam(lr=lr),
                 loss='sparse_categorical_crossentropy',
                 metrics=['accuracy'])
    return model
tests.test_model_final(model_final)


print('Final Model Loaded')
# TODO: Train the final model

# # pad english sentence, shape = (None, 15) => (None, 21)
# tmp_x = pad(preproc_english_sentences, max_french_sequence_length)

# rnn_model_final = model_final(tmp_x.shape,
#                                max_french_sequence_length,
#                                english_vocab_size,
#                                french_vocab_size)

# print(rnn_model_final.summary())

# rnn_model_final.fit(tmp_x, preproc_french_sentences,
#                 batch_size=1024,
#                 epochs=10,
#                 validation_split=0.2)

# print(logits_to_text(rnn_model_final.predict(tmp_x[:1])[0], french_tokenizer))

Final Model Loaded


## Prediction (IMPLEMENTATION)

In [15]:
import os

def final_predictions(x, y, x_tk, y_tk):
    """
    Gets predictions using the final model
    :param x: Preprocessed English data
    :param y: Preprocessed French data
    :param x_tk: English tokenizer
    :param y_tk: French tokenizer
    """
    # TODO: Train neural network using model_final
    model = model_final(x.shape,
                        y.shape[1],
                       len(x_tk.word_index),
                       len(y_tk.word_index))
    
    print(model.summary())

    model.fit(x, y, batch_size=1024, epochs=10, validation_split=0.2)

    
    ## DON'T EDIT ANYTHING BELOW THIS LINE
    y_id_to_word = {value: key for key, value in y_tk.word_index.items()}
    y_id_to_word[0] = '<PAD>'

    sentence = 'he saw a old yellow truck'
    sentence = [x_tk.word_index[word] for word in sentence.split()]
    sentence = pad_sequences([sentence], maxlen=x.shape[-1], padding='post')
    sentences = np.array([sentence[0], x[0]])
    predictions = model.predict(sentences, len(sentences))

    print('Sample 1:')
    print(' '.join([y_id_to_word[np.argmax(x)] for x in predictions[0]]))
    print('Il a vu un vieux camion jaune')
    print('Sample 2:')
    print(' '.join([y_id_to_word[np.argmax(x)] for x in predictions[1]]))
    print(' '.join([y_id_to_word[np.max(x)] for x in y[0]]))

    # save model
    directory='./models/'
    filename='final_model.h5'

    if not os.path.exists(directory):
        os.makedirs(directory)

    model.save(directory + filename)

final_predictions(preproc_english_sentences, preproc_french_sentences, english_tokenizer, french_tokenizer)

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_4 (InputLayer)         (None, 15)                0         
_________________________________________________________________
embedding_4 (Embedding)      (None, 15, 128)           25472     
_________________________________________________________________
bidirectional_4 (Bidirection (None, 512)               591360    
_________________________________________________________________
repeat_vector_4 (RepeatVecto (None, 21, 512)           0         
_________________________________________________________________
gru_8 (GRU)                  (None, 21, 256)           590592    
_________________________________________________________________
time_distributed_3 (TimeDist (None, 21, 344)           88408     
Total params: 1,295,832
Trainable params: 1,295,832
Non-trainable params: 0
_________________________________________________________________


In [10]:
a = [1, 2]
b = [3, 4]

c = [a] + b
print(c)
c0, c1, c2 = c
print('{}, {}, {}'.format(c0, c1, c2))

d = [a] + [b]
print(d)
(d0, (d11, d12)) = d
print('{}, {}, {}'.format(d0, d11, d12))

[[1, 2], 3, 4]
[1, 2], 3, 4
[[1, 2], [3, 4]]
[1, 2], 3, 4


In [9]:
import pdb

from random import randint
from numpy import array
from numpy import argmax
from numpy import array_equal
from keras.utils import to_categorical
from keras.models import Model
from keras.layers import Input
from keras.layers import LSTM
from keras.layers import Dense

# generate a sequence of random integers
def generate_sequence(length, n_unique):
	return [randint(1, n_unique-1) for _ in range(length)]

# prepare data for the LSTM
def get_dataset(n_in, n_out, cardinality, n_samples):
	X1, X2, y = list(), list(), list()
	for _ in range(n_samples):
		# generate source sequence
		source = generate_sequence(n_in, cardinality)
		# define padded target sequence
		target = source[:n_out]
		target.reverse()
		# create padded input target sequence
		target_in = [0] + target[:-1]
		# encode
		src_encoded = to_categorical([source], num_classes=cardinality)
		tar_encoded = to_categorical([target], num_classes=cardinality)
		tar2_encoded = to_categorical([target_in], num_classes=cardinality)
		# store
		X1.append(src_encoded)
		X2.append(tar2_encoded)
		y.append(tar_encoded)
	return array(X1), array(X2), array(y)

# returns train, inference_encoder and inference_decoder models
def define_models(n_input, n_output, n_units):
	# define training encoder
	encoder_inputs = Input(shape=(None, n_input))
	encoder = LSTM(n_units, return_state=True)
	encoder_outputs, state_h, state_c = encoder(encoder_inputs)
	encoder_states = [state_h, state_c]
	# define training decoder
	decoder_inputs = Input(shape=(None, n_output))
	decoder_lstm = LSTM(n_units, return_sequences=True, return_state=True)
	decoder_outputs, _, _ = decoder_lstm(decoder_inputs, initial_state=encoder_states)
	decoder_dense = Dense(n_output, activation='softmax')
	decoder_outputs = decoder_dense(decoder_outputs)
	model = Model([encoder_inputs, decoder_inputs], decoder_outputs)
	# define inference encoder
	encoder_model = Model(encoder_inputs, encoder_states)
	# define inference decoder
	decoder_state_input_h = Input(shape=(n_units,))
	decoder_state_input_c = Input(shape=(n_units,))
	decoder_states_inputs = [decoder_state_input_h, decoder_state_input_c]
	decoder_outputs, state_h, state_c = decoder_lstm(decoder_inputs, initial_state=decoder_states_inputs)
	decoder_states = [state_h, state_c]
	decoder_outputs = decoder_dense(decoder_outputs)
	decoder_model = Model([decoder_inputs] + decoder_states_inputs, [decoder_outputs] + decoder_states)
	# return all models
	return model, encoder_model, decoder_model

# generate target given source sequence
def predict_sequence(infenc, infdec, source, n_steps, cardinality):
    # encode
    state = infenc.predict(source)
    # start of sequence input
    target_seq = array([0.0 for _ in range(cardinality)]).reshape(1, 1, cardinality)
    
#     pdb.set_trace()
    
    # collect predictions
    output = list()
    for t in range(n_steps):
        # predict next char
        yhat, h, c = infdec.predict([target_seq] + state)
        
        pdb.set_trace()
        
        # store prediction
        output.append(yhat[0,0,:])
        # update state
        state = [h, c]
        # update target sequence
        target_seq = yhat
    return array(output)

# decode a one hot encoded string
def one_hot_decode(encoded_seq):
	return [argmax(vector) for vector in encoded_seq]

# configure problem
n_features = 50 + 1
n_steps_in = 6
n_steps_out = 3
# define model
train, infenc, infdec = define_models(n_features, n_features, 128)

######## Test code ####
# X1, X2, y = get_dataset(n_steps_in, n_steps_out, n_features, 100)
# target = predict_sequence(infenc, infdec, X1, n_steps_out, n_features)
#######################

print('-----------------------')
print('Train')
print(train.summary())
print('-----------------------')
print('Inf encoder')
print(infenc.summary())
print('-----------------------')
print('inf decoder')
print(infdec.summary())
print('-----------------------')

train.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['acc'])
# generate training dataset
# X1, X2, y = get_dataset(n_steps_in, n_steps_out, n_features, 100000)
X1, X2, y = get_dataset(n_steps_in, n_steps_out, n_features, 100)

print(X1.shape,X2.shape,y.shape)

# pdb.set_trace()

# train model
train.fit([X1, X2], y, epochs=1)
# evaluate LSTM
total, correct = 100, 0
for _ in range(total):
    X1, X2, y = get_dataset(n_steps_in, n_steps_out, n_features, 1)
    target = predict_sequence(infenc, infdec, X1, n_steps_out, n_features)
    if array_equal(one_hot_decode(y[0]), one_hot_decode(target)):
        correct += 1
print('Accuracy: %.2f%%' % (float(correct)/float(total)*100.0))
# spot check some examples
for _ in range(10):
	X1, X2, y = get_dataset(n_steps_in, n_steps_out, n_features, 1)
	target = predict_sequence(infenc, infdec, X1, n_steps_out, n_features)
	print('X=%s y=%s, yhat=%s' % (one_hot_decode(X1[0]), one_hot_decode(y[0]), one_hot_decode(target)))

-----------------------
Train
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_5 (InputLayer)            (None, None, 51)     0                                            
__________________________________________________________________________________________________
input_6 (InputLayer)            (None, None, 51)     0                                            
__________________________________________________________________________________________________
lstm_3 (LSTM)                   [(None, 128), (None, 92160       input_5[0][0]                    
__________________________________________________________________________________________________
lstm_4 (LSTM)                   [(None, None, 128),  92160       input_6[0][0]                    
                                                                 lstm_3[0][1]  

BdbQuit: 

In [8]:
X1.shape
X2.shape
y.shape
target.shape

(3, 51)

## Submission
When you're ready to submit, complete the following steps:
1. Review the [rubric](https://review.udacity.com/#!/rubrics/1004/view) to ensure your submission meets all requirements to pass
2. Generate an HTML version of this notebook

  - Run the next cell to attempt automatic generation (this is the recommended method in Workspaces)
  - Navigate to **FILE -> Download as -> HTML (.html)**
  - Manually generate a copy using `nbconvert` from your shell terminal
```
$ pip install nbconvert
$ python -m nbconvert machine_translation.ipynb
```
  
3. Submit the project

  - If you are in a Workspace, simply click the "Submit Project" button (bottom towards the right)
  
  - Otherwise, add the following files into a zip archive and submit them 
  - `helper.py`
  - `machine_translation.ipynb`
  - `machine_translation.html`
    - You can export the notebook by navigating to **File -> Download as -> HTML (.html)**.

### Generate the html

**Save your notebook before running the next cell to generate the HTML output.** Then submit your project.

In [2]:
# Save before you run this cell!
!!jupyter nbconvert *.ipynb

['[NbConvertApp] Converting notebook machine_translation.ipynb to html',
 '[NbConvertApp] Writing 305996 bytes to machine_translation.html']

## Optional Enhancements

This project focuses on learning various network architectures for machine translation, but we don't evaluate the models according to best practices by splitting the data into separate test & training sets -- so the model accuracy is overstated. Use the [`sklearn.model_selection.train_test_split()`](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) function to create separate training & test datasets, then retrain each of the models using only the training set and evaluate the prediction accuracy using the hold out test set. Does the "best" model change?