## Introduction
In this notebook, i will build a deep neural network that functions as part of an end-to-end machine translation pipeline.completed pipeline will accept English text as input and return the French translation.


In [2]:
import collections

import helper
import numpy as np
import project_tests as tests

from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.models import Model
from keras.layers import GRU, Input, Dense, TimeDistributed, Activation, RepeatVector, Bidirectional
from keras.layers.embeddings import Embedding
from keras.optimizers import Adam
from keras.losses import sparse_categorical_crossentropy

Using TensorFlow backend.


## Dataset
We begin by investigating the dataset that will be used to train and evaluate your pipeline.  The most common datasets used for machine translation are from [WMT](http://www.statmt.org/).  However, that will take a long time to train a neural network on.  i'll be using a dataset udacity ai  created for this project that contains a small vocabulary.

In [4]:
# Load English data
english_sentences = helper.load_data('data/small_vocab_en')
# Load French data
french_sentences = helper.load_data('data/small_vocab_fr')

print('Dataset Loaded')

Dataset Loaded


In [5]:
for sample_i in range(2):
    print('small_vocab_en Line {}:  {}'.format(sample_i + 1, english_sentences[sample_i]))
    print('small_vocab_fr Line {}:  {}'.format(sample_i + 1, french_sentences[sample_i]))

small_vocab_en Line 1:  new jersey is sometimes quiet during autumn , and it is snowy in april .
small_vocab_fr Line 1:  new jersey est parfois calme pendant l' automne , et il est neigeux en avril .
small_vocab_en Line 2:  the united states is usually chilly during july , and it is usually freezing in november .
small_vocab_fr Line 2:  les états-unis est généralement froid en juillet , et il gèle habituellement en novembre .


In [6]:
english_words_counter = collections.Counter([word for sentence in english_sentences for word in sentence.split()])
french_words_counter = collections.Counter([word for sentence in french_sentences for word in sentence.split()])

print('{} English words.'.format(len([word for sentence in english_sentences for word in sentence.split()])))
print('{} unique English words.'.format(len(english_words_counter)))
print('10 Most common words in the English dataset:')
print('"' + '" "'.join(list(zip(*english_words_counter.most_common(10)))[0]) + '"')
print()
print('{} French words.'.format(len([word for sentence in french_sentences for word in sentence.split()])))
print('{} unique French words.'.format(len(french_words_counter)))
print('10 Most common words in the French dataset:')
print('"' + '" "'.join(list(zip(*french_words_counter.most_common(10)))[0]) + '"')

1823250 English words.
227 unique English words.
10 Most common words in the English dataset:
"is" "," "." "in" "it" "during" "the" "but" "and" "sometimes"

1961295 French words.
355 unique French words.
10 Most common words in the French dataset:
"est" "." "," "en" "il" "les" "mais" "et" "la" "parfois"



Time to start preprocessing the data...
### Tokenize 
For a neural network to predict on text data, it first has to be turned into data it can understand. Text data like "dog" is a sequence of ASCII character encodings.  Since a neural network is a series of multiplication and addition operations, the input data needs to be number(s).

We can turn each character into a number or each word into a number.  These are called character and word ids, respectively.  Character ids are used for character level models that generate text predictions for each character.  A word level model uses word ids that generate text predictions for each word.  Word level models tend to learn better, since they are lower in complexity, so we'll use those.


In [19]:
def tokenize(x):

    tokenizer = Tokenizer()
    tokenizer.fit_on_texts(x)
    data = tokenizer.texts_to_sequences(x)
    return (data,tokenizer)
tests.test_tokenize(tokenize)

# Tokenize Example output
text_sentences = [
    'The quick brown fox jumps over the lazy dog .',
    'By Jove , my quick study of lexicography won a prize .',
    'This is a short sentence .']
text_tokenized, text_tokenizer = tokenize(text_sentences)
print(text_tokenizer.word_index)
print()
for sample_i, (sent, token_sent) in enumerate(zip(text_sentences, text_tokenized)):
    print('Sequence {} in x'.format(sample_i + 1))
    print('  Input:  {}'.format(sent))
    print('  Output: {}'.format(token_sent))

{'the': 1, 'quick': 2, 'a': 3, 'brown': 4, 'fox': 5, 'jumps': 6, 'over': 7, 'lazy': 8, 'dog': 9, 'by': 10, 'jove': 11, 'my': 12, 'study': 13, 'of': 14, 'lexicography': 15, 'won': 16, 'prize': 17, 'this': 18, 'is': 19, 'short': 20, 'sentence': 21}

Sequence 1 in x
  Input:  The quick brown fox jumps over the lazy dog .
  Output: [1, 2, 4, 5, 6, 7, 1, 8, 9]
Sequence 2 in x
  Input:  By Jove , my quick study of lexicography won a prize .
  Output: [10, 11, 12, 2, 13, 14, 15, 16, 3, 17]
Sequence 3 in x
  Input:  This is a short sentence .
  Output: [18, 19, 3, 20, 21]


### Padding
When batching the sequence of word ids together, each sequence needs to be the same length.  Since sentences are dynamic in length, we can add padding to the end of the sequences to make them the same length.


In [29]:
def pad(x, length=None):

    if not length:
        m=len(max(x, key=len))
        pad=pad_sequences(x,maxlen=m,padding='post')
    else:
        pad=pad_sequences(x,maxlen=length,padding='post')
    return pad
tests.test_pad(pad)

# Pad Tokenized output
test_pad = pad(text_tokenized)
for sample_i, (token_sent, pad_sent) in enumerate(zip(text_tokenized, test_pad)):
    print('Sequence {} in x'.format(sample_i + 1))
    print('  Input:  {}'.format(np.array(token_sent)))
    print('  Output: {}'.format(pad_sent))

Sequence 1 in x
  Input:  [1 2 4 5 6 7 1 8 9]
  Output: [1 2 4 5 6 7 1 8 9 0]
Sequence 2 in x
  Input:  [10 11 12  2 13 14 15 16  3 17]
  Output: [10 11 12  2 13 14 15 16  3 17]
Sequence 3 in x
  Input:  [18 19  3 20 21]
  Output: [18 19  3 20 21  0  0  0  0  0]


In [30]:
def preprocess(x, y):

    preprocess_x, x_tk = tokenize(x)
    preprocess_y, y_tk = tokenize(y)

    preprocess_x = pad(preprocess_x)
    preprocess_y = pad(preprocess_y)

    # Keras's sparse_categorical_crossentropy function requires the labels to be in 3 dimensions
    preprocess_y = preprocess_y.reshape(*preprocess_y.shape, 1)

    return preprocess_x, preprocess_y, x_tk, y_tk

preproc_english_sentences, preproc_french_sentences, english_tokenizer, french_tokenizer =\
    preprocess(english_sentences, french_sentences)
    
max_english_sequence_length = preproc_english_sentences.shape[1]
max_french_sequence_length = preproc_french_sentences.shape[1]
english_vocab_size = len(english_tokenizer.word_index)
french_vocab_size = len(french_tokenizer.word_index)

print('Data Preprocessed')
print("Max English sentence length:", max_english_sequence_length)
print("Max French sentence length:", max_french_sequence_length)
print("English vocabulary size:", english_vocab_size)
print("French vocabulary size:", french_vocab_size)

Data Preprocessed
Max English sentence length: 15
Max French sentence length: 21
English vocabulary size: 199
French vocabulary size: 344


## Models
In this section, i will experiment with various neural network architectures.
- Model 1 is a simple RNN
- Model 2 is a RNN with Embedding
- Model 3 is a Bidirectional RNN
- Model 4 is an optional Encoder-Decoder RNN


### Ids Back to Text
The neural network will be translating the input to words ids, which isn't the final form we want.  We want the French translation.  The function `logits_to_text` will bridge the gab between the logits from the neural network to the French translation.

In [31]:
def logits_to_text(logits, tokenizer):

    index_to_words = {id: word for word, id in tokenizer.word_index.items()}
    index_to_words[0] = '<PAD>'

    return ' '.join([index_to_words[prediction] for prediction in np.argmax(logits, 1)])

print('`logits_to_text` function loaded.')

`logits_to_text` function loaded.


### Model 1: RNN 
![RNN](images/rnn.png)
A basic RNN model is a good baseline for sequence data.

In [62]:
def simple_model(input_shape, output_sequence_length, english_vocab_size, french_vocab_size):

    learning_rate=0.001
    inputs=Input(shape=input_shape[1:])
    gru=GRU(units=output_sequence_length,return_sequences=True)(inputs)
    layers = TimeDistributed(Dense(2 * french_vocab_size, 
                                    activation='relu'))(gru)
    outputs = Dense(french_vocab_size, 
                                    activation='softmax')(layers)
    model = Model(inputs=inputs, outputs=outputs)
    model.compile(loss=sparse_categorical_crossentropy,
                  optimizer=Adam(learning_rate),
                 metrics=['accuracy'])
    print(model.summary())
    return model

# Reshaping the input to work with a basic RNN
tmp_x = pad(preproc_english_sentences, max_french_sequence_length)
tmp_x = tmp_x.reshape((-1, preproc_french_sentences.shape[-2], 1))

# Train the neural network
simple_rnn_model = simple_model(
    tmp_x.shape,
    max_french_sequence_length,
    english_vocab_size,
    french_vocab_size)
simple_rnn_model.fit(tmp_x, preproc_french_sentences, batch_size=1024, epochs=10, validation_split=0.2)

# Print prediction(s)
print(logits_to_text(simple_rnn_model.predict(tmp_x[:1])[0], french_tokenizer))

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_36 (InputLayer)        (None, 21, 1)             0         
_________________________________________________________________
gru_33 (GRU)                 (None, 21, 21)            1449      
_________________________________________________________________
time_distributed_32 (TimeDis (None, 21, 688)           15136     
_________________________________________________________________
dense_33 (Dense)             (None, 21, 344)           237016    
Total params: 253,601
Trainable params: 253,601
Non-trainable params: 0
_________________________________________________________________
None
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_37 (InputLayer)        (None, 21, 1)             0         
_________________________________________________________________

### Model 2: Embedding 
![RNN](images/embedding.png)


In [54]:
def embed_model(input_shape, output_sequence_length, english_vocab_size, french_vocab_size):

    learning_rate=0.001
    inputs=Input(shape=input_shape[1:])
    encoded_inputs=layers = Embedding(english_vocab_size, english_vocab_size)(inputs)
    gru=GRU(units=output_sequence_length,return_sequences=True)(encoded_inputs)
    layers = TimeDistributed(Dense(2 * french_vocab_size, 
                                    activation='relu'))(gru)
    outputs = Dense(french_vocab_size, 
                                    activation='softmax')(layers)
    model = Model(inputs=inputs, outputs=outputs)
    model.compile(loss=sparse_categorical_crossentropy,
                  optimizer=Adam(learning_rate),
                 metrics=['accuracy'])
    print(model.summary())
    return model

# Reshaping the input to work with a basic RNN
tmp_x = pad(preproc_english_sentences, preproc_french_sentences.shape[1])

# Train the neural network
embed_rnn_model = embed_model(
    tmp_x.shape,
    preproc_french_sentences.shape[1],
    len(english_tokenizer.word_index) + 1,
    len(french_tokenizer.word_index) + 1)

embed_rnn_model.fit(tmp_x, preproc_french_sentences, batch_size=1024, 
                    epochs=10, validation_split=0.2)

# Print prediction(s)
print(logits_to_text(embed_rnn_model.predict(tmp_x[:1])[0], french_tokenizer))

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_29 (InputLayer)        (None, 21)                0         
_________________________________________________________________
embedding_12 (Embedding)     (None, 21, 199)           39601     
_________________________________________________________________
gru_26 (GRU)                 (None, 21, 21)            13923     
_________________________________________________________________
time_distributed_26 (TimeDis (None, 21, 688)           15136     
_________________________________________________________________
time_distributed_27 (TimeDis (None, 21, 344)           237016    
Total params: 305,676
Trainable params: 305,676
Non-trainable params: 0
_________________________________________________________________
None
_________________________________________________________________
Layer (type)                 Output Shape              Param #   

### Model 3: Bidirectional RNNs 
![RNN](images/bidirectional.png)
One restriction of a RNN is that it can't see the future input, only the past.  This is where bidirectional recurrent neural networks come in.  They are able to see the future data.

In [61]:
def bd_model(input_shape, output_sequence_length, english_vocab_size, french_vocab_size):

    learning_rate=0.001
    inputs=Input(shape=input_shape[1:])
    gru=Bidirectional(GRU(units=output_sequence_length,return_sequences=True))(inputs)
    layers = TimeDistributed(Dense(2 * french_vocab_size, 
                                    activation='relu'))(gru)
    outputs =Dense(french_vocab_size, 
                                    activation='softmax')(layers)
    model = Model(inputs=inputs, outputs=outputs)
    model.compile(loss=sparse_categorical_crossentropy,
                  optimizer=Adam(learning_rate),
                 metrics=['accuracy'])
    print(model.summary())
    return model
# Reshaping the input to work with a basic RNN
tmp_x = pad(preproc_english_sentences, preproc_french_sentences.shape[1])
tmp_x = tmp_x.reshape((-1, preproc_french_sentences.shape[-2], 1))

# Train the neural network
bd_rnn_model = bd_model(
    tmp_x.shape,
    preproc_french_sentences.shape[1],
    len(english_tokenizer.word_index) + 1,
    len(french_tokenizer.word_index) + 1)

bd_rnn_model.fit(tmp_x, preproc_french_sentences, batch_size=1024, 
                    epochs=10, validation_split=0.2)

# Print prediction(s)
print(logits_to_text(bd_rnn_model.predict(tmp_x[:1])[0], french_tokenizer))

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_35 (InputLayer)        (None, 21, 1)             0         
_________________________________________________________________
bidirectional_3 (Bidirection (None, 21, 42)            2898      
_________________________________________________________________
time_distributed_30 (TimeDis (None, 21, 690)           29670     
_________________________________________________________________
time_distributed_31 (TimeDis (None, 21, 345)           238395    
Total params: 270,963
Trainable params: 270,963
Non-trainable params: 0
_________________________________________________________________
None
Train on 110288 samples, validate on 27573 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
new jersey est parfois calme en en et il est il en en <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD>

### Model 4: Encoder-Decoder 
Time to look at encoder-decoder models.  This model is made up of an encoder and decoder. The encoder creates a matrix representation of the sentence.  The decoder takes this matrix as input and predicts the translation as output.

In [64]:
def encdec_model(input_shape, output_sequence_length, english_vocab_size, french_vocab_size):

    learning_rate = .001
    inputs = Input(shape=input_shape[1:])
    gru = GRU(output_sequence_length)(inputs)
    repeats = RepeatVector(output_sequence_length)(gru)
    layers = GRU(output_sequence_length,return_sequences=True)(repeats)
    layers = TimeDistributed(Dense(2 * french_vocab_size, 
                                    activation='relu'))(layers)
    outputs =Dense(french_vocab_size, 
                                    activation='softmax')(layers)
    model = Model(inputs=inputs, outputs=outputs)
    model.compile(
        loss=sparse_categorical_crossentropy,
        optimizer=Adam(learning_rate),
        metrics=['accuracy']
    )
    print(model.summary())
    return model
# Reshaping the input to work with a basic RNN
tmp_x = pad(preproc_english_sentences, preproc_french_sentences.shape[1])
tmp_x = tmp_x.reshape((-1, preproc_french_sentences.shape[-2], 1))

# Train the neural network
encdec_rnn_model = encdec_model(
    tmp_x.shape,
    preproc_french_sentences.shape[1],
    len(english_tokenizer.word_index) + 1,
    len(french_tokenizer.word_index) + 1)

encdec_rnn_model.fit(tmp_x, preproc_french_sentences, batch_size=1024, 
                    epochs=10, validation_split=0.2)

# Print prediction(s)
print(logits_to_text(encdec_rnn_model.predict(tmp_x[:1])[0], french_tokenizer))

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_39 (InputLayer)        (None, 15, 1)             0         
_________________________________________________________________
gru_37 (GRU)                 (None, 21)                1449      
_________________________________________________________________
repeat_vector_2 (RepeatVecto (None, 21, 21)            0         
_________________________________________________________________
gru_38 (GRU)                 (None, 21, 21)            2709      
_________________________________________________________________
time_distributed_34 (TimeDis (None, 21, 688)           15136     
_________________________________________________________________
dense_37 (Dense)             (None, 21, 344)           237016    
Total params: 256,310
Trainable params: 256,310
Non-trainable params: 0
_________________________________________________________________
None

### Model 5: Custom
here i will make a custom model from all the previous ones to try get better result

In [82]:
def model_final(input_shape, output_sequence_length, english_vocab_size, french_vocab_size):

    # build the layers
    learning_rate = .005
    inputs = Input(shape=input_shape[1:])
    layers = Embedding(english_vocab_size, english_vocab_size, 
                       mask_zero=False)(inputs)
    layers = Bidirectional(GRU(output_sequence_length))(layers)
    layers = RepeatVector(output_sequence_length)(layers)
    layers = Bidirectional(GRU(output_sequence_length,return_sequences=True))(layers)
    layers = TimeDistributed(Dense(2 * french_vocab_size, activation='relu'))(layers)
    outputs = Dense(french_vocab_size, activation='softmax')(layers)
    model = Model(inputs=inputs, outputs=outputs)
    model.compile(
        loss=sparse_categorical_crossentropy,
        optimizer=Adam(learning_rate),
        metrics=['accuracy']
    )
    print(model.summary())
    return model

tests.test_model_final(model_final)

print('Final Model Loaded')

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_56 (InputLayer)        (None, 15)                0         
_________________________________________________________________
embedding_33 (Embedding)     (None, 15, 199)           39601     
_________________________________________________________________
bidirectional_34 (Bidirectio (None, 42)                27846     
_________________________________________________________________
repeat_vector_19 (RepeatVect (None, 21, 42)            0         
_________________________________________________________________
bidirectional_35 (Bidirectio (None, 21, 42)            8064      
_________________________________________________________________
time_distributed_69 (TimeDis (None, 21, 688)           29584     
_________________________________________________________________
dense_86 (Dense)             (None, 21, 344)           237016    
Total para

## Prediction

In [83]:
import os.path
from keras.models import load_model
from keras.callbacks import ModelCheckpoint
model_file = 'final_model.h5'

def fit_model(model, x, y):
    checkpoint = ModelCheckpoint(filepath=model_file, 
                                   monitor='val_loss',
                                   save_best_only=True, 
                                   verbose=1)
    model.fit(x, y, batch_size=1024, 
                epochs=100, validation_split=0.2, 
                callbacks=[checkpoint],
                verbose=1)

def final_predictions(x, y, x_tk, y_tk):
    x = pad(x, y.shape[1])
    if os.path.isfile(model_file):        
        print('Continue with last save: {}'.format(model_file))
        model = load_model(model_file)
    else:
        # Train neural network using model_final
        model = model_final(
            x.shape,
            y.shape[1],
            len(x_tk.word_index) + 1,
            len(y_tk.word_index) + 1)
    fit_model(model, x, y)

    # Print prediction(s)
    print(logits_to_text(model.predict(x[:1])[0], y_tk))

    y_id_to_word = {value: key for key, value in y_tk.word_index.items()}
    y_id_to_word[0] = '<PAD>'

    sentence = 'he saw a old yellow truck'
    sentence = [x_tk.word_index[word] for word in sentence.split()]
    sentence = pad_sequences([sentence], maxlen=x.shape[-1], padding='post')
    sentences = np.array([sentence[0], x[0]])
    predictions = model.predict(sentences, len(sentences))

    print('Sample 1:')
    print(' '.join([y_id_to_word[np.argmax(x)] for x in predictions[0]]))
    print('Il a vu un vieux camion jaune')
    print('Sample 2:')
    print(' '.join([y_id_to_word[np.argmax(x)] for x in predictions[1]]))
    print(' '.join([y_id_to_word[np.max(x)] for x in y[0]]))


final_predictions(preproc_english_sentences, preproc_french_sentences, english_tokenizer, french_tokenizer)

Continue with last save: final_model.h5
Train on 110288 samples, validate on 27573 samples
Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100


Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100


Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100
Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/100
Epoch 75/100
Epoch 76/100
Epoch 77/100
Epoch 78/100
Epoch 79/100
Epoch 80/100
Epoch 81/100
Epoch 82/100


Epoch 83/100
Epoch 84/100
Epoch 85/100
Epoch 86/100
Epoch 87/100
Epoch 88/100
Epoch 89/100
Epoch 90/100
Epoch 91/100
Epoch 92/100
Epoch 93/100
Epoch 94/100
Epoch 95/100
Epoch 96/100
Epoch 97/100
Epoch 98/100
Epoch 99/100
Epoch 100/100
new jersey est parfois calme pendant l'automne de l' automne neigeux il est neigeux <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD>
Sample 1:
il a vu un petit camion noir <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD>
Il a vu un vieux camion jaune
Sample 2:
new jersey est parfois calme pendant l'automne de l' automne neigeux il est neigeux <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD>
new jersey est parfois calme pendant l' automne et il est neigeux en avril <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD>
