# Recurrent Neural Networks
## Machine Translation
In this notebook, we'll build a deep neural network that functions as part of an end-to-end machine translation pipeline. Our completed pipeline will accept English text as input and return the French translation as output. This project was taken from the [Udacity AI Nanodegree](https://www.udacity.com/course/artificial-intelligence-nanodegree--nd889). We can break this project down into three steps:

1. **Preprocess** - We'll need to convert text to a sequence of integers.
2. **Modeling** - Then, we'll create models which accepts a sequence of integers as input and returns a probability distribution over possible translations.
3. **Prediction** - Lastly, we'll write a function to run our model on English text and see our results.

### Dataset
The most common datasets used for machine translation are from [WMT](http://www.statmt.org/). However, that will take a long time to train a neural network on.  Instead, we'll be using a dataset that contains a small vocabulary which can be found in the `data/` folder. The `small_vocab_en` file contains English sentences with their French translations in the `small_vocab_fr` file.

### Load Data

In [1]:
import os

def load_data(path):
    """
    Load dataset from a given path.
    """
    input_file = os.path.join(path)
    with open(input_file, "r") as f:
        data = f.read()

    return data.split('\n')


# Load English data
english_sentences = load_data('data/small_vocab_en')
# Load French data
french_sentences = load_data('data/small_vocab_fr')
print('Dataset Loaded')

# Sample from dataset
for sample_i in range(3):
    print('small_vocab_en Line {}:  {}'.format(sample_i + 1, english_sentences[sample_i]))
    print('small_vocab_fr Line {}:  {}'.format(sample_i + 1, french_sentences[sample_i]))

Dataset Loaded
small_vocab_en Line 1:  new jersey is sometimes quiet during autumn , and it is snowy in april .
small_vocab_fr Line 1:  new jersey est parfois calme pendant l' automne , et il est neigeux en avril .
small_vocab_en Line 2:  the united states is usually chilly during july , and it is usually freezing in november .
small_vocab_fr Line 2:  les Ã©tats-unis est gÃ©nÃ©ralement froid en juillet , et il gÃ¨le habituellement en novembre .
small_vocab_en Line 3:  california is usually quiet during march , and it is usually hot in june .
small_vocab_fr Line 3:  california est gÃ©nÃ©ralement calme en mars , et il est gÃ©nÃ©ralement chaud en juin .


We can also look at the complexity of the problem. A more complex vocabulary is a more complex problem. Let's look at the complexity of the dataset we'll be working with.

In [2]:
import collections

# Create a counter object for each dataset
english_words_counter = collections.Counter([word for sentence in english_sentences for word in sentence.split()])
french_words_counter = collections.Counter([word for sentence in french_sentences for word in sentence.split()])

print('{} English words.'.format(len([word for sentence in english_sentences for word in sentence.split()])))
print('{} unique English words.'.format(len(english_words_counter)))
print('10 Most common words in the English dataset:')
print('"' + '" "'.join(list(zip(*english_words_counter.most_common(10)))[0]) + '"')
print()
print('{} French words.'.format(len([word for sentence in french_sentences for word in sentence.split()])))
print('{} unique French words.'.format(len(french_words_counter)))
print('10 Most common words in the French dataset:')
print('"' + '" "'.join(list(zip(*french_words_counter.most_common(10)))[0]) + '"')

1823250 English words.
227 unique English words.
10 Most common words in the English dataset:
"is" "," "." "in" "it" "during" "the" "but" "and" "sometimes"

1961295 French words.
355 unique French words.
10 Most common words in the French dataset:
"est" "." "," "en" "il" "les" "mais" "et" "la" "parfois"


Since we are using an abridged dataset, we'll only need to learn about 355 unique French words. For comparison, _Alice's Adventures in Wonderland_ contains 2,766 unique words of a total of 15,500 words. If we wanted to learn larger vocabularies it would just require more data and time.

## Preprocessing

Normally, the text we are given will be messy. For example, if we scrape a website for our vocabulary, we'll end up with a bunch of HTML tags and markup that aren't useful inputs for our objective. For this reason, text processing is usually our first step in Natural Language Processing. Common text processing steps include:
* Cleaning - removing unwanted symbols, tags, stopwords, etc so that we are left with plain text.
* Normalization - making everything lowercase, removing punctuation, etc.
* Tokenization - converting words to symbols that can fed into our model.

From looking at the sentences above, we can see they have mostly been preprocessed already. The punctuations have been delimited using spaces. All the text has been converted to lowercase. This will save us some time. We still need to tokenize our data though. We'll convert the text into sequences of integers using the following preprocess methods:
1. Tokenize the words into ids.
2. Add padding to make all the sequences the same length.
3. Run our data through both.

### Tokenize
For a neural network to predict on text data, it first has to be turned into data it can understand. Text data like "dog" is a sequence of ASCII character encodings. Since a neural network is a series of multiplication and addition operations, the input data needs to be numbers.

We can turn each character into a number or each word into a number. These are called character and word ids, respectively. Character ids are used for character level models that generate text predictions for each character.  A word level model uses word ids that generate text predictions for each word. Word level models tend to learn better, since they are lower in complexity, so we'll use those.

We'll turn each sentence into a sequence of words ids using Keras's [`Tokenizer`](https://keras.io/preprocessing/text/#tokenizer) function.

In [3]:
from keras.preprocessing.text import Tokenizer

def tokenize(x):
    """
    Tokenize x
    :param x: List of sentences/strings to be tokenized
    :return: Tuple of (tokenized x data, tokenizer used to tokenize x)
    """
    # Initate tokenizer
    tokenizer = Tokenizer()
    
    # Fit tokenizer to text
    tokenizer.fit_on_texts(x)
    
    # Get tokenized data
    tokenized_data = tokenizer.texts_to_sequences(x)
    
    return tokenized_data, tokenizer


# Tokenize Example output
text_sentences = ['The quick brown fox jumps over the lazy dog .',
                  'By Jove , my quick study of lexicography won a prize .',
                  'This is a short sentence .']
text_tokenized, text_tokenizer = tokenize(text_sentences)
print(text_tokenizer.word_index)
print()
for sample_i, (sent, token_sent) in enumerate(zip(text_sentences, text_tokenized)):
    print('Sequence {} in x'.format(sample_i + 1))
    print('  Input:  {}'.format(sent))
    print('  Output: {}'.format(token_sent))

Using TensorFlow backend.


{'the': 1, 'quick': 2, 'a': 3, 'brown': 4, 'fox': 5, 'jumps': 6, 'over': 7, 'lazy': 8, 'dog': 9, 'by': 10, 'jove': 11, 'my': 12, 'study': 13, 'of': 14, 'lexicography': 15, 'won': 16, 'prize': 17, 'this': 18, 'is': 19, 'short': 20, 'sentence': 21}

Sequence 1 in x
  Input:  The quick brown fox jumps over the lazy dog .
  Output: [1, 2, 4, 5, 6, 7, 1, 8, 9]
Sequence 2 in x
  Input:  By Jove , my quick study of lexicography won a prize .
  Output: [10, 11, 12, 2, 13, 14, 15, 16, 3, 17]
Sequence 3 in x
  Input:  This is a short sentence .
  Output: [18, 19, 3, 20, 21]


### Padding

When batching the sequence of word ids together, each sequence needs to be the same length. Since sentences are dynamic in length, we can add padding to the end of the sequences to make them the same length. We'll be using Keras's [`pad_sequences`](https://keras.io/preprocessing/sequence/#pad_sequences) function.

In [4]:
import numpy as np
from keras.preprocessing.sequence import pad_sequences

def pad(x, length=None):
    """
    Pad x
    :param x: List of sequences.
    :param length: Length to pad the sequence to.  If None, use length of longest sequence in x.
    :return: Padded numpy array of sequences
    """
    # Initate base case as length of longest sequence
    if length is None:
        length = max(len(seq) for seq in x)
        
    # Get padded sequences
    padded_seq = pad_sequences(sequences=x, maxlen=length, padding='post')
    
    return padded_seq


# Pad Tokenized output
test_pad = pad(text_tokenized)
for sample_i, (token_sent, pad_sent) in enumerate(zip(text_tokenized, test_pad)):
    print('Sequence {} in x'.format(sample_i + 1))
    print('  Input:  {}'.format(np.array(token_sent)))
    print('  Output: {}'.format(pad_sent))

Sequence 1 in x
  Input:  [1 2 4 5 6 7 1 8 9]
  Output: [1 2 4 5 6 7 1 8 9 0]
Sequence 2 in x
  Input:  [10 11 12  2 13 14 15 16  3 17]
  Output: [10 11 12  2 13 14 15 16  3 17]
Sequence 3 in x
  Input:  [18 19  3 20 21]
  Output: [18 19  3 20 21  0  0  0  0  0]


### Process the Data
Let's run our data through our tokenizer and pad it.

In [5]:
def preprocess(x, y):
    """
    Preprocess x and y
    :param x: Feature List of sentences
    :param y: Label List of sentences
    :return: Tuple of (Preprocessed x, Preprocessed y, x tokenizer, y tokenizer)
    """
    # Tokenize our data
    preprocess_x, x_tk = tokenize(x)
    preprocess_y, y_tk = tokenize(y)
    
    # Pad our data
    preprocess_x = pad(preprocess_x)
    preprocess_y = pad(preprocess_y)

    # Reshape our data
    # Keras's sparse_categorical_crossentropy function requires the labels to be in 3 dimensions
    preprocess_y = preprocess_y.reshape(*preprocess_y.shape, 1)

    return preprocess_x, preprocess_y, x_tk, y_tk

preproc_english_sentences, preproc_french_sentences, english_tokenizer, french_tokenizer =\
    preprocess(english_sentences, french_sentences)

print('Data Preprocessed')
# Sample from processed dataset
for sample_i in range(3):
    print('small_vocab_en Line {}:  {}'.format(sample_i + 1, preproc_english_sentences[sample_i]))
    print('small_vocab_fr Line {}:  {}'.format(sample_i + 1, preproc_french_sentences[sample_i]))

Data Preprocessed
small_vocab_en Line 1:  [17 23  1  8 67  4 39  7  3  1 55  2 44  0  0]
small_vocab_fr Line 1:  [[ 35]
 [ 34]
 [  1]
 [  8]
 [ 67]
 [ 37]
 [ 11]
 [ 24]
 [  6]
 [  3]
 [  1]
 [112]
 [  2]
 [ 50]
 [  0]
 [  0]
 [  0]
 [  0]
 [  0]
 [  0]
 [  0]]
small_vocab_en Line 2:  [ 5 20 21  1  9 62  4 43  7  3  1  9 51  2 45]
small_vocab_fr Line 2:  [[ 4]
 [32]
 [31]
 [ 1]
 [12]
 [19]
 [ 2]
 [49]
 [ 6]
 [ 3]
 [95]
 [69]
 [ 2]
 [51]
 [ 0]
 [ 0]
 [ 0]
 [ 0]
 [ 0]
 [ 0]
 [ 0]]
small_vocab_en Line 3:  [22  1  9 67  4 38  7  3  1  9 68  2 34  0  0]
small_vocab_fr Line 3:  [[101]
 [  1]
 [ 12]
 [ 67]
 [  2]
 [ 45]
 [  6]
 [  3]
 [  1]
 [ 12]
 [ 21]
 [  2]
 [ 41]
 [  0]
 [  0]
 [  0]
 [  0]
 [  0]
 [  0]
 [  0]
 [  0]]


### Ids Back to Text
The neural network will be translating the input to words ids, which isn't the final form we want. We want the French translation. Let's write a function, `logits_to_text`, that will bridge the gap between the logits from the neural network to the French translation.

In [6]:
def logits_to_text(logits, tokenizer):
    """
    Turn logits from a neural network into text using the tokenizer
    :param logits: Logits from a neural network
    :param tokenizer: Keras Tokenizer fit on the labels
    :return: String that represents the text of the logits
    """
    # Extract word from our tokenizer
    index_to_words = {id: word for word, id in tokenizer.word_index.items()}
    # Insert '<PAD>' in place of zeros.
    index_to_words[0] = '<PAD>'

    return ' '.join([index_to_words[prediction] for prediction in np.argmax(logits, 1)])

print('`logits_to_text` function loaded.')

`logits_to_text` function loaded.


## Modeling

Now that our data is processed, let's build our models. We'll start by experimenting with various neural network architectures:
* Simple RNN
* RNN with Embedding
* Bidirectional RNN
* Encoder-Decoder RNN

After experimenting with the four simple architectures, we will construct a deeper architecture that is designed to outperform all four models.

### Model 1: Simple RNN
![RNN](images/rnn.png)
A basic RNN model is a good baseline for sequence data. Unlike other neural networks, RNNs are able to remember the previous state of a neural network and use it as input to the next calculation. This allows RNNs to learn patterns in sequences, such as the next word in a sentence based on the first few words.

In [7]:
from keras.layers import GRU, Input, Dense, TimeDistributed
from keras.models import Model, Sequential
from keras.layers import Activation
from keras.optimizers import Adam
from keras.losses import sparse_categorical_crossentropy


def simple_model(input_shape, output_sequence_length, english_vocab_size, french_vocab_size):
    """
    Build and train a basic RNN on x and y
    :param input_shape: Tuple of input shape
    :param output_sequence_length: Length of output sequence
    :param english_vocab_size: Number of unique English words in the dataset
    :param french_vocab_size: Number of unique French words in the dataset
    :return: Keras model built, but not trained
    """
    # Initate model
    model = Sequential()
    
    # Add Gated Recurrent Layers
    # Return sequences set to True to remember full sequence and not just last output
    # Add modest recurrent dropout to prevent overfitting
    model.add(GRU(1024, input_shape=input_shape[1:], return_sequences=True))
    model.add(GRU(512, return_sequences=True, recurrent_dropout=0.3))
    
    # Add fully connected layer and softmax activation
    # Add a Time distributed wrapper to dense layer
    model.add(TimeDistributed(Dense(french_vocab_size)))
    model.add(Activation('softmax'))
    
    # Compile
    learning_rate = .001
    model.compile(loss=sparse_categorical_crossentropy,
                  optimizer=Adam(learning_rate),
                  metrics=['accuracy'])
    
    return model


# Reshape the input to work with a simple RNN
tmp_x = pad(preproc_english_sentences, preproc_french_sentences.shape[1])
tmp_x = tmp_x.reshape((-1, preproc_french_sentences.shape[-2], 1))

# Train the neural network
simple_rnn_model = simple_model(tmp_x.shape,
                                preproc_french_sentences.shape[1],
                                len(english_tokenizer.word_index),
                                len(french_tokenizer.word_index))

simple_rnn_model.fit(tmp_x, preproc_french_sentences, batch_size=256, epochs=10, validation_split=0.2)

# Print predictions
print('Model predicts:')
print(logits_to_text(simple_rnn_model.predict(tmp_x[:1])[0], french_tokenizer))
print('Actual translation:')
print(french_sentences[0])

Train on 110288 samples, validate on 27573 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Model predicts:
new jersey est parfois calme en l' automne l' il est neigeux en avril <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD>
Actual translation:
new jersey est parfois calme pendant l' automne , et il est neigeux en avril .


With this simple RNN we achieved a valuation accuracy of about 87%.
* Prediction: new jersey est parfois calme en l' automne l' il est neigeux en avril
* Actual: new jersey est parfois calme pendant l' automne et il est neigeux en avril

### Model 2: Embedding
![RNN](images/embedding.png)
We've turned our words into ids, but there's a better representation of a word called word embeddings. An embedding is a vector representation of the word that is close to similar words in n-dimensional space, where the n represents the size of the embedding vectors.

In other words, we take our words and run them through a seperate neural network that outputs how ever many features we want. Words that are similar in meaning will be closer together. We'll be using Kera's [`embedding`](https://keras.io/layers/embeddings/) function.

In [8]:
from keras.layers.embeddings import Embedding


def embed_model(input_shape, output_sequence_length, english_vocab_size, french_vocab_size):
    """
    Build and train a RNN model using word embedding on x and y
    :param input_shape: Tuple of input shape
    :param output_sequence_length: Length of output sequence
    :param english_vocab_size: Number of unique English words in the dataset
    :param french_vocab_size: Number of unique French words in the dataset
    :return: Keras model built, but not trained
    """
    # Initate model
    model = Sequential()
    
    # Add embedding layer
    model.add(Embedding(english_vocab_size, 1024, input_length=input_shape[1]))
    
    # Add Gated Recurrent Layers
    # Return sequences set to True to remember full sequence and not just last output
    # Add modest recurrent dropout to prevent overfitting
    model.add(GRU(1024, return_sequences=True))
    model.add(GRU(512, return_sequences=True, recurrent_dropout=0.3))
    
    # Add fully connected layer and softmax activation
    # Add a Time distributed wrapper to dense layer
    model.add(TimeDistributed(Dense(french_vocab_size)))
    model.add(Activation('softmax'))
    
    # Compile
    learning_rate = .001
    model.compile(loss=sparse_categorical_crossentropy,
                  optimizer=Adam(learning_rate),
                  metrics=['accuracy'])
    
    return model


# Reshape the input to work with embeddings
tmp_x = pad(preproc_english_sentences, preproc_french_sentences.shape[1])

# Train the neural network
embed_rnn_model = embed_model(tmp_x.shape,
                              preproc_french_sentences.shape[1],
                              len(english_tokenizer.word_index),
                              len(french_tokenizer.word_index))

embed_rnn_model.fit(tmp_x, preproc_french_sentences, batch_size=256, epochs=10, validation_split=0.2)

# Print prediction
print('Model predicts:')
print(logits_to_text(embed_rnn_model.predict(tmp_x[:1])[0], french_tokenizer))
print('Actual translation:')
print(french_sentences[0])

Train on 110288 samples, validate on 27573 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Model predicts:
new jersey est parfois calme en l' automne et il est neigeux en avril <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD>
Actual translation:
new jersey est parfois calme pendant l' automne , et il est neigeux en avril .


With this Embedded RNN we achieved a valuation accuracy of about 93%.
* Prediction: new jersey est parfois calme en l' automne et il est neigeux en avril
* Actual: new jersey est parfois calme pendant l' automne et il est neigeux en avril

### Model 3: Bidirectional RNNs
![RNN](images/bidirectional.png)
One limitation of a RNN is that it can't see the future sequence input, only the past. However, Bidirectional RNNs allow the network to read future input information from its current state. This allows us to find context information not only from the words preceding our target, but also from the words following it. We'll be useing Kera's [`bidirectional`](https://keras.io/layers/wrappers/#bidirectional) function.

In [9]:
from keras.layers import Bidirectional


def bd_model(input_shape, output_sequence_length, english_vocab_size, french_vocab_size):
    """
    Build and train a bidirectional RNN model on x and y
    :param input_shape: Tuple of input shape
    :param output_sequence_length: Length of output sequence
    :param english_vocab_size: Number of unique English words in the dataset
    :param french_vocab_size: Number of unique French words in the dataset
    :return: Keras model built, but not trained
    """
    # Initate model
    model = Sequential()
    
    # Add Gated Recurrent Layers
    # Add Bidirectional wrapper to learn from future input
    # Return sequences set to True to remember full sequence and not just last output
    # Add modest recurrent dropout to prevent overfitting
    model.add(Bidirectional(GRU(1024, return_sequences=True), input_shape=input_shape[1:]))
    model.add(Bidirectional(GRU(512, return_sequences=True, recurrent_dropout=0.3)))
    
    # Add fully connected layer and softmax activation
    # Add a Time distributed wrapper to dense layer
    model.add(TimeDistributed(Dense(french_vocab_size)))
    model.add(Activation('softmax'))
    
    # Compile
    learning_rate = .001
    model.compile(loss=sparse_categorical_crossentropy,
                  optimizer=Adam(learning_rate),
                  metrics=['accuracy'])
    
    return model


# Reshape the input to work with bidirectional model
tmp_x = pad(preproc_english_sentences, preproc_french_sentences.shape[1])
tmp_x = tmp_x.reshape((-1, preproc_french_sentences.shape[-2], 1))

# Train the model
bd_rnn_model = bd_model(tmp_x.shape,
                        preproc_french_sentences.shape[1],
                        len(english_tokenizer.word_index),
                        len(french_tokenizer.word_index))

bd_rnn_model.fit(tmp_x, preproc_french_sentences, batch_size=256, epochs=10, validation_split=0.2)

# Print prediction(s)
print('Model predicts:')
print(logits_to_text(bd_rnn_model.predict(tmp_x[:1])[0], french_tokenizer))
print('Actual translation:')
print(french_sentences[0])

Train on 110288 samples, validate on 27573 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Model predicts:
new jersey est parfois calme au cours automne il automne neigeux en avril <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD>
Actual translation:
new jersey est parfois calme pendant l' automne , et il est neigeux en avril .


With this Bidirectional RNN we achieved a valuation accuracy of about 87%.
* Prediction: new jersey est parfois calme au cours automne il automne neigeux en avril
* Actual: new jersey est parfois calme pendant l' automne et il est neigeux en avril

### Model 4: Encoder-Decoder
Another useful model is the Encoder-Decoder. As the name suggests, this model is made up of an encoder and decoder. The encoder creates a matrix representation of the sentence. The decoder takes this matrix as input and predicts the translation as output. Think of this as a two step neural network. One network comes up with an encoding; the other comes up with the decoding. We'll be useing Kera's [`repeatvector`](https://faroit.github.io/keras-docs/2.0.6/layers/core/#repeatvector) function.

In [10]:
from keras.layers import RepeatVector


def encdec_model(input_shape, output_sequence_length, english_vocab_size, french_vocab_size):
    """
    Build and train an encoder-decoder model on x and y
    :param input_shape: Tuple of input shape
    :param output_sequence_length: Length of output sequence
    :param english_vocab_size: Number of unique English words in the dataset
    :param french_vocab_size: Number of unique French words in the dataset
    :return: Keras model built, but not trained
    """
    # Initate model
    model = Sequential()
    
    # Add Gated Recurrent Layers
    model.add(GRU(1024, input_shape=input_shape[1:], return_sequences=False))
    # Add Repeat vector to repeat the last output
    model.add(RepeatVector(output_sequence_length))
    # Add modest recurrent dropout to prevent overfitting
    model.add(GRU(512, return_sequences=True, recurrent_dropout=0.3))
    
    # Add fully connected layer and softmax activation
    # Add a Time distributed wrapper to dense layer
    model.add(TimeDistributed(Dense(french_vocab_size)))
    model.add(Activation('softmax'))
    
    # Compile
    learning_rate = .001
    model.compile(loss=sparse_categorical_crossentropy,
                  optimizer=Adam(learning_rate),
                  metrics=['accuracy'])
    
    return model


# Reshape the input to work with model
tmp_x = pad(preproc_english_sentences, preproc_french_sentences.shape[1])
tmp_x = tmp_x.reshape((-1, preproc_french_sentences.shape[-2], 1))

# Train the neural network
encdec_rnn_model = encdec_model(tmp_x.shape,
                                preproc_french_sentences.shape[1],
                                len(english_tokenizer.word_index),
                                len(french_tokenizer.word_index))

encdec_rnn_model.fit(tmp_x, preproc_french_sentences, batch_size=256, epochs=10, validation_split=0.2)

# Print predictions
print('Model predicts:')
print(logits_to_text(encdec_rnn_model.predict(tmp_x[:1])[0], french_tokenizer))
print('Actual translation:')
print(french_sentences[0])

Train on 110288 samples, validate on 27573 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Model predicts:
new jersey est parfois chaud au l' et il automne et en est en <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD>
Actual translation:
new jersey est parfois calme pendant l' automne , et il est neigeux en avril .


With this Encoder-Decoder RNN we achieved a valuation accuracy of about 82%.
* Prediction: new jersey est parfois chaud au l' et il automne et en est en
* Actual: new jersey est parfois calme pendant l' automne et il est neigeux en avril

### Model 5: Putting it all together
Let's combine all our models together into a super model for machine translation. Well include embeddings, bidirectionallity, and encoder-decoders.

In [11]:
def model_final(input_shape, output_sequence_length, english_vocab_size, french_vocab_size):
    """
    Build and train a model that incorporates embedding, encoder-decoder, and bidirectional RNN on x and y
    :param input_shape: Tuple of input shape
    :param output_sequence_length: Length of output sequence
    :param english_vocab_size: Number of unique English words in the dataset
    :param french_vocab_size: Number of unique French words in the dataset
    :return: Keras model built, but not trained
    """
    # Initate model
    model = Sequential()
    
    # Add embedding layer
    model.add(Embedding(english_vocab_size, 1024, input_length=input_shape[1]))
    
    # Add Gated Recurrent Layers, Bidirectional, RepeatVector, dropout
    model.add(Bidirectional(GRU(1024, return_sequences=False)))
    model.add(RepeatVector(output_sequence_length))
    model.add(Bidirectional(GRU(512, return_sequences=True, recurrent_dropout=0.3)))
    
    # Add fully connected layer and softmax activation
    # Add a Time distributed wrapper to dense layer
    model.add(TimeDistributed(Dense(french_vocab_size)))
    model.add(Activation('softmax'))
    
    # Compile
    learning_rate = .001
    model.compile(loss=sparse_categorical_crossentropy,
                  optimizer=Adam(learning_rate),
                  metrics=['accuracy'])
    
    return model


# Reshape the input to work with model
tmp_x = pad(preproc_english_sentences, preproc_french_sentences.shape[1])

# Train the neural network
model = model_final(tmp_x.shape,
                    preproc_french_sentences.shape[1],
                    len(english_tokenizer.word_index),
                    len(french_tokenizer.word_index))

model.fit(tmp_x, preproc_french_sentences, batch_size=256, epochs=10, validation_split=0.2)

# Print predictions
print('Model predicts:')
print(logits_to_text(model.predict(tmp_x[:1])[0], french_tokenizer))
print('Actual translation:')
print(french_sentences[0])

Train on 110288 samples, validate on 27573 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Model predicts:
new jersey est parfois calme pendant l' automne et il est neigeux en avril <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD>
Actual translation:
new jersey est parfois calme pendant l' automne , et il est neigeux en avril .


With our final RNN we achieved a valuation accuracy of about 98%. We could do even better with longer training times.
* Prediction: new jersey est parfois calme pendant l' automne et il est neigeux en avril
* Actual: new jersey est parfois calme pendant l' automne et il est neigeux en avril

## Translation Pipeline
Lastly, let's create a function that can take in an unprocessed English sentence and return the French translation.

In [14]:
def translate(sentence):
    """
    Takes in an English sentence and returns the translated French sentence.
    """
    # Pre-process sentence
    sentence = [english_tokenizer.word_index[word] for word in sentence.split()]
    sentence = pad([sentence], length=21)
    
    # Make prediction
    prediction = logits_to_text(model.predict(sentence)[0], french_tokenizer)
    
    return prediction

print('he saw a old yellow truck')
print('Translation:')
print(translate('he saw a old yellow truck'))
print('Actual:')
print('il a vu un vieux camion jaune')

he saw a old yellow truck
Translation:
il a vu un vieux camion jaune <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD>
Actual:
il a vu un vieux camion jaune


Of course, this only works for English words in our trained vocabulary. For a more general purpose translator, we would need to train on more data.