# RNN Translator
I build an English-Franch Translator in this notebook use RNN sequence to sequence model
![](https://cdn-images-1.medium.com/max/1600/1*3lj8AGqfwEE5KCTJ-dXTvg.png)

In [1]:
import numpy as np
import pickle
import tensorflow as tf
from tensorflow.python.layers.core import Dense

## Load the data

In [2]:
en_text_path = './data/small_vocab_en'
fr_text_path = './data/small_vocab_fr'
with open(en_text_path, 'r', encoding='utf-8') as f:
    en_text = f.read()
with open(fr_text_path, 'r', encoding='utf-8') as f:
    fr_text = f.read()

In [3]:
en_text.split('\n')[:5]

['new jersey is sometimes quiet during autumn , and it is snowy in april .',
 'the united states is usually chilly during july , and it is usually freezing in november .',
 'california is usually quiet during march , and it is usually hot in june .',
 'the united states is sometimes mild during june , and it is cold in september .',
 'your least liked fruit is the grape , but my least liked is the apple .']

In [4]:
fr_text.split('\n')[:5]

["new jersey est parfois calme pendant l' automne , et il est neigeux en avril .",
 'les états-unis est généralement froid en juillet , et il gèle habituellement en novembre .',
 'california est généralement calme en mars , et il est généralement chaud en juin .',
 'les états-unis est parfois légère en juin , et il fait froid en septembre .',
 'votre moins aimé fruit est le raisin , mais mon moins aimé est la pomme .']

## Prepare the input data
![](https://cdn-images-1.medium.com/max/1600/1*Ismhi-muID5ooWf3ZIQFFg.png)   

Though the seq2seq model is designed for variable length input and output sequence, we still have to fill the network with same size input, the way to implement this is use special paddings    
* **&lt;PAD&gt;**: During training, we’ll need to feed our examples to the network in batches. The inputs in these batches all need to be the same width for the network to do its calculation. Our examples, however, are not of the same length. That’s why we’ll need to pad shorter inputs to bring them to the same width of the batch   
* **&lt;EOS&gt;**: This is another necessity of batching as well, but more on the decoder side. It allows us to tell the decoder where a sentence ends, and it allows the decoder to indicate the same thing in its outputs as well.   
* **&lt;UNK&gt;**: If you’re training your model on real data, you’ll find you can vastly improve the resource efficiency of your model by ignoring words that don’t show up often enough in your vocabulary to warrant consideration. We replace those with **&lt;UNK&gt;**.    
* **&lt;GO&gt;**: This is the input to the first time step of the decoder to let the decoder know when to start generating output.

In [5]:
from collections import Counter
def get_vocab_int(text):
    text = text.lower()
    vocab = sorted(set(text.split()))
    vocab_counter = Counter(vocab)
    # Later to get rid of the low frequency words! I'll leave it for now
    vocab = ['<PAD>','<EOS>','<UNK>','<GO>'] + vocab
    vocab_to_int = {word: index for index, word in enumerate(vocab)}
    int_to_vocab = {index: word for word, index in vocab_to_int.items()}
    return vocab_to_int, int_to_vocab, vocab_counter

In [6]:
en_vocab_to_int, en_int_to_vocab, en_vocab_counter = get_vocab_int(en_text)
fr_vocab_to_int, fr_int_to_vocab, fr_vocab_counter = get_vocab_int(fr_text)

In [7]:
def text_to_ids(source_text, target_text, source_vocab_to_int, target_vocab_to_int):

    source_id_text_line = [line for line in source_text.split('\n')]
    target_id_text_line = [line for line in target_text.split('\n')]
    source_id_text = []
    target_id_text = []
    for line in source_id_text_line:
        new_line = [source_vocab_to_int.get(word, source_vocab_to_int['<UNK>']) for word in line.split()]
        source_id_text.append(new_line)
        
    for line in target_id_text_line:
        new_line = [target_vocab_to_int.get(word, target_vocab_to_int['<UNK>']) for word in line.split()]
        new_line.append(target_vocab_to_int['<EOS>'])
        target_id_text.append(new_line)
        
    return source_id_text, target_id_text

In [8]:
en_text_to_id, fr_text_to_id = text_to_ids(en_text, fr_text, en_vocab_to_int, fr_vocab_to_int)

Now the data is in this form   
<img src='./img/format.png' width='600px'>

In [9]:
max_en_seq_length = max([len(line) for line in en_text_to_id])
max_fr_seq_length = max([len(line) for line in fr_text_to_id])

In [10]:
en_text_to_id = [sentence + [en_vocab_to_int['<PAD>']] * (max_en_seq_length - len(sentence))
                    for sentence in en_text_to_id]
fr_text_to_id = [sentence + [fr_vocab_to_int['<PAD>']] * (max_fr_seq_length - len(sentence))
                    for sentence in fr_text_to_id]

### Save data

In [11]:
with open('preprocess.p', 'wb') as out_file:
    pickle.dump((
        (en_text_to_id, fr_text_to_id),
        (en_vocab_to_int, fr_vocab_to_int),
        (en_int_to_vocab, fr_int_to_vocab)), out_file)

## Components of Network

### Input placeholder

In [12]:
def get_input_placeholder():
    input_data = tf.placeholder(tf.int32, [None, None], name='input')
    targets = tf.placeholder(tf.int32, [None, None], name='targets')
    lr = tf.placeholder(tf.float32, name='learning_rate')

    target_sequence_length = tf.placeholder(tf.int32, (None,), name='target_sequence_length')
    max_target_sequence_length = tf.reduce_max(target_sequence_length, name='max_target_length')
    source_sequence_length = tf.placeholder(tf.int32, (None,), name='source_sequence_length')
    
    return input_data, targets, lr, target_sequence_length, max_target_sequence_length, source_sequence_length

### Encoder layer
![](./img/encoder.png)

In [13]:
def encoding_layer(input_data, rnn_size, num_layers,
                   source_sequence_length, source_vocab_size, 
                   encoding_embedding_size):


    # Encoder embedding
    enc_embed_input = tf.contrib.layers.embed_sequence(input_data, source_vocab_size, encoding_embedding_size)

    # RNN cell
    def make_cell(rnn_size):
        enc_cell = tf.contrib.rnn.LSTMCell(rnn_size,
                                           initializer=tf.random_uniform_initializer(-0.1, 0.1, seed=2))
        return enc_cell

    enc_cell = tf.contrib.rnn.MultiRNNCell([make_cell(rnn_size) for _ in range(num_layers)])
    
    enc_output, enc_state = tf.nn.dynamic_rnn(enc_cell, enc_embed_input, sequence_length=source_sequence_length, dtype=tf.float32)
    
    return enc_output, enc_state

### Decoder

There is no use for the last one word in the decoder input sequence, so remove it and add < GO > to the start

In [14]:
def process_decoder_input(target_data, vocab_to_int, batch_size):
    ending = tf.strided_slice(target_data, [0, 0], [batch_size, -1], [1, 1])
    dec_input = tf.concat([tf.fill([batch_size, 1], vocab_to_int['<GO>']), ending], 1)

    return dec_input

![](./img/decoder.png)

In [15]:
def decoding_layer(target_letter_to_int, decoding_embedding_size, num_layers, rnn_size,
                   target_sequence_length, max_target_sequence_length, enc_state, dec_input):
    # 1. Decoder Embedding
    target_vocab_size = len(target_letter_to_int)
    dec_embeddings = tf.Variable(tf.random_uniform([target_vocab_size, decoding_embedding_size]))
    dec_embed_input = tf.nn.embedding_lookup(dec_embeddings, dec_input)

    # 2. Construct the decoder cell
    def make_cell(rnn_size):
        dec_cell = tf.contrib.rnn.LSTMCell(rnn_size,
                                           initializer=tf.random_uniform_initializer(-0.1, 0.1, seed=2))
        return dec_cell

    dec_cell = tf.contrib.rnn.MultiRNNCell([make_cell(rnn_size) for _ in range(num_layers)])
     
    # 3. Dense layer to translate the decoder's output at each time 
    # step into a choice from the target vocabulary
    output_layer = Dense(target_vocab_size,
                         kernel_initializer = tf.truncated_normal_initializer(mean = 0.0, stddev=0.1))


    # 4. Set up a training decoder and an inference decoder
    # Training Decoder
    with tf.variable_scope("decode"):

        # Helper for the training process. Used by BasicDecoder to read inputs.
        training_helper = tf.contrib.seq2seq.TrainingHelper(inputs=dec_embed_input,
                                                            sequence_length=target_sequence_length,
                                                            time_major=False)
        
        
        # Basic decoder
        training_decoder = tf.contrib.seq2seq.BasicDecoder(dec_cell,
                                                           training_helper,
                                                           enc_state,
                                                           output_layer) 
        
        # Perform dynamic decoding using the decoder
        training_decoder_output = tf.contrib.seq2seq.dynamic_decode(training_decoder,
                                                                       impute_finished=True,
                                                                       maximum_iterations=max_target_sequence_length)[0]
    # 5. Inference Decoder
    # Reuses the same parameters trained by the training process
    with tf.variable_scope("decode", reuse=True):
        start_tokens = tf.tile(tf.constant([target_letter_to_int['<GO>']], dtype=tf.int32), [batch_size], name='start_tokens')

        # Helper for the inference process.
        inference_helper = tf.contrib.seq2seq.GreedyEmbeddingHelper(dec_embeddings,
                                                                start_tokens,
                                                                target_letter_to_int['<EOS>'])

        # Basic decoder
        inference_decoder = tf.contrib.seq2seq.BasicDecoder(dec_cell,
                                                        inference_helper,
                                                        enc_state,
                                                        output_layer)
        
        # Perform dynamic decoding using the decoder
        inference_decoder_output = tf.contrib.seq2seq.dynamic_decode(inference_decoder,
                                                            impute_finished=True,
                                                            maximum_iterations=max_target_sequence_length)[0]
         

    
    return training_decoder_output, inference_decoder_output

### Put encoder and decoder together to build the model

In [16]:
def seq2seq_model(input_data, targets, lr, target_sequence_length, 
                  max_target_sequence_length, source_sequence_length,
                  source_vocab_size, target_vocab_size,
                  enc_embedding_size, dec_embedding_size, 
                  rnn_size, num_layers):
    
    # Pass the input data through the encoder. We'll ignore the encoder output, but use the state
    _, enc_state = encoding_layer(input_data, 
                                  rnn_size, 
                                  num_layers, 
                                  source_sequence_length,
                                  source_vocab_size, 
                                  encoding_embedding_size)
    
    
    # Prepare the target sequences we'll feed to the decoder in training mode
    dec_input = process_decoder_input(targets, target_vocab_to_int, batch_size)
    
    # Pass encoder state and decoder inputs to the decoders
    training_decoder_output, inference_decoder_output = decoding_layer(target_vocab_to_int, 
                                                                       decoding_embedding_size, 
                                                                       num_layers, 
                                                                       rnn_size,
                                                                       target_sequence_length,
                                                                       max_target_sequence_length,
                                                                       enc_state, 
                                                                       dec_input) 
    
    return training_decoder_output, inference_decoder_output

### Hyperparameters

In [23]:
# Number of Epochs
epochs = 8
# Batch Size
batch_size = 128
# RNN Size
rnn_size = 50
# Number of Layers
num_layers = 2
# Embedding Size
encoding_embedding_size = 15
decoding_embedding_size = 15
# Learning Rate
learning_rate = 0.001

### Load the data

In [18]:
with open('preprocess.p', mode='rb') as in_file:
    ((source_text_int, target_text_int),
    (source_vocab_to_int, target_vocab_to_int),
    (source_int_to_vocab, target_int_to_vocab)) = pickle.load(in_file)

In [19]:
# Build the graph
train_graph = tf.Graph()
# Set the graph to default to ensure that it is ready for training
with train_graph.as_default():
    
    # Load the model inputs    
    input_data, targets, lr, target_sequence_length, max_target_sequence_length, source_sequence_length = get_input_placeholder()
    
    # Create the training and inference logits
    training_decoder_output, inference_decoder_output = seq2seq_model(input_data, 
                                                                      targets, 
                                                                      lr, 
                                                                      target_sequence_length, 
                                                                      max_target_sequence_length, 
                                                                      source_sequence_length,
                                                                      len(source_vocab_to_int),
                                                                      len(target_vocab_to_int),
                                                                      encoding_embedding_size, 
                                                                      decoding_embedding_size, 
                                                                      rnn_size, 
                                                                      num_layers)    
    
    # Create tensors for the training logits and inference logits
    training_logits = tf.identity(training_decoder_output.rnn_output, 'logits')
    inference_logits = tf.identity(inference_decoder_output.sample_id, name='predictions')
    
    # Create the weights for sequence_loss
    masks = tf.sequence_mask(target_sequence_length, max_target_sequence_length, dtype=tf.float32, name='masks')

    with tf.name_scope("optimization"):
        
        # Loss function
        cost = tf.contrib.seq2seq.sequence_loss(
            training_logits,
            targets,
            masks)

        # Optimizer
        optimizer = tf.train.AdamOptimizer(lr)

        # Gradient Clipping
        gradients = optimizer.compute_gradients(cost)
        capped_gradients = [(tf.clip_by_value(grad, -5., 5.), var) for grad, var in gradients if grad is not None]
        train_op = optimizer.apply_gradients(capped_gradients)


### Get the batch

In [20]:
def get_batches(targets, sources, batch_size):
    for batch_i in range(0, len(sources)//batch_size):
        start_i = batch_i * batch_size
        sources_batch = sources[start_i:start_i + batch_size]
        targets_batch = targets[start_i:start_i + batch_size]
        
        # Need the lengths for the _lengths parameters
        targets_lengths = []
        for target in targets_batch:
            targets_lengths.append(len(target))

        source_lengths = []
        for source in sources_batch:
            source_lengths.append(len(source))
        
        yield targets_batch, sources_batch, targets_lengths, source_lengths

## Train the network

In [24]:
# Split data to training and validation sets
train_source = source_text_int[batch_size:]
train_target = target_text_int[batch_size:]
valid_source = source_text_int[:batch_size]
valid_target = target_text_int[:batch_size]
(valid_targets_batch, valid_sources_batch, valid_targets_lengths, valid_sources_lengths) = next(get_batches(valid_target, valid_source, batch_size))

display_step = 200 # Check training loss after every 20 batches

checkpoint = "best_model.ckpt" 
with tf.Session(graph=train_graph) as sess:
    sess.run(tf.global_variables_initializer())
        
    for epoch_i in range(1, epochs+1):
        for batch_i, (targets_batch, sources_batch, targets_lengths, sources_lengths) in enumerate(
                get_batches(train_target, train_source, batch_size)):
            assert len(targets_batch) == len(sources_batch)
            assert len(targets_lengths) == len(sources_lengths)
            # Training step
            _, loss = sess.run(
                [train_op, cost],
                {input_data: sources_batch,
                 targets: targets_batch,
                 lr: learning_rate,
                 target_sequence_length: targets_lengths,
                 source_sequence_length: sources_lengths})

            # Debug message updating us on the status of the training
            if batch_i % display_step == 0 and batch_i > 0:
                
                # Calculate validation cost
                validation_loss = sess.run(
                [cost],
                {input_data: valid_sources_batch,
                 targets: valid_targets_batch,
                 lr: learning_rate,
                 target_sequence_length: valid_targets_lengths,
                 source_sequence_length: valid_sources_lengths})
                
                print('Epoch {:>3}/{} Batch {:>4}/{} - Loss: {:>6.3f}  - Validation loss: {:>6.3f}'
                      .format(epoch_i,
                              epochs, 
                              batch_i, 
                              len(train_source) // batch_size, 
                              loss, 
                              validation_loss[0]))

    
    
    # Save Model
    saver = tf.train.Saver()
    saver.save(sess, checkpoint)
    print('Model Trained and Saved')

Epoch   1/8 Batch  200/1076 - Loss:  2.475  - Validation loss:  2.490
Epoch   1/8 Batch  400/1076 - Loss:  2.130  - Validation loss:  2.082
Epoch   1/8 Batch  600/1076 - Loss:  1.817  - Validation loss:  1.790
Epoch   1/8 Batch  800/1076 - Loss:  1.593  - Validation loss:  1.519
Epoch   1/8 Batch 1000/1076 - Loss:  1.323  - Validation loss:  1.309
Epoch   2/8 Batch  200/1076 - Loss:  1.124  - Validation loss:  1.101
Epoch   2/8 Batch  400/1076 - Loss:  1.004  - Validation loss:  0.968
Epoch   2/8 Batch  600/1076 - Loss:  0.884  - Validation loss:  0.872
Epoch   2/8 Batch  800/1076 - Loss:  0.816  - Validation loss:  0.800
Epoch   2/8 Batch 1000/1076 - Loss:  0.744  - Validation loss:  0.740
Epoch   3/8 Batch  200/1076 - Loss:  0.696  - Validation loss:  0.679
Epoch   3/8 Batch  400/1076 - Loss:  0.647  - Validation loss:  0.642
Epoch   3/8 Batch  600/1076 - Loss:  0.606  - Validation loss:  0.613
Epoch   3/8 Batch  800/1076 - Loss:  0.598  - Validation loss:  0.592
Epoch   3/8 Batch 10

## Do the Translation

### Need a function to convert sentence to sequence for the feeding process

In [25]:
def sentence_to_seq(text, vocab_to_int):
    text = text.lower()
    text = text.split()
    return [vocab_to_int.get(word, vocab_to_int['<UNK>']) for word in text]

In [38]:
translate_sentence = 'i saw a car'

In [39]:
translate_sentence = sentence_to_seq(translate_sentence, source_vocab_to_int)

checkpoint = "./best_model.ckpt"
loaded_graph = tf.Graph()
with tf.Session(graph=loaded_graph) as sess:
    # Load saved model
    loader = tf.train.import_meta_graph(checkpoint + '.meta')
    loader.restore(sess, checkpoint)

    input_data = loaded_graph.get_tensor_by_name('input:0')
    logits = loaded_graph.get_tensor_by_name('predictions:0')
    target_sequence_length = loaded_graph.get_tensor_by_name('target_sequence_length:0')
    source_sequence_length = loaded_graph.get_tensor_by_name('source_sequence_length:0')

    translate_logits = sess.run(logits, {input_data: [translate_sentence]*batch_size,
                                         target_sequence_length: [len(translate_sentence)*2]*batch_size,
                                         source_sequence_length: [len(translate_sentence)]*batch_size})[0]

print('Input')
print('  Word Ids:      {}'.format([i for i in translate_sentence]))
print('  English Words: {}'.format([source_int_to_vocab[i] for i in translate_sentence]))

print('\nPrediction')
print('  Word Ids:      {}'.format([i for i in translate_logits]))
print('  French Words: {}'.format(" ".join([target_int_to_vocab[i] for i in translate_logits])))

INFO:tensorflow:Restoring parameters from ./best_model.ckpt
Input
  Word Ids:      [99, 176, 7, 39]
  English Words: ['i', 'saw', 'a', 'car']

Prediction
  Word Ids:      [118, 177, 41, 10, 1]
  French Words: est la automne . <EOS>


## The result   
the result is rather bad...