# Project4 - Language Translation
This project utilizes a sequence to sequence model on a data set of English and French sentences and will attempt to then translate new sentences from English to French once trained.

### Data Import 
The training data is composed only of a small portion of the respective languages.

In [1]:
import helper
import problem_unittests as tests

source_path = 'data/small_vocab_en.txt'
target_path = 'data/small_vocab_fr.txt'
source_text = helper.load_data(source_path)
target_text = helper.load_data(target_path)

A few printouts to provide some insight into the dataset.

In [2]:
import numpy as np

view_sentence_range = (0, 10)

print('Dataset Stats')
print('Roughly the number of unique words: {}'.format(len({word: None for word in source_text.split()})))

sentences = source_text.split('\n')
word_counts = [len(sentence.split()) for sentence in sentences]
print('Number of sentences: {}'.format(len(sentences)))
print('Average number of words in a sentence: {}'.format(np.average(word_counts)))

print()
print('English sentences {} to {}:'.format(*view_sentence_range))
print('\n'.join(source_text.split('\n')[view_sentence_range[0]:view_sentence_range[1]]))
print()
print('French sentences {} to {}:'.format(*view_sentence_range))
print('\n'.join(target_text.split('\n')[view_sentence_range[0]:view_sentence_range[1]]))

Dataset Stats
Roughly the number of unique words: 227
Number of sentences: 137861
Average number of words in a sentence: 13.225277634719028

English sentences 0 to 10:
new jersey is sometimes quiet during autumn , and it is snowy in april .
the united states is usually chilly during july , and it is usually freezing in november .
california is usually quiet during march , and it is usually hot in june .
the united states is sometimes mild during june , and it is cold in september .
your least liked fruit is the grape , but my least liked is the apple .
his favorite fruit is the orange , but my favorite is the grape .
paris is relaxing during december , but it is usually chilly in july .
new jersey is busy during spring , and it is never hot in march .
our least liked fruit is the lemon , but my least liked is the grape .
the united states is sometimes busy during january , and it is sometimes warm in november .

French sentences 0 to 10:
new jersey est parfois calme pendant l' automne 

### Data Preprocessing
Converting source text and target text datasets into numerical data input for the network.

In [3]:
def text_to_ids(source_text, target_text, source_vocab_to_int, target_vocab_to_int):
    """
    Convert source and target text to proper word ids
    :param source_text: String that contains all the source text.
    :param target_text: String that contains all the target text.
    :param source_vocab_to_int: Dictionary to go from the source words to an id
    :param target_vocab_to_int: Dictionary to go from the target words to an id
    :return: A tuple of lists (source_id_text, target_id_text)
    """
    source_id_text = [[source_vocab_to_int[x] for x in y.split()] for y in source_text.split('\n')]
    target_id_text = [[target_vocab_to_int[a] for a in b.split()] for b in target_text.split('\n')]
    
    target_id_text = [entry + [target_vocab_to_int['<EOS>']]for entry in target_id_text]
    
    return source_id_text, target_id_text

Utilize helper to preprocess data and save to file

In [4]:
helper.preprocess_and_save_data(source_path, target_path, text_to_ids)

load data from disc. 

In [5]:
(source_int_text, target_int_text), (source_vocab_to_int, target_vocab_to_int), _ = helper.load_preprocess()

## Building the Neural Network

In [6]:
import tensorflow as tf
from tensorflow.python.layers.core import Dense

### Input
Following function will create all the necessary placeholders for the network.

In [7]:
def model_inputs():
    """
    Create TF Placeholders for input, targets, learning rate, and lengths of source and target sequences.
    :return: Tuple (input, targets, learning rate, keep probability, target sequence length,
    max target sequence length, source sequence length)
    """
    
    inputs = tf.placeholder(tf.int32, [None, None], name='input')
    targets = tf.placeholder(tf.int32, [None, None], name='targets')
    keep_prob = tf.placeholder(tf.float32, name='keep_prob')
    learning_rate = tf.placeholder(tf.float32, name='learning_rate')
    source_len = tf.placeholder(tf.int32, [None], name='source_sequence_length')
    target_len = tf.placeholder(tf.int32, [None], name='target_sequence_length')
    max_target_len = tf.reduce_max(target_len)
    
    return inputs, targets, learning_rate, keep_prob, target_len, max_target_len, source_len


### Process Decoder Input
Removes the last word id from each batch in target_data and concatinates the GO ID to the begining of each word batch.

In [8]:
def process_decoder_input(target_data, target_vocab_to_int, batch_size):
    """
    Preprocess target data for encoding
    :param target_data: Target Placehoder
    :param target_vocab_to_int: Dictionary to go from the target words to an id
    :param batch_size: Batch Size
    :return: Preprocessed target data
    """
    # Drop last word id from each batch
    trimmed_batch = tf.strided_slice(target_data, [0,0], [batch_size, -1], [1, 1])

    # Create GO_ID data to add to the batches
    go_id = tf.fill([batch_size, 1], target_vocab_to_int["<GO>"])
    
    # Consolidate Go ID & trimmed batch
    processed = tf.concat([go_id, trimmed_batch], 1)
    
    return processed

### Encoding
Embed encoding input and construct stacked LSTM cells with dropout.

In [9]:
def encoding_layer(rnn_inputs, rnn_size, num_layers, keep_prob,
                  source_sequence_length, source_vocab_size, 
                  encoding_embedding_size):
    """
    Create encoding layer
    :param rnn_inputs: Inputs for the RNN
    :param rnn_size: RNN Size
    :param num_layers: Number of layers
    :param keep_prob: Dropout keep probability
    :param source_sequence_length: a list of the lengths of each sequence in the batch
    :param source_vocab_size: vocabulary size of source data
    :param encoding_embedding_size: embedding size of source data
    :return: tuple (RNN output, RNN state)
    """
    
    # Creating the embedding
    embed = tf.contrib.layers.embed_sequence(rnn_inputs, 
                                            source_vocab_size,
                                            encoding_embedding_size)
    
    # LSTM Cell stack
    cell = tf.contrib.rnn.MultiRNNCell([tf.contrib.rnn.LSTMCell(rnn_size) for _ in range(num_layers)])
    
    # Wrapping multi-cell stack w. dropout
    drop = tf.contrib.rnn.DropoutWrapper(cell, output_keep_prob = keep_prob)
    
    # Create RNN output & state
    rnn_out, rnn_state = tf.nn.dynamic_rnn(drop, embed, 
                                           sequence_length = source_sequence_length, 
                                           dtype = tf.float32)
    
    return rnn_out, rnn_state

### Decoding - Training
Using the seq2seq functions in tf.contrib to obtain decoder outputs.

In [16]:
def decoding_layer_train(encoder_state, dec_cell, dec_embed_input,
                        target_sequence_length, max_summary_length,
                        output_layer, keep_prob):
    """
    Create a decoding layer for training
    :param encoder_state: Encoder State
    :param dec_cell: Decoder RNN Cell
    :param dec_embed_input: Decoder embedded input
    :param target_sequence_length: The lengths of each sequence in the target batch
    :param max_summary_length: The length of the longest sequence in the batch
    :param output_layer: Function to apply the output layer
    :param keep_prob: Dropout keep probability
    :return: BasicDecoderOutput containing training logits and sample_id
    """
    
    # Helper for the training process
    training_helper = tf.contrib.seq2seq.TrainingHelper(inputs = dec_embed_input,
                                                       sequence_length = target_sequence_length,
                                                       time_major = False)
    
    # Basic Decoder
    training_decoder = tf.contrib.seq2seq.BasicDecoder(dec_cell, 
                                                      training_helper,
                                                      encoder_state,
                                                      output_layer)
    
    # Perform dynamic encoding
    training_output, state, seq_len = tf.contrib.seq2seq.dynamic_decode(training_decoder,
                                                              impute_finished = True,
                                                              maximum_iterations = max_summary_length)
    
    return training_output

### Decoding - Inference
Again utilizing the tf seq2seq functions to create the inference decoder and obtain decoder outputs.

In [22]:
def decoding_layer_infer(encoder_state, dec_cell, dec_embeddings,
                        start_of_sequence_id, end_of_sequence_id,
                         max_target_sequence_length, vocab_size,
                        output_layer, batch_size, keep_prob):
    """
    Create a decoding layer for inference
    :param encoder_state: Encoder state
    :param dec_cell: Decoder RNN Cell
    :param dec_embeddings: Decoder embeddings
    :param start_of_sequence_id: GO ID
    :param end_of_sequence_id: EOS Id
    :param max_target_sequence_length: Maximum length of target sequences
    :param vocab_size: Size of decoder/target vocabulary
    :param decoding_scope: TenorFlow Variable Scope for decoding
    :param output_layer: Function to apply the output layer
    :param batch_size: Batch size
    :param keep_prob: Dropout keep probability
    :return: BasicDecoderOutput containing inference logits and sample_id
    """
    #Start tokens
    start_tokens = tf.tile(tf.constant([start_of_sequence_id],
                                        dtype = tf.int32),
                                        [batch_size],
                                        name = 'start_tokens')
    
    # Helper for the inference process
    inference_helper = tf.contrib.seq2seq.GreedyEmbeddingHelper(dec_embeddings,
                                                               start_tokens,
                                                               end_of_sequence_id)
    
    # Basic decoder
    inference_decoder = tf.contrib.seq2seq.BasicDecoder(dec_cell,
                                                       inference_helper,
                                                       encoder_state,
                                                       output_layer)
    
    # Perform dynamic decoding
    inference_output, state, seq_len = tf.contrib.seq2seq.dynamic_decode(inference_decoder,
                                                               impute_finished = True,
                                                               maximum_iterations = max_target_sequence_length)
    
    return inference_output

### Decoding Layer
Creating the RNN layer for the decoder. Embeds target sequences, constructs decoder LSTM cell, and output layer.

In [23]:
def decoding_layer(dec_input, encoder_state,
                   target_sequence_length, max_target_sequence_length,
                   rnn_size,
                   num_layers, target_vocab_to_int, target_vocab_size,
                   batch_size, keep_prob, decoding_embedding_size):
    """
    Create decoding layer
    :param dec_input: Decoder input
    :param encoder_state: Encoder state
    :param target_sequence_length: The lengths of each sequence in the target batch
    :param max_target_sequence_length: Maximum length of target sequences
    :param rnn_size: RNN Size
    :param num_layers: Number of layers
    :param target_vocab_to_int: Dictionary to go from the target words to an id
    :param target_vocab_size: Size of target vocabulary
    :param batch_size: The size of the batch
    :param keep_prob: Dropout keep probability
    :param decoding_embedding_size: Decoding embedding size
    :return: Tuple of (Training BasicDecoderOutput, Inference BasicDecoderOutput)
    """
    
    # Decoder embedding
    dec_embeddings = tf.Variable(tf.random_uniform([target_vocab_size, decoding_embedding_size]))
    dec_embed_input = tf.nn.embedding_lookup(dec_embeddings, dec_input)
    
    # Construct Decoder cell
    def make_cell(rnn_size):
        dec_cell = tf.contrib.rnn.LSTMCell(rnn_size,
                                      initializer = tf.random_uniform_initializer(-0.1, 0.1, seed=2))
        
        return dec_cell
    
    dec_cell = tf.contrib.rnn.MultiRNNCell([make_cell(rnn_size) for _ in range(num_layers)])
    
    # Dense layer to translate decoder output
    output_layer = Dense(target_vocab_size,
                        kernel_initializer = tf.truncated_normal_initializer(mean=0.0, stddev=0.1))
    
    go_id = target_vocab_to_int['<GO>']
    eos_id = target_vocab_to_int['<EOS>']
    
    with tf.variable_scope('decode') as scope:
        # Call training decoder
        dec_train_out = decoding_layer_train(encoder_state,
                                            dec_cell,
                                            dec_embed_input,
                                            target_sequence_length,
                                            max_target_sequence_length,
                                            output_layer,
                                            keep_prob)
        
        # Reuse parameter variables for inference function call
        scope.reuse_variables()
        
        # Call to inference decoder
        dec_infer_out = decoding_layer_infer(encoder_state,
                                            dec_cell,
                                            dec_embeddings,
                                            go_id,
                                            eos_id,
                                            max_target_sequence_length,
                                            target_vocab_size,
                                            output_layer,
                                            batch_size,
                                            keep_prob)
        
        return dec_train_out, dec_infer_out

### Building the Network

In [24]:
def seq2seq_model(input_data, target_data, keep_prob, batch_size,
                  source_sequence_length, target_sequence_length,
                  max_target_sentence_length,
                  source_vocab_size, target_vocab_size,
                  enc_embedding_size, dec_embedding_size,
                  rnn_size, num_layers, target_vocab_to_int):
    """
    Build the Sequence-to-Sequence part of the neural network
    :param input_data: Input placeholder
    :param target_data: Target placeholder
    :param keep_prob: Dropout keep probability placeholder
    :param batch_size: Batch Size
    :param source_sequence_length: Sequence Lengths of source sequences in the batch
    :param target_sequence_length: Sequence Lengths of target sequences in the batch
    :param source_vocab_size: Source vocabulary size
    :param target_vocab_size: Target vocabulary size
    :param enc_embedding_size: Decoder embedding size
    :param dec_embedding_size: Encoder embedding size
    :param rnn_size: RNN Size
    :param num_layers: Number of layers
    :param target_vocab_to_int: Dictionary to go from the target words to an id
    :return: Tuple of (Training BasicDecoderOutput, Inference BasicDecoderOutput)
    """
    # Pass data through encoder
    ignore, enc_state = encoding_layer(input_data,
                                      rnn_size,
                                      num_layers,
                                      keep_prob,
                                      source_sequence_length,
                                      source_vocab_size,
                                      enc_embedding_size)
    
    # Prepare target sequences
    dec_input = process_decoder_input(target_data,
                                     target_vocab_to_int,
                                     batch_size)
    
    # Pass encoder state & inputs to decoders
    train_dec_out, infer_dec_out = decoding_layer(dec_input,
                                                 enc_state,
                                                 target_sequence_length,
                                                 max_target_sentence_length,
                                                 rnn_size,
                                                 num_layers,
                                                 target_vocab_to_int,
                                                 target_vocab_size,
                                                 batch_size,
                                                 keep_prob,
                                                 dec_embedding_size)
    
    return train_dec_out, infer_dec_out
    

## Network Training
### HyperParameters

In [25]:
# Number of Epochs
epochs = 4
# Batch Size
batch_size = 256
# RNN Size
rnn_size = 150
# Number of Layers
num_layers = 2
# Embedding Size
encoding_embedding_size = 150
decoding_embedding_size = 150
# Learning Rate
learning_rate = 0.01
# Dropout Keep Probability
keep_probability = 0.8
display_step = 50

### Building the graph

In [26]:
save_path = 'checkpoints/dev'
(source_int_text, target_int_text), (source_vocab_to_int, target_vocab_to_int), _ = helper.load_preprocess()
max_target_sentence_length = max([len(sentence) for sentence in source_int_text])

train_graph = tf.Graph()
with train_graph.as_default():
    input_data, targets, lr, keep_prob, target_sequence_length, max_target_sequence_length, source_sequence_length = model_inputs()

    #sequence_length = tf.placeholder_with_default(max_target_sentence_length, None, name='sequence_length')
    input_shape = tf.shape(input_data)

    train_logits, inference_logits = seq2seq_model(tf.reverse(input_data, [-1]),
                                                   targets,
                                                   keep_prob,
                                                   batch_size,
                                                   source_sequence_length,
                                                   target_sequence_length,
                                                   max_target_sequence_length,
                                                   len(source_vocab_to_int),
                                                   len(target_vocab_to_int),
                                                   encoding_embedding_size,
                                                   decoding_embedding_size,
                                                   rnn_size,
                                                   num_layers,
                                                   target_vocab_to_int)


    training_logits = tf.identity(train_logits.rnn_output, name='logits')
    inference_logits = tf.identity(inference_logits.sample_id, name='predictions')

    masks = tf.sequence_mask(target_sequence_length, max_target_sequence_length, dtype=tf.float32, name='masks')

    with tf.name_scope("optimization"):
        # Loss function
        cost = tf.contrib.seq2seq.sequence_loss(
            training_logits,
            targets,
            masks)

        # Optimizer
        optimizer = tf.train.AdamOptimizer(lr)

        # Gradient Clipping
        gradients = optimizer.compute_gradients(cost)
        capped_gradients = [(tf.clip_by_value(grad, -1., 1.), var) for grad, var in gradients if grad is not None]
        train_op = optimizer.apply_gradients(capped_gradients)


### Batch and Pad
Preparing the datasets

In [27]:
def pad_sentence_batch(sentence_batch, pad_int):
    """Pad sentences with <PAD> so that each sentence of a batch has the same length"""
    max_sentence = max([len(sentence) for sentence in sentence_batch])
    return [sentence + [pad_int] * (max_sentence - len(sentence)) for sentence in sentence_batch]


def get_batches(sources, targets, batch_size, source_pad_int, target_pad_int):
    """Batch targets, sources, and the lengths of their sentences together"""
    for batch_i in range(0, len(sources)//batch_size):
        start_i = batch_i * batch_size

        # Slice the right amount for the batch
        sources_batch = sources[start_i:start_i + batch_size]
        targets_batch = targets[start_i:start_i + batch_size]

        # Pad
        pad_sources_batch = np.array(pad_sentence_batch(sources_batch, source_pad_int))
        pad_targets_batch = np.array(pad_sentence_batch(targets_batch, target_pad_int))

        # Need the lengths for the _lengths parameters
        pad_targets_lengths = []
        for target in pad_targets_batch:
            pad_targets_lengths.append(len(target))

        pad_source_lengths = []
        for source in pad_sources_batch:
            pad_source_lengths.append(len(source))

        yield pad_sources_batch, pad_targets_batch, pad_source_lengths, pad_targets_lengths


### Training

In [28]:
def get_accuracy(target, logits):
    """
    Calculate accuracy
    """
    max_seq = max(target.shape[1], logits.shape[1])
    if max_seq - target.shape[1]:
        target = np.pad(
            target,
            [(0,0),(0,max_seq - target.shape[1])],
            'constant')
    if max_seq - logits.shape[1]:
        logits = np.pad(
            logits,
            [(0,0),(0,max_seq - logits.shape[1])],
            'constant')

    return np.mean(np.equal(target, logits))

# Split data to training and validation sets
train_source = source_int_text[batch_size:]
train_target = target_int_text[batch_size:]
valid_source = source_int_text[:batch_size]
valid_target = target_int_text[:batch_size]
(valid_sources_batch, valid_targets_batch, valid_sources_lengths, valid_targets_lengths ) = next(get_batches(valid_source,
                                                                                                             valid_target,
                                                                                                             batch_size,
                                                                                                             source_vocab_to_int['<PAD>'],
                                                                                                             target_vocab_to_int['<PAD>']))                                                                                                  
with tf.Session(graph=train_graph) as sess:
    sess.run(tf.global_variables_initializer())

    for epoch_i in range(epochs):
        for batch_i, (source_batch, target_batch, sources_lengths, targets_lengths) in enumerate(
                get_batches(train_source, train_target, batch_size,
                            source_vocab_to_int['<PAD>'],
                            target_vocab_to_int['<PAD>'])):

            _, loss = sess.run(
                [train_op, cost],
                {input_data: source_batch,
                 targets: target_batch,
                 lr: learning_rate,
                 target_sequence_length: targets_lengths,
                 source_sequence_length: sources_lengths,
                 keep_prob: keep_probability})


            if batch_i % display_step == 0 and batch_i > 0:


                batch_train_logits = sess.run(
                    inference_logits,
                    {input_data: source_batch,
                     source_sequence_length: sources_lengths,
                     target_sequence_length: targets_lengths,
                     keep_prob: 1.0})


                batch_valid_logits = sess.run(
                    inference_logits,
                    {input_data: valid_sources_batch,
                     source_sequence_length: valid_sources_lengths,
                     target_sequence_length: valid_targets_lengths,
                     keep_prob: 1.0})

                train_acc = get_accuracy(target_batch, batch_train_logits)

                valid_acc = get_accuracy(valid_targets_batch, batch_valid_logits)

                print('Epoch {:>3} Batch {:>4}/{} - Train Accuracy: {:>6.4f}, Validation Accuracy: {:>6.4f}, Loss: {:>6.4f}'
                      .format(epoch_i, batch_i, len(source_int_text) // batch_size, train_acc, valid_acc, loss))

    # Save Model
    saver = tf.train.Saver()
    saver.save(sess, save_path)
    print('Model Trained and Saved')

Epoch   0 Batch   50/538 - Train Accuracy: 0.4904, Validation Accuracy: 0.5202, Loss: 2.1273
Epoch   0 Batch  100/538 - Train Accuracy: 0.5494, Validation Accuracy: 0.5717, Loss: 1.1451
Epoch   0 Batch  150/538 - Train Accuracy: 0.5938, Validation Accuracy: 0.5985, Loss: 0.7500
Epoch   0 Batch  200/538 - Train Accuracy: 0.6260, Validation Accuracy: 0.6270, Loss: 0.5806
Epoch   0 Batch  250/538 - Train Accuracy: 0.6553, Validation Accuracy: 0.6731, Loss: 0.4781
Epoch   0 Batch  300/538 - Train Accuracy: 0.7169, Validation Accuracy: 0.7198, Loss: 0.3604
Epoch   0 Batch  350/538 - Train Accuracy: 0.8164, Validation Accuracy: 0.7795, Loss: 0.2760
Epoch   0 Batch  400/538 - Train Accuracy: 0.8374, Validation Accuracy: 0.8269, Loss: 0.1927
Epoch   0 Batch  450/538 - Train Accuracy: 0.8776, Validation Accuracy: 0.8716, Loss: 0.1479
Epoch   0 Batch  500/538 - Train Accuracy: 0.9171, Validation Accuracy: 0.8823, Loss: 0.0937
Epoch   1 Batch   50/538 - Train Accuracy: 0.9195, Validation Accuracy

### Save Model

In [29]:
# Save parameters for checkpoint
helper.save_params(save_path)

reload model from file

In [30]:
_, (source_vocab_to_int, target_vocab_to_int), (source_int_to_vocab, target_int_to_vocab) = helper.load_preprocess()
load_path = helper.load_params()

### Sentence to Sequence
To feed a sentence into the model for translation, it must first be processed. The following function will complete this action.

In [31]:
def sentence_to_seq(sentence, vocab_to_int):
    """
    Convert a sentence to a sequence of ids
    :param sentence: String
    :param vocab_to_int: Dictionary to go from the words to an id
    :return: List of word ids
    """
    low_case = sentence.lower()
    
    word_ids = [vocab_to_int.get(word, vocab_to_int['<UNK>']) for word in sentence.split()]
    
    return word_ids

### Translation
Testing the model's translation capabilities on an english sentence.

In [32]:
translate_sentence = 'he saw a old yellow truck .'

In [33]:
translate_sentence = sentence_to_seq(translate_sentence, source_vocab_to_int)

loaded_graph = tf.Graph()
with tf.Session(graph=loaded_graph) as sess:
    # Load saved model
    loader = tf.train.import_meta_graph(load_path + '.meta')
    loader.restore(sess, load_path)

    input_data = loaded_graph.get_tensor_by_name('input:0')
    logits = loaded_graph.get_tensor_by_name('predictions:0')
    target_sequence_length = loaded_graph.get_tensor_by_name('target_sequence_length:0')
    source_sequence_length = loaded_graph.get_tensor_by_name('source_sequence_length:0')
    keep_prob = loaded_graph.get_tensor_by_name('keep_prob:0')

    translate_logits = sess.run(logits, {input_data: [translate_sentence]*batch_size,
                                         target_sequence_length: [len(translate_sentence)*2]*batch_size,
                                         source_sequence_length: [len(translate_sentence)]*batch_size,
                                         keep_prob: 1.0})[0]

print('Input')
print('  Word Ids:      {}'.format([i for i in translate_sentence]))
print('  English Words: {}'.format([source_int_to_vocab[i] for i in translate_sentence]))

print('\nPrediction')
print('  Word Ids:      {}'.format([i for i in translate_logits]))
print('  French Words: {}'.format(" ".join([target_int_to_vocab[i] for i in translate_logits])))

INFO:tensorflow:Restoring parameters from checkpoints/dev
Input
  Word Ids:      [198, 25, 28, 50, 147, 212, 115]
  English Words: ['he', 'saw', 'a', 'old', 'yellow', 'truck', '.']

Prediction
  Word Ids:      [270, 108, 70, 326, 317, 326, 236, 12, 134, 1]
  French Words: il a vu au rendre au printemps prochain . <EOS>


### Limitations

Due to the fact that this model was trained on a limited vocabulary containing only 227 english words, some sentences will obviously translate better than others.