# Project4 - Language Translation
This project utilizes a sequence to sequence model on a data set of English and French sentences and will attempt to then translate new sentences from English to French once trained.

### Data Import 
The training data is composed only of a small portion of the respective languages.

In [1]:
import helper
import problem_unittests as tests

source_path = 'data/small_vocab_en.txt'
target_path = 'data/small_vocab_fr.txt'
source_text = helper.load_data(source_path)
target_text = helper.load_data(target_path)

A few printouts to provide some insight into the dataset.

In [2]:
import numpy as np

view_sentence_range = (0, 10)

print('Dataset Stats')
print('Roughly the number of unique words: {}'.format(len({word: None for word in source_text.split()})))

sentences = source_text.split('\n')
word_counts = [len(sentence.split()) for sentence in sentences]
print('Number of sentences: {}'.format(len(sentences)))
print('Average number of words in a sentence: {}'.format(np.average(word_counts)))

print()
print('English sentences {} to {}:'.format(*view_sentence_range))
print('\n'.join(source_text.split('\n')[view_sentence_range[0]:view_sentence_range[1]]))
print()
print('French sentences {} to {}:'.format(*view_sentence_range))
print('\n'.join(target_text.split('\n')[view_sentence_range[0]:view_sentence_range[1]]))

Dataset Stats
Roughly the number of unique words: 227
Number of sentences: 137861
Average number of words in a sentence: 13.225277634719028

English sentences 0 to 10:
new jersey is sometimes quiet during autumn , and it is snowy in april .
the united states is usually chilly during july , and it is usually freezing in november .
california is usually quiet during march , and it is usually hot in june .
the united states is sometimes mild during june , and it is cold in september .
your least liked fruit is the grape , but my least liked is the apple .
his favorite fruit is the orange , but my favorite is the grape .
paris is relaxing during december , but it is usually chilly in july .
new jersey is busy during spring , and it is never hot in march .
our least liked fruit is the lemon , but my least liked is the grape .
the united states is sometimes busy during january , and it is sometimes warm in november .

French sentences 0 to 10:
new jersey est parfois calme pendant l' automne 

### Data Preprocessing
Converting source text and target text datasets into numerical data input for the network.

In [3]:
def text_to_ids(source_text, target_text, source_vocab_to_int, target_vocab_to_int):
    """
    Convert source and target text to proper word ids
    :param source_text: String that contains all the source text.
    :param target_text: String that contains all the target text.
    :param source_vocab_to_int: Dictionary to go from the source words to an id
    :param target_vocab_to_int: Dictionary to go from the target words to an id
    :return: A tuple of lists (source_id_text, target_id_text)
    """
    source_id_text = [[source_vocab_to_int[x] for x in y.split()] for y in source_text.split('\n')]
    target_id_text = [[target_vocab_to_int[a] for a in b.split()] for b in target_text.split('\n')]
    
    target_id_text = [entry + [target_vocab_to_int['<EOS>']]for entry in target_id_text]
    
    return source_id_text, target_id_text

Utilize helper to preprocess data and save to file

In [4]:
helper.preprocess_and_save_data(source_path, target_path, text_to_ids)

load data from disc. 

In [5]:
(source_int_text, target_int_text), (source_vocab_to_int, target_vocab_to_int), _ = helper.load_preprocess()

## Building the Neural Network

In [6]:
import tensorflow as tf
from tensorflow.python.layers.core import Dense

### Input
Following function will create all the necessary placeholders for the network.

In [7]:
def model_inputs():
    """
    Create TF Placeholders for input, targets, learning rate, and lengths of source and target sequences.
    :return: Tuple (input, targets, learning rate, keep probability, target sequence length,
    max target sequence length, source sequence length)
    """
    
    inputs = tf.placeholder(tf.int32, [None, None], name='input')
    targets = tf.placeholder(tf.int32, [None, None], name='targets')
    keep_prob = tf.placeholder(tf.float32, name='keep_prob')
    learning_rate = tf.placeholder(tf.float32, name='learning_rate')
    source_len = tf.placeholder(tf.int32, [None], name='source_sequence_length')
    target_len = tf.placeholder(tf.int32, [None], name='target_sequence_length')
    max_target_len = tf.reduce_max(target_len)
    
    return inputs, targets, learning_rate, keep_prob, target_len, max_target_len, source_len


### Process Decoder Input
Removes the last word id from each batch in target_data and concatinates the GO ID to the begining of each word batch.

In [12]:
def process_decoder_input(target_data, target_vocab_to_int, batch_size):
    """
    Preprocess target data for encoding
    :param target_data: Target Placehoder
    :param target_vocab_to_int: Dictionary to go from the target words to an id
    :param batch_size: Batch Size
    :return: Preprocessed target data
    """
    # Drop last word id from each batch
    trimmed_batch = tf.strided_slice(target_data, [0,0], [batch_size, -1], [1, 1])

    # Create GO_ID data to add to the batches
    go_id = tf.fill([batch_size, 1], target_vocab_to_int["<GO>"])
    
    # Consolidate Go ID & trimmed batch
    processed = tf.concat([go_id, trimmed_batch], 1)
    
    return processed

### Encoding
Embed encoding input and construct stacked LSTM cells with dropout.

In [15]:
def encoding_layer(rnn_inputs, rnn_size, num_layers, keep_prob,
                  source_sequence_length, source_vocab_size, 
                  encoding_embedding_size):
    """
    Create encoding layer
    :param rnn_inputs: Inputs for the RNN
    :param rnn_size: RNN Size
    :param num_layers: Number of layers
    :param keep_prob: Dropout keep probability
    :param source_sequence_length: a list of the lengths of each sequence in the batch
    :param source_vocab_size: vocabulary size of source data
    :param encoding_embedding_size: embedding size of source data
    :return: tuple (RNN output, RNN state)
    """
    
    # Creating the embedding
    embed = tf.contrib.layers.embed_sequence(rnn_inputs, 
                                            source_vocab_size,
                                            encoding_embedding_size)
    
    # LSTM Cell stack
    cell = tf.contrib.rnn.MultiRNNCell([tf.contrib.rnn.LSTMCell(rnn_size) for _ in range(num_layers)])
    
    # Wrapping multi-cell stack w. dropout
    drop = tf.contrib.rnn.DropoutWrapper(cell, output_keep_prob = keep_prob)
    
    # Create RNN output & state
    rnn_out, rnn_state = tf.nn.dynamic_rnn(drop, embed, 
                                           sequence_length = source_sequence_length, 
                                           dtype = tf.float32)
    
    return rnn_out, rnn_state

### Decoding - Training
Using the seq2seq functions in tf.contrib to obtain decoder outputs.

In [19]:
def decoding_layer_train(encoder_state, dec_cell, dec_embed_input,
                        target_sequence_length, max_summary_length,
                        output_layer, keep_prob):
    """
    Create a decoding layer for training
    :param encoder_state: Encoder State
    :param dec_cell: Decoder RNN Cell
    :param dec_embed_input: Decoder embedded input
    :param target_sequence_length: The lengths of each sequence in the target batch
    :param max_summary_length: The length of the longest sequence in the batch
    :param output_layer: Function to apply the output layer
    :param keep_prob: Dropout keep probability
    :return: BasicDecoderOutput containing training logits and sample_id
    """
    
    # Helper for the training process
    training_helper = tf.contrib.seq2seq.TrainingHelper(inputs = dec_embed_input,
                                                       sequence_length = target_sequence_length,
                                                       time_major = False)
    
    # Basic Decoder
    training_decoder = tf.contrib.seq2seq.BasicDecoder(dec_cell, 
                                                      training_helper,
                                                      encoder_state,
                                                      output_layer)
    
    # Perform dynamic encoding
    training_output, state = tf.contrib.seq2seq.dynamic_decode(training_decoder,
                                                              impute_finished = True,
                                                              maximum_iterations = max_summary_length)
    
    return training_output

### Decoding - Inference
Again utilizing the tf seq2seq functions to create the inference decoder and obtain decoder outputs.