# An introduction to sequence-to-sequence learning in Keras

#### Word embedding based version following Chollet tutorial

https://blog.keras.io/a-ten-minute-introduction-to-sequence-to-sequence-learning-in-keras.html

https://github.com/keras-team/keras/blob/master/examples/lstm_seq2seq.py

Note to self: See my local version for working notes.

This code is based upon Chollet's character-based seq2seq tutorial. It has a section at the end which provides guidance and code for making a word embedding version.

I've also referenced a couple of Brownlee's tutorials.

[How to Use Word Embedding Layers for Deep Learning with Keras](https://machinelearningmastery.com/use-word-embedding-layers-deep-learning-keras/)

Also his tut on text preparation:

[How to Prepare Text Data for Deep Learning with Keras](https://machinelearningmastery.com/prepare-text-data-deep-learning-keras/)

From the Chollet tutorial post:

>#### Data download

>- English to French sentence pairs: 
http://www.manythings.org/anki/fra-eng.zip

>- Lots of neat sentence pairs datasets can be found at:
http://www.manythings.org/anki/

>#### References
>- Sequence to Sequence Learning with Neural Networks
    https://arxiv.org/abs/1409.3215
>- Learning Phrase Representations using
    RNN Encoder-Decoder for Statistical Machine Translation
    https://arxiv.org/abs/1406.1078

In [1]:
# Imports
from keras.models import Model
from keras.layers import Input, LSTM, Dense, Embedding
from keras.preprocessing.text import text_to_word_sequence, Tokenizer
from keras.preprocessing.sequence import pad_sequences

import numpy as np

Using TensorFlow backend.


In [2]:
# Model configuration
batch_size = 64  # Batch size for training.
epochs = 100  # Number of epochs to train for.
latent_dim = 256  # Latent dimensionality of the encoding space.
num_samples = 10000  # Number of samples to train on.
# Path to the data txt file on disk.
data_path = 'data/fra-eng/fra.txt'

#### Vectorize the data (word embedding and one-hot encoding)

In [5]:
# Text preparation
# For source and target texts:
# Split phrases as word arrays, filter for puncuation.
# Collect unique vocabularies.

# diagnostics. Using Tokenizer for actual
input_texts = []
target_texts = []

with open(data_path, 'r', encoding='utf-8') as f: # assures file will close.
    lines = f.read().split('\n') # split doc into lines at newline.
for line in lines[: min(num_samples, len(lines) - 1)]: # parse no more than num_samples lines.
    input_text, target_text = line.split('\t') # separate source and target phrases 
    # Remove some observed unicode from the French
    input_text = input_text.lower().replace(u"\xa0", u" ").replace(u"\u202f", u" ").replace('.', ' .').replace('?', ' ?').replace('!', ' !') # add space for splitting to retain punc.
    target_text = target_text.lower().replace(u"\xa0", u" ").replace(u"\u202f", u" ").replace('.', ' .').replace('?', ' ?').replace('!', ' !')
    # collect
    input_texts.append(input_text)
    target_texts.append(target_text)                   
    
# Tokenizer integer sequences
filters = '"#$%&()*+,-/:;<=>@[\]^_`{|}~' # removed/keeping [.!?]

input_text_tokr = Tokenizer(lower=True, filters=filters)
input_text_tokr.fit_on_texts(input_texts)
input_texts_seq = input_text_tokr.texts_to_sequences(input_texts)
max_encoder_seq_length = max([len(seq) for seq in input_texts_seq])
encoder_input_data = pad_sequences(
    input_texts_seq, max_encoder_seq_length, padding='post')

target_text_tokr = Tokenizer(lower=True, filters=filters)
target_text_tokr.fit_on_texts(target_texts)
target_texts_seq = target_text_tokr.texts_to_sequences(target_texts)
max_decoder_seq_length = max([len(seq) for seq in target_texts_seq])
decoder_input_data = pad_sequences(
    target_texts_seq, max_decoder_seq_length, padding='post') 

# vocabulary sizes
num_encoder_tokens = len(input_text_tokr.word_index.keys()) + 1
num_decoder_tokens = len(target_text_tokr.word_index.keys()) + 1


In [6]:
# one-hot decoder_target_data initialization
decoder_target_data = np.zeros(
    (len(input_texts), max_decoder_seq_length, num_decoder_tokens),
    dtype='float32')

The `decoder_target_data` parameter args in the following cell may need experimenting if performance is not good.

In [7]:
# Populate decoder_target_data with
for i, seq in enumerate(target_texts_seq):
    for t, idx in enumerate(seq):
        if t > 0:
            # decoder_target_data will be ahead by one timestep
            # Q: What about start and stop tokens? How were they used in char-based version? 
            #decoder_target_data[i, t - 1, idx-1] = 1. # Q: Why is Tokenizer.word_index lookup dictionary indexed starting at 1 instead of 0?
            decoder_target_data[i, t - 1, idx] = 1.
            #decoder_target_data[i, t, idx-1] = 1. # I think this is wrong! Should just be idx

In [226]:
#help(target_text_tokr.word_index)

In [227]:
# diagnostic
#help(target_text_tokr.texts_to_sequences)
#help(pad_sequences)

In [228]:
# Diagnostic
#target_texts_seq
#print(type(input_texts_seq))
#print(type(input_texts_seq[0]))
#input_texts_seq

In [229]:
# diagnostic
#print(encoder_input_data)

In [8]:
# diagnostic
print('num_encoder_tokens:', num_encoder_tokens)
print('num_decoder_tokens:', num_decoder_tokens)
print('max_decoder_seq_length:', max_decoder_seq_length)

num_encoder_tokens: 2165
num_decoder_tokens: 4253
max_decoder_seq_length: 11


In [231]:
# diagnostic

# A dictionary of words and their counts.
#print('input_text_tokr.word_counts')
#print(input_text_tokr.word_counts)
#print()

# Number of documents processed.
#print('input_text_tokr.document_count')
#print(input_text_tokr.document_count)
#print()

# A dictionary of words and their uniquely assigned integers.
# This is what I want for embedding. I need to convert each phrase using this dict.
# And I need to create a reverse dict.
#print('input_text_tokr.word_index')
#print(input_text_tokr.word_index)
#print()

#print('target_text_tokr.word_index')
#print(target_text_tokr.word_index)
#print()

#print(input_text_tokr.filters)

# An integer count of the total number of documents that were used to fit the Tokenizer. 
#print('input_text_tokr.word_docs')
#print(input_text_tokr.word_docs)


In [232]:
# diagnostic
# Tokenizer word_index dictionary
#input_token_index

In [233]:
# diagnostic
# Tokenizer word_index dictionary
#target_token_index

In [234]:
#input_texts[-30:-1]

In [235]:
#input_texts_seq[-30:-1]

In [236]:
#target_texts

In [237]:
#target_texts_seq

### Encoder

I'm going by code that is the blog, but it's not in a py file that runs successfully like the character based version. If I have trouble with it refer back the the char based version.

In [9]:
# Define an input sequence and process it.
encoder_inputs = Input(shape=(None,)) 
x_e = Embedding(num_encoder_tokens, latent_dim)(encoder_inputs)
x_e, state_h, state_c = LSTM(latent_dim, return_state=True)(x_e)
encoder_states = [state_h, state_c] # The decoder will work with these.


### Decoder

In [10]:
# Set up the decoder, using `encoder_states` as initial state.
decoder_inputs = Input(shape=(None,)) 
x_d = Embedding(num_decoder_tokens, latent_dim)(decoder_inputs)
x_d = LSTM(latent_dim, return_sequences=True)(x_d, initial_state=encoder_states)
decoder_outputs = Dense(num_decoder_tokens, activation='softmax')(x_d)


### Model

In [11]:
# Define the model that will turn
# `encoder_input_data` & `decoder_input_data` into `decoder_target_data`
model = Model([encoder_inputs, decoder_inputs], decoder_outputs)

In [None]:
# Compile and run training
model.compile(optimizer='rmsprop', loss='categorical_crossentropy')
# Note that `decoder_target_data` needs to be one-hot encoded,
# rather than sequences of integers like `decoder_input_data`!
# RW: Does decoder_target_data still need to be ahead of 
# decoder_input_data by one timestep? He doesn't say so, but I will 
# assume that the pattern is still required. 
model.fit([encoder_input_data, decoder_input_data], decoder_target_data,
          batch_size=batch_size,
          epochs=epochs,
          validation_split=0.2)


In [None]:
# Save model
model.save('s2s.h5')

#### Everything below is out of sync