# Basic Seq2Seq
This is a basic seq2seq implementation to show what can be done for conversational models.  The task we'll train it on is predicting company responses to consumers.

This notebook shows how to prepare the data and construct the Keras model, but will not train quickly!  Instead, it demonstrates how the network progresses toward natural responses, and allows replying to arbitrary text, as shown below.  Unfortunately, getting to interesting results takes longer than an hour on Kaggle's non-GPU notebooks, so you'll need to download the notebook and run on your own machine to get to interesting results.

This configuration tops out at a test loss of ~1.8, and provides nuanced responses to some of the more requests, like "[the I problem](http://www.refinery29.com/2017/11/179790/ios-11-1-bug-keyboard-problem)" for @AppleSupport, after around 6 hours of training on a CUDA 5.0 GPU.

![seq2seq model architecture](https://i.imgur.com/JmuryKu.png)

In [1]:
import re
import random
import time

print('Library versions:')

import keras
print(f'keras:{keras.__version__}')
#import pandas as pd
#print(f'pandas:{pd.__version__}')
import sklearn
print(f'sklearn:{sklearn.__version__}')
import nltk
print(f'nltk:{nltk.__version__}')
import numpy as np
print(f'numpy:{np.__version__}')

from sklearn.feature_extraction.text import CountVectorizer
from nltk.tokenize import casual_tokenize

#from tqdm import tqdm_notebook as tqdm # Special jupyter notebook progress bar 💫

Library versions:


  from ._conv import register_converters as _register_converters
Using TensorFlow backend.


keras:2.1.4
sklearn:0.19.1
nltk:3.2.5
numpy:1.14.2


## Model Parameters

In [113]:
# 8192 - large enough for demonstration, larger values make network training slower
MAX_VOCAB_SIZE = 2**14
# seq2seq generally relies on fixed length message vectors - longer messages provide more info
# but result in slower training and larger networks
MAX_MESSAGE_LEN = 100  
# Embedding size for words - gives a trade off between expressivity of words and network size
EMBEDDING_SIZE = 100
# Embedding size for whole messages, same trade off as word embeddings
CONTEXT_SIZE = 100
# Larger batch sizes generally reach the average response faster, but small batch sizes are
# required for the model to learn nuanced responses.  Also, GPU memory limits max batch size.
BATCH_SIZE = 4
# Helps regularize network and prevent overfitting.
DROPOUT = 0.2
# High learning rate helps model reach average response faster, but can make it hard to 
# converge on nuanced responses
LEARNING_RATE=0.005

# Tokens needed for seq2seq
UNK = 1  # words that aren't found in the vocab
PAD = 0  # after message has finished, this fills all remaining vector positions
START = 2  # provided to the model at position 0 for every response predicted

# Implementaiton detail for allowing this to be run in Kaggle's notebook hardware
SUB_BATCH_SIZE = 100


## Data Prep
Here, we'll prepare the data for training our seq2seq model, including:

- Replace screen names with `@__sn__` token to show model the commonality between them
- Build a vocab to turn tokens into integers suitable for our seq2seq model
- Tokenize input and target text into fixed size vectors
- Partition our dataset into train and test sets

### Data Loading and Reshaping
Pulled from [this kernel](https://www.kaggle.com/soaxelbrooke/first-inbound-and-response-tweets).

### Tokenizing and Vocab Build

We'll use NLTK's `casual_tokenize`, which handles a lot of corner cases found in social media data ("casual" text data) along with scitkit learn's `CountVectorizer`.  We won't use the actual `CountVectorizer`, just use it as a convenient vocabulary builder, which we'll apply with functions that turn text into "word indexes" - integers that represent each word - and back.

In [None]:
count_vec = CountVectorizer(tokenizer=casual_tokenize, max_features=MAX_VOCAB_SIZE - 3)
print("Fitting CountVectorizer on X and Y text data...")
count_vec.fit(tqdm(x_text + y_text))
analyzer = count_vec.build_analyzer()
vocab = {k: v + 3 for k, v in count_vec.vocabulary_.items()}
vocab['__unk__'] = UNK
vocab['__pad__'] = PAD
vocab['__start__'] = START
# Used to turn seq2seq predictions into human readable strings
reverse_vocab = {v: k for k, v in vocab.items()}
print(f"Learned vocab of {len(vocab)} items.")

In [3]:
import pickle

with open('./data/embeddings.pkl', 'rb') as fp:
    our_embedding , idx2word , word2idx = pickle.load(fp)

In [4]:
word2idx

{'ad': 3,
 'sales': 4,
 'boost': 5,
 'time': 6,
 'warner': 7,
 'profit': 8,
 'dollar': 9,
 'gains': 10,
 'on': 11,
 'greenspan': 12,
 'speech': 13,
 'yukos': 14,
 'unit': 15,
 'buyer': 16,
 'faces': 17,
 'loan': 18,
 'claim': 19,
 'high': 20,
 'fuel': 21,
 'prices': 22,
 'hit': 23,
 'ba': 24,
 "'": 25,
 's': 26,
 'profits': 27,
 'pernod': 28,
 'takeover': 29,
 'talk': 30,
 'lifts': 31,
 'domecq': 32,
 'japan': 33,
 'narrowly': 34,
 'escapes': 35,
 'recession': 36,
 'jobs': 37,
 'growth': 38,
 'still': 39,
 'slow': 40,
 'in': 41,
 'the': 42,
 'us': 43,
 'india': 44,
 'calls': 45,
 'for': 46,
 'fair': 47,
 'trade': 48,
 'rules': 49,
 'ethiopia': 50,
 'crop': 51,
 'production': 52,
 'up': 53,
 '24': 54,
 '%': 55,
 'court': 56,
 'rejects': 57,
 '$': 58,
 'tobacco': 59,
 'case': 60,
 'ask': 61,
 'jeeves': 62,
 'tips': 63,
 'online': 64,
 'revival': 65,
 'indonesians': 66,
 'face': 67,
 'price': 68,
 'rise': 69,
 'peugeot': 70,
 'deal': 71,
 'boosts': 72,
 'mitsubishi': 73,
 'telegraph': 74,

In [5]:
idx2word

{3: 'ad',
 4: 'sales',
 5: 'boost',
 6: 'time',
 7: 'warner',
 8: 'profit',
 9: 'dollar',
 10: 'gains',
 11: 'on',
 12: 'greenspan',
 13: 'speech',
 14: 'yukos',
 15: 'unit',
 16: 'buyer',
 17: 'faces',
 18: 'loan',
 19: 'claim',
 20: 'high',
 21: 'fuel',
 22: 'prices',
 23: 'hit',
 24: 'ba',
 25: "'",
 26: 's',
 27: 'profits',
 28: 'pernod',
 29: 'takeover',
 30: 'talk',
 31: 'lifts',
 32: 'domecq',
 33: 'japan',
 34: 'narrowly',
 35: 'escapes',
 36: 'recession',
 37: 'jobs',
 38: 'growth',
 39: 'still',
 40: 'slow',
 41: 'in',
 42: 'the',
 43: 'us',
 44: 'india',
 45: 'calls',
 46: 'for',
 47: 'fair',
 48: 'trade',
 49: 'rules',
 50: 'ethiopia',
 51: 'crop',
 52: 'production',
 53: 'up',
 54: '24',
 55: '%',
 56: 'court',
 57: 'rejects',
 58: '$',
 59: 'tobacco',
 60: 'case',
 61: 'ask',
 62: 'jeeves',
 63: 'tips',
 64: 'online',
 65: 'revival',
 66: 'indonesians',
 67: 'face',
 68: 'price',
 69: 'rise',
 70: 'peugeot',
 71: 'deal',
 72: 'boosts',
 73: 'mitsubishi',
 74: 'telegraph',

In [6]:
our_embedding[word2idx['fuel']]

array([-7.0881e-01,  8.4256e-01,  6.8714e-01, -6.9636e-01,  3.0655e-01,
       -1.5754e+00, -7.9960e-04,  5.8261e-01,  3.3351e-01,  1.2829e+00,
       -1.5290e-01,  2.5007e-01, -3.5430e-01,  9.4143e-02, -4.1984e-01,
       -8.2855e-01, -2.9726e-01,  1.2291e-01,  4.0926e-01, -2.7786e-01,
        6.6257e-01, -5.2272e-01,  3.9382e-01, -8.7131e-02,  1.7506e-01,
        1.5439e-01, -1.2746e+00,  1.8409e-01,  5.7692e-02,  7.1842e-01,
       -4.3843e-01, -1.1107e-02, -1.2128e+00, -8.3243e-02, -1.7166e-01,
        6.2765e-01,  9.7759e-01, -3.5805e-03, -1.4293e-01,  3.5790e-01,
       -4.0073e-01, -1.0119e+00, -2.8738e-01, -2.1578e-01,  1.0012e+00,
       -8.4636e-03, -1.3581e-01, -7.0063e-01, -4.2808e-01, -7.4352e-01,
        4.2888e-01,  5.0330e-01, -8.1019e-01,  1.1835e+00,  4.1573e-01,
       -1.5368e+00, -2.2367e-01, -3.9961e-02,  2.0855e+00, -2.6799e-01,
        5.2742e-01, -3.9748e-01, -4.4802e-03,  6.5526e-01,  7.1186e-01,
        2.7329e-01,  6.6516e-01, -1.0332e+00,  4.4366e-02, -5.38

In [7]:
word2idx['__unk__'] = UNK
word2idx['__pad__'] = PAD
word2idx['__start__'] = START

In [100]:
with open('data/xy.pkl', 'rb') as fp:
    x, y = pickle.load(fp)

In [101]:
y

[[3,
  4,
  5,
  6,
  7,
  8,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0],
 [9,
  10,
  11,
  12,
  13,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,


In [102]:
x, y = np.array(x), np.array(y)

In [103]:
y

array([[   3,    4,    5, ...,    0,    0,    0],
       [   9,   10,   11, ...,    0,    0,    0],
       [  14,   15,   16, ...,    0,    0,    0],
       ...,
       [1232,    1,  292, ...,    0,    0,    0],
       [ 184,  167, 1233, ...,    0,    0,    0],
       [1067, 1068, 1234, ...,    0,    0,    0]])

### Vocab Helper Functions
These helper functions take strings and turn them into word indexes used by the actual seq2seq models.  This turns something like "This is how we do it." into a padded array of integers, like [153, 4, 643, 48, 94, 54, 8, 0, 0, 0].  We'll apply the `to_word_idx` function to our text data to get our `N x MESSAGE_LEN` training/test data.

In [104]:
import nltk
def to_word_idx(sentence):
    full_length = [word2idx.get(tok, UNK) for tok in nltk.word_tokenize(sentence)] + [PAD] * MAX_MESSAGE_LEN
    return full_length[:MAX_MESSAGE_LEN]

def from_word_idx(word_idxs):
    return ' '.join(idx2word[idx] for idx in word_idxs if idx != PAD).strip()


In [62]:
# Make sure our helpers work as expected...
x_text.head().apply(to_word_idx).apply(from_word_idx)

NameError: name 'x_text' is not defined

[3,
 4,
 5,
 6,
 7,
 8,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0]

### Train / Test Split
Here, we split our data into training and test sets.  For simplicity, we use a random split, which may result in different distributions between the training and test set, but we won't worry about that for this case.

In [105]:
all_idx = list(range(len(x)))
train_idx = set(random.sample(all_idx, int(0.8 * len(all_idx))))
test_idx = {idx for idx in all_idx if idx not in train_idx}

train_x = x[list(train_idx)]
test_x = x[list(test_idx)]
train_y = y[list(train_idx)]
test_y = y[list(test_idx)]

assert train_x.shape == train_y.shape
assert test_x.shape == test_y.shape

print(f'Training data of shape {train_x.shape} and test data of shape {test_x.shape}.')

Training data of shape (408, 100) and test data of shape (102, 100).


In [106]:
train_y[0].reshape(6).tolist()

ValueError: cannot reshape array of size 100 into shape (6,)

## Model Creation
We'll create and compile the model here.  It will consist of the following components:

- Shared word embeddings
  - A shared embedding layer that turns word indexes (a sparse representation) into a dense/compressed representation.  This embeds both the request from the customer, and also the last words uttered by the model that are fed back into the model.
- Encoder RNN
  - In this case, a single LSTM layer.  This encodes the whole input sentence into a context vector (or thought vector) that represents completely what the customer is saying, and produces a single output.
- Decoder RNN
  - This RNN (also an LSTM in this case) decodes the context vector into a string of tokens/utterances.  For each time step, it takes the context vector and the embedded last utterance and produces the next utterance, which is fed back into the model.  More complex and effective models copy the encoder state into the decoder, add more layers of LSTMs, and apply attention mechanisms - but these are out of the scope of this simple example.
- Next Word Dense+Softmax
  - These two layers take the decoder output and turn it into the next word to be uttered.  The dense layer allows the decoder to not map directly to words uttered, and the softmax turns the dense layer output into a probability distribution, from which we pick the most likely next word.

![seq2seq model structure](https://i.imgur.com/JmuryKu.png)

In [107]:
# keras imports, because there are like... A million of them.
from keras.models import Model
from keras.optimizers import Adam
from keras.layers import Dense, Input, LSTM, Dropout, Embedding, RepeatVector, concatenate, \
    TimeDistributed
from keras.utils import np_utils

In [108]:
def create_model():
    shared_embedding = Embedding(
        output_dim=EMBEDDING_SIZE,
        input_dim=MAX_VOCAB_SIZE,
        input_length=MAX_MESSAGE_LEN,
        name='embedding',
    )
    
    # ENCODER
    
    encoder_input = Input(
        shape=(MAX_MESSAGE_LEN,),
        dtype='int32',
        name='encoder_input',
    )
    
    embedded_input = shared_embedding(encoder_input)
    
    # No return_sequences - since the encoder here only produces a single value for the
    # input sequence provided.
    encoder_rnn = LSTM(
        CONTEXT_SIZE,
        name='encoder',
        dropout=DROPOUT
    )
    
    context = RepeatVector(MAX_MESSAGE_LEN)(encoder_rnn(embedded_input))
    
    # DECODER
    
    last_word_input = Input(
        shape=(MAX_MESSAGE_LEN, ),
        dtype='int32',
        name='last_word_input',
    )
    
    embedded_last_word = shared_embedding(last_word_input)
    # Combines the context produced by the encoder and the last word uttered as inputs
    # to the decoder.
    decoder_input = concatenate([embedded_last_word, context], axis=2)
    
    # return_sequences causes LSTM to produce one output per timestep instead of one at the
    # end of the intput, which is important for sequence producing models.
    decoder_rnn = LSTM(
        CONTEXT_SIZE,
        name='decoder',
        return_sequences=True,
        dropout=DROPOUT
    )
    
    decoder_output = decoder_rnn(decoder_input)
    
    # TimeDistributed allows the dense layer to be applied to each decoder output per timestep
    next_word_dense = TimeDistributed(
        Dense(int(MAX_VOCAB_SIZE / 2), activation='relu'),
        name='next_word_dense',
    )(decoder_output)
    
    next_word = TimeDistributed(
        Dense(MAX_VOCAB_SIZE, activation='softmax'),
        name='next_word_softmax'
    )(next_word_dense)
    
    return Model(inputs=[encoder_input, last_word_input], outputs=[next_word])

s2s_model = create_model()
optimizer = Adam(lr=LEARNING_RATE, clipvalue=5.0)
s2s_model.compile(optimizer='adam', loss='categorical_crossentropy')

## Model Training
We'll train the model here.  After each sub-batch of the dataset, we'll test with static input strings to see how the model is progressing in human readable terms.  Its important to have these tests along with traditional model evaluation to provide a better understanding of how well the model is training.

It's important to pull test strings from the real distribution of the data, also.  It can be hard to really put yourself in customers' shoes when writing test messages, and you will get non-representative results when you provide test examples that don't fit the true distribution of the input data (when your input text doesn't sound like real customer requests).

In [109]:
def add_start_token(y_array):
    """ Adds the start token to vectors.  Used for training data. """
    return np.hstack([
        START * np.ones((len(y_array), 1)),
        y_array[:, :-1],
    ])

def binarize_labels(labels):
    """ Helper function that turns integer word indexes into sparse binary matrices for 
        the expected model output.
    """
    return np.array([np_utils.to_categorical(row, num_classes=MAX_VOCAB_SIZE)
                     for row in labels])

In [110]:
def respond_to(model, text):
    """ Helper function that takes a text input and provides a text output. """
    input_y = add_start_token(PAD * np.ones((1, MAX_MESSAGE_LEN)))
    idxs = np.array(to_word_idx(text)).reshape((1, MAX_MESSAGE_LEN))
    for position in range(MAX_MESSAGE_LEN - 1):
        prediction = model.predict([idxs, input_y]).argmax(axis=2)[0]
        input_y[:,position + 1] = prediction[position]
    return from_word_idx(model.predict([idxs, input_y]).argmax(axis=2)[0])

In [111]:
def train_mini_epoch(model, start_idx, end_idx):
    """ Batching seems necessary in Kaggle Jupyter Notebook environments, since
        `model.fit` seems to freeze on larger batches (somewhere 1k-10k).
    """
    b_train_y = binarize_labels(train_y[start_idx:end_idx])
    input_train_y = add_start_token(train_y[start_idx:end_idx])
    
    model.fit(
        [train_x[start_idx:end_idx], input_train_y], 
        b_train_y,
        epochs=1,
        batch_size=BATCH_SIZE,
    )
    
    rand_idx = random.sample(list(range(len(test_x))), SUB_BATCH_SIZE)
    print('Test results:', model.evaluate(
        [test_x[rand_idx], add_start_token(test_y[rand_idx])],
        binarize_labels(test_y[rand_idx])
    ))
    
    input_strings = [
        "@AppleSupport I fix I this I stupid I problem I",
        "@AmazonHelp I hadnt expected that such a big brand like amazon would have such a poor customer service.",
    ]
    
    for input_string in input_strings:
        output_string = respond_to(model, input_string)
        print(f'> "{input_string}"\n< "{output_string}"')


### Train the model!

You can stop training by pressing the stop button - the training code is configured to watch for the `KeyboardInterrupt` exception triggered that way.  Also, it will run until the configured stopping point below.


Let's start the training! 🚀

In [114]:
training_time_limit = 360 * 60  # seconds (notebooks terminate after 1 hour)
start_time = time.time()
stop_after = start_time + training_time_limit

class TimesUpInterrupt(Exception):
    pass

try:
    for epoch in range(100):
        print(f'Training in epoch {epoch}...')
        for start_idx in range(0, len(train_x), SUB_BATCH_SIZE):
            train_mini_epoch(s2s_model, start_idx, start_idx + SUB_BATCH_SIZE)
            if time.time() > stop_after:
                raise TimesUpInterrupt
except KeyboardInterrupt:
    print("Halting training from keyboard interrupt.")
except TimesUpInterrupt:
    print(f"Halting after {time.time() - start_time} seconds spent training.")

Training in epoch 0...


ValueError: Error when checking input: expected encoder_input to have shape (30,) but got array with shape (100,)

In [35]:
respond_to(s2s_model, '''@AppleSupport iPhone 8 touchID doesnt unlock while charging on 
    110v w/ 61w laptop charger to usbc lightning cable just uh.. so you guys know''')

ERROR:root:An unexpected error occurred while tokenizing input
The following traceback may be corrupted or invalid
The error message is: ('EOF in multi-line string', (1, 81))



KeyboardInterrupt: 

In [None]:
respond_to(s2s_model, '''@sprintcare I can't make calls... wtf''')