# Custom Conversational Model

In this notebook, I implement a basic seq2seq conversational chatbot. My work here is somewhat like [the Deep Q&A project](https://github.com/Conchylicultor/DeepQA), but I use [Keras](https://keras.io/) as a neural network framework because it is so much cleaner and easier than the underlying TensorFlow. For the dataset, I choose [Cornell Movie-Dialogs Corpus](https://www.cs.cornell.edu/~cristian/Cornell_Movie-Dialogs_Corpus.html) which provides adequate dialogues for building a simple chit-chat system.

<img src="img/rawpixel-633846-unsplash.jpg" alt="Buiding a Bot" width="400">

This chatbot implementation can be breaked down into 5 steps as follows.
<ol>
      <li>Read text data</li>
      <li>Transform the texts into a format that is proper to be used by the training module</li>
      <li>Build a basic seq2seq model</li>
      <li>Train the model by using prepared dialogue data</li>
      <li>Evaluate the trained model</li>
</ol>

In [1]:
import os
os.environ['CUDA_VISIBLE_DEVICES'] = '2'

import numpy as np
import yaml, pickle

Here is the section that sets values of important parameters.

In [2]:
config_name = 'config02'

with open('{}.yml'.format(config_name)) as config_file:
    configs = yaml.load(config_file)

MAX_LEN = configs['params']['sequence']['max_len']
START_TOKEN = configs['params']['sequence']['start_token']
KEPT_SYMBOLS = configs['params']['sequence']['kept_symbols']

NUM_WORDS = configs['params']['tokenizer']['num_words']    
OOV_TOKEN = configs['params']['tokenizer']['oov_token']
LOWER = configs['params']['tokenizer']['lower']
FILTERS = configs['params']['tokenizer']['filters']

EMBEDDING_DIM = configs['params']['model']['embedding_dim']
STATE_DIM = configs['params']['model']['state_dim']

BATCH_SIZE = configs['params']['training']['batch_size']
EPOCHS = configs['params']['training']['epochs']

## Read Texts

The original dataset has been processed by [ParlAI](http://www.parl.ai/), resulted in ready-to-use data files which are already in the tab-separated format. 

In [3]:
# paths to the data txt files on disk
train_path = '../data/CornellMovie/train.txt'
valid_path = '../data/CornellMovie/valid.txt'
test_path = '../data/CornellMovie/test.txt'

In [4]:
def process_symbol(texts):
    if not KEPT_SYMBOLS:
        return texts
    for s in KEPT_SYMBOLS:
        texts = [text.replace(s, ' {} '.format(s)) for text in texts]        
    return texts

def read_text(text_path):
    input_texts = []
    target_texts = []

    with open(text_path, 'r', encoding='utf-8') as f:
        lines = f.read().split('\n')

    for line in lines:
        if not line:
            continue
        
        line = line[line.find(' ')+1:]
        
        if len(line.split('\t')) != 2:
            continue
            
        input_text, target_text = line.split('\t')
        
        input_texts.append(input_text)
        target_texts.append(target_text)
        
    with open(text_path.replace('.txt', '_inputs.txt'), 'w') as f:
        f.write('\n'.join(input_texts))
              
    with open(text_path.replace('.txt', '_targets.txt'), 'w') as f:
        f.write('\n'.join(target_texts))
        
    input_texts = process_symbol(input_texts)
    target_texts = process_symbol(target_texts)

    print("{}".format(text_path))
    for i in range(3):
        print("Input_{}: {}".format(i, input_texts[i]))
        print("Target_{}: {}".format(i, target_texts[i]))
    print()            
    
    return input_texts, target_texts

Let's read and take a glimpse at the data.

In [5]:
train_text_inputs, train_text_targets = read_text(train_path)
valid_text_inputs, valid_text_targets = read_text(valid_path)
test_text_inputs, test_text_targets = read_text(test_path)

../data/CornellMovie/train.txt
Input_0: You're asking me out .   That's so cute .  What's your name again ? 
Target_0: Forget it . 
Input_1: No ,  no ,  it's my fault -- we didn't have a proper introduction ---
Target_1: Cameron . 
Input_2: The thing is ,  Cameron -- I'm at the mercy of a particularly hideous breed of loser .   My sister .   I can't date until she does . 
Target_2: Seems like she could get a date easy enough .  .  . 

../data/CornellMovie/valid.txt
Input_0: Can we make this quick ?   Roxanne Korrine and Andrew Barrett are having an incredibly horrendous public break- up on the quad .   Again . 
Target_0: Well ,  I thought we'd start with pronunciation ,  if that's okay with you . 
Input_1: Not the hacking and gagging and spitting part .   Please . 
Target_1: Okay .  .  .  then how 'bout we try out some French cuisine .   Saturday ?   Night ? 
Input_2: How do you get your hair to look like that ? 
Target_2: Eber's Deep Conditioner every two days .  And I never ,  ever u

In the teacher forcing scheme, inputs to a decoder begin with the start symbol.

In [6]:
def insert_start(texts):
    return [START_TOKEN + ' ' + text for text in texts]

train_text_targets_with_start = insert_start(train_text_targets)
valid_text_targets_with_start = insert_start(valid_text_targets)
test_text_targets_with_start = insert_start(test_text_targets)

print(train_text_targets_with_start[0])
print(valid_text_targets_with_start[0])
print(test_text_targets_with_start[0])

<start> Forget it . 
<start> Well ,  I thought we'd start with pronunciation ,  if that's okay with you . 
<start> You're sweet . 


# Preprocessing

The raw texts must be tokenized and vectorized before entering the seq2seq networks. Keras also equips with some text processing modules which facilitate this kind of tasks.

In [7]:
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences

tokenizer = Tokenizer(num_words=NUM_WORDS, oov_token=OOV_TOKEN, lower=LOWER, filters=FILTERS)
tokenizer.fit_on_texts(train_text_inputs + train_text_targets_with_start)

with open('../models/tokenizer_{}.pkl'.format(config_name), 'wb') as handle:
    pickle.dump(tokenizer, handle)

Using TensorFlow backend.


The tokenizer is fitted with training texts and now can be used to transform the texts into sequences that will enter a seq2seq model.

In [8]:
word2index = tokenizer.word_index
index2word = dict(map(reversed, tokenizer.word_index.items()))

def transform_text(texts):
    sequences = tokenizer.texts_to_sequences(texts)
    data = pad_sequences(sequences, maxlen=MAX_LEN, padding='post', truncating='post')
    return data

In [9]:
train_data_inputs = transform_text(train_text_inputs)
valid_data_inputs = transform_text(valid_text_inputs)
test_data_inputs = transform_text(test_text_inputs)

train_data_targets_with_start = transform_text(train_text_targets_with_start)
valid_data_targets_with_start = transform_text(valid_text_targets_with_start)
test_data_targets_with_start = transform_text(test_text_targets_with_start)

train_data_targets = transform_text(train_text_targets)[:, :, np.newaxis]
valid_data_targets = transform_text(valid_text_targets)[:, :, np.newaxis]
test_data_targets = transform_text(test_text_targets)[:, :, np.newaxis]

In [10]:
print(train_text_inputs[7])
print(train_data_inputs[7])

#print(train_text_targets[7])
#print(train_data_targets[7])

#print(train_text_targets_with_start[7])
#print(train_data_targets_with_start[7])

How is our little Find the Wench A Date plan progressing ? 
[   93    18   143   127  1991     7 33323   113 10791   737 19670     4
     0     0     0]


## Building a Basic Seq2Seq Model

A basic seq2seq model comprises of two RNN streams, an encoder and a decoder. The encoder receives chat texts and gathers information to the decoder. The decoder processes this information and generate replies. The model can be illustrated as follows.

<img src="img/seq2seq.png" alt="A Basic Seq2Seq Model">

This concept can be translated to the programming code by using Keras functional APIs. The implementation here is essentially a modification of the code from [this Keras official blog](https://blog.keras.io/a-ten-minute-introduction-to-sequence-to-sequence-learning-in-keras.html).

In [11]:
from keras.models import Model
from keras.layers import Input, Embedding, CuDNNLSTM, Dense

vocab_size = NUM_WORDS if NUM_WORDS else len(word2index)    

# shared embedder
embedding_layer = Embedding(vocab_size, EMBEDDING_DIM, name='embedding_layer')

# encoder
encoder_inputs = Input(shape=(None,), name='encoder_inputs')
encoder_rnn = CuDNNLSTM(STATE_DIM, return_state=True, name='encoder_rnn')

x = embedding_layer(encoder_inputs)
encoder_outputs, state_h, state_c = encoder_rnn(x)
encoder_states = [state_h, state_c]

#decoder
decoder_inputs = Input(shape=(None,), name='decoder_inputs')
decoder_rnn = CuDNNLSTM(STATE_DIM, return_sequences=True, return_state=True, name='decoder_rnn')
decoder_dense = Dense(vocab_size, activation='softmax', name='decoder_dense')

y = embedding_layer(decoder_inputs)
y, _, _ = decoder_rnn(y, initial_state=encoder_states)
decoder_outputs = decoder_dense(y)

model = Model([encoder_inputs, decoder_inputs], decoder_outputs)
model.summary()

__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
decoder_inputs (InputLayer)     (None, None)         0                                            
__________________________________________________________________________________________________
encoder_inputs (InputLayer)     (None, None)         0                                            
__________________________________________________________________________________________________
embedding_layer (Embedding)     (None, None, 300)    18329700    encoder_inputs[0][0]             
                                                                 decoder_inputs[0][0]             
__________________________________________________________________________________________________
encoder_rnn (CuDNNLSTM)         [(None, 512), (None, 1667072     embedding_layer[0][0]            
__________

## Training 

Now, the model is able to be trained with prepared data.

In [12]:
from keras.callbacks import ModelCheckpoint
from keras.callbacks import CSVLogger

model_file = '../models/seq2seq-cornell_{}.hdf5'.format(config_name)
log_file = '../models/seq2seq-cornell_{}.txt'.format(config_name)

checkpoint = ModelCheckpoint(model_file)
csv_logger = CSVLogger(log_file, append=True, separator=',')

callbacks_list = [checkpoint, csv_logger]

In [13]:
from keras.optimizers import Adam
model.compile(optimizer=Adam(), loss='sparse_categorical_crossentropy')

The training process can be time-consuming, up to more than 10 hours on Tesla K80 GPU, for running 100 epochs of this dataset with about 50M model parameters.

In [None]:
model.fit([train_data_inputs, train_data_targets_with_start], train_data_targets,
          batch_size=BATCH_SIZE,
          epochs=EPOCHS, 
          callbacks=callbacks_list,
          validation_data=([valid_data_inputs, valid_data_targets_with_start], valid_data_targets))

When the training has finished, the trained model is ready to evaluate

In [14]:
model.load_weights(model_file)

In [None]:
model.evaluate([test_data_inputs, test_data_targets_with_start], test_data_targets)

In [None]:
predictions = model.predict([test_data_inputs, test_data_targets_with_start])
for p in predictions[:10]:
    words = []
    for index in np.argmax(p, axis=-1):
        if index == 0:
            break
        words.append(index2word[index])
    print(words)

## Prediction

To use in real-world situations, we don't know the complete reply sentence in advance. So, the decoder must receive the previous predicted word as an input, together with the previous state. The trained model needs to be modified a little bit and the decoder must operate step by step. 

In [15]:
encoder_model = Model(encoder_inputs, encoder_states)
encoder_model.summary()

decoder_state_input_h = Input(shape=(STATE_DIM,))
decoder_state_input_c = Input(shape=(STATE_DIM,))
decoder_state_inputs = [decoder_state_input_h, decoder_state_input_c]

y = embedding_layer(decoder_inputs)
y, state_h, state_c = decoder_rnn(y, initial_state=decoder_state_inputs)
decoder_outputs = decoder_dense(y)
decoder_state_outputs = [state_h, state_c]

decoder_model = Model(
    [decoder_inputs] + decoder_state_inputs,
    [decoder_outputs] + decoder_state_outputs)

decoder_model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
encoder_inputs (InputLayer)  (None, None)              0         
_________________________________________________________________
embedding_layer (Embedding)  (None, None, 300)         18329700  
_________________________________________________________________
encoder_rnn (CuDNNLSTM)      [(None, 512), (None, 512) 1667072   
Total params: 19,996,772
Trainable params: 19,996,772
Non-trainable params: 0
_________________________________________________________________
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
decoder_inputs (InputLayer)     (None, None)         0                                            
_________________________________________________________________________________________________

In [16]:
def decode_sequence(input_seq):
    states_value = encoder_model.predict(input_seq)
    target_seq = np.array([[tokenizer.word_index[START_TOKEN]]])
    
    stop_condition = False
    decoded_sentence = []   
    
    word_count = 0
    
    while True:
        output_tokens, h, c = decoder_model.predict([target_seq] + states_value)
        sampled_token_index = np.argmax(output_tokens[0, -1, :])
        
        if (sampled_token_index == 0 or word_count >= MAX_LEN):
            break
            
        decoded_sentence.append(index2word[sampled_token_index])
        word_count += 1
        
        states_value = [h, c]
        target_seq = np.array([[sampled_token_index]])  

    return ' '.join(decoded_sentence)

def format_text(text):
    if not KEPT_SYMBOLS:
        return text    
    for s in KEPT_SYMBOLS:
        text = text.replace(' {}'.format(s), s)
    return text

Jobs Done! We can chat to the bot with any questions. Let's test it.

In [17]:
texts = ["Hello!",
         "How are you?",
         "Who are you?",
         "What is your name?",
         "How old are you?",
         "Tell me a joke.",
         "What time is it?",
         "Who is Skywalker?",
         "What is immoral?",
         "What is morality?",
         "What is the purpose of existence?",         
         "What is the purpose of being intelligent?",         
         "What happens if machines can think?",
         "Do you prefer cats or dogs?",
         "I play tennis. What do you play?"]

input_data = transform_text(process_symbol(texts))

for text, data in zip(texts, input_data):
    decoded_sentence = format_text(decode_sequence(data[np.newaxis, :]))
    print("Q: {}".format(text))
    print("A: {}".format(decoded_sentence))

Q: Hello!
A: Well? Tell us! How'd it go?
Q: How are you?
A: Not bad.
Q: Who are you?
A: That would be difficult to explain.
Q: What is your name?
A: Bedevere, my Liege.
Q: How old are you?
A: Older.
Q: Tell me a joke.
A: Tell him there was Miss Rossi in the other name of the corner. Why
Q: What time is it?
A: Eleven o'clock... I'll be back later.
Q: Who is Skywalker?
A: Yes, sir.
Q: What is immoral?
A: You're not going to let it!
Q: What is morality?
A: One of the plagues on members of Congress office going out front of the Creator
Q: What is the purpose of existence?
A: The numbers stop in the country. I never noticed it, but it is
Q: What is the purpose of being intelligent?
A: Everything. Why do you think I thought it was a really great thing about
Q: What happens if machines can think?
A: If we go for fun, are you?
Q: Do you prefer cats or dogs?
A: I'll be fine.
Q: I play tennis. What do you play?
A: Nothing.


## Remarks

<ul>
    <li> I've found that in this project, using validation data to evaluate a model results in underfitting. When choosing the model at the epoch which evaluation score is optimal, the bot often replies with "i don't know" answer. However, if the model is continued training until the training loss is low enough, the quality of replies is better. </li>
    <li> It's also interesting that the undertraining model can output "i don't know" answer as if it really doesn't understand a question. </li>
</ul>

<ul>
    <li> With large vocabulary size and long sequence length, for example 60000 vocabularies at the lenght of 20, training loss  turns to be nan at some point and the model cannot further be trained. Some regularization techniques may solve this problem. </li>
    <li> At some settings, validation loss becomes nan but training loss does not.
</ul>

<ul>
    <li> There is still a lot of room for model improvement, for example
        <ul>
            <li>reversing an input sequence and padding at the front instead</li>
            <li>improving the tokenizer to not just use only a space as a separator</li>
            <li>adding the attention mechanism</li>
            <li>adding more layers to the RNN</li>
            <li>using pretrained embeddings, e.g. GloVe</li>
            <li>maybe sharing an embedding layer for the output layer too</li>
            <li>adding regularization methods to the model, such as dropout and batch normalization</li>
        </ul>
    </li>
</ul>

<ul>
    <li> Actually, this is still far from a rational intelligent conversational agent. It lacks consistency, persona, world knowledge and many aspects which any practical chatbots should have. Nonetheless, this project demonstrates that a seq2seq model, even in the most basic form, with proper data and parameters could produce rather sensible interactions with human users. </li>
</ul>