# QA Bot

In this notebook we have an implementation of the ChatBot using the End to End memory networks implementation.

First, let's load the data.

### References
* End-to-End Memory Networks paper: https://arxiv.org/abs/1503.08895

In [1]:
import pickle
import numpy

with open('../../datasets/train_qa.txt', 'rb') as file:
    train_data = pickle.load(file)
    
with open('../../datasets/test_qa.txt', 'rb') as file:
    test_data = pickle.load(file)


In [2]:
train_count = len(train_data)
test_count = len(test_data)
print(f"Train data count: {train_count}")
print(f"Test data count: {test_count}")

Train data count: 10000
Test data count: 1000


We can take a look at the kind of data we can find in there.

In [3]:
import random

def print_record(data, index=None):
    if not index:
        index = random.randint(0, len(data) - 1)
    record = data[index]
    story = " ".join(record[0])
    story = [s.strip() for s in story.split(".")]
    story = "\n".join(story)
    question = " ".join(record[1])
    answer = record[2]
    print(f"Story:\n{story}")
    print(f"Question: {question}\n")
    print(f"Answer: {answer}")

In [4]:
print_record(train_data, 0)

Story:
Sandra went back to the hallway
John moved to the bathroom
Sandra journeyed to the kitchen
John journeyed to the office

Question: Is John in the garden ?

Answer: no


In [5]:
print_record(train_data)

Story:
Sandra went back to the bedroom
Sandra went back to the garden
Sandra went back to the kitchen
John travelled to the kitchen

Question: Is Sandra in the kitchen ?

Answer: yes


We need to ensure both the train and test data are used to create a vocabulary, this is to ensure the dictionary contains the elements used in both

In [6]:
all_data = train_data + test_data

# find all distinct elements between all the words
vocab = set()
for story, question, answer in all_data:
    vocab = vocab.union(set(story))
    vocab = vocab.union(set(question))
    
vocab.add("yes")
vocab.add("no")

In [7]:
vocab

{'.',
 '?',
 'Daniel',
 'Is',
 'John',
 'Mary',
 'Sandra',
 'apple',
 'back',
 'bathroom',
 'bedroom',
 'discarded',
 'down',
 'dropped',
 'football',
 'garden',
 'got',
 'grabbed',
 'hallway',
 'in',
 'journeyed',
 'kitchen',
 'left',
 'milk',
 'moved',
 'no',
 'office',
 'picked',
 'put',
 'the',
 'there',
 'to',
 'took',
 'travelled',
 'up',
 'went',
 'yes'}

In [8]:
vocab_len = len(vocab) + 1
vocab_len

38

Calculate the longest story and question

In [46]:
all_story_len = [len(data[0]) for data in all_data]
max_story_len = max(all_story_len)
all_question_len = [len(data[1]) for data in all_data]
max_question_len = max(all_question_len)

Now let's vectorize the data.

We need to fit a tokenizer with the vocabulary we just built to have a word index so we can feed it into the neural network

In [48]:
from keras.preprocessing.sequence import pad_sequences
from keras.preprocessing.text import Tokenizer

tokenizer = Tokenizer(filters=[])
tokenizer.fit_on_texts(vocab)
tokenizer.index_word

{1: 'grabbed',
 2: 'office',
 3: 'got',
 4: 'is',
 5: 'moved',
 6: 'back',
 7: 'left',
 8: 'dropped',
 9: 'picked',
 10: 'discarded',
 11: 'bedroom',
 12: 'there',
 13: 'to',
 14: '.',
 15: 'daniel',
 16: 'journeyed',
 17: 'garden',
 18: 'yes',
 19: 'the',
 20: 'travelled',
 21: 'no',
 22: 'put',
 23: 'in',
 24: 'apple',
 25: 'sandra',
 26: 'up',
 27: 'kitchen',
 28: 'went',
 29: 'hallway',
 30: 'bathroom',
 31: '?',
 32: 'football',
 33: 'john',
 34: 'took',
 35: 'down',
 36: 'mary',
 37: 'milk'}

This is just adding up all the texts grouped by type so we can later vectorize them, i.e., convert them into matrices.

In [49]:
train_story_text = []
train_question_text = []
train_answers = []

for story, question, answer in train_data:
    train_story_text.append(story)
    train_question_text.append(question)
    train_answers.append(answer)


### Vectorize stories
This is a very important function, what this is doing is converting the human readable texts for each type (story, questions and answers) into their corresponding vector representations using the tokenizer we just trained.

In [50]:
import numpy as np

def vectorize_stories(data, word_index, max_story_length, max_question_length):
    """Vectorize stories.
    
    @param data: story, question and answer packed together
    @param word_index: a pre-trained tokenizer word index
    @param max_story_length: max story length characters to pad story sequences
    @param max_question_length: max question length characters to pad question seq.
    """
    # stories
    X = []
    # questions
    Xq = []
    # answers
    Y = []
    # every record in 'data' corresponds to a story, question and an answer
    # that we need to vectorize to contruct the input data to feed the NN.
    for story, question, answer in data:
        # indices for every word in the story
        x = [word_index[word.lower()] for word in story]
        # indices for every word in the question
        xq = [word_index[word.lower()] for word in question]
        
        # initialize the target matrix with the length of the word index
        # plus one for keras pad sequence
        y = np.zeros(len(word_index) + 1)
        # only ligth up the element corresponding to the answer.
        y[word_index[answer]] = 1
        
        X.append(x)
        Xq.append(xq)
        Y.append(y)
    # finally pad the sequences accordingly so stories and questions have the same
    # lenght.
    return (pad_sequences(X, maxlen=max_story_length), pad_sequences(Xq, maxlen=max_question_length), np.array(Y))

In [51]:
input_train, queries_train, answerts_train = vectorize_stories(train_data, tokenizer.word_index, max_story_len, max_question_len)
input_test, queries_test, answerts_test = vectorize_stories(test_data, tokenizer.word_index, max_story_len, max_question_len)

Let's observe how the data looks, we should obtain matrices where each row corresponds to a single record of story-question-answer. For the first two matrices, each column corresponds to the vocabulary word with the index value. And for the last one, each column corresponds to the index of the answer.

In [52]:
input_train

array([[ 0,  0,  0, ..., 19, 11, 14],
       [ 0,  0,  0, ..., 19, 29, 14],
       [ 0,  0,  0, ..., 19, 30, 14],
       ...,
       [ 0,  0,  0, ..., 19, 11, 14],
       [ 0,  0,  0, ..., 37, 12, 14],
       [ 0,  0,  0, ..., 24, 12, 14]], dtype=int32)

If we take a look at the answers, we should observe arrays with zeros, except for the index positions of the words 'yes' and 'no' which corresponds to the tokenizer word index.

In [53]:
answerts_test

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

In [54]:
tokenizer.word_index['yes']

18

In [55]:
tokenizer.word_index['no']

21

If we sum up the answerts test array, we will obtain the amount of yes/no answers and the column index position corresponds to the word index value for that word, i.e., 'yes' has index 18, then if we count the columns indices in the answers_test array, we will find the amount of yes indices, i.e., 497.

In [56]:
sum(answerts_test)

array([  0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,
         0.,   0.,   0.,   0.,   0.,   0.,   0., 497.,   0.,   0., 503.,
         0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,
         0.,   0.,   0.,   0.,   0.])

In [57]:
answerts_test.shape

(1000, 38)

## Neural Network implementation

In [58]:
from keras.models import Sequential, Model
from keras.layers import Embedding, Input, Activation, Dense, Permute, Dropout, add, dot, concatenate, LSTM

input_seq = Input((max_story_len,))
question = Input((max_question_len,))

vocab_size = len(vocab) + 1

# Input encoder M (for stories)
input_encoder_m = Sequential()
input_encoder_m.add(Embedding(input_dim=vocab_size, output_dim=64))
input_encoder_m.add(Dropout(0.3))
# this will output a tensor of shape (samples, story_max_len, embedding_dim)

# Input encoder C
input_encoder_c = Sequential()
input_encoder_c.add(Embedding(input_dim=vocab_size, output_dim=max_question_len))
input_encoder_c.add(Dropout(0.3))
# this will output a tensor of shape (samples, story_max_len, max_question_len)

# Questions encoder
question_encoder = Sequential()
question_encoder.add(Embedding(input_dim=vocab_size, output_dim=64, input_length=max_question_len))
question_encoder.add(Dropout(0.3))
# this will output a tensor of shape (samples, query_maxlen, embedding_dim)

Notice here we are using the functional API of Keras

In [60]:
question

<tf.Tensor 'input_4:0' shape=(?, 6) dtype=float32>

In [61]:
input_encoded_m = input_encoder_m(input_seq)
input_encoded_c = input_encoder_c(input_seq)
question_encoded = question_encoder(question)

Now, according to the paper we need to compute a match $u^T \cdot m_i$ where:
* $u^T$ is the result of embedding the queries with similar dimensions as the story to obtain an internal state $u$
* $m_i$ is the encoded (embedded) matrix of stories or sentences we want to store in memory.

We later need to pass this to a softmax function to compute the match:
$$
p_i=Softmax(u^T \cdot m_i)
$$

In [66]:
match = dot([input_encoded_m, question_encoded], axes=(2, 2))
match = Activation('softmax')(match)

Now, we need to calculate the response vector from the memory $o$ which is a sum (add) over the transformed inputs $c_i$ weighted by the probability vector from the input.
$$
o=\sum_{i} p_ic_i
$$

In [67]:
response = add([match, input_encoded_c])
response = Permute((2, 1))(response)

Now we can concatenate the response with the question encoded

In [68]:
answer = concatenate([response, question_encoded])

In [69]:
answer

<tf.Tensor 'concatenate_2/concat:0' shape=(?, 6, 220) dtype=float32>

And pass this into a LSTM layer

In [70]:
answer = LSTM(32)(answer)
answer = Dropout(0.5)(answer)
answer = Dense(vocab_size)(answer) # (samples, vocab_size) # YES/NO
answer = Activation('softmax')(answer)

# Then build the final model
model = Model([input_seq, question], answer)
model.compile(optimizer='rmsprop', loss='categorical_crossentropy', metrics=['accuracy'])
model.summary()

__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_3 (InputLayer)            (None, 156)          0                                            
__________________________________________________________________________________________________
input_4 (InputLayer)            (None, 6)            0                                            
__________________________________________________________________________________________________
sequential_4 (Sequential)       multiple             2432        input_3[0][0]                    
__________________________________________________________________________________________________
sequential_6 (Sequential)       (None, 6, 64)        2432        input_4[0][0]                    
__________________________________________________________________________________________________
dot_2 (Dot