# Deep Learning Question & Answer Chatbot

We will be implementing a chat bot that can answer questions given a set of sentences. The chatbot will use a subset of the Babi Data Set from Facebook Research; it already contains stories(sentences), queries(questions), and answers. 
Here is a link to the Babi Data Sets and the research paper this is based on:

Full Details: https://research.fb.com/downloads/babi/

- Jason Weston, Antoine Bordes, Sumit Chopra, Tomas Mikolov, Alexander M. Rush,
  "Towards AI-Complete Question Answering: A Set of Prerequisite Toy Tasks",
  http://arxiv.org/abs/1502.05698
  
The bot currently returns a yes or a no to each question asked. However, I plan on integrating a Natural Language Generation component to introduce some meaningful dialogue as well as a speech to text component.

## Imports

In [2]:
import pickle
import numpy as np

In [3]:
from keras.preprocessing.sequence import pad_sequences
from keras.preprocessing.text import Tokenizer

Using TensorFlow backend.


In [4]:
# For creating the model
from keras.models import Sequential, Model
from keras.layers.embeddings import Embedding
from keras.layers import Input, Activation, Dense, Permute, Dropout
from keras.layers import add, dot, concatenate
from keras.layers import LSTM

## Load the Data

In [5]:
with open("train_qa.txt", "rb") as fp:   # Unpickling
    train_data =  pickle.load(fp) # List

In [6]:
with open("test_qa.txt", "rb") as fp:   # Unpickling
    test_data =  pickle.load(fp) # List

10:1 ratio for training data vs testing data; there are 10,000 points for train_data and 1,000 points for test_data. 

In [59]:
train_data[10]

(['Sandra',
  'went',
  'back',
  'to',
  'the',
  'hallway',
  '.',
  'Sandra',
  'moved',
  'to',
  'the',
  'office',
  '.'],
 ['Is', 'Sandra', 'in', 'the', 'office', '?'],
 'yes')

In [60]:
story_sentence = ' '.join(train_data[10][0]) #Story/Sentence
query_question = ' '.join(train_data[10][1]) #Query/Question
answer = train_data[10][2] #Answer to question

In [61]:
print("Sentence: ", story_sentence)
print("Question: ", query_question)
print("Answer:   ", answer)

Sentence:  Sandra went back to the hallway . Sandra moved to the office .
Question:  Is Sandra in the office ?
Answer:    yes


## Create a Vocabulary of all of the Words

In [62]:
# Set that contains the vocab words
vocab = set()

In [63]:
all_data = train_data + test_data 

for story, question , answer in all_data:
    # Creates a vocabulary of all the distinct words inside our dataset 
    vocab = vocab | set(story) # vocab ∪ Story. Continuously adds unique words
    vocab = vocab | set(question) # vocab ∪ question. Continuously adds unique words

In [64]:
# Add in the two possible answers 
vocab.add('no')
vocab.add('yes')

In [65]:
vocab

{'.',
 '?',
 'Daniel',
 'Is',
 'John',
 'Mary',
 'Sandra',
 'apple',
 'back',
 'bathroom',
 'bedroom',
 'discarded',
 'down',
 'dropped',
 'football',
 'garden',
 'got',
 'grabbed',
 'hallway',
 'in',
 'journeyed',
 'kitchen',
 'left',
 'milk',
 'moved',
 'no',
 'office',
 'picked',
 'put',
 'the',
 'there',
 'to',
 'took',
 'travelled',
 'up',
 'went',
 'yes'}

In [66]:
# Reserve 0 for Keras pad_sequences 
vocab_size = len(vocab) + 1 # + 1 to add an extra space for a 0 for Keras's pad_sequences

In [67]:
# Find longest story
longestStory = max(( (len(data[0])) for data in all_data )) 
longestQuery = max(( (len(data[1])) for data in all_data )) 

In [68]:
print(longestStory)
print(longestQuery)

156
6


## Vectorize Data

In [69]:
# Creates integer encoding for the sequences of words
tokenizer = Tokenizer(filters = [])
tokenizer.fit_on_texts(vocab) # This method creates the vocabulary index based on word frequency

In [70]:
tokenizer.word_index

{'hallway': 1,
 'down': 2,
 'moved': 3,
 'journeyed': 4,
 'mary': 5,
 'discarded': 6,
 'in': 7,
 'took': 8,
 'left': 9,
 'got': 10,
 '?': 11,
 '.': 12,
 'john': 13,
 'went': 14,
 'dropped': 15,
 'put': 16,
 'daniel': 17,
 'sandra': 18,
 'bedroom': 19,
 'kitchen': 20,
 'apple': 21,
 'back': 22,
 'bathroom': 23,
 'travelled': 24,
 'yes': 25,
 'the': 26,
 'football': 27,
 'garden': 28,
 'to': 29,
 'milk': 30,
 'is': 31,
 'there': 32,
 'no': 33,
 'picked': 34,
 'office': 35,
 'up': 36,
 'grabbed': 37}

In [71]:
trainStoryText = []
trainQueryText = []
trainAnswers = []

for story, query, answer in train_data:
    trainStoryText.append(story)
    trainQueryText.append(query)
    trainAnswers.append(answer)

In [72]:
 # Transforms each word in the sentences to a sequence of integers.
trainStorySeq = tokenizer.texts_to_sequences(trainStoryText)

In [73]:
def vectorizeStories(data, word_index = tokenizer.word_index, maxStoryLen = longestStory, maxQueryLen = longestQuery):
    """
    Vectorizes stories, queries, & answers into padded sequences. 
   
    Parameters: 
        data: All the data (Stories, Queries, Answers)
        word_index: A word index dictionary. Defaulted to our tokenizer.word_index
                    Can be overrided to other datasets or other sets of questions
        maxStoryLen: Length of the longest story (Will be used for the pad_sequences function)
        maxQueryLen: Length of the longest query (Will be used for the pad_sequences function)
        
        We need the max story & query length because we are using padded sequences; not every story/query 
        is the same length and our RNN that we're using for training needs everything to be the same length 
        We'll pad the inputs with 0s in case there's a story or query that is too short. Or we can cut down 
        a story or query if it is too long.
        
    Returns: 
        this (tuple): A tuple of the form (X, Q, A) (padded based on max lengths)    
    """
    X = [] # X := Stories
    Q = [] # Q := Queries
    A = [] # A := Answers (yes/no)
    
    for story, query, answer in data:
        # Convert the raw words into integers through a word index value
        
        # Grabs the word index for every word in story
        # [9, 34, ...]
        x = [word_index[word.lower()] for word in story]
        # Grabs the word index for every word in query
        q = [word_index[word.lower()] for word in query]
        
        # Index 0 is reserved since we are using pad sequences, so we add + 1
        a = np.zeros(len(word_index) + 1)
        
        # a is an empty matrix of NP zeros so we'll use numpy logic to create this assignment (Yes/No)
        a[word_index[answer]] = 1
        
        # We now append each set to their appropriate output list.
        X.append(x)
        Q.append(q)
        A.append(a)
    
    # Now that we have converted the words to numbers, we pad the sequences so they are all of equal length.
    X_padded_seqs = pad_sequences(X, maxlen = maxStoryLen)
    Q_padded_seqs = pad_sequences(Q, maxlen= maxQueryLen)
    answers = np.array(A)
    
    # Now that the sequences are padded based on their max length, the RNN can be trained on uniformly long sequences.
    # Returns tuple for unpacking. 
    return (X_padded_seqs, Q_padded_seqs, answers)

In [74]:
inputsTrain, queriesTrain, answersTrain = vectorizeStories(train_data)
inputsTest, queriesTest, answersTest = vectorizeStories(test_data)

In [75]:
print(inputsTrain)

[[ 0  0  0 ... 26 19 12]
 [ 0  0  0 ... 26  1 12]
 [ 0  0  0 ... 26 23 12]
 ...
 [ 0  0  0 ... 26 19 12]
 [ 0  0  0 ... 30 32 12]
 [ 0  0  0 ... 21 32 12]]


In [76]:
print(inputsTest)

[[ 0  0  0 ... 26 19 12]
 [ 0  0  0 ... 26 28 12]
 [ 0  0  0 ... 26 28 12]
 ...
 [ 0  0  0 ... 26 21 12]
 [ 0  0  0 ... 26 28 12]
 [ 0  0  0 ... 21 32 12]]


In [77]:
print(answersTest)

[[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 ...
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]]


In [78]:
tokenizer.word_index['yes']

25

In [79]:
tokenizer.word_index['no']

33

In [80]:
sum(answersTest)

array([  0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,
         0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,
         0.,   0.,   0., 497.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,
       503.,   0.,   0.,   0.,   0.])

We can see that we have 497 'yes's at index location 7 and 29 'no's at index location 29.

Our stories, queries, and answers are now successfully vectorized

## Create the model

In [81]:
# We have 2 inputs: The stories and questions
# We'll need to use place holders so we will use `Input()` to insantiate a Keras tensor.

In [82]:
# Shape = (Longest Story, Batch Size)
inputSequence = Input((longestStory, ))
query = Input((longestQuery, ))

## Build the Neural Network 

There are three encoders we will build
* Input Encoder C
* Input Encoder M
* Question Encoder

We are following this model from the paper:
![PaperModel](..\PaperModel.png)
*Figure 1: (a): A single layer version of the model. (b): A three layer version of the model.*

### Input Encoder C

In [83]:
# This input gets embedding to a sequence of vectors
inputEncoderC = Sequential()
# Add 2 layers to it
inputEncoderC.add(Embedding(input_dim = vocab_size, output_dim = longestQuery))
# Turns off a random % of nuerons. Helps with overfitting. Can increase droupout and train longer if wanted
inputEncoderC.add(Dropout(0.3))

# This encoder will output:
# output: (samples, stories max len, longestQuery)

### Input Encoder M

In [84]:
# This input gets embedding to a sequence of vectors
inputEncoderM = Sequential()
# Add 2 layers to it
# The dimension is set to 64 as the researchers found it to give good results for that vocab size. 
inputEncoderM.add(Embedding(input_dim = vocab_size, output_dim = 64)) 
# Turns off a random % of nuerons. Helps with overfitting. Can increase droupout and train longer if wanted
inputEncoderM.add(Dropout(0.3))

# This encoder will output:
# output: (samples, stories max len, embedding dim)

### Question Encoder

In [85]:
# This input gets embedding to a sequence of vectors. The paper states: The query q is also embedded (again, in the simplest case via another embedding matrix
# B with the same dimensions as A) to obtain an internal state u. So the output dimension will match our encoder m.
questionEncoder = Sequential()
# Add 2 layers to it
questionEncoder.add(Embedding(input_dim = vocab_size, output_dim = 64, input_length = longestQuery))
# Turns off a random % of nuerons. Helps with overfitting. Can increase droupout and train longer if wanted
questionEncoder.add(Dropout(0.3))

# This encoder will output:
# output: (samples, longestQuery, embedding dim)

### Encode the Sequences 

In [86]:
# Enocoded <-- Encoder(input)
# encode the input sequence and questions (which are indices) to sequences of dense vectors
# We already have our placeholders for the inputs (inputSequence & query)
inputEncodedM = inputEncoderM(inputSequence)
inputEncodedC = inputEncoderC(inputSequence)
questionEncoded = questionEncoder(query)

In [87]:
# As stated in the paper: 
# In the embedding space, we compute the match between u (1st input vector seq.) and each memory m_i (the query) by taking the innner product
# shape: `(samples, story_maxlen, query_maxlen)`
match = dot([inputEncodedM, questionEncoded], axes = (2,2))

Now we call an activation function on this match (Softmax).

$$p_{i}=\operatorname{Softmax}\left(u^{T} m_{i}\right)$$
Where
$$\operatorname{Softmax}\left(z_{i}\right)=e^{z_{i}} / \sum_{j} e^{z_{j}}$$

In [88]:
match = Activation('softmax')(match)

Now we add this match matrix with the second input vector sequence

In [89]:
response_vector = add([match, inputEncodedC]) # (samples, longestStory, longestQuery)
response_vector = Permute((2, 1))(response_vector)  # (samples, longestQuery, longestStory)

In [90]:
# concatenate the match matrix with the question vector sequence
answer = concatenate([response_vector, questionEncoded])
answer # (batch size, 6 x 220)

<tf.Tensor 'concatenate_4/concat:0' shape=(?, 6, 220) dtype=float32>

Now that we have our answer, we are going to reduce it with a RNN, specifically a LSTM layer

In [91]:
# Reduce with LSTM
answer = LSTM(32)(answer)  # (samples, 32)
LSTM()

In [96]:
# Perform one more series of Regularization with Dropout
answer = Dropout(0.5)(answer)
# Dense output layer for the vocab size (samples, vocab_size) # YES/NO 0000
answer = Dense(vocab_size)(answer) 

Output a probability distribution over the vocabulary bc we'll essentially see a bunch of zeros except some probability on YES and some probability on NO. we'll pass this into a Softmax in order to turn it into a 0 or 1

In [101]:
# Output a probability distribution over the vocabulary
answer = Activation('softmax')(answer)

# Building the final model
# This answer links together all the encoders (encoder C, encoder M, Question encoder).
# This is how we link our model to those encodings.
model = Model([inputSequence, query], answer) 

# We expect to see only high probabilities on YES or NO, but we're not working with a Binary Cross-Entropy Loss 
# since we we have a larger vocab size than that. Altough we should only expect to se only high probabilities on YES or NO
# categorical_crossentropy since we are doing this across the entire vocabulary. 
model.compile(optimizer='rmsprop', loss='categorical_crossentropy', metrics=['accuracy'])

In [102]:
model.summary()

__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_3 (InputLayer)            (None, 156)          0                                            
__________________________________________________________________________________________________
input_4 (InputLayer)            (None, 6)            0                                            
__________________________________________________________________________________________________
sequential_8 (Sequential)       multiple             2432        input_3[0][0]                    
__________________________________________________________________________________________________
sequential_9 (Sequential)       (None, 6, 64)        2432        input_4[0][0]                    
__________________________________________________________________________________________________
dot_4 (Dot