# Deep Learning Question & Answer Chatbot

We will be implementing a chat bot that can answer questions given a set of sentences. The chatbot will use a subset of the Babi Data Set from Facebook Research; it already contains stories(sentences), queries(questions), and answers. 
Here is a link to the Babi Data Sets and the research paper this is based on:

Full Details: https://research.fb.com/downloads/babi/

- Jason Weston, Antoine Bordes, Sumit Chopra, Tomas Mikolov, Alexander M. Rush,
  "Towards AI-Complete Question Answering: A Set of Prerequisite Toy Tasks",
  http://arxiv.org/abs/1502.05698
  
The bot currently returns a yes or a no to each question asked. However, I plan on integrating a Natural Language Generation component to introduce some meaningful dialogue as well as a speech to text component.

## Imports

In [4]:
import pickle
import numpy as np

In [5]:
from keras.preprocessing.sequence import pad_sequences
from keras.preprocessing.text import Tokenizer

Using TensorFlow backend.


In [6]:
# For creating the model
from keras.models import Sequential, Model
from keras.layers.embeddings import Embedding
from keras.layers import Input, Activation, Dense, Permute, Dropout
from keras.layers import add, dot, concatenate
from keras.layers import LSTM

## Load the Data

In [7]:
with open("train_qa.txt", "rb") as fp:   # Unpickling
    train_data =  pickle.load(fp) # List

In [8]:
with open("test_qa.txt", "rb") as fp:   # Unpickling
    test_data =  pickle.load(fp) # List

10:1 ratio for training data vs testing data; there are 10,000 points for train_data and 1,000 points for test_data. 

In [9]:
train_data[10]

(['Sandra',
  'went',
  'back',
  'to',
  'the',
  'hallway',
  '.',
  'Sandra',
  'moved',
  'to',
  'the',
  'office',
  '.'],
 ['Is', 'Sandra', 'in', 'the', 'office', '?'],
 'yes')

In [10]:
story_sentence = ' '.join(train_data[10][0]) #Story/Sentence
query_question = ' '.join(train_data[10][1]) #Query/Question
answer = train_data[10][2] #Answer to question

In [11]:
print("Sentence: ", story_sentence)
print("Question: ", query_question)
print("Answer:   ", answer)

Sentence:  Sandra went back to the hallway . Sandra moved to the office .
Question:  Is Sandra in the office ?
Answer:    yes


## Create a Vocabulary of all of the Words

In [12]:
# Set that contains the vocab words
vocab = set()

In [13]:
all_data = train_data + test_data 

for story, question , answer in all_data:
    # Creates a vocabulary of all the distinct words inside our dataset 
    vocab = vocab | set(story) # vocab ∪ Story. Continuously adds unique words
    vocab = vocab | set(question) # vocab ∪ question. Continuously adds unique words

In [14]:
# Add in the two possible answers 
vocab.add('no')
vocab.add('yes')

In [15]:
vocab

{'.',
 '?',
 'Daniel',
 'Is',
 'John',
 'Mary',
 'Sandra',
 'apple',
 'back',
 'bathroom',
 'bedroom',
 'discarded',
 'down',
 'dropped',
 'football',
 'garden',
 'got',
 'grabbed',
 'hallway',
 'in',
 'journeyed',
 'kitchen',
 'left',
 'milk',
 'moved',
 'no',
 'office',
 'picked',
 'put',
 'the',
 'there',
 'to',
 'took',
 'travelled',
 'up',
 'went',
 'yes'}

In [16]:
# Reserve 0 for Keras pad_sequences 
vocab_size = len(vocab) + 1 # + 1 to add an extra space for a 0 for Keras's pad_sequences

In [17]:
# Find longest story
longestStory = max(( (len(data[0])) for data in all_data )) 
longestQuery = max(( (len(data[1])) for data in all_data )) 

In [18]:
print(longestStory)
print(longestQuery)

156
6


## Vectorize Data

In [19]:
# Creates integer encoding for the sequences of words
tokenizer = Tokenizer(filters = [])
tokenizer.fit_on_texts(vocab) # This method creates the vocabulary index based on word frequency

In [20]:
tokenizer.word_index

{'?': 1,
 'to': 2,
 'picked': 3,
 'discarded': 4,
 'office': 5,
 'daniel': 6,
 'mary': 7,
 'bathroom': 8,
 'kitchen': 9,
 'put': 10,
 'left': 11,
 'apple': 12,
 'grabbed': 13,
 'sandra': 14,
 'went': 15,
 'no': 16,
 'back': 17,
 'journeyed': 18,
 '.': 19,
 'in': 20,
 'hallway': 21,
 'dropped': 22,
 'football': 23,
 'john': 24,
 'milk': 25,
 'garden': 26,
 'got': 27,
 'bedroom': 28,
 'down': 29,
 'yes': 30,
 'took': 31,
 'travelled': 32,
 'is': 33,
 'the': 34,
 'moved': 35,
 'up': 36,
 'there': 37}

In [21]:
trainStoryText = []
trainQueryText = []
trainAnswers = []

for story, query, answer in train_data:
    trainStoryText.append(story)
    trainQueryText.append(query)
    trainAnswers.append(answer)

In [22]:
 # Transforms each word in the sentences to a sequence of integers.
trainStorySeq = tokenizer.texts_to_sequences(trainStoryText)

In [23]:
def vectorizeStories(data, word_index = tokenizer.word_index, maxStoryLen = longestStory, maxQueryLen = longestQuery):
    """
    Vectorizes stories, queries, & answers into padded sequences. 
   
    Parameters: 
        data: All the data (Stories, Queries, Answers)
        word_index: A word index dictionary. Defaulted to our tokenizer.word_index
                    Can be overrided to other datasets or other sets of questions
        maxStoryLen: Length of the longest story (Will be used for the pad_sequences function)
        maxQueryLen: Length of the longest query (Will be used for the pad_sequences function)
        
        We need the max story & query length because we are using padded sequences; not every story/query 
        is the same length and our RNN that we're using for training needs everything to be the same length 
        We'll pad the inputs with 0s in case there's a story or query that is too short. Or we can cut down 
        a story or query if it is too long.
        
    Returns: 
        this (tuple): A tuple of the form (X, Q, A) (padded based on max lengths)    
    """
    X = [] # X := Stories
    Q = [] # Q := Queries
    A = [] # A := Answers (yes/no)
    
    for story, query, answer in data:
        # Convert the raw words into integers through a word index value
        
        # Grabs the word index for every word in story
        # [9, 34, ...]
        x = [word_index[word.lower()] for word in story]
        # Grabs the word index for every word in query
        q = [word_index[word.lower()] for word in query]
        
        # Index 0 is reserved since we are using pad sequences, so we add + 1
        a = np.zeros(len(word_index) + 1)
        
        # a is an empty matrix of NP zeros so we'll use numpy logic to create this assignment (Yes/No)
        a[word_index[answer]] = 1
        
        # We now append each set to their appropriate output list.
        X.append(x)
        Q.append(q)
        A.append(a)
    
    # Now that we have converted the words to numbers, we pad the sequences so they are all of equal length.
    X_padded_seqs = pad_sequences(X, maxlen = maxStoryLen)
    Q_padded_seqs = pad_sequences(Q, maxlen= maxQueryLen)
    answers = np.array(A)
    
    # Now that the sequences are padded based on their max length, the RNN can be trained on uniformly long sequences.
    # Returns tuple for unpacking. 
    return (X_padded_seqs, Q_padded_seqs, answers)

In [24]:
inputsTrain, queriesTrain, answersTrain = vectorizeStories(train_data)
inputsTest, queriesTest, answersTest = vectorizeStories(test_data)

In [25]:
print(inputsTrain)

[[ 0  0  0 ... 34 28 19]
 [ 0  0  0 ... 34 21 19]
 [ 0  0  0 ... 34  8 19]
 ...
 [ 0  0  0 ... 34 28 19]
 [ 0  0  0 ... 25 37 19]
 [ 0  0  0 ... 12 37 19]]


In [26]:
print(inputsTest)

[[ 0  0  0 ... 34 28 19]
 [ 0  0  0 ... 34 26 19]
 [ 0  0  0 ... 34 26 19]
 ...
 [ 0  0  0 ... 34 12 19]
 [ 0  0  0 ... 34 26 19]
 [ 0  0  0 ... 12 37 19]]


In [27]:
print(answersTest)

[[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 ...
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]]


In [28]:
tokenizer.word_index['yes']

30

In [29]:
tokenizer.word_index['no']

16

In [30]:
sum(answersTest)

array([  0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,
         0.,   0.,   0.,   0.,   0., 503.,   0.,   0.,   0.,   0.,   0.,
         0.,   0.,   0.,   0.,   0.,   0.,   0.,   0., 497.,   0.,   0.,
         0.,   0.,   0.,   0.,   0.])

We can see that we have 497 'yes's at index location 7 and 29 'no's at index location 29.

Our stories, queries, and answers are now successfully vectorized

## Create the model

In [31]:
# We have 2 inputs: The stories and questions
# We'll need to use place holders so we will use `Input()` to insantiate a Keras tensor.

In [33]:
# Shape = (Longest Story, Batch Size)
inputSequence = Input((longestStory, ))
query = Input((longestQuery, ))

## Build the Neural Network 

There are three encoders we will build
* Input Encoder C
* Input Encoder M
* Question Encoder

We are following this model from the paper:
![PaperModel](..\PaperModel.png)
*Figure 1: (a): A single layer version of the model. (b): A three layer version of the model.*

### Input Encoder C

In [95]:
# This input gets embedding to a sequence of vectors
inputEncoderC = Sequential()
# Add 2 layers to it
inputEncoderC.add(Embedding(input_dim = vocab_size, output_dim = longestQuery))
# Turns off a random % of nuerons. Helps with overfitting. Can increase droupout and train longer if wanted
inputEncoderC.add(Dropout(0.3))

# This encoder will output:
# output: (samples, stories max len, longestQuery)

### Input Encoder M

In [92]:
# This input gets embedding to a sequence of vectors
inputEncoderM = Sequential()
# Add 2 layers to it
inputEncoderM.add(Embedding(input_dim = vocab_size, output_dim = 64))
# Turns off a random % of nuerons. Helps with overfitting. Can increase droupout and train longer if wanted
inputEncoderM.add(Dropout(0.3))

# This encoder will output:
# output: (samples, stories max len, embedding dim)

### Question Encoder

In [96]:
# This input gets embedding to a sequence of vectors. The paper states: The query q is also embedded (again, in the simplest case via another embedding matrix
# B with the same dimensions as A) to obtain an internal state u. So the output dimension will match our encoder m.
questionEncoder = Sequential()
# Add 2 layers to it
questionEncoder.add(Embedding(input_dim = vocab_size, output_dim = 64, input_length = longestStory))
# Turns off a random % of nuerons. Helps with overfitting. Can increase droupout and train longer if wanted
questionEncoder.add(Dropout(0.3))

# This encoder will output:
# output: (samples, longestQuery, embedding dim)