
# Question and Answer Chat Bots

## Loading the Data

We will be working with the Babi Data Set from Facebook Research.

Full Details: https://research.fb.com/downloads/babi/

- Jason Weston, Antoine Bordes, Sumit Chopra, Tomas Mikolov, Alexander M. Rush,
  "Towards AI-Complete Question Answering: A Set of Prerequisite Toy Tasks",
  http://arxiv.org/abs/1502.05698


In [0]:
import pickle
import numpy as np

In [0]:
with open("train_qa.txt", "rb") as fp:   # Unpickling
    train_data =  pickle.load(fp)

In [0]:
with open("test_qa.txt", "rb") as fp:   # Unpickling
    test_data =  pickle.load(fp)

----

### Tip: It may be a good idea to explore the dataset!

Below is just a sample of what you can do:

In [4]:
type(test_data)

list

In [5]:
len(train_data)

10000

In [6]:
train_data[0]

(['Mary',
  'moved',
  'to',
  'the',
  'bathroom',
  '.',
  'Sandra',
  'journeyed',
  'to',
  'the',
  'bedroom',
  '.'],
 ['Is', 'Sandra', 'in', 'the', 'hallway', '?'],
 'no')

-----

## Setting up Vocabulary of All Words

In [0]:
# Create a set that holds the vocab words
vocab = set()

In [0]:
all_data = test_data + train_data

In [0]:
for story, question , answer in all_data:
    # In case you don't know what a union of sets is:
    # https://www.programiz.com/python-programming/methods/set/union
    vocab = vocab.union(set(story))
    vocab = vocab.union(set(question))

In [0]:
# Include any other words in the bot's vocabulary
vocab.add('no')
vocab.add('yes')

In [11]:
vocab

{'.',
 '?',
 'Daniel',
 'Is',
 'John',
 'Mary',
 'Sandra',
 'apple',
 'back',
 'bathroom',
 'bedroom',
 'discarded',
 'down',
 'dropped',
 'football',
 'garden',
 'got',
 'grabbed',
 'hallway',
 'in',
 'journeyed',
 'kitchen',
 'left',
 'milk',
 'moved',
 'no',
 'office',
 'picked',
 'put',
 'the',
 'there',
 'to',
 'took',
 'travelled',
 'up',
 'went',
 'yes'}

In [0]:
vocab_len = len(vocab) + 1 #we add an extra space to hold a 0 for Keras's pad_sequences

In [0]:
max_story_len = max([len(data[0]) for data in all_data])

In [14]:
max_story_len

156

In [0]:
max_question_len = max([len(data[1]) for data in all_data])

In [16]:
max_question_len

6

## Vectorizing the Data

In [17]:
vocab

{'.',
 '?',
 'Daniel',
 'Is',
 'John',
 'Mary',
 'Sandra',
 'apple',
 'back',
 'bathroom',
 'bedroom',
 'discarded',
 'down',
 'dropped',
 'football',
 'garden',
 'got',
 'grabbed',
 'hallway',
 'in',
 'journeyed',
 'kitchen',
 'left',
 'milk',
 'moved',
 'no',
 'office',
 'picked',
 'put',
 'the',
 'there',
 'to',
 'took',
 'travelled',
 'up',
 'went',
 'yes'}

In [0]:
# Reserve 0 for pad_sequences
vocab_size = len(vocab) + 1

-----------

In [19]:
from keras.preprocessing.sequence import pad_sequences
from keras.preprocessing.text import Tokenizer

Using TensorFlow backend.


In [0]:
# integer encode sequences of words
tokenizer = Tokenizer(filters=[])

# TODO: Fit tokenizer on text
tokenizer.fit_on_texts(vocab)

In [21]:
tokenizer.word_index

{'.': 37,
 '?': 7,
 'apple': 27,
 'back': 3,
 'bathroom': 28,
 'bedroom': 31,
 'daniel': 6,
 'discarded': 15,
 'down': 10,
 'dropped': 4,
 'football': 34,
 'garden': 9,
 'got': 12,
 'grabbed': 1,
 'hallway': 23,
 'in': 18,
 'is': 8,
 'john': 33,
 'journeyed': 25,
 'kitchen': 19,
 'left': 13,
 'mary': 21,
 'milk': 35,
 'moved': 32,
 'no': 2,
 'office': 26,
 'picked': 30,
 'put': 20,
 'sandra': 36,
 'the': 11,
 'there': 16,
 'to': 5,
 'took': 29,
 'travelled': 14,
 'up': 24,
 'went': 22,
 'yes': 17}

In [0]:
train_story_text = []
train_question_text = []
train_answers = []

# TODO: Fill the story, question, and answers list
for story,question,answer in train_data:
    train_story_text.append(story)
    train_question_text.append(question) 
    train_answers.append(answer)

In [0]:
# TODO: Vectorize into word sequences.
train_story_seq = tokenizer.texts_to_sequences(train_story_text)

In [24]:
len(train_story_text)

10000

In [25]:
len(train_story_seq)

10000

### Functionalize Vectorization

In [0]:
def vectorize_stories(data, 
                      word_index=tokenizer.word_index, 
                      max_story_len=max_story_len,
                      max_question_len=max_question_len):
    '''
    INPUT: 
    
    data: consisting of Stories,Queries,and Answers
    word_index: word index dictionary from tokenizer
    max_story_len: the length of the longest story (used for pad_sequences function)
    max_question_len: length of the longest question (used for pad_sequences function)


    OUTPUT:
    
    Vectorizes the stories,questions, and answers into padded sequences. We first loop for every story, query , and
    answer in the data. Then we convert the raw words to an word index value. Then we append each set to their appropriate
    output list. Then once we have converted the words to numbers, we pad the sequences so they are all of equal length.
    
    Returns this in the form of a tuple (X,Xq,Y) (padded based on max lengths)
    '''
    
    
    # X = STORIES
    X = []
    # Xq = QUERY/QUESTION
    Xq = []
    # Y = CORRECT ANSWER
    Y = []
    
    
    for story, query, answer in data:
        x = []
        xq = []
        y = []

        # TODO: Store every word from story into a list
        # TODO: Store every word from query into a list
        
        # TODO: One-hot encode the label into a list
        
        x = [word_index[word.lower()] for word in story]
        
        xq = [word_index[word.lower()] for word in question]
        
        y = np.zeros(len(word_index) + 1) #Index 0 Reserved when padding the sequences
        y[word_index[answer]] = 1
        # Append each set of story,query, and answer to their respective holding lists
        X.append(x)
        Xq.append(xq)
        Y.append(y)
        
    # RETURN TUPLE of paded, uniform sequences FOR UNPACKING
    return (pad_sequences(X, maxlen=max_story_len),pad_sequences(Xq, maxlen=max_question_len), np.array(Y))

In [0]:
inputs_train, queries_train, answers_train = vectorize_stories(train_data)

In [0]:
inputs_test, queries_test, answers_test = vectorize_stories(test_data)

In [29]:
inputs_test

array([[ 0,  0,  0, ..., 11, 31, 37],
       [ 0,  0,  0, ..., 11,  9, 37],
       [ 0,  0,  0, ..., 11,  9, 37],
       ...,
       [ 0,  0,  0, ..., 11, 27, 37],
       [ 0,  0,  0, ..., 11,  9, 37],
       [ 0,  0,  0, ..., 27, 16, 37]], dtype=int32)

In [30]:
queries_test

array([[ 8, 21, 18, 11, 31,  7],
       [ 8, 21, 18, 11, 31,  7],
       [ 8, 21, 18, 11, 31,  7],
       ...,
       [ 8, 21, 18, 11, 31,  7],
       [ 8, 21, 18, 11, 31,  7],
       [ 8, 21, 18, 11, 31,  7]], dtype=int32)

In [31]:
answers_test

array([[0., 0., 1., ..., 0., 0., 0.],
       [0., 0., 1., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 1., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

In [32]:
sum(answers_test)

array([  0.,   0., 503.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,
         0.,   0.,   0.,   0.,   0.,   0., 497.,   0.,   0.,   0.,   0.,
         0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,
         0.,   0.,   0.,   0.,   0.])

In [33]:
tokenizer.word_index['yes']

17

In [34]:
tokenizer.word_index['no']

2

## Creating the Model

In [0]:
from keras.models import Sequential, Model
from keras.layers.embeddings import Embedding
from keras.layers import Input, Activation, Dense, Permute, Dropout
from keras.layers import add, dot, concatenate
from keras.layers import LSTM

### Placeholders for Inputs

Recall we technically have two inputs, stories and questions. So we need to use placeholders. `Input()` is used to instantiate a Keras tensor.


In [36]:
input_sequence = Input((max_story_len,))
question = Input((max_question_len,))





### Building the Networks

To understand why we chose this setup, make sure to read the paper we are using:

* Sainbayar Sukhbaatar, Arthur Szlam, Jason Weston, Rob Fergus,
  "End-To-End Memory Networks",
  http://arxiv.org/abs/1503.08895

## Encoders

The input to your neural network for an NLP task requires you to setup an Embedding layer which creates word embedding for you(aka word2vec).

Also it would be a good idea to experiment with different hyperparameters like using/not using dropout layers, learning rate etc




### Input Encoder m

In [37]:
# Input gets embedded to a sequence of vectors
input_encoder_m = Sequential()
input_encoder_m.add(Embedding(input_dim=vocab_size,output_dim=64))

# Optional: Create any additional layers for neural network




### Input Encoder c

In [0]:
# Embed the input into a sequence of vectors of size query_maxlen
input_encoder_c = Sequential()
input_encoder_c.add(Embedding(input_dim=vocab_size,output_dim=max_question_len))

# Optional: Create any additional layers for neural network

### Question Encoder

In [0]:
# embed the question into a sequence of vectors
question_encoder = Sequential()
question_encoder.add(Embedding(input_dim=vocab_size,
                               output_dim=64,
                               input_length=max_question_len))

# Optional: Create any additional layers for neural network

### Encode the Sequences

In [0]:
# TODO: Encode input sequence and questions (which are indices)
# to sequences of dense vectors
input_encoded_m = input_encoder_m(input_sequence)
input_encoded_c = input_encoder_c(input_sequence)
question_encoded = question_encoder(question)

##### Use dot product to compute the match between first input vector seq and the query

In [0]:
# shape: `(samples, story_maxlen, query_maxlen)`
match = dot([input_encoded_m,question_encoded], axes = (2,2))
match = Activation('softmax')(match)

#### Add this match matrix with the second input vector sequence

In [0]:
# Add the match matrix with the second input vector sequence
response = add([match,input_encoded_c])
response = Permute((2,1))(response) #Permute Layer: permutes dimensions of input

#### Concatenate

In [0]:
# Concatenate the match matrix with the question vector sequence
answer = concatenate([response, question_encoded])

In [44]:
answer

<tf.Tensor 'concatenate_1/concat:0' shape=(?, 6, 220) dtype=float32>

In [0]:
answer = LSTM(32)(answer)

In [46]:
# Regularization with Dropout
answer = Dropout(0.5)(answer)
answer = Dense(vocab_size)(answer)  # (samples, vocab_size)


Instructions for updating:
Please use `rate` instead of `keep_prob`. Rate should be set to `rate = 1 - keep_prob`.


In [47]:
# TODO :Output a probability distribution over the vocabulary
answer = Activation('softmax')(answer)
#Build the model
model = Model([input_sequence,question], answer)

model.compile(optimizer='rmsprop', loss = 'categorical_crossentropy', metrics = ['accuracy'])





In [48]:
model.summary()

Model: "model_1"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_1 (InputLayer)            (None, 156)          0                                            
__________________________________________________________________________________________________
input_2 (InputLayer)            (None, 6)            0                                            
__________________________________________________________________________________________________
sequential_1 (Sequential)       multiple             2432        input_1[0][0]                    
__________________________________________________________________________________________________
sequential_3 (Sequential)       (None, 6, 64)        2432        input_2[0][0]                    
____________________________________________________________________________________________

In [49]:
# TODO: Train Model
epochs = 100
batch_size=32
history = model.fit([inputs_train,queries_train],answers_train, batch_size = batch_size, epochs = epochs, validation_data = ([inputs_test,queries_test],answers_test))

Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where



Train on 10000 samples, validate on 1000 samples
Epoch 1/100





Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100
Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100


### Saving the Model

In [0]:
filename = 'babi_chatbot_100_epochs.h5'
model.save(filename)

## Evaluating the Model

### Plotting Out Training History

In [0]:
import matplotlib.pyplot as plt
%matplotlib inline
# Plot out training history here

### Evaluating on Given Test Set

In [0]:
model.load_weights(filename)

# TODO: Predict with the model
pred_results = model.predict(([inputs_test,queries_test]))

In [53]:
test_data[0][0]

['Mary',
 'got',
 'the',
 'milk',
 'there',
 '.',
 'John',
 'moved',
 'to',
 'the',
 'bedroom',
 '.']

In [54]:
story =' '.join(word for word in test_data[0][0])
print(story)

Mary got the milk there . John moved to the bedroom .


In [55]:
query = ' '.join(word for word in test_data[0][1])
print(query)

Is John in the kitchen ?


In [56]:
print("True Test Answer from Data is:",test_data[0][2])

True Test Answer from Data is: no


In [57]:
#Generate prediction from model
val_max = np.argmax(pred_results[0])

for key, val in tokenizer.word_index.items():
    if val == val_max:
        k = key

print("Predicted answer is: ", k)
print("Probability of certainty was: ", pred_results[0][val_max])

Predicted answer is:  no
Probability of certainty was:  0.50597763


## Writing Your Own Stories and Questions

Remember you can only use words from the existing vocab

In [58]:
vocab

{'.',
 '?',
 'Daniel',
 'Is',
 'John',
 'Mary',
 'Sandra',
 'apple',
 'back',
 'bathroom',
 'bedroom',
 'discarded',
 'down',
 'dropped',
 'football',
 'garden',
 'got',
 'grabbed',
 'hallway',
 'in',
 'journeyed',
 'kitchen',
 'left',
 'milk',
 'moved',
 'no',
 'office',
 'picked',
 'put',
 'the',
 'there',
 'to',
 'took',
 'travelled',
 'up',
 'went',
 'yes'}

In [59]:
# Note the whitespace of the periods
my_story = "John left the kitchen . Sandra dropped the football in the garden ."
my_story.split()

['John',
 'left',
 'the',
 'kitchen',
 '.',
 'Sandra',
 'dropped',
 'the',
 'football',
 'in',
 'the',
 'garden',
 '.']

In [0]:
my_question = "Is the football in the garden ?"

In [61]:
my_question.split()

['Is', 'the', 'football', 'in', 'the', 'garden', '?']

In [0]:
mydata = [(my_story.split(),my_question.split(),'yes')]

In [66]:
my_story,my_ques,my_ans = vectorize_stories(mydata)

OperatorNotAllowedInGraphError: ignored

In [67]:
pred_results = model.predict(([ my_story, my_ques]))

NameError: ignored

In [0]:
#Generate prediction from model
val_max = np.argmax(pred_results[0])

for key, val in tokenizer.word_index.items():
    if val == val_max:
        k = key

print("Predicted answer is: ", k)
print("Probability of certainty was: ", pred_results[0][val_max])