___

___
# Question and Answer Chat Bots

----

The objective of this Notebook is to build a simple Chat Bot that answer simple questions based on a model trained with provided data. 

## Loading the Data

We will be working with the Babi Data Set from **Facebook Research**.

Full Details: https://research.fb.com/downloads/babi/

- Jason Weston, Antoine Bordes, Sumit Chopra, Tomas Mikolov, Alexander M. Rush,
  "Towards AI-Complete Question Answering: A Set of Prerequisite Toy Tasks",
  http://arxiv.org/abs/1502.05698


Data is in Pickle (compressed format), so some imports are needed to manage those training and testing data. 

In [None]:
import pickle
import numpy as np

If you are running this notebook on your own environment, the following two cells will load the needed files.   

If you are running on a cloud-based environment such as **IBM Watson Studio** you need to load the files from the **Cloud Object Storage**

In [None]:
with open("data/train_qa.txt", "rb") as fp:   # Unpickling
    train_data =  pickle.load(fp)

In [None]:
with open("data/test_qa.txt", "rb") as fp:   # Unpickling
    test_data =  pickle.load(fp)

The notebook in **Watson Studio** has the functionality to allow you to insert auto-generated code to read `.csv` files. However, if you upload any other types of file, it will not auto-generate the code. To read the file you will likely insert a `StreamingBody object` or insert a sparksession setup. 

To upload a pickle file, select the file in the `Find and add data` side pane, and click **Insert to code** and **StreamingBody object**. This will add the credentials to access the **Cloud Object Storage** and create a StreamingBody object.

In [None]:
## Insert code here

Now, you have a `StreamingBody object` that is simply an HTTP response that the boto client returns.

Read the object into memory using the following command:

In [None]:
readrawdata = streaming_body_1.read()

Convert the object to `BytesIO` to be able to read it using Pickle Connector or any other connector. For example, you might want to read the Excel file using xlrd. 

Complete these steps: 

In [None]:
from io import BytesIO

test_data = pickle.load(BytesIO(readrawdata))

**REDO THE SAME FOR THE `train_qa.txt` pickle file to load into the `train_data` variable.**

----

## Exploring the Format of the Data

Let's look at the type of the data and its length.

In [None]:
type(test_data)

In [None]:
type(train_data)

In [None]:
len(test_data)

In [None]:
len(train_data)

We have more or less a 10:1 ratio between training data and testing data. 
Let's take a look at training data. 

In [None]:
train_data[0]

Looking at the data we can see the main three components: 
+ The story: `['Mary','moved','to','the','bathroom','.','Sandra','journeyed','to','the','bedroom','.']` 
+ The question: `['Is', 'Sandra', 'in', 'the', 'hallway', '?']` 
+ The answer: `no` 

Use the `join` functionality to format it a but nicer. 

In [None]:
' '.join(train_data[0][0])

In [None]:
' '.join(train_data[0][1])

In [None]:
train_data[0][2]

-----

## Setting up Vocabulary of All Words

To begin with, we will setup a **vocabulary** of all the words within our data set, and to do this we make sure that we take not just the training data, but also the test data into account. That garantees that when testing our model, it doesn't get confused by maybe new names that did not show up in the training data. 

In [None]:
# Create a set that holds the vocab words
vocab = set()

Remember, test data and train data are just huge lists of tuples where each tuple is three of `story`, `question`, `answer`. 

So what we're doing is just a giant list with a bunch of tuples in it.

So now if I check out my length of all my data, it should now be eleven thousand.

In [None]:
all_data = test_data + train_data

In [None]:
len(all_data)

We want a set of all the unique words. 
> A set in Python is an unordered collection of unique elements. 

In [None]:
for story, question, answer in all_data:
    # In case you don't know what a union of sets is:
    # https://www.programiz.com/python-programming/methods/set/union
    vocab = vocab.union(set(story))
    vocab = vocab.union(set(question))

In [None]:
vocab.add('no')
vocab.add('yes')

In [None]:
vocab

In [None]:
vocab_len = len(vocab) + 1 #we add an extra space to hold a 0 for Keras's pad_sequences
print(vocab_len)

Actually there is not so many words in our data, so we might be limited we testing our bot. 

How long is the longest story? This is going to be needed later on for padding our sequences.

In [None]:
max_story_len = max([len(data[0]) for data in all_data])

In [None]:
max_story_len

In [None]:
max_question_len = max([len(data[1]) for data in all_data])

In [None]:
max_question_len

## Vectorizing the Data

In this section we'll vectorize our data. 

In [None]:
vocab

In [None]:
# Reserve 0 for pad_sequences
vocab_size = len(vocab) + 1

-----------

Let's import from **Keras** some functions that will help us to tokenize our vocab entries. 

In [None]:
from keras.preprocessing.sequence import pad_sequences
from keras.preprocessing.text import Tokenizer

In [None]:
# integer encode sequences of words
tokenizer = Tokenizer(filters=[])
tokenizer.fit_on_texts(vocab)

In [None]:
tokenizer.word_index

After running the **Tokenizer**, we have our tokenized text. Basically it mapped every single entry with a specific index number. 
> Notice as well that every word has been lowercase in the process.  

Next we'll vectorize the story, question and answer in a similar way. That is build a list that contains word indexes rather than the word itself. 

In [None]:
train_story_text = []
train_question_text = []
train_answers = []

for story,question,answer in train_data:
    train_story_text.append(story)
    train_question_text.append(question)

In [None]:
train_story_seq = tokenizer.texts_to_sequences(train_story_text)

In [None]:
len(train_story_text)

In [None]:
len(train_story_seq)

In [None]:
#print(train_story_text[1])
#print(train_story_seq[1])

### Functionalize Vectorization

Here is a function that does all the preprocessing of vectorization for us based on input parameters. 
> Note that we use the max length variable in order to pad the questions to the maximum story length (same for answers). The reason for that is that Keras works with objects of the same length, so we might need to pad to a given length adding *zeros*. 

Take some time to read and understand the function below. 

In [None]:
def vectorize_stories(data, word_index=tokenizer.word_index, max_story_len=max_story_len,max_question_len=max_question_len):
    '''
    INPUT: 
    
    data: consisting of Stories,Queries,and Answers
    word_index: word index dictionary from tokenizer
    max_story_len: the length of the longest story (used for pad_sequences function)
    max_question_len: length of the longest question (used for pad_sequences function)


    OUTPUT:
    
    Vectorizes the stories,questions, and answers into padded sequences. We first loop for every story, query , and
    answer in the data. Then we convert the raw words to an word index value. Then we append each set to their appropriate
    output list. Then once we have converted the words to numbers, we pad the sequences so they are all of equal length.
    
    Returns this in the form of a tuple (X,Xq,Y) (padded based on max lengths)
    '''
    
    
    # X = STORIES
    X = []
    # Xq = QUERY/QUESTION
    Xq = []
    # Y = CORRECT ANSWER
    Y = []
    
    
    for story, query, answer in data:
        
        # Grab the word index for every word in story
        x = [word_index[word.lower()] for word in story]
        # Grab the word index for every word in query
        xq = [word_index[word.lower()] for word in query]
        
        # Grab the Answers (either Yes/No so we don't need to use list comprehension here)
        # Index 0 is reserved so we're going to use + 1
        y = np.zeros(len(word_index) + 1)
        
        # Now that y is all zeros and we know its just Yes/No , we can use numpy logic to create this assignment
        #
        y[word_index[answer]] = 1
        
        # Append each set of story, question, and answer to their respective holding lists
        X.append(x)
        Xq.append(xq)
        Y.append(y)
        
    # Finally, pad the sequences based on their max length so the RNN can be trained on uniformly long sequences.
        
    # RETURN TUPLE FOR UNPACKING
    return (pad_sequences(X, maxlen=max_story_len),pad_sequences(Xq, maxlen=max_question_len), np.array(Y))

Use the above function to vectorize our training and testing data sets. 

In [None]:
inputs_train, queries_train, answers_train = vectorize_stories(train_data)

In [None]:
inputs_test, queries_test, answers_test = vectorize_stories(test_data)

In [None]:
inputs_test

In [None]:
queries_test

In [None]:
answers_test

In [None]:
sum(answers_test)

In [None]:
tokenizer.word_index['yes']

In [None]:
tokenizer.word_index['no']

At this point we have successfully vectorized our stories, questions and answers and we can consider the data being in the correct format for creating the model which we'll do next using Keras layers. 

## Creating the Model

> A quick note as a reminder. It is important that you read the paper provided in the resources since it's going to be fundamental to understanding how the network and the encoders work. 

We'll now start building out the neural network. Here's essentially the diagram of the network that we're producing along the encoders and the LSTM unit or the RNN that's used from the paper. 

<img src='../Resources/OverallModel.png' width=600/>

In [None]:
from keras.models import Sequential, Model
from keras.layers.embeddings import Embedding
from keras.layers import Input, Activation, Dense, Permute, Dropout
from keras.layers import add, dot, concatenate
from keras.layers import LSTM

### Placeholders for Inputs

Recall we technically have two inputs, stories and questions. So we need to use placeholders. `Input()` is used to instantiate a Keras tensor, and we need to pass in a shape which is going to be based on the max story length and the max question length.


In [None]:
input_sequence = Input((max_story_len,))
question = Input((max_question_len,))

### Building the Networks

To understand why we chose this setup, make sure to read the paper we are using:

* Sainbayar Sukhbaatar, Arthur Szlam, Jason Weston, Rob Fergus,
  "End-To-End Memory Networks",
  http://arxiv.org/abs/1503.08895

## Encoders

### Input Encoder m

In [None]:
# Reserve 0 for pad_sequences
vocab_size = len(vocab) + 1

Let's create our input Encoder `m` as refered in the paper with two layers:
1. The **Embedding layer** with an input dimension equal to our vocab size and an output dimension of **64** 
2. A **Dropout layer** which drops a percentage of neurons, in our case 30 percent of the neurons are going to be randomly turned off during the training. This will help prefenting **overfitting**.

In [None]:
# Input gets embedded to a sequence of vectors
input_encoder_m = Sequential()
input_encoder_m.add(Embedding(input_dim=vocab_size,output_dim=64))
input_encoder_m.add(Dropout(0.3))

# This encoder will output:
# (samples, story_maxlen, embedding_dim)

### Input Encoder c

We'll do pretty much the same with Encoder `c` with some differences for the **output dimension**.

In [None]:
# embed the input into a sequence of vectors of size query_maxlen
input_encoder_c = Sequential()
input_encoder_c.add(Embedding(input_dim=vocab_size,output_dim=max_question_len))
input_encoder_c.add(Dropout(0.3))
# output: (samples, story_maxlen, query_maxlen)

### Question Encoder

And we do the same for the question Encoder. 

In [None]:
# embed the question into a sequence of vectors
question_encoder = Sequential()
question_encoder.add(Embedding(input_dim=vocab_size,
                               output_dim=64,
                               input_length=max_question_len))
question_encoder.add(Dropout(0.3))
# output: (samples, query_maxlen, embedding_dim)

### Encode the Sequences

Now that we have the **input encoder m**, **input encoder c** and the **question encoder**, it's time to actually encode the sequences.

In [None]:
# encode input sequence and questions (which are indices)
# to sequences of dense vectors
input_encoded_m = input_encoder_m(input_sequence)
input_encoded_c = input_encoder_c(input_sequence)
question_encoded = question_encoder(question)

##### Use dot product to compute the match between first input vector seq and the query (the question encoded). 
Refer to the paper section 2.1. The `softmax` activation function gives the probability vector over the inputs. 

In [None]:
# shape: `(samples, story_maxlen, query_maxlen)`
match = dot([input_encoded_m, question_encoded], axes=(2, 2))
match = Activation('softmax')(match)

#### Add this match matrix with the second input vector sequence

In [None]:
# add the match matrix with the second input vector sequence
response = add([match, input_encoded_c])  # (samples, story_maxlen, query_maxlen)
response = Permute((2, 1))(response)  # (samples, query_maxlen, story_maxlen)

#### Concatenate

So now that we have a response, we can concatenate the match matrix with the question vector sequence.

In [None]:
# concatenate the match matrix with the question vector sequence
answer = concatenate([response, question_encoded])

In [None]:
answer

In [None]:
# Reduce with RNN (LSTM)
answer = LSTM(32)(answer)  # (samples, 32)

In [None]:
# Regularization with Dropout
answer = Dropout(0.5)(answer)
answer = Dense(vocab_size)(answer)  # (samples, vocab_size)

In [None]:
# we output a probability distribution over the vocabulary
answer = Activation('softmax')(answer)

# build the final model
model = Model([input_sequence, question], answer)
model.compile(optimizer='rmsprop', loss='categorical_crossentropy',
              metrics=['accuracy'])

In [None]:
model.summary()

## Training the model

We now have a defined model, all we need to do now is to train this model, evaluate it and evaluate the model on the given test data as well as write our own stories and questions to see how it performs. 

> Note that we use the training input stories and questions to train the model. We set the epochs on 120, note that each epoch is going to take between 5 to 10 seconds, so the training is going to take some time. You can set the epochs to whatever number. 

You don't have to run the fitting on large set of epochs if you don't want to. You can load pre-trained models for you that we have already trained, saved and provided for you. 

In [None]:
# train
history = model.fit([inputs_train, queries_train], answers_train,batch_size=32,epochs=120,validation_data=([inputs_test, queries_test], answers_test))

### Saving the Model

In [None]:
filename = 'data/chatbot_120_epochs.h5'
model.save(filename)

## Evaluating the Model

### Plotting Out Training History

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline
print(history.history.keys())
# summarize history for accuracy
plt.plot(history.history['acc'])
plt.plot(history.history['val_acc'])
plt.title('model accuracy')
plt.ylabel('accuracy')
plt.xlabel('epoch')
plt.legend(['train', 'test'], loc='upper left')
plt.show()

Looking at the above plot, we realize that accuracy starts to drastically improve at about 15 epochs. You can play around with the plotting options to plot the loss or any other metric. 

### Evaluating on Given Test Set

In this step we are going to predict the answer using the `model.predict` function and passing it a tuple with two inputs:
1. the inputs test
2. the queries test  

There is no need to pass the answer because that is actually what we are trying to predict based on the test set given as input.  
At this point you can provide your own stories and questions and see the predicted answer. We'll do this step later on. 

In [None]:
model.load_weights(filename)
pred_results = model.predict(([inputs_test, queries_test]))

As a reminder, let's take another fresh at our test data. 

In [None]:
test_data[0][0]

In [None]:
story =' '.join(word for word in test_data[0][0])
print(story)

In [None]:
query = ' '.join(word for word in test_data[0][1])
print(query)

In [None]:
print("True Test Answer from Data is:",test_data[0][2])

The below output show the predicted results and the probability for random words like John, kitchen and so on from our test set based off the tokenized or word index. The values should be mostly pretty low as expected as we don't expect the correct answer to be something like 'John' or 'milk' but more a **yes** or **no**.  
Looking at the below probability, there is a word index which has a probability close to 99%, I bet this is our **predicted answer**.  

In [None]:
pred_results[0]

Let's perform a `argmax` on the predicted results to only retain the **max probability** and generate directly the value (yes/no) from it. 

In [None]:
#Generate prediction from model
val_max = np.argmax(pred_results[0])

for key, val in tokenizer.word_index.items():
    if val == val_max:
        k = key

print("Predicted answer is: ", k)
print("Probability of certainty was: ", pred_results[0][val_max])

## Writing Your Own Stories and Questions

This is the cool part where we write our own stories and questions and see how the model performs.  

**Remember you can only use words from the existing vocab and use correct formating.**

In [None]:
vocab

In [None]:
# Note the whitespace of the periods
my_story = "John left the kitchen . Sandra dropped the football in the garden ."
my_story.split()

In [None]:
my_question = "Is the football in the garden ?"

In [None]:
my_question.split()

Let's format our input data (own story and question) as it is meant to be when we trained the model, that is using the `split()` function and **vectorize our inputs**, and provide the correct answer which is **yes** cause Sandra actually dropped the football in the garden.  
> Note: although we provide the answer for formating purpose, we'll let the model predict it for us. 

In [None]:
mydata = [(my_story.split(),my_question.split(),'yes')]

In [None]:
my_story,my_ques,my_ans = vectorize_stories(mydata)

In [None]:
pred_results = model.predict(([ my_story, my_ques]))

In [None]:
#Generate prediction from model
val_max = np.argmax(pred_results[0])

for key, val in tokenizer.word_index.items():
    if val == val_max:
        k = key

print(my_question)
print("Predicted answer is: ", k)
print("Probability of certainty was: ", pred_results[0][val_max])

# Great Job!

You have completed this lab and I hope you have enjoyed it and have understood the approach. As an optional exercice you can provide a series of additional stories and questions and see how the model predicts.  

You can change the question and re-run the cell to see the changes in the predicted answer. 

In [None]:
#Provide a story
my_story2 = "Daniel went to the office . Mary left the apple in the bedroom ."
my_story2.split()
#Provide a question
my_question2 = "Is the apple in the kitchen ?"
my_question2.split()
mydata2 = [(my_story2.split(),my_question2.split(),'yes')]
my_story2,my_ques2,my_ans2 = vectorize_stories(mydata2)
pred_results = model.predict(([ my_story2, my_ques2]))

#Generate prediction from model
val_max = np.argmax(pred_results[0])

for key, val in tokenizer.word_index.items():
    if val == val_max:
        k = key

print(my_question2)
print("Predicted answer is: ", k)
print("Probability of certainty was: ", pred_results[0][val_max])