## Natural Language Question Answering

### Anna Bethke, Andy Keller
### Originally Presented at Intel AI DevCon 2018

This notebook gives an overview of an End-to-End Memory Network for goal oriented dialogue, implemented in ngraph.

Goal oriented dialogue is a subset of open-domain dialogue where an automated agent has a specific
goal for the outcome of the interaction. At a high level, the system needs to understand a user
request and complete a related task with a clear goal within a limited number of dialog turns.
This task could be making a restaurant reservation, placing an order, setting a timer, or many of the digital personal assistant tasks.

End-to-End Memory Networks are generic semi-recurrent neural networks which allow for a bank of
external memories to be read from and used during execution. They can be used in place of traditional
slot-filling algorithms to accomplish goal oriented dialogue tasks without the need for expensive
hand-labeled dialogue data. End-to-End Memory Networks have also been shown to be useful for
Question-Answering and information retrieval tasks.

This demonstration will go through a short training cycle to conduct full dialogues, and an interactive demonstration of what the network can do with a fully trained model. Training a full model generally does not take a significant amount of time, but it may be longer than the duration of this session. The implementation is based off the paper by A. Bordes, Y. Boureau, J. Weston. [Learning End-to-End Goal-Oriented Dialog](https://arxiv.org/abs/1605.07683) 2016 and the Github repository [chatbot-MemN2N-tensorflow](https://github.com/vyraun/chatbot-MemN2N-tensorflow)

The model was trained and evaluated on the 6 bAbI Dialog tasks with the following results.

| Task | This  | Published |  This (w/ match-type) | Published (w/ match-type)|
|------|--------|-----------| ---------------------|--------------------------|
| 1    | 99.8   | 99.9      | 100.0                | 100.0                    |
| 2    | 100.0  | 100.0     | 100.0                | 98.3                     |
| 3    | 74.8   | 74.9      | 74.6                 | 74.9                     |
| 4    | 57.2   | 59.5      | 100.0                | 100.0                    |
| 5    | 96.4   | 96.1      | 95.6                 | 93.4                     |
| 6    | 48.1   | 41.1      | 45.4                 | 41.0                     |

![](https://camo.githubusercontent.com/ba1c7dbbccc5dd51d4a76cc6ef849bca65a9bf4d/687474703a2f2f692e696d6775722e636f6d2f6e7638394a4c632e706e67)


![](https://i.imgur.com/5pQJqjM.png)

Our first step is to import our necessary libraries. For simplicities sake, in this tutorial, we will be using some code from the NLP Architect github reposiotry. The code presented in this demonstration notebook is also included in that repository. 

NLP Architect is a repository for models exploring the state of the art deep learning techniques for natural language processing and natural language understanding. It is intended to be a platform for future research and collaboration.

The library includes our past and ongoing NLP research efforts as part of Intel AI Lab. For more information on the library please see the documentation at https://intellabs.github.io/nlp-architect, or the github repository at  https://github.com/IntelLabs/nlp-architect.

In [None]:
from __future__ import division
from __future__ import print_function
from __future__ import unicode_literals
from __future__ import absolute_import
from builtins import input
import numpy as np
from copy import copy
from functools import reduce
import itertools
from tqdm import tqdm

import ngraph as ng
from ngraph.frontends.neon import Layer
from ngraph.frontends.neon import GaussianInit
from ngraph.frontends.neon import make_bound_computation
from ngraph.frontends.neon import NgraphArgparser
from ngraph.frontends.neon import ArrayIterator
from ngraph.frontends.neon import Saver
from ngraph.frontends.neon import GaussianInit, Adam, GradientDescentMomentum, RMSProp
import ngraph.transformers as ngt

from nlp_architect.data.babi_dialog import BABI_Dialog
from nlp_architect.models.memn2n_dialogue import MemN2N_Dialog
from nlp_architect.contrib.ngraph.modified_lookup_table import ModifiedLookupTable
from contextlib import closing

from ipywidgets import interact, interactive, fixed, interact_manual
import ipywidgets as widgets

from IPython.display import display
import time

### Setting up arguments and getting the data

To start with, we will set some of the parameters we will need throughout this demonstration. Feel free to change these around. The most important thing to remember though is that when you want to run the interactive demo, you need to have the same parameters that you used when you saved a particular model. 

In [None]:
task = 5 # The task ID to train/test on from bAbI-dialog dataset (1-6)
emb_size = 32 # Size of the word-embedding used in the model.
nhops = 3 # Number of memory hops in the network
use_match_type = False # Use match type features
cache_match_type = False # Cache pre-processed match type answers
use_oov = False # Use OOV test set and for the interactive mode, allow out of vocabulary words
lr = 0.001 # learning rate
eps = 1e-8 # Epsilon used to avoid divide by zero in softmax renormalization
model_file = './memn2n_task5_weights_retrain.npz' # Where you want to store your model weights
data_dir = './data/' # The directory of the dataset
restore = False # Restore model weights if found
batch_size = 32 # Size of the batch during training
gradient_clip_norm = 40.0 # Clip gradients such that norm is below this value
cache_vectorized = False # Cache preprocessed/vectorized dataset

Before we dive into the memory network code, lets get to know our data a little bit better. The dataset used for training and evaluation is under the umbrella of the Facebook bAbI dialog tasks
(https://research.fb.com/downloads/babi/). The terms and conditions of the data set license apply. Intel does not grant any rights to the data files. The dataset is automatically downloaded if not found,
and the preprocessing all happens at the beginning of training.

There are six separate tasks, tasks 1 through 5 are from simulated conversations between a customer
and a restaurant booking bot (created by Facebook), and task 6 is more realistic natural language
restaurant booking conversations as part of the `dialog state tracking challenge`.

The descriptions of the six tasks are as follow:

- bAbI dialog dataset:
    - Task 1: Issuing API Calls
    - Task 2: Updating API Calls
    - Task 3: Displaying Options
    - Task 4: Providing Extra Information
    - Task 5: Conducting Full Dialogs

- Dialog State Tracking Challenge 2 Dataset:
    - Task 6: DSTC2 Full Dialogs
    
We have written a function BABI_Dialog which is able to take a directory of the dataset and with a few parameters prepare the data for the memory network task. Lets look to see what our data looks like before and after it goes through the this class. 

The core of the dataset is conversations, for example the first conversation for Task 5 (our focus today) is:

```
1 good morning	hello what can i help you with today
2 i'd like to book a table with italian food	i'm on it
3 <SILENCE>	where should it be
4 in paris	how many people would be in your party
5 for six people please	which price range are looking for
6 in a cheap price range please	ok let me look into some options for you
7 <SILENCE>	api_call italian paris six cheap
8 actually i would prefer for two people	sure is there anything else to update
9 instead could it be in madrid	sure is there anything else to update
10 instead could it be with spanish food	sure is there anything else to update
11 no	ok let me look into some options for you
12 <SILENCE>	api_call spanish madrid two cheap
13 resto_madrid_cheap_spanish_1stars R_phone resto_madrid_cheap_spanish_1stars_phone
14 resto_madrid_cheap_spanish_1stars R_cuisine spanish
15 resto_madrid_cheap_spanish_1stars R_address resto_madrid_cheap_spanish_1stars_address
16 resto_madrid_cheap_spanish_1stars R_location madrid
17 resto_madrid_cheap_spanish_1stars R_number two
18 resto_madrid_cheap_spanish_1stars R_price cheap
19 resto_madrid_cheap_spanish_1stars R_rating 1
20 resto_madrid_cheap_spanish_6stars R_phone resto_madrid_cheap_spanish_6stars_phone
21 resto_madrid_cheap_spanish_6stars R_cuisine spanish
22 resto_madrid_cheap_spanish_6stars R_address resto_madrid_cheap_spanish_6stars_address
23 resto_madrid_cheap_spanish_6stars R_location madrid
24 resto_madrid_cheap_spanish_6stars R_number two
25 resto_madrid_cheap_spanish_6stars R_price cheap
26 resto_madrid_cheap_spanish_6stars R_rating 6
27 resto_madrid_cheap_spanish_8stars R_phone resto_madrid_cheap_spanish_8stars_phone
28 resto_madrid_cheap_spanish_8stars R_cuisine spanish
29 resto_madrid_cheap_spanish_8stars R_address resto_madrid_cheap_spanish_8stars_address
30 resto_madrid_cheap_spanish_8stars R_location madrid
31 resto_madrid_cheap_spanish_8stars R_number two
32 resto_madrid_cheap_spanish_8stars R_price cheap
33 resto_madrid_cheap_spanish_8stars R_rating 8
34 <SILENCE>	what do you think of this option: resto_madrid_cheap_spanish_8stars
35 no this does not work for me	sure let me find an other option for you
36 <SILENCE>	what do you think of this option: resto_madrid_cheap_spanish_6stars
37 do you have something else	sure let me find an other option for you
38 <SILENCE>	what do you think of this option: resto_madrid_cheap_spanish_1stars
39 it's perfect	great let me do the reservation
40 may i have the phone number of the restaurant	here it is resto_madrid_cheap_spanish_1stars_phone
41 thanks	is there anything i can help you with
42 no thank you	you're welcome
```

It is a mixture of conversation between the user and agent along with API calls, and the resulting database responses, from the agent.


In [None]:
# Now lets process the data
# This involves parsing the above into user/bot utterances
# Then encoding each word using an integer index into our vocabulary.
# This does take some time (5-10 minutes)
babi = BABI_Dialog(
    path=data_dir,
    task=task,
    oov=use_oov,
    use_match_type=use_match_type,
    cache_match_type=cache_match_type)

The data within the babi object is a dictionary broken up by train/test

If we look at the training data there are answers, utterances, a memory mask and the memory. Each of the data objects has axes information along with the actual data.
The data has been preprocessed to be a bag of words representation. The data is already fairly clean in that everything is already lowercase and there are no spelling or gramatical errors.

In [None]:
babi.data_dict['train'].keys()

In [None]:
babi.data_dict['train']['user_utt']

In [None]:
# So for example the second sentence "i'd like to book a table with italian food"  looks like:
babi.data_dict['train']['user_utt']['data'][1]

In [None]:
# The memory contains information from of the conversation thus far
# so after one back and forth with the agent, it looks like the following:
babi.data_dict['train']['memory']['data'][1]

We can see that all utterances are padded to the maximum sentence length with 0's, and the empty memory slots are also filled with all 0 sentences. These values are eplicitly forced to zero in our embedding layer, and never updated.

In [None]:
babi.data_dict['train']['memory_mask']['data'][1]

The 'memory_mask' is used to mask out empty memory slots after the softmax is applied during the computation of the scalar similarity values. This is a trick to ensure model correctness, and emulate a softmax over a dynamic memory size. 

### Setting up for training

Now that we have the data, lets look at the model we will use to train it

The memory network heavily uses lookup tables - these are mappings from one object space to another. They are being used to map our vocab-size one-hot vectors to embedding-size dense vectors, allowing the model to decrease the input dimensionality and additionally learn how words are associated with each other. 

The class MemN2N_Dialog is the core of the memory network model. It starts with the various arguments that we will pass in when training is initialized.

In [None]:
class MemN2N_Dialog(Layer):
    """
    End-to-End Memory Networks for Goal Oriented Dialogue

    After the model is initialized, it accepts a BABI_Dialog class formatted dataset 
    as input and returns a probability distribution over candidate answers. 

    Args:
        cands (np.array): Vectorized array of potential candidate answers, encoded
            as integers, as returned by BABI_Dialog class. Shape = [num_cands, max_cand_length]
        num_cands (int): Number of potential candidate answers. 
        max_cand_len (int): Maximum length of a candidate answer sentence in number of words. 
        memory_size (int): Maximum number of sentences to keep in memory at any given time.
        max_utt_len (int): Maximum length of any given sentence / user utterance 
        vocab_size (int): Number of unique words in the vocabulary + 2 (0 is reserved for 
            a padding symbol, and 1 is reserved for OOV)
        emb_size (int): Dimensionality of word embeddings to use 
        batch_size (int): Number of training examples per batch 
        use_match_type (bool, optional): Flag to use match-type features
        kb_ents_to_type (dict, optional): For use with match-type features, dictionary of 
            entities found in the dataset mapping to their associated match-type
        kb_ents_to_cand_idxs (dict, optional): For use with match-type features, dictionary
            mapping from each entity in the  knowledge base to the set of indicies in the
            candidate_answers array that contain that entity.
        match_type_idxs (dict, optional): For use with match-type features, dictionary 
            mapping from match-type to the associated fixed index of the candidate vector
            which indicated this match type.
        nhop (int, optional): Number of memory-hops to perform during fprop 
        eps (float, optional): Small epsilon used for numerical stability in 
            softmax renormalization
        init (Initalizer, optional): Initalizer object used to initialize lookup table
            and projection layer.
    """
    def __init__(
        self,
        cands,
        num_cands,
        max_cand_len,
        memory_size,
        max_utt_len,
        vocab_size,
        emb_size,
        batch_size,
        use_match_type=False,
        kb_ents_to_type=None,
        kb_ents_to_cand_idxs=None,
        match_type_idxs=None,
        nhops=3,
        eps=1e-6,
        init=GaussianInit(
            mean=0.0,
            std=0.1)):
        super(MemN2N_Dialog, self).__init__()

        self.cands = cands
        self.memory_size = memory_size
        self.max_utt_len = max_utt_len
        self.vocab_size = vocab_size
        self.num_cands = num_cands
        self.max_cand_len = max_cand_len
        self.batch_size = batch_size
        self.use_match_type = use_match_type
        self.kb_ents_to_type = kb_ents_to_type
        self.kb_ents_to_cand_idxs = kb_ents_to_cand_idxs
        self.match_type_idxs = match_type_idxs
        self.nhops = nhops
        self.eps = eps
        self.init = init

        # Make axes
        self.batch_axis = ng.make_axis(length=batch_size, name='N')
        self.sentence_rec_axis = ng.make_axis(length=max_utt_len, name='REC')
        self.memory_axis = ng.make_axis(length=memory_size, name='memory_axis')
        self.embedding_axis = ng.make_axis(length=emb_size, name='F')
        self.embedding_axis_proj = ng.make_axis(length=emb_size, name='F_proj')
        self.cand_axis = ng.make_axis(length=num_cands, name='cand_axis')
        self.cand_rec_axis = ng.make_axis(length=max_cand_len, name='REC')

        # Weight sharing of A's accross all hops input and output
        self.LUT_A = ModifiedLookupTable(
            vocab_size, emb_size, init, update=True, pad_idx=0)
        # Use lookuptable W to embed the candidate answers
        self.LUT_W = ModifiedLookupTable(
            vocab_size, emb_size, init, update=True, pad_idx=0)

        # Initialize projection matrix between internal model states
        self.R_proj = ng.variable(axes=[self.embedding_axis, self.embedding_axis_proj],
            initial_value=init)

        if not self.use_match_type:
            # Initialize constant matrix of all candidate answers
            self.cands_mat = ng.constant(self.cands, axes=[self.cand_axis, self.cand_rec_axis])

    def __call__(self, inputs):
        query = ng.cast_axes(inputs['user_utt'], [self.batch_axis, self.sentence_rec_axis])

        # Query embedding [batch, sentence_axis, F]
        q_emb = self.LUT_A(query)

        # Sum the embeddings
        u_0 = ng.sum(q_emb, reduction_axes=[self.sentence_rec_axis])

        # Start a list of the internal states of the model. Will be appended to
        # after each memory hop
        u = [u_0]

        for hopn in range(self.nhops):
            story = ng.cast_axes(inputs['memory'], [self.batch_axis, self.memory_axis, self.sentence_rec_axis])

            # Re-use the query embedding matrix to embed the memory sentences
            # [batch, memory_axis, sentence_axis, F]
            m_emb_A = self.LUT_A(story)
            m_A = ng.sum(m_emb_A, reduction_axes=[self.sentence_rec_axis])  # [batch, memory_axis, F]

            # Compute scalar similarity between internal state and each memory
            # Equivalent to dot product between u[-1] and each memory in m_A
            # [batch, memory_axis]
            dotted = ng.sum(u[-1] * m_A, reduction_axes=[self.embedding_axis])

            # [batch, memory_axis]
            probs = ng.softmax(dotted, self.memory_axis)

            # Renormalize probabilites according to non-empty memories
            probs_masked = probs * inputs['memory_mask']
            renorm_sum = ng.sum(probs_masked, reduction_axes=[self.memory_axis]) + self.eps
            probs_renorm = (probs_masked + self.eps) / renorm_sum

            # Compute weighted sum of memory embeddings
            o_k = ng.sum(probs_renorm * m_A, reduction_axes=[self.memory_axis])  # [batch, F]

            # Add the output back into the internal state and project
            u_k = ng.cast_axes(ng.dot(self.R_proj, o_k), [
                               self.embedding_axis, self.batch_axis]) + u[-1]  # [batch, F_proj]

            # Add new internal state
            u.append(u_k)

        if self.use_match_type:
            # [batch_axis, cand_axis, cand_rec_axis, F]
            self.cands_mat = inputs['cands_mat']

        # Embed all candidate responses using LUT_W
        # [<batch_axis>, cand_axis, cand_rec_axis, F]
        cand_emb_W = self.LUT_W(self.cands_mat)
        # No position encoding added yet
        cands_mat_emb = ng.sum(cand_emb_W, reduction_axes=[self.cand_rec_axis])  # [<batch_axis>, cand_axis, F]

        # Compute predicted answer from product of final internal state
        # and embedded candidate answers
        # same as: a_logits = ng.dot(cands_mat_emb, u[-1]) # [batch, cand_axis]
        a_logits = ng.sum(u[-1] * cands_mat_emb, reduction_axes=[self.embedding_axis])

        # rename V to cand_axis to match answer
        a_logits = ng.cast_axes(a_logits, [self.batch_axis, self.cand_axis])
        a_pred = ng.softmax(a_logits, self.cand_axis)

        return a_pred, probs_renorm

The next stage in our training process is to initialize the memory network model by calling MemN2N_Dialog(). Here is a good place that you can modify some of the hyperparameters as well as other parameters like the initial default initialization. You can also choose a different optimizier. We are using Adam here, but the neon frontend of NGraph also includes GradientDescentMomentum, and RMSprop. These are already imported in case you would like to experiment with them.

In [None]:
weights_save_path = model_file

weight_saver = Saver()

# Set num iterations to 1 epoch since we loop over epochs & shuffle
ndata = babi.data_dict['train']['memory']['data'].shape[0]
num_iterations = ndata // batch_size

train_set = ArrayIterator(babi.data_dict['train'], batch_size=batch_size,
                          total_iterations=num_iterations)
dev_set = ArrayIterator(babi.data_dict['dev'], batch_size=batch_size)
test_set = ArrayIterator(babi.data_dict['test'], batch_size=batch_size)
inputs = train_set.make_placeholders()

memn2n = MemN2N_Dialog(
    babi.cands,
    babi.num_cands,
    babi.max_cand_len,
    babi.memory_size,
    babi.max_utt_len,
    babi.vocab_size,
    emb_size,
    batch_size,
    use_match_type=use_match_type,
    kb_ents_to_type=babi.kb_ents_to_type,
    kb_ents_to_cand_idxs=babi.kb_ents_to_cand_idxs,
    match_type_idxs=babi.match_type_idxs,
    nhops=nhops,
    eps=eps,
    init=GaussianInit(
        mean=0.0,
        std=0.1))

# Compute answer predictions
a_pred, attention = memn2n(inputs)

# specify loss function, calculate loss and update weights
loss = ng.cross_entropy_multi(a_pred, inputs['answer'], usebits=True)

# Compute sum of losses
mean_cost = ng.sum(loss, out_axes=[])

# initialize optimizer and create update-op
optimizer = Adam(learning_rate=lr)
updates = optimizer(loss)
batch_cost = ng.sequential([updates, mean_cost])

# provide outputs for bound computation
train_outputs = dict(batch_cost=batch_cost, train_preds=a_pred)

# Create additional inference graphs to handle things like dropout automatically
with Layer.inference_mode_on():
    a_pred_inference, attention_inference = memn2n(inputs)
    eval_loss = ng.cross_entropy_multi(
        a_pred_inference, inputs['answer'], usebits=True)

# Create interactive and eval outputs
interactive_outputs = dict(
    test_preds=a_pred_inference,
    attention=attention_inference)
eval_outputs = dict(test_cross_ent_loss=eval_loss, test_preds=a_pred_inference)

Our last and final stage is to initialize the training loop and run training. This cell will create the computational objects, restore any weights that were previously saved, and then for each epoch, run the memory network model. At the end of each training epoch, the validation set will be evaluated to get a continuous metric on how well the network is being trained.

In [None]:
epochs = 3
save_epochs = 1

# Train Loop
with closing(ngt.make_transformer()) as transformer:
    # bind the computations
    train_computation = make_bound_computation(transformer, train_outputs, inputs)
    loss_computation = make_bound_computation(transformer, eval_outputs, inputs)
    interactive_computation = make_bound_computation(transformer, interactive_outputs, inputs)

    weight_saver.setup_save(transformer=transformer, computation=train_outputs)

    # Load weights if found
    if restore and os.path.exists(weights_save_path):
        print("Loading weights from {}".format(weights_save_path))
        weight_saver.setup_restore(
            transformer=transformer,
            computation=train_outputs,
            filename=weights_save_path)
        weight_saver.restore()
    elif restore and os.path.exists(weights_save_path) is False:
        print("Could not find weights at {}. ".format(weights_save_path)
              + "Running with random initialization.")

    for e in range(epochs):
        train_error = []
        train_cost = []
        
        # Loop over training batches, and update model with the train computation
        for idx, data in enumerate(
            tqdm(train_set, total=train_set.nbatches,
                 unit='minibatches', desc="Epoch {}".format(e))):
            train_output = train_computation(data)
            
            # Store loss and error for monitoring
            train_cost.append(train_output['batch_cost'])
            preds = np.argmax(train_output['train_preds'], axis=1)
            error = np.mean(data['answer'].argmax(axis=1) != preds)
            train_error.append(error)

        train_cost_str = "Epoch {}: train_cost {}, train_error {}".format(
            e, np.mean(train_cost), np.mean(train_error))
        print(train_cost_str)

        if e % save_epochs == 0:
            print("Saving model to {}".format(weights_save_path))
            weight_saver.save(filename=weights_save_path)
            print("Saving complete - Running validation set")

        # Eval after each epoch
        test_loss = []
        test_error = []
        for idx, data in enumerate(dev_set):
            # only compute loss on evaluation set
            test_output = loss_computation(data)
            test_loss.append(np.sum(test_output['test_cross_ent_loss']))
            preds = np.argmax(test_output['test_preds'], axis=1)
            error = np.mean(data['answer'].argmax(axis=1) != preds)
            test_error.append(error)

        val_cost_str = "Epoch {}: validation_cost {}, validation_error {}".format(
            e, np.mean(test_loss), np.mean(test_error))
        print(val_cost_str)

        # Shuffle training set and reset the others
        shuf_idx = np.random.permutation(range(train_set.data_arrays['memory'].shape[0]))
        train_set.data_arrays = {k: v[shuf_idx] for k, v in train_set.data_arrays.items()}
        train_set.reset()
        dev_set.reset()

    print('Training Complete.')

### Moving to interactive.py

Now that we have a trained model we can talk to it! For that though it is best to go to the command line!

In the command line type:

```cd private-nlp-architect
python interactive.py --task 5 --data_dir ./data/ --model_file ./examples/memn2n_dialogue/memn2n_weights.npz```

Now it's up to you to play with the model architecture above and hyperparameter settings to observe changes in model performance!