Question asnwering involves memory: facts stored in memory, and question refers back to past information

How do models for sequential processing memorize things?

Two approaches:
* Flat memory: RNN and LSTMs
* Responsive memory: end-to-end memory networks

Language is inherently sequential and contextual, and often addresses long range dependencies. Question Answering is an especially memory intensive task: answering a question on the basis of a preceding sequence of facts

Given a large dataset of hand annotated questions linked to supporting facts (where the facts to answer a question can be found in only a single sentence), how can we retrieve the necessary information for answering the question? Since it is not known which specific fragment of which specific sentence contains the answer to the question, the model must store all of the sentences (the story)

The Facebook BABI dataset contains stories like these:

    1 Mary moved to the bathroom.
    2 John went to the hallway.
    3 Where is Mary?     bathroom    1
    4 Daniel went back to the hallway.
    5 Sandra moved to the garden.
    6 Where is Daniel?     hallway    4
    7 John moved to the office.
    8 Sandra journeyed to the bathroom.
    9 Where is Daniel?     hallway    4
    10 Mary moved to the hallway.
    11 Daniel travelled to the office.
    12 Where is Daniel?     office    11
    13 John went back to the garden.
    14 John moved to the bedroom.
    15 Where is Sandra?     bathroom    8

Two conditions for the model to explore here is:
* Only using a question and the supporting fact
* Using all facts in a story, including non-relevant ones for answering the particular question

## Data Vectorization
We need three vectors:
* A list holding all facts as vectors
* A list of vectorized questions
* A list of labels: word indices referring to the word that is the answer to a question

First of all, we need to tokenize the stories:

In [12]:
import re
from tensorflow.keras.preprocessing.text import Tokenizer

def create_tokenizer(training_data, test_data):
    f = open(training_data, "r")
    text = []
    
    for line in f:
        m = re.match("^\d+\s([^\.]+)[\.].*", line.rstrip())
        if m:
            text.append(m.group(1))
        else:
            m = re.match("^\d+\s([^\?]+)[\?]\s\t([^\t]+)",line.rstrip())
            if m:
                text.append(m.group(1) + ' ' + m.group(2))
    f.close()
    
    f = open(test_data, "r")
    for line in f:
        m=re.match("^\d+\s([^\.]+)[\.].*",line.rstrip())                    
        if m:
            text.append(m.group(1))
        else:
            m=re.match("^\d+\s([^\?]+)[\?].*",line.rstrip())                
            if m:
                text.append(m.group(1))
    f.close()
    
    vocabulary = set([word for word in text])
    max_words = len(vocabulary)
    
    print(vocabulary)
                      
    tokenizer = Tokenizer(num_words=max_words, char_level=False, split=' ')
    tokenizer.fit_on_texts(text)
    return tokenizer, max_words

In [17]:
train = 'datasets/tasksv11/en/qa1_single-supporting-fact_train.txt'
test = 'datasets/tasksv11/en/qa1_single-supporting-fact_test.txt'
tokenizer, max_words = create_tokenizer(train, test)

{'Daniel journeyed to the hallway', 'Sandra travelled to the kitchen', 'Sandra travelled to the bathroom', 'Where is Sandra office', 'Mary went to the kitchen', 'John journeyed to the hallway', 'John went back to the kitchen', 'Sandra went to the bedroom', 'Daniel went to the office', 'Daniel moved to the kitchen', 'Mary moved to the bedroom', 'John journeyed to the office', 'John went to the garden', 'Sandra went back to the hallway', 'John travelled to the bathroom', 'Sandra went back to the bedroom', 'John went to the office', 'John went to the bathroom', 'Daniel moved to the office', 'Where is Mary kitchen', 'Daniel travelled to the kitchen', 'Where is Sandra bathroom', 'Where is John bedroom', 'Mary went to the bedroom', 'Mary went back to the bedroom', 'John journeyed to the garden', 'Daniel journeyed to the kitchen', 'Daniel went back to the bedroom', 'Where is Daniel garden', 'Where is Daniel', 'Where is John', 'John went back to the office', 'Mary journeyed to the hallway', 'J

The entire list of facts will be concatenated into one big string, and that will be tokenized

A boolean flag will determine if the etnrie story will be kept or just the one holding the answer to the question

So after the relevant facts for answering the question have been determined, it's vectorized, along with the questions and answers, and the results are appended to desgined output variables for the entire training and testing sets

In [33]:
import numpy as np
from tensorflow.keras.preprocessing.sequence import pad_sequences

def vectorize(s, tokenizer):
    vector = tokenizer.texts_to_sequences([s])
    return vector[0]

def process_stories(filename, tokenizer, max_story_len, vocab_size, use_context=False):
    f = open(filename, 'r')
    X = []
    Q = []
    y = []
    n_questions = 0
    
    for line in f:
        m = re.match("^(\d+)\s(.+)\.",line.rstrip())
        if m:
            if int(m.group(1)) == 1:
                # if this is the first index
                # then this is a new story
                story = {}
            else:
                # otherwise, add the fact to the index
                # of the number of the fact
                story[int(m.group(1))] = m.group(2)
        else:
            # else, read a question
            m = re.match("^\d+\s(.+)\?\s\t([^\t]+)\t(.+)", line.rstrip())
            if m:
                question = m.group(1)
                answer = m.group(2)
                answer_ids = [int(x) for x in m.group(3).split(" ")]
                if use_context == False:
                    facts=' '.join([story[id] for id in answer_ids])
                    vectorized_fact = vectorize(facts, tokenizer)
                else:
                    vectorized_fact = vectorize(' '.join(story.values()), tokenizer)
                vectorized_question = vectorize(question, tokenizer)
                vectorized_answer = vectorize(answer, tokenizer)
                X.append(vectorized_fact)
                Q.append(vectorized_question)
                answer = np.zeros(vocab_size)
                answer[vectorized_answer[0]] = 1
                y.append(answer)
    f.close()
    
    X = pad_sequences(X, maxlen = max_story_len)
    Q = pad_sequences(Q, maxlen = max_query_len)
    
    return np.array(X), np.array(Q), np.array(y)

In [34]:
process_stories(train, tokenizer, max_story_len = 100, vocab_size = max_words)

KeyError: 1