## Notebook summary

In this notebook, we perform the necessary data preprocessing. This consists of the following components:

    - Loading the SQuAD data
    - Tokenization with Stanford CoreNLP
    - Embeddings with GloVe, pretrained on 840B Common Crawl, fixed

In the end, we require a 2 x n array containing the input data and target for each question-answer pair. The target here is [UNRESOLVED, see questions]. The input data for each example consists of a paragraph and a question, and each are encoded in word vectors. This means both are matrices of p x l and q x l, where p is the number of words in the paragraph, l the length of the word embeddings, and q the number of words in the question. Whether these can be concatenated or should be separate items in an array is [UNRESOLVED, see questions]

## Questions to resolve:

- Should the target answer in training be text or two indices? Same goes for the model output. 
    - Assumption: should be text

It seems that the model should output start and end indices for the span of the answer, but is evaluated on the words contained in the span. If this is the case, there has to be a step between model output and evaluation, where the two indices are converted to words in the span. 

- Where should this conversion from indices to words take place?

- Can the document and question embedding matrices be concatenated or should they be passed as separate list items?
- Why do we need the answer_start flag if we only match the output text with the target text?
- Should the target answer texts be tokenized too?
- Should we make full words out of tokenized contractions? (e.g. you're -> you, are | you're -> you, 're)
    - Assumption: No, we stick with the original tokens and hope that they're in GloVe

In [99]:
import pandas as pd
import numpy as np
import pickle

from pycorenlp import StanfordCoreNLP

nlp = StanfordCoreNLP('http://localhost:9000') # Set server url 
# If server is offline, run the command below in Terminal from the stanford CoreNLP folder
# java -mx4g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer -port 9000 -timeout 15000


In [75]:
squad = pd.read_json('../data/train-v1.1.json', orient='records')

In [76]:
squad.shape

(442, 2)

Okay, so we need to extract a couple of things:
    - answers, which are just texts (that should be tokenized? [UNRESOLVED, see questions])
    - questions, which should be tokenized and embedded
    - paragraphs, which should be tokenized and embedded

### The Stanford CoreNLP Tokenizer

In [79]:
def tokenize(text, annotator=nlp):
    """
    Calls the Stanford CoreNLP Tokenizer running on a local server, which tokenizes the input text.
    
    Returns:
    Tokenized text
    """
    annotated_text = annotator.annotate(text, properties={'annotators': 'tokenize', "outputFormat": "json"})
    tokenized_text = []
    for token in annotated_text['tokens']:
        word = token['word']
        tokenized_text.append(word)
        
    return tokenized_text

### The GloVe word embeddings

From the DCN paper:

"We use as GloVe word vectors pretrained
on the 840B Common Crawl corpus (Pennington et al., 2014). We limit the vocabulary
to words that are present in the Common Crawl corpus and set embeddings for out-of-vocabulary
words to zero. Empirically, we found that training the embeddings consistently led to overfitting and
subpar performance, and hence only report results with fixed word embeddings."


When reading in the GloVe vectors, we found that some vectors were the wrong length and contained odd words (such as name@example.com) and values (such as '.'). We don't know whether this is intrinsic to the data or whether we import it wrong. Either way, out of the 2196016 total lines, 29 were of the wrong length. We therefore decided to drop those 29 vectors and set the embeddings for the corresponding words to 0. 


In [64]:
glove_file_path = "../data/glove.840B.300d.txt"

def load_glove_embeddings(file_path):
    """
    Loads the glove word vectors from a textfile and parses it into a dictionary with words and vectors.
    
    Returns:
    A dictionary of words and corresponding vectors
    """
    
    print("Loading Glove Model")
    with open(file_path,'r', encoding="utf8") as f:
        embeddings_dict = {}
        cnt = 0
        for i, line in enumerate(f):
            
            split_line = line.split()
            
            # Skip aberrant lines
            if not len(split_line) == 301:
                continue 

            word = split_line[0]
            embedding = np.array([float(val) for val in split_line[1:]])
            embeddings_dict[word] = embedding
            
        print("Done. {} words loaded!".format(len(embeddings_dict)))
    return embeddings_dict


In [65]:
embeddings = load_glove_embeddings(glove_file_path)

Loading Glove Model
Done. 2195875 words loaded!


In [70]:
def embed(words, embeddings):
    """
    Takes words and returns corresponding GloVe word embeddings. Returns a zero vector if no embedding is found.
    
    Returns:
    List of word vectors
    """
    word_vectors = []
    
    for word in words:
        # Match word with vector
        try:
            vector = embeddings[word]
        except KeyError:
            # Set to zero vector if no match
            vector = np.zeros(300)
            
        word_vectors.append(vector)
    
    return word_vectors

In [112]:
# Preprocess paragraphs, questions, and answers

def preprocess(dataset):
    """
    Parses the dataset, extracts questions, paragraphs, and answers. Also tokenizes 
    and applies GloVe embeddings to the questions and paragraphs.
    
    Returns: 
    All three processed components in a nested list of the form 
    [[paragraph, question], answer]
    """
    data_answers = []
    data_paragraphs = []
    data_questions = []
    for article in dataset:
            for paragraph in article['paragraphs']:
                
                tokenized_paragraph = tokenize(paragraph['context'])
                embedded_paragraph = embed(tokenized_paragraph, embeddings)
                
                for qa in paragraph['qas']:
                    
                    data_paragraphs.append(embedded_paragraph) # Include the embedded paragraph with every question
                    
                    question = qa['question']
                    tokenized_question = tokenize(question)
                    embedded_question = embed(tokenized_question, embeddings)
                    data_questions.append(embedded_question)

                    answers = list(map(lambda x: x['text'], qa['answers']))
                    data_answers.append(answers)
                    
    return [[data_paragraphs, data_questions], data_answers]

In [119]:
dataset = squad['data']
training_data = preprocess(dataset)

In [120]:
# Save dataset
with open("..\data\\training_data.txt", "wb") as fp:
    pickle.dump(training_data, fp)

In [121]:
# Check if it saved correctly
with open("..\data\\training_data.txt", "rb") as fp:   # Unpickling
    tdata = pickle.load(fp)