## Notebook summary

In this notebook, we perform the necessary data preprocessing. This consists of the following components:

    - Loading the SQuAD data
    - Tokenization with Stanford CoreNLP
    - Embeddings with GloVe, pretrained on 840B Common Crawl, fixed

In the end, we require a 2 x n array containing the input data and target for each question-answer pair. The target here is [UNRESOLVED, see questions]. The input data for each example consists of a paragraph and a question, and each are encoded in word vectors. This means both are matrices of p x l and q x l, where p is the number of words in the paragraph, l the length of the word embeddings, and q the number of words in the question. Whether these can be concatenated or should be separate items in an array is [UNRESOLVED, see questions]

## Questions to resolve:

- Should the target answer in training be text or two indices? Same goes for the model output. 
    - Assumption: should be text

It seems that the model should output start and end indices for the span of the answer, but is evaluated on the words contained in the span. If this is the case, there has to be a step between model output and evaluation, where the two indices are converted to words in the span. 

- Where should this conversion from indices to words take place?

- Can the document and question embedding matrices be concatenated or should they be passed as separate list items?
- Why do we need the answer_start flag if we only match the output text with the target text?
- Should the target answer texts be tokenized too?
- Should we make full words out of tokenized contractions? (e.g. you're -> you, are | you're -> you, 're)
    - Assumption: No, we stick with the original tokens and hope that they're in GloVe

In [4]:
import pandas as pd
import numpy as np
import pickle

# from torch.utils.data import Dataset
from pycorenlp import StanfordCoreNLP
# from torch.nn.utils.rnn import *

nlp = StanfordCoreNLP('http://localhost:9001')
# If server is offline, run the command below in Terminal from the stanford CoreNLP folder
# java -mx4g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer -port 9001 -timeout 15000


In [5]:
squad = pd.read_json('../data/train-v1.1.json', orient='records')

In [6]:
squad.shape

(442, 2)

Okay, so we need to extract a couple of things:
    - answers, which are just texts (that should be tokenized? [UNRESOLVED, see questions])
    - questions, which should be tokenized and embedded
    - paragraphs, which should be tokenized and embedded

### The Stanford CoreNLP Tokenizer

In [7]:
def tokenize(text, annotator=nlp):
    """
    Calls the Stanford CoreNLP Tokenizer running on a local server, which tokenizes the input text.
    
    Returns:
    Tokenized text
    """
    annotated_text = annotator.annotate(text, properties={'annotators': 'tokenize', "outputFormat": "json"})
    tokenized_text = []
    for token in annotated_text['tokens']:
        word = token['word']
        tokenized_text.append(word)
        
    return tokenized_text

### The GloVe word embeddings

From the DCN paper:

"We use as GloVe word vectors pretrained
on the 840B Common Crawl corpus (Pennington et al., 2014). We limit the vocabulary
to words that are present in the Common Crawl corpus and set embeddings for out-of-vocabulary
words to zero. Empirically, we found that training the embeddings consistently led to overfitting and
subpar performance, and hence only report results with fixed word embeddings."


When reading in the GloVe vectors, we found that some vectors were the wrong length and contained odd words (such as name@example.com) and values (such as '.'). We don't know whether this is intrinsic to the data or whether we import it wrong. Either way, out of the 2196016 total lines, 29 were of the wrong length. We therefore decided to drop those 29 vectors and set the embeddings for the corresponding words to 0. 


In [8]:
glove_file_path = "../data/glove.840B.300d.txt"

def load_glove_embeddings(file_path):
    """
    Loads the glove word vectors from a textfile and parses it into a dictionary with words and vectors.
    
    Returns:
    A dictionary of words and corresponding vectors
    """
    
    print("Loading Glove Model")
    with open(file_path,'r', encoding="utf8") as f:
        embeddings_dict = {}
        cnt = 0
        for i, line in enumerate(f):
            
            split_line = line.split()
            
            # Skip aberrant lines
            if not len(split_line) == 301:
                continue 

            word = split_line[0]
            embedding = np.array([float(val) for val in split_line[1:]])
            embeddings_dict[word] = embedding
            
        print("Done. {} words loaded!".format(len(embeddings_dict)))
    return embeddings_dict


In [9]:
embeddings = load_glove_embeddings(glove_file_path)

Loading Glove Model


FileNotFoundError: [Errno 2] No such file or directory: '../data/glove.840B.300d.txt'

In [None]:
def embed(words, embeddings):
    """
    Takes words and returns corresponding GloVe word embeddings. Returns a zero vector if no embedding is found.
    
    Returns:
    List of word vectors
    """
    word_vectors = np.zeros((len(words), 300))
    
    for i, word in enumerate(words):
        # Match word with vector
        try:
            vector = embeddings[word]
        except KeyError:
            # Set to zero vector if no match
            vector = np.zeros(300)
            
        word_vectors[i] = vector
    
    return word_vectors

TODO: Add a check to verify that the amount of null vectors is relatively low.

In [None]:
# Preprocess paragraphs, questions, and answers

def preprocess(text):
    """
    Tokenizes and applies word embeddings to a text.
    """
    tokenized_text = tokenize(text)
    embedded_text = embed(tokenized_text, embeddings)

    return embedded_text

In [None]:
class SquadDataset(Dataset):
    """Stanford Question Answering Dataset."""

    def __init__(self, json_file, transform=None):
        """
        Args:
            json_file (string): Path to the csv file with annotations.
            transform (callable, optional): Optional transform to be applied
                on a sample.
        """
        self.dataset = pd.read_json(json_file, orient='records')['data']
        self.dataset = self.flatten_data(self.dataset)
        self.transform = transform

    def __len__(self):
        return len(self.dataset)

    def __getitem__(self, idx):
        item = self.dataset[idx]
        
        text = item['text']
        question = item['question']
        answer = item['answer']
        
        if self.transform:
            text = self.transform(text)
            question = self.transform(question)
            
        sample = {'text': text, 'question': question, 'answer': answer}

        return sample
    
    def flatten_data(self, data):
        flat_data = []
        for article in data:
            for paragraph in article['paragraphs']:
                for qa in paragraph['qas']:
                    flat_data.append({'text': paragraph['context'], 
                                      'question': qa['question'], 
                                      'answer': qa['answers'][0]['text']})
        return flat_data

In [None]:
squad_dataset = SquadDataset(json_file='../data/train-v1.1.json', transform=preprocess)

In [1]:
glove_file_path = "../data/glove.840B.300d.txt"
squad_file_path = '../data/train-v1.1.json'


In [2]:
import sys

In [3]:
sys.path.append('../')

In [4]:
from src.dataset import SquadDataset
from src.preprocessing import Preprocessing

In [5]:
d = SquadDataset(squad_file_path, glove_file_path)

Loading Glove Model
Done. 2195875 words loaded!


In [None]:
len(d)

In [6]:
d[0]

text: Architecturally, the school has a Catholic character. Atop the Main Building's gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend "Venite Ad Me Omnes". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858. At the end of the main drive (and in a direct line that connects through 3 statues and the Gold Dome), is a simple, modern stone statue of Mary.
question: To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France?


{'answer': 'Saint Bernadette Soubirous',
 'question': array([[-0.058238 ,  0.20478  , -0.036143 , ..., -0.041456 , -0.17519  ,
          0.19846  ],
        [ 0.17536  ,  0.0073103,  0.07546  , ...,  0.0066219,  0.19711  ,
         -0.38826  ],
        [-0.068894 ,  0.38769  , -0.2612   , ...,  0.19304  ,  0.37526  ,
          0.14579  ],
        ..., 
        [ 0.70281  , -0.42784  ,  0.01525  , ..., -0.11392  ,  0.25031  ,
          0.072933 ],
        [ 0.014575 ,  0.52839  , -0.12192  , ..., -0.16678  ,  0.76594  ,
          0.14542  ],
        [-0.086864 ,  0.19161  ,  0.10915  , ..., -0.01516  ,  0.11108  ,
          0.2065   ]]),
 'text': array([[ 0.36565 , -0.1154  ,  0.34923 , ...,  0.085372,  0.17609 ,
         -0.04315 ],
        [-0.082752,  0.67204 , -0.14987 , ..., -0.1918  , -0.37846 ,
         -0.06589 ],
        [ 0.27204 , -0.06203 , -0.1884  , ...,  0.13015 , -0.18317 ,  0.1323  ],
        ..., 
        [ 0.060216,  0.21799 , -0.04249 , ...,  0.11709 , -0.16692 ,
   

In [None]:
p.preprocess('hey hello my name is jorren how are you')