# Code to run
Reference to src folder for correct functions.

## Notebook summary
In this notebook, we perform the necessary data preprocessing. This consists of the following components:

    - Loading the SQuAD data
    - Tokenization with Stanford CoreNLP
    - Embeddings with GloVe, pretrained on 840B Common Crawl, fixed

In the end, we require a 2 x n array containing the input data and target for each question-answer pair. The target here is [UNRESOLVED, see questions]. The input data for each example consists of a paragraph and a question, and each are encoded in word vectors. This means both are matrices of p x l and q x l, where p is the number of words in the paragraph, l the length of the word embeddings, and q the number of words in the question. Whether these can be concatenated or should be separate items in an array is [UNRESOLVED, see questions]

## Questions to resolve:

- Should the target answer in training be text or two indices? Same goes for the model output. 
    - Assumption: should be text

It seems that the model should output start and end indices for the span of the answer, but is evaluated on the words contained in the span. If this is the case, there has to be a step between model output and evaluation, where the two indices are converted to words in the span. 

- Where should this conversion from indices to words take place?

- Can the document and question embedding matrices be concatenated or should they be passed as separate list items?
- Why do we need the answer_start flag if we only match the output text with the target text?
- Should the target answer texts be tokenized too?
- Should we make full words out of tokenized contractions? (e.g. you're -> you, are | you're -> you, 're)
    - Assumption: No, we stick with the original tokens and hope that they're in GloVe
    - https://nlp.stanford.edu/IR-book/html/htmledition/tokenization-1.html
    - https://stanfordnlp.github.io/CoreNLP/tokenize.html

In [25]:
glove_file_path = "../data/glove.840B.300d.txt"
squad_file_path = '../data/train-v1.1.json'

In [26]:
import sys

In [27]:
sys.path.append('../')

In [28]:
from src.dataset import SquadDataset
from src.preprocessing import Preprocessing

In [5]:
#print(sys.path)

['', 'C:\\ProgramData\\Anaconda3\\python36.zip', 'C:\\ProgramData\\Anaconda3\\DLLs', 'C:\\ProgramData\\Anaconda3\\lib', 'C:\\ProgramData\\Anaconda3', 'C:\\ProgramData\\Anaconda3\\lib\\site-packages', 'C:\\ProgramData\\Anaconda3\\lib\\site-packages\\win32', 'C:\\ProgramData\\Anaconda3\\lib\\site-packages\\win32\\lib', 'C:\\ProgramData\\Anaconda3\\lib\\site-packages\\Pythonwin', 'C:\\ProgramData\\Anaconda3\\lib\\site-packages\\IPython\\extensions', 'C:\\Users\\jetze\\.ipython', '../']


In [29]:
data = SquadDataset(squad_file_path, glove_file_path,'text')

Found pickled GloVe file. Loading...
Done. 2195875 words loaded!


In [7]:
len(d)

87599

In [8]:
q = SquadDataset(squad_file_path, glove_file_path,'question')

Found pickled GloVe file. Loading...
Done. 2195875 words loaded!


In [9]:
len(q)

87599

In [10]:
a = SquadDataset(squad_file_path, glove_file_path,'answer')

Found pickled GloVe file. Loading...
Done. 2195875 words loaded!


In [11]:
len(a)

87599

# Sandbox for preprocessing
This is outdated -> check src folder for up to date function

In [19]:
import pandas as pd
import numpy as np
import pickle

from torch.utils.data import Dataset
from pycorenlp import StanfordCoreNLP

nlp = StanfordCoreNLP('http://localhost:9001')
# If server is offline, run the command below in Terminal from the stanford CoreNLP folder
# java -mx4g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer -port 9001 -timeout 15000

In [13]:
squad = pd.read_json('../data/train-v1.1.json', orient='records')

In [14]:
squad.shape

(442, 2)

Okay, so we need to extract a couple of things:

- answers, which are just texts (that should be tokenized? [UNRESOLVED, see questions])
- questions, which should be tokenized and embedded
- paragraphs, which should be tokenized and embedded

### Stanford CoreNLP Tokenizer

In [15]:
def tokenize(text, annotator=nlp):
    """"
    Calls the Stanford CoreNLP Tokenizer running on a local server, which tokenizes the input text.
    
    Returns:
    Tokenized text
    """
    annotated_text = annotator.annotate(text, properties={'annotators': 'tokenize','outputFormat': 'json'})
    tokenized_text = []
    for token in annotated_text['tokens']:
        word = token['word']
        tokenized_text.append(word)
        
    return tokenized_text

### GloVe word embeddings
From the DCN paper:

"We use as GloVe word vectors pretrained on the 840B Common Crawl corpus (Pennington et al., 2014). We limit the vocabulary to words that are present in the Common Crawl corpus and set embeddings for out-of-vocabulary words to zero. Empirically, we found that training the embeddings consistently led to overfitting and subpar performance, and hence only report results with fixed word embeddings."

When reading in the GloVe vectors, we found that some vectors were the wrong length and contained odd words (such as name@example.com) and values (such as '.'). We don't know whether this is intrinsic to the data or whether we import it wrong. Either way, out of the 2196016 total lines, 29 were of the wrong length. We therefore decided to drop those 29 vectors and set the embeddings for the corresponding words to 0.

In [24]:
# from preprocessing

import os
import pickle
import pandas as pd
import numpy as np
from pathlib import Path
from pycorenlp import StanfordCoreNLP

class Preprocessing():
    """
    Class containing tokenization and embeddings functions, borrwoing from the Stanford
    CoreNLP tokenizer and the pretrained GloVe word embeddings
    
    About the CoreNLP server: if server is offline, run the command below in
    Terminal from the stanford CoreNLP folder:
    
    java -mx4g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer -port 9001 -timeout 15000
    """
    
    
    def __init__(self, glove_file_path):
        self.annotator = StanfordCoreNLP('http://localhost:9001')
        self.embeddings = self.load_glove_embeddings(glove_file_path)
        
        
    def load_glove_embeddings(self, file_path):
        """
        Loads the glove word vectors from a textfile and parses it into a directory 
        with words and vectors.
        
        Returns:
        A dictionary of words and corresponding vectors
        """
        
        cached_file = '../data/glove.pickle'
        if os.path.isfile(cached_file):
            print("Found pickled GloVe file. Loading...")
            with open(cached_file, 'rb') as handle:
                embeddings_dict = pickle.load(handle)
        else:
            print("Loading GloVe model from .txt...") #changed Glove to GloVe
            with open(file_path, 'r', encoding='utf8') as f:
                embeddings_dict = {}
                cnt = 0
                for i, line in enumerate(f):
                    split_line = line.split()
                    
                    #skip aberrant lines
                    if not len(split_line) == 301:
                        continue
                    
                    word = split_line[0]
                    embedding = no.array([float(val) for val in split_line[1:]]) # why from 1?
                    embeddings_dict[word] = embedding
                
                with open('../data/glove.pickle', 'wb') as handle:
                    print("Saving GloVes as pickle...")
                    pickle.dump(embeddings_dict, handle, protocol=pickle.HIGHEST_PROTOCOL)
                
            print("Done. {} words loaded!".format(len(embeddings_dict)))
            
        return embeddings_dict
    
    
    def preprocess(self, text):
        """
        Tokenizes and applies word embeddings to a text. Also pads or cuts the whole
        sequence to length 600.
        """
        tokenized_text = self.tokenize(text)
        embedded_text = self.embed(tokenized_text)
        embedded_text = embedded_text[:600,:]
        padded_embeddings = np.zeros([600, 300])
        padded_embeddings[:embedded_text.shape[0], :embedded_text.shape[1]] = embedded_text
        
        return padded_embeddings
    
    
    def tokenize(self, text):
        """
        Calls the Stanford CoreNLP Tokenizer running on a local server, which tokenizes 
        the input text.
        """
        annotated_text = self.annotator.annotate(text,
                                                properties = {'annotators': 'tokenize',
                                                             'outputFormat': 'json'})
        tokenized_text = []
        for token in annotated_text['tokens']:
            word = token['word']
            tokenized_text.append(word)
            
        return tokenized_text
    
    
    def embed(self, words):
        """
        Takes words and returns corresponding GloVe word embeddings.
        Returns a zero vector if no embedding is found.
        
        Returns:
        List of word vectors
        """
        word_vectors = np.zeros((len(words),300))
        
        for i, word in enumerate(words):
            # Match word with vector
            try:
                vector = self.embeddings[word]
            except KeyError:
                # Set to zero vector if no match
                vector = np.zeros(300)
            
            word_vectors[i] = vector
            
        return word_vectors

imports in a class:

https://stackoverflow.com/questions/6861487/importing-modules-inside-python-class

https://www.python.org/dev/peps/pep-0008/
PEP-08:
'Imports are always put at the top of the file, just after any module comments and docstrings, and before module globals and constants.'

In the code above, what happens at the skip aberrant lines?

TODO: Add a check to verify that the amount of null vectors is relatively low.

In [23]:
class SquadDataset(Dataset):
    """
    Dataset object for the Stanford Question Answering Dataset.
    """
    
    def __inti__(self, data_file_path, glove_file_path, target):
        """
        Args:
            json_file (string): Path to the csv file with annotations.
            transform (callable, optional): Optional transform to be applied on a sample.
        """
        self.dataset = pd.read_json(data_file_path, orient='records')['data']
        self.dataset = self.flatten_data(self.dataset)
        self.preprocess = Preprocessing(glove_file_path)
        self.target = target
        
    def __len__(self):
        return len(self.dataset)
    
    def __getitem__(self,idx):
        item = self.dataset[idx]
        
        if self.target in ['text','question']:
            data_point = item[self.target]
            data_point = self.preprocess.preprocess(data_point)
            sample = {self.target: torch.from_numpy(data_point)}
        else:
            data_point = item['answer']
            sample = {self.target: data_point}
            
        return sample
    
    def flatten_data(self, data):
        flat_data = []
        for article in data:
            for paragraph in article['paragraphs']:
                for qa in paragraph['qas']:
                    flat_data.append({'text': paragraph['context'],
                                     'question': qa['question'],
                                     'answer': qa['answers'][0]['text']})
        return flat_data

TODO: update args in SquadDataset

Question: what happens in flatten?