# Building the Question Answering System

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import sklearn
import keras

Using TensorFlow backend.


## Reading in Data

In [2]:
train_df=pd.read_json("C:/Users/Lukas Buteliauskas/Desktop/training_data.json").reset_index(drop=True)
dev_df=pd.read_json("C:/Users/Lukas Buteliauskas/Desktop/validation_data.json").reset_index(drop=True)

## Word Vectorization
To be able to use words, phrases, questions or other natural language constructs in our model we require a to provide our neural network a numerical representation of our words (as these are the elemental NLP 'particles'). The simplest implementation would be to use 'one hot encoding' and define each word as a vector the size of our dictionary (the number of unique words found in our collection of documents, our corpus). However, this approach will most likely be insufficient for the purposes of a question answering system. word2vec and GloVe are 2 popular choices sophisticated options for word embeddings that also capture word similarities. I will not go into the details of either architecture other than to say that we will not be re-training the word vectors due to the insufficient size of the dataset, and we will begin with the GloVe word embeddings due to it's superior performance in most 'downstream' modelling tasks. Having said that, given the simplicity of swapping word vector representations we will also test out performance with word2vec (providing we can do so in a time-efficient manner).

Info and download links for GloVe can be found at: https://nlp.stanford.edu/projects/glove/

### Defining Custom Functions
For the purpose of not repeating code, avoiding bugs and developing good programming practice/design.

In [3]:
"""Takes a URL or a local path and returns a dictionary of GloVe word vectors where the key is the word and the value is the 
word vector with the dimension specified in the input file."""
def get_word_vector_dict(url_or_path):
    
    with open(url_or_path, encoding="utf8") as glove_text:
        word_lines=glove_text.readlines()
    word_embeddings=[line.split(" ") for line in word_lines]
    word_vector_dict={element[0]:list(map(float, element[1:])) for element in word_embeddings}
    
    return word_vector_dict


"""Takes a URL or path like the previous function, or can take a word vector dictionary and returns a word vector
dataframe. Rows of the dataframe are the word vectors, columns are the dimensions of the word vector, indices are the words."""
def get_word_vector_df(url_path_or_dict):
    
    if type(url_path_or_dict) is str:
        with open(url_path_or_dict, encoding="utf8") as glove_text:
            word_lines=glove_text.readlines()
        word_embeddings=[line.split(" ") for line in word_lines]
        word_vector_dict={element[0]:list(map(float, element[1:])) for element in word_embeddings}
        word_vector_df=pd.DataFrame(word_vector_dict).transpose()
    
    else:
        word_vector_df=pd.DataFrame(url_path_or_dict).transpose()
    
    return word_vector_df

### Setting up the Word Vectors
As mentioned above with regards to what model we use for the word vectors, it's important to note that the dimention of the word vectors is a hyperparameter of the Neural Networks to come, so to keep our options open we imported a few different word vectors representations and the custom functions defined above make this a 'one line of code' affair (dictionary or dataframe).


In [4]:
word_vector_50_dict=get_word_vector_dict("C:/Users/Lukas Buteliauskas/Desktop/glove.6B.50d.txt")
word_vector_50_df=get_word_vector_df(word_vector_50_dict)
vocab=np.array(word_vector_50_dict.keys()) #400k words as per the documentation.
word_vector_100_dict=get_word_vector_dict("C:/Users/Lukas Buteliauskas/Desktop/glove.6B.100d.txt")
word_vector_100_df=get_word_vector_df(word_vector_100_dict)

In [6]:
"""Some quick test prints/sanity checks to make sure that everything is ok and we haven't made any errors. It is also
important to note that this is mostly for the reader, much of the testing of code/test prints are done, for this stage at 
least."""
print(word_vector_50_df.info())
print(word_vector_50_df.describe())

<class 'pandas.core.frame.DataFrame'>
Index: 400000 entries, ! to ￥
Data columns (total 50 columns):
0     400000 non-null float64
1     400000 non-null float64
2     400000 non-null float64
3     400000 non-null float64
4     400000 non-null float64
5     400000 non-null float64
6     400000 non-null float64
7     400000 non-null float64
8     400000 non-null float64
9     400000 non-null float64
10    400000 non-null float64
11    400000 non-null float64
12    400000 non-null float64
13    400000 non-null float64
14    400000 non-null float64
15    400000 non-null float64
16    400000 non-null float64
17    400000 non-null float64
18    400000 non-null float64
19    400000 non-null float64
20    400000 non-null float64
21    400000 non-null float64
22    400000 non-null float64
23    400000 non-null float64
24    400000 non-null float64
25    400000 non-null float64
26    400000 non-null float64
27    400000 non-null float64
28    400000 non-null float64
29    400000 non-null float64

In [5]:
"""Original code for reading in an composing the dataframe for the word vectors."""

"""# Reading in the pre-trained 50-dimentional GloVe vectors
with open("C:/Users/Lukas Buteliauskas/Desktop/glove.6B.50d.txt", encoding="utf8") as glove_50d_text:
    word_lines=glove_50d_text.readlines()
    
# Array of arrays of words and their word vector values.
word_embeddings=[line.split(" ") for line in word_lines]
# Word vector dictionary. Words are the keys, word vectors are the values.
word_vector_dict={element[0]:list(map(float, element[1:])) for element in word_embeddings}
# Word vector dataframe. Word vectors are rows, words are the indices, columns are the dimentions of the word vectors. 
word_vector_df=pd.DataFrame(word_vector_dict).transpose()"""

'# Reading in the pre-trained 50-dimentional GloVe vectors\nwith open("C:/Users/Lukas Buteliauskas/Desktop/glove.6B.50d.txt", encoding="utf8") as glove_50d_text:\n    word_lines=glove_50d_text.readlines()\n    \n# Array of arrays of words and their word vector values.\nword_embeddings=[line.split(" ") for line in word_lines]\n# Word vector dictionary. Words are the keys, word vectors are the values.\nword_vector_dict={element[0]:list(map(float, element[1:])) for element in word_embeddings}\n# Word vector dataframe. Word vectors are rows, words are the indices, columns are the dimentions of the word vectors. \nword_vector_df=pd.DataFrame(word_vector_dict).transpose()'