# Load Data

In [1]:
# Load in the python script containing the same code as the load the data notebook
%run loadData.py
# now we can access train, dev, and test
# along with trainSents, devSents testSents

# Demo Data

In [2]:
trainSents[0][2]

u'Phonograph records are generally described by their diameter in inches (12", 10", 7"), the rotational speed in rpm at which they are played (16 2\u20443, 33 1\u20443, 45, 78), and their time capacity resulting from a combination of those parameters (LP \u2013 long playing 33 1\u20443 rpm, SP \u2013 78 rpm single, EP \u2013 12-inch single or extended play, 33 or 45 rpm); their reproductive quality or level of fidelity (high-fidelity, orthophonic, full-range, etc.), and the number of audio channels provided (mono, stereo, quad, etc.).'

In [3]:
train[0][0]

{u'answer': u'long playing',
 u'answer_sentence': 2,
 u'question': u'What does LP stand for when it comes to time capacity?'}

In [4]:
documents = trainSents[0]
questions = train[0]

# Useful Imports

In [5]:
import pprint
pp = pprint.PrettyPrinter(indent=4)

# Shared Workflow Thoughts (dealing with .ipynb notebooks)

Think with each feature we do below, create generalized functions that can be easily composed with easy names, split by type.

Then create/use small demo template below using the locked document/questions above to get intuition, check sanity, iterate quickly, to help keep us all on the same page.

This way we keep everything well contained/documented/explainable, will help with report writing.

Then in separate notebook we do statistical valid/testing for error exploration/analysis, using generalized functions above - easily changeable/copyable.

Finally put it all in a python file that will do full run. Write TODOs to illustrate next steps/improvements, that way can stay on top/track/improve upon easily.

# Sentence Retreival

The first part of your basic QA system will use a bag-of-words (BOW) vector space model to identify the sentence in the Wikipedia article which is most likely to contain the answer to a question, using standard information retrieval techniques. Here the "query" is the question, the "documents" are actually sentences, and each Wikipedia article should be viewed as separate "document collection". You should apply various preprocessing steps appropriate to this situation, including term weighting; if you are at all uncertain about what choices to make, you should evaluate them using the dev data, and use the results to justify your choice in your report.

##### TODO

* Improving tuning of preprocessing/lemmatize functions for use QA case

In [6]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.neighbors import NearestNeighbors

In [7]:
# Tuning functions

import nltk
from nltk.corpus import stopwords

stop_words = set(stopwords.words('english'))

# Follow lemmatize function from guide notebook: WSTA_N1B_preprocessing.ipynb
lemmatizer = nltk.stem.wordnet.WordNetLemmatizer()

def lemmatize(word):
    lemma = lemmatizer.lemmatize(word,'v')
    if lemma == word:
        lemma = lemmatizer.lemmatize(word,'n')
    return lemma

In [8]:
# Core functions

def vectorize_documents(text_documents):

    vectorizer = TfidfVectorizer(stop_words='english')
    vector_documents = vectorizer.fit_transform(text_documents)
    
    return [vector_documents, vectorizer]

def vectorize_query(vectorizer, text_query):
    return vectorizer.transform([text_query])

def process_neighbours(vector_documents):
    
    neighbours = NearestNeighbors(1, algorithm="brute", metric="cosine")
    neighbours.fit(vector_documents)
    
    return neighbours

def closest_document(neighbours, vector_query):

    result = neighbours.kneighbors(vector_query, 1, return_distance=True)

    result_index = result[1][0][0]
    result_similarity = result[0][0][0]
    
    return [result_similarity, result_index]

In [9]:
# Demonstration function

def demo_process_set(questions, documents):
    
    vector_documents, vectorizer = vectorize_documents(documents)
    analyze = vectorizer.build_analyzer()
    neighbours = process_neighbours(vector_documents)

    print "=" * 20
    print "Vector documents shape: {0}".format(vector_documents.shape)
    print "Actual documents length: {0}".format(len(documents))
    print "=" * 20, "\n"
    
    for question in questions[10:10+3]:
        
        text_query = question["question"]

        print "Text query:\n\n\t{0}\n".format(text_query)

        vector_query = vectorize_query(vectorizer, text_query)

        print "Vector query shape:\n\n\t{0}".format(vector_query.shape)

        result_similarity, result_index  = closest_document(neighbours, vector_query)
        
        print

        print "Result:\n\n\tSimilarity ({0}), Index ({1})\n".format(result_similarity, result_index)

        print

        print "Query (text):\n\n\t{0}\n".format(text_query)
        print "Document (text):\n\n\t{0}".format(documents[result_index].encode("utf-8"))

        print

        print "Query (vector text):\n"
        pp.pprint(analyze(text_query))
        print
        
        print "Document (vector text): \n\n"
        pp.pprint(analyze(documents[result_index]))
        
        print "\n", "=" * 20, "\n"

In [10]:
demo_process_set(questions, documents)

Vector documents shape: (463, 2425)
Actual documents length: 463

Text query:

	What was the original intent of the phonautograph?

Vector query shape:

	(1, 2425)

Result:

	Similarity (0.72338690096), Index (9)


Query (text):

	What was the original intent of the phonautograph?

Document (text):

	The phonautograph, patented by Léon Scott in 1857, used a vibrating diaphragm and stylus to graphically record sound waves as tracings on sheets of paper, purely for visual analysis and without any intent of playing them back.

Query (vector text):

[u'original', u'intent', u'phonautograph']

Document (vector text): 


[   u'phonautograph',
    u'patented',
    u'l\xe9on',
    u'scott',
    u'1857',
    u'used',
    u'vibrating',
    u'diaphragm',
    u'stylus',
    u'graphically',
    u'record',
    u'sound',
    u'waves',
    u'tracings',
    u'sheets',
    u'paper',
    u'purely',
    u'visual',
    u'analysis',
    u'intent',
    u'playing']


Text query:

	In what years where phonauto

# Entity Extraction

##### TODO

* TODO

# Answer Ranking

##### TODO

* TODO