# Labeling answers
Given a plain text, we're going to ouput the words from it we consider to be answer-worthy.

We'll first extract the words and their features, from the new text, as we did in *Feature Engineering*, then we'll one-hot encode them and put use our predictor from *Training*.

In [137]:
#Common imports 
import pandas as pd

### Pickling

In [138]:
import _pickle as cPickle
from pathlib import Path

def dumpPickle(fileName, content):
    pickleFile = open(fileName, 'wb')
    cPickle.dump(content, pickleFile, -1)
    pickleFile.close()

def loadPickle(fileName):    
    file = open(fileName, 'rb')
    content = cPickle.load(file)
    file.close()
    
    return content
    
def pickleExists(fileName):
    file = Path(fileName)
    
    if file.is_file():
        return True
    
    return False

## Extracting words and their features

In [139]:
import spacy
from spacy import displacy
nlp = spacy.load('en_core_web_sm')

#There seems to be a bug with spacy's stop words.
from spacy.lang.en.stop_words import STOP_WORDS
for word in STOP_WORDS:
    for w in (word, word[0].capitalize(), word.upper()):
        lex = nlp.vocab[w]
        lex.is_stop = True
        
#Extract answers and the sentence they are in
def extractAnswers(qas, doc):
    answers = []

    senStart = 0
    senId = 0

    for sentence in doc.sents:
        senLen = len(sentence.text)

        for answer in qas:
            answerStart = answer['answers'][0]['answer_start']

            if (answerStart >= senStart and answerStart < (senStart + senLen)):   #if answer lies within the sentence range
                answers.append({'sentenceId': senId, 'text': answer['answers'][0]['text']})

        senStart += senLen
        senId += 1
    
    return answers

#TODO - Clean answers from stopwords?
def tokenIsAnswer(token, sentenceId, answers):
    for i in range(len(answers)):
        if (answers[i]['sentenceId'] == sentenceId):
            if (answers[i]['text'] == token):
                return True
    return False

#Save named entities start points

def getNEStartIndexs(doc):
    neStarts = {}
    for ne in doc.ents:
        neStarts[ne.start] = ne
        
    return neStarts 

def getSentenceStartIndexes(doc):
    senStarts = []
    
    for sentence in doc.sents:
        senStarts.append(sentence[0].i)
    
    return senStarts
    
def getSentenceForWordPosition(wordPos, senStarts):
    for i in range(1, len(senStarts)):
        if (wordPos < senStarts[i]):
            return i - 1
        
def addWordsForParagrapgh(newWords, text):
    doc = nlp(text)

    neStarts = getNEStartIndexs(doc)
    senStarts = getSentenceStartIndexes(doc)
    
    #index of word in spacy doc text
    i = 0
    
    while (i < len(doc)):
        #If the token is a start of a Named Entity, add it and push to index to end of the NE
        if (i in neStarts):
            word = neStarts[i]
            #add word
            currentSentence = getSentenceForWordPosition(word.start, senStarts)
            wordLen = word.end - word.start
            shape = ''
            for wordIndex in range(word.start, word.end):
                shape += (' ' + doc[wordIndex].shape_)

            newWords.append([word.text,
                            0,
                            0,
                            currentSentence,
                            wordLen,
                            word.label_,
                            None,
                            None,
                            None,
                            shape])
            i = neStarts[i].end - 1
        #If not a NE, add the word if it's not a stopword or a non-alpha (not regular letters)
        else:
            if (doc[i].is_stop == False and doc[i].is_alpha == True):
                word = doc[i]

                currentSentence = getSentenceForWordPosition(i, senStarts)
                wordLen = 1

                newWords.append([word.text,
                                0,
                                0,
                                currentSentence,
                                wordLen,
                                None,
                                word.pos_,
                                word.tag_,
                                word.dep_,
                                word.shape_])
        i += 1


## Loading the text for which we want to label the words

In [140]:
train = pd.read_json('../data/squad-v1/train-v1.1.json', orient='column')
dev = pd.read_json('../data/squad-v1/dev-v1.1.json', orient='column')

df = pd.concat([train, dev], ignore_index=True)

In [141]:
titleId = 0
paragraphId = 0

text = df['data'][titleId]['paragraphs'][paragraphId]['context']
text

'Architecturally, the school has a Catholic character. Atop the Main Building\'s gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend "Venite Ad Me Omnes". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858. At the end of the main drive (and in a direct line that connects through 3 statues and the Gold Dome), is a simple, modern stone statue of Mary.'

In [142]:
words = []
addWordsForParagrapgh(words, text)

In [143]:
wordColums = ['text', 'titleId', 'paragrapghId', 'sentenceId','wordCount', 'NER', 'POS', 'TAG', 'DEP','shape']
df = pd.DataFrame(words, columns=wordColums)
df.head()

Unnamed: 0,text,titleId,paragrapghId,sentenceId,wordCount,NER,POS,TAG,DEP,shape
0,Architecturally,0,0,0.0,1,,ADV,RB,advmod,Xxxxx
1,school,0,0,0.0,1,,NOUN,NN,nsubj,xxxx
2,Catholic,0,0,0.0,1,NORP,,,,Xxxxx
3,character,0,0,0.0,1,,NOUN,NN,dobj,xxxx
4,Atop,0,0,1.0,1,,ADP,IN,prep,Xxxx


## One-hot encoding

In [144]:
columnsToEncode = ['NER', 'POS', "TAG", 'DEP']

for column in columnsToEncode:
    print(column)
    one_hot = pd.get_dummies(df[column])
    one_hot = one_hot.add_prefix(column + '_')

    df = df.drop(column, axis = 1)
    df = df.join(one_hot)

NER
POS
TAG
DEP


In [145]:
df.head()

Unnamed: 0,text,titleId,paragrapghId,sentenceId,wordCount,shape,NER_CARDINAL,NER_DATE,NER_FAC,NER_GPE,...,DEP_amod,DEP_appos,DEP_attr,DEP_compound,DEP_conj,DEP_dobj,DEP_nsubj,DEP_pobj,DEP_prep,DEP_relcl
0,Architecturally,0,0,0.0,1,Xxxxx,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
1,school,0,0,0.0,1,xxxx,False,False,False,False,...,False,False,False,False,False,False,True,False,False,False
2,Catholic,0,0,0.0,1,Xxxxx,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
3,character,0,0,0.0,1,xxxx,False,False,False,False,...,False,False,False,False,False,True,False,False,False,False
4,Atop,0,0,1.0,1,Xxxx,False,False,False,False,...,False,False,False,False,False,False,False,False,True,False


Dammit! One-hot encoding gave me more columns on the full sample. I need to add the rest of the columns.

In [146]:
predictorFeaturesName = '../data/pickles/nb-predictor-features.pkl'
predictorColumns = loadPickle(predictorFeaturesName)


In [147]:
wordsDf = pd.DataFrame(columns=predictorColumns)

In [148]:
wordsDf.head()

Unnamed: 0,wordCount,NER_CARDINAL,NER_DATE,NER_EVENT,NER_FAC,NER_GPE,NER_LANGUAGE,NER_LAW,NER_LOC,NER_MONEY,...,DEP_nummod,DEP_oprd,DEP_parataxis,DEP_pcomp,DEP_pobj,DEP_poss,DEP_predet,DEP_prep,DEP_relcl,DEP_xcomp


In [149]:
wordsDf.columns
len(wordsDf)

0

In [150]:
df.columns
len(df)

54

In [151]:
for column in wordsDf.columns:
    if (column in df.columns):
        wordsDf[column] = df[column]
    else:
        wordsDf[column] = 0

In [152]:
wordsDf.head()

Unnamed: 0,wordCount,NER_CARDINAL,NER_DATE,NER_EVENT,NER_FAC,NER_GPE,NER_LANGUAGE,NER_LAW,NER_LOC,NER_MONEY,...,DEP_nummod,DEP_oprd,DEP_parataxis,DEP_pcomp,DEP_pobj,DEP_poss,DEP_predet,DEP_prep,DEP_relcl,DEP_xcomp
0,1,False,False,0,False,False,0,0,False,0,...,0,0,0,0,False,0,0,False,False,0
1,1,False,False,0,False,False,0,0,False,0,...,0,0,0,0,False,0,0,False,False,0
2,1,False,False,0,False,False,0,0,False,0,...,0,0,0,0,False,0,0,False,False,0
3,1,False,False,0,False,False,0,0,False,0,...,0,0,0,0,False,0,0,False,False,0
4,1,False,False,0,False,False,0,0,False,0,...,0,0,0,0,False,0,0,True,False,0


In [153]:
wordsDf.columns

Index(['wordCount', 'NER_CARDINAL', 'NER_DATE', 'NER_EVENT', 'NER_FAC',
       'NER_GPE', 'NER_LANGUAGE', 'NER_LAW', 'NER_LOC', 'NER_MONEY',
       'NER_NORP', 'NER_ORDINAL', 'NER_ORG', 'NER_PERCENT', 'NER_PERSON',
       'NER_PRODUCT', 'NER_QUANTITY', 'NER_TIME', 'NER_WORK_OF_ART', 'POS_ADJ',
       'POS_ADP', 'POS_ADV', 'POS_NOUN', 'POS_NUM', 'POS_PROPN', 'POS_SCONJ',
       'POS_VERB', 'TAG_CD', 'TAG_IN', 'TAG_JJ', 'TAG_JJR', 'TAG_JJS',
       'TAG_NN', 'TAG_NNP', 'TAG_NNPS', 'TAG_NNS', 'TAG_RB', 'TAG_RBR',
       'TAG_RBS', 'TAG_VB', 'TAG_VBD', 'TAG_VBG', 'TAG_VBN', 'TAG_VBP',
       'TAG_VBZ', 'DEP_ROOT', 'DEP_acl', 'DEP_acomp', 'DEP_advcl',
       'DEP_advmod', 'DEP_amod', 'DEP_appos', 'DEP_attr', 'DEP_aux',
       'DEP_auxpass', 'DEP_cc', 'DEP_ccomp', 'DEP_compound', 'DEP_conj',
       'DEP_csubj', 'DEP_dative', 'DEP_dep', 'DEP_dobj', 'DEP_nmod',
       'DEP_npadvmod', 'DEP_nsubj', 'DEP_nsubjpass', 'DEP_nummod', 'DEP_oprd',
       'DEP_parataxis', 'DEP_pcomp', 'DEP_pobj', 'DEP_p

I can't believe this worked.

## Predict

In [154]:
predictorPickleName = '../data/pickles/nb-predictor.pkl'
predictor = loadPickle(predictorPickleName)

In [155]:
y_pred = predictor.predict(wordsDf)

print(len(wordsDf))
wordsDf.columns

54


Index(['wordCount', 'NER_CARDINAL', 'NER_DATE', 'NER_EVENT', 'NER_FAC',
       'NER_GPE', 'NER_LANGUAGE', 'NER_LAW', 'NER_LOC', 'NER_MONEY',
       'NER_NORP', 'NER_ORDINAL', 'NER_ORG', 'NER_PERCENT', 'NER_PERSON',
       'NER_PRODUCT', 'NER_QUANTITY', 'NER_TIME', 'NER_WORK_OF_ART', 'POS_ADJ',
       'POS_ADP', 'POS_ADV', 'POS_NOUN', 'POS_NUM', 'POS_PROPN', 'POS_SCONJ',
       'POS_VERB', 'TAG_CD', 'TAG_IN', 'TAG_JJ', 'TAG_JJR', 'TAG_JJS',
       'TAG_NN', 'TAG_NNP', 'TAG_NNPS', 'TAG_NNS', 'TAG_RB', 'TAG_RBR',
       'TAG_RBS', 'TAG_VB', 'TAG_VBD', 'TAG_VBG', 'TAG_VBN', 'TAG_VBP',
       'TAG_VBZ', 'DEP_ROOT', 'DEP_acl', 'DEP_acomp', 'DEP_advcl',
       'DEP_advmod', 'DEP_amod', 'DEP_appos', 'DEP_attr', 'DEP_aux',
       'DEP_auxpass', 'DEP_cc', 'DEP_ccomp', 'DEP_compound', 'DEP_conj',
       'DEP_csubj', 'DEP_dative', 'DEP_dep', 'DEP_dobj', 'DEP_nmod',
       'DEP_npadvmod', 'DEP_nsubj', 'DEP_nsubjpass', 'DEP_nummod', 'DEP_oprd',
       'DEP_parataxis', 'DEP_pcomp', 'DEP_pobj', 'DEP_p

In [156]:
y_pred

array([ True,  True,  True,  True, False,  True,  True,  True,  True,
       False,  True,  True,  True, False,  True, False,  True,  True,
       False,  True,  True,  True, False,  True,  True,  True,  True,
        True,  True,  True,  True, False,  True,  True,  True,  True,
        True, False,  True,  True,  True,  True,  True,  True,  True,
       False,  True,  True,  True,  True,  True,  True, False,  True])

## Moment of truth

In [157]:
for i in range(len(y_pred)):
    if (y_pred[i]):   
        print('T', df.iloc[i]['text'])
    else:
        print('F', df.iloc[i]['text'])

T Architecturally
T school
T Catholic
T character
F Atop
T the Main Building's
T gold
T dome
T golden
F statue
T the Virgin Mary.
T Immediately
T the Main Building
F facing
T copper
F statue
T Christ
T arms
F upraised
T legend
T Venite Ad Me Omnes
T the Main Building
F Basilica
T the Sacred Heart
T Immediately
T basilica
T Grotto
T Marian
T place
T prayer
T reflection
F replica
T grotto
T Lourdes
T France
T the Virgin Mary
T reputedly
F appeared
T Saint Bernadette Soubirous
T 1858
T end
T main
T drive
T direct
T line
F connects
T 3
T statues
T the Gold Dome
T simple
T modern
T stone
F statue
T Mary


Well, since most of the words are labeled as answers, let's see just the incorrect ones.

In [158]:
for i in range(len(y_pred)):
    if (y_pred[i] == False):   
        print(df.iloc[i]['text'])

Atop
statue
facing
statue
upraised
Basilica
replica
appeared
connects
statue


Wow! That's pretty great actually!

It would be great if I get confidence of the word being an answer rather than just a binary classification.

Let's not spoil this magic moment, when I actually think that I classifiy all the appropriate words and work on the incorrect answer generation.