# Data exploration - SQuAD v1

In [93]:
#Common imports 
import pandas as pd
from IPython.display import Markdown, display, clear_output
from nltk import tokenize
from scipy import stats
from IPython.core.debugger import set_trace
from pathlib import Path

### Pretty printing

In [94]:
def printBold(string):
    display(Markdown('**' + string + '**'))
    
#TODO    
#def printColor():
#     display(Markdown('<span style="color:blue">blue</span>'))

### Pickling

In [95]:
import _pickle as cPickle
from pathlib import Path

def dumpPickle(fileName, content):
    pickleFile = open(fileName, 'wb')
    cPickle.dump(content, pickleFile, -1)
    pickleFile.close()

def loadPickle(fileName):    
    file = open(fileName, 'rb')
    content = cPickle.load(file)
    file.close()
    
    return content
    
def pickleExists(fileName):
    file = Path(fileName)
    
    if file.is_file():
        return True
    
    return False

## Reading the datasets

Since we aren't really doing the answering of the questions, as is the true intention for the dataset, we'll merge the train and dev datasets into one. The test dataset is probably hidden, since there's a competition for it.

In [96]:
train = pd.read_json('../data/squad-v1/train-v1.1.json', orient='column')
dev = pd.read_json('../data/squad-v1/dev-v1.1.json', orient='column')

In [97]:
df = pd.concat([train, dev], ignore_index=True)
#merging

In [98]:
df.head()

Unnamed: 0,data,version
0,"{'title': 'University_of_Notre_Dame', 'paragra...",1.1
1,"{'title': 'Beyoncé', 'paragraphs': [{'context'...",1.1
2,"{'title': 'Montana', 'paragraphs': [{'context'...",1.1
3,"{'title': 'Genocide', 'paragraphs': [{'context...",1.1
4,"{'title': 'Antibiotics', 'paragraphs': [{'cont...",1.1


Let's look at a what we've got.

In [99]:
def showQuestion(titleId, paragraphId, questionId):

    title = df['data'][titleId]['title']
    paragraph = df['data'][titleId]['paragraphs'][paragraphId]['context']
    question = df['data'][titleId]['paragraphs'][paragraphId]['qas'][questionId]['question']
    answer = df['data'][titleId]['paragraphs'][paragraphId]['qas'][questionId]['answers'][0]['text']
    answerStart = df['data'][titleId]['paragraphs'][paragraphId]['qas'][questionId]['answers'][0]['answer_start']

    printBold('Title')
    print(title)
    printBold('Paragraph')
    print(paragraph)
    printBold('Question')
    print(question)
    printBold('Answer')
    print(answerStart)
    print(answer)

In [100]:
titleId = 0
paragraphId = 0 
questionId = 0

showQuestion(titleId, paragraphId, questionId)

**Title**

University_of_Notre_Dame


**Paragraph**

Architecturally, the school has a Catholic character. Atop the Main Building's gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend "Venite Ad Me Omnes". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858. At the end of the main drive (and in a direct line that connects through 3 statues and the Gold Dome), is a simple, modern stone statue of Mary.


**Question**

To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France?


**Answer**

515
Saint Bernadette Soubirous


## Dataset size

In [101]:
titlesCount = len(df['data'])
totalParagraphsCount = 0
totalQuestionsCount = 0

for titleId in range(titlesCount):
    paragraphsCount = len(df['data'][titleId]['paragraphs'])
    totalParagraphsCount += paragraphsCount
    
    for paragraphId in range(paragraphsCount):
        questionsCount = len(df['data'][titleId]['paragraphs'][paragraphId]['qas'])
        
        totalQuestionsCount += questionsCount
        
print('Titles', titlesCount)
print('Paragraphs', totalParagraphsCount)
print('Questions', totalQuestionsCount)

Titles 490
Paragraphs 20963
Questions 98169


In [102]:
df['data']

0      {'title': 'University_of_Notre_Dame', 'paragra...
1      {'title': 'Beyoncé', 'paragraphs': [{'context'...
2      {'title': 'Montana', 'paragraphs': [{'context'...
3      {'title': 'Genocide', 'paragraphs': [{'context...
4      {'title': 'Antibiotics', 'paragraphs': [{'cont...
                             ...                        
485    {'title': 'Islamism', 'paragraphs': [{'context...
486    {'title': 'Imperialism', 'paragraphs': [{'cont...
487    {'title': 'United_Methodist_Church', 'paragrap...
488    {'title': 'French_and_Indian_War', 'paragraphs...
489    {'title': 'Force', 'paragraphs': [{'context': ...
Name: data, Length: 490, dtype: object

## Titles

In [103]:
titles = []
for titleId in range(len(df['data'])):
    titles.append(df['data'][titleId]['title'])
    
for i in range(20):
    print(titles[i])

University_of_Notre_Dame
Beyoncé
Montana
Genocide
Antibiotics
Frédéric_Chopin
Sino-Tibetan_relations_during_the_Ming_dynasty
IPod
The_Legend_of_Zelda:_Twilight_Princess
Spectre_(2015_film)
2008_Sichuan_earthquake
New_York_City
To_Kill_a_Mockingbird
Solar_energy
Tajikistan
Anthropology
Portugal
Kanye_West
Buddhism
American_Idol


Titles are pretty random. Seems to be a lot of locations like countries and cities but not nearly enough to afford splitting the dataset.

## Questions

One of our main assumptions is that the sentence that contains the answer could be turned into a question just by removing the answer from it. Let's see how much of that is true for the questions in this dataset.

In [104]:
titleId = 0
paragraphId = 0 
questionId = 0

showQuestion(titleId, paragraphId, questionId)

**Title**

University_of_Notre_Dame


**Paragraph**

Architecturally, the school has a Catholic character. Atop the Main Building's gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend "Venite Ad Me Omnes". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858. At the end of the main drive (and in a direct line that connects through 3 statues and the Gold Dome), is a simple, modern stone statue of Mary.


**Question**

To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France?


**Answer**

515
Saint Bernadette Soubirous


In [105]:
def extractSentence(paragraph, answerStart):
    
    sentences = tokenize.sent_tokenize(paragraph)
    sentenceStart = 0
    
    for sentence in sentences:
        if (sentenceStart + len(sentence) >= answerStart):
            return sentence         
        
        sentenceStart += len(sentence) + 1

In [106]:
import nltk
nltk.download('punkt')

paragraph = df['data'][0]['paragraphs'][0]['context']
answerStart = df['data'][0]['paragraphs'][0]['qas'][0]['answers'][0]['answer_start']

sentence = extractSentence(paragraph, answerStart)
print(sentence)

It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858.


[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\ktmay\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [107]:
#this can be used for containment score of tect in question
def containedInText(text, question):
    
    questionWords = tokenize.word_tokenize(question.lower())
    textWords = tokenize.word_tokenize(text.lower())
    wordsContained = 0

    for questionWord in questionWords:
        for textWord in textWords:
            if (questionWord == textWord):
                wordsContained += 1
                break

    print(len(questionWords))
    return wordsContained / len(questionWords)

In [108]:
question =  df['data'][0]['paragraphs'][0]['qas'][0]['question']
contained = containedInText(sentence, question)

14


In [109]:
printBold('Question')
print(question)
printBold('Sentence')
print(sentence)
printBold("Contained")
print(contained)


**Question**

To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France?


**Sentence**

It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858.


**Contained**

0.6428571428571429


I wouldn't expect a 100% containment simply because the questions will contain **question-like words** like *Why, Who, *Whom*, What*.

In this example we also see that the word appear is contained in the original sentence but in **past tense**. We could take care of that if we take the **stems** of the words, but I think it's better to see the least imaginative way for forming questions.

We are also calculating some **stopwords - common words like *to, the, in*** which could be encountered at different places of the sentence, but again we want to measure the least-creative questions.

In this sentece *(damn, that was a good example)* we also see that the question uses the word *allegedly* which is a **synonym** of *reputedly* in the sentence. That could be nice for question forming, but I think it's more of an overkill.

We also see that the question actually encompasses the **words around the answer, rather than the entire sentence**. Which is a definate must-do when we form our questions. 

Let's see what is the score on all of the questons. I'm also curious to see the score on the entire paragraph.

This may come in handy in the future. Pretty printing the progress.

In [110]:
#Printint the percentage completed
def printPercentage(currentStep, maxStep):
    stepSize = maxStep / 100   #size of each step required to reach 1% progress.
    
    if (int(currentStep / stepSize) > ((currentStep - 1) / stepSize)):  #checks if the current progress 
        #is greater than the progress achieved in the previous step
        clear_output()
        print('{}%'.format(int(currentStep / stepSize)))

In [111]:
questionContainmentDfPickleName = 'pickles/questionContainmentDf.pkl'

#If the dataframe is already generated, load it.
if (pickleExists(questionContainmentDfPickleName)):
    print("Pickle found. Saved some time.")
    questionContainmentDf = loadPickle(questionContainmentDfPickleName)
else:
    sentenceScore = []
    paragraphScore = []

    #For each title
    titlesCount = len(df['data'])
    for titleId in range(titlesCount):
        printPercentage(titleId, titlesCount)

        #For each paragraph
        for paragraphId in range(len(df['data'][titleId]['paragraphs'])):
            paragraph = df['data'][titleId]['paragraphs'][paragraphId]['context']

            #For each question
            for questionId in range(len(df['data'][titleId]['paragraphs'][paragraphId]['qas'])):
                question = df['data'][titleId]['paragraphs'][paragraphId]['qas'][questionId]['question']
                answerStart = df['data'][titleId]['paragraphs'][paragraphId]['qas'][questionId]['answers'][0]['answer_start']
                sentence = extractSentence(paragraph, answerStart) #breaking sentence from para

                sentenceScore.append(containedInText(sentence, question))
                paragraphScore.append(containedInText(paragraph, question))           
                
    #Merge dataframes into one                
    sentenceScoreDf = pd.DataFrame(sentenceScore, columns=['sentence'])
    paragraphScoreDf = pd.DataFrame(paragraphScore, columns=['paragraph'])

    questionContainmentDf = pd.concat([sentenceScoreDf, paragraphScoreDf], axis=1)
    
    #Pickle the result
    dumpPickle(questionContainmentDfPickleName, questionContainmentDf)
    
    print("Result not pickled. Generating...")


Pickle found. Saved some time.


In [112]:
questionContainmentDf.describe()

Unnamed: 0,sentence,paragraph
count,98169.0,98169.0
mean,0.463937,0.582157
std,0.190377,0.159055
min,0.0,0.0
25%,0.333333,0.5
50%,0.461538,0.6
75%,0.6,0.7
max,1.0,1.0


I would argue that almost half the words contained is a pretty good result. 

As expected, contained within the entire paragraph is better.

I do wonder about those questions that are 100% contained in the answer.

In [113]:
questionContainmentDf.head(10)

Unnamed: 0,sentence,paragraph
0,0.642857,0.571429
1,0.636364,0.636364
2,0.533333,0.6
3,0.375,0.5
4,0.333333,0.416667
5,0.272727,0.636364
6,0.3,0.8
7,0.363636,0.727273
8,0.0,0.545455
9,0.266667,0.733333


In [114]:
#looking for specific cases in the questions
def getQuestionAt(index):
    currentIndex = 0
    
    for titleId in range(len(df['data'])):
        for paragraphId in range(len(df['data'][titleId]['paragraphs'])):
            for questionId in range(len(df['data'][titleId]['paragraphs'][paragraphId]['qas'])):
                if (currentIndex == index):
                    return titleId, paragraphId, questionId
                currentIndex += 1

Let's see question #8 which has 0 containment in the answer sentence. 

In [115]:
getQuestionAt(8)

(0, 1, 3)

In [116]:
titleId = 0
paragraphId = 1 
questionId = 3

showQuestion(titleId, paragraphId, questionId)

**Title**

University_of_Notre_Dame


**Paragraph**

As at most other universities, Notre Dame's students run a number of news media outlets. The nine student-run outlets include three newspapers, both a radio and television station, and several magazines and journals. Begun as a one-page journal in September 1876, the Scholastic magazine is issued twice monthly and claims to be the oldest continuous collegiate publication in the United States. The other magazine, The Juggler, is released twice a year and focuses on student literature and artwork. The Dome yearbook is published annually. The newspapers have varying publication interests, with The Observer published daily and mainly reporting university and other news, and staffed by students from both Notre Dame and Saint Mary's College. Unlike Scholastic and The Dome, The Observer is an independent publication and does not have a faculty advisor or any editorial oversight from the University. In 1987, when some students believed that The Observer began to show a conservative bias, a lib

**Question**

How many student news papers are found at Notre Dame?


**Answer**

126
three


The question is actually formed from the previous sentence.

### 0% containment

In [117]:
questionContainmentDf[questionContainmentDf['paragraph'] == 0].head()

Unnamed: 0,sentence,paragraph
269,0.0,0.0
363,0.0,0.0
505,0.0,0.0
2781,0.0,0.0
3678,0.0,0.0


In [118]:
getQuestionAt(269)

(1, 0, 0)

In [119]:
titleId = 1
paragraphId = 0 
questionId = 0

showQuestion(titleId, paragraphId, questionId)

**Title**

Beyoncé


**Paragraph**

Beyoncé Giselle Knowles-Carter (/biːˈjɒnseɪ/ bee-YON-say) (born September 4, 1981) is an American singer, songwriter, record producer and actress. Born and raised in Houston, Texas, she performed in various singing and dancing competitions as a child, and rose to fame in the late 1990s as lead singer of R&B girl-group Destiny's Child. Managed by her father, Mathew Knowles, the group became one of the world's best-selling girl groups of all time. Their hiatus saw the release of Beyoncé's debut album, Dangerously in Love (2003), which established her as a solo artist worldwide, earned five Grammy Awards and featured the Billboard Hot 100 number-one singles "Crazy in Love" and "Baby Boy".


**Question**

When did Beyonce start becoming popular?


**Answer**

269
in the late 1990s


A **synonym** case - *instead of rose to fame*, *start becoming popular* is used.

In [120]:
getQuestionAt(505)

(1, 18, 6)

In [121]:
titleId = 1
paragraphId = 18 
questionId = 6

showQuestion(titleId, paragraphId, questionId)

**Title**

Beyoncé


**Paragraph**

In 2011, documents obtained by WikiLeaks revealed that Beyoncé was one of many entertainers who performed for the family of Libyan ruler Muammar Gaddafi. Rolling Stone reported that the music industry was urging them to return the money they earned for the concerts; a spokesperson for Beyoncé later confirmed to The Huffington Post that she donated the money to the Clinton Bush Haiti Fund. Later that year she became the first solo female artist to headline the main Pyramid stage at the 2011 Glastonbury Festival in over twenty years, and was named the highest-paid performer in the world per minute.


**Question**

When did this leak happen?


**Answer**

3
2011


That's just a bad question. It could only be asked in combination with the text.

### 100% containment

In [122]:
questionContainmentDf[questionContainmentDf['sentence'] == 1]

Unnamed: 0,sentence,paragraph
21911,1.0,1.0
39394,1.0,1.0
45064,1.0,1.0
48874,1.0,1.0
53226,1.0,1.0
67425,1.0,1.0


In [123]:
getQuestionAt(53226)

(258, 23, 0)

In [124]:
titleId = 258
paragraphId = 23 
questionId = 0

showQuestion(titleId, paragraphId, questionId)

**Title**

Utrecht


**Paragraph**

Utrecht city has an active cultural life, and in the Netherlands is second only to Amsterdam. There are several theatres and theatre companies. The 1941 main city theatre was built by Dudok. Besides theatres there is a large number of cinemas including three arthouse cinemas. Utrecht is host to the international Early Music Festival (Festival Oude Muziek, for music before 1800) and the Netherlands Film Festival. The city has an important classical music hall Vredenburg (1979 by Herman Hertzberger). Its acoustics are considered among the best of the 20th-century original music halls.[citation needed] The original Vredenburg music hall has been redeveloped as part of the larger station area redevelopment plan and in 2014 has gained additional halls that allowed its merger with the rock club Tivoli and the SJU jazzpodium. There are several other venues for music throughout the city. Young musicians are educated in the conservatory, a department of the Utrecht School of the Arts. There is 

**Question**

Cultural life in Utrecht is second to 


**Answer**

0
Utrecht city has an active cultural life, and in the Netherlands is second only to Amsterdam


Strange question. The question words all appear in the sentence, but not in order. But the answer is the entire sentence, which obviously has needless information inside it. Looking further into it, the question is actually wrong, because it should state second *in Netherlands*. This question should be scrapped...

In [125]:
getQuestionAt(67425)

(341, 25, 2)

In [126]:
titleId = 341
paragraphId = 25 
questionId = 2

showQuestion(titleId, paragraphId, questionId)

**Title**

Energy


**Paragraph**

Thermodynamics divides energy transformation into two kinds: reversible processes and irreversible processes. An irreversible process is one in which energy is dissipated (spread) into empty energy states available in a volume, from which it cannot be recovered into more concentrated forms (fewer quantum states), without degradation of even more energy. A reversible process is one in which this sort of dissipation does not happen. For example, conversion of energy from one type of potential field to another, is reversible, as in the pendulum system described above. In processes where heat is generated, quantum states of lower energy, present as possible excitations in fields between atoms, act as a reservoir for part of the energy, from which it cannot be recovered, in order to be converted with 100% efficiency into other forms of energy. In this case, the energy must partly stay as heat, and cannot be completely recovered as usable energy, except at the price of an increase in some ot

**Question**

A reversible process is one in which this does not happen.


**Answer**

406
dissipation


This is, basically, just the question I expect to generate. The answer is removed and the sentence is descriptive enough to fill in the missing word.

### Summary

The assumption that the **question is mostly consisted of words from the sentence the answer is in** seems correct.

There are some obvious differences like:
- **Question-like words** are added - who, why, when...
- **Synonyms** are used instead of the words used in the sentence
- Changing the sentence to a question also changes the **tense** of the word.
- In long sentences, only a **part of the sentence is used**. Like if the sentence is separated with commas, the comma actually divides two logical statements.

I also managed to find some outliers which turned out to be not-so-well asked questions.

## Answers

Couple of ideas to explore:
- Are all the answers phrases from the text
- The type of the answers - number, dates, names, locations, similarity to the title
- Part of speech - verb, noun
- Answer lenght in words
- Words around the answer.
- Answer location in the sentence - First word, last word. 

### Answers contained in the text

In [127]:
answersInText = 0
answersNotInText = 0

for titleId in range(len(df['data'])):
     for paragraphId in range(len(df['data'][titleId]['paragraphs'])):
        paragraph = df['data'][titleId]['paragraphs'][paragraphId]['context']
        for questionId in range(len(df['data'][titleId]['paragraphs'][paragraphId]['qas'])):
            answer = df['data'][titleId]['paragraphs'][paragraphId]['qas'][questionId]['answers'][0]['text']
            if (answer in paragraph):
                answersInText += 1
            else:
                answersNotInText += 1
                
printBold('Answers in text')
print(answersInText)
printBold('Answers not in text')
print(answersNotInText)

**Answers in text**

98169


**Answers not in text**

0


All the answers are phrases from the text. Seems like that has been a requirement from the start, since the answers also have an index indicating their start location in the paragraph.

### Extracting the answers

In [128]:
answers = []
sentences = []

for titleId in range(len(df['data'])):
    
     for paragraphId in range(len(df['data'][titleId]['paragraphs'])):
        paragraph = df['data'][titleId]['paragraphs'][paragraphId]['context']
        
        for questionId in range(len(df['data'][titleId]['paragraphs'][paragraphId]['qas'])):
            answer = df['data'][titleId]['paragraphs'][paragraphId]['qas'][questionId]['answers'][0]['text']
            answerStart = df['data'][titleId]['paragraphs'][paragraphId]['qas'][questionId]['answers'][0]['answer_start']
            #answerStart is the pos of answer. and we fetch the sentences where answer lies
            sentence = extractSentence(paragraph, answerStart)
            
            answers.append(answer)
            sentences.append(sentence)

In [129]:
answerTextsDf = pd.DataFrame(answers, columns=['answer'])
sentenceDf = pd.DataFrame(sentences, columns=['sentence'])

answersDf = pd.concat([answerTextsDf, sentenceDf], axis=1)
answersDf.head()

Unnamed: 0,answer,sentence
0,Saint Bernadette Soubirous,"It is a replica of the grotto at Lourdes, Fran..."
1,a copper statue of Christ,Immediately in front of the Main Building and ...
2,the Main Building,Next to the Main Building is the Basilica of t...
3,a Marian place of prayer and reflection,"Immediately behind the basilica is the Grotto,..."
4,a golden statue of the Virgin Mary,Atop the Main Building's gold dome is a golden...


### Answer word lenght 

In [130]:
wordCount = []

for i in range(len(answersDf)):
    wordCount.append(len(tokenize.word_tokenize(answersDf.iloc[i]['answer'])))

In [131]:
answersDf = pd.concat([answersDf, pd.DataFrame(wordCount, columns=['wordCount'])], axis=1)

In [132]:
answersDf.head

<bound method NDFrame.head of                                         answer  \
0                   Saint Bernadette Soubirous   
1                    a copper statue of Christ   
2                            the Main Building   
3      a Marian place of prayer and reflection   
4           a golden statue of the Virgin Mary   
...                                        ...   
98164                           kilogram-force   
98165                                 kilopond   
98166                                     slug   
98167                                      kip   
98168                                   sthène   

                                                sentence  wordCount  
0      It is a replica of the grotto at Lourdes, Fran...          3  
1      Immediately in front of the Main Building and ...          5  
2      Next to the Main Building is the Basilica of t...          3  
3      Immediately behind the basilica is the Grotto,...          7  
4      Atop the Mai

In [133]:
answersDf['wordCount'].describe()

count    98169.000000
mean         3.355031
std          3.731700
min          1.000000
25%          1.000000
50%          2.000000
75%          4.000000
max         46.000000
Name: wordCount, dtype: float64

In [134]:
answersDf['wordCount'].value_counts()

wordCount
1     32156
2     25228
3     14348
4      7562
5      4659
6      3051
7      2222
8      1676
9      1206
10      975
11      755
12      652
13      566
14      461
15      407
16      313
18      274
17      269
19      244
20      191
21      182
23      138
22      131
25      120
24      101
26       77
28       59
27       58
29       28
30       19
31       12
32       11
33        6
38        2
34        2
35        2
36        2
37        2
46        1
42        1
Name: count, dtype: int64

About 1/3 of of the answers are single words. And about 2/3 are up to 3 words. Let's get an overview of the groups.

In [135]:
answersDf[answersDf['wordCount'] == 1].sample(20, random_state=42)

Unnamed: 0,answer,sentence,wordCount
94041,quickly,"As a practice area and specialist domain, phar...",1
16141,1985,"By 1985, the USFL had ceased football operatio...",1
4182,65000,"At 2.7 million in 2012, New York's non-Hispani...",1
70863,Jews,"Moving to reduce Italian influence, in October...",1
19072,148,"It has a number of parks and green spaces, the...",1
6351,Nepal,"According to Buddhist tradition, the Buddha li...",1
33608,5.5,It is estimated that 5.5 million tonnes of ura...,1
83840,Hannibal,Extraordinary circumstances called for extraor...,1
23810,Babylonia,The Roman abacus was used in Babylonia as earl...,1
8244,arrested,Several protesters who tried to disrupt the re...,1


There seems to be a lot of years and some names.

The two word answers seem to be dominated by names. There are also a lot of answers where one of the words isn't useful. Some could easily be removed like *a* and *the*. *six years* and *tree times* could also be turned to just 6 and 3. The *13.3%* seems to be just misplaced. Not sure if it's because of the *"."* or the *"%"*. 

In [136]:
answersDf[answersDf['wordCount'] == 2].sample(n=20, random_state=5)

Unnamed: 0,answer,sentence,wordCount
62843,25 genes,"By the end of 2005, 25 genes had been associat...",2
4799,Notre Dame,"In 2006, Lee was awarded an honorary doctorate...",2
44145,Charles Pillsbury,There he met fellow student and later Green Pa...,2
68124,Lionel Robbins,"With the help of Mises, in the late 1920s Haye...",2
7152,The Beatles,"The single, ""A Moment Like This"", went on to b...",2
37185,Prince Albert,"Victoria married her first cousin, Prince Albe...",2
72091,Western Railroad,It was formerly used by the Milwaukee Road fro...,2
4851,Mockingbird groupies,"Local residents call them ""Mockingbird groupie...",2
89845,two points,"Luther's rediscovery of ""Christ and His salvat...",2
17988,Copeland Award,Kansas also won the 1981–82 Copeland Award.,2


The two word answers seem to be dominated by names. There are also a lot of answers where one of the words isn't useful. Some could easily be removed like *a*, *in* and *the*. *six years* could be turned to just 6. The *23.02%* seems to be just misplaced. Not sure if it's because of the *"."* or the *"%"*. 

In [137]:
answersDf[answersDf['wordCount'] == 3].sample(n=20, random_state=5)

Unnamed: 0,answer,sentence,wordCount
76445,the CAP theorem,In recent years there was a high demand for ma...,3
28489,Futbol Club Barcelona,"With the end of Franco's dictatorship in 1974,...",3
85710,21st Army Group,Field Marshal Montgomery insisted priority be ...,3
96528,in C4 plants,Cyclic photophosphorylation is common in C4 pl...,3
61068,970 and 1190,The Chalukya dynasty ruled parts of southern a...,3
33328,General Auguste-Alexandre Ducrot,What made a bad situation much worse was the c...,3
64822,inside the egg,The fertilization and development takes place ...,3
24545,the South site,"Eventually, owing to space constraints and the...",3
54848,Stop TB Partnership,"The World Health Organization declared TB a ""g...",3
84203,the Roku player,Google made YouTube available on the Roku play...,3


Again names, more institution names as well. 

In [138]:
answersDf[answersDf['wordCount'] == 5].sample(n=20, random_state=5)

Unnamed: 0,answer,sentence,wordCount
23833,system of pulleys and wires,It used a system of pulleys and wires to autom...,5
46951,Inner London and Outer London,Greater London is split for some purposes into...,5
97610,within the Church of England,The movement which would become The United Met...,5
94650,Annual Status of Education Report,"The Annual Status of Education Report (ASER), ...",5
23462,less than one per cent,Throughout the period monks remained a very sm...,5
32401,a system of concentric layers,"Air defence in naval tactics, especially withi...",5
20665,the structure of the Alps,In simple terms the structure of the Alps cons...,5
74712,quadrivium and scholastic logic.,The people were associated with the studia hum...,5
22411,poor management and financial control,The Ministry of Defence has been criticised in...,5
76929,between 1 and 1.5 million,The total number of people killed has been mos...,5


As the words increase it seems harder to create deceptive incorrect answers. A viable option for some would to be mix the individual words like:

*end of World War I" -> start of World War 1, end of World War II, start of World War II, end of Balkans Wars*....

*large tumour on her liver -> large tumor on her brain, large tumor on her lungs, large (some other medical term) on her liver*

Though this would become more difficult because if use 2 generated words, they must also fit with each other as well as the original words.

Some of the anwers look like logical phrases. For their generation I would argue that a text-summarization aproach would work. And with longer answers we could employ a **True/False** questions.

In [139]:
answersDf[answersDf['wordCount'] == 20].sample(n=10, random_state=5)

Unnamed: 0,answer,sentence,wordCount
14715,"to saturate broken (""dangling"") bonds of amorp...","Hydrogen is employed to saturate broken (""dang...",20
70704,"Bullied for being a Bedouin, he was proud of h...","Bullied for being a Bedouin, he was proud of h...",20
39422,"the Bill & Melinda Gates Foundation Trust, whi...","In October 2006, the Bill & Melinda Gates Foun...",20
21062,There are 64 possible codons (four possible nu...,":6 Additionally, a ""start codon"", and three ""s...",20
5882,"Francesinha (Frenchie) from Porto, and bifanas...",Typical fast food dishes include the Francesin...,20
34718,into four summaries that look specifically at ...,"However, results can be further simplified int...",20
26946,elected members and special office bearers suc...,The legislature consists of elected members an...,20
96100,support from China for a planned $2.5 billion ...,"Kenyatta was ""[a]ccompanied by 60 Kenyan busin...",20
77789,"On 26 December 1999, Chelsea became the first ...","On 26 December 1999, Chelsea became the first ...",20
97887,format of the congress and many specifics of t...,"Nevertheless, the format of the congress and m...",20


In [140]:
answersDf[answersDf['wordCount'] == 20].sample(n=20, random_state=5).iloc[8]['answer']

#from 8th row

'On 26 December 1999, Chelsea became the first Premier League side to field an entirely foreign starting line-up,'

I would argue that from this sentence could be created several questions with single word answers, like:
- In what year? - *1999*
- Which team? - *Chelsea*

And our longest answer with 46 words

In [141]:
answersDf[answersDf['wordCount'] == 46].iloc[0]['answer']  #from 1st row

'that the sudden shift of a huge quantity of water into the region could have relaxed the tension between the two sides of the fault, allowing them to move apart, and could have increased the direct pressure on it, causing a violent rupture'

*sudden shift of a huge quantity of water* seems like a good answer to the question *What could have relaxed the tension between the two sides?*

In [142]:
answersDf[answersDf['wordCount'] == 42].iloc[0]['answer']

'Hillary Clinton (2008), Howard Dean (2004), Gary Hart (1984 and 1988), Paul Tsongas (1992), Pat Robertson (1988) and Jerry Brown (1976, 1980, 1992).'

The second longest answer seems to be a sequence of correct answer, to something like *Who has been a presitend candidate*. This could be great for queastion with multiple correct answers as well as multiple incorrect.

### Word types

**Spacy** turned out be a pretty great tool which could provide me with *NER (Named entity recognition), part of speech detection, word embeddings similarity* and some more functions which may or may be useful in my case.

In [143]:
import spacy
from spacy import displacy
nlp = spacy.load('en_core_web_sm')

#### Named entity recognition

In [144]:
doc = nlp('European authorities fined Google a record $5.1 billion on Wednesday for abusing its power in the mobile phone market and ordered the company to alter its practices')
print([(X.text, X.label_) for X in doc.ents])   #entity labelling

[('European', 'NORP'), ('Google', 'ORG'), ('$5.1 billion', 'MONEY'), ('Wednesday', 'DATE')]


In [145]:
def NerForWord(text):
    doc = nlp(text)
    
    entitiesFound = len(doc.ents)
    
    if (entitiesFound > 0):
        #TODO - Could potentially find multiple entities in the text. We're returning only the first one.
        return doc.ents[0].label_
    else:
        return ''

In [146]:
NerForWord('Portugal')

'GPE'

Useful function for deciphering the tags. They really go deep into the grammatical types, most of which I haven't even heard  of until now. I suspect I'll have to group them up or not use some of the information at all.

In [147]:
spacy.explain("dobj")

'direct object'

Since the *spacy* tagging works on tokens (not necessarily single words, could be multiple words, e.g. names) it'll significatly ease my work if (for now) I work only with the answers which contain only 1 token. 

By my judgment, most of the multiple-token answers contain a single important token and a few words describing it. Or are multiple correct tokens separted by 'and' or ','. I could try to extract the important tokens, but I don't think it's worth it at this point.

There are some great questions containing multi-token answers, but I think it's better If I limit myself to only single-token answers. That way I can work easier with word embedings and detect the tokens appropriate to be answers.

In [148]:
def isSingleToken(text):
    doc = nlp(text)
    
    #The entire text is a single named entity 
    entitiesFound = len(doc.ents)
    if(entitiesFound == 1 and doc.ents[0].text == text):   #.text, .label_ 
        return True
    
    #The text is not an named entity, but is a single token
    tokensFound = len(doc)
    if (tokensFound == 1):
        return True
    
    return False

In [149]:
isSingleToken('George R. R. Martin')

True

Let's see how many of our answers we're gonna cut.

In [150]:
singleTokenCount = 0

sampleSize =  int(len(answersDf) / 10)
for i in range(sampleSize):
        
    printPercentage(i, sampleSize)
    
    if (isSingleToken(answersDf.iloc[i]['answer'])):
        singleTokenCount += 1

99%


In [151]:
singleTokenCount / sampleSize

0.5755908720456397

On 10% of the data about 60% is retained. I expected worse.

Let's do some of the more interesting spacy tags - NER, POS, DEP, TAG, SHAPE...

Better to provide the full text to spacy, because the token's tags are influenced by their relationship with the other words in the text. But we'll do that we do the feature engineering and also tag the non-answer words.

In [152]:
doc = nlp('James R. Scott')

for token in doc:
    print(token.text, token.lemma_, token.pos_, token.tag_, token.dep_,
          token.shape_, token.is_alpha, token.is_stop, len(doc.ents), doc.ents[0].label_)
    
shape = doc[0].shape_
for wordIndex in range(1, len(doc)):   #shape or visual patter of taken Xxxxx
    shape += (' ' + doc[wordIndex].shape_)
        
print(shape)

James James PROPN NNP compound Xxxxx True False 1 PERSON
R. R. PROPN NNP compound X. False False 1 PERSON
Scott Scott PROPN NNP ROOT Xxxxx True False 1 PERSON
Xxxxx X. Xxxxx


In [153]:
spacy.explain('CARDINAL')

'Numerals that do not fall under another type'

Adding the additional columns

In [154]:
answersDf['isSingleToken'] = False
answersDf['NER'] = ''   # entity recognition
answersDf['POS'] = ''  #parts of speeech
answersDf['TAG'] = ''   #tags
answersDf['DEP'] = ''   #dependency
answersDf['shape'] = ''   # shape attribute
answersDf['isAlpha'] = False    #alphabet
answersDf['isStop'] = False   #stop words

In [155]:
answersDf.head()

Unnamed: 0,answer,sentence,wordCount,isSingleToken,NER,POS,TAG,DEP,shape,isAlpha,isStop
0,Saint Bernadette Soubirous,"It is a replica of the grotto at Lourdes, Fran...",3,False,,,,,,False,False
1,a copper statue of Christ,Immediately in front of the Main Building and ...,5,False,,,,,,False,False
2,the Main Building,Next to the Main Building is the Basilica of t...,3,False,,,,,,False,False
3,a Marian place of prayer and reflection,"Immediately behind the basilica is the Grotto,...",7,False,,,,,,False,False
4,a golden statue of the Virgin Mary,Atop the Main Building's gold dome is a golden...,7,False,,,,,,False,False


Populating the single-token answers

In [156]:
singleTokenCount = 0

sampleSize = int(len(answersDf) / 10)

for i in range(sampleSize):
        
    printPercentage(i, sampleSize)
    
    answer = answersDf.iloc[i]['answer']
    if (isSingleToken(answer)):
        answersDf.at[i, 'isSingleToken'] = True
        
        answersDf.at[i, 'NER'] = NerForWord(answer)
        
        #At this point I've called spacy's nlp method 3 times for the same words...
        doc = nlp(answer)
        
        answersDf.at[i, 'POS'] = doc[0].pos_   #.at[ ] pandas method to access/modify single entity
        answersDf.at[i, 'TAG'] = doc[0].tag_
        answersDf.at[i, 'DEP'] = doc[0].dep_
        answersDf.at[i, 'isAlpha'] = doc[0].is_alpha
        answersDf.at[i, 'isStop'] = doc[0].is_stop
        
        shape = doc[0].shape_
        for wordIndex in range(1, len(doc)):
            shape += (' ' + doc[wordIndex].shape_)
            
        answersDf.at[i, 'shape'] = shape
        
        

99%


#### Stopwords

In [157]:
answersDf['isStop'].value_counts()

isStop
False    97675
True       494
Name: count, dtype: int64

We can safely not bother with stopwords.

#### Named Entity Recognition

In [158]:
answersDf['NER'].value_counts()

NER
               93737
PERSON          1044
CARDINAL         965
DATE             958
ORG              562
GPE              353
PERCENT          151
MONEY             99
NORP              98
LOC               53
FAC               37
ORDINAL           34
QUANTITY          32
TIME              14
WORK_OF_ART       11
EVENT              9
LANGUAGE           7
LAW                3
PRODUCT            2
Name: count, dtype: int64

*Note that I've done the NER on only 10% of the dataset. So, 9350 out of 93503. Seems about 40% of the word have a NER.

In [159]:
answersDf[answersDf['NER'] == 'ORG'].sample(n=10, random_state=5)

Unnamed: 0,answer,sentence,wordCount,isSingleToken,NER,POS,TAG,DEP,shape,isAlpha,isStop
1937,The New York Times,"On the occasion of the composer's bicentenary,...",4,True,ORG,DET,DT,det,Xxx Xxx Xxxx Xxxxx,True,True
5060,The International Energy Agency,The International Energy Agency has said that ...,4,True,ORG,DET,DT,det,Xxx Xxxxx Xxxxx Xxxxx,True,True
2080,KK,The present standard musicological reference f...,1,True,ORG,PROPN,NNP,ROOT,XX,True,False
4342,Queens Borough Public Library,Queens is served by the Queens Borough Public ...,4,True,ORG,PROPN,NNP,compound,Xxxxx Xxxxx Xxxxx Xxxxx,True,False
3635,Tzu Chi Foundation,Beijing accepted the aid of the Tzu Chi Founda...,3,True,ORG,PROPN,NNP,compound,Xxx Xxx Xxxxx,True,False
3928,Federal Hall,"In 1789, the first President of the United Sta...",2,True,ORG,PROPN,NNP,compound,Xxxxx Xxxx,True,False
1618,eolomelodicon,He was engaged by the inventors of a mechanica...,1,True,ORG,PROPN,NNP,ROOT,xxxx,True,False
8597,CNN,"On April 17, Xinhua condemned what it called ""...",1,True,ORG,PROPN,NNP,ROOT,XXX,True,False
9061,Krugman,Krugman's contention (that the growth of a com...,1,True,ORG,PROPN,NNP,ROOT,Xxxxx,True,False
2477,Apple,The iPod is a line of portable media players a...,1,True,ORG,NOUN,NN,ROOT,Xxxxx,True,False


In [160]:
answersDf['isAlpha'].value_counts()

isAlpha
False    94146
True      4023
Name: count, dtype: int64

#### Part of speech

In [161]:
answersDf['POS'].value_counts()

POS
         92519
PROPN     2267
NUM       1715
NOUN       874
ADJ        294
VERB       167
DET        134
SYM         72
ADV         54
ADP         26
X           22
PRON        10
PUNCT        6
INTJ         5
AUX          3
PART         1
Name: count, dtype: int64

##### Nouns

The answers are dominated by nouns. Difference between a noun and a proper noun (PROPN) is that proper nouns are names of specific people, places, ideas... while common nouns are just non-specific (cat, woman, bottle...)

In [162]:
answersDf[answersDf['POS'] == 'PROPN'].sample(n=5, random_state=16)

Unnamed: 0,answer,sentence,wordCount,isSingleToken,NER,POS,TAG,DEP,shape,isAlpha,isStop
4022,New Jersey,The Hudson River separates the city from the U...,2,True,GPE,PROPN,NNP,compound,Xxx Xxxxx,True,False
152,Frank Eck Stadium,"Also, there are many outdoor fields, as the Fr...",3,True,PERSON,PROPN,NNP,compound,Xxxxx Xxx Xxxxx,True,False
2167,Arthur Hutchings,While his illness and his love-affairs conform...,2,True,PERSON,PROPN,NNP,compound,Xxxxx Xxxxx,True,False
9178,Merrill Lynch,"The volume ""Credit Correlation: Life After Cop...",2,True,ORG,PROPN,NNP,compound,Xxxxx Xxxxx,True,False
2431,Zhang Juzheng,"Before he left, he sent a letter and gifts to ...",2,True,PERSON,PROPN,NNP,compound,Xxxxx Xxxxx,True,False


In [163]:
answersDf[answersDf['POS'] == 'NOUN'].sample(n=5, random_state=16)

Unnamed: 0,answer,sentence,wordCount,isSingleToken,NER,POS,TAG,DEP,shape,isAlpha,isStop
4394,jazz,"The city was a center of jazz in the 1940s, ab...",1,True,,NOUN,NN,ROOT,xxxx,True,False
6695,meditation,While there is no convincing evidence for medi...,1,True,,NOUN,NN,ROOT,xxxx,True,False
7765,dukkōn,The term may possibly derive from Proto-German...,1,True,,NOUN,NN,ROOT,xxxx,True,False
7889,intelligence,Dog intelligence is the ability of the dog to ...,1,True,,NOUN,NN,ROOT,xxxx,True,False
6772,shramanas,"[note 16] These groups, whose members were kno...",1,True,,NOUN,NNS,ROOT,xxxx,True,False


##### Numerals

The second most prominent category is NUM. It's pretty much years and other numbers.

In [164]:
answersDf[answersDf['POS'] == 'NUM'].sample(n=10, random_state=16)

Unnamed: 0,answer,sentence,wordCount,isSingleToken,NER,POS,TAG,DEP,shape,isAlpha,isStop
6880,487 million,"According to Johnson and Grim (2013), Buddhism...",2,True,CARDINAL,NUM,CD,compound,ddd xxxx,False,False
4054,1931,The Art Deco style of the Chrysler Building (1...,1,True,DATE,NUM,CD,ROOT,dddd,False,False
9605,1975,By 1975 the majority of local authorities in E...,1,True,DATE,NUM,CD,ROOT,dddd,False,False
3264,48,"On Metacritic, the film has a rating of 60 out...",1,True,CARDINAL,NUM,CD,ROOT,dd,False,False
2448,1565,"In 1565, the powerful Rinbung princes were ove...",1,True,DATE,NUM,CD,ROOT,dddd,False,False
2514,2007,"In 2007, Apple modified the iPod interface aga...",1,True,DATE,NUM,CD,ROOT,dddd,False,False
1742,26 February 1832,On 26 February 1832 Chopin gave a debut Paris ...,3,True,DATE,NUM,CD,nummod,dd Xxxxx dddd,False,False
1137,4.40%,The United States Census Bureau estimates that...,2,True,PERCENT,NUM,CD,nummod,d.dd %,False,False
8389,70,\n India: Due to concerns about pro-Tibet prot...,1,True,CARDINAL,NUM,CD,ROOT,dd,False,False
7123,seven,"Starting with season seven, contestants may pe...",1,True,CARDINAL,NUM,CD,ROOT,xxxx,True,False


##### Adjectives and Verbs

Didn't really expect much of those, but they seem like adequate answers. 

In [165]:
answersDf[(answersDf['POS'] == 'ADJ') & (answersDf['wordCount'] == 1)].sample(n=5, random_state=4)

Unnamed: 0,answer,sentence,wordCount,isSingleToken,NER,POS,TAG,DEP,shape,isAlpha,isStop
4540,Republican,New York City has not been carried by a Republ...,1,True,NORP,ADJ,JJ,ROOT,Xxxxx,True,False
5828,Galician-Portuguese,Portuguese is a Romance language that originat...,1,True,NORP,ADJ,JJ,amod,Xxxxx - Xxxxx,True,False
6391,monastic,"Most accept that he lived, taught and founded ...",1,True,,ADJ,JJ,ROOT,xxxx,True,False
5438,Ethical,Ethical commitments in anthropology include no...,1,True,,ADJ,JJ,ROOT,Xxxxx,True,False
5493,French,Portuguese and their allied British troops fou...,1,True,NORP,ADJ,JJ,ROOT,Xxxxx,True,False


In [166]:
answersDf[(answersDf['POS'] == 'VERB') & (answersDf['wordCount'] == 1)].sample(n=5, random_state=4)

Unnamed: 0,answer,sentence,wordCount,isSingleToken,NER,POS,TAG,DEP,shape,isAlpha,isStop
649,Flawless,She would later align herself more publicly wi...,1,True,,VERB,VB,ROOT,Xxxxx,True,False
8044,taboo,"However, Western, South Asian, African, and Mi...",1,True,,VERB,VB,ROOT,xxxx,True,False
1269,destroyed,The definition upholds the centrality of inten...,1,True,,VERB,VBN,ROOT,xxxx,True,False
7879,Neutering,Neutering reduces problems caused by hypersexu...,1,True,,VERB,VBG,ROOT,Xxxxx,True,False
2893,helmet,"However, as Hyrule Castle collapses, it is rev...",1,True,,VERB,VB,ROOT,xxxx,True,False


##### Symbols

All of the symbols are multi word answers, with some dollar signs infront.

In [167]:
answersDf[(answersDf['POS'] == 'SYM')].sample(n=5, random_state=4)

Unnamed: 0,answer,sentence,wordCount,isSingleToken,NER,POS,TAG,DEP,shape,isAlpha,isStop
3777,$26 million,The association has also collected a total of ...,3,True,MONEY,SYM,$,quantmod,$ dd xxxx,False,False
9259,US$2.5 trillion,"During the last quarter of 2008, these central...",4,True,MONEY,SYM,$,quantmod,XX$ d.d xxxx,False,False
9067,$70 trillion,"In a Peabody Award winning program, NPR corres...",3,True,MONEY,SYM,$,quantmod,$ dd xxxx,False,False
3252,$70.4 million,The film ended up grossing $70.4 million in it...,3,True,MONEY,SYM,$,quantmod,$ dd.d xxxx,False,False
3675,$772 million,Many donated through text messaging on mobile ...,3,True,MONEY,SYM,$,quantmod,$ ddd xxxx,False,False


## Answers in a bigger picture

Let's take a look at the highlighted answers.

I suspect:

1. There are other (many more?) obviously good words for answers that were just not selected.
2. There are some sentences that just don't contain answers.

In [168]:
def highlightAnswers(titleId, paragraphId):

    paragraph = df['data'][titleId]['paragraphs'][paragraphId]['context']
    
    answers = df['data'][titleId]['paragraphs'][paragraphId]['qas']

    #Get answer starts and answer length
    answerPosition = {}
    for answer in answers:
        answerStart = answer['answers'][0]['answer_start']
        answerLength = len(answer['answers'][0]['text'])

        answerPosition[answerStart] = answerLength

    #Bold answers
    shiftStart = 0
    highlightedText = ''
    currentPlaceInText = 0
    
    #Append text between previous answer and current answer + bold sign + answer + bold sign
    for answerStart in sorted(answerPosition.keys()):
        highlightedText += paragraph[currentPlaceInText:answerStart]
        highlightedText += '**'
        highlightedText += paragraph[answerStart:answerStart + answerPosition[answerStart]]
        highlightedText += '**'
        
        currentPlaceInText = answerStart + answerPosition[answerStart]
    
    #Append the remaining text after the last answer
    highlightedText += paragraph[currentPlaceInText:len(paragraph)]

    #Diplay the highlighted text
    display(Markdown(highlightedText))  #this format can make ** ** bold work

In [169]:
titleId = 24
paragraphId = 0

highlightAnswers(titleId, paragraphId)

Located approximately 250 kilometres (**160** mi) east of Puerto Rico and the nearer Virgin Islands, St. Barthélemy lies immediately southeast of the islands of Saint Martin and Anguilla. It is one of **the Renaissance** Islands. St. Barthélemy is separated from Saint Martin by **the Saint-Barthélemy Channel**. It lies northeast of Saba and St Eustatius, and north of St Kitts. Some small **satellite islets** belong to St. Barthélemy including Île Chevreau (Île Bonhomme), Île Frégate, Île Toc Vers, Île Tortue and Gros Îlets (Îlots Syndare). A much bigger islet, Île Fourchue, lies on the north of the island, in the Saint-Barthélemy Channel. Other rocky islets which include Coco, the Roques (or **little Turtle rocks**), the Goat, and the Sugarloaf.

In [170]:
titleId = 4
paragraphId = 12

highlightAnswers(titleId, paragraphId)

**Inappropriate antibiotic treatment and overuse** of antibiotics have contributed to the emergence of antibiotic-resistant bacteria. **Self prescription** of antibiotics is an example of misuse. Many antibiotics are frequently prescribed to treat symptoms or diseases that do not respond to antibiotics or that are likely to resolve without treatment. Also, incorrect or suboptimal antibiotics are prescribed for certain bacterial infections. The **overuse of antibiotics**, like penicillin and erythromycin, has been associated with emerging antibiotic resistance since the 1950s. Widespread usage of antibiotics in hospitals has also been associated with increases in bacterial strains and species that no longer respond to treatment with the most common antibiotics.

In [171]:
titleId = 52
paragraphId = 4

highlightAnswers(titleId, paragraphId)

According to the **endurance running hypothesis**, long-distance running as in **persistence hunting**, a method still practiced by **some hunter-gatherer groups** in modern times, was likely the driving evolutionary force leading to the evolution of certain human characteristics. This hypothesis does not necessarily contradict the **scavenging hypothesis**: **both subsistence strategies** could have been in use – sequentially, alternating or even simultaneously.

In [172]:
titleId = 453
paragraphId = 1

highlightAnswers(titleId, paragraphId)

The first commercially successful true engine, in that it could generate power and transmit it to a machine, was the **atmospheric engine**, invented by **Thomas Newcomen** around **1712**. It was an improvement over Savery's **steam pump**, using a piston as proposed by **Papin**. Newcomen's engine was relatively inefficient, and in most cases was used for pumping water. It worked by creating a partial vacuum by condensing steam under a piston within a cylinder. It was employed for draining mine workings at depths hitherto impossible, and also for providing a reusable water supply for driving waterwheels at factories sited away from a suitable "head". Water that had passed over the wheel was pumped back up into a storage reservoir above the wheel.

It definetely seems that there are a lot more words that could become good answers. But I'm optimistic I can extract the selected word's features even if I don't have all of the possible answer words.

At first glance, it seems like the answers are spread troughout the entire text and there aren't as many sentences without an answers. Though a better experiment would be to just count the sentences without answers agaisnt the ones with. 
But I don't see a large enough benefit to do it (deadline aproaching).

## Noun chunks
Another neat thing spacy gives us is noun chunks.

In [173]:
text = df['data'][0]['paragraphs'][0]['context']
doc = nlp(text)

for noun_chunk in doc.noun_chunks:
    print(noun_chunk)

the school
a Catholic character
the Main Building's gold dome
a golden statue
the Virgin Mary
front
the Main Building
it
a copper statue
Christ
arms
the legend
"Venite Ad Me Omnes
the Main Building
the Basilica
the Sacred Heart
the basilica
the Grotto
a Marian place
prayer
reflection
It
a replica
the grotto
Lourdes
France
the Virgin Mary
the end
the main drive
a direct line
that
3 statues
the Gold Dome
a simple, modern stone statue
Mary


In [174]:
titleId = 0
paragraphId = 0

highlightAnswers(titleId, paragraphId)

Architecturally, the school has a Catholic character. Atop the Main Building's gold dome is **a golden statue of the Virgin Mary**. Immediately in front of the Main Building and facing it, is **a copper statue of Christ** with arms upraised with the legend "Venite Ad Me Omnes". Next to **the Main Building** is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, **a Marian place of prayer and reflection**. It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to **Saint Bernadette Soubirous** in 1858. At the end of the main drive (and in a direct line that connects through 3 statues and the Gold Dome), is a simple, modern stone statue of Mary.

In the first paragraph we have 2 of the answers entirely recognized as noun chunks:
1. **the Main Building**
2. **Saint Bernadette Soubirous**

While the other 3 answers are partially cut:
1. **a golden statue** of the Virgin Mary
2. **a copper statue** of *Christ* 
3. **a Marian place** of *prayer* and *reflection*

Though I would argue that all of the other noun chunks would make great answers.
I could potentially use only noun chunks for the answers and sacrifice the verbs and adjectives. But the noun chunks are mostly multi-word tokens. That would pose a problem with my features:
1. **Part of speech** - Coulnd't really do it on multiple words.
2. **TF-IDF** - Would need to modify it by either getting the aggreate of the single words or scoring the entire noun chunk... or both.
3. **Title similarity** - Aggregation of the single words.
4. **Incorrect answers** - That would be tricky, because I would need to find similar words for each word in the chunk and mix and match with the other similar words... That is bound to produce some inadequte mixes. But it still may not be a bad thing if I rely on a final filtering by a human.