# Data exploration - SQuAD v1

In [2]:
#Common imports 
import pandas as pd
from IPython.display import Markdown, display, clear_output
from nltk import tokenize
from scipy import stats
from IPython.core.debugger import set_trace
from pathlib import Path

### Pretty printing

In [3]:
def printBold(string):
    display(Markdown('**' + string + '**'))
    
#TODO    
#def printColor():
#     display(Markdown('<span style="color:blue">blue</span>'))

### Pickling

In [4]:
import _pickle as cPickle
from pathlib import Path

def dumpPickle(fileName, content):
    pickleFile = open(fileName, 'wb')
    cPickle.dump(content, pickleFile, -1)
    pickleFile.close()

def loadPickle(fileName):    
    file = open(fileName, 'rb')
    content = cPickle.load(file)
    file.close()
    
    return content
    
def pickleExists(fileName):
    file = Path(fileName)
    
    if file.is_file():
        return True
    
    return False

## Reading the datasets

Since we aren't really doing the answering of the questions, as is the true intention for the dataset, we'll merge the train and dev datasets into one. The test dataset is probably hidden, since there's a competition for it.

In [5]:
train = pd.read_json('../data/squad-v1/train-v1.1.json', orient='column')
dev = pd.read_json('../data/squad-v1/dev-v1.1.json', orient='column')

In [6]:
df = pd.concat([train, dev], ignore_index=True)

In [7]:
df.head()

Unnamed: 0,data,version
0,"{'title': 'University_of_Notre_Dame', 'paragra...",1.1
1,"{'title': 'Beyoncé', 'paragraphs': [{'context'...",1.1
2,"{'title': 'Montana', 'paragraphs': [{'context'...",1.1
3,"{'title': 'Genocide', 'paragraphs': [{'context...",1.1
4,"{'title': 'Antibiotics', 'paragraphs': [{'cont...",1.1


In [10]:
df.shape

(490, 2)

Let's look at a what we've got.

In [7]:
def showQuestion(titleId, paragraphId, questionId):

    title = df['data'][titleId]['title']
    paragraph = df['data'][titleId]['paragraphs'][paragraphId]['context']
    question = df['data'][titleId]['paragraphs'][paragraphId]['qas'][questionId]['question']
    answer = df['data'][titleId]['paragraphs'][paragraphId]['qas'][questionId]['answers'][0]['text']
    answerStart = df['data'][titleId]['paragraphs'][paragraphId]['qas'][questionId]['answers'][0]['answer_start']

    printBold('Title')
    print(title)
    printBold('Paragraph')
    print(paragraph)
    printBold('Question')
    print(question)
    printBold('Answer')
    print(answerStart)
    print(answer)

In [8]:
titleId = 0
paragraphId = 0 
questionId = 0

showQuestion(titleId, paragraphId, questionId)

**Title**

University_of_Notre_Dame


**Paragraph**

Architecturally, the school has a Catholic character. Atop the Main Building's gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend "Venite Ad Me Omnes". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858. At the end of the main drive (and in a direct line that connects through 3 statues and the Gold Dome), is a simple, modern stone statue of Mary.


**Question**

To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France?


**Answer**

515
Saint Bernadette Soubirous


## Dataset size

In [9]:
titlesCount = len(df['data'])
totalParagraphsCount = 0
totalQuestionsCount = 0

for titleId in range(titlesCount):
    paragraphsCount = len(df['data'][titleId]['paragraphs'])
    totalParagraphsCount += paragraphsCount
    
    for paragraphId in range(paragraphsCount):
        questionsCount = len(df['data'][titleId]['paragraphs'][paragraphId]['qas'])
        
        totalQuestionsCount += questionsCount
        
print('Titles', titlesCount)
print('Paragraphs', totalParagraphsCount)
print('Questions', totalQuestionsCount)

Titles 490
Paragraphs 20963
Questions 98169


## Titles

In [10]:
titles = []
for titleId in range(len(df['data'])):
    titles.append(df['data'][titleId]['title'])
    
for i in range(20):
    print(titles[i])

University_of_Notre_Dame
Beyoncé
Montana
Genocide
Antibiotics
Frédéric_Chopin
Sino-Tibetan_relations_during_the_Ming_dynasty
IPod
The_Legend_of_Zelda:_Twilight_Princess
Spectre_(2015_film)
2008_Sichuan_earthquake
New_York_City
To_Kill_a_Mockingbird
Solar_energy
Tajikistan
Anthropology
Portugal
Kanye_West
Buddhism
American_Idol


Titles are pretty random. Seems to be a lot of locations like countries and cities but not nearly enough to afford splitting the dataset.

## Questions

One of our main assumptions is that the sentence that contains the answer could be turned into a question just by removing the answer from it. Let's see how much of that is true for the questions in this dataset.

In [11]:
titleId = 0
paragraphId = 0 
questionId = 0

showQuestion(titleId, paragraphId, questionId)

**Title**

University_of_Notre_Dame


**Paragraph**

Architecturally, the school has a Catholic character. Atop the Main Building's gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend "Venite Ad Me Omnes". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858. At the end of the main drive (and in a direct line that connects through 3 statues and the Gold Dome), is a simple, modern stone statue of Mary.


**Question**

To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France?


**Answer**

515
Saint Bernadette Soubirous


In [12]:
def extractSentence(paragraph, answerStart):
    
    sentences = tokenize.sent_tokenize(paragraph)
    sentenceStart = 0
    
    for sentence in sentences:
        if (sentenceStart + len(sentence) >= answerStart):
            return sentence         
        
        sentenceStart += len(sentence) + 1

In [13]:
paragraph = df['data'][0]['paragraphs'][0]['context']
answerStart = df['data'][0]['paragraphs'][0]['qas'][0]['answers'][0]['answer_start']

sentence = extractSentence(paragraph, answerStart)
print(sentence)

It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858.


In [14]:
def containedInText(text, question):
    
    questionWords = tokenize.word_tokenize(question.lower())
    textWords = tokenize.word_tokenize(text.lower())
    wordsContained = 0

    for questionWord in questionWords:
        for textWord in textWords:
            if (questionWord == textWord):
                wordsContained += 1
                break

    return wordsContained / len(questionWords)

In [15]:
question =  df['data'][0]['paragraphs'][0]['qas'][0]['question']

contained = containedInText(sentence, question)

In [16]:
printBold('Question')
print(question)
printBold('Sentence')
print(sentence)
printBold("Contained")
print(contained)

**Question**

To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France?


**Sentence**

It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858.


**Contained**

0.6428571428571429


I wouldn't expect a 100% containment simply because the questions will contain **question-like words** like *Why, Who, *Whom*, What*.

In this example we also see that the word appear is contained in the original sentence but in **past tense**. We could take care of that if we take the **stems** of the words, but I think it's better to see the least imaginative way for forming questions.

We are also calculating some **stopwords - common words like *to, the, in*** which could be encountered at different places of the sentence, but again we want to measure the least-creative questions.

In this sentece *(damn, that was a good example)* we also see that the question uses the word *allegedly* which is a **synonym** of *reputedly* in the sentence. That could be nice for question forming, but I think it's more of an overkill.

We also see that the question actually encompasses the **words around the answer, rather than the entire sentence**. Which is a definate must-do when we form our questions. 

Let's see what is the score on all of the questons. I'm also curious to see the score on the entire paragraph.

This may come in handy in the future. Pretty printing the progress.

In [17]:
#Printint the percentage completed
def printPercentage(currentStep, maxStep):
    stepSize = maxStep / 100
    
    if (int(currentStep / stepSize) > ((currentStep - 1) / stepSize)):
        clear_output()
        print('{}%'.format(int(currentStep / stepSize)))

In [18]:
questionContainmentDfPickleName = 'pickles/questionContainmentDf.pkl'

#If the dataframe is already generated, load it.
if (pickleExists(questionContainmentDfPickleName)):
    print("Pickle found. Saved some time.")
    questionContainmentDf = loadPickle(questionContainmentDfPickleName)
else:
    sentenceScore = []
    paragraphScore = []

    #For each title
    titlesCount = len(df['data'])
    for titleId in range(titlesCount):
        printPercentage(titleId, titlesCount)

        #For each paragraph
        for paragraphId in range(len(df['data'][titleId]['paragraphs'])):
            paragraph = df['data'][titleId]['paragraphs'][paragraphId]['context']

            #For each question
            for questionId in range(len(df['data'][titleId]['paragraphs'][paragraphId]['qas'])):
                question = df['data'][titleId]['paragraphs'][paragraphId]['qas'][questionId]['question']
                answerStart = df['data'][titleId]['paragraphs'][paragraphId]['qas'][questionId]['answers'][0]['answer_start']
                sentence = extractSentence(paragraph, answerStart)

                sentenceScore.append(containedInText(sentence, question))
                paragraphScore.append(containedInText(paragraph, question))           
                
    #Merge dataframes into one                
    sentenceScoreDf = pd.DataFrame(sentenceScore, columns=['sentence'])
    paragraphScoreDf = pd.DataFrame(paragraphScore, columns=['paragraph'])

    questionContainmentDf = pd.concat([sentenceScoreDf, paragraphScoreDf], axis=1)
    
    #Pickle the result
    dumpPickle(questionContainmentDfPickleName, questionContainmentDf)
    
    print("Result not pickled. Generating...")


Pickle found. Saved some time.


In [19]:
questionContainmentDf.describe()

Unnamed: 0,sentence,paragraph
count,98169.0,98169.0
mean,0.463937,0.582157
std,0.190377,0.159055
min,0.0,0.0
25%,0.333333,0.5
50%,0.461538,0.6
75%,0.6,0.7
max,1.0,1.0


I would argue that almost half the words contained is a pretty good result. 

As expected, contained within the entire paragraph is better.

I do wonder about those questions that are 100% contained in the answer.

In [20]:
questionContainmentDf.head(10)

Unnamed: 0,sentence,paragraph
0,0.642857,0.571429
1,0.636364,0.636364
2,0.533333,0.6
3,0.375,0.5
4,0.333333,0.416667
5,0.272727,0.636364
6,0.3,0.8
7,0.363636,0.727273
8,0.0,0.545455
9,0.266667,0.733333


In [21]:
def getQuestionAt(index):
    currentIndex = 0
    
    for titleId in range(len(df['data'])):
        for paragraphId in range(len(df['data'][titleId]['paragraphs'])):
            for questionId in range(len(df['data'][titleId]['paragraphs'][paragraphId]['qas'])):
                if (currentIndex == index):
                    return titleId, paragraphId, questionId
                currentIndex += 1

Let's see question #8 which has 0 containment in the answer sentence. 

In [22]:
getQuestionAt(8)

(0, 1, 3)

In [23]:
titleId = 0
paragraphId = 1 
questionId = 3

showQuestion(titleId, paragraphId, questionId)

**Title**

University_of_Notre_Dame


**Paragraph**

As at most other universities, Notre Dame's students run a number of news media outlets. The nine student-run outlets include three newspapers, both a radio and television station, and several magazines and journals. Begun as a one-page journal in September 1876, the Scholastic magazine is issued twice monthly and claims to be the oldest continuous collegiate publication in the United States. The other magazine, The Juggler, is released twice a year and focuses on student literature and artwork. The Dome yearbook is published annually. The newspapers have varying publication interests, with The Observer published daily and mainly reporting university and other news, and staffed by students from both Notre Dame and Saint Mary's College. Unlike Scholastic and The Dome, The Observer is an independent publication and does not have a faculty advisor or any editorial oversight from the University. In 1987, when some students believed that The Observer began to show a conservative bias, a lib

**Question**

How many student news papers are found at Notre Dame?


**Answer**

126
three


The question is actually formed from the previous sentence.

### 0% containment

In [24]:
questionContainmentDf[questionContainmentDf['paragraph'] == 0].head()

Unnamed: 0,sentence,paragraph
269,0.0,0.0
363,0.0,0.0
505,0.0,0.0
2781,0.0,0.0
3678,0.0,0.0


In [25]:
getQuestionAt(269)

(1, 0, 0)

In [26]:
titleId = 1
paragraphId = 0 
questionId = 0

showQuestion(titleId, paragraphId, questionId)

**Title**

Beyoncé


**Paragraph**

Beyoncé Giselle Knowles-Carter (/biːˈjɒnseɪ/ bee-YON-say) (born September 4, 1981) is an American singer, songwriter, record producer and actress. Born and raised in Houston, Texas, she performed in various singing and dancing competitions as a child, and rose to fame in the late 1990s as lead singer of R&B girl-group Destiny's Child. Managed by her father, Mathew Knowles, the group became one of the world's best-selling girl groups of all time. Their hiatus saw the release of Beyoncé's debut album, Dangerously in Love (2003), which established her as a solo artist worldwide, earned five Grammy Awards and featured the Billboard Hot 100 number-one singles "Crazy in Love" and "Baby Boy".


**Question**

When did Beyonce start becoming popular?


**Answer**

269
in the late 1990s


A **synonym** case - *instead of rose to fame*, *start becoming popular* is used.

In [27]:
getQuestionAt(505)

(1, 18, 6)

In [28]:
titleId = 1
paragraphId = 18 
questionId = 6

showQuestion(titleId, paragraphId, questionId)

**Title**

Beyoncé


**Paragraph**

In 2011, documents obtained by WikiLeaks revealed that Beyoncé was one of many entertainers who performed for the family of Libyan ruler Muammar Gaddafi. Rolling Stone reported that the music industry was urging them to return the money they earned for the concerts; a spokesperson for Beyoncé later confirmed to The Huffington Post that she donated the money to the Clinton Bush Haiti Fund. Later that year she became the first solo female artist to headline the main Pyramid stage at the 2011 Glastonbury Festival in over twenty years, and was named the highest-paid performer in the world per minute.


**Question**

When did this leak happen?


**Answer**

3
2011


That's just a bad question. It could only be asked in combination with the text.

### 100% containment

In [29]:
questionContainmentDf[questionContainmentDf['sentence'] == 1]

Unnamed: 0,sentence,paragraph
21911,1.0,1.0
39394,1.0,1.0
45064,1.0,1.0
48874,1.0,1.0
53226,1.0,1.0
67425,1.0,1.0


In [30]:
getQuestionAt(53226)

(258, 23, 0)

In [31]:
titleId = 258
paragraphId = 23 
questionId = 0

showQuestion(titleId, paragraphId, questionId)

**Title**

Utrecht


**Paragraph**

Utrecht city has an active cultural life, and in the Netherlands is second only to Amsterdam. There are several theatres and theatre companies. The 1941 main city theatre was built by Dudok. Besides theatres there is a large number of cinemas including three arthouse cinemas. Utrecht is host to the international Early Music Festival (Festival Oude Muziek, for music before 1800) and the Netherlands Film Festival. The city has an important classical music hall Vredenburg (1979 by Herman Hertzberger). Its acoustics are considered among the best of the 20th-century original music halls.[citation needed] The original Vredenburg music hall has been redeveloped as part of the larger station area redevelopment plan and in 2014 has gained additional halls that allowed its merger with the rock club Tivoli and the SJU jazzpodium. There are several other venues for music throughout the city. Young musicians are educated in the conservatory, a department of the Utrecht School of the Arts. There is 

**Question**

Cultural life in Utrecht is second to 


**Answer**

0
Utrecht city has an active cultural life, and in the Netherlands is second only to Amsterdam


Strange question. The question words all appear in the sentence, but not in order. But the answer is the entire sentence, which obviously has needless information inside it. Looking further into it, the question is actually wrong, because it should state second *in Netherlands*. This question should be scrapped...

In [32]:
getQuestionAt(67425)

(341, 25, 2)

In [33]:
titleId = 341
paragraphId = 25 
questionId = 2

showQuestion(titleId, paragraphId, questionId)

**Title**

Energy


**Paragraph**

Thermodynamics divides energy transformation into two kinds: reversible processes and irreversible processes. An irreversible process is one in which energy is dissipated (spread) into empty energy states available in a volume, from which it cannot be recovered into more concentrated forms (fewer quantum states), without degradation of even more energy. A reversible process is one in which this sort of dissipation does not happen. For example, conversion of energy from one type of potential field to another, is reversible, as in the pendulum system described above. In processes where heat is generated, quantum states of lower energy, present as possible excitations in fields between atoms, act as a reservoir for part of the energy, from which it cannot be recovered, in order to be converted with 100% efficiency into other forms of energy. In this case, the energy must partly stay as heat, and cannot be completely recovered as usable energy, except at the price of an increase in some ot

**Question**

A reversible process is one in which this does not happen.


**Answer**

406
dissipation


This is, basically, just the question I expect to generate. The answer is removed and the sentence is descriptive enough to fill in the missing word.

### Summary

The assumption that the **question is mostly consisted of words from the sentence the answer is in** seems correct.

There are some obvious differences like:
- **Question-like words** are added - who, why, when...
- **Synonyms** are used instead of the words used in the sentence
- Changing the sentence to a question also changes the **tense** of the word.
- In long sentences, only a **part of the sentence is used**. Like if the sentence is separated with commas, the comma actually divides two logical statements.

I also managed to find some outliers which turned out to be not-so-well asked questions.

## Answers

Couple of ideas to explore:
- Are all the answers phrases from the text
- The type of the answers - number, dates, names, locations, similarity to the title
- Part of speech - verb, noun
- Answer lenght in words
- Words around the answer.
- Answer location in the sentence - First word, last word. 

### Answers contained in the text

In [34]:
answersInText = 0
answersNotInText = 0

for titleId in range(len(df['data'])):
     for paragraphId in range(len(df['data'][titleId]['paragraphs'])):
        paragraph = df['data'][titleId]['paragraphs'][paragraphId]['context']
        for questionId in range(len(df['data'][titleId]['paragraphs'][paragraphId]['qas'])):
            answer = df['data'][titleId]['paragraphs'][paragraphId]['qas'][questionId]['answers'][0]['text']
            if (answer in paragraph):
                answersInText += 1
            else:
                answersNotInText += 1
                
printBold('Answers in text')
print(answersInText)
printBold('Answers not in text')
print(answersNotInText)

**Answers in text**

98169


**Answers not in text**

0


All the answers are phrases from the text. Seems like that has been a requirement from the start, since the answers also have an index indicating their start location in the paragraph.

### Extracting the answers

In [35]:
answers = []
sentences = []

for titleId in range(len(df['data'])):
    
     for paragraphId in range(len(df['data'][titleId]['paragraphs'])):
        paragraph = df['data'][titleId]['paragraphs'][paragraphId]['context']
        
        for questionId in range(len(df['data'][titleId]['paragraphs'][paragraphId]['qas'])):
            answer = df['data'][titleId]['paragraphs'][paragraphId]['qas'][questionId]['answers'][0]['text']
            answerStart = df['data'][titleId]['paragraphs'][paragraphId]['qas'][questionId]['answers'][0]['answer_start']
            
            sentence = extractSentence(paragraph, answerStart)
            
            answers.append(answer)
            sentences.append(sentence)

In [36]:
answerTextsDf = pd.DataFrame(answers, columns=['answer'])
sentenceDf = pd.DataFrame(sentences, columns=['sentence'])

answersDf = pd.concat([answerTextsDf, sentenceDf], axis=1)
answersDf.head()

Unnamed: 0,answer,sentence
0,Saint Bernadette Soubirous,"It is a replica of the grotto at Lourdes, Fran..."
1,a copper statue of Christ,Immediately in front of the Main Building and ...
2,the Main Building,Next to the Main Building is the Basilica of t...
3,a Marian place of prayer and reflection,"Immediately behind the basilica is the Grotto,..."
4,a golden statue of the Virgin Mary,Atop the Main Building's gold dome is a golden...


### Answer word lenght 

In [37]:
wordCount = []

for i in range(len(answersDf)):
    wordCount.append(len(tokenize.word_tokenize(answersDf.iloc[i]['answer'])))

In [38]:
answersDf = pd.concat([answersDf, pd.DataFrame(wordCount, columns=['wordCount'])], axis=1)

In [39]:
answersDf['wordCount'].describe()

count    98169.000000
mean         3.354919
std          3.731475
min          1.000000
25%          1.000000
50%          2.000000
75%          4.000000
max         46.000000
Name: wordCount, dtype: float64

In [40]:
answersDf['wordCount'].value_counts()

1     32158
2     25227
3     14349
4      7562
5      4657
6      3051
7      2222
8      1676
9      1206
10      975
11      755
12      653
13      565
14      461
15      407
16      313
18      274
17      269
19      243
20      191
21      183
23      138
22      132
25      120
24      101
26       76
28       59
27       58
29       29
30       18
31       12
32       11
33        6
38        2
34        2
35        2
36        2
37        2
42        1
46        1
Name: wordCount, dtype: int64

About 1/3 of of the answers are single words. And about 2/3 are up to 3 words. Let's get an overview of the groups.

In [41]:
answersDf[answersDf['wordCount'] == 1].sample(10, random_state=42)

Unnamed: 0,answer,sentence,wordCount
52643,Hanja,"Hanja are still used to some extent, particula...",1
79467,subtropical,"Iran's climate ranges from arid or semiarid, t...",1
88679,1130,"Roger's son, Roger II of Sicily, was crowned k...",1
35390,microphone,The second controller lacked the START and SEL...,1
34469,rarely,"Since Elizabeth rarely gives interviews, littl...",1
60270,1774,This period also saw some contacts with Jesuit...,1
10684,ZigBee,Many newer control systems are using wireless ...,1
43072,1956,She then formed the World Women's Wrestling As...,1
43751,1866,"In 1866, the feud between Austria and Prussia ...",1
43557,mid-1980s,The Museo Tamayo was opened in the mid-1980s t...,1


There seems to be a lot of years and some names.

The two word answers seem to be dominated by names. There are also a lot of answers where one of the words isn't useful. Some could easily be removed like *a* and *the*. *six years* and *tree times* could also be turned to just 6 and 3. The *13.3%* seems to be just misplaced. Not sure if it's because of the *"."* or the *"%"*. 

In [42]:
answersDf[answersDf['wordCount'] == 2].sample(n=20, random_state=5)

Unnamed: 0,answer,sentence,wordCount
72764,German nationalism,"Red, white, and black were the colors of the G...",2
4799,Notre Dame,"In 2006, Lee was awarded an honorary doctorate...",2
52778,American colonies,The war had removed Bermuda's primary trading ...,2
59586,Wayne County,"It is the seat of Wayne County, the most popul...",2
7152,The Beatles,"The single, ""A Moment Like This"", went on to b...",2
94263,phagocytic cells,The complement system and phagocytic cells are...,2
72088,Mumbai area,"1,500 V DC is used in the Netherlands, Japan, ...",2
4851,Mockingbird groupies,"Local residents call them ""Mockingbird groupie...",2
69075,Vagad region,"The hilly Vagad region, home to the cities of ...",2
17980,Lew Perkins,"Under former athletic director Lew Perkins, th...",2


The two word answers seem to be dominated by names. There are also a lot of answers where one of the words isn't useful. Some could easily be removed like *a*, *in* and *the*. *six years* could be turned to just 6. The *23.02%* seems to be just misplaced. Not sure if it's because of the *"."* or the *"%"*. 

In [43]:
answersDf[answersDf['wordCount'] == 3].sample(n=20, random_state=5)

Unnamed: 0,answer,sentence,wordCount
49157,Vasco da Gama,Portugal had during the 15th century – particu...,3
28486,Copa del Generalísimo,The 1960s saw the emergence of Josep Maria Fus...,3
79879,individual football associations,Any decision regarding points awarded for aban...,3
95828,fear of betrayal,"In 1354, when Toghtogha led a large army to cr...",3
61068,970 and 1190,The Chalukya dynasty ruled parts of southern a...,3
92998,keyed Northumbrian smallpipes,"John Dunn, inventor of keyed Northumbrian smal...",3
66064,The Weather Company,"On October 28, 2015, IBM announced its acquisi...",3
24543,10 February 1931,"The city that was later dubbed ""Lutyens' Delhi...",3
50145,political and moral,"He is without parallel in any age, excepting p...",3
89964,The Black Cloister,Luther and his wife moved into a former monast...,3


Again names, more institution names as well. 

In [44]:
answersDf[answersDf['wordCount'] == 5].sample(n=20, random_state=5)

Unnamed: 0,answer,sentence,wordCount
23975,two and one hundred registers,There are typically between two and one hundre...,5
46957,for the purposes of lieutenancies,"Within London, both the City of London and the...",5
97610,within the Church of England,The movement which would become The United Met...,5
94664,fully funded by private parties,The private 'un-aided' schools are fully funde...,5
23638,Babylonian Captivity of the Papacy,"During the tumultuous 14th century, disputes w...",5
32428,SAM systems with ECCM capabilities,"It is an arms race; as better jamming, counter...",5
20708,The stress of the movement,The stress of the movement causes the ice to b...,5
74713,Latin and Greek classical texts,Renaissance humanism took a close study of the...,5
22455,the Japanese state broadcaster NHK,"In 1979, the Japanese state broadcaster NHK fi...",5
76974,a railway and air blockade,The initial post-Soviet years were marred by e...,5


As the words increase it seems harder to create deceptive incorrect answers. A viable option for some would to be mix the individual words like:

*end of World War I" -> start of World War 1, end of World War II, start of World War II, end of Balkans Wars*....

*large tumour on her liver -> large tumor on her brain, large tumor on her lungs, large (some other medical term) on her liver*

Though this would become more difficult because if use 2 generated words, they must also fit with each other as well as the original words.

Some of the anwers look like logical phrases. For their generation I would argue that a text-summarization aproach would work. And with longer answers we could employ a **True/False** questions.

In [45]:
answersDf[answersDf['wordCount'] == 20].sample(n=10, random_state=5)

Unnamed: 0,answer,sentence,wordCount
14715,"to saturate broken (""dangling"") bonds of amorp...","Hydrogen is employed to saturate broken (""dang...",20
70704,"Bullied for being a Bedouin, he was proud of h...","Bullied for being a Bedouin, he was proud of h...",20
39422,"the Bill & Melinda Gates Foundation Trust, whi...","In October 2006, the Bill & Melinda Gates Foun...",20
21062,There are 64 possible codons (four possible nu...,":6 Additionally, a ""start codon"", and three ""s...",20
5882,"Francesinha (Frenchie) from Porto, and bifanas...",Typical fast food dishes include the Francesin...,20
34718,into four summaries that look specifically at ...,"However, results can be further simplified int...",20
26946,elected members and special office bearers suc...,The legislature consists of elected members an...,20
96100,support from China for a planned $2.5 billion ...,"Kenyatta was ""[a]ccompanied by 60 Kenyan busin...",20
77789,"On 26 December 1999, Chelsea became the first ...","On 26 December 1999, Chelsea became the first ...",20
97887,format of the congress and many specifics of t...,"Nevertheless, the format of the congress and m...",20


In [46]:
answersDf[answersDf['wordCount'] == 20].sample(n=20, random_state=5).iloc[8]['answer']

'On 26 December 1999, Chelsea became the first Premier League side to field an entirely foreign starting line-up,'

I would argue that from this sentence could be created several questions with single word answers, like:
- In what year? - *1999*
- Which team? - *Chelsea*

And our longest answer with 46 words

In [47]:
answersDf[answersDf['wordCount'] == 46].iloc[0]['answer']

'that the sudden shift of a huge quantity of water into the region could have relaxed the tension between the two sides of the fault, allowing them to move apart, and could have increased the direct pressure on it, causing a violent rupture'

*sudden shift of a huge quantity of water* seems like a good answer to the question *What could have relaxed the tension between the two sides?*

In [48]:
answersDf[answersDf['wordCount'] == 42].iloc[0]['answer']

'Hillary Clinton (2008), Howard Dean (2004), Gary Hart (1984 and 1988), Paul Tsongas (1992), Pat Robertson (1988) and Jerry Brown (1976, 1980, 1992).'

The second longest answer seems to be a sequence of correct answer, to something like *Who has been a presitend candidate*. This could be great for queastion with multiple correct answers as well as multiple incorrect.

### Word types

**Spacy** turned out be a pretty great tool which could provide me with *NER (Named entity recognition), part of speech detection, word embeddings similarity* and some more functions which may or may be useful in my case.

In [50]:
import spacy
from spacy import displacy
from collections import Counter
import en_core_web_sm
nlp = en_core_web_sm.load()

#nlp = spacy.load('en_core_web_md')

#### Named entity recognition

In [51]:
doc = nlp('European authorities fined Google a record $5.1 billion on Wednesday for abusing its power in the mobile phone market and ordered the company to alter its practices')
print([(X.text, X.label_) for X in doc.ents])

[('European', 'NORP'), ('Google', 'ORG'), ('$5.1 billion', 'MONEY'), ('Wednesday', 'DATE')]


In [52]:
def NerForWord(text):
    doc = nlp(text)
    
    entitiesFound = len(doc.ents)
    
    if (entitiesFound > 0):
        #TODO - Could potentially find multiple entities in the text. We're returning only the first one.
        return doc.ents[0].label_
    else:
        return ''

In [53]:
NerForWord('Portugal')

'GPE'

Useful function for deciphering the tags. They really go deep into the grammatical types, most of which I haven't even heard  of until now. I suspect I'll have to group them up or not use some of the information at all.

In [54]:
spacy.explain("dobj")

'direct object'

Since the *spacy* tagging works on tokens (not necessarily single words, could be multiple words, e.g. names) it'll significatly ease my work if (for now) I work only with the answers which contain only 1 token. 

By my judgment, most of the multiple-token answers contain a single important token and a few words describing it. Or are multiple correct tokens separted by 'and' or ','. I could try to extract the important tokens, but I don't think it's worth it at this point.

There are some great questions containing multi-token answers, but I think it's better If I limit myself to only single-token answers. That way I can work easier with word embedings and detect the tokens appropriate to be answers.

In [55]:
def isSingleToken(text):
    doc = nlp(text)
    
    #The entire text is a single named entity 
    entitiesFound = len(doc.ents)
    if(entitiesFound == 1 and doc.ents[0].text == text):
        return True
    
    #The text is not an named entity, but is a single token
    tokensFound = len(doc)
    if (tokensFound == 1):
        return True
    
    return False

In [56]:
isSingleToken('George R. R. Martin')

True

Let's see how many of our answers we're gonna cut.

In [57]:
singleTokenCount = 0

sampleSize =  int(len(answersDf) / 10)
for i in range(sampleSize):
        
    printPercentage(i, sampleSize)
    
    if (isSingleToken(answersDf.iloc[i]['answer'])):
        singleTokenCount += 1

99%


In [58]:
singleTokenCount / sampleSize

0.561837815810921

On 10% of the data about 60% is retained. I expected worse.

Let's do some of the more interesting spacy tags - NER, POS, DEP, TAG, SHAPE...

Better to provide the full text to spacy, because the token's tags are influenced by their relationship with the other words in the text. But we'll do that we do the feature engineering and also tag the non-answer words.

In [59]:
doc = nlp('James R. Scott')

for token in doc:
    print(token.text, token.lemma_, token.pos_, token.tag_, token.dep_,
          token.shape_, token.is_alpha, token.is_stop, len(doc.ents), doc.ents[0].label_)
    
shape = doc[0].shape_
for wordIndex in range(1, len(doc)):
    shape += (' ' + doc[wordIndex].shape_)
        
print(shape)

James James PROPN NNP compound Xxxxx True False 1 PERSON
R. R. PROPN NNP compound X. False False 1 PERSON
Scott Scott PROPN NNP ROOT Xxxxx True False 1 PERSON
Xxxxx X. Xxxxx


In [60]:
spacy.explain('CARDINAL')

'Numerals that do not fall under another type'

Adding the additional columns

In [61]:
answersDf['isSingleToken'] = False
answersDf['NER'] = ''
answersDf['POS'] = ''
answersDf['TAG'] = ''
answersDf['DEP'] = ''
answersDf['shape'] = ''
answersDf['isAlpha'] = False
answersDf['isStop'] = False

In [62]:
answersDf.head()

Unnamed: 0,answer,sentence,wordCount,isSingleToken,NER,POS,TAG,DEP,shape,isAlpha,isStop
0,Saint Bernadette Soubirous,"It is a replica of the grotto at Lourdes, Fran...",3,False,,,,,,False,False
1,a copper statue of Christ,Immediately in front of the Main Building and ...,5,False,,,,,,False,False
2,the Main Building,Next to the Main Building is the Basilica of t...,3,False,,,,,,False,False
3,a Marian place of prayer and reflection,"Immediately behind the basilica is the Grotto,...",7,False,,,,,,False,False
4,a golden statue of the Virgin Mary,Atop the Main Building's gold dome is a golden...,7,False,,,,,,False,False


Populating the single-token answers

In [63]:
singleTokenCount = 0

sampleSize = int(len(answersDf) / 10)

for i in range(sampleSize):
        
    printPercentage(i, sampleSize)
    
    answer = answersDf.iloc[i]['answer']
    if (isSingleToken(answer)):
        answersDf.at[i, 'isSingleToken'] = True
        
        answersDf.at[i, 'NER'] = NerForWord(answer)
        
        #At this point I've called spacy's nlp method 3 times for the same words...
        doc = nlp(answer)
        
        answersDf.at[i, 'POS'] = doc[0].pos_
        answersDf.at[i, 'TAG'] = doc[0].tag_
        answersDf.at[i, 'DEP'] = doc[0].dep_
        answersDf.at[i, 'isAlpha'] = doc[0].is_alpha
        answersDf.at[i, 'isStop'] = doc[0].is_stop
        
        shape = doc[0].shape_
        for wordIndex in range(1, len(doc)):
            shape += (' ' + doc[wordIndex].shape_)
            
        answersDf.at[i, 'shape'] = shape
        
        

99%


#### Stopwords

In [64]:
answersDf['isStop'].value_counts()

False    97670
True       499
Name: isStop, dtype: int64

We can safely not bother with stopwords.

#### Named Entity Recognition

In [65]:
answersDf['NER'].value_counts()

               94185
CARDINAL        1021
PERSON           973
DATE             886
ORG              419
GPE              256
PERCENT          150
MONEY             88
NORP              56
LOC               43
QUANTITY          32
ORDINAL           19
FAC               16
TIME              12
WORK_OF_ART        5
EVENT              4
PRODUCT            3
LAW                1
Name: NER, dtype: int64

*Note that I've done the NER on only 10% of the dataset. So, 9350 out of 93503. Seems about 40% of the word have a NER.

In [66]:
answersDf[answersDf['NER'] == 'ORG'].sample(n=10, random_state=5)

Unnamed: 0,answer,sentence,wordCount,isSingleToken,NER,POS,TAG,DEP,shape,isAlpha,isStop
3082,MGM,In November 2013 MGM and the McClory estate fo...,1,True,ORG,PROPN,NNP,ROOT,XXX,True,False
6415,Suddhodana,"According to this narrative, shortly after the...",1,True,ORG,PROPN,NNP,ROOT,Xxxxx,True,False
994,Parkwood Topshop Athletic Ltd,The 50-50 venture is called Parkwood Topshop A...,4,True,ORG,PROPN,NNP,compound,Xxxxx Xxxxx Xxxxx Xxx,True,False
8224,People's Daily,"In response to the demonstrations, an editoria...",3,True,ORG,NOUN,NNS,poss,Xxxxx 'x Xxxxx,True,False
990,Parkwood Topshop Athletic Ltd,The 50-50 venture is called Parkwood Topshop A...,4,True,ORG,PROPN,NNP,compound,Xxxxx Xxxxx Xxxxx Xxx,True,False
9376,Gustavia,The islanders developed commerce through the p...,1,True,ORG,PROPN,NNP,ROOT,Xxxxx,True,False
4115,National Park Service,The Statue of Liberty National Monument and El...,3,True,ORG,PROPN,NNP,compound,Xxxxx Xxxx Xxxxx,True,False
7734,Walt Disney World,"On February 14, 2009, The Walt Disney Company ...",3,True,ORG,PROPN,NNP,compound,Xxxx Xxxxx Xxxxx,True,False
2749,Entertainment Weekly,Entertainment Weekly put it on its end-of-the-...,2,True,ORG,PROPN,NNP,compound,Xxxxx Xxxxx,True,False
3723,The State Council,The State Council declared a three-day period ...,3,True,ORG,DET,DT,det,Xxx Xxxxx Xxxxx,True,True


In [67]:
answersDf['isAlpha'].value_counts()

False    94264
True      3905
Name: isAlpha, dtype: int64

#### Part of speech

In [68]:
answersDf['POS'].value_counts()

         92654
PROPN     2686
NUM       1633
NOUN       546
ADJ        198
DET        119
X           90
SYM         71
VERB        63
ADP         54
ADV         43
PUNCT        7
PRON         2
INTJ         2
PART         1
Name: POS, dtype: int64

##### Nouns

The answers are dominated by nouns. Difference between a noun and a proper noun (PROPN) is that proper nouns are names of specific people, places, ideas... while common nouns are just non-specific (cat, woman, bottle...)

In [69]:
answersDf[answersDf['POS'] == 'PROPN'].sample(n=5, random_state=16)

Unnamed: 0,answer,sentence,wordCount,isSingleToken,NER,POS,TAG,DEP,shape,isAlpha,isStop
6784,Vedas,The Buddha says that it was on this alteration...,1,True,ORG,PROPN,NNP,ROOT,Xxxxx,True,False
6707,pāramitā,"It is one of the three practices (sīla, samādh...",1,True,,PROPN,NNP,ROOT,xxxx,True,False
7353,DioGuardi,Both Allen and Lambert released the coronation...,1,True,,PROPN,NNP,ROOT,XxxXxxxx,True,False
7627,Fox,The show pushed Fox to become the number one U...,1,True,ORG,PROPN,NNP,ROOT,Xxx,True,False
6593,disquietude,"Although the term is often translated as ""suff...",1,True,,PROPN,NNP,ROOT,xxxx,True,False


In [70]:
answersDf[answersDf['POS'] == 'NOUN'].sample(n=5, random_state=16)

Unnamed: 0,answer,sentence,wordCount,isSingleToken,NER,POS,TAG,DEP,shape,isAlpha,isStop
6597,suffering,In the Nikayas anatta is not meant as a metaph...,1,True,,NOUN,NN,ROOT,xxxx,True,False
694,R&B,"Beyoncé's music is generally R&B, but she also...",3,True,,NOUN,NN,ROOT,X&X,False,False
2844,boss,Link navigates these dungeons and fights a bos...,1,True,,NOUN,NN,ROOT,xxxx,True,False
7983,horse,Use of dogs as pack animals in these cultures ...,1,True,,NOUN,NN,ROOT,xxxx,True,False
2102,Elsner,Chopin's polonaises show a marked advance on t...,1,True,,NOUN,NN,ROOT,Xxxxx,True,False


##### Numerals

The second most prominent category is NUM. It's pretty much years and other numbers.

In [71]:
answersDf[answersDf['POS'] == 'NUM'].sample(n=10, random_state=16)

Unnamed: 0,answer,sentence,wordCount,isSingleToken,NER,POS,TAG,DEP,shape,isAlpha,isStop
9042,33,The average hours per work week declined to 33...,1,True,CARDINAL,NUM,CD,ROOT,dd,False,False
3320,10 km,The focus was deeper than 10 km.,2,True,QUANTITY,NUM,CD,nummod,dd xx,False,False
6256,7:35 pm,"On November 10, 2007, at approximately 7:35 pm...",2,True,TIME,NUM,CD,nummod,d:dd xx,False,False
3097,1983,Oberhauser shares his name with Hannes Oberhau...,1,True,DATE,NUM,CD,ROOT,dddd,False,False
7580,21.7 million,The season attracted an average of 21.7 millio...,2,True,CARDINAL,NUM,CD,compound,dd.d xxxx,False,False
176,2015,A theology library was also opened in fall of ...,1,True,DATE,NUM,CD,ROOT,dddd,False,False
5667,18,Continental Portugal is agglomerated into 18 d...,1,True,CARDINAL,NUM,CD,ROOT,dd,False,False
2182,1402–1424,In hopes of reviving the unique relationship o...,1,True,,NUM,CD,ROOT,dddd–dddd,False,False
5270,13,By 1898 the American Association for the Advan...,1,True,CARDINAL,NUM,CD,ROOT,dd,False,False
4235,180000,The city's fashion industry provides approxima...,1,True,CARDINAL,NUM,CD,ROOT,"ddd,ddd",False,False


##### Adjectives and Verbs

Didn't really expect much of those, but they seem like adequate answers. 

In [72]:
answersDf[(answersDf['POS'] == 'ADJ') & (answersDf['wordCount'] == 1)].sample(n=5, random_state=4)

Unnamed: 0,answer,sentence,wordCount,isSingleToken,NER,POS,TAG,DEP,shape,isAlpha,isStop
8019,emotional,This gives dogs the ability to recognize emoti...,1,True,,ADJ,JJ,ROOT,xxxx,True,False
1500,Romantic,Frédéric François Chopin (/ˈʃoʊpæn/; French pr...,1,True,,ADJ,JJ,ROOT,Xxxxx,True,False
8296,peaceful,Several hundred pro-Tibet protesters gathered ...,1,True,,ADJ,JJ,ROOT,xxxx,True,False
6720,seventh,"For the complete list, the seventh precept is ...",1,True,ORDINAL,ADJ,JJ,ROOT,xxxx,True,False
6768,Tantric,"Though based upon Mahayana, Tibeto-Mongolian B...",1,True,,ADJ,JJ,ROOT,Xxxxx,True,False


In [73]:
answersDf[(answersDf['POS'] == 'VERB') & (answersDf['wordCount'] == 1)].sample(n=5, random_state=4)

Unnamed: 0,answer,sentence,wordCount,isSingleToken,NER,POS,TAG,DEP,shape,isAlpha,isStop
444,Listen,"To promote the film, Beyoncé released ""Listen""...",1,True,,VERB,VB,ROOT,Xxxxx,True,False
5376,increasing,The kind of issues addressed and implications ...,1,True,,VERB,VBG,ROOT,xxxx,True,False
4550,crack,Others cite the end of the crack epidemic and ...,1,True,,VERB,VB,ROOT,xxxx,True,False
3696,Promise,"In June, Hong Kong actor Jackie Chan, who dona...",1,True,,VERB,VB,ROOT,Xxxxx,True,False
5373,participating,"More simply, applied anthropology is the pract...",1,True,,VERB,VBG,ROOT,xxxx,True,False


##### Symbols

All of the symbols are multi word answers, with some dollar signs infront.

In [74]:
answersDf[(answersDf['POS'] == 'SYM')].sample(n=5, random_state=4)

Unnamed: 0,answer,sentence,wordCount,isSingleToken,NER,POS,TAG,DEP,shape,isAlpha,isStop
3777,$26 million,The association has also collected a total of ...,3,True,MONEY,SYM,$,quantmod,$ dd xxxx,False,False
9260,$1.5 trillion,Following a model initiated by the United King...,3,True,MONEY,SYM,$,quantmod,$ d.d xxxx,False,False
9086,$650 billion,"Bernanke explained that between 1996 and 2004,...",3,True,MONEY,SYM,$,quantmod,$ ddd xxxx,False,False
3252,$70.4 million,The film ended up grossing $70.4 million in it...,3,True,MONEY,SYM,$,quantmod,$ dd.d xxxx,False,False
3675,$772 million,Many donated through text messaging on mobile ...,3,True,MONEY,SYM,$,quantmod,$ ddd xxxx,False,False


## Answers in a bigger picture

Let's take a look at the highlighted answers.

I suspect:

1. There are other (many more?) obviously good words for answers that were just not selected.
2. There are some sentences that just don't contain answers.

In [75]:
def highlightAnswers(titleId, paragraphId):

    paragraph = df['data'][titleId]['paragraphs'][paragraphId]['context']
    
    answers = df['data'][titleId]['paragraphs'][paragraphId]['qas']

    #Get answer starts and answer length
    answerPosition = {}
    for answer in answers:
        answerStart = answer['answers'][0]['answer_start']
        answerLength = len(answer['answers'][0]['text'])

        answerPosition[answerStart] = answerLength

    #Bold answers
    shiftStart = 0
    highlightedText = ''
    currentPlaceInText = 0
    
    #Append text between previous answer and current answer + bold sign + answer + bold sign
    for answerStart in sorted(answerPosition.keys()):
        highlightedText += paragraph[currentPlaceInText:answerStart]
        highlightedText += '**'
        highlightedText += paragraph[answerStart:answerStart + answerPosition[answerStart]]
        highlightedText += '**'
        
        currentPlaceInText = answerStart + answerPosition[answerStart]
    
    #Append the remaining text after the last answer
    highlightedText += paragraph[currentPlaceInText:len(paragraph)]

    #Diplay the highlighted text
    display(Markdown(highlightedText))

In [76]:
titleId = 24
paragraphId = 0

highlightAnswers(titleId, paragraphId)

Located approximately 250 kilometres (**160** mi) east of Puerto Rico and the nearer Virgin Islands, St. Barthélemy lies immediately southeast of the islands of Saint Martin and Anguilla. It is one of **the Renaissance** Islands. St. Barthélemy is separated from Saint Martin by **the Saint-Barthélemy Channel**. It lies northeast of Saba and St Eustatius, and north of St Kitts. Some small **satellite islets** belong to St. Barthélemy including Île Chevreau (Île Bonhomme), Île Frégate, Île Toc Vers, Île Tortue and Gros Îlets (Îlots Syndare). A much bigger islet, Île Fourchue, lies on the north of the island, in the Saint-Barthélemy Channel. Other rocky islets which include Coco, the Roques (or **little Turtle rocks**), the Goat, and the Sugarloaf.

In [77]:
titleId = 4
paragraphId = 12

highlightAnswers(titleId, paragraphId)

**Inappropriate antibiotic treatment and overuse** of antibiotics have contributed to the emergence of antibiotic-resistant bacteria. **Self prescription** of antibiotics is an example of misuse. Many antibiotics are frequently prescribed to treat symptoms or diseases that do not respond to antibiotics or that are likely to resolve without treatment. Also, incorrect or suboptimal antibiotics are prescribed for certain bacterial infections. The **overuse of antibiotics**, like penicillin and erythromycin, has been associated with emerging antibiotic resistance since the 1950s. Widespread usage of antibiotics in hospitals has also been associated with increases in bacterial strains and species that no longer respond to treatment with the most common antibiotics.

In [78]:
titleId = 52
paragraphId = 4

highlightAnswers(titleId, paragraphId)

According to the **endurance running hypothesis**, long-distance running as in **persistence hunting**, a method still practiced by **some hunter-gatherer groups** in modern times, was likely the driving evolutionary force leading to the evolution of certain human characteristics. This hypothesis does not necessarily contradict the **scavenging hypothesis**: **both subsistence strategies** could have been in use – sequentially, alternating or even simultaneously.

In [79]:
titleId = 453
paragraphId = 1

highlightAnswers(titleId, paragraphId)

The first commercially successful true engine, in that it could generate power and transmit it to a machine, was the **atmospheric engine**, invented by **Thomas Newcomen** around **1712**. It was an improvement over Savery's **steam pump**, using a piston as proposed by **Papin**. Newcomen's engine was relatively inefficient, and in most cases was used for pumping water. It worked by creating a partial vacuum by condensing steam under a piston within a cylinder. It was employed for draining mine workings at depths hitherto impossible, and also for providing a reusable water supply for driving waterwheels at factories sited away from a suitable "head". Water that had passed over the wheel was pumped back up into a storage reservoir above the wheel.

It definetely seems that there are a lot more words that could become good answers. But I'm optimistic I can extract the selected word's features even if I don't have all of the possible answer words.

At first glance, it seems like the answers are spread troughout the entire text and there aren't as many sentences without an answers. Though a better experiment would be to just count the sentences without answers agaisnt the ones with. 
But I don't see a large enough benefit to do it (deadline aproaching).

## Noun chunks
Another neat thing spacy gives us is noun chunks.

In [80]:
text = df['data'][0]['paragraphs'][0]['context']
doc = nlp(text)

for noun_chunk in doc.noun_chunks:
    print(noun_chunk)

the school
a Catholic character
the Main Building's gold dome
a golden statue
the Virgin Mary
front
the Main Building
it
a copper statue
Christ
arms
the legend
"Venite Ad Me Omnes
the Main Building
the Basilica
the Sacred Heart
the basilica
the Grotto
a Marian place
prayer
reflection
It
a replica
the grotto
Lourdes
France
the Virgin Mary
Saint Bernadette Soubirous
the end
the main drive
a direct line
3 statues
the Gold Dome
a simple, modern stone statue
Mary


In [81]:
titleId = 0
paragraphId = 0

highlightAnswers(titleId, paragraphId)

Architecturally, the school has a Catholic character. Atop the Main Building's gold dome is **a golden statue of the Virgin Mary**. Immediately in front of the Main Building and facing it, is **a copper statue of Christ** with arms upraised with the legend "Venite Ad Me Omnes". Next to **the Main Building** is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, **a Marian place of prayer and reflection**. It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to **Saint Bernadette Soubirous** in 1858. At the end of the main drive (and in a direct line that connects through 3 statues and the Gold Dome), is a simple, modern stone statue of Mary.

In the first paragraph we have 2 of the answers entirely recognized as noun chunks:
1. **the Main Building**
2. **Saint Bernadette Soubirous**

While the other 3 answers are partially cut:
1. **a golden statue** of the Virgin Mary
2. **a copper statue** of *Christ* 
3. **a Marian place** of *prayer* and *reflection*

Though I would argue that all of the other noun chunks would make great answers.
I could potentially use only noun chunks for the answers and sacrifice the verbs and adjectives. But the noun chunks are mostly multi-word tokens. That would pose a problem with my features:
1. **Part of speech** - Coulnd't really do it on multiple words.
2. **TF-IDF** - Would need to modify it by either getting the aggreate of the single words or scoring the entire noun chunk... or both.
3. **Title similarity** - Aggregation of the single words.
4. **Incorrect answers** - That would be tricky, because I would need to find similar words for each word in the chunk and mix and match with the other similar words... That is bound to produce some inadequte mixes. But it still may not be a bad thing if I rely on a final filtering by a human.