# Data exploration - SQuAD v1

In [1]:
#Imports
import pandas as pd
from IPython.display import Markdown, display, clear_output
from nltk import tokenize
from scipy import stats
from IPython.core.debugger import set_trace

### Pretty printing

In [2]:
def printBold(string):
    display(Markdown('**' + string + '**'))
    
#def printColor():
#     display(Markdown('<span style="color:blue">blue</span>'))

## Reading the datasets

Since we aren't really doing the answering of the questions, as is the true intention for the dataset, we'll merge the train and dev datasets into one. The test dataset is probably hidden, since there's a competition for it.

In [3]:
train = pd.read_json('../data/squad-v1/train-v1.1.json', orient='column')
dev = pd.read_json('../data/squad-v1/dev-v1.1.json', orient='column')

In [4]:
df = pd.concat([train, dev], ignore_index=True)

In [5]:
df.head()

Unnamed: 0,data,version
0,"{'title': 'University_of_Notre_Dame', 'paragra...",1.1
1,"{'title': 'Beyoncé', 'paragraphs': [{'context'...",1.1
2,"{'title': 'Montana', 'paragraphs': [{'context'...",1.1
3,"{'title': 'Genocide', 'paragraphs': [{'context...",1.1
4,"{'title': 'Antibiotics', 'paragraphs': [{'cont...",1.1


Let's look at a what we've got.

In [6]:
def showQuestion(titleId, paragraphId, questionId):

    title = df['data'][titleId]['title']
    paragraph = df['data'][titleId]['paragraphs'][paragraphId]['context']
    question = df['data'][titleId]['paragraphs'][paragraphId]['qas'][questionId]['question']
    answer = df['data'][titleId]['paragraphs'][paragraphId]['qas'][questionId]['answers'][0]['text']
    answerStart = df['data'][titleId]['paragraphs'][paragraphId]['qas'][questionId]['answers'][0]['answer_start']

    printBold('Title')
    print(title)
    printBold('Paragraph')
    print(paragraph)
    printBold('Question')
    print(question)
    printBold('Answer')
    print(answerStart)
    print(answer)

In [7]:
titleId = 0
paragraphId = 0 
questionId = 0

showQuestion(titleId, paragraphId, questionId)

**Title**

University_of_Notre_Dame


**Paragraph**

Architecturally, the school has a Catholic character. Atop the Main Building's gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend "Venite Ad Me Omnes". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858. At the end of the main drive (and in a direct line that connects through 3 statues and the Gold Dome), is a simple, modern stone statue of Mary.


**Question**

To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France?


**Answer**

515
Saint Bernadette Soubirous


## Dataset size

In [8]:
titlesCount = len(df['data'])
totalParagraphsCount = 0
totalQuestionsCount = 0

for titleId in range(titlesCount):
    paragraphsCount = len(df['data'][titleId]['paragraphs'])
    totalParagraphsCount += paragraphsCount
    
    for paragraphId in range(paragraphsCount):
        questionsCount = len(df['data'][titleId]['paragraphs'][paragraphId]['qas'])
        
        totalQuestionsCount += questionsCount
        
print('Titles', titlesCount)
print('Paragraphs', totalParagraphsCount)
print('Questions', totalQuestionsCount)

Titles 490
Paragraphs 20963
Questions 98169


## Titles

In [9]:
titles = []
for titleId in range(len(df['data'])):
    titles.append(df['data'][titleId]['title'])
    
for i in range(20):
    print(titles[i])

University_of_Notre_Dame
Beyoncé
Montana
Genocide
Antibiotics
Frédéric_Chopin
Sino-Tibetan_relations_during_the_Ming_dynasty
IPod
The_Legend_of_Zelda:_Twilight_Princess
Spectre_(2015_film)
2008_Sichuan_earthquake
New_York_City
To_Kill_a_Mockingbird
Solar_energy
Tajikistan
Anthropology
Portugal
Kanye_West
Buddhism
American_Idol


Titles are pretty random. Seems to be a lot of locations like countries and cities but not nearly enough to afford splitting the dataset.

## Questions

One of our main assumptions is that the sentence that contains the answer could be turned into a question just by removing the answer from it. Let's see how much of that is true for the questions in this dataset.

In [10]:
titleId = 0
paragraphId = 0 
questionId = 0

showQuestion(titleId, paragraphId, questionId)

**Title**

University_of_Notre_Dame


**Paragraph**

Architecturally, the school has a Catholic character. Atop the Main Building's gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend "Venite Ad Me Omnes". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858. At the end of the main drive (and in a direct line that connects through 3 statues and the Gold Dome), is a simple, modern stone statue of Mary.


**Question**

To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France?


**Answer**

515
Saint Bernadette Soubirous


In [11]:
def extractSentence(paragraph, answerStart):
    
    sentences = tokenize.sent_tokenize(paragraph)
    sentenceStart = 0
    
    for sentence in sentences:
        if (sentenceStart + len(sentence) >= answerStart):
            return sentence         
        
        sentenceStart += len(sentence) + 1

In [12]:
paragraph = df['data'][0]['paragraphs'][0]['context']
answerStart = df['data'][0]['paragraphs'][0]['qas'][0]['answers'][0]['answer_start']

sentence = extractSentence(paragraph, answerStart)
print(sentence)

It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858.


In [13]:
def containedInText(text, question):
    
    questionWords = tokenize.word_tokenize(question.lower())
    textWords = tokenize.word_tokenize(text.lower())
    wordsContained = 0

    for questionWord in questionWords:
        for textWord in textWords:
            if (questionWord == textWord):
                wordsContained += 1
                break

    return wordsContained / len(questionWords)

In [14]:
question =  df['data'][0]['paragraphs'][0]['qas'][0]['question']

contained = containedInText(sentence, question)

In [15]:
printBold('Question')
print(question)
printBold('Sentence')
print(sentence)
printBold("Contained")
print(contained)

**Question**

To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France?


**Sentence**

It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858.


**Contained**

0.6428571428571429


I wouldn't expect a 100% containment simply because the questions will contain **question-like words** like *Why, Who, *Whom*, What*.

In this example we also see that the word appear is contained in the original sentence but in **past tense**. We could take care of that if we take the **stems** of the words, but I think it's better to see the least imaginative way for forming questions.

We are also calculating some **common words like *to, the, in*** which could be encountered at different places of the sentence, but again we want to measure the least-creative questions.

In this sentece *(damn, that was a good example)* we also see that the question uses the word *allegedly* which is a **synonym** of *reputedly* in the sentence. That could be nice for question forming, but I think it's more of an overkill.

We also see that the question actually encompasses the **words around the answer, rather than the entire sentence**. Which is a definate must-do when we form our questions. 

Let's see what is the score on all of the questons. I'm also curious to see the score on the entire paragraph.

This may come in handy in the future. Pretty printing the progress.

In [16]:
#Printint the percentage done
def printPercentage(currentStep, maxStep):
    stepSize = maxStep / 100
    
    if (int(currentStep / stepSize) > ((currentStep - 1) / stepSize)):
        clear_output()
        print('{}%'.format(int(currentStep / stepSize)))

In [17]:
sentenceScore = []
paragraphScore = []

#For each title
titlesCount = len(df['data'])
for titleId in range(titlesCount):
    printPercentage(titleId, titlesCount)
    
    #For each paragraph
    for paragraphId in range(len(df['data'][titleId]['paragraphs'])):
        paragraph = df['data'][titleId]['paragraphs'][paragraphId]['context']
        
        #For each question
        for questionId in range(len(df['data'][titleId]['paragraphs'][paragraphId]['qas'])):
            question = df['data'][titleId]['paragraphs'][paragraphId]['qas'][questionId]['question']
            answerStart = df['data'][titleId]['paragraphs'][paragraphId]['qas'][questionId]['answers'][0]['answer_start']
            sentence = extractSentence(paragraph, answerStart)
          
            sentenceScore.append(containedInText(sentence, question))
            paragraphScore.append(containedInText(paragraph, question))            

99%


In [18]:
sentenceScoreDf = pd.DataFrame(sentenceScore, columns=['sentence'])
paragraphScoreDf = pd.DataFrame(paragraphScore, columns=['paragraph'])

questionContainmentDf = pd.concat([sentenceScoreDf, paragraphScoreDf], axis=1)
questionContainmentDf.describe()

Unnamed: 0,sentence,paragraph
count,98169.0,98169.0
mean,0.463937,0.582157
std,0.190377,0.159055
min,0.0,0.0
25%,0.333333,0.5
50%,0.461538,0.6
75%,0.6,0.7
max,1.0,1.0


I would argue that almost half the words contained is a pretty good result. 

As expected, contained within the entire paragraph is better.

I do wonder about those questions that are 100% contained in the answer.

In [19]:
questionContainmentDf.head(10)

Unnamed: 0,sentence,paragraph
0,0.642857,0.571429
1,0.636364,0.636364
2,0.533333,0.6
3,0.375,0.5
4,0.333333,0.416667
5,0.272727,0.636364
6,0.3,0.8
7,0.363636,0.727273
8,0.0,0.545455
9,0.266667,0.733333


In [20]:
def getQuestionAt(index):
    currentIndex = 0
    
    for titleId in range(len(df['data'])):
        for paragraphId in range(len(df['data'][titleId]['paragraphs'])):
            for questionId in range(len(df['data'][titleId]['paragraphs'][paragraphId]['qas'])):
                if (currentIndex == index):
                    return titleId, paragraphId, questionId
                currentIndex += 1

Let's see question #8 which has 0 containment in the answer sentence. 

In [21]:
getQuestionAt(8)

(0, 1, 3)

In [22]:
titleId = 0
paragraphId = 1 
questionId = 3

showQuestion(titleId, paragraphId, questionId)

**Title**

University_of_Notre_Dame


**Paragraph**

As at most other universities, Notre Dame's students run a number of news media outlets. The nine student-run outlets include three newspapers, both a radio and television station, and several magazines and journals. Begun as a one-page journal in September 1876, the Scholastic magazine is issued twice monthly and claims to be the oldest continuous collegiate publication in the United States. The other magazine, The Juggler, is released twice a year and focuses on student literature and artwork. The Dome yearbook is published annually. The newspapers have varying publication interests, with The Observer published daily and mainly reporting university and other news, and staffed by students from both Notre Dame and Saint Mary's College. Unlike Scholastic and The Dome, The Observer is an independent publication and does not have a faculty advisor or any editorial oversight from the University. In 1987, when some students believed that The Observer began to show a conservative bias, a lib

**Question**

How many student news papers are found at Notre Dame?


**Answer**

126
three


The question is actually formed from the previous sentence.

### 0% containment

In [23]:
questionContainmentDf[questionContainmentDf['paragraph'] == 0].head()

Unnamed: 0,sentence,paragraph
269,0.0,0.0
363,0.0,0.0
505,0.0,0.0
2781,0.0,0.0
3678,0.0,0.0


In [24]:
getQuestionAt(269)

(1, 0, 0)

In [25]:
titleId = 1
paragraphId = 0 
questionId = 0

showQuestion(titleId, paragraphId, questionId)

**Title**

Beyoncé


**Paragraph**

Beyoncé Giselle Knowles-Carter (/biːˈjɒnseɪ/ bee-YON-say) (born September 4, 1981) is an American singer, songwriter, record producer and actress. Born and raised in Houston, Texas, she performed in various singing and dancing competitions as a child, and rose to fame in the late 1990s as lead singer of R&B girl-group Destiny's Child. Managed by her father, Mathew Knowles, the group became one of the world's best-selling girl groups of all time. Their hiatus saw the release of Beyoncé's debut album, Dangerously in Love (2003), which established her as a solo artist worldwide, earned five Grammy Awards and featured the Billboard Hot 100 number-one singles "Crazy in Love" and "Baby Boy".


**Question**

When did Beyonce start becoming popular?


**Answer**

269
in the late 1990s


A **synonym** case - *instead of rose to fame*, *start becoming popular* is used.

In [26]:
getQuestionAt(505)

(1, 18, 6)

In [27]:
titleId = 1
paragraphId = 18 
questionId = 6

showQuestion(titleId, paragraphId, questionId)

**Title**

Beyoncé


**Paragraph**

In 2011, documents obtained by WikiLeaks revealed that Beyoncé was one of many entertainers who performed for the family of Libyan ruler Muammar Gaddafi. Rolling Stone reported that the music industry was urging them to return the money they earned for the concerts; a spokesperson for Beyoncé later confirmed to The Huffington Post that she donated the money to the Clinton Bush Haiti Fund. Later that year she became the first solo female artist to headline the main Pyramid stage at the 2011 Glastonbury Festival in over twenty years, and was named the highest-paid performer in the world per minute.


**Question**

When did this leak happen?


**Answer**

3
2011


That's just a bad question. It could only be asked in combination with the text.

### 100% containment

In [28]:
questionContainmentDf[questionContainmentDf['sentence'] == 1]

Unnamed: 0,sentence,paragraph
21911,1.0,1.0
39394,1.0,1.0
45064,1.0,1.0
48874,1.0,1.0
53226,1.0,1.0
67425,1.0,1.0


In [29]:
getQuestionAt(53226)

(258, 23, 0)

In [30]:
titleId = 258
paragraphId = 23 
questionId = 0

showQuestion(titleId, paragraphId, questionId)

**Title**

Utrecht


**Paragraph**

Utrecht city has an active cultural life, and in the Netherlands is second only to Amsterdam. There are several theatres and theatre companies. The 1941 main city theatre was built by Dudok. Besides theatres there is a large number of cinemas including three arthouse cinemas. Utrecht is host to the international Early Music Festival (Festival Oude Muziek, for music before 1800) and the Netherlands Film Festival. The city has an important classical music hall Vredenburg (1979 by Herman Hertzberger). Its acoustics are considered among the best of the 20th-century original music halls.[citation needed] The original Vredenburg music hall has been redeveloped as part of the larger station area redevelopment plan and in 2014 has gained additional halls that allowed its merger with the rock club Tivoli and the SJU jazzpodium. There are several other venues for music throughout the city. Young musicians are educated in the conservatory, a department of the Utrecht School of the Arts. There is 

**Question**

Cultural life in Utrecht is second to 


**Answer**

0
Utrecht city has an active cultural life, and in the Netherlands is second only to Amsterdam


Strange question. The question words all appear in the sentence, but not in order. But the answer is the entire sentence, which obviously has needless information inside it. Looking further into it, the question is actually wrong, because it should state second *in Netherlands*. This question should be scrapped...

In [31]:
getQuestionAt(67425)

(341, 25, 2)

In [32]:
titleId = 341
paragraphId = 25 
questionId = 2

showQuestion(titleId, paragraphId, questionId)

**Title**

Energy


**Paragraph**

Thermodynamics divides energy transformation into two kinds: reversible processes and irreversible processes. An irreversible process is one in which energy is dissipated (spread) into empty energy states available in a volume, from which it cannot be recovered into more concentrated forms (fewer quantum states), without degradation of even more energy. A reversible process is one in which this sort of dissipation does not happen. For example, conversion of energy from one type of potential field to another, is reversible, as in the pendulum system described above. In processes where heat is generated, quantum states of lower energy, present as possible excitations in fields between atoms, act as a reservoir for part of the energy, from which it cannot be recovered, in order to be converted with 100% efficiency into other forms of energy. In this case, the energy must partly stay as heat, and cannot be completely recovered as usable energy, except at the price of an increase in some ot

**Question**

A reversible process is one in which this does not happen.


**Answer**

406
dissipation


This is, basically, just the question I expect to generate. The answer is removed and the sentence is descriptive enough to fill in the missing word.

### Summary

The assumption that the **question is mostly consisted of words from the sentence the answer is in** seems correct.

There are some obvious differences like:
- **Question-like words** are added - who, why, when...
- **Synonyms** are used instead of the words used in the sentence
- Changing the sentence to a question also changes the **tense** of the word.
- In long sentences, only a **part of the sentence is used**. Like if the sentence is separated with commas, the comma actually divides two logical statements.

I also managed to find some outliers which turned out to be not-so-well asked questions.

## Answers

Let's see:
- Are all the answers phrases from the text
- The type of the answers - number, dates, names, locations, close to the title
- Part of speech - verb, noun
- Answer lenght in words
- Words around the answer.
- Answer location in the sentence - First word, last word. 

### Answers contained in the text

In [33]:
answersInText = 0
answersNotInText = 0

for titleId in range(len(df['data'])):
     for paragraphId in range(len(df['data'][titleId]['paragraphs'])):
        paragraph = df['data'][titleId]['paragraphs'][paragraphId]['context']
        for questionId in range(len(df['data'][titleId]['paragraphs'][paragraphId]['qas'])):
            answer = df['data'][titleId]['paragraphs'][paragraphId]['qas'][questionId]['answers'][0]['text']
            if (answer in paragraph):
                answersInText += 1
            else:
                answersNotInText += 1
                
printBold('Answers in text')
print(answersInText)
printBold('Answers not in text')
print(answersNotInText)

**Answers in text**

98169


**Answers not in text**

0


All the answers are phrases from the text. Seems like that has been a requirement from the start, since the answers also have an index indicating their start location in the paragraph.

### Extracting the answers

In [34]:
answers = []
sentences = []

for titleId in range(len(df['data'])):
    
     for paragraphId in range(len(df['data'][titleId]['paragraphs'])):
        paragraph = df['data'][titleId]['paragraphs'][paragraphId]['context']
        
        for questionId in range(len(df['data'][titleId]['paragraphs'][paragraphId]['qas'])):
            answer = df['data'][titleId]['paragraphs'][paragraphId]['qas'][questionId]['answers'][0]['text']
            answerStart = df['data'][titleId]['paragraphs'][paragraphId]['qas'][questionId]['answers'][0]['answer_start']
            
            sentence = extractSentence(paragraph, answerStart)
            
            answers.append(answer)
            sentences.append(sentence)

In [35]:
answerTextsDf = pd.DataFrame(answers, columns=['answer'])
sentenceDf = pd.DataFrame(sentences, columns=['sentence'])

answersDf = pd.concat([answerTextsDf, sentenceDf], axis=1)
answersDf.head()

Unnamed: 0,answer,sentence
0,Saint Bernadette Soubirous,"It is a replica of the grotto at Lourdes, Fran..."
1,a copper statue of Christ,Immediately in front of the Main Building and ...
2,the Main Building,Next to the Main Building is the Basilica of t...
3,a Marian place of prayer and reflection,"Immediately behind the basilica is the Grotto,..."
4,a golden statue of the Virgin Mary,Atop the Main Building's gold dome is a golden...


### Answer word lenght 

In [36]:
wordCount = []

for i in range(len(answersDf)):
    wordCount.append(len(tokenize.word_tokenize(answersDf.iloc[i]['answer'])))

In [37]:
answersDf = pd.concat([answersDf, pd.DataFrame(wordCount, columns=['wordCount'])], axis=1)

In [38]:
answersDf['wordCount'].describe()

count    98169.000000
mean         3.354511
std          3.731074
min          1.000000
25%          1.000000
50%          2.000000
75%          4.000000
max         46.000000
Name: wordCount, dtype: float64

In [41]:
answersDf['wordCount'].value_counts()

1     32161
2     25233
3     14350
4      7557
5      4654
6      3050
7      2222
8      1676
9      1206
10      974
11      755
12      653
13      565
14      462
15      406
16      313
18      275
17      269
19      243
20      191
21      182
23      138
22      132
25      120
24      101
26       78
28       58
27       57
29       29
30       18
31       12
32       11
33        6
38        2
34        2
35        2
36        2
37        2
42        1
46        1
Name: wordCount, dtype: int64

About 1/3 of of the answers are single words. And about 2/3 are up to 3 words. Let's get an overview of the groups.

In [75]:
answersDf[answersDf['wordCount'] == 1].sample(10, random_state=42)

Unnamed: 0,answer,sentence,wordCount
52642,mul,"For example, the name for the hanja 水 is 물 수 (...",1
79457,"11,000–16,000",The total Iranian casualties in the war were e...,1
88678,Saracens,"From these bases, the Normans eventually captu...",1
35390,microphone,The second controller lacked the START and SEL...,1
34469,rarely,"Since Elizabeth rarely gives interviews, littl...",1
57333,1991,"By the late 1980s, digital media, in the form ...",1
10684,ZigBee,Many newer control systems are using wireless ...,1
43080,1990s,Intergender singles bouts were first fought on...,1
43755,1870,"In 1870, after France attacked Prussia, Prussi...",1
65525,Champs-Élysées,"As of 2013 the City of Paris had 1,570 hotels ...",1


There seems to be a lot of years and some names.

The two word answers seem to be dominated by names. There are also a lot of answers where one of the words isn't useful. Some could easily be removed like *a* and *the*. *six years* and *tree times* could also be turned to just 6 and 3. The *13.3%* seems to be just misplaced. Not sure if it's because of the *"."* or the *"%"*. 

In [72]:
answersDf[answersDf['wordCount'] == 2].sample(n=20, random_state=5)

Unnamed: 0,answer,sentence,wordCount
31777,six years,A peace agreement was signed in which John ret...,2
4799,Notre Dame,"In 2006, Lee was awarded an honorary doctorate...",2
21766,gamma-aminobutyric acid,The two neurotransmitters that are used most w...,2
28267,Thomas Aquinas,"During the Middle Ages, the Aristotelian view ...",2
7152,The Beatles,"The single, ""A Moment Like This"", went on to b...",2
26176,migratory species,The state is also a host to a large population...,2
33975,over five,"For example, over five columns of text were de...",2
4851,Mockingbird groupies,"Local residents call them ""Mockingbird groupie...",2
85540,Alan Rogerson,Former members Heather and Gary Botting compar...,2
77579,Sheffield United,The first ever Premier League goal was scored ...,2


The two word answers seem to be dominated by names. There are also a lot of answers where one of the words isn't useful. Some could easily be removed like *a*, *in* and *the*. *six years* could be turned to just 6. The *23.02%* seems to be just misplaced. Not sure if it's because of the *"."* or the *"%"*. 

In [76]:
answersDf[answersDf['wordCount'] == 3].sample(n=20, random_state=5)

Unnamed: 0,answer,sentence,wordCount
49157,Vasco da Gama,Portugal had during the 15th century – particu...,3
28486,Copa del Generalísimo,The 1960s saw the emergence of Josep Maria Fus...,3
91654,magnetic tape shortage,"During the following years, a magnetic tape sh...",3
95828,fear of betrayal,"In 1354, when Toghtogha led a large army to cr...",3
61090,Arab Umayyad Caliphate,"After conquering Persia, the Arab Umayyad Cali...",3
92998,keyed Northumbrian smallpipes,"John Dunn, inventor of keyed Northumbrian smal...",3
66068,The Weather Company,"On October 28, 2015, IBM announced its acquisi...",3
24543,10 February 1931,"The city that was later dubbed ""Lutyens' Delhi...",3
50145,political and moral,"He is without parallel in any age, excepting p...",3
84203,the Roku player,Google made YouTube available on the Roku play...,3


Again names, more institution names as well. 

In [78]:
answersDf[answersDf['wordCount'] == 5].sample(n=20, random_state=5)

Unnamed: 0,answer,sentence,wordCount
85209,conduct surveys of party colleagues,"For instance, to keep their party colleagues ""...",5
20338,Robert Bideleux and Ian Jeffries,Significant legislative changes in the status ...,5
22367,in excess of £3.3 billion,The total annual cost to support the defence e...,5
64526,Koninklijk Conservatorium Artesis Hogeschool A...,She is now also professor mandolin at the musi...,5
97009,end of World War I,"At the end of World War I, the Rhineland was s...",5
54390,partly cold-based and partly warm-based,Glaciers which are partly cold-based and partl...,5
90084,body and blood of Christ,Luther insisted on the Real Presence of the bo...,5
39942,the eastern waterfront in Buceo,"The Museo Naval, is located on the eastern wat...",5
95598,School of Social Service Administration,"In 1955, Eero Saarinen was contracted to devel...",5
58652,protruded from the road surface,"However, the company ceased trading in 1875 af...",5


As the words increase it seems harder to create deceptive incorrect answers. A viable option for some would to be mix the individual words like:

*end of World War I" -> start of World War 1, end of World War II, start of World War II, end of Balkans Wars*....

*large tumour on her liver -> large tumor on her brain, large tumor on her lungs, large (some other medical term) on her liver*

Though this would become more difficult because if use 2 generated words, they must also fit with each other as well as the original words.

Some of the anwers look like logical phrases. For their generation I would argue that a text-summarization aproach would work. And with longer answers we could employ a **True/False** questions.

In [81]:
answersDf[answersDf['wordCount'] == 20].sample(n=10, random_state=5)

Unnamed: 0,answer,sentence,wordCount
14715,"to saturate broken (""dangling"") bonds of amorp...","Hydrogen is employed to saturate broken (""dang...",20
70704,"Bullied for being a Bedouin, he was proud of h...","Bullied for being a Bedouin, he was proud of h...",20
39422,"the Bill & Melinda Gates Foundation Trust, whi...","In October 2006, the Bill & Melinda Gates Foun...",20
21062,There are 64 possible codons (four possible nu...,":6 Additionally, a ""start codon"", and three ""s...",20
5882,"Francesinha (Frenchie) from Porto, and bifanas...",Typical fast food dishes include the Francesin...,20
34718,into four summaries that look specifically at ...,"However, results can be further simplified int...",20
26946,elected members and special office bearers suc...,The legislature consists of elected members an...,20
96100,support from China for a planned $2.5 billion ...,"Kenyatta was ""[a]ccompanied by 60 Kenyan busin...",20
77789,"On 26 December 1999, Chelsea became the first ...","On 26 December 1999, Chelsea became the first ...",20
97887,format of the congress and many specifics of t...,"Nevertheless, the format of the congress and m...",20


In [84]:
answersDf[answersDf['wordCount'] == 20].sample(n=20, random_state=5).iloc[8]['answer']

'On 26 December 1999, Chelsea became the first Premier League side to field an entirely foreign starting line-up,'

I would argue that from this sentence could be created several questions with single word answers, like:
- In what year? - *1999*
- Which team? - *Chelsea*

And our longest answer with 46 words

In [90]:
answersDf[answersDf['wordCount'] == 46].iloc[0]['answer']

'that the sudden shift of a huge quantity of water into the region could have relaxed the tension between the two sides of the fault, allowing them to move apart, and could have increased the direct pressure on it, causing a violent rupture'

*sudden shift of a huge quantity of water* seems like a good answer to the question *What could have relaxed the tension between the two sides?*

In [91]:
answersDf[answersDf['wordCount'] == 42].iloc[0]['answer']

'Hillary Clinton (2008), Howard Dean (2004), Gary Hart (1984 and 1988), Paul Tsongas (1992), Pat Robertson (1988) and Jerry Brown (1976, 1980, 1992).'

The second longest answer seems to be a sequence of correct answer, to something like *Who has been a presitend candidate*. This could be great for queastion with multiple correct answers as well as multiple incorrect.