# Data exploration - SQuAD v1

In [47]:
#Imports
import pandas as pd
from IPython.display import Markdown, display
from nltk import tokenize
from scipy import stats

### Pretty printing

In [2]:
def printBold(string):
    display(Markdown('**' + string + '**'))
    
#def printColor():
#     display(Markdown('<span style="color:blue">blue</span>'))

## Reading the datasets

Since we aren't really doing the answering of the questions, as is the true intention for the dataset, we'll merge the train and dev datasets into one. The test dataset is probably hidden, since there's a competition for it.

In [3]:
train = pd.read_json('../data/squad-v1/train-v1.1.json', orient='column')
dev = pd.read_json('../data/squad-v1/dev-v1.1.json', orient='column')

In [4]:
df = pd.concat([train, dev], ignore_index=True)

In [5]:
df.head()

Unnamed: 0,data,version
0,"{'title': 'University_of_Notre_Dame', 'paragra...",1.1
1,"{'title': 'Beyoncé', 'paragraphs': [{'context'...",1.1
2,"{'title': 'Montana', 'paragraphs': [{'context'...",1.1
3,"{'title': 'Genocide', 'paragraphs': [{'context...",1.1
4,"{'title': 'Antibiotics', 'paragraphs': [{'cont...",1.1


Let's look at a what we've got.

In [6]:
def showQuestion(titleId, paragraphId, questionId):

    title = df['data'][titleId]['title']
    paragraph = df['data'][titleId]['paragraphs'][paragraphId]['context']
    question = df['data'][titleId]['paragraphs'][paragraphId]['qas'][questionId]['question']
    answer = df['data'][titleId]['paragraphs'][paragraphId]['qas'][questionId]['answers'][0]['text']

    printBold('Title')
    print(title)
    printBold('Paragraph')
    print(paragraph)
    printBold('Question')
    print(question)
    printBold('Answer')
    print(answer)

In [7]:
titleId = 0
paragraphId = 0 
questionId = 0

showQuestion(titleId, paragraphId, questionId)

**Title**

University_of_Notre_Dame


**Paragraph**

Architecturally, the school has a Catholic character. Atop the Main Building's gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend "Venite Ad Me Omnes". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858. At the end of the main drive (and in a direct line that connects through 3 statues and the Gold Dome), is a simple, modern stone statue of Mary.


**Question**

To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France?


**Answer**

Saint Bernadette Soubirous


## Dataset size

In [8]:
titlesCount = len(df['data'])
totalParagraphsCount = 0
totalQuestionsCount = 0

for titleId in range(titlesCount):
    paragraphsCount = len(df['data'][titleId]['paragraphs'])
    totalParagraphsCount += paragraphsCount
    
    for paragraphId in range(paragraphsCount):
        questionsCount = len(df['data'][titleId]['paragraphs'][paragraphId]['qas'])
        
        totalQuestionsCount += questionsCount
        
print('Titles', titlesCount)
print('Paragraphs', totalParagraphsCount)
print('Questions', totalQuestionsCount)

Titles 490
Paragraphs 20963
Questions 98169


## Titles

In [9]:
titles = []
for titleId in range(len(df['data'])):
    titles.append(df['data'][titleId]['title'])
    
titles

['University_of_Notre_Dame',
 'Beyoncé',
 'Montana',
 'Genocide',
 'Antibiotics',
 'Frédéric_Chopin',
 'Sino-Tibetan_relations_during_the_Ming_dynasty',
 'IPod',
 'The_Legend_of_Zelda:_Twilight_Princess',
 'Spectre_(2015_film)',
 '2008_Sichuan_earthquake',
 'New_York_City',
 'To_Kill_a_Mockingbird',
 'Solar_energy',
 'Tajikistan',
 'Anthropology',
 'Portugal',
 'Kanye_West',
 'Buddhism',
 'American_Idol',
 'Dog',
 '2008_Summer_Olympics_torch_relay',
 'Alfred_North_Whitehead',
 'Financial_crisis_of_2007%E2%80%9308',
 'Saint_Barth%C3%A9lemy',
 'Genome',
 'Comprehensive_school',
 'Republic_of_the_Congo',
 'Prime_minister',
 'Institute_of_technology',
 'Wayback_Machine',
 'Dutch_Republic',
 'Symbiosis',
 'Canadian_Armed_Forces',
 'Cardinal_(Catholicism)',
 'Iranian_languages',
 'Lighting',
 'Separation_of_powers_under_the_United_States_Constitution',
 'Architecture',
 'Human_Development_Index',
 'Southern_Europe',
 'BBC_Television',
 'Arnold_Schwarzenegger',
 'Plymouth',
 'Heresy',
 'Warsa

Titles are pretty random. Seems to be a lot of locations like countries and cities but not nearly enough to afford splitting the dataset.

## Questions

One of our main assumptions is that the sentence that contains the answer could be turned into a question just by removing the answer from it. Let's see how much of that is true for the questions in this dataset.

In [10]:
titleId = 0
paragraphId = 0 
questionId = 0

showQuestion(titleId, paragraphId, questionId)

**Title**

University_of_Notre_Dame


**Paragraph**

Architecturally, the school has a Catholic character. Atop the Main Building's gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend "Venite Ad Me Omnes". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858. At the end of the main drive (and in a direct line that connects through 3 statues and the Gold Dome), is a simple, modern stone statue of Mary.


**Question**

To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France?


**Answer**

Saint Bernadette Soubirous


In [11]:
#def containedInText()
    #for each word in the question check if it's contained in the sentence
    
#def extractSentence()

#or the parapragh

#return fraction for containment in the range [0;1]

In [35]:
def extractSentence(paragrapgh, answerStart):
    sentences = tokenize.sent_tokenize(paragrapgh)
    
    sentenceStart = 0
    
    for sentence in sentences:
        if (sentenceStart + len(sentence) >= answerStart):
            return sentence         
        
        sentenceStart += len(sentence) + 1

In [36]:
paragrapgh = df['data'][0]['paragraphs'][0]['context']
answerStart = df['data'][0]['paragraphs'][0]['qas'][0]['answers'][0]['answer_start']

sentence = extractSentence(paragrapgh, answerStart)
print(sentence)

It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858.


In [38]:
def containedInText(text, question):
    
    questionWords = tokenize.word_tokenize(question.lower())
    textWords = tokenize.word_tokenize(text.lower())
    wordsContained = 0

    for questionWord in questionWords:
        for textWord in textWords:
            if (questionWord == textWord):
                wordsContained += 1
                break

    return wordsContained / len(questionWords)

In [39]:
question =  df['data'][0]['paragraphs'][0]['qas'][0]['question']

contained = containedInText(sentence, question)

In [16]:
printBold('Question')
print(question)
printBold('Sentence')
print(sentence)
printBold("Contained")
print(contained)

**Question**

To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France?


**Sentence**

It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858.


**Contained**

0.6428571428571429


I wouldn't expect a 100% containment simply because the questions will contain **question-like words** like *Why, Who, *Whom*, What*.

In this example we also see that the word appear is contained in the original sentence but in **past tense**. We could take care of that if we take the **stems** of the words, but I think it's better to see the least imaginative way for forming questions.

We are also calculating some **common words like *to, the, in*** which could be encountered at different places of the sentence, but again we want to measure the least-creative questions.

In this sentece *(damn, that was a good example)* we also see that the question uses the word *allegedly* which is a **synonym** of *reputedly* in the sentence. That could be nice for question forming, but I think it's more of an overkill.

We also see that the question actually encompasses the **words around the answer, rather than the entire sentence**. Which is a definate must-do when we form our questions. 

Let's see what is the score on all of the questons. I'm also curious to see the score on the entire paragraph.

In [184]:
from IPython.core.debugger import set_trace

In [40]:
sentenceScore = []
paragrapghScore = []


#For each title
for titleId in range(len(df['data'])):
    print(titleId)
    #For each paragraph
    for paragraphId in range(len(df['data'][titleId]['paragraphs'])):
        paragrapgh = df['data'][titleId]['paragraphs'][paragraphId]['context']
        #For each question
        for questionId in range(len(df['data'][titleId]['paragraphs'][paragraphId]['qas'])):
            question = df['data'][titleId]['paragraphs'][paragraphId]['qas'][questionId]['question']
            answerStart = df['data'][titleId]['paragraphs'][paragraphId]['qas'][questionId]['answers'][0]['answer_start']
            sentence = extractSentence(paragrapgh, answerStart)
          
            sentenceScore.append(containedInText(sentence, question))
            paragrapghScore.append(containedInText(paragrapgh, question))            

0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
27

In [46]:
stats.describe(sentenceScore)

DescribeResult(nobs=98169, minmax=(0.0, 1.0), mean=0.4639372890613787, variance=0.036243536466669425, skewness=-0.1260186187793197, kurtosis=-0.49616712681619424)

In [48]:
stats.describe(paragrapghScore)

DescribeResult(nobs=98169, minmax=(0.0, 1.0), mean=0.5821566303382768, variance=0.025298457602856428, skewness=-0.37555623686086964, kurtosis=-0.07449784279571592)

I would argue that almost half the words contained is a pretty good result. 

As expected, contained within the entire paragrapgh is better.

I do wonder about those questions that are 100% contained in the answer.