# Data exploration - SQuAD v1

In [14]:
#Imports
import pandas as pd
from IPython.display import Markdown, display, clear_output
from nltk import tokenize
from scipy import stats
from IPython.core.debugger import set_trace

### Pretty printing

In [15]:
def printBold(string):
    display(Markdown('**' + string + '**'))
    
#def printColor():
#     display(Markdown('<span style="color:blue">blue</span>'))

## Reading the datasets

Since we aren't really doing the answering of the questions, as is the true intention for the dataset, we'll merge the train and dev datasets into one. The test dataset is probably hidden, since there's a competition for it.

In [16]:
train = pd.read_json('../data/squad-v1/train-v1.1.json', orient='column')
dev = pd.read_json('../data/squad-v1/dev-v1.1.json', orient='column')

In [17]:
df = pd.concat([train, dev], ignore_index=True)

In [18]:
df.head()

Unnamed: 0,data,version
0,"{'title': 'University_of_Notre_Dame', 'paragra...",1.1
1,"{'title': 'Beyoncé', 'paragraphs': [{'context'...",1.1
2,"{'title': 'Montana', 'paragraphs': [{'context'...",1.1
3,"{'title': 'Genocide', 'paragraphs': [{'context...",1.1
4,"{'title': 'Antibiotics', 'paragraphs': [{'cont...",1.1


Let's look at a what we've got.

In [59]:
def showQuestion(titleId, paragraphId, questionId):

    title = df['data'][titleId]['title']
    paragraph = df['data'][titleId]['paragraphs'][paragraphId]['context']
    question = df['data'][titleId]['paragraphs'][paragraphId]['qas'][questionId]['question']
    answer = df['data'][titleId]['paragraphs'][paragraphId]['qas'][questionId]['answers'][0]['text']
    answerStart = df['data'][titleId]['paragraphs'][paragraphId]['qas'][questionId]['answers'][0]['answer_start']

    printBold('Title')
    print(title)
    printBold('Paragraph')
    print(paragraph)
    printBold('Question')
    print(question)
    printBold('Answer')
    print(answerStart)
    print(answer)

In [60]:
titleId = 0
paragraphId = 0 
questionId = 0

showQuestion(titleId, paragraphId, questionId)

**Title**

University_of_Notre_Dame


**Paragraph**

Architecturally, the school has a Catholic character. Atop the Main Building's gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend "Venite Ad Me Omnes". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858. At the end of the main drive (and in a direct line that connects through 3 statues and the Gold Dome), is a simple, modern stone statue of Mary.


**Question**

To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France?


**Answer**

515
Saint Bernadette Soubirous


## Dataset size

In [21]:
titlesCount = len(df['data'])
totalParagraphsCount = 0
totalQuestionsCount = 0

for titleId in range(titlesCount):
    paragraphsCount = len(df['data'][titleId]['paragraphs'])
    totalParagraphsCount += paragraphsCount
    
    for paragraphId in range(paragraphsCount):
        questionsCount = len(df['data'][titleId]['paragraphs'][paragraphId]['qas'])
        
        totalQuestionsCount += questionsCount
        
print('Titles', titlesCount)
print('Paragraphs', totalParagraphsCount)
print('Questions', totalQuestionsCount)

Titles 490
Paragraphs 20963
Questions 98169


## Titles

In [22]:
titles = []
for titleId in range(len(df['data'])):
    titles.append(df['data'][titleId]['title'])
    
titles

['University_of_Notre_Dame',
 'Beyoncé',
 'Montana',
 'Genocide',
 'Antibiotics',
 'Frédéric_Chopin',
 'Sino-Tibetan_relations_during_the_Ming_dynasty',
 'IPod',
 'The_Legend_of_Zelda:_Twilight_Princess',
 'Spectre_(2015_film)',
 '2008_Sichuan_earthquake',
 'New_York_City',
 'To_Kill_a_Mockingbird',
 'Solar_energy',
 'Tajikistan',
 'Anthropology',
 'Portugal',
 'Kanye_West',
 'Buddhism',
 'American_Idol',
 'Dog',
 '2008_Summer_Olympics_torch_relay',
 'Alfred_North_Whitehead',
 'Financial_crisis_of_2007%E2%80%9308',
 'Saint_Barth%C3%A9lemy',
 'Genome',
 'Comprehensive_school',
 'Republic_of_the_Congo',
 'Prime_minister',
 'Institute_of_technology',
 'Wayback_Machine',
 'Dutch_Republic',
 'Symbiosis',
 'Canadian_Armed_Forces',
 'Cardinal_(Catholicism)',
 'Iranian_languages',
 'Lighting',
 'Separation_of_powers_under_the_United_States_Constitution',
 'Architecture',
 'Human_Development_Index',
 'Southern_Europe',
 'BBC_Television',
 'Arnold_Schwarzenegger',
 'Plymouth',
 'Heresy',
 'Warsa

Titles are pretty random. Seems to be a lot of locations like countries and cities but not nearly enough to afford splitting the dataset.

## Questions

One of our main assumptions is that the sentence that contains the answer could be turned into a question just by removing the answer from it. Let's see how much of that is true for the questions in this dataset.

In [23]:
titleId = 0
paragraphId = 0 
questionId = 0

showQuestion(titleId, paragraphId, questionId)

**Title**

University_of_Notre_Dame


**Paragraph**

Architecturally, the school has a Catholic character. Atop the Main Building's gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend "Venite Ad Me Omnes". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858. At the end of the main drive (and in a direct line that connects through 3 statues and the Gold Dome), is a simple, modern stone statue of Mary.


**Question**

To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France?


**Answer**

Saint Bernadette Soubirous


In [25]:
def extractSentence(paragrapgh, answerStart):
    sentences = tokenize.sent_tokenize(paragrapgh)
    
    sentenceStart = 0
    
    for sentence in sentences:
        if (sentenceStart + len(sentence) >= answerStart):
            return sentence         
        
        sentenceStart += len(sentence) + 1

In [26]:
paragrapgh = df['data'][0]['paragraphs'][0]['context']
answerStart = df['data'][0]['paragraphs'][0]['qas'][0]['answers'][0]['answer_start']

sentence = extractSentence(paragrapgh, answerStart)
print(sentence)

It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858.


In [27]:
def containedInText(text, question):
    
    questionWords = tokenize.word_tokenize(question.lower())
    textWords = tokenize.word_tokenize(text.lower())
    wordsContained = 0

    for questionWord in questionWords:
        for textWord in textWords:
            if (questionWord == textWord):
                wordsContained += 1
                break

    return wordsContained / len(questionWords)

In [28]:
question =  df['data'][0]['paragraphs'][0]['qas'][0]['question']

contained = containedInText(sentence, question)

In [29]:
printBold('Question')
print(question)
printBold('Sentence')
print(sentence)
printBold("Contained")
print(contained)

**Question**

To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France?


**Sentence**

It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858.


**Contained**

0.6428571428571429


I wouldn't expect a 100% containment simply because the questions will contain **question-like words** like *Why, Who, *Whom*, What*.

In this example we also see that the word appear is contained in the original sentence but in **past tense**. We could take care of that if we take the **stems** of the words, but I think it's better to see the least imaginative way for forming questions.

We are also calculating some **common words like *to, the, in*** which could be encountered at different places of the sentence, but again we want to measure the least-creative questions.

In this sentece *(damn, that was a good example)* we also see that the question uses the word *allegedly* which is a **synonym** of *reputedly* in the sentence. That could be nice for question forming, but I think it's more of an overkill.

We also see that the question actually encompasses the **words around the answer, rather than the entire sentence**. Which is a definate must-do when we form our questions. 

Let's see what is the score on all of the questons. I'm also curious to see the score on the entire paragraph.

This may come in handy in the future. Pretty printing the progress.

In [131]:
#Printint the percentage done
def printPercentage(currentStep, maxStep):
    stepSize = maxStep / 100
    
    if (int(currentStep / stepSize) > ((currentStep - 1) / stepSize)):
        clear_output()
        print('{}%'.format(int(currentStep / stepSize)))

In [132]:
sentenceScore = []
paragrapghScore = []

#For each title
titlesCount = len(df['data'])
for titleId in range(titlesCount):
    printPercentage(titleId, titlesCount)
    
    #For each paragraph
    for paragraphId in range(len(df['data'][titleId]['paragraphs'])):
        paragrapgh = df['data'][titleId]['paragraphs'][paragraphId]['context']
        
        #For each question
        for questionId in range(len(df['data'][titleId]['paragraphs'][paragraphId]['qas'])):
            question = df['data'][titleId]['paragraphs'][paragraphId]['qas'][questionId]['question']
            answerStart = df['data'][titleId]['paragraphs'][paragraphId]['qas'][questionId]['answers'][0]['answer_start']
            sentence = extractSentence(paragrapgh, answerStart)
          
            sentenceScore.append(containedInText(sentence, question))
            paragrapghScore.append(containedInText(paragrapgh, question))            

99%


In [109]:
sentenceScoreDf = pd.DataFrame(sentenceScore, columns=['sentence'])
paragrapghScoreDf = pd.DataFrame(paragrapghScore, columns=['paragraph'])

questionContainmentDf = pd.concat([sentenceScoreDf, paragrapghScoreDf], axis=1)
questionContainmentDf.describe()

Unnamed: 0,sentence,paragraph
count,98169.0,98169.0
mean,0.463937,0.582157
std,0.190377,0.159055
min,0.0,0.0
25%,0.333333,0.5
50%,0.461538,0.6
75%,0.6,0.7
max,1.0,1.0


I would argue that almost half the words contained is a pretty good result. 

As expected, contained within the entire paragrapgh is better.

I do wonder about those questions that are 100% contained in the answer.

In [110]:
questionContainmentDf.head(10)

Unnamed: 0,sentence,paragraph
0,0.642857,0.571429
1,0.636364,0.636364
2,0.533333,0.6
3,0.375,0.5
4,0.333333,0.416667
5,0.272727,0.636364
6,0.3,0.8
7,0.363636,0.727273
8,0.0,0.545455
9,0.266667,0.733333


In [111]:
def getQuestionAt(index):
    currentIndex = 0
    
    for titleId in range(len(df['data'])):
        for paragraphId in range(len(df['data'][titleId]['paragraphs'])):
            for questionId in range(len(df['data'][titleId]['paragraphs'][paragraphId]['qas'])):
                if (currentIndex == index):
                    return titleId, paragraphId, questionId
                currentIndex += 1

Let's see question #8 which has 0 containment in the answer sentence. 

In [112]:
getQuestionAt(8)

(0, 1, 3)

In [113]:
titleId = 0
paragraphId = 1 
questionId = 3

showQuestion(titleId, paragraphId, questionId)

**Title**

University_of_Notre_Dame


**Paragraph**

As at most other universities, Notre Dame's students run a number of news media outlets. The nine student-run outlets include three newspapers, both a radio and television station, and several magazines and journals. Begun as a one-page journal in September 1876, the Scholastic magazine is issued twice monthly and claims to be the oldest continuous collegiate publication in the United States. The other magazine, The Juggler, is released twice a year and focuses on student literature and artwork. The Dome yearbook is published annually. The newspapers have varying publication interests, with The Observer published daily and mainly reporting university and other news, and staffed by students from both Notre Dame and Saint Mary's College. Unlike Scholastic and The Dome, The Observer is an independent publication and does not have a faculty advisor or any editorial oversight from the University. In 1987, when some students believed that The Observer began to show a conservative bias, a lib

**Question**

How many student news papers are found at Notre Dame?


**Answer**

126
three


The question is actually formed from the previous sentence.

In [116]:
questionContainmentDf[questionContainmentDf['paragraph'] == 0].head()

Unnamed: 0,sentence,paragraph
269,0.0,0.0
363,0.0,0.0
505,0.0,0.0
2781,0.0,0.0
3678,0.0,0.0


In [117]:
getQuestionAt(269)

(1, 0, 0)

In [118]:
titleId = 1
paragraphId = 0 
questionId = 0

showQuestion(titleId, paragraphId, questionId)

**Title**

Beyoncé


**Paragraph**

Beyoncé Giselle Knowles-Carter (/biːˈjɒnseɪ/ bee-YON-say) (born September 4, 1981) is an American singer, songwriter, record producer and actress. Born and raised in Houston, Texas, she performed in various singing and dancing competitions as a child, and rose to fame in the late 1990s as lead singer of R&B girl-group Destiny's Child. Managed by her father, Mathew Knowles, the group became one of the world's best-selling girl groups of all time. Their hiatus saw the release of Beyoncé's debut album, Dangerously in Love (2003), which established her as a solo artist worldwide, earned five Grammy Awards and featured the Billboard Hot 100 number-one singles "Crazy in Love" and "Baby Boy".


**Question**

When did Beyonce start becoming popular?


**Answer**

269
in the late 1990s


A **synonym** case - *instead of rose to fame*, *start becoming popular* is used.

In [120]:
getQuestionAt(505)

(1, 18, 6)

In [121]:
titleId = 1
paragraphId = 18 
questionId = 6

showQuestion(titleId, paragraphId, questionId)

**Title**

Beyoncé


**Paragraph**

In 2011, documents obtained by WikiLeaks revealed that Beyoncé was one of many entertainers who performed for the family of Libyan ruler Muammar Gaddafi. Rolling Stone reported that the music industry was urging them to return the money they earned for the concerts; a spokesperson for Beyoncé later confirmed to The Huffington Post that she donated the money to the Clinton Bush Haiti Fund. Later that year she became the first solo female artist to headline the main Pyramid stage at the 2011 Glastonbury Festival in over twenty years, and was named the highest-paid performer in the world per minute.


**Question**

When did this leak happen?


**Answer**

3
2011


That's just a bad question. It could only be asked in combination with the text.

In [123]:
questionContainmentDf[questionContainmentDf['sentence'] == 1]

Unnamed: 0,sentence,paragraph
21911,1.0,1.0
39394,1.0,1.0
45064,1.0,1.0
48874,1.0,1.0
53226,1.0,1.0
67425,1.0,1.0


In [126]:
getQuestionAt(53226)

(258, 23, 0)

In [127]:
titleId = 258
paragraphId = 23 
questionId = 0

showQuestion(titleId, paragraphId, questionId)

**Title**

Utrecht


**Paragraph**

Utrecht city has an active cultural life, and in the Netherlands is second only to Amsterdam. There are several theatres and theatre companies. The 1941 main city theatre was built by Dudok. Besides theatres there is a large number of cinemas including three arthouse cinemas. Utrecht is host to the international Early Music Festival (Festival Oude Muziek, for music before 1800) and the Netherlands Film Festival. The city has an important classical music hall Vredenburg (1979 by Herman Hertzberger). Its acoustics are considered among the best of the 20th-century original music halls.[citation needed] The original Vredenburg music hall has been redeveloped as part of the larger station area redevelopment plan and in 2014 has gained additional halls that allowed its merger with the rock club Tivoli and the SJU jazzpodium. There are several other venues for music throughout the city. Young musicians are educated in the conservatory, a department of the Utrecht School of the Arts. There is 

**Question**

Cultural life in Utrecht is second to 


**Answer**

0
Utrecht city has an active cultural life, and in the Netherlands is second only to Amsterdam


Strange question. The question words all appear in the sentence, but not in order. But the answer is the entire sentence, which obviously has needless information inside it. Looking further into it, the question is actually wrong, because it should state second *in Netherlands*. This question should be scrapped...

In [129]:
getQuestionAt(67425)

(341, 25, 2)

In [130]:
titleId = 341
paragraphId = 25 
questionId = 2

showQuestion(titleId, paragraphId, questionId)

**Title**

Energy


**Paragraph**

Thermodynamics divides energy transformation into two kinds: reversible processes and irreversible processes. An irreversible process is one in which energy is dissipated (spread) into empty energy states available in a volume, from which it cannot be recovered into more concentrated forms (fewer quantum states), without degradation of even more energy. A reversible process is one in which this sort of dissipation does not happen. For example, conversion of energy from one type of potential field to another, is reversible, as in the pendulum system described above. In processes where heat is generated, quantum states of lower energy, present as possible excitations in fields between atoms, act as a reservoir for part of the energy, from which it cannot be recovered, in order to be converted with 100% efficiency into other forms of energy. In this case, the energy must partly stay as heat, and cannot be completely recovered as usable energy, except at the price of an increase in some ot

**Question**

A reversible process is one in which this does not happen.


**Answer**

406
dissipation


This is, basically, just the question I expect to generate. The answer is removed and the sentence is descriptive enough to fill in the missing word.

### Summary

The assumption that the **question is mostly consisted of words from the sentence the answer is in** seems correct.

There are some obvious differences like:
- **Question-like words** are added - who, why, when...
- **Synonyms** are used instead of the words used in the sentence
- Changing the sentence to a question also changes the **tense** of the word.
- In long sentences, only a **part of the sentence is used**. Like if the sentence is separated with commas, the comma actually divides two logical statements.

I also managed to find some outliers which turned out to be not-so-well asked questions.