## Content
1. [Data Loading](#1.-Data-loading)  
2. [Structure of the loaded data](#2.-Structure-of-the-loaded-data)  
    2.1 [General overview](#2.1-General-overview)  
    2.2 [Conclusion](#2.2-Conclusion)

## 1. Data loading

In [1]:
import json
from pprint import pprint

IN_PATH = '../data/squad/'

def load_data(filename):
    data = []
    with open(filename) as f:
        data = json.load(f)  
        
    return data

train = load_data(IN_PATH + 'train-v1.1.json')
dev = load_data(IN_PATH + 'dev-v1.1.json')

## 2. Structure of the loaded data
### 2.1 General overview
** A general overview of how the data is organized. Train and dev sets have the same structure. Here I will play with the train set. **

In [2]:
print(type(train))

<type 'dict'>


In [3]:
print(train.keys())

[u'version', u'data']


In [4]:
print(train['version'])

1.1


In [5]:
print(type(train['data']))
print(len(train['data']))

<type 'list'>
442


In [6]:
print(type(train['data'][0]))
print(dir(train['data'][0]))

<type 'dict'>
['__class__', '__cmp__', '__contains__', '__delattr__', '__delitem__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__getitem__', '__gt__', '__hash__', '__init__', '__iter__', '__le__', '__len__', '__lt__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__setitem__', '__sizeof__', '__str__', '__subclasshook__', 'clear', 'copy', 'fromkeys', 'get', 'has_key', 'items', 'iteritems', 'iterkeys', 'itervalues', 'keys', 'pop', 'popitem', 'setdefault', 'update', 'values', 'viewitems', 'viewkeys', 'viewvalues']


In [7]:
print(train['data'][0].keys())

[u'paragraphs', u'title']


##### Let's play with the first article from the train set. An article contains a title and some paragraphs.

In [8]:
paragraphs = train['data'][0]['paragraphs']
title = train['data'][0]['title']
print('First article')
print('Title: {}'.format(title))
print('Paragraphs[0]: {}'.format(paragraphs[0]))

First article
Title: University_of_Notre_Dame
Paragraphs[0]: {u'qas': [{u'question': u'To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France?', u'id': u'5733be284776f41900661182', u'answers': [{u'text': u'Saint Bernadette Soubirous', u'answer_start': 515}]}, {u'question': u'What is in front of the Notre Dame Main Building?', u'id': u'5733be284776f4190066117f', u'answers': [{u'text': u'a copper statue of Christ', u'answer_start': 188}]}, {u'question': u'The Basilica of the Sacred heart at Notre Dame is beside to which structure?', u'id': u'5733be284776f41900661180', u'answers': [{u'text': u'the Main Building', u'answer_start': 279}]}, {u'question': u'What is the Grotto at Notre Dame?', u'id': u'5733be284776f41900661181', u'answers': [{u'text': u'a Marian place of prayer and reflection', u'answer_start': 381}]}, {u'question': u'What sits on top of the Main Building at Notre Dame?', u'id': u'5733be284776f4190066117e', u'answers': [{u'text': u'a golden statue of the Virgin

In [9]:
print(type(paragraphs[0]))
print(paragraphs[0].keys())

<type 'dict'>
[u'qas', u'context']


In [10]:
print(type(paragraphs[0]['qas']))
print(len(paragraphs[0]['qas']))
print(paragraphs[0]['qas'][0])
print(type(paragraphs[0]['qas'][0]))
print(type(paragraphs[0]['qas'][0]['answers']))

<type 'list'>
5
{u'question': u'To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France?', u'id': u'5733be284776f41900661182', u'answers': [{u'text': u'Saint Bernadette Soubirous', u'answer_start': 515}]}
<type 'dict'>
<type 'list'>


### 2.2 Conclusion
** The structure of the data can be seen in the following way:  **
<pre>       
            dataset   
            (dict)   
            /    \  
           /      \  
      'version'   'data'
      (string)    (list)
                  # each element in the list represents an article (dict)  
                     |
                     |
                  articles
                   (dict)
                   /    \
                  /      \
             'title'   'paragraphs'  
             (string)     (list)
                          # each element in the list is an dict with 'qas' and 'context' keys
                             |
                             |
                            pair
                           (dict)
                          /      \
                         /        \
                        /          \
                       /            \
                 'context'          'qas'
                 (string)           (list)
                  # context of      # each element in the list is a dict with 3 keys
                    the questions      |
                    and answers        |
                                       |
                                 question-answer
                                     (dict)
                                    /   |   \
                                   /    |    \
                                  /     |     \
                               'id' 'question' 'answers'
                             (string)(string)   (list)
                                                # each element in the list is an answer dict with 
                                                'text' of the answer and 'answer_start'
                                                  |
                                                  |
                                                answer
                                                (dict)
                                                /     \
                                               /       \
                                        'answer_start' 'text'
                                            (int)     (string)  
                                               
                                                 
</pre>