### Downloading SQuAD-v2 from source

In [None]:
!wget https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v2.0.json

In [2]:
import json
# with open('dev-v2.0.json') as f:
#   squad_dev = json.load(f)
with open('train-v2.0.json') as f:
  squad_train = json.load(f)

### Fine tune with questions related to Buddhism
Extract Buddhism from titles

In [3]:
title_of_interest = [(i, j['title']) 
                     for i, j in enumerate(squad_train['data']) 
                     if j['title'] == 'Buddhism']

In [4]:
title_of_interest

[(11, 'Buddhism')]

Buddhism is located at index 11. Take a look at the dictionary.

In [8]:
squad_train['data'][title_of_interest[0][0]]['paragraphs'][0]

{'qas': [{'question': 'What type of religion is Buddhism?',
   'id': '56cff91b234ae51400d9c1bb',
   'answers': [{'text': 'nontheistic', 'answer_start': 25}],
   'is_impossible': False},
  {'question': 'What are the practices of Buddhism based on?',
   'id': '56cff91b234ae51400d9c1bc',
   'answers': [{'text': 'teachings attributed to Gautama Buddha',
     'answer_start': 202}],
   'is_impossible': False},
  {'question': 'Where did the Buddha live?',
   'id': '56cff91b234ae51400d9c1bd',
   'answers': [{'text': 'present-day Nepal', 'answer_start': 402}],
   'is_impossible': False},
  {'question': 'How do Buddhists believe their suffering can be ended?',
   'id': '56cff91b234ae51400d9c1be',
   'answers': [{'text': 'through the direct understanding and perception of dependent origination and the Four Noble Truths',
     'answer_start': 706}],
   'is_impossible': False},
  {'question': 'What did the Buddha teach should be given up to end suffering?',
   'id': '56cff91b234ae51400d9c1bf',
   '

### Organize the `data.csv` 
Below is the requirement for the finetune dataset from the instruction page.
>Input: A directory containing a 'data.csv' file.
>- The first column of the 'data.csv' should have a question.
>- The second column should have the corresponding context.
>- The third column should have the integer character starting position for the answer in the context.
>- The fourth column should have the integer character ending position for the answer in the context.

The following nested for loop will go through each context, question and answer.

In [37]:
rows = []
for paragraph in squad_train['data'][title_of_interest[0][0]]['paragraphs']:
    context = paragraph['context']
    for qas in paragraph['qas']:
        question = qas['question']
        for answer in qas['answers']:
            answer_text = answer['text']
            answer_start = answer['answer_start']
            answer_end = answer_start + len(answer_text) - 1
            rows.append([question, context, answer_start, answer_end])

Saving the `rows` into a csv

In [38]:
import csv

with open('data.csv', 'w', newline='') as f:
    writer = csv.writer(f)
    writer.writerows(rows)

In [39]:
!head -n2 data.csv

What type of religion is Buddhism?,"Buddhism /ˈbudɪzəm/ is a nontheistic religion[note 1] or philosophy (Sanskrit: धर्म dharma; Pali: धम्म dhamma) that encompasses a variety of traditions, beliefs and spiritual practices largely based on teachings attributed to Gautama Buddha, commonly known as the Buddha (""the awakened one""). According to Buddhist tradition, the Buddha lived and taught in the eastern part of the Indian subcontinent, present-day Nepal sometime between the 6th and 4th centuries BCE.[note 1] He is recognized by Buddhists as an awakened or enlightened teacher who shared his insights to help sentient beings end their suffering through the elimination of ignorance and craving. Buddhists believe that this is accomplished through the direct understanding and perception of dependent origination and the Four Noble Truths.",25,35
What are the practices of Buddhism based on?,"Buddhism /ˈbudɪzəm/ is a nontheistic religion[note 1] or philosophy (Sanskrit: धर्म dharma; Pali: धम्म 

### Uploading the `data.csv` to a S3 bucket

In [40]:
import sagemaker
sess = sagemaker.Session()
bucket = sess.default_bucket()

In [43]:
sagemaker.s3.S3Uploader.upload(local_path='data.csv',
                               desired_s3_uri='s3://%s/sagemaker-studio-book/chapter08/buddhism' % bucket,
                               sagemaker_session=sess)

's3://sagemaker-us-west-2-552106442228/sagemaker-studio-book/chapter08/buddhism/data.csv'