This notebook is developed using the `Python 3 (Data Science)` kernel on an `ml.t3.medium` instance.
### Downloading SQuAD-v2 from source

In [None]:
import os
import sagemaker
sess = sagemaker.Session()
bucket = sess.default_bucket()
prefix = 'sagemaker-studio-book/chapter08'

local_prefix='buddhism'
os.makedirs(local_prefix, exist_ok=True)

In [None]:
!wget https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v2.0.json -P {local_prefix}

In [None]:
import json
with open(f'{local_prefix}/train-v2.0.json') as f:
    squad_train = json.load(f)

### Fine tune with questions related to Buddhism
Extract Buddhism from titles

In [None]:
title_of_interest = [(i, j['title']) 
                     for i, j in enumerate(squad_train['data']) 
                     if j['title'] == 'Buddhism']

In [None]:
title_of_interest

Buddhism is located at index 11. Take a look at the dictionary.

In [None]:
squad_train['data'][title_of_interest[0][0]]['paragraphs'][0]

### Organize the `data.csv` 
Below is the requirement for the finetune dataset from the instruction page.
>Input: A directory containing a 'data.csv' file.
>- The first column of the 'data.csv' should have a question.
>- The second column should have the corresponding context.
>- The third column should have the integer character starting position for the answer in the context.
>- The fourth column should have the integer character ending position for the answer in the context.

The following nested for loop will go through each context, question and answer.

In [None]:
rows = []
for paragraph in squad_train['data'][title_of_interest[0][0]]['paragraphs']:
    context = paragraph['context']
    for qas in paragraph['qas']:
        question = qas['question']
        for answer in qas['answers']:
            answer_text = answer['text']
            answer_start = answer['answer_start']
            answer_end = answer_start + len(answer_text) - 1
            rows.append([question, context, answer_start, answer_end])

Saving the `rows` into a csv

In [None]:
import csv

with open(f'{local_prefix}/data.csv', 'w', newline='') as f:
    writer = csv.writer(f)
    writer.writerows(rows)

In [None]:
!head -n2 data.csv

### Uploading the `data.csv` to a S3 bucket

In [None]:
sagemaker.s3.S3Uploader.upload(local_path=f'{local_prefix}/data.csv',
                               desired_s3_uri=f's3://{bucket}/{prefix}/{local_prefix}',
                               sagemaker_session=sess)