<a href="https://colab.research.google.com/github/Karthick47v2/question-generator/blob/main/data_extraction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Download dataset

In [None]:
# SQuAD dataset
!wget https://data.deepai.org/squad1.1.zip
!unzip squad1.1.zip

# SciQ dataset
!wget https://ai2-public-datasets.s3.amazonaws.com/sciq/SciQ.zip
!unzip SciQ.zip

### Import libraries

In [2]:
import json
import pandas as pd

### Extract data from json

In [3]:
def parse_json(filepath):
  """Load json file from storage.

  Args:
    filepath (str): Path of json file.

  Returns:
    list(dict(obj)): List of nested dictionaries.
  """
  data = {}

  with open(filepath) as file:
    data = json.load(file)

  return data

***SQuAD***

- SQuAD dataset doesn't contain null values, so, no need to check.
- We are only interested in generating questions from simple answers. So answers with more than 5 words will be filtered out.

In [4]:
def extract_from_squad(data):
  """Extract data from SQuAD dataset.

  Args:
    data (list(dict(obj))): List of nested dictionaries.

  Returns:
    tuple(list(str), list(str)): tuple of lists of model input and output. 
  """
  source = []
  target = []

  for topic in data['data']:
    for dict_set in topic['paragraphs']:
      for qna_set in dict_set['qas']:
        if len(qna_set['answers'][0]['text'].split()) <= 5:
          source.append(f"context: {dict_set['context']} answer: {qna_set['answers'][0]['text']}")
          target.append(qna_set['question'])

  return source, target

***SciQ***

- SCiQ dataset contains empty string for some values of `support` (mentioned in dataset readme.txt). So, that will be filtered out.
- We are only interested in generating questions from simple answers. So answers with more than 5 words will be filtered out.

In [5]:
def extract_from_sciq(data):
  """Extract data from SciQ dataset.

  Args:
    data (list(dict(obj))): List of nested dictionaries.

  Returns:
    tuple(list(str), list(str)): tuple of lists of model input and output. 
  """
  source = []
  target = []

  for dict_set in data:
    if dict_set['support'] == "":
      continue
    if len(dict_set['correct_answer']) <= 5:
      continue
    else:
      source.append(f"context: {dict_set['support']} answer: {dict_set['correct_answer']}")
      target.append(dict_set['question'])

  return source, target

In [6]:
source_text = []
target_text = []

dataset = 'sciq' # squad or sciq

if dataset == 'squad':
  data = parse_json('train-v1.1.json')
  source_text, target_text = extract_from_squad(data)

else:
  for filename in ['train', 'test', 'valid']:
    data = parse_json(f"SciQ dataset-2 3/{filename}.json")
    source, target = extract_from_sciq(data)

    source_text.extend(source)
    target_text.extend(target)

***SQuAD***
- Total data: 87,599
- Filtered data: 76,135


***SciQ***
- Totla data: 13,679
- Filtered data: 10,640

In [7]:
df = pd.DataFrame({'source_text': source_text, 'target_text': target_text})

### Review

In [None]:
df.head()

### Export as *.csv and upload to GDrive

In [None]:
from google.colab import drive
drive.mount('/content/gdrive')

In [10]:
output = f"{'SQuAD' if dataset == 'squad' else 'SciQ'}-processed.csv"

df.to_csv(output, index=False)
!mv $output gdrive/MyDrive/mcq-gen