## Dataset creation
Here I created dataset from qasc dataset. Can be downloaded [here](https://leaderboard.allenai.org/qasc/submissions/get-started)  
Actually I sent the dataset so that you don't need to download it.

A bit later I figured out that this dataset also exists on [HF](https://huggingface.co/datasets/qasc)


**Note:** I used only dev.jsonl and train.jsonl files because test.jsonl doesn't have answers.

### Creation
1. Proceed data, form convient format
2. Use GPT2 API to create topics for each question
    - Add `OPENAI_API_KEY` to env - your OpenAI API key
3. Explode questions 
4. Some augmentation can be done (I missed it)
5. Save dataset on HF

Final size of dataset is **40k** samples

This part definitely can be improved by removing dependency on local dataset

In [None]:
path = 'QASC_Dataset/' 
files = ['train.jsonl', 'dev.jsonl']

In [None]:
import pandas as pd

cols = ['formatted_question', 'combinedfact', 'answerKey']
train = pd.read_json(path + files[0], lines=True)[cols]
dev = pd.read_json(path + files[1], lines=True)[cols]

In [None]:
# merge train and dev
train = pd.concat([train, dev], ignore_index=True)

In [None]:
len(train)

9060

In [None]:
train = train.apply(lambda x: x.astype(str).str.lower())

In [None]:
from pprint import pprint
import numpy as np

n = np.random.randint(0, len(train))
pprint(train.iloc[n].to_dict())

{'answerKey': 'e',
 'combinedfact': 'overexposure to vibrating matter can damage hearing',
 'formatted_question': 'what can overexposure to vibrating matter cause? (a) '
                       'symptoms (b) hypothyroidism (c) pollution (d) '
                       'hyperthyroidism (e) damaged hearing (f) electrical '
                       'energy (g) relieve pain (h) decrease stamina'}


**The most interesting part of the dataset creation is usage of the OpenAI GPT-2 model!**

In [None]:
from dotenv import load_dotenv
import openai
import time
import numpy as np
import os
load_dotenv()


openai.api_key = os.getenv("OPENAI_API_KEY")

class TopicMaker:
  def __init__(self, query_file, max_queries=3):
    with open(query_file, 'r') as f:
      query = f.read().lower()
    self.query = query
    self.max_queries = max_queries

  def gen_query(self, question):
    new_line = f'question: {question}\ntopics:'
    return self.query + new_line
  
  def send_query(self, query):
    response = None
    for _ in range(self.max_queries):
      try:
        response = openai.Completion.create(
          model="text-babbage-001",
          prompt=query,
          temperature=0,
          max_tokens=60,
          top_p=1.0,
          frequency_penalty=0.5,
          presence_penalty=0.0
        )
        # random sleep seconds 
        time.sleep(np.random.randint(1, 5))
        break
      except Exception as e:
        print('Error', e)
      
    return response
  
  def get_topics(self, response):
    if response is None:
      return []
    return response['choices'][0]['text'].strip().lower().split(', ')
  
  def __call__(self, question):
    query = self.gen_query(question)
    response = self.send_query(query)
    topics = self.get_topics(response)
    return topics




In [None]:
tm = TopicMaker('query.txt')
q = train['formatted_question'][0]
print('Question:', q)
print('Query:')
print(tm.gen_query(q))
print('Response:')
tm(train['formatted_question'][0])

Question: what type of water formation is formed by clouds? (a) pearls (b) streams (c) shells (d) diamonds (e) rain (f) beads (g) cooled (h) liquid
Query:
determine general topics for each question. don't add trailing comma at the end!

# example
question:  what are used for protection by fish? (a) scales (b) fins (c) streams. (d) coral (e) gills (f) collagen (g) mussels (h) whiskers
topics: biology, anatomy, marine biology, zoology, evolution, fish anatomy, aquatic ecosystems, natural selection

question:  what are pangolins covered in? (a) tunicates (b) echinoids (c) shells (d) exoskeleton (e) blastoids (f) barrel-shaped (g) protection (h) white
topics: biology, anatomy, zoology, endangered species, wildlife conservation, animal behavior, mammals

question:  what are covered with protection? (a) apples (b) trees (c) coral (d) clams (e) roses (f) wings (g) hats (h) fish
topics: biology, anatomy, botany, botanical morphology, horticulture, agriculture, plant physiology, plant reproduct

['physics', 'clouds', 'atmospheric science', 'meteorology']

**I don't recommend running this cell. It takes a long time to run and a lot of API calls.**

But if you want change `readyToRun` to `True` and run it.

In [None]:
import pickle

readyToRun = False
topics_file = 'topics.pkl'
if readyToRun:
    if not os.path.exists(topics_file):
        with open(topics_file, 'wb') as f:
            pickle.dump(pd.Series(), f)

    batch_size = 1000
    for i in range(0, len(train), batch_size):
        new_topics = train[i:i+batch_size]['formatted_question'].apply(tm)
        with open(topics_file, 'rb') as f:
            topics = pickle.load(f)
        topics = pd.concat([topics, new_topics])
        with open(topics_file, 'wb') as f:
            pickle.dump(topics, f)

print('Success!')

Success!


In [None]:
with open(topics_file, 'rb') as f:
    topics = pickle.load(f)

In [None]:
print('Topics:', topics[0])
print('Length:', len(topics))

Topics: ['physics', 'clouds', 'atmospheric science', 'meteorology']
Length: 9060


**Well, almost 40k samples**

In [None]:
# add topics to train
assert len(topics) == len(train)
train['topics'] = topics
# explode topics
train = train.explode('topics')
# rename topics to topic
train.rename(columns={'topics': 'topic'}, inplace=True) 

In [None]:
n = np.random.randint(0, len(train))
pprint(train.iloc[n].to_dict())

{'answerKey': 'a',
 'combinedfact': 'wind can cause damage to thin soil.',
 'formatted_question': 'what can cause the most damage to thin soil? (a) wind '
                       '(b) storms (c) fronts (d) flowers (e) rivers (f) slugs '
                       '(g) compost (h) rain',
 'topic': 'agriculture'}


### Pushing to HF

In [None]:
from datasets import Dataset
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [None]:
data = Dataset.from_pandas(train)
data.push_to_hub('labeled-multiple-choice')