# AdapTable Dataset Demo

This notebook provides a demonstration of loading the AdapTable dataset from the HuggingFace Hub. For more information, please see [our GitHub repo](https://github.com/JunShern/few-shot-adaptation).

In [1]:
!pip install -q datasets

In [2]:
from datasets import load_dataset

distribution_names = [
    # Full dataset
    "MicPie/adaptable_full",
    # 5k random tasks from full dataset
    "MicPie/adaptable_5k",
    # Filtered to 1 task per website
    "MicPie/adaptable_unique",
    #  Single website tasks
    "MicPie/adaptable_baseball.fantasysports.yahoo.com",
    "MicPie/adaptable_bulbapedia.bulbagarden.net",
    "MicPie/adaptable_cappex.com",
    "MicPie/adaptable_cram.com",
    "MicPie/adaptable_dividend.com",
    "MicPie/adaptable_dummies.com",
    "MicPie/adaptable_en.wikipedia.org",
    "MicPie/adaptable_ensembl.org",
    "MicPie/adaptable_gamefaqs.com",
    "MicPie/adaptable_mgoblog.com",
    "MicPie/adaptable_mmo-champion.com",
    "MicPie/adaptable_msdn.microsoft.com",
    "MicPie/adaptable_phonearena.com",
    "MicPie/adaptable_sittercity.com",
    "MicPie/adaptable_sporcle.com",
    "MicPie/adaptable_studystack.com",
    "MicPie/adaptable_support.google.com",
    "MicPie/adaptable_w3.org",
    "MicPie/adaptable_wiki.openmoko.org",
    "MicPie/adaptable_wkdu.org",
    # Single cluster tasks
    "MicPie/adaptable_cluster00", "MicPie/adaptable_cluster01", "MicPie/adaptable_cluster02", "MicPie/adaptable_cluster03", "MicPie/adaptable_cluster04", "MicPie/adaptable_cluster05", "MicPie/adaptable_cluster06", "MicPie/adaptable_cluster07", "MicPie/adaptable_cluster08", "MicPie/adaptable_cluster09", "MicPie/adaptable_cluster10", "MicPie/adaptable_cluster11", "MicPie/adaptable_cluster12", "MicPie/adaptable_cluster13", "MicPie/adaptable_cluster14", "MicPie/adaptable_cluster15", "MicPie/adaptable_cluster16", "MicPie/adaptable_cluster17", "MicPie/adaptable_cluster18", "MicPie/adaptable_cluster19", "MicPie/adaptable_cluster20", "MicPie/adaptable_cluster21", "MicPie/adaptable_cluster22", "MicPie/adaptable_cluster23", "MicPie/adaptable_cluster24", "MicPie/adaptable_cluster25", "MicPie/adaptable_cluster26", "MicPie/adaptable_cluster27", "MicPie/adaptable_cluster28", "MicPie/adaptable_cluster29", "MicPie/adaptable_cluster-noise", 
    # Manual-rated tasks
    "MicPie/adaptable_rated-low", "MicPie/adaptable_rated-medium", "MicPie/adaptable_rated-high",
]

# Let's look at the 5k sample dataset
dataset = load_dataset('MicPie/adaptable_5k')
print(dataset['train'])

Using custom data configuration default
Reusing dataset adaptable_5k (/root/.cache/huggingface/datasets/MicPie___adaptable_5k/default/1.0.0/5ab3257953bcc82cddb96f38905de6dd5c1ed65754cdf1a6d5c97ddbdc32814e)


  0%|          | 0/1 [00:00<?, ?it/s]

Dataset({
    features: ['task', 'input', 'output', 'options', 'pageTitle', 'outputColName', 'url', 'wdcFile'],
    num_rows: 84492
})


In [3]:
from itertools import islice

# Print the first 50 rows of the dataset (contains rows from all tasks in a flat list)
for row in islice(dataset['train'], 20):
    print(row)
print("...")

{'task': '2962b1e5_asteurized_Milk____Eating_Well__COMMENTS', 'input': '[POSTED] 07/27/2015 - 2:34am [COMMENTS] ', 'output': 'Homogenization is a major contributor to heart disease bc the fat molecules that would not normally pass into the blood stream quickly are broken down and actually damage blood vessels due to the speed of absorption/smaller size, over years this causes cholesterol to adhere to the micro lesions formed by the fat molecules clogging the blood vessel. — Anonymous', 'options': [], 'pageTitle': 'Is Raw Milk More Nutritious than Pasteurized Milk? - Eating Well', 'outputColName': 'COMMENTS', 'url': 'http://www.eatingwell.com/nutrition_health/nutrition_news_information/is_raw_milk_more_nutritious_than_pasteurized_milk?order=timestamp&sort=desc&quicktabs_1=1', 'wdcFile': '25/1438042987155.85_20150728002307-00234-ip-10-236-191-2_417594383_0.json'}
{'task': '2962b1e5_asteurized_Milk____Eating_Well__COMMENTS', 'input': '[POSTED] 07/10/2015 - 8:52pm [COMMENTS] ', 'output': "

In [4]:
# Separate rows by task
dataset_by_task = {}
for row in dataset['train']:
    dataset_by_task.setdefault(row['task'], []).append(row)
assert sum([len(task_rows) for task_rows in dataset_by_task.values()]) == len(dataset['train'])

In [5]:
# Show tasks
num_tasks = 10
max_rows_per_task = 3
for task_name, task_rows in islice(dataset_by_task.items(), num_tasks):
    print(f"TASK: {task_name} ({len(task_rows)} rows)")
    for row in task_rows[:max_rows_per_task]:
        print(f"INPUT  : {row['input']}")
        print(f"OUTPUT : {row['output']}")
    print("...\n")

TASK: 2962b1e5_asteurized_Milk____Eating_Well__COMMENTS (10 rows)
INPUT  : [POSTED] 07/27/2015 - 2:34am [COMMENTS] 
OUTPUT : Homogenization is a major contributor to heart disease bc the fat molecules that would not normally pass into the blood stream quickly are broken down and actually damage blood vessels due to the speed of absorption/smaller size, over years this causes cholesterol to adhere to the micro lesions formed by the fat molecules clogging the blood vessel. — Anonymous
INPUT  : [POSTED] 07/10/2015 - 8:52pm [COMMENTS] 
OUTPUT : I am curious about how all of these seemingly imminent dangers are in touch with reality. China, with it's incredible population and diversity, does not pasteurize it's milk. All of their dairy is raw. According to these raging statistics, they should be wiped out in swaths by listeria. But they are not - in fact, I have yet to actually see an article about Chinese dairy anywhere. I feel like it's all hype to keep us in the grocery stores, but that 