# TACO - Prepare Dataset

The goal of this dataset is to prepare the train and tests to be used.  
For this some decisions have to be me made:
- What will be the  input?
- It will include the desired output?
- What will be "task" delimeter? The skill or the algorithm?


In [72]:

import polars as pl
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np

from datasets import load_dataset, load_from_disk
from pywaffle import Waffle
import json
import os
import ast


## Load and Visualize Datasets

In [73]:
PATH = "../../data/TACO"
train = load_from_disk(f"{PATH}/train.hf")
test = load_from_disk(f"{PATH}/test.hf")
df_train = pl.DataFrame(train[0:-1])
df_test = pl.DataFrame(test[0:-1])

In [74]:
df_train.head()

question,solutions,starter_code,input_output,difficulty,raw_tags,name,source,tags,skill_types,url,Expected Auxiliary Space,time_limit,date,picture_num,memory_limit,Expected Time Complexity
str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str
"""This is an interactive problem…","""[]""","""""","""{""inputs"": [""hack\n30\n1 0 1 1…","""HARD""","""['interactive', 'binary search…",,"""codeforces""","""['Geometry', 'Sorting', 'Const…","""['Sorting']""","""https://codeforces.com/problem…",,"""2.0 seconds""",,,"""256.0 megabytes""",
"""There are $n$ candy boxes in f…","""[""INF = 10000000000.0\nmax_n =…","""""","""{""inputs"": [""5 3 10\n1 2 3 4 5…","""HARD""","""['dp']""",,"""codeforces""","""['Dynamic programming']""","""['Dynamic programming']""","""https://codeforces.com/problem…",,,"""2019-12-31""",,,
"""Little Petya likes to play a l…","""[]""","""""","""{""inputs"": [""10 10\n5 1 2 4 1 …","""VERY_HARD""","""['data structures', 'dsu']""",,"""codeforces""","""['Spanning trees', 'Data struc…","""['Data structures']""","""https://codeforces.com/problem…",,"""1.0 seconds""",,,"""64.0 megabytes""",
"""If you visit Aizu Akabeko shri…","""[""def sub(maxs, mins):\n\tfor …","""""","""{""inputs"": [""9714431"", ""166123…","""UNKNOWN_DIFFICULTY""","""[]""",,"""aizu""","""[]""","""[]""",,,"""1.0 seconds""",,,"""268.435456 megabytes""",
"""You have a deck of $n$ cards, …","""[""import heapq\nfrom math impo…","""""","""{""inputs"": [""4\n4\n1 2 3 4\n5\…","""EASY""","""['data structures', 'greedy', …",,"""codeforces""","""['Data structures', 'Mathemati…","""['Data structures', 'Greedy al…","""https://codeforces.com/problem…",,"""1 second""","""2021-02-23""","""0""","""512 megabytes""",


In [75]:
train[0]

{'question': 'This is an interactive problem.\n\nIn good old times dwarves tried to develop extrasensory abilities:\n\n  * Exactly n dwarves entered completely dark cave. \n  * Each dwarf received a hat — white or black. While in cave, none of the dwarves was able to see either his own hat or hats of other Dwarves. \n  * Dwarves went out of the cave to the meadow and sat at an arbitrary place one after the other. When a dwarf leaves the cave, he sees the colors of all hats of all dwarves that are seating on the meadow (i.e. left the cave before him). However, he is not able to see the color of his own hat and none of the dwarves can give him this information. \n  * The task for dwarves was to got diverged into two parts — one with dwarves with white hats and one with black hats. \n\n\n\nAfter many centuries, dwarves finally managed to select the right place on the meadow without error. Will you be able to repeat their success?\n\nYou are asked to successively name n different integer p

## Process Input Baseline 

The goal here is to create a .feather with the following columns and processes:
- input: join the question and starter_code when it exists
- tags: explode the tags into booleans for each type, this will represent our desired tasks
- difficulty

In [76]:
processed_train = (
    df_train
    .select(["question", "starter_code", "tags", "difficulty"])
    .with_row_count("id")
    .with_columns(
        pl.concat_str(["question", "starter_code"], separator="\n",).alias("input"),
        pl.col("tags").map_elements(lambda x: ast.literal_eval(x)).alias("tags"),
    )
    .explode("tags")
    .select("id", "difficulty", "tags", "input")
)

  .with_row_count("id")


In [77]:
processed_train.filter(pl.col("tags").is_not_null()).select("id").unique().count()

id
u32
16877


In [78]:
processed_train.filter(pl.col("tags").is_null()).select("id").unique().count()

id
u32
8565


In [79]:
processed_train.head()

id,difficulty,tags,input
u32,str,str,str
0,"""HARD""","""Geometry""","""This is an interactive problem…"
0,"""HARD""","""Sorting""","""This is an interactive problem…"
0,"""HARD""","""Constructive algorithms""","""This is an interactive problem…"
1,"""HARD""","""Dynamic programming""","""There are $n$ candy boxes in f…"
2,"""VERY_HARD""","""Spanning trees""","""Little Petya likes to play a l…"


In [80]:
processed_test= (
    df_test
    .select(["question", "starter_code", "tags", "difficulty"])
    .with_row_count("id")
    .with_columns(
        pl.concat_str(["question", "starter_code"], separator="\n",).alias("input"),
        pl.col("tags").map_elements(lambda x: ast.literal_eval(x)).alias("tags"),
    )
    .explode("tags")
    .select("id", "difficulty", "tags", "input")
)

  .with_row_count("id")


In [81]:
processed_train.write_ipc(f"{PATH}/processed/train.feather")
processed_test.write_ipc(f"{PATH}/processed/test.feather")

## Process Evaluation Tests

In [None]:
ids = pl.Series([], dtype=pl.Int64)
test_ids = pl.Series([], dtype=pl.Int64)
inputs = pl.Series([], dtype=pl.String)
outputs = pl.Series([], dtype=pl.String)

for id in range(0,train.num_rows):
    try:
        input_output = json.loads(train[id]["input_output"])
    except:
        input_output = {"inputs": [], "outputs": []}
    for t in range(len(input_output["inputs"])):
        ids.append(pl.Series([id]))
        test_ids.append(pl.Series([t]))
        inputs.append(pl.Series([str(input_output["inputs"][t])]))
        outputs.append(pl.Series([str(input_output["outputs"][t])]))

In [120]:
train_evaluation_tests = pl.DataFrame({"id": ids, "test_id": test_ids, "input": inputs, "output": outputs})
train_evaluation_tests.write_ipc(f"{PATH}/processed/train_evaluation_tests.feather")

In [124]:
ids = pl.Series([], dtype=pl.Int64)
test_ids = pl.Series([], dtype=pl.Int64)
inputs = pl.Series([], dtype=pl.String)
outputs = pl.Series([], dtype=pl.String)

for id in range(0,test.num_rows):
    try:
        input_output = json.loads(test[id]["input_output"])
    except:
        input_output = {"inputs": [], "outputs": []}
    for t in range(len(input_output["inputs"])):
        ids.append(pl.Series([id]))
        test_ids.append(pl.Series([t]))
        inputs.append(pl.Series([str(input_output["inputs"][t])]))
        outputs.append(pl.Series([str(input_output["outputs"][t])]))

In [125]:
test_evaluation_tests = pl.DataFrame({"id": ids, "test_id": test_ids, "input": inputs, "output": outputs})
test_evaluation_tests.write_ipc(f"{PATH}/processed/test_evaluation_tests.feather")

## Process Expected Solutions

In [85]:
ids = pl.Series([], dtype=pl.Int64)
solutions = pl.Series([], dtype=pl.String)

for id in range(0,train.num_rows):

    sample_solutions = json.loads(train[id]["solutions"])
    for s in sample_solutions:
        ids.append(pl.Series([id]))
        solutions.append(pl.Series([s]))

In [86]:
train_solutions = pl.DataFrame({"id": ids, "solution": solutions})
train_solutions.write_ipc(f"{PATH}/processed/train_solutions.feather")


In [93]:
ids = pl.Series([], dtype=pl.Int64)
solutions = pl.Series([], dtype=pl.String)

for id in range(0,test.num_rows):
    try:
        sample_solutions = json.loads(test[id]["solutions"])
    
    except:
        sample_solutions = []
    for s in sample_solutions:
        ids.append(pl.Series([id]))
        solutions.append(pl.Series([s]))

In [91]:
test[11]

{'question': 'Kevin has a string S consisting of N lowercase English letters.  \n\nKevin wants to split it into 4 pairwise different non-empty parts.  For example, string "happynewyear" can be splitted into "happy", "new", "ye" and "ar". He can\'t delete any characters or change the order of the characters.\n\nHelp Kevin and find if there exist at least one possible spliting.\n\nInput format:\n\nThe first line of input will contain an integer T, denoting the number of test cases. Each of the next T lines contains a string S.\n\nOutput format:\n\nFor every test case output "YES" if it is possible to split the string and "NO" otherwise.\n\nConstraints:\n1 ≤ T ≤ 100\n1 ≤ N ≤ 1000\nN ≤ 20 in test data worth 40% of all points\n\nSAMPLE INPUT\n2\nababca\naaabb\n\nSAMPLE OUTPUT\nYES\nNO',
 'solutions': '[\'\\\'\\\'\\\'\\n# Read input from stdin and provide input before running code\\n\\nname = raw_input(\\\'What is your name?\\\\n\\\')\\nprint \\\'Hi, %s.\\\' % name\\n\\\'\\\'\\\'\\n\\ndef ge

In [95]:
test_solutions = pl.DataFrame({"id": ids, "solution": solutions})
test_solutions.write_ipc(f"{PATH}/processed/test_solutions.feather")
