# Extracting example test cases from the APPS dataset
Original code from CodeRL repo - Copyright (c) 2022, salesforce.com, inc.
All rights reserved.
SPDX-License-Identifier: BSD-3-Clause
For full license text, see the LICENSE file in https://github.com/salesforce/CodeRL/blob/main/LICENSE.txt
or https://opensource.org/licenses/BSD-3-Clause

Adapted to work with APPS dataset from huggingface/datasets, instead of having to manually download the APPS dataset.

In [107]:
# First party imports
import json, os, random, io, pdb
from tqdm import tqdm
from random import random
import numpy as np
import glob 
import pickle as pkl 
import re

# Second party imports
from apps_dataset import load_apps_dataset, APPSTask

In [108]:
# Load the APPS dataset
apps_dataset = load_apps_dataset()

train_dataset = apps_dataset["train"]
test_dataset = apps_dataset["test"]

### IMPORTANT: SET TO DESIRED DATASET PORTION
target_dataset = test_dataset

No config specified, defaulting to: apps/all
Found cached dataset apps (C:/Users/noahv/.cache/huggingface/datasets/codeparrot___apps/all/0.0.0/04ac807715d07d6e5cc580f59cdc8213cd7dc4529d0bb819cca72c9f8e8c1aa5)
100%|██████████| 2/2 [00:00<00:00, 17.33it/s]


In [109]:
def check_if_key_matches_problem_id(dataset):
    for i, problem in enumerate(dataset):
        problem_id = problem["problem_id"]
        if problem_id != i:
            print("Problem ID {} does not match index {}".format(problem_id, i))
            return False

    return True

print(check_if_key_matches_problem_id(train_dataset))
print(check_if_key_matches_problem_id(test_dataset))

True
True


## Debugging

In [110]:
problem = train_dataset[15]
problem["problem_id"]

15

In [111]:
problem["question"]

"Screen resolution of Polycarp's monitor is $a \\times b$ pixels. Unfortunately, there is one dead pixel at his screen. It has coordinates $(x, y)$ ($0 \\le x < a, 0 \\le y < b$). You can consider columns of pixels to be numbered from $0$ to $a-1$, and rows\xa0— from $0$ to $b-1$.\n\nPolycarp wants to open a rectangular window of maximal size, which doesn't contain the dead pixel. The boundaries of the window should be parallel to the sides of the screen.\n\nPrint the maximal area (in pixels) of a window that doesn't contain the dead pixel inside itself.\n\n\n-----Input-----\n\nIn the first line you are given an integer $t$ ($1 \\le t \\le 10^4$)\xa0— the number of test cases in the test. In the next lines you are given descriptions of $t$ test cases.\n\nEach test case contains a single line which consists of $4$ integers $a, b, x$ and $y$ ($1 \\le a, b \\le 10^4$; $0 \\le x < a$; $0 \\le y < b$)\xa0— the resolution of the screen and the coordinates of a dead pixel. It is guaranteed th

In [112]:
full_text = problem["question"]

# Reverse join operation where we put the text in a list based on the new line character
lines = [line + "\n" for line in full_text.split("\n")]
lines[-1] = lines[-1].strip()
lines

["Screen resolution of Polycarp's monitor is $a \\times b$ pixels. Unfortunately, there is one dead pixel at his screen. It has coordinates $(x, y)$ ($0 \\le x < a, 0 \\le y < b$). You can consider columns of pixels to be numbered from $0$ to $a-1$, and rows\xa0— from $0$ to $b-1$.\n",
 '\n',
 "Polycarp wants to open a rectangular window of maximal size, which doesn't contain the dead pixel. The boundaries of the window should be parallel to the sides of the screen.\n",
 '\n',
 "Print the maximal area (in pixels) of a window that doesn't contain the dead pixel inside itself.\n",
 '\n',
 '\n',
 '-----Input-----\n',
 '\n',
 'In the first line you are given an integer $t$ ($1 \\le t \\le 10^4$)\xa0— the number of test cases in the test. In the next lines you are given descriptions of $t$ test cases.\n',
 '\n',
 'Each test case contains a single line which consists of $4$ integers $a, b, x$ and $y$ ($1 \\le a, b \\le 10^4$; $0 \\le x < a$; $0 \\le y < b$)\xa0— the resolution of the screen 

In [113]:
print(full_text)

Screen resolution of Polycarp's monitor is $a \times b$ pixels. Unfortunately, there is one dead pixel at his screen. It has coordinates $(x, y)$ ($0 \le x < a, 0 \le y < b$). You can consider columns of pixels to be numbered from $0$ to $a-1$, and rows — from $0$ to $b-1$.

Polycarp wants to open a rectangular window of maximal size, which doesn't contain the dead pixel. The boundaries of the window should be parallel to the sides of the screen.

Print the maximal area (in pixels) of a window that doesn't contain the dead pixel inside itself.


-----Input-----

In the first line you are given an integer $t$ ($1 \le t \le 10^4$) — the number of test cases in the test. In the next lines you are given descriptions of $t$ test cases.

Each test case contains a single line which consists of $4$ integers $a, b, x$ and $y$ ($1 \le a, b \le 10^4$; $0 \le x < a$; $0 \le y < b$) — the resolution of the screen and the coordinates of a dead pixel. It is guaranteed that $a+b>2$ (e.g. $a=b=1$ is im

In [114]:
print(in_outs[15])

{'inputs': ['1 7 3\n', '10 10 0\n', '1 -4 5\n', '0 60 50\n'], 'outputs': ['YES\n', 'YES\n', 'NO\n', 'NO\n']}


In [115]:
print("".join(lines))

Screen resolution of Polycarp's monitor is $a \times b$ pixels. Unfortunately, there is one dead pixel at his screen. It has coordinates $(x, y)$ ($0 \le x < a, 0 \le y < b$). You can consider columns of pixels to be numbered from $0$ to $a-1$, and rows — from $0$ to $b-1$.

Polycarp wants to open a rectangular window of maximal size, which doesn't contain the dead pixel. The boundaries of the window should be parallel to the sides of the screen.

Print the maximal area (in pixels) of a window that doesn't contain the dead pixel inside itself.


-----Input-----

In the first line you are given an integer $t$ ($1 \le t \le 10^4$) — the number of test cases in the test. In the next lines you are given descriptions of $t$ test cases.

Each test case contains a single line which consists of $4$ integers $a, b, x$ and $y$ ($1 \le a, b \le 10^4$; $0 \le x < a$; $0 \le y < b$) — the resolution of the screen and the coordinates of a dead pixel. It is guaranteed that $a+b>2$ (e.g. $a=b=1$ is im

In [116]:
# See if they're equivalent
full_text == "".join(lines)

True

In [117]:
problem["test_example"] = "test"

In [118]:
problem

{'problem_id': 15,
 'question': "Screen resolution of Polycarp's monitor is $a \\times b$ pixels. Unfortunately, there is one dead pixel at his screen. It has coordinates $(x, y)$ ($0 \\le x < a, 0 \\le y < b$). You can consider columns of pixels to be numbered from $0$ to $a-1$, and rows\xa0— from $0$ to $b-1$.\n\nPolycarp wants to open a rectangular window of maximal size, which doesn't contain the dead pixel. The boundaries of the window should be parallel to the sides of the screen.\n\nPrint the maximal area (in pixels) of a window that doesn't contain the dead pixel inside itself.\n\n\n-----Input-----\n\nIn the first line you are given an integer $t$ ($1 \\le t \\le 10^4$)\xa0— the number of test cases in the test. In the next lines you are given descriptions of $t$ test cases.\n\nEach test case contains a single line which consists of $4$ integers $a, b, x$ and $y$ ($1 \\le a, b \\le 10^4$; $0 \\le x < a$; $0 \\le y < b$)\xa0— the resolution of the screen and the coordinates of a

# Extracting example test cases

In [119]:
def find_in_out(lines):
    start_example = False
    start_input = False
    start_output = False 
    inputs = []
    outputs = []
    curr_input = ''
    curr_output = ''
    for line in lines:
        
        if len(line.strip())==0: 
            start_output = False
            start_input = False 
            continue
        
        line1 = line.lower()
        
        if '-examples-' in line1 or '-example-' in line1 or '-example -' in line1 or \
            '-example 1-' in line1 or '-example 2-' in line1 or '-example 3-' in line1 or \
            '-example 4-' in line1 or '-example 5-' in line1 or \
            'example:' in line1 or \
            'example 1:' in line1 or 'example 2:' in line1 or '-example 3:' in line1 or \
            'example 4:' in line1 or 'example 5:' in line1:
            start_example = True
            continue
        
        if '-note-'.lower() in line1:
            start_example = False
            start_output = False
            start_input = False
            continue
            
        if (start_example and 'Input' in line) or ('-Sample Input' in line) \
            or ('-Example Input' in line) or ('Sample Input:' in line) \
            or ('-Sample input' in line):
            start_input = True
            start_output = False
            
            if len(curr_output)>0:
                outputs.append(curr_output)
                curr_output = ''
            
            if (not '-sample input' in line1) and (not '-example input' in line1) and (not '-sample input' in line1):
                
                if 'input:' in line1:
                    temp = line1.replace('example','').replace('sample','').replace('input:','')
                    if len(temp.strip())>0: 
                        curr_input = temp        
            continue
        
        if (start_example and 'Output' in line) or ('-Sample Output' in line) \
            or ('-Example Output' in line) or ('Sample Output:' in line) \
            or ('-Sample output' in line):
            start_output = True
            start_input = False
            
            if len(curr_input)>0:
                inputs.append(curr_input)
                curr_input = ''
                
            if (not '-sample output' in line1) and (not '-example output' in line1) and (not '-sample output' in line1):
                if 'output:' in line1:
                    temp = line1.replace('example','').replace('sample','').replace('output:','')
                    if len(temp.strip())>0: 
                        curr_output = temp
            continue 
        
        if start_input:
            curr_input += line 
        
        if start_output:
            curr_output += line 
            
    if len(curr_output)>0: 
        outputs.append(curr_output)
        start_output = False
            
    if len(inputs)==0 or len(inputs) != len(outputs) or (start_output or start_input):
        return None
        
    return {'inputs': inputs, 'outputs': outputs}

In [120]:
non_eng = {}
no_test = {}
in_outs = {}

for _, problem in tqdm(enumerate(target_dataset), total=len(target_dataset)):
    problem: APPSTask
    problem_idx = problem["problem_id"]
    full_text = problem["question"]

    # Reverse join operation where we put the text in a list based on the new line character
    lines = [line + "\n" for line in full_text.split("\n")]
    lines[-1] = lines[-1].strip()
    temp = full_text

    in_out = find_in_out(lines)
    in_outs[problem_idx] = in_out
    
    # special case with non-English problems
    if 'Входные' in temp: 
        non_eng[problem_idx] = temp

    elif in_outs[problem_idx] is None:
        no_test[problem_idx] = temp
    
print("Non-English task: {}".format(len(non_eng))) 
print("Zero-example-test task: {}".format(len(no_test)))       

'''
Train dataset:
Non-English task: 0
Zero-example-test task: 2770

Test dataset:
Non-English task: 17
Zero-example-test task: 46
'''

100%|██████████| 5000/5000 [00:01<00:00, 3633.67it/s]

Non-English task: 17
Zero-example-test task: 46





'\nTrain dataset:\nNon-English task: 0\nZero-example-test task: 2770\n\nTest dataset:\nNon-English task: 17\nZero-example-test task: 46\n'

In [121]:
nb_example_tests = []
for k,v in in_outs.items(): 
    if v is None:
        nb_example_tests.append(0)
    else:
        nb_example_tests.append(len(v['inputs']))

print("Total number of samples: {}".format(np.array(nb_example_tests).shape))
print("Average extracted example test cases: {}".format(np.array(nb_example_tests).mean()))

'''
Train dataset:
Total number of samples: (5000,)
Average extracted example test cases: 0.7406

Test dataset:
Total number of samples: (5000,)
Average extracted example test cases: 1.9764
'''

Total number of samples: (5000,)
Average extracted example test cases: 1.9764


'\nTrain dataset:\nTotal number of samples: (5000,)\nAverage extracted example test cases: 0.7406\n\nTest dataset:\nTotal number of samples: (5000,)\nAverage extracted example test cases: 1.9764\n'

In [106]:
# Save the in_outs to the directory example_test_cases, make the directory if it did not exist yet
os.makedirs("example_test_cases", exist_ok=True)

# with open("example_test_cases/train.json", "w", encoding="utf-8") as f:
#      json.dump(in_outs, f)

with open("example_test_cases/test.json", "w", encoding="utf-8") as f:
      json.dump(in_outs, f)