# Convert ARC JSON Tasks to Custom JSONL Format

This notebook converts a folder of ARC task `.json` files into a JSON Lines (`.jsonl`) file where each line is a JSON object with three fields:

- **train**: A string with formatted training examples (inputs and outputs).
- **test**: A string with formatted test inputs.
- **test_answer**: The test answer(s) as the raw grid arrays (if only one test example is present, then the output grid is provided directly; if multiple, a list of grids is provided).

The notebook is now reorganized into several helper functions:

1. **get_train_string**: Returns the formatted training examples string.
2. **get_test_string**: Returns the formatted test inputs string.
3. **get_test_answer**: Returns the ground truth output(s) from test examples.
4. **format_task_file_to_strings**: Returns a tuple of all three strings for one task file.

Then the notebook tests the functions on a single task file and finally loops over all files in a folder to create the JSONL file.

In [40]:
# ## 1. Setup: Import Libraries
import json
from pathlib import Path
from tqdm import tqdm  # for progress visualization
import traceback  # for improved error reporting

print("Libraries imported.")

Libraries imported.


In [41]:
def format_grid_to_string(grid):
    """
    Converts a grid (list of lists) into a nested list string format.
    Each row becomes a list, and line breaks appear only after each inner list.

    Args:
        grid (list[list[int]]): The input grid.

    Returns:
        str: The formatted nested list as a string with minimal line breaks.
    """
    if not grid or not isinstance(grid, list) or not isinstance(grid[0], list):
        return ""
    inner_lists = [f"[{','.join(map(str, row))}]," for row in grid]
    return "[\n" + "\n".join(inner_lists) + "\n]"


In [42]:
# ## 3. Define Formatting Functions for Each Component

def get_train_string(task_data, train_template):
    """
    Formats the training examples using the provided template.
    
    Args:
        task_data (dict): The loaded JSON data for a task.
        train_template (str): Template for training examples. Expected placeholders: {index}, {input}, {output}.
    
    Returns:
        str: The formatted training examples.
    """
    train_examples = task_data.get("train", [])
    formatted_examples = []
    for idx, example in enumerate(train_examples):
        input_str = format_grid_to_string(example.get("input"))
        output_str = format_grid_to_string(example.get("output"))
        formatted_examples.append(
            train_template.format(index=idx + 1, input=input_str, output=output_str)
        )
    return "\n\n".join(formatted_examples)


def get_test_string(task_data, test_template):
    """
    Formats the test inputs using the provided template.
    
    Args:
        task_data (dict): The loaded JSON data for a task.
        test_template (str): Template for test examples. Expected placeholders: {index}, {input}.
    
    Returns:
        str: The formatted test input examples.
    """
    test_examples = task_data.get("test", [])
    formatted_examples = []
    for idx, example in enumerate(test_examples):
        input_str = format_grid_to_string(example.get("input"))
        formatted_examples.append(
            test_template.format(index=idx + 1, input=input_str)
        )
    return "\n\n".join(formatted_examples)


def get_test_answer(task_data):
    """
    Extracts the ground truth output(s) from the test examples.
    If there is only one test example with an output, returns it directly; otherwise returns a list of outputs.
    
    Args:
        task_data (dict): The loaded JSON data for a task.
    
    Returns:
        object: The formatted test output or list of outputs.
    """
    test_examples = task_data.get("test", [])
    answers = []
    for example in test_examples:
        answers.append(example.get("output", []))
    return answers[0] if len(answers) == 1 else answers


def format_task_file_to_strings(json_path, train_template, test_template):
    """
    Loads a JSON task file and returns a tuple with the three formatted components:
    (train_string, test_string, test_answer).
    
    Args:
        json_path (str or Path): Path to the JSON file.
        train_template (str): Template for formatting training examples.
        test_template (str): Template for formatting test inputs.
    
    Returns:
        tuple: (train_string, test_string, test_answer) or None if an error occurs.
    """
    try:
        json_path = Path(json_path)
        with open(json_path, 'r', encoding='utf-8') as infile:
            task_data = json.load(infile)
        
        train_str = get_train_string(task_data, train_template)
        test_str = get_test_string(task_data, test_template)
        test_answer = get_test_answer(task_data)
        
        return train_str, test_str, test_answer
    except Exception as e:
        print(f"\nError processing file {json_path.name}: {e}")
        # Uncomment below to see the full traceback if needed
        # traceback.print_exc()
        return None

print("Custom formatting functions defined.")

Custom formatting functions defined.


In [43]:
# ## 4. Test the Function on a Single Task File

# Define custom templates for train and test inputs
train_template = (
    "### Train Example {index}:\n"
    "Input:\n"
    "{input}\n\n"
    "Output:\n"
    "{output}"
)

test_template = (
    "### Test Input:\n"
    "{input}"
)

# Set the sample task filename (adjust the folder/filename as needed)
test_input_folder = "modded_tasks"  # Folder containing your ARC JSON task files
example_task_filename = "8dab14c2_test1.json"  # Example file name

example_task_path = Path(test_input_folder) / example_task_filename

print(f"--- Testing custom formatting for: {example_task_path.name} ---")

if example_task_path.is_file():
    result = format_task_file_to_strings(example_task_path, train_template, test_template)
    if result is not None:
        train_str, test_str, test_answer = result
        print("\n--- TRAIN ---\n")
        print(train_str)
        print("\n--- TEST ---\n")
        print(test_str)
        print("\n--- TEST_ANSWER ---\n")
        print(test_answer)
    else:
        print("An error occurred during formatting.")
else:
    print(f"Error: Example file not found at {example_task_path}. Please check your input folder.")

print("\n--- End of Test ---")

--- Testing custom formatting for: 8dab14c2_test1.json ---

--- TRAIN ---

### Train Example 1:
Input:
[
[6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6],
[6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6],
[6,6,5,5,5,5,5,5,5,5,6,5,5,5,6,6],
[6,6,5,5,5,5,5,5,5,5,5,5,5,5,6,6],
[6,6,6,5,5,5,5,5,5,5,5,5,5,5,6,6],
[6,6,5,5,5,5,5,5,5,5,5,5,5,5,6,6],
[6,6,5,5,5,5,5,5,5,5,5,5,5,5,6,6],
[6,6,6,6,6,5,6,6,6,6,6,5,5,5,6,6],
[6,6,6,6,6,6,6,6,6,6,6,5,5,5,6,6],
[6,6,6,6,6,6,6,6,6,6,6,5,5,5,6,6],
[6,6,6,6,6,6,6,6,6,6,6,5,5,6,6,6],
[6,6,6,6,6,6,6,6,6,6,6,5,5,5,6,6],
[6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6],
[6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6],
]

Output:
[
[6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6],
[6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6],
[6,6,5,5,5,6,5,5,5,5,6,5,5,5,6,6],
[6,6,5,5,5,5,5,5,5,5,5,5,5,5,6,6],
[6,6,6,5,5,5,5,5,5,5,5,5,5,5,5,6],
[6,6,5,5,5,5,5,5,5,5,5,5,5,5,6,6],
[6,6,5,5,5,5,5,5,5,5,5,5,5,5,6,6],
[6,6,6,6,6,5,6,6,6,6,5,5,5,5,6,6],
[6,6,6,6,6,6,6,6,6,6,6,5,5,5,6,6],
[6,6,6,6,6,6,6,6,6,6,6,5,5,5,6,6],
[6,6,6,6,6,6,6,6,6,6,5,5,5,6,6,6],
[6,6,6,

In [44]:
# ## 5. Define the Main Function to Create the JSONL File

def create_formatted_jsonl_from_folder(input_folder, output_jsonl_path, train_template, test_template):
    """
    Iterates through .json files in the input folder, formats each using the custom functions,
    and writes the result to a JSON Lines file. Each line is a JSON object with keys:
    "train", "test", and "test_answer".
    
    Args:
        input_folder (str or Path): Directory containing the JSON task files.
        output_jsonl_path (str or Path): Output filename for the JSONL file.
        train_template (str): Template for training examples.
        test_template (str): Template for test inputs.
    """
    input_folder = Path(input_folder)
    output_jsonl_path = Path(output_jsonl_path)
    json_files = sorted(list(input_folder.glob("*.json")))
    num_files = len(json_files)

    if num_files == 0:
        print("Warning: No .json files found. Output file will be empty.")
        output_jsonl_path.touch()
        return

    print(f"--- Starting Formatted JSONL Creation ---")
    print(f"Reading from: {input_folder.resolve()}")
    print(f"Writing to:   {output_jsonl_path.resolve()}")
    print(f"Found {num_files} .json files to process.")

    lines_written = 0
    files_skipped = 0
    
    with open(output_jsonl_path, 'w', encoding='utf-8') as outfile:
        for json_path in tqdm(json_files, desc="Processing JSON files"):
            result = format_task_file_to_strings(json_path, train_template, test_template)
            if result is not None:
                train_str, test_str, test_answer = result
                json_obj = {"train": train_str, "test": test_str, "test_answer": test_answer}
                outfile.write(json.dumps(json_obj, separators=(',', ':')) + "\n")
                lines_written += 1
            else:
                files_skipped += 1

    print("--- Formatted JSONL Creation Complete ---")
    print(f"Successfully wrote {lines_written} lines to {output_jsonl_path}")
    if files_skipped > 0:
        print(f"Skipped {files_skipped} file(s) due to processing errors.")

print("Main JSONL creation function defined.")

Main JSONL creation function defined.


In [45]:
# ## 6. Configure Paths for the Conversion and Run the Main Conversion

# Path to the folder with your .json task files
input_folder_path = "modded_tasks"

# Output JSONL filename (you can add a path like "output/formatted_tasks.jsonl" if needed)
output_jsonl_filename = "formatted_arc_tasks_custom.jsonl"

# Resolve paths
input_folder = Path(input_folder_path)
output_file = Path(output_jsonl_filename)

print(f"Input directory set to: {input_folder.resolve()}")
print(f"Output file set to:    {output_file.resolve()}")

# Run the conversion over all JSON files in the folder
create_formatted_jsonl_from_folder(input_folder, output_file, train_template, test_template)

Input directory set to: C:\Users\Lukhausen\github\Lepus\experimental\lukas\preprocessing\modded_tasks
Output file set to:    C:\Users\Lukhausen\github\Lepus\experimental\lukas\preprocessing\formatted_arc_tasks_custom.jsonl
--- Starting Formatted JSONL Creation ---
Reading from: C:\Users\Lukhausen\github\Lepus\experimental\lukas\preprocessing\modded_tasks
Writing to:   C:\Users\Lukhausen\github\Lepus\experimental\lukas\preprocessing\formatted_arc_tasks_custom.jsonl
Found 30138 .json files to process.


Processing JSON files: 100%|██████████| 30138/30138 [00:47<00:00, 636.38it/s]


--- Formatted JSONL Creation Complete ---
Successfully wrote 30138 lines to formatted_arc_tasks_custom.jsonl


In [46]:
# ## 7. Verification (Optional)
# To verify the conversion, we display the first JSON object in the output file.
with open(output_file, 'r', encoding='utf-8') as f:
    first_line = f.readline()
    try:
        json_obj = json.loads(first_line)
        print("First JSON object in the output file:")
        print(json.dumps(json_obj, indent=2))
    except Exception as e:
        print(f"Error reading the output file: {e}")

First JSON object in the output file:
{
  "train": "### Train Example 1:\nInput:\n[\n[3,1],\n[1,4],\n]\n\nOutput:\n[\n[3,1,3,1,3,1],\n[1,4,1,4,1,4],\n[1,3,1,3,1,3],\n[4,1,4,1,4,1],\n[3,1,3,1,3,1],\n[1,4,1,4,1,4],\n]\n\n### Train Example 2:\nInput:\n[\n[9,0],\n[4,6],\n]\n\nOutput:\n[\n[9,0,9,0,9,0],\n[4,6,4,6,4,6],\n[0,9,0,9,0,9],\n[6,4,6,4,6,4],\n[9,0,9,0,9,0],\n[4,6,4,6,4,6],\n]",
  "test": "### Test Input:\n[\n[6,5],\n[9,3],\n]",
  "test_answer": [
    [
      6,
      5,
      6,
      5,
      6,
      5
    ],
    [
      9,
      3,
      9,
      3,
      9,
      3
    ],
    [
      5,
      6,
      5,
      6,
      5,
      6
    ],
    [
      3,
      9,
      3,
      9,
      3,
      9
    ],
    [
      6,
      5,
      6,
      5,
      6,
      5
    ],
    [
      9,
      3,
      9,
      3,
      9,
      3
    ]
  ]
}


Upload to Hugginface.

In [47]:
%pip install datasets

from datasets import Dataset

# Load your JSONL file
dataset = Dataset.from_json("formatted_arc_tasks_custom.jsonl")

# Upload to Hugging Face Hub
dataset.push_to_hub("Lukhausen/arc-agi-lepus-v1")


Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip is available: 24.3.1 -> 25.0.1
[notice] To update, run: python.exe -m pip install --upgrade pip


Generating train split: 0 examples [00:00, ? examples/s]

Uploading the dataset shards:   0%|          | 0/1 [00:00<?, ?it/s]

Creating parquet from Arrow format:   0%|          | 0/31 [00:00<?, ?ba/s]

README.md:   0%|          | 0.00/372 [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/datasets/Lukhausen/arc-agi-lepus-v1/commit/c223c28fb08beb76f77f01df3a529d6888443326', commit_message='Upload dataset', commit_description='', oid='c223c28fb08beb76f77f01df3a529d6888443326', pr_url=None, repo_url=RepoUrl('https://huggingface.co/datasets/Lukhausen/arc-agi-lepus-v1', endpoint='https://huggingface.co', repo_type='dataset', repo_id='Lukhausen/arc-agi-lepus-v1'), pr_revision=None, pr_num=None)