# Convert ARC JSON Tasks to Formatted Text in JSON Lines (JSONL) - Refactored

This notebook converts a folder of standard ARC task `.json` files into a single `.jsonl` file using a refactored approach.

**Approach:**

1.  A function `format_single_task_file` is defined to handle the formatting of **one** task file.
2.  The main function `create_formatted_jsonl_from_folder` iterates through the input folder, calling the single-file function for each task.
3.  A test cell demonstrates the output of `format_single_task_file` for a specific example task.

**Formatting Applied (Identical to previous version):**

*   **Grid Conversion:** Grids become multi-line strings.
*   **Structured Text:** Output includes `### Example`, `Input:`, `Output:` headers.
*   **JSONL Output:** Each line is `{"text": "formatted_string_for_task"}`.

## 1. Setup: Import Libraries

In [2]:
import json
from pathlib import Path
from tqdm import tqdm # Optional: for progress visualization
import traceback # For better error reporting if a single file fails

print("Libraries imported.")

Libraries imported.


## 2. Define Helper Function: Format Grid to String

This function remains the same - it converts a grid (list of lists) to a string.

In [3]:
def format_grid_to_string(grid):
    """Converts a list of lists (grid) into a newline-separated string.
    
    Args:
        grid (list[list[int]]): The input grid.
        
    Returns:
        str: A string representation of the grid, or an empty string 
             if the grid is invalid or empty.
    """
    if not grid or not isinstance(grid, list) or not isinstance(grid[0], list):
        return "" 
    return '\n'.join([''.join(map(str, row)) for row in grid])

## 3. Define Function to Format a *Single* Task File

This function takes the path to one JSON file, reads it, applies the formatting rules, and returns the complete formatted text string for that task. It returns `None` if there's an error reading or processing the file.

In [4]:
def format_single_task_file(json_path):
    """Reads one ARC task JSON file and returns its formatted text representation.
    
    Args:
        json_path (str or Path): The path to the input .json file.
        
    Returns:
        str or None: The fully formatted text string for the task, or None if 
                     an error occurs during file reading or processing.
    """
    try:
        json_path = Path(json_path)
        with open(json_path, 'r', encoding='utf-8') as infile:
            task_data = json.load(infile)
        
        formatted_text_parts = []
        
        # Process Training examples
        train_examples = task_data.get('train', [])
        for i, example in enumerate(train_examples):
            input_grid_str = format_grid_to_string(example.get('input'))
            output_grid_str = format_grid_to_string(example.get('output'))
            
            formatted_text_parts.append(f"### Train Example {i+1}:")
            formatted_text_parts.append(f"Input:\n{input_grid_str}")
            formatted_text_parts.append(f"\nOutput:\n{output_grid_str}")
            if i < len(train_examples) - 1 or task_data.get('test'):
               formatted_text_parts.append("\n\n") 
                   
        # Process Test examples
        test_examples = task_data.get('test', [])
        for i, example in enumerate(test_examples):
            input_grid_str = format_grid_to_string(example.get('input'))
            
            formatted_text_parts.append(f"### Test Example {i+1}:")
            formatted_text_parts.append(f"Input:\n{input_grid_str}")
            
            if 'output' in example:
                output_grid_str = format_grid_to_string(example.get('output'))
                formatted_text_parts.append(f"\nOutput:\n{output_grid_str}")
            else:
                 formatted_text_parts.append(f"\nOutput:")

            if i < len(test_examples) - 1:
               formatted_text_parts.append("\n\n") 

        # Join all parts
        final_task_string = "\n".join(formatted_text_parts)
        return final_task_string
        
    except Exception as e:
        print(f"\nError processing file {json_path.name}: {e}")
        # Optional: Print full traceback for debugging
        # traceback.print_exc()
        return None # Indicate failure for this file

## 4. Test the Single-File Formatting

Let's test the `format_single_task_file` function on one specific task file from our input folder to see the output format directly.

In [6]:
# --- Configuration for the test ---
test_input_folder = "modded_tasks" # Make sure this points to your JSON files
example_task_filename = "8dab14c2_test1.json" # An example original task filename
# --- End of configuration ---

example_task_path = Path(test_input_folder) / example_task_filename

print(f"--- Testing formatting for single file: {example_task_path.name} ---")

if example_task_path.is_file():
    # Call the function to get the formatted string
    formatted_string = format_single_task_file(example_task_path)
    
    if formatted_string is not None:
        print("Formatted Text Output:\n")
        print(formatted_string)
    else:
        print(f"Function returned None, indicating an error during processing.")
else:
    print(f"Error: Example file not found at {example_task_path}")
    print(f"Please ensure '{test_input_folder}' contains '{example_task_filename}'.")

print("--- End of Single File Test ---")

--- Testing formatting for single file: 8dab14c2_test1.json ---
Formatted Text Output:

### Train Example 1:
Input:
1111111111111111
1111111111111111
1144444444144411
1144444444444411
1114444444444411
1144444444444411
1144444444444411
1111141111144411
1111111111144411
1111111111144411
1111111111144111
1111111111144411
1111111111111111
1111111111111111

Output:
1111111111111111
1111111111111111
1144414444144411
1144444444444411
1114444444444441
1144444444444411
1144444444444411
1111141111444411
1111111111144411
1111111111144411
1111111111444111
1111111111144411
1111111111111111
1111111111111111



### Train Example 2:
Input:
111111111111111
111111111114111
111111111444411
111111111444411
111111111444411
111111111444411
111111111444411
114444444444411
114444444444411
111444444444411
114444444444411
114444444444411
111111111111111
111111111111111

Output:
111111111111111
111111111114111
111111111444411
111111111444411
111111111444411
111111111444411
111111111444411
114444444444411
1144444

## 5. Define the Main Function to Create JSONL

This function loops through the input folder, calls `format_single_task_file` for each JSON file, wraps the result in `{"text": ...}`, and writes it to the output JSONL file.

In [None]:
def create_formatted_jsonl_from_folder(input_folder, output_jsonl_path):
    """
    Iterates through .json files in input_folder, formats each using 
    format_single_task_file, and writes the results to a JSON Lines file.
    Each line is a JSON object: {"text": "formatted_task_string"}.

    Args:
        input_folder (str or Path): Path to the directory containing .json files.
        output_jsonl_path (str or Path): Path to the output .jsonl file.
    """
    input_folder = Path(input_folder)
    output_jsonl_path = Path(output_jsonl_path)

    print(f"--- Starting Formatted JSONL Creation (Main Process) ---")
    print(f"Reading from: {input_folder}")
    print(f"Writing to:   {output_jsonl_path}")

    json_files = sorted(list(input_folder.glob("*.json")))
    num_files = len(json_files)

    if num_files == 0:
        print("Warning: No .json files found. Output file will be empty.")
        output_jsonl_path.touch()
        return

    print(f"Found {num_files} .json files to process.")

    lines_written = 0
    files_skipped = 0
    # Open the output file
    with open(output_jsonl_path, 'w', encoding='utf-8') as outfile:
        for json_path in tqdm(json_files, desc="Creating Formatted JSONL"):
            # Get the formatted text for the current file
            formatted_text = format_single_task_file(json_path)
            
            # Only write if formatting was successful
            if formatted_text is not None:
                # Create the output dictionary structure { "text": "..." }
                output_line_dict = {"text": formatted_text}
                
                # Convert the dictionary to a JSON string and write to the file
                json_string = json.dumps(output_line_dict, separators=(',', ':'))
                outfile.write(json_string + '\n')
                lines_written += 1
            else:
                # Keep track of files that failed formatting
                files_skipped += 1

    print(f"--- Formatted JSONL Creation Complete ---")
    print(f"Successfully wrote {lines_written} lines to {output_jsonl_path}")
    if files_skipped > 0:
        print(f"Skipped {files_skipped} files due to processing errors (see details above)." )

## 6. Configure Paths for Main Conversion

**Important:** Modify `input_folder_path` if needed.

In [None]:
# --- PLEASE EDIT THESE VARIABLES IF NEEDED --- 

# Path to the folder containing ALL your .json task files to be converted
input_folder_path = "modded_tasks"  # <<< CHANGE THIS IF NEEDED

# Desired name for the final output .jsonl file
output_jsonl_filename = "formatted_arc_tasks_refactored.jsonl"

# --- End of Editable Section --- 

# Verify paths 
input_folder = Path(input_folder_path)
output_file = Path(output_jsonl_filename)

print(f"Input directory set to: {input_folder.resolve()}")
print(f"Output file set to:    {output_file.resolve()}")

## 7. Run the Main Conversion

Execute the main function to process the entire folder.

In [None]:
create_formatted_jsonl_from_folder(input_folder, output_file)

## 8. Verification (Optional)

Check the first line of the created file. It should be `{"text": "...formatted string..."}`.

In [None]:
# Display the first line of the output file
!head -n 1 "{output_file}" 

# Optional: To see the full formatted text from the first line more clearly:
# try:
#     with open(output_file, 'r', encoding='utf-8') as f:
#        first_line_json = json.loads(f.readline())
#        print("\n--- Formatted text from first task in JSONL: ---")
#        print(first_line_json['text'])
# except Exception as e:
#     print(f"Could not read or parse the first line: {e}")

--- Notebook Complete ---