# Interacting with the ARC Dataset and LLMs

This notebook provides a foundational workflow for exploring the Abstraction and Reasoning Corpus (ARC) dataset using Large Language Models (LLMs).

**Understanding ARC:**

The ARC dataset, created by François Chollet, is designed to test abstract reasoning and intelligence, moving beyond simple pattern recognition. Each ARC task challenges a system to infer an underlying transformation rule from a few examples and then apply that rule to new, unseen inputs.

**Task Structure:**

Every ARC task consists of:
*   **`train` pairs:** A small set (usually 2-5) of input/output grid examples. The goal is to *learn* the transformation rule by observing how the input grids change to become the output grids in these examples.
*   **`test` pairs:** One or more input grids (and their corresponding *unseen* solution output grids). After inferring the rule from the `train` set, the system must apply it to the `test` input grids to generate the correct `test` output grids.

**The Challenge:** The core challenge is the *abstraction* of the rule from the `train` examples and its *generalization* to the `test` inputs.

**Evaluation Rule (ARC Prize):**

For a task to be considered solved:
1.  The system must generate the correct output grid for **every single `test` input grid** within that task.
2.  Each prediction must be an *exact* match to the ground truth solution grid.
3.  If a task has multiple `test` inputs, **all** of them must be solved correctly based on the single rule inferred from the `train` set.

**This Notebook's Goal (Initial Steps):**

This part focuses on the initial setup and data handling:
1.  Setting up the environment (installing and importing libraries).
2.  Performing a basic API call test (assuming credentials are set).
3.  Loading ARC task data from JSON files.
4.  Understanding the structure of the loaded data.
5.  Providing helper functions to easily access specific parts of a task (train pairs, test inputs, test outputs).

## 1. Setup: Libraries and API Test

First, we install and import the necessary libraries. We assume you have Python and pip installed.
*   `python-dotenv`: To potentially load API keys from a `.env` file (though we won't explicitly check for the key here).
*   `litellm`: To interact with LLM APIs.
*   `numpy`: For potential numerical operations later (though not strictly needed for just loading/parsing).

We'll also perform a minimal API call to ensure `litellm` is configured correctly and can reach the service. **Note:** This step assumes your API key (e.g., `OPENAI_API_KEY`) is already set as an environment variable or globally configured for `litellm`. If not set, this test call will fail.

In [159]:
# Install required packages if they aren't already installed
print("Checking/Installing required libraries...")
try:
    import litellm
    import dotenv
    import numpy
    print("Libraries found.")
except ImportError:
    print("Installing python-dotenv, litellm, numpy...")
    %pip install -q python-dotenv litellm numpy
    print("Installation complete. You might need to restart the kernel.")

Checking/Installing required libraries...
Libraries found.


In [160]:
import os
import json
from pathlib import Path
from typing import Dict, List, Tuple, Optional, Any, Union

# Import numpy just in case it's needed later, though not strictly for loading
import numpy as np 

from dotenv import load_dotenv
import litellm
from litellm import completion

# Attempt to load environment variables from .env file (optional)
load_dotenv()

# Reduce LiteLLM's default logging verbosity for a cleaner output
litellm.set_verbose = False

print("Libraries imported.")

Libraries imported.


In [161]:
# --- Minimal API Key Test --- 
# This performs a very small API call to check basic connectivity.
# It ASSUMES your API key is correctly set in your environment.
# If this cell fails, check your API key setup (e.g., OPENAI_API_KEY env var).
print("Performing minimal API test call...")
try:
    response = completion(
        model="gpt-4o", # Or "gpt-3.5-turbo" or another model you have access to
        messages=[{"role": "user", "content": "Respond with just 'OK'."}],
        max_tokens=2,        # Limit response length
        request_timeout=20 # Short timeout
    )
    # We don't strictly need to check the content, just that it didn't crash
    print("API test call successful (received a response).") 
    # print(f"Test Response: {response.choices[0].message.content}") # Optional: view response
except Exception as e:
    print(f"\033[91mAPI test call failed: {e}\033[0m")
    print("Please ensure your API key (e.g., OPENAI_API_KEY) is set correctly as an environment variable or in a .env file,")
    print("and that you have access to the specified model ('gpt-4o' used here).")

Performing minimal API test call...
API test call successful (received a response).


## 2. Data Loading

This function loads ARC tasks from a specified directory containing `.json` files. It assumes the files are correctly formatted ARC tasks. We are skipping complex error checking for this developer-focused setup.

In [162]:
# --- Type Definitions --- 
# Define types for clarity, even in a simplified script
Grid = List[List[int]] # Represents a 2D grid of integers
TaskPair = Dict[str, Grid] # Represents one {'input': Grid, 'output': Grid} pair
TaskData = Dict[str, List[TaskPair]] # Represents the core content {'train': List[TaskPair], 'test': List[TaskPair]}

def load_arc_tasks_simple(data_dir: str, limit: Optional[int] = None) -> Dict[str, TaskData]:
    """Loads ARC tasks from JSON files into a dictionary (simplified error handling)."""
    arc_path = Path(data_dir)
    if not arc_path.is_dir():
        print(f"Error: ARC data directory not found: {arc_path.resolve()}")
        # In a truly simple script, we might just let it raise the error later,
        # but checking the dir is a minimal useful check.
        return {}

    # Assume directory scanning works
    json_files = sorted(list(arc_path.glob("*.json")))

    if not json_files:
        print(f"Warning: No JSON files found in {arc_path.resolve()}")
        return {}

    if limit:
        json_files = json_files[:limit]
        print(f"Limiting loading to the first {len(json_files)} tasks.")

    loaded_tasks: Dict[str, TaskData] = {}
    print(f"Processing {len(json_files)} task files from {arc_path.resolve()}...")

    for json_file in json_files:
        task_id = json_file.stem
        # Assume file is readable and valid JSON with 'train' and 'test' keys
        with open(json_file, 'r', encoding='utf-8') as f:
            task_data = json.load(f)
        loaded_tasks[task_id] = task_data # Store the raw loaded data

    print(f"\nSuccessfully attempted to load {len(loaded_tasks)} tasks.")
    return loaded_tasks

# --- Load the Data --- 
# <<< IMPORTANT: UPDATE THIS PATH TO YOUR ARC DATA DIRECTORY >>>
ARC_DATA_DIR = "../ARC-AGI/data/evaluation" # Adjust this path!
TASK_LOAD_LIMIT = 20 # Load only a few tasks for faster testing

print(f"\nAttempting to load data from: {Path(ARC_DATA_DIR).resolve()}")
all_task_data: Dict[str, TaskData] = load_arc_tasks_simple(ARC_DATA_DIR, limit=TASK_LOAD_LIMIT)

if not all_task_data:
    print("\n--- CRITICAL: No tasks were loaded. Check ARC_DATA_DIR path. --- ")
else:
    print(f"\nExample Task IDs loaded: {list(all_task_data.keys())[:10]}")


Attempting to load data from: C:\Users\Lukhausen\github\Lepus\experimental\lukas\ARC-AGI\data\evaluation
Limiting loading to the first 20 tasks.
Processing 20 task files from C:\Users\Lukhausen\github\Lepus\experimental\lukas\ARC-AGI\data\evaluation...

Successfully attempted to load 20 tasks.

Example Task IDs loaded: ['0934a4d8', '135a2760', '136b0064', '13e47133', '142ca369', '16b78196', '16de56c4', '1818057f', '195c6913', '1ae2feb7']


## 3. Understanding ARC Task Data Structure

The `all_task_data` variable loaded above is a Python dictionary.

*   **Keys:** Each key is a string representing the `task_id` (e.g., `'0934a4d8'`).
*   **Values:** Each value is another dictionary (`TaskData`) containing the data for that specific task, loaded directly from the corresponding JSON file.

Inside each `TaskData` dictionary, the crucial keys are:

1.  **`'train'`**: This holds a *list* of training examples.
    *   Each element in the `'train'` list is a dictionary (`TaskPair`).
    *   Each `TaskPair` dictionary has two keys:
        *   `'input'`: Contains the input grid (`Grid`), which is a list of lists of integers (e.g., `[[0, 1], [2, 0]]`).
        *   `'output'`: Contains the corresponding output grid (`Grid`) after the transformation rule has been applied.

2.  **`'test'`**: This holds a *list* of test problems.
    *   Each element in the `'test'` list is also a `TaskPair` dictionary.
    *   Each `TaskPair` in the test list contains:
        *   `'input'`: The test input grid (`Grid`) to which the learned rule must be applied.
        *   `'output'`: The *ground truth* solution grid (`Grid`). The LLM does **not** see this during prediction; it's used for evaluation later.

**LLM Interaction Strategy:**

To ask an LLM to solve *one* test case for a given task, we need to provide:
1.  All the `train` input/output pairs for that task.
2.  The specific `test` input grid we want it to solve.

Since a task can have multiple test cases in its `'test'` list, we typically need to make a separate LLM call for *each* test input grid within that task.

## 4. Helper Functions for Accessing Task Components

To make it easier to prepare data for the LLM, let's create simple functions to extract the relevant pieces from the loaded `TaskData` for a specific task ID.

In [163]:
def get_train_pairs(task_id: str, all_tasks: Dict[str, TaskData]) -> List[TaskPair]:
    """Returns the list of training pairs for a given task ID."""
    # Assumes task_id exists and has 'train' key with a list
    return all_tasks[task_id]['train']

def get_test_inputs(task_id: str, all_tasks: Dict[str, TaskData]) -> List[Grid]:
    """Returns a list of all test input grids for a given task ID."""
    # Assumes task_id exists and has 'test' key with a list of pairs, each having 'input'
    test_pairs = all_tasks[task_id]['test']
    return [pair['input'] for pair in test_pairs]

def get_test_outputs(task_id: str, all_tasks: Dict[str, TaskData]) -> List[Grid]:
    """Returns a list of all test output (solution) grids for a given task ID."""
    # Assumes task_id exists and has 'test' key with a list of pairs, each having 'output'
    test_pairs = all_tasks[task_id]['test']
    return [pair['output'] for pair in test_pairs]

print("Helper functions defined: get_train_pairs, get_test_inputs, get_test_outputs")

Helper functions defined: get_train_pairs, get_test_inputs, get_test_outputs


### Example Usage of Helper Functions

Let's see how to use these functions to get the data components for the first task we loaded.

In [164]:
if all_task_data:
    example_task_id = list(all_task_data.keys())[0] # Get the first loaded task ID
    print(f"--- Example data extraction for Task ID: {example_task_id} ---")

    # Get training pairs
    train_pairs = get_train_pairs(example_task_id, all_task_data)
    print(f"Number of training pairs: {len(train_pairs)}")
    if train_pairs:
        # Print the structure of the first training pair (input/output keys)
        print(f"Structure of first train pair: {train_pairs[0].keys()}") 
        # print(f"First train input grid: {train_pairs[0]['input']}") # Optional: print grid

    # Get test inputs
    test_inputs = get_test_inputs(example_task_id, all_task_data)
    print(f"\nNumber of test inputs: {len(test_inputs)}")
    if test_inputs:
        print(f"Type of first test input: {type(test_inputs[0])}")
        # print(f"First test input grid: {test_inputs[0]}") # Optional: print grid

    # Get test outputs (solutions)
    test_outputs = get_test_outputs(example_task_id, all_task_data)
    print(f"\nNumber of test outputs (solutions): {len(test_outputs)}")
    if test_outputs:
        print(f"Type of first test output: {type(test_outputs[0])}")
        # print(f"First test output grid: {test_outputs[0]}") # Optional: print grid
else:
    print("\nNo task data loaded, skipping example.")

--- Example data extraction for Task ID: 0934a4d8 ---
Number of training pairs: 4
Structure of first train pair: dict_keys(['input', 'output'])

Number of test inputs: 1
Type of first test input: <class 'list'>

Number of test outputs (solutions): 1
Type of first test output: <class 'list'>


## 5. Define Prompt Strategy Templates

Instead of complex functions, we can define our prompting strategy using simple string templates with placeholders. We'll use f-string style placeholders like `{placeholder_name}`.

We need placeholders for:
*   `{train_examples_string}`: Where we will insert the formatted string of all training input/output pairs.
*   `{test_input_string}`: Where we will insert the formatted string of the single test input grid we want the LLM to solve.

Below are example templates. You can modify these strings to experiment with different instructions.

In [166]:
# Define simple string templates for the prompts
SYSTEM_PROMPT_TEMPLATE = (
    "You are an ARC puzzle solver. Analyze the train examples (input/output pairs). "
    "Apply the deduced rule to the test input grid. "
    "Output your reasoning and then a JSON array in a Codeblock for the predicted test output grid."
)

USER_PROMPT_TEMPLATE = (
    "**TRAIN EXAMPLES:**\n"
    "{train_examples_string}\n\n"
    "**TEST INPUT GRID:**\n"
    "{test_input_string}\n\n"
    "**PREDICTED OUTPUT GRID:**"
)

print("Prompt templates defined (SYSTEM_PROMPT_TEMPLATE, USER_PROMPT_TEMPLATE).")

Prompt templates defined (SYSTEM_PROMPT_TEMPLATE, USER_PROMPT_TEMPLATE).


## 6. Formatting Helpers and Prompt Preparation

We need helper functions to format the ARC grid data into strings, and then a function to insert these strings into our templates to create the final messages for the LLM API.

In [167]:
# --- Formatting Helpers (Simplified) --- 
# Reuse or redefine the formatting helpers if needed

def format_grid_to_string(grid: Grid) -> str:
    """Converts a grid (list of lists) to a compact JSON string representation."""
    # Assume grid is a list of lists of ints
    return json.dumps(grid, separators=(',', ':'))

def format_train_pairs_to_string(train_pairs: List[TaskPair]) -> str:
    """Formats a list of training pairs into a single string for the prompt."""
    formatted_pairs = []
    for i, pair in enumerate(train_pairs):
        input_str = format_grid_to_string(pair['input'])
        output_str = format_grid_to_string(pair['output'])
        formatted_pairs.append(f"Example {i+1} Input:\n{input_str}\nExample {i+1} Output:\n{output_str}")
    return "\n\n".join(formatted_pairs)

# --- Prompt Preparation Function --- 

def prepare_prompt_messages(
    system_template: str,
    user_template: str,
    train_pairs: List[TaskPair],
    test_input_grid: Grid
) -> List[Dict[str, str]]:
    """Formats data and inserts it into prompt templates to create API messages."""

    # Format the training examples and test input into strings
    train_examples_str = format_train_pairs_to_string(train_pairs)
    test_input_str = format_grid_to_string(test_input_grid)

    # Use f-strings to fill the templates
    # Note: The variable names inside the f-string (e.g., train_examples_str)
    # must match the placeholders used in the template strings (e.g., {train_examples_string})
    # We map our internal variable names to the placeholder names expected by the template
    final_user_prompt = user_template.format(
        train_examples_string=train_examples_str,
        test_input_string=test_input_str
    )
    
    # System prompt usually doesn't need formatting in this simple case
    final_system_prompt = system_template 

    # Return the structured messages
    messages = [
        {"role": "system", "content": final_system_prompt},
        {"role": "user", "content": final_user_prompt}
    ]
    return messages

print("Formatting and prompt preparation functions defined.")

Formatting and prompt preparation functions defined.


In [168]:

# Sample Training Pairs (List of TaskPair dictionaries)
sample_train_pairs: List[TaskPair] = [
    {
        'input': [[1, 0], [0, 0]],
        'output': [[0, 1], [0, 0]]
    },
    {
        'input': [[0, 0], [2, 0]],
        'output': [[0, 0], [0, 2]]
    }
]

# Sample Test Input Grid (Grid)
sample_test_input_grid: Grid = [[0, 3], [0, 0]]

# --- Call the Preparation Function ---

print("--- Preparing messages using sample data ---")
prepared_messages = prepare_prompt_messages(
    system_template=SYSTEM_PROMPT_TEMPLATE,
    user_template=USER_PROMPT_TEMPLATE,
    train_pairs=sample_train_pairs,
    test_input_grid=sample_test_input_grid
)

# --- Print the Result ---
# Use json.dumps for pretty printing the list of dictionaries
print("\nResulting 'messages' list (ready for LLM API):")
print(json.dumps(prepared_messages, indent=2))

# --- Optionally, print the formatted User prompt content for clarity ---
print("\n--- Content of the 'user' message: ---")
print(prepared_messages[1]['content'])
print("--------------------------------------")

--- Preparing messages using sample data ---

Resulting 'messages' list (ready for LLM API):
[
  {
    "role": "system",
    "content": "You are an ARC puzzle solver. Analyze the train examples (input/output pairs). Apply the deduced rule to the test input grid. Output your reasoning and then a JSON array in a Codeblock for the predicted test output grid."
  },
  {
    "role": "user",
    "content": "**TRAIN EXAMPLES:**\nExample 1 Input:\n[[1,0],[0,0]]\nExample 1 Output:\n[[0,1],[0,0]]\n\nExample 2 Input:\n[[0,0],[2,0]]\nExample 2 Output:\n[[0,0],[0,2]]\n\n**TEST INPUT GRID:**\n[[0,3],[0,0]]\n\n**PREDICTED OUTPUT GRID:**"
  }
]

--- Content of the 'user' message: ---
**TRAIN EXAMPLES:**
Example 1 Input:
[[1,0],[0,0]]
Example 1 Output:
[[0,1],[0,0]]

Example 2 Input:
[[0,0],[2,0]]
Example 2 Output:
[[0,0],[0,2]]

**TEST INPUT GRID:**
[[0,3],[0,0]]

**PREDICTED OUTPUT GRID:**
--------------------------------------


## 7. Get LLM Response (Simple)

This function takes the prepared messages and sends them to the specified LLM using `litellm`. It returns the raw text response content, with minimal error handling.

In [169]:
def get_llm_response(messages: List[Dict[str, str]], model_name: str) -> Optional[str]:
    """Sends messages to the LLM and returns the raw response content."""
    print(f"Sending request to model: {model_name}...")
    try:
        response = completion(
            model=model_name,
            messages=messages,
            temperature=0.0, # Set low for deterministic output if possible
            max_tokens=2048, # Adjust as needed, ARC grids can be large
            request_timeout=120 # Timeout in seconds
        )
        # Extract the content directly - assumes success and standard response structure
        response_content = response.choices[0].message.content
        print("Response received.")
        return response_content
    except Exception as e:
        print(f"\033[91mLLM API call failed: {e}\033[0m")
        # In this simplified version, we just return None on failure
        return None

print("LLM interaction function 'get_llm_response' defined.")

LLM interaction function 'get_llm_response' defined.


Ookay, Now lets add a Parser that can extract the Json from the llm.

In [171]:
# Add this new cell after the cell defining get_llm_response (Section 7)

import re # Import the regular expression module

def parse_llm_response_for_grid(response_content: Optional[str]) -> Optional[Grid]:
    """
    Attempts to find and parse a JSON grid array from the LLM response,
    prioritizing content within ```json ... ``` or ``` ... ``` code blocks.

    Returns:
        Optional[Grid]: The parsed grid if found and valid, otherwise None.
    """
    if not response_content or not isinstance(response_content, str):
        print("Parser received empty or non-string content.")
        return None

    response_text = response_content.strip()
    
    # 1. Prioritize finding ```json [...] ``` code blocks
    # Regex explanation:
    # ```json       - Matches the opening ```json tag
    # \s*          - Matches any whitespace (including newlines)
    # (\[         - Start capturing group 1, matches the opening square bracket
    #   .*?        - Matches any character (including newlines) non-greedily
    #  \])        - Matches the closing square bracket, end capturing group 1
    # \s*          - Matches any whitespace
    # ```          - Matches the closing ``` tag
    json_block_match = re.search(r"```json\s*(\[.*?\])\s*```", response_text, re.DOTALL)
    
    potential_json_str = None
    if json_block_match:
        potential_json_str = json_block_match.group(1).strip()
        # print("Found JSON content inside ```json block.")
    else:
        # 2. If no ```json block, look for a plain ``` [...] ``` block
        code_block_match = re.search(r"```\s*(\[.*?\])\s*```", response_text, re.DOTALL)
        if code_block_match:
            potential_json_str = code_block_match.group(1).strip()
            # print("Found JSON content inside plain ``` block.")
        else:
            # 3. Fallback: Look for the first occurrence of [...] in the entire text
            # This is less precise but might catch cases without code blocks.
            first_bracket_match = re.search(r"(\[.*?\])", response_text, re.DOTALL)
            if first_bracket_match:
                 potential_json_str = first_bracket_match.group(1).strip()
                 # print("Found JSON content directly in text (fallback).")

    if not potential_json_str:
        # print("Could not find any potential JSON grid content.")
        return None

    # 4. Attempt to parse the extracted string
    try:
        parsed_grid = json.loads(potential_json_str)
    except json.JSONDecodeError as e:
        # print(f"JSON decoding failed for extracted string: {e}")
        # print(f"String was: '{potential_json_str}'")
        return None

    # 5. Basic Validation (Is it a list of lists of ints? Is it rectangular?)
    if not isinstance(parsed_grid, list):
        # print("Parsed content is not a list.")
        return None
    
    row_len = -1
    for i, row in enumerate(parsed_grid):
        if not isinstance(row, list):
            # print(f"Row {i} is not a list.")
            return None
        if i == 0:
            row_len = len(row)
        elif len(row) != row_len:
             # print(f"Grid rows have inconsistent lengths (row 0: {row_len}, row {i}: {len(row)}).")
             return None # Ensure rectangular grid

        for j, cell in enumerate(row):
            if not isinstance(cell, int):
                # Try conversion for string digits, otherwise fail
                if isinstance(cell, str) and cell.isdigit():
                    try:
                        parsed_grid[i][j] = int(cell) # Modify in place
                    except ValueError:
                        # print(f"Cell ({i},{j}) is non-integer ('{cell}') and couldn't be converted.")
                        return None
                else:
                     # print(f"Cell ({i},{j}) is not an integer ('{cell}').")
                     return None
            # Optional: Check 0-9 range
            # if not (0 <= parsed_grid[i][j] <= 9): return None 
            
    # print("Successfully parsed and validated grid.")
    return parsed_grid

print("LLM response parsing function 'parse_llm_response_for_grid' defined.")

LLM response parsing function 'parse_llm_response_for_grid' defined.


## 8. Example: Run Strategy and Compare Output

Let's select a task and a test case, prepare the prompt using our templates, get the LLM's raw response, and print it alongside the ground truth solution for manual comparison.

In [175]:
# --- Configuration for Simple Example --- 
EXAMPLE_TASK_ID = '0934a4d8' # Choose a task ID that was loaded
TEST_CASE_INDEX = 0          # Which test case within the task (usually 0)
MODEL_NAME = 'gpt-4o'     # The LLM model to use
# --- End Configuration ---\n",

# Check if data is loaded
if not all_task_data or EXAMPLE_TASK_ID not in all_task_data:
    print(f"\033[91mError: Task data not loaded or Task ID '{EXAMPLE_TASK_ID}' not found.\033[0m")
else:
    print(f"--- Running Simple Example: Task '{EXAMPLE_TASK_ID}', Test Case {TEST_CASE_INDEX} ---")

    # 1. Get Task Data Components
    train_pairs = get_train_pairs(EXAMPLE_TASK_ID, all_task_data)
    test_inputs = get_test_inputs(EXAMPLE_TASK_ID, all_task_data)
    test_outputs_truth = get_test_outputs(EXAMPLE_TASK_ID, all_task_data)

    # Check if the test case index is valid
    if TEST_CASE_INDEX >= len(test_inputs):
        print(f"\033[91mError: Test case index {TEST_CASE_INDEX} is out of bounds for task '{EXAMPLE_TASK_ID}' (has {len(test_inputs)} test cases).\033[0m")
    else:
        target_test_input = test_inputs[TEST_CASE_INDEX]
        ground_truth_output = test_outputs_truth[TEST_CASE_INDEX]

        # 2. Prepare Prompt Messages using Templates
        messages = prepare_prompt_messages(
            system_template=SYSTEM_PROMPT_TEMPLATE,
            user_template=USER_PROMPT_TEMPLATE,
            train_pairs=train_pairs,
            test_input_grid=target_test_input
        )
        
        # 3. Get LLM Response
        raw_llm_response = get_llm_response(messages, MODEL_NAME)

        # 4. Parse the LLM Response
        print("\n--- Parsing LLM Response ---")
        parsed_predicted_grid = parse_llm_response_for_grid(raw_llm_response)

        # 5. Print Raw Response, Parsed Prediction, and Ground Truth
        print("\n--- Comparison --- ")
        print(f"\nRaw LLM Response (first 500 chars):\n```\n{str(raw_llm_response)[:500]}...\n```" if raw_llm_response else "Raw LLM Response: None")
        
        if parsed_predicted_grid is not None:
            parsed_grid_string = format_grid_to_string(parsed_predicted_grid)
            print(f"\nParsed Predicted Grid (JSON):\n```json\n{parsed_grid_string}\n```")
        else:
            print("\nParsed Predicted Grid: \033[91mParsing Failed (None)\033[0m")

        # Format ground truth for easy comparison
        ground_truth_string = format_grid_to_string(ground_truth_output)
        print(f"\nGround Truth Output (JSON):\n```json\n{ground_truth_string}\n```")
        
        # Compare parsed grid with ground truth using numpy for robustness
        if parsed_predicted_grid is not None:
            # Assume numpy is imported as np earlier
            is_correct = np.array_equal(np.array(parsed_predicted_grid), np.array(ground_truth_output))

            if is_correct:
                 print("\n\033[92mParsed Grid Match: Parsed prediction matches ground truth!\033[0m")
            else:
                 print("\n\033[91mParsed Grid Match: Parsed prediction differs from ground truth.\033[0m")
        else:
            print("\nCannot compare parsed grid as parsing failed.")

--- Running Simple Example: Task '0934a4d8', Test Case 0 ---
Sending request to model: gpt-4o...
Response received.

--- Parsing LLM Response ---

--- Comparison --- 

Raw LLM Response (first 500 chars):
```
To solve this problem, we need to identify the pattern or transformation applied to the input grids to produce the output grids in the training examples. Let's analyze the given examples:

1. **Example 1:**
   - Input: A 30x30 grid.
   - Output: A 9x4 grid.
   - Observation: The output grid seems to be a subgrid extracted from the input grid. Specifically, it appears to be a 9x4 section from the bottom-right corner of the input grid.

2. **Example 2:**
   - Input: A 30x30 grid.
   - Output: A 4x...
```

Parsed Predicted Grid (JSON):
```json
[[9,7,7,5],[7,5,9,7],[5,1,6,1],[5,7,5,9],[1,6,1,5],[7,5,9,7],[5,1,6,1],[5,7,5,9],[1,6,1,5]]
```

Ground Truth Output (JSON):
```json
[[7,7,9],[7,2,9],[7,2,9],[7,7,9],[4,4,7],[4,4,7],[6,6,1],[6,6,6],[1,6,1]]
```

[91mParsed Grid Match: Parsed pr