# Using Frontier Models on ARC-AGI via LangChain

In this notebook we'll walk through a basic "hello world" example of using an LLM on [ARC-AGI tasks](https://arcprize.org/). This notebook is for demonstration purposes only. We encourage you to evolve it with your own ideas!

While notebooks like these are not eligible for [ARC Prize 2024](https://arcprize.org/competition), they can be used to make a submission to [ARC-AGI-Pub](https://arcprize.org/arc-agi-pub), a separate leaderboard which allows API calls. For more information on ARC-AGI-Pub visit the [leaderboard](https://arcprize.org/leaderboard).

Feel free to reach out to the ARC Prize team on [X](https://twitter.com/arcprize), [YouTube](https://www.youtube.com/channel/UC_rdrp-QkrZn-ce9uCE-0EA), [Discord](https://discord.gg/9b77dPAmcA) or team@arcprize.org for questions!

Let's get started!

**Goal**: Create a `submission.json` file with our predictions for each challenge. Even though we are not submitting to Kaggle we will use this file to score ourselves against the public task sets.

We are starting with a new notebook instance, so we'll first have to install our packages. We'll use [LangChain](https://python.langchain.com/v0.1/docs/get_started/introduction), an open source library for LLM orchestration, for their ability to help with swapping models, prompt template and output parsing.

_Run each cell one by one by clicking the 'play' button in the top left of each cell or shift-enter with a cell highlighted_

In [1]:
!pip install -qU langchain 2>/dev/null
!pip install -qU langchain-openai 2>/dev/null
!pip install -qU langchain-anthropic 2>/dev/null

Next we'll import our packages. Notice the `from kaggle_secrets import UserSecretsClient` import. This is how Kaggle manages API keys and secrets. See this [article](https://www.kaggle.com/discussions/product-feedback/114053) for more information.

In [12]:
# from kaggle_secrets import UserSecretsClient # Used to manage secrets. Similar to .env
#
# import langchain # Main LangChain import
# from langchain_openai import ChatOpenAI # To work with OpenAI
# from langchain_anthropic import ChatAnthropic # To work with Anthropic (optional)
# from langchain_core.output_parsers import JsonOutputParser # To help with structured output
# from langchain_core.prompts import PromptTemplate # To help create our prompt
# from langchain_core.pydantic_v1 import BaseModel, Field # To help with defining what output structure we want

from typing import List, Tuple
import json

## ARC-AGI Data

Next let's take a look at the files in our environment, this will help us navigate them later

In [3]:
import os
print ("Files included")
for dirname, _, filenames in os.walk('./data/inputs'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

Files included
./data/inputs/arc-agi_evaluation_solutions.json
./data/inputs/.DS_Store
./data/inputs/arc-agi_test_challenges.json
./data/inputs/arc-agi_training_solutions.json
./data/inputs/sample_submission.json
./data/inputs/arc-agi_training_challenges.json
./data/inputs/arc-agi_evaluation_challenges.json


As you can see we have a few different `challenge` and `solution` files. The files held in `kaggle/input/arc-prize-2024` are hosted by the official competition. For more information on these files check out the [competition data overview](https://www.kaggle.com/competitions/arc-prize-2024/data) or [ARC Prize Guide](https://arcprize.org/guide#data-structure).

Then let's set up a quick dictionary that will allow us to swap sets of `challenges` and `solutions` quickly.
* `training` : This is a set of 400 public training tasks from the official [ARC-AGI repo](https://github.com/fchollet/ARC-AGI)
* `evaluation` : This is a set of 400 evaluation tasks from the official [ARC-AGI repo](https://github.com/fchollet/ARC-AGI)

* `_challenges` : Contains a series of `train` input and output pairs along with a `test` input
* `_solutions` : Contains the output to the `test` input. This is what you're model will try and predict.

You are not limited to testing with these task sets (though they are very convenient to use!), you could make your own tasks that follow the same format to test.

If you ever want to see one of these pubic tasks, head over to [ARCprize.org/play](https://arcprize.org/play?task=00576224) along with your `task_id` (ex: https://arcprize.org/play?task=00576224).

In [9]:
task_sets = {
    'training' : {
        'challenges' : 'data/inputs/arc-agi_training_challenges.json',
        'solutions' : 'data/inputs/arc-agi_training_solutions.json',
    },
    'evaluation' : {
        'challenges' : 'data/inputs/arc-agi_evaluation_challenges.json',
        'solutions' : 'data/inputs/arc-agi_evaluation_solutions.json',
    }
}

Then let's create a function that will load up our challenges and tasks according to the task set we choose

In [10]:
def load_tasks_from_file(task_set):
    """
    Loads the tasks from the file and returns the challenges and solutions tasks
    """
    with open(task_set['challenges'], "r") as tasks:
        challenges = json.load(tasks)

    with open(task_set['solutions'], "r") as tasks:
        solutions = json.load(tasks)

    return challenges, solutions

Let's look at a quick example of a task challenge

In [22]:
challenges, solutions = load_tasks_from_file(task_set=task_sets['training'])
challenges['0520fde7']
solutions['0520fde7']

[[[2, 0, 2], [0, 0, 0], [0, 0, 0]]]

You can see our `train` input output pairs (there are 3 pairs) and our `test` inputs. You can view this task at https://arcprize.org/play?task=0520fde7

In this example there are 3 test inputs. Your model must make a prediction for each of them. Your score will be the sum of your scores on each input (0 or 1) divided by the number of inputs. If you got 1 of 3 correct, you'd get a score of 33% for the task.

This is a special example, most tasks (~96%) have only one test input.

## LLM Set up

We'll use [LangChain](https://www.langchain.com/) for our LLM orchestration. This will allow us to use their out of the box output parsing, model selection and prompt templates. You can also use [LangSmith](https://www.langchain.com/langsmith) (free tier) to observe your output but that is outside the scope of this template.

First let's get our model ready. I'll be using `gpt-4o` to start. But you can swap whatever model you'd like! See more models [here](https://python.langchain.com/v0.2/docs/integrations/chat/). We set `max_tokens=3000` because the default token limit may not capture the full output of a prediction.

Make sure that your api key secret name below matches what you put in your Kaggle secrets.

In [7]:
# llm = ChatOpenAI(model='gpt-4o', openai_api_key=UserSecretsClient().get_secret('OPENAI_API_KEY'), max_tokens=3000)

## And incase you want to try Anthropic
llm = ChatAnthropic(model='claude-3-5-sonnet-20240620', api_key=UserSecretsClient().get_secret("ANTHROPIC_API_KEY"), max_tokens=3000)

LLMs can not ingest a json object, so first order of business is to convert the `train` pairs and `test` input into a string (which LLMs love).

This is a highly creative process and the below is only a starting point. There is much more prompt engineering you can do to the below that will likely improve your scores.

Do not take the format below as the way you *should* do it, but rather as one example. Have fun and share what works for you!

The output of the `json_task_to_string` (below) will be used in the prompt we give to the LLM.

In [21]:
def json_task_to_string(challenge_tasks: dict, task_id: str, test_input_index: int) -> str:
    """
    challenge_tasks: dict a list of tasks
    task_id: str the id of the task we want to convert to a string
    
    Convert your json task into a string so you can pass it to your LLM.
    This is a crucial step where you can use your creativity to edit how tasks are represented.
    """
    json_task = challenge_tasks[task_id]
    train_tasks = json_task['train']
    test_task = json_task['test']
    final_output = f"{len(train_tasks)} Known Examples:\n"
    for i, task in enumerate(train_tasks):
        final_output += f"Known {i + 1}: Input\n["
        for row in task['input']:
            final_output += f"\n  {str(row)},"
        final_output += "\n]\n\n"
        final_output += f"Known {i + 1}: Output\n["
        for row in task['output']:
            final_output += f"\n  {str(row)},"
        final_output += "\n]\n\n"

    final_output += "Current Problem\n["
    for row in test_task[test_input_index]['input']:
        final_output += f"\n  {str(row)}"

    final_output += "\n]\n\nSolution:"

    return final_output

task_string = json_task_to_string(challenges, '0520fde7', 0)
print (task_string)

3 Known Examples:
Known 1: Input
[
  [1, 0, 0, 5, 0, 1, 0],
  [0, 1, 0, 5, 1, 1, 1],
  [1, 0, 0, 5, 0, 0, 0],
]

Known 1: Output
[
  [0, 0, 0],
  [0, 2, 0],
  [0, 0, 0],
]

Known 2: Input
[
  [1, 1, 0, 5, 0, 1, 0],
  [0, 0, 1, 5, 1, 1, 1],
  [1, 1, 0, 5, 0, 1, 0],
]

Known 2: Output
[
  [0, 2, 0],
  [0, 0, 2],
  [0, 2, 0],
]

Known 3: Input
[
  [0, 0, 1, 5, 0, 0, 0],
  [1, 1, 0, 5, 1, 0, 1],
  [0, 1, 1, 5, 1, 0, 1],
]

Known 3: Output
[
  [0, 0, 0],
  [2, 0, 0],
  [0, 0, 2],
]

Current Problem
[
  [1, 0, 1, 5, 1, 0, 1]
  [0, 1, 0, 5, 1, 0, 1]
  [1, 0, 1, 5, 0, 1, 0]
]

Solution:


Let's look at an example of this using the task we had before

# task_string = json_task_to_string(challenges, '0520fde7', 0)
print (task_string)

## Output Parsing

Awesome! Now we have a string we can work with. But what about the output from the LLM?

LLMs aren't *great* at outputting valid json, so we'll take any help we can get.

LangChain has a [few ways](https://python.langchain.com/v0.1/docs/modules/model_io/output_parsers/) to do output parsing (ensuring the output is in the format you'd like). We'll use a [JsonOutputParser](https://python.langchain.com/v0.1/docs/modules/model_io/output_parsers/types/json/) for our use case. Feel free to use any other you'd like.

To do this we need a data structure, we'll have a simple prediction that is a list of lists. This won't be 100% accurate all the time (there are better ways to ensure data structures (like [instructor](https://github.com/jxnl/instructor)), but we'll add some retries later in case it fails).

In [9]:
# Defining a prediction as a list of lists
class ARCPrediction(BaseModel):
    prediction: List[List] = Field(..., description="A prediction for a task")

## Language Model Prediction
Now that we have our data structure let's move to the LLM prompt and call to get our prediction.

There will be 3 main pieces to this LLM call
* Model: This is the LLM that we'll use
* Prompt: The prompt that we'll send to the LLM. Because this is a 'hello world' example, we'll use an extremely simple prompt. You should edit this with different ideas you have about how to score better on ARC-AGI.
* Parser: The output parser that we'll use to ensure our output is in the correct format

We'll wrap these up with [LangChain Expression Language](https://python.langchain.com/v0.1/docs/expression_language/) to keep it simple.

In [10]:
def get_task_prediction(challenge_tasks, task_id, test_input_index) -> List[List]:
    """
    challenge_tasks: dict a list of tasks
    task_id: str the id of the task we want to get a prediction for
    test_input_index: the index of your test input. 96% of tests only have 1 input.

    Given a task, predict the test output
    """

    # Get the string representation of your task
    task_string = json_task_to_string(challenge_tasks, task_id, test_input_index)
    
    # Set up a parser to inject instructions into the prompt template.
    parser = JsonOutputParser(pydantic_object=ARCPrediction)

    # Create your prompt template. This is very rudimentary! You should edit this to do much better.
    # For example, we don't tell the model what it's first attempt was (so it can do a different one), that might help!
    prompt = PromptTemplate(
        template="You are a bot that is very good at solving puzzles. Below is a list of input and output pairs with a pattern." 
                    "Identify the pattern, then apply that pattern to the test input to give a final output"
                    "Just give valid json list of lists response back, nothing else. Do not explain your thoughts."
                    "{format_instructions}\n{task_string}\n",
        input_variables=["task_string"],
        partial_variables={"format_instructions": parser.get_format_instructions()},
    )

    # Wrap up your chain with LCEL
    chain = prompt | llm | parser

    # Optional, print out the prompt if you want to see it. If you use LangSmith you could view this there as well.
    # print (f"Prompt:\n\n{prompt.format(task_string=task_string)}")
    
    # Finally, go get your prediction from your LLM. Ths will make the API call.
    output = chain.invoke({"task_string": task_string})

    # Because the output is structured, get the prediction key. If it isn't there, then just get the output
    if isinstance(output, dict):
        prediction = output.get('prediction', output)
    else:
        prediction = output

    # Safety measure to error out if you don't get a list of lists of ints back. This will spark a retry later.
    if not all(isinstance(sublist, list) and all(isinstance(item, int) for item in sublist) for sublist in prediction):
        print("Warning: Output must be a list of lists of integers.")
        print (f"Errored Output: {prediction}")
        raise ValueError("Output must be a list of lists of integers.")
    
    # Let's find the shape of our prediction
    num_rows = len(prediction)
    num_cols = len(prediction[0]) if num_rows > 0 else 0
    print(f"    Prediction Grid Size: {num_rows}x{num_cols}\n")
    
    return prediction

Great! Now that we have the pieces to make our prediction we can run through our challenges and start predicting them.

Let's make a function that will run through challenges and then output a list of submissions for us.

Note: This list of submissions needs to be in a specific format. We'll later use this to populate a `submission.json` file that matches the format used for the official Kaggle competition. See more information on that [here](www.kaggle.com/competitions/arc-prize-2024/overview/evaluation) or view an example of `submission.json` file [here](https://www.kaggle.com/competitions/arc-prize-2024/data?select=sample_submission.json).

Note: For ARC Prize 2024, you get 2 attempts at each task input. Both attempts must be submitted at the same time. If either of them are correct you get a full score of 1.

In [11]:
def run_model(challenges, NUM_ATTEMPTS=2, RETRY_ATTEMPTS=3, NUM_TASKS=None):
    """
    challenges: dict a list of challenges. This should come directly from your _challenges file
    NUM_ATTEMPTS: int the number of times to attempt a prediction. The official competition has 2 attempts.
    RETRY_ATTEMPTS: int the number of times to retry a prediction if it fails
    NUM_TASKS: int, If set, this represents the the number of tasks you'd like to test. If None then the all challeneges will be tested

    Loop through your challenges and produce a submission.json file you can submit for a score.
    """

    # A dict to hold your submissions that you'll return after all predictions are made
    submission = {}

    # Run through each task in your challenge set
    for i, task_id in enumerate(challenges):
        task_attempts = []  # List to store all attempts for the current task

        # Go through each test pair to get a prediction. 96% of challenges have 1 pair.
        for t, pair in enumerate(challenges[task_id]['test']):
            print(f"Starting task #{i + 1} ({task_id}), pair #{t+1}")

            # Dictionary to store attempts for the current test pair
            pair_attempts = {}  

            # Run through each prediction attempt
            for attempt in range(1, NUM_ATTEMPTS + 1):
                attempt_key = f"attempt_{attempt}"
                pair_attempts[attempt_key] = [] # Init your attempt

                # Try to get a prediction, with retries in case of failure
                for retry in range(RETRY_ATTEMPTS):
                    try:
                        print(f"    Predicting attempt #{attempt}, retry #{retry + 1}")
                        prediction = get_task_prediction(challenge_tasks=challenges,
                                                         task_id=task_id,
                                                         test_input_index=t)
                        
                        # If you get a valid prediction (list of lists of ints) with no error, then log the attempt
                        pair_attempts[attempt_key] = prediction
                        break  # Break the retry loop if prediction is successful
                    except Exception as e:
                        print(f"Retrying: {e}")
                        if retry == RETRY_ATTEMPTS - 1:
                            pair_attempts[attempt_key] = []  # Assign None if all retries fail

            # After you get your attempts, append them to the task attempts
            task_attempts.append(pair_attempts)

        # Append the task attempts to the submission with the task_id as the key
        submission[task_id] = task_attempts

        # If you want to stop after N tasks, uncomment the below
        if NUM_TASKS is not None and i + 1 == NUM_TASKS:
            break

    return submission

Wow! Now we have a way to get a prediction for each challenge. Let's try this out with one example from the `training` tasks

In [12]:
# Load up training tasks
challenges, solutions = load_tasks_from_file(task_set=task_sets['training'])

# Run the model on a single task
submission = run_model(challenges, NUM_TASKS=1)

# Print the submission
print (submission)

Starting task #1 (007bbfb7), pair #1
    Predicting attempt #1, retry #1
    Prediction Grid Size: 9x9

    Predicting attempt #2, retry #1
    Prediction Grid Size: 9x9

{'007bbfb7': [{'attempt_1': [[7, 0, 7, 7, 0, 7, 7, 0, 7], [7, 0, 7, 7, 0, 7, 7, 0, 7], [7, 7, 0, 7, 7, 0, 7, 7, 0], [7, 0, 7, 7, 0, 7, 7, 0, 7], [7, 0, 7, 7, 0, 7, 7, 0, 7], [7, 7, 0, 7, 7, 0, 7, 7, 0], [7, 0, 7, 7, 0, 7, 7, 0, 7], [7, 0, 7, 7, 0, 7, 7, 0, 7], [7, 7, 0, 7, 7, 0, 7, 7, 0]], 'attempt_2': [[7, 0, 7, 7, 0, 7, 7, 0, 7], [7, 0, 7, 7, 0, 7, 7, 0, 7], [7, 7, 0, 7, 7, 0, 7, 7, 0], [7, 0, 7, 7, 0, 7, 7, 0, 7], [7, 0, 7, 7, 0, 7, 7, 0, 7], [7, 7, 0, 7, 7, 0, 7, 7, 0], [7, 0, 7, 7, 0, 7, 7, 0, 7], [7, 0, 7, 7, 0, 7, 7, 0, 7], [7, 7, 0, 7, 7, 0, 7, 7, 0]]}]}


Awesome! That is great. Let's break down what we see:
* `{'007bbfb7':`: This is our key that will let us know which task the solution is for
* `[{'attempt_1':` This list will hold our attempt dicts. There will be one dict per test input (96% of the time there will only be one). This is also our first attempt
*  `'attempt_2': `: Our second attempt

Let's create a quick function that will take our submission output and save it as a `submission.json` file.

In [13]:
def create_submission_file(submission, file_name='submission.json'):
    """
    Save a submission file to the specified file name
    """
    with open(file_name, "w") as file:
        json.dump(submission, file)

    print (f"Submission saved to {file_name}")

Lastly, we want a way to score our submissions. We'll do this by comparing `submission.json` to the `_solutions` file for the corresponding challenge.

This function will walk down the submission dict, look at each task, find the corresponding solution and then compare the two. The extra code is to account for multiple attempts and multiple test inputs. See more on scoring [here](https://www.kaggle.com/code/gregkamradt/arc-prize-scoring).

In [14]:
def score_submission(submission_file_name, solutions) -> Tuple[float, int]:
    """
    submission_file_name: str, the file name of your submission file
    solutions: dict, the ground truth solutions you'd like to test against
    
    Read a submission from file, score it, then return the score
    """
    print (f"Scoring {submission_file_name}\n")

    # Open your submission file
    with open(submission_file_name, "r") as file:
        submission = json.load(file)

    total_score = 0
    total_tasks = 0

    # Loop through each task in your submission to grade it
    for task_id, task_submission in submission.items():
        total_tasks += 1
        task_score = 0
        num_pairs = len(task_submission)

        # Go through each task. Most will only have 1
        for pair_index, pair_attempts in enumerate(task_submission):
            print(f"Scoring Task {task_id} pair #{pair_index+1}")
            pair_correct = False

            # Look at both of your attempts
            for attempt_key, attempt in pair_attempts.items():
                
                # check to see if one is correct
                if attempt == solutions[task_id][pair_index]:
                    print(f"Task Id {task_id} pair {pair_index+1} {attempt_key} matches solution")
                    pair_correct = True
                    break # If it is correct, log it and break the loop

            if pair_correct:
                task_score += 1

        task_score /= num_pairs
        total_score += task_score

    return {
        'total_score': total_score,
        'total_tasks_scored': total_tasks
    }

## Bring it all together

Great! Now that we have a way to get a prediction for each challenge and a way to score our submissions, let's put it all together.

This is a simple function that will load up the tasks, run the model, create a submission file and then score the submission.

In [15]:
def main(task_set='training', NUM_TASKS=None, submission_file_name='submission.json'):
    # Load datasets
    challenges, solutions = load_tasks_from_file(task_set=task_sets[task_set])

    # # Run the model
    submission = run_model(challenges, NUM_TASKS=NUM_TASKS)

    # Create (and overwrite) a submission file
    create_submission_file(submission, file_name=submission_file_name)

    # Score the submission
    score_result = score_submission(solutions = solutions, submission_file_name=submission_file_name)

    print(f"Final score: {score_result['total_score']} of {score_result['total_tasks_scored']} ({round(score_result['total_score']/score_result['total_tasks_scored'] * 100, 2)}%)")

## Let's run our model!

In [16]:
main(task_set='evaluation', NUM_TASKS=150)

Starting task #1 (00576224), pair #1
    Predicting attempt #1, retry #1
    Prediction Grid Size: 6x6

    Predicting attempt #2, retry #1
    Prediction Grid Size: 6x6

Starting task #2 (009d5c81), pair #1
    Predicting attempt #1, retry #1
    Prediction Grid Size: 14x14

    Predicting attempt #2, retry #1
    Prediction Grid Size: 14x14

Starting task #3 (00dbd492), pair #1
    Predicting attempt #1, retry #1
    Prediction Grid Size: 20x20

    Predicting attempt #2, retry #1
    Prediction Grid Size: 20x20

Starting task #4 (03560426), pair #1
    Predicting attempt #1, retry #1
    Prediction Grid Size: 10x10

    Predicting attempt #2, retry #1
    Prediction Grid Size: 10x10

Starting task #5 (05a7bcf2), pair #1
    Predicting attempt #1, retry #1
    Prediction Grid Size: 30x30

    Predicting attempt #2, retry #1
    Prediction Grid Size: 30x30

Starting task #6 (0607ce86), pair #1
    Predicting attempt #1, retry #1
    Prediction Grid Size: 24x22

    Predicting attempt 

Awesome! Congratulations. You just made a predicted outputs to ARC-AGI tasks using gpt-4o. Remember, with LangChain you can easily swap out different models. How do you think [claude-3.5-sonnet](https://www.anthropic.com/news/claude-3-5-sonnet) or [Gemini](https://gemini.google.com/) would do?

If you would like to make a ARC-AGI-Pub high-score claim, run your model against `kaggle/input/arc-prize-2024/arc-agi_evaluation_challenges.json` and report your score [here](https://docs.google.com/forms/d/e/1FAIpQLSdyPg16R2BmGb6nZpsTAty4HqI4WjhpZcg951ApjzSfHJ7Kpw/viewform). Serious submissions will be considered to be tested against a semi-private set of challenges. Top scores of ARC-AGI-Pub will be reported on the [ARC-AGI-Pub Leaderboard](https://arcprize.org/leaderboard).

We love hearing from the community. If you expand on this notebook please share it with us on [Twitter](https://twitter.com/arcprize), [Discord](https://discord.gg/9b77dPAmcA) or at team@arcprize.org.