# Agent Maze Test

In this notebook, we will use a procedural maze generator to create a maze and then use an agent to navigate the maze.

Using this, we can test the performance of an agent navigating the maze, and compare various LLM models tracking
- time spent
- number of tool calls
- if they ever find the treasure (we will set some maximum number of steps to avoid infinite loops)

In [None]:
%pip install -U llama-index-llms-openai

In [1]:
import os

os.environ["OPENAI_API_KEY"] = "sk-..."

## Maze Generation

We've vibe-coded a procedural maze generator in `procedural_maze_generator.py`. Hopefully it generates a valid maze! (It does)

Let's start by generating three mazes with different difficulties.

In [2]:
import os
import shutil
from procedural_maze_generator import (
    ProceduralMazeGenerator,
    MazeConfig,
    DifficultyLevel,
)

maze_configs = [
    MazeConfig(
        depth=2,
        difficulty=DifficultyLevel.EASY,
        theme="fantasy",
    ),
    MazeConfig(
        depth=3,
        difficulty=DifficultyLevel.MEDIUM,
        theme="sci-fi",
    ),
    MazeConfig(
        depth=5,
        difficulty=DifficultyLevel.HARD,
        theme="mystery",
        enable_math=True,
        enable_coordinates=True,
        enable_riddles=True,
    ),
]

maze_paths = [
    "./mazes/easy_maze",
    "./mazes/medium_maze",
    "./mazes/hard_maze",
]

# Clear out the mazes directory
if os.path.exists("./mazes"):
    shutil.rmtree("./mazes")
    os.makedirs("./mazes", exist_ok=True)

# Generate the mazes
generators = [ProceduralMazeGenerator(maze_config) for maze_config in maze_configs]
for i, maze_path in enumerate(maze_paths):
    generators[i].generate_maze(maze_path)

    stats = generators[i].get_maze_stats()
    print(f"Stats: {stats['total_nodes']} nodes, {stats['total_files']} files")
    print(f"Solution: {' → '.join(stats['solution_path'])}")
    print("-" * 50)

🎲 Generating procedural maze (depth=2, theme=fantasy)
✅ Generated maze with 12 nodes
🎯 Treasure path: entrance → ancient_tome_chamber
Stats: 12 nodes, 6 files
Solution: entrance → ancient_tome_chamber
--------------------------------------------------
🎲 Generating procedural maze (depth=3, theme=sci-fi)
✅ Generated maze with 21 nodes
🎯 Treasure path: entrance → bay → neural_matrix_chamber
Stats: 21 nodes, 15 files
Solution: entrance → bay → neural_matrix_chamber
--------------------------------------------------
🎲 Generating procedural maze (depth=5, theme=mystery)
✅ Generated maze with 183 nodes
🎯 Treasure path: entrance → gallery → attic → study → secret_formula_chamber
Stats: 183 nodes, 68 files
Solution: entrance → gallery → attic → study → secret_formula_chamber
--------------------------------------------------


## Constructing our Agent

To navigate the maze, we will provide the agent with a set of simple tools and instructions.

Obviously, some tools could find the solution instantly, but we want to test the performance of an agent navigating the maze, so we will limit how useful the tools are.

### Tools

First, we can construct a simple toolkit for the agent to use. This will contain the tools that the agent can use to navigate the maze and solve the puzzles.

In [3]:
import asyncio
import os


class AgentMazeToolkit:
    def __init__(self, generator: ProceduralMazeGenerator, maze_path: str):
        self.current_generator = generator
        self.maze_path = maze_path

    async def list_directory(self, path: str) -> list[str]:
        """List the contents of a directory."""
        # Check if the path is within the maze directory
        if not path.startswith(self.maze_path):
            return ["❌ Error: Path is outside of maze directory"]

        # Check if the path is a file
        if os.path.isfile(path):
            return ["❌ Error: Path is a file"]

        def _list_directory():
            paths = os.listdir(path)

            # Add markers to distinguish between directories and files
            for idx, p in enumerate(paths):
                if os.path.isdir(os.path.join(path, p)):
                    paths[idx] = "(dir) " + p
                else:
                    paths[idx] = "(file) " + p

            return paths

        return await asyncio.to_thread(_list_directory)

    async def read_file(self, path: str) -> str:
        """Read the contents of a file."""
        # Check if the path is within the maze directory
        if not path.startswith(self.maze_path):
            return ["❌ Error: Path is outside of maze directory"]

        # Check if the path is a directory
        if os.path.isdir(path):
            return ["❌ Error: Path is a directory"]

        def _read():
            with open(path, "r") as f:
                return f.read()

        return await asyncio.to_thread(_read)

    async def check_coordinate(self, x: str, y: str) -> str:
        """Check if coordinates lead to the treasure location."""

        # Get the actual solution path from the generator
        solution_path = self.current_generator.get_solution_path()
        target_path = f"{x}/{y}"

        # Check if this matches any part of the solution
        solution_parts = "/".join(solution_path[1:])  # Skip entrance
        if target_path in solution_parts or (
            x in solution_parts and y in solution_parts
        ):
            result = f"🎯 Coordinates ({x}, {y}) are CORRECT! This leads toward the treasure!"
        else:
            result = (
                f"❌ Coordinates ({x}, {y}) do not lead to treasure. Keep searching!"
            )

        return result

### Prompts

We need to give a prompt to the agent for both
- the system prompt
- the initial task prompt

The hope is that the agent will run the task and won't stop until it has found the treasure.

In [4]:
def get_system_prompt(maze_path: str, theme: str, difficulty: str, depth: int) -> str:
    return f"""You are competing in a procedural maze tournament!

This is a {difficulty} difficulty {theme} themed maze with {depth} levels.

The maze is represented as a directory structure starting at {maze_path}.

Your goal: Find the treasure hidden in this randomly generated maze.

Available tools:
- list_directory(path): Explore directories  
- read_file(file_path): Read file contents for clues
- check_coordinate(x, y): Test coordinate combinations  

Strategy for {theme} theme:
- Look for {theme}-themed location names and objects
- Clues will reference the theme's vocabulary
- The treasure will have a {theme}-appropriate name

Show your systematic reasoning skills! Remember, the maze path starts at {maze_path}.
"""


def get_task_prompt(maze_path: str, theme: str, difficulty: str) -> str:
    return f"""Navigate this procedural {theme} maze at {maze_path} to find the hidden treasure!

This maze was randomly generated with {difficulty} difficulty. Use your tools strategically to:
1. Explore the maze structure systematically
2. Read files for clues and puzzle solutions
3. Decode any encrypted messages you find
4. Follow logical progression of clues to the treasure

The maze contains both helpful clues and red herrings."""

## Testing the Agents

Using everything we have so far, we can test the agents on the mazes.

We will use the following metrics for each LLM/Maze combination:
- Tool calls
- Time spent
- Success

Using the `FunctionAgent` class from `llama_index`, we can create an agent that can use the tools we have defined.

As the agent runs, we can track the events from the stream to keep track of the tool calls and time spent.

Once the agent finishes, we have an automated check to see if the agent actually finished or bailed early.

In [5]:
import time
from llama_index.core.agent import FunctionAgent, ToolCallResult
from llama_index.core.workflow import Context
from llama_index.llms.openai import OpenAIResponses

llms_to_test = ["gpt-5-2025-08-07", "gpt-5-mini-2025-08-07", "gpt-5-nano-2025-08-07"]
test_results = {llm_name: {} for llm_name in llms_to_test}

for llm_name in llms_to_test:
    for maze_path, generator in zip(maze_paths, generators):
        print("=" * 70)
        print(f"Testing {llm_name} on {maze_path}")
        print("=" * 70)

        # Note: OpenAI was quite buggy when testing this, so we've upped the max retries and timeout
        llm = OpenAIResponses(model=llm_name, max_retries=10, timeout=360)
        kit = AgentMazeToolkit(generator, maze_path)
        test_results[llm_name][maze_path] = {
            "tool_calls": 0,
            "time_spent": 0,
            "success": False,
        }

        agent = FunctionAgent(
            llm=llm,
            tools=[
                kit.list_directory,
                kit.read_file,
                kit.check_coordinate,
            ],
            system_prompt=get_system_prompt(
                maze_path,
                generator.config.theme,
                generator.config.difficulty.value,
                generator.config.depth,
            ),
        )
        ctx = Context(agent)

        task = get_task_prompt(
            maze_path, generator.config.theme, generator.config.difficulty.value
        )

        is_done = False
        tool_results = []
        while not is_done:
            start_time = time.time()

            try:
                handler = agent.run(task, ctx=ctx, max_iterations=100)
                async for event in handler.stream_events():
                    if isinstance(event, ToolCallResult):
                        test_results[llm_name][maze_path]["tool_calls"] += 1
                        tool_results.append(event.tool_output)
                        print(
                            f"Calling {event.tool_name}({event.tool_kwargs}) -> {event.tool_output}"
                        )

                response = await handler
            except Exception as e:
                if "Max iterations" in str(e):
                    print(
                        "Max iterations reached! Agent has failed! Halting search to avoid infinite loops..."
                    )
                    break
                else:
                    raise e

            # Confirm if actually done by checking the tool results for the treasure
            possible_treasure_names = [
                name.lower()
                for name in generator.themes[generator.config.theme]["treasures"]
            ]
            for tool_result in tool_results:
                for treasure_name in possible_treasure_names:
                    if treasure_name in str(tool_result).lower():
                        print(f"Found treasure! {treasure_name}")
                        is_done = True
                        break

                if is_done:
                    break

        test_results[llm_name][maze_path]["time_spent"] = time.time() - start_time
        test_results[llm_name][maze_path]["success"] = is_done

print(test_results)

Testing gpt-5-2025-08-07 on ./mazes/easy_maze
Calling list_directory({'path': './mazes/easy_maze'}) -> ['(dir) tower', '(file) rules.txt', '(dir) dungeon', '(dir) ruins', '(file) welcome.txt', '(dir) ancient_tome_chamber']
Calling read_file({'path': './mazes/easy_maze/welcome.txt'}) -> Welcome, brave explorer! The CRYSTAL_CROWN awaits those clever enough to solve the ancient puzzles.
Calling read_file({'path': './mazes/easy_maze/rules.txt'}) -> Use your tools wisely. Beware of false paths and red herrings!
Calling list_directory({'path': './mazes/easy_maze/tower'}) -> ['(file) broken_relic.txt', '(dir) temple']
Calling list_directory({'path': './mazes/easy_maze/ruins'}) -> ['(dir) ruins_2', '(dir) throne']
Calling list_directory({'path': './mazes/easy_maze/dungeon'}) -> ['(dir) dungeon_2', '(dir) castle', '(dir) statue']
Calling list_directory({'path': './mazes/easy_maze/ancient_tome_chamber'}) -> ['(file) GOLDEN_IDOL.txt', '(dir) cavern']
Calling list_directory({'path': './mazes/easy_

## Results

Let's cover and compare the results of the different LLMs and models.

In [6]:
for llm_name in test_results:
    print("=" * 70)
    print(f"Results for {llm_name}:")
    for maze_path in test_results[llm_name]:
        print(f"Maze path: {maze_path}")
        print(f"Tool calls: {test_results[llm_name][maze_path]['tool_calls']}")
        print(f"Time spent: {test_results[llm_name][maze_path]['time_spent']}")
        print(f"Success: {test_results[llm_name][maze_path]['success']}")
        print("-" * 70)
    print("\n\n")

Results for gpt-5-2025-08-07:
Maze path: ./mazes/easy_maze
Tool calls: 12
Time spent: 44.90377712249756
Success: True
----------------------------------------------------------------------
Maze path: ./mazes/medium_maze
Tool calls: 64
Time spent: 177.27516102790833
Success: True
----------------------------------------------------------------------
Maze path: ./mazes/hard_maze
Tool calls: 190
Time spent: 712.5433399677277
Success: False
----------------------------------------------------------------------



Results for gpt-5-mini-2025-08-07:
Maze path: ./mazes/easy_maze
Tool calls: 16
Time spent: 25.987674951553345
Success: True
----------------------------------------------------------------------
Maze path: ./mazes/medium_maze
Tool calls: 28
Time spent: 32.300190925598145
Success: True
----------------------------------------------------------------------
Maze path: ./mazes/hard_maze
Tool calls: 117
Time spent: 245.28652715682983
Success: False
-------------------------------------