🗺️ LLM Maze Benchmark Data Generator

A TypeScript-based maze generation system that creates diverse, solvable mazes with optimal A* pathfinding solutions, designed specifically for LLM fine-tuning and benchmarking.

📋 Overview

This project generates mazes of varying sizes (5x5, 10x10, 20x20) using three distinct algorithms, ensuring structural diversity and algorithmic variety. Each generated maze is guaranteed to be solvable and includes the optimal path from start (P) to destination (D).

🎯 Features

Three Distinct Generation Algorithms:
- Recursive Backtracker (DFS-based): Long, winding corridors with many dead ends
- Prim's Algorithm (MST-based): Blockier, denser feel with room-like areas
- Cellular Automata: Organic, cave-like structures with irregular walls
Robust A Pathfinding*: Finds optimal solutions using Manhattan distance heuristic
Quality Guarantees:
- All generated mazes are solvable
- Start (P) and destination (D) are well-separated
- Single connected component (no isolated areas)
Dual Output Formats:
- Fine-tuning dataset (JSONL)
- Benchmark dataset (JSON with metadata)

🚀 Quick Start

Installation

bun install

Generate Mazes

bun run generate
# or
bun start

This will generate:

data/finetuning_dataset.jsonl: 900 training samples (100 per algorithm per size)
data/benchmark_maps.json: 90 benchmark samples (10 per algorithm per size)

Visualize Mazes

bun run visualize
# or open visualizer.html in your browser

Opens an interactive web-based visualizer to inspect generated mazes with:

Visual grid display with color-coded cells (walls, empty, start, destination, path)
Path visualization - show/hide optimal path, step through it move-by-move
Filtering - view specific algorithms or sizes
Navigation - Previous/Next buttons or keyboard arrows (←/→)
Random selection - Jump to a random maze for spot-checking
Keyboard shortcuts - Arrow keys to navigate, Space to toggle path

Visualizer Features:

🎨 Beautiful, responsive UI with color-coded maze elements
📊 Displays map metadata (ID, algorithm, size, path length)
🔄 Easy navigation between all 90 benchmark maps
🎲 Random button for quick spot-checking
⏭️ Step-by-step path visualization to trace the solution
📱 Mobile-friendly responsive design

Run LLM Benchmark

Test how well different LLMs can solve the generated mazes:

# 1. Set up your OpenRouter API key
echo "OPENROUTER_API_KEY=your_key_here" > .env.local

# 2. Run the benchmark
bun run benchmark

Features:

🤖 Test multiple models in parallel (Claude, GPT-4, etc.)
📊 Automatic tracking of accuracy, tokens, cost, and latency
💾 Results saved as JSON and CSV for analysis
⚡ Configurable parallelism for fast execution
🔄 Progress reporting in real-time

Results include:

Accuracy percentage per model
Average response time
Total tokens used and estimated cost
Detailed per-test breakdown in CSV format

📁 Project Structure

mazesolver/
├── lib/                                  # 📦 Reusable libraries
│   └── llm-benchmark-runner/             # LLM benchmark library (can be extracted)
│       ├── src/
│       │   ├── index.ts                  # Public API exports
│       │   ├── types.ts                  # Core types and interfaces
│       │   ├── openrouter.ts             # OpenRouter API client
│       │   └── runner.ts                 # Parallel benchmark runner
│       ├── package.json                  # Library package config
│       └── README.md                     # Library documentation
├── src/
│   ├── types.ts                          # Type definitions
│   ├── index.ts                          # Main generation script
│   ├── utils/
│   │   └── gridOperations.ts             # Grid utilities (walls, P/D placement)
│   ├── pathfinding/
│   │   └── astar.ts                      # A* pathfinding algorithm
│   ├── mapGenerator/
│   │   ├── recursiveBacktracker.ts       # DFS-based maze generator
│   │   ├── prim.ts                       # Prim's algorithm generator
│   │   └── cellularAutomata.ts           # Cellular automata generator
│   └── benchmark/                        # Maze benchmarking
│       ├── mazeSolver.ts                 # Maze-specific logic
│       └── run.ts                        # Main benchmark script
├── data/                                 # Generated datasets
├── results/                              # Benchmark results
├── visualizer.html                       # Interactive maze viewer
├── serve-visualizer.ts                   # Local web server for visualizer
└── package.json

🧩 Map Representation

Map Elements

#: Wall (impassable)
P: Player start position
D: Destination (goal)
.: Walkable tile (empty space)

Example 5x5 Map

#######
#P....#
#.###.#
#.#...#
#.#.#.#
#...#D#
#######

Note: Map dimensions (5x5, 10x10, 20x20) refer to the inner walkable grid, excluding the outer perimeter walls. A 5x5 inner grid becomes a 7x7 grid with walls.

🎲 Generation Algorithms

1. Recursive Backtracker (DFS-based)

Description: Creates "perfect" mazes with no loops or isolated sections. Each pair of cells has exactly one path between them.

Characteristics:

Long, winding corridors
Many dead ends
Spindly, organic branching
High path complexity

Use Case: Testing LLM's ability to handle complex, maze-like navigation with many decision points.

2. Prim's Algorithm (MST-based)

Description: Also creates perfect mazes but with a different structural feel.

Characteristics:

Blockier, denser layout
More interconnected "room-like" areas
Wider passages
Bulbous cavern structures

Use Case: Testing LLM's spatial reasoning with more open, room-based layouts.

3. Cellular Automata

Description: Generates organic, natural-looking cave systems using Game of Life-like rules.

Algorithm:

Initialize with random wall distribution (~45%)
Apply cellular automata rules for 5-8 iterations
Ensure single connected component via flood fill
Smooth edges with additional iterations

Characteristics:

Organic, irregular shapes
Natural cavern feel
Open areas with scattered obstacles
Less predictable structure

Use Case: Testing LLM's adaptability to non-traditional, organic layouts.

📊 Output Formats

Fine-tuning Dataset (`finetuning_dataset.jsonl`)

Each line is a JSON object:

{"input": "Map:\n#######\n#P....#\n#.....#\n#.....#\n#....D#\n#######", "output": "DOWN,DOWN,DOWN,RIGHT,RIGHT,RIGHT,RIGHT"}

Benchmark Dataset (`benchmark_maps.json`)

JSON array of objects with metadata:

[
  {
    "id": "rb_5x5_bm_001",
    "algorithm": "recursiveBacktracker",
    "size": "5x5",
    "map": "#######\n#P....#\n#.....#\n#.....#\n#....D#\n#######",
    "optimalPath": ["DOWN", "DOWN", "DOWN", "RIGHT", "RIGHT", "RIGHT", "RIGHT"]
  }
]

🔧 Configuration

Edit src/index.ts to customize generation:

const CONFIG = {
  numMapsPerAlgorithmPerSize: {
    finetuning: 100,  // Maps per algorithm per size for training
    benchmark: 10     // Maps per algorithm per size for testing
  },
  mapSizes: [5, 10, 20],  // Inner dimensions
  algorithms: ['recursiveBacktracker', 'prim', 'cellularAutomata']
};

🧪 Benchmark Framework

The benchmark system uses @fallom/llm-benchmark-runner - a reusable library in lib/llm-benchmark-runner/.

Library Features (in lib/llm-benchmark-runner/):

✅ OpenRouter API integration with automatic cost/token tracking
✅ Parallel execution with configurable concurrency
✅ Progress reporting and error handling
✅ Flexible output parsing and evaluation
✅ Results export to JSON and CSV
✅ Ready to extract as a standalone npm package

Maze-Specific Logic (in src/benchmark/):

Output parsing for direction sequences
Path evaluation (exact match, length comparison)
Prompt formatting for maze problems

The library is structured to be easily extracted and published. See lib/llm-benchmark-runner/README.md for library documentation.

Example: Adding a new model

Edit src/benchmark/run.ts:

const models: ModelConfig[] = [
  {
    id: 'anthropic/claude-3.5-sonnet',
    name: 'Claude 3.5 Sonnet',
    settings: { temperature: 0, max_tokens: 1000 }
  },
  {
    id: 'openai/gpt-4o',  // Add your model
    name: 'GPT-4o',
    settings: { temperature: 0, max_tokens: 1000 }
  }
];

🎯 Key Implementation Details

P & D Placement

Only placed on walkable tiles (.)
Minimum separation distance ensures meaningful paths:
- Small maps (5x5): minimum 2 cells apart
- Larger maps: minimum ~1/3 of grid size apart
Up to 1000 attempts to find well-separated positions

Outer Walls

Automatically added after inner grid generation
A* pathfinding runs on the full grid (with outer walls)
Inner dimensions + 2 = final grid dimensions

Validation

Every generated map is validated with A*
Maps without valid paths are discarded and regenerated
Ensures 100% solvability of output dataset

Connectivity

Recursive Backtracker & Prim's: Perfect mazes by design, additional connectivity pass adds strategic loops
Cellular Automata: Flood fill ensures single connected component, isolated areas converted to walls

📈 Performance

Small maps (5x5): ~1-2ms per map
Medium maps (10x10): ~5-10ms per map
Large maps (20x20): ~20-50ms per map

Total generation time for default configuration (990 maps): ~1-5 minutes

🛠️ Development

Type Safety

The project uses strict TypeScript with:

Strict null checks
No unchecked indexed access
Comprehensive type definitions in types.ts

Testing a Single Map

import { generateRecursiveBacktrackerMaze } from './src/mapGenerator/recursiveBacktracker';
import { addOuterWalls, placePandD, mapToString } from './src/utils/gridOperations';
import { findPath } from './src/pathfinding/astar';

const innerGrid = generateRecursiveBacktrackerMaze(10, 10);
const withWalls = addOuterWalls(innerGrid);
const result = placePandD(withWalls);

if (result) {
  const path = findPath(result.grid, result.start, result.end);
  console.log(mapToString(result.grid));
  console.log('Path:', path);
}

📝 Algorithm Details

A* Pathfinding

Heuristic: Manhattan distance (optimal for 4-directional grid movement)
Movement: 4-directional (UP, DOWN, LEFT, RIGHT) - no diagonals
Cost: Uniform cost of 1 per move
Optimality: Guaranteed to find shortest path

Cellular Automata Rules

Initialization: 45% wall density
Rule: Cell becomes wall if ≥5 neighbors are walls, otherwise floor
Iterations: Adaptive based on grid size (typically 5-8)
Post-processing: Flood fill + connectivity check + smoothing

🤝 Contributing

This is a benchmark generation tool. Contributions welcome for:

Additional maze generation algorithms
Performance optimizations
Extended validation checks
Output format variations

📄 License

MIT License - see LICENSE file for details

🙏 Acknowledgments

Built with:

Bun - Fast JavaScript runtime
TypeScript - Type-safe development
Classic maze generation algorithms

Generated with ❤️ for LLM training and benchmarking

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.cursor/rules		.cursor/rules
examples		examples
lib/llm-benchmark-runner/src		lib/llm-benchmark-runner/src
src		src
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
QUICKSTART.md		QUICKSTART.md
README.md		README.md
VISUALIZER.md		VISUALIZER.md
bun.lock		bun.lock
package.json		package.json
serve-visualizer.ts		serve-visualizer.ts
test-benchmark-output.ts		test-benchmark-output.ts
tsconfig.json		tsconfig.json
visualizer.html		visualizer.html

Fallomai/mazesolver

Folders and files

Latest commit

History

Repository files navigation

🗺️ LLM Maze Benchmark Data Generator

📋 Overview

🎯 Features

🚀 Quick Start

Installation

Generate Mazes

Visualize Mazes

Run LLM Benchmark

📁 Project Structure

🧩 Map Representation

Map Elements

Example 5x5 Map

🎲 Generation Algorithms

1. Recursive Backtracker (DFS-based)

2. Prim's Algorithm (MST-based)

3. Cellular Automata

📊 Output Formats

Fine-tuning Dataset (finetuning_dataset.jsonl)

Benchmark Dataset (benchmark_maps.json)

🔧 Configuration

🧪 Benchmark Framework

🎯 Key Implementation Details

P & D Placement

Outer Walls

Validation

Connectivity

📈 Performance

🛠️ Development

Type Safety

Testing a Single Map

📝 Algorithm Details

A* Pathfinding

Cellular Automata Rules

🤝 Contributing

📄 License

🙏 Acknowledgments

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Fine-tuning Dataset (`finetuning_dataset.jsonl`)

Benchmark Dataset (`benchmark_maps.json`)

Packages