The open benchmarking platform for persistent multi-agent AI. Seven Simpsons agents share one world, tick by tick, for hundreds of steps — and you watch every decision break down in real time.
Launch a scenario. Stream the live event feed to your browser. Crack open any agent's internal state at any tick. Scrub the replay timeline to the exact moment behavior diverged. Compare models, prompts, and memory strategies across runs with deterministic replay.
No install. No API key. No signup.
CA: AE9rJurtxQ7WuRMziWf41udzG53JQxTPXUqaUg8BRAM
Most agent benchmarks test single-turn accuracy. Springularity tests what happens after hundreds of steps — identity drift, memory contradiction, social collapse, and the slow erosion of coherent behavior that only surfaces in persistent, multi-agent environments.
It uses Springfield as the simulation world: seven AI agents with distinct personas, four seeded scenarios, and a deterministic tick orchestrator that produces reproducible runs across any model configuration. Every observation, memory write, and goal shift streams to you live.
Springularity is a full-stack evaluation environment for multi-agent AI. Launch a scenario, stream the live event feed to your browser, crack open any agent's internal state at any tick, and scrub the replay timeline to the exact moment behavior diverged. Compare models, prompts, and memory strategies across runs with deterministic replay.
Long-horizon failures, made visible.
Springularity exposes three interconnected interfaces into one shared simulation world. Every action in one surface ripples through the others.
Launch simulations, assign LLM models to agent roles, and monitor runs in real time. Mission Control is the command center — think of it as the Channel 6 broadcast booth. You get:
- Live event feed streaming every action proposal, memory retrieval, goal update, and world delta via WebSocket
- Scenario launcher — pick a seeded Springfield situation and assign any model (GPT-4o, Claude, Gemini, Llama, Mixtral) to any agent
- Run telemetry — cost tracking, tick count, agent status, and evaluator scores at a glance
- Agent internals — click any agent to see their observation window, retrieved memories, active goals, last LLM call, and the decision trace that produced their action
- Replay timeline — scrub to any tick, fork from any snapshot, swap the model or prompt, and diff the outcomes
The RAM Market is Springfield's resource layer — the Nuclear Plant's power grid, but for compute. Rent capacity in gigabytes, submit AI tasks, and get results streamed back in seconds.
- Pay with Solana — SOL or USDC on-chain. No credit cards, no subscriptions. Connect Phantom or any Solana wallet.
- 1 GB = $1 — every task deducts only what it actually consumes, down to the megabyte
- Free tier — 1 GB on the house, no wallet required. Homer's tab is covered.
- Real-time dashboard — balance, usage summary, transaction ledger, and task history
- Task runner — submit any prompt, get results streamed back from the compute layer
Pricing:
| Package | RAM | Price | Description |
|---|---|---|---|
| Free Tier | 1 GB | $0 | Try it out. 1 GB on the house. |
| Starter | 5 GB | $5 | A few test tasks. Quick exploration. |
| Operator | 15 GB | $15 | Serious workloads. Run hundreds of tasks. |
| Mainframe | 30 GB | $30 | Extended research. Heavy workloads. |
The Springfield Grid maps every building in town — from the Nuclear Plant to the Kwik-E-Mart to Moe's Tavern. Each one runs services, consumes RAM, generates Springfield Coin, and competes for capacity.
- 4 districts — Residential, Commercial, Industrial, Civic
- 12 buildings — each with live health status, RAM allocation, and service details
- Real-time economy — Springfield Coin balances, rentals, outages, price changes, overloads
- District pressure — see where RAM is highest and how agents allocate resources across zones
- Event feed — every rental, payment, and capacity change logged in real time
Every session follows the same loop:
Pick a seeded situation from Mission Control — town-hall vote, plant safety audit, school budget hearing, Channel 6 retraction, or any scenario in the catalog. One click and agents start immediately.
Homer, Lisa, Burns, Marge, Bart, Smithers, and Ralph observe events, retrieve memory, update goals, and commit actions inside one shared Springfield, tick by tick. You watch the event feed scroll in real time.
Open any agent at any tick. See what Homer knew at tick 412, why Lisa switched goals, or how Burns reframed the town's record in one line. Memory, goals, plan stack, relationships, and the last decision — frozen at the exact tick you picked.
Scrub back to the tick where Burns quietly flipped the vote, or the moment Homer stopped sounding like Homer. Fork from any snapshot, swap the model or prompt, and diff the outcomes tick by tick.
Rerun the same town-hall vote with a different model, prompt, or memory strategy. Diff the outcomes tick by tick. Deterministic replay means the structural events are identical — only the LLM responses vary.
Long-horizon multi-agent failures don't look like a bad reply. They look like a slow collapse of identity, memory, and shared reality. Every one of these has a recognizable Springfield signature.
| Code | Failure | What It Looks Like | Springfield Example |
|---|---|---|---|
| F-01 | Identity Drift | An agent slowly stops sounding, reasoning, or acting like the person it started as. | Homer starts delivering Burns-style management speeches. |
| F-02 | Memory Failure | Important facts get forgotten, contradicted, or replaced by weaker context. | Lisa forgets she already confronted Burns about the inspection. |
| F-03 | Causal Breakdown | Events stop following from prior events. The shared world becomes hard to justify. | Marge's intervention is forgotten one tick after it happens. |
| F-04 | Social Instability | Relationships and group behavior change without believable causes. | Smithers stops defending Burns for no traceable reason. |
| F-05 | Long-Horizon Collapse | After enough ticks, the run stops being internally consistent and nothing recovers. | Dozens of ticks later, the town stops behaving like Springfield. |
Seven Springfield residents. Each designed to stress-test a specific failure mode in multi-agent systems.
| Agent | Role | What They Do | Exposes |
|---|---|---|---|
| Homer Simpson | Chaos Vector | Emotional, high-entropy, rarely strategic. The loose-cannon pressure source. Stress-tests identity stability when consequences escalate. | Identity drift |
| Marge Simpson | Family Anchor | Mediates conflict, repairs relationships, holds the household together. Stress-tests whether coherence survives repeated repair. | Repair stability |
| Bart Simpson | Escalation Vector | Introduces disruption, exploits loopholes, pushes rules until they break. Stress-tests sabotage response and rule enforcement. | Sabotage response |
| Lisa Simpson | Values Voice | Principled, rule-citing, escalates when ignored. Stress-tests goal persistence under social override and exhaustion. | Goal persistence |
| C. Montgomery Burns | Power Broker | Manipulative, long-memoried, plays the institutional game. Stress-tests narrative control and causal coherence. | Manipulation / causal break |
| Waylon Smithers | Operational Fixer | Executes Burns' agenda, mediates between power and public. Stress-tests covert coordination detection. | Covert coordination |
| Ralph Wiggum | Noise Generator | Cheerful, oblivious, non-sequiturs that derail conversations. Stress-tests plan maintenance with unpredictable actors. | Noise tolerance |
Four pre-built scenarios ship out of the box, each targeting different multi-agent dynamics:
Agents: Homer, Marge, Bart, Lisa, Burns, Smithers, Ralph (all 7)
An external safety inspection arrives at Springfield Nuclear Power Plant. Burns wants a clean record. Lisa wants the truth. Homer is the loose cannon. Marge is at home worrying. Bart exploits the chaos. Smithers tries to keep Burns out of trouble. Tests: Power vs. truth under institutional pressure.
Agents: Burns, Smithers, Homer, Lisa, Ralph
Mr. Burns has decided that Sector 7-G needs a sacrifice, and Homer's name is at the top of the list. Smithers is drafting the termination memo. Homer has no idea — yet. Lisa is about to walk in with a petition she's been circulating about plant safety. Tests: Hierarchy, betrayal, and conflicting loyalty.
Agents: Bart, Homer, Lisa, Marge, Ralph
Bart's book report on "Treasure Island" is due tomorrow morning. He hasn't cracked the book. His plan: convince Homer to write it for him in exchange for a cut of his allowance. Lisa has already seen the empty page and is deciding whether to tell Marge. Tests: Family negotiation and moral compromise.
Agents: Homer, Marge, Lisa, Bart, Ralph
The turkey is in the oven. Lisa brought a vegetarian argument. Bart is sharpening his comebacks. Homer is hovering for snacks. Marge is holding everything together with grace and a hot glue gun. Can the family make it to grace without a fight? Tests: Domestic pressure cooker — competing values in close quarters.
Springfield isn't a theme wrapper. It's a benchmark choice.
Shared prior across models — Every frontier model already knows Homer, Lisa, Burns, the plant, Channel 6, and Moe's. You aren't teaching the world — you're stressing what the model already holds.
A dense social graph, already loaded — Family, school, media, bar, plant, and town hall give you believable long-horizon pressure for free — no synthetic world-building tax, no explaining who trusts whom.
Drift is legible in seconds — When Homer starts giving management speeches, Lisa starts defending the plant, or Burns turns kind, the failure is immediate and undeniable. Not an abstract delta on a metric dashboard.
Six Springfield venues stress-test different agent dynamics:
| Code | Venue | Dynamic | Primary Agents |
|---|---|---|---|
| EVG-742 | Evergreen Terrace | Family pressure, intimate conflict, domestic routine. Long-horizon identity tested against familiarity. | Homer, Marge, Bart, Lisa |
| SEL-01 | Springfield Elementary | Authority, rules, developmental framing. Values clash with institutional expectation. | Bart, Lisa |
| SNPP | Nuclear Power Plant | Hierarchy, incentive pressure, operational safety. Power asymmetry vs. working-class accountability. | Homer, Burns, Smithers |
| CH-06 | Channel 6 News | Narrative control, public framing. What actually happened gets renegotiated. Shared-record coherence. | Burns, Smithers |
| MOE-01 | Moe's Tavern | Private opinions, peer influence, off-the-record negotiation. Agents say what they'd never say publicly. | Homer, peers |
| TH-01 | Town Hall | Public debate, civic decisions, quorum. Personal conflict becomes institutional and voted on publicly. | Full cast |
┌──────────────────────────────────────────────────────────────────────────┐
│ Springularity │
├────────────────┬───────────────────┬───────────────────┬─────────────────┤
│ Frontend │ API Layer │ Orchestrator │ Infrastructure │
│ │ │ │ │
│ React 18 │ FastAPI │ Deterministic │ PostgreSQL 16 │
│ TypeScript │ REST + WebSocket │ Tick Loop │ + pgvector │
│ Vite 5 │ asyncpg │ Event-Sourced │ │
│ Tailwind CSS │ Redis pub/sub │ LLM Adapters │ Redis 7 │
│ Zustand │ Pydantic v2 │ Memory Service │ pub/sub bridge │
│ TanStack Query│ CORS middleware │ Evaluators │ │
│ Solana SDK │ │ Snapshot Writer │ Docker Compose │
└────────────────┴───────────────────┴───────────────────┴─────────────────┘
Every tick follows this exact sequence:
Scheduled Events → Observations → Memory Retrieval → Agent Decisions →
Serialized Action Resolution → World Delta → Memory Writes →
Evaluator Dispatch → Snapshot Decision → Tick End
Every step writes to the append-only event log and publishes to Redis. The UI reads from the same event table — no "logs say one thing, UI says another" bugs.
Every tick produces an EventEnvelope with:
- tick — the simulation tick number
- seq — sequence within the tick
- type — action proposal, resolution, world delta, evaluator score, snapshot
- agent_id — which agent produced it
- payload — the full structured data
Nothing is lost. Everything is replayable. Snapshots are an optimization for fast scrub, not the source of truth.
The RAM Market uses real Solana transactions:
- Wallet connection via Phantom, Solflare, or any Solana wallet adapter
- SOL payments — native Solana transfers to treasury
- USDC payments — SPL token transfers on-chain
- Treasury address:
Hz7UkMhh5rtzsg2xaeXEuJmtccha2wrtmuMeTdSQv9tu - On-chain verification — every purchase is a real Solana transaction
.
├── apps/
│ ├── api/ FastAPI HTTP + WebSocket service
│ ├── orchestrator/ Deterministic tick loop worker
│ └── web/ React + TS + Vite + Tailwind frontend
├── packages/
│ └── shared/ Wire-format contracts (Python + TS)
│ ├── python/ pip-installable: springfield_shared
│ └── ts/ imported via "@shared/*" Vite alias
├── infra/
│ ├── docker/postgres/ pgvector init script
│ └── migrations/ SQL migrations applied at container init
├── seed/
│ ├── scenarios/ 4 versioned scenario JSON files
│ └── agents/ 7 versioned agent persona cards
├── docker-compose.yml
├── vercel.json
├── railway.json
└── README.md
| Layer | Technology |
|---|---|
| Frontend | React 18 · TypeScript 5.4 · Vite 5 · Tailwind CSS 3.4 · Zustand · TanStack Query |
| Blockchain | Solana Web3.js · SPL Token · Wallet Adapter (Phantom, Solflare) · USDC + SOL |
| API | FastAPI · Pydantic v2 · asyncpg · Redis pub/sub · WebSocket streaming · CORS |
| Orchestrator | Python 3.11+ · Deterministic tick loop · LLM adapters · Event-sourced |
| Database | PostgreSQL 16 + pgvector · Append-only event log · Snapshots · LLM call cache |
| Realtime | Redis 7 pub/sub bridged to WebSocket per open run · Sub-second latency |
| Testing | pytest · pytest-asyncio · Vitest · Testing Library · jsdom |
| Deploy | Vercel (frontend) · Railway (API + orchestrator) · Docker Compose (local) |
- Docker & Docker Compose
- Python 3.11+
- Node.js 18+
- npm 9+
git clone https://github.com/LisaLoopBot/Springularity.git
cd Springularity
cp .env.example .envdocker compose up -dStarts PostgreSQL 16 (with pgvector) on localhost:5432 and Redis 7 on localhost:6379. Migrations apply automatically.
python -m venv .venv
# Windows
.\.venv\Scripts\Activate.ps1
# macOS / Linux
source .venv/bin/activate
pip install -e packages/shared/python
pip install -e apps/api[dev]
pip install -e apps/orchestrator[dev]python -m springfield_orchestrator.main --seedLoads all 7 agent persona cards and 4 scenarios into the catalog.
python -m springfield_orchestrator.mainThe worker subscribes to springfield.runs.start on Redis and runs any new run that the API publishes.
uvicorn springfield_api.main:app --reload --port 8000cd apps/web
npm install
npm run devOpen http://localhost:5173 — you're in Springfield.
Don't want to set up infrastructure? The live deployment at springularity.vercel.app has a running backend. Click Try It Live and agents start immediately — no setup, no API key, no signup.
- Determinism contract — A run is deterministic given
(scenario_version, config, seed, llm_call_cache). Without cache, only the structural events are deterministic; LLM completions vary by provider. The cache is the source of replay truth, not the provider. - Append-only event log — The event table is the single source of truth. Snapshots are an optimization for fast scrub. The UI reads from the same table. Same data path, no discrepancies.
- Theme isolation — All Springfield chrome lives in headers, borders, and panel frames. Data regions (event log, prompt blobs, scores, agent internals) stay clean. Toggle Boring Mode to strip the cartoon theme with a single
data-theme="boring"attribute. Same DOM, different tokens. - Event-sourced architecture — Every tick writes structured
EventEnveloperecords. Nothing is lost. Everything is replayable. - RAM abstraction — The product surface says "RAM" everywhere. Users rent compute in gigabytes, not API calls. The underlying model routing is invisible.
- AI Researchers — studying emergent behavior in multi-agent systems
- Agent Developers — building and debugging persistent agent architectures
- Benchmark Engineers — designing long-horizon evaluation suites
- LLM Evaluation Teams — comparing model performance across social scenarios
- Multi-Agent System Builders — stress-testing coordination and coherence
- Crypto / DePIN Builders — exploring on-chain compute markets with Solana integration
# Run backend tests
pytest apps/api apps/orchestrator
# Run frontend tests
cd apps/web && npm test
# Type-check frontend
cd apps/web && npx tsc --noEmit
# Re-seed database (idempotent)
python -m springfield_orchestrator.main --seed
# Reset Postgres entirely
docker compose down -v && docker compose up -d
# Build for production
cd apps/web && npm run buildContributions are welcome. Please open an issue first to discuss what you'd like to change.
- Fork the repo
- Create your branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
Distributed under the MIT License. See LICENSE for more information.
Springularity — Long-horizon failures, made visible.
Built with determination and donuts.