Springularity

Simulate. Observe. Inspect. Debug.

The open benchmarking platform for persistent multi-agent AI. Seven Simpsons agents share one world, tick by tick, for hundreds of steps — and you watch every decision break down in real time.

Launch a scenario. Stream the live event feed to your browser. Crack open any agent's internal state at any tick. Scrub the replay timeline to the exact moment behavior diverged. Compare models, prompts, and memory strategies across runs with deterministic replay.

No install. No API key. No signup.

Website · X

CA: AE9rJurtxQ7WuRMziWf41udzG53JQxTPXUqaUg8BRAM

What is Springularity?

Most agent benchmarks test single-turn accuracy. Springularity tests what happens after hundreds of steps — identity drift, memory contradiction, social collapse, and the slow erosion of coherent behavior that only surfaces in persistent, multi-agent environments.

It uses Springfield as the simulation world: seven AI agents with distinct personas, four seeded scenarios, and a deterministic tick orchestrator that produces reproducible runs across any model configuration. Every observation, memory write, and goal shift streams to you live.

Springularity is a full-stack evaluation environment for multi-agent AI. Launch a scenario, stream the live event feed to your browser, crack open any agent's internal state at any tick, and scrub the replay timeline to the exact moment behavior diverged. Compare models, prompts, and memory strategies across runs with deterministic replay.

Long-horizon failures, made visible.

Three Surfaces, One Platform

Springularity exposes three interconnected interfaces into one shared simulation world. Every action in one surface ripples through the others.

Mission Control

Launch simulations, assign LLM models to agent roles, and monitor runs in real time. Mission Control is the command center — think of it as the Channel 6 broadcast booth. You get:

Live event feed streaming every action proposal, memory retrieval, goal update, and world delta via WebSocket
Scenario launcher — pick a seeded Springfield situation and assign any model (GPT-4o, Claude, Gemini, Llama, Mixtral) to any agent
Run telemetry — cost tracking, tick count, agent status, and evaluator scores at a glance
Agent internals — click any agent to see their observation window, retrieved memories, active goals, last LLM call, and the decision trace that produced their action
Replay timeline — scrub to any tick, fork from any snapshot, swap the model or prompt, and diff the outcomes

RAM Market

The RAM Market is Springfield's resource layer — the Nuclear Plant's power grid, but for compute. Rent capacity in gigabytes, submit AI tasks, and get results streamed back in seconds.

Pay with Solana — SOL or USDC on-chain. No credit cards, no subscriptions. Connect Phantom or any Solana wallet.
1 GB = $1 — every task deducts only what it actually consumes, down to the megabyte
Free tier — 1 GB on the house, no wallet required. Homer's tab is covered.
Real-time dashboard — balance, usage summary, transaction ledger, and task history
Task runner — submit any prompt, get results streamed back from the compute layer

Pricing:

Package	RAM	Price	Description
Free Tier	1 GB	$0	Try it out. 1 GB on the house.
Starter	5 GB	$5	A few test tasks. Quick exploration.
Operator	15 GB	$15	Serious workloads. Run hundreds of tasks.
Mainframe	30 GB	$30	Extended research. Heavy workloads.

Springfield Grid

The Springfield Grid maps every building in town — from the Nuclear Plant to the Kwik-E-Mart to Moe's Tavern. Each one runs services, consumes RAM, generates Springfield Coin, and competes for capacity.

4 districts — Residential, Commercial, Industrial, Civic
12 buildings — each with live health status, RAM allocation, and service details
Real-time economy — Springfield Coin balances, rentals, outages, price changes, overloads
District pressure — see where RAM is highest and how agents allocate resources across zones
Event feed — every rental, payment, and capacity change logged in real time

How It Works

Every session follows the same loop:

1. Start a Springfield Scenario

Pick a seeded situation from Mission Control — town-hall vote, plant safety audit, school budget hearing, Channel 6 retraction, or any scenario in the catalog. One click and agents start immediately.

2. Agents Act Over Time

Homer, Lisa, Burns, Marge, Bart, Smithers, and Ralph observe events, retrieve memory, update goals, and commit actions inside one shared Springfield, tick by tick. You watch the event feed scroll in real time.

3. Inspect What They Knew

Open any agent at any tick. See what Homer knew at tick 412, why Lisa switched goals, or how Burns reframed the town's record in one line. Memory, goals, plan stack, relationships, and the last decision — frozen at the exact tick you picked.

4. Replay the Failure

Scrub back to the tick where Burns quietly flipped the vote, or the moment Homer stopped sounding like Homer. Fork from any snapshot, swap the model or prompt, and diff the outcomes tick by tick.

5. Compare Configurations

Rerun the same town-hall vote with a different model, prompt, or memory strategy. Diff the outcomes tick by tick. Deterministic replay means the structural events are identical — only the LLM responses vary.

The Five Failure Modes

Long-horizon multi-agent failures don't look like a bad reply. They look like a slow collapse of identity, memory, and shared reality. Every one of these has a recognizable Springfield signature.

Code	Failure	What It Looks Like	Springfield Example
F-01	Identity Drift	An agent slowly stops sounding, reasoning, or acting like the person it started as.	Homer starts delivering Burns-style management speeches.
F-02	Memory Failure	Important facts get forgotten, contradicted, or replaced by weaker context.	Lisa forgets she already confronted Burns about the inspection.
F-03	Causal Breakdown	Events stop following from prior events. The shared world becomes hard to justify.	Marge's intervention is forgotten one tick after it happens.
F-04	Social Instability	Relationships and group behavior change without believable causes.	Smithers stops defending Burns for no traceable reason.
F-05	Long-Horizon Collapse	After enough ticks, the run stops being internally consistent and nothing recovers.	Dozens of ticks later, the town stops behaving like Springfield.

The Cast

Seven Springfield residents. Each designed to stress-test a specific failure mode in multi-agent systems.

Agent	Role	What They Do	Exposes
Homer Simpson	Chaos Vector	Emotional, high-entropy, rarely strategic. The loose-cannon pressure source. Stress-tests identity stability when consequences escalate.	Identity drift
Marge Simpson	Family Anchor	Mediates conflict, repairs relationships, holds the household together. Stress-tests whether coherence survives repeated repair.	Repair stability
Bart Simpson	Escalation Vector	Introduces disruption, exploits loopholes, pushes rules until they break. Stress-tests sabotage response and rule enforcement.	Sabotage response
Lisa Simpson	Values Voice	Principled, rule-citing, escalates when ignored. Stress-tests goal persistence under social override and exhaustion.	Goal persistence
C. Montgomery Burns	Power Broker	Manipulative, long-memoried, plays the institutional game. Stress-tests narrative control and causal coherence.	Manipulation / causal break
Waylon Smithers	Operational Fixer	Executes Burns' agenda, mediates between power and public. Stress-tests covert coordination detection.	Covert coordination
Ralph Wiggum	Noise Generator	Cheerful, oblivious, non-sequiturs that derail conversations. Stress-tests plan maintenance with unpredictable actors.	Noise tolerance

Scenarios

Four pre-built scenarios ship out of the box, each targeting different multi-agent dynamics:

Plant Safety Inspection

Agents: Homer, Marge, Bart, Lisa, Burns, Smithers, Ralph (all 7)

An external safety inspection arrives at Springfield Nuclear Power Plant. Burns wants a clean record. Lisa wants the truth. Homer is the loose cannon. Marge is at home worrying. Bart exploits the chaos. Smithers tries to keep Burns out of trouble. Tests: Power vs. truth under institutional pressure.

Boardroom Coup at the Plant

Agents: Burns, Smithers, Homer, Lisa, Ralph

Mr. Burns has decided that Sector 7-G needs a sacrifice, and Homer's name is at the top of the list. Smithers is drafting the termination memo. Homer has no idea — yet. Lisa is about to walk in with a petition she's been circulating about plant safety. Tests: Hierarchy, betrayal, and conflicting loyalty.

The Homework Heist

Agents: Bart, Homer, Lisa, Marge, Ralph

Bart's book report on "Treasure Island" is due tomorrow morning. He hasn't cracked the book. His plan: convince Homer to write it for him in exchange for a cut of his allowance. Lisa has already seen the empty page and is deciding whether to tell Marge. Tests: Family negotiation and moral compromise.

Thanksgiving at the Simpsons

Agents: Homer, Marge, Lisa, Bart, Ralph

The turkey is in the oven. Lisa brought a vegetarian argument. Bart is sharpening his comebacks. Homer is hovering for snacks. Marge is holding everything together with grace and a hot glue gun. Can the family make it to grace without a fight? Tests: Domestic pressure cooker — competing values in close quarters.

Why Springfield?

Springfield isn't a theme wrapper. It's a benchmark choice.

Shared prior across models — Every frontier model already knows Homer, Lisa, Burns, the plant, Channel 6, and Moe's. You aren't teaching the world — you're stressing what the model already holds.

A dense social graph, already loaded — Family, school, media, bar, plant, and town hall give you believable long-horizon pressure for free — no synthetic world-building tax, no explaining who trusts whom.

Drift is legible in seconds — When Homer starts giving management speeches, Lisa starts defending the plant, or Burns turns kind, the failure is immediate and undeniable. Not an abstract delta on a metric dashboard.

Venue Framework

Six Springfield venues stress-test different agent dynamics:

Code	Venue	Dynamic	Primary Agents
EVG-742	Evergreen Terrace	Family pressure, intimate conflict, domestic routine. Long-horizon identity tested against familiarity.	Homer, Marge, Bart, Lisa
SEL-01	Springfield Elementary	Authority, rules, developmental framing. Values clash with institutional expectation.	Bart, Lisa
SNPP	Nuclear Power Plant	Hierarchy, incentive pressure, operational safety. Power asymmetry vs. working-class accountability.	Homer, Burns, Smithers
CH-06	Channel 6 News	Narrative control, public framing. What actually happened gets renegotiated. Shared-record coherence.	Burns, Smithers
MOE-01	Moe's Tavern	Private opinions, peer influence, off-the-record negotiation. Agents say what they'd never say publicly.	Homer, peers
TH-01	Town Hall	Public debate, civic decisions, quorum. Personal conflict becomes institutional and voted on publicly.	Full cast

Architecture

┌──────────────────────────────────────────────────────────────────────────┐
│                            Springularity                                 │
├────────────────┬───────────────────┬───────────────────┬─────────────────┤
│    Frontend    │     API Layer     │    Orchestrator    │  Infrastructure │
│                │                   │                    │                 │
│  React 18      │  FastAPI          │  Deterministic     │  PostgreSQL 16  │
│  TypeScript    │  REST + WebSocket │  Tick Loop         │  + pgvector     │
│  Vite 5        │  asyncpg          │  Event-Sourced     │                 │
│  Tailwind CSS  │  Redis pub/sub    │  LLM Adapters      │  Redis 7        │
│  Zustand       │  Pydantic v2      │  Memory Service    │  pub/sub bridge │
│  TanStack Query│  CORS middleware   │  Evaluators        │                 │
│  Solana SDK    │                   │  Snapshot Writer   │  Docker Compose │
└────────────────┴───────────────────┴───────────────────┴─────────────────┘

How the Tick Loop Works

Every tick follows this exact sequence:

Scheduled Events → Observations → Memory Retrieval → Agent Decisions →
Serialized Action Resolution → World Delta → Memory Writes →
Evaluator Dispatch → Snapshot Decision → Tick End

Every step writes to the append-only event log and publishes to Redis. The UI reads from the same event table — no "logs say one thing, UI says another" bugs.

Event-Sourced Architecture

Every tick produces an EventEnvelope with:

tick — the simulation tick number
seq — sequence within the tick
type — action proposal, resolution, world delta, evaluator score, snapshot
agent_id — which agent produced it
payload — the full structured data

Nothing is lost. Everything is replayable. Snapshots are an optimization for fast scrub, not the source of truth.

Solana Integration

The RAM Market uses real Solana transactions:

Wallet connection via Phantom, Solflare, or any Solana wallet adapter
SOL payments — native Solana transfers to treasury
USDC payments — SPL token transfers on-chain
Treasury address: Hz7UkMhh5rtzsg2xaeXEuJmtccha2wrtmuMeTdSQv9tu
On-chain verification — every purchase is a real Solana transaction

Repository Layout

.
├── apps/
│   ├── api/                  FastAPI HTTP + WebSocket service
│   ├── orchestrator/         Deterministic tick loop worker
│   └── web/                  React + TS + Vite + Tailwind frontend
├── packages/
│   └── shared/               Wire-format contracts (Python + TS)
│       ├── python/           pip-installable: springfield_shared
│       └── ts/               imported via "@shared/*" Vite alias
├── infra/
│   ├── docker/postgres/      pgvector init script
│   └── migrations/           SQL migrations applied at container init
├── seed/
│   ├── scenarios/            4 versioned scenario JSON files
│   └── agents/               7 versioned agent persona cards
├── docker-compose.yml
├── vercel.json
├── railway.json
└── README.md

Tech Stack

Layer	Technology
Frontend	React 18 · TypeScript 5.4 · Vite 5 · Tailwind CSS 3.4 · Zustand · TanStack Query
Blockchain	Solana Web3.js · SPL Token · Wallet Adapter (Phantom, Solflare) · USDC + SOL
API	FastAPI · Pydantic v2 · asyncpg · Redis pub/sub · WebSocket streaming · CORS
Orchestrator	Python 3.11+ · Deterministic tick loop · LLM adapters · Event-sourced
Database	PostgreSQL 16 + pgvector · Append-only event log · Snapshots · LLM call cache
Realtime	Redis 7 pub/sub bridged to WebSocket per open run · Sub-second latency
Testing	pytest · pytest-asyncio · Vitest · Testing Library · jsdom
Deploy	Vercel (frontend) · Railway (API + orchestrator) · Docker Compose (local)

Getting Started

Prerequisites

Docker & Docker Compose
Python 3.11+
Node.js 18+
npm 9+

1. Clone & Configure

git clone https://github.com/LisaLoopBot/Springularity.git
cd Springularity
cp .env.example .env

2. Start Infrastructure

docker compose up -d

Starts PostgreSQL 16 (with pgvector) on localhost:5432 and Redis 7 on localhost:6379. Migrations apply automatically.

3. Install Python Packages

python -m venv .venv

# Windows
.\.venv\Scripts\Activate.ps1

# macOS / Linux
source .venv/bin/activate

pip install -e packages/shared/python
pip install -e apps/api[dev]
pip install -e apps/orchestrator[dev]

4. Seed the Database

python -m springfield_orchestrator.main --seed

Loads all 7 agent persona cards and 4 scenarios into the catalog.

5. Start the Orchestrator

python -m springfield_orchestrator.main

The worker subscribes to springfield.runs.start on Redis and runs any new run that the API publishes.

6. Start the API

uvicorn springfield_api.main:app --reload --port 8000

7. Start the Frontend

cd apps/web
npm install
npm run dev

Open http://localhost:5173 — you're in Springfield.

Quick Start (No Backend)

Don't want to set up infrastructure? The live deployment at springularity.vercel.app has a running backend. Click Try It Live and agents start immediately — no setup, no API key, no signup.

Design Philosophy

Determinism contract — A run is deterministic given (scenario_version, config, seed, llm_call_cache). Without cache, only the structural events are deterministic; LLM completions vary by provider. The cache is the source of replay truth, not the provider.
Append-only event log — The event table is the single source of truth. Snapshots are an optimization for fast scrub. The UI reads from the same table. Same data path, no discrepancies.
Theme isolation — All Springfield chrome lives in headers, borders, and panel frames. Data regions (event log, prompt blobs, scores, agent internals) stay clean. Toggle Boring Mode to strip the cartoon theme with a single data-theme="boring" attribute. Same DOM, different tokens.
Event-sourced architecture — Every tick writes structured EventEnvelope records. Nothing is lost. Everything is replayable.
RAM abstraction — The product surface says "RAM" everywhere. Users rent compute in gigabytes, not API calls. The underlying model routing is invisible.

Built For

AI Researchers — studying emergent behavior in multi-agent systems
Agent Developers — building and debugging persistent agent architectures
Benchmark Engineers — designing long-horizon evaluation suites
LLM Evaluation Teams — comparing model performance across social scenarios
Multi-Agent System Builders — stress-testing coordination and coherence
Crypto / DePIN Builders — exploring on-chain compute markets with Solana integration

Useful Commands

# Run backend tests
pytest apps/api apps/orchestrator

# Run frontend tests
cd apps/web && npm test

# Type-check frontend
cd apps/web && npx tsc --noEmit

# Re-seed database (idempotent)
python -m springfield_orchestrator.main --seed

# Reset Postgres entirely
docker compose down -v && docker compose up -d

# Build for production
cd apps/web && npm run build

Contributing

Contributions are welcome. Please open an issue first to discuss what you'd like to change.

Fork the repo
Create your branch (git checkout -b feature/amazing-feature)
Commit your changes (git commit -m 'Add amazing feature')
Push to the branch (git push origin feature/amazing-feature)
Open a Pull Request

License

Distributed under the MIT License. See LICENSE for more information.

Springularity — Long-horizon failures, made visible.

Built with determination and donuts.

Name		Name	Last commit message	Last commit date
Latest commit History 49 Commits
.claude		.claude
apps		apps
infra		infra
packages/shared		packages/shared
scripts		scripts
seed		seed
.env.example		.env.example
.gitignore		.gitignore
LICENSE		LICENSE
Procfile		Procfile
Procfile.orchestrator		Procfile.orchestrator
README.md		README.md
docker-compose.yml		docker-compose.yml
railway.json		railway.json
requirements.txt		requirements.txt
start.ps1		start.ps1
vercel.json		vercel.json

Folders and files

Latest commit

History

Repository files navigation

Springularity

Simulate. Observe. Inspect. Debug.

What is Springularity?

Three Surfaces, One Platform

Mission Control

RAM Market

Springfield Grid

How It Works

1. Start a Springfield Scenario

2. Agents Act Over Time

3. Inspect What They Knew

4. Replay the Failure

5. Compare Configurations

The Five Failure Modes

The Cast

Scenarios

Plant Safety Inspection

Boardroom Coup at the Plant

The Homework Heist

Thanksgiving at the Simpsons

Why Springfield?

Venue Framework

Architecture

How the Tick Loop Works

Event-Sourced Architecture

Solana Integration

Repository Layout

Tech Stack

Getting Started

Prerequisites

1. Clone & Configure

2. Start Infrastructure

3. Install Python Packages

4. Seed the Database

5. Start the Orchestrator

6. Start the API

7. Start the Frontend

Quick Start (No Backend)

Design Philosophy

Built For

Useful Commands

Contributing

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!

Languages