Delphi

Synthetic populations as a computational substrate. Built on Gemini 3.5 Flash for the Google I/O Hackathon, 2026-05-23.

You ask any question — Will the Fed cut rates in Q3? Pretest this tagline. Stress-test this decision — and a swarm of Gemini 3.5 Flash sub-agents, each role-playing a different American persona generated from US Census–aligned demographic axes and each grounded in live web, reasons in parallel. In ~60 seconds you get back: a probabilistic forecast with confidence interval, the strongest reasons for and against drawn from agents' own reasoning, the demographic axes where groups diverged most, a striking outlier quote, and a Wall Street Journal–style synthesis paragraph — all produced by one final Gemini call after aggregation.

It is not a chatbot. It is not a copilot. It is a new primitive.

Headline result

Live, on stage, N = 500 synthetic Americans, the question "Will the Fed cut interest rates in Q3 2026?":

	Before shock	After shock	Δ
Headline	13.3%	79.5%	+66.2 pp
±1σ band	[7%, 19%]	[73%, 86%]	tighter
Bullish bucket (≥ 0.8)	0%	62%	full flip
Per-agent success	363 / 500	409 / 500	improved

Shock injected: "Surprise May CPI prints 2.1%, well below forecasts."

The post-shock synthesis, verbatim from a Gemini call after aggregation:

A strong majority of forecasters expect the Federal Reserve to cut interest rates in Q3 2026, driven by conviction that the surprise 2.1% May CPI print removes any remaining justification for restrictive monetary policy.

The dissenting view, surfaced by the same call:

Institutional preference for multi-month trend lines over a single data point. Concerns that premature easing could trigger a secondary wave of inflation.

The agents genuinely re-reasoned given the new context. Not keyword substitution.

Why this only works on Gemini 3.5 Flash

We ran the same swarm — same personas, same question, same prompt, same concurrency — on Gemini 3.5 Flash vs Gemini 2.5 Flash. Results from our cross-model harness:

Model	Headline	Per-agent success	Median latency
`gemini-3.5-flash`	86.8% [80.5%, 93.0%]	100% (8/8)	20.9 s
`gemini-2.5-flash`	75.0% (1 valid call)	12.5% (1/8)	24.3 s
Δ	−11.75 pp	−87.5 pp	+3.4 s

Flash 3.5 is not an incremental upgrade — Flash 2.5 reliably fails on identical prompts at identical concurrency. The cost × speed × intelligence frontier of Flash 3.5 is what makes hundreds of parallel grounded reasoning agents economically viable. Swap to the previous generation and the primitive collapses.

Source: delphi/eval/cross_model.py · raw output in eval_results/cross_model.json.

Run it

# 1 · Backend
cp .env.example .env          # fill in your GEMINI_API_KEY
uv sync --all-extras
uv run uvicorn delphi.api:app --port 8000

# 2 · Frontend  (in another terminal)
cd web
npm install
npm run dev                   # serves at http://localhost:4000

# 3 · CLI alternative
uv run python main.py "Will Apple ship smart glasses in 2027?"

# 4 · Offline demo with no API key  (FakeClient harness)
uv run python -m delphi.harness

Then open http://localhost:4000 and convene the swarm.

Architecture

WEB FRONTEND · Next.js / React / Three.js (R3F) / Tailwind
   left rail (controls)  ·  centre (globe + dots)  ·  drill-down + synthesis
                              │
                              │ REST  +  WebSocket
                              ▼
ORCHESTRATOR · FastAPI + asyncio · delphi/api.py
   persona generator  ──▶  swarm runner  ──▶  shock re-run
        (1 batched           (N parallel        (append +
         Gemini call)         + semaphore K)     re-run)
                              │
                              ▼
                       aggregator  ──▶  Gemini summary call
                                         (narrative + quote)
                              │
                              ▼
N REASONING SUB-AGENTS · delphi/agent.py
   Gemini 3.5 Flash + google_search grounding + 1M context per agent
   structured JSON output · 25s per-agent timeout

Full architecture, request-lifecycle sequence diagram, data models, and decision log live in PRD.md §8.

What's in this repo

Path	Contents
`delphi/`	Python backend — orchestrator, agent, swarm, summary, shock
`delphi/eval/`	Validation harnesses — adversarial, cross-model, stress, persona-quality
`web/`	Next.js frontend — globe, drill-down, synthesis panel, shock injector
`tests/`	27 pytest tests · `FakeClient` integration harness · 0.75 s suite
`slides/`	Demo + evaluation deck (Marp markdown + exported PDF)
`eval_results/`	JSON outputs from each validation harness
`PRD.md`	Product requirements document
`DELPHI.md`	Project pitch, positioning, competitive landscape
`CASES.md`	Three live-run case studies — one per mode

Validation

Layer	What we measured	Result
Architecture	Aggregator math · schema parsing · HTTP / WS lifecycle · failure paths	27 / 27 tests pass, 0.75 s suite
Persona stability	LLM-as-judge in-character score across 24 charged-prompt trials	4.92 / 5 mean, 0% drift to centrist
Model dependence	Flash 3.5 vs Flash 2.5 on identical swarm + identical concurrency	87.5 pp per-agent success-rate gap
Scale	End-to-end at N = 200 and N = 500	207 s / 89% success at N = 200

Sources: delphi/eval/adversarial.py · delphi/eval/cross_model.py · delphi/eval/stress.py · delphi/eval/persona_quality.py. Raw outputs in eval_results/.

What is not yet validated

Calibration against real polling baselines. Synthetic forecasts have not been benchmarked against Polymarket, Pew, Gallup, or Good Judgment Project on matched questions.
Larger persona panels. N = 200 validated end-to-end; N = 500 demonstrated live with 27% per-agent failure; N = 1000 untested under quota.
Adversarial test breadth. Stability measured against 8 charged prompts × 3 personas. Wider sweep needed for production claims.
Cross-model breadth. Compared gemini-3.5-flash vs gemini-2.5-flash. Other Flash-class models (Claude Haiku, GPT-4o-mini) not yet tested.

Delphi today is a prototype of a primitive, not a calibrated forecasting product. The next major piece is a backtest harness against historical events (past Fed decisions, elections, product launches), scored by Brier loss against the Good Judgment Project median.

Forecasting is the wedge

The same primitive underwrites:

Vertical	Use
Marketing	Pre-test campaigns on 1,000 synthetic ICPs before spend
Policy / governance	War-game regulation against 1,000 affected constituencies
Comms / PR	Stress-test a statement against critic, supporter, journalist personas
Legal	Synthetic jury for trial-message testing
Product	A/B-test features with synthetic users before code
Public health	Disease-spread and behaviour modelling grounded in real demographics

We didn't make an existing thing faster. We built a category that did not exist last year.

Built solo, on 2026-05-23

One person, one day, one model. Built at Shack15, San Francisco, for the Google I/O Hackathon hosted with the Google DeepMind team.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
delphi		delphi
docs		docs
eval_results		eval_results
scripts		scripts
slides		slides
tests		tests
web		web
.env.example		.env.example
.gitignore		.gitignore
.python-version		.python-version
CASES.md		CASES.md
DELPHI.md		DELPHI.md
HACKATHON.md		HACKATHON.md
PRD.md		PRD.md
README.md		README.md
main.py		main.py
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Delphi

Headline result

Why this only works on Gemini 3.5 Flash

Run it

Architecture

What's in this repo

Validation

What is not yet validated

Forecasting is the wedge

Built solo, on 2026-05-23

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Delphi

Headline result

Why this only works on Gemini 3.5 Flash

Run it

Architecture

What's in this repo

Validation

What is not yet validated

Forecasting is the wedge

Built solo, on 2026-05-23

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages