An open source re-implementation of Google's AI co-scientist (Gottweis et al., Nature, 2026; research blog, 2025) — a multi-agent system that takes a natural-language research goal and produces a tournament-ranked research overview of novel hypotheses.
The agent roster, prompts, and control flow follow the paper. Source materials that were used to instruct the coding agent (Claude Code) is included with the repo:
reference/8 Pseudocode of Co-Scientist agents— the supplementary pseudocode for Supervisor, Generation, Reflection, Ranking, Evolution, Proximity, Meta-review.reference/9 Prompts for the specialized agents in .md— the per-agent prompts from the paper's supplement, used verbatim (modulo Jinja interpolation) inconfig/prompts/.reference/AICoScientist-*.png— the architecture and component diagrams from the paper.
The agents:
- Generation — proposes hypotheses via literature review and simulated scientific debate.
- Reflection — reviews hypotheses for novelty, correctness, and testability; deep-verifies the underlying assumptions.
- Ranking — runs an Elo tournament with simulated debates between hypotheses.
- Evolution — combines, simplifies, makes more feasible, or out-of-box-reimagines top-ranked hypotheses.
- Proximity — embeds and clusters hypotheses to drive dedup and informative tournament pairings.
- Meta-review — synthesizes system-wide feedback and the final research overview.
A Supervisor parses the goal into a research plan and schedules agent tasks through a durable SQLite-backed queue with bounded concurrency.
This is an independent re-implementation in Python on top of pluggable LLM provider SDKs — not affiliated with Google or the paper's authors.
docs/BENCH_RESULTS.md— every cross-model bench ever run on this code, with per-candidate Elo, every hypothesis produced, gold-set hits, and direct file pointers. Auto-generated from the bench DB.
- Architecture
- Install
- Initialize
- Run a research session
- LLM provider
- Configuration
- Bench: compare models head-to-head
- Repository layout
co-scientist run "<goal>"
│
▼
┌──────────────────────────────────────┐
│ Supervisor │ durable task queue (SQLite)
│ • parse_goal → ResearchPlan │ bounded concurrency
│ • enqueue initial Generation tasks │ lease + dead-letter + resume
│ • main loop: claim → run → follow-up│ termination: BUDGET / WALL_CLOCK
│ • decide_next_steps when idle │ / ELO_STABLE / IDLE / EXTERNAL
│ • finalize: meta-review overview │
└──────────────────────────────────────┘
│ tasks
┌─────────────────────┼─────────────────────────────┐
▼ ▼ ▼
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ Generation │ hyp │ Reflection │ review │ Ranking │
│ literature │─────►│ full + │─────────────►│ pairwise vs │──► Elo
│ + debate │ │ verification│ │ debate │
└──────────────┘ └──────────────┘ └──────────────┘
▲ ▲ │
│ │ informative pairings ▼
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ Evolution │◄─────│ Meta-review │ │ Proximity │
│ combine / │ feed │ system fdbk │ │ FAISS embed │
│ simplify / │ back │ + final │ │ + cluster / │
│ feasibility /│ │ overview │ │ dedup │
│ out_of_box │ └──────────────┘ └──────────────┘
└──────────────┘
│
▼
new hypotheses re-enter the cycle
Shared infrastructure
─────────────────────
• LLMProvider ─ anthropic / openai / openrouter / gemini / groq /
together / mistral / ollama / openai_compatible
• ToolRegistry ─ web_fetch + pubmed_search / arxiv_search / europe_pmc_search;
web_search auto-registered iff TAVILY/BRAVE key set;
science-skills discovered via SKILL.md frontmatter
• TokenBudget ─ per-agent shares + global cap; reservation released on retry
• EventBus ─ in-memory fan-out to SSE for the live web UI
• FaissStore ─ IndexFlatIP per session, asyncio-locked, atomic save/load;
Voyage → OpenAI → hash-fallback embedder chain
• SQLite ─ sessions / hypotheses / reviews / tournament_matches /
elo_journal / tasks / transcripts / system_feedback /
embeddings_meta / spans / events / bench_* (15 tables;
WAL, busy_timeout, idempotent migration runner)
# Recommended: Python 3.11–3.13 (FAISS wheel availability)
python3.12 -m venv .venv
source .venv/bin/activate
pip install -e ".[dev]"
cp .env.example .env
# fill in the API key for whichever LLM provider you'll use (see below).co-scientist init
co-scientist listinit creates data/ (artifacts, vectors, logs) and applies migrations to data/co_scientist.db. The output prints which LLM provider it sees configured and whether its API key is set.
co-scientist run "Identify hypotheses about microbiome-driven inflammation" \
--n 3 --budget-usd 2.0 --wall-clock 600This kicks off Generation → Reflection → Ranking → Evolution → Meta-review under the configured LLM provider. The Supervisor schedules tasks, the Elo tournament refines a leaderboard, and the final research overview is written to data/artifacts/<session_id>/final/overview.md.
co-scientist serve # FastAPI + htmx + SSE dashboard at localhost:7878
co-scientist report <id> # print the final overview
co-scientist status <id> # session metadata + counts
co-scientist pause <id> | resume <id> | abort <id>
co-scientist feedback <id> --kind directive --text "focus on metabolic pathways"
co-scientist estimate # pre-flight cost estimate; warns if > 1.2× budget
co-scientist eval [agent] # run the rubric eval bundle (offline mode optional)
co-scientist tools list # show every registered tool the agents can callThe agents are provider-agnostic — every agent talks to one LLM provider per session, picked in config/default.toml (override with your own co-scientist.toml). Any of the providers below works; pick whichever you have a key for.
Config is deep-merged over config/default.toml, whose [models] defaults are Claude model ids. So if you switch provider away from anthropic, override every key in [models] — any key you leave out keeps its Claude default and will be sent to your new provider, which will reject it. Fill in model ids your chosen provider exposes (see the provider table below for examples per vendor):
[llm]
# Pick one. See the provider table below.
provider = "openai" # anthropic | openai | openrouter | gemini | google | groq | together | mistral | ollama | openai_compatible
[models]
# Override ALL of these with model ids from your chosen provider.
# (Two tiers are enough: a stronger model for reasoning-heavy agents and a
# cheaper one for the rest — set them to whatever your provider offers.)
parse_goal = "<cheap-model>"
generation = "<strong-model>"
reflection = "<strong-model>"
evolution = "<strong-model>"
ranking_pairwise = "<cheap-model>"
ranking_debate = "<strong-model>"
ranking_priority = "<strong-model>"
metareview_feedback = "<cheap-model>"
metareview_final = "<strong-model>"
classifier = "<cheap-model>"
judge = "<cheap-model>"Providers are listed alphabetically — none is preferred; pick whichever you have a key for.
| provider | Endpoint | API-key env var | Example models |
|---|---|---|---|
anthropic |
api.anthropic.com | ANTHROPIC_API_KEY |
claude-opus-4-7, claude-sonnet-4-6 |
gemini / google |
generativelanguage.googleapis.com (OpenAI-compat) | GEMINI_API_KEY |
gemini-2.5-pro, gemini-2.5-flash |
groq |
api.groq.com | GROQ_API_KEY |
llama-3.3-70b-versatile, mixtral-8x7b-32768 |
mistral |
api.mistral.ai | MISTRAL_API_KEY |
mistral-large-latest, codestral-latest |
ollama |
localhost:11434 — local models | (none) | llama3.3:70b, qwen2.5:32b |
openai |
api.openai.com | OPENAI_API_KEY |
gpt-5, gpt-4o, o3-mini |
openai_compatible |
Anything else; set [llm.openai] base_url explicitly |
OPENAI_API_KEY |
depends |
openrouter |
openrouter.ai — 200+ models from every major vendor | OPENROUTER_API_KEY |
openai/gpt-5, google/gemini-2.5-pro, anthropic/claude-3.5-sonnet, meta-llama/llama-3.3-70b-instruct |
together |
api.together.xyz | TOGETHER_API_KEY |
meta-llama/Llama-3.3-70B-Instruct-Turbo |
Key precedence: for every OpenAI-compatible preset (
openrouter,gemini,groq,together,mistral,ollama),OPENAI_API_KEYis used first if it's set, and the provider-specific var above is only the fallback. So if you have a strayOPENAI_API_KEYin your environment it will be sent to the preset's endpoint (and rejected) — unset it, or set only the provider's own key, when using a preset.
Mixing vendors per session requires picking the provider once; for multi-vendor routing in a single session, use provider = "openrouter" and let OpenRouter dispatch upstream per model:
[llm]
provider = "openrouter"
[llm.openrouter]
referer = "https://your-app.example.com" # optional, for catalog attribution
title = "My Co-Scientist"
[models]
generation = "openai/gpt-5"
reflection = "anthropic/claude-3.5-sonnet"
ranking_pairwise = "google/gemini-2.5-flash"
metareview_final = "meta-llama/llama-3.3-70b-instruct"Any per-agent model can point at any vendor — the example above just mixes four. Use whatever combination you prefer.
Cost is estimated via co_scientist/llm/routing.py's PRICE_TABLE; unknown models match a family-hint (flash / mini / opus / sonnet / gemini / llama / mistral) so brand-new previews price sensibly. Tighten [run] budget_usd if running on a new model you haven't sanity-checked.
Provider feature support. Tool / function calling is required — the agent pipeline is built on it, so a provider (or openai_compatible endpoint) that can't do function calling won't work. The other three rows are optional vendor-specific accelerators: when a provider doesn't support one, it's transparently skipped, never an error.
| Feature | anthropic |
everything else (OpenAI + all OpenAI-compatible providers) |
|---|---|---|
| Tool / function call (required) | ✅ | ✅ native OpenAI; on other endpoints it must be supported or the run fails |
| Extended reasoning | ✅ via thinking budgets |
✅ via reasoning_effort, only for reasoning models — the model id must start with o1/o3/o4 or contain reasoning; for any other model (e.g. gpt-4o) the thinking budget is dropped |
| Prompt-cache breakpoints | ✅ | ❌ (stripped before sending) |
| Batch API (50%-off ranking) | ✅ | ❌ (Anthropic-only; other providers run all matches synchronously) |
Note: the reasoning-model check is a name heuristic (
openai_client.py_is_reasoning_model). Newer reasoning-capable models whose ids don't match the pattern (e.g.gpt-5) won't getreasoning_effortuntil the heuristic is updated — they still work, just without an explicit reasoning budget.
Layered: config/default.toml → ~/.co-scientist/config.toml → ./co-scientist.toml → --config <path>. Secrets come from environment only (see .env.example).
co-scientist bench runs the same goal under N different (provider, model) configurations and ranks them via a single shared Elo tournament. Each candidate independently generates hypotheses; then every candidate-pair plays --matches head-to-head debates, judged by ONE fixed judge model (picked separately so no candidate scores its own work).
For live numbers — per-candidate Elo, the actual hypotheses each model proposed, gold-set hits, and what the data showed — see
docs/BENCH_RESULTS.md. It includes a headline-findings section at the top so you don't have to scroll through every bench.
--preset |
What it does |
|---|---|
paper |
Co-Scientist paper baselines (Gemini 2 Flash Thinking, Gemini 2 Pro, OpenAI o1, Claude Haiku) via OpenRouter, head-to-head Elo only |
paper-aml |
Same candidates + the paper's AML drug-repurposing goal + gold-set recall scoring (defaults to the strict top-3 set: Nanvuranlat / KIRA6 / Leflunomide) |
paper-aml-vs-raw |
paper-aml but each model runs both in the full pipeline AND as a single raw LM call — isolates the multi-agent harness's value-add |
frontier-aml-vs-raw |
Same pipeline-vs-raw setup but with current frontier models (Claude Opus 4.7, GPT-5, Gemini 3 Pro / Flash) |
# Reproduce the paper preference-ranking comparison:
co-scientist bench --preset paper --budget-per-candidate 1.5
# Score against the paper's AML drug picks:
co-scientist bench --preset paper-aml --n 3 --matches 2
# Compare multi-agent pipeline vs raw model call on the same goal
# (--budget-per-candidate defaults to 3.0; frontier models need it):
co-scientist bench --preset paper-aml-vs-raw --n 1
# Current frontier models, pipeline vs raw:
co-scientist bench --preset frontier-aml-vs-raw --n 1The --preset *-vs-raw presets pit each model's full co-scientist Generation pipeline (literature tools + tool loop + dedup + record_hypothesis) against a single raw LM call with the same model + a forced record_hypothesis function call (no tools). Lets you measure how much of the system's output quality comes from the multi-agent harness vs the underlying model. → live numbers in docs/BENCH_RESULTS.md.
paper-aml* presets score recall against a curated answer key from the Co-Scientist paper. Two gold sets ship; both stay registered so historical bench artifacts remain interpretable.
| label | size | what it is |
|---|---|---|
aml-repurposing-paper-top3 (default for paper-aml*) |
3 | Top-3 of the original paper's list: candidates with no prior published AML repurposing, no prior preclinical evidence in AML, and no external inputs (no DepMap scores, no expert curation). → Nanvuranlat (JPH-203 / KYT-0353), KIRA6, Leflunomide (Arava / HWA-486 / Teriflunomide / Aubagio) |
aml-repurposing-paper-5 |
5 | Broader 5-drug list referenced in the paper's main text: Binimetinib (MEK162), Pacritinib (SB1518 / Vonjo), Cerivastatin (Baycol), Pravastatin (Pravachol), Dimethyl fumarate (DMF / BG-12 / Tecfidera) |
Swap with --goldset:
co-scientist bench --preset paper-aml --goldset aml-repurposing-paper-5 # broader list
co-scientist bench --preset paper-aml --goldset none # head-to-head onlyThe matcher is whole-token, case-insensitive, and looks at every searched field of every hypothesis (title / summary / full_text / entities / citation excerpts). Drug class mentions (e.g. "DHODH inhibitor") do not count — the candidate has to name the actual compound (or one of its registered aliases).
label=provider:model[@mode]. mode is pipeline (default) or direct. Pipeline goes through the full Generation agent stack; direct is a single forced-tool LM call with no literature tools.
co-scientist bench "Identify hypotheses about X" \
-c flash3=openrouter:google/gemini-3-flash-preview \
-c flash3-raw=openrouter:google/gemini-3-flash-preview@direct \
-c gpt5=openai:gpt-5 \
-c opus=anthropic:claude-opus-4.7 \
--judge anthropic:claude-sonnet-4-6Every bench writes to SQLite + JSON on disk:
data/co_scientist.db ← SQLite, all metadata
bench_runs one row per bench
bench_candidates one row per (bench × candidate × mode)
bench_matches one row per head-to-head
data/artifacts/<session_id>/ ← JSON on disk
bench/<bench_id>.json run summary + per-entity gold_hit_detail
hypotheses/<hyp_id>.json every hypothesis the bench produced
transcripts/generation/<trn_id>.json every LLM call
The auto-generated docs/BENCH_RESULTS.md (rebuild with python scripts/build_bench_report.py) walks every recorded bench and renders the per-candidate result table, every hypothesis attributed to the model that produced it, and a post-hoc rescore against every registered gold set.
- Generation runs in parallel per candidate under a deep-copied Config (
cfg.llm.provider,cfg.models.*, thinking budgets zeroed for non-Anthropic). - Round-robin pairings: every pair plays
--matcheshead-to-heads (one random hypothesis from each side per match). - Structured verdict via a forced
record_verdictfunction call — no fragilebetter idea: <N>text parsing across providers. - Bench runs are isolated from regular sessions — they don't write to
tournament_matchesor affect any session's leaderboard.
co_scientist/
agents/ # supervisor + 6 specialized agents (base, generation, reflection,
# ranking, evolution, proximity, metareview)
bench/ # cross-model bench runner (Elo tournament + gold-set scoring)
llm/ # provider abstraction (anthropic/openai/openrouter/gemini/...),
# tool loop, token budgets, model routing, retry, batch, estimator
storage/ # SQLite schema + migrations, db connection, 10 repos
tools/ # tool registry; web_fetch, web_search, pubmed/arxiv/europe_pmc,
# science-skills bridge
vectors/ # embeddings (Voyage/OpenAI/hash-fallback) + FAISS IndexFlatIP
orchestrator/ # task scheduling, Elo updates, termination, event bus
safety/ # injection quoting, classifier, citation verifier
obs/ # metrics (tokens, cost, cache hit ratio, latency)
web/ # FastAPI + htmx + SSE UI + sanitized markdown renderer
evals/ # per-agent + e2e + regression evals
tests/ # 213 unit tests + fixtures + smoke
config/
default.toml
prompts/ # 14 Jinja2 templates (one per agent.mode), derived from
# the paper's supplementary prompts
docs/
BENCH_RESULTS.md # every bench ever run (auto-generated)
scripts/
build_bench_report.py
reference/ # paper source materials (pseudocode, prompts, diagrams)
data/ # gitignored; runtime artifacts (SQLite, FAISS, transcripts)
vendor/ # gitignored; pinned clone of google-deepmind/science-skills
Apache-2.0.