AI co-scientist

An open source re-implementation of Google's AI co-scientist (Gottweis et al., Nature, 2026; research blog, 2025) — a multi-agent system that takes a natural-language research goal and produces a tournament-ranked research overview of novel hypotheses.

The agent roster, prompts, and control flow follow the paper. Source materials that were used to instruct the coding agent (Claude Code) is included with the repo:

reference/8 Pseudocode of Co-Scientist agents — the supplementary pseudocode for Supervisor, Generation, Reflection, Ranking, Evolution, Proximity, Meta-review.
reference/9 Prompts for the specialized agents in .md — the per-agent prompts from the paper's supplement, used verbatim (modulo Jinja interpolation) in config/prompts/.
reference/AICoScientist-*.png — the architecture and component diagrams from the paper.

The agents:

Generation — proposes hypotheses via literature review and simulated scientific debate.
Reflection — reviews hypotheses for novelty, correctness, and testability; deep-verifies the underlying assumptions.
Ranking — runs an Elo tournament with simulated debates between hypotheses.
Evolution — combines, simplifies, makes more feasible, or out-of-box-reimagines top-ranked hypotheses.
Proximity — embeds and clusters hypotheses to drive dedup and informative tournament pairings.
Meta-review — synthesizes system-wide feedback and the final research overview.

A Supervisor parses the goal into a research plan and schedules agent tasks through a durable SQLite-backed queue with bounded concurrency.

This is an independent re-implementation in Python on top of pluggable LLM provider SDKs — not affiliated with Google or the paper's authors.

docs/BENCH_RESULTS.md — every cross-model bench ever run on this code, with per-candidate Elo, every hypothesis produced, gold-set hits, and direct file pointers. Auto-generated from the bench DB.

Architecture

                       co-scientist run "<goal>"
                                  │
                                  ▼
            ┌──────────────────────────────────────┐
            │            Supervisor                │  durable task queue (SQLite)
            │  • parse_goal → ResearchPlan         │  bounded concurrency
            │  • enqueue initial Generation tasks  │  lease + dead-letter + resume
            │  • main loop: claim → run → follow-up│  termination: BUDGET / WALL_CLOCK
            │  • decide_next_steps when idle       │              / ELO_STABLE / IDLE / EXTERNAL
            │  • finalize: meta-review overview    │
            └──────────────────────────────────────┘
                                  │  tasks
            ┌─────────────────────┼─────────────────────────────┐
            ▼                     ▼                             ▼
   ┌──────────────┐      ┌──────────────┐              ┌──────────────┐
   │  Generation  │ hyp  │  Reflection  │ review       │   Ranking    │
   │  literature  │─────►│  full +      │─────────────►│ pairwise vs  │──► Elo
   │  + debate    │      │  verification│              │   debate     │
   └──────────────┘      └──────────────┘              └──────────────┘
            ▲                     ▲                             │
            │                     │ informative pairings        ▼
   ┌──────────────┐      ┌──────────────┐              ┌──────────────┐
   │  Evolution   │◄─────│ Meta-review  │              │  Proximity   │
   │ combine /    │ feed │ system fdbk  │              │ FAISS embed  │
   │ simplify /   │ back │ + final      │              │ + cluster /  │
   │ feasibility /│      │ overview     │              │ dedup        │
   │ out_of_box   │      └──────────────┘              └──────────────┘
   └──────────────┘
            │
            ▼
       new hypotheses re-enter the cycle


  Shared infrastructure
  ─────────────────────
  • LLMProvider  ─ anthropic / openai / openrouter / gemini / groq /
                   together / mistral / ollama / openai_compatible
  • ToolRegistry ─ web_fetch + pubmed_search / arxiv_search / europe_pmc_search;
                   web_search auto-registered iff TAVILY/BRAVE key set;
                   science-skills discovered via SKILL.md frontmatter
  • TokenBudget  ─ per-agent shares + global cap; reservation released on retry
  • EventBus     ─ in-memory fan-out to SSE for the live web UI
  • FaissStore   ─ IndexFlatIP per session, asyncio-locked, atomic save/load;
                   Voyage → OpenAI → hash-fallback embedder chain
  • SQLite       ─ sessions / hypotheses / reviews / tournament_matches /
                   elo_journal / tasks / transcripts / system_feedback /
                   embeddings_meta / spans / events / bench_* (15 tables;
                   WAL, busy_timeout, idempotent migration runner)

Install

# Recommended: Python 3.11–3.13 (FAISS wheel availability)
python3.12 -m venv .venv
source .venv/bin/activate
pip install -e ".[dev]"

cp .env.example .env
# fill in the API key for whichever LLM provider you'll use (see below).

Initialize

co-scientist init
co-scientist list

init creates data/ (artifacts, vectors, logs) and applies migrations to data/co_scientist.db. The output prints which LLM provider it sees configured and whether its API key is set.

Run a research session

co-scientist run "Identify hypotheses about microbiome-driven inflammation" \
  --n 3 --budget-usd 2.0 --wall-clock 600

This kicks off Generation → Reflection → Ranking → Evolution → Meta-review under the configured LLM provider. The Supervisor schedules tasks, the Elo tournament refines a leaderboard, and the final research overview is written to data/artifacts/<session_id>/final/overview.md.

co-scientist serve            # FastAPI + htmx + SSE dashboard at localhost:7878
co-scientist report <id>      # print the final overview
co-scientist status <id>      # session metadata + counts
co-scientist pause <id> | resume <id> | abort <id>
co-scientist feedback <id> --kind directive --text "focus on metabolic pathways"
co-scientist estimate         # pre-flight cost estimate; warns if > 1.2× budget
co-scientist eval [agent]     # run the rubric eval bundle (offline mode optional)
co-scientist tools list       # show every registered tool the agents can call

LLM provider

The agents are provider-agnostic — every agent talks to one LLM provider per session, picked in config/default.toml (override with your own co-scientist.toml). Any of the providers below works; pick whichever you have a key for.

Config is deep-merged over config/default.toml, whose [models] defaults are Claude model ids. So if you switch provider away from anthropic, override every key in [models] — any key you leave out keeps its Claude default and will be sent to your new provider, which will reject it. Fill in model ids your chosen provider exposes (see the provider table below for examples per vendor):

[llm]
# Pick one. See the provider table below.
provider = "openai"   # anthropic | openai | openrouter | gemini | google | groq | together | mistral | ollama | openai_compatible

[models]
# Override ALL of these with model ids from your chosen provider.
# (Two tiers are enough: a stronger model for reasoning-heavy agents and a
#  cheaper one for the rest — set them to whatever your provider offers.)
parse_goal          = "<cheap-model>"
generation          = "<strong-model>"
reflection          = "<strong-model>"
evolution           = "<strong-model>"
ranking_pairwise    = "<cheap-model>"
ranking_debate      = "<strong-model>"
ranking_priority    = "<strong-model>"
metareview_feedback = "<cheap-model>"
metareview_final    = "<strong-model>"
classifier          = "<cheap-model>"
judge               = "<cheap-model>"

Providers are listed alphabetically — none is preferred; pick whichever you have a key for.

provider	Endpoint	API-key env var	Example models
`anthropic`	api.anthropic.com	`ANTHROPIC_API_KEY`	`claude-opus-4-7`, `claude-sonnet-4-6`
`gemini` / `google`	generativelanguage.googleapis.com (OpenAI-compat)	`GEMINI_API_KEY`	`gemini-2.5-pro`, `gemini-2.5-flash`
`groq`	api.groq.com	`GROQ_API_KEY`	`llama-3.3-70b-versatile`, `mixtral-8x7b-32768`
`mistral`	api.mistral.ai	`MISTRAL_API_KEY`	`mistral-large-latest`, `codestral-latest`
`ollama`	localhost:11434 — local models	(none)	`llama3.3:70b`, `qwen2.5:32b`
`openai`	api.openai.com	`OPENAI_API_KEY`	`gpt-5`, `gpt-4o`, `o3-mini`
`openai_compatible`	Anything else; set `[llm.openai] base_url` explicitly	`OPENAI_API_KEY`	depends
`openrouter`	openrouter.ai — 200+ models from every major vendor	`OPENROUTER_API_KEY`	`openai/gpt-5`, `google/gemini-2.5-pro`, `anthropic/claude-3.5-sonnet`, `meta-llama/llama-3.3-70b-instruct`
`together`	api.together.xyz	`TOGETHER_API_KEY`	`meta-llama/Llama-3.3-70B-Instruct-Turbo`

Key precedence: for every OpenAI-compatible preset (openrouter, gemini, groq, together, mistral, ollama), OPENAI_API_KEY is used first if it's set, and the provider-specific var above is only the fallback. So if you have a stray OPENAI_API_KEY in your environment it will be sent to the preset's endpoint (and rejected) — unset it, or set only the provider's own key, when using a preset.

Mixing vendors per session requires picking the provider once; for multi-vendor routing in a single session, use provider = "openrouter" and let OpenRouter dispatch upstream per model:

[llm]
provider = "openrouter"
[llm.openrouter]
referer = "https://your-app.example.com"   # optional, for catalog attribution
title   = "My Co-Scientist"

[models]
generation         = "openai/gpt-5"
reflection         = "anthropic/claude-3.5-sonnet"
ranking_pairwise   = "google/gemini-2.5-flash"
metareview_final   = "meta-llama/llama-3.3-70b-instruct"

Any per-agent model can point at any vendor — the example above just mixes four. Use whatever combination you prefer.

Cost is estimated via co_scientist/llm/routing.py's PRICE_TABLE; unknown models match a family-hint (flash / mini / opus / sonnet / gemini / llama / mistral) so brand-new previews price sensibly. Tighten [run] budget_usd if running on a new model you haven't sanity-checked.

Provider feature support. Tool / function calling is required — the agent pipeline is built on it, so a provider (or openai_compatible endpoint) that can't do function calling won't work. The other three rows are optional vendor-specific accelerators: when a provider doesn't support one, it's transparently skipped, never an error.

Feature	`anthropic`	everything else (OpenAI + all OpenAI-compatible providers)
Tool / function call (required)	✅	✅ native OpenAI; on other endpoints it must be supported or the run fails
Extended reasoning	✅ via `thinking` budgets	✅ via `reasoning_effort`, only for reasoning models — the model id must start with `o1`/`o3`/`o4` or contain `reasoning`; for any other model (e.g. `gpt-4o`) the thinking budget is dropped
Prompt-cache breakpoints	✅	❌ (stripped before sending)
Batch API (50%-off ranking)	✅	❌ (Anthropic-only; other providers run all matches synchronously)

Note: the reasoning-model check is a name heuristic (openai_client.py _is_reasoning_model). Newer reasoning-capable models whose ids don't match the pattern (e.g. gpt-5) won't get reasoning_effort until the heuristic is updated — they still work, just without an explicit reasoning budget.

Configuration

Layered: config/default.toml → ~/.co-scientist/config.toml → ./co-scientist.toml → --config <path>. Secrets come from environment only (see .env.example).

Bench: compare models head-to-head

co-scientist bench runs the same goal under N different (provider, model) configurations and ranks them via a single shared Elo tournament. Each candidate independently generates hypotheses; then every candidate-pair plays --matches head-to-head debates, judged by ONE fixed judge model (picked separately so no candidate scores its own work).

For live numbers — per-candidate Elo, the actual hypotheses each model proposed, gold-set hits, and what the data showed — see docs/BENCH_RESULTS.md. It includes a headline-findings section at the top so you don't have to scroll through every bench.

Presets

`--preset`	What it does
`paper`	Co-Scientist paper baselines (Gemini 2 Flash Thinking, Gemini 2 Pro, OpenAI o1, Claude Haiku) via OpenRouter, head-to-head Elo only
`paper-aml`	Same candidates + the paper's AML drug-repurposing goal + gold-set recall scoring (defaults to the strict top-3 set: Nanvuranlat / KIRA6 / Leflunomide)
`paper-aml-vs-raw`	`paper-aml` but each model runs both in the full pipeline AND as a single raw LM call — isolates the multi-agent harness's value-add
`frontier-aml-vs-raw`	Same pipeline-vs-raw setup but with current frontier models (Claude Opus 4.7, GPT-5, Gemini 3 Pro / Flash)

# Reproduce the paper preference-ranking comparison:
co-scientist bench --preset paper --budget-per-candidate 1.5

# Score against the paper's AML drug picks:
co-scientist bench --preset paper-aml --n 3 --matches 2

# Compare multi-agent pipeline vs raw model call on the same goal
# (--budget-per-candidate defaults to 3.0; frontier models need it):
co-scientist bench --preset paper-aml-vs-raw --n 1

# Current frontier models, pipeline vs raw:
co-scientist bench --preset frontier-aml-vs-raw --n 1

Pipeline vs raw LM (one model, isolated)

The --preset *-vs-raw presets pit each model's full co-scientist Generation pipeline (literature tools + tool loop + dedup + record_hypothesis) against a single raw LM call with the same model + a forced record_hypothesis function call (no tools). Lets you measure how much of the system's output quality comes from the multi-agent harness vs the underlying model. → live numbers in docs/BENCH_RESULTS.md.

Gold-set scoring (AML drug repurposing)

paper-aml* presets score recall against a curated answer key from the Co-Scientist paper. Two gold sets ship; both stay registered so historical bench artifacts remain interpretable.

label	size	what it is
`aml-repurposing-paper-top3` (default for `paper-aml`)*	3	Top-3 of the original paper's list: candidates with no prior published AML repurposing, no prior preclinical evidence in AML, and no external inputs (no DepMap scores, no expert curation). → Nanvuranlat (JPH-203 / KYT-0353), KIRA6, Leflunomide (Arava / HWA-486 / Teriflunomide / Aubagio)
`aml-repurposing-paper-5`	5	Broader 5-drug list referenced in the paper's main text: Binimetinib (MEK162), Pacritinib (SB1518 / Vonjo), Cerivastatin (Baycol), Pravastatin (Pravachol), Dimethyl fumarate (DMF / BG-12 / Tecfidera)

Swap with --goldset:

co-scientist bench --preset paper-aml --goldset aml-repurposing-paper-5   # broader list
co-scientist bench --preset paper-aml --goldset none                       # head-to-head only

The matcher is whole-token, case-insensitive, and looks at every searched field of every hypothesis (title / summary / full_text / entities / citation excerpts). Drug class mentions (e.g. "DHODH inhibitor") do not count — the candidate has to name the actual compound (or one of its registered aliases).

Custom candidates

label=provider:model[@mode]. mode is pipeline (default) or direct. Pipeline goes through the full Generation agent stack; direct is a single forced-tool LM call with no literature tools.

co-scientist bench "Identify hypotheses about X" \
  -c flash3=openrouter:google/gemini-3-flash-preview \
  -c flash3-raw=openrouter:google/gemini-3-flash-preview@direct \
  -c gpt5=openai:gpt-5 \
  -c opus=anthropic:claude-opus-4.7 \
  --judge anthropic:claude-sonnet-4-6

Where results live

Every bench writes to SQLite + JSON on disk:

data/co_scientist.db                          ← SQLite, all metadata
  bench_runs                                  one row per bench
  bench_candidates                            one row per (bench × candidate × mode)
  bench_matches                               one row per head-to-head

data/artifacts/<session_id>/                  ← JSON on disk
  bench/<bench_id>.json                       run summary + per-entity gold_hit_detail
  hypotheses/<hyp_id>.json                    every hypothesis the bench produced
  transcripts/generation/<trn_id>.json        every LLM call

The auto-generated docs/BENCH_RESULTS.md (rebuild with python scripts/build_bench_report.py) walks every recorded bench and renders the per-candidate result table, every hypothesis attributed to the model that produced it, and a post-hoc rescore against every registered gold set.

Mechanics

Generation runs in parallel per candidate under a deep-copied Config (cfg.llm.provider, cfg.models.*, thinking budgets zeroed for non-Anthropic).
Round-robin pairings: every pair plays --matches head-to-heads (one random hypothesis from each side per match).
Structured verdict via a forced record_verdict function call — no fragile better idea: <N> text parsing across providers.
Bench runs are isolated from regular sessions — they don't write to tournament_matches or affect any session's leaderboard.

Repository layout

co_scientist/
  agents/       # supervisor + 6 specialized agents (base, generation, reflection,
                # ranking, evolution, proximity, metareview)
  bench/        # cross-model bench runner (Elo tournament + gold-set scoring)
  llm/          # provider abstraction (anthropic/openai/openrouter/gemini/...),
                # tool loop, token budgets, model routing, retry, batch, estimator
  storage/      # SQLite schema + migrations, db connection, 10 repos
  tools/        # tool registry; web_fetch, web_search, pubmed/arxiv/europe_pmc,
                # science-skills bridge
  vectors/      # embeddings (Voyage/OpenAI/hash-fallback) + FAISS IndexFlatIP
  orchestrator/ # task scheduling, Elo updates, termination, event bus
  safety/       # injection quoting, classifier, citation verifier
  obs/          # metrics (tokens, cost, cache hit ratio, latency)
  web/          # FastAPI + htmx + SSE UI + sanitized markdown renderer
  evals/        # per-agent + e2e + regression evals
  tests/        # 213 unit tests + fixtures + smoke
config/
  default.toml
  prompts/      # 14 Jinja2 templates (one per agent.mode), derived from
                # the paper's supplementary prompts
docs/
  BENCH_RESULTS.md   # every bench ever run (auto-generated)
scripts/
  build_bench_report.py
reference/      # paper source materials (pseudocode, prompts, diagrams)
data/           # gitignored; runtime artifacts (SQLite, FAISS, transcripts)
vendor/         # gitignored; pinned clone of google-deepmind/science-skills

License

Apache-2.0.

Name		Name	Last commit message	Last commit date
Latest commit History 40 Commits
.github/workflows		.github/workflows
co_scientist		co_scientist
config		config
docs		docs
scripts		scripts
.env.example		.env.example
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AI co-scientist

Contents

Architecture

Install

Initialize

Run a research session

LLM provider

Configuration

Bench: compare models head-to-head

Presets

Pipeline vs raw LM (one model, isolated)

Gold-set scoring (AML drug repurposing)

Custom candidates

Where results live

Mechanics

Repository layout

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

AI co-scientist

Contents

Architecture

Install

Initialize

Run a research session

LLM provider

Configuration

Bench: compare models head-to-head

Presets

Pipeline vs raw LM (one model, isolated)

Gold-set scoring (AML drug repurposing)

Custom candidates

Where results live

Mechanics

Repository layout

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages