Caliper

The research assistant that tells you when to trust it.

An AI analyst that runs your data pipelines end-to-end, reports calibrated confidence on every result, and escalates to a human exactly when it shouldn't be trusted — getting sharper with each correction.

Modern science runs on software that most scientists were never trained to use — and can't easily tell whether to believe. Caliper does the analysis for you. And it does the one thing no tool does today: it tells you, honestly, when the answer is solid and when you should ask a human.

It is a ready-to-use agent. The calibrated-trust-and-feedback layer is what makes it worth trusting unattended.

Why it exists

Existing science agents build a curated toolbox and an agent to drive it, then hand you an answer with no bound on being wrong — so a resource-constrained scientist still can't act on it unattended. Caliper closes that gap, across fields.

How it works — three thin layers

Layer	What it is
① Domain Pack	A small, versioned registry of a field's vetted tools (metadata — the model already knows the rest).
② Agent Core	Plan → pick tools → run code → keep a reproducible record.
③ Trust & Feedback	Calibrated confidence on every result, a provable bound on confidently-wrong answers, escalation to a human, and recalibration from each correction.

One core, swappable packs. Ships with a working bio pack (genomics); astro is a skeleton proving the core is domain-agnostic.

How Caliper is different from other science agents

Most science-AI agents focus on capability — assembling tools and automating the analysis. Caliper adds the piece they leave out: an honest, calibrated signal of when to trust the result, and the discipline to hand borderline cases back to a human.

	Curated tools	Runs analysis end-to-end	Calibrated confidence	Defers to a human when unsure	Learns from your corrections
Caliper	✅	✅	✅	✅	✅
General-purpose science agents	✅	✅	—	—	—
Autonomous research agents	✅	✅	—	✗ (no human in the loop)	—
A bare LLM + tools	partial	partial	—	—	—

Others hand you an answer. Caliper hands you an answer plus a provable bound on how often it is wrong when it doesn't ask for help — and sharpens that judgment every time you correct it.

_{Comparison reflects each category's described design at the time of writing. These are complementary efforts; Caliper's trust layer can in principle sit on top of an existing agent.}

The promise you can set

Give Caliper a rule in plain terms — "never let more than 1 in 10 of the answers you hand me unchecked be wrong" — and it keeps it. From a modest set of expert-checked examples it learns exactly how confident it must be before answering on its own; anything below that bar it escalates. The guarantee is finite-sample and distribution-free, and it holds even when the underlying judge is imperfect — because it would rather ask for help than mislead you.

Quickstart

pip install -e .
python examples/bio_demo.py        # full run: analyze → grade confidence → decide
python examples/feedback_loop.py   # watch it grow more confident as feedback arrives
python -m unittest discover -s tests

LLM providers

Provider-agnostic via make_llm(provider=...) or the CALIPER_PROVIDER env var.

from caliper import make_llm
llm = make_llm()                              # default
llm = make_llm("openai")                       # OpenAI
llm = make_llm("openai", model="gpt-5-chat-latest")

Deployment — where it runs

Caliper deploys as two cooperating tiers, so your data never leaves your own machines:

A control host — a small, always-on server (a VM in your private network / VPC, or any low-cost cloud instance). It serves the web app, runs the agent and the trust layer, handles login + audit, and keeps chat history. It stays light — it holds no large data.
Your private analysis server — where the raw data and the field's tools already live (on-prem or in your VPC). The control host connects to it over a secure connection and runs each analysis step there, inside a confined working directory.

   Browser ──HTTPS──▶  Control host  (web · agent · trust · login + audit)
                            │   per step: secure connection
                            ▼
                     Your private server   (data + tools)
                     • runs each step in a confined workspace
                     • raw data stays here — only small results return

In practice:

Users reach it at a domain you control, over HTTPS, behind a login (per-user email + password; every sign-in is audited — who, when, and source IP).
Every analysis step is workspace-confined: it writes only to a dedicated directory and reads your data read-only.
Large data never leaves your server — only small results and logs flow back.
History and the experience log are kept as files on both tiers — no database required.

To deploy: point a control host at your private server (its address + an account), set your model provider's API key and a login, and expose a domain over HTTPS. The control host is intentionally tiny; all heavy compute runs on your server, next to the data.

Status & roadmap

Research preview. Working: thin core, bio pack, calibrated gate (tested), multi-provider models, live feedback recalibration, offline + live demos. Next: exact-validity calibration (Learn-then-Test), a real genomics tool environment and a reproduced published study, distribution-shift robustness, and a fleshed-out astro pack. See docs/DEVELOPERS.md for the architecture and the math.

License

Apache 2.0 — see LICENSE. Caliper is an independent reimplementation; it reuses no third-party code verbatim. Prior art is credited in NOTICE. Underlying domain tools carry their own licenses; review before commercial use.

Name		Name	Last commit message	Last commit date
Latest commit History 33 Commits
caliper		caliper
docs		docs
environments		environments
examples		examples
lab-pack		lab-pack
scripts		scripts
tests		tests
.dockerignore		.dockerignore
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
NOTICE		NOTICE
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Caliper

The research assistant that tells you when to trust it.

Why it exists

How it works — three thin layers

How Caliper is different from other science agents

The promise you can set

Quickstart

LLM providers

Deployment — where it runs

Status & roadmap

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Caliper

The research assistant that tells you when to trust it.

Why it exists

How it works — three thin layers

How Caliper is different from other science agents

The promise you can set

Quickstart

LLM providers

Deployment — where it runs

Status & roadmap

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages