A shared time-calibration corpus for coding agents, served over MCP. It counters the systematic over-estimation that LLM agents inherit from ~30 years of human software-engineering timelines — by handing the agent real "a human estimated X, it actually took Y" rows before it scopes your task, instead of letting it reach for an engineer-weeks prior.
Telling an agent "you're powerful, you'll be fast" doesn't override its training prior. Examples do. TimeCal is the examples.
flowchart LR
Preamble["timecal://preamble<br/>resets the prior"] --> Agent
Agent["Coding agent<br/>Claude Code · Codex<br/>Cursor · Cline"] -->|calibrate_task| Rank
Agent -->|log_completion| DB
Rank["Retrieval<br/>term overlap + concept layer"] --> DB
DB[("Corpus<br/>(SQLite)")] --> Rank
Rank --> Out["Ranked rows<br/>two clocks · regime"]
Out --> Agent
classDef box fill:#14171c,stroke:#2a313d,color:#ecebe5
classDef store fill:#0f1115,stroke:#2a313d,color:#ecebe5,stroke-width:2px
classDef out fill:#2a2117,stroke:#4a3a22,color:#f4d495
class Agent,Preamble box
class DB store
class Rank,Out out
Ships as a zero-install MCP server: uvx timecal boots it over stdio against a local SQLite corpus that auto-seeds on first run. No database setup, no API keys, no network.
When an agent reads "build a Reddit→Claude pipeline," its training prior maps that to engineer-weeks. A capable agent driving Claude Code can ship the same thing in an afternoon. That bad prior cascades: things get called "infeasible solo," scope gets cut that didn't need cutting, multi-phase rollouts get proposed where one session would do.
The fix isn't a pep talk — it's grounded data. TimeCal models the calibration explicitly: a small MCP server backed by a local SQLite corpus of real "human-estimated → actually-took" rows that any MCP-aware agent retrieves before scoping. Every row separates the two clocks (wall-clock days vs. active hours — different units, never compared raw) and carries a regime so the reading agent can tell whether a human "months" estimate was a fake prior or a real external constraint. No black box: the agent sees the rows, not a number.
Zero-install, with uv:
uvx timecal # runs the MCP server over stdioOr install it:
pip install timecal
python -m timecal.serverThe corpus DB is created and seeded automatically on first run (10-row example corpus) at ~/.timecal/timecal.db — no setup step. Point TIMECAL_DB at any path to use your own corpus instead; the server, scripts, and tests all read it at call time.
From source (for development):
git clone https://github.com/Conalh/timecal && cd timecal
pip install -e ".[dev]"
pytest -q # 30 passingClaude Code:
claude mcp add timecal -- uvx timecalGeneric MCP client (mcp.json / client config):
{
"mcpServers": {
"timecal": {
"command": "uvx",
"args": ["timecal"],
"env": { "TIMECAL_DB": "/path/to/your/corpus.db" }
}
}
}(env is optional — drop it to use the auto-seeded default.) Once connected, the agent gets two tools and one resource:
| Surface | What it does |
|---|---|
calibrate_task(task_description, …) |
Retrieve similar past rows, ranked, before scoping. Optional regime, limit, include_low_provenance. |
log_completion(…) |
Append a new row as work finishes, so the corpus grows from real outcomes. |
timecal://preamble |
A resource to load at conversation start that resets the agent's estimation prior. |
Same task, same model — the difference is one calibrate_task call:
User: Can we build a Slack bot that summarizes standup messages?
Without TimeCal: "That's a solid 2–3 week project — you'll need to design the webhook integration, build the summarization pipeline, handle Slack's API rate limits, and test across channels. I'd suggest phasing it: week 1 plumbing, week 2 the LLM layer, week 3 hardening."
With TimeCal (after
calibrate_task("slack bot that summarizes standup messages")returns thestandup-summary-botrow — human bucket: weeks, actual: ~5h active, agent_executed): "A close match in the corpus shipped in about 5 active hours, agent-executed — the webhook plumbing dominated, not the LLM call. This is an afternoon, not a sprint. Want me to start on the webhook handler?"
Every row in the projects table:
| Field | What it is |
|---|---|
regime |
What kind of work it was — the load-bearing column. Decides whether a human estimate was a fake prior or a real constraint. |
wall_clock_days |
Calendar span, idle gaps included. Never compared raw to active hours. |
active_hours |
Real work time at the keyboard. |
estimate_bucket |
What a human team would have guessed (hours … year_plus), with estimate_raw for nuance. |
data_quality |
How the row was measured: dates_only, timed_session, or self_reported. |
source |
Provenance. Empty-source rows are hidden from the default response. |
what_shipped · stack · tags |
Free-text description + tags the retrieval layer matches against. |
Modeling choices worth flagging:
- The two clocks are never reconciled.
wall_clock_daysandactive_hoursuse different units on purpose; the formatter surfaces both and refuses to multiply or compare them — because the gap between them (idle time) is exactly the signal a human "it took three weeks" estimate hides. regimeis what makes a human estimate readable. The three values map directly to why an estimate was what it was:agent_executed— agent does the work end-to-end; a human-week estimate is usually really agent-hours.review_bound— agent produces code in minutes, but human review / re-prompting dominates wall-clock.external_bound— gated by people, data accrual, or training runs; "months" is months, not a prior.
- Provenance gating keeps synthetic data out of the prior. Rows with an empty
sourceare filtered from the defaultcalibrate_taskresponse, so exploration/demo rows can't pollute what the agent reads. Passinclude_low_provenance=Trueto override. estimate_bucketis ordinal, not a number — it keeps the human prior comparable across rows without pretending to a false precision the source data never had.
Retrieval is deterministic and dependency-free — at corpus scale (tens of rows) this beats embeddings on cost and inspectability. The query and every row are expanded the same way before counting overlap, so a query for "authentication" matches a row that only ever says "OAuth"/"session", and "chatbot" matches "slack bot". See src/timecal/calibrate.py.
Tokenize ─── query + each row → lowercased tokens
(what_shipped · stack · name · tags)
Concept ─── singularize ("bots" → "bot"), then map synonyms to one concept:
expansion authentication · oauth · session · jwt → auth
bot · chatbot · assistant → bot
dashboard · chart · viz · graph → dataviz
pipeline · etl · scraper · crawler → pipeline
… cli · ml · docs · security · migration · payments · compliance
Filter ─── estimate_bucket present · regime (if requested)
source non-empty (low-provenance hidden unless include_low_provenance)
Score ─── overlap = | query_concepts ∩ row_concepts |
drop rows with 0 overlap · sort desc · take top `limit`
The agent-facing formatter then prints each match with both clocks, its regime and gloss, the human estimate vs. the actual, and provenance — never collapsing them into a single misleading "estimate."
pip install -e ".[dev]"
pytest -q30 tests covering retrieval ranking and the regime / provenance / limit filters, the concept layer (synonym + plural matching, and that unrelated concepts stay excluded), the two-clock agent formatting, insert validation and enum rejection in log_completion, the reviewed-CSV importer, and the auto-init/auto-seed DB behavior (seeds a fresh DB, never reseeds an existing one, creates missing parent dirs). CI runs ruff + pytest on Python 3.11 for every push and PR (.github/workflows/ci.yml).
timecal/
├── src/timecal/
│ ├── server.py MCP server entrypoint (FastMCP, stdio)
│ ├── calibrate.py retrieval / ranking + concept layer + agent formatting
│ ├── log.py validation + insert (mcp-free, unit-tested)
│ ├── db.py DB path + auto-init/seed (honors TIMECAL_DB)
│ └── data/
│ ├── schema.sql SQLite schema (shipped in the wheel)
│ └── example.csv 10-row synthetic corpus, auto-seeded on first run
├── scripts/
│ ├── bootstrap.py pre-create + seed the DB without starting the server
│ ├── init_db.py create an empty DB from schema
│ └── import_seed.py import a reviewed CSV (validates enums)
└── tests/ pytest suite (db · calibrate · log · import_seed)
The auto-seeded example rows (marked source=example) are synthetic — enough to make the demo runnable, not enough to be your real prior. Build your own:
- point
TIMECAL_DBat a fresh path, then log tasks as you finish them via thelog_completiontool, or - import a reviewed CSV into your
TIMECAL_DB:python scripts/import_seed.py path/to/your.csv(rejects, rather than silently drops, rows with bad enums or emptywhat_shipped).
To keep your corpus free of the example rows, delete them with DELETE FROM projects WHERE source = 'example'; or start from an empty DB via python scripts/init_db.py.
v0.1.0 — published on PyPI, uvx timecal runs it with zero setup, CI green. The retrieval layer is deliberately simple (deterministic term overlap + a hand-built concept map); embedding-based retrieval can land later if relevance is genuinely poor on a larger corpus. The shipped corpus is a synthetic example — the real value compounds as agents log their own completions back.
Deliberately out of scope for now: a hosted/shared multi-user corpus, embeddings, and any analytics beyond term overlap. The MCP server + local SQLite file is the whole artifact.
MIT — see LICENSE.