TimeCal

A shared time-calibration corpus for coding agents, served over MCP. It counters the systematic over-estimation that LLM agents inherit from ~30 years of human software-engineering timelines — by handing the agent real "a human estimated X, it actually took Y" rows before it scopes your task, instead of letting it reach for an engineer-weeks prior.

Telling an agent "you're powerful, you'll be fast" doesn't override its training prior. Examples do. TimeCal is the examples.

flowchart LR
    Preamble["timecal://preamble<br/>resets the prior"] --> Agent
    Agent["Coding agent<br/>Claude Code · Codex<br/>Cursor · Cline"] -->|calibrate_task| Rank
    Agent -->|log_completion| DB
    Rank["Retrieval<br/>term overlap + concept layer"] --> DB
    DB[("Corpus<br/>(SQLite)")] --> Rank
    Rank --> Out["Ranked rows<br/>two clocks · regime"]
    Out --> Agent

    classDef box fill:#14171c,stroke:#2a313d,color:#ecebe5
    classDef store fill:#0f1115,stroke:#2a313d,color:#ecebe5,stroke-width:2px
    classDef out fill:#2a2117,stroke:#4a3a22,color:#f4d495
    class Agent,Preamble box
    class DB store
    class Rank,Out out

Ships as a zero-install MCP server: uvx timecal boots it over stdio against a local SQLite corpus that auto-seeds on first run. No database setup, no API keys, no network.

Why this exists

When an agent reads "build a Reddit→Claude pipeline," its training prior maps that to engineer-weeks. A capable agent driving Claude Code can ship the same thing in an afternoon. That bad prior cascades: things get called "infeasible solo," scope gets cut that didn't need cutting, multi-phase rollouts get proposed where one session would do.

The fix isn't a pep talk — it's grounded data. TimeCal models the calibration explicitly: a small MCP server backed by a local SQLite corpus of real "human-estimated → actually-took" rows that any MCP-aware agent retrieves before scoping. Every row separates the two clocks (wall-clock days vs. active hours — different units, never compared raw) and carries a regime so the reading agent can tell whether a human "months" estimate was a fake prior or a real external constraint. No black box: the agent sees the rows, not a number.

Run it

Zero-install, with uv:

uvx timecal                  # runs the MCP server over stdio

Or install it:

pip install timecal
python -m timecal.server

The corpus DB is created and seeded automatically on first run (10-row example corpus) at ~/.timecal/timecal.db — no setup step. Point TIMECAL_DB at any path to use your own corpus instead; the server, scripts, and tests all read it at call time.

From source (for development):

git clone https://github.com/Conalh/timecal && cd timecal
pip install -e ".[dev]"
pytest -q                    # 30 passing

Use it from an agent

Claude Code:

claude mcp add timecal -- uvx timecal

Generic MCP client (mcp.json / client config):

{
  "mcpServers": {
    "timecal": {
      "command": "uvx",
      "args": ["timecal"],
      "env": { "TIMECAL_DB": "/path/to/your/corpus.db" }
    }
  }
}

(env is optional — drop it to use the auto-seeded default.) Once connected, the agent gets two tools and one resource:

Surface	What it does
`calibrate_task(task_description, …)`	Retrieve similar past rows, ranked, before scoping. Optional `regime`, `limit`, `include_low_provenance`.
`log_completion(…)`	Append a new row as work finishes, so the corpus grows from real outcomes.
`timecal://preamble`	A resource to load at conversation start that resets the agent's estimation prior.

What it changes — before / after

Same task, same model — the difference is one calibrate_task call:

User: Can we build a Slack bot that summarizes standup messages?

Without TimeCal: "That's a solid 2–3 week project — you'll need to design the webhook integration, build the summarization pipeline, handle Slack's API rate limits, and test across channels. I'd suggest phasing it: week 1 plumbing, week 2 the LLM layer, week 3 hardening."

With TimeCal (after calibrate_task("slack bot that summarizes standup messages") returns the standup-summary-bot row — human bucket: weeks, actual: ~5h active, agent_executed): "A close match in the corpus shipped in about 5 active hours, agent-executed — the webhook plumbing dominated, not the LLM call. This is an afternoon, not a sprint. Want me to start on the webhook handler?"

Data model

Every row in the projects table:

Field	What it is
`regime`	What kind of work it was — the load-bearing column. Decides whether a human estimate was a fake prior or a real constraint.
`wall_clock_days`	Calendar span, idle gaps included. Never compared raw to active hours.
`active_hours`	Real work time at the keyboard.
`estimate_bucket`	What a human team would have guessed (`hours` … `year_plus`), with `estimate_raw` for nuance.
`data_quality`	How the row was measured: `dates_only`, `timed_session`, or `self_reported`.
`source`	Provenance. Empty-`source` rows are hidden from the default response.
`what_shipped` · `stack` · `tags`	Free-text description + tags the retrieval layer matches against.

Modeling choices worth flagging:

The two clocks are never reconciled. wall_clock_days and active_hours use different units on purpose; the formatter surfaces both and refuses to multiply or compare them — because the gap between them (idle time) is exactly the signal a human "it took three weeks" estimate hides.
regime is what makes a human estimate readable. The three values map directly to why an estimate was what it was:
- agent_executed — agent does the work end-to-end; a human-week estimate is usually really agent-hours.
- review_bound — agent produces code in minutes, but human review / re-prompting dominates wall-clock.
- external_bound — gated by people, data accrual, or training runs; "months" is months, not a prior.
Provenance gating keeps synthetic data out of the prior. Rows with an empty source are filtered from the default calibrate_task response, so exploration/demo rows can't pollute what the agent reads. Pass include_low_provenance=True to override.
estimate_bucket is ordinal, not a number — it keeps the human prior comparable across rows without pretending to a false precision the source data never had.

How matching works

Retrieval is deterministic and dependency-free — at corpus scale (tens of rows) this beats embeddings on cost and inspectability. The query and every row are expanded the same way before counting overlap, so a query for "authentication" matches a row that only ever says "OAuth"/"session", and "chatbot" matches "slack bot". See src/timecal/calibrate.py.

Tokenize  ─── query + each row → lowercased tokens
              (what_shipped · stack · name · tags)

Concept   ─── singularize ("bots" → "bot"), then map synonyms to one concept:
expansion       authentication · oauth · session · jwt   → auth
                bot · chatbot · assistant                → bot
                dashboard · chart · viz · graph          → dataviz
                pipeline · etl · scraper · crawler       → pipeline
                … cli · ml · docs · security · migration · payments · compliance

Filter    ─── estimate_bucket present  ·  regime (if requested)
              source non-empty (low-provenance hidden unless include_low_provenance)

Score     ─── overlap = | query_concepts ∩ row_concepts |
              drop rows with 0 overlap  ·  sort desc  ·  take top `limit`

The agent-facing formatter then prints each match with both clocks, its regime and gloss, the human estimate vs. the actual, and provenance — never collapsing them into a single misleading "estimate."

Tests

pip install -e ".[dev]"
pytest -q

30 tests covering retrieval ranking and the regime / provenance / limit filters, the concept layer (synonym + plural matching, and that unrelated concepts stay excluded), the two-clock agent formatting, insert validation and enum rejection in log_completion, the reviewed-CSV importer, and the auto-init/auto-seed DB behavior (seeds a fresh DB, never reseeds an existing one, creates missing parent dirs). CI runs ruff + pytest on Python 3.11 for every push and PR (.github/workflows/ci.yml).

Project layout

timecal/
├── src/timecal/
│   ├── server.py        MCP server entrypoint (FastMCP, stdio)
│   ├── calibrate.py     retrieval / ranking + concept layer + agent formatting
│   ├── log.py           validation + insert (mcp-free, unit-tested)
│   ├── db.py            DB path + auto-init/seed (honors TIMECAL_DB)
│   └── data/
│       ├── schema.sql   SQLite schema (shipped in the wheel)
│       └── example.csv  10-row synthetic corpus, auto-seeded on first run
├── scripts/
│   ├── bootstrap.py     pre-create + seed the DB without starting the server
│   ├── init_db.py       create an empty DB from schema
│   └── import_seed.py   import a reviewed CSV (validates enums)
└── tests/               pytest suite (db · calibrate · log · import_seed)

Bring your own corpus

The auto-seeded example rows (marked source=example) are synthetic — enough to make the demo runnable, not enough to be your real prior. Build your own:

point TIMECAL_DB at a fresh path, then log tasks as you finish them via the log_completion tool, or
import a reviewed CSV into your TIMECAL_DB: python scripts/import_seed.py path/to/your.csv (rejects, rather than silently drops, rows with bad enums or empty what_shipped).

To keep your corpus free of the example rows, delete them with DELETE FROM projects WHERE source = 'example'; or start from an empty DB via python scripts/init_db.py.

Status

v0.1.0 — published on PyPI, uvx timecal runs it with zero setup, CI green. The retrieval layer is deliberately simple (deterministic term overlap + a hand-built concept map); embedding-based retrieval can land later if relevance is genuinely poor on a larger corpus. The shipped corpus is a synthetic example — the real value compounds as agents log their own completions back.

Deliberately out of scope for now: a hosted/shared multi-user corpus, embeddings, and any analytics beyond term overlap. The MCP server + local SQLite file is the whole artifact.

License

MIT — see LICENSE.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
.github/workflows		.github/workflows
scripts		scripts
src/timecal		src/timecal
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

TimeCal

Why this exists

Run it

Use it from an agent

What it changes — before / after

Data model

How matching works

Tests

Project layout

Bring your own corpus

Status

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

TimeCal

Why this exists

Run it

Use it from an agent

What it changes — before / after

Data model

How matching works

Tests

Project layout

Bring your own corpus

Status

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages