Skip to content

Conalh/timecal

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

TimeCal

CI PyPI python MCP ruff license

A shared time-calibration corpus for coding agents, served over MCP. It counters the systematic over-estimation that LLM agents inherit from ~30 years of human software-engineering timelines — by handing the agent real "a human estimated X, it actually took Y" rows before it scopes your task, instead of letting it reach for an engineer-weeks prior.

Telling an agent "you're powerful, you'll be fast" doesn't override its training prior. Examples do. TimeCal is the examples.

flowchart LR
    Preamble["timecal://preamble<br/>resets the prior"] --> Agent
    Agent["Coding agent<br/>Claude Code · Codex<br/>Cursor · Cline"] -->|calibrate_task| Rank
    Agent -->|log_completion| DB
    Rank["Retrieval<br/>term overlap + concept layer"] --> DB
    DB[("Corpus<br/>(SQLite)")] --> Rank
    Rank --> Out["Ranked rows<br/>two clocks · regime"]
    Out --> Agent

    classDef box fill:#14171c,stroke:#2a313d,color:#ecebe5
    classDef store fill:#0f1115,stroke:#2a313d,color:#ecebe5,stroke-width:2px
    classDef out fill:#2a2117,stroke:#4a3a22,color:#f4d495
    class Agent,Preamble box
    class DB store
    class Rank,Out out
Loading

Ships as a zero-install MCP server: uvx timecal boots it over stdio against a local SQLite corpus that auto-seeds on first run. No database setup, no API keys, no network.

Why this exists

When an agent reads "build a Reddit→Claude pipeline," its training prior maps that to engineer-weeks. A capable agent driving Claude Code can ship the same thing in an afternoon. That bad prior cascades: things get called "infeasible solo," scope gets cut that didn't need cutting, multi-phase rollouts get proposed where one session would do.

The fix isn't a pep talk — it's grounded data. TimeCal models the calibration explicitly: a small MCP server backed by a local SQLite corpus of real "human-estimated → actually-took" rows that any MCP-aware agent retrieves before scoping. Every row separates the two clocks (wall-clock days vs. active hours — different units, never compared raw) and carries a regime so the reading agent can tell whether a human "months" estimate was a fake prior or a real external constraint. No black box: the agent sees the rows, not a number.

Run it

Zero-install, with uv:

uvx timecal                  # runs the MCP server over stdio

Or install it:

pip install timecal
python -m timecal.server

The corpus DB is created and seeded automatically on first run (10-row example corpus) at ~/.timecal/timecal.db — no setup step. Point TIMECAL_DB at any path to use your own corpus instead; the server, scripts, and tests all read it at call time.

From source (for development):

git clone https://github.com/Conalh/timecal && cd timecal
pip install -e ".[dev]"
pytest -q                    # 30 passing

Use it from an agent

Claude Code:

claude mcp add timecal -- uvx timecal

Generic MCP client (mcp.json / client config):

{
  "mcpServers": {
    "timecal": {
      "command": "uvx",
      "args": ["timecal"],
      "env": { "TIMECAL_DB": "/path/to/your/corpus.db" }
    }
  }
}

(env is optional — drop it to use the auto-seeded default.) Once connected, the agent gets two tools and one resource:

Surface What it does
calibrate_task(task_description, …) Retrieve similar past rows, ranked, before scoping. Optional regime, limit, include_low_provenance.
log_completion(…) Append a new row as work finishes, so the corpus grows from real outcomes.
timecal://preamble A resource to load at conversation start that resets the agent's estimation prior.

What it changes — before / after

Same task, same model — the difference is one calibrate_task call:

User: Can we build a Slack bot that summarizes standup messages?

Without TimeCal: "That's a solid 2–3 week project — you'll need to design the webhook integration, build the summarization pipeline, handle Slack's API rate limits, and test across channels. I'd suggest phasing it: week 1 plumbing, week 2 the LLM layer, week 3 hardening."

With TimeCal (after calibrate_task("slack bot that summarizes standup messages") returns the standup-summary-bot row — human bucket: weeks, actual: ~5h active, agent_executed): "A close match in the corpus shipped in about 5 active hours, agent-executed — the webhook plumbing dominated, not the LLM call. This is an afternoon, not a sprint. Want me to start on the webhook handler?"

Data model

Every row in the projects table:

Field What it is
regime What kind of work it was — the load-bearing column. Decides whether a human estimate was a fake prior or a real constraint.
wall_clock_days Calendar span, idle gaps included. Never compared raw to active hours.
active_hours Real work time at the keyboard.
estimate_bucket What a human team would have guessed (hoursyear_plus), with estimate_raw for nuance.
data_quality How the row was measured: dates_only, timed_session, or self_reported.
source Provenance. Empty-source rows are hidden from the default response.
what_shipped · stack · tags Free-text description + tags the retrieval layer matches against.

Modeling choices worth flagging:

  • The two clocks are never reconciled. wall_clock_days and active_hours use different units on purpose; the formatter surfaces both and refuses to multiply or compare them — because the gap between them (idle time) is exactly the signal a human "it took three weeks" estimate hides.
  • regime is what makes a human estimate readable. The three values map directly to why an estimate was what it was:
    • agent_executed — agent does the work end-to-end; a human-week estimate is usually really agent-hours.
    • review_bound — agent produces code in minutes, but human review / re-prompting dominates wall-clock.
    • external_bound — gated by people, data accrual, or training runs; "months" is months, not a prior.
  • Provenance gating keeps synthetic data out of the prior. Rows with an empty source are filtered from the default calibrate_task response, so exploration/demo rows can't pollute what the agent reads. Pass include_low_provenance=True to override.
  • estimate_bucket is ordinal, not a number — it keeps the human prior comparable across rows without pretending to a false precision the source data never had.

How matching works

Retrieval is deterministic and dependency-free — at corpus scale (tens of rows) this beats embeddings on cost and inspectability. The query and every row are expanded the same way before counting overlap, so a query for "authentication" matches a row that only ever says "OAuth"/"session", and "chatbot" matches "slack bot". See src/timecal/calibrate.py.

Tokenize  ─── query + each row → lowercased tokens
              (what_shipped · stack · name · tags)

Concept   ─── singularize ("bots" → "bot"), then map synonyms to one concept:
expansion       authentication · oauth · session · jwt   → auth
                bot · chatbot · assistant                → bot
                dashboard · chart · viz · graph          → dataviz
                pipeline · etl · scraper · crawler       → pipeline
                … cli · ml · docs · security · migration · payments · compliance

Filter    ─── estimate_bucket present  ·  regime (if requested)
              source non-empty (low-provenance hidden unless include_low_provenance)

Score     ─── overlap = | query_concepts ∩ row_concepts |
              drop rows with 0 overlap  ·  sort desc  ·  take top `limit`

The agent-facing formatter then prints each match with both clocks, its regime and gloss, the human estimate vs. the actual, and provenance — never collapsing them into a single misleading "estimate."

Tests

pip install -e ".[dev]"
pytest -q

30 tests covering retrieval ranking and the regime / provenance / limit filters, the concept layer (synonym + plural matching, and that unrelated concepts stay excluded), the two-clock agent formatting, insert validation and enum rejection in log_completion, the reviewed-CSV importer, and the auto-init/auto-seed DB behavior (seeds a fresh DB, never reseeds an existing one, creates missing parent dirs). CI runs ruff + pytest on Python 3.11 for every push and PR (.github/workflows/ci.yml).

Project layout

timecal/
├── src/timecal/
│   ├── server.py        MCP server entrypoint (FastMCP, stdio)
│   ├── calibrate.py     retrieval / ranking + concept layer + agent formatting
│   ├── log.py           validation + insert (mcp-free, unit-tested)
│   ├── db.py            DB path + auto-init/seed (honors TIMECAL_DB)
│   └── data/
│       ├── schema.sql   SQLite schema (shipped in the wheel)
│       └── example.csv  10-row synthetic corpus, auto-seeded on first run
├── scripts/
│   ├── bootstrap.py     pre-create + seed the DB without starting the server
│   ├── init_db.py       create an empty DB from schema
│   └── import_seed.py   import a reviewed CSV (validates enums)
└── tests/               pytest suite (db · calibrate · log · import_seed)

Bring your own corpus

The auto-seeded example rows (marked source=example) are synthetic — enough to make the demo runnable, not enough to be your real prior. Build your own:

  • point TIMECAL_DB at a fresh path, then log tasks as you finish them via the log_completion tool, or
  • import a reviewed CSV into your TIMECAL_DB: python scripts/import_seed.py path/to/your.csv (rejects, rather than silently drops, rows with bad enums or empty what_shipped).

To keep your corpus free of the example rows, delete them with DELETE FROM projects WHERE source = 'example'; or start from an empty DB via python scripts/init_db.py.

Status

v0.1.0 — published on PyPI, uvx timecal runs it with zero setup, CI green. The retrieval layer is deliberately simple (deterministic term overlap + a hand-built concept map); embedding-based retrieval can land later if relevance is genuinely poor on a larger corpus. The shipped corpus is a synthetic example — the real value compounds as agents log their own completions back.

Deliberately out of scope for now: a hosted/shared multi-user corpus, embeddings, and any analytics beyond term overlap. The MCP server + local SQLite file is the whole artifact.

License

MIT — see LICENSE.

About

Cross-agent time-calibration corpus served over MCP — counters the LLM prior that scopes agent work in engineer-weeks instead of agent-hours

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages