feat: ground chat agent in a cached, auto-generated PolicyEngine API reference#36
feat: ground chat agent in a cached, auto-generated PolicyEngine API reference#36vahid-ahmadi merged 5 commits intomainfrom
Conversation
Adds scripts/build_reference.py which walks the installed policyengine_uk_compiled library and dumps docstrings, capabilities(), and the full Parameters JSON schema to backend/reference.md (~30k tokens). The chat loop sends it as a second cache_control=ephemeral system block so the agent grounds code generation against the actually installed API instead of training-data memory, at ~10% of the normal input cost thanks to prompt caching. Re-run scripts/build_reference.py after bumping policyengine-uk-compiled to refresh the reference. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
|
Beta preview has been cleaned up because this PR was closed. |
Those sections were hardcoded examples that drift every time policyengine-uk-compiled ships a release. The cached reference added in #36 — docstrings, capabilities() snapshot, full Parameters JSON schema — is already the authoritative source, and the agent reaches the same answers by calling capabilities() dynamically. Keep only the behavioral guardrails and point the model at the reference. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
SakshiKekre
left a comment
There was a problem hiding this comment.
I like the idea here. One thought I had is that right now the app is still consuming an already-generated reference.md. If the follow-up PR wires regeneration into the build / startup path so the backend always comes up serving a freshly generated reference, then I think this concern mostly goes away.
Otherwise, it feels like we may just be recreating the same drift issue in a different form: not stale handwritten prompt examples, but a stale generated reference artifact.
Looking forward to the follow-up PR for that piece!
One separate concern: the generator currently seems to include inherited/generic docs as well as real package docs. In the generated reference I can already see built-in container docstrings and large chunks of generic Pydantic ModelMetaclass boilerplate. That increases prompt size/cost and lowers signal.
Could we tighten the generator so it prefers actual PolicyEngine API surface and skips framework/builtin boilerplate?
- Skip re-exports from outside policyengine_uk_compiled (stdlib, pydantic). - Drop inherited docstrings (Pydantic BaseModel boilerplate was emitted on every model, adding ~30 lines of noise per entry). - For module-level data constants, emit the value repr instead of the container class's stdlib docstring (DATASETS now shows the actual tuple, HOUSEHOLD_DEFAULTS shows the actual dict). - Regenerate reference.md during `docker build` so deployed images always ship a reference matching the installed library version. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Thanks @SakshiKekre — both points fair, pushed 8fb456d on this branch: Drift (build/startup regeneration). Added Generator noise. Tightened
Will share the post-rebuild size delta once the preview redeploys. Expected shape: the reference section drops from ~3100 lines to roughly a third of that, dominated now by real signatures + the Parameters JSON schema. |
SakshiKekre
left a comment
There was a problem hiding this comment.
Thanks for the follow-up. I think this is moving in the right direction.
Question: Do we need to do this for Modal as well since our deployed backend path is Modal?
Also, the committed backend/reference.md still contains the builtin dict docs and Pydantic BaseModel boilerplate, not sure if the tightened generator was run before committing.
Two follow-ups to the second-round review:
1. Modal is the production deploy path, but its image only had
`add_local_dir("backend", ...)` after pip_install — so it shipped
whatever reference.md was checked into git, not a fresh build.
Add `.run_commands("python scripts/build_reference.py")` after
the local dir is copied so Modal regenerates against its own
installed policyengine-uk-compiled, mirroring the Dockerfile.
2. Stop committing backend/reference.md. Both deploy paths now
regenerate it at image build time, so the only role of a checked-in
copy is to drift — which is exactly what the previous reviewer
flagged (builtin dict / Pydantic ModelMetaclass docs surviving from
the pre-tightened generator). Add it to .gitignore; local dev
regenerates it via `docker-compose exec backend python scripts/build_reference.py`
or running the script directly.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Thanks @SakshiKekre — both fair. Pushed 2d92139: Modal regeneration. You're right, Modal was the gap. Stale committed reference. You diagnosed it correctly — the committed file was generated before the tightened generator landed, which is why it still has the |
# Conflicts: # backend/routes/chatbot.py
Summary
Replaces drift-prone hardcoded API examples in the system prompt with a freshly generated, prompt-cached reference extracted from the installed library.
Commit 1 — inject cached reference (
2084bf5)backend/scripts/build_reference.pywalks the installedpolicyengine_uk_compiled, dumps docstrings,capabilities()output, and the fullParametersJSON schema tobackend/reference.md(~30k tokens).backend/routes/chatbot.pyloads it at import and sends it as a secondcache_control=ephemeralsystem block alongsideSYSTEM_PROMPT._select_chat_modelcounts the reference in the Haiku→Sonnet routing estimate.Commit 2 — drop stale prompt blocks (
0214aa4)COMMON WORKFLOWS,MODELLING SCOPE, andDATASETSsections fromSYSTEM_PROMPT— all hardcoded examples now covered by the reference.Why
Claude's training-data memory of the PolicyEngine API is stale the moment the library ships a release. The hardcoded examples in
SYSTEM_PROMPTdrift the same way and need a human edit to sync. Together these two changes eliminate both drift sources: the version-sensitive facts move to a file regenerated from the live library, and the prompt keeps only the behavioral rules. Prompt caching keeps the per-turn cost of the ~30k-token reference at ~10% of sending it fresh every time.Measured impact
Local smoke test on Haiku 4.5:
cache_creation_input_tokens ≈ 70k(reference + prompt + tools cached).cache_read_input_tokens ≈ 70k,cache_creation_input_tokens = 0.capabilities()snapshot without the deletedDATASETS/MODELLING SCOPEblocks.Refreshing the reference
Re-run after any
policyengine-uk-compiledbump:A follow-up PR should wire this into
pr-beta-deploy.ymlso it runs automatically on every library upgrade.Test plan
docker-compose up→ send one chat message → second message should showcache_read_input_tokens > 0capabilities()call, not from the deletedDATASETSblockParameters.model_validate({...})correctly against the current schema without the hardcoded example🤖 Generated with Claude Code