docs: agent guide + manual verification checklist (calibrated on bank-chat-system)#35
Conversation
…-chat-system)
Two new docs targeting real-world rollout on enterprise Java projects:
docs/AGENT-GUIDE.md — copy-paste-into-QWEN.md/CLAUDE.md block.
- Forced reasoning preamble (Q-class + Pick + Why on every call)
- Decision tree mapping user intent to first tool
- Full reference for all 22 MCP tools with required args, common
mistakes, minimal examples
- Ontology glossary (v9) with verbatim enum values for roles,
capabilities, route_framework, route_kind, client_kind, call_match
- Recovery playbook (7 common failure modes -> fix)
- Slash-style prompt aliases (/who-calls, /flow, /impact, /diff-risk, …)
Engineered for weak / mid models that otherwise pick the wrong tool
(e.g. semantic search for exact-call-graph questions, simple names
where FQNs are required).
docs/MANUAL-VERIFICATION-CHECKLIST.md — agent-driven manual test plan.
- 7 phases: index health, roles & capabilities, routes, call graph,
cross-service edges, semantic search, brownfield overrides
- Each item: ☐ checkbox · copy-paste verification prompt · expected
output (calibrated on tests/bank-chat-system, ontology v9) · fix
- Pre-flight build instructions and post-completion guidance
- Calibration source pinned: master @ d62b48c, 84 files / 92 types /
17 routes / 793 calls / 2 HTTP_CALLS / 5 ASYNC_CALLS
README.md — added 'Driving this MCP from an agent' callout linking
both docs.
Test baseline unchanged: 290 passed, 4 skipped.
Per real-world feedback: weak models burn calls on argument-shape
mistakes more than tool-selection mistakes. Two specific failures
observed in the wild:
1. Stringified arrays. Agent passes "[\"DTO\",\"ENTITY\"]" instead of
["DTO","ENTITY"] for exclude_roles (and similar list params),
tripping FastMCP/Pydantic validation.
2. Overloaded method needles. Agent passes Foo#bar() for an overloaded
method that has Foo#bar(String) too, and gets back only the no-arg
match — interpreting empty/short results as 'method not used'.
New section 'Argument shapes — what the parser actually wants' inserted
between the forced reasoning preamble and the decision tree, covering:
§A. JSON, not stringified JSON. Right vs wrong table for every list
and primitive param. One-line rule: if the schema says list[str],
send a JSON array; if it says str, send a JSON string.
§B. Method needles. Exact FQN format spec (simple type names only,
generics erased, no spaces, <init> for ctors, dot-separated nested
types). Verbatim examples copied from the bank-chat-system index.
Three needle shapes ranked by precision. Explicit overload-recovery
recipe (drop parens to list overloads, or use type FQN to fan out
via DECLARES, or recover the FQN via codebase_search).
§C. Path templates. The normalised servlet form vs the raw annotation
value. With a fallback recipe (list_routes path_prefix → copy path).
Also:
- Forced reasoning preamble now nudges 'sanity-check arguments before
issuing the call'.
- find_callers / find_callees tool reference: corrected the example
(was com.foo.Bar#baz(java.lang.String) — wrong, parens take simple
type names) and cross-linked to §B.
- Recovery playbook: 3 new rows (overload mismatch, stringified JSON,
raw path_template).
- Slash aliases: /who-calls and /calls-from now explicitly require
fqn-with-sig and document the simple-name fallback.
Test baseline unchanged: 290 passed, 4 skipped.
Follow-up: argument-shape failures (
|
| Param | ✅ Right | ❌ Wrong |
|---|---|---|
exclude_roles |
["DTO","ENTITY","CONFIG","OTHER"] |
"[\"DTO\",\"ENTITY\",\"CONFIG\",\"OTHER\"]" |
edge_types |
["EXTENDS","IMPLEMENTS"] |
"EXTENDS,IMPLEMENTS" |
confirm / limit / min_confidence |
true / 20 / 0.9 |
"true" / "20" / "0.9" |
| string enums | "CONTROLLER" |
["CONTROLLER"] (single value, not list) |
Plus the one-line rule: if the schema says list[str], send a JSON array; if it says str, send a JSON string.
§B. Method needles — FQN + signature, with simple type names
Exact format spec verified against the live bank-chat-system index:
<package>.<Type>[.<NestedType>]#<methodName>(<SimpleType1>,<SimpleType2>,…)
Key rules surfaced explicitly:
- Simple type names only (
String, notjava.lang.String) - Generics erased (
List<String>→List) - No spaces between commas
<init>for constructors- Nested types dot-separated under the outer type
Verbatim calibration examples lifted from the fixture's actual stored FQNs.
Overload-recovery recipe — three options when you don't know the exact signature:
- Drop the parens entirely → simple-name lookup matches all overloads → pick the right one and re-query with full FQN+sig
- Pass the type FQN → fans out to every method via
DECLARES codebase_searchwithauto_hybrid=true→ recover the exact stored FQN
Plus a "How to find the FQN you need" subsection (how to extract FQNs from codebase_search/list_by_role/find_implementors results) and a warning to never pass phantom-row needles like ?HashMap<>#<init>(0).
§C. Path templates — the normalised servlet form
Right-vs-wrong table for get_route_by_path / find_route_callers path_template arg, including the concatenation rule for class-level @RequestMapping + method-level @GetMapping.
Other touch-ups
- Forced reasoning preamble now nudges "sanity-check arguments before issuing the call" — most weak-model failures here are wrong-argument-shape, not wrong-tool-choice.
find_callers/find_calleestool reference: fixed an incorrect example (wascom.foo.Bar#baz(java.lang.String)— wrong, parens take simple types only) and cross-linked to §B.- Recovery playbook: 3 new rows mapping these failure symptoms straight to fixes.
- Slash aliases:
/who-callsand/calls-fromnow explicitly say fqn-with-sig and document the simple-name fallback inline.
Diff
docs/AGENT-GUIDE.md | 134 +++++++++++++++++++++++++++++++++++++++++++++++++---
1 file changed, 127 insertions(+), 7 deletions(-)
Test baseline unchanged: 290 passed, 4 skipped.
Calibration source for the FQN examples: tests/bank-chat-system index built on master @ d62b48c.
Summary
Two new docs to make this MCP usable on real enterprise Java projects, especially with weaker LLMs (Qwen Code, Sonnet-mini, etc.) that previously got confused selecting and invoking tools — leveraging the work done in PRs #25–#33.
docs/AGENT-GUIDE.md(387 lines)Copy-paste-into-
QWEN.md/CLAUDE.mdblock (delimited with<!-- BEGIN/END user-rag MCP guide -->markers) containing:Q-class:+Pick:+Why:prefix. Cheap scaffolding that dramatically improves tool selection on mid-sized models.VALID_ROUTE_FRAMEWORKS,VALID_ROUTE_KINDS,VALID_CLIENT_KINDS,VALID_CALL_MATCHES, sourced fromjava_ontology.py./who-calls,/calls-from,/route,/handler,/who-hits,/why-no-route,/role-of,/impact,/cross-service,/flow,/diff-risk,/health. Just prompt templates, not real commands — but they reliably steer weak models.docs/MANUAL-VERIFICATION-CHECKLIST.md(636 lines)Agent-driven manual test plan to run after indexing your real project. 7 phases, ~24 items total:
ontology_version=9, parse error rate, symbol counts, LanceDB tablesfind_callersmatches IDE "Find Usages", end-to-end chain viafind_callees, phantom rateHTTP_CALLSresolution,ASYNC_CALLSproducer↔listener,cross_service_resolutionflagauto_hybridfor identifiers@CodebaseRoleflips role,@CodebaseRouteregisters route,@CodebaseClientcreates outbound edgeEach item has:
tests/bank-chat-systemPlus pre-flight (build the graph), per-phase "red flags" sections, and post-completion guidance.
README.md(+11 lines)Added "Driving this MCP from an agent" callout right under the existing "Tuning for your codebase" callout, linking both new docs.
Calibration source
tests/bank-chat-systemindexed withmaster @ d62b48c(post PR-H1, ontology version 9):Tests
290 passed, 4 skipped in 52.56s— no code change, baseline unchanged.Future work (not in this PR)
AGENT-GUIDE.mdinto per-skill files (skills/codebase-search.md,skills/route-investigation.md, …) that an orchestrator pulls in on demand. The monolith is the right starting point — easier to copy-paste into a singleQWEN.md.server.pyintrospection (thename=,description=, andField()blocks) so it never drifts from the actual tool surface.