Conversation
Manual test results (tmux, dev mode against real API)All three test-plan items verified end-to-end. Used Test A — deferred tool discoveryPrompt: "请用 ask_user_question 工具问我喜欢哪种颜色(红/蓝/绿)" Model trajectory:
Confirms: the model sees Test B —
|
Code Coverage Summary
CLI Package - Full Text ReportCore Package - Full Text ReportFor detailed HTML reports, please see the 'coverage-reports-22.x-ubuntu-latest' artifact from the main CI run. |
…hemas Large MCP deployments push the function-declaration list past 15K tokens per request. This change lets tools opt out of the initial declaration list via `shouldDefer`, and adds a new `ToolSearch` tool the model calls to fetch schemas on demand — either by exact name (`select:Name1,Name2`) or keyword search with name/description/searchHint scoring. - `DeclarativeTool` gains `shouldDefer`, `alwaysLoad`, `searchHint` opts. - MCP tools default to `shouldDefer=true`; lsp, cron_*, ask_user_question, and exit_plan_mode are flagged too. - `ToolRegistry.getFunctionDeclarations()` filters deferred tools by default; `revealDeferredTool()` re-includes them after ToolSearch loads their schemas. - `getCoreSystemPrompt` appends a "Deferred Tools" list (names + first line of description) so the model knows what's reachable. - Subagent wildcard inheritance keeps including deferred tools so existing `tools: ['*']` configs still see MCP schemas. - Resume-session support: `startChat` scans history for prior calls to deferred tools and re-reveals them so the API doesn't reject follow-up calls. `resetChat` clears the revealed set for a clean slate. - Skipped when ToolSearch is filtered out by the permission manager.
Registers a synthetic `structured_output` tool whose parameter schema IS the user-supplied JSON Schema. In headless mode (`qwen -p`), the first successful call terminates the session and exposes the validated payload via the result message's `structured_result` field. Invalid schemas are rejected at CLI parse time via a new strict Ajv compile helper so they can't silently no-op at runtime.
There was a problem hiding this comment.
Pull request overview
Note
Copilot was unable to run its full agentic suite in this review.
This PR introduces on-demand loading for deferred tool schemas via a new ToolSearch tool to reduce prompt/tool-declaration token usage, and adds a synthetic structured_output tool for --json-schema structured results in non-interactive CLI mode.
Changes:
- Add tool deferral controls (
shouldDefer,alwaysLoad,searchHint) and session “reveal” tracking to selectively include tool schemas in function declarations. - Implement
ToolSearchplus prompt updates to advertise deferred tools only when discovery is available. - Add
--json-schemasupport: strict schema compilation, syntheticstructured_outputtool registration, and structured result emission for headless runs.
Reviewed changes
Copilot reviewed 31 out of 31 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| packages/core/src/utils/schemaValidator.ts | Adds compileStrict() to fail fast on invalid JSON Schemas. |
| packages/core/src/utils/schemaValidator.test.ts | Adds test cases for compileStrict(). |
| packages/core/src/tools/tools.ts | Extends DeclarativeTool with deferral/search metadata flags. |
| packages/core/src/tools/tool-search.ts | Adds ToolSearch tool, scoring/tokenization helpers. |
| packages/core/src/tools/tool-search.test.ts | Comprehensive tests for ToolSearch modes/scoring/reveal behavior. |
| packages/core/src/tools/tool-registry.ts | Filters deferred tools by default; adds reveal tracking + deferred summary. |
| packages/core/src/tools/tool-registry.test.ts | Tests deferred filtering, includeDeferred, alwaysLoad, reveal, summary. |
| packages/core/src/tools/tool-names.ts | Adds tool_search and structured_output tool names/display names. |
| packages/core/src/tools/syntheticOutput.ts | Adds synthetic structured_output tool for JSON-schema final output. |
| packages/core/src/tools/syntheticOutput.test.ts | Tests schema passthrough + validation behavior for synthetic tool. |
| packages/core/src/tools/mcp-tool.ts | Marks MCP tools as deferred and adds server-name search hint. |
| packages/core/src/tools/lsp.ts | Marks LSP tool as deferred and adds search hints. |
| packages/core/src/tools/exitPlanMode.ts | Marks ExitPlanMode tool as deferred and adds search hints. |
| packages/core/src/tools/cron-list.ts | Marks cron-list tool as deferred and adds search hints. |
| packages/core/src/tools/cron-delete.ts | Marks cron-delete tool as deferred and adds search hints. |
| packages/core/src/tools/cron-create.ts | Marks cron-create tool as deferred and adds search hints. |
| packages/core/src/tools/askUserQuestion.ts | Marks ask-user-question tool as deferred and adds search hints. |
| packages/core/src/test-utils/mock-tool.ts | Extends MockTool to support new deferral/search flags in tests. |
| packages/core/src/index.ts | Exports ToolSearch + SyntheticOutput tool types. |
| packages/core/src/core/prompts.ts | Appends a “Deferred Tools” section to system prompts when applicable. |
| packages/core/src/core/client.ts | Warms tools before prompt build, re-reveals deferred tools on resume, clears on /clear. |
| packages/core/src/core/client.test.ts | Updates test stubs/expectations for new prompt argument + registry methods. |
| packages/core/src/config/config.ts | Adds jsonSchema to config + registers ToolSearch and synthetic output tool. |
| packages/core/src/agents/runtime/agent-core.ts | Ensures wildcard/all-tool inheritance includes deferred tools. |
| packages/cli/src/ui/commands/contextCommand.ts | Ensures token estimates include deferred tools for consistency. |
| packages/cli/src/nonInteractiveCli.ts | Implements structured-output termination behavior in headless mode. |
| packages/cli/src/nonInteractiveCli.test.ts | Updates config mock to include getJsonSchema(). |
| packages/cli/src/nonInteractive/io/BaseJsonOutputAdapter.ts | Adds structured_result and forces result to JSON string when structured. |
| packages/cli/src/nonInteractive/io/BaseJsonOutputAdapter.test.ts | Tests structured_result emission and back-compat omission when undefined. |
| packages/cli/src/config/jsonSchemaArg.test.ts | New tests for parsing/reading/validating --json-schema. |
| packages/cli/src/config/config.ts | Adds --json-schema arg parsing + strict Ajv compile-time validation. |
Comments suppressed due to low confidence (1)
packages/core/src/utils/schemaValidator.ts:1
compileStrict()claims the schema must be a JSON object, but the current guard allows arrays (typeof [] === 'object'). This yields less clear errors (Ajv compile message) instead of the intended descriptive message; update the guard to explicitly rejectArray.isArray(schema).
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Resolves 2 #3589 review threads: - `max_results` schema: declared as unconstrained `number` but the implementation clamps to the integer range [1, 20]. Updated to `type: 'integer'` with `minimum: 1`, `maximum: HARD_MAX_RESULTS`, `default: DEFAULT_MAX_RESULTS` so the model gets accurate contract guidance and out-of-range values fail validation early instead of silently being clamped after a wasted call. - `execute()` signature now takes `_signal: AbortSignal` to match the base `ToolInvocation.execute` contract. The signal is unused today (the search is sync), but matching the shared signature avoids accidental divergence and makes future cancellation wiring trivial. Test: existing `enforces max_results cap` split into: - schema-rejection (`max_results: 100` → throws at build time) - clamp-on-in-range (`max_results: 20` capped on the candidate side) 21/21 tool-search.test.ts pass; tsc + ESLint clean.
wenshao
left a comment
There was a problem hiding this comment.
测试覆盖缺口(无法映射到具体行)
以下关键路径缺少测试,建议补充:
packages/cli/src/nonInteractiveCli.ts—structured_output成功路径(模型调用 structured_output → emitResult 带 structuredResult → 提前返回)无测试覆盖。packages/cli/src/nonInteractiveCli.ts—--json-schema纯文本错误路径(模型产生纯文本而非调用 structured_output →process.exitCode=1)无测试覆盖。packages/core/src/core/client.ts—startChat中延迟工具恢复扫描(遍历历史重新暴露已用延迟工具)无测试覆盖。packages/core/src/core/client.ts—deferredTools传给系统提示的条件逻辑无测试覆盖。
…tools Closes 3 #3589 review threads: - Critical: `setTools()` failure during reveal was silently swallowed via `debugLogger.warn` (off in production). Schemas appeared in `llmContent` so the model thought the tools were callable, but the chat's declaration list never updated — the next call surfaced as an "unknown tool" API error, leaving the session in an unrecoverable degraded state. Now returns a proper `ToolResult.error` with the concrete failure reason and instructions to retry; schemas are withheld from `llmContent` so the model doesn't act on a non-ready tool. - Critical: `collectCandidates` returned every deferred tool that matched `shouldDefer && !alwaysLoad` regardless of whether ToolSearch had already revealed it earlier in the session. Already-revealed tools are in the model's declaration list, so re-surfacing them in later keyword searches wasted tokens and risked the model retrying a load it already had. Filter now also skips tools where `registry.isDeferredToolRevealed(name) === true`. `select:<name>` mode is unaffected (the model may legitimately want to re-inspect the schema of a loaded tool). - Suggestion: `--json-schema` plain-text terminal path set `process.exitCode = 1` and emitted `isError: true` to the JSON adapter, but TEXT-mode users only saw a silent exit-code-1 with no visible context (`emitResult` is a no-op for the TEXT-mode error case). Echo the full `'Model produced plain text instead of calling the structured_output tool as required by --json-schema.'` line to stderr so headless runs are debuggable without scraping `--output-format json`. Tests: 2 new in `tool-search.test.ts`: - `keyword search excludes already-revealed deferred tools`: pins the dedupe behavior across two consecutive searches. - `returns an error result when setTools() throws`: pins that failures don't expose schemas as "ready" and the agent gets the underlying message in `error.message`. 23/23 tool-search.test.ts pass; tsc + ESLint clean. DEFERRED to follow-up PRs (replied on threads): - Critical: structured_output + side-effect-tool race in same turn — needs a pre-scan + synthesized "skipped" tool_results, design overlaps with #3598 PR-2's existing skippedOutput pattern. - Suggestion: `+` prefix parsing edge cases (C++, `+ slack`). - Suggestion: `instanceof DiscoveredMCPTool` hard couple — needs a type tag on AnyDeclarativeTool, broader API surface change. - Suggestion: SyntheticOutputTool registered in interactive mode. - Suggestion: resume scan O(history × parts) early-exit. - Suggestion: deferredToolsSection cap.
wenshao
left a comment
There was a problem hiding this comment.
[Critical] --json-schema cannot work in bare mode because createToolRegistry() returns from the bare-mode branch before registering the synthetic structured_output tool. The CLI accepts the flag in headless mode, but with bare mode the model never receives the only tool that can satisfy the structured-output contract. Register structured_output before the bare-mode early return, or reject --json-schema with bare mode during argument/config validation.
— gpt-5.5 via Qwen Code /review
wenshao
left a comment
There was a problem hiding this comment.
Reviewed the full diff (+1572/-13, 31 files). The deferred-tool mechanism and ToolSearch are well-designed — scoring, select/keyword modes, reveal lifecycle, resume support, and the structured_output flow all check out. No Critical issues found. A few suggestions below.
The two non-interactive exit paths in `main()` hardcoded `process.exit(0)` after `runNonInteractive` / `runNonInteractiveStreamJson` returned. This silently overwrote any `process.exitCode = 1` set inside the run — most visibly the `--json-schema` plain-text contract: the JSON adapter emits `isError: true` and stderr gets the explanation, but the shell saw exit code 0 and assumed success. Replace the hardcoded 0 with `process.exit(process.exitCode ?? 0)` on both paths so non-zero exits propagate. The success case is unchanged (exitCode is undefined → exits 0).
Closes review-flagged coverage gaps for #3589: `json-schema.test.ts` (6 cases) covers the headless structured-output contract end-to-end: - structured_result emits when the model fills the schema (success path) - @path/to/schema.json file-load works - parse-time validation rejects invalid JSON, invalid JSON Schema, and missing files (no LLM, fast) - plain-text path: when structured_output is not callable (`--exclude-tools structured_output`), the run exits 1 with `is_error: true` and the contract error message — locks in the exit-code fix from the prior commit. `tool-search.test.ts` (3 cases) covers the deferred-tool flow: - select:<name> reveals a tool and the model can invoke it in the same turn (asserts call order so a missed reveal would surface as an unknown-tool API error instead of a silent pass) - keyword query (no select: prefix) hits the tool_search tool - feature-flag-off: with experimental.cron disabled, cron tools are never registered and never appear in tool calls LLM-dependent tests use the cron tools as a deterministic deferred target (gated by experimental.cron, no MCP server required).
Closes 3 #3589 review threads: - Schemas like `{"type":"string"}` and `{"type":"array"}` compiled fine (they're valid JSON Schemas in isolation), but the `--json-schema` value becomes the synthetic structured_output tool's parameter schema and tool-call arguments are object-shaped. Reject any non-undefined top-level type that is not "object" so the user sees the contract violation at parse time, not as an unrecoverable runtime mismatch. - `SchemaValidator.compileStrict` accepted arrays since `typeof [] === 'object'` — Ajv would later emit a confusing error. Add an explicit `Array.isArray` guard so the contract stated by the function name is honored at the boundary. - `compileStrict` shared the project-wide Ajv instances configured with `strictSchema: false` (intentionally lenient so MCP servers can ship custom keywords without breaking runtime validation). That leniency is wrong for the `--json-schema` surface — typos like `propertees` were silently ignored. Compile inside a dedicated `strict: true` Ajv so user-supplied schemas surface mistakes immediately. Tests: - jsonSchemaArg: rejects non-object top-level type ("string"/"array"). - schemaValidator.compileStrict: rejects arrays; flags unknown keywords (typos) under strict mode.
Closes 1 #3589 review thread. `loadAndReturnSchemas` revealed each requested tool BEFORE calling `setTools()` because `getFunctionDeclarations()` filters by the revealedDeferred set — the reveal has to be in place when setTools() rebuilds the chat's declaration list. But if setTools() throws (e.g. chat not yet initialised), the registry was left holding orphaned reveals: the tool was marked "revealed" while the API never received its schema. Subsequent keyword searches would then exclude that tool from candidates (per `collectCandidates`'s isDeferredToolRevealed filter), making it unreachable until `/clear`. Track the names this call NEWLY revealed (skipping tools that were already revealed by an earlier ToolSearch in the same session) and unreveal them on setTools() failure. Added `unrevealDeferredTool` to the registry as the one-tool inverse of `revealDeferredTool`; `clearRevealedDeferredTools` is unchanged and still wipes the whole set on `/clear`. Test: extends the existing `setTools() throws` test to also assert that (a) the failed call's reveal is rolled back and (b) a tool revealed by an earlier call stays revealed (not whole-set wiped).
Closes one of the test-coverage gaps in #3589 reviews (gpt-5.5 review S8). Adds two deterministic L1 unit tests in nonInteractiveCli.test.ts that mock the LLM at sendMessageStream — no model API hit, no flake, ~10ms total. 1. structured_output success path: model fires the synthetic tool once, runtime sets structuredSubmission, aborts background tasks, and emitResult fires exactly once with `structuredResult` matching the model's args. No follow-up turn is issued (single-shot contract). 2. plain-text error path under --json-schema: model emits text only; runtime sets process.exitCode=1, writes the contract-violation line to stderr, and emits an isError result with the canonical "Model produced plain text" message. Both tests inject a stub adapter via runNonInteractive's `options.adapter` hook, so they assert against direct emitResult calls instead of parsing JSON stdout. process.exitCode is snapshot/restored to keep the test hermetic. The L2 integration tests in integration-tests/cli/json-schema.test.ts remain as smoke coverage against a real model.
Critical correctness: - `client.ts`: when ToolSearch is filtered out (allow/deny rules, `--exclude-tools tool_search`), eagerly reveal every deferred tool so they all land in the function declaration list. Without this the user sees those tools just disappear silently — the deferred- tool discovery surface is gone, but the tools are still hidden by the registry filter, so they're effectively invisible AND uncallable. Token-saving rationale of deferral was predicated on the discovery surface being available; if not, eager reveal preserves the invariant "all registered tools are callable". - `config.ts`: `--json-schema` now requires the root schema to declare `type: "object"` (or array containing it). Tool-call args are always validated as objects, so root-only `anyOf` / `oneOf` / `allOf` / `not` would create schemas the model can't consistently satisfy — surface as a startup error instead of mid-session "Model produced plain text" failures users can't easily diagnose. - `nonInteractiveCli.ts`: structured_output + sibling tools in the same turn no longer leaks side effects. Pre-scan reorders structured_output to the front of `toolCallRequests`; once it succeeds, sibling tools (write_file, shell, …) get a synthesized `Skipped: this turn's structured_output contract took precedence as the terminal output. Re-issue this call in a separate turn if needed.` tool_result instead of running. If structured_output fails (e.g. validation), siblings still execute via the normal loop body, same as a turn that didn't issue structured_output at all. Reveal-set hygiene: - `tool-registry.ts`: `removeMcpToolsByServer`, `removeDiscoveredTools`, and `discoverToolsForServer` (the re-discovery path) now also drop the affected tool names from `revealedDeferred`. Without this, an MCP server disconnect / reconnect that re-registers a tool of the same name inherits `revealed: true` from before the disconnect — the schema lands in `getFunctionDeclarations` before the model has any way to know the tool exists this session. Defensive: - `config.ts`: `resolveJsonSchemaArg` caps `@path/to/schema.json` reads at 4 MiB. Real schemas are well under (decompose with `$ref` if needed); the cap catches accidental wrong-path arguments (`@./node_modules/.cache/*.json`) before they OOM `fs.readFileSync` + `JSON.parse`. Tests: - New regression in `tool-registry.test.ts` for the `removeMcpToolsByServer` revealedDeferred prune. - 23/23 tool-search.test.ts, 23/23 tool-registry.test.ts, 226/229 nonInteractiveCli.test.ts (3 skipped pre-existing), 195/197 config.test.ts (2 skipped pre-existing) — all pass. Deferred to follow-up (replied + tracked): - 10-positional-param API on DeclarativeTool (refactor breadth). - `instanceof DiscoveredMCPTool` (needs `toolType` tag). - `structured_result` intersection vs canonical interface. - Resume-scan error/permission-denied filter + early-exit. - `getAllTools()` sort discarded (perf, ~negligible). - DeferredTools section cap. - `setTools` → `warmAll` undercutting deferral (theoretical; factories are nearly empty in practice today).
Closes 2 fresh #3589 review threads: - `tool-search.ts`: `select:` mode now strips a single layer of matching `"…"` / `'…'` from each tool name before lookup. Models often paste names back verbatim from the deferred-tools system prompt section, which renders them as JSON string literals (`"cron_list"`); without quote-strip the lookup searches for the literal-with-quotes name and misses every time. - `nonInteractiveCli.ts`: moved the `import { writeStderrLine } …` to sit with the other top-of-file imports (eslint-plugin-import's `import/first` rule) and hoisted `createDebugLogger(...)` below the imports — was wedged between them. Test: new `select: tolerates JSON-quoted tool names` regression in tool-search.test.ts pins both `"…"` and `'…'` shapes; 29/29 pass.
wenshao
left a comment
There was a problem hiding this comment.
以下发现无法映射到 diff 行(目标行不在 PR 变更范围内):
Critical
config.test.ts:1693,1698,1711,1725 [typecheck] — TypeScript 类型错误:4 处 possibly 'undefined' 错误导致 tsc --noEmit 失败。在访问 mcpServers 属性前添加可选链或非空断言。
Suggestion
nonInteractiveCli.ts:421-426 [review] — --json-schema 契约违例在 max-turns 耗尽时不可见。handleMaxTurnsExceededError 调用 exitAfterCleanup()(Promise<never>),导致 898 行的 --json-schema 契约检查不可达。用户只能看到通用的 "Reached max session turns"。建议在调用 handleMaxTurnsExceededError 前为 --json-schema 添加专项检查。
client.test.ts:509-543 [review] — resetChat() 测试未断言 clearRevealedDeferredTools() 被调用。mock 已将其设为 vi.fn(),但测试仅验证了 chat 重置,遗漏此关键断言。添加 expect(mockToolRegistry.clearRevealedDeferredTools).toHaveBeenCalled()。
prompts.test.ts [review] — getCoreSystemPrompt/getCustomSystemPrompt 新增的 deferredTools 参数插入位置未测试。buildDeferredToolsSection 测试充分,但调用方的拼接顺序(deferred section 应在 user memory 之前)无回归保护。
— DeepSeek/deepseek-v4-pro via Qwen Code /review
Closes 2 #3589 review threads (deepseek-v4-pro): - ToolSearch.loadAndReturnSchemas: an `ensureTool()` throw mid-batch used to propagate out of the for loop with previous tools already revealed but never setTools()-synced — same orphaned-reveal failure the setTools() catch block guards against. Wrap ensureTool in try/catch so a failure surfaces as a `missing` entry and the rest of the batch is processed normally; the throw is logged at debug level for diagnostics. - nonInteractiveCli `--json-schema` plain-text error: the static message gave operators no diagnostic context. Now appends turn count + a JSON-quoted preview of the model's actual plain text (capped at 200 chars across all turns). Operators debugging a headless run no longer need to scrape `--output-format json` to understand why the contract failed; the stderr line and the JSON result both carry the same enriched body. Tests: - ensureTool throws on bravo mid-batch; alpha + charlie still load and reveal, bravo reported missing, registry stays consistent (bravo NOT revealed). - existing plain-text error test now also asserts the turn-count suffix and the model's actual content ("plain answer") shows up in both emitResult and stderr. Not done: deepseek's MCP `__` segment-boundary scoring suggestion turned out to be a non-issue on inspection — `endsWith('_'+term)` already matches every case `endsWith('__'+term)` would catch (the latter is a subset of the former since `__term` always ends with `_term` too). Reverted the proposed change after the test exposed that the boundary is already covered. Filing a thread reply.
Closes 1 #3589 review thread (deepseek-v4-pro): the existing client test mocked `getDeferredToolSummary: () => []` and `getTool(TOOL_SEARCH): () => null`, which short-circuited every deferred-tool code path in `startChat()` — ~50 lines of logic (resume re-reveal, no-ToolSearch eager-reveal, already-revealed filter) were unreachable from tests. Add `isDeferredToolRevealed` to the base registry mock so default tests don't crash, then add a `describe('startChat — deferred tools')` block with three cases: 1. Resume scan: history with a `functionCall` to a deferred tool re-reveals exactly that tool; siblings stay deferred. Pins the resume-rejected-tool guard. 2. ToolSearch unavailable: every deferred tool is revealed eagerly so the model can still reach them via the regular declaration list. Pins the silent-disappearance fix. 3. ToolSearch available + no history match: nothing is revealed (deferral is preserved). Pins the negative case so future refactors can't regress to "always reveal everything".
…bstring (6) #3589 review thread suggested adding an explicit `isMcp && nameLower.endsWith('__' + term)` arm to the MCP scoring path on the assumption that the existing `endsWith('_' + term)` fails to match `mcp__server__toolname` patterns. Verified the premise is incorrect: `endsWith('_x')` returns true for strings ending in `__x` because the last 2 chars (`_x`) are present. JS verification: `'mcp__slack__send_message'.endsWith('_send_message')` → true; same for `'_issue'` on `'mcp__github__create_issue'` etc. So the suggested code change would have been a redundant no-op (adding an OR-arm that fires only when the existing arm already matches). Instead, lock the existing behavior in with a regression test that asserts MCP tools get the exact-suffix score (12) on both the trailing tokenized toolname and a single tail token — so a future refactor to a tighter word-boundary regex can't silently downgrade MCP scoring without the test catching it. 30/30 tool-search.test.ts pass.
wenshao
left a comment
There was a problem hiding this comment.
tanzhenxin
left a comment
There was a problem hiding this comment.
Review
The headline ToolSearch feature is in solid shape. Across the review waves it's picked up real hardening — max_results schema cap, setTools() rollback on failure, MCP description JSON-quoting plus the < escape, eager-reveal when the discovery surface is filtered out, ensureTool isolation, and the new startChat deferred-tool tests. What's left on the ToolSearch side at HEAD is genuinely cosmetic.
The residual gaps cluster entirely on the --json-schema side, and they have a structural reason: the synthetic structured_output tool makes a strong contractual claim ("the only way to deliver the final result, terminates the session"), and that contract interacts with every orthogonal flag in the runtime — cron drains, Monitor, --bare, Agent subagents. Each interaction is a separate design question, not a typo, and answering them all in this PR's review thread is becoming the bottleneck for the ToolSearch work that's already done.
Would you be open to splitting this into two PRs?
- PR A — ToolSearch. Everything around
tool_search, deferred filtering, theshouldDefer/alwaysLoad/searchHintframework, the prompt section,startChatre-reveal and eager-reveal, and the per-tool defer markers. Lands now with a tight follow-up review. - PR B —
--json-schema. The synthetic tool, the schema-arg parser, the strict-validator scoping, and the headless terminal handling — on top of PR A's framework. Givesstructured_output's contract its own focused review and lets the four integration questions below be designed in isolation.
The alwaysLoad flag belongs with PR A as part of the deferred framework, even without a consumer in that PR; PR B then just uses it. The two branches separate cleanly.
The four open --json-schema integration gaps that would migrate to PR B:
1. Drain-turn structured submissions are silently dropped (severity: medium · confidence: very high)
In a qwen -p "..." --json-schema '<schema>' run that also has a scheduled cron, the main turn can produce text and the cron's drain turn can later deliver the structured_output call. The drain loop runs that call as an ordinary tool, the runtime never flips its terminal sentinel, and the post-drain branch then reports "Model produced plain text" with exit code 1. The model fulfilled the contract but the user sees a failure and the structured payload is discarded.
2. Structured-output exit silently drops Monitor notifications (severity: medium · confidence: very high)
When the model invokes a Monitor tool earlier in the run and later submits via structured_output, the structured-output exit path skips one-shot-monitor finalization and the queued-notification flush that the success and cancel paths run. The trailing finally block then aborts the monitor registry without notify, so SDK consumers see task_started for the monitor but never receive the paired task_notification. The dropped events are silent, not just delayed.
3. --json-schema is unfulfillable in --bare mode (severity: medium · confidence: very high)
The bare-mode path in the tool-registry builder returns early after registering only read_file, edit, and shell — before the --json-schema synthetic-tool registration runs. Result: qwen -p "..." --json-schema '<schema>' --bare never registers structured_output, the model has no way to satisfy the contract, plain-text fallback fires, and the run exits with code 1.
4. Subagents can complete structured_output instead of the parent (severity: medium · confidence: high)
The synthetic tool is alwaysLoad: true and is not in the subagent exclusion set, so subagents see it during --json-schema sessions. Its description tells the model "this is the ONLY way to deliver the final result" — a subagent following that contract calls it, but only the top-level non-interactive loop treats the call as terminal. The subagent's scheduler runs it as a normal tool, consumes the "session will end" payload locally, and the parent agent never receives the structured result.
Verdict
COMMENT — proposing the split so ToolSearch can land independently. Happy to re-review the split PRs as soon as they're up.
Closes 2 #3589 review threads (glm-5.1): - nonInteractiveCli.test.ts: the existing batch test put structured_output at index 0, so the pre-scan reorder branch and the validation-failure fallback were both unreachable. The inline comment claimed "tracked as follow-up", but the pre-scan is now in shipped code (nonInteractiveCli.ts:509-535) since 9588231. Two new cases: 1. "reorders structured_output before side-effect tools so siblings never run": batch ordered as [write_file, structured_output] — pre-scan must hoist structured_output to position 0, then break-after-success keeps write_file from executing. Pins the irreversible-side-effect guard. 2. "lets siblings run when structured_output validation fails so the model can retry": batch ordered as [structured_output(bad args), write_file] — structured_output's executeToolCall fails, hasStructuredSubmission stays false, sibling runs normally, loop falls through to second turn (model gives up with plain text) and the plain-text terminal branch fires. Pins the fallback semantics. Also updates the existing test's stale comment to point at the new sibling case rather than claiming the pre-scan is still TODO. - client.test.ts: `resetChat()` now calls `clearRevealedDeferredTools()` (added back when /clear behavior was sorted out), but no test asserted it. A regression here would silently carry deferred-tool reveals across `/clear`, defeating the clean-slate expectation. New test pins the call.
…xt turn Closes 1 #3589 review thread (Copilot). The previous description said ToolSearch returns the matched tools' "complete JSONSchema definitions" and that "once a tool's schema appears in that result, it is callable exactly like any tool defined at the top of the prompt." Both phrasings could lead the model to assume the returned `<functions>` block itself made the tool invocable in the same turn. Reality: ToolSearch returns full function declarations (name + description + parameter schema), reveals them in the registry, and calls `setTools()` to update the active chat's declaration list. The schema becomes a real callable tool only on the NEXT model turn. Reword the description to make this two-step contract explicit so a model can't waste a turn trying to call a "callable schema" embedded in the same response. No test changes — none assert the description text verbatim and the new wording keeps the same query-form summary the keyword tests exercise.
…esized Closes 1 #3589 review thread (Copilot). The pre-scan comment claimed siblings receive a "synthesized 'skipped' tool_result" after structured_output succeeds. The implementation actually breaks out of the loop without emitting any tool_result for the skipped calls. The transcript is missing the function_response entries for them, but the session terminates via emitResult immediately so no follow-up API call ever sees the mismatch — the missing entries are harmless in the single-shot contract. Update the comment to describe what the code actually does. The existing tests already pin the contract (no executeToolCall for the skipped tool, no emitToolResult for its callId).
| const wasRevealed = registry.isDeferredToolRevealed(canonical); | ||
| registry.revealDeferredTool(canonical); | ||
| if (!wasRevealed) { | ||
| newlyRevealed.push(canonical); | ||
| } | ||
| loaded.push(tool); | ||
| } | ||
|
|
||
| // Re-sync the active chat's tool list so the revealed tools appear in the | ||
| // next API request. | ||
| let setToolsError: string | undefined; | ||
| if (loaded.length > 0) { | ||
| const geminiClient = this.config.getGeminiClient(); | ||
| if (!geminiClient) { | ||
| // Optional chaining (`?.setTools()`) used to silently no-op here, | ||
| // leaving the registry with reveals the API never received — | ||
| // exactly the inconsistency `setTools() throws` already guards | ||
| // against. Treat null client identically: rollback + surface an | ||
| // error so the caller can retry once init is complete. | ||
| setToolsError = 'GeminiClient not initialised'; | ||
| } else { | ||
| try { | ||
| await geminiClient.setTools(); | ||
| } catch (err) { |
wenshao
left a comment
There was a problem hiding this comment.
No review findings. Downgraded from Approve to Comment: self-PR. — gpt-5.5 via Qwen Code /review
| // Text-format users only see the exit code (1) without | ||
| // visible context — `emitResult` is a no-op in TEXT mode for | ||
| // the isError-true path. Echo to stderr so the failure mode | ||
| // is discoverable without scraping `--output-format json`. | ||
| writeStderrLine(`qwen --json-schema: ${errorMessage}`); | ||
| process.exitCode = 1; |
Summary
Two independent features sharing the same branch:
ToolSearchfor on-demand deferred tool schemas — shrinks thedefault tool-declaration list by hiding MCP tools and a few
low-frequency built-ins behind a discovery tool. A typical 39-tool
setup previously spent ~15K tokens per request on declarations.
--json-schemaheadless structured output — registers asynthetic
structured_outputtool whose parameter schema IS theuser-supplied JSON Schema. In
qwen -pmode the first valid callterminates the session and exposes the validated payload via the
result message's
structured_resultfield.Feature 1 — ToolSearch / deferred tools
Mechanism
DeclarativeToolbase class gains three optional flags:shouldDefer— hide from initial function-declaration listalwaysLoad— always include even when deferred is the defaultsearchHint— optional keywords for fuzzy-match scoringToolRegistry.getFunctionDeclarations()filters out deferred tools bydefault. New
{ includeDeferred: true }opts back in.revealDeferredTool(name)/unrevealDeferredTool(name)/isDeferredToolRevealed(name)/clearRevealedDeferredTools()track per-session "revealed" state.
getDeferredToolSummary()returns a compact{name, description}list for the system-prompt section.
getCoreSystemPrompt/getCustomSystemPromptaccept an optionaldeferredToolsargument and append a "Deferred Tools" section.ToolSearchsupports two query modes:select:Name1,Name2— exact lookup (case-insensitive, deduped,capped by
max_results)+must-wordrequired-term prefixmax_results: integer,[1, HARD_MAX_RESULTS=20], default 5; out-of-range values fail at
tool.build()time, in-range values are clampedinternally as a defense.
10/5/4/2onexact-name/substring/searchHint/description; MCP
12/6onexact-name/substring to bias toward surfacing MCP.
ensureTool, marks itrevealed, and calls
client.setTools()so the next API requestincludes the schema. If
setTools()throws, the call's reveals arerolled back (keeps the registry consistent with the chat's
declaration list) and a
ToolResult.erroris returned so the agentcan choose to retry — no silent debug-only swallow.
Default deferred tools
them, hence the scoring boost above)
lsp,cron_create,cron_list,cron_delete,ask_user_question,exit_plan_modeEdge cases covered
tools: ['*']) inheritance keepsincludeDeferred: trueso existing configs that relied on MCP via wildcard still work.resetChat(/clear) clears revealed state for a clean slate;compression path (
startChat(newHistory)) preserves it so cross-turnrevealed tools survive compaction.
startChatscans history for prior function calls todeferred tools and re-reveals them before
setTools()runs, so theAPI doesn't reject follow-up calls when resuming a transcript.
ToolSearchis filtered out by the permission manager, thesystem prompt omits the "Deferred Tools" section to avoid advertising
an unreachable tool.
(
collectCandidatesfilter) so the model doesn't waste a lookupcall on schemas it already has.
select:mode is unaffected so themodel can still re-inspect a loaded tool's schema.
contextCommandand agent wildcard inheritance use{ includeDeferred: true }so token accounting and inheritancesemantics stay consistent with pre-change behavior.
Feature 2 —
--json-schemaheadless structured outputCLI
Mechanism
resolveJsonSchemaArg(CLI parse time): reads inline JSON or@path,parses it, runs
SchemaValidator.compileStrict(a dedicated Ajvconfigured with
strictSchema: trueto catch unknown-keyword typoslike
propertees, plusallowUnionTypes: truefor spec-validtype: ["a","b"]unions, and explicit opt-outs ofstrictRequired/
strictTypes/validateFormatsso spec-valid shapes such as{type:'object', required:['x']}and customformatvalues aren'trejected as lint failures), and rejects any non-object root
(top-level
typemust be"object"or a type-array containing"object"; absenttypeis allowed for{}/ composition cases).Bad schemas fail at parse time so they cannot silently no-op at
runtime.
SyntheticOutputTool(registered only whenjsonSchemais set):alwaysLoad: true,shouldDefer: false. Its parameter schema IS theuser-supplied schema, so
BaseDeclarativeTool.validateToolParams(Ajv) gates the model's args before
execute()runs. Validationerrors flow back to the model, which can retry with corrected fields.
runNonInteractive: when the synthetic tool's first successful callis detected, the loop aborts background tasks and emits a result
message with
structured_result: <model-supplied payload>andresult: JSON.stringify(payload).structured_output, the runtime setsprocess.exitCode = 1, writesqwen --json-schema: Model produced plain text…to stderr, and emitsan error result. The exit code now actually propagates to the shell
(the headless
main()was hardcodingprocess.exit(0)regardless).Review-driven hardening
a17a357max_resultsschema declared as integer w/ min/max/default;execute()matches the AbortSignal contract1fa1d75setTools()failure surfaces as a tool error; keyword search excludes already-revealed toolse8712ebmain()honorsprocess.exitCodeinstead of hardcodingprocess.exit(0)38726567--json-schemarejects non-object roots and arrays;compileStrictswitched from the project-wide lenient Ajv to a dedicated instance withstrictSchema: trueso unknown-keyword typos surface (later refined in9c031da0— see below)9c031da0compileStrictswaps the broadstrict: true(which also enabledstrictRequired/strictTypes/validateFormats) for the targetedstrictSchema: trueplus explicit opt-outs, so spec-valid schemas like{type:'object', required:['x']}and custom-format fields aren't rejected as lint failures6153efbsetTools()failure rolls back this call's reveals so the registry stays consistent with the chat's declaration list210ec2bcompileStrictaccepts spec-valid type unions (["string","number"],["object","null"]) viaallowUnionTypes: true; CLI guard treats type-arrays containing"object"as object-allowed7f235c2select:mode now also honorsmax_resultssoselect:a,b,c,…cannot return unbounded schemasTests
L1 unit (deterministic, mocked LLM, ~ms each)
tool-search.test.ts(25 cases): tokenize, scoreTool weights, bothquery modes,
max_resultsschema rejection / cap (keyword ANDselect),
+must-wordfilter, reveal propagation, empty query,setTools()throws → error result + reveal rollback.tool-registry.test.ts(5 new cases): filter,includeDeferred,alwaysLoad, reveal, summary sorting +clearRevealedDeferredTools.schemaValidator.test.ts(38 cases including 6 new):compileStrictrejects null/undefined/string/array; flags unknown keywords (typos);
accepts type-union arrays.
jsonSchemaArg.test.ts(13 cases): inline +@path+ parse errors +array root rejection + non-object type rejection + type-array with
"object"accepted.syntheticOutput.test.ts,nonInteractiveCli.test.ts(2 new):registration; structured_output success path emits
structuredResultand stops in one turn; plain-text fallback sets
process.exitCode=1and writes the contract violation to stderr.
L2 integration (real CLI binary, real LLM, smoke)
integration-tests/cli/json-schema.test.ts(6 cases): inlineschema →
structured_result.answer;@pathschema; parse-timefailures (invalid JSON / invalid Schema / missing file);
--exclude-tools structured_outputforces the plain-text path andasserts
is_error: true+exitCode: 1.integration-tests/cli/tool-search.test.ts(3 cases):select:→invoke (asserts call order so a missed reveal would surface as an
unknown-tool API error); keyword search; feature-flag-off negative.
Test plan
tsc --noEmitonpackages/coreandpackages/clivitest runonpackages/coreandpackages/cliToolSearchis invoked whenthe model needs an MCP tool, and the schema appears in the next
request
/clearand verify deferred list is re-announced withoutstale revealed state
qwen --resumea session that used a deferred tool,verify the tool is callable without calling
ToolSearchagainDeferred to follow-up PRs (replied on threads)
(needs pre-scan + synthesized "skipped" tool_results — design
overlaps with feat(cli): add --json-schema for structured output in headless mode #3598)
+prefix parsing edge cases (C++,+ slack)instanceof DiscoveredMCPToolhard couple (replace with type tag)exclusivity but interactive mode doesn't honor it)
revealedDeferrednot pruned on MCP server disconnect--json-schema(anyOf/oneOf/notcombinators that forbid object root)
--json-schema @pathstructured_resultfield on canonicalCLIResultMessageSuccesstypeDeclarativeTool10-positional-param constructor → options objectre-revealing