fix(planner): handle tool-loop exhaustion + tool_use-only responses#201
Merged
Conversation
Post-merge manual smoke of #200 surfaced two bugs on the first tool-using turn: the LLM called filesystem_list, hit MAX_PLANNER_TOOL_TURNS=4 still mid-call, and the raw tool_use block leaked into the chat UI as [{'id': 'toolu_01...', 'input': {'__arg1': 'dashboard/src/lib'}, 'name': 'filesystem_list', 'type': 'tool_use'}] Backend logs showed: [PLANNER] Hit max tool turns (4) Unexpected LLM response content type: list Fixes: - Bump MAX_PLANNER_TOOL_TURNS default from 4 to 6. Four turns is tight for the Planner's from-scratch exploration (ls dashboard/src/lib, ls dashboard/src/lib/components, read package.json, produce final answer = easily 4+). Architect Phase 2 uses 4 but only runs AFTER Phase 1 already narrowed the work. - `_extract_text_from_content` now silently skips tool_use / tool_result / input_json_delta blocks and returns "" when a list contains only those (instead of stringifying the dict). Empty list also returns "". - `_invoke_with_optional_tools` detects tool-loop exhaustion (returned response has no usable text) and retries the unbound LLM on the original messages. Note: we can't replay the loop's mid-flight messages because the final tool_use has no paired tool_result — Anthropic would reject that sequence. The wrap-up loses the loop's learnings but produces a valid conversational response instead of the raw-dict dump the user saw. Tests: 6 new in TestExtractTextFromContent + TestPlannerReadOnlyTools — covers tool_use-only response handling, empty-list handling, mixed text + tool_use, and the exhaustion wrap-up path. 87/87 planner tests pass; 293/293 related tests pass; ruff clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Abernaughty
added a commit
that referenced
this pull request
Apr 17, 2026
… + rich logging Follow-up to #201. The wrap-up strategy there saved the user from seeing raw tool_use dicts, but threw away all the tool results the loop gathered — so the unbound fallback LLM fabricated the answer (claimed BottomPanel.jsx + React/Vite when the real project is Svelte + TypeScript) and even wrote synthetic `<tool_call>` XML tags into the prose. Three compounding fixes: 1. `_run_tool_loop` takes a new `return_messages=False` kwarg. When True, returns a 4-tuple (response, tokens, log, messages). The Architect / Developer / QA call sites stay on the existing 3-tuple; only the Planner opts in. 2. Planner wrap-up rewritten: on exhaustion, take the loop's accumulated messages, drop only the single trailing assistant message with unresolved tool_use blocks (Anthropic rejects unpaired sequences), then append an explicit HumanMessage forbidding tool-call simulation and demanding the LLM answer from what it already learned. The wrap-up LLM now has the real filesystem output, not a clean slate. 3. `MAX_PLANNER_TOOL_TURNS` default 6 → 50. Per user feedback: even a well-documented issue like #113 took 6+ calls, complex issues could easily need 10-20. The cap exists as a safety net against runaway loops, not as a cost shape. The env var remains for operator tuning. Rich diagnostic logging added at every decision point so "why did this misbehave?" is answerable from the backend log alone, without code changes or a session re-run: - Planner turn-start: session id, model, workspace, tool count, GitHub context size, user message preview - Tool loader: failures upgraded from debug to info (no workspace / missing config / init error each log distinctly) - `_run_tool_loop`: per-tool-call log with name, sanitized args, result length, wall-clock ms, success flag - Tool executor: pre-call log so hung calls are visible - Tool failures include the args that triggered them - Loop exhaustion summary: total turns, per-tool counts, failure count, final response block types - Wrap-up entry log: messages-before / messages-after / dropped count - Wrap-up LLM returning empty content also logs at warning Tests: - `test_wrap_up_threads_real_loop_messages_not_originals` — repro for the fabrication bug: assert the wrap-up LLM is called with messages containing the real tool result, the unresolved tool_use message is dropped, and the appended nudge contains "do not call" / "tool_call" / "invent". - New `TestIsUnresolvedTail` class — 6 unit tests for the helper. - `TestRunToolLoopReturnMessages` in test_tool_binding.py — regression guard: default returns 3-tuple, `return_messages=True` returns 4-tuple, existing Architect/Dev/QA call sites keep working. - Updated existing `test_invoke_with_optional_tools_*` mocks for the new 4-tuple return. 1135 tests pass (2 pre-existing unrelated Windows path-separator failures confirmed on main). Ruff clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
4 tasks
Abernaughty
added a commit
that referenced
this pull request
Apr 18, 2026
… + rich logging (#202) Follow-up to #201. The wrap-up strategy there saved the user from seeing raw tool_use dicts, but threw away all the tool results the loop gathered — so the unbound fallback LLM fabricated the answer (claimed BottomPanel.jsx + React/Vite when the real project is Svelte + TypeScript) and even wrote synthetic `<tool_call>` XML tags into the prose. Three compounding fixes: 1. `_run_tool_loop` takes a new `return_messages=False` kwarg. When True, returns a 4-tuple (response, tokens, log, messages). The Architect / Developer / QA call sites stay on the existing 3-tuple; only the Planner opts in. 2. Planner wrap-up rewritten: on exhaustion, take the loop's accumulated messages, drop only the single trailing assistant message with unresolved tool_use blocks (Anthropic rejects unpaired sequences), then append an explicit HumanMessage forbidding tool-call simulation and demanding the LLM answer from what it already learned. The wrap-up LLM now has the real filesystem output, not a clean slate. 3. `MAX_PLANNER_TOOL_TURNS` default 6 → 50. Per user feedback: even a well-documented issue like #113 took 6+ calls, complex issues could easily need 10-20. The cap exists as a safety net against runaway loops, not as a cost shape. The env var remains for operator tuning. Rich diagnostic logging added at every decision point so "why did this misbehave?" is answerable from the backend log alone, without code changes or a session re-run: - Planner turn-start: session id, model, workspace, tool count, GitHub context size, user message preview - Tool loader: failures upgraded from debug to info (no workspace / missing config / init error each log distinctly) - `_run_tool_loop`: per-tool-call log with name, sanitized args, result length, wall-clock ms, success flag - Tool executor: pre-call log so hung calls are visible - Tool failures include the args that triggered them - Loop exhaustion summary: total turns, per-tool counts, failure count, final response block types - Wrap-up entry log: messages-before / messages-after / dropped count - Wrap-up LLM returning empty content also logs at warning Tests: - `test_wrap_up_threads_real_loop_messages_not_originals` — repro for the fabrication bug: assert the wrap-up LLM is called with messages containing the real tool result, the unresolved tool_use message is dropped, and the appended nudge contains "do not call" / "tool_call" / "invent". - New `TestIsUnresolvedTail` class — 6 unit tests for the helper. - `TestRunToolLoopReturnMessages` in test_tool_binding.py — regression guard: default returns 3-tuple, `return_messages=True` returns 4-tuple, existing Architect/Dev/QA call sites keep working. - Updated existing `test_invoke_with_optional_tools_*` mocks for the new 4-tuple return. 1135 tests pass (2 pre-existing unrelated Windows path-separator failures confirmed on main). Ruff clean. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Post-merge manual smoke of #200 surfaced two bugs on the first tool-using turn. The LLM called
filesystem_liston `dashboard/src/lib`, hit `MAX_PLANNER_TOOL_TURNS=4` still mid-tool-call, and the raw `tool_use` block leaked into the chat UI as a Python dict dump:```
[{'id': 'toolu_01GxiJrS8Xj4jN4FeFT3vsJu', 'input': {'__arg1': 'dashboard/src/lib'},
'name': 'filesystem_list', 'type': 'tool_use'}]
```
Backend logs confirmed:
```
[PLANNER] Hit max tool turns (4)
Unexpected LLM response content type: list
```
Fixes
MAX_PLANNER_TOOL_TURNS4 → 6. Four turns is tight for the Planner's from-scratch exploration (lsdashboard/src/lib→ lsdashboard/src/lib/components→ readpackage.json→ final answer = easily 4+). Architect Phase 2 uses 4 but only runs AFTER Phase 1 already narrowed the work._extract_text_from_contentnow silently skipstool_use/tool_result/input_json_deltablocks and returns""when a list contains only those. Empty list also returns"". No morestr(content)fallback rendering dict dumps._invoke_with_optional_toolsdetects tool-loop exhaustion (returned response has no usable text) and retries the unbound LLM on the original messages. Note: we can't replay the loop's mid-flight messages because the finaltool_usehas no pairedtool_result— Anthropic would reject that sequence. The wrap-up loses the loop's learnings but produces a valid conversational response instead of the raw-dict dump.Test plan
uv run ruff check src/ tests/— cleanuv run pytest tests/test_planner.py— 87/87 pass (6 new)uv run pytest tests/test_planner.py tests/test_github_fetch.py tests/test_mcp_tools.py tests/test_architect_two_phase.py tests/test_api.py— 293/293 passTestExtractTextFromContent::test_list_with_only_tool_use_returns_empty— regression for the dict-dump bugTestExtractTextFromContent::test_mixed_text_and_tool_use_returns_only_text— strips tool_use from mixed listsTestPlannerReadOnlyTools::test_invoke_with_optional_tools_wraps_up_on_exhaustion— regression with the exact dict shape the user sawRefs #193, follow-up to #200
🤖 Generated with Claude Code