fix(planner): handle tool-loop exhaustion + tool_use-only responses by Abernaughty · Pull Request #201 · Abernaughty/agent-dev

Abernaughty · 2026-04-17T22:55:55Z

Summary

Post-merge manual smoke of #200 surfaced two bugs on the first tool-using turn. The LLM called filesystem_list on `dashboard/src/lib`, hit `MAX_PLANNER_TOOL_TURNS=4` still mid-tool-call, and the raw `tool_use` block leaked into the chat UI as a Python dict dump:

```
[{'id': 'toolu_01GxiJrS8Xj4jN4FeFT3vsJu', 'input': {'__arg1': 'dashboard/src/lib'},
'name': 'filesystem_list', 'type': 'tool_use'}]
```

Backend logs confirmed:
```
[PLANNER] Hit max tool turns (4)
Unexpected LLM response content type: list
```

Fixes

Bump MAX_PLANNER_TOOL_TURNS 4 → 6. Four turns is tight for the Planner's from-scratch exploration (ls dashboard/src/lib → ls dashboard/src/lib/components → read package.json → final answer = easily 4+). Architect Phase 2 uses 4 but only runs AFTER Phase 1 already narrowed the work.
_extract_text_from_content now silently skips tool_use / tool_result / input_json_delta blocks and returns "" when a list contains only those. Empty list also returns "". No more str(content) fallback rendering dict dumps.
_invoke_with_optional_tools detects tool-loop exhaustion (returned response has no usable text) and retries the unbound LLM on the original messages. Note: we can't replay the loop's mid-flight messages because the final tool_use has no paired tool_result — Anthropic would reject that sequence. The wrap-up loses the loop's learnings but produces a valid conversational response instead of the raw-dict dump.

Test plan

uv run ruff check src/ tests/ — clean
uv run pytest tests/test_planner.py — 87/87 pass (6 new)
uv run pytest tests/test_planner.py tests/test_github_fetch.py tests/test_mcp_tools.py tests/test_architect_two_phase.py tests/test_api.py — 293/293 pass
New tests:
- TestExtractTextFromContent::test_list_with_only_tool_use_returns_empty — regression for the dict-dump bug
- TestExtractTextFromContent::test_mixed_text_and_tool_use_returns_only_text — strips tool_use from mixed lists
- TestPlannerReadOnlyTools::test_invoke_with_optional_tools_wraps_up_on_exhaustion — regression with the exact dict shape the user saw
Post-merge smoke: same "Review and implement the fix for GitHub Issue bug: Terminal panel drag-to-resize breaks after ~1/3 screen height #113" prompt — Planner should finish within 6 turns with a proper conversational response

Refs #193, follow-up to #200

🤖 Generated with Claude Code

Post-merge manual smoke of #200 surfaced two bugs on the first tool-using turn: the LLM called filesystem_list, hit MAX_PLANNER_TOOL_TURNS=4 still mid-call, and the raw tool_use block leaked into the chat UI as [{'id': 'toolu_01...', 'input': {'__arg1': 'dashboard/src/lib'}, 'name': 'filesystem_list', 'type': 'tool_use'}] Backend logs showed: [PLANNER] Hit max tool turns (4) Unexpected LLM response content type: list Fixes: - Bump MAX_PLANNER_TOOL_TURNS default from 4 to 6. Four turns is tight for the Planner's from-scratch exploration (ls dashboard/src/lib, ls dashboard/src/lib/components, read package.json, produce final answer = easily 4+). Architect Phase 2 uses 4 but only runs AFTER Phase 1 already narrowed the work. - `_extract_text_from_content` now silently skips tool_use / tool_result / input_json_delta blocks and returns "" when a list contains only those (instead of stringifying the dict). Empty list also returns "". - `_invoke_with_optional_tools` detects tool-loop exhaustion (returned response has no usable text) and retries the unbound LLM on the original messages. Note: we can't replay the loop's mid-flight messages because the final tool_use has no paired tool_result — Anthropic would reject that sequence. The wrap-up loses the loop's learnings but produces a valid conversational response instead of the raw-dict dump the user saw. Tests: 6 new in TestExtractTextFromContent + TestPlannerReadOnlyTools — covers tool_use-only response handling, empty-list handling, mixed text + tool_use, and the exhaustion wrap-up path. 87/87 planner tests pass; 293/293 related tests pass; ruff clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

… + rich logging Follow-up to #201. The wrap-up strategy there saved the user from seeing raw tool_use dicts, but threw away all the tool results the loop gathered — so the unbound fallback LLM fabricated the answer (claimed BottomPanel.jsx + React/Vite when the real project is Svelte + TypeScript) and even wrote synthetic `<tool_call>` XML tags into the prose. Three compounding fixes: 1. `_run_tool_loop` takes a new `return_messages=False` kwarg. When True, returns a 4-tuple (response, tokens, log, messages). The Architect / Developer / QA call sites stay on the existing 3-tuple; only the Planner opts in. 2. Planner wrap-up rewritten: on exhaustion, take the loop's accumulated messages, drop only the single trailing assistant message with unresolved tool_use blocks (Anthropic rejects unpaired sequences), then append an explicit HumanMessage forbidding tool-call simulation and demanding the LLM answer from what it already learned. The wrap-up LLM now has the real filesystem output, not a clean slate. 3. `MAX_PLANNER_TOOL_TURNS` default 6 → 50. Per user feedback: even a well-documented issue like #113 took 6+ calls, complex issues could easily need 10-20. The cap exists as a safety net against runaway loops, not as a cost shape. The env var remains for operator tuning. Rich diagnostic logging added at every decision point so "why did this misbehave?" is answerable from the backend log alone, without code changes or a session re-run: - Planner turn-start: session id, model, workspace, tool count, GitHub context size, user message preview - Tool loader: failures upgraded from debug to info (no workspace / missing config / init error each log distinctly) - `_run_tool_loop`: per-tool-call log with name, sanitized args, result length, wall-clock ms, success flag - Tool executor: pre-call log so hung calls are visible - Tool failures include the args that triggered them - Loop exhaustion summary: total turns, per-tool counts, failure count, final response block types - Wrap-up entry log: messages-before / messages-after / dropped count - Wrap-up LLM returning empty content also logs at warning Tests: - `test_wrap_up_threads_real_loop_messages_not_originals` — repro for the fabrication bug: assert the wrap-up LLM is called with messages containing the real tool result, the unresolved tool_use message is dropped, and the appended nudge contains "do not call" / "tool_call" / "invent". - New `TestIsUnresolvedTail` class — 6 unit tests for the helper. - `TestRunToolLoopReturnMessages` in test_tool_binding.py — regression guard: default returns 3-tuple, `return_messages=True` returns 4-tuple, existing Architect/Dev/QA call sites keep working. - Updated existing `test_invoke_with_optional_tools_*` mocks for the new 4-tuple return. 1135 tests pass (2 pre-existing unrelated Windows path-separator failures confirmed on main). Ruff clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

… + rich logging (#202) Follow-up to #201. The wrap-up strategy there saved the user from seeing raw tool_use dicts, but threw away all the tool results the loop gathered — so the unbound fallback LLM fabricated the answer (claimed BottomPanel.jsx + React/Vite when the real project is Svelte + TypeScript) and even wrote synthetic `<tool_call>` XML tags into the prose. Three compounding fixes: 1. `_run_tool_loop` takes a new `return_messages=False` kwarg. When True, returns a 4-tuple (response, tokens, log, messages). The Architect / Developer / QA call sites stay on the existing 3-tuple; only the Planner opts in. 2. Planner wrap-up rewritten: on exhaustion, take the loop's accumulated messages, drop only the single trailing assistant message with unresolved tool_use blocks (Anthropic rejects unpaired sequences), then append an explicit HumanMessage forbidding tool-call simulation and demanding the LLM answer from what it already learned. The wrap-up LLM now has the real filesystem output, not a clean slate. 3. `MAX_PLANNER_TOOL_TURNS` default 6 → 50. Per user feedback: even a well-documented issue like #113 took 6+ calls, complex issues could easily need 10-20. The cap exists as a safety net against runaway loops, not as a cost shape. The env var remains for operator tuning. Rich diagnostic logging added at every decision point so "why did this misbehave?" is answerable from the backend log alone, without code changes or a session re-run: - Planner turn-start: session id, model, workspace, tool count, GitHub context size, user message preview - Tool loader: failures upgraded from debug to info (no workspace / missing config / init error each log distinctly) - `_run_tool_loop`: per-tool-call log with name, sanitized args, result length, wall-clock ms, success flag - Tool executor: pre-call log so hung calls are visible - Tool failures include the args that triggered them - Loop exhaustion summary: total turns, per-tool counts, failure count, final response block types - Wrap-up entry log: messages-before / messages-after / dropped count - Wrap-up LLM returning empty content also logs at warning Tests: - `test_wrap_up_threads_real_loop_messages_not_originals` — repro for the fabrication bug: assert the wrap-up LLM is called with messages containing the real tool result, the unresolved tool_use message is dropped, and the appended nudge contains "do not call" / "tool_call" / "invent". - New `TestIsUnresolvedTail` class — 6 unit tests for the helper. - `TestRunToolLoopReturnMessages` in test_tool_binding.py — regression guard: default returns 3-tuple, `return_messages=True` returns 4-tuple, existing Architect/Dev/QA call sites keep working. - Updated existing `test_invoke_with_optional_tools_*` mocks for the new 4-tuple return. 1135 tests pass (2 pre-existing unrelated Windows path-separator failures confirmed on main). Ruff clean. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Abernaughty merged commit 87f6700 into main Apr 17, 2026
3 checks passed

Abernaughty deleted the fix/planner-tool-loop-exhaustion branch April 17, 2026 22:59

Abernaughty mentioned this pull request Apr 17, 2026

fix(planner): thread real tool results into wrap-up + raise cap to 50 + rich logging #202

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(planner): handle tool-loop exhaustion + tool_use-only responses#201

fix(planner): handle tool-loop exhaustion + tool_use-only responses#201
Abernaughty merged 1 commit into
mainfrom
fix/planner-tool-loop-exhaustion

Abernaughty commented Apr 17, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Abernaughty commented Apr 17, 2026

Summary

Fixes

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant