feat: tool-calling Phase 3 — cost tracking, feature flag, token budget#773
feat: tool-calling Phase 3 — cost tracking, feature flag, token budget#773Chris0Jeky merged 8 commits intomainfrom
Conversation
…Result for token budget enforcement
|
You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard. |
Self-review findingsAdversarial analysis1. Cost tracking accuracy when orchestrator aborts early 2. Feature flag preserves single-turn behavior 3. Null-safety in TruncateToolResult
4. TruncateToolResult performance concern 5. Log ordering (TruncateToolResult vs TruncateForLog) 6. DI injection correctness 7. Mock provider in tests always available No issues found requiring fixes. All 93 tool-calling tests pass. |
There was a problem hiding this comment.
Pull request overview
This PR finalizes Phase 3 of the LLM tool-calling rollout by adding a runtime feature flag, enforcing a tool-result byte budget before feeding tool output back into the LLM, and ensuring multi-round token usage is aggregated and recorded via ILlmQuotaService.
Changes:
- Add
LlmToolCallingSettings(Enabled,MaxToolResultBytes) and wire it through DI,ChatService, andToolCallingChatOrchestrator. - Record aggregated orchestration token usage to quota tracking with provider/model attribution.
- Add unit tests covering feature-flag routing, token aggregation, quota recording, and tool-result truncation boundaries.
Reviewed changes
Copilot reviewed 8 out of 8 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| docs/STATUS.md | Documents Phase 3 tool-calling refinements and test coverage. |
| docs/IMPLEMENTATION_MASTERPLAN.md | Marks #651 as delivered and updates roadmap status. |
| backend/tests/Taskdeck.Application.Tests/Services/ToolCallingFeatureFlagAndCostTests.cs | Adds unit coverage for feature flag behavior, quota usage recording, token aggregation, and truncation boundaries. |
| backend/src/Taskdeck.Application/Services/ToolCallingChatOrchestrator.cs | Injects settings and truncates tool results before returning them to the LLM. |
| backend/src/Taskdeck.Application/Services/LlmToolCallingSettings.cs | Introduces settings object for tool-calling enablement and tool-result byte budget. |
| backend/src/Taskdeck.Application/Services/ChatService.cs | Gates orchestrator usage behind feature flag and records accumulated token usage to quota service. |
| backend/src/Taskdeck.Api/Extensions/LlmProviderRegistration.cs | Registers LlmToolCallingSettings from configuration as a singleton. |
| backend/src/Taskdeck.Api/appsettings.json | Adds LlmToolCalling configuration defaults. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| const string marker = "...(truncated)"; | ||
| // Walk back until the byte count fits | ||
| var maxChars = maxBytes - System.Text.Encoding.UTF8.GetByteCount(marker); | ||
| if (maxChars <= 0) return marker; | ||
|
|
There was a problem hiding this comment.
TruncateToolResult can return the full "...(truncated)" marker even when maxBytes is smaller than the marker’s UTF-8 byte length (maxChars <= 0 branch). That violates the method’s stated byte-budget contract and can still blow the context window for very small limits; consider ensuring the returned string is always <= maxBytes (e.g., truncate the marker itself or return an empty string when the budget can’t fit it).
| var encoded = System.Text.Encoding.UTF8.GetByteCount(content); | ||
| if (encoded <= maxBytes) | ||
| return content; | ||
|
|
||
| // Truncate by characters (approximate — UTF-8 multi-byte chars are uncommon | ||
| // in typical JSON tool results but we avoid cutting in the middle of a char). | ||
| const string marker = "...(truncated)"; | ||
| // Walk back until the byte count fits | ||
| var maxChars = maxBytes - System.Text.Encoding.UTF8.GetByteCount(marker); | ||
| if (maxChars <= 0) return marker; | ||
|
|
||
| // Binary-search-like: estimate character count from byte ratio then clamp | ||
| var ratio = (double)maxChars / encoded; | ||
| var estimate = (int)(content.Length * ratio); | ||
| while (estimate > 0 && System.Text.Encoding.UTF8.GetByteCount(content[..estimate]) > maxChars) | ||
| estimate--; | ||
|
|
||
| return estimate > 0 ? content[..estimate] + marker : marker; |
There was a problem hiding this comment.
TruncateToolResult repeatedly allocates substrings (content[..estimate]) and recomputes UTF-8 byte counts in a decrementing loop; for large tool results this can become very expensive (O(n^2) worst-case) and add GC pressure. Consider using a bounded binary search over index (low/high) and Encoding.UTF8.GetByteCount(ReadOnlySpan) to avoid repeated allocations and reduce byte-count calls.
| var encoded = System.Text.Encoding.UTF8.GetByteCount(content); | |
| if (encoded <= maxBytes) | |
| return content; | |
| // Truncate by characters (approximate — UTF-8 multi-byte chars are uncommon | |
| // in typical JSON tool results but we avoid cutting in the middle of a char). | |
| const string marker = "...(truncated)"; | |
| // Walk back until the byte count fits | |
| var maxChars = maxBytes - System.Text.Encoding.UTF8.GetByteCount(marker); | |
| if (maxChars <= 0) return marker; | |
| // Binary-search-like: estimate character count from byte ratio then clamp | |
| var ratio = (double)maxChars / encoded; | |
| var estimate = (int)(content.Length * ratio); | |
| while (estimate > 0 && System.Text.Encoding.UTF8.GetByteCount(content[..estimate]) > maxChars) | |
| estimate--; | |
| return estimate > 0 ? content[..estimate] + marker : marker; | |
| var utf8 = Encoding.UTF8; | |
| var encoded = utf8.GetByteCount(content); | |
| if (encoded <= maxBytes) | |
| return content; | |
| // Truncate by characters (approximate — UTF-8 multi-byte chars are uncommon | |
| // in typical JSON tool results but we avoid cutting in the middle of a char). | |
| const string marker = "...(truncated)"; | |
| var maxContentBytes = maxBytes - utf8.GetByteCount(marker); | |
| if (maxContentBytes <= 0) | |
| return marker; | |
| var span = content.AsSpan(); | |
| var low = 0; | |
| var high = content.Length; | |
| var best = 0; | |
| while (low <= high) | |
| { | |
| var mid = low + ((high - low) / 2); | |
| var byteCount = utf8.GetByteCount(span[..mid]); | |
| if (byteCount <= maxContentBytes) | |
| { | |
| best = mid; | |
| low = mid + 1; | |
| } | |
| else | |
| { | |
| high = mid - 1; | |
| } | |
| } | |
| return best > 0 ? content[..best] + marker : marker; |
| /// Maximum byte length of a single tool result before it is truncated. | ||
| /// Keeps oversized responses within the provider's context window. | ||
| /// 0 = no truncation limit (not recommended for production). | ||
| /// Default is 8 000 bytes (~6 000 tokens at typical tokenisation ratios). |
There was a problem hiding this comment.
The XML doc comment claims the default 8,000-byte limit is "~6,000 tokens", but 8 KB of UTF-8 content cannot be anywhere near 6k tokens under typical tokenization. This is likely misleading; consider removing the token estimate or replacing it with a more accurate/qualified statement (e.g., "a few thousand tokens depending on content and tokenizer").
| /// Default is 8 000 bytes (~6 000 tokens at typical tokenisation ratios). | |
| /// Default is 8 000 bytes (roughly a few thousand tokens depending on content and tokenizer). |
| - **Inbox triage action visibility** (`#688`/`#743`): 21 new tests in `InboxView.spec.ts` covering single-item triage action states and bulk action bar visibility with DOM-level assertions | ||
| - **Webhook HMAC signature verification** (`#726`/`#750`): 11 tests in `OutboundWebhookHmacDeliveryTests.cs` covering header format, HMAC round-trip, wrong-key rejection, secret rotation, large payload, and timing-safe comparison; adversarial review fixed rotation test and replaced BCL-testing stubs with real domain property tests | ||
| - **Webhook delivery reliability and SSRF boundary** (`#710`/`#756`): 78 webhook tests across 9 files (endpoint guard, service, signature, delivery worker, HMAC delivery, API, repository, domain delivery, domain subscription); SSRF coverage via `OutboundWebhookEndpointGuardTests` includes private IPv4/IPv6 ranges; delivery reliability covers retry/backoff, dead-letter, concurrent delivery, HMAC at worker boundary; `HttpClient` resource leak fixed in tests | ||
| - **LLM tool-calling Phase 3 refinements** (`#651`/PR TBD, 2026-04-04): cost tracking (token accumulation across all rounds reported to `ILlmQuotaService` with provider+model attribution), `EnableToolCalling` feature flag (`LlmToolCalling:Enabled` in `appsettings.json`, default `true`) wired into `ChatService` to bypass orchestrator when disabled, token budget enforcement via `TruncateToolResult` (configurable `MaxToolResultBytes`, default 8 000 bytes with `"...(truncated)"` marker), `LlmToolCallingSettings` registered as DI singleton and injected into both `ToolCallingChatOrchestrator` and `ChatService`; 15 new tests covering feature flag bypass, cost accumulation, quota service integration, and `TruncateToolResult` boundary conditions; all 93 tool-calling tests pass |
There was a problem hiding this comment.
This new bullet is not indented like the surrounding sub-items in this section (lines 88–90), so it renders as a top-level list item rather than part of the current wave. If it’s intended to be part of the same nested list, indent it to match the other sub-bullets.
| - **LLM tool-calling Phase 3 refinements** (`#651`/PR TBD, 2026-04-04): cost tracking (token accumulation across all rounds reported to `ILlmQuotaService` with provider+model attribution), `EnableToolCalling` feature flag (`LlmToolCalling:Enabled` in `appsettings.json`, default `true`) wired into `ChatService` to bypass orchestrator when disabled, token budget enforcement via `TruncateToolResult` (configurable `MaxToolResultBytes`, default 8 000 bytes with `"...(truncated)"` marker), `LlmToolCallingSettings` registered as DI singleton and injected into both `ToolCallingChatOrchestrator` and `ChatService`; 15 new tests covering feature flag bypass, cost accumulation, quota service integration, and `TruncateToolResult` boundary conditions; all 93 tool-calling tests pass | |
| - **LLM tool-calling Phase 3 refinements** (`#651`/PR TBD, 2026-04-04): cost tracking (token accumulation across all rounds reported to `ILlmQuotaService` with provider+model attribution), `EnableToolCalling` feature flag (`LlmToolCalling:Enabled` in `appsettings.json`, default `true`) wired into `ChatService` to bypass orchestrator when disabled, token budget enforcement via `TruncateToolResult` (configurable `MaxToolResultBytes`, default 8 000 bytes with `"...(truncated)"` marker), `LlmToolCallingSettings` registered as DI singleton and injected into both `ToolCallingChatOrchestrator` and `ChatService`; 15 new tests covering feature flag bypass, cost accumulation, quota service integration, and `TruncateToolResult` boundary conditions; all 93 tool-calling tests pass |
|
Warning Gemini encountered an error creating the review. You can try again by commenting |
…dversarial review findings - Replace linear walk-back in TruncateToolResult with a proper binary search over ReadOnlySpan<char>, eliminating the O(n) worst case and all intermediate heap allocations during the search. - Fix byte-budget contract violation: when maxBytes <= marker.Length the old code returned the full 14-byte marker even though it exceeded the stated budget; now trims the marker itself to fit. - Correct misleading doc comment on MaxToolResultBytes (~6 000 tokens was inaccurate; replaced with a qualified estimate). - Fix STATUS.md indentation: new bullet was a top-level item instead of a sub-bullet matching surrounding wave entries; also replaces "PR TBD" with the actual PR number 773. - Add two adversarial regression tests: budget smaller than marker (previously violated the contract) and multi-byte CJK content (exercises the binary search path). Test count: 15 → 17.
Adversarial ReviewCI StatusAll 23 CI checks pass (Backend Unit, API Integration, Architecture, E2E Smoke, CodeQL, Container Images, Dependency Review, Docs Governance, Frontend Unit, OpenAPI Guardrail, Workflow Lint — both ubuntu-latest and windows-latest where applicable). No failures. Test Results (run locally from PR branch)76 tool-calling tests total. After fixes: 17 tests in Bot CommentsCopilot (4 inline comments):
Codex: Hit usage limit, no findings. Adversarial Findings1. 2. Linear walk-back replaced with binary search (PERFORMANCE + CORRECTNESS — FIXED) 3. Feature flag checked in correct order (PASS) 4. Orchestrator still constructed when flag is disabled (ACCEPTED) 5. Token accumulation when exiting early (PASS) 6. Cancellation token threading (PASS) 7. DI lifetime correctness (PASS) 8. Default values match appsettings.json (PASS) 9. No test from actual IConfiguration (ACCEPTED) 10. Verdict3 real issues found and fixed (byte-budget contract violation, O(n) loop, misleading docs + STATUS.md indentation). All are addressed in commit |
Closes #651
Summary
Phase 3 of LLM tool-calling (#647 tracker). Implements the remaining refinements to make tool-calling production-ready:
LlmToolCalling:Enabled):ChatServicenow checksLlmToolCallingSettings.Enabledbefore routing to the orchestrator. Whenfalse, every request falls through to single-turnCompleteAsync— no orchestrator invocation, no tool schemas sent. Default istrue(preserves existing behaviour). Configurable inappsettings.jsonwithout code changes.ToolCallingResult.TokensUsed.ChatServicereports this total toILlmQuotaService.RecordUsageAsyncwith provider+model attribution, the same path as single-turn calls. Tokens are tracked even on degraded exits (timeout, loop detection, max rounds).TruncateToolResult): Newinternal staticmethod onToolCallingChatOrchestratortruncates oversized tool results toMaxToolResultBytes(default 8 000 bytes, configurable) before they are fed back to the LLM. Appends"...(truncated)"marker so the LLM knows the result was cut. UTF-8 byte-safe; 0/negative disables truncation.LlmToolCallingSettingsregistered as singleton inLlmProviderRegistration, injected into bothToolCallingChatOrchestrator(forMaxToolResultBytes) andChatService(forEnabledflag).Test plan
ToolCallingFeatureFlagAndCostTests.csLlmToolCallingSettingshasEnabled=true,MaxToolResultBytes=8000MaxToolResultBytesQuotaService.RecordUsageAsynccalled with accumulated total (100 tokens across 2 rounds)TruncateToolResult: short content unchanged, zero/negative limit no-op, oversized content truncated with marker, exact-at-limit unchanged, empty string unchanged