Skip to content

feat: tool-calling Phase 3 — cost tracking, feature flag, token budget#773

Merged
Chris0Jeky merged 8 commits intomainfrom
feat/tool-calling-refinements-651
Apr 4, 2026
Merged

feat: tool-calling Phase 3 — cost tracking, feature flag, token budget#773
Chris0Jeky merged 8 commits intomainfrom
feat/tool-calling-refinements-651

Conversation

@Chris0Jeky
Copy link
Copy Markdown
Owner

Closes #651

Summary

Phase 3 of LLM tool-calling (#647 tracker). Implements the remaining refinements to make tool-calling production-ready:

  • Feature flag (LlmToolCalling:Enabled): ChatService now checks LlmToolCallingSettings.Enabled before routing to the orchestrator. When false, every request falls through to single-turn CompleteAsync — no orchestrator invocation, no tool schemas sent. Default is true (preserves existing behaviour). Configurable in appsettings.json without code changes.
  • Cost tracking integration: Token usage is already accumulated across all orchestration rounds in ToolCallingResult.TokensUsed. ChatService reports this total to ILlmQuotaService.RecordUsageAsync with provider+model attribution, the same path as single-turn calls. Tokens are tracked even on degraded exits (timeout, loop detection, max rounds).
  • Token budget enforcement (TruncateToolResult): New internal static method on ToolCallingChatOrchestrator truncates oversized tool results to MaxToolResultBytes (default 8 000 bytes, configurable) before they are fed back to the LLM. Appends "...(truncated)" marker so the LLM knows the result was cut. UTF-8 byte-safe; 0/negative disables truncation.
  • DI wiring: LlmToolCallingSettings registered as singleton in LlmProviderRegistration, injected into both ToolCallingChatOrchestrator (for MaxToolResultBytes) and ChatService (for Enabled flag).

Test plan

  • 15 new unit tests in ToolCallingFeatureFlagAndCostTests.cs
    • Feature flag disabled → orchestrator not invoked, single-turn provider called
    • Feature flag enabled → orchestrator invoked for board-scoped sessions
    • No orchestrator registered + flag disabled → single-turn path works
    • Non-board session always uses single-turn regardless of flag
    • Default LlmToolCallingSettings has Enabled=true, MaxToolResultBytes=8000
    • Token accumulation verified across 3-round + final orchestration (3×25+50=125)
    • Token accumulation correct on degraded abort (round-1 tokens reported)
    • Token budget truncation enforced at small MaxToolResultBytes
    • QuotaService.RecordUsageAsync called with accumulated total (100 tokens across 2 rounds)
    • TruncateToolResult: short content unchanged, zero/negative limit no-op, oversized content truncated with marker, exact-at-limit unchanged, empty string unchanged
  • All 93 existing tool-calling tests still pass

Copilot AI review requested due to automatic review settings April 4, 2026 18:33
@chatgpt-codex-connector
Copy link
Copy Markdown

You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard.
To continue using code reviews, you can upgrade your account or add credits to your account and enable them for code reviews in your settings.

@Chris0Jeky
Copy link
Copy Markdown
Owner Author

Self-review findings

Adversarial analysis

1. Cost tracking accuracy when orchestrator aborts early
totalTokensUsed is incremented on every round before any degraded check (totalTokensUsed += llmResult.TokensUsed). All early-exit paths (timeout, loop detection, max rounds, degraded provider) return the accumulated total. Tokens from round 1 are always reported even if round 2 fails. PASS

2. Feature flag preserves single-turn behavior
The only code change in SendMessageAsync is adding && _toolCallingSettings.Enabled to the orchestrator guard condition. The single-turn path, proposal creation, quota recording, and all other logic are unchanged. PASS

3. Null-safety in TruncateToolResult

  • string.IsNullOrEmpty(content) guard handles null/empty before byte counting.
  • maxBytes <= 0 guard prevents negative-size truncation.
  • maxChars <= 0 guard returns the marker directly if even the marker doesn't fit.
  • estimate > 0 guard before slicing prevents zero-length slice.
  • While loop terminates because estimate decrements toward 0 and the loop condition guards estimate > 0. PASS

4. TruncateToolResult performance concern
The while loop decrements by 1 character each iteration. Worst case: content is entirely multi-byte UTF-8 (e.g. CJK) and maxBytes is very small. For the default 8 000 byte limit, the theoretical worst case is ~4 000 iterations for CJK text. Acceptable for a chat service — this is not a hot path and tool results are typically ASCII JSON. If this becomes a concern, a binary search can replace it. ACCEPTABLE

5. Log ordering (TruncateToolResult vs TruncateForLog)
TruncateToolResult runs first (caps at MaxToolResultBytes = 8 000 bytes). TruncateForLog then caps the already-truncated string at 200 chars for the log entry. This is correct: the LLM sees the byte-budgeted content, the log shows a brief summary. PASS

6. DI injection correctness
LlmToolCallingSettings is registered as services.AddSingleton(instance). ASP.NET Core DI will inject it into ChatService (optional parameter with null default). Since the singleton is registered, DI resolves it and passes it — no null fallback in production. Test constructors pass settings explicitly. PASS

7. Mock provider in tests always available
The feature flag only affects whether ChatService routes to the orchestrator. MockLlmProvider.CompleteAsync (single-turn) is always callable regardless of the flag. PASS

No issues found requiring fixes. All 93 tool-calling tests pass.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR finalizes Phase 3 of the LLM tool-calling rollout by adding a runtime feature flag, enforcing a tool-result byte budget before feeding tool output back into the LLM, and ensuring multi-round token usage is aggregated and recorded via ILlmQuotaService.

Changes:

  • Add LlmToolCallingSettings (Enabled, MaxToolResultBytes) and wire it through DI, ChatService, and ToolCallingChatOrchestrator.
  • Record aggregated orchestration token usage to quota tracking with provider/model attribution.
  • Add unit tests covering feature-flag routing, token aggregation, quota recording, and tool-result truncation boundaries.

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
docs/STATUS.md Documents Phase 3 tool-calling refinements and test coverage.
docs/IMPLEMENTATION_MASTERPLAN.md Marks #651 as delivered and updates roadmap status.
backend/tests/Taskdeck.Application.Tests/Services/ToolCallingFeatureFlagAndCostTests.cs Adds unit coverage for feature flag behavior, quota usage recording, token aggregation, and truncation boundaries.
backend/src/Taskdeck.Application/Services/ToolCallingChatOrchestrator.cs Injects settings and truncates tool results before returning them to the LLM.
backend/src/Taskdeck.Application/Services/LlmToolCallingSettings.cs Introduces settings object for tool-calling enablement and tool-result byte budget.
backend/src/Taskdeck.Application/Services/ChatService.cs Gates orchestrator usage behind feature flag and records accumulated token usage to quota service.
backend/src/Taskdeck.Api/Extensions/LlmProviderRegistration.cs Registers LlmToolCallingSettings from configuration as a singleton.
backend/src/Taskdeck.Api/appsettings.json Adds LlmToolCalling configuration defaults.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +437 to +441
const string marker = "...(truncated)";
// Walk back until the byte count fits
var maxChars = maxBytes - System.Text.Encoding.UTF8.GetByteCount(marker);
if (maxChars <= 0) return marker;

Copy link

Copilot AI Apr 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TruncateToolResult can return the full "...(truncated)" marker even when maxBytes is smaller than the marker’s UTF-8 byte length (maxChars <= 0 branch). That violates the method’s stated byte-budget contract and can still blow the context window for very small limits; consider ensuring the returned string is always <= maxBytes (e.g., truncate the marker itself or return an empty string when the budget can’t fit it).

Copilot uses AI. Check for mistakes.
Comment on lines +431 to +448
var encoded = System.Text.Encoding.UTF8.GetByteCount(content);
if (encoded <= maxBytes)
return content;

// Truncate by characters (approximate — UTF-8 multi-byte chars are uncommon
// in typical JSON tool results but we avoid cutting in the middle of a char).
const string marker = "...(truncated)";
// Walk back until the byte count fits
var maxChars = maxBytes - System.Text.Encoding.UTF8.GetByteCount(marker);
if (maxChars <= 0) return marker;

// Binary-search-like: estimate character count from byte ratio then clamp
var ratio = (double)maxChars / encoded;
var estimate = (int)(content.Length * ratio);
while (estimate > 0 && System.Text.Encoding.UTF8.GetByteCount(content[..estimate]) > maxChars)
estimate--;

return estimate > 0 ? content[..estimate] + marker : marker;
Copy link

Copilot AI Apr 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TruncateToolResult repeatedly allocates substrings (content[..estimate]) and recomputes UTF-8 byte counts in a decrementing loop; for large tool results this can become very expensive (O(n^2) worst-case) and add GC pressure. Consider using a bounded binary search over index (low/high) and Encoding.UTF8.GetByteCount(ReadOnlySpan) to avoid repeated allocations and reduce byte-count calls.

Suggested change
var encoded = System.Text.Encoding.UTF8.GetByteCount(content);
if (encoded <= maxBytes)
return content;
// Truncate by characters (approximate — UTF-8 multi-byte chars are uncommon
// in typical JSON tool results but we avoid cutting in the middle of a char).
const string marker = "...(truncated)";
// Walk back until the byte count fits
var maxChars = maxBytes - System.Text.Encoding.UTF8.GetByteCount(marker);
if (maxChars <= 0) return marker;
// Binary-search-like: estimate character count from byte ratio then clamp
var ratio = (double)maxChars / encoded;
var estimate = (int)(content.Length * ratio);
while (estimate > 0 && System.Text.Encoding.UTF8.GetByteCount(content[..estimate]) > maxChars)
estimate--;
return estimate > 0 ? content[..estimate] + marker : marker;
var utf8 = Encoding.UTF8;
var encoded = utf8.GetByteCount(content);
if (encoded <= maxBytes)
return content;
// Truncate by characters (approximate — UTF-8 multi-byte chars are uncommon
// in typical JSON tool results but we avoid cutting in the middle of a char).
const string marker = "...(truncated)";
var maxContentBytes = maxBytes - utf8.GetByteCount(marker);
if (maxContentBytes <= 0)
return marker;
var span = content.AsSpan();
var low = 0;
var high = content.Length;
var best = 0;
while (low <= high)
{
var mid = low + ((high - low) / 2);
var byteCount = utf8.GetByteCount(span[..mid]);
if (byteCount <= maxContentBytes)
{
best = mid;
low = mid + 1;
}
else
{
high = mid - 1;
}
}
return best > 0 ? content[..best] + marker : marker;

Copilot uses AI. Check for mistakes.
/// Maximum byte length of a single tool result before it is truncated.
/// Keeps oversized responses within the provider's context window.
/// 0 = no truncation limit (not recommended for production).
/// Default is 8 000 bytes (~6 000 tokens at typical tokenisation ratios).
Copy link

Copilot AI Apr 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The XML doc comment claims the default 8,000-byte limit is "~6,000 tokens", but 8 KB of UTF-8 content cannot be anywhere near 6k tokens under typical tokenization. This is likely misleading; consider removing the token estimate or replacing it with a more accurate/qualified statement (e.g., "a few thousand tokens depending on content and tokenizer").

Suggested change
/// Default is 8 000 bytes (~6 000 tokens at typical tokenisation ratios).
/// Default is 8 000 bytes (roughly a few thousand tokens depending on content and tokenizer).

Copilot uses AI. Check for mistakes.
Comment thread docs/STATUS.md Outdated
- **Inbox triage action visibility** (`#688`/`#743`): 21 new tests in `InboxView.spec.ts` covering single-item triage action states and bulk action bar visibility with DOM-level assertions
- **Webhook HMAC signature verification** (`#726`/`#750`): 11 tests in `OutboundWebhookHmacDeliveryTests.cs` covering header format, HMAC round-trip, wrong-key rejection, secret rotation, large payload, and timing-safe comparison; adversarial review fixed rotation test and replaced BCL-testing stubs with real domain property tests
- **Webhook delivery reliability and SSRF boundary** (`#710`/`#756`): 78 webhook tests across 9 files (endpoint guard, service, signature, delivery worker, HMAC delivery, API, repository, domain delivery, domain subscription); SSRF coverage via `OutboundWebhookEndpointGuardTests` includes private IPv4/IPv6 ranges; delivery reliability covers retry/backoff, dead-letter, concurrent delivery, HMAC at worker boundary; `HttpClient` resource leak fixed in tests
- **LLM tool-calling Phase 3 refinements** (`#651`/PR TBD, 2026-04-04): cost tracking (token accumulation across all rounds reported to `ILlmQuotaService` with provider+model attribution), `EnableToolCalling` feature flag (`LlmToolCalling:Enabled` in `appsettings.json`, default `true`) wired into `ChatService` to bypass orchestrator when disabled, token budget enforcement via `TruncateToolResult` (configurable `MaxToolResultBytes`, default 8 000 bytes with `"...(truncated)"` marker), `LlmToolCallingSettings` registered as DI singleton and injected into both `ToolCallingChatOrchestrator` and `ChatService`; 15 new tests covering feature flag bypass, cost accumulation, quota service integration, and `TruncateToolResult` boundary conditions; all 93 tool-calling tests pass
Copy link

Copilot AI Apr 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This new bullet is not indented like the surrounding sub-items in this section (lines 88–90), so it renders as a top-level list item rather than part of the current wave. If it’s intended to be part of the same nested list, indent it to match the other sub-bullets.

Suggested change
- **LLM tool-calling Phase 3 refinements** (`#651`/PR TBD, 2026-04-04): cost tracking (token accumulation across all rounds reported to `ILlmQuotaService` with provider+model attribution), `EnableToolCalling` feature flag (`LlmToolCalling:Enabled` in `appsettings.json`, default `true`) wired into `ChatService` to bypass orchestrator when disabled, token budget enforcement via `TruncateToolResult` (configurable `MaxToolResultBytes`, default 8 000 bytes with `"...(truncated)"` marker), `LlmToolCallingSettings` registered as DI singleton and injected into both `ToolCallingChatOrchestrator` and `ChatService`; 15 new tests covering feature flag bypass, cost accumulation, quota service integration, and `TruncateToolResult` boundary conditions; all 93 tool-calling tests pass
- **LLM tool-calling Phase 3 refinements** (`#651`/PR TBD, 2026-04-04): cost tracking (token accumulation across all rounds reported to `ILlmQuotaService` with provider+model attribution), `EnableToolCalling` feature flag (`LlmToolCalling:Enabled` in `appsettings.json`, default `true`) wired into `ChatService` to bypass orchestrator when disabled, token budget enforcement via `TruncateToolResult` (configurable `MaxToolResultBytes`, default 8 000 bytes with `"...(truncated)"` marker), `LlmToolCallingSettings` registered as DI singleton and injected into both `ToolCallingChatOrchestrator` and `ChatService`; 15 new tests covering feature flag bypass, cost accumulation, quota service integration, and `TruncateToolResult` boundary conditions; all 93 tool-calling tests pass

Copilot uses AI. Check for mistakes.
@gemini-code-assist
Copy link
Copy Markdown

Warning

Gemini encountered an error creating the review. You can try again by commenting /gemini review.

…dversarial review findings

- Replace linear walk-back in TruncateToolResult with a proper binary
  search over ReadOnlySpan<char>, eliminating the O(n) worst case and
  all intermediate heap allocations during the search.
- Fix byte-budget contract violation: when maxBytes <= marker.Length the
  old code returned the full 14-byte marker even though it exceeded the
  stated budget; now trims the marker itself to fit.
- Correct misleading doc comment on MaxToolResultBytes (~6 000 tokens
  was inaccurate; replaced with a qualified estimate).
- Fix STATUS.md indentation: new bullet was a top-level item instead of
  a sub-bullet matching surrounding wave entries; also replaces "PR TBD"
  with the actual PR number 773.
- Add two adversarial regression tests: budget smaller than marker
  (previously violated the contract) and multi-byte CJK content
  (exercises the binary search path). Test count: 15 → 17.
@Chris0Jeky
Copy link
Copy Markdown
Owner Author

Adversarial Review

CI Status

All 23 CI checks pass (Backend Unit, API Integration, Architecture, E2E Smoke, CodeQL, Container Images, Dependency Review, Docs Governance, Frontend Unit, OpenAPI Guardrail, Workflow Lint — both ubuntu-latest and windows-latest where applicable). No failures.

Test Results (run locally from PR branch)

Passed! - Failed: 0, Passed: 74, Skipped: 0, Total: 74 - Taskdeck.Application.Tests.dll
Passed! - Failed: 0, Passed:  2, Skipped: 0, Total:  2 - Taskdeck.Api.Tests.dll

76 tool-calling tests total. After fixes: 17 tests in ToolCallingFeatureFlagAndCostTests (was 15).

Bot Comments

Copilot (4 inline comments):

  • TruncateToolResult byte-budget contract violation when maxBytes <= marker.LengthFIXED
  • O(n) walk-back loop — FIXED (replaced with binary search)
  • Misleading doc comment ~6 000 tokensFIXED
  • STATUS.md indentation mismatch — FIXED

Codex: Hit usage limit, no findings.
Gemini: Encountered error, no findings.
Owner self-review: 7 items analyzed, all pass — agreed with the analysis.

Adversarial Findings

1. TruncateToolResult byte-budget contract violation (BUG — FIXED)
When maxBytes is positive but smaller than the 14-byte "...(truncated)" marker (e.g., maxBytes = 5), the old code hit the if (maxChars <= 0) return marker; branch and returned the full 14-byte marker, violating the stated byte budget. While MaxToolResultBytes defaults to 8 000 and typical use would never set it below 14, the contract promise in the XML doc ("always within maxBytes") was broken. Fixed by clamping the marker itself when the budget is tighter than the marker. Added regression test TruncateToolResult_MaxBytesSmallerThanMarker_ResultIsWithinBudget.

2. Linear walk-back replaced with binary search (PERFORMANCE + CORRECTNESS — FIXED)
The original used a comment saying "binary-search-like" but implemented a decrementing loop. Worst-case O(n) iterations for all-multi-byte content; each iteration called GetByteCount(content[..estimate]) and allocated a new string slice. Replaced with a proper binary search over ReadOnlySpan<char> using Encoding.UTF8.GetByteCount(ReadOnlySpan<char>), which avoids heap allocations in the search inner loop. Added regression test TruncateToolResult_OversizedMultiByteContent_ResultIsWithinBudget (CJK characters, 3 bytes each) to exercise this path.

3. Feature flag checked in correct order (PASS)
The guard condition _toolCallingOrchestrator != null && _toolCallingSettings.Enabled && session.BoardId.HasValue short-circuits correctly. The orchestrator object existence is checked first (cheap null check), then the flag, then the board-scope requirement. Evaluation order is correct and C# short-circuit semantics are reliable here.

4. Orchestrator still constructed when flag is disabled (ACCEPTED)
ToolCallingChatOrchestrator is constructed by DI (scoped) regardless of Enabled. When Enabled = false, the orchestrator is constructed but never invoked. This is a minor wasted allocation per request but is architecturally clean — the flag is a runtime bypass, not a registration gate. Acceptable for the stated use case (cost-control toggle, not permanent removal).

5. Token accumulation when exiting early (PASS)
All early-exit paths (BuildTimeoutResult, BuildDegradedResult, BuildLoopDetectedResult, BuildExhaustedResult) receive the accumulated totalTokensUsed. The Orchestrator_TokensAccumulated_WhenDegradedEarly test confirms round-1 tokens are preserved even when round 2 throws. PASS.

6. Cancellation token threading (PASS)
Both ExecuteAsync overloads accept a CancellationToken. A linked CancellationTokenSource is created per round with a per-round timeout. The outer ct is checked at the top of each loop iteration with ct.ThrowIfCancellationRequested(). Tool executor calls pass ct (not the linked one), which is intentional — the per-round LLM timeout shouldn't cancel an in-progress tool execution that the LLM already triggered. PASS.

7. DI lifetime correctness (PASS)
LlmToolCallingSettings is registered as a singleton instance (services.AddSingleton(instance)). The concrete instance is bound from configuration at startup and shared for the process lifetime. This is the same pattern used for LlmProviderSettings, LlmQuotaSettings, and LlmKillSwitchSettings. PASS.

8. Default values match appsettings.json (PASS)
LlmToolCallingSettings defaults: Enabled = true, MaxToolResultBytes = 8_000. appsettings.json values: "Enabled": true, "MaxToolResultBytes": 8000. They agree.

9. No test from actual IConfiguration (ACCEPTED)
Tests construct LlmToolCallingSettings directly. The DI wiring (configuration.GetSection("LlmToolCalling").Get<LlmToolCallingSettings>()) is exercised by the CI integration tests (which start the full API host). A dedicated unit test for the section binding would require IConfiguration mocking and adds little beyond what the integration tests already cover.

10. MaxToolResultBytes = 0 validation (ACCEPTED WITH NOTE)
Setting MaxToolResultBytes = 0 is explicitly documented as "no truncation" and is handled by if (maxBytes <= 0) return content. No validation is required. Docs should discourage it for production — the XML doc comment already says "(not recommended for production)". PASS.

Verdict

3 real issues found and fixed (byte-budget contract violation, O(n) loop, misleading docs + STATUS.md indentation). All are addressed in commit 17209b2e pushed to this branch. 17 tests pass (2 new regression tests added). No blocking issues remain.

@Chris0Jeky Chris0Jeky merged commit 58e7df5 into main Apr 4, 2026
23 checks passed
@Chris0Jeky Chris0Jeky deleted the feat/tool-calling-refinements-651 branch April 4, 2026 21:26
@github-project-automation github-project-automation bot moved this from Pending to Done in Taskdeck Execution Apr 4, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

LLM-08: Phase 3 — Tool-calling refinements (loop detection, cost tracking, feature flag)

2 participants