Test: LLM provider abstraction and tool-calling edge cases (#709)#747
Test: LLM provider abstraction and tool-calling edge cases (#709)#747Chris0Jeky merged 6 commits intomainfrom
Conversation
18 tests covering: per-round timeout, empty/null tool call lists, multiple concurrent tools in one round, mixed error handling, tool-not-found with available tool suggestion, generic provider exception, cancellation token propagation, userId context passing, large result truncation in log, metadata JSON generation, token accumulation across rounds, null content on complete, exhausted rounds partial summary, and status notifier invocation per tool.
24 tests covering: default CompleteWithToolsAsync throws NotSupportedException, MockLlmProvider edge cases (empty messages, non-user roles, actionable message detection, very long input, empty tool-calling messages, previous results summary, error results, large result truncation, health/probe endpoints), provider selection edge cases (null/empty settings, case insensitive provider names), record default values, degraded result accessibility, and kill switch settings defaults.
48 tests covering: negation filtering (don't, do not, never, stop, cancel, avoid), other-tool questions (Trello, Jira, Asana, Notion), positive detection for all intent types, non-actionable inputs, null/empty/whitespace handling, very long inputs, mixed case, newlines, prompt injection patterns, archive vs move disambiguation, plural noun support, and alternate verb coverage (generate, build, prepare, set up, modify, change, sort, rearrange, reorganize).
10 tests covering: empty registry returns null, case-insensitive lookup, mixed case lookup, empty registered tool names, multiple executor registration, non-existent tool lookup, empty string lookup, ToolExecutionContext property access, and value equality.
|
You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard. |
Adversarial Self-Review FindingsPotential Issues Found
No Issues Found
|
There was a problem hiding this comment.
Code Review
This pull request introduces a comprehensive suite of edge case and boundary tests for the LLM-related services, including the intent classifier, provider abstractions, and the tool-calling orchestrator. The review feedback identifies several opportunities for improvement: refining test cases to better target negation logic, adding missing assertions for newline detection, refactoring the orchestrator to support configurable timeouts for faster testing, and implementing missing test cases for the tool executor registry as noted in its documentation.
| [InlineData("stop create new tasks")] | ||
| [InlineData("cancel the delete of card 5")] | ||
| [InlineData("don't remove that task")] | ||
| [InlineData("avoid creating a task please")] // "avoid" + "creating" uses "avoid" in the negation list |
There was a problem hiding this comment.
The test case "avoid creating a task please" passes because the word "creating" (gerund) is not recognized as an actionable verb by the classifier's regex patterns, rather than being caught by the negation logic. The NegationPattern in LlmIntentClassifier only includes base verb forms like create. If the intent is to test negation, consider using a base verb form like "avoid create a task". Additionally, note that the negation logic for "set up" in the classifier uses a literal space, which is inconsistent with the creation patterns that use \s+; adding a test case with multiple spaces (e.g., "don't set up") would be a good edge case to cover.
| public void Classify_NewlinesInInput_StillDetects(string input) | ||
| { | ||
| // Regex patterns work per-line or across depending on implementation | ||
| var (isActionable, _) = LlmIntentClassifier.Classify(input); | ||
|
|
||
| // Regardless of detection, it should not throw | ||
| // (The actual behavior depends on regex mode - this tests safety) | ||
| } |
There was a problem hiding this comment.
The test Classify_NewlinesInInput_StillDetects is missing assertions. While the comment suggests it only checks for exceptions, the test name implies it should verify that the intent is correctly detected despite newlines. Given that \s in .NET regex matches newlines, this should be explicitly asserted.
[Theory]
[InlineData("create a card\nand some other text")]
[InlineData("create\na\ncard")]
public void Classify_NewlinesInInput_StillDetects(string input)
{
var (isActionable, actionIntent) = LlmIntentClassifier.Classify(input);
isActionable.Should().BeTrue($"input with newlines '{input}' should be detected");
actionIntent.Should().Be("card.create");
}| IReadOnlyList<ToolCallResult>? _, CancellationToken ct) => | ||
| { | ||
| // Wait until cancellation triggers (per-round timeout) | ||
| await Task.Delay(TimeSpan.FromSeconds(60), ct); |
There was a problem hiding this comment.
This test will take at least 30 seconds to execute because PerRoundTimeoutSeconds is a hardcoded constant in ToolCallingChatOrchestrator. Hardcoded timeouts make unit tests slow and brittle. Consider refactoring the orchestrator to accept a settings object or a TimeSpan for timeouts in its constructor, which would allow tests to use much shorter durations (e.g., 100ms) while still verifying the timeout logic.
| /// duplicate tool names, and null/empty tool name lookup. | ||
| /// </summary> |
There was a problem hiding this comment.
The class summary mentions covering "duplicate tool names" and "null/empty tool name lookup", but there are no test cases for these scenarios. Since ToolExecutorRegistry uses ToDictionary (which throws on duplicate keys) and TryGetValue (which throws on a null key), tests should be added to verify these crash conditions or ensure the registry handles them gracefully.
- Replace no-assertion newline test with proper NotThrow assertion and add separate test for single-line actionable detection - Strengthen MockProvider truncation test to verify the full string does NOT appear (was a weak LessThan assertion) - Fix misleading negation test: use "avoid create" (infinitive) since the negation regex requires bare verbs, not gerunds like "creating"
Self-Review Fixes Applied (d96f9a0)Fixed the three issues from the adversarial review:
Remaining known gaps (documented, not blocking):
|
There was a problem hiding this comment.
Pull request overview
Adds extensive edge-case unit coverage around the backend LLM subsystem, focusing on tool-calling orchestration boundaries and provider abstraction behaviors to reduce regressions and improve confidence in failure-mode handling.
Changes:
- Adds new boundary tests for
ToolCallingChatOrchestrator(timeouts, empty tool calls, tool-not-found, mixed tool errors, metadata, token accumulation, round exhaustion). - Adds edge-case tests for
ILlmProviderdefault tool-calling behavior,MockLlmProvider, provider selection policy, and record defaults. - Adds expanded intent-classifier edge-case tests (negation, other-tool queries, long inputs, casing/newlines, injection-like strings, verb/plural coverage) and tool registry tests.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| backend/tests/Taskdeck.Application.Tests/Services/ToolExecutorRegistryEdgeCaseTests.cs | New tests for tool executor registry lookup behavior and ToolExecutionContext value semantics. |
| backend/tests/Taskdeck.Application.Tests/Services/ToolCallingChatOrchestratorEdgeCaseTests.cs | New orchestrator boundary tests for timeout/degraded paths, tool execution logging, and notifier invocation. |
| backend/tests/Taskdeck.Application.Tests/Services/LlmProviderAbstractionEdgeCaseTests.cs | New tests for provider abstraction defaults, mock provider edge cases, selection policy, and settings defaults. |
| backend/tests/Taskdeck.Application.Tests/Services/LlmIntentClassifierEdgeCaseTests.cs | New classifier tests for negation/other-tool suppression and robustness against varied inputs. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| // Simulate a provider that takes longer than PerRoundTimeoutSeconds | ||
| var mock = new Mock<ILlmProvider>(); | ||
| mock.Setup(p => p.CompleteWithToolsAsync( | ||
| It.IsAny<ChatCompletionRequest>(), | ||
| It.IsAny<IReadOnlyList<TaskdeckToolSchema>>(), | ||
| It.IsAny<IReadOnlyList<ToolCallResult>?>(), | ||
| It.IsAny<CancellationToken>())) | ||
| .Returns(async (ChatCompletionRequest _, IReadOnlyList<TaskdeckToolSchema> _, | ||
| IReadOnlyList<ToolCallResult>? _, CancellationToken ct) => | ||
| { | ||
| // Wait until cancellation triggers (per-round timeout) | ||
| await Task.Delay(TimeSpan.FromSeconds(60), ct); | ||
| return new LlmToolCompletionResult( | ||
| Content: "Should not reach here", | ||
| TokensUsed: 0, | ||
| Provider: "Test", | ||
| Model: "test-v1", | ||
| ToolCalls: null, | ||
| IsComplete: true); | ||
| }); |
There was a problem hiding this comment.
This per-round timeout test will sleep until the orchestrator's real PerRoundTimeoutSeconds (30s) elapses (the provider delays 60s but is cancelled after 30s). That makes the test suite significantly slower and can cause CI timeouts. Consider simulating the timeout without real waiting (e.g., have the mocked provider immediately throw an OperationCanceledException while the external ct is not cancelled, which hits the orchestrator's per-round timeout catch path), or refactor the orchestrator to allow injecting shorter timeout values for tests.
| // Simulate a provider that takes longer than PerRoundTimeoutSeconds | |
| var mock = new Mock<ILlmProvider>(); | |
| mock.Setup(p => p.CompleteWithToolsAsync( | |
| It.IsAny<ChatCompletionRequest>(), | |
| It.IsAny<IReadOnlyList<TaskdeckToolSchema>>(), | |
| It.IsAny<IReadOnlyList<ToolCallResult>?>(), | |
| It.IsAny<CancellationToken>())) | |
| .Returns(async (ChatCompletionRequest _, IReadOnlyList<TaskdeckToolSchema> _, | |
| IReadOnlyList<ToolCallResult>? _, CancellationToken ct) => | |
| { | |
| // Wait until cancellation triggers (per-round timeout) | |
| await Task.Delay(TimeSpan.FromSeconds(60), ct); | |
| return new LlmToolCompletionResult( | |
| Content: "Should not reach here", | |
| TokensUsed: 0, | |
| Provider: "Test", | |
| Model: "test-v1", | |
| ToolCalls: null, | |
| IsComplete: true); | |
| }); | |
| // Simulate a per-round timeout/cancellation from the provider without real waiting. | |
| var mock = new Mock<ILlmProvider>(); | |
| mock.Setup(p => p.CompleteWithToolsAsync( | |
| It.IsAny<ChatCompletionRequest>(), | |
| It.IsAny<IReadOnlyList<TaskdeckToolSchema>>(), | |
| It.IsAny<IReadOnlyList<ToolCallResult>?>(), | |
| It.IsAny<CancellationToken>())) | |
| .ThrowsAsync(new OperationCanceledException(new CancellationToken(canceled: false))); |
| [Theory] | ||
| [InlineData("create a card\nand some other text")] | ||
| [InlineData("create\na\ncard")] | ||
| public void Classify_NewlinesInInput_DoesNotThrow(string input) | ||
| { | ||
| // Verify that newlines in input do not cause exceptions. | ||
| // The classifier may or may not detect the intent depending on | ||
| // whether the regex matches across line boundaries, but it must | ||
| // never crash. | ||
| var act = () => LlmIntentClassifier.Classify(input); | ||
| act.Should().NotThrow("newlines in input must not cause exceptions"); |
There was a problem hiding this comment.
This test currently has no explicit assertion (it only passes unless Classify throws). To make the intent clear and avoid accidental no-op coverage, add an explicit assertion such as act.Should().NotThrow() and/or assert expected classification for these newline cases.
…ge, slow timeout - PromptInjection test now verifies classification results, not just no-crash - PerRoundTimeout test uses immediate OperationCanceledException instead of waiting 30s for real timeout (same code path exercised, 30s faster) - Add loop detection tests: identical consecutive tool calls abort, but retries after errors are allowed (core orchestrator feature was untested) - Add ToolExecutorRegistry tests for null tool name and duplicate tool names (mentioned in docstring but were missing)
Second-Pass Adversarial ReviewIssues Found and Fixed (commit b20d2fe)1. False-positive test: 2. 30-second test: 3. Missing coverage: Loop detection (FIXED -- 2 new tests)
4. Missing coverage: ToolExecutorRegistry null/duplicate (FIXED -- 2 new tests)
Issues Reviewed and Accepted (no fix needed)5. CI failure: API Integration -- Pre-existing 6. Gemini/Copilot bot comments -- Both bots flagged the same issues I fixed above (newline test, timeout test, registry docstring). The self-review fix (d96f9a0) addressed the newline and truncation issues but left the timeout and loop detection gaps open. Remaining known gaps (not blocking)
Test count: 107 (was 101, net +6 new tests) |
Summary
Closes #709
Test plan