Skip to content

Add LlmIntentClassifier and chat-to-proposal integration tests#580

Merged
Chris0Jeky merged 4 commits intomainfrom
test/577-intent-classifier-tests
Mar 29, 2026
Merged

Add LlmIntentClassifier and chat-to-proposal integration tests#580
Chris0Jeky merged 4 commits intomainfrom
test/577-intent-classifier-tests

Conversation

@Chris0Jeky
Copy link
Copy Markdown
Owner

Summary

  • Fixes Add LlmIntentClassifier and chat-to-proposal integration tests #577
  • Add edge case tests for LlmIntentClassifier: null input (documents NullReferenceException), very long strings, whitespace-only input, special characters, and pattern matching within strings containing special characters/newlines
  • Add chat-to-proposal flow integration tests in ChatServiceTests: structured syntax classifier hit with parser success, natural language classifier miss with no planner call, explicit RequestProposal with parser failure (graceful error), and actionable classification with parser failure (hint shown)

Test plan

  • All new tests pass with current codebase (1609 total, 0 failures)
  • LlmIntentClassifier tests cover all current patterns (existing) plus edge cases (new)
  • Known gap cases documented as tests asserting current (limited) behavior (existing)
  • ChatService proposal flow tested end-to-end with mocks (new + existing)
  • Edge cases covered: null, empty, whitespace, special chars, very long strings

@chatgpt-codex-connector
Copy link
Copy Markdown

You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard.
To continue using code reviews, you can upgrade your account or add credits to your account and enable them for code reviews in your settings.

@Chris0Jeky
Copy link
Copy Markdown
Owner Author

Self-Review Findings

Resilience to classifier improvements

  • All tests assert current behavior (not desired behavior). Tests documenting known gaps assert isActionable.Should().BeFalse() with explanatory because messages.
  • If the classifier is improved to detect natural language, the existing "known gap" tests in the pre-existing region will need updating, but the new edge case tests (null, long strings, special chars) will remain stable.

Mock accuracy

  • ChatService tests mock ILlmProvider.CompleteAsync to return specific IsActionable values, accurately simulating the real flow where the mock provider's IsActionable flag drives proposal creation.
  • The planner mock correctly returns Result.Failure with ErrorCodes.ValidationError to simulate parse failures, matching real behavior.

Test quality

  • No flaky patterns detected: all tests are deterministic with no time-dependent assertions or external dependencies.
  • Classify_NullInput_ThrowsNullReferenceException documents a real gap (no null guard) without being prescriptive about fixing it.
  • Test names clearly describe the scenario and expected outcome.

Overlap with existing tests

  • SendMessageAsync_StructuredSyntax_ClassifierHit_ParserSuccess_ProposalCreated overlaps somewhat with the existing SendMessageAsync_ShouldAutoCreateProposal_WhenActionableIntentDetected_WithoutExplicitRequestProposal, but adds an explicit Verify on the exact message being passed to the planner and has a clearer name for documenting the pipeline.
  • SendMessageAsync_ActionableClassification_ParserFails_ShowsParseHint overlaps with SendMessageAsync_ShouldReturnStatusWithParseHint_WhenActionableButPlannerFails — these test the same code path. The new test provides slightly different framing (Add LlmIntentClassifier and chat-to-proposal integration tests #577 context) but is duplicative. Not blocking since the test names are distinct and provide different documentation angles.

No issues found requiring changes

The diff is clean, all 1609 tests pass, and no tests assert aspirational behavior.

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds integration tests for the Chat-to-Proposal flow, covering successful parsing, classifier misses, and error handling for invalid instructions. It also introduces edge-case tests for the LLM intent classifier, addressing long strings, special characters, and null inputs. Review feedback suggests enhancing the classifier's robustness by handling null inputs gracefully and improving test specificity by verifying exact call arguments in mock setups.

Comment on lines +147 to +155
[Fact]
public void Classify_NullInput_ThrowsNullReferenceException()
{
// The classifier calls message.ToLowerInvariant() without a null guard.
// This documents that null input is not handled gracefully.
var act = () => LlmIntentClassifier.Classify(null!);

act.Should().Throw<NullReferenceException>();
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

While it's good to document current behavior, a public static method like Classify should ideally be more robust and not throw a NullReferenceException on null input. It would be better to handle null (and whitespace) gracefully by returning (false, null). I recommend updating LlmIntentClassifier.Classify to handle this and changing this test to assert the graceful handling instead of the exception. This prevents potential unhandled exceptions in the application.

    [Fact]
    public void Classify_NullInput_ReturnsNotActionable()
    {
        // A null guard should be in place for public methods.
        var (isActionable, actionIntent) = LlmIntentClassifier.Classify(null!);

        isActionable.Should().BeFalse();
        actionIntent.Should().BeNull();
    }

Comment on lines +890 to +894
p => p.ParseInstructionAsync(
It.IsAny<string>(), userId, boardId,
It.IsAny<CancellationToken>(), ProposalSourceType.Chat,
session.Id.ToString(), It.IsAny<string?>()),
Times.Once);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

For a more robust test, it's better to verify that ParseInstructionAsync was called with the exact message content. Using It.IsAny<string>() for the instruction makes the test less specific and could potentially mask issues if the wrong content is passed to the planner.

            p => p.ParseInstructionAsync(
                "please create some tasks for the deployment checklist", userId, boardId,
                It.IsAny<CancellationToken>(), ProposalSourceType.Chat,
                session.Id.ToString(), It.IsAny<string?>()),
            Times.Once);


result.IsSuccess.Should().BeTrue();
result.Value.MessageType.Should().Be("status");
result.Value.Content.Should().Contain("detected a task request but could not parse it");
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

To make this test more robust, consider adding a verification step to ensure ParseInstructionAsync was called on the planner mock with the expected input. This explicitly confirms that the flow reached the planner as intended before failing.

        result.Value.Content.Should().Contain("detected a task request but could not parse it");

        _plannerMock.Verify(
            p => p.ParseInstructionAsync(
                "create card for testing without quotes", userId, boardId,
                It.IsAny<CancellationToken>(), ProposalSourceType.Chat,
                session.Id.ToString(), It.IsAny<string?>()),
            Times.Once);

@Chris0Jeky
Copy link
Copy Markdown
Owner Author

Adversarial Review — PR #580

Critical

None found.

Major

1. Two new tests are near-exact duplicates of existing tests (false coverage inflation)

  • SendMessageAsync_StructuredSyntax_ClassifierHit_ParserSuccess_ProposalCreated duplicates SendMessageAsync_ShouldAutoCreateProposal_WhenActionableIntentDetected_WithoutExplicitRequestProposal (lines 182-228). Both test the same code path: LLM returns IsActionable=true, planner succeeds, result is proposal-reference. The only difference is the new test adds a Verify call on the exact message string — but the existing test already covers the behavior. The added Verify is also fragile: it asserts the exact user message string is forwarded to the planner, which will break if the message routing logic ever transforms the input.

  • SendMessageAsync_ActionableClassification_ParserFails_ShowsParseHint duplicates SendMessageAsync_ShouldReturnStatusWithParseHint_WhenActionableButPlannerFails (lines 279-307). Both test the identical code path: LLM returns IsActionable=true, planner fails, result contains "detected a task request but could not parse it". The self-review acknowledges this overlap but dismisses it as "not blocking." In a test suite of 1600+ tests, duplicate tests increase maintenance burden for zero additional safety.

Recommendation: Remove these two duplicate tests, or consolidate them with the originals. Adding a Verify call to the existing test would capture the only new assertion without the duplication.

2. SendMessageAsync_NaturalLanguage_ClassifierMiss_NoPlannerCall is also largely duplicative

This test covers the same path as SendMessageAsync_NaturalLanguage_WithoutRequestProposal_NoProposalAttempt (lines 813-843 in the existing NLP gap region). Both set IsActionable=false, no RequestProposal, and verify ParseInstructionAsync is never called. The existing test even uses the same Times.Never verification. The new test uses a different message string but tests no new code path.

Minor

3. Classify_NullInput_ThrowsNullReferenceException asserts on an implementation detail (NRE), not a contract

This test documents that null throws NullReferenceException. If someone later adds a null guard (returning (false, null) or throwing ArgumentNullException), this test will break. A more resilient approach would be to assert that the method throws any exception on null, or better, to assert the desired behavior with a comment noting the current gap. As written, it locks in a bug as the expected contract.

4. No test for empty-string input to Classify

The existing Classify_NonActionable_ShouldReturnFalse already covers "" (line 133), so the edge case region does not need it — but the edge case region's XML summary claims to cover "input extremes" without acknowledging that empty string is already tested elsewhere. This could mislead future readers into thinking it was overlooked.

5. SendMessageAsync_ExplicitRequestProposal_NaturalLanguage_ParserFailsGracefully is largely duplicative of the existing SendMessageAsync_NaturalLanguage_WithRequestProposal_ShowsParseError

Both tests: set RequestProposal: true, set IsActionable: false, mock planner to return Failure, and assert MessageType == "status" with content containing "Could not create the requested proposal". Same code path (lines 252-255 in ChatService.cs). The only variation is the user message string.

Nits

6. Inconsistent edge-case completeness for Classify

The edge case region tests very long strings (50K chars) but misses a boundary value: a string of exactly MaxPromptLength (4000 chars). While the classifier itself has no length limit, the ChatService enforces one upstream, so this is cosmetic.

7. Test region naming ambiguity

The new #region Chat-to-Proposal Flow — Classifier → Parser Integration (#577) is placed before the existing #region NLP Gap Tests — Documents #570. Given the significant overlap between the two regions (both test classifier-to-planner flow), a reader may not immediately understand why they are separate.

8. Classify_VeryLongStringContainingPattern_StillMatches is useful but has a minor naming issue

The name says "StillMatches" but doesn't specify what intent it matches. Classify_VeryLongStringContainingCreateCard_MatchesCardCreate would be clearer.

Overall Assessment

Pass with fixes. The LlmIntentClassifier edge-case tests (null, long strings, whitespace, special chars, newlines) are genuinely valuable and well-constructed. However, 3 of the 4 ChatService flow tests duplicate existing tests covering the same code paths. This adds maintenance cost without improving coverage. The duplicates should be removed or the overlapping existing tests should be enhanced instead.

Summary of recommended changes:

  1. Remove SendMessageAsync_StructuredSyntax_ClassifierHit_ParserSuccess_ProposalCreated — add the Verify to the existing test if desired.
  2. Remove SendMessageAsync_ActionableClassification_ParserFails_ShowsParseHint — identical path to existing test.
  3. Remove SendMessageAsync_NaturalLanguage_ClassifierMiss_NoPlannerCall — identical path to existing NLP gap test.
  4. Keep SendMessageAsync_ExplicitRequestProposal_NaturalLanguage_ParserFailsGracefully only if the framing under Add LlmIntentClassifier and chat-to-proposal integration tests #577 adds distinct documentary value vs the existing Chat-to-proposal NLP gap: natural language fails to produce proposals #570 test. Otherwise remove.
  5. Consider changing the null-input test to assert ThrowsException<Exception>() instead of the specific NullReferenceException type.

- Remove 4 ChatServiceTests that duplicated existing tests covering
  identical code paths (structured-syntax success, classifier miss,
  explicit RequestProposal failure, actionable-but-parser-fails)
- Change Classify_NullInput test to assert base Exception instead of
  NullReferenceException so it survives addition of a null guard
@Chris0Jeky
Copy link
Copy Markdown
Owner Author

Follow-up: Fixes Applied

Pushed commit 4c5fdb5f addressing the Major findings from the adversarial review.

Changes made:

  1. Removed 4 duplicate ChatService flow tests (176 lines deleted):

    • SendMessageAsync_StructuredSyntax_ClassifierHit_ParserSuccess_ProposalCreated — duplicated SendMessageAsync_ShouldAutoCreateProposal_WhenActionableIntentDetected_WithoutExplicitRequestProposal
    • SendMessageAsync_NaturalLanguage_ClassifierMiss_NoPlannerCall — duplicated SendMessageAsync_NaturalLanguage_WithoutRequestProposal_NoProposalAttempt
    • SendMessageAsync_ExplicitRequestProposal_NaturalLanguage_ParserFailsGracefully — duplicated SendMessageAsync_NaturalLanguage_WithRequestProposal_ShowsParseError
    • SendMessageAsync_ActionableClassification_ParserFails_ShowsParseHint — duplicated SendMessageAsync_ShouldReturnStatusWithParseHint_WhenActionableButPlannerFails
  2. Hardened null-input test: Changed Classify_NullInput_ThrowsNullReferenceException to Classify_NullInput_Throws, asserting base Exception instead of NullReferenceException. This way the test survives if someone later adds an ArgumentNullException null guard.

Test results:

All 1605 tests pass (0 failures). Test count reduced from 1609 by removing the 4 duplicates — no coverage lost since the same code paths are exercised by the existing tests.

Remaining items (Minor/Nit, not blocking):

  • Edge case region could note that empty-string is already covered in the Non-Actionable region
  • Classify_VeryLongStringContainingPattern_StillMatches name could be more specific about the matched intent

@Chris0Jeky
Copy link
Copy Markdown
Owner Author

Addressed Gemini review feedback: added a string.IsNullOrWhiteSpace guard at the top of LlmIntentClassifier.Classify() so null/whitespace input returns (false, null) instead of throwing. Updated the corresponding test to assert the new graceful behavior. All 1,605 backend tests pass.

@Chris0Jeky Chris0Jeky merged commit b948742 into main Mar 29, 2026
18 checks passed
@Chris0Jeky Chris0Jeky deleted the test/577-intent-classifier-tests branch March 29, 2026 22:22
@github-project-automation github-project-automation bot moved this from Pending to Done in Taskdeck Execution Mar 29, 2026
Chris0Jeky added a commit that referenced this pull request Mar 29, 2026
Update two analysis docs (chat-to-proposal gap and manual testing findings) to reflect recent fixes and testing status. Key changes: add Last Updated and status notes; mark Tier 1 improvements shipped (intent classifier regex/stemming/negation fixes, substring ordering bug, PR #579), UX parse hints shipped (PR #582), unit/integration tests shipped (PR #580), and note PR range #578#582. In manual testing findings mark OBS-2/OBS-3 resolved (PR #581) and BUG-M5 resolved (PR #578), update resolutions and remove duplicate checklist items. Minor editorial clarifications and test counts added.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

Add LlmIntentClassifier and chat-to-proposal integration tests

1 participant