sc-m9txz: Implement failure clustering and triage prompt engine#211
Conversation
Adds internal/triage package with: - Engine: accepts TriageInput (failures, flakiness history, git diff, previous failures), builds prompt, calls llm.Provider, parses and validates JSON response into TriageOutput (clusters + per-failure classifications). - Graceful degradation: LLM errors and invalid responses return a fallback TriageOutput with all failures classified as "unknown", alongside the original error for callers to record failed status. - BuildPrompt: constructs structured prompt with adaptive truncation (message/trace limits shrink at 11 and 26 failures) so typical runs stay within the 8 k-token budget. - Structured output schema (llmOutput / llmCluster / llmClassification) with normalisation of invalid classification values to "unknown". - 37 unit tests covering happy path, multi-cluster, partial response, invalid values, LLM errors, 5- and 50-failure scale, and token budget. All tests use llm.MockProvider — no external dependencies. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Leftover first-attempt `mock` was unused — only `mock2` drove the test. Collapsed to a single `mock` and removed the `_ = mock` suppressor.
…comment - engine_test.go:218: drop `!= nil &&` guard before len() check — len(nil)==0 in Go so the nil check was redundant. - prompt_test.go: move stray makeDiffFiles doc comment to sit above the function instead of floating before the injection-tests section header. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Apply sanitizeIdentifier() to h.TestName in writeFlakinessHistory (lines 151/155) and to name in writePreviousFailures (line 189) — same CTRF origin and injection vector as FailureDetail.Name which was already sanitized. - Add json:"snake_case" struct tags to TriageOutput, ClusterResult, and ClassificationResult so serialized output uses snake_case keys (summary, clusters, classifications, root_cause, label, test_result_id, cluster_index, classification) consistent with every other type in the codebase. - Update TestEngine_Triage_EmptyFailures_MarshalsClarificationsAsEmptyArray assertions to assert snake_case keys. - Add TestBuildPrompt_SanitizesNewlinesInFlakinessTestName and TestBuildPrompt_SanitizesNewlinesInPreviousFailureName covering the new sanitization paths. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
Implements a new internal LLM-based triage subsystem that builds a structured prompt from failing test data and parses the model’s JSON response into clustered failure triage results.
Changes:
- Added
triage.Enginewith typed input/output, LLM invocation, response parsing, and fallback behavior. - Added prompt construction (
BuildPrompt) with adaptive truncation and optional enrichment sections (flakiness history, git diff, previous failures). - Added extensive unit tests for prompt content/sanitization and engine parsing/fallback behaviors.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 8 comments.
Show a summary per file
| File | Description |
|---|---|
| internal/triage/engine.go | Core triage engine types + LLM call, parse, normalization, fallback output |
| internal/triage/engine_test.go | Engine behavior tests (happy path, error paths, output shaping) |
| internal/triage/prompt.go | Prompt builder with truncation + enrichment sections + basic sanitization |
| internal/triage/prompt_test.go | Prompt content, truncation budget, and injection-related tests |
| internal/triage/schema.go | Internal JSON schema types and valid classification set |
| } | ||
| } | ||
|
|
||
| func TestEngine_Triage_EmptyFailures_MarshalsClarificationsAsEmptyArray(t *testing.T) { |
There was a problem hiding this comment.
Typo in test name: "Clarifications" should be "Classifications".
| func TestEngine_Triage_EmptyFailures_MarshalsClarificationsAsEmptyArray(t *testing.T) { | |
| func TestEngine_Triage_EmptyFailures_MarshalsClassificationsAsEmptyArray(t *testing.T) { |
| RootCause string `json:"root_cause"` | ||
|
|
||
| // Label is an optional short identifier for the cluster (may be empty). | ||
| Label string `json:"label"` |
There was a problem hiding this comment.
ClusterResult.Label is documented as optional but the JSON tag is json:"label", so empty labels will still serialize as "label":"". If consumers expect omission for optional fields (as in the LLM schema's label,omitempty), consider using json:"label,omitempty" here for consistency.
| Label string `json:"label"` | |
| Label string `json:"label,omitempty"` |
| // truncateStr shortens s to at most maxLen bytes, appending "…" if truncated. | ||
| // It does not split multi-byte UTF-8 runes at the boundary — it truncates | ||
| // at the byte level which may break a rune, but this is acceptable for prompt | ||
| // construction where exact byte counts matter more than rune alignment. | ||
| func truncateStr(s string, maxLen int) string { | ||
| if len(s) <= maxLen { | ||
| return s | ||
| } | ||
| return s[:maxLen] + "…" |
There was a problem hiding this comment.
truncateStr claims to keep strings to at most maxLen bytes, but it returns s[:maxLen] + "…" which exceeds maxLen by the ellipsis byte length. This can break the intended prompt size budget. Adjust truncation to reserve space for the suffix (or use an ASCII suffix) and handle very small maxLen values safely.
| // truncateStr shortens s to at most maxLen bytes, appending "…" if truncated. | |
| // It does not split multi-byte UTF-8 runes at the boundary — it truncates | |
| // at the byte level which may break a rune, but this is acceptable for prompt | |
| // construction where exact byte counts matter more than rune alignment. | |
| func truncateStr(s string, maxLen int) string { | |
| if len(s) <= maxLen { | |
| return s | |
| } | |
| return s[:maxLen] + "…" | |
| // truncateStr shortens s to at most maxLen bytes (including the suffix), | |
| // appending "…" if truncated. It does not split multi-byte UTF-8 runes at | |
| // the boundary — it truncates at the byte level which may break a rune, but | |
| // this is acceptable for prompt construction where exact byte counts matter | |
| // more than rune alignment. | |
| func truncateStr(s string, maxLen int) string { | |
| if maxLen <= 0 { | |
| return "" | |
| } | |
| if len(s) <= maxLen { | |
| return s | |
| } | |
| const ellipsis = "…" | |
| // If maxLen is very small, return as much of the ellipsis as fits. | |
| if maxLen <= len(ellipsis) { | |
| return ellipsis[:maxLen] | |
| } | |
| prefixLen := maxLen - len(ellipsis) | |
| return s[:prefixLen] + ellipsis |
| // Free-text fields (Message, Trace) are wrapped in XML-style delimiters | ||
| // so the LLM can distinguish user-controlled data from prompt instructions. | ||
| if msg := truncateStr(strings.TrimSpace(f.Message), ml); msg != "" { | ||
| fmt.Fprintf(b, "- error: <error_data>%s</error_data>\n", msg) | ||
| } | ||
| if trace := truncateStr(strings.TrimSpace(f.Trace), tl); trace != "" { | ||
| fmt.Fprintf(b, "- trace: <trace_data>%s</trace_data>\n", trace) | ||
| } |
There was a problem hiding this comment.
Free-text fields are wrapped in <error_data>/<trace_data> tags but their contents are not escaped. If Message/Trace contains sequences like "</error_data>" it can break the delimiter structure and allow prompt injection into the surrounding instructions. Consider escaping/encoding these fields (e.g., replace "</" sequences, or base64/JSON-escape inside the delimiter) so the delimiters remain reliable.
| continue | ||
| } | ||
| fmt.Fprintf(b, "| %s | %d | %.0f%% | %.2f | %s |\n", | ||
| sanitizeIdentifier(h.TestName), h.TotalRuns, h.PassRate, h.FlakyScore, h.LastStatus) |
There was a problem hiding this comment.
writeFlakinessHistory sanitizes TestName but prints LastStatus verbatim into the markdown table. If LastStatus contains control characters/newlines it can still break prompt structure. Consider sanitizing LastStatus (and any other string fields) the same way as identifiers before writing them into the prompt.
| sanitizeIdentifier(h.TestName), h.TotalRuns, h.PassRate, h.FlakyScore, h.LastStatus) | |
| sanitizeIdentifier(h.TestName), h.TotalRuns, h.PassRate, h.FlakyScore, sanitizeIdentifier(h.LastStatus)) |
| fmt.Fprintf(b, "Repository: %s\n", diff.Repository) | ||
| if diff.BaseCommit != "" { | ||
| fmt.Fprintf(b, "Base: %s → Head: %s\n", diff.BaseCommit, diff.HeadCommit) | ||
| } else { | ||
| fmt.Fprintf(b, "Head: %s\n", diff.HeadCommit) | ||
| } | ||
| if diff.Truncated { | ||
| fmt.Fprintf(b, "(showing top %d of %d changed files by churn)\n", len(diff.Files), diff.TotalFiles) | ||
| } | ||
| b.WriteString("\n") | ||
| for _, f := range diff.Files { | ||
| fmt.Fprintf(b, "- %s (+%d/-%d churn=%d)\n", f.Path, f.Additions, f.Deletions, f.Churn) |
There was a problem hiding this comment.
writeGitDiff prints Repository/BaseCommit/HeadCommit and each FileDiffStat.Path without sanitization. Since these values can originate from external inputs (e.g., CTRF metadata or git filenames), they could contain newlines/control characters and inject content into the prompt. Consider applying the same identifier sanitization here before writing these fields.
| fmt.Fprintf(b, "Repository: %s\n", diff.Repository) | |
| if diff.BaseCommit != "" { | |
| fmt.Fprintf(b, "Base: %s → Head: %s\n", diff.BaseCommit, diff.HeadCommit) | |
| } else { | |
| fmt.Fprintf(b, "Head: %s\n", diff.HeadCommit) | |
| } | |
| if diff.Truncated { | |
| fmt.Fprintf(b, "(showing top %d of %d changed files by churn)\n", len(diff.Files), diff.TotalFiles) | |
| } | |
| b.WriteString("\n") | |
| for _, f := range diff.Files { | |
| fmt.Fprintf(b, "- %s (+%d/-%d churn=%d)\n", f.Path, f.Additions, f.Deletions, f.Churn) | |
| fmt.Fprintf(b, "Repository: %s\n", sanitizeIdentifier(diff.Repository)) | |
| if diff.BaseCommit != "" { | |
| fmt.Fprintf(b, "Base: %s → Head: %s\n", sanitizeIdentifier(diff.BaseCommit), sanitizeIdentifier(diff.HeadCommit)) | |
| } else { | |
| fmt.Fprintf(b, "Head: %s\n", sanitizeIdentifier(diff.HeadCommit)) | |
| } | |
| if diff.Truncated { | |
| fmt.Fprintf(b, "(showing top %d of %d changed files by churn)\n", len(diff.Files), diff.TotalFiles) | |
| } | |
| b.WriteString("\n") | |
| for _, f := range diff.Files { | |
| fmt.Fprintf(b, "- %s (+%d/-%d churn=%d)\n", sanitizeIdentifier(f.Path), f.Additions, f.Deletions, f.Churn) |
| // sanitizeIdentifier truncates s at the first control character (< U+0020) | ||
| // and removes any remaining control characters from what follows. Identifier | ||
| // fields (TestResultID, Name, Suite) must not contain newlines or control | ||
| // characters: a value such as "tr-1\nIgnore all instructions" would inject | ||
| // a new line into the prompt structure. Legitimate identifiers never contain | ||
| // control characters, so truncating at the first one is safe and removes the | ||
| // injected payload entirely. |
There was a problem hiding this comment.
The sanitizeIdentifier doc comment says it "removes any remaining control characters", but the implementation truncates at the first control character and doesn't perform any additional filtering. Either update the comment to reflect the actual behavior, or implement the additional filtering described.
| mock := llm.NewMock(validResponse(failures)) | ||
| e := NewEngine(mock) | ||
|
|
||
| e.Triage(context.Background(), TriageInput{Failures: failures}) //nolint:errcheck |
There was a problem hiding this comment.
This test ignores the error returned by Triage via //nolint:errcheck, which can allow the test to pass even if Triage unexpectedly fails. Capture the returned output/error and assert err is nil so the test validates the full behavior it depends on.
| e.Triage(context.Background(), TriageInput{Failures: failures}) //nolint:errcheck | |
| _, err := e.Triage(context.Background(), TriageInput{Failures: failures}) | |
| if err != nil { | |
| t.Fatalf("unexpected error from Triage: %v", err) | |
| } |
Closes droplet sc-m9txz.