Skip to content

sc-m9txz: Implement failure clustering and triage prompt engine#211

Merged
MichielDean merged 5 commits into
mainfrom
feat/sc-m9txz
Mar 25, 2026
Merged

sc-m9txz: Implement failure clustering and triage prompt engine#211
MichielDean merged 5 commits into
mainfrom
feat/sc-m9txz

Conversation

@MichielDean
Copy link
Copy Markdown
Owner

Closes droplet sc-m9txz.

Cistern Agent and others added 5 commits March 25, 2026 03:52
Adds internal/triage package with:

- Engine: accepts TriageInput (failures, flakiness history, git diff,
  previous failures), builds prompt, calls llm.Provider, parses and
  validates JSON response into TriageOutput (clusters + per-failure
  classifications).
- Graceful degradation: LLM errors and invalid responses return a
  fallback TriageOutput with all failures classified as "unknown",
  alongside the original error for callers to record failed status.
- BuildPrompt: constructs structured prompt with adaptive truncation
  (message/trace limits shrink at 11 and 26 failures) so typical runs
  stay within the 8 k-token budget.
- Structured output schema (llmOutput / llmCluster / llmClassification)
  with normalisation of invalid classification values to "unknown".
- 37 unit tests covering happy path, multi-cluster, partial response,
  invalid values, LLM errors, 5- and 50-failure scale, and token budget.
  All tests use llm.MockProvider — no external dependencies.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Leftover first-attempt `mock` was unused — only `mock2` drove the test.
Collapsed to a single `mock` and removed the `_ = mock` suppressor.
…comment

- engine_test.go:218: drop `!= nil &&` guard before len() check — len(nil)==0
  in Go so the nil check was redundant.
- prompt_test.go: move stray makeDiffFiles doc comment to sit above the function
  instead of floating before the injection-tests section header.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Apply sanitizeIdentifier() to h.TestName in writeFlakinessHistory (lines
  151/155) and to name in writePreviousFailures (line 189) — same CTRF
  origin and injection vector as FailureDetail.Name which was already
  sanitized.
- Add json:"snake_case" struct tags to TriageOutput, ClusterResult, and
  ClassificationResult so serialized output uses snake_case keys (summary,
  clusters, classifications, root_cause, label, test_result_id,
  cluster_index, classification) consistent with every other type in the
  codebase.
- Update TestEngine_Triage_EmptyFailures_MarshalsClarificationsAsEmptyArray
  assertions to assert snake_case keys.
- Add TestBuildPrompt_SanitizesNewlinesInFlakinessTestName and
  TestBuildPrompt_SanitizesNewlinesInPreviousFailureName covering the new
  sanitization paths.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings March 25, 2026 10:25
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Implements a new internal LLM-based triage subsystem that builds a structured prompt from failing test data and parses the model’s JSON response into clustered failure triage results.

Changes:

  • Added triage.Engine with typed input/output, LLM invocation, response parsing, and fallback behavior.
  • Added prompt construction (BuildPrompt) with adaptive truncation and optional enrichment sections (flakiness history, git diff, previous failures).
  • Added extensive unit tests for prompt content/sanitization and engine parsing/fallback behaviors.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 8 comments.

Show a summary per file
File Description
internal/triage/engine.go Core triage engine types + LLM call, parse, normalization, fallback output
internal/triage/engine_test.go Engine behavior tests (happy path, error paths, output shaping)
internal/triage/prompt.go Prompt builder with truncation + enrichment sections + basic sanitization
internal/triage/prompt_test.go Prompt content, truncation budget, and injection-related tests
internal/triage/schema.go Internal JSON schema types and valid classification set

}
}

func TestEngine_Triage_EmptyFailures_MarshalsClarificationsAsEmptyArray(t *testing.T) {
Copy link

Copilot AI Mar 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo in test name: "Clarifications" should be "Classifications".

Suggested change
func TestEngine_Triage_EmptyFailures_MarshalsClarificationsAsEmptyArray(t *testing.T) {
func TestEngine_Triage_EmptyFailures_MarshalsClassificationsAsEmptyArray(t *testing.T) {

Copilot uses AI. Check for mistakes.
Comment thread internal/triage/engine.go
RootCause string `json:"root_cause"`

// Label is an optional short identifier for the cluster (may be empty).
Label string `json:"label"`
Copy link

Copilot AI Mar 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ClusterResult.Label is documented as optional but the JSON tag is json:"label", so empty labels will still serialize as "label":"". If consumers expect omission for optional fields (as in the LLM schema's label,omitempty), consider using json:"label,omitempty" here for consistency.

Suggested change
Label string `json:"label"`
Label string `json:"label,omitempty"`

Copilot uses AI. Check for mistakes.
Comment thread internal/triage/prompt.go
Comment on lines +51 to +59
// truncateStr shortens s to at most maxLen bytes, appending "…" if truncated.
// It does not split multi-byte UTF-8 runes at the boundary — it truncates
// at the byte level which may break a rune, but this is acceptable for prompt
// construction where exact byte counts matter more than rune alignment.
func truncateStr(s string, maxLen int) string {
if len(s) <= maxLen {
return s
}
return s[:maxLen] + "…"
Copy link

Copilot AI Mar 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

truncateStr claims to keep strings to at most maxLen bytes, but it returns s[:maxLen] + "…" which exceeds maxLen by the ellipsis byte length. This can break the intended prompt size budget. Adjust truncation to reserve space for the suffix (or use an ASCII suffix) and handle very small maxLen values safely.

Suggested change
// truncateStr shortens s to at most maxLen bytes, appending "…" if truncated.
// It does not split multi-byte UTF-8 runes at the boundary — it truncates
// at the byte level which may break a rune, but this is acceptable for prompt
// construction where exact byte counts matter more than rune alignment.
func truncateStr(s string, maxLen int) string {
if len(s) <= maxLen {
return s
}
return s[:maxLen] + "…"
// truncateStr shortens s to at most maxLen bytes (including the suffix),
// appending "…" if truncated. It does not split multi-byte UTF-8 runes at
// the boundary — it truncates at the byte level which may break a rune, but
// this is acceptable for prompt construction where exact byte counts matter
// more than rune alignment.
func truncateStr(s string, maxLen int) string {
if maxLen <= 0 {
return ""
}
if len(s) <= maxLen {
return s
}
const ellipsis = "…"
// If maxLen is very small, return as much of the ellipsis as fits.
if maxLen <= len(ellipsis) {
return ellipsis[:maxLen]
}
prefixLen := maxLen - len(ellipsis)
return s[:prefixLen] + ellipsis

Copilot uses AI. Check for mistakes.
Comment thread internal/triage/prompt.go
Comment on lines +131 to +138
// Free-text fields (Message, Trace) are wrapped in XML-style delimiters
// so the LLM can distinguish user-controlled data from prompt instructions.
if msg := truncateStr(strings.TrimSpace(f.Message), ml); msg != "" {
fmt.Fprintf(b, "- error: <error_data>%s</error_data>\n", msg)
}
if trace := truncateStr(strings.TrimSpace(f.Trace), tl); trace != "" {
fmt.Fprintf(b, "- trace: <trace_data>%s</trace_data>\n", trace)
}
Copy link

Copilot AI Mar 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Free-text fields are wrapped in <error_data>/<trace_data> tags but their contents are not escaped. If Message/Trace contains sequences like "</error_data>" it can break the delimiter structure and allow prompt injection into the surrounding instructions. Consider escaping/encoding these fields (e.g., replace "</" sequences, or base64/JSON-escape inside the delimiter) so the delimiters remain reliable.

Copilot uses AI. Check for mistakes.
Comment thread internal/triage/prompt.go
continue
}
fmt.Fprintf(b, "| %s | %d | %.0f%% | %.2f | %s |\n",
sanitizeIdentifier(h.TestName), h.TotalRuns, h.PassRate, h.FlakyScore, h.LastStatus)
Copy link

Copilot AI Mar 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

writeFlakinessHistory sanitizes TestName but prints LastStatus verbatim into the markdown table. If LastStatus contains control characters/newlines it can still break prompt structure. Consider sanitizing LastStatus (and any other string fields) the same way as identifiers before writing them into the prompt.

Suggested change
sanitizeIdentifier(h.TestName), h.TotalRuns, h.PassRate, h.FlakyScore, h.LastStatus)
sanitizeIdentifier(h.TestName), h.TotalRuns, h.PassRate, h.FlakyScore, sanitizeIdentifier(h.LastStatus))

Copilot uses AI. Check for mistakes.
Comment thread internal/triage/prompt.go
Comment on lines +166 to +177
fmt.Fprintf(b, "Repository: %s\n", diff.Repository)
if diff.BaseCommit != "" {
fmt.Fprintf(b, "Base: %s → Head: %s\n", diff.BaseCommit, diff.HeadCommit)
} else {
fmt.Fprintf(b, "Head: %s\n", diff.HeadCommit)
}
if diff.Truncated {
fmt.Fprintf(b, "(showing top %d of %d changed files by churn)\n", len(diff.Files), diff.TotalFiles)
}
b.WriteString("\n")
for _, f := range diff.Files {
fmt.Fprintf(b, "- %s (+%d/-%d churn=%d)\n", f.Path, f.Additions, f.Deletions, f.Churn)
Copy link

Copilot AI Mar 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

writeGitDiff prints Repository/BaseCommit/HeadCommit and each FileDiffStat.Path without sanitization. Since these values can originate from external inputs (e.g., CTRF metadata or git filenames), they could contain newlines/control characters and inject content into the prompt. Consider applying the same identifier sanitization here before writing these fields.

Suggested change
fmt.Fprintf(b, "Repository: %s\n", diff.Repository)
if diff.BaseCommit != "" {
fmt.Fprintf(b, "Base: %s → Head: %s\n", diff.BaseCommit, diff.HeadCommit)
} else {
fmt.Fprintf(b, "Head: %s\n", diff.HeadCommit)
}
if diff.Truncated {
fmt.Fprintf(b, "(showing top %d of %d changed files by churn)\n", len(diff.Files), diff.TotalFiles)
}
b.WriteString("\n")
for _, f := range diff.Files {
fmt.Fprintf(b, "- %s (+%d/-%d churn=%d)\n", f.Path, f.Additions, f.Deletions, f.Churn)
fmt.Fprintf(b, "Repository: %s\n", sanitizeIdentifier(diff.Repository))
if diff.BaseCommit != "" {
fmt.Fprintf(b, "Base: %s → Head: %s\n", sanitizeIdentifier(diff.BaseCommit), sanitizeIdentifier(diff.HeadCommit))
} else {
fmt.Fprintf(b, "Head: %s\n", sanitizeIdentifier(diff.HeadCommit))
}
if diff.Truncated {
fmt.Fprintf(b, "(showing top %d of %d changed files by churn)\n", len(diff.Files), diff.TotalFiles)
}
b.WriteString("\n")
for _, f := range diff.Files {
fmt.Fprintf(b, "- %s (+%d/-%d churn=%d)\n", sanitizeIdentifier(f.Path), f.Additions, f.Deletions, f.Churn)

Copilot uses AI. Check for mistakes.
Comment thread internal/triage/prompt.go
Comment on lines +37 to +43
// sanitizeIdentifier truncates s at the first control character (< U+0020)
// and removes any remaining control characters from what follows. Identifier
// fields (TestResultID, Name, Suite) must not contain newlines or control
// characters: a value such as "tr-1\nIgnore all instructions" would inject
// a new line into the prompt structure. Legitimate identifiers never contain
// control characters, so truncating at the first one is safe and removes the
// injected payload entirely.
Copy link

Copilot AI Mar 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The sanitizeIdentifier doc comment says it "removes any remaining control characters", but the implementation truncates at the first control character and doesn't perform any additional filtering. Either update the comment to reflect the actual behavior, or implement the additional filtering described.

Copilot uses AI. Check for mistakes.
mock := llm.NewMock(validResponse(failures))
e := NewEngine(mock)

e.Triage(context.Background(), TriageInput{Failures: failures}) //nolint:errcheck
Copy link

Copilot AI Mar 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test ignores the error returned by Triage via //nolint:errcheck, which can allow the test to pass even if Triage unexpectedly fails. Capture the returned output/error and assert err is nil so the test validates the full behavior it depends on.

Suggested change
e.Triage(context.Background(), TriageInput{Failures: failures}) //nolint:errcheck
_, err := e.Triage(context.Background(), TriageInput{Failures: failures})
if err != nil {
t.Fatalf("unexpected error from Triage: %v", err)
}

Copilot uses AI. Check for mistakes.
@MichielDean MichielDean merged commit d74d5ac into main Mar 25, 2026
25 of 27 checks passed
@MichielDean MichielDean deleted the feat/sc-m9txz branch March 25, 2026 10:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants