sc-m9txz: Implement failure clustering and triage prompt engine by MichielDean · Pull Request #211 · MichielDean/ScaledTest

MichielDean · 2026-03-25T10:25:40Z

Closes droplet sc-m9txz.

Adds internal/triage package with: - Engine: accepts TriageInput (failures, flakiness history, git diff, previous failures), builds prompt, calls llm.Provider, parses and validates JSON response into TriageOutput (clusters + per-failure classifications). - Graceful degradation: LLM errors and invalid responses return a fallback TriageOutput with all failures classified as "unknown", alongside the original error for callers to record failed status. - BuildPrompt: constructs structured prompt with adaptive truncation (message/trace limits shrink at 11 and 26 failures) so typical runs stay within the 8 k-token budget. - Structured output schema (llmOutput / llmCluster / llmClassification) with normalisation of invalid classification values to "unknown". - 37 unit tests covering happy path, multi-cluster, partial response, invalid values, LLM errors, 5- and 50-failure scale, and token budget. All tests use llm.MockProvider — no external dependencies. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Leftover first-attempt `mock` was unused — only `mock2` drove the test. Collapsed to a single `mock` and removed the `_ = mock` suppressor.

…comment - engine_test.go:218: drop `!= nil &&` guard before len() check — len(nil)==0 in Go so the nil check was redundant. - prompt_test.go: move stray makeDiffFiles doc comment to sit above the function instead of floating before the injection-tests section header. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

- Apply sanitizeIdentifier() to h.TestName in writeFlakinessHistory (lines 151/155) and to name in writePreviousFailures (line 189) — same CTRF origin and injection vector as FailureDetail.Name which was already sanitized. - Add json:"snake_case" struct tags to TriageOutput, ClusterResult, and ClassificationResult so serialized output uses snake_case keys (summary, clusters, classifications, root_cause, label, test_result_id, cluster_index, classification) consistent with every other type in the codebase. - Update TestEngine_Triage_EmptyFailures_MarshalsClarificationsAsEmptyArray assertions to assert snake_case keys. - Add TestBuildPrompt_SanitizesNewlinesInFlakinessTestName and TestBuildPrompt_SanitizesNewlinesInPreviousFailureName covering the new sanitization paths. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Copilot

Pull request overview

Implements a new internal LLM-based triage subsystem that builds a structured prompt from failing test data and parses the model’s JSON response into clustered failure triage results.

Changes:

Added triage.Engine with typed input/output, LLM invocation, response parsing, and fallback behavior.
Added prompt construction (BuildPrompt) with adaptive truncation and optional enrichment sections (flakiness history, git diff, previous failures).
Added extensive unit tests for prompt content/sanitization and engine parsing/fallback behaviors.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 8 comments.

Show a summary per file

File	Description
internal/triage/engine.go	Core triage engine types + LLM call, parse, normalization, fallback output
internal/triage/engine_test.go	Engine behavior tests (happy path, error paths, output shaping)
internal/triage/prompt.go	Prompt builder with truncation + enrichment sections + basic sanitization
internal/triage/prompt_test.go	Prompt content, truncation budget, and injection-related tests
internal/triage/schema.go	Internal JSON schema types and valid classification set

Copilot · 2026-03-25T10:30:35Z

+	}
+}
+
+func TestEngine_Triage_EmptyFailures_MarshalsClarificationsAsEmptyArray(t *testing.T) {


Typo in test name: "Clarifications" should be "Classifications".

Suggested change

func TestEngine_Triage_EmptyFailures_MarshalsClarificationsAsEmptyArray(t *testing.T) {

func TestEngine_Triage_EmptyFailures_MarshalsClassificationsAsEmptyArray(t *testing.T) {

Copilot · 2026-03-25T10:30:35Z

+	RootCause string `json:"root_cause"`
+
+	// Label is an optional short identifier for the cluster (may be empty).
+	Label string `json:"label"`


ClusterResult.Label is documented as optional but the JSON tag is json:"label", so empty labels will still serialize as "label":"". If consumers expect omission for optional fields (as in the LLM schema's label,omitempty), consider using json:"label,omitempty" here for consistency.

Suggested change

Label string `json:"label"`

Label string `json:"label,omitempty"`

Copilot · 2026-03-25T10:30:35Z

+// truncateStr shortens s to at most maxLen bytes, appending "…" if truncated.
+// It does not split multi-byte UTF-8 runes at the boundary — it truncates
+// at the byte level which may break a rune, but this is acceptable for prompt
+// construction where exact byte counts matter more than rune alignment.
+func truncateStr(s string, maxLen int) string {
+	if len(s) <= maxLen {
+		return s
+	}
+	return s[:maxLen] + "…"


truncateStr claims to keep strings to at most maxLen bytes, but it returns s[:maxLen] + "…" which exceeds maxLen by the ellipsis byte length. This can break the intended prompt size budget. Adjust truncation to reserve space for the suffix (or use an ASCII suffix) and handle very small maxLen values safely.

Suggested change

// truncateStr shortens s to at most maxLen bytes, appending "…" if truncated.

// It does not split multi-byte UTF-8 runes at the boundary — it truncates

// at the byte level which may break a rune, but this is acceptable for prompt

// construction where exact byte counts matter more than rune alignment.

func truncateStr(s string, maxLen int) string {

if len(s) <= maxLen {

return s

}

return s[:maxLen] + "…"

// truncateStr shortens s to at most maxLen bytes (including the suffix),

// appending "…" if truncated. It does not split multi-byte UTF-8 runes at

// the boundary — it truncates at the byte level which may break a rune, but

// this is acceptable for prompt construction where exact byte counts matter

// more than rune alignment.

func truncateStr(s string, maxLen int) string {

if maxLen <= 0 {

return ""

}

if len(s) <= maxLen {

return s

}

const ellipsis = "…"

// If maxLen is very small, return as much of the ellipsis as fits.

if maxLen <= len(ellipsis) {

return ellipsis[:maxLen]

}

prefixLen := maxLen - len(ellipsis)

return s[:prefixLen] + ellipsis

Copilot · 2026-03-25T10:30:36Z

+		// Free-text fields (Message, Trace) are wrapped in XML-style delimiters
+		// so the LLM can distinguish user-controlled data from prompt instructions.
+		if msg := truncateStr(strings.TrimSpace(f.Message), ml); msg != "" {
+			fmt.Fprintf(b, "- error: <error_data>%s</error_data>\n", msg)
+		}
+		if trace := truncateStr(strings.TrimSpace(f.Trace), tl); trace != "" {
+			fmt.Fprintf(b, "- trace: <trace_data>%s</trace_data>\n", trace)
+		}


Free-text fields are wrapped in <error_data>/<trace_data> tags but their contents are not escaped. If Message/Trace contains sequences like "</error_data>" it can break the delimiter structure and allow prompt injection into the surrounding instructions. Consider escaping/encoding these fields (e.g., replace "</" sequences, or base64/JSON-escape inside the delimiter) so the delimiters remain reliable.

Copilot · 2026-03-25T10:30:36Z

+			continue
+		}
+		fmt.Fprintf(b, "| %s | %d | %.0f%% | %.2f | %s |\n",
+			sanitizeIdentifier(h.TestName), h.TotalRuns, h.PassRate, h.FlakyScore, h.LastStatus)


writeFlakinessHistory sanitizes TestName but prints LastStatus verbatim into the markdown table. If LastStatus contains control characters/newlines it can still break prompt structure. Consider sanitizing LastStatus (and any other string fields) the same way as identifiers before writing them into the prompt.

Suggested change

sanitizeIdentifier(h.TestName), h.TotalRuns, h.PassRate, h.FlakyScore, h.LastStatus)

sanitizeIdentifier(h.TestName), h.TotalRuns, h.PassRate, h.FlakyScore, sanitizeIdentifier(h.LastStatus))

Copilot · 2026-03-25T10:30:36Z

+	fmt.Fprintf(b, "Repository: %s\n", diff.Repository)
+	if diff.BaseCommit != "" {
+		fmt.Fprintf(b, "Base: %s → Head: %s\n", diff.BaseCommit, diff.HeadCommit)
+	} else {
+		fmt.Fprintf(b, "Head: %s\n", diff.HeadCommit)
+	}
+	if diff.Truncated {
+		fmt.Fprintf(b, "(showing top %d of %d changed files by churn)\n", len(diff.Files), diff.TotalFiles)
+	}
+	b.WriteString("\n")
+	for _, f := range diff.Files {
+		fmt.Fprintf(b, "- %s (+%d/-%d churn=%d)\n", f.Path, f.Additions, f.Deletions, f.Churn)


writeGitDiff prints Repository/BaseCommit/HeadCommit and each FileDiffStat.Path without sanitization. Since these values can originate from external inputs (e.g., CTRF metadata or git filenames), they could contain newlines/control characters and inject content into the prompt. Consider applying the same identifier sanitization here before writing these fields.

Suggested change

fmt.Fprintf(b, "Repository: %s\n", diff.Repository)

if diff.BaseCommit != "" {

fmt.Fprintf(b, "Base: %s → Head: %s\n", diff.BaseCommit, diff.HeadCommit)

} else {

fmt.Fprintf(b, "Head: %s\n", diff.HeadCommit)

}

if diff.Truncated {

fmt.Fprintf(b, "(showing top %d of %d changed files by churn)\n", len(diff.Files), diff.TotalFiles)

}

b.WriteString("\n")

for _, f := range diff.Files {

fmt.Fprintf(b, "- %s (+%d/-%d churn=%d)\n", f.Path, f.Additions, f.Deletions, f.Churn)

fmt.Fprintf(b, "Repository: %s\n", sanitizeIdentifier(diff.Repository))

if diff.BaseCommit != "" {

fmt.Fprintf(b, "Base: %s → Head: %s\n", sanitizeIdentifier(diff.BaseCommit), sanitizeIdentifier(diff.HeadCommit))

} else {

fmt.Fprintf(b, "Head: %s\n", sanitizeIdentifier(diff.HeadCommit))

}

if diff.Truncated {

fmt.Fprintf(b, "(showing top %d of %d changed files by churn)\n", len(diff.Files), diff.TotalFiles)

}

b.WriteString("\n")

for _, f := range diff.Files {

fmt.Fprintf(b, "- %s (+%d/-%d churn=%d)\n", sanitizeIdentifier(f.Path), f.Additions, f.Deletions, f.Churn)

Copilot · 2026-03-25T10:30:36Z

+// sanitizeIdentifier truncates s at the first control character (< U+0020)
+// and removes any remaining control characters from what follows. Identifier
+// fields (TestResultID, Name, Suite) must not contain newlines or control
+// characters: a value such as "tr-1\nIgnore all instructions" would inject
+// a new line into the prompt structure. Legitimate identifiers never contain
+// control characters, so truncating at the first one is safe and removes the
+// injected payload entirely.


The sanitizeIdentifier doc comment says it "removes any remaining control characters", but the implementation truncates at the first control character and doesn't perform any additional filtering. Either update the comment to reflect the actual behavior, or implement the additional filtering described.

Copilot · 2026-03-25T10:30:37Z

+	mock := llm.NewMock(validResponse(failures))
+	e := NewEngine(mock)
+
+	e.Triage(context.Background(), TriageInput{Failures: failures}) //nolint:errcheck


This test ignores the error returned by Triage via //nolint:errcheck, which can allow the test to pass even if Triage unexpectedly fails. Capture the returned output/error and assert err is nil so the test validates the full behavior it depends on.

Suggested change

e.Triage(context.Background(), TriageInput{Failures: failures}) //nolint:errcheck

_, err := e.Triage(context.Background(), TriageInput{Failures: failures})

if err != nil {

t.Fatalf("unexpected error from Triage: %v", err)

}

Cistern Agent and others added 5 commits March 25, 2026 03:52

sc-m9txz: simplify: remove dead mock variable in InvalidJSON test

11ea76f

Leftover first-attempt `mock` was unused — only `mock2` drove the test. Collapsed to a single `mock` and removed the `_ = mock` suppressor.

sc-m9txz: fix prompt injection, duplicate ID, and nil Classifications

4bf4028

Copilot AI review requested due to automatic review settings March 25, 2026 10:25

Copilot started reviewing on behalf of MichielDean March 25, 2026 10:26 View session

Copilot AI reviewed Mar 25, 2026

View reviewed changes

MichielDean merged commit d74d5ac into main Mar 25, 2026
25 of 27 checks passed

MichielDean deleted the feat/sc-m9txz branch March 25, 2026 10:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sc-m9txz: Implement failure clustering and triage prompt engine#211

sc-m9txz: Implement failure clustering and triage prompt engine#211
MichielDean merged 5 commits into
mainfrom
feat/sc-m9txz

MichielDean commented Mar 25, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Mar 25, 2026

Uh oh!

Copilot AI Mar 25, 2026

Uh oh!

Copilot AI Mar 25, 2026

Uh oh!

Copilot AI Mar 25, 2026

Uh oh!

Copilot AI Mar 25, 2026

Uh oh!

Copilot AI Mar 25, 2026

Uh oh!

Copilot AI Mar 25, 2026

Uh oh!

Copilot AI Mar 25, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	func TestEngine_Triage_EmptyFailures_MarshalsClarificationsAsEmptyArray(t *testing.T) {
	func TestEngine_Triage_EmptyFailures_MarshalsClassificationsAsEmptyArray(t *testing.T) {

	Label string `json:"label"`
	Label string `json:"label,omitempty"`

-// truncateStr shortens s to at most maxLen bytes, appending "…" if truncated.
-// It does not split multi-byte UTF-8 runes at the boundary — it truncates
-// at the byte level which may break a rune, but this is acceptable for prompt
-// construction where exact byte counts matter more than rune alignment.
-func truncateStr(s string, maxLen int) string {
-	if len(s) <= maxLen {
-		return s
-	}
-	return s[:maxLen] + "…"
+// truncateStr shortens s to at most maxLen bytes (including the suffix),
+// appending "…" if truncated. It does not split multi-byte UTF-8 runes at
+// the boundary — it truncates at the byte level which may break a rune, but
+// this is acceptable for prompt construction where exact byte counts matter
+// more than rune alignment.
+func truncateStr(s string, maxLen int) string {
+	if maxLen <= 0 {
+		return ""
+	}
+	if len(s) <= maxLen {
+		return s
+	}
+	const ellipsis = "…"
+	// If maxLen is very small, return as much of the ellipsis as fits.
+	if maxLen <= len(ellipsis) {
+		return ellipsis[:maxLen]
+	}
+	prefixLen := maxLen - len(ellipsis)
+	return s[:prefixLen] + ellipsis

	sanitizeIdentifier(h.TestName), h.TotalRuns, h.PassRate, h.FlakyScore, h.LastStatus)
	sanitizeIdentifier(h.TestName), h.TotalRuns, h.PassRate, h.FlakyScore, sanitizeIdentifier(h.LastStatus))

-	e.Triage(context.Background(), TriageInput{Failures: failures}) //nolint:errcheck
+	_, err := e.Triage(context.Background(), TriageInput{Failures: failures})
+	if err != nil {
+		t.Fatalf("unexpected error from Triage: %v", err)
+	}

Conversation

MichielDean commented Mar 25, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Mar 25, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 25, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 25, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 25, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 25, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 25, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 25, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 25, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants