Skip to content

feat(cli): add aicr diff for configuration drift detection#582

Open
sanjeevrg89 wants to merge 6 commits intoNVIDIA:mainfrom
sanjeevrg89:feat/aicr-diff
Open

feat(cli): add aicr diff for configuration drift detection#582
sanjeevrg89 wants to merge 6 commits intoNVIDIA:mainfrom
sanjeevrg89:feat/aicr-diff

Conversation

@sanjeevrg89
Copy link
Copy Markdown

@sanjeevrg89 sanjeevrg89 commented Apr 15, 2026

Adds aicr diff — compares a snapshot against a recipe to surface constraint violations and component version drift.

What's included:

  • Constraint evaluation: flags mismatches between expected recipe values and actual snapshot state
  • Component drift: detects version skew between recipe chart versions and deployed images
  • Prometheus metrics for drift tracking (counters, gauges, histograms)
  • Table output for human-readable reports
  • Full test coverage (190 tests, race-clean)

Scoped to just the diff feature per @xdu31's feedback on #464 — node-validate and node-scoped constraints will follow as separate PRs.

Summary by CodeRabbit

  • New Features

    • Added a new diff subcommand to compare two snapshots and report field-level changes; supports table and serialized outputs and a fail-on-drift option.
    • Integrates snapshot source selection and kubeconfig-aware snapshot loading with CLI timeout handling.
  • Documentation

    • Added guidance describing planned drift-detection use cases and future integration notes.
  • Tests

    • New unit tests covering CLI flags, snapshot diff behavior, table output, and metrics recording.

Add snapshot-vs-recipe comparison that detects constraint violations
and component version drift. Includes Prometheus metrics, table output,
and comprehensive test coverage.

New packages: pkg/diff (core engine, metrics, table renderer)
New command: pkg/cli/diff.go registered in root.go
@sanjeevrg89 sanjeevrg89 requested a review from a team as a code owner April 15, 2026 03:59
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot bot commented Apr 15, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@mchmarny mchmarny self-requested a review April 15, 2026 17:25
@mchmarny mchmarny added the enhancement New feature or request label Apr 15, 2026
Copy link
Copy Markdown
Member

@mchmarny mchmarny left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Welcome — solid first contribution. Good test coverage (190 tests), clean pkg/cli / pkg/diff separation, and smart reuse of constraints.Evaluate. The tactical items (Close errors, missing timeout, dead metrics code) are straightforward fixes — see inline comments.

Comment thread pkg/cli/diff.go
Comment thread pkg/cli/diff.go Outdated
Comment thread pkg/cli/diff.go Outdated
Comment thread pkg/diff/_future/metrics.go.bak
Comment thread pkg/diff/diff.go Outdated
Comment thread pkg/diff/diff.go
Comment thread pkg/diff/diff.go Outdated
Copy link
Copy Markdown
Member

@mchmarny mchmarny left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The bigger questions are architectural:

  1. Parallel constraint evaluation pathevaluateConstraint reimplements the loop from validator.checkReadiness(). Consider extracting a shared function so both commands stay in sync.

  2. Union result typeResult serves both modes via optional fields keyed on .Mode. Separate types per mode would be type-safe and easier to schema-validate for CI consumers.

  3. Component drift matching is heuristic — string-matching ComponentRef.Name against image keys will produce false "not-observed" in real clusters where image names don't exactly match component names.

  4. Scope — Snapshot-vs-snapshot comparison is genuinely novel. Recipe-vs-snapshot is a read-only subset of what validate already does. Would it be cleaner to ship snapshot-vs-snapshot here, and add recipe drift as aicr validate --drift-only (skip job execution, report constraint state)? That keeps constraint evaluation in one code path.

Copy link
Copy Markdown
Contributor

@yuanchen8911 yuanchen8911 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two regressions remain in the GPU workflow changes.

The new trigger/filter lists no longer include pkg/validator/checks/conformance/** or pkg/validator/checks/deployment/**. That means a PR that only changes validator checks will stop triggering these GPU workflows on pull-request/* pushes, creating a silent CI gap for the main end-to-end validator path.

The conformance evidence collection step now requires validate-conformance to have run with a success or failure outcome. If chainsaw test or expected-resource verification fails first, validate-conformance is skipped and evidence collection never runs. Previously we still uploaded conformance evidence after those earlier failures, which was useful for debugging common GPU CI breakages. The same regression exists in both the inference and training workflows.

@yuanchen8911
Copy link
Copy Markdown
Contributor

I think this is really two different features. Snapshot-vs-snapshot feels like a natural diff command. Recipe-vs-snapshot feels more like validate in an offline/read-only mode than an extension of snapshot.

That also reinforces my earlier concern about the union Result type: one command is currently carrying two distinct concepts and output schemas behind a .Mode switch.

My preference would be to keep diff focused on snapshot comparison, and put recipe drift under validate so the constraint-evaluation path stays unified.

Copy link
Copy Markdown
Contributor

@ArangoGutierrez ArangoGutierrez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Welcome, and thanks for the contribution -- configuration drift detection is a useful feature. Mark's review covers the key architectural questions well. The main one to address first is the scope question: whether recipe-vs-snapshot belongs here or under validate --drift-only, since that drives the type design (union Result vs separate types) and avoids the parallel constraint evaluation path.

Once you have a chance to respond to the review feedback, the tactical items (Close errors, missing timeout, unused metrics registration) should be quick fixes. Looking forward to seeing the next iteration.

Address review feedback on NVIDIA#582 by removing recipe-vs-snapshot mode from
aicr diff. Reviewers agreed recipe drift belongs under validate --drift-only
to keep constraint evaluation in one code path.

Changes:
- Remove RecipeVsSnapshot path, union Result type, component drift matching
- Remove pkg/diff/metrics.go (dead code — promauto was registering gauges
  at init that nothing populated)
- Add context.WithTimeout on CLI I/O using defaults.CLISnapshotTimeout
- Fix writable f.Close() and serializer closer.Close() to capture errors

Preserved in pkg/diff/_future/ with a README documenting the use cases
that justify recipe-vs-snapshot drift detection (airgapped audit,
historical compliance, fleet-scale monitoring, CI/CD gating) and a
reuse plan for validate --drift-only.

Net: -1,004 lines. 176 tests pass with -race. go vet clean.
@sanjeevrg89
Copy link
Copy Markdown
Author

Thanks @mchmarny, @yuanchen8911, and @ArangoGutierrez for the detailed feedback — all well taken. Pushed an update that scopes this PR down to snapshot-vs-snapshot only, which I think addresses the architectural concerns.

What changed in this push:

  • Stripped RecipeVsSnapshot, the union Result type, component drift matching, and the parallel constraint evaluation loop from pkg/diff. aicr diff is now just snapshot-vs-snapshot — one concept, one type, one code path.
  • Removed pkg/diff/metrics.go. Agreed it was dead code in this PR — promauto was registering gauges at init that nothing populated.
  • Added context.WithTimeout(ctx, defaults.CLISnapshotTimeout) around the CLI I/O path.
  • Fixed the writable f.Close() and the serializer closer.Close() to capture and log errors instead of discarding them.

Net change is -1,004 lines on the feature. 196 tests pass with -race, go vet clean.

On recipe-vs-snapshot:

I agree with the suggestion to put it under aicr validate --drift-only so constraint evaluation stays in one place. I preserved the recipe-mode code in pkg/diff/_future/ (outside Go's build graph) with a README documenting the use cases that I think justify keeping it as a first-class capability even if it moves:

  • Airgapped/disconnected audit — auditor has snapshot + recipe, no cluster access
  • Historical compliance — "was cluster X compliant on 2026-03-15?" given that day's snapshot
  • Fleet-scale drift monitoring — analyze 500 snapshots offline instead of round-tripping validate
  • CI/CD drift gating — sub-second drift check on a recipe change without spinning up a cluster
  • Pre-flight before validate — cheap read-only check before the heavy Job-deploying run
  • Evidence/attestation trail — reproducible signed drift report from recipe hash + snapshot hash

The reviewers' concern was that constraint evaluation should live in one path — they're right. The suggested validate --drift-only design unifies the code path while keeping both use cases first-class. Happy to take that on as a follow-up PR, with the constraint evaluator extracted to pkg/constraints as a shared function and a structured ComponentRef→image mapping to replace the heuristic matching.

Let me know if the scope on this PR now feels right.

@coderabbitai

This comment was marked as resolved.

@mchmarny mchmarny dismissed their stale review April 16, 2026 14:40

issues addressed

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 9

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@pkg/cli/diff_test.go`:
- Around line 39-51: The variable name requiredFlags in pkg/cli/diff_test.go is
misleading because some entries are optional; rename it to expectedFlags and
update all references in the test (the slice definition and the for-range that
iterates it) so the intent is clear, keeping the rest of the logic (loop over
cmd.Flags and hasFlag) unchanged; ensure the test error message that uses
flagName remains the same.

In `@pkg/cli/diff.go`:
- Around line 138-141: The drift-detection branch currently returns
errors.New(..., errors.ErrCodeInternal) which is the wrong code; update the
error creation in the block that checks cmd.Bool("fail-on-drift") &&
result.HasDrift() to use errors.ErrCodeInvalidRequest (or add a new constant
like ErrCodeConfigurationMismatch in the central error codes and use that)
instead of ErrCodeInternal, and ensure the new/changed error code maps to the
intended exit code (exit code 2) in the error handling/mapping logic.

In `@pkg/diff/_future/cli_diff.go.bak`:
- Around line 251-259: The table branch currently treats any non-empty output as
a filesystem path (creating a file for "--output -"), so in the block guarded by
outFormat == serializer.FormatTable replace the manual os.Create logic with the
same normalization used elsewhere: call serializer.NewFileWriterOrStdout(output)
to obtain the writer, propagate any error, assign the returned writer to w, and
if the returned writer implements io.Closer defer Close() to avoid leaking file
descriptors; reference the symbols outFormat, serializer.FormatTable, output, w
and serializer.NewFileWriterOrStdout when locating and changing the code.
- Around line 258-272: The JSON/YAML output path currently ignores Close()
errors on the serializer (ser) just like the earlier file branch where defer
f.Close() drops errors; change both close paths to capture and propagate Close()
failures: for the file branch replace defer f.Close() with a deferred closure
that checks f.Close() error and returns or wraps it into the function's error
return, and for the serializer branch change the anonymous defer to call
closer.Close(), check its error, and if non-nil wrap and return it (preserve
existing error wrapping style used around serializer.NewFileWriterOrStdout);
reference the variables/functions ser, f, serializer.NewFileWriterOrStdout and
ensure the function returns the Close() error instead of discarding it.

In `@pkg/diff/_future/README.md`:
- Line 1: Replace the current H2 heading "## Recipe-vs-Snapshot Drift Detection
(deferred)" with a top-level H1 heading so the README begins with a single
top-level heading; update the line in pkg/diff/_future/README.md that contains
"Recipe-vs-Snapshot Drift Detection (deferred)" to use an H1 (single leading
hash) instead of H2 to satisfy MD041.

In `@pkg/diff/_future/recipe_diff.go.bak`:
- Around line 267-281: The current loop in the block that assigns
result.ValidationPhases = checkValidationPhases(rec, snap) is aggregating
phase-level constraint outcomes (iterating vp.ConstraintResults) into
result.Summary.Constraints* while those constraint rows are not present in
result.ConstraintResults, causing inconsistent counts. Fix by either (A) keeping
separate phase counters on result (e.g., result.PhaseSummary.* or new fields)
and increment those instead of result.Summary, or (B) append/flatten each
vp.ConstraintResults into result.ConstraintResults before computing
result.Summary.Total and the ConstraintsPassed/Failed/Error counts; update the
code paths that compute result.Summary (referencing result.ValidationPhases,
vp.ConstraintResults, result.ConstraintResults, and result.Summary) so top-level
summary only reflects entries actually present in result.ConstraintResults.
- Around line 230-236: Guard against nil inputs at the start of
RecipeVsSnapshot: check if rec==nil or snap==nil before using rec.Constraints or
dereferencing snap, and return a safe empty Result (or an explicit error Result)
instead of letting the function panic. Concretely, compute a constraintsLen :=
0; if rec != nil { constraintsLen = len(rec.Constraints) } and use that when
allocating ConstraintResults, and if snap == nil or rec == nil return the
initialized Result (Mode "recipe-vs-snapshot") immediately or set an error field
so downstream code in RecipeVsSnapshot, and any uses of rec or snap, do not
dereference nil values.

In `@pkg/diff/diff.go`:
- Around line 86-92: The docstring for Snapshots contradicts its behavior:
instead of requiring non-nil inputs, the function currently returns an empty
*Result when baseline or target is nil; update the comment for Snapshots to
state it is nil-safe and that when either baseline or target is nil the function
returns an empty Result (with Changes initialized empty), referencing the Result
and Change return values and the snapshotter.Snapshot parameters so callers know
they don't need to pre-validate; do not change the function signature.

In `@pkg/diff/table.go`:
- Around line 56-60: The else branch printing "NO DRIFT" is unreachable because
the function returns when len(result.Changes) == 0 and by the time we reach this
block result.HasDrift() will always be true; remove the dead else branch and
simplify the block to always print the drift message (e.g., replace the if/else
using result.HasDrift() with a single fmt.Fprintln(w, "DRIFT DETECTED")). Ensure
references to result.Changes, result.Summary.Total (populated by Snapshots()),
and result.HasDrift() remain correct elsewhere.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro Plus

Run ID: 392d5693-49cb-4486-be78-91531e3ee884

📥 Commits

Reviewing files that changed from the base of the PR and between 03b82bd and dc011a8.

📒 Files selected for processing (13)
  • pkg/cli/diff.go
  • pkg/cli/diff_test.go
  • pkg/cli/root.go
  • pkg/diff/_future/README.md
  • pkg/diff/_future/cli_diff.go.bak
  • pkg/diff/_future/metrics.go.bak
  • pkg/diff/_future/metrics_test.go.bak
  • pkg/diff/_future/recipe_diff.go.bak
  • pkg/diff/_future/recipe_diff_test.go.bak
  • pkg/diff/_future/table.go.bak
  • pkg/diff/diff.go
  • pkg/diff/diff_test.go
  • pkg/diff/table.go

Comment thread pkg/cli/diff_test.go Outdated
Comment thread pkg/cli/diff.go
Comment on lines +251 to +259
if outFormat == serializer.FormatTable {
w := os.Stdout
if output != "" {
f, err := os.Create(output)
if err != nil {
return errors.Wrap(errors.ErrCodeInternal, "failed to create output file", err)
}
defer f.Close()
w = f
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Honor stdout sentinels in the table path.

The table branch treats any non-empty --output as a filesystem path, so --output - (and the stdout URI) will create a file instead of writing to stdout. Reuse the same output normalization as serializer.NewFileWriterOrStdout here.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/diff/_future/cli_diff.go.bak` around lines 251 - 259, The table branch
currently treats any non-empty output as a filesystem path (creating a file for
"--output -"), so in the block guarded by outFormat == serializer.FormatTable
replace the manual os.Create logic with the same normalization used elsewhere:
call serializer.NewFileWriterOrStdout(output) to obtain the writer, propagate
any error, assign the returned writer to w, and if the returned writer
implements io.Closer defer Close() to avoid leaking file descriptors; reference
the symbols outFormat, serializer.FormatTable, output, w and
serializer.NewFileWriterOrStdout when locating and changing the code.

Comment on lines +258 to +272
defer f.Close()
w = f
}
return diff.WriteTable(w, result)
}

// JSON/YAML use standard serializer
ser, err := serializer.NewFileWriterOrStdout(outFormat, output)
if err != nil {
return errors.Wrap(errors.ErrCodeInternal, "failed to create output writer", err)
}
defer func() {
if closer, ok := ser.(interface{ Close() error }); ok {
_ = closer.Close()
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Propagate close failures from output writers.

Both close paths drop errors. That can silently lose the last buffered bytes and still return success. The future copy should keep the same close/error pattern as the live CLI before it is brought back.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/diff/_future/cli_diff.go.bak` around lines 258 - 272, The JSON/YAML
output path currently ignores Close() errors on the serializer (ser) just like
the earlier file branch where defer f.Close() drops errors; change both close
paths to capture and propagate Close() failures: for the file branch replace
defer f.Close() with a deferred closure that checks f.Close() error and returns
or wraps it into the function's error return, and for the serializer branch
change the anonymous defer to call closer.Close(), check its error, and if
non-nil wrap and return it (preserve existing error wrapping style used around
serializer.NewFileWriterOrStdout); reference the variables/functions ser, f,
serializer.NewFileWriterOrStdout and ensure the function returns the Close()
error instead of discarding it.

@@ -0,0 +1,75 @@
## Recipe-vs-Snapshot Drift Detection (deferred)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Add top-level heading per markdown standards.

Static analysis flags MD041: first line should be a top-level heading. The current line 1 is ## Recipe-vs-Snapshot Drift Detection (deferred) which is H2.

📝 Suggested fix
+# Deferred Implementation
+
 ## Recipe-vs-Snapshot Drift Detection (deferred)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
## Recipe-vs-Snapshot Drift Detection (deferred)
# Deferred Implementation
## Recipe-vs-Snapshot Drift Detection (deferred)
🧰 Tools
🪛 markdownlint-cli2 (0.22.0)

[warning] 1-1: First line in a file should be a top-level heading

(MD041, first-line-heading, first-line-h1)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/diff/_future/README.md` at line 1, Replace the current H2 heading "##
Recipe-vs-Snapshot Drift Detection (deferred)" with a top-level H1 heading so
the README begins with a single top-level heading; update the line in
pkg/diff/_future/README.md that contains "Recipe-vs-Snapshot Drift Detection
(deferred)" to use an H1 (single leading hash) instead of H2 to satisfy MD041.

Comment on lines +230 to +236
func RecipeVsSnapshot(rec *recipe.RecipeResult, snap *snapshotter.Snapshot) *Result {
result := &Result{
Mode: "recipe-vs-snapshot",
ConstraintResults: make([]ConstraintResult, 0, len(rec.Constraints)),
ComponentDrifts: make([]ComponentDrift, 0),
ValidationPhases: make([]ValidationPhaseSummary, 0),
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Guard nil recipe/snapshot inputs before constructing the result.

len(rec.Constraints) panics when rec is nil, and a nil snap is dereferenced later during constraint and component evaluation. For a reusable package API, this should return a safe empty result or an explicit error path instead of crashing. Based on learnings: Functional Packages (pkg/oci, pkg/bundler, pkg/recipe, pkg/collector, pkg/validator) must be self-contained, reusable business logic usable independently without CLI/API.

Possible hardening
 func RecipeVsSnapshot(rec *recipe.RecipeResult, snap *snapshotter.Snapshot) *Result {
+	if rec == nil || snap == nil {
+		return &Result{
+			Mode:              "recipe-vs-snapshot",
+			ConstraintResults: []ConstraintResult{},
+			ComponentDrifts:   []ComponentDrift{},
+			ValidationPhases:  []ValidationPhaseSummary{},
+		}
+	}
+
 	result := &Result{
 		Mode:              "recipe-vs-snapshot",
 		ConstraintResults: make([]ConstraintResult, 0, len(rec.Constraints)),
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
func RecipeVsSnapshot(rec *recipe.RecipeResult, snap *snapshotter.Snapshot) *Result {
result := &Result{
Mode: "recipe-vs-snapshot",
ConstraintResults: make([]ConstraintResult, 0, len(rec.Constraints)),
ComponentDrifts: make([]ComponentDrift, 0),
ValidationPhases: make([]ValidationPhaseSummary, 0),
}
func RecipeVsSnapshot(rec *recipe.RecipeResult, snap *snapshotter.Snapshot) *Result {
if rec == nil || snap == nil {
return &Result{
Mode: "recipe-vs-snapshot",
ConstraintResults: []ConstraintResult{},
ComponentDrifts: []ComponentDrift{},
ValidationPhases: []ValidationPhaseSummary{},
}
}
result := &Result{
Mode: "recipe-vs-snapshot",
ConstraintResults: make([]ConstraintResult, 0, len(rec.Constraints)),
ComponentDrifts: make([]ComponentDrift, 0),
ValidationPhases: make([]ValidationPhaseSummary, 0),
}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/diff/_future/recipe_diff.go.bak` around lines 230 - 236, Guard against
nil inputs at the start of RecipeVsSnapshot: check if rec==nil or snap==nil
before using rec.Constraints or dereferencing snap, and return a safe empty
Result (or an explicit error Result) instead of letting the function panic.
Concretely, compute a constraintsLen := 0; if rec != nil { constraintsLen =
len(rec.Constraints) } and use that when allocating ConstraintResults, and if
snap == nil or rec == nil return the initialized Result (Mode
"recipe-vs-snapshot") immediately or set an error field so downstream code in
RecipeVsSnapshot, and any uses of rec or snap, do not dereference nil values.

Comment on lines +267 to +281
// 3. Summarize validation phase configuration and count phase-level constraints
result.ValidationPhases = checkValidationPhases(rec, snap)
for _, vp := range result.ValidationPhases {
for _, cr := range vp.ConstraintResults {
if cr.Error != "" {
result.Summary.ConstraintsError++
} else if cr.Passed {
result.Summary.ConstraintsPassed++
} else {
result.Summary.ConstraintsFailed++
}
}
}

result.Summary.Total = len(result.ConstraintResults)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Don't mix phase constraint totals into the top-level constraint summary.

This loop folds phase-level constraint outcomes into Summary.Constraints*, but those rows are not added to ConstraintResults. Any consumer that renders ConstraintResults with Summary will show inconsistent counts versus rows. Keep separate phase counters, or flatten phase constraint results into the main slice before summarizing.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/diff/_future/recipe_diff.go.bak` around lines 267 - 281, The current loop
in the block that assigns result.ValidationPhases = checkValidationPhases(rec,
snap) is aggregating phase-level constraint outcomes (iterating
vp.ConstraintResults) into result.Summary.Constraints* while those constraint
rows are not present in result.ConstraintResults, causing inconsistent counts.
Fix by either (A) keeping separate phase counters on result (e.g.,
result.PhaseSummary.* or new fields) and increment those instead of
result.Summary, or (B) append/flatten each vp.ConstraintResults into
result.ConstraintResults before computing result.Summary.Total and the
ConstraintsPassed/Failed/Error counts; update the code paths that compute
result.Summary (referencing result.ValidationPhases, vp.ConstraintResults,
result.ConstraintResults, and result.Summary) so top-level summary only reflects
entries actually present in result.ConstraintResults.

Comment thread pkg/diff/diff.go
Comment thread pkg/diff/table.go Outdated
@mchmarny
Copy link
Copy Markdown
Member

mchmarny commented Apr 16, 2026

@sanjeevrg89 looks like some linting issues remain, otherwise LGTM. Sounds like coderabbit though has some other ideas ;)

Address review feedback on NVIDIA#582 by narrowing aicr diff to just
snapshot-vs-snapshot comparison. Reviewers agreed recipe-vs-snapshot
drift belongs under validate --drift-only to keep constraint
evaluation in one code path.

Architectural changes:
- Remove RecipeVsSnapshot path, union Result type, component drift
  matching, and the parallel constraint evaluation loop
- Remove pkg/diff/metrics.go (dead code — promauto was registering
  gauges at init that nothing populated)

Tactical review fixes:
- Add context.WithTimeout on CLI I/O using defaults.CLISnapshotTimeout
- Capture and log errors from writable f.Close() and serializer
  closer.Close() instead of discarding them
- Use ErrCodeInvalidRequest (exit 2) instead of ErrCodeInternal (exit
  8) when fail-on-drift triggers — drift is an expected outcome, not
  an internal failure
- Document nil-safe behavior on Snapshots() instead of claiming both
  inputs must be non-nil
- Remove unreachable "NO DRIFT" branch in table.go (early return
  handles the empty-changes path)
- Rename requiredFlags to expectedFlags in diff_test.go

Lint fixes:
- gocritic unlambda: drop wrapper closure around runDiffCmd
- govet shadow: reassign err instead of shadowing at initDataProvider
- prealloc: preallocate changes slice in addedMeasurement and
  removedMeasurement

Preserved in pkg/diff/_future/ with a README documenting the use
cases (airgapped audit, historical compliance, fleet-scale monitoring,
CI/CD gating, pre-flight validate, evidence trail) that justify
recipe-vs-snapshot drift detection as a first-class capability under
validate --drift-only.

golangci-lint v2.10.1: 0 issues. 196 tests pass with -race.
@sanjeevrg89
Copy link
Copy Markdown
Author

sanjeevrg89 commented Apr 16, 2026

Thanks @mchmarny — linting's green now. Just pushed a fix.

Resolved all 4 golangci-lint failures from tests / Lint:

  • pkg/cli/diff.go:54 (gocritic unlambda) — dropped the wrapper closure, use Action: runDiffCmd directly
  • pkg/cli/diff.go:105 (govet shadow) — reassign outer err at initDataProvider instead of shadowing
  • pkg/diff/diff.go:220,228 (prealloc) — preallocate the changes slice in addedMeasurement and removedMeasurement using len(m.Subtypes)

While I was in there I also took the valid CodeRabbit nits on the live code:

  • pkg/cli/diff.go:136 — use ErrCodeInvalidRequest (exit 2) instead of ErrCodeInternal (exit 8) when --fail-on-drift triggers. Drift is an expected outcome, not an internal failure — this also makes the exit code useful for CI scripting.
  • pkg/diff/diff.goSnapshots() docstring now documents the nil-safe behavior instead of claiming both inputs must be non-nil (the implementation was already nil-safe).
  • pkg/diff/table.go — removed the unreachable NO DRIFT branch; the empty-changes path returns early above it.
  • pkg/cli/diff_test.go — renamed requiredFlagsexpectedFlags since the slice includes optional flags like --output and --format.

Skipped the CodeRabbit suggestions against files in pkg/diff/_future/ — those .bak files aren't in the Go build graph or linted, they're archived preservation for when someone picks up validate --drift-only. The README H1 nit is also there; markdown isn't in the repo's lint pipeline.

Verified locally with golangci-lint v2.10.1 (matches CI): 0 issues. 196 tests pass with -race.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (1)
pkg/cli/diff.go (1)

153-161: ⚠️ Potential issue | 🟠 Major

Return Close() failures for writable outputs instead of only logging them.

Close() is where buffered I/O errors often surface. Logging these failures still lets aicr diff exit 0 after a truncated output file. Make writeDiffResult use a named err return and promote close failures when no earlier error exists.

💾 Suggested pattern
-func writeDiffResult(ctx context.Context, cmd *cli.Command, outFormat serializer.Format, result *diff.Result) error {
+func writeDiffResult(ctx context.Context, cmd *cli.Command, outFormat serializer.Format, result *diff.Result) (err error) {
 	output := cmd.String("output")
 ...
 			defer func() {
-				if closeErr := f.Close(); closeErr != nil {
-					slog.Error("failed to close output file", "error", closeErr)
+				if closeErr := f.Close(); closeErr != nil && err == nil {
+					err = errors.Wrap(errors.ErrCodeInternal, "failed to close output file", closeErr)
 				}
 			}()
 ...
 	defer func() {
 		if closer, ok := ser.(interface{ Close() error }); ok {
-			if closeErr := closer.Close(); closeErr != nil {
-				slog.Error("failed to close serializer", "error", closeErr)
+			if closeErr := closer.Close(); closeErr != nil && err == nil {
+				err = errors.Wrap(errors.ErrCodeInternal, "failed to close serializer", closeErr)
 			}
 		}
 	}()

As per coding guidelines "Always capture and check Close() error on writable file handles (files opened with Create or OpenFile with write modes)".

Also applies to: 172-178

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/cli/diff.go` around lines 153 - 161, The defer that closes the created
file (variable f) must promote Close() failures to the function return instead
of only logging them: change writeDiffResult to use a named return variable
(err) and in the defer check closeErr := f.Close(); if closeErr != nil && err ==
nil { err = errors.Wrap(errors.ErrCodeInternal, "failed to close output file",
closeErr) } so Close() errors are returned; apply the same pattern for the other
writable output file close at the second occurrence (lines around the other
defer) to ensure buffered-write errors are propagated.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@pkg/diff/diff.go`:
- Around line 152-156: indexMeasurements currently dereferences m.Type without
checking for nil, causing a panic if measurements contains nil entries; update
the function (indexMeasurements in pkg/diff/diff.go) to skip nil measurement
pointers by adding a guard (if m == nil { continue }) before accessing m.Type so
nil entries are ignored (or handled) instead of panicking, ensuring the map
creation remains safe on invalid input.

---

Duplicate comments:
In `@pkg/cli/diff.go`:
- Around line 153-161: The defer that closes the created file (variable f) must
promote Close() failures to the function return instead of only logging them:
change writeDiffResult to use a named return variable (err) and in the defer
check closeErr := f.Close(); if closeErr != nil && err == nil { err =
errors.Wrap(errors.ErrCodeInternal, "failed to close output file", closeErr) }
so Close() errors are returned; apply the same pattern for the other writable
output file close at the second occurrence (lines around the other defer) to
ensure buffered-write errors are propagated.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro Plus

Run ID: 6efb79ea-a4e0-4d31-8e33-44cff404500f

📥 Commits

Reviewing files that changed from the base of the PR and between dc011a8 and ca9e24f.

📒 Files selected for processing (4)
  • pkg/cli/diff.go
  • pkg/cli/diff_test.go
  • pkg/diff/diff.go
  • pkg/diff/table.go

Comment thread pkg/diff/diff.go
Comment thread pkg/diff/table.go Outdated
Three fixes from review:

- pkg/diff/table.go: wrap writes to the output Writer with an errWriter
  helper so the first write failure (broken pipe, full disk) surfaces
  as a structured pkg/errors.Wrap(ErrCodeInternal, ...) return value
  instead of a silent exit 0. Also wraps the bare err from tw.Flush()
  that CLAUDE.md's "never return bare err" rule forbids.

- pkg/cli/diff.go: convert writeDiffResult to a named-return function
  and promote Close() failures on the output file and serializer into
  the return value when no earlier error exists. Buffered-write
  failures on Close were previously logged-and-lost, allowing aicr
  diff to exit 0 after a truncated output file.

- pkg/diff/diff.go: guard against nil *measurement.Measurement entries
  in indexMeasurements. A snapshot loaded from malformed YAML could
  contain nil entries and panic the dereference on m.Type; now
  skipped cleanly.

Added tests for the new failure paths: TestSnapshots_NilMeasurementEntries
exercises the nil guard, TestWriteTable_PropagatesWriteErrors exercises
both the empty and non-empty output branches with a failing writer.

golangci-lint v2.10.1: 0 issues. 200 tests pass with -race.
@sanjeevrg89
Copy link
Copy Markdown
Author

Good catches from CodeRabbit — pushed fixes for all three:

1. pkg/diff/table.go write propagation. Wrote an errWriter helper that captures the first io.Writer error and turns subsequent writes into no-ops. WriteTable now wraps any failure with errors.Wrap(errors.ErrCodeInternal, "failed to write diff table output", err). This covers the broken-pipe / full-disk scenario and also addresses the bare tw.Flush() return that was violating the "never return bare err" rule.

2. pkg/cli/diff.go Close() error propagation. Converted writeDiffResult to a named-return function (err error) and promote Close() failures on both the output file (f.Close()) and the serializer (closer.Close()) into the return value when no earlier error exists. Previously these were logged and dropped — meaning aicr diff could exit 0 after a truncated output file.

3. pkg/diff/diff.go nil guard on indexMeasurements. Skip nil entries instead of panicking on m.Type. Real snapshots from the snapshotter won't contain nil entries, but malformed YAML deserialization can produce them and we shouldn't crash on invalid input.

Added two tests for the new failure paths: TestSnapshots_NilMeasurementEntries exercises the nil guard, TestWriteTable_PropagatesWriteErrors uses a failingWriter to verify both the empty-changes and with-changes paths return wrapped errors when the target writer fails.

golangci-lint v2.10.1: 0 issues. 200 tests pass with -race.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants