Skip to content

feat(cluster): manifest-vs-storage reconciliation — Phase 5 anti-entropy#412

Merged
xe-nvdk merged 8 commits intomainfrom
feat/phase5-manifest-reconciliation
Apr 27, 2026
Merged

feat(cluster): manifest-vs-storage reconciliation — Phase 5 anti-entropy#412
xe-nvdk merged 8 commits intomainfrom
feat/phase5-manifest-reconciliation

Conversation

@xe-nvdk
Copy link
Copy Markdown
Member

@xe-nvdk xe-nvdk commented Apr 27, 2026

Summary

  • Adds a periodic reconciler that detects and repairs drift between the Raft-replicated cluster file manifest and physical storage. Two kinds of drift addressed: orphan manifest entries (manifest references a missing file — caused by retention/compaction succeeding storage-side then losing Raft quorum) and orphan storage files (file exists but no manifest entry — caused by crash between storage.Write and the file-registrar Raft propose, or pre-Phase-1 history).
  • Druid-style posture: off by default. Once enabled, cron auto-acts on drift older than a 24h grace window with a 10k per-run blast cap. Anti-entropy that doesn't require babysitting matches Druid's coordinator-kill task and Pinot's retention manager.
  • Replaces explicit "Phase 5 will add" TODOs at retention.go:826, delete.go:687, job.go:827, coordinator.go:2101.

Design

Cluster gate is dual-mode by backend topology:

  • Shared storage (S3/Azure/MinIO): one node sweeps via IsActiveCompactor — reuses the failover-managed compactor lease as a single-sweeper election with no new state.
  • Local storage: every node walks its own disk filtered by OriginNodeID == self.
  • Standalone: defensive guard in sweepOrphanStorage refuses to act (no manifest to reconcile against).

Manifest-before-storage invariant upheld at the cycle level: orphan-manifest sweep runs first; orphan-storage runs only if the first half completes cleanly. Mirrors the retention chunked-batch pattern at retention.go:818-879 — same chunk size, same manifest-failure abort semantics, same per-file marshal safety.

Re-checks before every delete catch concurrent writer/retention/compaction races. Sorted output (computeDiff) makes cap-bounded runs deterministic about which orphans get cleaned. Per-chunk gate re-evaluation ensures lease handoff aborts the run promptly.

API

Route Auth Behavior
GET /api/v1/reconciliation/status RequireRead Last 10 runs, gate state, schedule, next_run
POST /api/v1/reconciliation/trigger RequireAdmin Defaults dry-run; act mode requires both dry_run=false AND act=true (defense against accidental curl)
GET /api/v1/reconciliation/runs/:id RequireAdmin Single run from in-memory ring buffer

Configuration (ARC_RECONCILIATION_*)

Defaults lean conservative: enabled=false, schedule="17 4 * * *", grace_window=24h, max_deletes_per_run=10000, max_manifest_size=200000. manifest_only_dry_run exists as a safety bridge for operators who want a few weeks of report-only runs before flipping to full auto.

Audit

Per-batch (not per-file, to keep audit log proportional): reconcile.run_started, reconcile.run_completed, reconcile.run_aborted, reconcile.manifest_batch_deleted, reconcile.storage_batch_deleted, reconcile.cap_hit. Run.WalkPartial flag plus Run.Errors surface per-prefix list failures so an operator can disambiguate "0 orphans found" from "we couldn't see most of the bucket".

License

Rides on FeatureClustering at coordinator construction. Standalone Arc has no manifest, so reconciliation is naturally Enterprise-only — no separate flag (avoiding InfluxDB-style fragmentation).

Scope of changes

  • New: internal/reconciliation/ (engine, scheduler, gate, walk, diff, two sweeps, tests)
  • New: internal/api/reconciliation.go (handler with auth-wrapped routes)
  • Wiring: cmd/arc/main.go (constructor + dual-mode reconciliationClusterGate)
  • Config: internal/config/config.go (ReconciliationConfig + viper defaults)
  • Comment cleanup: removed Phase 5 TODOs at retention/delete/compaction/coordinator
  • Release notes: RELEASE_NOTES_2026.05.1.md

No FSM changes. No new Raft commands. No storage.Backend interface changes. Reuses retention's BatchFileOpsInManifest chunking pattern verbatim.

Test plan

  • Unit tests (40+) covering: clean state, orphan-manifest only, orphan-storage only, both kinds mixed, grace window, clock skew, manifest-too-large abort, Raft error abort, ctx cancel, dry-run vs act, idempotent already-gone delete, blast cap mid-run, race re-check, local-mode foreign-origin filter, walk-partial flag, tightened path-shape check
  • Real-backend integration tests against LocalBackend (real fs walk, real Chtimes, real Exists)
  • Scheduler tests (cron lifecycle, gating, license, no-overlap, TriggerNow, status fields)
  • go build ./cmd/... ./internal/... clean
  • go vet clean
  • Three parallel staff/principal-engineer reviews (correctness / security / code quality) — all blocking findings addressed in the same commit
  • Manual end-to-end with two-node MinIO docker-compose (planted drift)

Reviewer notes

  • The Coordinator interface in reconciler.go is the only cross-package surface; defined in the consumer package to break the cluster→reconciliation→cluster cycle.
  • markAborted defaults unknown errors to AbortUnknown (not AbortRaftQuorumLoss) to avoid polluting Raft dashboards.
  • Run.WalkPartial is set when any per-prefix list failed; per-prefix errors are written into Run.Errors.
  • looksLikeManagedPath requires the full 7-segment Arc layout (db/m/yyyy/mm/dd/hh/file) and rejects ./.. segments — the strict-mode safety net for DeletePreManifestOrphans=false.

🤖 Generated with Claude Code


Closes #391

Periodically detects and repairs drift between the Raft-replicated cluster
file manifest and physical storage. Two kinds of drift:

- Orphan manifest entries (manifest references a missing file)
- Orphan storage files (file exists but no manifest entry)

Druid-style posture: off by default; once enabled, cron auto-acts on drift
older than a 24h grace window with a 10k per-run blast cap. Mirrors the
retention chunked-batch + abort-on-Raft-error pattern at retention.go:818-879.

Cluster gate is dual-mode by backend topology:
- Shared storage (S3/Azure/MinIO): one node sweeps via IsActiveCompactor
- Local storage: every node walks own disk, filtered by OriginNodeID
- Standalone: defensive guard refuses to act (no manifest to reconcile)

Manifest-before-storage invariant upheld at the cycle level: orphan-manifest
sweep runs first, orphan-storage runs only if the first half completes.
Re-checks before every delete catch concurrent writer/retention/compaction
races. Sorted output makes cap-bounded runs deterministic.

API: GET /status (RequireRead), POST /trigger (RequireAdmin, defaults
dry_run=true; act mode requires both dry_run=false AND act=true), GET
/runs/:id (RequireAdmin). Audit events emitted per batch with bounded
sample paths.
@xe-nvdk
Copy link
Copy Markdown
Member Author

xe-nvdk commented Apr 27, 2026

/gemini review

@gemini-code-assist this PR adds Phase 5 manifest-vs-storage reconciliation (anti-entropy) for Arc Enterprise clusters. Please pay particular attention to:

  1. Concurrent-execution races between the reconciler and writer flush / retention / compaction (manifest-before-storage invariant + per-file re-checks).
  2. Standalone-mode safety — defensive guard in sweepOrphanStorage rejects BackendStandalone; verify no other code path can reach destructive deletes without a manifest.
  3. Local-storage OriginNodeID filtering in manifestToKeys — does it correctly avoid false orphan-manifest signals on peer-replicated files?
  4. Cap accounting — the per-chunk vs per-batch CapHit detection in both sweeps. Look for off-by-one or subtle edge cases when re-check skips fill the chunk.
  5. Path-shape validation in looksLikeManagedPath (sweep_storage.go) — strict-mode (DeletePreManifestOrphans=false) safety.
  6. Gate-revoke vs Raft-quorum-loss classification in markAborted — gate handoff should not pollute Raft-failure dashboards.

Three internal staff/principal-level review agents already ran; their blocking findings are addressed in this PR. Looking for fresh eyes.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a periodic reconciler to detect and repair drift between the Raft-replicated cluster file manifest and physical storage, specifically addressing orphan manifest entries and orphan storage files. The implementation includes a core reconciliation engine, a cron-based scheduler, and a new API handler for status reporting and manual triggers. Feedback focuses on a potential data loss risk where storage backends lacking ObjectLister support bypass the grace window due to zeroed modification times. Additionally, there are concerns regarding the robustness of error classification using substring matching and the potential for excessive I/O during the root walk discovery process in clusters with many databases.

Comment thread internal/reconciliation/walk.go Outdated
Comment thread internal/reconciliation/reconciler.go Outdated
Comment thread internal/reconciliation/walk.go Outdated
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a manifest-vs-storage reconciliation system for Enterprise clusters to repair drift between Raft metadata and physical storage. The implementation includes a periodic reconciler with safety features like grace windows and blast caps, a cron-based scheduler, and a new API for status monitoring and manual triggers. Review feedback suggests several optimizations, including increasing initial map capacities for better performance, using strings.Join for efficient path sampling in audit logs, improving the robustness of error classification, and explicitly clearing large manifest slices to reduce memory pressure during runs.

Comment thread internal/reconciliation/walk.go Outdated
Comment thread internal/reconciliation/sweep_manifest.go Outdated
Comment thread internal/reconciliation/reconciler.go Outdated
Comment thread internal/reconciliation/reconciler.go
Round 1 (HIGH + 2 MEDIUMs):

- HIGH: zero-mtime files now treated as YOUNG (protected) instead of
  eligible for deletion. The List+StatFile fallback path (used when a
  backend doesn't implement ObjectLister) produces zero LastModified
  values; the previous behavior would have made those files
  immediately eligible, bypassing the grace window. Conservative
  default prevents data loss; emit a Warn so operators know the
  safety net engaged. (walk.go, diff.go)

- MEDIUM: replace substring-match isRaftError with errors.Is against
  a new raft.ErrManifestApply sentinel. Coordinator now wraps every
  manifest-mutation error (RegisterFile / DeleteFile / BatchFileOps /
  UpdateFile, leader and forwarded paths) via errors.Join so the
  classification is robust to future log-string renames. Sentinel
  lives in the leaf raft package to avoid an import cycle.

- MEDIUM: bound derivePrefixes' root-walk via cfg.MaxRootWalkDatabases
  (default 1000). On a sprawling cluster or shared-bucket leak, the
  unbounded inner ListDirectories loop could trigger thousands of
  metadata calls per tick. Cap=0 disables the root walk entirely.

Round 2 (3 MEDIUMs):

- MEDIUM: bump derivePrefixes set map initial capacity from 32 to 128
  to cover typical production schemas without re-allocs.

- MEDIUM: replace joinSample loop concatenation with strings.Join.

- MEDIUM: nil out the full manifest slice after reducing to ObjectKeys
  so ~40 MB of pinned FileEntry data (SHA256, SizeBytes, CreatedAt)
  becomes GC-eligible during the rest of a 200k-entry run.

Refs PR #412.
@xe-nvdk
Copy link
Copy Markdown
Member Author

xe-nvdk commented Apr 27, 2026

/gemini review

@gemini-code-assist addressed your findings in commit a04b862. Summary:

Round 1 (HIGH + 2 MEDIUMs)

  • ✅ Zero-mtime files now treated as YOUNG (protected) — isYoungerThan returns true for zero values; List+StatFile fallback emits a Warn so operators see the degraded-but-safe path engaged.
  • ✅ Replaced isRaftError substring matching with errors.Is(err, raft.ErrManifestApply). New sentinel in internal/cluster/raft/errors.go; coordinator wraps every manifest-mutation error via errors.Join.
  • ✅ Added cfg.MaxRootWalkDatabases (default 1000, 0 disables) to cap derivePrefixes root-walk fan-out.

Round 2 (3 MEDIUMs)

  • ✅ Bumped derivePrefixes set map initial capacity 32 → 128.
  • ✅ Replaced joinSample loop concat with strings.Join.
  • ✅ Nil out the full manifest slice after reducing to ObjectKeys (~40 MB GC-able mid-run on 200k-entry manifests).

New tests added: zero-mtime-protected, root-walk-cap-bounds-listdirectories, root-walk-cap-zero-disables.

Looking for fresh eyes on the typed-error wiring (verify no other manifest-write callers were missed) and any additional findings.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new manifest-vs-storage reconciliation feature to detect and repair drift between the Raft-replicated manifest and physical storage. The implementation includes a periodic reconciler, a scheduler, and API endpoints for status and manual triggering. I have identified several issues in the root walk logic where database-level checks were ineffective, and I have suggested improvements for log management to prevent log flooding during transient storage failures.

Comment thread internal/reconciliation/walk.go
Comment thread internal/reconciliation/walk.go Outdated
Comment thread internal/reconciliation/sweep_manifest.go Outdated
Comment thread internal/reconciliation/sweep_storage.go Outdated
Comment thread internal/reconciliation/walk.go Outdated
- HIGH: derivePrefixes was checking the database-only key against the
  database/measurement set, which always missed — every known database
  consumed MaxRootWalkDatabases budget and starved the discovery of
  truly-unknown databases. Build a separate dbSet keyed by
  "database/" so the membership check is correct. Regression test
  TestReconcile_RootWalkCapDoesNotCountKnownDatabases pins the fix.

- MEDIUM: rate-limit storage.Exists failure logs in sweepOrphanManifest.
  A transient backend hiccup that fails Exists across an entire chunk
  produced N Warns, drowning operator dashboards. Per-file detail now
  goes into run.Errors (capped at 32) and a single summary Warn fires
  at the end of each batch.

- MEDIUM: same pattern for storage.Delete failures in
  applyStorageDeletes — already captured in run.Errors, now emits one
  summary Warn per batch instead of one per file.

- MEDIUM: hoist the ObjectLister-missing Warn to once-per-run in
  walkStorage instead of once-per-prefix in listPrefix. On a cluster
  with 1000 measurements this drops the warning count from 1000 to 1.

Refs PR #412.
@xe-nvdk
Copy link
Copy Markdown
Member Author

xe-nvdk commented Apr 27, 2026

/gemini review

@gemini-code-assist round 3 fixes pushed in commit 96802d6:

  • ✅ HIGH: dbSet keyed by "database/" fixes the root-walk cap-budget bug (was checking against the "database/measurement/" set, always missed). Regression test added.
  • ✅ MEDIUM: per-batch summary Warn for storage.Exists failures in sweepOrphanManifest (was 1 Warn per file).
  • ✅ MEDIUM: same pattern for storage.Delete failures in applyStorageDeletes.
  • ✅ MEDIUM: ObjectLister-missing Warn hoisted to once-per-run in walkStorage (was per-prefix).

Per-file failure detail still flows into run.Errors (capped at 32). Looking for any remaining findings.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces the Phase 5 Manifest-vs-Storage Reconciliation feature, which periodically identifies and resolves drift between the cluster's Raft manifest and physical storage. The implementation includes a new reconciliation package, a cron-based scheduler, and an HTTP API for status monitoring and manual triggers. Review feedback highlights several improvement opportunities: bounding the initial directory listing in the root walk to prevent potential memory issues, parallelizing synchronous existence checks during the manifest sweep to improve performance on remote backends, and utilizing a circular buffer for the run history to avoid inefficient slice prepending.

Comment thread internal/reconciliation/walk.go Outdated
Comment thread internal/reconciliation/sweep_manifest.go Outdated
Comment thread internal/reconciliation/reconciler.go Outdated
All three round-4 findings addressed (no defers):

- MEDIUM: bound the initial ListDirectories("") iteration in
  derivePrefixes to 4 × MaxRootWalkDatabases. The previous bound only
  capped the inspected counter; a shared bucket with thousands of
  unrelated top-level prefixes could still pin megabytes of dirnames
  and burn CPU on the dbSet lookup loop. Headroom of 4× covers the
  case where most top-level entries are known databases that the
  dbSet check skips. Operator-visible Warn fires when the limit
  trims the returned list.

- MEDIUM: parallel storage.Exists during the manifest sweep.
  Sequential HEAD-per-file on S3/Azure was the slow path; with chunks
  up to BatchSize=1000 a manifest sweep could take minutes per
  network RTT. New parallelExists helper runs a bounded worker pool
  (default 8 via RecheckConcurrency, 1 forces sequential) feeding
  into a fixed-index result slice so the existing sequential
  ops-build loop is unchanged. Conservative default stays under
  cloud per-second rate limits. Two new tests pin the concurrency
  bound and the sequential-when-1 mode.

- MEDIUM: ring buffer for runHistory. The previous
  append([]*Run{run}, r.runHistory...) prepend was O(N); the new
  fixed-size slice + head index + size counter is O(1). Read methods
  walk the ring newest-first; behavior of RecentRuns and FindRun is
  unchanged from the caller's perspective. Regression test pins
  newest-first ordering and FIFO eviction across wrap-around.

Refs PR #412.
@xe-nvdk
Copy link
Copy Markdown
Member Author

xe-nvdk commented Apr 27, 2026

/gemini review

@gemini-code-assist round 4 fixes pushed in commit 86b69fd. All three findings addressed (no defers):

  • ✅ MEDIUM: bound initial ListDirectories("") iteration to 4 × MaxRootWalkDatabases. Operator-visible Warn fires when the limit trims the returned list.
  • ✅ MEDIUM: parallel storage.Exists worker pool in the manifest sweep — new RecheckConcurrency config (default 8, set to 1 to force sequential). Two new tests assert the concurrency bound and sequential mode.
  • ✅ MEDIUM: ring buffer for runHistory — O(1) record (was O(N) prepend). Regression test pins newest-first ordering and FIFO wrap-around eviction.

This should be the final pass — looking for any remaining findings before merge.

…dditions

The original release-notes entry for the reconciler reflected commit
ab0d4ce; rounds 1-4 of the gemini review added behavior the operator
docs hadn't caught up with:

- Standalone-mode safety guard (orphan-storage sweep refuses to act)
- AbortReason taxonomy via raft.ErrManifestApply sentinel
  (raft_quorum_loss / lease_lost / unknown distinct on dashboards)
- Parallel storage.Exists re-checks (RecheckConcurrency, default 8)
- Root-walk fan-out cap (MaxRootWalkDatabases, default 1000)
- Strict 7-segment looksLikeManagedPath when DeletePreManifestOrphans=false
- Conservative protection on un-mtime'd backends (zero LastModified -> YOUNG)
- WalkPartial flag + per-prefix errors in run summary
- Auth wiring on all 3 routes (RequireRead on /status, RequireAdmin on
  /trigger and /runs/:id)

Two new config knobs added to the table: max_root_walk_databases,
recheck_concurrency.
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a periodic reconciler to detect and repair drift between the Raft-replicated file manifest and physical storage, addressing both orphan manifest entries and storage files. The implementation includes a new internal/reconciliation package, a cron-based scheduler, and administrative API endpoints for status and manual triggers. Review feedback suggests improving code organization by centralizing utility functions, clarifying release note descriptions for path-based deletion rules, and optimizing the scheduler by caching parsed cron schedules to avoid redundant parsing.

Comment thread internal/reconciliation/sweep_storage.go Outdated
Comment thread RELEASE_NOTES_2026.05.1.md
Comment thread internal/reconciliation/scheduler.go
Combined gemini round-5 (3 items) + 4 parallel internal staff reviews
(correctness / code-quality / security / perf-observability) into a
single commit so the next gemini pass lands on a near-clean diff.

HIGH (correctness — internal):
- CapHit lost when a manifest-sweep chunk is all-recheck-skipped at the
  cap boundary: continue skipped the audit + return paths so the
  reconciler kept doing recheck I/O on subsequent chunks. Same bug in
  storage sweep where len(toDelete)==0 used continue instead of return.
- WalkPartial and CapHit not in run_completed/run_aborted audit detail.
  SIEM consumers had no way to tell "0 orphans, all clean" from "blind
  walk produced 0 candidates" or "we hit the cap and stopped early."
- Inner ListDirectories failures in derivePrefixes silently swallowed —
  now surfaced via prefixDiscovery.errors and Run.WalkPartial.
- ctx.Err() not checked inside the root-walk loop after my round-4
  scan-limit change; a misbehaving backend could burn the whole budget
  after cancel. Added per-iteration check.

HIGH (gemini + code-quality):
- appendBounded was in sweep_storage.go but used cross-file. Moved all
  four cap/sample helpers (appendBounded, joinSample, sampleStrings,
  sampleCandidatePaths, boolStr) into helpers.go. Renamed `cap` shadow
  to `limit`. Collapsed sampleStrings/sampleCandidatePaths into a
  generic sampleSlice.
- scheduler.tick used direct == against sentinel errors (silently
  breaks if anyone wraps them). Switched to errors.Is.

MEDIUM (security):
- License re-validation middleware on /trigger and /runs/:id (per
  CLAUDE.md "License middleware goes after auth middleware"). New
  License.CanUseClustering() / Client.CanUseClustering() and a
  requireClusteringLicense fiber middleware mirroring the
  requireTieringLicense pattern. /status remains license-free so
  expired clusters can still report disabled state.
- applyDefaults clamps operator overrides: MaxRootWalkDatabases at
  1_000_000 (defends against int overflow in the × 4 scan-limit
  arithmetic), RecheckConcurrency at 64 (cloud rate-limit safety),
  negative MaxDeletesPerRun / GraceWindow / etc default to safe values
  (no negative-slice-index panics).

MEDIUM (perf):
- ErrGateRevoked sentinel + errors.Is replaces the substring-match
  isLeaseLostError. Both sweeps wrap %w ErrGateRevoked.
- Cached parsed cron.Schedule on the Scheduler struct + module-level
  cronParser. nextRun no longer re-parses on every Status() request.
- Replaced fmt.Sprintf("%d", n) hot-path calls with strconv.Itoa /
  strconv.FormatInt in the audit-emit and cap-hit paths.
- computeDiff now uses a single map[string]bool (manifestSeen) instead
  of two separate maps. Saves ~16 MB of transient allocation per run
  on a 200k-entry manifest.

MEDIUM (code quality):
- backend_kind threaded into run_completed and run_aborted audit
  details. Operators inspecting an audit row can now tell shared-vs-
  local without correlating to status.
- skipped_recheck added to run_completed audit.
- Dropped unreachable LocalNodeID=="" branch in manifestToKeys
  (NewReconciler already rejects that combination).

MEDIUM (release notes):
- Tightened delete_pre_manifest_orphans description per gemini's
  feedback — explicitly calls out the 7-segment layout and `.`/`..`
  rejection rather than the vague "database/measurement/..." wording.

New tests:
- TestApplyDefaults_ClampsOperatorOverridesToCeilings — pins the new
  MaxRootWalkDatabases / RecheckConcurrency / MaxDeletesPerRun clamp
  semantics.

CLAUDE.md updated locally (not committed — gitignored): post-
implementation review section now mandates a 4-agent parallel review
including a perf/observability/operational-hygiene lens explicitly
charged with catching the line-level items gemini flags.

Refs PR #412.
@xe-nvdk
Copy link
Copy Markdown
Member Author

xe-nvdk commented Apr 27, 2026

/gemini review

@gemini-code-assist round 5 fixes pushed in commit 3f41088. Combined gemini's 3 round-5 findings + 4 parallel internal staff/principal reviews into one commit so this lands closer to a clean diff.

HIGH (correctness, internal review):

  • ✅ CapHit lost when chunk all-recheck-skipped — both sweeps now return nil instead of continue past the cap audit.
  • walk_partial / cap_hit / backend_kind / skipped_recheck now emitted in run_completed/run_aborted audit detail.
  • ✅ Inner ListDirectories failures in derivePrefixes now surfaced via Run.WalkPartial + Run.Errors.
  • ctx.Err() checked inside the root-walk loop.

HIGH (gemini + code-quality):

  • appendBounded + joinSample + sample helpers consolidated into new helpers.go. Generic sampleSlice[T] collapses two near-duplicates.
  • scheduler.tick switched from == to errors.Is for sentinel checks.

MEDIUM (security):

  • ✅ License re-validation middleware (requireClusteringLicense) on /trigger and /runs/:id. New License.CanUseClustering() / Client.CanUseClustering().
  • ✅ Config clamps in applyDefaults: MaxRootWalkDatabases ≤ 1_000_000 (defends against × 4 overflow), RecheckConcurrency ≤ 64, negative values default safely.

MEDIUM (perf):

  • ErrGateRevoked sentinel + errors.Is replaces substring-match isLeaseLostError.
  • cron.Schedule cached on Scheduler; nextRun no longer re-parses on every Status() call.
  • fmt.Sprintf("%d",n)strconv.Itoa / strconv.FormatInt across audit-emit hot paths.
  • computeDiff uses a single map[string]bool instead of two maps. ~16 MB saved per 200k-entry run.

MEDIUM (code quality):

  • backend_kind + skipped_recheck in run_completed/run_aborted audit detail.
  • ✅ Dropped unreachable LocalNodeID=="" branch in manifestToKeys.

MEDIUM (release notes):

  • ✅ Tightened delete_pre_manifest_orphans description with explicit 7-segment layout + ./.. rejection note.

New regression test: TestApplyDefaults_ClampsOperatorOverridesToCeilings pins the clamp semantics.

Looking for any final findings before merge.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a periodic reconciler to detect and repair drift between the Raft-replicated cluster file manifest and physical storage, addressing both orphan manifest entries and orphan storage files. The implementation includes a core reconciliation engine, a cron-based scheduler, an HTTP API for status and manual triggers, and integration with the existing clustering and storage layers. Feedback highlights a logic error in applyDefaults that prevents disabling the root walk by setting MaxRootWalkDatabases to 0. Additionally, it is suggested to change the default for delete_pre_manifest_orphans to false to prevent accidental data loss in shared storage environments.

Comment thread internal/reconciliation/reconciler.go Outdated
Comment thread RELEASE_NOTES_2026.05.1.md Outdated
Gemini round 6 flagged two items; pre-emptive same-shape audit confirmed
gemini's findings and surfaced no additional fields with the same bug
shape. Combined into a single commit so the next gemini pass lands on
a clean diff.

HIGH (gemini): config-driven MaxRootWalkDatabases=0 was silently
overridden to 1000 by applyDefaults, preventing operators from
reaching the disabled state via arc.toml. Convention now: `< 0` means
"operator did not set this, fill default 1000"; `0` means "explicitly
disabled" and passes through to the downstream guard at
walk.go:derivePrefixes (which gates root walk on `> 0`); positive
values used (clamped at maxRootWalkDatabasesCap to defend against
× 4 overflow). All other fields remain on `<= 0` "fill default"
semantics — overloading every numeric knob with shape-#1 semantics
would diverge from the existing ManifestOnlyDryRun behavioral knob
and surprise operators in unrelated config paths.

MEDIUM (gemini): delete_pre_manifest_orphans default flipped from
true to false. Shared-bucket deployments where stray non-Arc files
co-exist with Arc data must NOT see those files deleted on first
cron run. Operators who deliberately want pre-Phase-1 / migration-
residual cleanup now opt in explicitly. Mirrors Iceberg's
remove_orphan_files / Delta VACUUM secure-by-default posture.

MEDIUM (internal review): manifest_only_dry_run default flipped
from false to true. Even though the feature is opt-in via
reconciliation.enabled, flipping enabled=true alone should not
auto-delete on the very first cron tick. Operators review dry-run
audits, then explicitly flip manifest_only_dry_run=false to allow
real deletes. Druid's coordinator kill task and Iceberg's
remove_orphan_files ship the same posture.

Test infrastructure:
- newReconciler test helper sets MaxRootWalkDatabases=-1 when the
  caller passes 0 so tests don't accidentally land in the disabled
  state; tests that DO want disabled set 0 explicitly.
- Two integration tests using NewReconciler directly add
  MaxRootWalkDatabases:-1 to opt into root walk.
- New TestApplyDefaults_MaxRootWalkDatabasesZeroFromConfigDisablesRootWalk
  pins the gemini-flagged convention.

Release notes updated: "off by default AND report-only on first run"
prose, plus the two flipped default rows in the config table.

Refs PR #412.
@xe-nvdk
Copy link
Copy Markdown
Member Author

xe-nvdk commented Apr 27, 2026

/gemini review

@gemini-code-assist round 6 fixes pushed in commit 38da2a2.

HIGH (your finding):

  • MaxRootWalkDatabases=0 from config now reaches the disabled state. Convention: < 0 means "operator did not set this, fill default"; 0 means "disabled"; > 0 is the cap (clamped at the int-overflow ceiling). Other numeric fields stay on <= 0 "fill default" — overloading every knob with shape-Questions regarding data durability and schema #1 semantics would diverge from the existing ManifestOnlyDryRun behavioral knob already documented for audit-only mode.
  • ✅ New regression test: TestApplyDefaults_MaxRootWalkDatabasesZeroFromConfigDisablesRootWalk.

MEDIUM (your finding) — secure-by-default flips:

  • delete_pre_manifest_orphans default → false. Shared-bucket deployments are protected from accidental stray-file deletion on first cron.
  • manifest_only_dry_run default → true (additional internal-review finding of the same shape). First cron tick produces a dry-run audit; operator flips after reviewing. Mirrors Druid kill / Iceberg remove_orphan_files first-run posture.

Process note: I ran a pre-emptive same-shape audit on the package looking for other fields with the config-defaults-vs-disabled confusion you spotted. Confirmed there were no other instances — only MaxRootWalkDatabases had the overloaded ==0 check. Audit also confirmed two more secure-by-default opportunities (the manifest_only_dry_run flip above + larger policy-level changes around grace window / blast cap defaults that I'm leaving as a follow-up issue rather than baking into this PR).

Looking for any final findings before merge.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a periodic reconciler to detect and repair drift between the Raft-replicated cluster file manifest and physical storage, addressing both orphan manifest entries and orphan storage files. The implementation includes a core reconciliation engine, a cron-based scheduler, and API endpoints for status and manual triggers. Feedback suggests refining the DryRun logic in the core engine to allow manual overrides via the API even when the global dry-run configuration is enabled. Additionally, the check for excluding compaction state files should be made more specific to avoid incorrectly ignoring valid user data that might contain the reserved string in its path.

Comment thread internal/reconciliation/reconciler.go Outdated
Comment thread internal/reconciliation/walk.go Outdated
…ction-state filter

Gemini round 7 flagged two MEDIUMs; pre-emptive same-shape audit
escalated #1 to HIGH (operator-intent suppression with no diagnostic)
and confirmed no other instances of either shape elsewhere in the
package.

HIGH (gemini, escalated): Reconcile previously did
`DryRun = dryRun || cfg.ManifestOnlyDryRun`. With the round-6 secure-
by-default flip of ManifestOnlyDryRun=true, this silently forced
every Reconcile call into dry-run, including the manual API path.
Operators hitting POST /trigger?dry_run=false&act=true (which already
encodes the explicit double-confirmation) would be silently denied
the act-mode they asked for. The cron path still passes
cfg.ManifestOnlyDryRun directly via scheduler.tick, so dropping the OR
preserves cron safety while honoring legitimate manual overrides.

Two regression tests pin both behaviors:
- TestReconcile_ManualOverrideHonorsDryRunFalseEvenWithCfgDryRunOnly
- TestReconcile_CronPathDryRunStillRespectsCfg

MEDIUM (gemini): isParquetCandidate used
`strings.Contains(p, "_compaction_state/")` which is a substring
match, not segment-aware. A measurement named with that substring
embedded — e.g. "db/m_compaction_state_logs/..." — would have all
its valid data files silently filtered out of orphan-storage
candidacy. Replaced with an explicit `strings.Split + segment ==
"_compaction_state"` check, mirroring the segment-aware shape used
by looksLikeManagedPath. Added three regression cases to the
TestIsParquetCandidate table covering substring-but-not-segment
positions where false positives would previously trigger.

Refs PR #412.
@xe-nvdk
Copy link
Copy Markdown
Member Author

xe-nvdk commented Apr 27, 2026

/gemini review

@gemini-code-assist round 7 fixes pushed in commit ba98d58.

MEDIUM (your finding, internally re-graded HIGH) — manual API override now honored:

  • ✅ Dropped || r.cfg.ManifestOnlyDryRun from Reconcile. The cron path already passes cfg.ManifestOnlyDryRun via scheduler.tick (line 146), so cron safety is preserved upstream. The OR was silently masking legitimate manual API overrides — operators hitting POST /trigger?dry_run=false&act=true (which already encodes the double-confirmation) would otherwise be denied act-mode.
  • Two regression tests pin both behaviors: TestReconcile_ManualOverrideHonorsDryRunFalseEvenWithCfgDryRunOnly and TestReconcile_CronPathDryRunStillRespectsCfg.

MEDIUM (your finding) — segment-aware compaction-state filter:

  • ✅ Replaced strings.Contains(p, "_compaction_state/") with an explicit strings.Split + segment == "_compaction_state" check. A measurement named with that substring embedded (db/m_compaction_state_logs/...) is no longer silently filtered out. Mirrors the segment-aware shape already used by looksLikeManagedPath.
  • Added three regression cases to TestIsParquetCandidate covering substring-but-not-segment positions.

Pre-emptive same-shape audit: ran a parallel staff-engineer audit before fixing, looking for other belt-and-braces overrides that suppress operator intent and other path-shape substring matches. Confirmed no other instances of either bug shape in the package.

Looking for any final findings before merge.

@xe-nvdk xe-nvdk merged commit 829f5d8 into main Apr 27, 2026
6 checks passed
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a periodic reconciliation system for Enterprise clusters to detect and repair drift between the Raft-replicated file manifest and physical storage. It addresses orphan manifest entries and orphan storage files with safety features like dry-run modes, grace windows, and blast caps. Feedback suggests improving the root-walk discovery to cover orphan measurements in known databases, removing redundant deduplication logic in the storage walk, and optimizing path segment checks to reduce allocations.

Comment on lines +228 to +232
if _, exists := dbSet[key]; exists {
// Already covered by the manifest-derived dbSet;
// don't count toward the inspection cap.
continue
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The root-walk fallback currently skips any database already present in the manifest (dbSet). This optimization prevents the reconciler from discovering orphan measurements (measurements that exist in storage but not in the manifest) within known databases. While this saves on ListDirectories calls, it creates a gap in anti-entropy coverage. Consider if a more thorough discovery (perhaps sampled or periodic) is needed for known databases.

Comment on lines +61 to +100
res := walkResult{
records: make([]objectRecord, 0, len(manifest)),
partial: discovery.partial,
prefixErrors: append([]string{}, discovery.errors...),
}
seen := make(map[string]struct{}, len(manifest))

for _, prefix := range prefixes {
if ctx.Err() != nil {
return res, ctx.Err()
}
records, err := r.listPrefix(ctx, prefix)
if err != nil {
// A per-prefix list failure is logged and the prefix is
// skipped; the run continues against the remaining prefixes
// so a single transient backend hiccup doesn't kill the
// whole cycle. Errors are reported on the Run summary so
// operators can spot "0 orphans found" runs that were
// actually blind to most of the bucket.
res.partial = true
res.prefixErrors = append(res.prefixErrors, fmt.Sprintf("prefix %q: %v", prefix, err))
r.logger.Warn().
Err(err).
Str("prefix", prefix).
Msg("Reconciliation: prefix list failed; skipping")
continue
}
for _, rec := range records {
if _, dup := seen[rec.path]; dup {
continue
}
seen[rec.path] = struct{}{}
res.records = append(res.records, rec)
}
}

return res, nil
}

// prefixDiscovery is the result of derivePrefixes. partial=true and/or
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The seen map used for deduplication in walkStorage is redundant because the prefixes generated by derivePrefixes are disjoint by construction (each is a unique database/measurement/ pair). Removing this map and using a direct append will reduce memory overhead and improve performance, especially for large manifests.

	res := walkResult{
		records:      make([]objectRecord, 0, len(manifest)),
		partial:      discovery.partial,
		prefixErrors: append([]string{}, discovery.errors...),
	}

	for _, prefix := range prefixes {
		if ctx.Err() != nil {
			return res, ctx.Err()
		}
		records, err := r.listPrefix(ctx, prefix)
		if err != nil {
			// A per-prefix list failure is logged and the prefix is
			// skipped; the run continues against the remaining prefixes
			// so a single transient backend hiccup doesn't kill the
			// whole cycle. Errors are reported on the Run summary so
			// operators can spot "0 orphans found" runs that were
			// actually blind to most of the bucket.
			res.partial = true
			res.prefixErrors = append(res.prefixErrors, fmt.Sprintf("prefix %q: %v", prefix, err))
			r.logger.Warn().
				Err(err).
				Str("prefix", prefix).
				Msg("Reconciliation: prefix list failed; skipping")
			continue
		}
		res.records = append(res.records, records...)
	}

	return res, nil

Comment on lines +324 to +328
for _, seg := range strings.Split(p, "/") {
if seg == "_compaction_state" {
return false
}
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using strings.Split here for every file found in storage can lead to significant allocations during large reconciliation runs. Since we only need to check if _compaction_state is a path segment, and we already know the path ends in .parquet, we can use more efficient string checks.

	if strings.HasPrefix(p, "_compaction_state/") || strings.Contains(p, "/_compaction_state/") {
		return false
	}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Orphan compacted-file reconciliation

1 participant