feat(cluster): manifest-vs-storage reconciliation — Phase 5 anti-entropy by xe-nvdk · Pull Request #412 · Basekick-Labs/arc

xe-nvdk · 2026-04-27T19:31:58Z

Summary

Adds a periodic reconciler that detects and repairs drift between the Raft-replicated cluster file manifest and physical storage. Two kinds of drift addressed: orphan manifest entries (manifest references a missing file — caused by retention/compaction succeeding storage-side then losing Raft quorum) and orphan storage files (file exists but no manifest entry — caused by crash between storage.Write and the file-registrar Raft propose, or pre-Phase-1 history).
Druid-style posture: off by default. Once enabled, cron auto-acts on drift older than a 24h grace window with a 10k per-run blast cap. Anti-entropy that doesn't require babysitting matches Druid's coordinator-kill task and Pinot's retention manager.
Replaces explicit "Phase 5 will add" TODOs at retention.go:826, delete.go:687, job.go:827, coordinator.go:2101.

Design

Cluster gate is dual-mode by backend topology:

Shared storage (S3/Azure/MinIO): one node sweeps via IsActiveCompactor — reuses the failover-managed compactor lease as a single-sweeper election with no new state.
Local storage: every node walks its own disk filtered by OriginNodeID == self.
Standalone: defensive guard in sweepOrphanStorage refuses to act (no manifest to reconcile against).

Manifest-before-storage invariant upheld at the cycle level: orphan-manifest sweep runs first; orphan-storage runs only if the first half completes cleanly. Mirrors the retention chunked-batch pattern at retention.go:818-879 — same chunk size, same manifest-failure abort semantics, same per-file marshal safety.

Re-checks before every delete catch concurrent writer/retention/compaction races. Sorted output (computeDiff) makes cap-bounded runs deterministic about which orphans get cleaned. Per-chunk gate re-evaluation ensures lease handoff aborts the run promptly.

API

Route	Auth	Behavior
`GET /api/v1/reconciliation/status`	`RequireRead`	Last 10 runs, gate state, schedule, next_run
`POST /api/v1/reconciliation/trigger`	`RequireAdmin`	Defaults dry-run; act mode requires both `dry_run=false` AND `act=true` (defense against accidental curl)
`GET /api/v1/reconciliation/runs/:id`	`RequireAdmin`	Single run from in-memory ring buffer

Configuration (`ARC_RECONCILIATION_*`)

Defaults lean conservative: enabled=false, schedule="17 4 * * *", grace_window=24h, max_deletes_per_run=10000, max_manifest_size=200000. manifest_only_dry_run exists as a safety bridge for operators who want a few weeks of report-only runs before flipping to full auto.

Audit

Per-batch (not per-file, to keep audit log proportional): reconcile.run_started, reconcile.run_completed, reconcile.run_aborted, reconcile.manifest_batch_deleted, reconcile.storage_batch_deleted, reconcile.cap_hit. Run.WalkPartial flag plus Run.Errors surface per-prefix list failures so an operator can disambiguate "0 orphans found" from "we couldn't see most of the bucket".

License

Rides on FeatureClustering at coordinator construction. Standalone Arc has no manifest, so reconciliation is naturally Enterprise-only — no separate flag (avoiding InfluxDB-style fragmentation).

Scope of changes

New: internal/reconciliation/ (engine, scheduler, gate, walk, diff, two sweeps, tests)
New: internal/api/reconciliation.go (handler with auth-wrapped routes)
Wiring: cmd/arc/main.go (constructor + dual-mode reconciliationClusterGate)
Config: internal/config/config.go (ReconciliationConfig + viper defaults)
Comment cleanup: removed Phase 5 TODOs at retention/delete/compaction/coordinator
Release notes: RELEASE_NOTES_2026.05.1.md

No FSM changes. No new Raft commands. No storage.Backend interface changes. Reuses retention's BatchFileOpsInManifest chunking pattern verbatim.

Test plan

Unit tests (40+) covering: clean state, orphan-manifest only, orphan-storage only, both kinds mixed, grace window, clock skew, manifest-too-large abort, Raft error abort, ctx cancel, dry-run vs act, idempotent already-gone delete, blast cap mid-run, race re-check, local-mode foreign-origin filter, walk-partial flag, tightened path-shape check
Real-backend integration tests against LocalBackend (real fs walk, real Chtimes, real Exists)
Scheduler tests (cron lifecycle, gating, license, no-overlap, TriggerNow, status fields)
go build ./cmd/... ./internal/... clean
go vet clean
Three parallel staff/principal-engineer reviews (correctness / security / code quality) — all blocking findings addressed in the same commit
Manual end-to-end with two-node MinIO docker-compose (planted drift)

Reviewer notes

The Coordinator interface in reconciler.go is the only cross-package surface; defined in the consumer package to break the cluster→reconciliation→cluster cycle.
markAborted defaults unknown errors to AbortUnknown (not AbortRaftQuorumLoss) to avoid polluting Raft dashboards.
Run.WalkPartial is set when any per-prefix list failed; per-prefix errors are written into Run.Errors.
looksLikeManagedPath requires the full 7-segment Arc layout (db/m/yyyy/mm/dd/hh/file) and rejects ./.. segments — the strict-mode safety net for DeletePreManifestOrphans=false.

🤖 Generated with Claude Code

Closes #391

Periodically detects and repairs drift between the Raft-replicated cluster file manifest and physical storage. Two kinds of drift: - Orphan manifest entries (manifest references a missing file) - Orphan storage files (file exists but no manifest entry) Druid-style posture: off by default; once enabled, cron auto-acts on drift older than a 24h grace window with a 10k per-run blast cap. Mirrors the retention chunked-batch + abort-on-Raft-error pattern at retention.go:818-879. Cluster gate is dual-mode by backend topology: - Shared storage (S3/Azure/MinIO): one node sweeps via IsActiveCompactor - Local storage: every node walks own disk, filtered by OriginNodeID - Standalone: defensive guard refuses to act (no manifest to reconcile) Manifest-before-storage invariant upheld at the cycle level: orphan-manifest sweep runs first, orphan-storage runs only if the first half completes. Re-checks before every delete catch concurrent writer/retention/compaction races. Sorted output makes cap-bounded runs deterministic. API: GET /status (RequireRead), POST /trigger (RequireAdmin, defaults dry_run=true; act mode requires both dry_run=false AND act=true), GET /runs/:id (RequireAdmin). Audit events emitted per batch with bounded sample paths.

xe-nvdk · 2026-04-27T19:32:22Z

/gemini review

@gemini-code-assist this PR adds Phase 5 manifest-vs-storage reconciliation (anti-entropy) for Arc Enterprise clusters. Please pay particular attention to:

Concurrent-execution races between the reconciler and writer flush / retention / compaction (manifest-before-storage invariant + per-file re-checks).
Standalone-mode safety — defensive guard in sweepOrphanStorage rejects BackendStandalone; verify no other code path can reach destructive deletes without a manifest.
Local-storage OriginNodeID filtering in manifestToKeys — does it correctly avoid false orphan-manifest signals on peer-replicated files?
Cap accounting — the per-chunk vs per-batch CapHit detection in both sweeps. Look for off-by-one or subtle edge cases when re-check skips fill the chunk.
Path-shape validation in looksLikeManagedPath (sweep_storage.go) — strict-mode (DeletePreManifestOrphans=false) safety.
Gate-revoke vs Raft-quorum-loss classification in markAborted — gate handoff should not pollute Raft-failure dashboards.

Three internal staff/principal-level review agents already ran; their blocking findings are addressed in this PR. Looking for fresh eyes.

gemini-code-assist

Code Review

This pull request introduces a periodic reconciler to detect and repair drift between the Raft-replicated cluster file manifest and physical storage, specifically addressing orphan manifest entries and orphan storage files. The implementation includes a core reconciliation engine, a cron-based scheduler, and a new API handler for status reporting and manual triggers. Feedback focuses on a potential data loss risk where storage backends lacking ObjectLister support bypass the grace window due to zeroed modification times. Additionally, there are concerns regarding the robustness of error classification using substring matching and the potential for excessive I/O during the root walk discovery process in clusters with many databases.

gemini-code-assist

Code Review

This pull request introduces a manifest-vs-storage reconciliation system for Enterprise clusters to repair drift between Raft metadata and physical storage. The implementation includes a periodic reconciler with safety features like grace windows and blast caps, a cron-based scheduler, and a new API for status monitoring and manual triggers. Review feedback suggests several optimizations, including increasing initial map capacities for better performance, using strings.Join for efficient path sampling in audit logs, improving the robustness of error classification, and explicitly clearing large manifest slices to reduce memory pressure during runs.

Round 1 (HIGH + 2 MEDIUMs): - HIGH: zero-mtime files now treated as YOUNG (protected) instead of eligible for deletion. The List+StatFile fallback path (used when a backend doesn't implement ObjectLister) produces zero LastModified values; the previous behavior would have made those files immediately eligible, bypassing the grace window. Conservative default prevents data loss; emit a Warn so operators know the safety net engaged. (walk.go, diff.go) - MEDIUM: replace substring-match isRaftError with errors.Is against a new raft.ErrManifestApply sentinel. Coordinator now wraps every manifest-mutation error (RegisterFile / DeleteFile / BatchFileOps / UpdateFile, leader and forwarded paths) via errors.Join so the classification is robust to future log-string renames. Sentinel lives in the leaf raft package to avoid an import cycle. - MEDIUM: bound derivePrefixes' root-walk via cfg.MaxRootWalkDatabases (default 1000). On a sprawling cluster or shared-bucket leak, the unbounded inner ListDirectories loop could trigger thousands of metadata calls per tick. Cap=0 disables the root walk entirely. Round 2 (3 MEDIUMs): - MEDIUM: bump derivePrefixes set map initial capacity from 32 to 128 to cover typical production schemas without re-allocs. - MEDIUM: replace joinSample loop concatenation with strings.Join. - MEDIUM: nil out the full manifest slice after reducing to ObjectKeys so ~40 MB of pinned FileEntry data (SHA256, SizeBytes, CreatedAt) becomes GC-eligible during the rest of a 200k-entry run. Refs PR #412.

xe-nvdk · 2026-04-27T19:49:54Z

/gemini review

@gemini-code-assist addressed your findings in commit a04b862. Summary:

Round 1 (HIGH + 2 MEDIUMs)

✅ Zero-mtime files now treated as YOUNG (protected) — isYoungerThan returns true for zero values; List+StatFile fallback emits a Warn so operators see the degraded-but-safe path engaged.
✅ Replaced isRaftError substring matching with errors.Is(err, raft.ErrManifestApply). New sentinel in internal/cluster/raft/errors.go; coordinator wraps every manifest-mutation error via errors.Join.
✅ Added cfg.MaxRootWalkDatabases (default 1000, 0 disables) to cap derivePrefixes root-walk fan-out.

Round 2 (3 MEDIUMs)

✅ Bumped derivePrefixes set map initial capacity 32 → 128.
✅ Replaced joinSample loop concat with strings.Join.
✅ Nil out the full manifest slice after reducing to ObjectKeys (~40 MB GC-able mid-run on 200k-entry manifests).

New tests added: zero-mtime-protected, root-walk-cap-bounds-listdirectories, root-walk-cap-zero-disables.

Looking for fresh eyes on the typed-error wiring (verify no other manifest-write callers were missed) and any additional findings.

gemini-code-assist

Code Review

This pull request introduces a new manifest-vs-storage reconciliation feature to detect and repair drift between the Raft-replicated manifest and physical storage. The implementation includes a periodic reconciler, a scheduler, and API endpoints for status and manual triggering. I have identified several issues in the root walk logic where database-level checks were ineffective, and I have suggested improvements for log management to prevent log flooding during transient storage failures.

- HIGH: derivePrefixes was checking the database-only key against the database/measurement set, which always missed — every known database consumed MaxRootWalkDatabases budget and starved the discovery of truly-unknown databases. Build a separate dbSet keyed by "database/" so the membership check is correct. Regression test TestReconcile_RootWalkCapDoesNotCountKnownDatabases pins the fix. - MEDIUM: rate-limit storage.Exists failure logs in sweepOrphanManifest. A transient backend hiccup that fails Exists across an entire chunk produced N Warns, drowning operator dashboards. Per-file detail now goes into run.Errors (capped at 32) and a single summary Warn fires at the end of each batch. - MEDIUM: same pattern for storage.Delete failures in applyStorageDeletes — already captured in run.Errors, now emits one summary Warn per batch instead of one per file. - MEDIUM: hoist the ObjectLister-missing Warn to once-per-run in walkStorage instead of once-per-prefix in listPrefix. On a cluster with 1000 measurements this drops the warning count from 1000 to 1. Refs PR #412.

xe-nvdk · 2026-04-27T20:01:23Z

/gemini review

@gemini-code-assist round 3 fixes pushed in commit 96802d6:

✅ HIGH: dbSet keyed by "database/" fixes the root-walk cap-budget bug (was checking against the "database/measurement/" set, always missed). Regression test added.
✅ MEDIUM: per-batch summary Warn for storage.Exists failures in sweepOrphanManifest (was 1 Warn per file).
✅ MEDIUM: same pattern for storage.Delete failures in applyStorageDeletes.
✅ MEDIUM: ObjectLister-missing Warn hoisted to once-per-run in walkStorage (was per-prefix).

Per-file failure detail still flows into run.Errors (capped at 32). Looking for any remaining findings.

gemini-code-assist

Code Review

This pull request introduces the Phase 5 Manifest-vs-Storage Reconciliation feature, which periodically identifies and resolves drift between the cluster's Raft manifest and physical storage. The implementation includes a new reconciliation package, a cron-based scheduler, and an HTTP API for status monitoring and manual triggers. Review feedback highlights several improvement opportunities: bounding the initial directory listing in the root walk to prevent potential memory issues, parallelizing synchronous existence checks during the manifest sweep to improve performance on remote backends, and utilizing a circular buffer for the run history to avoid inefficient slice prepending.

All three round-4 findings addressed (no defers): - MEDIUM: bound the initial ListDirectories("") iteration in derivePrefixes to 4 × MaxRootWalkDatabases. The previous bound only capped the inspected counter; a shared bucket with thousands of unrelated top-level prefixes could still pin megabytes of dirnames and burn CPU on the dbSet lookup loop. Headroom of 4× covers the case where most top-level entries are known databases that the dbSet check skips. Operator-visible Warn fires when the limit trims the returned list. - MEDIUM: parallel storage.Exists during the manifest sweep. Sequential HEAD-per-file on S3/Azure was the slow path; with chunks up to BatchSize=1000 a manifest sweep could take minutes per network RTT. New parallelExists helper runs a bounded worker pool (default 8 via RecheckConcurrency, 1 forces sequential) feeding into a fixed-index result slice so the existing sequential ops-build loop is unchanged. Conservative default stays under cloud per-second rate limits. Two new tests pin the concurrency bound and the sequential-when-1 mode. - MEDIUM: ring buffer for runHistory. The previous append([]*Run{run}, r.runHistory...) prepend was O(N); the new fixed-size slice + head index + size counter is O(1). Read methods walk the ring newest-first; behavior of RecentRuns and FindRun is unchanged from the caller's perspective. Regression test pins newest-first ordering and FIFO eviction across wrap-around. Refs PR #412.

xe-nvdk · 2026-04-27T20:17:02Z

/gemini review

@gemini-code-assist round 4 fixes pushed in commit 86b69fd. All three findings addressed (no defers):

✅ MEDIUM: bound initial ListDirectories("") iteration to 4 × MaxRootWalkDatabases. Operator-visible Warn fires when the limit trims the returned list.
✅ MEDIUM: parallel storage.Exists worker pool in the manifest sweep — new RecheckConcurrency config (default 8, set to 1 to force sequential). Two new tests assert the concurrency bound and sequential mode.
✅ MEDIUM: ring buffer for runHistory — O(1) record (was O(N) prepend). Regression test pins newest-first ordering and FIFO wrap-around eviction.

This should be the final pass — looking for any remaining findings before merge.

…dditions The original release-notes entry for the reconciler reflected commit ab0d4ce; rounds 1-4 of the gemini review added behavior the operator docs hadn't caught up with: - Standalone-mode safety guard (orphan-storage sweep refuses to act) - AbortReason taxonomy via raft.ErrManifestApply sentinel (raft_quorum_loss / lease_lost / unknown distinct on dashboards) - Parallel storage.Exists re-checks (RecheckConcurrency, default 8) - Root-walk fan-out cap (MaxRootWalkDatabases, default 1000) - Strict 7-segment looksLikeManagedPath when DeletePreManifestOrphans=false - Conservative protection on un-mtime'd backends (zero LastModified -> YOUNG) - WalkPartial flag + per-prefix errors in run summary - Auth wiring on all 3 routes (RequireRead on /status, RequireAdmin on /trigger and /runs/:id) Two new config knobs added to the table: max_root_walk_databases, recheck_concurrency.

gemini-code-assist

Code Review

This pull request introduces a periodic reconciler to detect and repair drift between the Raft-replicated file manifest and physical storage, addressing both orphan manifest entries and storage files. The implementation includes a new internal/reconciliation package, a cron-based scheduler, and administrative API endpoints for status and manual triggers. Review feedback suggests improving code organization by centralizing utility functions, clarifying release note descriptions for path-based deletion rules, and optimizing the scheduler by caching parsed cron schedules to avoid redundant parsing.

Combined gemini round-5 (3 items) + 4 parallel internal staff reviews (correctness / code-quality / security / perf-observability) into a single commit so the next gemini pass lands on a near-clean diff. HIGH (correctness — internal): - CapHit lost when a manifest-sweep chunk is all-recheck-skipped at the cap boundary: continue skipped the audit + return paths so the reconciler kept doing recheck I/O on subsequent chunks. Same bug in storage sweep where len(toDelete)==0 used continue instead of return. - WalkPartial and CapHit not in run_completed/run_aborted audit detail. SIEM consumers had no way to tell "0 orphans, all clean" from "blind walk produced 0 candidates" or "we hit the cap and stopped early." - Inner ListDirectories failures in derivePrefixes silently swallowed — now surfaced via prefixDiscovery.errors and Run.WalkPartial. - ctx.Err() not checked inside the root-walk loop after my round-4 scan-limit change; a misbehaving backend could burn the whole budget after cancel. Added per-iteration check. HIGH (gemini + code-quality): - appendBounded was in sweep_storage.go but used cross-file. Moved all four cap/sample helpers (appendBounded, joinSample, sampleStrings, sampleCandidatePaths, boolStr) into helpers.go. Renamed `cap` shadow to `limit`. Collapsed sampleStrings/sampleCandidatePaths into a generic sampleSlice. - scheduler.tick used direct == against sentinel errors (silently breaks if anyone wraps them). Switched to errors.Is. MEDIUM (security): - License re-validation middleware on /trigger and /runs/:id (per CLAUDE.md "License middleware goes after auth middleware"). New License.CanUseClustering() / Client.CanUseClustering() and a requireClusteringLicense fiber middleware mirroring the requireTieringLicense pattern. /status remains license-free so expired clusters can still report disabled state. - applyDefaults clamps operator overrides: MaxRootWalkDatabases at 1_000_000 (defends against int overflow in the × 4 scan-limit arithmetic), RecheckConcurrency at 64 (cloud rate-limit safety), negative MaxDeletesPerRun / GraceWindow / etc default to safe values (no negative-slice-index panics). MEDIUM (perf): - ErrGateRevoked sentinel + errors.Is replaces the substring-match isLeaseLostError. Both sweeps wrap %w ErrGateRevoked. - Cached parsed cron.Schedule on the Scheduler struct + module-level cronParser. nextRun no longer re-parses on every Status() request. - Replaced fmt.Sprintf("%d", n) hot-path calls with strconv.Itoa / strconv.FormatInt in the audit-emit and cap-hit paths. - computeDiff now uses a single map[string]bool (manifestSeen) instead of two separate maps. Saves ~16 MB of transient allocation per run on a 200k-entry manifest. MEDIUM (code quality): - backend_kind threaded into run_completed and run_aborted audit details. Operators inspecting an audit row can now tell shared-vs- local without correlating to status. - skipped_recheck added to run_completed audit. - Dropped unreachable LocalNodeID=="" branch in manifestToKeys (NewReconciler already rejects that combination). MEDIUM (release notes): - Tightened delete_pre_manifest_orphans description per gemini's feedback — explicitly calls out the 7-segment layout and `.`/`..` rejection rather than the vague "database/measurement/..." wording. New tests: - TestApplyDefaults_ClampsOperatorOverridesToCeilings — pins the new MaxRootWalkDatabases / RecheckConcurrency / MaxDeletesPerRun clamp semantics. CLAUDE.md updated locally (not committed — gitignored): post- implementation review section now mandates a 4-agent parallel review including a perf/observability/operational-hygiene lens explicitly charged with catching the line-level items gemini flags. Refs PR #412.

xe-nvdk · 2026-04-27T20:51:17Z

/gemini review

@gemini-code-assist round 5 fixes pushed in commit 3f41088. Combined gemini's 3 round-5 findings + 4 parallel internal staff/principal reviews into one commit so this lands closer to a clean diff.

HIGH (correctness, internal review):

✅ CapHit lost when chunk all-recheck-skipped — both sweeps now return nil instead of continue past the cap audit.
✅ walk_partial / cap_hit / backend_kind / skipped_recheck now emitted in run_completed/run_aborted audit detail.
✅ Inner ListDirectories failures in derivePrefixes now surfaced via Run.WalkPartial + Run.Errors.
✅ ctx.Err() checked inside the root-walk loop.

HIGH (gemini + code-quality):

✅ appendBounded + joinSample + sample helpers consolidated into new helpers.go. Generic sampleSlice[T] collapses two near-duplicates.
✅ scheduler.tick switched from == to errors.Is for sentinel checks.

MEDIUM (security):

✅ License re-validation middleware (requireClusteringLicense) on /trigger and /runs/:id. New License.CanUseClustering() / Client.CanUseClustering().
✅ Config clamps in applyDefaults: MaxRootWalkDatabases ≤ 1_000_000 (defends against × 4 overflow), RecheckConcurrency ≤ 64, negative values default safely.

MEDIUM (perf):

✅ ErrGateRevoked sentinel + errors.Is replaces substring-match isLeaseLostError.
✅ cron.Schedule cached on Scheduler; nextRun no longer re-parses on every Status() call.
✅ fmt.Sprintf("%d",n) → strconv.Itoa / strconv.FormatInt across audit-emit hot paths.
✅ computeDiff uses a single map[string]bool instead of two maps. ~16 MB saved per 200k-entry run.

MEDIUM (code quality):

✅ backend_kind + skipped_recheck in run_completed/run_aborted audit detail.
✅ Dropped unreachable LocalNodeID=="" branch in manifestToKeys.

MEDIUM (release notes):

✅ Tightened delete_pre_manifest_orphans description with explicit 7-segment layout + ./.. rejection note.

New regression test: TestApplyDefaults_ClampsOperatorOverridesToCeilings pins the clamp semantics.

Looking for any final findings before merge.

gemini-code-assist

Code Review

This pull request introduces a periodic reconciler to detect and repair drift between the Raft-replicated cluster file manifest and physical storage, addressing both orphan manifest entries and orphan storage files. The implementation includes a core reconciliation engine, a cron-based scheduler, an HTTP API for status and manual triggers, and integration with the existing clustering and storage layers. Feedback highlights a logic error in applyDefaults that prevents disabling the root walk by setting MaxRootWalkDatabases to 0. Additionally, it is suggested to change the default for delete_pre_manifest_orphans to false to prevent accidental data loss in shared storage environments.

Gemini round 6 flagged two items; pre-emptive same-shape audit confirmed gemini's findings and surfaced no additional fields with the same bug shape. Combined into a single commit so the next gemini pass lands on a clean diff. HIGH (gemini): config-driven MaxRootWalkDatabases=0 was silently overridden to 1000 by applyDefaults, preventing operators from reaching the disabled state via arc.toml. Convention now: `< 0` means "operator did not set this, fill default 1000"; `0` means "explicitly disabled" and passes through to the downstream guard at walk.go:derivePrefixes (which gates root walk on `> 0`); positive values used (clamped at maxRootWalkDatabasesCap to defend against × 4 overflow). All other fields remain on `<= 0` "fill default" semantics — overloading every numeric knob with shape-#1 semantics would diverge from the existing ManifestOnlyDryRun behavioral knob and surprise operators in unrelated config paths. MEDIUM (gemini): delete_pre_manifest_orphans default flipped from true to false. Shared-bucket deployments where stray non-Arc files co-exist with Arc data must NOT see those files deleted on first cron run. Operators who deliberately want pre-Phase-1 / migration- residual cleanup now opt in explicitly. Mirrors Iceberg's remove_orphan_files / Delta VACUUM secure-by-default posture. MEDIUM (internal review): manifest_only_dry_run default flipped from false to true. Even though the feature is opt-in via reconciliation.enabled, flipping enabled=true alone should not auto-delete on the very first cron tick. Operators review dry-run audits, then explicitly flip manifest_only_dry_run=false to allow real deletes. Druid's coordinator kill task and Iceberg's remove_orphan_files ship the same posture. Test infrastructure: - newReconciler test helper sets MaxRootWalkDatabases=-1 when the caller passes 0 so tests don't accidentally land in the disabled state; tests that DO want disabled set 0 explicitly. - Two integration tests using NewReconciler directly add MaxRootWalkDatabases:-1 to opt into root walk. - New TestApplyDefaults_MaxRootWalkDatabasesZeroFromConfigDisablesRootWalk pins the gemini-flagged convention. Release notes updated: "off by default AND report-only on first run" prose, plus the two flipped default rows in the config table. Refs PR #412.

xe-nvdk · 2026-04-27T21:09:44Z

/gemini review

@gemini-code-assist round 6 fixes pushed in commit 38da2a2.

HIGH (your finding):

✅ MaxRootWalkDatabases=0 from config now reaches the disabled state. Convention: < 0 means "operator did not set this, fill default"; 0 means "disabled"; > 0 is the cap (clamped at the int-overflow ceiling). Other numeric fields stay on <= 0 "fill default" — overloading every knob with shape-Questions regarding data durability and schema #1 semantics would diverge from the existing ManifestOnlyDryRun behavioral knob already documented for audit-only mode.
✅ New regression test: TestApplyDefaults_MaxRootWalkDatabasesZeroFromConfigDisablesRootWalk.

MEDIUM (your finding) — secure-by-default flips:

✅ delete_pre_manifest_orphans default → false. Shared-bucket deployments are protected from accidental stray-file deletion on first cron.
✅ manifest_only_dry_run default → true (additional internal-review finding of the same shape). First cron tick produces a dry-run audit; operator flips after reviewing. Mirrors Druid kill / Iceberg remove_orphan_files first-run posture.

Process note: I ran a pre-emptive same-shape audit on the package looking for other fields with the config-defaults-vs-disabled confusion you spotted. Confirmed there were no other instances — only MaxRootWalkDatabases had the overloaded ==0 check. Audit also confirmed two more secure-by-default opportunities (the manifest_only_dry_run flip above + larger policy-level changes around grace window / blast cap defaults that I'm leaving as a follow-up issue rather than baking into this PR).

Looking for any final findings before merge.

gemini-code-assist

Code Review

This pull request introduces a periodic reconciler to detect and repair drift between the Raft-replicated cluster file manifest and physical storage, addressing both orphan manifest entries and orphan storage files. The implementation includes a core reconciliation engine, a cron-based scheduler, and API endpoints for status and manual triggers. Feedback suggests refining the DryRun logic in the core engine to allow manual overrides via the API even when the global dry-run configuration is enabled. Additionally, the check for excluding compaction state files should be made more specific to avoid incorrectly ignoring valid user data that might contain the reserved string in its path.

…ction-state filter Gemini round 7 flagged two MEDIUMs; pre-emptive same-shape audit escalated #1 to HIGH (operator-intent suppression with no diagnostic) and confirmed no other instances of either shape elsewhere in the package. HIGH (gemini, escalated): Reconcile previously did `DryRun = dryRun || cfg.ManifestOnlyDryRun`. With the round-6 secure- by-default flip of ManifestOnlyDryRun=true, this silently forced every Reconcile call into dry-run, including the manual API path. Operators hitting POST /trigger?dry_run=false&act=true (which already encodes the explicit double-confirmation) would be silently denied the act-mode they asked for. The cron path still passes cfg.ManifestOnlyDryRun directly via scheduler.tick, so dropping the OR preserves cron safety while honoring legitimate manual overrides. Two regression tests pin both behaviors: - TestReconcile_ManualOverrideHonorsDryRunFalseEvenWithCfgDryRunOnly - TestReconcile_CronPathDryRunStillRespectsCfg MEDIUM (gemini): isParquetCandidate used `strings.Contains(p, "_compaction_state/")` which is a substring match, not segment-aware. A measurement named with that substring embedded — e.g. "db/m_compaction_state_logs/..." — would have all its valid data files silently filtered out of orphan-storage candidacy. Replaced with an explicit `strings.Split + segment == "_compaction_state"` check, mirroring the segment-aware shape used by looksLikeManagedPath. Added three regression cases to the TestIsParquetCandidate table covering substring-but-not-segment positions where false positives would previously trigger. Refs PR #412.

xe-nvdk · 2026-04-27T21:18:01Z

/gemini review

@gemini-code-assist round 7 fixes pushed in commit ba98d58.

MEDIUM (your finding, internally re-graded HIGH) — manual API override now honored:

✅ Dropped || r.cfg.ManifestOnlyDryRun from Reconcile. The cron path already passes cfg.ManifestOnlyDryRun via scheduler.tick (line 146), so cron safety is preserved upstream. The OR was silently masking legitimate manual API overrides — operators hitting POST /trigger?dry_run=false&act=true (which already encodes the double-confirmation) would otherwise be denied act-mode.
Two regression tests pin both behaviors: TestReconcile_ManualOverrideHonorsDryRunFalseEvenWithCfgDryRunOnly and TestReconcile_CronPathDryRunStillRespectsCfg.

MEDIUM (your finding) — segment-aware compaction-state filter:

✅ Replaced strings.Contains(p, "_compaction_state/") with an explicit strings.Split + segment == "_compaction_state" check. A measurement named with that substring embedded (db/m_compaction_state_logs/...) is no longer silently filtered out. Mirrors the segment-aware shape already used by looksLikeManagedPath.
Added three regression cases to TestIsParquetCandidate covering substring-but-not-segment positions.

Pre-emptive same-shape audit: ran a parallel staff-engineer audit before fixing, looking for other belt-and-braces overrides that suppress operator intent and other path-shape substring matches. Confirmed no other instances of either bug shape in the package.

Looking for any final findings before merge.

gemini-code-assist

Code Review

This pull request introduces a periodic reconciliation system for Enterprise clusters to detect and repair drift between the Raft-replicated file manifest and physical storage. It addresses orphan manifest entries and orphan storage files with safety features like dry-run modes, grace windows, and blast caps. Feedback suggests improving the root-walk discovery to cover orphan measurements in known databases, removing redundant deduplication logic in the storage walk, and optimizing path segment checks to reduce allocations.

gemini-code-assist · 2026-04-27T21:46:28Z

+		if _, exists := dbSet[key]; exists {
+			// Already covered by the manifest-derived dbSet;
+			// don't count toward the inspection cap.
+			continue
+		}


The root-walk fallback currently skips any database already present in the manifest (dbSet). This optimization prevents the reconciler from discovering orphan measurements (measurements that exist in storage but not in the manifest) within known databases. While this saves on ListDirectories calls, it creates a gap in anti-entropy coverage. Consider if a more thorough discovery (perhaps sampled or periodic) is needed for known databases.

gemini-code-assist · 2026-04-27T21:46:28Z

+	res := walkResult{
+		records:      make([]objectRecord, 0, len(manifest)),
+		partial:      discovery.partial,
+		prefixErrors: append([]string{}, discovery.errors...),
+	}
+	seen := make(map[string]struct{}, len(manifest))
+
+	for _, prefix := range prefixes {
+		if ctx.Err() != nil {
+			return res, ctx.Err()
+		}
+		records, err := r.listPrefix(ctx, prefix)
+		if err != nil {
+			// A per-prefix list failure is logged and the prefix is
+			// skipped; the run continues against the remaining prefixes
+			// so a single transient backend hiccup doesn't kill the
+			// whole cycle. Errors are reported on the Run summary so
+			// operators can spot "0 orphans found" runs that were
+			// actually blind to most of the bucket.
+			res.partial = true
+			res.prefixErrors = append(res.prefixErrors, fmt.Sprintf("prefix %q: %v", prefix, err))
+			r.logger.Warn().
+				Err(err).
+				Str("prefix", prefix).
+				Msg("Reconciliation: prefix list failed; skipping")
+			continue
+		}
+		for _, rec := range records {
+			if _, dup := seen[rec.path]; dup {
+				continue
+			}
+			seen[rec.path] = struct{}{}
+			res.records = append(res.records, rec)
+		}
+	}
+
+	return res, nil
+}
+
+// prefixDiscovery is the result of derivePrefixes. partial=true and/or


The seen map used for deduplication in walkStorage is redundant because the prefixes generated by derivePrefixes are disjoint by construction (each is a unique database/measurement/ pair). Removing this map and using a direct append will reduce memory overhead and improve performance, especially for large manifests.

res := walkResult{ records: make([]objectRecord, 0, len(manifest)), partial: discovery.partial, prefixErrors: append([]string{}, discovery.errors...), } for _, prefix := range prefixes { if ctx.Err() != nil { return res, ctx.Err() } records, err := r.listPrefix(ctx, prefix) if err != nil { // A per-prefix list failure is logged and the prefix is // skipped; the run continues against the remaining prefixes // so a single transient backend hiccup doesn't kill the // whole cycle. Errors are reported on the Run summary so // operators can spot "0 orphans found" runs that were // actually blind to most of the bucket. res.partial = true res.prefixErrors = append(res.prefixErrors, fmt.Sprintf("prefix %q: %v", prefix, err)) r.logger.Warn(). Err(err). Str("prefix", prefix). Msg("Reconciliation: prefix list failed; skipping") continue } res.records = append(res.records, records...) } return res, nil

gemini-code-assist · 2026-04-27T21:46:28Z

+	for _, seg := range strings.Split(p, "/") {
+		if seg == "_compaction_state" {
+			return false
+		}
+	}


Using strings.Split here for every file found in storage can lead to significant allocations during large reconciliation runs. Since we only need to check if _compaction_state is a path segment, and we already know the path ends in .parquet, we can use more efficient string checks.

if strings.HasPrefix(p, "_compaction_state/") || strings.Contains(p, "/_compaction_state/") { return false }

This was referenced Apr 27, 2026

Orphan compacted-file reconciliation #391

Closed

Periodic reconciliation for file replication #393

Open