feat(cluster): writer-only retention with Raft manifest propagation by xe-nvdk · Pull Request #407 · Basekick-Labs/arc

xe-nvdk · 2026-04-20T18:31:00Z

Summary

Only the primary writer node runs retention policies in cluster mode, preventing races on shared (S3/Azure/NFS) and per-node (local) storage
After each file delete, DeleteFileFromManifest commits the removal into the Raft log, keeping the cluster manifest consistent
Reader nodes clean up their local copies via the existing onFileDeleted FSM callback and delete-worker pool (shared with compaction — no new code needed)
Standalone deployments are completely unaffected (nil gate = no check)

Test plan

go build ./cmd/... ./internal/... passes clean
go test ./internal/scheduler/... ./internal/api/... ./internal/cluster/... all pass
Standalone: retention runs normally, no gate log messages
Cluster (3-node): only writer logs "Triggering scheduled retention"; readers log "Scheduler idle"
Cluster: after retention runs, GET /api/v1/cluster/files shows no orphaned entries for deleted files
Writer failover: new primary picks up retention on next cron tick

Only the primary writer node runs retention policies in cluster mode, preventing races on shared/per-node storage. After each file delete, DeleteFileFromManifest commits the removal into the Raft log so reader nodes clean up their local copies via the existing onFileDeleted worker pool.

gemini-code-assist

Code Review

This pull request implements cluster-safe retention policy execution for enterprise deployments, ensuring that retention tasks only run on the primary writer node and that file deletions are propagated to the Raft manifest. Feedback indicates that the current implementation of Raft manifest updates is inefficient and should be batched to improve performance. Additionally, the cluster gate logic in the scheduler needs to be moved from the initialization phase to the execution phase to correctly handle node role transitions and failover scenarios.

…te check - Batch all retention file deletions into a single Raft proposal via BatchFileOpsInManifest instead of one proposal per file, reducing Raft log pressure on large retention runs - Move cluster gate check from Start() into runRetention() so role transitions (failover, demotion) take effect on the next tick without requiring a node restart

xe-nvdk · 2026-04-20T18:35:58Z

@gemini-code-assist Both findings from your review have been addressed in the latest push:

Batch manifest updates — replaced the per-file DeleteFileFromManifest calls with a single BatchFileOpsInManifest call after the deletion loop, matching how compaction handles manifest updates.
Per-tick gate check — moved CanRunRetention() out of Start() and into runRetention() so role transitions (failover, demotion) take effect on the next cron tick without requiring a node restart. Status() now reports the live can_run value instead of a stale startup snapshot.

Could you do another pass?

gemini-code-assist

Code Review

This pull request implements cluster-safe retention policy execution, ensuring that retention tasks run exclusively on the primary writer node and that file deletions are synchronized across the cluster via the Raft manifest. The changes include the introduction of cluster gates in the scheduler and batching logic for manifest updates in the retention handler. Feedback focuses on enhancing the reliability of manifest updates by addressing potential orphaned entries and Raft log size limitations through chunking, as well as optimizing performance by removing unnecessary mutex locking for immutable fields.

- Chunk manifest batch updates at 1000 ops to avoid oversized Raft log entries - Upgrade manifest failure log to Error with honest comment (no self-heal today, Phase 5 reconciliation will clean orphans) - Remove unnecessary mutex around immutable clusterGate in runRetention and TriggerNow

xe-nvdk · 2026-04-20T18:48:10Z

@gemini-code-assist Three findings from the second review addressed:

Chunked manifest updates — batch is now split into 1000-op chunks to prevent oversized Raft log entries on large retention runs.
Error log + honest comment — upgraded from Warn to Error on manifest failure; removed the misleading "self-heal" comment and replaced with an explicit note that affected paths will be orphaned in the manifest until Phase 5 reconciliation ships.
Removed unnecessary mutex — clusterGate is immutable after construction; removed the lock/copy pattern in both runRetention and TriggerNow.

Could you do a final pass?

gemini-code-assist

Code Review

This pull request introduces cluster-safe retention policy execution, ensuring that retention tasks only run on the primary writer node and that file deletions are propagated to the Raft manifest. Feedback highlights a potential consistency issue where files are deleted from storage before the manifest is updated; it is recommended to reverse this order to avoid orphaned manifest entries in the event of a Raft quorum loss. Additionally, error handling for manifest updates should be improved by terminating the retention cycle upon failure instead of merely logging the error and continuing with subsequent chunks.

…lure Update cluster manifest before deleting from storage so a Raft failure leaves the file intact in both places and the next retention run can retry it. Previously, storage was deleted first, creating permanent orphan manifest entries on failure. On manifest update failure, abort the entire retention cycle rather than continuing to the next chunk — a Raft quorum loss is not transient and proceeding would only create more orphans for the same root cause. Also removes the now-redundant !dryRun guard (deletedFilePaths is only populated when dryRun=false).

xe-nvdk · 2026-04-20T18:58:04Z

@gemini-code-assist Three findings from the latest round addressed:

Manifest-before-storage ordering — the manifest is now updated before deleting from storage. If the Raft update fails, the file still exists in both places and the next retention run will retry it naturally. A storage-delete failure after a successful manifest update leaves a harmless ghost file on disk (invisible to the cluster, will be swept by the next compaction pass).
Abort on manifest failure — if BatchFileOpsInManifest fails, the retention cycle returns an error immediately rather than continuing to the next chunk. A Raft quorum loss is not transient; proceeding would only produce more orphans for the same root cause.
Removed redundant !dryRun guard — deletedFilePaths/eligiblePaths is only populated when dryRun=false, so the outer check was redundant.

Could you do a final pass?

gemini-code-assist

Code Review

This pull request introduces cluster-aware retention policy execution, ensuring that retention tasks run exclusively on the primary writer node and propagate deletions through the Raft manifest. Key changes include the addition of a cluster gate for the retention scheduler and a refactored deletion process in the retention handler that updates the cluster manifest before removing files from storage. Feedback highlights a potential SQL injection vulnerability in how file paths are handled during metadata retrieval and suggests interleaving manifest updates with storage deletions to improve the robustness of the process and minimize orphaned files.

…rquet path - Interleave manifest update and storage delete per chunk instead of two separate passes; limits orphan blast radius to one chunk on mid-run failure - Escape single quotes in read_parquet() file paths to prevent SQL injection via crafted database/measurement names

xe-nvdk · 2026-04-20T19:03:54Z

@gemini-code-assist Both findings addressed:

Interleaved manifest+storage per chunk — instead of two separate passes (all manifest updates then all storage deletes), each chunk now updates the manifest and immediately deletes from storage before moving to the next chunk. This limits the orphan blast radius to one chunk on a mid-run failure, matching your suggested implementation.
SQL injection fix in getFileMaxTimeAndRowCount — single quotes in the file path are now escaped (strings.ReplaceAll(filePath, "'", "''")) before interpolation into the read_parquet() query, since DuckDB doesn't support parameterized read_parquet() calls.

Could you do a final pass?

gemini-code-assist

Code Review

This pull request introduces cluster-safe retention policy execution for enterprise deployments, ensuring that retention tasks only run on the primary writer node and that file deletions are synchronized via the Raft manifest. Key improvements include batching deletions in chunks of 1000 and adding SQL escaping to DuckDB queries to prevent injection. A logic bug was identified in the chunk processing logic where a failure to marshal a manifest update for a file would not prevent its deletion from storage, potentially leading to manifest inconsistency.

Track subPaths/subRows in parallel with ops so a json.Marshal failure skips a file from both the manifest batch and the storage delete loop. Previously a marshal failure would skip the manifest op but still delete the file from storage, creating an orphaned manifest entry.

xe-nvdk · 2026-04-20T19:07:14Z

@gemini-code-assist Logic bug fixed: subPaths and subRows are now built in parallel with ops during the marshal loop, so a json.Marshal failure skips the file from both the manifest batch and the storage delete loop. Previously, a marshal failure would silently exclude the file from ops but leave it in chunk, causing a storage delete without a corresponding manifest removal.

Could you do a final pass?

gemini-code-assist

Code Review

This pull request introduces cluster-safe retention policy execution for enterprise deployments, ensuring that retention tasks only run on the primary writer node and that file deletions are synchronized across the cluster via Raft manifest updates. The implementation includes a new cluster gate mechanism, batched manifest updates to maintain consistency, and SQL injection hardening for DuckDB queries. Feedback was provided to improve the readability of slice initialization within the retention handler's deletion logic.

…adability

xe-nvdk · 2026-04-20T19:14:18Z

@gemini-code-assist Addressed — replaced [:0:0] with make([]string, 0, len(chunk)) / make([]int64, 0, len(chunk)) for clarity. Could you do a final pass?

gemini-code-assist

Code Review

This pull request introduces cluster-safe retention policy execution, ensuring that retention tasks run exclusively on the primary writer node and that file deletions are synchronized across the cluster via the Raft manifest. The changes include the implementation of cluster gates in the scheduler and chunked manifest updates in the retention handler. Feedback recommends enhancing error handling for manifest update failures to avoid redundant logging during cluster outages and implementing concurrency controls in the scheduler to prevent overlapping retention cycles.

- Add runningJob flag to RetentionScheduler to prevent overlapping cycles when a run exceeds the cron interval; applies to both scheduled and manual TriggerNow paths - Abort ExecutePolicy early on manifest error (Raft quorum loss is not transient — continuing to the next measurement only creates more orphans) - Add license check in ExecutePolicy to guard direct programmatic calls that bypass the scheduler's license gate - Wire licenseClient into RetentionHandler constructor

xe-nvdk · 2026-04-20T19:34:09Z

@gemini-code-assist Four issues addressed from Gemini's latest round and a parallel principal-level review:

Concurrent execution guard — added runningJob bool flag to RetentionScheduler. If a previous cycle is still in progress when the next tick fires, the tick is skipped with a warn log. Applies to both the cron path and TriggerNow.
Early abort on manifest error in ExecutePolicy — when deleteOldFiles returns a manifest error and a coordinator is wired (cluster mode), ExecutePolicy now records the execution as failed and returns immediately instead of continuing to the next measurement. Storage-only errors (standalone mode) still continue as before.
License check in ExecutePolicy — added a license gate directly in ExecutePolicy so direct programmatic/API calls that bypass the scheduler's Start() check are also protected.
licenseClient wired into RetentionHandler — constructor updated to accept *license.Client; call site in main.go and test updated accordingly.

Could you do a final pass?

gemini-code-assist

Code Review

This pull request implements cluster-safe retention policy execution, ensuring that retention runs exclusively on the primary writer node and propagates file deletions through the Raft manifest. It also introduces license validation for retention tasks and enhances SQL safety by escaping file paths in DuckDB queries. Feedback was provided regarding the need for concurrency protection in the manual trigger method to prevent overlapping retention cycles.

Prevents a manual trigger from starting a second concurrent cycle when a scheduled run is already in progress.

xe-nvdk · 2026-04-20T19:40:32Z

@gemini-code-assist Applied the runningJob guard to TriggerNow as well — a manual trigger now returns an error immediately if a scheduled cycle is already in progress, matching the behaviour added to runRetention. Could you do a final pass?

gemini-code-assist

Code Review

This pull request introduces cluster-safe retention policy execution for enterprise deployments. Key changes include restricting retention tasks to the primary writer node via a cluster gate, propagating file deletions to the Raft manifest for cluster-wide consistency, and implementing concurrency control to prevent overlapping retention cycles. Additionally, the update includes license validation for retention execution and SQL escaping for DuckDB queries to mitigate injection risks. Feedback was provided regarding the efficiency of JSON marshaling within the file deletion loop.

…uster gate - ExecutePolicy: remove coordinator nil-guard around abort — standalone mode should also abort the cycle on storage/manifest failure rather than silently continuing to the next measurement and creating inconsistent partial state - handleExecute: propagate c.Context() into deleteOldFiles (was context.Background()) so client disconnects and server timeouts cancel in-flight deletion; record execution as "failed" and return 500 on first error rather than swallowing it - retentionClusterGate.CanRunRetention: guard GetLocalNode() return value against nil so a gate check during cluster startup does not panic

xe-nvdk · 2026-04-20T19:54:29Z

Three additional fixes from a parallel staff/principal-engineer review:

ExecutePolicy abort in standalone mode — removed the if h.coordinator != nil guard around the abort path. Standalone mode was silently continue-ing on errors and producing partial inconsistent state; now aborts the entire cycle on first failure in both modes.
handleExecute context propagation — was calling deleteOldFiles(context.Background(), ...), changed to c.Context() so client disconnects and server timeouts cancel in-flight deletions. Also records execution as "failed" and returns 500 on error instead of swallowing it.
retentionClusterGate.CanRunRetention nil safety — GetLocalNode() can return nil during cluster startup; added guard to return false instead of panicking.

@gemini-code-assist please do a final review pass on these three commits.

gemini-code-assist

Code Review

This pull request introduces cluster-safe retention policy execution, ensuring that retention tasks run only on the primary writer node and that file deletions are synchronized with the Raft manifest. Key changes include the implementation of a cluster gate for the scheduler, batch manifest updates prior to storage deletion to maintain consistency, and concurrency controls to prevent overlapping retention cycles. Additionally, the PR includes a security fix to escape file paths in DuckDB queries to prevent SQL injection. Feedback was provided regarding the lack of cluster role verification in the manual API execution path, which currently allows reader nodes to trigger retention and potentially cause races.

handleExecute was missing the primary-writer check — a reader node could trigger retention via the API and race with the writer over shared or local storage. Add CanRunRetention()/Role() to RetentionCoordinator interface and check them in handleExecute before executing (dry-run is allowed everywhere since it makes no mutations). Implement the two new methods on *cluster.Coordinator, delegating to GetLocalNode().IsPrimaryWriter().

xe-nvdk · 2026-04-20T19:58:10Z

Fixed the reader-node API race flagged in the latest review.

handleExecute now checks h.coordinator.CanRunRetention() before executing any mutations — a reader node gets a 503 Service Unavailable with the node's current role in the error message. Dry-run requests are allowed on any node since they make no mutations.

Implementation: extended RetentionCoordinator interface with CanRunRetention() bool and Role() string, implemented both on *cluster.Coordinator (delegates to GetLocalNode().IsPrimaryWriter() with nil safety). Standalone mode is unaffected — coordinator is nil, check is skipped.

@gemini-code-assist please do a final pass.

gemini-code-assist

Code Review

This pull request implements cluster-safe retention policy execution, ensuring that retention tasks run exclusively on the primary writer node and that file deletions are synchronized via the Raft manifest. Key changes include the introduction of cluster gates to manage node roles, batching manifest updates before storage deletions to maintain consistency, and adding concurrency protection to prevent overlapping retention cycles. Feedback focuses on improving auditability by including policy identifiers in deletion reasons, making the hardcoded 30-minute execution timeout configurable, and refining error handling to distinguish between transient storage failures and non-transient manifest errors.

…ge errors - Pass reason="retention:<policy_id>" to deleteOldFiles so manifest ops carry the policy ID for auditability in the Raft log and delete-worker traces - Downgrade storage.Delete failures from Error to Warn — a network blip or already-deleted file is transient; the cycle should continue. Manifest failures remain Error + abort since Raft quorum loss is non-transient.

xe-nvdk · 2026-04-20T20:08:34Z

Two more fixes from the latest review:

Policy ID in manifest reason — deleteOldFiles now takes a reason string parameter; callers pass "retention:<policy_id>" (e.g. "retention:42"). This propagates to DeleteFilePayload.Reason in every Raft manifest op, making delete-worker traces and Raft log entries auditable by policy.
Storage error severity — storage.Delete failures downgraded from Error to Warn. A network blip or already-deleted file is transient; the cycle continues to the next file. Manifest failures remain Error + abort since Raft quorum loss is non-transient.

@gemini-code-assist please do a final pass.

gemini-code-assist

Code Review

This pull request implements cluster-safe retention policy execution, ensuring that retention tasks run exclusively on the primary writer node and that file deletions are synchronized across the cluster via the Raft manifest. It also introduces concurrency protection in the scheduler to prevent overlapping runs and adds SQL escaping to protect against injection in DuckDB queries. Review feedback identifies an issue where cumulative deletion counters are not updated when a measurement fails partially, which results in inaccurate execution metrics being recorded.

…ent error Accumulate deleted/filesDeleted before checking the error so that recordExecutionComplete reports all completed work accurately when a mid-cycle abort occurs. deleteOldFiles returns its partial progress even on failure, so the prior behaviour was discarding counts from the failed measurement chunk.

xe-nvdk · 2026-04-20T20:15:40Z

Fixed the partial-progress counter issue.

totalDeleted and totalFilesDeleted are now accumulated before the error check in both ExecutePolicy and handleExecute. deleteOldFiles returns its partial progress even on abort (rows/files deleted in completed chunks before the manifest failure), so the execution record now accurately reflects all completed work rather than dropping counts from the failed measurement.

@gemini-code-assist please do a final pass.

gemini-code-assist

Code Review

This pull request introduces cluster-safe retention policy execution for enterprise deployments. Key changes include ensuring that retention policies run exclusively on the primary writer node via a new cluster gate, propagating file deletions through the Raft manifest for cluster-wide consistency, and implementing concurrency control in the scheduler to prevent overlapping execution cycles. Additionally, the PR adds license validation for retention tasks and hardens DuckDB queries against SQL injection by escaping file paths. There are no review comments to address, and I have no further feedback to provide.