feat(cluster): writer-only retention with Raft manifest propagation#407
feat(cluster): writer-only retention with Raft manifest propagation#407
Conversation
Only the primary writer node runs retention policies in cluster mode, preventing races on shared/per-node storage. After each file delete, DeleteFileFromManifest commits the removal into the Raft log so reader nodes clean up their local copies via the existing onFileDeleted worker pool.
There was a problem hiding this comment.
Code Review
This pull request implements cluster-safe retention policy execution for enterprise deployments, ensuring that retention tasks only run on the primary writer node and that file deletions are propagated to the Raft manifest. Feedback indicates that the current implementation of Raft manifest updates is inefficient and should be batched to improve performance. Additionally, the cluster gate logic in the scheduler needs to be moved from the initialization phase to the execution phase to correctly handle node role transitions and failover scenarios.
…te check - Batch all retention file deletions into a single Raft proposal via BatchFileOpsInManifest instead of one proposal per file, reducing Raft log pressure on large retention runs - Move cluster gate check from Start() into runRetention() so role transitions (failover, demotion) take effect on the next tick without requiring a node restart
|
@gemini-code-assist Both findings from your review have been addressed in the latest push:
Could you do another pass? |
There was a problem hiding this comment.
Code Review
This pull request implements cluster-safe retention policy execution, ensuring that retention tasks run exclusively on the primary writer node and that file deletions are synchronized across the cluster via the Raft manifest. The changes include the introduction of cluster gates in the scheduler and batching logic for manifest updates in the retention handler. Feedback focuses on enhancing the reliability of manifest updates by addressing potential orphaned entries and Raft log size limitations through chunking, as well as optimizing performance by removing unnecessary mutex locking for immutable fields.
- Chunk manifest batch updates at 1000 ops to avoid oversized Raft log entries - Upgrade manifest failure log to Error with honest comment (no self-heal today, Phase 5 reconciliation will clean orphans) - Remove unnecessary mutex around immutable clusterGate in runRetention and TriggerNow
|
@gemini-code-assist Three findings from the second review addressed:
Could you do a final pass? |
There was a problem hiding this comment.
Code Review
This pull request introduces cluster-safe retention policy execution, ensuring that retention tasks only run on the primary writer node and that file deletions are propagated to the Raft manifest. Feedback highlights a potential consistency issue where files are deleted from storage before the manifest is updated; it is recommended to reverse this order to avoid orphaned manifest entries in the event of a Raft quorum loss. Additionally, error handling for manifest updates should be improved by terminating the retention cycle upon failure instead of merely logging the error and continuing with subsequent chunks.
…lure Update cluster manifest before deleting from storage so a Raft failure leaves the file intact in both places and the next retention run can retry it. Previously, storage was deleted first, creating permanent orphan manifest entries on failure. On manifest update failure, abort the entire retention cycle rather than continuing to the next chunk — a Raft quorum loss is not transient and proceeding would only create more orphans for the same root cause. Also removes the now-redundant !dryRun guard (deletedFilePaths is only populated when dryRun=false).
|
@gemini-code-assist Three findings from the latest round addressed:
Could you do a final pass? |
There was a problem hiding this comment.
Code Review
This pull request introduces cluster-aware retention policy execution, ensuring that retention tasks run exclusively on the primary writer node and propagate deletions through the Raft manifest. Key changes include the addition of a cluster gate for the retention scheduler and a refactored deletion process in the retention handler that updates the cluster manifest before removing files from storage. Feedback highlights a potential SQL injection vulnerability in how file paths are handled during metadata retrieval and suggests interleaving manifest updates with storage deletions to improve the robustness of the process and minimize orphaned files.
…rquet path - Interleave manifest update and storage delete per chunk instead of two separate passes; limits orphan blast radius to one chunk on mid-run failure - Escape single quotes in read_parquet() file paths to prevent SQL injection via crafted database/measurement names
|
@gemini-code-assist Both findings addressed:
Could you do a final pass? |
There was a problem hiding this comment.
Code Review
This pull request introduces cluster-safe retention policy execution for enterprise deployments, ensuring that retention tasks only run on the primary writer node and that file deletions are synchronized via the Raft manifest. Key improvements include batching deletions in chunks of 1000 and adding SQL escaping to DuckDB queries to prevent injection. A logic bug was identified in the chunk processing logic where a failure to marshal a manifest update for a file would not prevent its deletion from storage, potentially leading to manifest inconsistency.
Track subPaths/subRows in parallel with ops so a json.Marshal failure skips a file from both the manifest batch and the storage delete loop. Previously a marshal failure would skip the manifest op but still delete the file from storage, creating an orphaned manifest entry.
|
@gemini-code-assist Logic bug fixed: Could you do a final pass? |
There was a problem hiding this comment.
Code Review
This pull request introduces cluster-safe retention policy execution for enterprise deployments, ensuring that retention tasks only run on the primary writer node and that file deletions are synchronized across the cluster via Raft manifest updates. The implementation includes a new cluster gate mechanism, batched manifest updates to maintain consistency, and SQL injection hardening for DuckDB queries. Feedback was provided to improve the readability of slice initialization within the retention handler's deletion logic.
|
@gemini-code-assist Addressed — replaced |
There was a problem hiding this comment.
Code Review
This pull request introduces cluster-safe retention policy execution, ensuring that retention tasks run exclusively on the primary writer node and that file deletions are synchronized across the cluster via the Raft manifest. The changes include the implementation of cluster gates in the scheduler and chunked manifest updates in the retention handler. Feedback recommends enhancing error handling for manifest update failures to avoid redundant logging during cluster outages and implementing concurrency controls in the scheduler to prevent overlapping retention cycles.
- Add runningJob flag to RetentionScheduler to prevent overlapping cycles when a run exceeds the cron interval; applies to both scheduled and manual TriggerNow paths - Abort ExecutePolicy early on manifest error (Raft quorum loss is not transient — continuing to the next measurement only creates more orphans) - Add license check in ExecutePolicy to guard direct programmatic calls that bypass the scheduler's license gate - Wire licenseClient into RetentionHandler constructor
|
@gemini-code-assist Four issues addressed from Gemini's latest round and a parallel principal-level review:
Could you do a final pass? |
There was a problem hiding this comment.
Code Review
This pull request implements cluster-safe retention policy execution, ensuring that retention runs exclusively on the primary writer node and propagates file deletions through the Raft manifest. It also introduces license validation for retention tasks and enhances SQL safety by escaping file paths in DuckDB queries. Feedback was provided regarding the need for concurrency protection in the manual trigger method to prevent overlapping retention cycles.
Prevents a manual trigger from starting a second concurrent cycle when a scheduled run is already in progress.
|
@gemini-code-assist Applied the |
There was a problem hiding this comment.
Code Review
This pull request introduces cluster-safe retention policy execution for enterprise deployments. Key changes include restricting retention tasks to the primary writer node via a cluster gate, propagating file deletions to the Raft manifest for cluster-wide consistency, and implementing concurrency control to prevent overlapping retention cycles. Additionally, the update includes license validation for retention execution and SQL escaping for DuckDB queries to mitigate injection risks. Feedback was provided regarding the efficiency of JSON marshaling within the file deletion loop.
…uster gate - ExecutePolicy: remove coordinator nil-guard around abort — standalone mode should also abort the cycle on storage/manifest failure rather than silently continuing to the next measurement and creating inconsistent partial state - handleExecute: propagate c.Context() into deleteOldFiles (was context.Background()) so client disconnects and server timeouts cancel in-flight deletion; record execution as "failed" and return 500 on first error rather than swallowing it - retentionClusterGate.CanRunRetention: guard GetLocalNode() return value against nil so a gate check during cluster startup does not panic
|
Three additional fixes from a parallel staff/principal-engineer review:
@gemini-code-assist please do a final review pass on these three commits. |
There was a problem hiding this comment.
Code Review
This pull request introduces cluster-safe retention policy execution, ensuring that retention tasks run only on the primary writer node and that file deletions are synchronized with the Raft manifest. Key changes include the implementation of a cluster gate for the scheduler, batch manifest updates prior to storage deletion to maintain consistency, and concurrency controls to prevent overlapping retention cycles. Additionally, the PR includes a security fix to escape file paths in DuckDB queries to prevent SQL injection. Feedback was provided regarding the lack of cluster role verification in the manual API execution path, which currently allows reader nodes to trigger retention and potentially cause races.
handleExecute was missing the primary-writer check — a reader node could trigger retention via the API and race with the writer over shared or local storage. Add CanRunRetention()/Role() to RetentionCoordinator interface and check them in handleExecute before executing (dry-run is allowed everywhere since it makes no mutations). Implement the two new methods on *cluster.Coordinator, delegating to GetLocalNode().IsPrimaryWriter().
|
Fixed the reader-node API race flagged in the latest review.
Implementation: extended @gemini-code-assist please do a final pass. |
There was a problem hiding this comment.
Code Review
This pull request implements cluster-safe retention policy execution, ensuring that retention tasks run exclusively on the primary writer node and that file deletions are synchronized via the Raft manifest. Key changes include the introduction of cluster gates to manage node roles, batching manifest updates before storage deletions to maintain consistency, and adding concurrency protection to prevent overlapping retention cycles. Feedback focuses on improving auditability by including policy identifiers in deletion reasons, making the hardcoded 30-minute execution timeout configurable, and refining error handling to distinguish between transient storage failures and non-transient manifest errors.
…ge errors - Pass reason="retention:<policy_id>" to deleteOldFiles so manifest ops carry the policy ID for auditability in the Raft log and delete-worker traces - Downgrade storage.Delete failures from Error to Warn — a network blip or already-deleted file is transient; the cycle should continue. Manifest failures remain Error + abort since Raft quorum loss is non-transient.
|
Two more fixes from the latest review:
@gemini-code-assist please do a final pass. |
There was a problem hiding this comment.
Code Review
This pull request implements cluster-safe retention policy execution, ensuring that retention tasks run exclusively on the primary writer node and that file deletions are synchronized across the cluster via the Raft manifest. It also introduces concurrency protection in the scheduler to prevent overlapping runs and adds SQL escaping to protect against injection in DuckDB queries. Review feedback identifies an issue where cumulative deletion counters are not updated when a measurement fails partially, which results in inaccurate execution metrics being recorded.
…ent error Accumulate deleted/filesDeleted before checking the error so that recordExecutionComplete reports all completed work accurately when a mid-cycle abort occurs. deleteOldFiles returns its partial progress even on failure, so the prior behaviour was discarding counts from the failed measurement chunk.
|
Fixed the partial-progress counter issue.
@gemini-code-assist please do a final pass. |
There was a problem hiding this comment.
Code Review
This pull request introduces cluster-safe retention policy execution for enterprise deployments. Key changes include ensuring that retention policies run exclusively on the primary writer node via a new cluster gate, propagating file deletions through the Raft manifest for cluster-wide consistency, and implementing concurrency control in the scheduler to prevent overlapping execution cycles. Additionally, the PR adds license validation for retention tasks and hardens DuckDB queries against SQL injection by escaping file paths. There are no review comments to address, and I have no further feedback to provide.
Summary
DeleteFileFromManifestcommits the removal into the Raft log, keeping the cluster manifest consistentonFileDeletedFSM callback and delete-worker pool (shared with compaction — no new code needed)Test plan
go build ./cmd/... ./internal/...passes cleango test ./internal/scheduler/... ./internal/api/... ./internal/cluster/...all passGET /api/v1/cluster/filesshows no orphaned entries for deleted files