Skip to content

feat(cluster): writer-only retention with Raft manifest propagation#407

Merged
xe-nvdk merged 13 commits intomainfrom
feat/cluster-safe-retention
Apr 20, 2026
Merged

feat(cluster): writer-only retention with Raft manifest propagation#407
xe-nvdk merged 13 commits intomainfrom
feat/cluster-safe-retention

Conversation

@xe-nvdk
Copy link
Copy Markdown
Member

@xe-nvdk xe-nvdk commented Apr 20, 2026

Summary

  • Only the primary writer node runs retention policies in cluster mode, preventing races on shared (S3/Azure/NFS) and per-node (local) storage
  • After each file delete, DeleteFileFromManifest commits the removal into the Raft log, keeping the cluster manifest consistent
  • Reader nodes clean up their local copies via the existing onFileDeleted FSM callback and delete-worker pool (shared with compaction — no new code needed)
  • Standalone deployments are completely unaffected (nil gate = no check)

Test plan

  • go build ./cmd/... ./internal/... passes clean
  • go test ./internal/scheduler/... ./internal/api/... ./internal/cluster/... all pass
  • Standalone: retention runs normally, no gate log messages
  • Cluster (3-node): only writer logs "Triggering scheduled retention"; readers log "Scheduler idle"
  • Cluster: after retention runs, GET /api/v1/cluster/files shows no orphaned entries for deleted files
  • Writer failover: new primary picks up retention on next cron tick

Only the primary writer node runs retention policies in cluster mode,
preventing races on shared/per-node storage. After each file delete,
DeleteFileFromManifest commits the removal into the Raft log so reader
nodes clean up their local copies via the existing onFileDeleted worker
pool.
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request implements cluster-safe retention policy execution for enterprise deployments, ensuring that retention tasks only run on the primary writer node and that file deletions are propagated to the Raft manifest. Feedback indicates that the current implementation of Raft manifest updates is inefficient and should be batched to improve performance. Additionally, the cluster gate logic in the scheduler needs to be moved from the initialization phase to the execution phase to correctly handle node role transitions and failover scenarios.

Comment thread internal/api/retention.go Outdated
Comment thread internal/scheduler/retention_scheduler.go Outdated
…te check

- Batch all retention file deletions into a single Raft proposal via
  BatchFileOpsInManifest instead of one proposal per file, reducing
  Raft log pressure on large retention runs
- Move cluster gate check from Start() into runRetention() so role
  transitions (failover, demotion) take effect on the next tick without
  requiring a node restart
@xe-nvdk
Copy link
Copy Markdown
Member Author

xe-nvdk commented Apr 20, 2026

@gemini-code-assist Both findings from your review have been addressed in the latest push:

  1. Batch manifest updates — replaced the per-file DeleteFileFromManifest calls with a single BatchFileOpsInManifest call after the deletion loop, matching how compaction handles manifest updates.

  2. Per-tick gate check — moved CanRunRetention() out of Start() and into runRetention() so role transitions (failover, demotion) take effect on the next cron tick without requiring a node restart. Status() now reports the live can_run value instead of a stale startup snapshot.

Could you do another pass?

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request implements cluster-safe retention policy execution, ensuring that retention tasks run exclusively on the primary writer node and that file deletions are synchronized across the cluster via the Raft manifest. The changes include the introduction of cluster gates in the scheduler and batching logic for manifest updates in the retention handler. Feedback focuses on enhancing the reliability of manifest updates by addressing potential orphaned entries and Raft log size limitations through chunking, as well as optimizing performance by removing unnecessary mutex locking for immutable fields.

Comment thread internal/api/retention.go Outdated
Comment thread internal/api/retention.go Outdated
Comment thread internal/scheduler/retention_scheduler.go Outdated
- Chunk manifest batch updates at 1000 ops to avoid oversized Raft log entries
- Upgrade manifest failure log to Error with honest comment (no self-heal today,
  Phase 5 reconciliation will clean orphans)
- Remove unnecessary mutex around immutable clusterGate in runRetention and TriggerNow
@xe-nvdk
Copy link
Copy Markdown
Member Author

xe-nvdk commented Apr 20, 2026

@gemini-code-assist Three findings from the second review addressed:

  1. Chunked manifest updates — batch is now split into 1000-op chunks to prevent oversized Raft log entries on large retention runs.
  2. Error log + honest comment — upgraded from Warn to Error on manifest failure; removed the misleading "self-heal" comment and replaced with an explicit note that affected paths will be orphaned in the manifest until Phase 5 reconciliation ships.
  3. Removed unnecessary mutexclusterGate is immutable after construction; removed the lock/copy pattern in both runRetention and TriggerNow.

Could you do a final pass?

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces cluster-safe retention policy execution, ensuring that retention tasks only run on the primary writer node and that file deletions are propagated to the Raft manifest. Feedback highlights a potential consistency issue where files are deleted from storage before the manifest is updated; it is recommended to reverse this order to avoid orphaned manifest entries in the event of a Raft quorum loss. Additionally, error handling for manifest updates should be improved by terminating the retention cycle upon failure instead of merely logging the error and continuing with subsequent chunks.

Comment thread internal/api/retention.go Outdated
Comment thread internal/api/retention.go Outdated
…lure

Update cluster manifest before deleting from storage so a Raft failure
leaves the file intact in both places and the next retention run can
retry it. Previously, storage was deleted first, creating permanent
orphan manifest entries on failure.

On manifest update failure, abort the entire retention cycle rather than
continuing to the next chunk — a Raft quorum loss is not transient and
proceeding would only create more orphans for the same root cause.

Also removes the now-redundant !dryRun guard (deletedFilePaths is only
populated when dryRun=false).
@xe-nvdk
Copy link
Copy Markdown
Member Author

xe-nvdk commented Apr 20, 2026

@gemini-code-assist Three findings from the latest round addressed:

  1. Manifest-before-storage ordering — the manifest is now updated before deleting from storage. If the Raft update fails, the file still exists in both places and the next retention run will retry it naturally. A storage-delete failure after a successful manifest update leaves a harmless ghost file on disk (invisible to the cluster, will be swept by the next compaction pass).

  2. Abort on manifest failure — if BatchFileOpsInManifest fails, the retention cycle returns an error immediately rather than continuing to the next chunk. A Raft quorum loss is not transient; proceeding would only produce more orphans for the same root cause.

  3. Removed redundant !dryRun guarddeletedFilePaths/eligiblePaths is only populated when dryRun=false, so the outer check was redundant.

Could you do a final pass?

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces cluster-aware retention policy execution, ensuring that retention tasks run exclusively on the primary writer node and propagate deletions through the Raft manifest. Key changes include the addition of a cluster gate for the retention scheduler and a refactored deletion process in the retention handler that updates the cluster manifest before removing files from storage. Feedback highlights a potential SQL injection vulnerability in how file paths are handled during metadata retrieval and suggests interleaving manifest updates with storage deletions to improve the robustness of the process and minimize orphaned files.

Comment thread internal/api/retention.go
Comment thread internal/api/retention.go Outdated
…rquet path

- Interleave manifest update and storage delete per chunk instead of two
  separate passes; limits orphan blast radius to one chunk on mid-run failure
- Escape single quotes in read_parquet() file paths to prevent SQL injection
  via crafted database/measurement names
@xe-nvdk
Copy link
Copy Markdown
Member Author

xe-nvdk commented Apr 20, 2026

@gemini-code-assist Both findings addressed:

  1. Interleaved manifest+storage per chunk — instead of two separate passes (all manifest updates then all storage deletes), each chunk now updates the manifest and immediately deletes from storage before moving to the next chunk. This limits the orphan blast radius to one chunk on a mid-run failure, matching your suggested implementation.

  2. SQL injection fix in getFileMaxTimeAndRowCount — single quotes in the file path are now escaped (strings.ReplaceAll(filePath, "'", "''")) before interpolation into the read_parquet() query, since DuckDB doesn't support parameterized read_parquet() calls.

Could you do a final pass?

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces cluster-safe retention policy execution for enterprise deployments, ensuring that retention tasks only run on the primary writer node and that file deletions are synchronized via the Raft manifest. Key improvements include batching deletions in chunks of 1000 and adding SQL escaping to DuckDB queries to prevent injection. A logic bug was identified in the chunk processing logic where a failure to marshal a manifest update for a file would not prevent its deletion from storage, potentially leading to manifest inconsistency.

Comment thread internal/api/retention.go
Track subPaths/subRows in parallel with ops so a json.Marshal failure
skips a file from both the manifest batch and the storage delete loop.
Previously a marshal failure would skip the manifest op but still delete
the file from storage, creating an orphaned manifest entry.
@xe-nvdk
Copy link
Copy Markdown
Member Author

xe-nvdk commented Apr 20, 2026

@gemini-code-assist Logic bug fixed: subPaths and subRows are now built in parallel with ops during the marshal loop, so a json.Marshal failure skips the file from both the manifest batch and the storage delete loop. Previously, a marshal failure would silently exclude the file from ops but leave it in chunk, causing a storage delete without a corresponding manifest removal.

Could you do a final pass?

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces cluster-safe retention policy execution for enterprise deployments, ensuring that retention tasks only run on the primary writer node and that file deletions are synchronized across the cluster via Raft manifest updates. The implementation includes a new cluster gate mechanism, batched manifest updates to maintain consistency, and SQL injection hardening for DuckDB queries. Feedback was provided to improve the readability of slice initialization within the retention handler's deletion logic.

Comment thread internal/api/retention.go Outdated
@xe-nvdk
Copy link
Copy Markdown
Member Author

xe-nvdk commented Apr 20, 2026

@gemini-code-assist Addressed — replaced [:0:0] with make([]string, 0, len(chunk)) / make([]int64, 0, len(chunk)) for clarity. Could you do a final pass?

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces cluster-safe retention policy execution, ensuring that retention tasks run exclusively on the primary writer node and that file deletions are synchronized across the cluster via the Raft manifest. The changes include the implementation of cluster gates in the scheduler and chunked manifest updates in the retention handler. Feedback recommends enhancing error handling for manifest update failures to avoid redundant logging during cluster outages and implementing concurrency controls in the scheduler to prevent overlapping retention cycles.

Comment thread internal/api/retention.go
Comment thread internal/scheduler/retention_scheduler.go
- Add runningJob flag to RetentionScheduler to prevent overlapping cycles
  when a run exceeds the cron interval; applies to both scheduled and
  manual TriggerNow paths
- Abort ExecutePolicy early on manifest error (Raft quorum loss is not
  transient — continuing to the next measurement only creates more orphans)
- Add license check in ExecutePolicy to guard direct programmatic calls
  that bypass the scheduler's license gate
- Wire licenseClient into RetentionHandler constructor
@xe-nvdk
Copy link
Copy Markdown
Member Author

xe-nvdk commented Apr 20, 2026

@gemini-code-assist Four issues addressed from Gemini's latest round and a parallel principal-level review:

  1. Concurrent execution guard — added runningJob bool flag to RetentionScheduler. If a previous cycle is still in progress when the next tick fires, the tick is skipped with a warn log. Applies to both the cron path and TriggerNow.

  2. Early abort on manifest error in ExecutePolicy — when deleteOldFiles returns a manifest error and a coordinator is wired (cluster mode), ExecutePolicy now records the execution as failed and returns immediately instead of continuing to the next measurement. Storage-only errors (standalone mode) still continue as before.

  3. License check in ExecutePolicy — added a license gate directly in ExecutePolicy so direct programmatic/API calls that bypass the scheduler's Start() check are also protected.

  4. licenseClient wired into RetentionHandler — constructor updated to accept *license.Client; call site in main.go and test updated accordingly.

Could you do a final pass?

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request implements cluster-safe retention policy execution, ensuring that retention runs exclusively on the primary writer node and propagates file deletions through the Raft manifest. It also introduces license validation for retention tasks and enhances SQL safety by escaping file paths in DuckDB queries. Feedback was provided regarding the need for concurrency protection in the manual trigger method to prevent overlapping retention cycles.

Comment thread internal/scheduler/retention_scheduler.go
Prevents a manual trigger from starting a second concurrent cycle when
a scheduled run is already in progress.
@xe-nvdk
Copy link
Copy Markdown
Member Author

xe-nvdk commented Apr 20, 2026

@gemini-code-assist Applied the runningJob guard to TriggerNow as well — a manual trigger now returns an error immediately if a scheduled cycle is already in progress, matching the behaviour added to runRetention. Could you do a final pass?

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces cluster-safe retention policy execution for enterprise deployments. Key changes include restricting retention tasks to the primary writer node via a cluster gate, propagating file deletions to the Raft manifest for cluster-wide consistency, and implementing concurrency control to prevent overlapping retention cycles. Additionally, the update includes license validation for retention execution and SQL escaping for DuckDB queries to mitigate injection risks. Feedback was provided regarding the efficiency of JSON marshaling within the file deletion loop.

Comment thread internal/api/retention.go Outdated
…uster gate

- ExecutePolicy: remove coordinator nil-guard around abort — standalone mode
  should also abort the cycle on storage/manifest failure rather than silently
  continuing to the next measurement and creating inconsistent partial state
- handleExecute: propagate c.Context() into deleteOldFiles (was context.Background())
  so client disconnects and server timeouts cancel in-flight deletion; record
  execution as "failed" and return 500 on first error rather than swallowing it
- retentionClusterGate.CanRunRetention: guard GetLocalNode() return value against
  nil so a gate check during cluster startup does not panic
@xe-nvdk
Copy link
Copy Markdown
Member Author

xe-nvdk commented Apr 20, 2026

Three additional fixes from a parallel staff/principal-engineer review:

  1. ExecutePolicy abort in standalone mode — removed the if h.coordinator != nil guard around the abort path. Standalone mode was silently continue-ing on errors and producing partial inconsistent state; now aborts the entire cycle on first failure in both modes.

  2. handleExecute context propagation — was calling deleteOldFiles(context.Background(), ...), changed to c.Context() so client disconnects and server timeouts cancel in-flight deletions. Also records execution as "failed" and returns 500 on error instead of swallowing it.

  3. retentionClusterGate.CanRunRetention nil safetyGetLocalNode() can return nil during cluster startup; added guard to return false instead of panicking.

@gemini-code-assist please do a final review pass on these three commits.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces cluster-safe retention policy execution, ensuring that retention tasks run only on the primary writer node and that file deletions are synchronized with the Raft manifest. Key changes include the implementation of a cluster gate for the scheduler, batch manifest updates prior to storage deletion to maintain consistency, and concurrency controls to prevent overlapping retention cycles. Additionally, the PR includes a security fix to escape file paths in DuckDB queries to prevent SQL injection. Feedback was provided regarding the lack of cluster role verification in the manual API execution path, which currently allows reader nodes to trigger retention and potentially cause races.

Comment thread internal/api/retention.go
handleExecute was missing the primary-writer check — a reader node could
trigger retention via the API and race with the writer over shared or local
storage. Add CanRunRetention()/Role() to RetentionCoordinator interface and
check them in handleExecute before executing (dry-run is allowed everywhere
since it makes no mutations). Implement the two new methods on
*cluster.Coordinator, delegating to GetLocalNode().IsPrimaryWriter().
@xe-nvdk
Copy link
Copy Markdown
Member Author

xe-nvdk commented Apr 20, 2026

Fixed the reader-node API race flagged in the latest review.

handleExecute now checks h.coordinator.CanRunRetention() before executing any mutations — a reader node gets a 503 Service Unavailable with the node's current role in the error message. Dry-run requests are allowed on any node since they make no mutations.

Implementation: extended RetentionCoordinator interface with CanRunRetention() bool and Role() string, implemented both on *cluster.Coordinator (delegates to GetLocalNode().IsPrimaryWriter() with nil safety). Standalone mode is unaffected — coordinator is nil, check is skipped.

@gemini-code-assist please do a final pass.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request implements cluster-safe retention policy execution, ensuring that retention tasks run exclusively on the primary writer node and that file deletions are synchronized via the Raft manifest. Key changes include the introduction of cluster gates to manage node roles, batching manifest updates before storage deletions to maintain consistency, and adding concurrency protection to prevent overlapping retention cycles. Feedback focuses on improving auditability by including policy identifiers in deletion reasons, making the hardcoded 30-minute execution timeout configurable, and refining error handling to distinguish between transient storage failures and non-transient manifest errors.

Comment thread internal/api/retention.go Outdated
Comment thread internal/scheduler/retention_scheduler.go
Comment thread internal/api/retention.go
…ge errors

- Pass reason="retention:<policy_id>" to deleteOldFiles so manifest ops carry
  the policy ID for auditability in the Raft log and delete-worker traces
- Downgrade storage.Delete failures from Error to Warn — a network blip or
  already-deleted file is transient; the cycle should continue. Manifest
  failures remain Error + abort since Raft quorum loss is non-transient.
@xe-nvdk
Copy link
Copy Markdown
Member Author

xe-nvdk commented Apr 20, 2026

Two more fixes from the latest review:

  1. Policy ID in manifest reasondeleteOldFiles now takes a reason string parameter; callers pass "retention:<policy_id>" (e.g. "retention:42"). This propagates to DeleteFilePayload.Reason in every Raft manifest op, making delete-worker traces and Raft log entries auditable by policy.

  2. Storage error severitystorage.Delete failures downgraded from Error to Warn. A network blip or already-deleted file is transient; the cycle continues to the next file. Manifest failures remain Error + abort since Raft quorum loss is non-transient.

@gemini-code-assist please do a final pass.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request implements cluster-safe retention policy execution, ensuring that retention tasks run exclusively on the primary writer node and that file deletions are synchronized across the cluster via the Raft manifest. It also introduces concurrency protection in the scheduler to prevent overlapping runs and adds SQL escaping to protect against injection in DuckDB queries. Review feedback identifies an issue where cumulative deletion counters are not updated when a measurement fails partially, which results in inaccurate execution metrics being recorded.

Comment thread internal/api/retention.go
Comment thread internal/api/retention.go
…ent error

Accumulate deleted/filesDeleted before checking the error so that
recordExecutionComplete reports all completed work accurately when a mid-cycle
abort occurs. deleteOldFiles returns its partial progress even on failure, so
the prior behaviour was discarding counts from the failed measurement chunk.
@xe-nvdk
Copy link
Copy Markdown
Member Author

xe-nvdk commented Apr 20, 2026

Fixed the partial-progress counter issue.

totalDeleted and totalFilesDeleted are now accumulated before the error check in both ExecutePolicy and handleExecute. deleteOldFiles returns its partial progress even on abort (rows/files deleted in completed chunks before the manifest failure), so the execution record now accurately reflects all completed work rather than dropping counts from the failed measurement.

@gemini-code-assist please do a final pass.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces cluster-safe retention policy execution for enterprise deployments. Key changes include ensuring that retention policies run exclusively on the primary writer node via a new cluster gate, propagating file deletions through the Raft manifest for cluster-wide consistency, and implementing concurrency control in the scheduler to prevent overlapping execution cycles. Additionally, the PR adds license validation for retention tasks and hardens DuckDB queries against SQL injection by escaping file paths. There are no review comments to address, and I have no further feedback to provide.

@xe-nvdk xe-nvdk merged commit 347b965 into main Apr 20, 2026
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant