Skip to content

fix(gc): add random delay to avoid a cluster of nodes run GC and reboot RPC services at the same time#6639

Merged
hanabi1224 merged 3 commits intomainfrom
hm/gc-random-delay
Feb 23, 2026
Merged

fix(gc): add random delay to avoid a cluster of nodes run GC and reboot RPC services at the same time#6639
hanabi1224 merged 3 commits intomainfrom
hm/gc-random-delay

Conversation

@hanabi1224
Copy link
Contributor

@hanabi1224 hanabi1224 commented Feb 23, 2026

Summary of changes

Changes introduced in this pull request:

Reference issue to close (if applicable)

Closes #6594

Other information and links

Change checklist

  • I have performed a self-review of my own code,
  • I have made corresponding changes to the documentation. All new code adheres to the team's documentation standards,
  • I have added tests that prove my fix is effective or that my feature works (if possible),
  • I have made sure the CHANGELOG is up-to-date. All user-facing changes should be reflected in this document.

Outside contributions

  • I have read and agree to the CONTRIBUTING document.
  • I have read and agree to the AI Policy document. I understand that failure to comply with the guidelines will lead to rejection of the pull request.

Summary by CodeRabbit

  • Chores
    • Added randomized jitter (0–30 epochs) to garbage-collection scheduling to better spread work across cycles.
    • Scheduler now applies a per-cycle random delay when evaluating GC timing to reduce simultaneous work spikes.
    • Log messages now include the randomized delay when GC is scheduled or skipped for easier diagnostics.
  • Documentation
    • Clarified default GC interval (20160 epochs / 7 days) and noted the added per-cycle random delay.
    • Documented that the GC scheduler is disabled with --no-gc.
  • Changelog
    • Added release note entry describing the random GC delay and rationale.

@hanabi1224 hanabi1224 marked this pull request as ready for review February 23, 2026 08:26
@hanabi1224 hanabi1224 requested a review from a team as a code owner February 23, 2026 08:26
@hanabi1224 hanabi1224 requested review from LesnyRumcajs and sudo-shashank and removed request for a team February 23, 2026 08:26
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Feb 23, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

Walkthrough

Adds a small randomized delay to the Snapshot GC scheduling check so each GC cycle samples a jitter and requires an extra epoch gap before running snapshot GC. Changes touch the scheduler logic, user docs, and changelog; references issue #6594. (≤50 words)

Changes

Cohort / File(s) Summary
GC Scheduler Jitter
src/db/gc/snapshot.rs
Import rand::Rng; sample gc_interval_random_delay_epochs from forest_rng() in each cycle with bounds 0..=30.min(snap_gc_interval_epochs / 5); require head_epoch - car_db_head_epoch >= snap_gc_interval_epochs + gc_interval_random_delay_epochs to schedule Snap GC; add logging and comments referencing issue #6594.
Docs — GC guide
docs/docs/users/guides/gc.md
Clarify default GC interval is 20160 epochs (7 days); document that a small random delay is appended to the GC interval each cycle to avoid synchronized GC/reboots; note scheduler disabled when using --no-gc.
Changelog
CHANGELOG.md
Add entry under Forest unreleased describing the randomized GC delay to avoid simultaneous GC across nodes.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~8 minutes

Possibly related PRs

Suggested reviewers

  • sudo-shashank
  • LesnyRumcajs
  • akaladarshi
🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and specifically describes the main change: adding a random delay to GC to prevent simultaneous node reboots.
Linked Issues check ✅ Passed The PR successfully implements the core requirement from issue #6594: introducing a random delay to GC scheduling to avoid cluster-wide synchronized GC and RPC service reboots.
Out of Scope Changes check ✅ Passed All changes are directly related to the linked issue objective: GC scheduling logic modifications, documentation updates, and changelog entry.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch hm/gc-random-delay

Comment @coderabbitai help to get the list of available commands and usage tips.

@codecov
Copy link

codecov bot commented Feb 23, 2026

Codecov Report

❌ Patch coverage is 0% with 5 lines in your changes missing coverage. Please review.
✅ Project coverage is 63.36%. Comparing base (cc869f5) to head (6d22792).
⚠️ Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
src/db/gc/snapshot.rs 0.00% 5 Missing ⚠️
Additional details and impacted files
Files with missing lines Coverage Δ
src/db/gc/snapshot.rs 19.02% <0.00%> (-0.17%) ⬇️

... and 2 files with indirect coverage changes


Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update cc869f5...6d22792. Read the comment docs.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
src/db/gc/snapshot.rs (1)

211-211: Consider logging the random delay for observability.

The scheduled GC log message includes snap_gc_interval_epochs but not the gc_interval_random_delay_epochs that was applied. Logging this value would help operators understand why different nodes trigger GC at different times.

💡 Proposed enhancement
-                    tracing::info!(%car_db_head_epoch, %head_epoch, %network_head_epoch, %snap_gc_interval_epochs, "Snap GC scheduled");
+                    tracing::info!(%car_db_head_epoch, %head_epoch, %network_head_epoch, %snap_gc_interval_epochs, %gc_interval_random_delay_epochs, "Snap GC scheduled");
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/db/gc/snapshot.rs` at line 211, The log line using tracing::info!
currently emits car_db_head_epoch, head_epoch, network_head_epoch and
snap_gc_interval_epochs but omits the actual random delay applied; update the
tracing::info! call in snapshot.rs (the site that logs "Snap GC scheduled") to
include gc_interval_random_delay_epochs (or whatever local variable holds the
random delay) so operators can see the applied delay value; ensure the log
fields include %gc_interval_random_delay_epochs alongside the existing %...
fields and keep the message text unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@src/db/gc/snapshot.rs`:
- Line 211: The log line using tracing::info! currently emits car_db_head_epoch,
head_epoch, network_head_epoch and snap_gc_interval_epochs but omits the actual
random delay applied; update the tracing::info! call in snapshot.rs (the site
that logs "Snap GC scheduled") to include gc_interval_random_delay_epochs (or
whatever local variable holds the random delay) so operators can see the applied
delay value; ensure the log fields include %gc_interval_random_delay_epochs
alongside the existing %... fields and keep the message text unchanged.
ℹ️ Review info

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between e5a5c3b and e278108.

📒 Files selected for processing (1)
  • src/db/gc/snapshot.rs

Copy link
Member

@LesnyRumcajs LesnyRumcajs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add a note in the GC docs? And a changelog entry?

@hanabi1224
Copy link
Contributor Author

Should we add a note in the GC docs? And a changelog entry?

@LesnyRumcajs Fixed.

@hanabi1224 hanabi1224 enabled auto-merge February 23, 2026 10:31
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (2)
docs/docs/users/guides/gc.md (2)

8-8: Fix spacing in the default interval sentence.

Minor formatting: add a space in “20160 epochs(7 days)”.

Proposed tweak
-The default GC interval is 20160 epochs(7 days).
+The default GC interval is 20160 epochs (7 days).
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/docs/users/guides/gc.md` at line 8, The sentence "The default GC
interval is 20160 epochs(7 days)." has a missing space before the parenthesis;
update the text to "The default GC interval is 20160 epochs (7 days)." and
ensure the environment variable name FOREST_SNAPSHOT_GC_INTERVAL_EPOCHS remains
unchanged.

9-9: Tighten grammar and keep the docs’ friendly tone.

“avoid a cluster of nodes run GC” is ungrammatical. Consider a smoother, approachable phrasing.

Proposed tweak
-Note that, an extra random small delay is added to the GC interval on every GC cycle to avoid a cluster of nodes run GC and reboot RPC services at the same time.
+Note: an extra small random delay is added to each GC cycle so a whole cluster doesn’t all run GC (and reboot RPC) at once.

Based on learnings: “Maintain a playful, approachable tone in Forest project documentation (Markdown files) rather than strictly formal/corporate language.”

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/docs/users/guides/gc.md` at line 9, Fix the ungrammatical phrase in the
sentence that reads "avoid a cluster of nodes run GC and reboot RPC services at
the same time" by rewording it to something smooth and friendly (e.g., "avoid
many nodes running GC and rebooting RPC services at the same time" or "prevent a
cluster of nodes from all running GC and restarting RPC services
simultaneously") while preserving the playful, approachable tone used elsewhere
in the Forest docs; update the sentence in the GC guide (the line describing the
extra random small delay added to the GC interval) to one of these clearer
phrasings.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@docs/docs/users/guides/gc.md`:
- Line 8: The sentence "The default GC interval is 20160 epochs(7 days)." has a
missing space before the parenthesis; update the text to "The default GC
interval is 20160 epochs (7 days)." and ensure the environment variable name
FOREST_SNAPSHOT_GC_INTERVAL_EPOCHS remains unchanged.
- Line 9: Fix the ungrammatical phrase in the sentence that reads "avoid a
cluster of nodes run GC and reboot RPC services at the same time" by rewording
it to something smooth and friendly (e.g., "avoid many nodes running GC and
rebooting RPC services at the same time" or "prevent a cluster of nodes from all
running GC and restarting RPC services simultaneously") while preserving the
playful, approachable tone used elsewhere in the Forest docs; update the
sentence in the GC guide (the line describing the extra random small delay added
to the GC interval) to one of these clearer phrasings.

ℹ️ Review info

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 0c711ec and 6d22792.

📒 Files selected for processing (2)
  • CHANGELOG.md
  • docs/docs/users/guides/gc.md
✅ Files skipped from review due to trivial changes (1)
  • CHANGELOG.md

@hanabi1224 hanabi1224 added this pull request to the merge queue Feb 23, 2026
Merged via the queue into main with commit 80db71d Feb 23, 2026
44 of 46 checks passed
@hanabi1224 hanabi1224 deleted the hm/gc-random-delay branch February 23, 2026 11:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[GC] Add random delay to avoid a cluster of nodes run GC at the same time

2 participants