Skip to content

fix(db): wait for in-flight timer callback in DbMetricsUpdater.Dispose#11270

Merged
asdacap merged 3 commits into
masterfrom
setup-claude-code
Apr 22, 2026
Merged

fix(db): wait for in-flight timer callback in DbMetricsUpdater.Dispose#11270
asdacap merged 3 commits into
masterfrom
setup-claude-code

Conversation

@asdacap
Copy link
Copy Markdown
Contributor

@asdacap asdacap commented Apr 21, 2026

Changes

  • DbMetricsUpdater.Dispose() now uses Timer.Dispose(WaitHandle) + WaitHandle.WaitOne() so it returns only after any running UpdateMetrics callback has completed. _timer is swapped atomically to make disposal idempotent.
  • Removed a stale comment on the ProcessStatisticsString path that claimed stats were not being extracted — the line immediately above already extracts them.

Why

DbOnTheRocks.Dispose() (DbOnTheRocks.cs:1473-1492) disposes _metricsUpdaters before calling ReleaseUnmanagedResources() which destroys the native RocksDb handle. The old naked Timer.Dispose() returned immediately without waiting for pending callbacks, so an UpdateMetrics tick that was already running could continue executing and call db.GetProperty(...) / dbOptions.GetStatisticsString() on a freed native handle. The callback's managed try/catch doesn't reliably contain native-side use-after-free.

Statistics collection is opt-in (EnableDbStatistics + EnableMetricsUpdater, both default false) and the default dump period is 600 s, so the window is narrow — but the race is real and cheap to close.

Types of changes

  • Bugfix (a non-breaking change that fixes an issue)

Testing

Requires testing

  • No

Notes on testing

The fix is a use of the standard .NET Timer.Dispose(WaitHandle) guarantee. Existing parsing tests in Nethermind.Db.Test/DbMetricsUpdaterTests.cs (22 tests) pass unchanged. A race-reproducing unit test would need a test double for db and dbOptions that is faster than the timer — deemed not worth the complexity for a framework-level guarantee.

Documentation

Requires documentation update

  • No

Requires explanation in Release Notes

  • No

🤖 Generated with Claude Code

The previous Dispose() called the parameterless Timer.Dispose(), which
does not wait for a callback that is already running. Because
DbOnTheRocks.Dispose() disposes metrics updaters before the underlying
RocksDb handle is released, an in-flight UpdateMetrics could touch
db.GetProperty / dbOptions.GetStatisticsString after the native handle
was freed.

Switch to Timer.Dispose(WaitHandle) + WaitOne so the method returns only
once any pending callback has completed. Also atomically null out _timer
so Dispose is idempotent.

Removes a stale comment claiming stats are not extracted (they are, by
ProcessStatisticsString on the line above).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@claude
Copy link
Copy Markdown
Contributor

claude Bot commented Apr 21, 2026

Claude finished @asdacap's task in 2m 21s —— View job


Code Review

  • Gather context and read changed files
  • Read relevant rules files
  • Analyze the changes for correctness, security, and performance
  • Post findings

Summary

The fix correctly addresses a real use-after-free race: the old _timer?.Dispose() returned immediately without waiting for any in-flight UpdateMetrics callback, so the callback could continue executing db.GetProperty() / dbOptions.GetStatisticsString() on a freed native RocksDB handle after ReleaseUnmanagedResources() ran. The new implementation using Interlocked.Exchange + Timer.Dispose(WaitHandle) + WaitOne() is the standard .NET idiom and closes the race correctly. The stale comment removal is also accurate — ProcessStatisticsString was already being called on that line.


Findings

Severity Count
Critical 0
High 0
Medium 1
Low 0

Medium — Unbounded WaitOne() on the shutdown path (see inline comment)

waitHandle.WaitOne() at line 179 has no timeout. If UpdateMetrics gets stuck inside a native RocksDB call (db.GetProperty / dbOptions.GetStatisticsString) — for example, blocked on an internal RocksDB mutex during heavy compaction — DbOnTheRocks.Dispose() will block indefinitely, leaving the process hung and requiring a SIGKILL. A bounded wait (e.g., 5 s) with a logged warning would preserve the safety guarantee while keeping shutdown responsive. This only affects users with EnableDbStatistics + EnableMetricsUpdater enabled (both default false), so the impact is narrow but real for those users.


What looks good

  • Idempotency via Interlocked.Exchange — double-Dispose() is handled correctly; the null-guard exits immediately on the second call.
  • if (timer.Dispose(waitHandle)) — the false branch is correct: when Dispose cannot register the notify object (e.g., timer already stopped), the WaitHandle is never set by the runtime, so we must skip WaitOne() to avoid blocking forever.
  • using ManualResetEvent — the handle is disposed even when Dispose(waitHandle) returns false; since in that branch the runtime never references the handle again, there's no use-after-dispose risk.
  • Removal of the stale comment — the comment // Currently we don't extract any DB statistics but we can do it here was factually wrong; ProcessStatisticsString was already extracting them on the line immediately above.
  • The fix targets a narrow, opt-in code path (EnableMetricsUpdater + EnableDbStatistics, both off by default) where the race window is small — but the cost of closing it is trivial and the change is correct.

Comment thread src/Nethermind/Nethermind.Db.Rocks/Statistics/DbMetricsUpdater.cs Outdated
Unbounded WaitOne() on the shutdown path could hang forever if
UpdateMetrics is stuck inside a native RocksDB call (e.g. blocked on an
internal RocksDB mutex during compaction), forcing a SIGKILL. Use a 5s
timeout and log a warning if it fires — keeps shutdown responsive at
the cost of giving up the use-after-free guard in the pathological
hang case, which is the correct tradeoff.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@asdacap asdacap marked this pull request as ready for review April 21, 2026 04:57
@claude
Copy link
Copy Markdown
Contributor

claude Bot commented Apr 21, 2026

Claude finished @asdacap's task in 2m 43s —— View job


Code Review

  • Gather context and read changed files
  • Read relevant rules files
  • Analyze changes for correctness, security, and performance
  • Post findings

Summary

The two-commit PR correctly closes a real use-after-free race. The previous Medium finding (unbounded WaitOne()) has been addressed by 0005f65 with a 5 s timeout + warning log. Final state is clean.

Disposal ordering confirmed: DbOnTheRocks.Dispose() (lines 1479–1481) disposes _metricsUpdaters before calling ReleaseUnmanagedResources() which calls _db.Dispose() (the native handle). The fix ensures UpdateMetrics has finished before that free occurs.


Findings

Severity Count
Critical 0
High 0
Medium 0
Low 1

Low — Potential ObjectDisposedException on the timeout path

When WaitOne(5 s) times out, the using block disposes the ManualResetEvent. The .NET timer runtime still holds a reference to that handle and will call handle.Set() when the stuck callback eventually finishes (via SignalNoCallbacksRunning()). Calling Set() on a disposed ManualResetEvent throws ObjectDisposedException; in .NET 6+ that is an unhandled exception on a thread-pool thread, which terminates the process.

Mitigating factors:

  • Both EnableDbStatistics and EnableMetricsUpdater must be true (both default false)
  • The callback must be stuck >5 s (already a broken/hung state)
  • If we reached the timeout the native RocksDB handle gets freed anyway, so the callback was going to cause a use-after-free regardless — a crash here is not meaningfully worse than the alternative
  • UpdateMetrics is fully wrapped in try/catch, so ordinary exceptions from RocksDB won't trigger this path

For a future hardening pass the WaitHandle could be left undisposed in the timeout branch (letting it be GC'd after the timer eventually releases it), but that's a minor improvement on an already-narrow path and is not blocking.


What looks good

  • Idempotency: Interlocked.Exchange(ref _timer, null) — double-Dispose() exits immediately on the second call
  • false branch: if (timer.Dispose(waitHandle) && ...) — correctly skips WaitOne() when the runtime will never signal the handle (timer already stopped before our call)
  • using ManualResetEvent: handle is properly freed in the normal (non-timeout) path and in the Dispose(waitHandle) == false path
  • Timeout + warn log (0005f65): directly addresses the previous Medium finding; keeps shutdown responsive while still logging a visible warning if the callback hangs
  • Stale comment removal: ProcessStatisticsString was already being called on the line above — the comment was factually wrong
  • Disposal order in DbOnTheRocks.Dispose() (lines 1479–1487): metrics updaters are disposed before ReleaseUnmanagedResources() — the fix is effective in that call sequence

Verdict: mergeable. No Critical, High, or Medium findings remain.

@LukaszRozmej
Copy link
Copy Markdown
Member

LukaszRozmej commented Apr 21, 2026

Is up to 5s freeze acceptable on shutdown? Sounds like too long.

@asdacap
Copy link
Copy Markdown
Contributor Author

asdacap commented Apr 22, 2026

(This is Claude replying.)

The 5 s is a worst-case upper bound, not a typical wait. In practice UpdateMetrics only calls db.GetProperty and dbOptions.GetStatisticsString() — both are fast RocksDB stat reads that finish in sub-milliseconds under normal conditions. The timer fires at most once per cycle, so the actual wait should be ~0 ms in almost every shutdown.

The bound is necessary to avoid a use-after-free: without waiting, the callback could still be running inside a native RocksDB call when ReleaseUnmanagedResources() frees the handle. The 5 s was chosen conservatively for the pathological case where RocksDB is stuck on an internal mutex (e.g., heavy compaction).

If 5 s feels too long, 1 s is a reasonable tightening — it still covers any realistic stat-read latency and only triggers in a clearly broken/hung state where the node is already in trouble. Happy to reduce it if you prefer.

@LukaszRozmej
Copy link
Copy Markdown
Member

(This is Claude replying.)

The 5 s is a worst-case upper bound, not a typical wait. In practice UpdateMetrics only calls db.GetProperty and dbOptions.GetStatisticsString() — both are fast RocksDB stat reads that finish in sub-milliseconds under normal conditions. The timer fires at most once per cycle, so the actual wait should be ~0 ms in almost every shutdown.

The bound is necessary to avoid a use-after-free: without waiting, the callback could still be running inside a native RocksDB call when ReleaseUnmanagedResources() frees the handle. The 5 s was chosen conservatively for the pathological case where RocksDB is stuck on an internal mutex (e.g., heavy compaction).

If 5 s feels too long, 1 s is a reasonable tightening — it still covers any realistic stat-read latency and only triggers in a clearly broken/hung state where the node is already in trouble. Happy to reduce it if you prefer.

I would go to 1s as things like docker might just kill the process if waiting is too long

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@asdacap asdacap merged commit 658c75a into master Apr 22, 2026
474 of 476 checks passed
@asdacap asdacap deleted the setup-claude-code branch April 22, 2026 11:18
asdacap added a commit that referenced this pull request Apr 25, 2026
#11270)

* fix(db): wait for in-flight timer callback in DbMetricsUpdater.Dispose

The previous Dispose() called the parameterless Timer.Dispose(), which
does not wait for a callback that is already running. Because
DbOnTheRocks.Dispose() disposes metrics updaters before the underlying
RocksDb handle is released, an in-flight UpdateMetrics could touch
db.GetProperty / dbOptions.GetStatisticsString after the native handle
was freed.

Switch to Timer.Dispose(WaitHandle) + WaitOne so the method returns only
once any pending callback has completed. Also atomically null out _timer
so Dispose is idempotent.

Removes a stale comment claiming stats are not extracted (they are, by
ProcessStatisticsString on the line above).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(db): bound DbMetricsUpdater dispose wait to 5s with warn log

Unbounded WaitOne() on the shutdown path could hang forever if
UpdateMetrics is stuck inside a native RocksDB call (e.g. blocked on an
internal RocksDB mutex during compaction), forcing a SIGKILL. Use a 5s
timeout and log a warning if it fires — keeps shutdown responsive at
the cost of giving up the use-after-free guard in the pathological
hang case, which is the correct tradeoff.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(db): reduce DbMetricsUpdater dispose timeout to 1s

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants