Skip to content

Automatic child lifecycle management in StopWaiter#4536

Merged
pmikolajczyk41 merged 50 commits intomasterfrom
pmikolajczyk/nit-4684-stop-waiter-api
Mar 27, 2026
Merged

Automatic child lifecycle management in StopWaiter#4536
pmikolajczyk41 merged 50 commits intomasterfrom
pmikolajczyk/nit-4684-stop-waiter-api

Conversation

@pmikolajczyk41
Copy link
Copy Markdown
Member

@pmikolajczyk41 pmikolajczyk41 commented Mar 20, 2026

StopWaiter API: TrackChild / StartAndTrackChild

Added two methods to StopWaiter for automatic child lifecycle management:

  • TrackChild(child) — registers a child for automatic shutdown in LIFO order
  • StartAndTrackChild(child) — starts a child with the parent's managed context and tracks it

Tracked children are automatically stopped during StopOnly and waited on during StopAndWait. Children are taken atomically (guarded by ChildrenTaken flag) to prevent gaps between take and stop. TrackChild after shutdown immediately stops the child. Adopted across 15+ call sites, removing most custom StopAndWait overrides.

Bug fixes discovered during migration

  • AddressFilter.FilterService: child addressChecker was started but never stopped — now tracked via StartAndTrackChild
  • BidValidator: child producer was started but never stopped — now tracked via StartAndTrackChild
  • ExecutionEngine: child transactionFiltererRPCClient was started (externally) but only stopped if non-nil in a manual override — now tracked via TrackChild, which handles nil-safety implicitly
  • ExpressLaneService: expressLaneTracker had unclear ownership — started by ExecutionNode but stopped by ExpressLaneService. Now consistently owned (started and tracked) by ExecutionNode
  • BroadcastClients: primary clients and routers were started with the parent's context but never explicitly tracked for shutdown — now tracked via TrackChild
  • ValidationServer (redis consumer): boldSpawner was started dynamically from a background goroutine but could be missed during shutdown if StopAndWait was called before it was created — now tracked via StartAndTrackChild, and late-tracked children are immediately stopped via the ChildrenTaken safety net

@codecov
Copy link
Copy Markdown

codecov bot commented Mar 20, 2026

Codecov Report

❌ Patch coverage is 67.50000% with 26 lines in your changes missing coverage. Please review.
✅ Project coverage is 34.57%. Comparing base (72adaee) to head (729bb1a).
⚠️ Report is 51 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #4536      +/-   ##
==========================================
+ Coverage   33.99%   34.57%   +0.58%     
==========================================
  Files         498      498              
  Lines       58994    58977      -17     
==========================================
+ Hits        20053    20392     +339     
+ Misses      35314    34965     -349     
+ Partials     3627     3620       -7     

@pmikolajczyk41 pmikolajczyk41 changed the title Add TrackChild/StartAndTrackChild to StopWaiter for automatic child lifecycle management Automatic child lifecycle management in StopWaiter Mar 20, 2026
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Mar 20, 2026

❌ 12 Tests Failed:

Tests completed Failed Passed Skipped
4634 12 4622 0
View the top 3 failed tests by shortest run time
TestPruningDBSizeReduction
Stack Traces | 0.000s run time
=== RUN   TestPruningDBSizeReduction
--- FAIL: TestPruningDBSizeReduction (0.00s)
TestAliasingFlaky
Stack Traces | -0.000s run time
=== RUN   TestAliasingFlaky
=== PAUSE TestAliasingFlaky
=== CONT  TestAliasingFlaky
    common_test.go:768: BuildL1 deployConfig: DeployBold=true, DeployReferenceDAContracts=false
TestBatchPosterL1SurplusMatchesBatchGasFlaky
Stack Traces | 0.530s run time
... [CONTENT TRUNCATED: Keeping last 20 lines]
panic: runtime error: invalid memory address or nil pointer dereference [recovered, repanicked]
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x2079f52]

goroutine 52 [running]:
testing.tRunner.func1.2({0x37dd320, 0x61ec9b0})
	/opt/hostedtoolcache/go/1.25.8/x64/src/testing/testing.go:1872 +0x237
testing.tRunner.func1()
	/opt/hostedtoolcache/go/1.25.8/x64/src/testing/testing.go:1875 +0x35b
panic({0x37dd320?, 0x61ec9b0?})
	/opt/hostedtoolcache/go/1.25.8/x64/src/runtime/panic.go:783 +0x132
github.com/offchainlabs/nitro/arbnode.(*InboxTracker).GetBatchCount(0x1ac55900?)
	/home/runner/work/nitro/nitro/arbnode/inbox_tracker.go:210 +0x12
github.com/offchainlabs/nitro/arbnode.(*InboxTracker).FindInboxBatchContainingMessage(0x0, 0x7)
	/home/runner/work/nitro/nitro/arbnode/inbox_tracker.go:225 +0x2f
github.com/offchainlabs/nitro/system_tests.TestBatchPosterL1SurplusMatchesBatchGasFlaky(0xc0005028c0)
	/home/runner/work/nitro/nitro/system_tests/batch_poster_test.go:838 +0x725
testing.tRunner(0xc0005028c0, 0x41ae150)
	/opt/hostedtoolcache/go/1.25.8/x64/src/testing/testing.go:1934 +0xea
created by testing.(*T).Run in goroutine 1
	/opt/hostedtoolcache/go/1.25.8/x64/src/testing/testing.go:1997 +0x465

📣 Thoughts on this report? Let Codecov know! | Powered by Codecov

@pmikolajczyk41 pmikolajczyk41 requested a review from tsahee March 23, 2026 13:24
@pmikolajczyk41 pmikolajczyk41 marked this pull request as ready for review March 23, 2026 13:24
Copy link
Copy Markdown
Member

@eljobe eljobe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Automated PR Review (Claude pr-review-toolkit)

Note: These findings were generated by Claude's pr-review-toolkit agents and have not been checked for accuracy by Pepper. Please evaluate each comment on its merits.

Overview

Excellent infrastructure PR. The TrackChild/StartAndTrackChild mechanism is well-designed with proper mutual exclusion, atomic ChildrenTaken guarding, and thorough test coverage. It directly fixes several real bugs (leaked children in AddressFilter, BidValidator, ExecutionEngine, BroadcastClients, ValidationServer). All 15+ migration sites were verified for correct shutdown ordering.

Strengths

  • Directly fixes real child-leak bugs across multiple components
  • Clean LIFO mechanism with takeChildren() + ChildrenTaken flag
  • Late-tracking safety net (immediate stop after shutdown) handles dynamic startBoldSpawner
  • Excellent test suite covering LIFO order, concurrency, grandchild hierarchy
  • All ValidatorWallet implementations updated with StopOnly()
  • Careful handling of exceptions (BroadcastClients secondary, MultiProtocolStaker wallet)

Summary

Severity Count
Critical 1
Important 5
Suggestions 5
Test gaps 3

See inline comments for details.

@eljobe eljobe assigned pmikolajczyk41 and unassigned eljobe Mar 26, 2026
@eljobe
Copy link
Copy Markdown
Member

eljobe commented Mar 27, 2026

Approved, but, it would be good to understand why the default-B tests are failing.

@pmikolajczyk41 pmikolajczyk41 added this pull request to the merge queue Mar 27, 2026
@pmikolajczyk41
Copy link
Copy Markdown
Member Author

Approved, but, it would be good to understand why the default-B tests are failing.

just flakiness, rerun helped

Merged via the queue into master with commit fd2a5b0 Mar 27, 2026
57 of 59 checks passed
@pmikolajczyk41 pmikolajczyk41 deleted the pmikolajczyk/nit-4684-stop-waiter-api branch March 27, 2026 14:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants