Skip to content

fix(l1): juno shutdown doesn't hang#3708

Merged
rodrodros merged 3 commits into
mainfrom
rdr/remove-close-delay
Jun 8, 2026
Merged

fix(l1): juno shutdown doesn't hang#3708
rodrodros merged 3 commits into
mainfrom
rdr/remove-close-delay

Conversation

@rodrodros
Copy link
Copy Markdown
Contributor

Fix #3684

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Jun 8, 2026

Claude finished @rodrodros's task in 2m 12s —— View job


Review: fix(l1): juno shutdown doesn't hang


Summary

The fix is correct and well-tested. The root cause was that subscribeToUpdates and finalisedHeight retried with time.Sleep(resubscribeDelay), which is not interruptible — on shutdown the loop could block for the full delay (default 10s) before noticing ctx.Done(). Replacing time.Sleep with a time.Timer selected against ctx.Done() makes cancellation prompt. I confirmed there are no remaining time.Sleep calls in the l1 production code.

A few things I checked and that look right:

  • Timer usage is correct — time.NewTimer(0) fires the first attempt immediately, and timer.Reset is only called after the timer has fired (received from timer.C), so no drain is needed.
  • nil/false sentinels are unambiguoussubscribeToUpdates returns nil and finalisedHeight returns found=false only on ctx.Done() (the per-call timeouts are separate child contexts), so callers correctly treat them as a clean shutdown rather than masking a real error.
  • Behavior change in Run — it now returns nil instead of "context canceled before resubscribe was successful" on shutdown. This is intentional and the TestFailToCreateSubscription assertion was updated to match. Genuine (non-cancel) subscribe errors were already retried indefinitely before, so no real error is being swallowed.
  • The synctest tests are a clean way to assert this without flaky sleeps — a 1h resubscribeDelay in virtual time, synctest.Wait() to reach a durable block, then cancel() and assert virtual time stayed ~0. Requires Go 1.25+ synctest.Test; repo is on go 1.26.0, so fine.

Nits (non-blocking)

  • l1/contract/starknet.go:44 — unfinished/dangling comment (// I am pretty sure this shouldn't be \big.Int` and more closesly`) with a typo; looks accidentally committed. (inline)
  • l1/l1.go:117 — typo stablishedestablished. (inline)

Pre-existing (out of scope, FYI)

  • l1/l1.go:243defer updateSub.Unsubscribe() inside the resubscribe loop (already //nolint:gocritic). The old sub is explicitly Unsubscribe()'d at line 237, so the deferred calls are mostly redundant, but the defer chain still grows by one per resubscribe for the lifetime of watchL1StateUpdates. On a long-running node with a flaky L1 endpoint this is unbounded (if small) growth. Not introduced by this PR, but worth a follow-up since the lines are nearby.

No blockers — this is good to merge once the two comment nits are addressed.
· rdr/remove-close-delay

Comment thread l1/contract/starknet.go Outdated
Comment thread l1/l1.go
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Jun 8, 2026

Claude finished @rodrodros's task in 1m 16s —— View job


Review: fix(l1): juno shutdown doesn't hang


This PR is clean and good to merge. The latest commits also resolved both nits from the prior review.

What I checked

Fix correctness — The root cause (uninterruptible time.Sleep(resubscribeDelay) in the subscribeToUpdates and finalisedHeight retry loops) is fixed by selecting a time.Timer against ctx.Done(). No time.Sleep remains in l1 production code. Cancellation is now prompt.

Timer usagetime.NewTimer(0) fires the first attempt immediately; timer.Reset(...) is only called after the timer has fired (received via <-timer.C), so no channel drain is needed. defer timer.Stop() is in place. Correct.

Sentinel semanticssubscribeToUpdates returns nil and finalisedHeight returns found=false only on ctx.Done(). The per-call deadlines use separate child contexts (callCtx), so a per-attempt timeout surfaces as an err and retries rather than being mistaken for shutdown. Callers (watchL1StateUpdates:215/240, setL1Head:400, catchUpL1HeadUpdates:369) all treat the sentinel as a clean shutdown and return nil — no genuine error is swallowed, since non-cancel errors were already retried indefinitely.

Tests — The two synctest tests are a clean, non-flaky way to assert prompt cancellation: a 1h resubscribeDelay in virtual time, synctest.Wait() to reach a durable block, cancel(), then assert virtual time stayed well under a minute. Requires Go 1.25+ synctest.Test; go.mod is on go 1.26.0, so fine. TestFailToCreateSubscription was correctly updated to expect nil (cancel = clean shutdown).

Prior nits — resolved

  • l1/contract/starknet.go:44 — dangling comment is now a proper // todo(rdr): ....
  • l1/l1.gostablishedestablished typo fixed.

Pre-existing (out of scope, FYI)

  • l1/l1.go:243defer updateSub.Unsubscribe() inside the resubscribe loop (already //nolint:gocritic). The defer chain grows by one per resubscribe for the lifetime of watchL1StateUpdates; bounded-but-unbounded growth on a long-running node with a flaky L1 endpoint. Not introduced by this PR — worth a follow-up.

No blockers.
· rdr/remove-close-delay

@rodrodros rodrodros force-pushed the rdr/remove-close-delay branch from 0cdef78 to bc5c0fa Compare June 8, 2026 10:51
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Jun 8, 2026

Claude finished @rodrodros's task in 2m 9s —— View job


Review: fix(l1): juno shutdown doesn't hang


This PR is clean — good to merge. go build ./l1/... passes locally, and the two nits from the earlier reviews (the dangling starknet.go comment and the stablished typo) are resolved in the latest commits.

What I verified

Root cause & fix — The hang came from the uninterruptible time.Sleep(c.resubscribeDelay) in the subscribeToUpdates and finalisedHeight retry loops: on shutdown the loop could block for the full delay before checking ctx.Done(). Replacing it with a time.Timer selected against ctx.Done() makes cancellation prompt. No time.Sleep remains in l1 production code (grep confirms — only the test assertion strings mention it).

Timer correctnesstime.NewTimer(0) fires the first attempt immediately; timer.Reset(...) is only called after the timer has fired (received via <-timer.C), so no channel drain is needed, and defer timer.Stop() is in place. Correct in both loops.

Sentinel semanticssubscribeToUpdates returns nil and finalisedHeight returns found=false only on ctx.Done(). Per-attempt deadlines use separate child contexts (callCtx/finalisedHeightTimeout), so a per-call timeout surfaces as an err and retries rather than being mistaken for shutdown. All callers (watchL1StateUpdates:215/240, setL1Head:400) treat the sentinel as clean shutdown and return nil — and since non-cancel errors were already retried indefinitely, no genuine error is swallowed.

Bonus correctness improvementsetL1Head now early-returns on !found instead of proceeding with finalisedHeight == 0. The old path would iterate nonFinalisedLogs and delete any block-0 entry during a shutdown; the early return avoids mutating state on cancellation. Nice.

Tests — The two synctest tests (l1_pkg_test.go:790/850) are a clean, non-flaky way to assert prompt cancellation: a 1h resubscribeDelay in virtual time, synctest.Wait() to reach a durable block, cancel(), then assert virtual time stayed well under a minute. Requires Go 1.25+ synctest.Test; go.mod is on go 1.26.0, so fine. TestFailToCreateSubscription was correctly updated to expect nil (cancel = clean shutdown).

Pre-existing (out of scope, FYI)

  • l1/l1.go:243defer updateSub.Unsubscribe() inside the resubscribe loop (already //nolint:gocritic). The old sub is explicitly Unsubscribe()'d at line 237, so the deferred calls are mostly redundant, but the defer chain grows by one per resubscribe for the lifetime of watchL1StateUpdates — bounded-but-unbounded growth on a long-running node with a flaky L1 endpoint. Not introduced by this PR; worth a follow-up since the lines are right here.

No blockers.
· rdr/remove-close-delay

@codecov
Copy link
Copy Markdown

codecov Bot commented Jun 8, 2026

Codecov Report

❌ Patch coverage is 67.50000% with 13 lines in your changes missing coverage. Please review.
✅ Project coverage is 76.63%. Comparing base (a79a303) to head (d203445).

Files with missing lines Patch % Lines
l1/l1.go 76.47% 7 Missing and 1 partial ⚠️
l1/contract/starknet.go 16.66% 5 Missing ⚠️
Additional details and impacted files
@@           Coverage Diff           @@
##             main    #3708   +/-   ##
=======================================
  Coverage   76.62%   76.63%           
=======================================
  Files         409      409           
  Lines       37030    37050   +20     
=======================================
+ Hits        28376    28392   +16     
- Misses       6666     6672    +6     
+ Partials     1988     1986    -2     

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Copy Markdown
Contributor

@RafaelGranza RafaelGranza left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nicer than the time.After version I suggested earlier.

Comment thread l1/contract/starknet.go
Comment thread l1/l1.go
rodrodros added 3 commits June 8, 2026 10:33
Happened specifically when trying to resubscriby
or looking for the finalized height. Now the node promptly exits
@RafaelGranza RafaelGranza force-pushed the rdr/remove-close-delay branch from bc5c0fa to d203445 Compare June 8, 2026 13:33
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Jun 8, 2026

Claude finished @RafaelGranza's task in 57s —— View job


Review: fix(l1): juno shutdown doesn't hang


Clean — no blockers. go build ./l1/... passes, and the nits from earlier reviews (the dangling starknet.go comment and the stablished typo) are resolved.

What I verified

Root cause & fix — The hang came from the uninterruptible time.Sleep(c.resubscribeDelay) in the subscribeToUpdates and finalisedHeight retry loops: on shutdown the loop could block for the full delay before checking ctx.Done(). Replacing it with a time.Timer selected against ctx.Done() makes cancellation prompt.

Timer correctnesstime.NewTimer(0) fires the first attempt immediately; timer.Reset(...) is only called after the timer fires (received via <-timer.C), so no channel drain is needed and defer timer.Stop() is in place. Correct in both loops (l1.go:378, l1.go:394).

Sentinel semanticssubscribeToUpdates returns nil and finalisedHeight returns found=false only on ctx.Done(). Per-attempt deadlines use a separate child context (callCtx), so a per-call timeout surfaces as an err and retries rather than being mistaken for shutdown. All callers treat the sentinel as a clean shutdown and return nil; since non-cancel errors were already retried indefinitely, no genuine error is swallowed.

Bonus correctnesssetL1Head:401 now early-returns on !found instead of proceeding with finalisedHeight == 0, avoiding state mutation (the block-0 delete path) on cancellation.

Tests — The two synctest tests are a clean, non-flaky way to assert prompt cancellation: a 1h resubscribeDelay in virtual time, synctest.Wait() to reach a durable block, cancel(), then assert virtual time stayed under a minute. TestFailToCreateSubscription was correctly updated to expect nil (cancel = clean shutdown). Requires Go 1.25+ synctest.Test; go.mod is on go 1.26.0, so fine.

Pre-existing (out of scope, FYI)

  • l1/l1.go:243defer updateSub.Unsubscribe() inside the resubscribe loop (already //nolint:gocritic). The old sub is explicitly Unsubscribe()'d first, so the deferred calls are mostly redundant, but the defer chain grows by one per resubscribe for the lifetime of watchL1StateUpdates — bounded-but-unbounded growth on a long-running node with a flaky L1 endpoint. Not introduced by this PR; worth a follow-up.

Already approved by @RafaelGranza — good to merge.
· rdr/remove-close-delay

@rodrodros rodrodros merged commit 4977012 into main Jun 8, 2026
17 of 18 checks passed
@rodrodros rodrodros deleted the rdr/remove-close-delay branch June 8, 2026 15:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Avoid waiting time on context cancellation

2 participants