Skip to content

fix(eventbus): timer-drain hang + silent-drop observability (#112)#116

Merged
intel352 merged 2 commits into
mainfrom
fix/eventbus-112-drain-observability-2026-05-28T2027
May 29, 2026
Merged

fix(eventbus): timer-drain hang + silent-drop observability (#112)#116
intel352 merged 2 commits into
mainfrom
fix/eventbus-112-drain-observability-2026-05-28T2027

Conversation

@intel352
Copy link
Copy Markdown
Contributor

Fixes #112 — one real bug plus two observability gaps in modules/eventbus.

P1 (bug) — MemoryEventBus.Publish hang in deliveryMode: "timeout"

The legacy if !deadline.Stop() { <-deadline.C } drain blocks the publisher forever once the timer fires (on Go 1.23+ timer channels are unbuffered, so the unconditional re-receive never returns). Replaced with the race-free non-blocking drain, covering both the timer-fired and ctx-cancelled exit branches. Kept inline (not defer) because the block is inside a per-subscriber for loop.

P2 (observability) — CustomMemoryEventBus dropped silently

Added deliveredCount/droppedCount + Stats() mirroring MemoryEventBus. Also changed EngineRouter.CollectStats/CollectPerEngineStats to dispatch over a new statsProvider interface instead of type-asserting the concrete *MemoryEventBus, so the custom engine (and any future stats-bearing engine) participates in module.Stats()/PerEngineStats(). DurableMemoryEventBus.Stats() has a single-value return and intentionally does not satisfy the interface (it never drops).

P3 (observability) — Unsubscribe/Stop abandoned buffered events uncounted

Both engines now count events still buffered in a subscriber channel as dropped on handler-goroutine exit (drainSubscription, run via defer so no exit path skips it), plus the dequeued-but-cancelled event. MemoryEventBus.Stop additionally drains the worker pool after wg.Wait() (race-free: all senders/receivers are wg-tracked and have exited) so abandoned async tasks are counted — closing the async conservation gap. CustomMemoryEventBus gained the WaitGroup + finished-channel shutdown synchronization the standard engine already had, so Stop waits for handlers to drain and Stats() is final on return.

Contract: at-most-once delivery across teardown — delivered + dropped == enqueued once publishers quiesce. Corrected the false "processes all in-flight events" Stop doc comments in eventbus.go and module.go. Decision recorded in decisions/0001-eventbus-at-most-once-teardown.md.

Process

  • Design + adversarial-design-review (2 Critical + 4 Important found and resolved across rev 1→2): docs/plans/2026-05-28-eventbus-112-fix.md. The async-workerPool conservation gap (C1) and the dequeued-cancelled accounting hole (C2) were both caught at design time.
  • Adversarial code review of the implementation: PASS, zero Critical.
  • TDD. New tests: issue112_memory_test.go, issue112_custom_test.go (timer-hang guard, sync + async conservation across Unsubscribe and Stop, custom Stats, router statsProvider wiring).

Regression-invariant proof (timer hang, P1)

With fix reverted:
$ go test -race -run TestIssue112_TimeoutModePublishDoesNotHang
  --- FAIL: TestIssue112_TimeoutModePublishDoesNotHang (3.00s)
      Publish hung in timeout delivery mode (issue #112 P1): timer-drain deadlock

With fix applied:
$ go test -race -run TestIssue112_TimeoutModePublishDoesNotHang
  --- PASS: TestIssue112_TimeoutModePublishDoesNotHang (0.16s)

The conservation tests showed the same RED→GREEN: pre-fix delivered+dropped was 1/6 (Unsubscribe) and 2/30 (async Stop — 28 events lost in the worker pool); post-fix both reconcile exactly.

Verification

  • cd modules/eventbus && go test -race ./... → ok (25.9s)
  • gofmt clean, go vet clean, golangci-lint run → 0 issues

Closes #112.

intel352 and others added 2 commits May 28, 2026 21:05
Design+plan for issue #112. Passed adversarial-design-review (rev 2):
2 Critical + 4 Important found in rev 1, all resolved in rev 2.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Fixes three issues in modules/eventbus surfaced in #112:

P1 (bug): MemoryEventBus.Publish hung forever in deliveryMode "timeout".
The legacy `if !deadline.Stop() { <-deadline.C }` drain blocks on an
already-drained timer channel (on Go 1.23+ timer channels are unbuffered).
Replaced with a non-blocking, version-safe drain covering both the
timer-fired and ctx-cancelled exit branches.

P2 (observability): CustomMemoryEventBus dropped events silently. Added
deliveredCount/droppedCount + Stats() mirroring MemoryEventBus, and changed
EngineRouter.CollectStats/CollectPerEngineStats to dispatch over a new
statsProvider interface so the custom engine participates in module-level
stats (previously they type-asserted the concrete *MemoryEventBus and
ignored every other engine).

P3 (observability): Unsubscribe/Stop abandoned buffered events without
counting them. Both engines now count events still buffered in a subscriber
channel as dropped on handler-goroutine exit (drainSubscription, run via
defer so no exit path skips it), plus the dequeued-but-cancelled event.
MemoryEventBus also drains the worker pool after wg.Wait in Stop so
abandoned async tasks are counted (closes the async conservation gap).
CustomMemoryEventBus gained the WaitGroup + finished-channel shutdown
synchronization the standard engine already had, so Stop waits for handler
goroutines to drain and Stats() is final on return.

Contract: at-most-once delivery across teardown (delivered + dropped ==
enqueued once publishers quiesce). Corrected the false "processes all
in-flight events" Stop doc comments in eventbus.go and module.go.
Decision recorded in decisions/0001-eventbus-at-most-once-teardown.md.

Design + adversarial review: docs/plans/2026-05-28-eventbus-112-fix.md.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown
Contributor

📋 API Contract Changes Summary

No breaking changes detected - only additions and non-breaking modifications

Changed Components:

Core Framework

Contract diff saved to artifacts/diffs/core.json

Module: auth

Contract diff saved to artifacts/diffs/auth.json

Module: cache

Contract diff saved to artifacts/diffs/cache.json

Module: database

Contract diff saved to artifacts/diffs/database.json

Module: eventbus

Contract diff saved to artifacts/diffs/eventbus.json

Module: jsonschema

Contract diff saved to artifacts/diffs/jsonschema.json

Module: letsencrypt

Contract diff saved to artifacts/diffs/letsencrypt.json

Module: reverseproxy

Contract diff saved to artifacts/diffs/reverseproxy.json

Artifacts

📁 Full contract diffs and JSON artifacts are available in the workflow artifacts.

@codecov
Copy link
Copy Markdown

codecov Bot commented May 29, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

@intel352 intel352 merged commit ca4bf71 into main May 29, 2026
25 checks passed
@intel352 intel352 deleted the fix/eventbus-112-drain-observability-2026-05-28T2027 branch May 29, 2026 01:11
intel352 added a commit that referenced this pull request May 29, 2026
…lity (#117)

Post-merge retrospective for PR #116 (squash ca4bf71). Covers which
gates worked (adversarial review caught C1+C2+I1+I3 pre-code; TDD
RED→GREEN; CI race+lint+BDD all green) and carry-forward items
(DurableMemoryEventBus stats arity, multi-engine config threading,
residual publish-after-drain race).

Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

eventbus: timer-drain hang in MemoryEventBus timeout-mode + silent drop observability gaps

1 participant