Skip to content

fix(clients/go): Nack placeholder, eager Connect ping, Consumer panic recovery#115

Merged
NikolayS merged 5 commits intomainfrom
fix/go-driver-bugs
Apr 30, 2026
Merged

fix(clients/go): Nack placeholder, eager Connect ping, Consumer panic recovery#115
NikolayS merged 5 commits intomainfrom
fix/go-driver-bugs

Conversation

@NikolayS
Copy link
Copy Markdown
Owner

Closes #91.

Bugs fixed

Three regressions caught by red tests written in commit 672615b.

# Bug Fix commit
1 Nack() SQL had 11 placeholders ($2..$12) but pgque.message has 10 fields — pgx rejected the query with 42846 6d7b04a
2 Connect() called pgxpool.New (lazy) and returned nil error for unreachable hosts — failure only surfaced on the first query bc4a2b9
3 Handler panic in Consumer.Start killed the poll goroutine and stalled the queue — subsequent messages were never delivered 0102caf

A fourth commit (7f2797d) improves test setup/teardown: lowers per-queue ticker thresholds, opens a fresh pool for cleanup to avoid use-after-close, and purges retry_queue/dead_letter rows before drop_queue.

Full go test output (local PG, all green)

=== RUN   TestSend
--- PASS: TestSend (0.05s)
=== RUN   TestSendAndReceive
--- PASS: TestSendAndReceive (0.03s)
=== RUN   TestNack
--- PASS: TestNack (0.03s)
=== RUN   TestNackPlaceholderCount
--- PASS: TestNackPlaceholderCount (0.02s)
=== RUN   TestConnect_UnreachableHostFailsImmediately
--- PASS: TestConnect_UnreachableHostFailsImmediately (0.00s)
=== RUN   TestConsumer_HandlerPanicRecoversAndContinues
2026/04/30 02:16:44 pgque: handler error for panic.test: handler panic: simulated handler panic
--- PASS: TestConsumer_HandlerPanicRecoversAndContinues (0.07s)
=== RUN   TestConsumerHandlerNacksOnError
2026/04/30 02:16:44 pgque: handler error for fail.test: simulated handler failure
--- PASS: TestConsumerHandlerNacksOnError (0.07s)
=== RUN   TestConsumerHandlerDispatch
2026/04/30 02:16:44 pgque: receive error: pgque: receive: closed pool
--- PASS: TestConsumerHandlerDispatch (0.02s)
PASS
ok  	github.com/NikolayS/pgque/clients/go	0.675s

Anti-leak confirmation

No internal GitLab, prospect, or benchmark work-item references in any changed file.

NikolayS and others added 5 commits April 29, 2026 20:01
Adds three failing tests, one per bug, to anchor red/green TDD:

- TestNackPlaceholderCount: exercises Client.Nack against a live DB.
  Pre-fix pgx errors with 'expected 12 arguments, got 11' because the
  ROW(...) cast targeting pgque.message (10 fields) has a trailing $12
  typo while only 11 args are supplied.

- TestConnect_UnreachableHostFailsImmediately: dials port 1 (reserved,
  refuses connections). Pre-fix Connect returns nil error within
  microseconds because pgxpool.New is lazy. Post-fix it errors on
  pool.Ping within ~3s.

- TestConsumer_HandlerPanicRecoversAndContinues: handler panics on the
  first message, succeeds on the second. Pre-fix the panic kills the
  consumer goroutine; the second message never reaches its handler.
  Post-fix the consumer recovers, nacks the panicking message, and keeps
  polling.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
pgque.message has 10 fields (msg_id … extra4); the ROW() cast
in Nack() had 11 placeholders ($2..$12) and 11 args — one extra.
Trim to $2..$11 so the placeholder count equals the field count.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
pgxpool.New is lazy and never dials; Connect now calls pool.Ping
so unreachable hosts fail at call-site, not on the first query.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Wrap each handler dispatch in dispatchWithRecover; a panic is
caught by defer/recover, converted to an error, and the message
is nack'd so it retries. The poll loop continues to the next msg.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Lower ticker thresholds per-queue so tests don't wait 3 s for a
tick. Teardown opens a fresh pool to avoid use-after-close when
defer client.Close() fires before t.Cleanup; purge retry_queue
and dead_letter rows before drop_queue to prevent stale leaks.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@NikolayS
Copy link
Copy Markdown
Owner Author

REV Review — PR #115

CI: green — claude-review pass, client-smoke pass, test (14/15/16/17/18) all pass, verify pass.

Functional evidence: Reviewed the post-fix files on fix/go-driver-bugs directly. Bug 1: pgque.message in sql/pgque.sql:4559-4570 has exactly 10 fields (msg_id, batch_id, type, payload, retry_count, created_at, extra1..extra4); the new Nack() SQL uses $1 for batch_id and $2..$11 for the 10 ROW fields, and the Go variadic passes 11 args in the matching order — placeholder/arg/field counts now line up. Bug 2: Connect() calls pool.Ping(ctx) before returning success and pool.Close()s on Ping failure, so unreachable hosts surface at call site without leaking a pool. Bug 3: dispatchWithRecover is invoked per-message (not once per goroutine) with a deferred recover() that converts the panic into a non-nil error; the existing error path in Start() then nacks the offending message and the loop continues. The three red tests assert the post-fix behavior (live Nack() succeeds, Connect() errors within 4 s on port 1, second handler call fires AND a retry_queue row appears for the panicking message). Live PG run was attached by the author in the PR body and shows all 8 tests green; I did not re-run locally.

Verdict: READY FOR USER REVIEW

Blocking

  • None.

Non-blocking

  • [LOW] [7/10] commit subject lengths — Several commits exceed CLAUDE.md's "Subject < 50 chars" rule: test(clients/go): red — three regressions for nack/connect/panic bugs (~69), fix(clients/go): consumer recovers from handler panics (~54), test(clients/go): improve setup/teardown robustness (~51). PR title is also ~79 chars. Fix: style nit only — not worth a re-spin; tighten on next round.
  • [LOW] [6/10] clients/go/consumer.go Start docstring — The docstring on Consumer.Start was not updated to mention the new panic-recovery contract, even though Connect's and Nack's docstrings were. Fix: add one line like "Handler panics are recovered and treated as handler errors (the message is nacked, the loop continues)."

Potential

  • [INFO] [5/10] clients/go/pgque_test.go Connect coverage — Only the unreachable-host case is exercised; no test for invalid-DSN syntax. Likely fine since pgx parses DSN inside pgxpool.New, which already returned a non-nil error pre-fix; the new code doesn't change that path. Worth a one-line test if you want belt-and-suspenders. Fix: optional — add TestConnect_InvalidDSN.
  • [INFO] [4/10] clients/go/consumer.go:38-103 Start loop ack semantics — When a handler errors (or now, panics), the message is nacked but the loop still falls through to Ack(batchID) after processing all messages, which acks the batch even though one message was nacked. This is pre-existing behavior (not introduced by this PR — TestConsumerHandlerNacksOnError already covers it), and the docstring says "A batch is acked only if every handled message in it returned nil" which contradicts the implementation. Out of scope for this PR but worth a follow-up to either fix the doc or fix the code. Fix: track in a separate issue.
  • [INFO] [4/10] dispatchWithRecover and runtime.Goexitrecover() does not catch runtime.Goexit(); if a handler ever calls t.FailNow or similar inside a goroutine that hits the dispatch path, the consumer goroutine still exits. Realistic? Almost never — production handlers shouldn't call Goexit. Documenting the contract ("panics are recovered; Goexit is not") would close the loop. Fix: optional doc-only.

Summary

Area Findings Potential Filtered
Security 0 0 0
Bugs 0 1 0
Tests 0 1 0
Guidelines 1 0 0
Docs 1 1 0

Anti-leak: clean — no GitLab/internal/benchmark-WI references in diff, commits, title, or body.


REV-style review (security, bugs, tests, guidelines, docs). SOC2 items skipped per project policy.

@NikolayS NikolayS merged commit bad299b into main Apr 30, 2026
8 checks passed
@NikolayS NikolayS deleted the fix/go-driver-bugs branch April 30, 2026 09:32
NikolayS added a commit that referenced this pull request Apr 30, 2026
Remove TestConnect_UnreachableHost and
TestConsumer_HandlerPanicKillsLoop; both document pre-#115 behavior
(lazy Connect, panic kills loop) that the bugfix PR corrected and
re-tested with TestConnect_UnreachableHostFailsImmediately and
TestConsumer_HandlerPanicRecoversAndContinues. Keeping them would
cause test failures and misrepresent current semantics.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
NikolayS added a commit that referenced this pull request Apr 30, 2026
PR #82 (Python) and #83 (TypeScript) are now merged on main.
Drop --ignore flags for test_consumer.py/test_smoke.py (Python)
and --passWithNoTests (TypeScript); all real tests now run.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
NikolayS added a commit that referenced this pull request Apr 30, 2026
PR #82 (Python) and #83 (TypeScript) are now merged on main.
Drop --ignore flags for test_consumer.py/test_smoke.py (Python)
and --passWithNoTests (TypeScript); all real tests now run.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
NikolayS added a commit that referenced this pull request Apr 30, 2026
…edge cases, benchmarks (#81)

* test(clients/go): comprehensive coverage suite

Add 6 new test files covering the Go driver beyond the happy path:

- helpers_test.go (115 LOC) — shared connectOrSkip, setupFreshQueue with
  unique random suffixes per test, retryQueueCount, dlqCount helpers
- integration_test.go (231 LOC) — multi-event batches, max-batch limits,
  empty-queue receive, payload round-trip, default event type, DLQ-at-limit
- errors_test.go (208 LOC) — bad DSN, unreachable host, missing queue,
  missing consumer, after-close panic-safety on Send/Receive/Ack/Nack,
  context cancellation, unmarshalable payload (chan)
- consumer_test.go (239 LOC) — clean Start/Stop, poll-interval respected,
  unregistered-event nack, context propagation to handler, single-ack-
  per-batch, documents current handler-panic behavior (kills goroutine)
- concurrency_test.go (272 LOC) — concurrent Send under -race, full
  send/receive/ack loop, handler-nack-under-load, two consumers same
  queue, double-Start safety
- edge_test.go (262 LOC) — empty payload, 1 MiB payload, unicode (CJK +
  emoji), special-char event types, timestamp sanity, retry_count
  initial state, string + array payload shapes
- bench_test.go (169 LOC) — BenchmarkSend, BenchmarkReceive_Empty,
  BenchmarkSendReceiveAck, BenchmarkConsumer_DispatchThroughput

All integration tests env-gated via PGQUE_TEST_DSN; t.Skip when
unreachable. Each test uses a uniquely-named queue + consumer with
t.Cleanup so parallel runs cannot collide.

Verified: go build ./... clean, go vet ./... clean, gofmt -l clean.
Live integration runs require PGQUE_TEST_DSN; not exercised in this
commit.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test(clients/go): drop stale tests superseded by #115

Remove TestConnect_UnreachableHost and
TestConsumer_HandlerPanicKillsLoop; both document pre-#115 behavior
(lazy Connect, panic kills loop) that the bugfix PR corrected and
re-tested with TestConnect_UnreachableHostFailsImmediately and
TestConsumer_HandlerPanicRecoversAndContinues. Keeping them would
cause test failures and misrepresent current semantics.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* test(clients/go): strengthen weak assertions

- TestSend_MissingQueue: t.Logf → t.Errorf for queue/not-found check
- TestSend_ContextCancelled: t.Logf → t.Errorf for context.Canceled wrap
- TestRace_HandlerNackUnderLoad: t.Logf → t.Errorf on failed==0
- TestNack_ToDLQAtRetryLimit: remove Ack-after-Nack (mutually exclusive
  on same batch); raise cycles to 5; t.Skipf → t.Fatalf on dlqCount==0
- TestSend_AfterClose / TestReceive_AfterClose: use a second live client
  for cleanup so t.Cleanup does not run on a closed pool

* test(clients/go): rename tests and fix helpers

- TestConsumer_AckOnlyOnceForBatch → TestConsumer_AllMessagesDispatched
  (body counts handler calls, not Ack invocations)
- TestConsumer_PollIntervalRespected → TestConsumer_LivenessUnderEmptyQueue
  (body tests liveness, not measured cadence)
- TestConsumer_StartTwiceFromSameInstance → TestConsumer_DoubleStart_DoesNotPanic
  (scope is limited to no-panic guarantee)
- TestConcurrent_TwoConsumersSameQueue → TestConcurrent_TwoConsumersDistinctNames
  using distinct consumer registrations; assert each sees all messages
- benchSuffix: replace time+name suffix with crypto/rand (collision-safe)
- helpers_test.go: correct stale comment about Connect laziness
- Remove PR-number references from source comments

---------

Co-authored-by: Nik Samokhvalov <nik@niks-mbp.lan>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
NikolayS added a commit that referenced this pull request Apr 30, 2026
PR #82 (Python) and #83 (TypeScript) are now merged on main.
Drop --ignore flags for test_consumer.py/test_smoke.py (Python)
and --passWithNoTests (TypeScript); all real tests now run.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
NikolayS added a commit that referenced this pull request Apr 30, 2026
PR #82 (Python) and #83 (TypeScript) are now merged on main.
Drop --ignore flags for test_consumer.py/test_smoke.py (Python)
and --passWithNoTests (TypeScript); all real tests now run.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

W1: Go driver bugfixes — Nack $12 placeholder, eager Connect, Consumer panic recovery

1 participant