fix(indexer): errgroup.WithContext so ETL halt actually exits the process#884
fix(indexer): errgroup.WithContext so ETL halt actually exits the process#884raymondjacobson wants to merge 2 commits into
Conversation
…cess
CoreIndexer.Start used a bare errgroup.Group{}, which doesn't share or
cancel any ctx. When etlIndexer.Run() returned the halt-on-block-error
introduced by go-openaudio#323 / api#883:
- the errgroup captured the error
- but aggregatesCalculator.Start kept spinning on an uncancelled ctx
- eg.Wait() blocked forever waiting for it
- main.go's panic(err) never received anything
- the pod stayed 1/1 Running with the ETL goroutine dead
Observed in prod tonight on cd94ede: a 25P02 cascade in the plays-hook
savepoint poisoned the pgx pool, indexBlocks() correctly returned per
#323, but the pod kept running with the parity jobs (which start outside
the errgroup) ticking happily and MAX(blocks.height) advancing via the
still-running Python indexer — making the wedge invisible to health
checks. The whole point of #323's halt-on-error was defeated by this
wrapper.
The fix is one-liner-ish: errgroup.WithContext(ctx) so the first error
cancels gCtx, which aggregatesCalculator and the parity jobs already
honor. eg.Wait then returns, main panics, the supervisor restarts the
pod with a fresh pgx pool, and the ETL self-heals — same retry semantics
#323 promised.
Test plan
- go build ./..., go vet ./..., gofmt clean.
- go test ./indexer/... still passes (10 pubkey + user_events hook tests).
- After deploy, can synthesize the ETL-halt path and confirm the pod
actually crash-restarts.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Closing for now. The fix in this PR is correct (the bare `errgroup.Group{}` really does block forever when ETL halts), but shipping it on top of the current state would convert today's silent ETL wedge into a continuous pod crashloop during dual-run with Python — see https://github.com/AudiusProject/api/pull/884#issuecomment-… [discussion in PR thread]. The underlying cause is that the on-chain plays bridge from #881 doesn't appear to ON CONFLICT-protect against rows Python has already written, so #883 / go-openaudio#323's halt-on-error fires on essentially every block. That makes #883 incompatible with the dual-run state. We're rolling back #883 first (and reverting upstream go-openaudio#323) to restore stability, then will audit cross-writer collision points (plays bridge first, anywhere else ETL and Python write the same row second), then re-land #883, then re-land this PR. The diagnosis here stays correct; the deploy sequence isn't. |
…)" (#885) ## Summary Reverts #883, pinning go-openaudio back to `v1.3.1-0.20260529221831-4d1c9dfdfb52`. The halt-on-error behavior from upstream go-openaudio#323 is correct in isolation, but is **incompatible with the current dual-run state**: Python and api-side ETL both write to overlapping tables, and the on-chain plays bridge from #881 doesn't ON CONFLICT-protect against rows Python has already written. So: - Pre-#883: the failure was silently swallowed by `continue` — ETL was effectively a no-op on essentially every block since #881 deployed, but block_diff stayed green because Python's writes kept `MAX(blocks.height)` moving. Block-level data loss masked by Python carrying the load. - Post-#883: the same failure crashes the indexing loop. We saw it tonight at `processBlock failed` on block 25415514, reproducibly across pod restarts because Python writes the same plays in the same block before the ETL gets to it. Once #884 (the api-wrapper fix that makes that halt actually exit the process) ships, every pod would crashloop the moment it tries to index any recent block. So shipping #883 + #884 without first handling the cross-writer collision points would convert today's silent wedge into a continuous outage that takes the parity jobs (`IndexChallengesJob`, `UserListeningHistory`, `HourlyPlayCounts`, etc.) down with the ETL. Strictly worse. ## Plan 1. **This PR**: pin upstream back to the pre-halt version. Today's silent wedge stays in place — bad, but bounded — and the parity jobs keep ticking. 2. Close #884 (already done). The diagnosis there is correct, but it amplifies #883's bad sequencing, so we re-land it after #883 is safe to re-ship. 3. Revert OpenAudio/go-openaudio#323 upstream too, so no future bump trips this accidentally. 4. **Audit + fix the cross-writer collision points in pkg/etl** — start with the plays bridge (#881), apply the same ON CONFLICT pattern #319 used for the `blocks` table. Then sweep anywhere else ETL and Python touch the same row. 5. Re-land go-openaudio#323, then api#883, then api#884 (in that order). At that point the halt-on-error guarantee is honest. ## Bump details (revert direction) | | from | to | |---|---|---| | `github.com/OpenAudio/go-openaudio` | `v1.3.1-0.20260529230137-819100b28c94` | `v1.3.1-0.20260529221831-4d1c9dfdfb52` | | `github.com/OpenAudio/go-openaudio/pkg/etl` | `v1.3.1-0.20260529230137-819100b28c94` | `v1.3.1-0.20260529221831-4d1c9dfdfb52` | ## Test plan - [x] `go build ./...` clean. - [ ] After deploy: confirm new pod boots, no `processBlock failed` halt log on block 25415514 (it'll go back to silent `continue`). - [ ] Verify parity jobs still tick and block_diff stays at 0 (no functional change vs. pre-#883 prod). 🤖 Generated with [Claude Code](https://claude.com/claude-code)
Summary
CoreIndexer.Startused a bareerrgroup.Group{}, which doesn't share or cancel any ctx. WhenetlIndexer.Run()returned the halt-on-block-error introduced by go-openaudio#323 / api#883:aggregatesCalculator.Startkept spinning on an uncancelled ctx,eg.Wait()blocked forever waiting for it,main.go'spanic(err)never received anything,1/1 Runningwith the ETL goroutine dead.Observed in prod tonight: a 25P02 cascade in the plays-hook savepoint poisoned the pgx pool,
indexBlocks()correctly returned per #323, but the pod kept running. The parity jobs (whichstartParityJobslaunches outside the errgroup) kept ticking,MAX(blocks.height)advanced because the still-running Python indexer was writing them, andhealth_checklooked totally green — making the wedge invisible. The whole point of #323's halt-on-error was defeated by this wrapper.Fix
errgroup.WithContext(ctx)so the first error cancelsgCtx.aggregatesCalculator.Startalready has a properfor { select { case <-ctx.Done() ... } }loop and the parity jobs honor ctx viaScheduleEvery, so both exit naturally;eg.Wait()returns,main.gopanics, the pod crash-restarts with a fresh pgx pool, and the ETL self-heals on the same block. Same retry semantics #323 promised.Caveats (called out in the rewritten comment)
etl.Indexer.Run()still uses its own internalcontext.Background(), so when the outer ctx (SIGTERM) is cancelled,gCtxcancels and the aggregates / parity jobs exit, buteg.Wait()still blocks on ETL. The k8s grace-period SIGKILL handles that today — same tradeoff as before, just made explicit.Test plan
go build ./...,go vet ./..., gofmt clean.go test ./indexer/...still passes (existing pubkey + user_events hook tests).🤖 Generated with Claude Code