fix(indexer): errgroup.WithContext so ETL halt actually exits the process by raymondjacobson · Pull Request #884 · AudiusProject/api

raymondjacobson · 2026-05-29T23:56:55Z

Summary

CoreIndexer.Start used a bare errgroup.Group{}, which doesn't share or cancel any ctx. When etlIndexer.Run() returned the halt-on-block-error introduced by go-openaudio#323 / api#883:

the errgroup captured the error,
but aggregatesCalculator.Start kept spinning on an uncancelled ctx,
eg.Wait() blocked forever waiting for it,
main.go's panic(err) never received anything,
and the pod stayed 1/1 Running with the ETL goroutine dead.

Observed in prod tonight: a 25P02 cascade in the plays-hook savepoint poisoned the pgx pool, indexBlocks() correctly returned per #323, but the pod kept running. The parity jobs (which startParityJobs launches outside the errgroup) kept ticking, MAX(blocks.height) advanced because the still-running Python indexer was writing them, and health_check looked totally green — making the wedge invisible. The whole point of #323's halt-on-error was defeated by this wrapper.

Fix

errgroup.WithContext(ctx) so the first error cancels gCtx. aggregatesCalculator.Start already has a proper for { select { case <-ctx.Done() ... } } loop and the parity jobs honor ctx via ScheduleEvery, so both exit naturally; eg.Wait() returns, main.go panics, the pod crash-restarts with a fresh pgx pool, and the ETL self-heals on the same block. Same retry semantics #323 promised.

- eg := errgroup.Group{}
+ eg, gCtx := errgroup.WithContext(ctx)
  eg.Go(func() error {
-   return ci.aggregatesCalculator.Start(ctx)
+   return ci.aggregatesCalculator.Start(gCtx)
  })
  eg.Go(func() error {
    ci.logger.Info("Starting ETL indexer")
    return ci.etlIndexer.Run()
  })
- ci.startParityJobs(ctx)
+ ci.startParityJobs(gCtx)
  return eg.Wait()

Caveats (called out in the rewritten comment)

etl.Indexer.Run() still uses its own internal context.Background(), so when the outer ctx (SIGTERM) is cancelled, gCtx cancels and the aggregates / parity jobs exit, but eg.Wait() still blocks on ETL. The k8s grace-period SIGKILL handles that today — same tradeoff as before, just made explicit.

Test plan

go build ./..., go vet ./..., gofmt clean.
go test ./indexer/... still passes (existing pubkey + user_events hook tests).
After deploy: synthesize an ETL-halt path (or wait for the next transient) and confirm the pod actually crash-restarts instead of going zombie.

🤖 Generated with Claude Code

…cess CoreIndexer.Start used a bare errgroup.Group{}, which doesn't share or cancel any ctx. When etlIndexer.Run() returned the halt-on-block-error introduced by go-openaudio#323 / api#883: - the errgroup captured the error - but aggregatesCalculator.Start kept spinning on an uncancelled ctx - eg.Wait() blocked forever waiting for it - main.go's panic(err) never received anything - the pod stayed 1/1 Running with the ETL goroutine dead Observed in prod tonight on cd94ede: a 25P02 cascade in the plays-hook savepoint poisoned the pgx pool, indexBlocks() correctly returned per #323, but the pod kept running with the parity jobs (which start outside the errgroup) ticking happily and MAX(blocks.height) advancing via the still-running Python indexer — making the wedge invisible to health checks. The whole point of #323's halt-on-error was defeated by this wrapper. The fix is one-liner-ish: errgroup.WithContext(ctx) so the first error cancels gCtx, which aggregatesCalculator and the parity jobs already honor. eg.Wait then returns, main panics, the supervisor restarts the pod with a fresh pgx pool, and the ETL self-heals — same retry semantics #323 promised. Test plan - go build ./..., go vet ./..., gofmt clean. - go test ./indexer/... still passes (10 pubkey + user_events hook tests). - After deploy, can synthesize the ETL-halt path and confirm the pod actually crash-restarts. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

raymondjacobson · 2026-05-30T00:03:06Z

Closing for now. The fix in this PR is correct (the bare `errgroup.Group{}` really does block forever when ETL halts), but shipping it on top of the current state would convert today's silent ETL wedge into a continuous pod crashloop during dual-run with Python — see https://github.com/AudiusProject/api/pull/884#issuecomment-… [discussion in PR thread].

The underlying cause is that the on-chain plays bridge from #881 doesn't appear to ON CONFLICT-protect against rows Python has already written, so #883 / go-openaudio#323's halt-on-error fires on essentially every block. That makes #883 incompatible with the dual-run state.

We're rolling back #883 first (and reverting upstream go-openaudio#323) to restore stability, then will audit cross-writer collision points (plays bridge first, anywhere else ETL and Python write the same row second), then re-land #883, then re-land this PR. The diagnosis here stays correct; the deploy sequence isn't.

…)" (#885) ## Summary Reverts #883, pinning go-openaudio back to `v1.3.1-0.20260529221831-4d1c9dfdfb52`. The halt-on-error behavior from upstream go-openaudio#323 is correct in isolation, but is **incompatible with the current dual-run state**: Python and api-side ETL both write to overlapping tables, and the on-chain plays bridge from #881 doesn't ON CONFLICT-protect against rows Python has already written. So: - Pre-#883: the failure was silently swallowed by `continue` — ETL was effectively a no-op on essentially every block since #881 deployed, but block_diff stayed green because Python's writes kept `MAX(blocks.height)` moving. Block-level data loss masked by Python carrying the load. - Post-#883: the same failure crashes the indexing loop. We saw it tonight at `processBlock failed` on block 25415514, reproducibly across pod restarts because Python writes the same plays in the same block before the ETL gets to it. Once #884 (the api-wrapper fix that makes that halt actually exit the process) ships, every pod would crashloop the moment it tries to index any recent block. So shipping #883 + #884 without first handling the cross-writer collision points would convert today's silent wedge into a continuous outage that takes the parity jobs (`IndexChallengesJob`, `UserListeningHistory`, `HourlyPlayCounts`, etc.) down with the ETL. Strictly worse. ## Plan 1. **This PR**: pin upstream back to the pre-halt version. Today's silent wedge stays in place — bad, but bounded — and the parity jobs keep ticking. 2. Close #884 (already done). The diagnosis there is correct, but it amplifies #883's bad sequencing, so we re-land it after #883 is safe to re-ship. 3. Revert OpenAudio/go-openaudio#323 upstream too, so no future bump trips this accidentally. 4. **Audit + fix the cross-writer collision points in pkg/etl** — start with the plays bridge (#881), apply the same ON CONFLICT pattern #319 used for the `blocks` table. Then sweep anywhere else ETL and Python touch the same row. 5. Re-land go-openaudio#323, then api#883, then api#884 (in that order). At that point the halt-on-error guarantee is honest. ## Bump details (revert direction) | | from | to | |---|---|---| | `github.com/OpenAudio/go-openaudio` | `v1.3.1-0.20260529230137-819100b28c94` | `v1.3.1-0.20260529221831-4d1c9dfdfb52` | | `github.com/OpenAudio/go-openaudio/pkg/etl` | `v1.3.1-0.20260529230137-819100b28c94` | `v1.3.1-0.20260529221831-4d1c9dfdfb52` | ## Test plan - [x] `go build ./...` clean. - [ ] After deploy: confirm new pod boots, no `processBlock failed` halt log on block 25415514 (it'll go back to silent `continue`). - [ ] Verify parity jobs still tick and block_diff stays at 0 (no functional change vs. pre-#883 prod). 🤖 Generated with [Claude Code](https://claude.com/claude-code)

raymondjacobson and others added 2 commits May 29, 2026 16:56

Remove comment

dc1bde6

raymondjacobson closed this May 30, 2026

This was referenced May 30, 2026

Revert "chore(deps): bump go-openaudio ETL to halt-on-block-error (#883)" #885

Merged

Revert "fix(etl): halt on block indexing failure (#323)" OpenAudio/go-openaudio#324

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(indexer): errgroup.WithContext so ETL halt actually exits the process#884

fix(indexer): errgroup.WithContext so ETL halt actually exits the process#884
raymondjacobson wants to merge 2 commits into
mainfrom
api/indexer-errgroup-with-context

raymondjacobson commented May 29, 2026

Uh oh!

raymondjacobson commented May 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

raymondjacobson commented May 29, 2026

Summary

Fix

Caveats (called out in the rewritten comment)

Test plan

Uh oh!

raymondjacobson commented May 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant