Skip to content

fix(indexer): errgroup.WithContext so ETL halt actually exits the process#884

Closed
raymondjacobson wants to merge 2 commits into
mainfrom
api/indexer-errgroup-with-context
Closed

fix(indexer): errgroup.WithContext so ETL halt actually exits the process#884
raymondjacobson wants to merge 2 commits into
mainfrom
api/indexer-errgroup-with-context

Conversation

@raymondjacobson
Copy link
Copy Markdown
Member

Summary

CoreIndexer.Start used a bare errgroup.Group{}, which doesn't share or cancel any ctx. When etlIndexer.Run() returned the halt-on-block-error introduced by go-openaudio#323 / api#883:

  • the errgroup captured the error,
  • but aggregatesCalculator.Start kept spinning on an uncancelled ctx,
  • eg.Wait() blocked forever waiting for it,
  • main.go's panic(err) never received anything,
  • and the pod stayed 1/1 Running with the ETL goroutine dead.

Observed in prod tonight: a 25P02 cascade in the plays-hook savepoint poisoned the pgx pool, indexBlocks() correctly returned per #323, but the pod kept running. The parity jobs (which startParityJobs launches outside the errgroup) kept ticking, MAX(blocks.height) advanced because the still-running Python indexer was writing them, and health_check looked totally green — making the wedge invisible. The whole point of #323's halt-on-error was defeated by this wrapper.

Fix

errgroup.WithContext(ctx) so the first error cancels gCtx. aggregatesCalculator.Start already has a proper for { select { case <-ctx.Done() ... } } loop and the parity jobs honor ctx via ScheduleEvery, so both exit naturally; eg.Wait() returns, main.go panics, the pod crash-restarts with a fresh pgx pool, and the ETL self-heals on the same block. Same retry semantics #323 promised.

- eg := errgroup.Group{}
+ eg, gCtx := errgroup.WithContext(ctx)
  eg.Go(func() error {
-   return ci.aggregatesCalculator.Start(ctx)
+   return ci.aggregatesCalculator.Start(gCtx)
  })
  eg.Go(func() error {
    ci.logger.Info("Starting ETL indexer")
    return ci.etlIndexer.Run()
  })
- ci.startParityJobs(ctx)
+ ci.startParityJobs(gCtx)
  return eg.Wait()

Caveats (called out in the rewritten comment)

  • etl.Indexer.Run() still uses its own internal context.Background(), so when the outer ctx (SIGTERM) is cancelled, gCtx cancels and the aggregates / parity jobs exit, but eg.Wait() still blocks on ETL. The k8s grace-period SIGKILL handles that today — same tradeoff as before, just made explicit.

Test plan

  • go build ./..., go vet ./..., gofmt clean.
  • go test ./indexer/... still passes (existing pubkey + user_events hook tests).
  • After deploy: synthesize an ETL-halt path (or wait for the next transient) and confirm the pod actually crash-restarts instead of going zombie.

🤖 Generated with Claude Code

raymondjacobson and others added 2 commits May 29, 2026 16:56
…cess

CoreIndexer.Start used a bare errgroup.Group{}, which doesn't share or
cancel any ctx. When etlIndexer.Run() returned the halt-on-block-error
introduced by go-openaudio#323 / api#883:

  - the errgroup captured the error
  - but aggregatesCalculator.Start kept spinning on an uncancelled ctx
  - eg.Wait() blocked forever waiting for it
  - main.go's panic(err) never received anything
  - the pod stayed 1/1 Running with the ETL goroutine dead

Observed in prod tonight on cd94ede: a 25P02 cascade in the plays-hook
savepoint poisoned the pgx pool, indexBlocks() correctly returned per
#323, but the pod kept running with the parity jobs (which start outside
the errgroup) ticking happily and MAX(blocks.height) advancing via the
still-running Python indexer — making the wedge invisible to health
checks. The whole point of #323's halt-on-error was defeated by this
wrapper.

The fix is one-liner-ish: errgroup.WithContext(ctx) so the first error
cancels gCtx, which aggregatesCalculator and the parity jobs already
honor. eg.Wait then returns, main panics, the supervisor restarts the
pod with a fresh pgx pool, and the ETL self-heals — same retry semantics
#323 promised.

Test plan
- go build ./..., go vet ./..., gofmt clean.
- go test ./indexer/... still passes (10 pubkey + user_events hook tests).
- After deploy, can synthesize the ETL-halt path and confirm the pod
  actually crash-restarts.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@raymondjacobson
Copy link
Copy Markdown
Member Author

Closing for now. The fix in this PR is correct (the bare `errgroup.Group{}` really does block forever when ETL halts), but shipping it on top of the current state would convert today's silent ETL wedge into a continuous pod crashloop during dual-run with Python — see https://github.com/AudiusProject/api/pull/884#issuecomment-… [discussion in PR thread].

The underlying cause is that the on-chain plays bridge from #881 doesn't appear to ON CONFLICT-protect against rows Python has already written, so #883 / go-openaudio#323's halt-on-error fires on essentially every block. That makes #883 incompatible with the dual-run state.

We're rolling back #883 first (and reverting upstream go-openaudio#323) to restore stability, then will audit cross-writer collision points (plays bridge first, anywhere else ETL and Python write the same row second), then re-land #883, then re-land this PR. The diagnosis here stays correct; the deploy sequence isn't.

raymondjacobson added a commit that referenced this pull request May 30, 2026
…)" (#885)

## Summary

Reverts #883, pinning go-openaudio back to
`v1.3.1-0.20260529221831-4d1c9dfdfb52`.

The halt-on-error behavior from upstream go-openaudio#323 is correct in
isolation, but is **incompatible with the current dual-run state**:
Python and api-side ETL both write to overlapping tables, and the
on-chain plays bridge from #881 doesn't ON CONFLICT-protect against rows
Python has already written. So:

- Pre-#883: the failure was silently swallowed by `continue` — ETL was
effectively a no-op on essentially every block since #881 deployed, but
block_diff stayed green because Python's writes kept
`MAX(blocks.height)` moving. Block-level data loss masked by Python
carrying the load.
- Post-#883: the same failure crashes the indexing loop. We saw it
tonight at `processBlock failed` on block 25415514, reproducibly across
pod restarts because Python writes the same plays in the same block
before the ETL gets to it. Once #884 (the api-wrapper fix that makes
that halt actually exit the process) ships, every pod would crashloop
the moment it tries to index any recent block.

So shipping #883 + #884 without first handling the cross-writer
collision points would convert today's silent wedge into a continuous
outage that takes the parity jobs (`IndexChallengesJob`,
`UserListeningHistory`, `HourlyPlayCounts`, etc.) down with the ETL.
Strictly worse.

## Plan

1. **This PR**: pin upstream back to the pre-halt version. Today's
silent wedge stays in place — bad, but bounded — and the parity jobs
keep ticking.
2. Close #884 (already done). The diagnosis there is correct, but it
amplifies #883's bad sequencing, so we re-land it after #883 is safe to
re-ship.
3. Revert OpenAudio/go-openaudio#323 upstream too, so no future bump
trips this accidentally.
4. **Audit + fix the cross-writer collision points in pkg/etl** — start
with the plays bridge (#881), apply the same ON CONFLICT pattern #319
used for the `blocks` table. Then sweep anywhere else ETL and Python
touch the same row.
5. Re-land go-openaudio#323, then api#883, then api#884 (in that order).
At that point the halt-on-error guarantee is honest.

## Bump details (revert direction)

| | from | to |
|---|---|---|
| `github.com/OpenAudio/go-openaudio` |
`v1.3.1-0.20260529230137-819100b28c94` |
`v1.3.1-0.20260529221831-4d1c9dfdfb52` |
| `github.com/OpenAudio/go-openaudio/pkg/etl` |
`v1.3.1-0.20260529230137-819100b28c94` |
`v1.3.1-0.20260529221831-4d1c9dfdfb52` |

## Test plan

- [x] `go build ./...` clean.
- [ ] After deploy: confirm new pod boots, no `processBlock failed` halt
log on block 25415514 (it'll go back to silent `continue`).
- [ ] Verify parity jobs still tick and block_diff stays at 0 (no
functional change vs. pre-#883 prod).

🤖 Generated with [Claude Code](https://claude.com/claude-code)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant