feat(challenges): make Phase 3 incremental + partial GIN for on_user_challenge cooldown trigger by raymondjacobson · Pull Request #877 · AudiusProject/api

raymondjacobson · 2026-05-29T19:39:19Z

Summary

Two related fixes for the wedge the IndexChallengesJob hit right after #842 deployed. The first post-deploy reconcile tick stalled (20+ min, no completion):

The four new Phase 3 processors (m/r/rv/rd) were full-scan over 2.3M user_events rows.
Every is_complete=true upsert fired the on_user_challenge trigger's cooldown-window check — an unindexed scan against the 8 GB notification table (~23.5M rows) that ran 19s+ per call (caught in pg_stat_activity, IO-bound on DataFileRead).

Block indexing stayed healthy throughout (the indexer uses separate connections) — this PR is purely a perf fix for the challenge job.

Two-pronged fix

(a) Phase 3 processors go incremental, #875-style

Each tick rescans only user_events rows whose blocknumber moved past a per-processor checkpoint, then re-derives state for the affected users. Same pattern the merged #875 used for profile_completion/track_upload/first_playlist/cosign.

Supporting migrations:

Migration	What it does
`0209_user_events_blocknumber_idx.sql`	`CREATE INDEX CONCURRENTLY` btree on `user_events(blocknumber)` so the dirty scan is an index range read instead of a 2.3M-row seq-scan.
`0211_seed_phase_3_user_event_checkpoints.sql`	Seeds the four checkpoints (`challenges:m/r/rv/rd:last_blocknumber`) to current `max(user_events.blocknumber)` so prod starts "from now" and skips re-deriving 2.3M historical rows. Python already populated `user_challenges`; the upserts are idempotent.

(b) Partial GIN on `notification` for the trigger's slow path

CREATE INDEX CONCURRENTLY ix_notification_cooldown_user_ids
    ON public.notification USING gin (user_ids)
    WHERE type = 'reward_in_cooldown';

The on_user_challenge trigger does, for every is_complete=true upsert where cooldown_days > 0:

SELECT id FROM notification
 WHERE type='reward_in_cooldown'
   AND new.user_id = ANY(user_ids)
   AND timestamp >= (new.completed_at - interval '1 hour')
 LIMIT 1

The full GIN on user_ids matches across the entire 8 GB table. The partial GIN restricted to type='reward_in_cooldown' is small (one type out of ~30) and lets the planner go straight to user-id matches within that slice. Benefits every cooldown_days>0 challenge (p, u, the Phase 2 ones, etc.), not just Phase 3.

Caveat

The r/rv dirty scan keys on user_events.blocknumber, so a referrer's verification flip (their users.is_verified going true) is not picked up until the referred user's user_events row changes again. Verification changes are rare and the old full-scan code only caught them on its next tick; this matches the precedent #875 set for the other incremental processors. Called out in the file-level docstring.

Test plan

go build ./..., go vet ./jobs/challenges/..., gofmt clean.
All 5 existing user_events processor tests pass — checkpoint defaults to 0 in the test DB, so the first run still catches the seeded block-101 rows (same end-state as the full-scan version).
Added TestMobileInstall_SkipsRowsBelowCheckpoint — pre-seeds the checkpoint past the row's blocknumber and asserts no upsert, explicitly covering the dirty-set skip.
Full jobs/challenges suite green (incl. feat(challenges): make heavy processors incremental (dirty-set) #875's incremental tests).
Schema dump + migration tracker regenerated.

Cutover note

Prod is currently on cd94ede (#842) with the challenge job's first tick still grinding on the slow path. Once this deploys:

0209 creates the user_events blocknumber index (CONCURRENTLY, no write blocking).
0210 creates the partial GIN on notification (CONCURRENTLY).
0211 seeds the four Phase 3 checkpoints to the current high-water mark.
The next tick of IndexChallengesJob picks up the new processors with checkpoint = max, so the dirty set is ~0 rows and the tick completes in milliseconds.

The currently-stuck in-flight tick on cd94ede will continue running until it finishes naturally (or until the pod is replaced by this deploy, which terminates it).

🤖 Generated with Claude Code

Two related fixes for the wedge the IndexChallengesJob hit right after #842 deployed. The first post-deploy reconcile tick stalled (20+ min, no completion) because the four new Phase 3 processors (m/r/rv/rd) were full-scan over 2.3M user_events, and every is_complete=true upsert fired the on_user_challenge trigger's cooldown-window check — an unindexed scan against the 8 GB notification table that ran 19s+ per call. This change splits the fix: - Phase 3 processors go incremental (#875-style dirty-set). Each tick rescans only user_events rows whose blocknumber moved past a per-processor checkpoint, then re-derives state for the affected users. New supporting files: * jobs/challenges/user_event_challenges.go (rewrite) * ddl/migrations/0209: btree on user_events(blocknumber) so the dirty scan is an index range read instead of a 2.3M-row seq-scan * ddl/migrations/0211: seed the four checkpoints to current max user_events.blocknumber so prod starts "from now" and skips the redundant backfill (Python already populated user_challenges; the upserts are idempotent) - Partial GIN on notification(user_ids) WHERE type='reward_in_cooldown' (ddl/migrations/0210, CONCURRENTLY). Lets the trigger's cooldown-window query hit a small in-subset index instead of the full 8 GB table. Helps every cooldown_days>0 challenge, not just Phase 3. Caveat noted in user_event_challenges.go: r/rv key the dirty scan on user_events.blocknumber, so a referrer's verification flip is not picked up until the *referred* user's row changes again. Verification changes are rare and the old full-scan code only caught them on the next tick; this matches the precedent #875 set for the other incremental processors. Tests: existing 5 user_events processor tests still pass (checkpoint defaults to 0 in the test DB → first run catches the seeded block-101 rows); added TestMobileInstall_SkipsRowsBelowCheckpoint to explicitly cover the dirty-set skip path. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…e used (#880) ## Summary The actual fix for the `IndexChallengesJob` wedge that #877 was supposed to enable. #877's partial GIN was correct — but the trigger SQL couldn't reach it. ## Root cause The `on_user_challenge` and `on_challenge_disbursement` triggers both do per-row lookups against the 8 GB / ~23.5 M-row `notification` table to dedupe reward / cooldown notifications, e.g.: ```sql SELECT id FROM notification WHERE type = 'reward_in_cooldown' AND new.user_id = ANY(user_ids) AND timestamp >= (new.completed_at - interval '1 hour') ``` **PostgreSQL's GIN operator class supports `@>`, `&&`, `<@` — but NOT `scalar = ANY(array)`.** Even with the full `ix_notification` GIN, and the partial GIN added by `0210` (#877), every trigger call fell back to a parallel sequential scan of the entire 8 GB table. Confirmed in prod on cd94ede: ``` -> Parallel Seq Scan on notification (cost=0.00..963930.62) Filter: type='reward_in_cooldown' AND scalar = ANY(user_ids) Rows Removed by Filter: 7,846,045 Execution: 13,640 ms ``` And `pg_stat_user_indexes` showed `ix_notification_cooldown_user_ids.idx_scan = 0` since it was built — completely unused. ## Fix Rewrite the predicate to the canonical `@>` form. Semantics are identical (both test array membership), but only `@>` is GIN-eligible. Same EXPLAIN on the same row, with the same data: | Form | Plan | Execution | |---|---|---| | `new.user_id = ANY(user_ids)` (before) | Parallel Seq Scan | **13,640 ms** | | `user_ids @> ARRAY[new.user_id]` (after) | Bitmap Index Scan on `ix_notification_cooldown_user_ids` | **2 ms** | Three call sites updated: | File | What it does | |---|---| | `ddl/functions/handle_user_challenges.sql` | `reward_in_cooldown` dedupe path of `handle_on_user_challenge()` — fires on every `is_complete=true` write to `user_challenges` | | `ddl/functions/handle_challenge_disbursements.sql` (×2) | `challenge_reward` dedupe in both `handle_challenge_disbursement()` (legacy table) and `handle_sol_reward_disbursement()` (new indexer's table) | Schema dump (`sql/01_schema.sql`) and migration tracker checksums updated to match. ## Impact - **Challenge job**: per-upsert trigger cost drops from ~13 s → ~2 ms. The `IndexChallengesJob` first-tick wedge clears once a fresh backend picks up the new function (so a `core-indexer` pod restart after this deploys is the last manual step, unless the deploy itself replaces the pod). - **Rewards / disbursements**: per-disbursement trigger cost drops by the same factor; the rewards attester is no longer rate-limited by trigger latency. - **All other `cooldown_days > 0` challenges** (`p`, `u`, the Phase 2 ones, etc.) get the same speedup whenever they fire the trigger. ## Out of scope The wider codebase has **~50 other `= any(user_ids)` occurrences** across other trigger functions (notification triggers added in #851, etc.). Same anti-pattern — same fix. Worth a separate sweep PR; I left it out here to keep this one minimal and reviewable. ## Test plan - [x] `go build ./...`, `go vet ./...` clean (no Go changes; sanity check). - [x] Confirmed in prod via `EXPLAIN (ANALYZE, BUFFERS)` that the `@>` form picks `ix_notification_cooldown_user_ids` and completes in 2 ms (vs 13.6 s for the `= ANY` form). - [x] Function checksums in `sql/03_migration_tracker.sql` updated so `pg_migrate.sh` re-applies them on deploy. - [x] `sql/01_schema.sql` updated in lockstep so a fresh test template reflects the new function bodies. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

raymondjacobson merged commit f6e633c into main May 29, 2026
5 checks passed

raymondjacobson deleted the api/challenges-incremental-phase3 branch May 29, 2026 20:12

raymondjacobson mentioned this pull request May 29, 2026

fix(triggers): rewrite user_ids = ANY(...) to @> so GIN indexes can be used #880

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(challenges): make Phase 3 incremental + partial GIN for on_user_challenge cooldown trigger#877

feat(challenges): make Phase 3 incremental + partial GIN for on_user_challenge cooldown trigger#877
raymondjacobson merged 1 commit into
mainfrom
api/challenges-incremental-phase3

raymondjacobson commented May 29, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

raymondjacobson commented May 29, 2026

Summary

Two-pronged fix

(a) Phase 3 processors go incremental, #875-style

(b) Partial GIN on notification for the trigger's slow path

Caveat

Test plan

Cutover note

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

(b) Partial GIN on `notification` for the trigger's slow path