feat(challenges): make Phase 3 incremental + partial GIN for on_user_challenge cooldown trigger#877
Merged
Conversation
Two related fixes for the wedge the IndexChallengesJob hit right after #842 deployed. The first post-deploy reconcile tick stalled (20+ min, no completion) because the four new Phase 3 processors (m/r/rv/rd) were full-scan over 2.3M user_events, and every is_complete=true upsert fired the on_user_challenge trigger's cooldown-window check — an unindexed scan against the 8 GB notification table that ran 19s+ per call. This change splits the fix: - Phase 3 processors go incremental (#875-style dirty-set). Each tick rescans only user_events rows whose blocknumber moved past a per-processor checkpoint, then re-derives state for the affected users. New supporting files: * jobs/challenges/user_event_challenges.go (rewrite) * ddl/migrations/0209: btree on user_events(blocknumber) so the dirty scan is an index range read instead of a 2.3M-row seq-scan * ddl/migrations/0211: seed the four checkpoints to current max user_events.blocknumber so prod starts "from now" and skips the redundant backfill (Python already populated user_challenges; the upserts are idempotent) - Partial GIN on notification(user_ids) WHERE type='reward_in_cooldown' (ddl/migrations/0210, CONCURRENTLY). Lets the trigger's cooldown-window query hit a small in-subset index instead of the full 8 GB table. Helps every cooldown_days>0 challenge, not just Phase 3. Caveat noted in user_event_challenges.go: r/rv key the dirty scan on user_events.blocknumber, so a referrer's verification flip is not picked up until the *referred* user's row changes again. Verification changes are rare and the old full-scan code only caught them on the next tick; this matches the precedent #875 set for the other incremental processors. Tests: existing 5 user_events processor tests still pass (checkpoint defaults to 0 in the test DB → first run catches the seeded block-101 rows); added TestMobileInstall_SkipsRowsBelowCheckpoint to explicitly cover the dirty-set skip path. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
4 tasks
raymondjacobson
added a commit
that referenced
this pull request
May 29, 2026
…e used (#880) ## Summary The actual fix for the `IndexChallengesJob` wedge that #877 was supposed to enable. #877's partial GIN was correct — but the trigger SQL couldn't reach it. ## Root cause The `on_user_challenge` and `on_challenge_disbursement` triggers both do per-row lookups against the 8 GB / ~23.5 M-row `notification` table to dedupe reward / cooldown notifications, e.g.: ```sql SELECT id FROM notification WHERE type = 'reward_in_cooldown' AND new.user_id = ANY(user_ids) AND timestamp >= (new.completed_at - interval '1 hour') ``` **PostgreSQL's GIN operator class supports `@>`, `&&`, `<@` — but NOT `scalar = ANY(array)`.** Even with the full `ix_notification` GIN, and the partial GIN added by `0210` (#877), every trigger call fell back to a parallel sequential scan of the entire 8 GB table. Confirmed in prod on cd94ede: ``` -> Parallel Seq Scan on notification (cost=0.00..963930.62) Filter: type='reward_in_cooldown' AND scalar = ANY(user_ids) Rows Removed by Filter: 7,846,045 Execution: 13,640 ms ``` And `pg_stat_user_indexes` showed `ix_notification_cooldown_user_ids.idx_scan = 0` since it was built — completely unused. ## Fix Rewrite the predicate to the canonical `@>` form. Semantics are identical (both test array membership), but only `@>` is GIN-eligible. Same EXPLAIN on the same row, with the same data: | Form | Plan | Execution | |---|---|---| | `new.user_id = ANY(user_ids)` (before) | Parallel Seq Scan | **13,640 ms** | | `user_ids @> ARRAY[new.user_id]` (after) | Bitmap Index Scan on `ix_notification_cooldown_user_ids` | **2 ms** | Three call sites updated: | File | What it does | |---|---| | `ddl/functions/handle_user_challenges.sql` | `reward_in_cooldown` dedupe path of `handle_on_user_challenge()` — fires on every `is_complete=true` write to `user_challenges` | | `ddl/functions/handle_challenge_disbursements.sql` (×2) | `challenge_reward` dedupe in both `handle_challenge_disbursement()` (legacy table) and `handle_sol_reward_disbursement()` (new indexer's table) | Schema dump (`sql/01_schema.sql`) and migration tracker checksums updated to match. ## Impact - **Challenge job**: per-upsert trigger cost drops from ~13 s → ~2 ms. The `IndexChallengesJob` first-tick wedge clears once a fresh backend picks up the new function (so a `core-indexer` pod restart after this deploys is the last manual step, unless the deploy itself replaces the pod). - **Rewards / disbursements**: per-disbursement trigger cost drops by the same factor; the rewards attester is no longer rate-limited by trigger latency. - **All other `cooldown_days > 0` challenges** (`p`, `u`, the Phase 2 ones, etc.) get the same speedup whenever they fire the trigger. ## Out of scope The wider codebase has **~50 other `= any(user_ids)` occurrences** across other trigger functions (notification triggers added in #851, etc.). Same anti-pattern — same fix. Worth a separate sweep PR; I left it out here to keep this one minimal and reviewable. ## Test plan - [x] `go build ./...`, `go vet ./...` clean (no Go changes; sanity check). - [x] Confirmed in prod via `EXPLAIN (ANALYZE, BUFFERS)` that the `@>` form picks `ix_notification_cooldown_user_ids` and completes in 2 ms (vs 13.6 s for the `= ANY` form). - [x] Function checksums in `sql/03_migration_tracker.sql` updated so `pg_migrate.sh` re-applies them on deploy. - [x] `sql/01_schema.sql` updated in lockstep so a fresh test template reflects the new function bodies. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Two related fixes for the wedge the
IndexChallengesJobhit right after #842 deployed. The first post-deploy reconcile tick stalled (20+ min, no completion):m/r/rv/rd) were full-scan over 2.3Muser_eventsrows.is_complete=trueupsert fired theon_user_challengetrigger's cooldown-window check — an unindexed scan against the 8 GBnotificationtable (~23.5M rows) that ran 19s+ per call (caught inpg_stat_activity, IO-bound onDataFileRead).Block indexing stayed healthy throughout (the indexer uses separate connections) — this PR is purely a perf fix for the challenge job.
Two-pronged fix
(a) Phase 3 processors go incremental, #875-style
Each tick rescans only
user_eventsrows whoseblocknumbermoved past a per-processor checkpoint, then re-derives state for the affected users. Same pattern the merged #875 used forprofile_completion/track_upload/first_playlist/cosign.Supporting migrations:
0209_user_events_blocknumber_idx.sqlCREATE INDEX CONCURRENTLYbtree onuser_events(blocknumber)so the dirty scan is an index range read instead of a 2.3M-row seq-scan.0211_seed_phase_3_user_event_checkpoints.sqlchallenges:m/r/rv/rd:last_blocknumber) to currentmax(user_events.blocknumber)so prod starts "from now" and skips re-deriving 2.3M historical rows. Python already populateduser_challenges; the upserts are idempotent.(b) Partial GIN on
notificationfor the trigger's slow pathThe
on_user_challengetrigger does, for everyis_complete=trueupsert wherecooldown_days > 0:The full GIN on
user_idsmatches across the entire 8 GB table. The partial GIN restricted totype='reward_in_cooldown'is small (one type out of ~30) and lets the planner go straight to user-id matches within that slice. Benefits everycooldown_days>0challenge (p,u, the Phase 2 ones, etc.), not just Phase 3.Caveat
The r/rv dirty scan keys on
user_events.blocknumber, so a referrer's verification flip (theirusers.is_verifiedgoing true) is not picked up until the referred user'suser_eventsrow changes again. Verification changes are rare and the old full-scan code only caught them on its next tick; this matches the precedent #875 set for the other incremental processors. Called out in the file-level docstring.Test plan
go build ./...,go vet ./jobs/challenges/..., gofmt clean.TestMobileInstall_SkipsRowsBelowCheckpoint— pre-seeds the checkpoint past the row's blocknumber and asserts no upsert, explicitly covering the dirty-set skip.jobs/challengessuite green (incl. feat(challenges): make heavy processors incremental (dirty-set) #875's incremental tests).Cutover note
Prod is currently on
cd94ede(#842) with the challenge job's first tick still grinding on the slow path. Once this deploys:0209creates theuser_eventsblocknumber index (CONCURRENTLY, no write blocking).0210creates the partial GIN onnotification(CONCURRENTLY).0211seeds the four Phase 3 checkpoints to the current high-water mark.IndexChallengesJobpicks up the new processors with checkpoint = max, so the dirty set is ~0 rows and the tick completes in milliseconds.The currently-stuck in-flight tick on
cd94edewill continue running until it finishes naturally (or until the pod is replaced by this deploy, which terminates it).🤖 Generated with Claude Code