Skip to content

feat(challenges): make Phase 3 incremental + partial GIN for on_user_challenge cooldown trigger#877

Merged
raymondjacobson merged 1 commit into
mainfrom
api/challenges-incremental-phase3
May 29, 2026
Merged

feat(challenges): make Phase 3 incremental + partial GIN for on_user_challenge cooldown trigger#877
raymondjacobson merged 1 commit into
mainfrom
api/challenges-incremental-phase3

Conversation

@raymondjacobson
Copy link
Copy Markdown
Member

Summary

Two related fixes for the wedge the IndexChallengesJob hit right after #842 deployed. The first post-deploy reconcile tick stalled (20+ min, no completion):

  • The four new Phase 3 processors (m/r/rv/rd) were full-scan over 2.3M user_events rows.
  • Every is_complete=true upsert fired the on_user_challenge trigger's cooldown-window check — an unindexed scan against the 8 GB notification table (~23.5M rows) that ran 19s+ per call (caught in pg_stat_activity, IO-bound on DataFileRead).

Block indexing stayed healthy throughout (the indexer uses separate connections) — this PR is purely a perf fix for the challenge job.

Two-pronged fix

(a) Phase 3 processors go incremental, #875-style

Each tick rescans only user_events rows whose blocknumber moved past a per-processor checkpoint, then re-derives state for the affected users. Same pattern the merged #875 used for profile_completion/track_upload/first_playlist/cosign.

Supporting migrations:

Migration What it does
0209_user_events_blocknumber_idx.sql CREATE INDEX CONCURRENTLY btree on user_events(blocknumber) so the dirty scan is an index range read instead of a 2.3M-row seq-scan.
0211_seed_phase_3_user_event_checkpoints.sql Seeds the four checkpoints (challenges:m/r/rv/rd:last_blocknumber) to current max(user_events.blocknumber) so prod starts "from now" and skips re-deriving 2.3M historical rows. Python already populated user_challenges; the upserts are idempotent.

(b) Partial GIN on notification for the trigger's slow path

CREATE INDEX CONCURRENTLY ix_notification_cooldown_user_ids
    ON public.notification USING gin (user_ids)
    WHERE type = 'reward_in_cooldown';

The on_user_challenge trigger does, for every is_complete=true upsert where cooldown_days > 0:

SELECT id FROM notification
 WHERE type='reward_in_cooldown'
   AND new.user_id = ANY(user_ids)
   AND timestamp >= (new.completed_at - interval '1 hour')
 LIMIT 1

The full GIN on user_ids matches across the entire 8 GB table. The partial GIN restricted to type='reward_in_cooldown' is small (one type out of ~30) and lets the planner go straight to user-id matches within that slice. Benefits every cooldown_days>0 challenge (p, u, the Phase 2 ones, etc.), not just Phase 3.

Caveat

The r/rv dirty scan keys on user_events.blocknumber, so a referrer's verification flip (their users.is_verified going true) is not picked up until the referred user's user_events row changes again. Verification changes are rare and the old full-scan code only caught them on its next tick; this matches the precedent #875 set for the other incremental processors. Called out in the file-level docstring.

Test plan

  • go build ./..., go vet ./jobs/challenges/..., gofmt clean.
  • All 5 existing user_events processor tests pass — checkpoint defaults to 0 in the test DB, so the first run still catches the seeded block-101 rows (same end-state as the full-scan version).
  • Added TestMobileInstall_SkipsRowsBelowCheckpoint — pre-seeds the checkpoint past the row's blocknumber and asserts no upsert, explicitly covering the dirty-set skip.
  • Full jobs/challenges suite green (incl. feat(challenges): make heavy processors incremental (dirty-set) #875's incremental tests).
  • Schema dump + migration tracker regenerated.

Cutover note

Prod is currently on cd94ede (#842) with the challenge job's first tick still grinding on the slow path. Once this deploys:

  1. 0209 creates the user_events blocknumber index (CONCURRENTLY, no write blocking).
  2. 0210 creates the partial GIN on notification (CONCURRENTLY).
  3. 0211 seeds the four Phase 3 checkpoints to the current high-water mark.
  4. The next tick of IndexChallengesJob picks up the new processors with checkpoint = max, so the dirty set is ~0 rows and the tick completes in milliseconds.

The currently-stuck in-flight tick on cd94ede will continue running until it finishes naturally (or until the pod is replaced by this deploy, which terminates it).

🤖 Generated with Claude Code

Two related fixes for the wedge the IndexChallengesJob hit right after #842
deployed. The first post-deploy reconcile tick stalled (20+ min, no
completion) because the four new Phase 3 processors (m/r/rv/rd) were full-scan
over 2.3M user_events, and every is_complete=true upsert fired the
on_user_challenge trigger's cooldown-window check — an unindexed scan against
the 8 GB notification table that ran 19s+ per call.

This change splits the fix:

- Phase 3 processors go incremental (#875-style dirty-set). Each tick rescans
  only user_events rows whose blocknumber moved past a per-processor
  checkpoint, then re-derives state for the affected users. New supporting
  files:
    * jobs/challenges/user_event_challenges.go (rewrite)
    * ddl/migrations/0209: btree on user_events(blocknumber) so the dirty
      scan is an index range read instead of a 2.3M-row seq-scan
    * ddl/migrations/0211: seed the four checkpoints to current max
      user_events.blocknumber so prod starts "from now" and skips the
      redundant backfill (Python already populated user_challenges; the
      upserts are idempotent)

- Partial GIN on notification(user_ids) WHERE type='reward_in_cooldown'
  (ddl/migrations/0210, CONCURRENTLY). Lets the trigger's cooldown-window
  query hit a small in-subset index instead of the full 8 GB table. Helps
  every cooldown_days>0 challenge, not just Phase 3.

Caveat noted in user_event_challenges.go: r/rv key the dirty scan on
user_events.blocknumber, so a referrer's verification flip is not picked up
until the *referred* user's row changes again. Verification changes are rare
and the old full-scan code only caught them on the next tick; this matches
the precedent #875 set for the other incremental processors.

Tests: existing 5 user_events processor tests still pass (checkpoint defaults
to 0 in the test DB → first run catches the seeded block-101 rows); added
TestMobileInstall_SkipsRowsBelowCheckpoint to explicitly cover the dirty-set
skip path.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@raymondjacobson raymondjacobson merged commit f6e633c into main May 29, 2026
5 checks passed
@raymondjacobson raymondjacobson deleted the api/challenges-incremental-phase3 branch May 29, 2026 20:12
raymondjacobson added a commit that referenced this pull request May 29, 2026
…e used (#880)

## Summary

The actual fix for the `IndexChallengesJob` wedge that #877 was supposed
to enable. #877's partial GIN was correct — but the trigger SQL couldn't
reach it.

## Root cause

The `on_user_challenge` and `on_challenge_disbursement` triggers both do
per-row lookups against the 8 GB / ~23.5 M-row `notification` table to
dedupe reward / cooldown notifications, e.g.:

```sql
SELECT id FROM notification
 WHERE type = 'reward_in_cooldown'
   AND new.user_id = ANY(user_ids)
   AND timestamp >= (new.completed_at - interval '1 hour')
```

**PostgreSQL's GIN operator class supports `@>`, `&&`, `<@` — but NOT
`scalar = ANY(array)`.** Even with the full `ix_notification` GIN, and
the partial GIN added by `0210` (#877), every trigger call fell back to
a parallel sequential scan of the entire 8 GB table.

Confirmed in prod on cd94ede:

```
->  Parallel Seq Scan on notification  (cost=0.00..963930.62)
       Filter: type='reward_in_cooldown' AND scalar = ANY(user_ids)
       Rows Removed by Filter: 7,846,045
       Execution: 13,640 ms
```

And `pg_stat_user_indexes` showed
`ix_notification_cooldown_user_ids.idx_scan = 0` since it was built —
completely unused.

## Fix

Rewrite the predicate to the canonical `@>` form. Semantics are
identical (both test array membership), but only `@>` is GIN-eligible.
Same EXPLAIN on the same row, with the same data:

| Form | Plan | Execution |
|---|---|---|
| `new.user_id = ANY(user_ids)` (before) | Parallel Seq Scan | **13,640
ms** |
| `user_ids @> ARRAY[new.user_id]` (after) | Bitmap Index Scan on
`ix_notification_cooldown_user_ids` | **2 ms** |

Three call sites updated:

| File | What it does |
|---|---|
| `ddl/functions/handle_user_challenges.sql` | `reward_in_cooldown`
dedupe path of `handle_on_user_challenge()` — fires on every
`is_complete=true` write to `user_challenges` |
| `ddl/functions/handle_challenge_disbursements.sql` (×2) |
`challenge_reward` dedupe in both `handle_challenge_disbursement()`
(legacy table) and `handle_sol_reward_disbursement()` (new indexer's
table) |

Schema dump (`sql/01_schema.sql`) and migration tracker checksums
updated to match.

## Impact

- **Challenge job**: per-upsert trigger cost drops from ~13 s → ~2 ms.
The `IndexChallengesJob` first-tick wedge clears once a fresh backend
picks up the new function (so a `core-indexer` pod restart after this
deploys is the last manual step, unless the deploy itself replaces the
pod).
- **Rewards / disbursements**: per-disbursement trigger cost drops by
the same factor; the rewards attester is no longer rate-limited by
trigger latency.
- **All other `cooldown_days > 0` challenges** (`p`, `u`, the Phase 2
ones, etc.) get the same speedup whenever they fire the trigger.

## Out of scope

The wider codebase has **~50 other `= any(user_ids)` occurrences**
across other trigger functions (notification triggers added in #851,
etc.). Same anti-pattern — same fix. Worth a separate sweep PR; I left
it out here to keep this one minimal and reviewable.

## Test plan

- [x] `go build ./...`, `go vet ./...` clean (no Go changes; sanity
check).
- [x] Confirmed in prod via `EXPLAIN (ANALYZE, BUFFERS)` that the `@>`
form picks `ix_notification_cooldown_user_ids` and completes in 2 ms (vs
13.6 s for the `= ANY` form).
- [x] Function checksums in `sql/03_migration_tracker.sql` updated so
`pg_migrate.sh` re-applies them on deploy.
- [x] `sql/01_schema.sql` updated in lockstep so a fresh test template
reflects the new function bodies.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant