feat(challenges): make heavy processors incremental (dirty-set) by raymondjacobson · Pull Request #875 · AudiusProject/api

raymondjacobson · 2026-05-29T17:59:25Z

Summary

The aggregating challenge processors rescanned their entire source tables on every 30s tick, which doesn't scale at prod volume. Measured against prod (read-only):

Processor	Before (full scan, every tick)	After (dirty-set, capped)
`profile_completion` (p)	>90s timeout	168ms
`track_upload` (u)	~14.3s	68ms

(Both measured over a 2h38m window of activity — far larger than any 30s tick. Prod source sizes: follows 26M, saves 10M, reposts 5.9M, users 3.2M, tracks 1.9M.)

These processors are not yet running in prod — user_challenges is still driven by the legacy Python stack — so this lands the fix before the bomb ships.

How it works

New jobs/challenges/incremental.go: reconcileIncrementalUsers checkpoints a per-processor blocknumber high-water mark and only recomputes the users/owners touched since the last tick. Every base-table mutation advances blocknumber (in-place tables bump it on update/delete; append-only versioned tables insert at the current block), and all six sources are btree-indexed on blocknumber, so the dirty scan is an index range scan. highWaterMark includes a don't-split-a-block rule for downtime catch-up; dirtyScanBatch=5000 bounds a catch-up tick.
profile_completion (p), track_upload (u), first_playlist (fp) use the shared helper.
cosign (cs) uses a tailored action-based incremental scan on the reposts+saves checkpoint (inactive in prod today, but now future-safe).
Threshold checks use LIMIT/EXISTS instead of unbounded COUNT(*) — this is what turns the 90s timeout into 168ms.

Cadence stays fast (30s): a profile completed seconds ago is picked up on the next tick. Work is now proportional to changes, not table size.

Migration

0208_seed_challenge_checkpoints.sql seeds the four checkpoints to the current max blocknumber on deploy, so prod starts "from now" and skips a redundant multi-hour historical backfill (Python already populated user_challenges; the upserts are idempotent). Seeding over empty tables yields 0, so fresh test templates are unaffected. Prod applies it against its live schema_version regardless of the tracker dump.

Behavior note

The incremental processors no longer downgrade current_step_count for users who drop below a threshold after deleting content — they skip zero-result owners and rely on sticky is_complete, matching the old GROUP-BY's "only emit rows for users with data" behavior.

Test plan

go build ./..., go vet ./jobs/..., gofmt clean
profile_completion / track_upload / first_playlist / cosign tests pass
Migration validated against the test template (rolled back); seeds to 0 over empty tables
Confirm parity with Python-written user_challenges during the parallel-run period before retiring Python

🤖 Generated with Claude Code

The aggregating challenge processors rescanned their entire source tables on every 30s tick, which doesn't scale: against prod (follows 26M, saves 10M, reposts 5.9M, users 3.2M, tracks 1.9M) a full profile_completion recompute timed out past 90s and track_upload took ~14s — every cycle. Rewrite profile_completion (p), track_upload (u), first_playlist (fp), and cosign (cs) to checkpoint a per-processor blocknumber high-water mark and only recompute the users/owners/actions touched since the last tick. Every base-table mutation advances blocknumber (in-place tables bump it; append-only versioned tables insert at the current block), and all six sources are btree-indexed on blocknumber, so the dirty scan is an index range scan. Threshold checks use LIMIT/EXISTS instead of unbounded COUNT(*). Measured against prod: profile_completion 90s+ -> 168ms, track_upload 14.3s -> 68ms, over a window far larger than a 30s tick. This keeps the fast cadence (a profile completed seconds ago is picked up on the next tick) without re-walking whole tables. Migration 0208 seeds the four checkpoints to the current max blocknumber on deploy so prod starts "from now" and skips a redundant multi-hour historical backfill (the legacy Python stack already populated user_challenges; upserts are idempotent). Seeding over empty tables yields 0, so test templates are unaffected. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…op signals system The mobile-install ("m") and referral ("r"/"rv"/"rd") challenges now derive from on-chain profile metadata instead of a net-new client-reported signals endpoint, matching the poll-based pattern of the other challenge processors. - indexer: new user_events post-hook records events.is_mobile_user / events.referrer into user_events on User Create+Update (is_mobile_user sticky, referrer set-once), mirroring legacy Python update_user_events. - challenges: replace signals.go with user_event_challenges.go — the m/r/rv/rd processors recompute completions from user_events each Reconcile. - remove the POST /v1/challenges/signals endpoint and the challenge_signals table + enum; migration 0205 becomes a seed-only Phase 3 catalog migration. - regenerate schema dump (challenge_signals removed) and migration tracker. Relanded on top of #875 (incremental dirty-set processors). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…urced) (#842) ## Summary Phase 3 of the challenge-processor port: the **mobile-install** (`m`) and **referral** (`r`/`rv`/`rd`) challenges. Unlike the original draft of this PR, these are now derived from **on-chain profile metadata** rather than a client-reported signals endpoint — matching the poll-based, recompute-from-source pattern of the Phase 1/2 processors. After this PR, every non-Solana challenge in apps' `challenges.json` is implemented (across #835, #841, and this PR). ## How it works The client encodes `is_mobile_user` / `referrer` in the user's profile metadata `events` object on a normal User tx. The indexer records these into the `user_events` table; the challenge processors recompute completions from that table on every Reconcile — no checkpoint, same as the on-chain-derived Phase 1/2 challenges. - **`indexer/user_events_hook.go`** — a `pkg/etl` post-hook on User Create + Update. Merges `events.is_mobile_user` (sticky: once true, stays true) and `events.referrer` (set-once; self-referral ignored) into `user_events`, versioned via `is_current`. Mirrors legacy Python `update_user_events`. - **`jobs/challenges/user_event_challenges.go`** — the processors: | ID | Source | Reward target | Gate | |---|---|---|---| | `m` mobile_install | `user_events.is_mobile_user` | the user | — | | `r` referral | `user_events.referrer` | the referrer | referrer NOT verified | | `rv` verified_referral | `user_events.referrer` | the referrer | referrer IS verified | | `rd` referred_signup | `user_events.referrer` | the referred user | — | ## Migration `0205_seed_phase_3_challenges.sql` is now a **seed-only** catalog migration for `m`/`r`/`rv`/`rd`. No `challenge_signals` table is created — the source of truth is on-chain profile metadata. ## Removed vs the original draft of this PR - `POST /v1/challenges/signals` endpoint (was net-new vs Python and created a behavioral gap). - `challenge_signals` table + `challenge_signal_type` enum. - The `one_shot` (`o`) challenge — no longer supported. ## Test plan - [x] `indexer`: 6 hook tests — mobile insert, referrer insert, sticky-mobile + set-once-referrer merge, no-op on unchanged, self-referral ignored, no `events` object. - [x] `jobs/challenges`: 5 user_events processor tests + all prior Phase 1/2/3 + incremental tests pass. - [x] `go build ./...`, `go vet ./...`, gofmt clean. ## Note Relanded on top of **#875** (incremental dirty-set processors), which was merged into this branch; #842 carries both into main. 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…challenge cooldown trigger (#877) ## Summary Two related fixes for the wedge the `IndexChallengesJob` hit right after #842 deployed. The first post-deploy reconcile tick stalled (20+ min, no completion): - The four new Phase 3 processors (`m`/`r`/`rv`/`rd`) were full-scan over **2.3M `user_events` rows**. - Every `is_complete=true` upsert fired the `on_user_challenge` trigger's cooldown-window check — an unindexed scan against the **8 GB `notification` table (~23.5M rows)** that ran **19s+ per call** (caught in `pg_stat_activity`, IO-bound on `DataFileRead`). Block indexing stayed healthy throughout (the indexer uses separate connections) — this PR is purely a perf fix for the challenge job. ## Two-pronged fix ### (a) Phase 3 processors go incremental, #875-style Each tick rescans only `user_events` rows whose `blocknumber` moved past a per-processor checkpoint, then re-derives state for the affected users. Same pattern the merged #875 used for `profile_completion`/`track_upload`/`first_playlist`/`cosign`. Supporting migrations: | Migration | What it does | |---|---| | `0209_user_events_blocknumber_idx.sql` | `CREATE INDEX CONCURRENTLY` btree on `user_events(blocknumber)` so the dirty scan is an index range read instead of a 2.3M-row seq-scan. | | `0211_seed_phase_3_user_event_checkpoints.sql` | Seeds the four checkpoints (`challenges:m/r/rv/rd:last_blocknumber`) to current `max(user_events.blocknumber)` so prod starts "from now" and skips re-deriving 2.3M historical rows. Python already populated `user_challenges`; the upserts are idempotent. | ### (b) Partial GIN on `notification` for the trigger's slow path ```sql CREATE INDEX CONCURRENTLY ix_notification_cooldown_user_ids ON public.notification USING gin (user_ids) WHERE type = 'reward_in_cooldown'; ``` The `on_user_challenge` trigger does, for every `is_complete=true` upsert where `cooldown_days > 0`: ```sql SELECT id FROM notification WHERE type='reward_in_cooldown' AND new.user_id = ANY(user_ids) AND timestamp >= (new.completed_at - interval '1 hour') LIMIT 1 ``` The full GIN on `user_ids` matches across the entire 8 GB table. The partial GIN restricted to `type='reward_in_cooldown'` is small (one type out of ~30) and lets the planner go straight to user-id matches within that slice. Benefits **every** `cooldown_days>0` challenge (`p`, `u`, the Phase 2 ones, etc.), not just Phase 3. ## Caveat The r/rv dirty scan keys on `user_events.blocknumber`, so a referrer's verification flip (their `users.is_verified` going true) is **not** picked up until the *referred* user's `user_events` row changes again. Verification changes are rare and the old full-scan code only caught them on its next tick; this matches the precedent #875 set for the other incremental processors. Called out in the file-level docstring. ## Test plan - [x] `go build ./...`, `go vet ./jobs/challenges/...`, gofmt clean. - [x] All 5 existing user_events processor tests pass — checkpoint defaults to 0 in the test DB, so the first run still catches the seeded block-101 rows (same end-state as the full-scan version). - [x] Added `TestMobileInstall_SkipsRowsBelowCheckpoint` — pre-seeds the checkpoint past the row's blocknumber and asserts no upsert, explicitly covering the dirty-set skip. - [x] Full `jobs/challenges` suite green (incl. #875's incremental tests). - [x] Schema dump + migration tracker regenerated. ## Cutover note Prod is currently on `cd94ede` (#842) with the challenge job's first tick still grinding on the slow path. Once this deploys: 1. `0209` creates the `user_events` blocknumber index (CONCURRENTLY, no write blocking). 2. `0210` creates the partial GIN on `notification` (CONCURRENTLY). 3. `0211` seeds the four Phase 3 checkpoints to the current high-water mark. 4. The next tick of `IndexChallengesJob` picks up the new processors with checkpoint = max, so the dirty set is ~0 rows and the tick completes in milliseconds. The currently-stuck in-flight tick on `cd94ede` will continue running until it finishes naturally (or until the pod is replaced by this deploy, which terminates it). 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

raymondjacobson merged commit bf24992 into api/challenges-phase-3 May 29, 2026
4 checks passed

raymondjacobson deleted the api/challenges-incremental branch May 29, 2026 18:30

raymondjacobson mentioned this pull request May 29, 2026

feat(challenges): Phase 3 — mobile-install & referral (user_events-sourced) #842

Merged

3 tasks

raymondjacobson mentioned this pull request May 29, 2026

feat(challenges): make Phase 3 incremental + partial GIN for on_user_challenge cooldown trigger #877

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(challenges): make heavy processors incremental (dirty-set)#875

feat(challenges): make heavy processors incremental (dirty-set)#875
raymondjacobson merged 1 commit into
api/challenges-phase-3from
api/challenges-incremental

raymondjacobson commented May 29, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

raymondjacobson commented May 29, 2026

Summary

How it works

Migration

Behavior note

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant