feat(challenges): make heavy processors incremental (dirty-set)#875
Merged
raymondjacobson merged 1 commit intoMay 29, 2026
Merged
Conversation
The aggregating challenge processors rescanned their entire source tables on every 30s tick, which doesn't scale: against prod (follows 26M, saves 10M, reposts 5.9M, users 3.2M, tracks 1.9M) a full profile_completion recompute timed out past 90s and track_upload took ~14s — every cycle. Rewrite profile_completion (p), track_upload (u), first_playlist (fp), and cosign (cs) to checkpoint a per-processor blocknumber high-water mark and only recompute the users/owners/actions touched since the last tick. Every base-table mutation advances blocknumber (in-place tables bump it; append-only versioned tables insert at the current block), and all six sources are btree-indexed on blocknumber, so the dirty scan is an index range scan. Threshold checks use LIMIT/EXISTS instead of unbounded COUNT(*). Measured against prod: profile_completion 90s+ -> 168ms, track_upload 14.3s -> 68ms, over a window far larger than a 30s tick. This keeps the fast cadence (a profile completed seconds ago is picked up on the next tick) without re-walking whole tables. Migration 0208 seeds the four checkpoints to the current max blocknumber on deploy so prod starts "from now" and skips a redundant multi-hour historical backfill (the legacy Python stack already populated user_challenges; upserts are idempotent). Seeding over empty tables yields 0, so test templates are unaffected. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
raymondjacobson
added a commit
that referenced
this pull request
May 29, 2026
…op signals system
The mobile-install ("m") and referral ("r"/"rv"/"rd") challenges now derive
from on-chain profile metadata instead of a net-new client-reported signals
endpoint, matching the poll-based pattern of the other challenge processors.
- indexer: new user_events post-hook records events.is_mobile_user /
events.referrer into user_events on User Create+Update (is_mobile_user
sticky, referrer set-once), mirroring legacy Python update_user_events.
- challenges: replace signals.go with user_event_challenges.go — the
m/r/rv/rd processors recompute completions from user_events each Reconcile.
- remove the POST /v1/challenges/signals endpoint and the challenge_signals
table + enum; migration 0205 becomes a seed-only Phase 3 catalog migration.
- regenerate schema dump (challenge_signals removed) and migration tracker.
Relanded on top of #875 (incremental dirty-set processors).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
3 tasks
raymondjacobson
added a commit
that referenced
this pull request
May 29, 2026
…urced) (#842) ## Summary Phase 3 of the challenge-processor port: the **mobile-install** (`m`) and **referral** (`r`/`rv`/`rd`) challenges. Unlike the original draft of this PR, these are now derived from **on-chain profile metadata** rather than a client-reported signals endpoint — matching the poll-based, recompute-from-source pattern of the Phase 1/2 processors. After this PR, every non-Solana challenge in apps' `challenges.json` is implemented (across #835, #841, and this PR). ## How it works The client encodes `is_mobile_user` / `referrer` in the user's profile metadata `events` object on a normal User tx. The indexer records these into the `user_events` table; the challenge processors recompute completions from that table on every Reconcile — no checkpoint, same as the on-chain-derived Phase 1/2 challenges. - **`indexer/user_events_hook.go`** — a `pkg/etl` post-hook on User Create + Update. Merges `events.is_mobile_user` (sticky: once true, stays true) and `events.referrer` (set-once; self-referral ignored) into `user_events`, versioned via `is_current`. Mirrors legacy Python `update_user_events`. - **`jobs/challenges/user_event_challenges.go`** — the processors: | ID | Source | Reward target | Gate | |---|---|---|---| | `m` mobile_install | `user_events.is_mobile_user` | the user | — | | `r` referral | `user_events.referrer` | the referrer | referrer NOT verified | | `rv` verified_referral | `user_events.referrer` | the referrer | referrer IS verified | | `rd` referred_signup | `user_events.referrer` | the referred user | — | ## Migration `0205_seed_phase_3_challenges.sql` is now a **seed-only** catalog migration for `m`/`r`/`rv`/`rd`. No `challenge_signals` table is created — the source of truth is on-chain profile metadata. ## Removed vs the original draft of this PR - `POST /v1/challenges/signals` endpoint (was net-new vs Python and created a behavioral gap). - `challenge_signals` table + `challenge_signal_type` enum. - The `one_shot` (`o`) challenge — no longer supported. ## Test plan - [x] `indexer`: 6 hook tests — mobile insert, referrer insert, sticky-mobile + set-once-referrer merge, no-op on unchanged, self-referral ignored, no `events` object. - [x] `jobs/challenges`: 5 user_events processor tests + all prior Phase 1/2/3 + incremental tests pass. - [x] `go build ./...`, `go vet ./...`, gofmt clean. ## Note Relanded on top of **#875** (incremental dirty-set processors), which was merged into this branch; #842 carries both into main. 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
raymondjacobson
added a commit
that referenced
this pull request
May 29, 2026
…challenge cooldown trigger (#877) ## Summary Two related fixes for the wedge the `IndexChallengesJob` hit right after #842 deployed. The first post-deploy reconcile tick stalled (20+ min, no completion): - The four new Phase 3 processors (`m`/`r`/`rv`/`rd`) were full-scan over **2.3M `user_events` rows**. - Every `is_complete=true` upsert fired the `on_user_challenge` trigger's cooldown-window check — an unindexed scan against the **8 GB `notification` table (~23.5M rows)** that ran **19s+ per call** (caught in `pg_stat_activity`, IO-bound on `DataFileRead`). Block indexing stayed healthy throughout (the indexer uses separate connections) — this PR is purely a perf fix for the challenge job. ## Two-pronged fix ### (a) Phase 3 processors go incremental, #875-style Each tick rescans only `user_events` rows whose `blocknumber` moved past a per-processor checkpoint, then re-derives state for the affected users. Same pattern the merged #875 used for `profile_completion`/`track_upload`/`first_playlist`/`cosign`. Supporting migrations: | Migration | What it does | |---|---| | `0209_user_events_blocknumber_idx.sql` | `CREATE INDEX CONCURRENTLY` btree on `user_events(blocknumber)` so the dirty scan is an index range read instead of a 2.3M-row seq-scan. | | `0211_seed_phase_3_user_event_checkpoints.sql` | Seeds the four checkpoints (`challenges:m/r/rv/rd:last_blocknumber`) to current `max(user_events.blocknumber)` so prod starts "from now" and skips re-deriving 2.3M historical rows. Python already populated `user_challenges`; the upserts are idempotent. | ### (b) Partial GIN on `notification` for the trigger's slow path ```sql CREATE INDEX CONCURRENTLY ix_notification_cooldown_user_ids ON public.notification USING gin (user_ids) WHERE type = 'reward_in_cooldown'; ``` The `on_user_challenge` trigger does, for every `is_complete=true` upsert where `cooldown_days > 0`: ```sql SELECT id FROM notification WHERE type='reward_in_cooldown' AND new.user_id = ANY(user_ids) AND timestamp >= (new.completed_at - interval '1 hour') LIMIT 1 ``` The full GIN on `user_ids` matches across the entire 8 GB table. The partial GIN restricted to `type='reward_in_cooldown'` is small (one type out of ~30) and lets the planner go straight to user-id matches within that slice. Benefits **every** `cooldown_days>0` challenge (`p`, `u`, the Phase 2 ones, etc.), not just Phase 3. ## Caveat The r/rv dirty scan keys on `user_events.blocknumber`, so a referrer's verification flip (their `users.is_verified` going true) is **not** picked up until the *referred* user's `user_events` row changes again. Verification changes are rare and the old full-scan code only caught them on its next tick; this matches the precedent #875 set for the other incremental processors. Called out in the file-level docstring. ## Test plan - [x] `go build ./...`, `go vet ./jobs/challenges/...`, gofmt clean. - [x] All 5 existing user_events processor tests pass — checkpoint defaults to 0 in the test DB, so the first run still catches the seeded block-101 rows (same end-state as the full-scan version). - [x] Added `TestMobileInstall_SkipsRowsBelowCheckpoint` — pre-seeds the checkpoint past the row's blocknumber and asserts no upsert, explicitly covering the dirty-set skip. - [x] Full `jobs/challenges` suite green (incl. #875's incremental tests). - [x] Schema dump + migration tracker regenerated. ## Cutover note Prod is currently on `cd94ede` (#842) with the challenge job's first tick still grinding on the slow path. Once this deploys: 1. `0209` creates the `user_events` blocknumber index (CONCURRENTLY, no write blocking). 2. `0210` creates the partial GIN on `notification` (CONCURRENTLY). 3. `0211` seeds the four Phase 3 checkpoints to the current high-water mark. 4. The next tick of `IndexChallengesJob` picks up the new processors with checkpoint = max, so the dirty set is ~0 rows and the tick completes in milliseconds. The currently-stuck in-flight tick on `cd94ede` will continue running until it finishes naturally (or until the pod is replaced by this deploy, which terminates it). 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
The aggregating challenge processors rescanned their entire source tables on every 30s tick, which doesn't scale at prod volume. Measured against prod (read-only):
profile_completion(p)track_upload(u)(Both measured over a 2h38m window of activity — far larger than any 30s tick. Prod source sizes: follows 26M, saves 10M, reposts 5.9M, users 3.2M, tracks 1.9M.)
These processors are not yet running in prod —
user_challengesis still driven by the legacy Python stack — so this lands the fix before the bomb ships.How it works
jobs/challenges/incremental.go:reconcileIncrementalUserscheckpoints a per-processorblocknumberhigh-water mark and only recomputes the users/owners touched since the last tick. Every base-table mutation advancesblocknumber(in-place tables bump it on update/delete; append-only versioned tables insert at the current block), and all six sources are btree-indexed onblocknumber, so the dirty scan is an index range scan.highWaterMarkincludes a don't-split-a-block rule for downtime catch-up;dirtyScanBatch=5000bounds a catch-up tick.profile_completion(p),track_upload(u),first_playlist(fp) use the shared helper.cosign(cs) uses a tailored action-based incremental scan on the reposts+saves checkpoint (inactive in prod today, but now future-safe).LIMIT/EXISTSinstead of unboundedCOUNT(*)— this is what turns the 90s timeout into 168ms.Cadence stays fast (30s): a profile completed seconds ago is picked up on the next tick. Work is now proportional to changes, not table size.
Migration
0208_seed_challenge_checkpoints.sqlseeds the four checkpoints to the current maxblocknumberon deploy, so prod starts "from now" and skips a redundant multi-hour historical backfill (Python already populateduser_challenges; the upserts are idempotent). Seeding over empty tables yields0, so fresh test templates are unaffected. Prod applies it against its liveschema_versionregardless of the tracker dump.Behavior note
The incremental processors no longer downgrade
current_step_countfor users who drop below a threshold after deleting content — they skip zero-result owners and rely on stickyis_complete, matching the old GROUP-BY's "only emit rows for users with data" behavior.Test plan
go build ./...,go vet ./jobs/..., gofmt cleanprofile_completion/track_upload/first_playlist/cosigntests passuser_challengesduring the parallel-run period before retiring Python🤖 Generated with Claude Code