Skip to content

feat(challenges): make heavy processors incremental (dirty-set)#875

Merged
raymondjacobson merged 1 commit into
api/challenges-phase-3from
api/challenges-incremental
May 29, 2026
Merged

feat(challenges): make heavy processors incremental (dirty-set)#875
raymondjacobson merged 1 commit into
api/challenges-phase-3from
api/challenges-incremental

Conversation

@raymondjacobson
Copy link
Copy Markdown
Member

Summary

The aggregating challenge processors rescanned their entire source tables on every 30s tick, which doesn't scale at prod volume. Measured against prod (read-only):

Processor Before (full scan, every tick) After (dirty-set, capped)
profile_completion (p) >90s timeout 168ms
track_upload (u) ~14.3s 68ms

(Both measured over a 2h38m window of activity — far larger than any 30s tick. Prod source sizes: follows 26M, saves 10M, reposts 5.9M, users 3.2M, tracks 1.9M.)

These processors are not yet running in produser_challenges is still driven by the legacy Python stack — so this lands the fix before the bomb ships.

How it works

  • New jobs/challenges/incremental.go: reconcileIncrementalUsers checkpoints a per-processor blocknumber high-water mark and only recomputes the users/owners touched since the last tick. Every base-table mutation advances blocknumber (in-place tables bump it on update/delete; append-only versioned tables insert at the current block), and all six sources are btree-indexed on blocknumber, so the dirty scan is an index range scan. highWaterMark includes a don't-split-a-block rule for downtime catch-up; dirtyScanBatch=5000 bounds a catch-up tick.
  • profile_completion (p), track_upload (u), first_playlist (fp) use the shared helper.
  • cosign (cs) uses a tailored action-based incremental scan on the reposts+saves checkpoint (inactive in prod today, but now future-safe).
  • Threshold checks use LIMIT/EXISTS instead of unbounded COUNT(*) — this is what turns the 90s timeout into 168ms.

Cadence stays fast (30s): a profile completed seconds ago is picked up on the next tick. Work is now proportional to changes, not table size.

Migration

0208_seed_challenge_checkpoints.sql seeds the four checkpoints to the current max blocknumber on deploy, so prod starts "from now" and skips a redundant multi-hour historical backfill (Python already populated user_challenges; the upserts are idempotent). Seeding over empty tables yields 0, so fresh test templates are unaffected. Prod applies it against its live schema_version regardless of the tracker dump.

Behavior note

The incremental processors no longer downgrade current_step_count for users who drop below a threshold after deleting content — they skip zero-result owners and rely on sticky is_complete, matching the old GROUP-BY's "only emit rows for users with data" behavior.

Test plan

  • go build ./..., go vet ./jobs/..., gofmt clean
  • profile_completion / track_upload / first_playlist / cosign tests pass
  • Migration validated against the test template (rolled back); seeds to 0 over empty tables
  • Confirm parity with Python-written user_challenges during the parallel-run period before retiring Python

🤖 Generated with Claude Code

The aggregating challenge processors rescanned their entire source tables
on every 30s tick, which doesn't scale: against prod (follows 26M, saves
10M, reposts 5.9M, users 3.2M, tracks 1.9M) a full profile_completion
recompute timed out past 90s and track_upload took ~14s — every cycle.

Rewrite profile_completion (p), track_upload (u), first_playlist (fp),
and cosign (cs) to checkpoint a per-processor blocknumber high-water mark
and only recompute the users/owners/actions touched since the last tick.
Every base-table mutation advances blocknumber (in-place tables bump it;
append-only versioned tables insert at the current block), and all six
sources are btree-indexed on blocknumber, so the dirty scan is an index
range scan. Threshold checks use LIMIT/EXISTS instead of unbounded
COUNT(*). Measured against prod: profile_completion 90s+ -> 168ms,
track_upload 14.3s -> 68ms, over a window far larger than a 30s tick.

This keeps the fast cadence (a profile completed seconds ago is picked up
on the next tick) without re-walking whole tables.

Migration 0208 seeds the four checkpoints to the current max blocknumber
on deploy so prod starts "from now" and skips a redundant multi-hour
historical backfill (the legacy Python stack already populated
user_challenges; upserts are idempotent). Seeding over empty tables
yields 0, so test templates are unaffected.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@raymondjacobson raymondjacobson merged commit bf24992 into api/challenges-phase-3 May 29, 2026
4 checks passed
@raymondjacobson raymondjacobson deleted the api/challenges-incremental branch May 29, 2026 18:30
raymondjacobson added a commit that referenced this pull request May 29, 2026
…op signals system

The mobile-install ("m") and referral ("r"/"rv"/"rd") challenges now derive
from on-chain profile metadata instead of a net-new client-reported signals
endpoint, matching the poll-based pattern of the other challenge processors.

- indexer: new user_events post-hook records events.is_mobile_user /
  events.referrer into user_events on User Create+Update (is_mobile_user
  sticky, referrer set-once), mirroring legacy Python update_user_events.
- challenges: replace signals.go with user_event_challenges.go — the
  m/r/rv/rd processors recompute completions from user_events each Reconcile.
- remove the POST /v1/challenges/signals endpoint and the challenge_signals
  table + enum; migration 0205 becomes a seed-only Phase 3 catalog migration.
- regenerate schema dump (challenge_signals removed) and migration tracker.

Relanded on top of #875 (incremental dirty-set processors).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
raymondjacobson added a commit that referenced this pull request May 29, 2026
…urced) (#842)

## Summary

Phase 3 of the challenge-processor port: the **mobile-install** (`m`)
and **referral** (`r`/`rv`/`rd`) challenges. Unlike the original draft
of this PR, these are now derived from **on-chain profile metadata**
rather than a client-reported signals endpoint — matching the
poll-based, recompute-from-source pattern of the Phase 1/2 processors.

After this PR, every non-Solana challenge in apps' `challenges.json` is
implemented (across #835, #841, and this PR).

## How it works

The client encodes `is_mobile_user` / `referrer` in the user's profile
metadata `events` object on a normal User tx. The indexer records these
into the `user_events` table; the challenge processors recompute
completions from that table on every Reconcile — no checkpoint, same as
the on-chain-derived Phase 1/2 challenges.

- **`indexer/user_events_hook.go`** — a `pkg/etl` post-hook on User
Create + Update. Merges `events.is_mobile_user` (sticky: once true,
stays true) and `events.referrer` (set-once; self-referral ignored) into
`user_events`, versioned via `is_current`. Mirrors legacy Python
`update_user_events`.
- **`jobs/challenges/user_event_challenges.go`** — the processors:

| ID | Source | Reward target | Gate |
|---|---|---|---|
| `m` mobile_install | `user_events.is_mobile_user` | the user | — |
| `r` referral | `user_events.referrer` | the referrer | referrer NOT
verified |
| `rv` verified_referral | `user_events.referrer` | the referrer |
referrer IS verified |
| `rd` referred_signup | `user_events.referrer` | the referred user | —
|

## Migration

`0205_seed_phase_3_challenges.sql` is now a **seed-only** catalog
migration for `m`/`r`/`rv`/`rd`. No `challenge_signals` table is created
— the source of truth is on-chain profile metadata.

## Removed vs the original draft of this PR

- `POST /v1/challenges/signals` endpoint (was net-new vs Python and
created a behavioral gap).
- `challenge_signals` table + `challenge_signal_type` enum.
- The `one_shot` (`o`) challenge — no longer supported.

## Test plan

- [x] `indexer`: 6 hook tests — mobile insert, referrer insert,
sticky-mobile + set-once-referrer merge, no-op on unchanged,
self-referral ignored, no `events` object.
- [x] `jobs/challenges`: 5 user_events processor tests + all prior Phase
1/2/3 + incremental tests pass.
- [x] `go build ./...`, `go vet ./...`, gofmt clean.

## Note

Relanded on top of **#875** (incremental dirty-set processors), which
was merged into this branch; #842 carries both into main.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
raymondjacobson added a commit that referenced this pull request May 29, 2026
…challenge cooldown trigger (#877)

## Summary

Two related fixes for the wedge the `IndexChallengesJob` hit right after
#842 deployed. The first post-deploy reconcile tick stalled (20+ min, no
completion):

- The four new Phase 3 processors (`m`/`r`/`rv`/`rd`) were full-scan
over **2.3M `user_events` rows**.
- Every `is_complete=true` upsert fired the `on_user_challenge`
trigger's cooldown-window check — an unindexed scan against the **8 GB
`notification` table (~23.5M rows)** that ran **19s+ per call** (caught
in `pg_stat_activity`, IO-bound on `DataFileRead`).

Block indexing stayed healthy throughout (the indexer uses separate
connections) — this PR is purely a perf fix for the challenge job.

## Two-pronged fix

### (a) Phase 3 processors go incremental, #875-style

Each tick rescans only `user_events` rows whose `blocknumber` moved past
a per-processor checkpoint, then re-derives state for the affected
users. Same pattern the merged #875 used for
`profile_completion`/`track_upload`/`first_playlist`/`cosign`.

Supporting migrations:

| Migration | What it does |
|---|---|
| `0209_user_events_blocknumber_idx.sql` | `CREATE INDEX CONCURRENTLY`
btree on `user_events(blocknumber)` so the dirty scan is an index range
read instead of a 2.3M-row seq-scan. |
| `0211_seed_phase_3_user_event_checkpoints.sql` | Seeds the four
checkpoints (`challenges:m/r/rv/rd:last_blocknumber`) to current
`max(user_events.blocknumber)` so prod starts "from now" and skips
re-deriving 2.3M historical rows. Python already populated
`user_challenges`; the upserts are idempotent. |

### (b) Partial GIN on `notification` for the trigger's slow path

```sql
CREATE INDEX CONCURRENTLY ix_notification_cooldown_user_ids
    ON public.notification USING gin (user_ids)
    WHERE type = 'reward_in_cooldown';
```

The `on_user_challenge` trigger does, for every `is_complete=true`
upsert where `cooldown_days > 0`:

```sql
SELECT id FROM notification
 WHERE type='reward_in_cooldown'
   AND new.user_id = ANY(user_ids)
   AND timestamp >= (new.completed_at - interval '1 hour')
 LIMIT 1
```

The full GIN on `user_ids` matches across the entire 8 GB table. The
partial GIN restricted to `type='reward_in_cooldown'` is small (one type
out of ~30) and lets the planner go straight to user-id matches within
that slice. Benefits **every** `cooldown_days>0` challenge (`p`, `u`,
the Phase 2 ones, etc.), not just Phase 3.

## Caveat

The r/rv dirty scan keys on `user_events.blocknumber`, so a referrer's
verification flip (their `users.is_verified` going true) is **not**
picked up until the *referred* user's `user_events` row changes again.
Verification changes are rare and the old full-scan code only caught
them on its next tick; this matches the precedent #875 set for the other
incremental processors. Called out in the file-level docstring.

## Test plan

- [x] `go build ./...`, `go vet ./jobs/challenges/...`, gofmt clean.
- [x] All 5 existing user_events processor tests pass — checkpoint
defaults to 0 in the test DB, so the first run still catches the seeded
block-101 rows (same end-state as the full-scan version).
- [x] Added `TestMobileInstall_SkipsRowsBelowCheckpoint` — pre-seeds the
checkpoint past the row's blocknumber and asserts no upsert, explicitly
covering the dirty-set skip.
- [x] Full `jobs/challenges` suite green (incl. #875's incremental
tests).
- [x] Schema dump + migration tracker regenerated.

## Cutover note

Prod is currently on `cd94ede` (#842) with the challenge job's first
tick still grinding on the slow path. Once this deploys:
1. `0209` creates the `user_events` blocknumber index (CONCURRENTLY, no
write blocking).
2. `0210` creates the partial GIN on `notification` (CONCURRENTLY).
3. `0211` seeds the four Phase 3 checkpoints to the current high-water
mark.
4. The next tick of `IndexChallengesJob` picks up the new processors
with checkpoint = max, so the dirty set is ~0 rows and the tick
completes in milliseconds.

The currently-stuck in-flight tick on `cd94ede` will continue running
until it finishes naturally (or until the pod is replaced by this
deploy, which terminates it).

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant