Skip to content

[Perf] Feed query: LATERAL + OFFSET 0 fence + per-followee cap#798

Merged
raymondjacobson merged 1 commit intomainfrom
ray/perf-feed-lateral-fence
May 9, 2026
Merged

[Perf] Feed query: LATERAL + OFFSET 0 fence + per-followee cap#798
raymondjacobson merged 1 commit intomainfrom
ray/perf-feed-lateral-fence

Conversation

@raymondjacobson
Copy link
Copy Markdown
Member

Summary

Fix the planner-cliff in /v1/users/:userId/feed that makes the same query take 125 ms for one user and 9-18 s for another with nearly identical follow counts.

Three changes pin the planner to nested-loop semantics:

  1. follow_set CTE → MATERIALIZED (row count fixed for downstream planning).
  2. Each UNION branch uses CROSS JOIN LATERAL with an OFFSET 0 optimization fence inside the lateral subquery — prevents Postgres from flattening the lateral back into a merge-join.
  3. Per-followee LIMIT 100 (50 for owned playlists) caps cost for users whose followees are very active. The outer query takes only the top-@limit by created_at, so older entries past the per-followee top-100 can never reach the response.

Why

/v1/users/:userId/feed is the worst signed-in endpoint by p95 in Axiom. Per the post-merge histogram (13 hours, 5,831 calls), 60 % of requests took >2 s, 23 % took >5 s, and 137 took >10 s.

EXPLAIN on prod replica showed the smoking gun — three users with similar follow counts produce completely different plans:

User follows Plan Time
20 (Phuture) 1752 nested-loop 125 ms
222 1820 nested-loop, lots of data 4.5 s
755516 1816 merge join, materialize all 2M reposts of last year + hash-join 1.4M tracks 9-18 s

follow_set is estimated at 17,290 rows but actually returns 1,816. That bad cardinality flips the join strategy.

Impact

End-to-end on local server pointed at prod replica:

User Before this PR After
20 (1752 follows) 500-750 ms 280-300 ms
222 (1820 follows) ~4.5 s 1.3-1.5 s (3×)
755516 (1816 follows) 9-18 s 640-700 ms (~20×)

EXPLAIN-only DB time on the same users (warm cache):

User Before After
20 113 ms 105 ms
222 4.4 s 1.5 s
755516 9-18 s 700 ms

Risk

  • Per-followee LIMIT 100 trims the input set. A followee who reposts >100 things in the past year contributes only their top-100 most recent. The outer query orders by created_at DESC LIMIT @limit (max 100), so older reposts past the per-followee top-100 cannot reach the rendered response. Equivalent for owned tracks (LIMIT 100) and owned playlists (LIMIT 50) — virtually no artist publishes more than that per year.
  • MATERIALIZED and OFFSET 0 are planner directives, not logic changes. The shape of history and the outer aggregation/sort/limit are unchanged.
  • Existing TestUsersFeed covers the four entity branches plus the no-followees empty case. Full ./api/... suite is green.

Test plan

  • go test -count=1 ./api/... (full suite, all green)
  • EXPLAIN ANALYZE on prod read replica across four users (1545-1820 follows) confirms ~3-25× speedup on the slow cohort with no regression on the lucky cohort
  • Local server end-to-end timings confirm the same shape as the EXPLAIN data

🤖 Generated with Claude Code

The feed query had a ~10x planner-cliff for some users: identical
SQL took 125ms for one user with 1752 follows but 9-18s for another
user with 1816 follows. Cause: stale n_distinct stats on
follows.follower_user_id make Postgres estimate follow_set at
~17,290 rows when actual is <2,000 — for the unlucky users it
flips from a sane nested-loop plan to "materialize all 2M reposts
of the past year, merge-join, then hash-join 1.4M tracks."

Three changes hold the planner to nested-loop semantics:

1. follow_set CTE marked MATERIALIZED so its row count is fixed
   downstream rather than re-estimated through inlining.
2. Each branch joins follow_set via CROSS JOIN LATERAL with an
   OFFSET 0 fence inside the lateral subquery — this is the well-
   known optimization barrier that prevents Postgres from flattening
   the lateral back into a merge-join.
3. Per-followee LIMIT 100 (50 for owned playlists) caps the cost
   for users whose followees are very active. The outer query takes
   only the top-@limit by created_at, so reposts/tracks past the
   per-followee top-100 can never reach the response anyway.

Verified end-to-end against the prod read replica via local server:

  user 20  (1752 follows)  500-750ms -> 280-300ms warm
  user 222 (1820 follows)  ~4.5s     -> 1.3-1.5s     (3x)
  user 755516 (1816 follows)  9-18s  -> 640-700ms    (~20x)

Existing TestUsersFeed regression covers the entity-type branches
and the no-followees empty case; full ./api/... suite is green.
@raymondjacobson raymondjacobson merged commit 3f4fc78 into main May 9, 2026
5 checks passed
@raymondjacobson raymondjacobson deleted the ray/perf-feed-lateral-fence branch May 9, 2026 00:03
raymondjacobson added a commit that referenced this pull request May 9, 2026
…800)

Two independent fixes for the API's two hottest queries (GetUsers ~287M
calls and GetTracks ~268M calls in `pg_stat_statements`).

## 1. GetUsers — rewrite `current_user_followee_follow_count`

Signed-in `GetUsers` was **700× slower than unsigned** (2-3 s vs 4 ms
for 20 users). Profiling each personalization subquery in isolation:

| Subquery (myId=20, 20 users) | Mean |
|---|---|
| `does_current_user_follow` | 0.3 ms |
| `does_follow_current_user` | 2.2 ms |
| `does_current_user_subscribe` | 2.2 ms |
| `artist_coin_badge` | 0.8 ms |
| **`current_user_followee_follow_count`** | **2,246 ms** ← entire delta
|

The old shape let Postgres pick a Merge Join that walked the **full
follower list of the target user** — 492 k - 1.9 M rows for popular
users like @audius — just to intersect with my ~1,752 followees.

The rewrite drives the loop from "my followees" (always small — at most
a few thousand) and probes whether each follows the target. The `LIMIT 1
OFFSET 0` inside the `EXISTS` is the same optimization fence used by
#798 (feed): it pins the planner to nested-loop semantics so the plan
never flips back to merge join.

### Verified on prod read replica

Full `GetUsers`, 20 popular target users, three warm runs each:

| Scenario | Before | After | Δ |
|---|---|---|---|
| myId=0 (unsigned) | 4 ms | 2 ms | sanity / unchanged |
| myId=20 (1752 follows) | 2 - 3 s | **127-155 ms** | **~15-20×** |
| myId=755516 (1816 follows) | 2.5 s | **142-157 ms** | **~15-18×** |

End-to-end through local server:
`/v1/full/users/handle/audius?user_id=Wem1e` (target has 1.95 M
followers) → **60-85 ms warm**. Response shape unchanged;
`current_user_followee_follow_count` returns the same count as before.

## 2. GetTracks — partial index for `album_backlink`

The `album_backlink` subquery does ~200 random `playlists_pkey` lookups
per popular track to filter for `is_album = true AND is_delete = false
AND is_current = true`. **~99.98 % of those lookups get rejected by the
filter** — for 50 popular tracks that's 10,115 heap probes returning 1-2
actual matches.

The partial index covers only published-album playlists, so non-album
lookups skip the heap entirely — the planner sees no row at the index
level and moves on without fetching the page.

```sql
create index concurrently if not exists idx_playlists_albums_published
    on playlists (playlist_id)
    where is_album = true and is_delete = false and is_current = true;
```

- Size: ~55,671 album rows × ~12 bytes ≈ **700 KB**.
- Built `CONCURRENTLY` so no `ACCESS EXCLUSIVE` lock — follows the
pattern from #196 (the migration whose comment explains the prior 0195
outage).
- Expected: GetTracks `album_backlink` portion drops from ~38 ms (50
popular tracks, warm) to ~10-15 ms — most of GetTracks's "always-on"
cost.

## Risk

- **GetUsers rewrite is semantically identical.** Same `count(*)` over
the same intersection, just with the join driven from the small side.
Existing user tests (TestV1UsersRelated, TestUsersFeed, etc.) pass; full
`./api/...` suite is green.
- **Partial index is additive.** No existing query plan can regress —
the planner picks it over `playlists_pkey` only when its `WHERE` clause
is satisfied (i.e. the lookup is for an album row).

## Test plan

- [x] `go test -count=1 ./api/...` (full suite, all green)
- [x] EXPLAIN ANALYZE on prod read replica across three myId regimes
(unsigned, mid follows, heavy follows)
- [x] Local server smoke test:
`/v1/full/users/handle/audius?user_id=Wem1e` returns identical response
shape, ~70 ms warm

🤖 Generated with [Claude Code](https://claude.com/claude-code)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant