Skip to content

ore: introduce mz_ore::pager with swap and file backends#36391

Draft
antiguru wants to merge 25 commits into
MaterializeInc:mainfrom
antiguru:moritz/clu-65-pager
Draft

ore: introduce mz_ore::pager with swap and file backends#36391
antiguru wants to merge 25 commits into
MaterializeInc:mainfrom
antiguru:moritz/clu-65-pager

Conversation

@antiguru
Copy link
Copy Markdown
Member

@antiguru antiguru commented May 4, 2026

Motivation

CLU-65: introduce an explicit pager so we can mark cold columnar data and pull it back on demand, choosing at runtime between an MADV_COLD-based swap backend and a file-backed scratch backend. Today Materialize spills to Linux swap and pays direct-reclaim latency on user threads. Design lives at doc/developer/design/20260504_pager.md.

Depends on CLU-64 (Column::Aligned becomes Vec<u64>), which makes columnar buffers a natural caller.

Description

mz_ore::pager exposes a small handle-oriented API (pageout, read_at_many, read_at, take) over a global atomic backend selector. The handle's backend is fixed at pageout time so live config flips do not invalidate existing data. Two backends:

  • Swap keeps the input Vec<u64> chains resident and calls MADV_COLD on the page-aligned subrange. Reads extend_from_slice across chunk boundaries; take is zero-copy when the handle holds a single chunk.
  • File writes one named scratch file per handle in a per-process subdirectory under the configured scratch root. No file descriptor is retained across calls — pageout opens, vector-writes via Write::write_vectored, and closes; reads reopen and pread. A reaper sweeps stale subdirectories from crashed predecessors at config time.

Documented as doc/developer/design/20260504_pager.md.

Verification

Unit + integration tests cover round-trip on both backends, scatter/gather correctness, drop-without-take reclaim, the swap fast path's pointer identity, and backend flip mid-process.

Two micro-benches plus a merge-batcher example (cargo run --release --features pager --example pager_merge) for sizing the swap-vs-file trade-off under real memory pressure. Sanity numbers on a 32 GiB working set with 16 GiB cap, ext4 scratch: file finishes the merge pass in ~67 s vs swap's ~210 s, dominated by swap-in TLB shootdowns on the swap path.

antiguru and others added 24 commits May 4, 2026 15:46
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ive matches)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Reuse buffers across iterations via iter_custom so allocator cost is
paid once at setup. Read one u64 per page after take to force the
kernel to actually fault pages in (relevant under memory pressure).
2 MiB single-chunk plus scatter sweep.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Builds two chains of 2 MiB chunks then performs a merge pass that
reads every cache line of both inputs and emits a new chain. Designed
to be run under systemd-run with MemoryMax to simulate a working set
that exceeds RAM, exposing real swap-eviction or disk-I/O cost.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Add `prefetch(&Handle)` and `prefetch_at(&Handle, offset, len)` to let
callers overlap the next chunk's I/O with current chunk processing.
The swap backend issues `MADV_WILLNEED`; the file backend opens the
scratch file briefly and issues `posix_fadvise(POSIX_FADV_WILLNEED)`,
both of which kick async kernel work and return promptly.

The merge example now prefetches one chunk ahead. With a 32 GiB working
set and 16 GiB cap on ext4, file-backend merge drops from 47.7 s to
45.2 s. Swap-backend merge is unchanged at ~141 s because under that
much pressure the kernel is reclaim-bound, not stall-bound.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Per project policy, raw `as` is disallowed in favor of mz_ore::cast::CastFrom,
mz_ore::cast::CastLossy, or std::convert::TryFrom. The pager's pointer-arithmetic
paths now use stable `*const T::addr()` and `byte_add` to keep provenance, with
`cast::<U>()` and `cast_mut()` replacing pointer-type `as` casts. FFI integer
arguments now go through `try_from` with explicit panics on overflow.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Each worker gets a 1/threads share of the total chain so working set
stays constant across thread counts. Cap=16 GiB / total chain=32 GiB:
file backend speeds up at 2 threads (64 -> 46 s, ~1.4x), regresses at
4, recovers at 8; swap backend halves wall at 4 threads (215 -> 127 s)
because kernel reclaim overlaps with other workers' compute.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…oughput and perf data

Adds a section that captures the swap-vs-file trade-off as actually
observed: file saturates the disk (1.47 GiB/s on encrypted NVMe), swap
floors at ~0.36 GiB/s regardless of cap or parallelism. perf stat plus
/proc/vmstat deltas show swap loses ~7x sys-time vs file because every
4 KiB readback page-faults synchronously on the user thread (5.2M
minor-faults vs 4K, 2.1M pswpin vs 2.2K). Operational guidance: swap
when resident, file when spilling.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Without `required-features` cargo tries to build the example with the
default feature set, where `#![cfg(feature = "pager")]` strips the
entire file and leaves no `main`. Declare the feature requirement so
the example is skipped when the feature is off.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Measured ~5% improvement on the file path at depth 16, within run-to-run
variance, and zero on the swap path. Not worth the API surface for v1.
Kernel readahead handles the file path adequately; swap is reclaim-bound
under pressure and prefetch can't help.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@antiguru antiguru requested a review from DAlperin May 5, 2026 15:09
…less-of-parallelism claim

The earlier "swap caps at ~0.36 GiB/s regardless of cap or parallelism"
headline holds only at low thread counts. On a 64 vCPU box with two striped
local NVMes, swap-backend merge scales 13× from 1 → 64 threads and reaches
~75% of file-backend throughput, because enough independent direct-reclaim
contexts run in parallel to keep the swap stripe nearly busy.

Reorganize the operational characteristics section into two benches —
encrypted NVMe (1.4 GB/s ceiling) and r8gd.16xlarge with striped instance
NVMe (~7 GB/s ceiling) — and add file-backend (1 TiB / cap 256G) and
swap-backend (128 GiB / cap 32G) thread-scaling tables for the second.
Operational guidance now distinguishes low-thread (file wins ~3–5×) from
high-thread (within ~25%) regimes and calls out the multi-tenant RSS
argument as a separate reason to prefer file regardless of throughput.

Drop the dead --prefetch-depth 4 reference; that flag was removed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@antiguru antiguru marked this pull request as draft May 12, 2026 13:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant