ore: introduce mz_ore::pager with swap and file backends by antiguru · Pull Request #36391 · MaterializeInc/materialize

antiguru · 2026-05-04T19:14:04Z

Motivation

CLU-65: introduce an explicit pager so we can mark cold columnar data and pull it back on demand, choosing at runtime between an MADV_COLD-based swap backend and a file-backed scratch backend. Today Materialize spills to Linux swap and pays direct-reclaim latency on user threads. Design lives at doc/developer/design/20260504_pager.md.

Depends on CLU-64 (Column::Aligned becomes Vec<u64>), which makes columnar buffers a natural caller.

Description

mz_ore::pager exposes a small handle-oriented API (pageout, read_at_many, read_at, take) over a global atomic backend selector. The handle's backend is fixed at pageout time so live config flips do not invalidate existing data. Two backends:

Swap keeps the input Vec<u64> chains resident and calls MADV_COLD on the page-aligned subrange. Reads extend_from_slice across chunk boundaries; take is zero-copy when the handle holds a single chunk.
File writes one named scratch file per handle in a per-process subdirectory under the configured scratch root. No file descriptor is retained across calls — pageout opens, vector-writes via Write::write_vectored, and closes; reads reopen and pread. A reaper sweeps stale subdirectories from crashed predecessors at config time.

Documented as doc/developer/design/20260504_pager.md.

Verification

Unit + integration tests cover round-trip on both backends, scatter/gather correctness, drop-without-take reclaim, the swap fast path's pointer identity, and backend flip mid-process.

Two micro-benches plus a merge-batcher example (cargo run --release --features pager --example pager_merge) for sizing the swap-vs-file trade-off under real memory pressure. Sanity numbers on a 32 GiB working set with 16 GiB cap, ext4 scratch: file finishes the merge pass in ~67 s vs swap's ~210 s, dominated by swap-in TLB shootdowns on the swap path.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ive matches) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Reuse buffers across iterations via iter_custom so allocator cost is paid once at setup. Read one u64 per page after take to force the kernel to actually fault pages in (relevant under memory pressure). 2 MiB single-chunk plus scatter sweep. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Builds two chains of 2 MiB chunks then performs a merge pass that reads every cache line of both inputs and emits a new chain. Designed to be run under systemd-run with MemoryMax to simulate a working set that exceeds RAM, exposing real swap-eviction or disk-I/O cost. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Add `prefetch(&Handle)` and `prefetch_at(&Handle, offset, len)` to let callers overlap the next chunk's I/O with current chunk processing. The swap backend issues `MADV_WILLNEED`; the file backend opens the scratch file briefly and issues `posix_fadvise(POSIX_FADV_WILLNEED)`, both of which kick async kernel work and return promptly. The merge example now prefetches one chunk ahead. With a 32 GiB working set and 16 GiB cap on ext4, file-backend merge drops from 47.7 s to 45.2 s. Swap-backend merge is unchanged at ~141 s because under that much pressure the kernel is reclaim-bound, not stall-bound. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Per project policy, raw `as` is disallowed in favor of mz_ore::cast::CastFrom, mz_ore::cast::CastLossy, or std::convert::TryFrom. The pager's pointer-arithmetic paths now use stable `*const T::addr()` and `byte_add` to keep provenance, with `cast::<U>()` and `cast_mut()` replacing pointer-type `as` casts. FFI integer arguments now go through `try_from` with explicit panics on overflow. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Each worker gets a 1/threads share of the total chain so working set stays constant across thread counts. Cap=16 GiB / total chain=32 GiB: file backend speeds up at 2 threads (64 -> 46 s, ~1.4x), regresses at 4, recovers at 8; swap backend halves wall at 4 threads (215 -> 127 s) because kernel reclaim overlaps with other workers' compute. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…oughput and perf data Adds a section that captures the swap-vs-file trade-off as actually observed: file saturates the disk (1.47 GiB/s on encrypted NVMe), swap floors at ~0.36 GiB/s regardless of cap or parallelism. perf stat plus /proc/vmstat deltas show swap loses ~7x sys-time vs file because every 4 KiB readback page-faults synchronously on the user thread (5.2M minor-faults vs 4K, 2.1M pswpin vs 2.2K). Operational guidance: swap when resident, file when spilling. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Without `required-features` cargo tries to build the example with the default feature set, where `#![cfg(feature = "pager")]` strips the entire file and leaves no `main`. Declare the feature requirement so the example is skipped when the feature is off. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Measured ~5% improvement on the file path at depth 16, within run-to-run variance, and zero on the swap path. Not worth the API surface for v1. Kernel readahead handles the file path adequately; swap is reclaim-bound under pressure and prefetch can't help. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…less-of-parallelism claim The earlier "swap caps at ~0.36 GiB/s regardless of cap or parallelism" headline holds only at low thread counts. On a 64 vCPU box with two striped local NVMes, swap-backend merge scales 13× from 1 → 64 threads and reaches ~75% of file-backend throughput, because enough independent direct-reclaim contexts run in parallel to keep the swap stripe nearly busy. Reorganize the operational characteristics section into two benches — encrypted NVMe (1.4 GB/s ceiling) and r8gd.16xlarge with striped instance NVMe (~7 GB/s ceiling) — and add file-backend (1 TiB / cap 256G) and swap-backend (128 GiB / cap 32G) thread-scaling tables for the second. Operational guidance now distinguishes low-thread (file wins ~3–5×) from high-thread (within ~25%) regimes and calls out the multi-tenant RSS argument as a separate reason to prefer file regardless of throughput. Drop the dead --prefetch-depth 4 reference; that flag was removed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

antiguru and others added 24 commits May 4, 2026 15:46

ore: add pager feature flag

bd34e13

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

ore: skeleton mz_ore::pager module with Backend enum

eeb7574

ore: pager scratch dir lifecycle and stale-subdir reaper

559e28a

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

ore: pager Handle type and inner storage scaffolding

6f90093

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

ore: pager swap backend pageout with MADV_COLD

339e137

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

ore: pager swap backend read_at_many

b80cc65

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

ore: pager swap backend take with zero-copy fast path

850217f

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

ore: pager public dispatch surface (pageout/read_at/take)

0eb5715

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

ore: pager file backend pageout with pwritev

4158e06

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

ore: pager file backend read_at_many with coalescing

d268554

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

ore: pager file backend take and drop reclaim

20bc8d3

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

ore: pager cross-backend integration tests

8691ce2

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

ore: pager Criterion bench harness

ce5d983

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

ore: pager clippy + lint cleanups (write_vectored, cast_from, exhaust…

d55a73c

…ive matches) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

ore: pager copyright headers and test-attribute lint compliance

e24f2d7

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

ore: update Cargo.lock for pager tempfile dev-dep

2cc7344

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

antiguru requested a review from DAlperin May 5, 2026 15:09

antiguru marked this pull request as draft May 12, 2026 13:50

antiguru mentioned this pull request May 14, 2026

timely-util: column_pager with policy + lz4 #36552

Draft

5 tasks

DAlperin mentioned this pull request May 19, 2026

Dov/column paged merge batcher #36627

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ore: introduce mz_ore::pager with swap and file backends#36391

ore: introduce mz_ore::pager with swap and file backends#36391
antiguru wants to merge 25 commits into
MaterializeInc:mainfrom
antiguru:moritz/clu-65-pager

antiguru commented May 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

antiguru commented May 4, 2026

Motivation

Description

Verification

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant