timely-util: column_pager with policy + lz4 by antiguru · Pull Request #36552 · MaterializeInc/materialize

antiguru · 2026-05-14T15:41:18Z

Motivation

Stacked on #36391. Layers a typed Column<C> API over mz_ore::pager with an injected policy that decides, per call, whether to keep resident, page out raw, or page out lz4-compressed. Once the pager lands, this is the seam between in-memory columnar buffers and out-of-core storage envisioned in doc/developer/design/20260504_pager.md.

Description

Adds mz_ore::pager::pageout_with for explicit-backend dispatch, bypassing the global atomic so layered consumers can route per call without racing other writers.

New mz_timely_util::column_pager module:

PagingPolicy trait with decide(PageHint) -> PageDecision (Skip / Page { backend, codec }) and record(PageEvent) for budget bookkeeping and metrics.
PagedColumn<C> with three variants: Resident(Column<C>, ResidentTicket), Paged { handle, meta }, and Compressed { inner, meta } (memory or pager-backed framed bytes).
ResidentTicket is a drop guard that fires PageEvent::ResidentReleased { bytes } whether the caller calls ColumnPager::take or drops the column without taking — so policy budgets don't leak.
ColumnPager::page drains via ContainerBytes; the Column::Align(Vec<u64>) uncompressed path moves the body Vec directly into the pager handle. Compressed path wraps the target in FrameEncoder so into_bytes streams straight through lz4 — no intermediate uncompressed Vec<u8>. Compressed File path pads to u64 alignment; the frame trailer self-delimits so no length prefix or unpad is needed on read.

Concrete TieredPolicy in column_pager::policy: each Timely worker draws from a fixed per-worker byte budget (kept in thread_local! state), spills to a process-wide shared pool when exhausted, and forces a pageout when both are full. Release returns budget to the shared pool first so other workers unblock sooner. Single-instance-per-process by construction (one shared LOCAL static); documented.

Criterion bench (cargo bench -p mz-timely-util --bench column_pager) covers 4 KiB / 256 KiB / 4 MiB × Swap / File × raw / lz4 for both round-trip and operator-loop shapes. Swap-backend results are labelled swap-warm to flag that the bench never builds enough working set to force kernel eviction; a follow-up bench under systemd-run --user --scope -p MemoryMax=... will exercise the cold path.

Checklist

This PR has adequate test coverage / QA is not needed.
This PR has an associated up-to-date design doc, is a design doc (template), or is sufficiently small to not require a design.
If this PR evolves an existing $T ⇔ Proto$T mapping (possibly in a backwards-incompatible way), then it is tagged with a T-proto label.
If this PR will require changes to cloud orchestration or tests, there is a companion cloud PR to account for those changes that is tagged with the release-blocker label (example).
If this PR includes major user-facing behavior changes, I have pinged the relevant PM to schedule a changelog post.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ive matches) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Reuse buffers across iterations via iter_custom so allocator cost is paid once at setup. Read one u64 per page after take to force the kernel to actually fault pages in (relevant under memory pressure). 2 MiB single-chunk plus scatter sweep. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Builds two chains of 2 MiB chunks then performs a merge pass that reads every cache line of both inputs and emits a new chain. Designed to be run under systemd-run with MemoryMax to simulate a working set that exceeds RAM, exposing real swap-eviction or disk-I/O cost. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Add `prefetch(&Handle)` and `prefetch_at(&Handle, offset, len)` to let callers overlap the next chunk's I/O with current chunk processing. The swap backend issues `MADV_WILLNEED`; the file backend opens the scratch file briefly and issues `posix_fadvise(POSIX_FADV_WILLNEED)`, both of which kick async kernel work and return promptly. The merge example now prefetches one chunk ahead. With a 32 GiB working set and 16 GiB cap on ext4, file-backend merge drops from 47.7 s to 45.2 s. Swap-backend merge is unchanged at ~141 s because under that much pressure the kernel is reclaim-bound, not stall-bound. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Per project policy, raw `as` is disallowed in favor of mz_ore::cast::CastFrom, mz_ore::cast::CastLossy, or std::convert::TryFrom. The pager's pointer-arithmetic paths now use stable `*const T::addr()` and `byte_add` to keep provenance, with `cast::<U>()` and `cast_mut()` replacing pointer-type `as` casts. FFI integer arguments now go through `try_from` with explicit panics on overflow. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Each worker gets a 1/threads share of the total chain so working set stays constant across thread counts. Cap=16 GiB / total chain=32 GiB: file backend speeds up at 2 threads (64 -> 46 s, ~1.4x), regresses at 4, recovers at 8; swap backend halves wall at 4 threads (215 -> 127 s) because kernel reclaim overlaps with other workers' compute. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…oughput and perf data Adds a section that captures the swap-vs-file trade-off as actually observed: file saturates the disk (1.47 GiB/s on encrypted NVMe), swap floors at ~0.36 GiB/s regardless of cap or parallelism. perf stat plus /proc/vmstat deltas show swap loses ~7x sys-time vs file because every 4 KiB readback page-faults synchronously on the user thread (5.2M minor-faults vs 4K, 2.1M pswpin vs 2.2K). Operational guidance: swap when resident, file when spilling. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Without `required-features` cargo tries to build the example with the default feature set, where `#![cfg(feature = "pager")]` strips the entire file and leaves no `main`. Declare the feature requirement so the example is skipped when the feature is off. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Measured ~5% improvement on the file path at depth 16, within run-to-run variance, and zero on the swap path. Not worth the API surface for v1. Kernel readahead handles the file path adequately; swap is reclaim-bound under pressure and prefetch can't help. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…less-of-parallelism claim The earlier "swap caps at ~0.36 GiB/s regardless of cap or parallelism" headline holds only at low thread counts. On a 64 vCPU box with two striped local NVMes, swap-backend merge scales 13× from 1 → 64 threads and reaches ~75% of file-backend throughput, because enough independent direct-reclaim contexts run in parallel to keep the swap stripe nearly busy. Reorganize the operational characteristics section into two benches — encrypted NVMe (1.4 GB/s ceiling) and r8gd.16xlarge with striped instance NVMe (~7 GB/s ceiling) — and add file-backend (1 TiB / cap 256G) and swap-backend (128 GiB / cap 32G) thread-scaling tables for the second. Operational guidance now distinguishes low-thread (file wins ~3–5×) from high-thread (within ~25%) regimes and calls out the multi-tenant RSS argument as a separate reason to prefer file regardless of throughput. Drop the dead --prefetch-depth 4 reference; that flag was removed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Adds `pageout_with(backend, chunks)` alongside `pageout`. Lets callers select the backend per call instead of going through the global atomic, so layered consumers (next commit's column-pager) can route swap and file pageouts independently without racing other writers. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Bridges `mz_ore::pager` to typed `Column<C>` via `ContainerBytes`. Callers drain a column into a `PagedColumn<C>` and rehydrate on demand; backend and compression are decided per call by an injected `PagingPolicy`, not the pager's global atomic. Three resting variants cover the matrix: * `Resident(Column<C>)` — policy returned `Skip`. * `Paged { handle, meta }` — raw u64-aligned bytes via `pager::Handle`. * `Compressed { inner, meta }` — lz4 frame; bytes live either in memory or in a `pager::Handle` (padded to u64). Fast paths: * `Column::Align(Vec<u64>)` uncompressed — moves the body Vec into the handle, no copy on the swap backend. * Compressed — `FrameEncoder` wraps the target so `into_bytes` streams serialized bytes straight through lz4 with no uncompressed staging. * Compressed file — the frame trailer self-delimits, so no `compressed_len` field and no unpad on read. Tests cover skip, swap/file × uncompressed/lz4 round trips, and the align-variant fast path. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Adds `ResidentTicket`, a drop guard carried inside `PagedColumn::Resident` that fires a new `PageEvent::ResidentReleased { bytes }` when the resident column is consumed via `ColumnPager::take` or dropped without being taken. Lets policies track outstanding resident memory without leaking budget if a caller drops a column unexpectedly. Introduces `TieredPolicy` in `column_pager::policy`. Each Timely worker draws from a fixed per-worker byte budget; once exhausted it falls back to a process-wide shared pool, and only when both are full does it page out via a configured backend and codec. Per-worker state lives in a `thread_local!` static so worker threads see independent counters. This limits the design to one `TieredPolicy` per process — sufficient for the expected configuration, and the constraint is documented. Release order returns budget to the shared pool first so other workers unblock sooner. The shared pool is a single `AtomicUsize` consumed via a CAS loop; only the cold fallback path touches it. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Measures round-trip (`page` + `take`) and operator-loop (`page` with column reuse) throughput across three axes: column size (4 KiB, 256 KiB, 4 MiB), pager backend (Swap, File), and codec (uncompressed, lz4). 24 cases total, throughput reported in bytes/sec via Criterion's `Throughput`. Run with: cargo bench -p mz-timely-util --bench column_pager The bench uses an `AlwaysPage` stub policy so every iteration exercises the paging path rather than the resident fast path. Smoke-tested at 4 KiB/swap/raw at ~8.6 GiB/s on a development laptop, which is close to the underlying pager's memcpy ceiling and confirms the column-pager layer adds no measurable overhead at that size. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

The pager's swap backend keeps the body Vec resident and hints MADV_COLD; the kernel evicts only under memory pressure. The column_pager bench round-trips one column at a time and never builds enough working set to trigger eviction, so swap-backend numbers measure the in-memory fast path (Vec move + bookkeeping), not the cost of a page-in from disk. Relabel the axis as `swap-warm` to make the distinction visible in every measurement name, and add a module-level caveat explaining what the numbers do and don't represent. A follow-up `column_pager_pressure` bench under `systemd-run --user --scope -p MemoryMax=...` will exercise the real eviction path. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

antiguru and others added 30 commits May 14, 2026 15:41

ore: add pager feature flag

386f849

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

ore: skeleton mz_ore::pager module with Backend enum

fb9edde

ore: pager scratch dir lifecycle and stale-subdir reaper

c3d9e20

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

ore: pager Handle type and inner storage scaffolding

fa1ea53

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

ore: pager swap backend pageout with MADV_COLD

96f1fb3

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

ore: pager swap backend read_at_many

a5d992c

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

ore: pager swap backend take with zero-copy fast path

991a5e7

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

ore: pager public dispatch surface (pageout/read_at/take)

22c3cba

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

ore: pager file backend pageout with pwritev

b0d744f

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

ore: pager file backend read_at_many with coalescing

e946d4f

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

ore: pager file backend take and drop reclaim

b4a90fa

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

ore: pager cross-backend integration tests

d885439

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

ore: pager Criterion bench harness

f69df0a

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

ore: pager clippy + lint cleanups (write_vectored, cast_from, exhaust…

dff9f9b

…ive matches) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

ore: pager copyright headers and test-attribute lint compliance

d2bbde6

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

ore: update Cargo.lock for pager tempfile dev-dep

5520d23

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

DAlperin mentioned this pull request May 19, 2026

Dov/column paged merge batcher #36627

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

timely-util: column_pager with policy + lz4#36552

timely-util: column_pager with policy + lz4#36552
antiguru wants to merge 30 commits into
MaterializeInc:mainfrom
antiguru:moritz/column-pager

antiguru commented May 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

antiguru commented May 14, 2026

Motivation

Description

Checklist

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant