Skip to content

timely-util: column_pager with policy + lz4#36552

Draft
antiguru wants to merge 30 commits into
MaterializeInc:mainfrom
antiguru:moritz/column-pager
Draft

timely-util: column_pager with policy + lz4#36552
antiguru wants to merge 30 commits into
MaterializeInc:mainfrom
antiguru:moritz/column-pager

Conversation

@antiguru
Copy link
Copy Markdown
Member

Motivation

Stacked on #36391. Layers a typed Column<C> API over mz_ore::pager with an injected policy that decides, per call, whether to keep resident, page out raw, or page out lz4-compressed. Once the pager lands, this is the seam between in-memory columnar buffers and out-of-core storage envisioned in doc/developer/design/20260504_pager.md.

Description

Adds mz_ore::pager::pageout_with for explicit-backend dispatch, bypassing the global atomic so layered consumers can route per call without racing other writers.

New mz_timely_util::column_pager module:

  • PagingPolicy trait with decide(PageHint) -> PageDecision (Skip / Page { backend, codec }) and record(PageEvent) for budget bookkeeping and metrics.
  • PagedColumn<C> with three variants: Resident(Column<C>, ResidentTicket), Paged { handle, meta }, and Compressed { inner, meta } (memory or pager-backed framed bytes).
  • ResidentTicket is a drop guard that fires PageEvent::ResidentReleased { bytes } whether the caller calls ColumnPager::take or drops the column without taking — so policy budgets don't leak.
  • ColumnPager::page drains via ContainerBytes; the Column::Align(Vec<u64>) uncompressed path moves the body Vec directly into the pager handle. Compressed path wraps the target in FrameEncoder so into_bytes streams straight through lz4 — no intermediate uncompressed Vec<u8>. Compressed File path pads to u64 alignment; the frame trailer self-delimits so no length prefix or unpad is needed on read.

Concrete TieredPolicy in column_pager::policy: each Timely worker draws from a fixed per-worker byte budget (kept in thread_local! state), spills to a process-wide shared pool when exhausted, and forces a pageout when both are full. Release returns budget to the shared pool first so other workers unblock sooner. Single-instance-per-process by construction (one shared LOCAL static); documented.

Criterion bench (cargo bench -p mz-timely-util --bench column_pager) covers 4 KiB / 256 KiB / 4 MiB × Swap / File × raw / lz4 for both round-trip and operator-loop shapes. Swap-backend results are labelled swap-warm to flag that the bench never builds enough working set to force kernel eviction; a follow-up bench under systemd-run --user --scope -p MemoryMax=... will exercise the cold path.

Checklist

  • This PR has adequate test coverage / QA is not needed.
  • This PR has an associated up-to-date design doc, is a design doc (template), or is sufficiently small to not require a design.
  • If this PR evolves an existing $T ⇔ Proto$T mapping (possibly in a backwards-incompatible way), then it is tagged with a T-proto label.
  • If this PR will require changes to cloud orchestration or tests, there is a companion cloud PR to account for those changes that is tagged with the release-blocker label (example).
  • If this PR includes major user-facing behavior changes, I have pinged the relevant PM to schedule a changelog post.

antiguru and others added 30 commits May 14, 2026 15:41
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ive matches)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Reuse buffers across iterations via iter_custom so allocator cost is
paid once at setup. Read one u64 per page after take to force the
kernel to actually fault pages in (relevant under memory pressure).
2 MiB single-chunk plus scatter sweep.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Builds two chains of 2 MiB chunks then performs a merge pass that
reads every cache line of both inputs and emits a new chain. Designed
to be run under systemd-run with MemoryMax to simulate a working set
that exceeds RAM, exposing real swap-eviction or disk-I/O cost.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Add `prefetch(&Handle)` and `prefetch_at(&Handle, offset, len)` to let
callers overlap the next chunk's I/O with current chunk processing.
The swap backend issues `MADV_WILLNEED`; the file backend opens the
scratch file briefly and issues `posix_fadvise(POSIX_FADV_WILLNEED)`,
both of which kick async kernel work and return promptly.

The merge example now prefetches one chunk ahead. With a 32 GiB working
set and 16 GiB cap on ext4, file-backend merge drops from 47.7 s to
45.2 s. Swap-backend merge is unchanged at ~141 s because under that
much pressure the kernel is reclaim-bound, not stall-bound.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Per project policy, raw `as` is disallowed in favor of mz_ore::cast::CastFrom,
mz_ore::cast::CastLossy, or std::convert::TryFrom. The pager's pointer-arithmetic
paths now use stable `*const T::addr()` and `byte_add` to keep provenance, with
`cast::<U>()` and `cast_mut()` replacing pointer-type `as` casts. FFI integer
arguments now go through `try_from` with explicit panics on overflow.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Each worker gets a 1/threads share of the total chain so working set
stays constant across thread counts. Cap=16 GiB / total chain=32 GiB:
file backend speeds up at 2 threads (64 -> 46 s, ~1.4x), regresses at
4, recovers at 8; swap backend halves wall at 4 threads (215 -> 127 s)
because kernel reclaim overlaps with other workers' compute.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…oughput and perf data

Adds a section that captures the swap-vs-file trade-off as actually
observed: file saturates the disk (1.47 GiB/s on encrypted NVMe), swap
floors at ~0.36 GiB/s regardless of cap or parallelism. perf stat plus
/proc/vmstat deltas show swap loses ~7x sys-time vs file because every
4 KiB readback page-faults synchronously on the user thread (5.2M
minor-faults vs 4K, 2.1M pswpin vs 2.2K). Operational guidance: swap
when resident, file when spilling.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Without `required-features` cargo tries to build the example with the
default feature set, where `#![cfg(feature = "pager")]` strips the
entire file and leaves no `main`. Declare the feature requirement so
the example is skipped when the feature is off.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Measured ~5% improvement on the file path at depth 16, within run-to-run
variance, and zero on the swap path. Not worth the API surface for v1.
Kernel readahead handles the file path adequately; swap is reclaim-bound
under pressure and prefetch can't help.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…less-of-parallelism claim

The earlier "swap caps at ~0.36 GiB/s regardless of cap or parallelism"
headline holds only at low thread counts. On a 64 vCPU box with two striped
local NVMes, swap-backend merge scales 13× from 1 → 64 threads and reaches
~75% of file-backend throughput, because enough independent direct-reclaim
contexts run in parallel to keep the swap stripe nearly busy.

Reorganize the operational characteristics section into two benches —
encrypted NVMe (1.4 GB/s ceiling) and r8gd.16xlarge with striped instance
NVMe (~7 GB/s ceiling) — and add file-backend (1 TiB / cap 256G) and
swap-backend (128 GiB / cap 32G) thread-scaling tables for the second.
Operational guidance now distinguishes low-thread (file wins ~3–5×) from
high-thread (within ~25%) regimes and calls out the multi-tenant RSS
argument as a separate reason to prefer file regardless of throughput.

Drop the dead --prefetch-depth 4 reference; that flag was removed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds `pageout_with(backend, chunks)` alongside `pageout`. Lets callers
select the backend per call instead of going through the global atomic,
so layered consumers (next commit's column-pager) can route swap and
file pageouts independently without racing other writers.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Bridges `mz_ore::pager` to typed `Column<C>` via `ContainerBytes`.
Callers drain a column into a `PagedColumn<C>` and rehydrate on demand;
backend and compression are decided per call by an injected
`PagingPolicy`, not the pager's global atomic.

Three resting variants cover the matrix:

* `Resident(Column<C>)` — policy returned `Skip`.
* `Paged { handle, meta }` — raw u64-aligned bytes via `pager::Handle`.
* `Compressed { inner, meta }` — lz4 frame; bytes live either in memory
  or in a `pager::Handle` (padded to u64).

Fast paths:

* `Column::Align(Vec<u64>)` uncompressed — moves the body Vec into the
  handle, no copy on the swap backend.
* Compressed — `FrameEncoder` wraps the target so `into_bytes` streams
  serialized bytes straight through lz4 with no uncompressed staging.
* Compressed file — the frame trailer self-delimits, so no
  `compressed_len` field and no unpad on read.

Tests cover skip, swap/file × uncompressed/lz4 round trips, and the
align-variant fast path.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Adds `ResidentTicket`, a drop guard carried inside `PagedColumn::Resident`
that fires a new `PageEvent::ResidentReleased { bytes }` when the resident
column is consumed via `ColumnPager::take` or dropped without being
taken. Lets policies track outstanding resident memory without leaking
budget if a caller drops a column unexpectedly.

Introduces `TieredPolicy` in `column_pager::policy`. Each Timely worker
draws from a fixed per-worker byte budget; once exhausted it falls back
to a process-wide shared pool, and only when both are full does it page
out via a configured backend and codec. Per-worker state lives in a
`thread_local!` static so worker threads see independent counters. This
limits the design to one `TieredPolicy` per process — sufficient for the
expected configuration, and the constraint is documented.

Release order returns budget to the shared pool first so other workers
unblock sooner. The shared pool is a single `AtomicUsize` consumed via a
CAS loop; only the cold fallback path touches it.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Measures round-trip (`page` + `take`) and operator-loop (`page` with
column reuse) throughput across three axes: column size (4 KiB, 256 KiB,
4 MiB), pager backend (Swap, File), and codec (uncompressed, lz4). 24
cases total, throughput reported in bytes/sec via Criterion's `Throughput`.

Run with:

    cargo bench -p mz-timely-util --bench column_pager

The bench uses an `AlwaysPage` stub policy so every iteration exercises
the paging path rather than the resident fast path. Smoke-tested at
4 KiB/swap/raw at ~8.6 GiB/s on a development laptop, which is close to
the underlying pager's memcpy ceiling and confirms the column-pager
layer adds no measurable overhead at that size.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The pager's swap backend keeps the body Vec resident and hints
MADV_COLD; the kernel evicts only under memory pressure. The
column_pager bench round-trips one column at a time and never builds
enough working set to trigger eviction, so swap-backend numbers measure
the in-memory fast path (Vec move + bookkeeping), not the cost of a
page-in from disk.

Relabel the axis as `swap-warm` to make the distinction visible in
every measurement name, and add a module-level caveat explaining what
the numbers do and don't represent. A follow-up `column_pager_pressure`
bench under `systemd-run --user --scope -p MemoryMax=...` will exercise
the real eviction path.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant