Skip to content

compute: store correction v2 chunks in a columnar chunk region#36577

Merged
antiguru merged 11 commits into
mainfrom
claude/fix-stackwrapper-spilling-uLluU
Jun 4, 2026
Merged

compute: store correction v2 chunks in a columnar chunk region#36577
antiguru merged 11 commits into
mainfrom
claude/fix-stackwrapper-spilling-uLluU

Conversation

@antiguru
Copy link
Copy Markdown
Member

@antiguru antiguru commented May 17, 2026

Motivation

#30083 added a spillable MV correction buffer that uses StackWrapper/ColumnationStack for the chunk type. That storage has two issues for the spilling use case:

  1. The outer Vec<T> lives on the regular heap (not lgalloc-backed), so it does not spill.
  2. The columnation regions inside grow without a convenient way to pre-size them, so per-chunk memory is hard to predict.

This PR replaces the chunk storage with a columnar chunk region: each chunk becomes a single, fixed-size allocation, so the correction buffer's memory behavior is predictable and spillable.

Part of MaterializeInc/database-issues#8464.

Changes

Chunk storage. Each chunk now holds a ChunkData<D> enum with two variants:

  • Typed(ContainerOf<(D, Timestamp, Diff)>) — typed columnar container, used while a chunk is still being filled.
  • Align(Vec<u64>) — finished chunk encoded into a single aligned allocation.

This mirrors mz_timely_util::columnar::Column minus its Bytes variant (which wraps Arc<dyn Any> and is !Sync, breaking the + Send iterator returned by updates_before).

ChunkBuilder. New helper that mints chunks at the same ~2 MiB serialized-size boundary used by mz_timely_util::columnar::builder::ColumnBuilder (SHIP_WORDS = 1 << 18). When the in-progress container reaches that boundary, its bytes are encoded into a Vec<u64> of exactly the required capacity and emitted as a ChunkData::Align. Each minted chunk is therefore a single predictably-sized allocation, and remainder data is flushed as a ChunkData::Typed on finalize.

ChainBuilder. Wraps a ChunkBuilder and drains minted chunks into a Chain<D>. All the chain-producing paths (Stage::{insert, flush}, Cursor::{into_chain, try_unwrap}, CorrectionV2::{merge_2, merge_many}) go through it; Chain itself no longer owns a partial chunk or carries a chunk_capacity.

Data trait. Now requires Columnar (with Container: Send + Sync + Clone) instead of Columnation, plus for<'a> Borrow<Ref<'a>: Eq + Ord> so the merge/heap code can compare update refs directly without into_owned on the hot path. Both Row and DataflowErrorSer already implement Columnar; the latter gained #[columnar(derive(Eq, PartialEq, Ord, PartialOrd))] so its auto-generated Reference satisfies the new ref-level bound.

Lifetime helper for pop_equal. <C as Borrow>::Ref<'a> is invariant in 'a through trait projection, so the compiler can't shorten two refs with unrelated lifetimes by variance on its own. A small refs_eq helper explicitly reborrows both inputs to a fresh local lifetime via Columnar::reborrow before delegating to Eq.

Chain invariant. The chain_proportionality invariant now compares chains by update count rather than chunk count. Chunk count was a faithful proxy under columnation (fixed updates per chunk) but is not anymore: any chain below the ~2 MiB boundary is a single chunk regardless of update count, so the comparison would degenerate to 1-vs-1, collapse the geometric invariant, and turn inserts into O(N^2) up to the boundary. Comparing update counts restores the pre-refactor O(log N) amortization.

Notes for reviewers

  • compute_correction_v2_chunk_size changed meaning: it now only sizes the in-memory Stage staging buffer and the ship cadence. On-chain chunk byte size is fixed by the builder's ~2 MiB boundary (SHIP_WORDS) and no longer follows this knob.
  • Correction-size metrics (mz_persist_sink_correction_*) now report serialized bytes (indexed::length_in_bytes) instead of columnation heap bytes. Balance-based assertions (correction-buffer-metrics.td) are unaffected, but absolute byte values shift — heads-up for anyone dashboarding them.
  • Added chain_builder_roundtrips_across_mint_boundary, exercising the Align encode/decode spill path (crossing at least one mint()), which previously had no coverage.

Checklist

  • This PR has adequate test coverage / QA involvement has been duly considered.
  • This PR has an associated up-to-date design doc, is a design doc, or is sufficiently small to not require a design.
  • If this PR evolves an existing $T ⇔ Proto$T mapping (possibly in a backwards-incompatible way), then it is tagged with a T-proto label.
  • If this PR will require changes to cloud orchestration or tests, there is a companion cloud PR.
  • If this PR includes major user-facing behavior changes, I have pinged the relevant PM.

Comment thread src/compute/src/sink/correction_v2.rs Outdated
Comment thread src/compute/src/sink/correction_v2.rs Outdated
Comment thread src/compute/src/sink/correction_v2.rs Outdated
Comment thread src/compute/src/sink/correction_v2.rs Outdated
@antiguru antiguru marked this pull request as ready for review May 18, 2026 09:58
@antiguru antiguru requested a review from a team as a code owner May 18, 2026 09:58
@antiguru antiguru changed the title Migrate correction_v2 from columnation to columnar compute: store correction v2 chunks in a columnar chunk region May 18, 2026
Copy link
Copy Markdown
Contributor

@def- def- left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The update count is double counted:

diff --git a/src/compute/src/sink/correction_v2.rs b/src/compute/src/sink/correction_v2.rs
index 23ef00ebf9..a243558737 100644
--- a/src/compute/src/sink/correction_v2.rs
+++ b/src/compute/src/sink/correction_v2.rs
@@ -1519,3 +1519,20 @@ impl<D: Data> Ord for MergeCursor<D> {
         (t1, d1).cmp(&(t2, d2)).reverse()
     }
 }
+
+#[cfg(test)]
+mod tests {
+    use mz_repr::{Diff, Timestamp};
+
+    use super::ChainBuilder;
+
+    #[mz_ore::test]
+    fn chain_builder_update_count_matches_items() {
+        let mut builder = ChainBuilder::<i64>::default();
+        for i in 0..10_i64 {
+            builder.push_owned(&(i, Timestamp::new(i as u64), Diff::ONE));
+        }
+        let chain = builder.finish();
+        assert_eq!(chain.update_count, chain.iter().count());
+    }
+}

fails:

---- sink::correction_v2::tests::chain_builder_update_count_matches_items stdout ----

thread 'sink::correction_v2::tests::chain_builder_update_count_matches_items' (1838738) panicked at src/compute/src/sink/correction_v2.rs:1536:9:
assertion `left == right` failed
  left: 20

@antiguru antiguru force-pushed the claude/fix-stackwrapper-spilling-uLluU branch from e91476b to f9c0ff8 Compare May 29, 2026 02:04
@antiguru antiguru requested a review from DAlperin May 29, 2026 18:18
@antiguru
Copy link
Copy Markdown
Member Author

Review

Refactor replacing columnation/ColumnationStack chunk storage with the columnar crate. Chunks are now byte-bounded (~2 MiB, SHIP_WORDS = 1<<18) and spill into an aligned Vec<u64> (ChunkData::Align) via indexed::encode. Merge/heap paths compare through columnar Refs instead of cloning via into_owned. errors.rs gains #[columnar(derive(Eq, PartialEq, Ord, PartialOrd))] to satisfy the new Ref: Eq + Ord bound on Data.

Blocking

  • The one new test never exercises the code this PR exists to fix. chain_builder_update_count_matches_items pushes 10 i64s, which stay a single ChunkData::Typed chunk — mint() only fires near the 2 MiB boundary, so the Align encode → from_bytes decode roundtrip (the actual spilling path, and the riskiest new code) gets zero coverage. Add a test that pushes enough updates to force at least one mint(), then asserts iter() roundtrips values/order/diffs correctly across the spill boundary. Without it the titular fix is untested.

Strong suggestions

  • CORRECTION_V2_CHUNK_SIZE quietly changed meaning. It now only sizes the staging Vec and the ship cadence; chain chunk byte-size is hardcoded to SHIP_WORDS. The knob no longer controls on-chain chunk size. If intended, note it in the PR body and check whether the config's documented effect is now stale.
  • Chain::len() semantics shifted. It returns chunk count, now byte-bounded rather than proportional to update_count. It feeds the chain-proportionality invariant in insert_inner (merge_needed). For sub-2 MiB chains, every chain is one Typed chunk regardless of update count, so the geometric invariant that bounds chain count / amortizes merge cost compares a weaker proxy than before. Confirm the amortization argument still holds, or compare on update_count.
  • Metrics report serialized bytes now. Chunk::get_size uses indexed::length_in_bytes / 8*len instead of columnation heap bytes. correction-buffer-metrics.td only checks balance, so it won't break — but the absolute mz_persist_sink_correction_* byte values shift. Heads-up if anyone dashboards them.

Nits

  • ChainBuilder::finish: if chunk.len() > 0 is redundant — pending only ever holds minted (non-empty) chunks. Harmless/defensive.
  • refs_eq and the ChunkData !Sync/Bytes doc comments are good — they explain the non-obvious lifetime-invariance and Send/Sync constraints. Keep.

Good

  • Dropping cached_size/invalidate_cached_size is a clean simplification — the cache existed because columnation heap_size chased pointers; the new length_in_bytes is O(columns), so caching is no longer worth it.
  • Removing into_owned clones from merge_2/merge_many/pop_equal is the right call for the hot path.
  • No as casts, no anyhow!, test uses #[mz_ore::test]. Single semantic change, single CODEOWNERS area.

Bottom line: solid refactor, but the spill path — the point of the PR — has no test. Add one minting-crossing test and I'd consider it mergeable.

🤖 Generated with Claude Code

@antiguru antiguru requested review from a team as code owners June 3, 2026 17:16
@antiguru antiguru requested review from Alphadelta14 and bobbyiliev and removed request for a team June 3, 2026 17:16
claude and others added 10 commits June 3, 2026 17:26
The MV sink's spillable correction buffer (PR #30083) stored each chunk in
a `StackWrapper`/`ColumnationStack`, which has two issues:

* the outer `Vec<T>` is heap-allocated (not lgalloc-backed), so it does
  not spill;
* the inner columnation regions over-allocate (no convenient way to
  pre-size the storage), so per-chunk memory is hard to predict.

Switch the chunk storage to a `ChunkData` enum (`Typed` + `Align`) backed
by the `columnar` crate, and add a `ChunkBuilder` that mints chunks at
the same ~2 MiB serialized-size boundary used by the merge-batcher's
`ColumnBuilder`. Each minted chunk is a single aligned `Vec<u64>`, so
chunks are fixed-size and per-chunk memory tracks usage closely.

The `Bytes` variant of `Column` is omitted because it wraps
`Arc<dyn Any>` and is `!Sync`, which would prevent `updates_before` from
returning a `+ Send` iterator.
- Drop the local TripleContainer/TripleBorrowed/UpdateRef aliases in favor
  of columnar's ContainerOf/BorrowedOf/Ref.
- Bound D::Container's Ref<'_> by Eq + Ord so the merge code (merge_2,
  merge_many's MergeCursor::cmp, Chain::can_accept) can compare refs
  directly instead of cloning through into_owned.
- Add #[columnar(derive(Eq, PartialEq, Ord, PartialOrd))] to
  DataflowErrorSer so its auto-generated Reference satisfies the new
  Ref-level Ord bound.
- MergeHeap::pop_equal still falls back to into_owned because the
  heap-top Ref (lifetime tied to self.0) and the parameter Ref have
  unrelated lifetimes that can't be unified through reborrow_ref; using
  into_owned both ends the immutable borrow before self.pop() and avoids
  the lifetime gymnastics. Comparison is guarded behind a time check so
  the clone only fires when timestamps match.
Use a `refs_eq` helper that explicitly reborrows both columnar refs to a
fresh local lifetime via `Columnar::reborrow` before comparing. The
projected `<C as Borrow>::Ref<'a>` is invariant in `'a`, so the compiler
won't shorten the inputs by variance on its own — the helper bridges that
gap and lets the `for<'a> Ref<'a>: Eq` bound on `Data` do the rest.
`ChainBuilder::push_{ref,owned}` were incrementing `chain.update_count`
by 1 per call, while `Chain::push_chunk` (invoked from `drain`/`finish`)
also adds `chunk.len()` when each minted chunk lands in the chain. Drop
the per-push increment and let `push_chunk` be the sole accounting
point. Add a regression test from the reviewer that asserts
`chain.update_count == chain.iter().count()`.
The `workflow_materialized_view_correction_pruning` test asserted that
the cumulative correction-buffer insertions metric exceeds 1000 after a
1000-row snapshot has been consolidated away. The metric is wired to
*net length deltas*, not raw record insertions, so the right floor is
`>= 1000`: the desired side grows the buffer by 1000 and the
persist-retraction side then shrinks it back to 0, contributing one
insertion-delta of 1000 and one deletion-delta of 1000 — never a brief
2000.

Before the columnar chunk-region refactor, each new chunk allocation
added a stray `+1` to `Chain::update_count` due to a double-count in
`Chain::push`, which inflated the cumulative insertions just enough for
`> 1000` to hold. The new chunk-builder counts accurately and lands at
exactly 1000, which is the correct value; tighten the test to that.
The correction v2 chain invariant kept each chain at least
`chain_proportionality` times as large as the next, comparing chains by
`Chain::len()` (chunk count). That was a faithful proxy under the old
columnation storage, where each chunk held a fixed number of updates.

With byte-bounded chunks, chunk count is no longer proportional to update
count: any chain below the ~2 MiB spill boundary is a single chunk
regardless of how many updates it holds. With the default 8 KiB
`chunk_size` and the 2 MiB boundary, that covers the common case, so the
comparison degenerated to 1-vs-1, the geometric invariant collapsed, and
every insert cascaded a full-buffer merge -- O(N^2) inserts up to the 2 MiB
boundary instead of O(log N) amortized.

Compare on `update_count` instead, which is what merge cost is actually
proportional to and which restores the pre-refactor amortization. Remove
the now-unused `Chain::len()`.

Also add a test that pushes enough updates to cross at least one `mint()`
boundary and asserts `iter()` roundtrips values, order, and diffs across
the `Align` encode/decode spill path, which previously had no coverage.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
`timely_bytes` 0.30 marked `Bytes` as `unsafe impl Send + Sync` (the
`Arc<dyn Any>` is only ever read by reference), so `Column<C>` no longer
needs the custom `ChunkData` enum that was carved out to dodge the
`!Sync` `Bytes` variant. Drop the wrapper and store the column directly;
the `ChunkBuilder` now delegates to the existing
`mz_timely_util::columnar::builder::ColumnBuilder`, which already
implements the ~2 MiB serialized-size minting logic this file was
duplicating.

Note in the `Chunk` doc that the `ColumnPager` from #36552 is the next
spill layer to wire in. The integration needs a deeper rework of the
cursor sharing model (`Cursor` holds `Rc<Chunk<D>>` and accesses chunks
through `&self`, while `pager.take` consumes its `PagedColumn`) so it
belongs in its own PR.
@antiguru antiguru force-pushed the claude/fix-stackwrapper-spilling-uLluU branch from e9b42f9 to a96d476 Compare June 3, 2026 17:30
Copy link
Copy Markdown
Contributor

@petrosagg petrosagg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

/// [`Columnar::reborrow`] before letting the inner `==` pick up the `for<'a> Ref<'a>: Eq`
/// bound on [`Data`].
#[inline]
fn refs_eq<D: Data>(a: Ref<'_, D>, b: Ref<'_, D>) -> bool {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to change the Data trait to make equality work without this reborrow helper but it quickly turned into a monster

Copy link
Copy Markdown
Member

@DAlperin DAlperin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Probably wants a feature-benchmark run for memory usage

Copy link
Copy Markdown
Contributor

@Alphadelta14 Alphadelta14 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

providing a stamp from console and cloud.
I didn't see obvious fallacies in my skim through.

@antiguru
Copy link
Copy Markdown
Member Author

antiguru commented Jun 4, 2026

Ran nightly, no performance regressions (https://buildkite.com/materialize/nightly/builds/16697)

@antiguru antiguru merged commit 72d1410 into main Jun 4, 2026
351 of 360 checks passed
@antiguru antiguru deleted the claude/fix-stackwrapper-spilling-uLluU branch June 4, 2026 15:17
maheshwarip pushed a commit that referenced this pull request Jun 5, 2026
### Motivation

#30083 added a spillable MV correction buffer that uses
`StackWrapper`/`ColumnationStack` for the chunk type. That storage has
two issues for the spilling use case:

1. The outer `Vec<T>` lives on the regular heap (not lgalloc-backed), so
it does not spill.
2. The columnation regions inside grow without a convenient way to
pre-size them, so per-chunk memory is hard to predict.

This PR replaces the chunk storage with a columnar chunk region: each
chunk becomes a single, fixed-size allocation, so the correction
buffer's memory behavior is predictable and spillable.

Part of MaterializeInc/database-issues#8464.

### Changes

**`Chunk` storage.** Each chunk now holds a `ChunkData<D>` enum with two
variants:

- `Typed(ContainerOf<(D, Timestamp, Diff)>)` — typed columnar container,
used while a chunk is still being filled.
- `Align(Vec<u64>)` — finished chunk encoded into a single aligned
allocation.

This mirrors `mz_timely_util::columnar::Column` minus its `Bytes`
variant (which wraps `Arc<dyn Any>` and is `!Sync`, breaking the `+
Send` iterator returned by `updates_before`).

**`ChunkBuilder`.** New helper that mints chunks at the same ~2 MiB
serialized-size boundary used by
`mz_timely_util::columnar::builder::ColumnBuilder` (`SHIP_WORDS = 1 <<
18`). When the in-progress container reaches that boundary, its bytes
are encoded into a `Vec<u64>` of exactly the required capacity and
emitted as a `ChunkData::Align`. Each minted chunk is therefore a single
predictably-sized allocation, and remainder data is flushed as a
`ChunkData::Typed` on finalize.

**`ChainBuilder`.** Wraps a `ChunkBuilder` and drains minted chunks into
a `Chain<D>`. All the chain-producing paths (`Stage::{insert, flush}`,
`Cursor::{into_chain, try_unwrap}`, `CorrectionV2::{merge_2,
merge_many}`) go through it; `Chain` itself no longer owns a partial
chunk or carries a `chunk_capacity`.

**`Data` trait.** Now requires `Columnar` (with `Container: Send + Sync
+ Clone`) instead of `Columnation`, plus `for<'a> Borrow<Ref<'a>: Eq +
Ord>` so the merge/heap code can compare update refs directly without
`into_owned` on the hot path. Both `Row` and `DataflowErrorSer` already
implement `Columnar`; the latter gained `#[columnar(derive(Eq,
PartialEq, Ord, PartialOrd))]` so its auto-generated `Reference`
satisfies the new ref-level bound.

**Lifetime helper for `pop_equal`.** `<C as Borrow>::Ref<'a>` is
invariant in `'a` through trait projection, so the compiler can't
shorten two refs with unrelated lifetimes by variance on its own. A
small `refs_eq` helper explicitly reborrows both inputs to a fresh local
lifetime via `Columnar::reborrow` before delegating to `Eq`.

**Chain invariant.** The `chain_proportionality` invariant now compares
chains by update count rather than chunk count. Chunk count was a
faithful proxy under columnation (fixed updates per chunk) but is not
anymore: any chain below the ~2 MiB boundary is a single chunk
regardless of update count, so the comparison would degenerate to
1-vs-1, collapse the geometric invariant, and turn inserts into O(N^2)
up to the boundary. Comparing update counts restores the pre-refactor
O(log N) amortization.

### Notes for reviewers

- `compute_correction_v2_chunk_size` changed meaning: it now only sizes
the in-memory `Stage` staging buffer and the ship cadence. On-chain
chunk byte size is fixed by the builder's ~2 MiB boundary (`SHIP_WORDS`)
and no longer follows this knob.
- Correction-size metrics (`mz_persist_sink_correction_*`) now report
serialized bytes (`indexed::length_in_bytes`) instead of columnation
heap bytes. Balance-based assertions (`correction-buffer-metrics.td`)
are unaffected, but absolute byte values shift — heads-up for anyone
dashboarding them.
- Added `chain_builder_roundtrips_across_mint_boundary`, exercising the
`Align` encode/decode spill path (crossing at least one `mint()`), which
previously had no coverage.

### Checklist

- [ ] This PR has adequate test coverage / QA involvement has been duly
considered.
- [ ] This PR has an associated up-to-date design doc, is a design doc,
or is sufficiently small to not require a design.
- [ ] If this PR evolves an existing `$T ⇔ Proto$T` mapping (possibly in
a backwards-incompatible way), then it is tagged with a `T-proto` label.
- [ ] If this PR will require changes to cloud orchestration or tests,
there is a companion cloud PR.
- [ ] If this PR includes major user-facing behavior changes, I have
pinged the relevant PM.

---------

Co-authored-by: Claude <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants