Vectorization POC by frankmcsherry · Pull Request #35464 · MaterializeInc/materialize

frankmcsherry · 2026-03-13T04:00:32Z

This is a POC written largely by Claude for vectorization of MirScalarExpression, using standard array language / columnar idioms. The expressions are still interpreted, but act on batches of 1024 elements. It does quite a lot of printing as it does the work, and this should all be tidied up, but sharing for discussion.

Here is a summary from Claude of the work, and the next steps:

⏺ Vectorized Batch Evaluation: Summary                                                            
                                                                                                  
  What changed                                                                                    
                                                            
  Vectorized arithmetic fast path — Replaced the previous function-pointer-based checked_add      
  per-element evaluation with true SIMD-friendly loops:                                           
  - Arithmetic uses wrapping_add/wrapping_sub via .iter().zip().map().collect(), which allows LLVM
   to auto-vectorize (confirmed: ~167-208ns for 1024 i64 adds, near the NEON theoretical minimum  
  of ~128ns).
  - Overflow detection uses a two-phase approach: a branchless OR-reduction (fold with |) to
  detect if any overflow occurred, followed by index collection only in the rare case that one
  did. This dropped overflow checking from ~670ns to ~250ns.
  - Multiplication retains checked_mul per-element since there's no cheap bit trick for multiply
  overflow.
  - Result containers are constructed in bulk (vec![discriminant; len] + (0..len).collect() for
  offsets) rather than pushing ColumnDatum variants one at a time.

  Eliminated redundant column copies:
  - evaluate_batch previously copied all input columns element-by-element into an owned
  Vec<DatumColumn> before evaluation. Now it borrows input columns directly.
  - VectorScalarExpr::eval for CallBinary detects when sub-expressions are Column references and
  passes the existing columns through without copying.
  - rows_to_columns now accepts an iterator of &Row instead of requiring a collected Vec<Row>,
  avoiding a clone of every input row.

  Observed performance (1024-element batches, release build, Apple Silicon)

  ┌───────────────────────────┬────────┬──────────────────────────────────────────────────┐
  │           Phase           │  Time  │                      Notes                       │
  ├───────────────────────────┼────────┼──────────────────────────────────────────────────┤
  │ Arithmetic (wrapping add) │ ~200ns │ SIMD-vectorized, near theoretical limit          │
  ├───────────────────────────┼────────┼──────────────────────────────────────────────────┤
  │ Overflow detection        │ ~250ns │ Branchless OR-reduction                          │
  ├───────────────────────────┼────────┼──────────────────────────────────────────────────┤
  │ Container construction    │ ~200ns │ variant + offset arrays                          │
  ├───────────────────────────┼────────┼──────────────────────────────────────────────────┤
  │ Compute total             │ ~0.7us │                                                  │
  ├───────────────────────────┼────────┼──────────────────────────────────────────────────┤
  │ Row packing (col→row)     │ ~10us  │ Per-element index_as_datum + packer.push         │
  ├───────────────────────────┼────────┼──────────────────────────────────────────────────┤
  │ Transpose (row→col)       │ ~58us  │ Per-element row.unpack() + datum_to_column_datum │
  ├───────────────────────────┼────────┼──────────────────────────────────────────────────┤
  │ End-to-end per batch      │ ~77us  │ Dominated by format conversion                   │
  └───────────────────────────┴────────┴──────────────────────────────────────────────────┘

  Next steps

  - columnar crate changes: The columnar container currently requires variant (discriminant) and
  offset arrays even for homogeneous columns. Adding Variants::Uniform(discriminant, len) and
  Offsets::Identity(len) representations would make constructing a homogeneous column from a typed
   Vec<T> O(1) with zero auxiliary allocations, eliminating the ~200ns container overhead.
  - Keep data columnar across operators: The transpose (58us) and row-packing (10us) costs are
  ~99% of the per-batch time. The architectural fix is to keep data in columnar form through
  multiple operators (maps, filters, joins) and only convert back to rows at emission boundaries.
  - Vectorized float operations: Float addition currently falls back to per-element scalar
  evaluation (~67us vs ~19us for integers). Adding vectorized float paths would bring parity.
  - Remove diagnostic timing: The tracing::info! calls with nanosecond breakdowns (arith_ns,
  overflow_ns, container_ns) were added for development and should be removed or gated behind a
  feature flag before merging.

github-actions · 2026-03-13T04:00:42Z

Thanks for opening this PR! Here are a few tips to help make the review process smooth for everyone.

PR title guidelines

Use imperative mood: "Fix X" not "Fixed X" or "Fixes X"
Be specific: "Fix panic in catalog sync when controller restarts" not "Fix bug" or "Update catalog code"
Prefix with area if helpful: compute: , storage: , adapter: , sql:

Pre-merge checklist

The PR title is descriptive and will make sense in the git log.
This PR has adequate test coverage / QA involvement has been duly considered. (trigger-ci for additional test/nightly runs)
If this PR includes major user-facing behavior changes, I have pinged the relevant PM to schedule a changelog post.
This PR has an associated up-to-date design doc, is a design doc (template), or is sufficiently small to not require a design.
If this PR evolves an existing $T ⇔ Proto$T mapping (possibly in a backwards-incompatible way), then it is tagged with a T-proto label.
If this PR will require changes to cloud orchestration or tests, there is a companion cloud PR to account for those changes that is tagged with the release-blocker label (example).

antiguru

I think this checks out! We should aim at having two different MFP implementations, and a runtime dispatch between them so we only have to pay the price of specializing the MFP once. Then there's a bunch of code that we could integrate into the sqlfunc macro, but no need to block on this.

src/expr/src/linear.rs

src/expr/src/vectorized.rs

Instead of converting MirScalarExpr to VectorScalarExpr on every evaluate_batch call, do the conversion once at operator construction. MfpEval enum dispatches between scalar (temporal) and vectorized (nontemporal) evaluation, storing only the needed variant. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Replace hand-written vectorized binary function dispatch with a declarative `vectorized = "..."` parameter on `#[sqlfunc]`. This keeps vectorized implementations in sync with function definitions. Introduces a `VectorizedBinaryFunc` trait with default no-op methods. The sqlfunc macro generates a specialized impl when `vectorized` is set, and `derive_binary\!` delegates through the trait on `BinaryFunc`. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Replace hardcoded discriminant constants with values derived from ColumnDatumKind, a companion enum generated by the enum_kinds crate. The macros now take only the variant name and derive the discriminant automatically, keeping them in sync with the enum definition order. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Break long line in derive_binary\! macro to stay within 100-char limit * Replace `as u8` enum casts with a `discriminant()` helper to satisfy the `as_conversions` clippy lint * Replace `len as u64` with `u64::cast_from(len)` * Add Cargo.lock update for enum-kinds dependency Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Add a criterion benchmark comparing vectorized vs scalar MFP evaluation on integer columns (100 to 100k rows), plus a rows_to_columns conversion benchmark. Add `enable_compute_vectorized_mfp` dyncfg (default false) to gate vectorized MFP evaluation. Thread the flag through persist_source, persist_source_core, and decode_and_mfp so the compute layer can enable it via worker config. Add `SafeMfpPlan::to_vectorized()` to expose VectorizedSafeMfpPlan construction without accessing private fields. Convert the oneshot source (render_decode_chunk) to use vectorized batch evaluation when the flag is enabled, with scalar fallback. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Add a benchmark that reads from `Column<(Row, Timestamp, Diff)>`, evaluates an MFP (vectorized vs scalar), and encodes results back into a Column. This mirrors how persist_source processes batches. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Update insta snapshots for new VectorizedBinaryFunc impl generation. Fix as_conversions, disallowed zip, and redundant closures in vectorized evaluation code. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Add the flag to FlipFlagsAction in parallel workload and to get_variable_system_parameters in mzcompose so CI exercises both true and false values. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

The lockfile was regenerated during rebase conflict resolution, which pulled in AWS SDK versions requiring rustc 1.91.1. Restore from upstream/main and re-resolve only our dependency additions. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Fix as_conversions and disallowed zip in persist_source.rs vectorized path. Apply rustfmt and fix broken doc link in vectorized.rs. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…0.1) Research findings on feasibility of column-of-datums arrangement spines: - Current DatumContainer stores rows as contiguous bytes with offset indexing - Columnar spines require schema propagation, new BatchContainer impls, modified merge/cursor logic, and vectorized eval (PR MaterializeInc#35464) as prereq - Recommended phased approach starting with vectorized MFP evaluation https://claude.ai/code/session_01JHo5sTCSGPW5NavNE2b49d

antiguru reviewed Mar 13, 2026

View reviewed changes

antiguru mentioned this pull request Mar 24, 2026

compute: incremental plan for columnar rendering migration #35605

Draft

3 tasks

frankmcsherry and others added 10 commits March 25, 2026 08:48

Vectorization POC

1514d05

Replace #[test] with #[mz_ore::test] in vectorized tests

f467dd0

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Fix clippy warnings and update snapshots after rebase

eb53e37

Update insta snapshots for new VectorizedBinaryFunc impl generation. Fix as_conversions, disallowed zip, and redundant closures in vectorized evaluation code. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Register enable_compute_vectorized_mfp in test infrastructure

e66d1a8

Add the flag to FlipFlagsAction in parallel workload and to get_variable_system_parameters in mzcompose so CI exercises both true and false values. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

antiguru force-pushed the vectorization_poc branch from e715afa to e66d1a8 Compare March 25, 2026 07:44

antiguru and others added 2 commits March 25, 2026 09:50

Fix clippy warnings in mz-storage-operators and formatting

fb86c5a

Fix as_conversions and disallowed zip in persist_source.rs vectorized path. Apply rustfmt and fix broken doc link in vectorized.rs. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

antiguru mentioned this pull request Mar 25, 2026

Columnar rendering: complete migration (Prompts 2.3–10.1) antiguru/materialize#179

Open

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Vectorization POC#35464

Vectorization POC#35464
frankmcsherry wants to merge 12 commits intoMaterializeInc:mainfrom
frankmcsherry:vectorization_poc

frankmcsherry commented Mar 13, 2026

Uh oh!

github-actions bot commented Mar 13, 2026

Uh oh!

antiguru left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

frankmcsherry commented Mar 13, 2026

Uh oh!

github-actions bot commented Mar 13, 2026

PR title guidelines

Pre-merge checklist

Uh oh!

antiguru left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants