Skip to content

Vectorization POC#35464

Draft
frankmcsherry wants to merge 12 commits intoMaterializeInc:mainfrom
frankmcsherry:vectorization_poc
Draft

Vectorization POC#35464
frankmcsherry wants to merge 12 commits intoMaterializeInc:mainfrom
frankmcsherry:vectorization_poc

Conversation

@frankmcsherry
Copy link
Copy Markdown
Contributor

This is a POC written largely by Claude for vectorization of MirScalarExpression, using standard array language / columnar idioms. The expressions are still interpreted, but act on batches of 1024 elements. It does quite a lot of printing as it does the work, and this should all be tidied up, but sharing for discussion.

Here is a summary from Claude of the work, and the next steps:

⏺ Vectorized Batch Evaluation: Summary                                                            
                                                                                                  
  What changed                                                                                    
                                                            
  Vectorized arithmetic fast path — Replaced the previous function-pointer-based checked_add      
  per-element evaluation with true SIMD-friendly loops:                                           
  - Arithmetic uses wrapping_add/wrapping_sub via .iter().zip().map().collect(), which allows LLVM
   to auto-vectorize (confirmed: ~167-208ns for 1024 i64 adds, near the NEON theoretical minimum  
  of ~128ns).
  - Overflow detection uses a two-phase approach: a branchless OR-reduction (fold with |) to
  detect if any overflow occurred, followed by index collection only in the rare case that one
  did. This dropped overflow checking from ~670ns to ~250ns.
  - Multiplication retains checked_mul per-element since there's no cheap bit trick for multiply
  overflow.
  - Result containers are constructed in bulk (vec![discriminant; len] + (0..len).collect() for
  offsets) rather than pushing ColumnDatum variants one at a time.

  Eliminated redundant column copies:
  - evaluate_batch previously copied all input columns element-by-element into an owned
  Vec<DatumColumn> before evaluation. Now it borrows input columns directly.
  - VectorScalarExpr::eval for CallBinary detects when sub-expressions are Column references and
  passes the existing columns through without copying.
  - rows_to_columns now accepts an iterator of &Row instead of requiring a collected Vec<Row>,
  avoiding a clone of every input row.

  Observed performance (1024-element batches, release build, Apple Silicon)

  ┌───────────────────────────┬────────┬──────────────────────────────────────────────────┐
  │           Phase           │  Time  │                      Notes                       │
  ├───────────────────────────┼────────┼──────────────────────────────────────────────────┤
  │ Arithmetic (wrapping add) │ ~200ns │ SIMD-vectorized, near theoretical limit          │
  ├───────────────────────────┼────────┼──────────────────────────────────────────────────┤
  │ Overflow detection        │ ~250ns │ Branchless OR-reduction                          │
  ├───────────────────────────┼────────┼──────────────────────────────────────────────────┤
  │ Container construction    │ ~200ns │ variant + offset arrays                          │
  ├───────────────────────────┼────────┼──────────────────────────────────────────────────┤
  │ Compute total             │ ~0.7us │                                                  │
  ├───────────────────────────┼────────┼──────────────────────────────────────────────────┤
  │ Row packing (col→row)     │ ~10us  │ Per-element index_as_datum + packer.push         │
  ├───────────────────────────┼────────┼──────────────────────────────────────────────────┤
  │ Transpose (row→col)       │ ~58us  │ Per-element row.unpack() + datum_to_column_datum │
  ├───────────────────────────┼────────┼──────────────────────────────────────────────────┤
  │ End-to-end per batch      │ ~77us  │ Dominated by format conversion                   │
  └───────────────────────────┴────────┴──────────────────────────────────────────────────┘

  Next steps

  - columnar crate changes: The columnar container currently requires variant (discriminant) and
  offset arrays even for homogeneous columns. Adding Variants::Uniform(discriminant, len) and
  Offsets::Identity(len) representations would make constructing a homogeneous column from a typed
   Vec<T> O(1) with zero auxiliary allocations, eliminating the ~200ns container overhead.
  - Keep data columnar across operators: The transpose (58us) and row-packing (10us) costs are
  ~99% of the per-batch time. The architectural fix is to keep data in columnar form through
  multiple operators (maps, filters, joins) and only convert back to rows at emission boundaries.
  - Vectorized float operations: Float addition currently falls back to per-element scalar
  evaluation (~67us vs ~19us for integers). Adding vectorized float paths would bring parity.
  - Remove diagnostic timing: The tracing::info! calls with nanosecond breakdowns (arith_ns,
  overflow_ns, container_ns) were added for development and should be removed or gated behind a
  feature flag before merging.

@github-actions
Copy link
Copy Markdown

Thanks for opening this PR! Here are a few tips to help make the review process smooth for everyone.

PR title guidelines

  • Use imperative mood: "Fix X" not "Fixed X" or "Fixes X"
  • Be specific: "Fix panic in catalog sync when controller restarts" not "Fix bug" or "Update catalog code"
  • Prefix with area if helpful: compute: , storage: , adapter: , sql:

Pre-merge checklist

  • The PR title is descriptive and will make sense in the git log.
  • This PR has adequate test coverage / QA involvement has been duly considered. (trigger-ci for additional test/nightly runs)
  • If this PR includes major user-facing behavior changes, I have pinged the relevant PM to schedule a changelog post.
  • This PR has an associated up-to-date design doc, is a design doc (template), or is sufficiently small to not require a design.
  • If this PR evolves an existing $T ⇔ Proto$T mapping (possibly in a backwards-incompatible way), then it is tagged with a T-proto label.
  • If this PR will require changes to cloud orchestration or tests, there is a companion cloud PR to account for those changes that is tagged with the release-blocker label (example).

Copy link
Copy Markdown
Member

@antiguru antiguru left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this checks out! We should aim at having two different MFP implementations, and a runtime dispatch between them so we only have to pay the price of specializing the MFP once. Then there's a bunch of code that we could integrate into the sqlfunc macro, but no need to block on this.

frankmcsherry and others added 10 commits March 25, 2026 08:48
Instead of converting MirScalarExpr to VectorScalarExpr on every
evaluate_batch call, do the conversion once at operator construction.
MfpEval enum dispatches between scalar (temporal) and vectorized
(nontemporal) evaluation, storing only the needed variant.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace hand-written vectorized binary function dispatch with a
declarative `vectorized = "..."` parameter on `#[sqlfunc]`. This keeps
vectorized implementations in sync with function definitions.

Introduces a `VectorizedBinaryFunc` trait with default no-op methods.
The sqlfunc macro generates a specialized impl when `vectorized` is set,
and `derive_binary\!` delegates through the trait on `BinaryFunc`.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace hardcoded discriminant constants with values derived from
ColumnDatumKind, a companion enum generated by the enum_kinds crate.
The macros now take only the variant name and derive the discriminant
automatically, keeping them in sync with the enum definition order.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* Break long line in derive_binary\! macro to stay within 100-char limit
* Replace `as u8` enum casts with a `discriminant()` helper to satisfy
  the `as_conversions` clippy lint
* Replace `len as u64` with `u64::cast_from(len)`
* Add Cargo.lock update for enum-kinds dependency

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add a criterion benchmark comparing vectorized vs scalar MFP evaluation
on integer columns (100 to 100k rows), plus a rows_to_columns conversion
benchmark.

Add `enable_compute_vectorized_mfp` dyncfg (default false) to gate
vectorized MFP evaluation. Thread the flag through persist_source,
persist_source_core, and decode_and_mfp so the compute layer can
enable it via worker config.

Add `SafeMfpPlan::to_vectorized()` to expose VectorizedSafeMfpPlan
construction without accessing private fields.

Convert the oneshot source (render_decode_chunk) to use vectorized
batch evaluation when the flag is enabled, with scalar fallback.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add a benchmark that reads from `Column<(Row, Timestamp, Diff)>`,
evaluates an MFP (vectorized vs scalar), and encodes results back
into a Column. This mirrors how persist_source processes batches.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Update insta snapshots for new VectorizedBinaryFunc impl generation.
Fix as_conversions, disallowed zip, and redundant closures in
vectorized evaluation code.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add the flag to FlipFlagsAction in parallel workload and to
get_variable_system_parameters in mzcompose so CI exercises both
true and false values.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@antiguru antiguru force-pushed the vectorization_poc branch from e715afa to e66d1a8 Compare March 25, 2026 07:44
antiguru and others added 2 commits March 25, 2026 09:50
The lockfile was regenerated during rebase conflict resolution, which
pulled in AWS SDK versions requiring rustc 1.91.1. Restore from
upstream/main and re-resolve only our dependency additions.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Fix as_conversions and disallowed zip in persist_source.rs vectorized
path. Apply rustfmt and fix broken doc link in vectorized.rs.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
antiguru pushed a commit to antiguru/materialize that referenced this pull request Mar 25, 2026
…0.1)

Research findings on feasibility of column-of-datums arrangement spines:
- Current DatumContainer stores rows as contiguous bytes with offset indexing
- Columnar spines require schema propagation, new BatchContainer impls,
  modified merge/cursor logic, and vectorized eval (PR MaterializeInc#35464) as prereq
- Recommended phased approach starting with vectorized MFP evaluation

https://claude.ai/code/session_01JHo5sTCSGPW5NavNE2b49d
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants