Smarter dot products by HEnquist · Pull Request #128 · HEnquist/rubato

HEnquist · 2026-05-02T20:15:17Z

Algorithm: combined sinc for multi-channel resampling

The sinc interpolation inner loop has been restructured. Previously, each channel performed N separate SIMD dot products per output frame (one per nearest polyphase point — 4 for Cubic, 3 for Quadratic, 2 for Linear). The new approach builds a single combined sinc filter per frame by linearly blending the nearest polyphase filters with their interpolation weights, then does one dot product per channel against that combined filter.

This trades a one-time per-frame build cost for a cheaper per-channel evaluation. The build step uses SIMD SAXPY (combined += weight * sinc[k]), accelerated with FMA on AVX/NEON and multiply-add on SSE.

The combined path is selected adaptively based on channel count, since for few channels the build overhead outweighs the savings:

Cubic: combined path from ≥ 2 channels (4 nearest points, break-even at 2)
Quadratic / Linear: combined path from ≥ 3 channels (fewer nearest points, break-even later)

Performance (vs master, scalar and NEON, measured on Apple Silicon)

Configuration	1 ch	2 ch	4 ch
Cubic f32 (scalar/NEON)	−4% / −1%	−17% / −13%	−40% / −40%
Cubic f64 (scalar/NEON)	+1% / +2%	−3% / −3%	−40% / −35%
Linear f32 (scalar/NEON)	+3% / ~0%	−4% / −3%	−18% / −21%
Linear f64 (scalar/NEON)	+5% / +6%	~0% / +2%	−15% / −11%
Nearest (all)	+1–5%	~0–5%	~0–4%

Nearest mode does not use the combined path and shows small regressions of 1–5%, likely from minor overhead introduced by the architectural refactor (sinc storage changed from packed SIMD vectors to plain Vec<T>).

Other changes

Sinc storage simplified: polyphase filters are stored as Vec<Vec<T>> instead of packed SIMD register types (Vec<__m256> etc.). Load intrinsics work equally well on unaligned T* pointers, so no functional change, but the code is significantly simpler.
New correctness tests: 16 parameterised tests verify that the 4-channel combined path produces output matching the 1-channel direct path (within 1e-10) across all four interpolation types, two resample ratios, and both fixed-input and fixed-output modes.

HEnquist added 4 commits May 1, 2026 22:24

Smarter way to calculate the dot products

c27b76a

WIP more optimizations

ccde7e4

switch stragegy based on channel count

bf22b93

Tweak selection rules

f4a9e6f

HEnquist mentioned this pull request May 2, 2026

Performance improvements #127

Open

HEnquist added 2 commits May 3, 2026 22:40

Fix compile error on x86_64

2f60af9

Clippy warning

6a8b7b9

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Smarter dot products#128

Smarter dot products#128
HEnquist wants to merge 6 commits intomasterfrom
smarter-dot-products

HEnquist commented May 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

HEnquist commented May 2, 2026

Algorithm: combined sinc for multi-channel resampling

Performance (vs master, scalar and NEON, measured on Apple Silicon)

Other changes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant