Skip to content

Smarter dot products#128

Open
HEnquist wants to merge 6 commits intomasterfrom
smarter-dot-products
Open

Smarter dot products#128
HEnquist wants to merge 6 commits intomasterfrom
smarter-dot-products

Conversation

@HEnquist
Copy link
Copy Markdown
Owner

@HEnquist HEnquist commented May 2, 2026

Algorithm: combined sinc for multi-channel resampling

The sinc interpolation inner loop has been restructured. Previously, each channel performed N separate SIMD dot products per output frame (one per nearest polyphase point — 4 for Cubic, 3 for Quadratic, 2 for Linear). The new approach builds a single combined sinc filter per frame by linearly blending the nearest polyphase filters with their interpolation weights, then does one dot product per channel against that combined filter.

This trades a one-time per-frame build cost for a cheaper per-channel evaluation. The build step uses SIMD SAXPY (combined += weight * sinc[k]), accelerated with FMA on AVX/NEON and multiply-add on SSE.

The combined path is selected adaptively based on channel count, since for few channels the build overhead outweighs the savings:

  • Cubic: combined path from ≥ 2 channels (4 nearest points, break-even at 2)
  • Quadratic / Linear: combined path from ≥ 3 channels (fewer nearest points, break-even later)

Performance (vs master, scalar and NEON, measured on Apple Silicon)

Configuration 1 ch 2 ch 4 ch
Cubic f32 (scalar/NEON) −4% / −1% −17% / −13% −40% / −40%
Cubic f64 (scalar/NEON) +1% / +2% −3% / −3% −40% / −35%
Linear f32 (scalar/NEON) +3% / ~0% −4% / −3% −18% / −21%
Linear f64 (scalar/NEON) +5% / +6% ~0% / +2% −15% / −11%
Nearest (all) +1–5% ~0–5% ~0–4%

Nearest mode does not use the combined path and shows small regressions of 1–5%, likely from minor overhead introduced by the architectural refactor (sinc storage changed from packed SIMD vectors to plain Vec<T>).

Other changes

  • Sinc storage simplified: polyphase filters are stored as Vec<Vec<T>> instead of packed SIMD register types (Vec<__m256> etc.). Load intrinsics work equally well on unaligned T* pointers, so no functional change, but the code is significantly simpler.
  • New correctness tests: 16 parameterised tests verify that the 4-channel combined path produces output matching the 1-channel direct path (within 1e-10) across all four interpolation types, two resample ratios, and both fixed-input and fixed-output modes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant