Skip to content

perf: per-block chain processing + batched NAM process_buffer (2.8×)#249

Merged
OpenSauce merged 3 commits into
mainfrom
perf/nam-block-processing
May 31, 2026
Merged

perf: per-block chain processing + batched NAM process_buffer (2.8×)#249
OpenSauce merged 3 commits into
mainfrom
perf/nam-block-processing

Conversation

@OpenSauce
Copy link
Copy Markdown
Owner

Summary

process_block existed on the Stage trait and AmplifierChain but the engine never called it — both process paths looped chain.process(sample) one sample at a time. And even once wired in, the NAM stage left the big win on the table because NamStage inherited the default per-sample trait loop instead of calling nam-rs's batched Model::process_buffer.

This PR fixes both.

Changes

  • engine.rs — both process paths now call chain.process_block() instead of a per-sample loop. Per-stage work (bypass branch, stage-list walk) happens once per block, and stages with batched process_block overrides are actually exercised.
  • NamStage::process_block override — applies input gain → batched process_buffer → output gain + dry/wet mix, using a preallocated scratch buffer for the dry signal so steady-state processing never allocates on the RT thread.
  • Parity test — asserts the block path matches the per-sample path within 1e-5.
  • is_active() accessor on NamStage.
  • Vendored reference_standard.nam (MIT, from nam-rs tests/fixtures/) into rustortion-core/tests/fixtures/ with attribution, so the parity test and NAM benchmark groups run deterministically in CI rather than depending on a user's gitignored nam/ models.
  • Benchmarks — added a NAM chain sample-vs-block group and a raw process_buffer vs process_sample ceiling group to chain.rs.

Results

Path Per-sample Block Speedup
Analog chain, 1x 10.5 µs 7.0 µs 1.5×
Analog chain, 16x 114 µs 112 µs ~1.0×
NAM chain, 1x 824 µs 293 µs 2.8×

For NAM users this is roughly a 64% CPU cut on the dominant stage — landing on nam-rs's raw process_buffer ceiling (288 µs) — and it's live in the real engine path, not just the benchmark. The analog-chain win is loop-order/cache-locality, largest at low oversampling and converging to parity as real DSP work dominates.

Verification

  • make lint — clean
  • make test — all pass, including block_matches_per_sample_with_real_model

OpenSauce added 2 commits May 31, 2026 15:40
Both engine paths looped chain.process(sample) one sample at a time, even
though AmplifierChain::process_block and the Stage::process_block trait method
already existed. Call process_block instead so per-stage work (bypass branch,
stage-list walk) happens once per block rather than once per sample, and so
stages with batched process_block overrides are actually exercised.

Measured on the analog chain: ~31% faster at 1x oversampling, converging to
parity as oversampling rises and real DSP work dominates loop overhead.
NamStage inherited the default Stage::process_block (a per-sample process_sample
loop), so the engine's per-block path never reached nam-rs's batched
Model::process_buffer. Override process_block to apply input gain, run
process_buffer over the block, then apply output gain and the dry/wet mix,
using a preallocated scratch buffer for the dry signal so steady-state
processing never allocates on the RT thread.

On the standard WaveNet reference model this cuts the NAM chain block from
~824us to ~293us per 128-sample block (2.8x; ~64% less CPU), matching the raw
process_buffer ceiling. A parity test asserts the block path matches the
per-sample path within 1e-5.

Vendor reference_standard.nam (MIT, from nam-rs) into tests/fixtures so the
parity test and NAM benchmark groups run deterministically in CI rather than
depending on a user's gitignored nam/ models. Add is_active() to NamStage and
NAM benchmark groups (chain sample-vs-block + raw process_buffer ceiling).
Copilot AI review requested due to automatic review settings May 31, 2026 14:44
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR wires AmplifierChain::process_block into both engine processing paths (previously the engine looped per-sample even though process_block existed) and adds a NamStage::process_block override that uses nam-rs's batched Model::process_buffer with a preallocated scratch buffer for the dry signal. A vendored MIT reference NAM model is committed under tests/fixtures/ to back a new parity test and dedicated NAM benchmark groups.

Changes:

  • Replace per-sample loops in process_without_upsampling / oversampled path with chain.process_block().
  • Add NamStage::process_block (batched, allocation-free in steady state) and is_active() accessor; add parity test against the per-sample path.
  • Add bench_nam_sample_vs_block and bench_nam_buffer_vs_sample benchmark groups; vendor reference_standard.nam with attribution README.

Reviewed changes

Copilot reviewed 4 out of 5 changed files in this pull request and generated 1 comment.

File Description
rustortion-core/src/audio/engine.rs Switch both process paths to block processing
rustortion-core/src/amp/stages/nam.rs Add batched process_block, dry-scratch buffer, is_active(), and parity test
rustortion-core/benches/chain.rs New NAM chain + raw model bench groups using vendored fixture
rustortion-core/tests/fixtures/README.md Attribution/license notes for vendored reference_standard.nam

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread rustortion-core/benches/chain.rs Outdated
The model loads from tests/fixtures/, not the workspace nam/ directory.
@OpenSauce OpenSauce merged commit 3e9ea24 into main May 31, 2026
7 checks passed
@OpenSauce OpenSauce deleted the perf/nam-block-processing branch May 31, 2026 14:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants