n_workers kwarg for FilterRecording and CommonReferenceRecording#4564
Open
galenlynch wants to merge 1 commit intoSpikeInterface:mainfrom
Open
n_workers kwarg for FilterRecording and CommonReferenceRecording#4564galenlynch wants to merge 1 commit intoSpikeInterface:mainfrom
n_workers kwarg for FilterRecording and CommonReferenceRecording#4564galenlynch wants to merge 1 commit intoSpikeInterface:mainfrom
Conversation
Adds opt-in intra-chunk thread-parallelism to two preprocessors: channel-split sosfilt/sosfiltfilt in FilterRecording, time-split median/mean in CommonReferenceRecording. Default n_workers=1 preserves existing behavior. Per-caller-thread inner pools ----------------------------- Each outer thread that calls ``get_traces()`` on a parallel-enabled segment gets its own inner ThreadPoolExecutor, stored in a ``WeakKeyDictionary`` keyed by the calling ``Thread`` object. Rationale: * Avoids the shared-pool queueing pathology that would occur if N outer workers (e.g., TimeSeriesChunkExecutor with n_jobs=N) all submitted into a single shared pool with fewer max_workers than outer callers. Under a shared pool, ``n_workers=2`` with ``n_jobs=24`` thrashed at 3.36 s on the test pipeline; per-caller pools: 1.47 s. * Keying by the Thread object (not thread-id integer) avoids the thread-id-reuse hazard: thread IDs can be reused after a thread dies, which would cause a new thread to silently inherit a dead thread's pool. * WeakKeyDictionary + weakref.finalize ensures automatic shutdown of the inner pool when the calling thread is garbage-collected. The finalizer calls ``pool.shutdown(wait=False)`` to avoid blocking the finalizer thread; in-flight tasks would be cancelled, but the owning thread submits+joins synchronously, so none exist when it exits. When useful ----------- * Direct ``get_traces()`` callers (interactive viewers, streaming consumers, mipmap-zarr tile builders) that don't use ``TimeSeriesChunkExecutor``. * Default SI users who haven't tuned job_kwargs. * RAM-constrained deployments that can't crank ``n_jobs`` to core count: on a 24-core host, ``n_jobs=6, n_workers=2`` gets within 8% of ``n_jobs=24, n_workers=1`` at ~1/4 the RAM. Performance (1M × 384 float32 BP+CMR pipeline, 24-core host, thread engine) --------------------------------------------------------------------------- === Component-level (scipy/numpy only) === sosfiltfilt serial → 8 threads: 7.80 s → 2.67 s (2.92x) np.median serial → 16 threads: 3.51 s → 0.33 s (10.58x) === Per-stage end-to-end (rec.get_traces) === Bandpass (5th-order, 300-6k Hz): 8.59 s → 3.20 s (2.69x) CMR median (global): 4.01 s → 0.81 s (4.95x) === CRE outer × inner Pareto, per-caller pools === outer=24, inner=1 each: 1.54 s (100% of peak) outer=24, inner=8 each: 1.42 s (108% of peak; oversubscribed) outer=12, inner=1 each: 1.59 s (97%, ~1/2 RAM of outer=24) outer=6, inner=2 each: 1.75 s (92%, ~1/4 RAM of outer=24) outer=4, inner=6 each: 1.83 s (87%, ~1/6 RAM with 24 threads) Tests ----- New ``test_parallel_pool_semantics.py`` verifies the per-caller-thread contract: single caller reuses one pool; concurrent callers get distinct pools. Existing bandpass + CMR tests still pass. Independent of the companion FIR phase-shift PR (perf/phase-shift-fir); the two can land in either order. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
6 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds opt-in intra-chunk thread-parallelism to two preprocessors:
FilterRecording(n_workers=N)— channel-splitsosfilt/sosfiltfiltCommonReferenceRecording(n_workers=N)— time-splitmedian/meanDefault
n_workers=1preserves existing behavior bit-for-bit. Each outer thread that callsget_traces()on a parallel-enabled segment gets its own innerThreadPoolExecutor(per-caller-thread pool semantics), so the kwarg composes cleanly withTimeSeriesChunkExecutorouter parallelism — no shared-pool queueing pathology.Headline. 1M × 384 float32, 24-core x86_64 host:
Per-stage (this PR alone):
FilterRecording(n_workers=8)CommonReferenceRecording(n_workers=16)Full pipeline (combined with companion FIR phase-shift PR; stock PS FFT dominates the pipeline so this PR alone only moves full-pipeline wall-clock ~1.1× — per-stage gains are dwarfed by the unchanged PS cost):
get_traces(), int16get_traces(), f32 propagatedCombined numbers require both PRs merged.
Motivation
SI's existing
TimeSeriesChunkExecutor(formerlyChunkRecordingExecutor) uses outer chunk-parallelism: each worker pulls a time slice, processes it serially. This is efficient for batch workflows (write_binary_recording, sorters, node pipelines) that own the control flow end-to-end.It doesn't serve direct
rec.get_traces(start, end)callers: interactive viewers, streaming consumers, custom loops with their own prefetch scheduling. Those callers can't reach CRE's parallelism without adopting its batch-processing API. This PR adds a second axis — intra-chunk thread parallelism, applied within a singleget_traces()call — that any caller can opt into by passing a kwarg.Secondary motivation: default SI users (
n_jobs=1is the default) get parallelism without needing to configure multi-processjob_kwargs.Changes
1.
FilterRecording(n_workers=N)— channel-parallel SOSFile:
src/spikeinterface/preprocessing/filter.pyn_workerskwarg (default1, preserving existing behavior).n_workers > 1,FilterRecordingSegment.get_tracessplits the channel axis into contiguous blocks and runsscipy.signal.sosfiltfilt/sosfilton each block in a per-caller-threadThreadPoolExecutor.2 * n_workers.2.
CommonReferenceRecording(n_workers=N)— time-parallel reductionFile:
src/spikeinterface/preprocessing/common_reference.pyn_workerskwarg (default1).group_indices=None,reference="global",ref_channel_ids=None) is parallelized — every other configuration delegates to the existing logic unchanged.n_workers > 1,_parallel_reduce_axis1splits the time axis into blocks and runsnp.median/np.meanper block in a per-caller-thread pool.min_block=8192samples per thread the overhead dominates; falls back to serial automatically.3. Per-caller-thread inner pool design
Each outer thread that calls
get_traces()on a parallel-enabled segment gets its own lazyThreadPoolExecutor, tracked in aweakref.WeakKeyDictionarykeyed by the callingThreadobject with aweakref.finalize(thread, pool.shutdown, wait=False)cleanup hook. A single shared inner pool would bottleneck under CRE — atn_jobs=24, n_workers=2, 24 outer threads submitting 2 tasks each into a 2-worker pool measured 3.36 s; per-caller pools measured 1.47 s (2.3× faster at the same thread budget). Keying byThread(not thread-id integer) avoids thread-id reuse; the weakref + finalize pair ensures long-running processes don't accumulate zombie pools.Correctness
np.allclose(rtol=1e-5)np.array_equalnp.allclose(rtol=1e-5)pool_a is pool_btests/test_parallel_pool_semantics.py)pool_a is not pool_bAll existing tests for both modules pass unchanged.
Performance (reproducible)
benchmarks/preprocessing/bench_perf.py— synthetic NumpyRecording, 1M × 384 float32, measured on a 24-core x86_64 host.Component-level (hot kernel only)
No SI plumbing — just raw scipy/numpy calls. Shows the ceiling for each kernel on this hardware:
scipy.signal.sosfiltfilt(1M × 384 float32)np.median(axis=1)(1M × 384 float32)Per-stage end-to-end (
rec.get_traces())Full SI preprocessing class through
get_traces(), including margin fetch, buffer copies, casts, and subtraction:End-to-end ratios are lower than component-level because the non-parallelizable glue (margin fetch, dtype cast, subtract) dilutes the speedup. Bandpass and CMR scale sub-linearly with thread count due to DRAM bandwidth saturation.
Pareto frontier under CRE: outer × inner
At chunk_duration="1s" (SI default), different splits of a ~24-thread compute budget on the BP+CMR pipeline, per-caller-thread pools:
Key observations:
outer=12, inner=1is within 3% ofouter=24, inner=1; doubling cores past that gives diminishing returns.outer=6, inner=2reaches 92% of peak at ~¼ the RAM;outer=8, inner=3matchesouter=12, inner=1at ⅔ the RAM.CRE interaction tables
For BP specifically (inner pool = 8, matching CRE n_jobs=8):
For CMR (inner pool = 16, exceeds CRE n_jobs=8):
Tuning guidance
Recommended configurations by caller posture:
get_traces()on large windows (viewer, streaming consumer)n_workers=core_count // 8(more gives diminishing returns)n_jobs=1)n_workers=8–16as per-stagen_workers=1(outer already near DRAM ceiling)n_workers = cores // n_jobs+ a marginn_jobs=core_count+n_workers=4–8(oversubscribed)Compatibility
n_workers=1preserves existing semantics exactly._kwargsdicts updated on both preprocessors;save()/load()round-trip the new kwargs correctly.concurrent.futures.ThreadPoolExecutor,threading,weakref.bandpass_filter,highpass_filter,filter,notch_filter,common_referencewrapper functions via**filter_kwargs.Review guide
filter.py:_apply_soshelper +n_workerskwarg plumbing +WeakKeyDictionarypool map +weakref.finalizecleanup.common_reference.py:_parallel_reduce_axis1helper + same pool-ownership pattern. Parallelization guarded to the global-reference hot path only.tests/test_parallel_pool_semantics.py: single-caller reuse + concurrent-caller isolation contract tests for both preprocessors.Companion PR
An independent companion PR #4563 adds a sinc-FIR alternative to
PhaseShiftRecordingwith ~100× per-stage speedup plus memory win. No code dependency between the two; either can land first. Combined, they give 13–20× on a typicalPhaseShiftRecording → HighpassFilterRecording → CommonReferenceRecordingchain for directget_traces()callers, or ~3× on top of existing CRE parallelism.Checklist
_kwargsupdated)