The current noise generation machinery (both NoiseGeneratorRecording and the new MockRecording, name under discussion #4520) have a strategy argument that determines how noise is generated internally. tile_pregenerated was the first strategy I introduced in #1581, and on_the_fly was added later by @samuelgarcia in #1948 with the goal of reducing memory consumption even further by avoiding the upfront allocation of a noise block.
I think this argument should be removed. How the noise is generated internally is an implementation detail that we should not expose to the user. The choice of strategy does not affect the behavior of the class from the outside; we should just keep whichever one performs best. If on_the_fly consumed less memory without a significant cost to generation speed, we should use it. The point is that the internal generation method could change over time and we do not want to lock ourselves into supporting both as part of the public API. The central contract of the class is to behave like an array-like lazy generator of noise.
I profiled both strategies with memray to measure peak memory during get_traces(), which is the relevant metric for spikeinterface: in a preprocessing chain, each get_traces call's peak memory compounds through the stack. You can reproduce the results with uv run profile_strategies.py (gist).
Peak memory during get_traces()
| Scenario |
Output size |
tile_pregenerated |
on_the_fly |
on-the-fly-memory-overhead |
| 32ch, 1000 samples |
0.1 MB |
0.1 MB |
4.0 MB |
32x |
| 32ch, 30000 samples (1s) |
3.7 MB |
3.7 MB |
11.0 MB |
3x |
| 384ch, 1000 samples |
1.5 MB |
1.5 MB |
45.6 MB |
31x |
| 384ch, 30000 samples (1s) |
43.9 MB |
43.9 MB |
131.8 MB |
3x |
| 384ch, 90000 samples (3s) |
131.8 MB |
131.8 MB |
219.7 MB |
1.7x |
| 384ch, 1800000 samples (1min) |
2636.7 MB |
2636.7 MB |
2724.6 MB |
1.0x |
Speed
| Scenario |
tile_pregenerated |
on_the_fly |
slowdown |
| 32ch, 1000 samples |
0.01 ms |
9.52 ms |
1087x |
| 32ch, 30000 samples (1s) |
0.57 ms |
17.63 ms |
31x |
| 384ch, 1000 samples |
0.09 ms |
111.43 ms |
1281x |
| 384ch, 30000 samples (1s) |
26.46 ms |
244.12 ms |
9x |
| 384ch, 90000 samples (3s) |
78.40 ms |
515.28 ms |
7x |
| 384ch, 1800000 samples (1min) |
1602.41 ms |
8144.07 ms |
5x |
tile_pregenerated is better overall. It allocates exactly the output array during get_traces(), nothing more, and is 7x to 3000x faster depending on the scenario. There are two places where on_the_fly has an advantage:
-
Zero initialization cost. on_the_fly allocates nothing at init, which is useful for serialization and dump/load. This has a simple solution: we can delay the generation of the tile until the first get_traces() call and cache it. That would give us zero init cost with the runtime performance of tile_pregenerated. We should do this regardless of which way we decide to go.
-
Non-repeating noise across blocks. on_the_fly seeds each block with (seed, block_index), so different blocks produce genuinely different noise. This might be necessary for simulated data. I don't see a simple fix: numpy's RNG doesn't support seeking, so reproducibility requires generating the full block even for small slices. The overhead is worst when the requested trace is small relative to the noise block (default 30000 samples, 1 second at 30kHz), which is the typical preprocessing chunk size (3x memory, 9x speed). For larger reads the overhead shrinks (see the 1-minute row) but the speed cost never disappears.
Now that we are separating testing from simulation, tiling is the clear choice for the testing side. The open question is whether simulation needs non-repeating noise. @samuelgarcia @cwindolf @alejoe91 @chrishalcrow, does repeating the same noise block affect simulation quality or introduce artifacts in downstream analysis, or is a large enough block size indistinguishable from non-repeating noise in practice? I think we should leave the simulation side with on_the_fly as that is the current default.
The current noise generation machinery (both
NoiseGeneratorRecordingand the newMockRecording, name under discussion #4520) have astrategyargument that determines how noise is generated internally.tile_pregeneratedwas the first strategy I introduced in #1581, andon_the_flywas added later by @samuelgarcia in #1948 with the goal of reducing memory consumption even further by avoiding the upfront allocation of a noise block.I think this argument should be removed. How the noise is generated internally is an implementation detail that we should not expose to the user. The choice of strategy does not affect the behavior of the class from the outside; we should just keep whichever one performs best. If
on_the_flyconsumed less memory without a significant cost to generation speed, we should use it. The point is that the internal generation method could change over time and we do not want to lock ourselves into supporting both as part of the public API. The central contract of the class is to behave like an array-like lazy generator of noise.I profiled both strategies with memray to measure peak memory during
get_traces(), which is the relevant metric for spikeinterface: in a preprocessing chain, eachget_tracescall's peak memory compounds through the stack. You can reproduce the results withuv run profile_strategies.py(gist).Peak memory during get_traces()
Speed
tile_pregeneratedis better overall. It allocates exactly the output array duringget_traces(), nothing more, and is 7x to 3000x faster depending on the scenario. There are two places whereon_the_flyhas an advantage:Zero initialization cost.
on_the_flyallocates nothing at init, which is useful for serialization and dump/load. This has a simple solution: we can delay the generation of the tile until the firstget_traces()call and cache it. That would give us zero init cost with the runtime performance oftile_pregenerated. We should do this regardless of which way we decide to go.Non-repeating noise across blocks.
on_the_flyseeds each block with(seed, block_index), so different blocks produce genuinely different noise. This might be necessary for simulated data. I don't see a simple fix: numpy's RNG doesn't support seeking, so reproducibility requires generating the full block even for small slices. The overhead is worst when the requested trace is small relative to the noise block (default 30000 samples, 1 second at 30kHz), which is the typical preprocessing chunk size (3x memory, 9x speed). For larger reads the overhead shrinks (see the 1-minute row) but the speed cost never disappears.Now that we are separating testing from simulation, tiling is the clear choice for the testing side. The open question is whether simulation needs non-repeating noise. @samuelgarcia @cwindolf @alejoe91 @chrishalcrow, does repeating the same noise block affect simulation quality or introduce artifacts in downstream analysis, or is a large enough block size indistinguishable from non-repeating noise in practice? I think we should leave the simulation side with
on_the_flyas that is the current default.