preallocate decoder hot path scratch#9
Open
jl33-ai wants to merge 1 commit into
Open
Conversation
_process_lfp_timestamp runs at ~167hz (180-sample bin at 30khz spike clock) and was allocating per tick: - two np.zeros(cred_int_bufsize) scratch arrays - a logical_and mask - two no-op atleast_2d wrappers around already-2d slices that allocation churn is the kind of thing that lets generational gc accumulate enough pressure to produce 50-100ms tail latency spikes, which would explain some of the worst case timing the lab has seen on the python side. changes: - enc_cred_intervals, enc_argmaxes, and the spike mask are now instance attrs allocated once in __init__ and reused via .fill(0) / out= - dropped the two atleast_2d calls since boolean indexing a 2d array with a 1d mask already returns 2d - hoisted self.p[...] lookups out of the inner loop - swapped msg.tobytes() for [msg, MPI.BYTE] in the three send paths so MPI gets the numpy buffer directly. wire format unchanged so receivers don't need any change. no algorithmic change. semantics preserved. same pass is worth doing in encoder_process and ripple_process but kept this commit scoped to the decoder.
This was referenced May 29, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
_process_lfp_timestampruns at ~167hz (180-sample bin at 30khz spike clock) and was allocating per tick:np.zeros(cred_int_bufsize)scratch arrayslogical_andmaskatleast_2dwrappers around already-2d slicesbytescopy of the structured numpy buffer for everyMPI.Sendviamsg.tobytes()in aggregate that is hundreds of short-lived numpy/bytes objects per second. exactly the kind of churn that lets generational gc accumulate enough pressure to produce the 50-100ms tail latency spikes the lab has seen on the python side. the math here is not the bottleneck, the allocator/gc interaction is.
changes:
enc_cred_intervals,enc_argmaxes, and the spike-bin mask move to instance attrs allocated once in__init__and reused via.fill(0)/out=per tick.np.atleast_2dcalls. boolean indexing a 2d array with a 1d bool mask already returns 2d so the wrapper was a no-op that just made an extra array per tick.self.p[...]lookups out of the inner loop.msg.tobytes()for[msg, MPI.BYTE]in the three send paths so MPI hands the numpy buffer to the network layer directly without allocating a fresh bytes per send. wire format unchanged (raw bytes), so thebytearray+np.frombufferreceivers don't need any change.no algorithmic change, semantics preserved. happy to add a microbenchmark if useful.
this is the first piece of a broader latency hygiene pass. the same approach is worth doing in
encoder_processandripple_process(both have similar per-tick allocations) but I kept this PR scoped to the decoder so it stays reviewable.