-
Notifications
You must be signed in to change notification settings - Fork 261
Fix spikeglx events memmap #1790
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix spikeglx events memmap #1790
Conversation
if len(dig_ch) > 0: | ||
event_data = self._events_memmap | ||
# Unpack bits on-demand - this is when memory allocation happens | ||
event_data = np.unpackbits(self._events_memmap_digital_word.astype(np.uint8)[:, None], axis=1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we cache this ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I agree with this. we were previously caching this, so if we are doing the unpacking we should cache right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't cache any other call to data. Why would this be different?
We were not caching before because we intended it to, we were just loading the data at init because someone outside of the team did the PR I think.
I prefer my resources to be freed outside of the call to get_data, how are you guys thinking about it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that's a fair point. How expensive is unpackbits. I don't know that well. It wouldn't be too hard to say:
if self.event_data is Not None:
event_data = self.event_data
else:
self.event_data = event_data = unpack
I guess the ultimate question is if this is a cheap operation then it doesn't hurt to do it in the init, it's just contrary to our strategy. If it is expensive (time or memory) we should move it. If expensive in memory then caching is expensive. If expensive in time and possibly something a user would want multiple times it would be nice to have a cache.
I'm curious to hear Sam on this one. I could be convinced not to cache it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why the giant allocation? Is the memory actually being used to store the events. I just don't know what events look like for this reader.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question was more:
- is the demultiplexing that occurs with the unpackbits (+ the memory allocation) relatively slow or relatively fast and how does it scale with length of recording
- Does it always demultiplex to the same n_channels such that one could estimate the memory allocation based on the length of recording or is this variable based on what the user does
I think moving this from parse_header (sorry said init earlier), is smart because we just need the memmap. My further question though for caching is speed and memory allocation after. I don't use events from the neo api at all so for me it's hard to imagine use cases. But maybe someone would try to get the events multiple times for something in which case it might be nice to have a cache if the process is relatively slow. If fast then absolutely no caching.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The general question is whether we cache data request in general. There is a trade-off between memory allocation (the memory allocation will remain after you do the data request which is bad) vs making it faster the next time you request. So this is the lens to discuss I think:
To my surprise, unpackbits
turned out to be more expensive than I expected. However, it’s not unusually costly compared to other operations we perform, such as demultiplexing signals (see Intan) or scaling them (see spike interface return scaled). In those cases, we don’t use caching, so I think we should be consistent: either apply caching across all of them or not use it at all.
Here are some performance comparisons:
echo "=== np.unpackbits ===" && python3 -m timeit -n 3 -r 3 "import numpy as np; d = np.random.randint(-32768, 32767, 216_000_000, dtype='int16'); np.unpackbits(d.astype(np.uint8)[:, None], axis=1)" && \
echo "=== bitwise_and ===" && python3 -m timeit -n 3 -r 3 "import numpy as np; d = np.random.randint(-32768, 32767, 216_000_000, dtype='int16'); np.bitwise_and(d, 1) > 0" && \
echo "=== stim decode ===" && python3 -m timeit -n 3 -r 3 "import numpy as np; d = np.random.randint(-32768, 32767, 216_000_000, dtype='int16'); mag = np.bitwise_and(d, 0xFF); sign = np.bitwise_and(np.right_shift(d, 8), 0x01); np.where(sign == 1, -mag.astype(np.int16), mag.astype(np.int16))" && \
echo "=== arithmetic ===" && python3 -m timeit -n 3 -r 3 "import numpy as np; d = np.random.randint(-32768, 32767, 216_000_000, dtype='int16'); d * 0.195"
=== np.unpackbits ===
3 loops, best of 3: 2.25 sec per loop
=== bitwise_and ===
3 loops, best of 3: 461 msec per loop
=== stim decode ===
3 loops, best of 3: 2.89 sec per loop
=== arithmetic ===
3 loops, best of 3: 1.27 sec per loop
think we should merge this as it is because this is consistent with the way that the library works: there are are similarly expensive data request uncached in other places. It is Ok to change this, but we should adopt a general policy and change it consistently in all the places.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, I think that rawio should remain simple and without these type of complications. People can still cache the data on the spot if they want.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm fine with this argument in general. Let's wait until tomorrow in case Sam wants to strongly fight for the other direction and if not we merge tomorrow afternoon?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok.
OK for me. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's discuss the cache before merge
This PR fixes the memory inefficiency issue reported in #1610 where parsing NIDQ files with digital channels would allocate excessive memory during
parse_header()
. For a 2.8 GiB NIDQ file, initialization would allocate ~3.9 GiB of RSS, making it impossible to work with large recordings without running out of memory.The fix implements lazy loading of event data by deferring the
np.unpackbits()
operation from_parse_header()
to_get_event_timestamps()
. During initialization, we now only create a memmap view of the digital word channel (virtual memory, no RAM allocation). The actual unpacking into individual event bits only happens when events are accessed, significantly reducing memory footprint during initialization. Additionally,sample_length
is now calculated from the actual file size instead of metadata to handle stub/modified files correctly ( we did something similar before I don't know if you remember)I ran some benchmarks using memray for some work files (a 844 MB NIDQ file): Peak memory reduced from 2.9 GB to 906 MB (68.7% reduction), with real RSS at initialization dropping from ~3.9 GB to ~20 MB (99.5% reduction).