-
Notifications
You must be signed in to change notification settings - Fork 229
Improve noise level machinery #3359
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve noise level machinery #3359
Conversation
|
Looks cool! @oliche 's strategy could be implemented here now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two general questions:
- Will this not fail with formats that lock IO access to the same region if the chunks overlap?
- I am still confused on why computing noise requires so many samples. The methods we use assume normality (we use MAD to estimate the std) but then we go and sample far way more than the converge criteria of normal distributions would naively suggests. What gives? Is there some empirical work on this? Now that we have a lot of open data available estimating sampling requirements for a variety of neural data (species, areas, etc) coul be done. It appears to me that this could be a quick and informative paper that we could put out for the community if there is no previous work.
| ) | ||
|
|
||
|
|
||
| if worker_ctx["method"] == "mad": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should pre-allocate to reduce memory footprint.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we do not have garanty that chaunk have the same size.
I think this will handle more globally when adding out=... in get_traces()
| force_recompute: bool = False, | ||
| **random_chunk_kwargs, | ||
| **kwargs, | ||
| # **random_chunk_kwargs, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why just not pass different dics? This is more complicated to read, document and has caused bugs before with the verbose and job_kwargs story.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this was to keep backward compatibility I guess.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is somethign we should discuss. I was for what you propose but this will break backward compatibility.
Really good point! The access is read only.
Honestly I was pretty sure that the number of sample used to be enough. |
| recording_slices = [] | ||
| low = margin_frames | ||
| size = num_chunks_per_segment | ||
| for segment_index in range(num_segments): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There could be an option to avoid overlapping chunks. This was not really necessary as long as we had not too many chunks given the size of the recording, but if we are taking more, maybe this is worth considering it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes and this would a new method
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for a futur PR
|
Yeah... in my experience, more blocks helps to stabilize the estimate (let's say we want numbers within x% of each other across runs with different seeds). The data certainly is not Gaussian, it has spikes, and spike activity can vary wildly across a recording. So using very few blocks, they will by chance disproportionately land in higher or lower activity regions (maybe in different ways across channels). You need a good number of blocks to reduce that effect -- for short or super consistent recordings maybe fewer blocks is fine. Also, it would be cool if |
|
But if the data is not Gaussian would it mean that using MAD to estimate std is wrong? This assumes normality: https://en.wikipedia.org/wiki/Median_absolute_deviation Anyway, if your experience is that more samples stabilize the estimator that I think trumps these considerations. |
|
Yeah, it's wrong! But I don't have any better ideas. Ideally one would be able to subtract away all of the spikes and then MAD the residuals (which would ideally be only Gaussian noise, but even that is not 100% true...), but that requires sorting, which requires some kind of standardization... |
|
Agree on the limitation. Thanks for answering my questions. |
|
@samuelgarcia can you fix the test? There are some |
| random_slices_kwargs : dict | ||
| Options transmited to get_random_recording_slices(), please read documentation from this | ||
| function for more details. | ||
| **job_kwargs: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would actually copy/paste the kwargs in the docstring here, since it's much higher level API
alejoe91
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Super @samuelgarcia!!!
Just a few suggestions and failing tests to fix
Co-authored-by: Alessio Buccino <alejoe9187@gmail.com>
for more information, see https://pre-commit.ci
get_random_recording_slices()to implement more futur) strategy for random chunk (aka non overlaping,regular, uniform...)
A very important change is that now the seed=None (instead of seed=0) in the function which I think is the good way. seed must be explicit and no inplicit. So the consequence is:
get_random_data_chunk()(sometimes this isn hidden) are not garantedanymore to have the same results. The solution is to explicitly seed everywhere which is a good practice.
@yger
@cwindolf : have a look to this, this is a first step to have a better noise levels estimate in SI.