Skip to content

Conversation

@samuelgarcia
Copy link
Member

@samuelgarcia samuelgarcia commented Sep 3, 2024

  • Add machinery to compute noise level in paralell
  • Add get_random_recording_slices() to implement more futur) strategy for random chunk (aka non overlaping,
    regular, uniform...)

A very important change is that now the seed=None (instead of seed=0) in the function which I think is the good way. seed must be explicit and no inplicit. So the consequence is:

  • all test that are running twice the get_random_data_chunk() (sometimes this isn hidden) are not garanted
    anymore to have the same results. The solution is to explicitly seed everywhere which is a good practice.

@yger

@cwindolf : have a look to this, this is a first step to have a better noise levels estimate in SI.

@cwindolf
Copy link
Collaborator

cwindolf commented Sep 3, 2024

Looks cool! @oliche 's strategy could be implemented here now.

Copy link
Collaborator

@h-mayorquin h-mayorquin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two general questions:

  1. Will this not fail with formats that lock IO access to the same region if the chunks overlap?
  2. I am still confused on why computing noise requires so many samples. The methods we use assume normality (we use MAD to estimate the std) but then we go and sample far way more than the converge criteria of normal distributions would naively suggests. What gives? Is there some empirical work on this? Now that we have a lot of open data available estimating sampling requirements for a variety of neural data (species, areas, etc) coul be done. It appears to me that this could be a quick and informative paper that we could put out for the community if there is no previous work.

)


if worker_ctx["method"] == "mad":
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should pre-allocate to reduce memory footprint.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we do not have garanty that chaunk have the same size.
I think this will handle more globally when adding out=... in get_traces()

force_recompute: bool = False,
**random_chunk_kwargs,
**kwargs,
# **random_chunk_kwargs,
Copy link
Collaborator

@h-mayorquin h-mayorquin Sep 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why just not pass different dics? This is more complicated to read, document and has caused bugs before with the verbose and job_kwargs story.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this was to keep backward compatibility I guess.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is somethign we should discuss. I was for what you propose but this will break backward compatibility.

@samuelgarcia
Copy link
Member Author

  1. Will this not fail with formats that lock IO access to the same region if the chunks overlap?

Really good point! The access is read only.
We will be able to add a none overlapiing option in random slices

  1. I am still confused on why computing noise requires so many samples. The methods we use assume normality (we use MAD to estimate the std) but then we go and sample far way more than the converge criteria of normal distributions would naively suggests. What gives? Is there some empirical work on this? Now that we have a lot of open data available estimating sampling requirements for a variety of neural data (species, areas, etc) coul be done. It appears to me that this could be a quick and informative paper that we could put out for the community if there is no previous work.

Honestly I was pretty sure that the number of sample used to be enough.
After discussion with @cwindolf I get the impression that we should have more ...
Charlie any coment ?

recording_slices = []
low = margin_frames
size = num_chunks_per_segment
for segment_index in range(num_segments):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There could be an option to avoid overlapping chunks. This was not really necessary as long as we had not too many chunks given the size of the recording, but if we are taking more, maybe this is worth considering it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes and this would a new method

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for a futur PR

@alejoe91 alejoe91 added the core Changes to core module label Sep 4, 2024
@cwindolf
Copy link
Collaborator

cwindolf commented Sep 4, 2024

Yeah... in my experience, more blocks helps to stabilize the estimate (let's say we want numbers within x% of each other across runs with different seeds). The data certainly is not Gaussian, it has spikes, and spike activity can vary wildly across a recording. So using very few blocks, they will by chance disproportionately land in higher or lower activity regions (maybe in different ways across channels). You need a good number of blocks to reduce that effect -- for short or super consistent recordings maybe fewer blocks is fine.

Also, it would be cool if si.zscore() and the other normalize_scale stuff can use these tools :)

@h-mayorquin
Copy link
Collaborator

But if the data is not Gaussian would it mean that using MAD to estimate std is wrong? This assumes normality:

https://en.wikipedia.org/wiki/Median_absolute_deviation
(see relationship to Relation to standard deviation here)

Anyway, if your experience is that more samples stabilize the estimator that I think trumps these considerations.

@cwindolf
Copy link
Collaborator

cwindolf commented Sep 4, 2024

Yeah, it's wrong! But I don't have any better ideas. Ideally one would be able to subtract away all of the spikes and then MAD the residuals (which would ideally be only Gaussian noise, but even that is not 100% true...), but that requires sorting, which requires some kind of standardization...

@h-mayorquin
Copy link
Collaborator

Agree on the limitation. Thanks for answering my questions.

@alejoe91
Copy link
Member

@samuelgarcia can you fix the test? There are some concatenated that trigger some errors

Comment on lines 764 to 767
random_slices_kwargs : dict
Options transmited to get_random_recording_slices(), please read documentation from this
function for more details.
**job_kwargs:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would actually copy/paste the kwargs in the docstring here, since it's much higher level API

Copy link
Member

@alejoe91 alejoe91 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Super @samuelgarcia!!!

Just a few suggestions and failing tests to fix

@samuelgarcia samuelgarcia merged commit 0df1160 into SpikeInterface:main Oct 25, 2024
15 checks passed
@samuelgarcia samuelgarcia deleted the improve_noise_level_machinery branch July 29, 2025 13:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core Changes to core module

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants