Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SpikeInterface Enhancement Proposal: SortingAnalyzer object #2282

Open
samuelgarcia opened this issue Dec 1, 2023 · 27 comments
Open

SpikeInterface Enhancement Proposal: SortingAnalyzer object #2282

samuelgarcia opened this issue Dec 1, 2023 · 27 comments
Labels
SEP SpikeInterface Enhancement Proposal

Comments

@samuelgarcia
Copy link
Member

samuelgarcia commented Dec 1, 2023

Rationale

Since SpikeInterface v0.90, the processing of a recording + a spike sorting output started from the creation of a WaveformExtractor and the subsequent addition of data from the postprocessing and qualitymetrics (QM) modules as WaveformExtension (from v0.93).
While some postprocessing and QM functions need waveforms, this is not the case for all (e.g., computing correlograms, spike amplitude, ISI histograms). In addition, the WaveformExtractor also ended up including a bunch of additional metadata handling for storing and retrieving metadata of the recording and sorting objects, so it doesn’t only handle waveforms.

With this proposal, we therefore suggest to implement a new SpikeSortingResult object which will only combine a recording and sorting object, define sparsity, handle the respective metadata, and implement loading and saving to multiple backends (memory, binary folders, zarr)

Features

  • Combines Recording + Sorting
  • Caches Sorting and recording info
  • Works recordingless
  • Supports extensions
  • Waveforms is also a ResultExtension living in core
  • Other ResultExtension live elsewhere (postprocessing, qualitymetrics, …)
  • Ability to compute several extensions one by one ssr.compute(“spike_amplitude”) or spikeinterface.postprocessing.compute_spike_amplitudes(ssr)
  • Ability to compute several extensions at once: ssr.compute([“spike_amplitude”, “unit_location”])
  • Automate compute logic: registration of an extension automatically generates the function / docstring. etc.
  • Native backends: memory + binary (npy) + zarr (TODO)
  • Export backends: other backends that are not core-supported
  • (run_sorter could output a SpikeSortingResult, to be discussed)

API changes

Old API

recording = read_openephys(...)
sorting = run_sorter(‘tridesclous’, recording, …)
# this create the object and also extractor waveforms
# we=WaveformExtactor
we = extract_waveforms(recording, sorting, ‘/waveorm_folder’)
# then some post processing
amplitudes = compute_spike_amplitude(we)
qm = compute_quality_metrics(we)
unit_locations = compute_unit_locations(we)

plot_waveforms(we)
# this used to be a strange syntax:
plot_quality_metrics(we)

This new API will be

recording = read_openephys(...)
sorting = run_sorter(‘tridesclous’, recording, …)
# this create the object and also extractor waveforms
# ssr=SpikeSortingResult
ssr = create_result(recording, sorting, ‘/result_folder’) # name to be discussed!
# then some post processing
ssr.compute(“spike_amplitude”)
ssr.compute(“quality_metrics”)
ssr.compute(“unit_locations”)
# or call by functions to get some backyard compatibility
# function call return directly the results
amplitudes = compute_spike_amplitude(ssr )
qm = compute_quality_metrics(we)
unit_locations = compute_unit_locations(we)
#or 
ssr.compute([“spike_amplitude”, “quality_metrics”, “unit_locations”])

plot_waveforms(ssr)
# this is now a better syntax
plot_quality_metrics(ssr)

Refactor needed

Internal:

  • Quality metrics
  • Postprocessing
  • Widgets
  • Exporters
  • Curation

External:

  • SI-GUI
  • Lussac
  • NeuroConv

Transition plan

For a transition phase, we will need to keep a DummyWaveformExtractor that mimics the behavior of the WaveformExtractor but internally use a SpikeSortingReult

  • All docs and examples will need to be updated.
  • Important: keep a load_sorting_result_from_waveforms() for old results
  • New tutorial on SpikeSortingResult

Agenda

December 2023 : first public discussion draft
January 2024 : start dev in a separate branch
February 2024 : last release 0.100.0 with WaveformExtractor
Day after the release : main go in 100_bug_fix branch and dev become main
June 2024 : release 0.101.0 with SpikeSortingResult

Advanced users : we need your feeback!

@magland @colehurwitz @mhhennig @khl02007 @h-mayorquin @zm711 @DradeAW @yger @ferchaure @bendichter @CodyCBakerPhD @JoeZiminski @TomBugnon @grahamfindlay @cwindolf @julienboussard

Let's discuss here and then we will open a project for sub tasks.

@samuelgarcia samuelgarcia added the SEP SpikeInterface Enhancement Proposal label Dec 1, 2023
@DradeAW
Copy link
Contributor

DradeAW commented Dec 1, 2023

I love the idea!
I don't really like the name SpikeSortingResult, but as of now I can't think of a better name. I'll try to think of one :)

Maybe the waveforms could be an extension? (like ssr.compute("waveforms / templates", **params)).

Also another subject: Would it be possible to think of a way to handle multiple filterings?
Let me explain: I like to apply a different filter when computing the PCA, or when computing the spikes amplitude, and when I look at the templates I like them unfiltered. The current implementation forces me a create a new WaveformExtractor for every operation, which is a bit annoying.
So my question is: am I the only one that likes to do this? Would it be possible to implement this easily and not make everything a mess?

@samuelgarcia
Copy link
Member Author

Maybe the waveforms could be an extension? (like ssr.compute("waveforms / templates", **params)).

Yes it is the plan.

Also another subject: Would it be possible to think of a way to handle multiple filterings? Let me explain: I like to apply a different filter when computing the PCA, or when computing the spikes amplitude, and when I look at the templates I like them unfiltered. The current implementation forces me a create a new WaveformExtractor for every operation, which is a bit annoying. So my question is: am I the only one that likes to do this? Would it be possible to implement this easily and not make everything a mess?

This is un interresting feature.
A bit challenging.
We could think about having a main recording and also a list of other recording and every extenssion could choose which recording.

This is particularlly relevant to have raw vs processed metrics.

@DradeAW
Copy link
Contributor

DradeAW commented Dec 1, 2023

I don't really like the name SpikeSortingResult, but as of now I can't think of a better name. I'll try to think of one :)

To expend on that, to me the Sorting class is the spike-sorting result.
The combination of recording and sorting is more like an analysis, but that's not a very good name.

Here are the suggestions from ChatGPT, maybe it can give us some ideas?

  1. ElectroSort
  2. NeuroRecorder
  3. SpikeTraceAnalyzer
  4. ElectroPhysSorter
  5. DataSpikeIntegrator
  6. NeuralWaveSorter
  7. VoltageSpikeMapper
  8. NeuroPhysioSort
  9. SpikeDataFusion
  10. ElectroSpikeSynthesizer

This is un interresting feature.
A bit challenging.

I hope you mean "an interesting feature" and not "un-interesting feature" 😅
Yes I know it's challenging and I don't want to complicate the new class too much. I'm emitting the idea because I think it's a cool feature, but I would understand if it's too niche and/or too complicated.

@h-mayorquin
Copy link
Collaborator

@DradeAW says:

Sorting class is the spike-sorting result.
The combination of recording and sorting is more like an analysis, but that's not a very good name.

If by Sorting you mean BaseSorting then Yes, +1 on both points.

This seems like a class that can be convenient for analysis and that is lighter (in expectation) than the waveform extractor. I don't think this should or need to be tied to the output of the sorters.

From the point of view of neurconv, I would opose having the proposed class as an output of run_sorter. This would complicate things for us and I think the heuristic of not mixing representations that are good for IO with representations that are good for analysis is a good one.

From the point of view of spikeinterface it would be good for me personally to understand what are the undesirable features of the WaveformExtractor as that seems to be the driving force of this: to have a class that is useful for analysis yet simpler in some sense than the current WaveformExtractor. That will allow me to be helpful in this discussion as I am not a power user of the processing features of the library.

@DradeAW
Copy link
Contributor

DradeAW commented Dec 1, 2023

From the point of view of spikeinterface it would be good for me personally to understand what are the undesirable features of the WaveformExtractor as that seems to be the driving force of this: to have a class that is useful for analysis yet simpler in some sense than the current WaveformExtractor. That will allow me to be helpful in this discussion as I am not a power user of the processing features of the library.

I believe that the main driving force is that to create a WaveformExtractor object, you need to extract the waveforms (which is a slow process). And in some cases, you need a waveform extractor object, but actually don't need the waveforms.
Yes you can technically do wvf_extractor = si.WaveformExtractor(recording, sorting) and the waveforms aren't extracted, but this is not very pretty, and the name doesn't correspond to the object.

@h-mayorquin
Copy link
Collaborator

@DradeAW Thanks.

@samuelgarcia
Copy link
Member Author

samuelgarcia commented Dec 1, 2023

@h-mayorquin:
the main motivations:

  • mainly: avoid extract waveforms when not necessary
  • avoid spaghetti code of WaveformExtactor that handle both : save/load + gather rec/sorting + waveforms handling
  • a logical semantic plot_spike_amplitudes(WaveformExtactor) is somehow very strange because you do not
    need waveforms for that. It is hard for a begenner that WaveformExtactor is the first step for post processing.
  • make several postprocessing at once
  • with zarr backend would be easy to make a high quality and interactive viewer (webbased or qt based) that stream
    on fly everything from this obect without any data duplication.

The last is long term but would be a game changer:

  • recording on cloud
  • sorting on computing machine
  • push SpikeSortingResult on cloud
  • visualize directly with a rich app given an url

@grahamfindlay
Copy link

As someone who often uses a WaveformExtractor for things that don't touch waveforms, this makes a lot of sense. Thinking of a good name is hard. SortingAnalyzer, SpikeAnalyzer, PostsortingExtractor, PostprocesssingExtractor?

@magland
Copy link
Member

magland commented Dec 1, 2023

This sounds great! Here's my name suggestion

RecordingSorting or SortingRecording

def some_function(X: si.BaseRecordingSorting):
    ...

@khl02007
Copy link
Contributor

khl02007 commented Dec 1, 2023

  • no strong opinion, can see how this might be useful
  • why is "work recordingless" a feature to aim for? isn't that what a sorting is? or does it mean a recording is optional (but the metadata about it is still required)?
  • suggested name: SortedRecording

@yger
Copy link
Collaborator

yger commented Dec 4, 2023

I think it would be a great feature, indeed. I think I would go for a name such as RecordingSorting, as proposed by @magland . A couple of ideas also:

  • most of the time, we don't need waveforms to compute metrics, so +1 to be able to bypass this computation. It would be great, however, if such an object could handle the Templates (added by @h-mayorquin ) (often available at the end of a spike sorting pipelines). Because with the Templates, some metrics like SNR could be also available directly, without the need to extract waveforms.
  • Should the RecordingSorting object keep track of the preprocessing chain that has been performed to obtain the sorting? By doing so, when asked to extract waveforms, the object would immedialy know what algorithmic steps should be re-applied to get the waveforms corresponding to the sorting. Then, maybe one option would be to play with such preprocessing pipelines (as requested by @DradeAW ), and be able to extract from preprocessed, from raw, or from custom chains (and thus exposing a dict of metrics, depending on the preprocessing steps?)

@DradeAW
Copy link
Contributor

DradeAW commented Dec 4, 2023

I like the names RecordingSorting and SortedRecording (I slightly prefer the latter).

@yger
Copy link
Collaborator

yger commented Dec 4, 2023

Yes, SortedRecording works also for me, I think this is even better

@JoeZiminski
Copy link
Contributor

This looks really nice, I like SortedRecording also

@h-mayorquin
Copy link
Collaborator

@samuelgarcia Thanks for explaining the advantages point by point to me. I hope they are useful for someone else as well!

I am +1 on @khl02007 suggestion of SortedRecording.

@zm711
Copy link
Collaborator

zm711 commented Dec 4, 2023

This all makes sense to me. The logic is pretty sound to make the base postprocessing feature not rely on waveform extraction. And I think it will be way less confusing for the plotting functions.

A couple tiny comments:

  1. Are you absolutely going to guarantee backward compatibility for a few versions? Or would it be better to make a bigger version jump than .100-.101 (ie you feel confident that the dummy waveform extractor will work for people who are used to the old api as this major change happens)
  2. what is the plan for deprecating WaveformExtractor or is the current plan to always keep it as backward compatibility so that there is no major break in version numbering (implications of maintaining code vs just doing a hard dep and making people adjust) -- you could imagine maybe a conversion function long term that at least partially converts a WaveformExtractor so you maintain one function rather than a giant class
  3. what are process/kwarg/memory implications of chaining qc/postprocessing (ie forcing them to be separate seems a bit safer for memory etc, but to be honest I don't know for sure if chaining them would be a problem or not)--plug for the kwarg optimizer :)
  4. SortedRecording works for me

Anyway like I said, seems super cool and I think it will be useful in the end (both from the user perspective and from the schema perspective for teaching new users).

@TomBugnon
Copy link
Contributor

TomBugnon commented Dec 4, 2023

Makes sense to me as well. I hope for easy conversion or backward compatibility of analysis run with waveform extractors.

Personnally this seems like more of a type of Sorting extension than a SortedRecording, particularly since the recording is optional (and will often not be available when loading the extractor) so I prefer SpikeSortingResult or PostProcessingExtractor or SortingExtension or the like, but all is fine

@grahamfindlay
Copy link

+1 to Tom's comment about the name. Seems weird to name something SortedRecording when an essential feature is that it can be recordingless. Plus correlograms, ISI histograms etc do not need or touch a recording object.

@DradeAW
Copy link
Contributor

DradeAW commented Dec 5, 2023

Seems weird to name something SortedRecording when an essential feature is that it can be recordingless

I understand the comment, but don't fully agree.
If there is a sorting object, then it had to have worked with a recording at some point. The feature just means that it is a possibility not to load said recording, but it exists somewhere (hopefully).

@DradeAW
Copy link
Contributor

DradeAW commented Dec 5, 2023

Also (stop me if this is too complex, but I just wanted to express the idea), coming back to @yger idea to handle the new Template class:

One interesting feature would be to compute the pre-processed templates from the raw templates, rather than pre-processing each individual waveforms (which is very slow when computing the mean from 10,000 spikes).
A good portion of the pre-processing is linear (e.g. temporal filtering, whitening), and can be applied after taking the mean of the raw waveforms (with the margin of course).

I don't know about other users, but I find myself needing a filtered template a lot of time, but not caring about the individual filtered waveforms.

@TomBugnon
Copy link
Contributor

TomBugnon commented Dec 5, 2023

@DradeAW Graham and I work with 48h neuropixels recording for which the recording is 3.5TB, and the sorting a few GBs. We can't keep the preprocessed recording so instantiating the recording requires re-extracting and re-processing the AP (computing whitening matrix, loading motion files etc). The raw AP files live on a different drive than the sorting that may be mounted at the time of running the sorting and postprocessing, but not necessarily later on when just loading (rather than computing) the waveforms, extensions etc.
So we very much appreciate the workflow allowed by the waveform extractor that you compute the waveforms, spike amplitudes etc (stuff relying on the recording) once and for all while the recording is accessible and then can compute/recompute/load most extension/metrics in recording-less mode.

More generally, considering that raw files are orders of magnitude larger than the sorting (and an order of magnitude larger than the waveforms) , and most (all?) operations rely on the spike times + waveforms rather than on the actual underlying recording, I think it makes sense in terms of data sharing and lifespan that as many operations as possible post-sorting rely only only on what's available in the sorting and postprocessing directories.

That's why we prefer PostprocessingExtractor or SortingExtension or the like.

@DradeAW
Copy link
Contributor

DradeAW commented Dec 5, 2023

I understand your limitations, but I don't quite understand the purpose or recording-less mode (could you enlighten me?)

Loading a recording object is supposed to be lazy, and especially with Neuropixels data the file is already in binary format. Isn't the load time supposed to be instantaneous?
And if the file is on another computer (which is my case), can't you mount it to your disk?

That said I understand that if the recording isn't going to be used, you might as well pass None as an argument.

@TomBugnon
Copy link
Contributor

TomBugnon commented Dec 5, 2023

the purpose of recordingless mode is not having to do all the following

  • ensure AP band data is accessible at the same location and in the same format as it was at the time of sorting (eg uncompressed)
  • Re-apply preprocessing (which is applied lazily but still has overhead and so is not instantiated instantly)
  • Cross fingers hoping that the spikeinterface implementation hasn't changed for all these steps since we ran the sorting and first extracted the waveforms

before instantiating an object to do operations which don't rely explicitly on the recording, even less so since it can use previously extracted waveforms

@DradeAW
Copy link
Contributor

DradeAW commented Dec 5, 2023

Ok I get it, thanks for the clarification!

Concerning the name, this is obviously opinion-based,
But to me, even if this object doesn't contains the recording object, it contains the recording that has been sorted into waveforms.
But if we chose not to use this name because it could be ambiguous, I understand :)

@TomBugnon
Copy link
Contributor

TomBugnon commented Dec 5, 2023

Yes I see your point but in the end even the waveforms are optional, you might want to create this object just to compute some quality metrics relying only on spike times, or the cross correlograms or whatnot! To me it feels like the only things ensured are that this object will be about the sorting and about some sort of postprocessing.

But yes no big deal in the end!

@mhhennig
Copy link
Member

Taking a step back...

  • The main motivation for the WaveformExtractor and this proposed object is that certain functions need both spikes and raw data.
  • Often these are snippets, expensive to gather so we want to cache them.
  • The WaveformExtractor is a slightly odd object as it is more then just data, it may contain PCA projections for example. Although useful and powerful, I find it a bit counter-intuitive to handle, it lacks the simplicity of the other Extractor objects.

It makes sense to have an object that holds both a sorting and a recording. However perhaps it would be good to separate out any methods/functions, the object should just hold data - snippets and references to the sources in this case - as do the other Extractors. Would this not increase flexibility?

On one extreme prostprocessing/metrics/widgets then could check the type and be object-aware. ISI violations would accept both a sorting or the new object, while PCA metrics would expect the new object (or make one, although this may not be great as users will be confused). On the other end the new object would do everything, but that's not feasible IMO. Somewhere between is tricky as it's unclear where to draw the line.

Just some thoughts, but in any case I think sorting should return a SortingExtractor to facilitate simple pipelining, which works great for many use cases.

@samuelgarcia
Copy link
Member Author

Hi all.
Thank you for all the interessting feedback.
I have a started an implementation of this object I finally choose SortingResult.

You can have a look here : #2398
I propose close this issue and to continue the discussing on the PR side.

@alejoe91 alejoe91 changed the title SpikeInterface Enhancement Proposal : SpikeSortingResult object SpikeInterface Enhancement Proposal : SortingAnalyzer object Mar 15, 2024
@alejoe91 alejoe91 changed the title SpikeInterface Enhancement Proposal : SortingAnalyzer object SpikeInterface Enhancement Proposal: SortingAnalyzer object Mar 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
SEP SpikeInterface Enhancement Proposal
Projects
None yet
Development

No branches or pull requests