Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Preprocessing of raw SpikeGLX data with spikeinterface #1018

Closed
TomBugnon opened this issue Oct 18, 2022 · 16 comments
Closed

Preprocessing of raw SpikeGLX data with spikeinterface #1018

TomBugnon opened this issue Oct 18, 2022 · 16 comments
Labels
question General question regarding SI

Comments

@TomBugnon
Copy link
Contributor

Hi!

We (@grahamfindlay) are intending to sort very long (~48h) neuropixels 1.0 data, for which we are now hitting serious bottlenecks in terms of (pre)processing time and disk usage.

So far, we have been using Bill Karsh's CatGT tool to preprocess the raw SpikeGLX files, before feeding the preprocessed data into sorters via spikeinterface. CatGT takes care of concatenating the successive "trigger" files (eg run_name_g{gate_index}_t{trigger_index}.ap.bin), at the same time as it performs various preprocessing steps, the most crucial of which (for us) being sample alignment, artifact removel ("gfix") and common average referencing.

Since spikeinterface seems to be able to perform the crucial sample alignment step, and also promises to offer the tools to perform the full "destriping" preprocessing presented in the IBL preprocessing white paper and implemented here, we would like to follow Samuel's suggestion ( #1010 ) and perform the full preprocessing and sorting from raw spikeGLX traces, bypassing CatGT.

For us, taking this step would be a serious gain in terms of disk usage (since we will avoid the useless step of writing the binary recording.dat file), and in terms of following the evolving preprocessing/sorting standards. In short, using spikeinterface from start to finish sounds amazing and we'd be really happy to make the switch.

So, here are a couple questions:

  • Has there been any evolutions regarding the preprocessing of neuropixels data presented here: https://spikeinterface.github.io/blog/spikeinterface-destripe/ ? And in particular the "kfilt" destriping?
  • Does spikeinterface offer a gfix-type automatic detection of short large amplitude artifact?
  • The trickiest (although not crucial for us in the short term): is spikeinterface/neo equipped to handle the concatenation of raw files across gates/triggers, since there may be overlap or gaps (that should be 0-filled) across these files ? As Bill Karsh put it ,

"all files with the same base run-name share parameters and come from the same underlying SpikeGLX hardware run (a continuous stream of consecutive samples), so have a time relation that allows them to be sewn back together (but with possible gaps and/or overlaps that need to be trimmed). The metadata ‘firstSample’ item is the starting sample number of this file (in that common underlying stream). CatGT can sew g and t series files back together"

Thank you so much for your continuous help, and let me say it is very exciting to see this project make such huge advances!
Tom

@samuelgarcia
Copy link
Member

Has there been any evolutions regarding the preprocessing of neuropixels data presented
here: https://spikeinterface.github.io/blog/spikeinterface-destripe/ ? And in particular the "kfilt" destriping?

Not yet, but Joe Mininski from London is working on it in coordination with Olivier Winter.

Does spikeinterface offer a gfix-type automatic detection of short large amplitude artifact?

No, we do not have this.
Depending the complexity, this could be done easily or not.
All preprocessing have to be "lazy", so done on demand by get_trace() .
Does this gifx is somehow a detector with simple threshold and then zero masking or is it a complex processing ?
(We have remove_artifacts() but index have to be provided externaly.)

The trickiest (although not crucial for us in the short term): is spikeinterface/neo equipped to handle the concatenation of raw > files across gates/triggers, since there may be overlap or gaps (that should be 0-filled) across these files ?
As #628 (comment)

The "concatenation" of "multi segment" (aka multi binary file) is handle also in a lazy mode with 2 differents flavors in spikeinterface ("append" = true multi segment or "conatenate" = virtual mono segment)
We do not fill with zero because we handle explitly the multi segment problem.

In spikeinetrface, you can already do this
rec = si.read_binary(file_paths=['file1.dat', 'file2.dat', 'file3.dat'], ...) and rec will be a 3 segments recording.

Please have a deep look at this : https://spikeinterface.readthedocs.io/en/latest/modules/core/plot_5_append_concatenate_segments.html#sphx-glr-modules-core-plot-5-append-concatenate-segments-py

Important, no sorter at the moment handle the "multi segment" correctly problem except the experimental ones inside spikeinetrface ("spkykingcircus2" quite advanced, "tridesclous" not working yet). This is why in many cases you need to write everything to single gigantic file with the rec.save() which is the same somehow as using the CatGT but more flexible.

@grahamfindlay
Copy link

Thanks, Samuel.

Does this gifx is somehow a detector with simple threshold and then zero masking or is it a complex processing ?

It is a very simple procedure (details). In short, it takes 3 parameters, (A, B, C): When it detects a peak with amplitude of at least ||A mV||, which rises at least as rapidly as ||B mV/sample-tick||, it calls this an artifact. All samples are zero'd until amplitude settles back to ||C mV|| or less. Embarassingly parallel, could probably be fast and easy using scipy.signal.find_peaks.

Is there a write-up anywhere of the ways in which SpyKING CIRCUS 2 differs from SpyKING CIRCUS? I couldn't find a document in Pierre's fork or on his website.

@JoeZiminski
Copy link
Contributor

JoeZiminski commented Oct 18, 2022

Hi everyone, I'm working on porting some of the IBL tools to SpikeInterface with Olivier Winter. Currently have been focusing on the bad channel detection / interpolation (will post draft PR soon), but next the kfilt averaging.

Would also be happy to take a look at a opening a PR for the gifx detection this or next week if it would be useful?

Cheers,
Joe

@TomBugnon
Copy link
Contributor Author

Thanks @samuelgarcia for the thorough response.

Hi @JoeZiminski Great to hear that, we'll be happy to keep in touch regarding the timeline for the kfilt averaging, thank you!
We've mostly used gfix to clean stimulation artifacts, which is something that can be done by spikeinterface with the remove_artifacts function, so it is not crucial for us, but in general it might be interesting for spikeinterface to offer CatGT's preprocessing options. As you see fit!

@alejoe91 alejoe91 added the question General question regarding SI label Oct 19, 2022
@alejoe91
Copy link
Member

@yger is atively working on SC2, and solving some final caveats! When it's ready, we'll add some proper documentation :)

Pierre do you want to comment on this?

@TomBugnon
Copy link
Contributor Author

Hi @samuelgarcia @alejoe91

I have been looking a bit into preprocessing raw SGLX data with SI, bypassing CatGT.
This plot shows traces (unscaled) for -
1- raw SGLX bin (SGLXExtractor)
2- catGT-processed bin file (only sample alignment) (SGLXExtractor)
3- CatGT + KS preprocessed file (BinExtractor('temp_wh.dat'))
4- raw SGLX bin + Spikeinterface preprocessing similar to CatGT+kilosort
5- raw SGLX bin + Spikeinterface preprocessing similar to CatGT+kilosort with local CMR

Using SI to do the sample alignment and the preprocessing steps usually done by kilosort seems to work as well (if not better) (3 vs 4)
Also. local common median referencing in SI (last plot) seems to be working very well for destriping! About as well as what we were expecting from the upcoming kfilt, actually! (3 vs 5)
image

Anyway, my question here regards scaling of traces.
Since the range of values is much reduced following full SI- vs CatGT- preprocessing (compare scales for 4 vs 2), resolution is lost when the recording is cast to int16 and written to file by write_binary_recording during run_sorter('kilosort2_5', rec_prepro).
(This is what channel traces look like in the recording.dat intermediate file:)
image
(and final temp_wh.dat file : (further scaled *200 by kilosort)
image

In the same way that values are scaled * 200 after kilosort preprocessing before being cast to int16 and written to temp_wh.dat , shouldn't spikeinterface scale recordings when necessary before running a sorter?

Or, what is the best way to scale trace values manually? Calling rec.set_channel_gains(200 * rec.get_channel_gains()) before run_sorter doesn't work because write_binary_recording calls get_traces(return_scaled=False).
The Recording.save() function either doesn't seem to take a use_scaled kwarg.

Thanks for feedback!

@samuelgarcia
Copy link
Member

Hi Tom,
this is a super cool and positive feedback.
Any feedback about the speed (when you do the rec.save(n_jobs=XXX) ?

Regarding the scale : if you do not apply whittening (which force float32), we keep the same gain as the original file. In short the "bit to microVolt" gain is kept along your preprocess chain (even if internaly it is float at some steps like fft sample alignement).

If you need gain at some steps, you can use some the of the scalers preprocessing : https://github.com/SpikeInterface/spikeinterface/blob/master/spikeinterface/preprocessing/normalize_scale.py

Note that you can also change the dtype to float32 at several step of your pipeline if you need it. Just add dtype='float' in the kwargs.

@TomBugnon
Copy link
Contributor Author

Ok great! I didn't see that there were specifically scaling preprocessing.

Will keep you posted on speed when doing some tests on longer recordings. I tried once write_binary_recording (no actual processing, only casting to int) with n_jobs=100 and 500G total_memory and it was quite bad: around 70h to copy a ~43h recording, while catGT took around 30h to process/write the same amount of data.

@samuelgarcia
Copy link
Member

I looking at your figure. Your gain is due to the whitening.
With whittening (which not recommend but some do) you expcet your signal be z-scored. This explain the sclae betweek -5 and 5.

@alejoe91
Copy link
Member

@TomBugnon can you try using these job_kwargs?

n_jobs=100, chunk_duration="1s"

Personally, I found that smaller chunks makes processing faster!

@samuelgarcia
Copy link
Member

Oups. This is slow. But 70h is for the full: align/CMR/bandpass/whiten ?
Not that MCR with median make it quite a bit slow. average would be faster.

By the way I would do CMR after bandpass to remove only commmon noise in high ferqs something like this no ?
align/bandpass/CMR/whiten or bandpass/align/CMR/whiten

@samuelgarcia
Copy link
Member

@alejoe91 : with 100 jobs you also have to control the memory!!!!!
@TomBugnon : note that total_memory=500G will be 500/100=5Go trace chunk per worker.
Some steps need internal copy so maybe the chain consum many more. Did you check you are not swapping ?

@TomBugnon
Copy link
Contributor Author

70h was with NO preprocessing steps, run_sorter directly on the raw recording (so only casting to int16 I think).
I didn't check swap but we have 1.5TB RAM (and 200+ CPUs) so I was expecting these params to perform ok. The main bottleneck is probably disk I/O in this case so maybe fewer jobs would work better. I'll do more tests sometime.

@samuelgarcia The idea was bypassing completely kilosort's preprocessing (so that there's no extra step of writing temp_wh.dat) and doing all the necessary steps in SI. KS performs whitening as well, are you suggesting that it's actually not recommended?

@alejoe91
Copy link
Member

alejoe91 commented Apr 6, 2023

@TomBugnon we are almost done with a patch to allow one to skip KS preprocessing: #1418

@TomBugnon
Copy link
Contributor Author

Cool! I did it directly in KS but your version will probably be cleaner
CSC-UW/Kilosort@b2e2dbf

@alejoe91
Copy link
Member

You can now skip KS preprocessing in SI: #1418

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question General question regarding SI
Projects
None yet
Development

No branches or pull requests

5 participants