Preprocessing of raw SpikeGLX data with spikeinterface #1018

TomBugnon · 2022-10-18T17:54:54Z

Hi!

We (@grahamfindlay) are intending to sort very long (~48h) neuropixels 1.0 data, for which we are now hitting serious bottlenecks in terms of (pre)processing time and disk usage.

So far, we have been using Bill Karsh's CatGT tool to preprocess the raw SpikeGLX files, before feeding the preprocessed data into sorters via spikeinterface. CatGT takes care of concatenating the successive "trigger" files (eg run_name_g{gate_index}_t{trigger_index}.ap.bin), at the same time as it performs various preprocessing steps, the most crucial of which (for us) being sample alignment, artifact removel ("gfix") and common average referencing.

Since spikeinterface seems to be able to perform the crucial sample alignment step, and also promises to offer the tools to perform the full "destriping" preprocessing presented in the IBL preprocessing white paper and implemented here, we would like to follow Samuel's suggestion ( #1010 ) and perform the full preprocessing and sorting from raw spikeGLX traces, bypassing CatGT.

For us, taking this step would be a serious gain in terms of disk usage (since we will avoid the useless step of writing the binary recording.dat file), and in terms of following the evolving preprocessing/sorting standards. In short, using spikeinterface from start to finish sounds amazing and we'd be really happy to make the switch.

So, here are a couple questions:

Has there been any evolutions regarding the preprocessing of neuropixels data presented here: https://spikeinterface.github.io/blog/spikeinterface-destripe/ ? And in particular the "kfilt" destriping?
Does spikeinterface offer a gfix-type automatic detection of short large amplitude artifact?
The trickiest (although not crucial for us in the short term): is spikeinterface/neo equipped to handle the concatenation of raw files across gates/triggers, since there may be overlap or gaps (that should be 0-filled) across these files ? As Bill Karsh put it ,

"all files with the same base run-name share parameters and come from the same underlying SpikeGLX hardware run (a continuous stream of consecutive samples), so have a time relation that allows them to be sewn back together (but with possible gaps and/or overlaps that need to be trimmed). The metadata ‘firstSample’ item is the starting sample number of this file (in that common underlying stream). CatGT can sew g and t series files back together"

Thank you so much for your continuous help, and let me say it is very exciting to see this project make such huge advances!
Tom

The text was updated successfully, but these errors were encountered:

samuelgarcia · 2022-10-18T19:18:29Z

Has there been any evolutions regarding the preprocessing of neuropixels data presented
here: https://spikeinterface.github.io/blog/spikeinterface-destripe/ ? And in particular the "kfilt" destriping?

Not yet, but Joe Mininski from London is working on it in coordination with Olivier Winter.

Does spikeinterface offer a gfix-type automatic detection of short large amplitude artifact?

No, we do not have this.
Depending the complexity, this could be done easily or not.
All preprocessing have to be "lazy", so done on demand by get_trace() .
Does this gifx is somehow a detector with simple threshold and then zero masking or is it a complex processing ?
(We have remove_artifacts() but index have to be provided externaly.)

The trickiest (although not crucial for us in the short term): is spikeinterface/neo equipped to handle the concatenation of raw > files across gates/triggers, since there may be overlap or gaps (that should be 0-filled) across these files ?
As #628 (comment)

The "concatenation" of "multi segment" (aka multi binary file) is handle also in a lazy mode with 2 differents flavors in spikeinterface ("append" = true multi segment or "conatenate" = virtual mono segment)
We do not fill with zero because we handle explitly the multi segment problem.

In spikeinetrface, you can already do this
rec = si.read_binary(file_paths=['file1.dat', 'file2.dat', 'file3.dat'], ...) and rec will be a 3 segments recording.

Please have a deep look at this : https://spikeinterface.readthedocs.io/en/latest/modules/core/plot_5_append_concatenate_segments.html#sphx-glr-modules-core-plot-5-append-concatenate-segments-py

Important, no sorter at the moment handle the "multi segment" correctly problem except the experimental ones inside spikeinetrface ("spkykingcircus2" quite advanced, "tridesclous" not working yet). This is why in many cases you need to write everything to single gigantic file with the rec.save() which is the same somehow as using the CatGT but more flexible.

grahamfindlay · 2022-10-18T20:15:15Z

Thanks, Samuel.

Does this gifx is somehow a detector with simple threshold and then zero masking or is it a complex processing ?

It is a very simple procedure (details). In short, it takes 3 parameters, (A, B, C): When it detects a peak with amplitude of at least ||A mV||, which rises at least as rapidly as ||B mV/sample-tick||, it calls this an artifact. All samples are zero'd until amplitude settles back to ||C mV|| or less. Embarassingly parallel, could probably be fast and easy using scipy.signal.find_peaks.

Is there a write-up anywhere of the ways in which SpyKING CIRCUS 2 differs from SpyKING CIRCUS? I couldn't find a document in Pierre's fork or on his website.

JoeZiminski · 2022-10-18T21:08:25Z

Hi everyone, I'm working on porting some of the IBL tools to SpikeInterface with Olivier Winter. Currently have been focusing on the bad channel detection / interpolation (will post draft PR soon), but next the kfilt averaging.

Would also be happy to take a look at a opening a PR for the gifx detection this or next week if it would be useful?

Cheers,
Joe

TomBugnon · 2022-10-19T16:02:56Z

Thanks @samuelgarcia for the thorough response.

Hi @JoeZiminski Great to hear that, we'll be happy to keep in touch regarding the timeline for the kfilt averaging, thank you!
We've mostly used gfix to clean stimulation artifacts, which is something that can be done by spikeinterface with the remove_artifacts function, so it is not crucial for us, but in general it might be interesting for spikeinterface to offer CatGT's preprocessing options. As you see fit!

alejoe91 · 2022-10-19T17:55:22Z

@yger is atively working on SC2, and solving some final caveats! When it's ready, we'll add some proper documentation :)

Pierre do you want to comment on this?

TomBugnon · 2022-10-28T12:56:42Z

Hi @samuelgarcia @alejoe91

I have been looking a bit into preprocessing raw SGLX data with SI, bypassing CatGT.
This plot shows traces (unscaled) for -
1- raw SGLX bin (SGLXExtractor)
2- catGT-processed bin file (only sample alignment) (SGLXExtractor)
3- CatGT + KS preprocessed file (BinExtractor('temp_wh.dat'))
4- raw SGLX bin + Spikeinterface preprocessing similar to CatGT+kilosort
5- raw SGLX bin + Spikeinterface preprocessing similar to CatGT+kilosort with local CMR

Using SI to do the sample alignment and the preprocessing steps usually done by kilosort seems to work as well (if not better) (3 vs 4)
Also. local common median referencing in SI (last plot) seems to be working very well for destriping! About as well as what we were expecting from the upcoming kfilt, actually! (3 vs 5)

Anyway, my question here regards scaling of traces.
Since the range of values is much reduced following full SI- vs CatGT- preprocessing (compare scales for 4 vs 2), resolution is lost when the recording is cast to int16 and written to file by write_binary_recording during run_sorter('kilosort2_5', rec_prepro).
(This is what channel traces look like in the recording.dat intermediate file:)

(and final temp_wh.dat file : (further scaled *200 by kilosort)

In the same way that values are scaled * 200 after kilosort preprocessing before being cast to int16 and written to temp_wh.dat , shouldn't spikeinterface scale recordings when necessary before running a sorter?

Or, what is the best way to scale trace values manually? Calling rec.set_channel_gains(200 * rec.get_channel_gains()) before run_sorter doesn't work because write_binary_recording calls get_traces(return_scaled=False).
The Recording.save() function either doesn't seem to take a use_scaled kwarg.

Thanks for feedback!

samuelgarcia · 2022-10-28T13:09:22Z

Hi Tom,
this is a super cool and positive feedback.
Any feedback about the speed (when you do the rec.save(n_jobs=XXX) ?

Regarding the scale : if you do not apply whittening (which force float32), we keep the same gain as the original file. In short the "bit to microVolt" gain is kept along your preprocess chain (even if internaly it is float at some steps like fft sample alignement).

If you need gain at some steps, you can use some the of the scalers preprocessing : https://github.com/SpikeInterface/spikeinterface/blob/master/spikeinterface/preprocessing/normalize_scale.py

Note that you can also change the dtype to float32 at several step of your pipeline if you need it. Just add dtype='float' in the kwargs.

TomBugnon · 2022-10-28T13:17:00Z

Ok great! I didn't see that there were specifically scaling preprocessing.

Will keep you posted on speed when doing some tests on longer recordings. I tried once write_binary_recording (no actual processing, only casting to int) with n_jobs=100 and 500G total_memory and it was quite bad: around 70h to copy a ~43h recording, while catGT took around 30h to process/write the same amount of data.

samuelgarcia · 2022-10-28T13:18:32Z

I looking at your figure. Your gain is due to the whitening.
With whittening (which not recommend but some do) you expcet your signal be z-scored. This explain the sclae betweek -5 and 5.

alejoe91 · 2022-10-28T13:21:06Z

@TomBugnon can you try using these job_kwargs?

n_jobs=100, chunk_duration="1s"

Personally, I found that smaller chunks makes processing faster!

samuelgarcia · 2022-10-28T13:21:52Z

Oups. This is slow. But 70h is for the full: align/CMR/bandpass/whiten ?
Not that MCR with median make it quite a bit slow. average would be faster.

By the way I would do CMR after bandpass to remove only commmon noise in high ferqs something like this no ?
align/bandpass/CMR/whiten or bandpass/align/CMR/whiten

samuelgarcia · 2022-10-28T13:23:33Z

@alejoe91 : with 100 jobs you also have to control the memory!!!!!
@TomBugnon : note that total_memory=500G will be 500/100=5Go trace chunk per worker.
Some steps need internal copy so maybe the chain consum many more. Did you check you are not swapping ?

TomBugnon · 2022-10-28T13:29:12Z

70h was with NO preprocessing steps, run_sorter directly on the raw recording (so only casting to int16 I think).
I didn't check swap but we have 1.5TB RAM (and 200+ CPUs) so I was expecting these params to perform ok. The main bottleneck is probably disk I/O in this case so maybe fewer jobs would work better. I'll do more tests sometime.

@samuelgarcia The idea was bypassing completely kilosort's preprocessing (so that there's no extra step of writing temp_wh.dat) and doing all the necessary steps in SI. KS performs whitening as well, are you suggesting that it's actually not recommended?

alejoe91 · 2023-04-06T08:25:54Z

@TomBugnon we are almost done with a patch to allow one to skip KS preprocessing: #1418

TomBugnon · 2023-04-06T09:15:03Z

Cool! I did it directly in KS but your version will probably be cleaner
CSC-UW/Kilosort@b2e2dbf

alejoe91 · 2023-06-12T14:49:57Z

You can now skip KS preprocessing in SI: #1418

alejoe91 added the question General question regarding SI label Oct 19, 2022

alejoe91 closed this as completed Jun 12, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Preprocessing of raw SpikeGLX data with spikeinterface #1018

Preprocessing of raw SpikeGLX data with spikeinterface #1018

TomBugnon commented Oct 18, 2022

samuelgarcia commented Oct 18, 2022

grahamfindlay commented Oct 18, 2022

JoeZiminski commented Oct 18, 2022 •

edited

TomBugnon commented Oct 19, 2022

alejoe91 commented Oct 19, 2022

TomBugnon commented Oct 28, 2022

samuelgarcia commented Oct 28, 2022

TomBugnon commented Oct 28, 2022

samuelgarcia commented Oct 28, 2022

alejoe91 commented Oct 28, 2022

samuelgarcia commented Oct 28, 2022

samuelgarcia commented Oct 28, 2022

TomBugnon commented Oct 28, 2022

alejoe91 commented Apr 6, 2023

TomBugnon commented Apr 6, 2023

alejoe91 commented Jun 12, 2023

Preprocessing of raw SpikeGLX data with spikeinterface #1018

Preprocessing of raw SpikeGLX data with spikeinterface #1018

Comments

TomBugnon commented Oct 18, 2022

samuelgarcia commented Oct 18, 2022

grahamfindlay commented Oct 18, 2022

JoeZiminski commented Oct 18, 2022 • edited

TomBugnon commented Oct 19, 2022

alejoe91 commented Oct 19, 2022

TomBugnon commented Oct 28, 2022

samuelgarcia commented Oct 28, 2022

TomBugnon commented Oct 28, 2022

samuelgarcia commented Oct 28, 2022

alejoe91 commented Oct 28, 2022

samuelgarcia commented Oct 28, 2022

samuelgarcia commented Oct 28, 2022

TomBugnon commented Oct 28, 2022

alejoe91 commented Apr 6, 2023

TomBugnon commented Apr 6, 2023

alejoe91 commented Jun 12, 2023

JoeZiminski commented Oct 18, 2022 •

edited