# DataJoint Workflow Guide for Creating a New Clustering Task


This notebook guides users through the process of adding a new `ClusteringTask` entry to the DataJoint pipeline.


**_Note:_**

- The examples in this notebook use a sample dataset. Replace these entries with your actual database entries to access and analyze your data.


### **Key Steps**


- **Setup**

- **Step 1: Select Session of Interest**

- **Step 2: Insert a New `ClusteringTask` Entry**


### **Setup**


First, import the necessary packages for the data pipeline and essential schemas.


In [1]:
import os

if os.path.basename(os.getcwd()) == "notebooks":
    os.chdir("..")

In [2]:
import datajoint as dj

In [3]:
from workflow.pipeline import ephys

[2024-11-21 13:19:56,477][INFO]: Connecting milagros@db.datajoint.com:3306
[2024-11-21 13:19:59,505][INFO]: Connected milagros@db.datajoint.com:3306


#### **Step 1: Select Session of Interest**


Let's select one session as an example and create a key:


In [25]:
session_key = {
    "organoid_id": "MB07",
    "experiment_start_time": "2024-09-07 14:49:00",
    "insertion_number": 0,
    "start_time": "2024-09-07 14:57:00",
    "end_time": "2024-09-07 14:02:00",
}

session_key

{'organoid_id': 'MB07',
 'experiment_start_time': '2024-09-07 14:49:00',
 'insertion_number': 0,
 'start_time': '2024-09-07 14:57:00',
 'end_time': '2024-09-07 14:02:00'}

In [27]:
ephys.EphysSession * ephys.EphysSessionProbe & session_key

organoid_id  e.g. O17,experiment_start_time,insertion_number,start_time,end_time,session_type,probe  unique identifier for this model of probe (e.g. serial number),port_id,"used_electrodes  list of electrode IDs used in this session (if null, all electrodes are used)"
MB07,2024-09-07 14:49:00,0,2024-09-07 14:57:00,2024-09-07 14:02:00,,Q983,C,=BLOB=


#### **Step 2: Insert a New `ClusteringTask` Entry**


In [None]:
ephys.ClusteringTask.heading

The `ephys.ClusteringTask` table facilitates pairing a specific `ephys.ClusteringParamSet` with a particular session. Each entry in the `ClusteringTask` table represents a clustering task awaiting execution.

Before creating a new `ClusteringTask`, let's inspect the existing parameter sets and sessions:


In [18]:
ephys.ClusteringParamSet()

paramset_idx,clustering_method,paramset_desc,param_set_hash,params  dictionary of all applicable parameters
0,spykingcircus2,Default parameters for spyking circus2 using SpikeInterface v0.100.1,b6fb9ec2-768c-66b0-2b71-9b8ac91e94da,=BLOB=
1,spykingcircus2,Default parameter set for spyking circus2 using SpikeInterface v0.101.*,434894d0-eb7b-db6c-80e6-638a1322c568,=BLOB=
2,kilosort2,kilosort2 with SpikeInterface version 0.101+,79a731f3-f1b6-c110-5f8a-e25227464de7,=BLOB=
5,spykingcircus2,Spyking circus2 with a detection threshold 5 (neg direction),4c895afd-a1b1-5d64-b747-e8489078e2e3,=BLOB=
11,spykingcircus2,waveform>threshold: .25->2,17d41d84-067d-791c-8706-8cab83020b84,=BLOB=
12,spykingcircus2,waveform>threshold: .25->2 attempt 2,2b28cf23-2456-8202-b70f-96871b837a26,=BLOB=
13,spykingcircus2,waveform>threshold: .25->2 attempt 2,1faf6aee-71d6-fe26-74ec-6bb7cdc0f30f,=BLOB=
14,spykingcircus2,apply_preprocessing = False,ce720015-b59a-08d6-198e-def81c860f46,=BLOB=
15,spykingcircus2,"apply_preprocessing, matched_filtering, and apply_motion_correction = False",5f7a8362-c31c-061e-14b2-74ad55466546,=BLOB=
16,spykingcircus2,"default parameters, different format",0a3d0360-c0de-6c30-9c35-7c931a9a6f62,=BLOB=


In [42]:
import spikeinterface as si

si.sorters.Kilosort4Sorter.default_params()

{'batch_size': 60000,
 'nblocks': 1,
 'Th_universal': 9,
 'Th_learned': 8,
 'do_CAR': True,
 'invert_sign': False,
 'nt': 61,
 'shift': None,
 'scale': None,
 'artifact_threshold': None,
 'nskip': 25,
 'whitening_range': 32,
 'highpass_cutoff': 300,
 'binning_depth': 5,
 'sig_interp': 20,
 'drift_smoothing': [0.5, 0.5, 0.5],
 'nt0min': None,
 'dmin': None,
 'dminx': 32,
 'min_template_size': 10,
 'template_sizes': 5,
 'nearest_chans': 10,
 'nearest_templates': 100,
 'max_channel_distance': None,
 'templates_from_data': True,
 'n_templates': 6,
 'n_pcs': 6,
 'Th_single_ch': 6,
 'acg_threshold': 0.2,
 'ccg_threshold': 0.25,
 'cluster_downsampling': 20,
 'cluster_pcs': 64,
 'x_centers': None,
 'duplicate_spike_ms': 0.25,
 'scaleproc': None,
 'save_preprocessed_copy': False,
 'torch_device': 'auto',
 'bad_channels': None,
 'clear_cache': False,
 'save_extra_vars': False,
 'do_correction': True,
 'keep_good_only': False,
 'skip_kilosort_preprocessing': False,
 'use_binary_file': None,
 'del

**Attention:**

- The next code cell will insert a new entry for this experiment.

- If connected to the cloud, this will trigger a series of computations in downstream tables. Please double-check the session attributes and `paramset_idx`.


In [7]:
MB03_key, MB05_key, MB06_key, MB07_key, MB08_key = (
    ephys.ClusteringTask & 'organoid_ID LIKE "%MB%"' & "paramset_idx=250"
).fetch("KEY")

In [17]:
MB05_key

{'organoid_id': 'MB05',
 'experiment_start_time': datetime.datetime(2024, 9, 7, 14, 49),
 'insertion_number': 0,
 'start_time': datetime.datetime(2024, 9, 7, 14, 49),
 'end_time': datetime.datetime(2024, 9, 7, 14, 54),
 'paramset_idx': 400}

In [15]:
MB08_key["paramset_idx"] = 400
MB08_key

{'organoid_id': 'MB08',
 'experiment_start_time': datetime.datetime(2024, 9, 7, 14, 49),
 'insertion_number': 0,
 'start_time': datetime.datetime(2024, 9, 7, 14, 49),
 'end_time': datetime.datetime(2024, 9, 7, 14, 54),
 'paramset_idx': 400}

In [39]:
ephys.ClusteringTask & 'organoid_ID LIKE "%MB08%"'

organoid_id  e.g. O17,experiment_start_time,insertion_number,start_time,end_time,paramset_idx,clustering_output_dir  clustering output directory relative to the clustering root data directory
MB08,2024-09-07 14:49:00,0,2024-09-07 14:49:00,2024-09-07 14:54:00,23,
MB08,2024-09-07 14:49:00,0,2024-09-07 14:49:00,2024-09-07 14:54:00,250,
MB08,2024-09-07 14:49:00,0,2024-09-07 14:49:00,2024-09-07 14:54:00,400,


In [23]:
from workflow.pipeline import ephys_sorter

In [31]:
ephys_sorter.schema.jobs & 'error_message LIKE "%kilosort4%"'

table_name  className of the table,key_hash  key hash,"status  if tuple is missing, the job is available",key  structure containing the key,error_message  error message returned if failed,error_stack  error stack if failed,user  database user,host  system hostname,pid  system process id,connection_id  connection_id(),timestamp  automatic timestamp
_s_i_clustering,50e897f1712935a4a0620c13d83043f4,error,=BLOB=,"SpikeSortingError: Spike sorting in docker failed with the following error: /mnt/efs/works/org/utah/proj/organoids/outbox/MB5-8_raw/202409071449_202409071454/MB05/kilosort4_400/kilosort4/in_container_sorter_script.py:23: DeprecationWarning: `output_folder` is deprecated and will be removed in version 0.103.0 Please use folder instead  sorting = run_sorter_local( /root/.local/lib/python3.11/site-packages/spikeinterface/core/baserecordingsnippets.py:271: UserWarning: There is no Probe attached to this recording. Creating a dummy one with contact positions  warn(""There is no Probe attached to this recording. Creating a dummy one with contact positions"") INFO:kilosort.io:======================================== INFO:kilosort.io:Loading recording with SpikeInterface... INFO:kilosort.io:number of samples: 6000000 INFO:kilosort.io:number of channels: 32 INFO:kilosort.io:numbef of segments: 1 INFO:kilosort.io:sampling rate: 20000.0 INFO:kilosort.io:dtype: int16 INFO:kilosort.io:======================================== INFO:kilosort.run_kilosort: INFO:kilosort.run_kilosort:Computing drift correction. INFO:kilosort.run_kilosort:---------------------------------------- INFO:kilosort.spikedetect:Re-computing universal templates from data. Skipping common average reference. Skipping kilosort preprocessing. Error running kilosort4 Traceback (most recent call last):  File ""/mnt/efs/works/org/utah/proj/organoids/outbox/MB5-8_raw/202409071449_202409071454/MB05/kilosort4_400/kilosort4/in_container_sorter_script.py"", line 23, in sorting = run_sorter_local(  ^^^^^^^^^^^^^^^^^  File ""/root/.local/lib/python3.11/site-packages/spikeinterface/sorters/runsorter.py"", line 261, in run_sorter_local  SorterClass.run_from_folder(folder, raise_error, verbose)  File ""/root/.local/lib/python3.11/site-packages/spikeinterface/sorters/basesorter.py"", line 302, in run_from_folder  raise SpikeSortingError( spikeinterface.sorters.utils.misc.SpikeSortingError: Spike sorting error trace: Traceback (most...truncated",=BLOB=,utah-worker@100.96.36.115,c0a9aaf87e81,7,602333,2024-11-21 13:56:34


In [None]:
ephys_sorter.schema.jobs & 'key_hash="50e897f1712935a4a0620c13d83043f4"'

table_name  className of the table,key_hash  key hash,"status  if tuple is missing, the job is available",key  structure containing the key,error_message  error message returned if failed,error_stack  error stack if failed,user  database user,host  system hostname,pid  system process id,connection_id  connection_id(),timestamp  automatic timestamp
_s_i_clustering,50e897f1712935a4a0620c13d83043f4,error,=BLOB=,"SpikeSortingError: Spike sorting in docker failed with the following error: /mnt/efs/works/org/utah/proj/organoids/outbox/MB5-8_raw/202409071449_202409071454/MB05/kilosort4_400/kilosort4/in_container_sorter_script.py:23: DeprecationWarning: `output_folder` is deprecated and will be removed in version 0.103.0 Please use folder instead  sorting = run_sorter_local( /root/.local/lib/python3.11/site-packages/spikeinterface/core/baserecordingsnippets.py:271: UserWarning: There is no Probe attached to this recording. Creating a dummy one with contact positions  warn(""There is no Probe attached to this recording. Creating a dummy one with contact positions"") INFO:kilosort.io:======================================== INFO:kilosort.io:Loading recording with SpikeInterface... INFO:kilosort.io:number of samples: 6000000 INFO:kilosort.io:number of channels: 32 INFO:kilosort.io:numbef of segments: 1 INFO:kilosort.io:sampling rate: 20000.0 INFO:kilosort.io:dtype: int16 INFO:kilosort.io:======================================== INFO:kilosort.run_kilosort: INFO:kilosort.run_kilosort:Computing drift correction. INFO:kilosort.run_kilosort:---------------------------------------- INFO:kilosort.spikedetect:Re-computing universal templates from data. Skipping common average reference. Skipping kilosort preprocessing. Error running kilosort4 Traceback (most recent call last):  File ""/mnt/efs/works/org/utah/proj/organoids/outbox/MB5-8_raw/202409071449_202409071454/MB05/kilosort4_400/kilosort4/in_container_sorter_script.py"", line 23, in sorting = run_sorter_local(  ^^^^^^^^^^^^^^^^^  File ""/root/.local/lib/python3.11/site-packages/spikeinterface/sorters/runsorter.py"", line 261, in run_sorter_local  SorterClass.run_from_folder(folder, raise_error, verbose)  File ""/root/.local/lib/python3.11/site-packages/spikeinterface/sorters/basesorter.py"", line 302, in run_from_folder  raise SpikeSortingError( spikeinterface.sorters.utils.misc.SpikeSortingError: Spike sorting error trace: Traceback (most...truncated",=BLOB=,utah-worker@100.96.36.115,c0a9aaf87e81,7,602333,2024-11-21 13:56:34


In [40]:
ephys_sorter.PostProcessing & 'organoid_ID LIKE "%MB%"'

organoid_id  e.g. O17,experiment_start_time,insertion_number,start_time,end_time,paramset_idx,execution_time  datetime of the start of this step,execution_duration  execution duration in hours
MB03,2024-08-30 19:00:00,0,2024-08-30 19:15:00,2024-08-30 19:20:00,250,2024-11-14 17:14:10,0.0209988
MB07,2024-09-07 14:49:00,0,2024-09-07 14:49:00,2024-09-07 14:54:00,250,2024-11-14 21:49:17,0.0109288


In [41]:
ephys.CuratedClustering & 'organoid_ID LIKE "%MB%"'

organoid_id  e.g. O17,experiment_start_time,insertion_number,start_time,end_time,paramset_idx
MB03,2024-08-30 19:00:00,0,2024-08-30 19:15:00,2024-08-30 19:20:00,250
MB07,2024-09-07 14:49:00,0,2024-09-07 14:49:00,2024-09-07 14:54:00,250


Let's insert a new entry in the `ClusteringTask` table with the selected session and `paramset_idx=101`:


In [16]:
# Insert a new ClusteringTask entry
ephys.ClusteringTask.insert1(
    dict(
        **MB08_key,
    ),
    skip_duplicates=True,
)

In [8]:
ephys.ClusteringTask & session_key & "paramset_idx=23"

organoid_id  e.g. O17,experiment_start_time,insertion_number,start_time,end_time,paramset_idx,clustering_output_dir  clustering output directory relative to the clustering root data directory
O09,2023-05-18 12:25:00,0,2023-05-18 12:29:00,2023-05-18 12:32:10,23,
