<a href="https://colab.research.google.com/github/IanQS/neuromatch_project/blob/main/steinmetz_modeling.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Dataset Parameters

Modify based on what you want to do with the dataset construction

In [1]:
import enum

class WindowChoice(enum.Enum):
    START = 0
    MID = 1
    END = 2


WINDOW_SIZE = 50
TRAIN_TEST_SPLIT = 0.8  # 80% is training, 20% test
SHUFFLE_DATASET = True

assert 0 < TRAIN_TEST_SPLIT <= 1.0


DATASET_PARAMETERS = dict()

DATASET_PARAMETERS["window_choice"] = WindowChoice.END
DATASET_PARAMETERS["window_size"] = WINDOW_SIZE
DATASET_PARAMETERS["train-test-split"] = TRAIN_TEST_SPLIT
DATASET_PARAMETERS["shuffle"] = SHUFFLE_DATASET

# Modeling of the Steinmetz dataset

- uses [Neuromatch Load Steinmetz Decisions](https://colab.research.google.com/github/NeuromatchAcademy/course-content/blob/main/projects/neurons/load_steinmetz_decisions.ipynb#scrollTo=DJ-jzsE5eLxX) as a base

In [72]:
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import zscore
from sklearn.decomposition import PCA
import concurrent.futures
from multiprocessing import Pool
from typing import Dict, List, Any
from sklearn.utils import shuffle

import copy
from sklearn.model_selection import train_test_split

np.random.seed(42)

# !pip install -q ipython-autotime
# %load_ext autotime

time: 5.41 ms (started: 2023-07-21 04:51:34 +00:00)


In [3]:
# @title Data Downloading And Stacking
import os, requests

fname = []
for j in range(3):
  fname.append('steinmetz_part%d.npz'%j)
url = ["https://osf.io/agvxh/download"]
url.append("https://osf.io/uv3mw/download")
url.append("https://osf.io/ehmw2/download")

for j in range(len(url)):
  if not os.path.isfile(fname[j]):
    try:
      r = requests.get(url[j])
    except requests.ConnectionError:
      print("!!! Failed to download data !!!")
    else:
      if r.status_code != requests.codes.ok:
        print("!!! Failed to download data !!!")
      else:
        with open(fname[j], "wb") as fid:
          fid.write(r.content)

all_ds = np.array([])
for j in range(len(fname)):
  all_ds = np.hstack((all_ds,
                      np.load('steinmetz_part%d.npz'%j,
                              allow_pickle=True)['dat']))

time: 52.2 s (started: 2023-07-21 03:26:30 +00:00)


# Dataset Description

(taken and modified from the Neuromatch Load Steinmetz Decisions notebook)

## High-level

`all_ds` contains 39 sessions from 10 mice, data from Steinmetz et al, 2019. Time bins for all measurements are 10ms, starting 500ms before stimulus onset. The mouse had to determine which side has the highest contrast. For each `curr_ds = all_ds[k]`, you have the fields below. For extra variables, check out the extra notebook and extra data files (lfp, waveforms and exact spike times, non-binned).

## Fields Used

* `curr_ds['spks']`: neurons by trials by time bins.    
* `curr_ds['brain_area']`: brain area for each neuron recorded.
* `curr_ds['response']`: which side the response was (`-1`, `0`, `1`). When the right-side stimulus had higher contrast, the correct choice was `-1`. `0` is a no go response.

## Fields present (not all are used)

* `curr_ds['mouse_name']`: mouse name
* `curr_ds['date_exp']`: when a session was performed
* `curr_ds['ccf']`: Allen Institute brain atlas coordinates for each neuron.
* `curr_ds['ccf_axes']`: axes names for the Allen CCF.
* `curr_ds['contrast_right']`: contrast level for the right stimulus, which is always contralateral to the recorded brain areas.
* `curr_ds['contrast_left']`: contrast level for left stimulus.
* `curr_ds['gocue']`: when the go cue sound was played.
* `curr_ds['response_time']`: when the response was registered, which has to be after the go cue. The mouse can turn the wheel before the go cue (and nearly always does!), but the stimulus on the screen won't move before the go cue.  
* `curr_ds['feedback_time']`: when feedback was provided.
* `curr_ds['feedback_type']`: if the feedback was positive (`+1`, reward) or negative (`-1`, white noise burst).  
* `curr_ds['wheel']`: turning speed of the wheel that the mice uses to make a response, sampled at `10ms`.
* `curr_ds['pupil']`: pupil area  (noisy, because pupil is very small) + pupil horizontal and vertical position.
* `curr_ds['face']`: average face motion energy from a video camera.
* `curr_ds['licks']`: lick detections, 0 or 1.   
* `curr_ds['trough_to_peak']`: measures the width of the action potential waveform for each neuron. Widths `<=10` samples are "putative fast spiking neurons".
* `curr_ds['%X%_passive']`: same as above for `X` = {`spks`, `pupil`, `wheel`, `contrast_left`, `contrast_right`} but for  passive trials at the end of the recording when the mouse was no longer engaged and stopped making responses.
* `curr_ds['prev_reward']`: time of the feedback (reward/white noise) on the previous trial in relation to the current stimulus time.
* `curr_ds['reaction_time']`: ntrials by 2. First column: reaction time computed from the wheel movement as the first sample above `5` ticks/10ms bin. Second column: direction of the wheel movement (`0` = no move detected).  


The original dataset is here: https://figshare.com/articles/dataset/Dataset_from_Steinmetz_et_al_2019/9598406

In [4]:
regions = ["vis ctx", "thal", "hipp", "other ctx", "midbrain", "basal ganglia", "cortical subplate", "other"]
region_colors = ['blue', 'red', 'green', 'darkblue', 'violet', 'lightblue', 'orange', 'gray']
brain_groups = [["VISa", "VISam", "VISl", "VISp", "VISpm", "VISrl"],  # visual cortex
                ["CL", "LD", "LGd", "LH", "LP", "MD", "MG", "PO", "POL", "PT", "RT", "SPF", "TH", "VAL", "VPL", "VPM"], # thalamus
                ["CA", "CA1", "CA2", "CA3", "DG", "SUB", "POST"],  # hippocampal
                ["ACA", "AUD", "COA", "DP", "ILA", "MOp", "MOs", "OLF", "ORB", "ORBm", "PIR", "PL", "SSp", "SSs", "RSP","TT"],  # non-visual cortex
                ["APN", "IC", "MB", "MRN", "NB", "PAG", "RN", "SCs", "SCm", "SCig", "SCsg", "ZI"],  # midbrain
                ["ACB", "CP", "GPe", "LS", "LSc", "LSr", "MS", "OT", "SNr", "SI"],  # basal ganglia
                ["BLA", "BMA", "EP", "EPd", "MEA"]  # cortical subplate
                ]

# Assign each area an index
area_to_index = dict(root=0)
counter = 1
for group in brain_groups:
    for area in group:
        area_to_index[area] = counter
        counter += 1

# Figure out which areas are in each dataset
areas_by_dataset = np.zeros((counter, len(all_ds)), dtype=bool)
for j, d in enumerate(all_ds):
    for area in np.unique(d['brain_area']):
        i = area_to_index[area]
        areas_by_dataset[i, j] = True


time: 12.9 ms (started: 2023-07-21 03:27:22 +00:00)


In [5]:
DATASET_IDX = 11
curr_ds = all_ds[DATASET_IDX]

dt = curr_ds["bin_size"]
NUM_NEURONS_RECORDED = curr_ds["spks"].shape[0]
NUM_TRIALS = curr_ds["spks"].shape[1]
NUM_BINNED_TIMES = curr_ds["spks"].shape[2]

if DATASET_IDX != 11:
    raise Exception("Code is only meant for DATASET_IDX=11")
else:
    NUM_REGIONS = 4
    NUM_NEURONS_RECORDED = len(curr_ds["brain_area"])  # The string idx version of

brain_subregions = NUM_REGIONS * np.ones(NUM_NEURONS_RECORDED, )  # last one is "other"
for j in range(NUM_REGIONS):
  brain_subregions[
      np.isin(curr_ds['brain_area'], brain_groups[j])
      ] = j  # assign a number to each region


time: 2.66 ms (started: 2023-07-21 03:27:22 +00:00)


# Creating the dataset

1) Create the labels

2) Create a dataset dictionary where the keys are brain areas (sub-regions) and the values are all the neuron readings that are in that area/sub-region

3) Enable users to specify their config of how they want the data: do we consider region interactions, should we consider the start/middle/end of the spike train, etc.

In [6]:
LABELS = curr_ds["response"]  # RIGHT - NO_GO - LEFT (-1, 0, 1)
y = LABELS

time: 544 µs (started: 2023-07-21 03:27:22 +00:00)


In [7]:
def log_shapes(ds):
    _ds = ds['spks']
    print(f"All spikes shape: {_ds.shape}")
    _ds_brain_region = _ds[brain_subregions == 0]
    print(f"\t- Spike shape for sample brain region (0-th): {_ds_brain_region.shape}")

    _ds_0th_left_response = _ds_brain_region[:, y >= 0]
    print(f"\t- Spike shape for sample brain region (0-th) left responses: {_ds_0th_left_response.shape}")

    averaged_over_left_response = _ds_0th_left_response.mean(axis=(0, 1))
    print(f"\t- Averaged brain region (0-th) left responses: {averaged_over_left_response.shape}")

log_shapes(curr_ds)


All spikes shape: (698, 340, 250)
	- Spike shape for sample brain region (0-th): (145, 340, 250)
	- Spike shape for sample brain region (0-th) left responses: (145, 199, 250)
	- Averaged brain region (0-th) left responses: (250,)
time: 27.2 ms (started: 2023-07-21 03:27:22 +00:00)


## Creating the fine-grained data dictionary

In [47]:
def dataset_by_subregion(arr_of_subregions: List[str], ds: Dict[str, Any]) -> Dict[str, List[np.ndarray]]:
    spike_partitioned = {}  # brain region to spike mapping
    unique_subregions = set(arr_of_subregions)
    for subregion in unique_subregions:
        subregion_idxs = arr_of_subregions == subregion
        subregion_data = ds["spks"][subregion_idxs]


        # from the "Dataset Description" section above
        #       > which side the response was (-1, 0, 1)
        spikes_for_right_response = subregion_data[:, y < 0]
        spikes_for_left_response = subregion_data[:, y > 0]

        # spikes_for_no_response = subregion_data[:, y == 0]

        spike_partitioned[subregion] = [
            spikes_for_left_response,
            # spikes_for_no_response,
            spikes_for_right_response
        ]
    return spike_partitioned

subregion_data_dict = dataset_by_subregion(curr_ds["brain_area"], curr_ds)

time: 66.5 ms (started: 2023-07-21 04:11:44 +00:00)


In [48]:
print("Number of Neurons recorded in each subregion ")
running_sum = 0
for k, v in subregion_data_dict.items():
    print(f"\t{k}\t {v[0].shape[0]}")
    running_sum += v[0].shape[0]

assert running_sum == curr_ds["spks"].shape[0], "Our totaled neurons across all subregions are not equal to the number of neurons measured"
print(running_sum)

Number of Neurons recorded in each subregion 
	LH	 18
	MD	 126
	root	 100
	LGd	 11
	MOs	 6
	CA1	 50
	VISam	 79
	VISp	 66
	ACA	 16
	PL	 56
	DG	 65
	SUB	 105
698
time: 2.71 ms (started: 2023-07-21 04:11:46 +00:00)


## Creating the Coarse-Grained Data Dictionary

- we do this manually since we do not have too many subregions

### All regions
```python
["VISa", "VISam", "VISl", "VISp", "VISpm", "VISrl"],  # visual cortex
["CL", "LD", "LGd", "LH", "LP", "MD", "MG", "PO", "POL", "PT", "RT", "SPF", "TH", "VAL", "VPL", "VPM"], # thalamus
["CA", "CA1", "CA2", "CA3", "DG", "SUB", "POST"],  # hippocampal
["ACA", "AUD", "COA", "DP", "ILA", "MOp", "MOs", "OLF", "ORB", "ORBm", "PIR", "PL", "SSp", "SSs", "RSP","TT"],  # non-visual cortex
["APN", "IC", "MB", "MRN", "NB", "PAG", "RN", "SCs", "SCm", "SCig", "SCsg", "ZI"],  # midbrain
["ACB", "CP", "GPe", "LS", "LSc", "LSr", "MS", "OT", "SNr", "SI"],  # basal ganglia
["BLA", "BMA", "EP", "EPd", "MEA"]  # cortical subplate
```


### Refined Regions

- only the ones relevant to our dataset

```python
MD -> thalamus
ACA -> non-visual-cortex
SUB -> hippocampal
CA1 -> hippocampal
DG -> hippocampal
LGd -> thalamus
LH -> thalamus
PL -> non-visual-cortex
root ->
VISp -> visual-cortex
MOs -> non-visual-cortex
VISam -> visual-cortex
```

In [49]:
len(subregion_data_dict["MD"])

2

time: 3.9 ms (started: 2023-07-21 04:11:48 +00:00)


#### Manual Insertion

In [50]:

def consolidate_fine_grained(subregion_dict):

    mapping = {
        "thalamus": ["MD", "LGd", "LH"],
        "non-visual-cortex": ["ACA", "PL", "MOs"],
        "hippocampal": ["SUB", "CA1", "DG"],
        "visual-cortex": ["VISp", "VISam"]
    }


    coarse_region_data_dict: Dict[str, List[List[np.ndarray]]] = dict()

    for coarse_region_name, subregion_name_arr in mapping.items():
        print("*" * 10)
        print(coarse_region_name)
        for subregion_name in subregion_name_arr:
            print(f"Subregion: {subregion_name}")
            if coarse_region_name not in coarse_region_data_dict:
                print(f"\tInit: Left and Right: {subregion_dict[subregion_name][0].shape}, {subregion_dict[subregion_name][1].shape}")
                coarse_region_data_dict[coarse_region_name] = copy.deepcopy(subregion_dict[subregion_name])
            else:
                subregion_left, subregion_right = subregion_dict[subregion_name]

                print(f"\tIncoming Shapes: Left and Right: {subregion_left.shape}, {subregion_right.shape}")
                # print(f"Container: {coarse_region_data_dict[coarse_region_name][1]}")
                coarse_region_data_dict[coarse_region_name][0] = np.vstack(
                    (coarse_region_data_dict[coarse_region_name][0],
                    subregion_left)
                )
                coarse_region_data_dict[coarse_region_name][1] = np.vstack(
                    (coarse_region_data_dict[coarse_region_name][1],
                    subregion_right)
                )
            print(f"\tPost-stack shapes: {coarse_region_data_dict[coarse_region_name][0].shape} {coarse_region_data_dict[coarse_region_name][1].shape}")
    return coarse_region_data_dict

coarse_region_data_dict = consolidate_fine_grained(subregion_data_dict)

**********
thalamus
Subregion: MD
	Init: Left and Right: (126, 135, 250), (126, 141, 250)
	Post-stack shapes: (126, 135, 250) (126, 141, 250)
Subregion: LGd
	Incoming Shapes: Left and Right: (11, 135, 250), (11, 141, 250)
	Post-stack shapes: (137, 135, 250) (137, 141, 250)
Subregion: LH
	Incoming Shapes: Left and Right: (18, 135, 250), (18, 141, 250)
	Post-stack shapes: (155, 135, 250) (155, 141, 250)
**********
non-visual-cortex
Subregion: ACA
	Init: Left and Right: (16, 135, 250), (16, 141, 250)
	Post-stack shapes: (16, 135, 250) (16, 141, 250)
Subregion: PL
	Incoming Shapes: Left and Right: (56, 135, 250), (56, 141, 250)
	Post-stack shapes: (72, 135, 250) (72, 141, 250)
Subregion: MOs
	Incoming Shapes: Left and Right: (6, 135, 250), (6, 141, 250)
	Post-stack shapes: (78, 135, 250) (78, 141, 250)
**********
hippocampal
Subregion: SUB
	Init: Left and Right: (105, 135, 250), (105, 141, 250)
	Post-stack shapes: (105, 135, 250) (105, 141, 250)
Subregion: CA1
	Incoming Shapes: Left and Ri

#### Dataset Construction

In [64]:
def populate_data(designed_matrix, is_left, X_container, y_container):
    """
    designed_matrix is of shape (a, b, c)
        a:= num_neurons in coarse_region
        b:= num_trials  (in this case either the left or right response trials)
        c:= spike_train
    """
    for trials_matrix in designed_matrix:
        for spike_train in trials_matrix:
            X_container.append(spike_train)
            y_container.append(1 if is_left else -1)
    return X_container, y_container

def encode_coarse_data(
    coarse_data_dict: Dict[str, List[List[np.ndarray]]],
):
    unique_keys = dict()
    one_hot_idx = 0

    X_container = []
    y_container = []
    for coarse_region_name, coarse_region_data in coarse_data_dict.items():
        print("*" * 20)
        print(coarse_region_name)
        # Enumerate all of the arrays of the subregions and vertically stack them
        left = coarse_region_data[0]   # The positive label (left) of our LGd, for example
        right = coarse_region_data[1]  # The negative label (right) of our LH, for example
        _l_shape = left.shape
        _r_shape = right.shape

        assert _l_shape[-1] == 250
        assert _r_shape[-1] == 250


        ##########################################
        # Add 1-hot encoded data
        #   For more information: https://en.wikipedia.org/wiki/One-hot

        vec_one_hot = [0 for _ in range(len(coarse_region_data_dict.keys()))]
        vec_one_hot[one_hot_idx] = 1

        left_pad = np.tile(vec_one_hot, (_l_shape[0], _l_shape[1], 1))
        print(left_pad.shape)
        right_pad = np.tile(vec_one_hot, (_r_shape[0], _r_shape[1], 1))
        left_designed = np.concatenate((left_pad, left), axis=-1)
        right_designed = np.concatenate((right_pad, right), axis=-1)

        print(f"Shape B4 populating: {np.asarray(X_container).shape}, {np.asarray(y_container).shape}")
        X_container, y_container = populate_data(left_designed, True, X_container, y_container)
        X_container, y_container = populate_data(right_designed, False, X_container, y_container)
        print(f"Shape After populating: {np.asarray(X_container).shape}, {np.asarray(y_container).shape}")

        # We are now in a new region, so we increment the index for the one-hot
        one_hot_idx += 1

    return np.asarray(X_container), np.asarray(y_container)


Xs, ys = encode_coarse_data(
    coarse_region_data_dict,
)

********************
thalamus
(155, 135, 4)
Shape B4 populating: (0,), (0,)
Shape After populating: (42780, 254), (42780,)
********************
non-visual-cortex
(78, 135, 4)
Shape B4 populating: (42780, 254), (42780,)
Shape After populating: (64308, 254), (64308,)
********************
hippocampal
(220, 135, 4)
Shape B4 populating: (64308, 254), (64308,)
Shape After populating: (125028, 254), (125028,)
********************
visual-cortex
(145, 135, 4)
Shape B4 populating: (125028, 254), (125028,)
Shape After populating: (165048, 254), (165048,)
time: 1.66 s (started: 2023-07-21 04:33:53 +00:00)


In [66]:
Xs.shape

(165048, 254)

time: 22.6 ms (started: 2023-07-21 04:43:55 +00:00)


# Dataset Configuration

Here we use the `DATASET_PARAMETERS` that was specified above

In [71]:

def create_dataset(dataset_parameters, X_data, y_data):
    """
    DATASET_PARAMETERS["window_choice"] = WindowChoice.END
    DATASET_PARAMETERS["window_size"] = WINDOW_SIZE
    DATASET_PARAMETERS["train-test-split"] = TRAIN_TEST_SPLIT
    """

    ##############################################
    # First step is we extract the spike train window of interest based on whether we want the start, mid, or end
    window_choice = dataset_parameters["window_choice"]

    if window_choice == WindowChoice.START:
        start = 0
        end = start + dataset_parameters["window_size"]
    elif window_choice == WindowChoice.MID:
        start = X_data.shape[-1] // 2
        end = start + dataset_parameters["window_size"]
    else:
        start = (X_data.shape[-1] - 1) - dataset_parameters["window_size"]
        end = (X_data.shape[-1] - 1)

    new_Xs = []
    for row in X_data:
        new_Xs.append(row[start:end])
    X_data = np.asarray(new_Xs)

    ##############################################
    # Optionally shuffle the dataset
    if dataset_parameters["shuffle"]:
        X_data, y_data = shuffle(X_data, y_data)


    ##############################################
    # Next step is to split into train-test

    X_train, X_test, y_train, y_test = train_test_split(X_data, y_data, test_size=dataset_parameters["train-test-split"], random_state=42)

    return X_train, X_test, y_train, y_test

create_dataset(DATASET_PARAMETERS, Xs, ys)

(array([[0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        ...,
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0]]),
 array([[0, 0, 0, ..., 0, 0, 0],
        [1, 0, 1, ..., 0, 0, 1],
        [0, 0, 0, ..., 0, 0, 0],
        ...,
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0]]),
 array([ 1,  1,  1, ...,  1, -1, -1]),
 array([ 1, -1, -1, ..., -1,  1, -1]))

time: 416 ms (started: 2023-07-21 04:49:34 +00:00)
