### Hands On Session: Dataset Engineering

### By Adi Nair, Anderson lab

In this section, we'll examine some common problems in datasets that hinders classification and other tasks and think of ways to fix them. We'll use the same dataset as before from [Remedios et al., 2017](https://https://www.nature.com/articles/nature23885) 

### Table of contents

#### 1) How well does your model perform by chance?
Creating null distributions & chance performance for linear models

#### 2) Class imbalances
Using resampling and shuffling to overcome class imbalances

#### 3) Data quality checks: Interpolation & Smoothing
Defining metrics to evaluate the quality of recordings & handle missing entries in datasets



[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/SaberaTalukder/Chen_Institute_DataSAI_for_Neuroscience/blob/main/07_05_22_day1_overview/code/diy_notebooks/dataset_engineering.ipynb)

In [9]:
# Import neccessary modules and helper functions
# Need to delete unused packages

import numpy as np
import matplotlib.pyplot as plt
from scipy.io import loadmat
from sklearn.decomposition import PCA
import matplotlib
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import confusion_matrix
from sklearn.metrics import ConfusionMatrixDisplay
from matplotlib import image
from sklearn.utils import shuffle
from scipy import interpolate
from imblearn.under_sampling import RandomUnderSampler

import imblearn


### Load data

Just as before let's load this dataset

In [2]:

!wget https://raw.githubusercontent.com/SaberaTalukder/Chen_Institute_DataSAI_for_Neuroscience/main/07_05_22_day1_overview/data/hypothalamus_calcium_imaging_remedios_et_al.mat?raw=true
!mv hypothalamus_calcium_imaging_remedios_et_al.mat\?raw\=true hypothalamus_calcium_imaging_remedios_et_al.mat

remedios_data = loadmat("hypothalamus_calcium_imaging_remedios_et_al.mat");
neural_data = remedios_data['neural_data']
attack_vector = remedios_data['attack_vector']
sex_vector = remedios_data['sex_vector']


### 1) How well does your model perform by chance

We've previously seen that training a simple linear model to classify attack frames in this dataset fails due to the rarity of the attack label. But how bad did our models do?

We need to establish a baseline performance or chance-level performance as a lower bound for what we might expect our model to achieve. 

A simple way to do this is by shuffling our behavior vector independently of our neural activity matrix and using that shuffled vector for classification. This needs to be performed a finite number of times to estimate true chance performance.

#### Task: Shuffle attack_vector below and use that shuffled vector to perform simple linear classification below:

In [None]:
# Enter code here:

#### Observe that even by chance, we can classify all 'not-attack' labels correctly 90% of the time! While that for attack by chance is around 0%.

If you ignored class-wise performance metrics, you can see why an average performance underestimates greatly the performance for classifying attack. 

There are also other important considerations for making a proper chance calculation for neural data that can be taken at the level of experimental design. [Harris 2022](https://www.biorxiv.org/content/10.1101/2020.11.29.402719v3#) is a great read on this topic and you'll encounter this tomorrow with Ann Kennedy

So, how can we account for the sparsity of the attack label?

### 2) Class imbalance

One method that allows us to fix this problem is to resample our dataset so that there are equal numbers of attack and non-attack labels. Resampling techniques are a broad class of tools that allow either random oversampling of the sparse label ('attack') or random undersampling of the majority label ('not-attack').

Python once again has an easy method to implement this in the form of the [imbalanced-learn](https://imbalanced-learn.org/stable/user_guide.html) library.

#### Task: Let's first perform random undersampling using [RandomUnderSampler](https://imbalanced-learn.org/stable/references/generated/imblearn.under_sampling.RandomUnderSampler.html) and then make a PCA plot using the resampled data below:

#### P.S You can also implement this manually! Try that out too below:

In [None]:
# Enter code here using imbalance-learn:

In [None]:
# Enter code for manual implementation here:

Great! As you can see, the resampled data contains an equal amount of attack and not-attack labels. 

#### Task: Train a simple linear model below to seperate data in the resampled case

In [None]:
# Enter code here

#### Clearly, resampling helps and our model performance has greatly improved for the attack label! 

But what about the baseline or chance level performance of our model after resampling? 

Would the chance performance of the attack label still be zero? 

#### Task: Calculate a shuffled performance for the resampled case above 

In [None]:
# Enter code here:

#### Thus, about 50% of the time, out model would accurately classify attack. Since our model performance (>90%) is much greater than the chance performance, we can be confident that the linear model has learnt to distinguish attack from not-attack labels

### 2 b) Other resampling approaches

Besides random-undersampling, several other undersampling approaches can be easily implemented using imblearn, find out more about that [here](https://imbalanced-learn.org/stable/references/under_sampling.html)

Most deep learning packages such as PyTorch contain functions to perform resampling in their specified data structures but packages like imblearn are general purpose and are useful in many non deep learning scenarios

### 3) Data quality checks: 

When dealing with neural data, it's also important to assess data quality such as noise levels in your recording. Many data pre-processing pipelines (such as kilosort for electrophysiology data and CNMFe for calcium imaging data) already perform checks on data quality but in this exercise we'll think about how to identify corrupted or noisy data.


### Load corrupted dataset

In [21]:
!wget https://raw.githubusercontent.com/SaberaTalukder/Chen_Institute_DataSAI_for_Neuroscience/main/07_05_22_day1_overview/data/hypothalamus_corrupted_calcium_imaging_remedios_et_al.mat?raw=true
!mv hypothalamus_corrupted_calcium_imaging_remedios_et_al.mat\?raw\=true hypothalamus_corrupted_calcium_imaging_remedios_et_al.mat

remedios_data = loadmat("hypothalamus_corrupted_calcium_imaging_remedios_et_al.mat");
neural_data = remedios_data['neural_data_corrupted']
attack_vector = remedios_data['attack_vector']
sex_vector = remedios_data['sex_vector']


### 3 a) Interpolation of missing data

Plot the activity of neurons in neural_data below and examine it carefully. Some neurons have been corrupted to have missing entries (entries replaced by 0). Can you identify which neurons they are?

#### Hint: Think of ways which can be used to identify continous 'chunks' of zeros in the activity of individual neurons, hint hint: differentiate the signal and search for periods of time where the derivative is zero

In [None]:
# Enter code here

#### Great! Now how do we deal with such missing values?

One simple method involves [linear interpolation](https://www.cuemath.com/linear-interpolation-formula/) which can be implemented using scipy's [interp1d](https://docs.scipy.org/doc/scipy/reference/generated/scipy.interpolate.interp1d.html)

#### Implement interp1d to fill in missing values of neurons identified above:

In [None]:
# Enter code here:

### 3 b) Smoothing:

In addition to missing entries, neural signals may be corrupted by noise emanating from various sources, some biological and some technical. We've intentionally corrupted some neural signals in this dataset with added gaussian noise, can you identify them?

#### Plot the 4 corrupted neurons below: Hint: calculate the [signal-to-noise](https://github.com/scipy/scipy/blob/v0.16.0/scipy/stats/stats.py#L1963) ratio of each neurons and look at the top 4 neurons with highest SNR. 

In [54]:
# Enter code here:

### Moving average filter

 A simple method to smooth data is to average data in time bins, this can be written as:
 $y(i)  = 1/M \sum_{j = 0}^{M-1} x(i+j)$

#### Task: Implement moving average as a function below:

In [None]:
# Enter code here:

def smoothMA(a,WSZ):
    # a: NumPy 1-D or 2-D array containing the data to be smoothed (row-wise)
    # WSZ: smoothing window size needs, which must be odd number

    return aSm;

How well does a moving average filter perform at smoothing data? 

#### Task: Use the moving average filter you defined to smooth neural activity for identified neurons in 3 b)

In [174]:
# Enter code here: