# Hands On Session: Dimensionality Reduction, Principal Components Analysis (PCA), and Singular Value Decomposition (SVD)
# By: Sabera Talukder

[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/SaberaTalukder/Chen_Institute_DataSAI_for_Neuroscience/blob/main/07_05_22_day1_overview/code/diy_notebooks/dimensionality_reduction.ipynb)

In [None]:
# All Imports - alphabetically ordered with shortcuts
import matplotlib.pyplot as plt
import numpy as np
import time

from mpl_toolkits import mplot3d
from numpy.linalg import svd
from scipy.io import loadmat
from sklearn.decomposition import PCA

# Data Exploration

## Hint: do not reinvent the wheel! If you want to do something, a preexisiting package, library, function, etc. exists to do what you want. Google & Stack Overflow are your friends üòÉ

## Load Data
### Dataset background: today we'll be working with calcium imaging data from one male mouse. We have already converted the calcium imaging videos into continuous neural signals, so you don't have to worry about it (you're welcome üòò). The male mouse has different visitors in his cage throughout the recording, and we'll explore dimensionality reduction by determining if it's üê≠‚ù§Ô∏è or üê≠ üò°!

#### Let's start by loading in our dataset!

In [1]:
!wget https://raw.githubusercontent.com/SaberaTalukder/Chen_Institute_DataSAI_for_Neuroscience/main/07_05_22_day1_overview/data/hypothalamus_calcium_imaging_remedios_et_al.mat?raw=true
!mv hypothalamus_calcium_imaging_remedios_et_al.mat\?raw\=true hypothalamus_calcium_imaging_remedios_et_al.mat

hypothalamus_data = loadmat('hypothalamus_calcium_imaging_remedios_et_al.mat')

## How many data arrays are contained in hypothalamus_data?
#### Hint: what happens if you type the variable name in a cell and run the cell?

In [None]:
# Enter code here:

## Extract the N data arrays into N separate variables.

In [None]:
# Enter code here:

## What is the dimensionality of each of the N data arrays?
## What do you think the dimensions represent?

In [None]:
# Enter code here:

## Visualize the distributions of each of the N data arrays as a histogram!
#### Hint: the answer to this question can be a picture!
#### Hint Hint: sometimes functions run faster if you transform a matrix a vector first.

In [None]:
# Enter code here:

## Plot the N data arrays.
#### Hint: sometimes the most expeditious way to visualize data is to treat it as an image!
#### Hint Hint: one visualization might give you something you dont expect, but is the problem the data?

In [None]:
# Enter code here:

## What do the values inside the arrays represent?

In [None]:
# Enter answer here (code can be used, but not required):

### Great job exploring the data! Now let's dive into what we can do with it!

# Dimensionality Reduction

## Prinicpal Components Analysis (PCA)
#### We're going to dive into how PCA works, but first we're going to see what can be done with it! All you need to know for now is that PCA creates a lower dimensional representation of your data to preserve the data's variance.

#### By now you know you have a neural data array that is number_of_neurons by time, let's say the dimensionality is NxT. What we are going to do with PCA is take all of our time steps and compress them; this will output an array that is SxT where S < N. In other words each time step is initially an N dimensional vector, that gets compressed into an S dimensional vector where S < N. Let's explore this with S = 3.

In [None]:
# make a PCA model with with S = 3
pca_model_s_3 = PCA(n_components=3)

# STOP & Check Yourself: Do you know why we can just call "PCA"?

# with the PCA model instance we created to our neural data
neural_pca_s_3 = pca_model_s_3.fit_transform(neural_data.T)

## What is the dimensionality of the PCAed neural data? What do these dimensions mean?

In [None]:
# Enter code here:

## Plot the Principal Components (aka PCs) in 3D!

In [None]:
%matplotlib notebook
fig = plt.figure()
ax = plt.axes(projection='3d')

# Enter code here:
pc1 = pass
pc2 = pass
pc3 = pass

ax.scatter3D(pc1, pc2, pc3);

## Nice job! Now rotate your representation! What interesting things do you notice about your dimensionality reduced data?
#### Hint: why does this data look connected?
#### Hint Hint: Why are the axes so different from each other? What do they represent?

In [1]:
# Enter answer here:

In [2]:
# do this to switch out of movable 3d plotting (i.e. when you have 2d plots next)
%matplotlib inline 

## We're going to return to this visualization, but first you have to be thinking to yourself, we got rid of A LOT of dimensions how do we know this representation is still good? Great question! You tell me üëáüèªüëáüèºüëáüèΩüëáüèæüëáüèø

## How much variance is explained by each of these top 3 principal components? What does this tell you about the data?
#### Hint: what is the first hint I gave you?

In [None]:
# Enter code here:

## Now that you know how much variance is explained by each of the top 3 PCs, let's explore the representation we built further!

#### Let's start by coloring each time point as a function of when it appears in the time series.
#### Hint: you're not changing the plot you're changing the color!

In [None]:
# Enter code here:

## What does this tell you about the representation?

## Now make three separate plots colored by the attack variable! üò°üê≠‚ùì

### For plot 1: Plot only the attack data points
### For plot 2: Plot only the other data points
### For plot 3: Plot the the attack data points on top of the other data points

#### Hint: it may be easier to separate your data by labels first!
#### Hint Hint: for plot 3 play with opacity (goes by a different name though!), and zorder.

In [None]:
# Enter Code Here:

## Now build the same plot but color based on the mouse sex variable! ‚ù§Ô∏èüê≠‚ùì

In [None]:
# Enter Code Here:

## Great! Now that you know more about the data and PCA, I want you to repeat everything you just did if you reduce the data to 2 PCs! More explicitly:

#### ‚ú¶ Train a model on the neural data with 2 PCs.
#### ‚ú¶ How much explained variance do these 2 PCs capture? Do you notice anything interesting about these 2 PCs? üòâ
#### ‚ú¶ How is time visualized in these 2 PCs?
#### ‚ú¶ How is üê≠ üò° visualized in these 2 PCs?
#### ‚ú¶ How is üê≠ ‚ù§Ô∏è visualized in these 2 PCs?

### Finally, if you needed to build a model to classify time, attack, or the visitor's sex how many PCs would you use? Do you lose anything between 3 PCs and 2PCs?

In [None]:
# Enter Code Here:

# Implement PCA yourself!!

## First, mean center your data!
##### The reason we didn't have to do this before is because the PCA function we called automatically did this for us üò±

In [None]:
# Enter code here:

#### Now verify that the pca function returns the same thing in 2D for the not mean centered data and the mean centered data to prove that function we call automatically does this for us. Color using the time steps!

In [None]:
# Enter Code Here:

## Now that we have mean centered data we can transform our data via two paths:
### (1) By stepping through linear decomposition ourselves.
### (2) By using singular value decomposition (a.k.a SVD).

#### Let's start with path (1):
#### First calculate the covariance matrix of your mean centered data.

In [None]:
# Enter code here:

#### Compute the eigenvalues and eigenvectors of the covariance matrix.
##### Hint: Make sure you've sorted the eigenvalues and eigenvectors to be in either ascending or descending order!

In [None]:
# Enter code here:

#### Now project your mean centered data into a reduced space using the 2 largest eigenvectors.

In [None]:
# Enter code here:

#### Plot your transformed data using time as your color!

In [None]:
# Enter code here:

#### Ok but why is the representation flipped?!
#### PCA is sign invariant, meaning that we can multiply the axes by -1 and the interpretation of the dimensionality reduced space stays the same. 
#### Now that we know this is true, change your plot to look like the plot when we use the PCA library directly.

In [None]:
# Enter Code Here:

## Great Job!! Now you've calculated PCA all by yourself using matrix operations ü§©ü§©ü§© Let's move on to implementing PCA using SVD (singular vector decomposition).

#### Singular vector decomposition is a method that decomposes a matrix into three matricies. U, S, and Vt. The left singular vectors are the columns of U. S are the singular values. V is a matrix whose columns are the right singular vectors. Vt is the transpose of V. Our input data (call it X) equals U\*S\*Vt ‚û°Ô∏è X = U\*S\*Vt.

#### We're not going to implement svd ourselves. Please run np.linalg's svd on our mean centered data.
##### Hint: have we already loaded svd?
##### Hint Hint: run with full_matrices = False otherwise it might take you a while!

In [2]:
# Enter code here:

#### What are the dimensions of U, S, and Vt? 

In [None]:
# Enter code here:

#### Because we used the data we want transformed to calculate U, S, and Vt, we can directly multiply our left singular vectors and singular values together to get our transformed data.

#### Hint: the singular values need to be converted into a diagonal matrix to make the matrix multiplication easier.
#### Hint Hint: We only want to transform our data to a reduced dimension of 2!

In [None]:
# Enter Code Here:

#### Plot your transformed data to match our PCA library plot.

In [None]:
# Enter code here:

# To dive deeper into the math behind PCA & SVD stay tuned for day 3!!