# Tutorial 3: Dimensionality Reduction and clustering
## Adapted from Neuromatch Academy: Week 1, Day 5, Tutorial 3 and BIPN162

The examples derived here are largely from Pascal Wallisch's *Neural Data Science* (Chapter 8), and Jake VanderPlaas's *Neural Data Science Handbook*. 
Content creators: Ashley Juavinett, Gal Mishne

# Tutorial Objectives
**Dimensionality reduction** is an incredibly useful tool in data science as well as neuroscience. Here, we'll explore one main method of dimensionality reduction: **Principal Components Analysis**. We'll first perform PCA step-by-step on a simulated dataset, and then learn the very simple commands to do this in Python. Next, we'll use *k*-means clustering to find clusters in the two-dimensional PCA data. 

In this notebook we'll learn to apply PCA for dimensionality reduction.

Overview:

- Perform PCA on a simulated neuroscience datasets
- Calculate the variance explained
- Use *k*-means clustering to find clusters in your data
- Plot the results of your dimensionality reduction and clustering, coloring each point by either a known feature or by a cluster assignment

# Setup
Run these cells to get the tutorial started.

In [None]:
# Imports
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from scipy import stats
%matplotlib inline
import seaborn as sns # This is another plotting package, built really nicely for plotting these types of analyses!

In [None]:
#@title Helper Functions

def plot_covariance(covar):
    plt.imshow(covar)
    plt.colorbar()
    plt.title('Covariance Matrix (\u03A3)')
    plt.show()

    
def plot_variance_explained(variance_explained):
  """
  Plots eigenvalues.

  Args:
    variance_explained (numpy array of floats) : Vector of variance explained
                                                 for each PC

  Returns:
    Nothing.

  """

  plt.figure()
  plt.plot(np.arange(1, len(variance_explained) + 1), variance_explained,
           'k')
  plt.xlabel('Number of components')
  plt.ylabel('Variance explained')
  plt.axhline(.9,c='r',ls='--')
  plt.show()


def change_of_basis(X, W):
  """
  Projects data onto a new basis.

  Args:
    X (numpy array of floats) : Data matrix each column corresponding to a
                                different random variable
    W (numpy array of floats) : new orthonormal basis columns correspond to
                                basis vectors

  Returns:
    (numpy array of floats)   : Data matrix expressed in new basis
  """

  Y = np.matmul(X, W)

  return Y


def get_sample_cov_matrix(X):
  """
  Returns the sample covariance matrix of data X.

  Args:
    X (numpy array of floats) : Data matrix each column corresponds to a
                                different random variable

  Returns:
    (numpy array of floats)   : Covariance matrix
"""

  X = X - np.mean(X, 0)
  cov_matrix = 1 / X.shape[0] * np.matmul(X.T, X)
  return cov_matrix


def sort_evals_descending(evals, evectors):
  """
  Sorts eigenvalues and eigenvectors in decreasing order. Also aligns first two
  eigenvectors to be in first two quadrants (if 2D).

  Args:
    evals (numpy array of floats)    :   Vector of eigenvalues
    evectors (numpy array of floats) :   Corresponding matrix of eigenvectors
                                         each column corresponds to a different
                                         eigenvalue

  Returns:
    (numpy array of floats)          : Vector of eigenvalues after sorting
    (numpy array of floats)          : Matrix of eigenvectors after sorting
  """

  index = np.flip(np.argsort(evals))
  evals = evals[index]
  evectors = evectors[:, index]
  if evals.shape[0] == 2:
    if np.arccos(np.matmul(evectors[:, 0],
                           1 / np.sqrt(2) * np.array([1, 1]))) > np.pi / 2:
      evectors[:, 0] = -evectors[:, 0]
    if np.arccos(np.matmul(evectors[:, 1],
                           1 / np.sqrt(2)*np.array([-1, 1]))) > np.pi / 2:
      evectors[:, 1] = -evectors[:, 1]

  return evals, evectors


def pca(X):
  """
  Performs PCA on multivariate data. Eigenvalues are sorted in decreasing order

  Args:
     X (numpy array of floats) :   Data matrix each column corresponds to a
                                   different random variable

  Returns:
    (numpy array of floats)    : Data projected onto the new basis
    (numpy array of floats)    : Vector of eigenvalues
    (numpy array of floats)    : Corresponding matrix of eigenvectors

  """

  X = X - np.mean(X, 0)
  cov_matrix = get_sample_cov_matrix(X)
  evals, evectors = np.linalg.eigh(cov_matrix)
  evals, evectors = sort_evals_descending(evals, evectors)
  score = change_of_basis(X, evectors)

  return score, evectors, evals


def plot_eigenvalues(evals, limit=True):
  """
  Plots eigenvalues.

  Args:
     (numpy array of floats) : Vector of eigenvalues

  Returns:
    Nothing.

  """

  plt.figure()
  plt.plot(np.arange(1, len(evals) + 1), evals, 'o-k')
  plt.xlabel('Component')
  plt.ylabel('Eigenvalue')
  plt.title('Scree plot')
  if limit:
    plt.show()
    
    
def plot_eigenvectors(eigenvectors, feature_names):
    plt.figure(figsize=(10,5))
    plt.imshow(eigenvectors,cmap='viridis',)
    plt.yticks([0,1,2],['PC1','PC2','PC3'],fontsize=10)
    plt.colorbar(orientation='horizontal')
    plt.tight_layout()
    plt.xticks(range(len(feature_names)),feature_names,rotation=65)
    plt.show()


def plot_pca_labels(score,labels):
    plt.figure()
    plt.scatter(score[:, 0], score[:, 1],c=labels,cmap='jet')
    plt.xlabel('PCA 1')
    plt.ylabel('PCA 2')
    plt.axis('equal')
    plt.colorbar()
    plt.show()
    
def plot_pca_labels_with_centroids(score,labels,centers):
    plt.figure()
    plt.scatter(score[:, 0], score[:, 1],c=labels,cmap='jet')
    plt.xlabel('PCA 1')
    plt.ylabel('PCA 2')
    plt.axis('equal')
    plt.colorbar()
    plt.scatter(centers[:, 0], centers[:, 1], c='black', s=200, alpha=0.5)
    plt.show()

## Part 1: Load a dataset of 100 neurons with eight different features.

Here, we will load a dataset with eight different features: response latency, somatic volume, cortical depth, maximum firing rate, spontaneous firing rate, spike width, axon length, and dendritic arborization area.

In [None]:
dataset = pd.read_csv("simulated_cells2.csv")

Run the next cell to check what is the size of the dataset, and see the first few datapoints in the dataset

In [None]:
nrows,ncolumns = dataset.shape
dataset.head()

Let's take a look at these eight different factors, and how they compare. We will use *Seaborn*, a Python data visualization library based on matplotlib. It provides a high-level interface for drawing informative statistical graphics. We will use *pairplot* to plot pairwise relationships in a dataset. The diagonal plots show the univariate distribution of the data for the variable in that column. The off-diagonal plots are pairwise scatter plots of the features in the data. 



In [None]:
sns.pairplot(dataset, height=1.5)
plt.show()

Clearly, there's a lot going on here. We'll want to use dimensionality reduction to try to understand if we can reduce the understanding of this dataset into a couple of different factors, and highlight which features succintly describe the data.

For now, we'll drop "transmission" from our dataframe -- we don't want to run PCA on it. Then, we'll normalize each column of data. Normalization isn't strictly *necessary* but it is a really useful for a dataset like this one where different variables have wildly different values (e.g., 0.5 to 300).

To normnalize the data, subtract the mean from each column and divide by the standard deviation. 

In [None]:
x_data = dataset.drop('transmission',axis=1)

  #################################################
  ## TO DO for students: normalize the columns of the data
  # Comment once you've filled in the function
  raise NotImplementedError("Student excercise: normalize data!")
  #################################################

x_data = 

# Uncomment to see the first few rows of the dataset
# x_data.head()

# Part 2: Perform PCA

### 1. Compute the covariance matrix.

Here, we'll calculate the **covariance** between the columns of our dataset using our *get_sample cov* function. Since we have normalized our columns, this is a correlation matrix (values are limited between -1 and 1). A positive correaltion means these columns vary together, whereas a negative correaltion means these columns vary in opposite directions.

In [None]:
# convert pandas dataframe to numpy array so we can use functions we developed in previous tutorials 
X = x_data.to_numpy()

  #################################################
  ## TO DO for students: calculate the sample covariance of the dataset
  # Comment once you've filled in the function
  raise NotImplementedError("Student excercise: calculate covariance!")
  #################################################
    
cov_mat = 

# Uncomment to plot the covariance matrix
# plot_covariance(cov_mat)

### 2. Scree plot of PCA

PCA will extract as many factors as there are variables, but we'd like to know how many of these factors are useful in explaining our data. 

There are a few different ways to do this (e.g., a Kaiser criterion), but here we'll use a "[Scree plot](https://en.wikipedia.org/wiki/Scree_plot)", and look for an "elbow" (very scientific, I know). Factors to the left of the elbow are considered meaningful, whereas factors to the right are considered noise.

**Steps:**
- Perform PCA on the dataset and examine the scree plot. 
- When do the eigenvalues appear (by eye) to reach zero? (**Hint:** use `plt.xlim` to zoom into a section of the plot).

In [None]:
help(pca)
help(plot_eigenvalues)

In [None]:
#################################################
## TO DO for students: perform PCA and plot the eigenvalues
# Comment once you've filled in the function
raise NotImplementedError("Student excercise: apply PCA to data!")
#################################################

# perform PCA
score, evectors, evals = 
# Uncomment to plot the eigenvalues
# plot_eigenvalues(evals, limit=False)

# Section 2: Calculate the variance explained

The scree plot suggests that most of the eigenvalues are near zero, with fewer than 100 having large values. Another common way to determine the intrinsic dimensionality is by considering the variance explained. This can be examined with a cumulative plot of the fraction of the total variance explained by the top $K$ components, i.e.,

\begin{equation}
\text{var explained} = \frac{\sum_{i=1}^K \lambda_i}{\sum_{i=1}^N \lambda_i}
\end{equation}

The intrinsic dimensionality is often quantified by the $K$ necessary to explain a large proportion of the total variance of the data (often a defined threshold, e.g., 90%).

## Exercise 2: Plot the explained variance

In this exercise you will plot the explained variance. 

**Steps:**
- Fill in the function below to calculate the fraction variance explained as a function of the number of principal componenets. **Hint:** use `np.cumsum`.
- Plot the variance explained using `plot_variance_explained`.

**Questions:**
- How many principal components are required to explain 90% of the variance?
- How does the intrinsic dimensionality of this dataset compare to its extrinsic dimensionality?

In [None]:
help(plot_variance_explained)

In [None]:
def get_variance_explained(evals):
  """
  Calculates variance explained from the eigenvalues.

  Args:
    evals (numpy array of floats) : Vector of eigenvalues

  Returns:
    (numpy array of floats)       : Vector of variance explained

  """

  #################################################
  ## TO DO for students: calculate the explained variance using the equation
  ## from Section 2.
  # Comment once you've filled in the function
  raise NotImplementedError("Student excercise: calculate explaine variance!")
  #################################################

  # cumulatively sum the eigenvalues
  csum = 
  # normalize by the sum of eigenvalues
  variance_explained = 

  return variance_explained


#################################################
## TO DO for students: call the function and plot the variance explained
#################################################

# calculate the variance explained
variance_explained = get_variance_explained(evals)

# Uncomment to plot the variance explained and 90% of the variance explained
# plot_variance_explained(variance_explained)


As you can see, we would need 5-6 principal components to account for more than 90% of the variance in the dataset (which would be a lot).

In [None]:
# Another thing we can do is look at the explained variance for each PCA.
## TO DO for students: calculate variance of each prinicipal component (is there mofre than one way you can do this?)
#################################################

ex_variance_ratio = 

# Uncomment to print variance explained of each component 
# print(ex_variance_ratio)

### 3. Interpreting the meaning of factors.
This part is really tricky for neural data -- what counts as a *meaningful* factor for brain activity? This part is more or less easy depending on your data and how much you already know about it. 

For spike sorting, these factors are typically something obvious like spike amplitude or spike width, but for behavioral measures or population dynamics, the meaning of factors could be less obvious. Below, we can see how much each feature contributes to the first PCs:

In [None]:
plot_eigenvectors(evectors[:,:3].T, x_data.columns)

### 4. Determining the factor values of the original variables

Here, we'll plot the first two dimensions of our PCA.

In [None]:
excitatory = dataset.transmission=='excitatory'
plot_pca_labels(score,excitatory)

Remember the PCA didn't have any information on whether our cells were excitatory or inhibitory, but clearly it's picking up on features that also divide those cell types. And, it looks like excitatory and inhibitory cells actually might fall into three classes. Let's see what happens if we use *k*-means clustering on our data.

In [None]:
from sklearn.cluster import KMeans #Import the KMeans model

#################################################
## TO DO for students: # Set up a kmeans model with 3 clusters
# Comment once you've filled in the function
  raise NotImplementedError("Student excercise: apply kmeans!")
#################################################
n_clust = 
kmeans = KMeans(n_clusters=n_clust) 

kmeans.fit(score[:,:2]) # Fit our two dimensional data
labels_kmeans = kmeans.predict(score[:,:2]) 

# uncomment to plot the first 2PCs with points colored by kmeans labels and the corresponding centroids
# plot_pca_labels_with_centroids(score,labels_kmeans,kmeans.cluster_centers_)

If we tell *k*-means to use three clusters, it divides up the data into three clusters. What happens if we tell it there are two clusters?

In [None]:
from sklearn.cluster import KMeans #Import the KMeans model

# Set up a kmeans model with 2 clusters
n_clust = 
kmeans = KMeans(n_clusters=n_clust) 

kmeans.fit(score[:,:2]) # Fit our two dimensional data
labels_kmeans_2 = kmeans.predict(score[:,:2]) 
centers = kmeans.cluster_centers_

# uncomment to plot the first 2PCs with points colored by kmeans labels and the corresponding centroids
# plot_pca_labels_with_centroids(score,y_kmeans_2,centers)

### Biplot

A loading plot shows how strongly each feature influences a principal component.
A biplot overlays this on top of the plotted prinicpal components.

- Plot biplot to figure out contributions of each feature to the prinicipal components.
- Based on the biplot, plot first 2PCs colored by meaningful features

In [None]:
# The following function plots the first two prinicipal components (scores) and the loading plot of the features
def biplot(score,coeff,labels=None):
    xs = score[:,0]
    ys = score[:,1]
    n = coeff.shape[0]
    scalex = 1.0/(xs.max() - xs.min())
    scaley = 1.0/(ys.max() - ys.min())
    plt.scatter(xs * scalex,ys * scaley,s=5)
    for i in range(n):
        plt.arrow(0, 0, coeff[i,0], coeff[i,1],color = 'r',alpha = 0.5)
        if labels is None:
            plt.text(coeff[i,0]* 1.15, coeff[i,1] * 1.15, "Var"+str(i+1), color = 'green', ha = 'center', va = 'center')
        else:
            plt.text(coeff[i,0]* 1.15, coeff[i,1] * 1.15, labels[i], color = 'g', ha = 'center', va = 'center')
 
    plt.xlabel("PC{}".format(1))
    plt.ylabel("PC{}".format(2))
    plt.grid()

In [None]:
biplot(score,evectors,list(x_data.columns))
plt.show()

Based on the biplot, choose three features which if you color the cells by, you get a meaningful separation.

In [None]:
# fill in 3 feature names below and then run the following three cells
feat1 = 
feat2 = 
feat3 = 

In [None]:
plot_pca_labels(score,feat1)

In [None]:
plot_pca_labels(score,feat2)

In [None]:
plot_pca_labels(score,feat3)