# Neuromatch Academy: Week 1, Day 5, Tutorial 4
# Dimensionality Reduction: Nonlinear dimensionality reduction

__Content creators:__ Alex Cayco Gajic, John Murray

__Content reviewers:__ Roozbeh Farhoudi, Matt Krause, Spiros Chavlis, Richard Gao, Michael Waskom


---
# Tutorial Objectives

In this notebook we'll explore how dimensionality reduction can be useful for visualizing and inferring structure in your data. To do this, we will compare PCA with t-SNE, a nonlinear dimensionality reduction method.

Overview:
- Visualize MNIST in 2D using PCA.
- Visualize MNIST in 2D using t-SNE.

In [None]:
# @title Video 1: PCA Applications
from IPython.display import YouTubeVideo
video = YouTubeVideo(id="2Zb93aOWioM", width=854, height=480, fs=1)
print("Video available at https://youtube.com/watch?v=" + video.id)
video

# Summary of Video 1:

- nonlinear methods of dimensionality reduction techniques.

the big picture of PCA
is that it finds a low dimensional basis that describes most of the variability in your data. We often call the subspace 
that's generated by this basis the latent subspace and after the transformation we often call this the latent representation.
Latent means hidden because these components are not directly observed in the data, but they're inferred from the structure.

PCA has a number of different applications:
1. can be used for compression.
That's related to the fact that PCA has an alternate formulation as finding the low dimensional basis that minimizes the
reconstruction error. So instead of saving your full data matrix X you can save the truncated scores matrix and the corresponding coefficients.

2. PCA can be useful for denoising. When you project onto the low
dimensional subspace, you're basically removing any noise that's orthogonal to that subspace.

3. PCA can be very useful for whitening, which is a procedure in which you decorrelate your data, and you standardize it meaning that
you rescale it so that the
individual
components have unit variance. 
PCA finds transformation so that the new components are uncorrelated, and we know that the variance of each component is given by the eigen value.
So we can just rescale this so that the new components are uncorrelated and have unit variance.
So these first three properties that PCA can be used for- compression, denoising and whitening are 
extremely useful in a very wide variety of settings,
and actually if you want to do a more complicated nonlinear dimensionality reduction technique
it's often useful to use PCA as a pre-processing step for exactly these reasons. 

4. visualization, maybe one of the most common applications of PCA.


If we have a sample
that's a 0
and so when you take the dot product you'll get a very
high value for the score
corresponding to the first principal component.

On the other hand, if you have a sample that says 0 
So you'll get a very low value of the first principal component score.
So that means that we already expect that there should be some kind of clustering
aligned with the first principal component of ones and zeros.

But this doesn't really say anything for what will happen for the digits between 2 & 9. 

So you will use PCA in order to visualize for full MNIST set.


---
# Setup
Run these cells to get the tutorial started.

In [None]:
# Imports
import numpy as np
import matplotlib.pyplot as plt

In [None]:
#@title Figure Settings
import ipywidgets as widgets       # interactive display
%config InlineBackend.figure_format = 'retina'
plt.style.use("https://raw.githubusercontent.com/NeuromatchAcademy/course-content/master/nma.mplstyle")

In [None]:
#@title Helper functions


def visualize_components(component1, component2, labels, show=True):
  """
  Plots a 2D representation of the data for visualization with categories
  labelled as different colors.

  Args:
    component1 (numpy array of floats) : Vector of component 1 scores
    component2 (numpy array of floats) : Vector of component 2 scores
    labels (numpy array of floats)     : Vector corresponding to categories of
                                         samples

  Returns:
    Nothing.

  """

  plt.figure()
  cmap = plt.cm.get_cmap('tab10')
  plt.scatter(x=component1, y=component2, c=labels, cmap=cmap)
  plt.xlabel('Component 1')
  plt.ylabel('Component 2')
  plt.colorbar(ticks=range(10))
  plt.clim(-0.5, 9.5)
  if show:
    plt.show()

# Discussion of helper functions:

*visualize_components*:
Plots a 2D representation of the data for visualization with categories labelled as different colors with x-axis: component 1 and y-axis: component 2.

---
# Section 1: Visualize MNIST in 2D using PCA

In this exercise, we'll visualize the first few components of the MNIST dataset to look for evidence of structure in the data. But in this tutorial, we will also be interested in the label of each image (i.e., which numeral it is from 0 to 9). Start by running the following cell to reload the MNIST dataset (this takes a few seconds). 

In [None]:
from sklearn.datasets import fetch_openml
mnist = fetch_openml(name='mnist_784')
X = mnist.data
labels = [int(k) for k in mnist.target]
labels = np.array(labels)

To perform PCA, we now will use the method implemented in sklearn. Run the following cell to set the parameters of PCA - we will only look at the top 2 components because we will be visualizing the data in 2D.

In [None]:
from sklearn.decomposition import PCA
pca_model = PCA(n_components=2) # Initializes PCA
pca_model.fit(X) # Performs PCA

## Exercise 1: Visualization of MNIST in 2D using PCA

Fill in the code below to perform PCA and visualize the top two  components. For better visualization, take only the first 2,000 samples of the data (this will also make t-SNE much faster in the following section of the tutorial so don't skip this step!)

**Suggestions:**
- Truncate the data matrix at 2,000 samples. You will also need to truncate the array of labels.
- Perform PCA on the truncated data.
- Use the function `visualize_components` to plot the labelled data.

In [None]:
help(visualize_components)
help(pca_model.transform)

In [None]:
#################################################
## TODO for students: take only 2,000 samples and perform PCA
#################################################

# Take only the first 2000 samples with the corresponding labels
# X, labels = ...
# Perform PCA
# scores = pca_model.transform(X)

# Plot the data and reconstruction
# visualize_components(...)

[*Click for solution*](https://github.com/NeuromatchAcademy/course-content/tree/master//tutorials/W1D5_DimensionalityReduction/solutions/W1D5_Tutorial4_Solution_e53bd4fb.py)

*Example output:*

<img alt='Solution hint' align='left' width=524 height=416 src=https://raw.githubusercontent.com/NeuromatchAcademy/course-content/master/tutorials/W1D5_DimensionalityReduction/static/W1D5_Tutorial4_Solution_e53bd4fb_0.png>



## Think!
- What do you see? Are different samples corresponding to the same numeral clustered together? Is there much overlap?
- Do some pairs of numerals appear to be more distinguishable than others?

*hint*:
1. think in terms of structure.
2. think in terms of distinguishability - could help to start with crude color based analysis.

---
# Section 2: Visualize MNIST in 2D using t-SNE


In [None]:
# @title Video 2: Nonlinear Methods
video = YouTubeVideo(id="5Xpb0YaN5Ms", width=854, height=480, fs=1)
print("Video available at https://youtube.com/watch?v=" + video.id)
video

# Summary of Video 2:

PCA is an example of linear dimensionality reduction. That means it finds a linear transformation to a low dimensional representation.
There are many such linear dimensionality reduction methods and they differ in terms of the assumptions that they make on the data.

1. probabilistic PCA which is similar to PCA, but it includes an explicit noise model. It assumes that the noise is gaussian with the same variance in each direction.

2. factor analysis, which is similar to PPCA, 
but it allows the variance to be different in different directions.

So different methods can also have different features of the data that they're looking for.

1. linear discriminant analysis looks for a low dimensional subspace that preserves class discriminatory information.
This is useful when you have labeled data

2. supervised dimensionality reduction in contrast to PCA 


- running PCA you would find direction that captures
most of the variance in the full data set, and if you
run LDA, you would find the direction that captures information about the two stimuli.
So LDA can be very useful when looking for directions that represents
Information about different stimuli or different behavior in your neural data set.

These methods are related to another class of methods that are used to solve blind source separation problems.

These are problems where you have some data that are mixtures of
different signals and you want to recover those signals by demixing your data?

For example, this can be useful in pre-processing neural data like an EEG.
The canonical example is independent components analysis, which finds components that are statistically independent.

This is a stronger condition than them being uncorrelated which is what PCA finds.

non-negative matrix factorization,
which is useful when you have positive data because it assumes that your weights and components are positive.
This can also make your results much more interpretable.
Geometrically speaking. It finds basis vectors that are on the edges of a cone that contains all of your data points.

And if you run PCA then you'll find components that have positive and negative
values that could be very difficult to interpret.
Again the basis vectors here are not necessarily orthogonal.

When are linear methods not enough? 
- dimension is curved. So if you run PCA you will not find that
S-shaped curve because it's explicitly looking for a linear representation.
Instead what you want to find is this S shape in which the data points are embedded. In neuroscience
we often call this a neural manifold. A neural manifold is a
smooth low dimensional structure in which your data points are embedded in your high dimensional space.
This is where nonlinear methods kick in. They don't usually find the equations for the manifold,
but they find a mapping
to a low dimensional embedding, and this mapping is chosen in order to preserve
some information in the structure of the data. For example information about the locality of
different data points. In particular many of these methods
try to map data points that are close together in your high dimensional space,
to also be close together and your low dimensional embedding.

There are many different types of nonlinear dimensionality reduction methods, all with their own 
positives and negatives. 

in this final exercise, you'll use one called T distributed stochastic neighbor embedding or t-SNE.

t-SNE is very useful for visualization of high dimensional data in two or three dimensions.
You also can't reconstruct the data in the same way that you can in PCA.

And finally t-SNE has a free parameter called the perplexity which roughly speaking
balances the local versus global information that's considered in finding this embedding. 
this perplexity parameter can have a large impact on the
results that you see. 


Next we will analyze the same data using t-SNE, a nonlinear dimensionality reduction method that is useful for visualizing high dimensional data in 2D or 3D. Run the cell below to get started. 

In [None]:
from sklearn.manifold import TSNE
tsne_model = TSNE(n_components=2, perplexity=30, random_state=2020)

## Exercise 2: Apply t-SNE on MNIST
First, we'll run t-SNE on the data to explore whether we can see more structure. The cell above defined the parameters that we will use to find our embedding (i.e, the low-dimensional representation of the data) and stored them in `model`. To run t-SNE on our data, use the function `model.fit_transform`.

**Suggestions:**
- Run t-SNE using the function `model.fit_transform`.
- Plot the result data using `visualize_components`.

In [None]:
help(tsne_model.fit_transform)

In [None]:
#################################################
## TODO for students: perform tSNE and visualize the data
#################################################

# perform t-SNE
embed = ...

# Visualize the data
# visualize_components(..., ..., labels)

[*Click for solution*](https://github.com/NeuromatchAcademy/course-content/tree/master//tutorials/W1D5_DimensionalityReduction/solutions/W1D5_Tutorial4_Solution_a989b6ef.py)

*Example output:*

<img alt='Solution hint' align='left' width=522 height=416 src=https://raw.githubusercontent.com/NeuromatchAcademy/course-content/master/tutorials/W1D5_DimensionalityReduction/static/W1D5_Tutorial4_Solution_a989b6ef_0.png>



## Exercise 3: Run t-SNE with different perplexities

Unlike PCA, t-SNE has a free parameter (the perplexity) that roughly determines how global vs. local information is weighted. Here we'll take a look at how the perplexity affects our interpretation of the results. 

**Steps:**
- Rerun t-SNE (don't forget to re-initialize using the function `TSNE` as above) with a perplexity of 50, 5 and 2.

In [None]:
def explore_perplexity(values):
  """
  Plots a 2D representation of the data for visualization with categories
  labelled as different colors using different perplexities.

  Args:
    values (list of floats) : list with perplexities to be visualized

  Returns:
    Nothing.

  """
  for perp in values:

    #################################################
    ## TO DO for students: Insert your code here to redefine the t-SNE "model"
    ## while setting the perplexity perform t-SNE on the data and plot the
    ## results for perplexity = 50, 5, and 2 (set random_state to 2020
    # Comment these lines when you complete the function
    raise NotImplementedError("Student Exercise! Explore t-SNE with different perplexity")
    #################################################

    # perform t-SNE
    tsne_model = ...

    embed = tsne_model.fit_transform(X)
    visualize_components(embed[:, 0], embed[:, 1], labels, show=False)
    plt.title(f"perplexity: {perp}")


# Uncomment when you complete the function
# values = [50, 5, 2]
# explore_perplexity(values)

[*Click for solution*](https://github.com/NeuromatchAcademy/course-content/tree/master//tutorials/W1D5_DimensionalityReduction/solutions/W1D5_Tutorial4_Solution_e3519b37.py)

*Example output:*

<img alt='Solution hint' align='left' width=521 height=416 src=https://raw.githubusercontent.com/NeuromatchAcademy/course-content/master/tutorials/W1D5_DimensionalityReduction/static/W1D5_Tutorial4_Solution_e3519b37_0.png>

<img alt='Solution hint' align='left' width=521 height=416 src=https://raw.githubusercontent.com/NeuromatchAcademy/course-content/master/tutorials/W1D5_DimensionalityReduction/static/W1D5_Tutorial4_Solution_e3519b37_1.png>

<img alt='Solution hint' align='left' width=522 height=416 src=https://raw.githubusercontent.com/NeuromatchAcademy/course-content/master/tutorials/W1D5_DimensionalityReduction/static/W1D5_Tutorial4_Solution_e3519b37_2.png>



## Think!

- What changes compared to your previous results using perplexity equal to 50? Do you see any clusters that have a different structure than before? 
- What changes in the embedding structure for perplexity equals to 5 or 2?

*hint*:

1. think location and structure. 
2. think overall structure. 

---
# Summary

* We learned the difference between linear and nonlinear dimensionality reduction. While nonlinear methods can be more powerful, they can also be senseitive to noise. In contrast, linear methods are useful for their simplicity and robustness.
* We compared PCA and t-SNE for data visualization. Using t-SNE, we could visualize clusters in the data corresponding to different digits. While PCA was able to separate some clusters (e.g., 0 vs 1), it performed poorly overall.
* However, the results of t-SNE can change depending on the choice of perplexity. To learn more, we recommend this [Distill paper](https://distill.pub/2016/misread-tsne/).
