# Nonlinear dimensionality reduction

## Goals

* Visualize a single-cell dataset with t-SNE, UMAP and PHATE
* Understand how important parameter tuning is to visualization
* Understand how to compare the merits of different dimensionality reduction algorithms

In [None]:
!pip install --user scprep phate umap-learn

## 1. Loading the Retinal Bipolar dataset

In [None]:
import scprep
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

Since we've done the preprocessing on this dataset before, we'll just download the preprocessed data from Google Drive.

In [None]:
scprep.io.download.download_google_drive("1pRYn62SOmmJxwVU0sSW7eBagRL2RJmx0", "shekhar_data.pkl")
scprep.io.download.download_google_drive("1FlNktWuJCka3pXOvNIFfRitGluZy2ftt", "shekhar_clusters.pkl")

In [None]:
data = pd.read_pickle("shekhar_data.pkl")
clusters = pd.read_pickle("shekhar_clusters.pkl")

## 2. t-SNE

#### What is tSNE?
t-SNE is the most popular visualization method for single cell RNA-sequencing data. The method was first introduced by Laurens van der Maaten in 2008 in the aptly named article ["Visualizing High-Dimensional Data Using t-SNE"](http://jmlr.org/papers/v9/vandermaaten08a.html). The goal of t-SNE is to produce a two or three dimensional embedding of a dataset that exists in many dimensions such that the embedding can be used for visualization.

By embedding, we're talking about projecting the data from high dimensional space onto vectors in a smaller dimensional space.

The way t-SNE does this is by minimizing the difference between neighborhood distances (i.e. distances from a cell to a set of close cells) in the original high dimensional space and the lower dimensional embedding space. t-SNE is an optimization problem where the algorithm iteratively learns a series of transformations such that each successive transformation better minimizes this difference between the high and low dimensional neighborhood distances. 

This approach preserves local structure in the data. Cells that are close in high dimensional space (i.e. have small Euclidean distances) will also be close in low dimensional space. However, it also means that global structure *will not* be preserved. This means that the distance between "clusters" in a t-SNE plot don't have any meaning.  In other words, the white-space on the graph has *no* interpretative value.


#### How to use t-SNE effectively

Unlike PCA, t-SNE has *hyperparameters* these are user-specified options that determine the output of t-SNE. Having hyperparameters isn't bad, but it is essential to understand what the hyperparameters are, what the effect of hyperpameter choices have on output, and how to select the best set of hyperparameters for a given research objective.

In 2016, a group from Google Brain published great essay in Distill about ["How to Use t-SNE Effectively"](https://distill.pub/2016/misread-tsne/). In the article, they provide an interactive tool to explore the effect of various hyperparameters of t-SNE on various datasets.

There are two main hyperparameters for t-SNE: **perplexity** and **learning rate** (sometimes called epsilon). Perplexity determines the "neighborhood size". Larger values of perplexity increase the number of points within the neighborhood. The reccomended range of t-SNE perplexity is roughly 5-50. Learning rate affects how quickly the algorithm "stablilizes". You probably don't need to change this, but should understand what it is.

This dataset consists of many cell types, which were mostly identified as Amacrine cells, Muller Glia, Rod Bipolar cells, and many subtypes of Cone Bipolar cells in [Shekhar et. al, 2016](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5003425/). We can plot the data using t-SNE, as was done in the original paper.

#### Reducing dimensionality with PCA to speed up t-SNE

t-SNE gets very slow with high-dimensional data. We can speed it up substantially by running PCA first to 100 dimensions.

In [None]:
data_pca = scprep.reduce.pca(data, n_components=100, method='dense')

#### Subsampling to speed up t-SNE even more

t-SNE is still slow even after PCA, so let's speed things up by using fewer points.

In [None]:
data_pca_subsample, clusters_subsample = scprep.select.subsample(data_pca, clusters, n=3000)

#### Running t-SNE

tSNE is implemented in `scikit-learn`. t-SNE is a manifold learning algorithm and you can find the t-SNE operator at [`sklearn.manifold.TSNE`](https://scikit-learn.org/stable/modules/generated/sklearn.manifold.TSNE.html).

We create a t-SNE operator and run it on data with the following syntax

```python
import sklearn.manifold
tsne_op = sklearn.manifold.TSNE(n_components=2, perplexity=30)
data_tsne = tsne_op.fit_transform(data)
```

**Note**: In the example below we are instantiating the `tsne_op` with default parameters.

In [None]:
import sklearn.manifold
tsne_op = sklearn.manifold.TSNE()
data_tsne = tsne_op.fit_transform(data_pca_subsample)

#### Plotting and interpreting t-SNE

In [None]:
scprep.plot.scatter2d(data_tsne, c=clusters_subsample['CELLTYPE'],
                      figsize=(8,4), legend_anchor=(1,1),
                      ticks=False, label_prefix='t-SNE')

What do you notice? Is your favorite cell type nicely separated in this plot? How obvious is the distinction between the macro-level cell types of cone bipolar, rod bipolar, and glial cells?

#### Exercise - run t-SNE with different `perplexity` parameters

t-SNE's `perplexity` parameter describes the size of the neighborhood around each point. The authors recommend values between 5 and 100. Try a range of different values in and outside of this range and discuss the results with your group.

*Note: be sure to use `data_pca_subsample`, as t-SNE can take a long time.*

In [None]:
# ==============
# experiment with the perplexity parameter
tsne_op = sklearn.manifold.TSNE(perplexity= ) # <-
data_tsne =
# ==============

In [None]:
scprep.plot.scatter2d(data_tsne, c=clusters_subsample['CELLTYPE'],
                      figsize=(8,4), legend_anchor=(1,1), ticks=False, label_prefix='t-SNE')

#### _Breakpoint_  - once you get here, please help those around you!

## 3. UMAP

Even though UMAP is not a part of scikit-learn, the syntax for UMAP is identical to t-SNE: `umap.UMAP().fit_transform`. UMAP is relatively fast, so you won't need to use the subsampled data. We also don't need to do PCA beforehand, but since we've already done it we may as well use it.

In [None]:
import umap
umap_op = umap.UMAP()
data_umap = umap_op.fit_transform(data_pca)

In [None]:
scprep.plot.scatter2d(data_umap, c=clusters_subsample['CELLTYPE'],
                      figsize=(8,4), legend_anchor=(1,1), ticks=False, label_prefix='UMAP')

What do you notice? Is your favorite cell type nicely separated in this plot? How obvious is the distinction between the macro-level cell types of cone bipolar, rod bipolar, and glial cells? How does this plot compare to t-SNE?

### Exercise - run UMAP with different `n_neighbors` and `min_dist` parameters

UMAP's `n_neighbors` parameter describes the size of the neighborhood around each point. The `min_dist` parameter describes how tightly points can be packed together. The authors recommend values between 2 and 200 for `n_neighbors`, and between 0 and 0.99 for `min_dist`. Try a range of different values in and outside of these ranges and discuss the results with your group.

In [None]:
# ===============
# Choose different values for n_neighbors and min_dist, plotting with scprep
umap_op =
data_umap =
scprep.plot.scatter2d(
# ===============

#### _Breakpoint_  - once you get here, please help those around you!

## 4. PHATE

### Exercise - perform PHATE and plot the results

The syntax for PHATE is identical to UMAP and t-SNE: `phate.PHATE().fit_transform`. PHATE is relatively fast, so you won't need to use the subsampled data.

In [None]:
import phate
phate_op = phate.PHATE()
data_phate = phate_op.fit_transform(data_pca)

In [None]:
scprep.plot.scatter2d(data_phate, c=clusters_subsample['CELLTYPE'],
                      figsize=(8,4), legend_anchor=(1,1), ticks=False, label_prefix='PHATE')

What do you notice? Is your favorite cell type nicely separated in this plot? How obvious is the distinction between the macro-level cell types of cone bipolar, rod bipolar, and glial cells? How does this plot compare to t-SNE and UMAP?

### Exercise - run PHATE with different `knn` and `t` parameters

UMAP's `knn` parameter describes the size of the neighborhood around each point. The `t` parameter describes how much denoising is performed. We recommend values between 2 and 100 for `n_neighbors`, and between 2 and 150 for `t`. Try a range of different values in and outside of these ranges and discuss the results with your group.

In [None]:
# ===============
# Choose different values for knn and y, plotting with scprep
phate_op =
data_phate =
scprep.plot.scatter2d(
# ===============

### Discussion 

In groups, discuss the following questions:
1. In a dataset with clusters, how well does each method perform?
2. How might you determine which method is closest to the ground truth?
3. Which parameters are the most similar between methods?
4. Which method is the most / least sensitive to parameter selection?
5. If you run the same method with the same parameters multiple times, do you always get the same result?

**Note Dan/Scott**: In one of the earlier clustering notebooks, we did more formal method of clustering efficiency, can we ask the students to actually do that as an exercise, if it is not too complicated.