In [None]:
!pip install --user scprep phate umap-learn

# Dimensionality reduction on the EB time course

<a id='loading'></a>
## 1. Loading preprocessed data

### Load EB Data (and download if needed)

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

import phate
import umap
import scprep
import os

In [None]:
file_path = os.path.expanduser('~/EBT_counts.pkl.gz')
if not os.path.exists(file_path):
    scprep.io.download.download_google_drive(id='1Xz0ONnRWp2MLC_R6r74MzNwaZ4DkQPcM',
                        destination=file_path)
data = scprep.utils.SparseDataFrame(pd.read_pickle(file_path))

In [None]:
metadata = pd.DataFrame([ix.split('_')[1] for ix in data.index], columns=['sample'], index=data.index)

## 2. Visualization using Principle Components Analysis (PCA)

Here we're going to use the simplest dimensionality reduction method first. We don't expect PCA to work well because the dataset is so complex, but it's a good place to start with any dataset.

#### Running PCA on the EB data

In [None]:
data_pca = scprep.reduce.pca(data, n_components=50, method='dense')

In [None]:
data_pca.head()

#### Plotting PCs using `scprep.plot`

The scprep package has a number of handy plotting features that act as a wrapper to `matplotlib`. You should know how to use `matplotlib` for more complicated plotting, but you can make all the plots we need in this tutorial with some help from `scprep`.

The full documentaiton of `scprep.plot` can be found here:

https://scprep.readthedocs.io/en/stable/reference.html#module-scprep.plot

In [None]:
# Create a figure (the background) and a set of axes (the things we plot on)
fig, axes = plt.subplots(2,3, figsize=(12,8))
# This makes it easier to iterate through the axes
axes = axes.flatten()

for i, ax in enumerate(axes):
    # only plot a legend on one axis
    legend = True if i == 2 else False
    # There are a lot of parameters here, you can find the full scatter documentation at
    # https://scprep.readthedocs.io/en/stable/reference.html#scprep.plot.scatter
    scprep.plot.scatter(data_pca.iloc[:,i], data_pca.iloc[:,i+1], c=metadata['sample'],
                        cmap='Spectral', ax=ax,
                        label_prefix="PC", legend=legend)
fig.tight_layout()

#### Plotting expression of a gene on the first two PCs

Now let's plot expression of some genes!


In [None]:
gene = 'SOX10'

expression = scprep.select.select_cols(data, starts_with=gene)

# we will sort cells by maximum expression so we can see where the gene is expressed
sort_index = expression.sort_values(by=expression.columns[0]).index

scprep.plot.scatter2d(data_pca.loc[sort_index], c=expression.loc[sort_index], shuffle=False,
                     title=gene, ticks=None, label_prefix='PC')

In [None]:
gene = 'ACTB'

# ================
# Sort cells by maximum expression of ACTB and plot the result on PCA
expression = 

sort_index = 

scprep.plot.scatter2d(
# ===============

In [None]:
gene = 'HAND1'

# ================
# Sort cells by maximum expression of HAND1 and plot the result on PCA
expression = 

sort_index = 

scprep.plot.scatter2d(
# ===============

### Discussion

What do you notice? What does the *first* principle component track with? What about the *second*? What do you think the higher PCs represent? What does that mean?

Why did we plot gene expression on the first two PCs?

Look up the function of these genes. What do you notice about where these genes are expressed? What does it mean when a gene is expressed everywhere vs. in one region?

#### _Breakpoint_  - once you get here, please help those around you!

## 3. t-SNE

#### How to use t-SNE effectively

Unlike PCA, t-SNE has *hyperparameters* these are user-specified options that determine the output of t-SNE. Having hyperparameters isn't bad, but it is essential to understand what the hyperparameters are, what the effect of hyperpameter choices have on output, and how to select the best set of hyperparameters for a given research objective.

In 2016, a group from Google Brain published great essay in Distill about ["How to Use t-SNE Effectively"](https://distill.pub/2016/misread-tsne/). In the article, they provide an interactive tool to explore the effect of various hyperparameters of t-SNE on various datasets.

There are two main hyperparameters for t-SNE: **perplexity** and **learning rate** (sometimes called epsilon). Perplexity determines the "neighborhood size". Larger values of perplexity increase the number of points within the neighborhood. The reccomended range of t-SNE perplexity is roughly 5-50. Learning rate affects how quickly the algorithm "stablilizes". You probably don't need to change this, but should understand what it is.

#### Running t-SNE on the embryoid body data

tSNE is implemented in `scikit-learn`. t-SNE is a manifold learning algorithm and you can find the t-SNE operator at [`sklearn.manifold.TNSE`](https://scikit-learn.org/stable/modules/generated/sklearn.manifold.TSNE.html).

We create a t-SNE operator and run it on data just like the PCA operator

```python
from sklearn.manifold import TSNE
tsne_op = TSNE(n_components=2, perplexity=30)
data_tsne = tsne_op.fit_transform(data)
```

### Excercise

In your groups, run TSNE on the EB dataset. Each person should pick a different perplexity. Note, in the following code block, we're using the first 20 PC components to speed up the run time (it should take 3-5 minutes to run). You can try changing the number of PCs and seeing how this affects output after the workshop. Think about why changing the number of PCs affects output.

What are the differences you see?

Try running t-SNE with the same parameters twice. What happens? Why?

In [None]:
from sklearn.manifold import TSNE

import tasklogger
with tasklogger.log_task('tSNE on {} cells'.format(data_pca.shape[0])):

    # Fitting tSNE. Change the perplexity here.
    tsne_op = TSNE(n_components=2, perplexity=30)
    data_tsne = tsne_op.fit_transform(data_pca.iloc[:,:20])

    # Put output into a dataframe
    data_tsne = pd.DataFrame(data_tsne, index=data.index)

In [None]:
scprep.plot.scatter2d(data_tsne, c=metadata['sample'], cmap='Spectral', 
                      ticks=False, label_prefix='t-SNE',
                      legend_anchor=(1,1), figsize=(7,5))

#### Let's look at some marker genes!

In [None]:
fig, axes = plt.subplots(1,3, figsize=(14,4))
axes = axes.flatten()


genes_for_plotting = ['ACTB', 'SOX10', 'HAND1']

for gene, ax in zip(genes_for_plotting, axes):
    gene_full_name = scprep.select.get_gene_set(data, exact_word=gene)[0]
    expression = data[gene_full_name]
    
    sort_index = expression.sort_values().index
    
    scprep.plot.scatter2d(data_tsne.loc[sort_index], c=expression.loc[sort_index], shuffle=False,
                         title=gene, ticks=None, label_prefix='t-SNE', ax=ax)
    
fig.tight_layout()

### Discussion

Now, take some time in your groups to think of some pros and cons of using tSNE. What recommendations would you give to a new user who wants to know which parameters to try?

**Note Dan/Scott**: The Discussion sections worked beautifully IRL, because you could look around.  Online format, this probably needs to be more formalized.  Is there a way to force people to discuss these in the groups.  With a group of 4 it might just be possible!

#### _Breakpoint_  - once you get here, please help those around you!

## 3.3. Embedding Data Using UMAP

The syntax for UMAP is identical to t-SNE: `umap.UMAP().fit_transform`. UMAP is relatively fast, so you won't need to use the subsampled data.

UMAP's `n_neighbors` parameter describes the size of the neighborhood around each point. The `min_dist` parameter describes how tightly points can be packed together. The authors recommend values between 2 and 200 for `n_neighbors`, and between 0 and 0.99 for `min_dist`. Try a range of different values in and outside of these ranges and discuss the results with your group.

Play around with the `min_dist` and `n_neighbors` parameters.

In [None]:
import umap
data_umap = umap.UMAP().fit_transform(data_pca.iloc[:,:50])

In [None]:
data_umap = pd.DataFrame(data_umap, index = data.index)

In [None]:
# ================
# As you did with t-SNE, plot the UMAP coordinates
# colored by time point
scprep.plot.scatter2d(
# ================

#### Let's look at some marker genes!

In [None]:
genes_for_plotting = ['ACTB', 'SOX10', 'HAND1']
# ================
# As you did with t-SNE, plot three subplots of the UMAP coordinates
# coloring by ACTB, SOX10 and HAND1
fig, axes = 

for gene, ax in zip(genes_for_plotting, axes.flatten()):
    gene_full_name = 
    expression = 
    
    sort_index = 
    
    scprep.plot.scatter2d(
# ================

fig.tight_layout()

### Discussion

What are the similarities and differences between UMAP and t-SNE? Do you notice any parameter choices that seem to have similar effects between the algorithms?

**Note Dan/Scott**: This part is kind of repetitive between multiple notebooks under dimensionality reduction.  Can we move this or remove this to make the notebooks non-repetitive.

#### _Breakpoint_  - once you get here, please help those around you!

## 3.4. Embedding Data Using PHATE

#### How does PHATE work?

PHATE is a dimensionaltiy reduction developed by the Krishnaswamy lab for visualizing high-dimensional data. We use PHATE for *every* dataset the comes through the lab: scRNA-seq, CyTOF, gut microbiome profiles, simulated data, etc. PHATE was designed to handle noisy, non-linear relationships between data points. PHATE produces a low-dimensional representation that preserves both local and global structure in a dataset so that you can make generate hypotheses from the plot about the relationships between cells present in a dataset. Although PHATE has utility for analysis of many data modalities, we will focus on the application of PHATE for scRNA-seq analysis.

PHATE is inspired by diffusion maps [(Coifman et al. 2008.)](https://doi.org/10.1016/j.acha.2006.04.006), but include several key innovations that make it possible to generate a two or three dimensional visualization that preserves continuous relationships between cells where they exist. For a full explanation of the PHATE algorithm, please consult [the PHATE manuscript](https://doi.org/10.1101/120378). **Note Dan/Scott Update the link to published paper**

#### Using the PHATE estimator

The API of PHATE models that of Scikit Learn. First, you instantiate a PHATE estimator object with the parameters for fitting the PHATE embedding to a given dataset. Next, you use the `fit` and `fit_transform` functions to generate an embedding. For more information, check out [**the PHATE readthedocs page**](http://phate.readthedocs.io/).

Like tSNE, PHATE has it's own set of hyperparameters. Changing the parameters will greatly change the output of the algorithm. We reccomend starting with the defaults, then change `knn` and `decay` according the reccomendations below. Generally, we won't select `t` ourselves, but if you're tuning hyperparameters, it's best to fix `t`.

* `knn` : Number of nearest neighbors (default: 5). Increase this (e.g. to 20) if your PHATE embedding appears very disconnected. You should also consider increasing `k` if your dataset is extremely large (e.g. >100k cells)
* `decay` : Alpha decay (default: 15). Decreasing `a` increases connectivity on the graph, increasing `a` decreases connectivity. This rarely needs to be tuned. Set it to `None` for a k-nearest neighbors kernel.
* `t` : Number of times to power the operator (default: 'auto'). This is equivalent to the amount of smoothing done to the data. It is chosen automatically by default, but you can increase it if your embedding lacks structure, or decrease it if the structure looks too compact.
* `gamma` : Informational distance constant (default: 1). `gamma=1` gives the PHATE log potential, but other informational distances can be interesting. If most of the points seem concentrated in one section of the plot, you can try `gamma=0`.

**Note Dan/Scott**: Havent the names of some parameters changed in the final version.  I only remember this vaguely, please ignore if it is not accurate.

Here's the simplest way to apply PHATE. Running this should take ~1-3 minutes.

In [None]:
phate_op = phate.PHATE(knn=5, n_jobs=-2)

data_phate = phate_op.fit_transform(data_pca.iloc[:,:50])
data_phate = pd.DataFrame(data_phate, index=data.index)

And then we plot using `scprep.plot.scatter2d`. For more advanced plotting, we recommend Matplotlib. If you want more help on using Matplotlib, they have [**extensive documentation**](https://matplotlib.org/tutorials/index.html) and [**many Stackoverflow threads**](https://stackoverflow.com/questions/tagged/matplotlib).

In [None]:
# ================
# As you did with t-SNE and UMAP, plot the PHATE coordinates
# colored by time point

# ================

#### Gene visualization

In [None]:
# ================
# As you did with t-SNE and UMAP, plot three subplots of the PHATE coordinates
# coloring by ACTB, SOX10 and HAND1

# ================

### Discussion

In groups, discuss the following questions:
1. In a dataset with trajectories, how well does each method perform?
2. Now that you've seen all the methods, how might you include them in a workflow?
3. What are the advantages of each method?
4. If you didn't know if your data contained clusters or trajectories, what would you do?