# Principal components analysis (PCA)

## Goals

* Understand how PCA is computed
* Visualize a single-cell dataset with PCA
* Understand how different biological axes of variation are separated into different components

In [None]:
!pip install --user scprep

## 1. Computing PCA on the UCI wine dataset

#### How does PCA work?

PCA related eigendecomposition methods are some of the most fundamental dimensionality reduction tools in data science. Many methods, including tSNE and PHATE, first reduce the data using PCA before performing further operations on the data. 

You can find many rigorous descriptions of the PCA algorithm online. Here, we will focus on the intutition. The goal of PCA is to identify a set of orthogonal dimensions (each of which is a linear combination of the input features) that explain the maximum variance in the data. These dimensions are called Principle Components. In the following figure, you can see data in two dimensions:

<img src="https://krishnaswamylab.github.io/img/how_to_single_cell/PCA_original_data.png" style="height: 25rem;"/>

This is a simple dataset where the data exists in two dimensions. The axis of maximum variance in this data is going to be some line that goes up and to the right. If you were to identify the first two principle components in this data they would look like the dashed grey lines in the following figure:

<img src="https://krishnaswamylab.github.io/img/how_to_single_cell/PCA_PC1.png" style="height: 35.35rem;"/>

PCA then projects the points onto these new axes. Above, we see the projection onto PC1 (the longest dashed line) for a handful of cells denoted by the red arrows. Note that the arrows are orthogonal (perpendicular) to PC1. This is the definition of projection. Below, you can see what the projection of the data onto the first principle component would look like. Here we're doing the simplest dimensionality reduction. We've taken the data from two dimensions to 1 dimension. Notice how some information is lost here. Some points are very close on PC1 that are far in the original data space. Some information loss is unavoidable when reducing dimensions. Notice that if we considered a second PC, we would get that information back.

Visualization is a game of deciding what information you want to keep, and what you're comfortable throwing away. Here, we're looking at two dimensional data, but scRNA-seq usually has 20-30K data points. Some information will definitely be lost when considering only 1 or 2 principle components.

**Note:** There exist as many PCs as there are original dimensions of the data, but we usually only consider the first 50-500 for single cell data. 


In [None]:
import scprep
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import sklearn.preprocessing
import sklearn.datasets

#### Load the data

In [None]:
wine = sklearn.datasets.load_wine()

# Access the numerical data from the wine Bunch
data = wine['data']

# Load data about the rows and columns
feature_names = wine['feature_names']

# Load cultivar information about each wine
cultivars = np.array(['Cultivar{}'.format(cl) for cl in wine['target']])

# Create nice names for each row
wine_names = np.array(['Wine{}'.format(i) for i in range(data.shape[0])])

# use the sklearn StandardScaler to scale to mean 0, variance 1
data = sklearn.preprocessing.StandardScaler().fit_transform(data)

# Gather all of this information into a DataFrame
data = pd.DataFrame(data, columns=feature_names, index=wine_names)

# Print the first 5 rows of the data, eq. to data[:5]
data.head()

#### Compute PCA manually

In [None]:
# compute the sample covariance matrix
Sigma = np.cov(np.transpose(data))

# compute the eigendecomposition
eigenvalues, eigenvectors = np.linalg.eigh(Sigma)

# sort the eigenvectors in order of decreasing eigenvalue
order = np.argsort(eigenvalues)[::-1] # [::-1] reverses the order of a list
eigenvalues, eigenvectors = eigenvalues[order], eigenvectors[:,order]

# plot the eigenvalues
plt.bar(np.arange(len(eigenvalues))+1, eigenvalues)
plt.xlabel('Principal Component')
plt.ylabel('Explained Variance')

In [None]:
# =============
# How many principal components do you think are meaningful in this dataset?
n = 
# =============

# take only the first n eigenvectors and eigenvalues
eigenvectors, eigenvalues = eigenvectors[:,:n], eigenvalues[:n]

# project the data onto the principal directions by matrix multiplication
data_pca = data @ eigenvectors

data_pca.head()

In [None]:
my_cultivar = "Cultivar0" # alternative: "Cultivar1", "Cultivar2"

In [None]:
scprep.plot.scatter(x=data_pca[0], y=data_pca[1],
                    c=cultivars==my_cultivar, ticks=False)

### Exercise 1 - pick through the first few principal components and see which best separate your chosen cultivar

In [None]:
# =====================
# Use scprep.plot.scatter to plot different principal components
scprep.plot.scatter(x=
                    y=
                    c=cultivars==my_cultivar, ticks=False)
# =====================

#### _Breakpoint_  - once you get here, please help those around you!

## 2. Downloading the Retinal Bipolar dataset

In [None]:
# download the data from Google Drive
scprep.io.download.download_google_drive("1bkOEkDJS1B8HeQUXtPHoo66qZiVK0ryC",
                                         "shekhar_data.zip")
scprep.io.download.unzip("shekhar_data.zip")

In [None]:
# read in the data
data = scprep.io.load_mtx("shekhar_data/matrix.mtx",
                         cell_names="shekhar_data/cell_names.tsv",
                         gene_names="shekhar_data/gene_names.tsv")
data.head()

In [None]:
# read in the cluster labels
clusters = scprep.io.load_tsv("shekhar_data/shekhar_clusters.tsv")
clusters.head()

In [None]:
data.shape

## 3. Preprocessing

You should be familiar with the preprocessing workflow from earlier, but we'll walk through it step by step anyway.

#### Library size filtering

In [None]:
scprep.plot.plot_library_size(data, percentile=(20,80))

Notice that there are no cells with library size smaller than ~500. This dataset has already been filtered for library size, so we don't _need_ to do anything, but for speed and memory concerns we'll filter it a bit more aggressively.

In [None]:
data, clusters = scprep.filter.filter_library_size(data, clusters, percentile=(20,80))

#### Library size normalization

In [None]:
data = scprep.normalize.library_size_normalize(data)

#### Mitochondrial DNA filtering

In [None]:
scprep.plot.plot_gene_set_expression(data, starts_with="mt-", percentile=80)

There is a long tail of high mitochondrial expression. Since we normalized library size to 10,000, a mitochondrial expression of 8,000 means nearly the entire droplet was mitochondrial. We should remove these.

In [None]:
data, clusters = scprep.filter.filter_gene_set_expression(
    data, clusters, starts_with="mt-", keep_cells='below', percentile=80)
data.shape

#### Rare gene filtering

Now we've removed some cells, it's likely that there are some genes with close to zero total counts. These are just a waste of space.

In [None]:
data = scprep.filter.filter_rare_genes(data, min_cells=10)
data.shape

#### Square root transform

In [None]:
data = scprep.transform.sqrt(data)

## 4. PCA

This dataset consists of many cell types, which were mostly identified as Amacrine cells, Muller Glia, Rod Bipolar cells, and many subtypes of Cone Bipolar cells in [Shekhar et. al, 2016](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5003425/). 

#### Separating celltypes by selecting appropriate plotting features

In [None]:
clusters['CELLTYPE'].unique()

First, let's try to separate out the Muller Glia cells from the rest of the dataset using a couple of known marker genes.

In [None]:
scprep.plot.scatter(data['Apoe'], data['Glul'], c=clusters['CELLTYPE'],
                    figsize=(10,4), legend_anchor=(1,1))

Notice that the Muller Glia cells are mostly separate from the rest, except for a smattering of cells labelled '-1'. These cells were not assigned a cluster in the original study, so let's see what the plot looks like without them.

In [None]:
scprep.plot.scatter(data['Apoe'], data['Glul'], c=clusters['CELLTYPE'],
                    mask=clusters['CELLTYPE'] != '-1',
                    figsize=(10,4), legend_anchor=(1,1))

Okay, so the Muller Glia cells are relatively easy to identify using this combination of genes. But how should we choose such combinations of genes? With 20,000 to choose from, it's no easy feat. This is where PCA comes in.

#### Computing PCA quickly

There's a faster way to do PCA, and fortunately it's already implemented for us in `scikit-learn` and `scprep`.

In [None]:
# first, we'll filter out those unlabeled cells
data, clusters = scprep.select.select_rows(data, clusters, idx=clusters['CELLTYPE'] != '-1')

import sklearn.decomposition
pca_op = sklearn.decomposition.PCA(n_components=100) # we could also do scprep.reduce.pca(data, 100)
data_pca = pca_op.fit_transform(scprep.utils.toarray(data))
data_pca

Note that since we used `sklearn` here, `data_pca` is a numpy array, not a DataFrame. We could have avoided this conversion by using `scprep.reduce.pca`, but `sklearn` has some additional functionality that we will use later.

#### Examining the first two principal components

Now we have computed the PCA, we can plot the first two directions to see how well our glial cells separate.

In [None]:
scprep.plot.scatter2d(data_pca, c=clusters['CELLTYPE'], figsize=(10,4),
                      ticks=False, label_prefix='PC', legend_anchor=(1,1))

Wow, look at that! The glial cells separate perfectly from the Rod Bipolar cells (lime green) and the Cone Bipolar cells (most everything else).

#### _Breakpoint_  - once you get here, please help those around you!

### Exercise 2 - Examining principal components

Each principal component can be thought of as representing some latent state in the data. For example, we see that the first component largely separates glia from bipolar cells, and the second separates rod bipolar cells from cone bipolar cells. Now it's your turn - pick a cell type and try to find the best principal component to separate it from the rest of the cells.

In [None]:
print(clusters['CELLTYPE'].unique())

In [None]:
# ================
# pick any named cell type
my_celltype =
# pick a principal component (a number >=1) to put on the x axis
x_pc =
# pick a principal component to put on the y axis
y_pc =
# ===============
scprep.plot.scatter(data_pca[:,x_pc-1], data_pca[:,y_pc-1], c=clusters['CELLTYPE'] == my_celltype,
                   ticks=False, xlabel='PC{}'.format(x_pc), ylabel='PC{}'.format(y_pc))

#### Examining loadings associated with principal components

The principal components are described by a linear combination of the original features, so we can use the coefficients of these principal directions (called "loadings") to understand which features are driving the separation. We'll do it here for the first two components.

In [None]:
pc_loadings = pd.DataFrame(pca_op.components_, columns=data.columns)
pc_loadings.head()

In [None]:
# find the top genes associated with PC1
top_genes = np.abs(pc_loadings.loc[0]).sort_values(ascending=False)
top_genes.head(20)

In [None]:
scprep.plot.scatter(x=data_pca[:,0], y=data_pca[:,1], c=data['Apoe'])

### Exercise 3 - find the top genes associated with PC2 and plot some of them on the PCA

In [None]:
# ==================
# find the top genes associated with PC2
top_genes =
# ==================
top_genes.head()

In [None]:
# ==================
# plot the result with scprep
scprep.plot.scatter(x=
                    y=
                    c=
# ==================

#### _Breakpoint_  - once you get here, please help those around you!

### Exercise 4 - identify cell type markers with PCA

In [None]:
# =============
# examine the loadings of the principal component(s) that you used to identify your 
# cell type of choice and color the PCA plot by the top genes

# =============

## 5. Save data for later

We'll save the preprocessed data file for later use.

In [None]:
data.to_pickle("shekhar_data.pkl")

clusters.to_pickle("shekhar_clusters.pkl")