In this section, we're going to go over a few introductory techniques for visualizing and exploring a single cell dataset. This is an essential analysis step, and will tell us a lot about the nature of the data we're working with. We'll figure out things like:

* If the data exists on a trajectory, clusters, or a mix of both
* How many kinds of cells are likely present in a dataset
* If there are batch effects between samples
* If there are technical artifacts remaining after preprocessing

We're going to use two main tools for this analysis: PCA and PHATE. PCA is useful because it's quick and serves as a preliminary readout of what's going on in a sample. However, PCA has many limitations as a visualization method because it can only recover linear combinations of genes. To get a better sense of the underlying structure of our dataset, we'll use PHATE.

## 1.0 What is a visualization?

Before we get too deep into showing a bunch of plots, I want to spend a little time discussing visualizations. Skip ahead if you want, but I think it's important to understand what a visualization is, and what you can or cannot get from it.

#### A visualization is a reduction of dimensions

When we talk about data, we often consider the number of observations and the number of dimensions. In single cell RNA-seq, the number of observations is the number of cells in a dataset. In other words, this is the number of rows. The number of dimensions, or number of features, is the number of genes. These are the columns in a gene expression matrix.

In a common experiment you might have 15,000-30,000 genes in a dataset measured across 5,000-100,000 cells. This presents a problem: How do you visually inspect such a dataset? The key is to figure out a way how to draw the relationships between points on a 2-dimensional sheet of paper, or if you add linear perspective, you can squeeze in a third dimension.

A visualization is simply figuring out how to go from 30,000 dimensions -> 2-3.


#### Heatmaps allow you to look at all genes across all cells simultaneously
One way is to look at a heatmap. Here I've created a clustered heatmap from the Datlinger data using `seaborn.clustermap`:

```python
import seaborn as sns
cg = sns.clustermap(t_cell_data, cmap='inferno', xticklabels=[], yticklabels=[])
cg.ax_heatmap.set_xlabel('Genes ({})'.format(t_cell_data.shape[1]))
cg.ax_heatmap.set_ylabel('Cells ({})'.format(t_cell_data.shape[0]))
```

{{< figure src="/img/how_to_single_cell/datlinger_heatmap.png" class="img-lg">}}

It's hard to draw any conclusions from this. How close together are any two cells? How do genes covary? We get some sense of this, and we are getting to look at all genes across all cells, but this representation of the data hinders hypothesis generation.

#### Biplots show gene-gene relationships

Another natural presentation is the biplot, commonly used for FACS analysis. Here each axis represents the expression of one of two genes and each dot is a cell. Let's look at a biplot for some genes from the Datlinger dataset.

As you can see, it's much easier to identify gene-gene relationships, but you can see how complex a plot we get when we look at only a handful of genes. Now realize that there are 312 million pairwise combinations of genes in a 25,000 gene genome.

{{< figure src="/img/how_to_single_cell/datlinger_pairplot.png" class="img-xl">}}

We need a better solution.

### Why can we reduce dimensions?

In biological systems, we know that some genes are related to each other. These relationships are complex and nonlinear, but we do know that not all possible combinations of gene expression are valid.

{{< figure src="/img/how_to_single_cell/ambient_latent_dim.png" class="img-md">}}


On the left, points are uniformly distributed in the ambient 3-dimensional space. On the right, the points are randomly distributed on a 1-dimensional line that rolls in on itself. If we could unroll this line on the right, we would only need one or two dimensions to visualize it.

#### How can we reduce dimensions?

There are many, many ways to visualize data. The most common ones are PCA, t-SNE, and MDS. Each of these has their own assumptions and simplifications they use to figure out an optimal 2D representation of high-dimensional data.

PCA identifies linear combinations of genes such that each combination (called a Principal Component) that explains the maximum variance. t-SNE is a convex optimization algorithm that tries to minimize the divergence between the neighborhood distances of points (the distance between points that are "close") in the low-dimensional representation and original data space.

There are thousands of dimensionality reduction algorithms out there, and it's important to understand that the drawbacks and benefits of each.
