### How to read this notebook

_Parts of this notebook require an active Python server, which is why we recommend it be viewed in_ [Binder](https://mybinder.org/v2/gh/KrishnaswamyLab/visualization_selection/master?filepath=main_tex.ipynb). _You can also run it on your own local Jupyter server._

_To advance through the notebook, press Shift+Return to run the code in each cell. Some plots will be missing if you read the text without running the code._

In [2]:
import warnings
warnings.simplefilter("ignore")
import matplotlib.pyplot as plt
import ipywidgets as widgets
import scprep
from blog_tools import data, embed, interact
%matplotlib inline



# Introduction 

#### What is good visualization?

Despite humans only being capable of interpreting data in three dimensions, real-world data often exists in much higher dimensions, making the task of understanding the structure of such data difficult. As a result, exploring high-dimensional data typically involves some form of dimensionality reduction to two or three dimenisons for visualization. The goal of this technique is to gain insight about the structure of the data, specifically the relationships between points. From these observations, one can generate hypotheses about the dataset. In this way, visualization is an important tool for narrowing the scope of investigation and aids in the selection of tools for future analysis. The importance of this problem is reflected in the many visualization tools have been developed for high-dimensional data. To name a few, we have: [Principal Components Analysis (PCA)](https://www.sciencedirect.com/science/article/pii/0169743987800849), [Multidimensional Scaling (MDS)](https://link.springer.com/article/10.1007/BF02289565), [Diffusion Maps](https://www.sciencedirect.com/science/article/pii/S1063520306000546), [Isometric Mapping (ISOMAP)](https://science.sciencemag.org/content/290/5500/2319.abstract), [Laplcaian Eigenmaps](https://www.semanticscholar.org/paper/Laplacian-Eigenmaps-for-Dimensionality-Reduction-Belkin-Niyogi/0ce8879ea7fc0e96fd0e4e242a46002010f86e18), [Locally Linear Embedding (LLE)](https://science.sciencemag.org/content/290/5500/2323.full), [t-distributed Stochastic Neighbor Embedding (t-SNE)](http://www.jmlr.org/papers/v9/vandermaaten08a.html), [Uniform Manifold Approximation (UMAP)](https://arxiv.org/abs/1802.03426), and [Potential of Heat Affinity Transition Embedding (PHATE)](https://www.biorxiv.org/content/10.1101/120378v4) to name a few. However, along with a diversity of tools comes a diversity of associated biases and assumptions that these methods introduce, and the dreaded problem of selecting the right tool for the job.

Take, for example, t-SNE. At the time of writing, [Visualizing Data using t-SNE (2008)](http://www.jmlr.org/papers/v9/vandermaaten08a.html) has over 8,400 citations. In some fields, like computational biology, it's difficult to find a published paper without a t-SNE visualization. However, t-SNE has a unique set of biases that lend it to misinterpretation; these limitations are explored in the Distill article [How to Use t-SNE Effectively](https://distill.pub/2016/misread-tsne/). Through a set of key examples, the authors show how parameter selection and intrinsic biases in the t-SNE algorithm can lead to misleading visualizations. This article has become a widely referenced teaching tool for the rational application of the method. 

Here, we seek to take the question of *How to Use t-SNE Effectively* and go one step further: How can you select among the many visualization tools effectively? How can you judge the results of each method? What benchmarks should one employ when considering different methods? Which methods are best suited for different data modalities? The remainder of this article will be divided into three sections. 

#### Structure of the article

1\. [Selecting toy datasets for comparing methods](#selecting-toy-datasets-for-comparing-methods)

First, we will talk about different kinds of data structures and introduce a few toy datasets that we will use to compare different visualization methods. Next, we will use these datasets to discuss the algorithms behind six popular dimenisonality reduction methods: PCA, MDS, ISOMAP, t-SNE, UMAP, and PHATE.

2\. [Parameter selection](#parameter-selection)

We will then discuss sensitivity to parameter choices and methods for quanitfying the accuracy of a visualization method. 

**TODO: quantification?**

3\. [Real world data](#real-world-data)

Finally, we will apply our seven comparison methods to large, real-world datasets. The emphasis of this section will be how a particular visualiation tool biases the observer towards a set of conclusions about a given dataset.

#### Dimensionality reduction vs. visualization

Before we go further, it will be useful to clarify the distinction between dimensionality reduction algorithms and visualization algorithms. In fact, visualization algorithms are a subset of dimensionality reduction techniques. The important distiction here is that some dimensionality reduction algorithms are useful for reducing computational complexity of some machine learning tasks, but are not good for creating interpretable two- or three-dimensional visualizations. Take for example, dimensionality reduction by random orthogonal projection, often used in [approximate nearest-neighbors search](https://dl.acm.org/citation.cfm?id=276877). This method take a set of $n$ random orthogonal vectors in the original $d$-dimensional data space with $n \ll d$. This approach preserves most of the structure of the data so that statistics (such as pairwise distances) can be calculated more quickly. However, random projections evenly preserves information across all $n$ dimensions and will not product features useful for visualizing the relationships between data points. To count as a visualization tool, an algorithm must prioritize information preservation in 2 or 3 dimensions.

## Selecting toy datasets for comparing methods

Rigorous application of any computational technique starts with considering the desired application and how well a given method's biases fits that application. Generally, a good place to start is by creating *toy data*. A good toy data set is small, easy to understand intuitively, and has a clear heuristic for a sucessful visualzation. In *How to Use t-SNE Effectively*, the authors present several compelling toy datasets. Here, we will consider a few of these along with some extras: a graph, small collection of handwritten 7's, and [2000 pictures of Brendan Frey's face](https://cs.nyu.edu/~roweis/data/frey_rawface.jpg).

![Ground truth images of six datasets](img/ground_truth.png)

**TODO: The tree image actually has another branch now. See the PHATE image below**

We selected these datasets because of the varying structures and modalites. 

* **Swiss roll:** This dataset is adapted from `sklearn`'s `dataset.make_swiss_roll()` method. The data exists in three dimensions with one latent dimension: the distance of each point from the center. This is represented by the color of the points in the above plots. The first two dimensions are generated using sines and cosines of this latent dimension and the third dimension is generated as uniform random noise. In this dataset, the noise dimension is slightly larger than the two shown above. As we will see, this proves problematic for some algorithms. The "ideal" visualization of the swiss roll is a single line that represents the latent dimension of that dataset.

  
* **Three blobs:** The three blobs here are gaussian clouds that were first generated in just two dimensions and then linearly projected into 200 dimensions using random Gaussian noise. In the original space, the ratio between the orange and blue blobs is 1/5th the distance between orange and green blobs. The goal here is to preserve the relative distances between the blobs while rotating the data back into two dimensions.

  
* **Uneven circle:** Here, the data points are distributed along a circle in 20 dimensions, but the spacing between the points is irregular. The color of each point represents it's angular position.


* **Digits:** This is the dataset found in `sklearn.datasets.load_digits()` method. According to the User Guide, the data is a copy of the test set of [the UCI ML hand-written digits dataset](https://archive.ics.uci.edu/ml/datasets/Optical+Recognition+of+Handwritten+Digits). We're just using the 179 sevens. Each image is 8x8 pixels. Because the digits are handwritten, there are natural variations in the digits. For example, some images have a cross-bar and other do not.

  
* **Frey faces:** This dataset consists of 200 images of Brendan Frey's face. In this dataset, Frey is making a series of facial expressions and many intermediate expressions are available between images. Here, we should be able to identify some continuous progressions in facial expression. For example, we should be able to see a smooth transition between a neutral face and a smile or frown.

  
* **Tree:** For our final dataset, we look at a collection of points arranges in a tree, generated using the tool [**Splatter**](https://bioconductor.org/packages/release/bioc/html/splatter.html). This data emulates smooth transitions between various states of biological systems. All of the branches of the tree are of equal length, but the points are not spaced evenly along the branches. Furthermore, some of the features change non-linearly along branches. The goal here is to recreate the topology of the six branches shown above.

  

With these toy datasets in hand, we can begin to compare methods.

### Introducing... _the algorithms_!

For this blog, we wanted to focus on a mix of classic and popular algorithms. Not many people use PCA for rigorous visualization of high-dimensional data (looking at you, [population geneticists](https://blog.insito.me/why-pca-and-genetics-are-a-match-made-in-heaven-6042ea027cf0)), but it is a common tool for preliminary analysis. Similarly, MDS, which is actually a collection of methods, is not as frequently applied for data analysis but serves as a foundation for non-linear dimensionality reduction. On the other hand, PHATE is a relatively new method for visualizing high dimensional data that chose to include because we developed it.

In this section, we will break introduce the algorithms one by one, give a brief overview of how the algorithm works, then show the results of that algorithm on our test cases. 

_Run the code below to display the discussion._

In [1]:
algorithms = embed.__all__
tab_widget = interact.TabWidget()
for algorithm in algorithms:
    name = algorithm.__name__
    with tab_widget.new_tab(algorithm.__name__):
        interact.display_markdown("md/discussion-{}.md".format(name),
                                  placeholder='Introduction to {}'.format(name))

tab_widget.display()

NameError: name 'embed' is not defined

## Parameter selection

When running a visualization tool, the user is inevitably faced with the daunting choice of various parameters. Some parameters are robust and barely affect the result of the visualization; others drastically affect the results and need to be tuned with care. Here we use the Swiss Roll dataset presented above to give an introduction to the influence that parameter tuning has on a visualization.

In order to keep things relatively simple, we have selected a maximum of two parameters for each algorithm: the random seed, and the parameter which most strongly affects the result. Some algorithms have many more parameters, and it is worthwhile investigating these in more detail for the visualization algorithm of your choice.

On the left, you see a 3D projection of the Swiss Roll. Move the sliders below the plot to adjust the parameters for the visualization on the right. _If the points don't appear on your screen, try moving your mouse over the plot._

In [5]:
import pickle
results = pickle.load(open("data/parameter_search.pickle", 'rb'))
dataset = data.swissroll()
tab_widget = interact.TabWidget()
for algorithm in results.keys():
    with tab_widget.new_tab(algorithm):
        interact.parameter_plot(algorithm, results, dataset.X_true, c=dataset.c)

tab_widget.display()

Tab(children=(Output(), Output(), Output(), Output(), Output(), Output()), _titles={'0': 'PCA', '1': 'MDS', '2…

Feel free to explore this visualization and draw your own conclusions. Here are a few important points:

#### Random seed

Ideally, if you visualize a dataset with the same parameters, your visualization should be the same every time. Unfortunately, since many methods use randomization for speed improvements, the random seed (initialization for Python's pseudo-random number generator) can impact the result. All of the methods shown here accept a random seed as a parameter, but as you can see, it has a highly variable effect. 

Firstly, **PCA, ISOMAP and PHATE** show no noticeable change when you adjust the random seed slider. **MDS** shrinks and warps ever so slightly, but the overall structure of the visualization does not change. On the other hand, **UMAP and t-SNE** significantly change the layout of the visualization, relationships between groups, and some points even totally change position on the plot, depending on the choice of other parameters.

#### Neighborhood size

Now consider the other parameters we have shown: `knn` (also known as `n_neighbors`, used in ISOMAP, UMAP and PHATE) and `perplexity` (used in t-SNE). All of these parameters modify the connectivity of the graph underlying the visualization. **TODO: discuss difference between knn and perplexity?**

If we connectivity is too low, the graph starts to become disconnected. This is very obvious in **t-SNE and UMAP** where the plot becomes an apparently random point cloud with `knn/perplexity` set to 2. A perhaps more interesting case can be observed in **ISOMAP** in the transition between `knn = 5`, which gives a perfectly reasonable embedding of the Swiss Roll as a plane, and `knn = 3`, when the top half of the plane becomes disconnected, leaving the rest to warp and distort in unexpected (and undesired) ways.

On the other hand, if connectivity is too high, we can start to incorporate long-range Euclidean distances, which tend not to be so meaniningful. In **ISOMAP, t-SNE and PHATE** in particular, we observe that as connectivity becomes extremely high, the start and end of the Swiss Roll become connected and overlap on the plot. This is once again undesirable, as these parts of the Swiss Roll were disconnected in the original data, but thanks to our extremely large neighborhoods we have connected them, giving a misleading visualization.

#### Other parameters

Most of the methods here have many more parameters than what we have shown. **TODO: if there is time, provide a page with more?** For example, if you look at the [scikit-learn documentation for t-SNE](https://scikit-learn.org/stable/modules/generated/sklearn.manifold.TSNE.html), you will see 13 parameters that can be set on the estimator. It is worth investigating which of these influence the embedding, and understanding why you get the visualisation you do with the parameters you have chosen. 

## Real world data

#### The role of real data in comparing visualization algorithms

Toy data is useful because it provides insights into the kinds of idiosyncracies that a given algorithm is sensitive to. However, it is also important that a dimensionality reduction algorithm performs well on real world data. Although you lose access to ground truth, it is still possible to compare algorithms. 

Generally spearking it useful to first consider well existing knowledge is preserved by an given embedding. Are groups of data points that are known to be closely related displayed proximally in the visualization? Can longer-range relationships be observed across the plot?

Once these properties have been confirmed, we usually will generate some hypothesis about the data and further examine trends within the data. This might mean gathering additional information about the data by integrating another outside dataset or performing another experiment. Although this process is challenging, the real utility of a visualization is how successful one can be by making hypotheses from the visualization and testing them using an independent method.

Here, we will describe three real world datasets and compare each visualization algorithm on each dataset. Again, we're going to use default parameters for every method.

### Analysis of population genetics data

#### The 1000 Genomes Project

The [1000 Genomes Project](http://www.internationalgenome.org/) is an international effort to understand genetic variation across human populations. Originating in 2007, the project completed in 2013 describing variation in roughly 3000 individuals from 26 populations across more than 80,000 genetics loci. The project is a high quality resource for understanding differences between people from different regions across the globe.

This data is useful for visualization because we have an intuitive understanding of geographic relationships between people, but the data is so high dimensional that understanding if these structures are preserved in the genetic space is infeasible.

In the 1000 Genomes data we also have access to detailed population information about each individual in the study. We expect that geographically proximal populations are genotypically similar and should therefore be grouped together in the visualization.

#### What does the data look like?

As mentioned above, the 1000 Genomes dataset contains genetic information about 3000 individuals. The data we will embed is called a genotype matrix. The rows of this matrix correspond to each individual in the study, and each column corresponds to a [Single Nucleotide Polymorphism (SNP)](https://ghr.nlm.nih.gov/primer/genomicresearch/snp). Each entry in the matrix is `0` if the individual is homozygous for the reference allele, `1` if the individual is heterozygous for the reference and alternative alleles, and `2` if the individual is homozygous for the alternative alleles. If these terms are unfamiliar to you, we suggest reading the NIH primer on SNPs. All that's really important to know if that each entry in the matrix indicates a specific DNA element of each individual in the dataset.

Thankfully Alex Diaz-Papkovich from Simon Gravel's lab at McGill provides [scripts to load and preprocess this data on Github](https://github.com/diazale/1KGP_dimred). All we need to do is run his scripts through the algorithms for our comparison.



#### Comparing visualization algorithms on the 1000 Genomes data.

![1000 - Genomes comparison](./img/1000_genomes.comparison.2x3.png)

This data is complex with many different populations spanning six continents. Generally speaking, blue corresponds to East Asian, green is South Asian, yellow/orange is African, red represents European, and pink is Ad Mixed American. The exact breakdown of each population code is given by the following table.

| Population Code  | Population Description                                             | Super Population  | 
|------------------|--------------------------------------------------------------------|------------------------| 
| CHB              | Han Chinese in Beijing, China                                      | East Asian             | 
| JPT              | Japanese in Tokyo, Japan                                           | East Asian             | 
| CHS              | Southern Han Chinese                                               | East Asian             | 
| CDX              | Chinese Dai in Xishuangbanna, China                                | East Asian             | 
| KHV              | Kinh in Ho Chi Minh City, Vietnam                                  | East Asian             | 
| CEU              | Utah Residents (CEPH) with Northern and Western European Ancestry  | European               | 
| TSI              | Toscani in Italia                                                  | European               | 
| FIN              | Finnish in Finland                                                 | European               | 
| GBR              | British in England and Scotland                                    | European               | 
| IBS              | Iberian Population in Spain                                        | European               | 
| YRI              | Yoruba in Ibadan, Nigeria                                          | African                | 
| LWK              | Luhya in Webuye, Kenya                                             | African                | 
| GWD              | Gambian in Western Divisions in the Gambia                         | African                | 
| MSL              | Mende in Sierra Leone                                              | African                | 
| ESN              | Esan in Nigeria                                                    | African                | 
| ASW              | Americans of African Ancestry in SW USA                            | African                | 
| ACB              | African Caribbeans in Barbados                                     | African                | 
| MXL              | Mexican Ancestry from Los Angeles USA                              | Ad Mixed American      | 
| PUR              | Puerto Ricans from Puerto Rico                                     | Ad Mixed American      | 
| CLM              | Colombians from Medellin, Colombia                                 | Ad Mixed American      | 
| PEL              | Peruvians from Lima, Peru                                          | Ad Mixed American      | 
| GIH              | Gujarati Indian from Houston, Texas                                | South Asian            | 
| PJL              | Punjabi from Lahore, Pakistan                                      | South Asian            | 
| BEB              | Bengali from Bangladesh                                            | South Asian            | 
| STU              | Sri Lankan Tamil from the UK                                       | South Asian            | 
| ITU              | Indian Telugu from the UK                                          | South Asian            | 


#### What do we see?

**PCA** is the most common method for visualizaing population genetics information, but limitations in this method mean that as larger datasets become available, researchers are often searching for other tools ([Diaz-Papkovich et al. 2019](https://www.biorxiv.org/content/10.1101/423632v2)). Examining the output of PCA on this dataset, we can see clear separation between the African populations denoted by the Yellow/Orange points from the rest of the populations. Similarly, the East Asian population is separated from the rest. However, it is difficult to make statements about the American, South Asian, or European populations.

**MDS** has some artistic appeal here as the visualization resembles a globe. However, we have a problem here: the pink Ad Mixed American population is divided in two by the South Asian population. We see this as well with PCA, but this does not reconcile with our understanding of genetic variation between these populations. Generally speaking, following humanity's migration out of Africa has been characterized by increasing genetic isolation with limited examples of population mixing ([Hellenthal et al. 2014](https://www.ncbi.nlm.nih.gov/pubmed/24531965/); [Norris et al. 2018](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6288849/)).

**ISOMAP** performs poorly here. See how the South Asian and East Asian populations are completely reduced to a very small region plot. This poorly represents the diversity of those populations. Also, ISOMAP creates ridge-like artifacts in the embedding which don't occur in any other embedding, but are a common occurrence in ISOMAP.

**t-SNE** can be very sensitive to outliers. Here, many individuals have essentially no neighbors in the high dimensional space and therefore they get evenly distributed around the visualization. Not so useful. Also, the size of each cluster in t-SNE is relative to the number of points, not the amount of variance; even though we know (from both the PCA plot and prior knowledge; see [Campbell and Tishkoff (2008)](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2953791/)) that the genetic diversity in Africa is significantly greater than on all other continents, so we would expect the African population to take up more space on the plot, indicating its large amount of heterogeneity.

**UMAP** was shown to work well for population genetics data by [Diaz-Papkovich et al. (2019)](https://www.biorxiv.org/content/10.1101/423632v2), and this was actually the inspriration for this real-world data application. There, the authors used some parameter tuning to generate the plots in the manuscript, so their visualization is slightly nicer than what we see here. However, the game of this blog post is to use default parameters in all comparisons. We see here that the defaults for UMAP produce a visualization that is very compressed, making it hard to see fine-grain relationships between points.

**PHATE** does a good job here of presenting each population as a separate group of points on the plots. The default parameters also compress a lot of the variation into a few smooth trajectories. One of the places where PHATE shines here is when you consider the way that orderings of each principle component are preserved within each group of points.

#### Examining PCS along each visualization

![1000 - Genomes comparison](./img/1000_genomes.comparison.PC1.png)

The first principal component distinguishes African populations.

![1000 - Genomes comparison](./img/1000_genomes.comparison.PC2.png)



![1000 - Genomes comparison](./img/1000_genomes.comparison.PC3.png)

In the above plots, we're looking at the first three principal components loadings on each individual in the dataset. If you look at PC1 in PHATE vs UMAP, PHATE preserved the ordering of cells by increasing PC1 loadings whereas UMAP splits the data points with high PC1 coordinates across two different clusters. Examining PC2 and PC3, we see that the embedding produced by MDS and TSNE order points non-monotonically by their coordinate loadings. This is a little frustrating because we generally think of these coordinates as representing axes of diversity throughout each population.

### Analysis of single cell RNA-sequencing data


#### What is single cell RNA-sequencing?

Every so often a biological discipline encounters a new technology that makes previously unthinkably complicated experiments suddenly accessible. In the field of genomics, single-cell RNA-sequencing (scRNA-seq) is the latest technology to change how biologists approach fundamental questions like "What makes a neuron different from a heart cell from a skin cell?" If you've never heard of this technology before, all you need to know is that it makes it possible to measure which genes are activated in tens of thousands of individual cells. The output of a scRNA-seq experiment is referred to as a gene expression matrix. Here, the rows represent individual cells and each column is one of the 20K-30K genes in the genome of human or mouse or other experimental system. Each entry is this matrix represents the number of molecules of each gene that were detected in that cell. There's a lot of nuance and detail that we've glossed over here, but this expression counts matrix is the primary focus in scRNA-seq analysis. A more thorough discussion can be found in [Yuan *et al.* 2017](https://genomebiology.biomedcentral.com/articles/10.1186/s13059-017-1218-y) and [Luecken and Thies, 2019](https://www.embopress.org/doi/10.15252/msb.20188746).

Because scRNA-seq produces measurements over tens of thousands of dimensions, it certainly qualifies as high-dimensional. scRNA-seq is also noisy. Single cells contain a very small amount of RNA (think nano-grams), and current chemistries for capturing that RNA is inefficient. Of the roughly 100K-500K RNA molecules in each cell, only 5-20% are captured by current technologies. This means a lot of lowly expressed genes, which are often important for determining cell state, are counted as $0$ in the gene counts matrix despite being expressed in a given cell.

Currently, t-SNE is the most popular method for visualizing single cell data. We recently attended Single Cell Genomics 2018 in Boston and it was difficult to find a poster or talk that didn't use t-SNE. However, as shown above, t-SNE does not do a good job of representing continuous changes within a data. We specifically developed PHATE as a solution to this problem. To show why it's important to preserve such structure in a large dataset, we are going to focus on a time course of human stem cell development.

#### Embryoid Body Timecourse

While developing PHATE, we partnered with Natalia Ivanova's labroatory at Yale University to create a detailed timecourse study of stem cell development *in vitro*. The experiment used human embryonic stem cells grown in 3-dimensional culture. The resulting cell cultures are called embryoid bodies or EBs because they recapitulate the spatial component of embryological development. However, the EBs are really just a disorganized clump of cells that expression markers similar to the cells found in a real embryos (hence the -oid suffix).

The dataset comprises 16,820 cells measured over 27 days of EB culture. After filtering, we find that 17,445 genes are expressed at some point during the time course. For the following visualizations of gene expression, we imputed missing gene expression values using MAGIC, a method developed in our lab for denoising single cell gene expression ([van Dijk *et al.* 2018](https://www.ncbi.nlm.nih.gov/pubmed/29961576)). Each visualization was generated using default parameters for each algorithm run on the square-root transformed normalized raw data. All preprocessing steps can be found in [this tutorial](https://nbviewer.jupyter.org/github/KrishnaswamyLab/PHATE/blob/master/Python/tutorial/EmbryoidBody.ipynb). 

#### Visualizations of the EB dataset

![EB - comparison](img/EmbryoidBody.comparison.png)

**PCA** is designed to identify linear combinations of genes that explain maximal variance in a dataset. Here we see that PCA does a good job of identifying the largest axis of variation, which is the time point of collection. However, as PCA is limited to identifying linear combinations of features, and variation is calculated globally. Hence the fine-grained local structure of the data is lost.

**MDS** fails completely on this dataset. In trying to preserve all pariwise distances, MDS has embedded the data in a ball with later timepoints on the outside and earlier timepoints on the inside. Not very useful for data analysis.

**ISOMAP** performs better than PCA and MDS because we can se some more sturcture in the intermediate 12-15 and 18-21 time points while preserving the general ordering of the data. We do see in some regions of the data overlapping cells from different time points, but this is not entirely inconsistent with the biology of the system. To better judge this embedding, we will need to examine expression of marker genes in the next section.

**t-SNE** fails completely on this dataset. This is likely because of the high degree of noise across the 17K dimensions in the data.

**UMAP** does little better than t-SNE. It is possible that both algorithms would perform better with some parameter tuning, but one of the goals of this analysis was to compare each method using default parameters.

**PHATE** both preserves the global ordering of cells from each sample and preserves fine-scale distinctions between subpopulations of cells. In the interests of full disclosure, the default parameters for PHATE were decided upon with a few datasets in mind including this one. However, the default parameter selection for PHATE did not include any other dataset in this article.

#### Examination of marker genes

**MARKER GENE INTERACTIVE PLOT GOES HERE**