![GML Logo](../images/logo.png)

# Spatial Statistics with Voyager

Contact: Andrew Newman (andrew.newman@uq.edu.au)

The purpose of this tutorial is to introduce the Voyager package for exploratory spatial data analysis and 
visualization for spatial genomics data represented by SpatialFeatureExperiment objects.


To get started, copy and paste the following code into your R terminal session:

In [None]:
source("./startup.R")
show_plot(Voyager::plotImage(raw_sfe))

## Overview of Spatial Feature Experiment

<img src="images/spatial-feature.png">

A [Spatial Feature Experiment](https://bioconductor.org/packages/release/bioc/vignettes/SpatialFeatureExperiment/inst/doc/SFE.html) is a data structure built on top of [SpatialExperiment](https://www.bioconductor.org/packages/release/bioc/vignettes/SpatialExperiment/inst/doc/SpatialExperiment.html), which itself is based on the [SingleCellExperiment](https://bioconductor.org/packages/release/bioc/vignettes/SingleCellExperiment/inst/doc/intro.html).

[SingleCellExperiment](https://bioconductor.org/packages/release/bioc/vignettes/SingleCellExperiment/inst/doc/intro.html) is designed for querying and storage of single cell data. It includes:
* Tables of measurement values (assays) such as raw and transformed transcript counts, with rows corresponding to features/genes and columns to observations (spots/cells),
* rowData to store gene IDs and names,
* colData to store observation metadata such as barcode/cell IDs, and
* reducedDims to store reduced dimension representations of the measurements.

A [SpatialExperiment](https://www.bioconductor.org/packages/release/bioc/vignettes/SpatialExperiment/inst/doc/SpatialExperiment.html) is a data structure for storing and accessing spatial transcriptomic data. It's features include::
* Spatial coordinates (spatialCoords) associated with each observation and
* Storing image files and related information (imgData) such as histology images and resolution.

[Spatial Feature Experiment](https://bioconductor.org/packages/release/bioc/vignettes/SpatialFeatureExperiment/inst/doc/SFE.html) incorporates geometries and geometry operations with the [sf](https://r-spatial.github.io/sf/) package. By using [sf](https://r-spatial.github.io/sf/), [Spatial Feature Experiment](https://bioconductor.org/packages/release/bioc/vignettes/SpatialFeatureExperiment/inst/doc/SFE.html) allows for efficient geometry operations such as intersection, buffering, aggregating polygons, find all points within a polygon, etc. It also adds additional features for:
* colGeometries and rowGeometries for storing geometries related to spots/cells (columns) and genes and their counts (rows),
* colGraph and rowGraph for storing graphs (nodes and edges) related to spots/cells and counts, and
* annotGeometries and annotGraphs for storing additional geometries and graphs such as tissue boundaries and its associated neighbourhood graph.

# Perform QC

One of the available pieces of information from a Visium output is whether a spot is considered inside a tissue or not (a value of 1 or 0 in the "in_tissue") column.

We can plot the "sum" of the expression counts, the number of genes "detected" in a spot and the percentage of mitochondrial genes (a sum of a collection of genes).

We can observe the distribution of these through violin plots both in tissue and out tissue.

In [None]:
p1 <- scater::plotColData(transposed_raw_sfe, "sum", x = "in_tissue", color_by = "in_tissue") + custom_theme()
p2 <- scater::plotColData(transposed_raw_sfe, "detected", x = "in_tissue", color_by = "in_tissue") + custom_theme()
p3 <- scater::plotColData(transposed_raw_sfe, "subsets_mito_percent", x = "in_tissue", color_by = "in_tissue") + custom_theme()
p4 <- scater::plotColData(transposed_raw_sfe, x = "sum", y = "subsets_mito_percent", bins = 100) + custom_theme()
plot <- (p1 + p2 + p3 + p4) + patchwork::plot_layout(ncol = 3, guides = "collect")
show_plot(plot)

![Violin Plot 1](images/violin_1.png)

We can see here that the number of gene expression counts is low outside the tissue, which is what you should expect. However, the number of genes detected out of tissue is slightly higher, with the sum of counts for mitochondiral genes being between 10% and 20%.

We can observe the effect on the data distribution of the genes and their counts by filtering out those spots whose mitochdondrial percentage is below 20%.

To plot this execute the following code (the actual filtering has been performed):

In [None]:
p1 <- scater::plotColData(processed_sfe, "sum", x = "in_tissue", color_by = "in_tissue") + custom_theme()
p2 <- scater::plotColData(processed_sfe, "detected", x = "in_tissue", color_by = "in_tissue") + custom_theme()
p3 <- scater::plotColData(processed_sfe, "subsets_mito_percent", x = "in_tissue", color_by = "in_tissue") + custom_theme()
p4 <- scater::plotColData(processed_sfe, x = "sum", y = "subsets_mito_percent", bins = 100) + custom_theme()
plot <- (p1 + p2 + p3 + p4) + patchwork::plot_layout(ncol = 3, guides = "collect")
show_plot(plot)

![Violin Plot 2](images/violin_2.png)

<img style="float: left; margin-right:20px; margin-bottom:0px;" src="images/code_icon.png">

The original data (unfiltered/raw) is stored in transposed_raw_sfe and the mitochondiral subset data in a dataframe called qc_sfe$subsets_mito_percent. 

Based on your observation of the data what percentage should you set your filter at? 30%, 10%, 5% or something else?

Here is some sample code, please modify it to subset data that percentage of mitochondrial DNA content:

```R
processed_30_percent <- transposed_raw_sfe[, qc_sfe$subsets_mito_percent < your_precentage_goes_here]
p1 <- scater::plotColData(processed_30_percent, "sum", x = "in_tissue", color_by = "in_tissue") + custom_theme()
p2 <- scater::plotColData(processed_30_percent, "detected", x = "in_tissue", color_by = "in_tissue") + custom_theme()
p3 <- scater::plotColData(processed_30_percent, "subsets_mito_percent", x = "in_tissue", color_by = "in_tissue") + custom_theme()
p4 <- scater::plotColData(processed_30_percent, x = "sum", y = "subsets_mito_percent", bins = 100) + custom_theme()
plot <- (p1 + p2 + p3 + p4) + patchwork::plot_layout(ncol = 3, guides = "collect")
show_plot(plot)
```

In [None]:
# Run the code here.

# Before and after QC by Percentage

We can also plot the effect of removing the mitochondrial cells spatially as well.

In [None]:
p1 <- Voyager::plotSpatialFeature(qc_sfe, c("sum"), image_id = "lowres", maxcell = 5e4, ncol = 2) + custom_theme()
p2 <- Voyager::plotSpatialFeature(qc_sfe, c("detected"), image_id = "lowres", maxcell = 5e4, ncol = 2) + custom_theme()
p3 <- Voyager::plotSpatialFeature(qc_sfe, c("subsets_mito_percent"), image_id = "lowres", maxcell = 5e4, ncol = 2) + custom_theme()
p4 <- Voyager::plotSpatialFeature(processed_sfe, c("sum"), image_id = "lowres", maxcell = 5e4, ncol = 2) + custom_theme()
p5 <- Voyager::plotSpatialFeature(processed_sfe, c("detected"), image_id = "lowres", maxcell = 5e4, ncol = 2) + custom_theme()
p6 <- Voyager::plotSpatialFeature(processed_sfe, c("subsets_mito_percent"), image_id = "lowres", maxcell = 5e4, ncol = 2) + custom_theme()
plot <- (p1 + p2 + p3 + p4 + p5 + p6) + patchwork::plot_layout(ncol = 3, guides = "collect")
show_plot(plot, width = 1600, height = 800)

![Spatial Comparison of Mitochondria QC](images/spatial_effect_qc.png)

# Plotting Metrics Spatially

Instead of plotting a violin or scatter plot of the counts and genes, we can also view the gene expression directly on the tissue.

## Using In Tissue Spots

The following combines the violin plots of the counts per spot and number of genes and the effect that it has spatially.

In [None]:
p1 <- plotColData(processed_sfe, "nCounts", x = "in_tissue", colour_by = "in_tissue") +
    theme(legend.position = "top") + custom_theme()
p2 <- plotSpatialFeature(processed_sfe, "nCounts", colGeometryName = "spotPoly",
                              annotGeometryName = "tissueBoundary", 
                              image = "lowres", maxcell = 5e4,
                              annot_fixed = list(fill = NA, color = "black")) + custom_theme()
p3 <- scater::plotColData(processed_sfe, "nGenes", x = "in_tissue", colour_by = "in_tissue") +
    theme(legend.position = "top") + custom_theme()
p4 <- Voyager::plotSpatialFeature(processed_sfe, "nGenes", colGeometryName = "spotPoly",
                              annotGeometryName = "tissueBoundary",
                              image = "lowres", maxcell = 5e4,
                              annot_fixed = list(fill = NA, color = "black")) + custom_theme()
plot <- (p1 + p2 + p3 + p4) + patchwork::plot_layout(ncol = 2, guides = "collect")
show_plot(plot, width = 1600, height = 800)

![Comparison of In Tissue and Out of Tissue Counts and Genes](images/violin_mito_comparison.png)

Display the number of genes and counts and where they appear in tissue. In data sets with distinct differences in tissue you can often perceive multiple arcs in the trend line indicating that different tissue sections vary. 

In [None]:
plot <- scater::plotColData(processed_sfe, x = "nCounts", y = "nGenes", colour_by = "in_tissue") + custom_theme()
show_plot(plot)

![mean variance comparison](images/counts_vs_genes.png)

In the above plot, we're showing that there is a relationship between the counts and genes between the in tissue and out of tissue sections.

Also, in many cases, there is a correlation between the mean and variance of gene expression data - where as the number of genes increase so too does the variance. Genes above the diagonal line have higher variance than expected given their mean expression, indicating more variability or heterogeneity in their expression counts. Genes below the diagonal line have lower variance than expected, suggesting more consistent or homogeneous expression levels across the dataset.

Biological interpretation:
* Genes with high mean expression and low variance may indicate housekeeping genes,
* Genes with high variance relative to mean expression maybe represent cell-type gene expression, a condition or response to stimuli, and

It's also noteworthy that it may also indicate techical factors such as noise, sequencing depth or batch effect.

In [None]:
plot <- scater::plotRowData(sfe_tissue, x = "means", y = "vars", bins = 50) +
    ggplot2::geom_abline(slope = 1, intercept = 0, color = "red") +
    ggplot2::scale_x_log10() + ggplot2::scale_y_log10() +
    ggplot2::scale_fill_distiller(palette = "Blues", direction = 1) +
    ggplot2::annotation_logticks() +
    ggplot2::coord_equal() + custom_theme()
show_plot(plot)

![mean variance comparison](images/means_variance.png)

# Comparing the Effects of In-Tissue Analysis

We can also compare the effect of in tissue and out tissue on the variance captured by principal component analysis. Similarly, the top variable genes and clustering should also change when considering all of the sample versus on in tissue spots.

Would you expect it to capture more or less variance if we only consider in tissue gene expression? Would you expect changes to the most variable genes detected? What changes are there to the clustering results?

In [None]:
p1 <- Voyager::ElbowPlot(processed_sfe, ndims = 30) + custom_theme()
p2 <- Voyager::ElbowPlot(sfe_tissue, ndims = 30) + custom_theme()
plot <- (p1 + p2) + patchwork::plot_layout(ncol = 2, guides = "collect")
show_plot(plot)

![comparison of in tissue and out tissue elbow plots](images/elbow_plot_comparison.png)

The change to the principal components is small but noticeable - with less variable data to consider slightly more data is covered in small components.

In [None]:
p1 <- Voyager::plotDimLoadings(processed_sfe, dims = 1:3, swap_rownames = "symbol", ncol = 3) + custom_theme()
p2 <- Voyager::plotDimLoadings(sfe_tissue, dims = 1:3, swap_rownames = "symbol", ncol = 3) + custom_theme()
plot <- (p1 + p2) + patchwork::plot_layout(ncol = 1, guides = "collect")
show_plot(plot)

![comparison of in tissue and out tissue variable genes](images/genes_comparison.png)

In [None]:
p1 <- scater::plotUMAP(processed_sfe, colour_by = "cluster") + custom_theme()
p2 <- scater::plotUMAP(sfe_tissue, colour_by = "cluster") + custom_theme()
plot <- (p1 + p2) + patchwork::plot_layout(ncol = 1, guides = "collect")
show_plot(plot)

![comparison of in tissue and out tissue umaps](images/umap_comparison.png)

The change to the UMAP clusters is small but noticeable. The clusters are more distinct and there are fewer detected.

# Other Plots Available

The following demonstrates some of the other comparisons you can make with the libraries used by Voyager including scater and others. This includes comparing the variance captured by the [PCA plots and performing a pairwise comparison against each component](https://bioconductor.org/packages/release/bioc/vignettes/scater/inst/doc/overview.html#43_Visualizing_reduced_dimensions). The boxes on the diagonal show the density of each component (the distribution of cells along that particular principal component - the peaks presenting where most of the cells contribute).

In [None]:
p1 <- scater::plotPCA(sfe_tissue, ncomponents = 5, colour_by = "cluster") + custom_theme()
plot <- (p1) + patchwork::plot_layout(ncol = 1, guides = "collect")
show_plot(plot)

![comparison of in tissue and out tissue umaps](images/pca_results.png)

We can also spatially plot the clusters and see the various spots captured by the principal components.

In [None]:
p1 <- Voyager::plotSpatialFeature(sfe_tissue, features = "cluster", colGeometryName = "spotPoly", image_id = "lowres") + custom_theme()
p2 <- Voyager::spatialReducedDim(sfe_tissue, "PCA", ncomponents = 4, 
                  colGeometryName = "spotPoly", divergent = TRUE, 
                  diverge_center = 0, ncol = 2, 
                  image_id = "lowres", maxcell = 5e4) + custom_theme()
plot <- (p1 + p2) + patchwork::plot_layout(ncol = 1, guides = "collect") & custom_theme()
show_plot(plot)

![spatial clustering and view of PCAs](images/cluster_pca.png)

In [None]:
plots <- scater::plotExpression(sfe_tissue, rowData(sfe_tissue)[genes_use, "symbol"], x = "cluster",
               colour_by = "cluster", swap_rownames = "symbol")
plots <- wrap_plots(plots) & custom_theme()
show_plot(plots)

![top variable genes and cluster distribution](images/violin_plot_clusters.png)

# Spatial Statistics

The intuition behind spatial statistics is that nearer things are more closely related than more distant things. For example, the weather in Brisbane and the Sunshine Coast are more similar than the weather in Melbourne. Spatial autocorrelation has been used for decades in [geographical information systems (GIS)](https://dces.wisc.edu/wp-content/uploads/sites/128/2013/08/W5_Getis2008.pdf) applied to areas such analysis of air pollution, water quality, or soil properties. A good example of using geographical, [spatial exploratory and confirmatory analysis](https://en.wikipedia.org/wiki/Exploratory_data_analysis) to improve cancer screening can be found in, "[Spatial evaluation of prevalence, pattern and predictors of cervical cancer screening in India](https://www.sciencedirect.com/science/article/pii/S003335061930294X#sec2.2)".

Univariate, bivariate and multivariate spatial correlation measures the degree of spatial dependence or clustering for a single variable, two variables or all variables across different locations. It quantifies whether values of a variable at nearby locations are more similar or dissimilar than expected by chance.

For example, Moran’s I is similar to the Pearson correlation between the value at each location and the average value at its neighbors. Like Pearson correlation, Moran’s I is generally bound between -1 and 1, where positive value indicates positive spatial autocorrelation and negative value indicates negative spatial autocorrelation.

To determine if the spatial autocorrelation is statistically significant, the [moran.test](https://r-spatial.github.io/spdep/reference/moran.test.html) function in [spdep](https://github.com/r-spatial/spdep) is used. It provides a p-value, but the p-value may not be accurate if the data is not normally distributed. Gene expression data is generally not normally distributed and data normalization doesn’t always work well. Instead, permutation testing is used to generate the significance of Moran’s I and Geary’s C.

Types of spatial correlation:
* Univariate
  * Global
      * Moran's I
      * Geary's C
      * Carrelogram
      * Varigram
  * Local
      * Moran Scatter Plot
      * Local Moran's I
      * Local spatial heteroscedasticity
      * Getis-Ord Gi
* Bivariate
    *  Lee's L
    *  Cross Variogram
* Multivariate
    * MULTISPATI PCA
    * Multivariate local Geary's C

A list of spatial statistic functions available for use:

In [None]:
Voyager::listSFEMethods(variate = "uni", scope = "global")

The "moran.test" and "geary.test" refer to autocorrelation functions that provide a p-value. The test statistic, known as the standard deviate of Moran's I, is assumed to follow a standard normal distribution (z-distribution) when the null hypothesis is true. However, the p-value may not be accurate if the data is not normally distributed, which is often the case with gene expression data. 

The "moran.mc" and "geary.mc" perform permutation testing using a Monte Carlo simulation to calculate the p-value. Permutation testing is a robust approach for assessing the significance of spatial correlation, especially when the data is not normally distributed. It provides a reliable way to determine if the observed spatial patterns are likely to have arisen by chance or if they reflect meaningful spatial relationships.

Testing is performed by randomly shuffling the values of the variable across the locations multiple times (e.g. 999 times) and recalculating Moran's I for each permutation. This creates a reference distribution of Moran's I values under the null hypothesis of no spatial correlation. The observed Moran's I value is then compared to this reference distribution. If the observed value falls in the extreme tails of the distribution (e.g., top 5% or bottom 5%), it suggests that the spatial correlation is statistically significant and unlikely to have occurred by chance.

## Generating a Spatial Neighbourhood Graph

A spatial neighbourhood graph is required to be generated in order to use spatial correlation. For Visium, where the spots 
are in a hexagonal grid, the spatial neighborhood graph is straightforward (we're using [tri2.nb](https://r-spatial.github.io/spdep/reference/tri2nb.html)).

We'll compare generating a graph for the entire sample (with spots removed), the tissue of interest and the tissue of interest with spots removed.

In [None]:
source("./stats.R")

## Visualising the Neighbourhood Graphs

The following shows the new neighbourhood graph generated from the original Visium spots, the tissue based spots and the Visium spots with QC applied to remove some of the spots.

In [None]:
p1 <- Voyager::plotColGraph(processed_sfe, "graph1") + custom_theme()
p2 <- Voyager::plotColGraph(sfe_tissue, "graph1") + custom_theme()
p3 <- Voyager::plotColGraph(sfe_tissue, "visium") + custom_theme()
plot <- (p1 + p2 + p3) + patchwork::plot_layout(ncol = 1, guides = "collect") & custom_theme()
show_plot(plot, width = 1600, height = 800)

![top variable genes and cluster distribution](images/neighbourhood_meshes.png)

## Global Spatial Correlation

Global spatial correlation looks at the entire dataset to assess whether it is clustered, dispersed, or randomly assigned.

### Using Moran's I

Moran’s I ([Moran 1950](https://doi.org/10.2307/2332162)) is one of the most commonly used statistic of spatial autocorrelation, defined as:

$I = \frac{n}{\sum_{i=1}^n \sum_{j=1}^n w_{ij}} \frac{\sum_{i=1}^n
\sum_{j=1}^n w_{ij} (x_i - \bar{x})(x_j - \bar{x})}{\sum_{i=1}^n (x_i -
\bar{x})^2},$

Where,

* n is the number of observations (spots/locations),
* Xi is the value at a location,
* Xj is the value at a subsequent location,
* Wij is a weight indexing location of i relative to j (an edge as defined by a weighted neighbourhood graph).

<img src="images/global_spatial_autocorrelation.png">

*From: [Chapter 4 Spatial Regression in R](http://www.geo.hunter.cuny.edu/~ssun/R-Spatial/spregression.html)*

Moran's I values indicate the following:
* if positive indicates observations with similar values cluster together,
* if near to or 0 indicates observations are randomly distributed, or
* if negative indicates observations are disperse (neighbours share dissimilar values).

In [None]:
calculateMoransI(t(colData(sfe_tissue)[,c("nCounts", "nGenes")]), listw = colGraph(sfe_tissue, "graph1"))

### What do these results mean? 

As mentioned above, a positive "moran" value means that they are spatially correlated and a negative value indicates that they are not (disperse). And a value close to zero indicates the values are random. 

<img src="images/kurtosis.png">

*Plot generates from kurtosis.py*

The "K" value or [kurtosis](https://en.wikipedia.org/wiki/Kurtosis) is a measure of the "tailedness" of a distribution. Higher values indicates a distribution with a higher chance of producing outliers, conversely, lower values have a smaller chance of producing outliers and a value of 3 is the typical value for normally distributed data.

In this case, both "nCounts" and "nGenes" have high positive Moran's I values (0.761749 and 0.793523, respectively), suggesting strong positive spatial autocorrelation. The kurtosis values for "nCounts" and "nGenes" are 1.84931 and 1.96607, respectively, which are less than 3, suggesting that both variables have a distribution with a smaller chance of producing outliers compared to a normal distribution.

### Histogram of Values

In [None]:
p1 <- Voyager::plotRowDataHistogram(sfe_tissue, "moran_sample01") + custom_theme()
plots <- wrap_plots(p1) + patchwork::plot_layout(ncol = 1, guides = "collect")
show_plot(plots)

![histogram of moran](images/moran_i_histogram.png)

As an example, we'll calculate the top 2000 highly variable genes and see if they are spatially correlated.

```
dec <- scran::modelGeneVar(sfe_tissue)
hvgs <- scran::getTopHVGs(dec, n = 2000)
```

In [None]:
df <- rowData(sfe_tissue)[hvgs,]
ord <- order(df$moran_sample01, decreasing = TRUE)
df[ord,c("symbol", "moran_sample01",  "K_sample01")]

In [None]:
p1 <- Voyager::plotSpatialFeature(sfe_tissue, pos_moran, 
                                  colGeometryName = "spotPoly", image = "lowres", maxcell = 5e4, 
                                  swap_rownames = "symbol", ncol = 2)
plots <- wrap_plots(p1) + patchwork::plot_layout(ncol = 1, guides = "collect") & custom_theme()
show_plot(plots)

![histogram of moran](images/moran_i_top_genes.png)

In [None]:
p1 <- Voyager::plotSpatialFeature(sfe_tissue, neg_moran, 
                                  colGeometryName = "spotPoly", image = "lowres", maxcell = 5e4, 
                                  swap_rownames = "symbol", ncol = 2)
plots <- wrap_plots(p1) + patchwork::plot_layout(ncol = 1, guides = "collect") & custom_theme()
show_plot(plots)

![histogram of moran](images/moran_i_bottom_genes.png)

## Local Spatial Correlation

Local spatial correlation looks at the neighbours of values. Instead of generating a single statistic characterising the dataset, local methods generate a value for each location. You can reuse methods such a Moran's I and Geary's C to process data locally - these will show clusters of similar values (low or high) as well as outliers. Getis-Ord Gi* is another method that generates clusters of cold (low) and hot (high) values and can be easier to interpret than other methods. 

In [None]:
plots <- Voyager::plotLocalResult(sfe_tissue, "localG_perm", features = "Pcp4", 
                colGeometryName = "spotPoly", divergent = TRUE,
                diverge_center = 0, image_id = "lowres", swap_rownames = "symbol", 
                color = "black", linewidth = 0.1) + custom_theme()
show_plot(plots)

![histogram of moran](images/localG_perm_pcp4.png)

In [None]:
plots <- Voyager::plotLocalResult(sfe_tissue, "localG_perm", features = "Pcp4", 
                attribute = "-log10p_adj Sim",
                colGeometryName = "spotPoly", divergent = TRUE,
                diverge_center = -log10(0.05), swap_rownames = "symbol",
                image_id = "lowres") + custom_theme()
show_plot(plots)

![histogram of moran](images/localG_perm_log_pcp4.png)

<img style="float: left; margin-right:20px; margin-bottom:0px;" src="images/code_icon.png">

In order to perform your own spatial analysis of interesting genes, pick another gene from the top 6 or bottom 6 and run an analysis. Some suggestions include "Apoe" or "Tcf12". The code to run the above:
```R
sfe_tissue <- Voyager::runUnivariate(sfe_tissue, type = "localG_perm", features = "Pcp4", colGraphName = "visium_B",
                                     swap_rownames = "symbol")
plots <- Voyager::plotLocalResult(sfe_tissue, "localG_perm", features = "Pcp4", 
                colGeometryName = "spotPoly", divergent = TRUE,
                diverge_center = 0, image_id = "lowres", swap_rownames = "symbol", 
                color = "black", linewidth = 0.1) + custom_theme()
show_plot(plots)
```

Create your own plot by entering the code in the next cell.

In [None]:
# Run the code here.

## MULTISPATI PCA (Multivariate Spatial Correlation)

Due to the large number of genes quantified in single cell and spatial transcriptomics, dimension reduction is part of the standard workflow to analyze such data, to visualize, to help interpreting the data, to distill relevant information and reduce noise, to facilitate downstream analyses such as clustering and pseudotime, to project different samples into a shared latent space for data integration, and so on.

Spatially informed dimension reduction is actually not new, and dates back to at least 1985, with Wartenberg’s crossover of Moran’s I and PCA (Wartenberg 1985), which was generalized and further developed as MULTISPATI PCA (Dray, Saı̈d, and Débias 2008).

In short, while PCA tries to maximize the variance explained by each PC, MULTISPATI maximizes the product of Moran’s I and variance explained. Also, while all the eigenvalues from PCA are non-negative, because the covariance matrix is positive semidefinite, MULTISPATI can give negative eigenvalues, which represent negative spatial autocorrelation, which can be present and interesting but is not as common as positive spatial autocorrelation and is often masked by the latter (Griffith 2019).

In [None]:
plots <- Voyager::ElbowPlot(sfe_tissue, nfnega = 20, reduction = "multispati") + custom_theme()
show_plot(plots)

![elbow plot of multispati](images/multispati_elbow_plot.png)

This is show the top variable genes using the MULTISPATI PCA.

In [None]:
p1 <- Voyager::plotDimLoadings(sfe_tissue, dims = c(1:4), swap_rownames = "symbol", reduction = "multispati")
plots <- wrap_plots(p1) & custom_theme()
show_plot(plots)

![elbow plot of multispati](images/multispati_hvg.png)

In [None]:
p1 <- Voyager::spatialReducedDim(sfe_tissue, "multispati", ncomponents = 5, 
                  colGeometryName = "spotPoly", divergent = TRUE, 
                  diverge_center = 0, ncol = 2, 
                  image_id = "lowres", maxcell = 5e4)
plots <- wrap_plots(p1) & custom_theme()
show_plot(plots)

## Comparing Spatial and Non-Spatial PCA

In the context of spatial transcriptomics, clusters are groups of spots that share similar characteristics within 
each group. 

Non-spatial clustering methods, such as principal component analysis (PCA), identify genes that are effective in 
distinguishing different cell types based solely on their expression patterns without considering the spatial 
arrangement of the cells. However, gene expression may exhibit strong spatial structure, meaning they are not 
randomly distributed across the tissue but rather form distinct spatial patterns.

MULTISPATI's components identify genes that define spatial regions in addition to differentiating between cell types. 
These genes not only distinguish cell types but also capture the spatial organization of cells within the tissue. 
The genes associated with each MULTISPATI component can provide valuable insights into the spatial patterns 
and the underlying biological processes. It's defined as:

$H = \frac 1 {2n} X(W^t+W)X^t$

Where,
* 𝑋 is a gene count matrix whose columns are cells or Visium spots and whose rows are genes, with 𝑛 columns, and
* 𝑊 is the row normalized 𝑛×𝑛 adjacency matrix of the spatial neighborhood graph of the cells or Visium spots, which does not have to be symmetric.
 
Both analysis methods provide different perspective and uncovers different aspects of the data. Non-spatial 
clustering focuses on identifying cell types based on gene expression, while spatial clustering incorporates 
both gene expression and spatial information to identify spatially coherent regions and the genes that define 
them.

We will now performing clustering on both the original PCA and the new MULTISPATI generate components.

In [None]:
p1 <- Voyager::plotSpatialFeature(sfe_tissue, c("clusts_nonspatial", "clusts_multispati"), colGeometryName = "spotPoly", 
                            scattermore = TRUE, pointsize = 7) & guides(colour = guide_legend(override.aes = list(size=2), ncol = 2))
plots <- wrap_plots(p1) & custom_theme()
show_plot(plots)

![MULTISPATI_plot](images/multispati.png)

# More Information

The homepage for the Voyager R project is https://pachterlab.github.io/voyager/index.html

This tutorial was based on the following:
https://pachterlab.github.io/voyager/articles/visium_10x.html
https://pachterlab.github.io/voyager/articles/vig1_visium_basic.html
https://pachterlab.github.io/voyager/articles/vig2_visium.html
https://pachterlab.github.io/voyager/articles/visium_10x_spatial.html
https://pachterlab.github.io/voyager/articles/multispati.html