# DAY 1: Visium Spatial Transcriptomics Data Analysis - Mouse Intestine

## Overview

This notebook will guide you through Day 1 content. We will cover:

1. Loading data
2. Quality control
3. Normalisation, feature selection and dimensionality reduction
4. Cluster analysis
5. Spatial clustering 

## How to use this notebook

This notebook is intended to be used as a reference for you own analysis.
All code chunks have an explanation detailing the analysis steps and their purpose, as well as key parameters.
Play around with these and see what they do, so that you are better equiped to adapt the workflow to your own data.

## Dataset

We will be using a Visium dataset from [Parigi et al, 2022](https://www.nature.com/articles/s41467-022-28497-0).
This dataset was generated using V1 3' polyA Visium chemistry and consists of four mouse intestine samples taken from healthy mice and mice subjected to a DSS colitis model, where the intestine is damaged.
Today, we will be using healthy mouse intestine section as an example.
Overall, the dataset quality is good but there are some quality issues, which will hopefully show you what to look out for in your own data. 


## Setting Up Your Environment

First we need to set up the environment and load the packages we will use for this workshop.

**os:** Provides access to the operating system for handling files, directories and paths used in the notebook.

**pandas:** Enables loading, cleaning and manipulating tabular data as dataframes.

**matplotlib:** Provides core plotting functions for creating various visualisations and plots.

**seaborn:** Builds on matplotlib for better plotting.

**anndata:** AnnData is an established data structure for storing annotated single-cell datasets, which can also be used for spatial analysis.

**scanpy:** Toolkit for preprocessing, analysing and visualising single-cell RNA-seq data that extends to spatial data.

**squidpy:** Adds spatial analysis and visualization methods for spatial transcriptomics data, expanding on scanpy functionality.

**spotsweeper:** A spatial QC package for detecting spatial outliers in spatial transcriptomics data.

**banksy:** Performs spatially aware clustering of cells or spots using neighborhood information.

In [None]:
# Import os for working with system path
import os
# Import pandas for dataframe manipulation
import pandas as pd
# Import matplotlib and seaborn for plotting
import matplotlib.pyplot as plt
import seaborn as sns
# Import scanpy and anndata for single-cell RNA-seq
import anndata as ad
import scanpy as sc
# Import squidpy for spatial transcriptomics
import squidpy as sq
# Import spotsweeper for quality control
import spotsweeper.local_outliers as sw
# import banksy for spatially aware clustering
from banksy_utils.filter_utils import filter_hvg
from banksy.main import median_dist_to_nearest_neighbour
from banksy.initialize_banksy import initialize_banksy
from banksy.embed_banksy import generate_banksy_matrix
from banksy.main import concatenate_all
from banksy_utils.umap_pca import pca_umap
from banksy.cluster_methods import run_Leiden_partition

In [None]:
# Set Scanpy logging and plotting parameters
sc.set_figure_params(facecolor="white", figsize=(8, 8))
sc.settings.verbosity = 3

## 1. Loading Visium Spatial Transcriptomics Data

We will start by loading Visium dataset.
As for single cell data, community data analysis tools have a set of constantly evolving reader functions to ingest various spatial transcriptomics data formats.
Unfortunately, how various platforms output and store their datasets is constantly evolving and therefore sometimes readers are not always working and up to date and you might require some workarounds if you have particularly new, or particularly old data!

Here, we are using [squidpy](https://squidpy.readthedocs.io/en/stable/) package `read.visium()` function to ingest the data.
[scanpy](https://scanpy.readthedocs.io/en/stable/) also has an equivalent `read_visium()` reader you can use, although it is now deprecated.

`sq.read.visium()` call loads 10x Genomics Visium spatial transcriptomics data into an AnnData object, loading gene expression counts, H&E images and spatial metadata. 

#### Key arguments:

**path:**
Path to the 10x Visium output directory. Must contain the "spatial" folder and count matrix.

**counts_file:**
Name of the count matrix file to load - this will either be filtered_feature_bc_matrix.h5 or raw_feature_bc_matrix.h5. Filtered matrix will load only under tissue spots, while raw matrix contains data for all spots on the slide. It is always good to inspect this even if we don't use the outside tissue spots for the analysis.

**library_id:**
Identifier for the Visium section and needed for identifying samples when analysing more than one.

**load_images:** (default: True)
Whether to load histology images and scale factors into
adata.uns["spatial"][library_id].

**source_image_path:**
Optional path to the high-resolution tissue image if it is not in the standard location.


In [None]:
visium_dir = '/nvme/project/shared/python/5_python_spatial_omics/data/visium_v1_mouse_intestine/spaceranger/SRR14083626_HEALTHY_DAY0/outs'
adata = sq.read.visium(visium_dir, counts_file='raw_feature_bc_matrix.h5', library_id="Day0")

A warning above tells you that you have duplicate variable names (i.e. genes) in your data.
This is because of mapping between Ensembl IDs and gene names is not one-to-one.
For the purpose of the analysis, we should make the names unique. 

In [None]:
adata.var_names_make_unique()

Inspect the data structure.

Below is a list of important "fields" of the AnnData structure, some of which are not immediately obvious from the summary view.

**adata.X** – expression matrix

**adata.obs** – spot metadata

**adata.var** – gene metadata

**adata.obs_names** - spot barcodes

**adata.obsm["spatial"]** – spot coordinates

**adata.uns["spatial"]["Day0"]** – images, scale factors, metadata

In [None]:
adata

Here we can see spot level meta data.
At the moment, we only have the position of each spot (array_row & array_col), which is read in as part of the spot meta data.

We can also see whether each spot is 'in_tissue' based on 10X Spaceranger's automated tissue detection based on the H&E image.

Throughout the analysis, we will calculate additional metrics that will be stored here.
It's a good place to stash any spot-level information.

In [None]:
adata.X

In [None]:
adata.obs

Here we can see feature (gene) meta data. We can also see that gene name is the primary index (that had to be made unique)

Ensembl gene IDs are alaso included for the mm10 genome.

Throughout the analysis, we will calculate additional gene level metrics that will get stored here.

In [None]:
adata.var

> **Exercise:**
Try exploring other parts of the object.
Where can you find coordinates and images?

In [None]:
adata.obs_names

In [None]:
adata.obsm["spatial"]

In [None]:
adata.uns["spatial"]["Day0"]

### Reading in Automated Tissue Detection Information

10X Spaceranger software carries out automated tissue detection based on H&E images, which is typically reasonably good.
The output is stored in the `tissue_positions.csv` text file.
This is picked up automatically by squidpy, but could also be loaded manually using [pandas](https://pandas.pydata.org/). 

The provided code reads in this file and adds a metadata variable to the [AnnData](https://anndata.readthedocs.io/en/stable/) object.
We also load the spatial coordinates.

This can be useful if you have carried out tissue detection using a different software, for example, or if your data format is causing issues with default loaders. 



In [None]:
# Load text file using pandas
tissue_coords = pd.read_csv(os.path.join(visium_dir, "spatial/tissue_positions.csv"))
tissue_coords = tissue_coords.set_index('barcode')
tissue_coords = tissue_coords.loc[adata.obs_names]

# Create new obs entry to indicate if each barcode is under tissue or not
# by matching barcodes in obs_names with those in pandas dataframe
adata.obs["under_tissue"] = [
    "Under Tissue"
    if tissue_coords.loc[barcode, "in_tissue"]
    else "Outside Tissue"
    for barcode in adata.obs_names
]

# Lets also update the spatial coordinates in the object manually - 
# sometimes, depending on spaceranger versions,
# your spaceranger spatial outputs may not be read in successfully by the default readers
adata.obsm["spatial"] = tissue_coords[["pxl_col_in_fullres", "pxl_row_in_fullres"]].to_numpy()
adata.obs["array_row"] = tissue_coords[["array_row"]]
adata.obs["array_col"] = tissue_coords[["array_col"]]

adata.obs

Let's sanity check our data by quickly visualising the slide we loaded. 

The [squidpy](https://squidpy.readthedocs.io/en/stable/) `spatial_scatter()` function is used to create spatial plots of the spots on a tissue section.
When an image such as H&E is available, all spatial data will be plotted over the image, which always provides a lot of useful information.
Let's inspect the image first:

In [None]:
# Visualise H&E staining for tissue
# This can be dome using both Scanpy and Squidpy
# sc.pl.spatial(adata, img_key="hires")
sq.pl.spatial_scatter(adata)

If we provide a `colour=` arguement to the plotting function, we can plot additional variables stored in the object.
Let's first check that the automated tissue detection that we read in matches the image. 

> **Exercise:**
Do you think the tissue detection here is working well?
Would you be happy with it for your own data?

In [None]:
# Visualization of tissue status in spatial coordinates
sq.pl.spatial_scatter(
    adata,
    color="under_tissue",
    size=1.5
)

## 2. Quality Control

The first thing we want to do after loading the data is quality control.
We want to see how well the experiment has worked and what potential quality issues the data might have.
As in single cell analysis, where we can have poor quality cells, in spatial experiments we can have lower quality spots, or lower quality regions (or the entire slide/section).
Generally, we consider very similar metrics as in single cell analysis:

- How many genes are detected in each spot?
- How many UMIs are detected in each spot?
- What's the complexity of each spot - do we have a lot of signal dominated by a handful of highly expressed genes?
- What's the mitochondrial/ribosomal gene content of each spot?

Unlike in single cell data however, the distribution of these metrics in spatial data tends to be a lot less uniform and there is no one uniform threshold that can be applied to every tissue uniformly.
Tissue composition and cellular density in different regions can result in a lot of variability across these metrics, so you need to evaluate them carefully for the first time you work with any particular tissue. 

The code below calculates basic QC metrics using [scanpy](https://scanpy.readthedocs.io/en/stable/).
You can add your own metrics - for example, you can look at the prevalence of gene signatures associated with cellular stress. 

In [None]:
# Identify mitochondrial and ribosomal genes
adata.var["mt"] = adata.var_names.str.startswith("mt-")
adata.var["rb"] = adata.var_names.str.startswith("Rp")

In [None]:
# Calculate QC metrics using scanpy
sc.pp.calculate_qc_metrics(
    adata,
    qc_vars=["mt", "rb"],
    inplace=True
)

In [None]:
# Inspect the updated object
adata

In [None]:
# Inspect the quality control data per spot
adata.obs

In [None]:
# Inspect the quality control data per gene
adata.var

Next, we want to visualise the QC metrics.
In this example, we visualise the distribution of the number of genes detected at each spot.
This is a good first look overview metric to use to evaluate the quality of the data
A high number of features may indicate areas with more complex or diverse cell types, while a low number might indicate poor-quality spots or regions with few active genes.

In this example, we can see that the distributions for gene detection rate between under tissue and outside  tissue overlap - in an ideal dataset, this should not be happening!
This can happen for several reasons: 

1. Tissue detection has not worked well and under tissue areas are misclassified as outside tissue areas. This can happen if you have tissue that does not stain well. 
2. Image is misaligned/rotated with respect to the data - check fiducial alignment in cellranger outputs, especially the corner fiducials
3. There are issues with tissue permeabilisation - under-permeabilised tissues will have very low counts under tissue and higher counts outside tissue/at tissue edges, over-permeabilised tissue can result in transcript diffusion/leakage outside tissue.

In [None]:
# Use seaborn histogram function to plot number of genes
sns.histplot(
    adata.obs,
    x="n_genes_by_counts",
    kde=True,
    bins=60,
    hue="under_tissue"
)

Next, we visualise can visualise the spatial distribution.
In this case, we can see that counts outside tissue area are observed mostly on one side of the tissue only, suggesting a technical issue with one side of the slide.
This is a visium experiment using older V1 kits where the slide preparation is handled manually and we can end up with more technical variation (e.g. due to small pipetting errors) over the slide.
This is less likely to occur in [CytAssist](https://www.10xgenomics.com/instruments/visium-cytassist) experiments, however evaluating the slides for issues is still critical. 

In [None]:
sq.pl.spatial_scatter(
    adata,
    color=["n_genes_by_counts"],
    cmap="jet"
)

Similar to the number of genes detected, we can also plot the number of molecules/UMIs detected per spot.
The two metrics should generally correlate, but you may have tissue regions with homogenous cell types and therefore lower number of detected genes but you can still have high number of UMIs. 

Low molecule count and low gene count could mean that there are data quality issues in those regions.
This could be due to tissue quality, or non-optimal tissue permeabilisation or other technical issues.

But, it could also correspond to low cell density regions.
It is important to understand the structural composition of your samples before you throw any data away.

In spatial transcriptomics, there is generally much more variability in QC metrics between different regions than between different cell types in scRNA-Seq data.

In [None]:
# Seaborn histogram
sns.histplot(
    adata.obs,
    x="total_counts",
    hue="under_tissue",
    kde=True,
    bins=60
)

In [None]:
sq.pl.spatial_scatter(
    adata,
    color=["total_counts"],
    cmap="jet"
)

As in single cell data, we can also evaluate the percentage mitochonrial and ribosomal genes per spot.
Do not expect the distributions (and thresholds for filtering) to mimic single cell datasets perfectly (e.g. 10% mitochondrial threshold). 

We can see in the plots that there is a bias in mitochondrial and ribosomal gene expression in parts of the slide.

The peanut-shaped region in the center of the slide with high ribosomal counts is an immune follicle.
As T-cells generally express highly levels of ribosomal protein genes, it is likely due to this.

We can also see higher percentage of mitochondrial genes in some parts of the slide, including non-tissue areas.
Generally, we would expect epithelial cells to have a higher mitochondrial gene count, but in this case it does not quite correspond to histology across all areas.
The left hand side and right hand side of the slide show shifts in these profiles, potentially indicating some technical artefacts.

In [None]:
# Seaborn violin plots for percentage mitochondrial and ribosomal RNA
fig, axs = plt.subplots(1, 2, figsize=(15, 4))
sns.violinplot(adata.obs["pct_counts_mt"], ax=axs[0])
sns.violinplot(adata.obs["pct_counts_rb"], ax=axs[1])

In [None]:
# Seaborn histogram of percentage mitochondrial RNA inside and outside tissue
sns.histplot(
    adata.obs,
    x="pct_counts_mt",
    hue="under_tissue",
    kde=True,
    bins=60
)

In [None]:
# Seaborn histogram of percentage ribosomal RNA inside and outside tissue
sns.histplot(
    adata.obs,
    x="pct_counts_rb",
    hue="under_tissue",
    kde=True,
    bins=60
)

In [None]:
# Spatial scatter of mitochondrial and ribosomal content
sq.pl.spatial_scatter(
    adata,
    color=["pct_counts_rb","pct_counts_mt"],
    cmap="jet"
)

### Spot complexity

The code below visualises a complexity metric, indicating the percentage of total unique molecules occupied by the top 50 most highly expressed genes per spot in the spatial transcriptomics dataset.
This can offer insights into the complexity and quality of each spot's transcriptome, particularly highlighting regions with very low diversity.

A high percentage indicates that a small number of genes dominate the transcriptome in that spot, suggesting low complexity.
Conversely, a lower percentage suggests a more diverse and complex transcriptome.

As we would expect, the complexity of (most) spots outside the tissue covered area is very low and under tissue (mostly) high.
So, although we often detect transcripts outside the tissue covered area, these tend to be dominated by few, highly abundant genes and are generally background.
In this slide, this is not always the case though!

We can use the values of QC metrics outside tissue as a guide to identify poor-quality spots under tissue.

In [None]:
sq.pl.spatial_scatter(
    adata,
    color=["pct_counts_in_top_50_genes"],
    cmap="jet"
)

Here we can see more clearly how the side of the tissue that has "leaked" has spot complexity comparable to that of spots under tissue. We should be very careful when looking at that area of the sample going forwards, as diffusion outside the tissue also means diffusion within the tissue in that area, which is harder to see and filter out. 

In [None]:
# Seaborn histogram of RNA complexity inside and outside tissue
sns.histplot(
    adata.obs,
    x="pct_counts_in_top_50_genes",
    hue="under_tissue",
    kde=True,
    bins=60
)

Inspecting the data allows us to make some decisions on the appropriate QC metrics and thresholds to use to clean up poor quality spots.
However, unlike single cell data, a lot of the variation in spatial transcriptomics data is due to tissue architecture.
We can expect a more acellular region to have fewer detected transcripts than a structure with dense or very transcriptionally active cells, which means applying uniform filtering thresholds across the whole slide could result in removal of good quality data. 

One alternative method we can use is tool called [SpotSweeper](https://pypi.org/project/spotsweeper/), which examines whether QC metrics for a spot are outliers without their local neighbourhood.
This can be useful to flag up any technical issues - for example, sometimes we see systematic slide printing errors where probes in spots near slide edges will have very low counts, or spots in a very particular pattern.  

We can examine individual QC metrics to flag local outliers. 

However, as the purpose of [SpotSweeper](https://pypi.org/project/spotsweeper/) is to flag local outliers, if there are quality issues with a large part/the entire slide, poor quality spots will not get flagged, so you cannot rely on it entirely. 

In [None]:
sw.local_outliers(
    adata,
    metric="pct_counts_in_top_50_genes",
    direction="higher",
    n_neighbors=36,
    sample_key="under_tissue",
    coord_key="spatial",
    log=False
)

Here we can see that we have a few low complexity spots within tissue surrounded by high complexityspots, so these get flagged as our local outliers. But the spots outside tissue which are surrounded by other low complexity spots are not flagged up. 

In [None]:
sq.pl.spatial_scatter(
    adata,
    color="pct_counts_in_top_50_genes",
    cmap="jet"
)

In [None]:
adata.obs['pct_counts_in_top_50_genes_outliers_cat'] = adata.obs['pct_counts_in_top_50_genes_outliers'].astype('category')
sq.pl.spatial_scatter(
    adata,
    color="pct_counts_in_top_50_genes_outliers_cat"
)

> **Exercise:**
> 1. How would you detect whether other QC metrics are local spatial outliers?
> 2. Can you aggregate multiple metrics to flag spots that are outliers in more than one QC area? Can you include additional metrics in the below example?

In [None]:
sw.local_outliers(
    adata,
    metric="pct_counts_mt",
    direction="higher",
    n_neighbors=36,
    sample_key="under_tissue",
    coord_key="spatial",
    log=False
)
sq.pl.spatial_scatter(
    adata,
    color="pct_counts_mt",
    cmap="jet"
)

In [None]:
adata.obs['pct_counts_mt_outliers_cat'] = adata.obs['pct_counts_mt_outliers'].astype('category')
sq.pl.spatial_scatter(
    adata,
    color="pct_counts_mt_outliers_cat"
)

In [None]:
adata.obs['all_metrics_outlier'] = (
    adata.obs['pct_counts_mt_outliers'] | adata.obs['pct_counts_in_top_50_genes_outliers'] 
)

In [None]:
adata.obs['all_metrics_outlier_cat'] = adata.obs['all_metrics_outlier'].astype('category')
sq.pl.spatial_scatter(adata, color="all_metrics_outlier_cat")

## 3. Filtering

Once you have investigated and have a good idea of the quality and potential issues in your samples, we want to filter out areas of the slide that have technical artefacts, are poor quality or are outside of tissue. 

You can also exclude genes that are expressed in only a few spots, but that's generally not necessary.

Below, we are removing spots that are:

1. Outside tissue covered area
2. Low complexity spots
3. Spots with low gene counts
4. Spots which are local outliers

In this slide, you might also want to consider:

1. Removing spots in tissue where there is a hair. Small debris is not uncommon in ST. 
2. Removing spots in the left hand side of the slide which appears to be affected a lot by permeabilisation issues. We will keep this for now to see how this affects downstream analysis!

In [None]:
mask = (
    (adata.obs["in_tissue"] == True) &
    (adata.obs["all_metrics_outlier"] == False) &
    (adata.obs["pct_counts_in_top_50_genes"] < 30) &
    (adata.obs["n_genes_by_counts"] > 1000)
)

print(f"Barcodes before filtering: {adata.n_obs}")

adata = adata[mask].copy()

print(f"Barcodes after cell count filter: {adata.n_obs}")

A visual check on what we filtered out:

In [None]:
sq.pl.spatial_scatter(
    adata,
    color="under_tissue"
)

## 3. Normalisation, Feature Selection & Dimensionality Reduction

Just as in the standard single-cell RNAseq data analysis workflow we now perform normalisation to adjust for spot library size. 

Lets pick a gene and see what it looks like before normalisation.

In [None]:
sq.pl.spatial_scatter(
    adata,
    color="Myh11",
    title="Raw Counts Myh11"
)

Normalise the data

In [None]:
# Normalisation
adata.layers["counts"] = adata.X.copy()
sc.pp.normalize_total(adata, inplace=True)
sc.pp.log1p(adata)

Now we can see Myh11 values are a more uniform across the muscularis layers, not just concentrated in the center with high counts spots.

In [None]:
sq.pl.spatial_scatter(
    adata,
    color="Myh11",
    use_raw=False,
    title="Normalised Myh11"
)

In unbiased ST experiments that measure the whole transcriptome, a lot of genes measured are not expressed, or are not differentially expressed or are generally uninformative.
For downstream analysis, they add noise without contributing much information.
Therefore, as in single cell data analysis, we want to select only a subset of genes which capture the most variability in the dataset for clustering analysis. 

NOTE - If you were working with targeted ST with smaller gene panels, this step is not required. 

In [None]:
# Feature selection
sc.pp.highly_variable_genes(
    adata,
    n_top_genes=2000
)

We can now see that for each gene, we have calculated means and dispersions and added a flag 'highly_variable' on whether it is in the top highly variable genes

In [None]:
adata.var

The code below automatically identifies the most variable gene and plots its normalised expression:

In [None]:
gene = adata.var["dispersions_norm"].sort_values(ascending=False).index[0]

sq.pl.spatial_scatter(
    adata,
    color=gene,
    use_raw=False,
    title=f"Most variable: {gene}"
)

Reducing dimensionality is a crucial step in the analysis of high-dimensional spatial transcriptomics data.
It simplifies the data structure while retaining its most informative features. 
 
Here, we'll run Principal Component Analysis (PCA) for dimensionality reduction.
PCA reduces the dimensionality of the dataset by transforming it into a set of orthogonal axes, known as principal components (PCs), which capture the most variance in the data.
By default, we'll store the top 50 PCs, which are typically more than enough to capture all the major axes of variation in spatial data.

In [None]:
# dimensionality reduction using PCA
sc.pp.pca(adata)

After PCA, we need to decide how many PCs to use for clustering and further analyses.

An *ElbowPlot* is used to select the number of PCs to include in downstream analyses.

By plotting the percentage of variance explained by each PC, it helps identify the point at which additional PCs contribute minimal additional variance.

In [None]:
# Elbow plot
sc.pl.pca_variance_ratio(
    adata,
    n_pcs=50,
    log=True
)

Biological relevance: Also consider biological knowledge and visual inspection of PCA plots to ensure that selected PCs capture relevant patterns.

We can achieve this by visualising gene loadings of each PC to understand what kind of biological variation they capture.

 - PC 1: Might highlight major tissue regions or dominant cell types OR very dominant technical effects.
 - PC 10: Could reflect variation related to specific cellular processes or less prominent tissue structures.
 - PC 30: Might capture noise or subtle patterns, like rare cell states or technical variations.

In [None]:
sc.pl.pca_loadings(adata, components=[1, 2, 3, 10, 20, 30])

We have spatial data, so we can also see which regions in the slide load highly on which PC, which can help interpret whether PCs are capturing true biological variability or something technical. 

For example, we see that PC1 seems to be capturing the difference between mucosa and muscularis/submucosal layers, PC4 seems to correspond to lymphoid structures. 

In [None]:
adata.obs["PC1"] = adata.obsm["X_pca"][:, 0]
sq.pl.spatial_scatter(adata, color="PC1", cmap="jet")

In [None]:
adata.obs["PC4"] = adata.obsm["X_pca"][:, 3]
sq.pl.spatial_scatter(adata, color="PC4", cmap="jet")

## 4. UMAP and Clustering

Next, as in single cell analysis, we want to embedd the spots into two dimensional space to visualise which spots are similar to each other and which are different.

Embedding spatial transcriptomics data using Uniform Manifold Approximation and Projection (UMAP) is a powerful way to visualise high-dimensional datasets in a two-dimensional space

**Preserve Structure:**
UMAP is designed to maintain both local and global data structures, which helps in visualising the relationships between data points accurately, although distances should be interpreted carefully. 

**Cluster Separation:**
UMAP often provides better separation of clusters than other methods, making it useful for identifying distinct groups within data. 

**Scalability:**
UMAP is computationally efficient, allowing it to handle large datasets commonly found in spatial transcriptomics.

Here, the `n_pcs=` parameter is important - we want to select it based on the elbow plot.
Too few PCs will miss important structures in the data, and too many will add too much noise.

In [None]:
# UMAP
sc.pp.neighbors(
    adata,
    n_pcs=10
)
sc.tl.umap(adata)

Next, we want to find clusters in the data.
We will use leiden clustering, using the graph created in the previous step.
The clustering algorithm partitions the spots into distinct groups.

- The resolution parameter determines the granularity of the clustering.
- *Resolution = 0.5* is a common starting point for many analyses, providing a balance between sensitivity and specificity.
- However, the optimal resolution varies based on dataset complexity and biological context.
- Higher resolution leads to more, smaller clusters. Useful for detailed analyses where subtle differences are biologically meaningful.
- Lower resolution results in fewer, larger clusters. Suitable when you expect broader, more general differences in the data.
- *n_iterations* How many iterations of the Leiden clustering algorithm to perform. Positive values above 2 define the total number of iterations to perform, -1 has the algorithm run until it reaches its optimal clustering. 2 is faster and the default for underlying packages.
- *key_added* adata.obs variable name under which to add the cluster labels
- *flavor* which packages implementation of the clustering algorythm to use. igraph is not the current default for compatability purposes, but it is the preffered method and the current default liedenalg will be deprecated shortly. 

In [None]:
# Clustering
sc.tl.leiden(
    adata,
    key_added="clusters",
    flavor="igraph",
    n_iterations=2,
    resolution=0.5
)

We can now visualise detected clusters in UMAP space (based on transcriptional similarity) and tissue space

In [None]:
sc.pl.umap(
    adata,
    color="clusters"
)

In [None]:
sq.pl.spatial_scatter(
    adata,
    color="clusters"
)

##### The choice of cluster resolution significantly impacts the granularity of the resulting clusters.

##### Let's explore the concept of clustering resolution and how to determine the optimal value for the analysis by trying high and low values.

In [None]:
sc.tl.leiden(adata, key_added="clusters_res0.1", flavor="igraph", directed=False, n_iterations=2, resolution=0.1)
sc.pl.umap(adata, color="clusters_res0.1")
sq.pl.spatial_scatter(adata, color="clusters_res0.1")

In [None]:
sc.tl.leiden(adata, key_added="clusters_res1", flavor="igraph", directed=False, n_iterations=2, resolution=1)
sc.pl.umap(adata, color="clusters_res1")
sq.pl.spatial_scatter(adata, color="clusters_res1")

## Marker Genes

Next, we want to figure out what various clusters represent.
To do so, we can find cluster marker genes. 

The `sc.tl.rank_genes_groups()` function is commonly used for this purpose, allowing you to perform differential expression analysis to identify genes that are significantly enriched in specific clusters compared to others.

#### Wilcoxon Rank Sum Test (method = "wilcoxon")

This non-parametric test is very commonly used, has been shown to perform well for detecting cluster marker genes and is suitable when no confounding variables need to be adjusted. 

NOTE: It is not suitable for detecting differentially expressed genes between biological groups with replicates, as it does not model individual sample effects in any way. For this, you want model-based or pseudobulk approaches which are more conservative.

#### Student's t-test (method = "t-test")

- **Parametric:** Assumes data is normally distributed.
- Use Case: Suitable for large datasets where normality can be assumed.
 
#### Student's t-test with over-estimated variance (method = "t-test_overestim_var'")
 
Variant of t-test with overestimated variance that is more conservative

#### Logistic Regression (method = "logreg")

- Regression: Uses logistic regression to assess differential expression.
- Use Case: Useful for handling confounding variables.


In [None]:
# Identify cluster marker genes
sc.tl.rank_genes_groups(
    adata,
    "clusters",
    method="wilcoxon",
    use_raw=False
)

The results table is stored as part of the unstructured data in the `AnnData` object, but that's not very useful for browsing the genes list!

In [None]:
adata.uns["rank_genes_groups"]

We can visualise it using a scanpy helper function sc.get.rank_genes_groups_df. For example, fetch genes for cluster 1:

In [None]:
sc.get.rank_genes_groups_df(
    adata,
    group="1"
)

We can also use standard [scanpy](https://scanpy.readthedocs.io/en/stable/) plotting functions to plot heatmaps to visualise how gene groups differ between clusters.
For example, here we are visualisig top 5 genes per cluster. 

In [None]:
sc.pl.rank_genes_groups_heatmap(
    adata,
    n_genes=5,
    groupby="clusters",
    standard_scale="var",
    show_gene_labels=True,
    use_raw=False
)

In [None]:
# Visualization of cluster markers in spatial coordinates
sq.pl.spatial_scatter(
    adata,
    color=["clusters", "Muc2"],
    use_raw=False
)

## 5. Spatial Clustering

So far in our analysis, we have not used spatial information (outside QC). 

As you can appreciate, single cell type workflows can get you very far in spatial transcriptomics analysis, and transcriptome-only clustering appears to correspond to histology and tissue structure reasonably well. 

However, there are several ways to incorporate spatial information that could yield better results here. 

Typically, spatial clustering approaches fall into broad categories:

1. Expression-only clustering -
   no spatial information considered, this is the workflow we did above.
   Fast and no additional bells and whistles.
2. Expression-only clustering with spatially informed feature selection. 
3. Graph-based spatial clustering -
   build a graph using both expression similarity and spatial adjacency,
   then cluster the graph. 
4. Spatially smoothed / regularized expression clustering -
   cluster on expression, but penalise spatial discontinuity.
5. Joint expression–spatial embeddings -
   Learn a latent space combining expression + coordinates,
   typically with deep models or graph embeddings that ingest both.
6. Image-informed spatial clustering -
   incorporating additional features from H&E image

Next, we will explore some of these but this is a very active field of development and many alternatives are available.

### Spatially variable genes

We will first try a simple extension to our previous approach: expression-only clustering with spatially informed feature selection.

Previously, we calculated variable features as input without considering spatial information. 
Here, we will modify the workflow and use spatially variable genes instead. 

Spatial variation can be caused by differences in cell-type composition, overall functional dependencies, or cell-cell communication events, and help to understand the underlying tissue biology.
Methods designed to identify spatially variable genes (SVGs) are designed to quantify whether a gene shows a significant spatial pattern by typically decomposing spatial and non-spatial variation in the dataset.
Several methods have been proposed for this task with varying complexity and different assumptions.
Currently there is no consensus on which method works best and how to define spatial variability in general. 

- SpatialDE (Svensson et al., 2018), SpatialDE2 (Kats et al., 2021) and SPARK (Zhu et al., 2021; Sun et al., 2020) use spatial correlation testing. 
- Sepal (Andersson and Lundeberg, 2021) leverages a Gaussian diffusion on spatial expression.
- scGCO (Zhang et al., 2022) utilizes a graph cut method.
- SpaGCN (Hu et al., 2021) identifies SVGs based on spatial domains identified through a graph convolutional neural network.

In this tutorial, we will use one of the most commonly used approaches, Moran's I. 

### Moran’s I 

- Moran's I is a measure of spatial autocorrelation
- It quantifies the degree to which similar values occur near each other in a spatial dataset.
- It is widely used in spatial analysis to determine if there is a pattern in the spatial distribution of a particular variable, such as gene expression levels in spatial transcriptomics data.
- Moran's I helps identify spatially variable features that may be linked to biological structures or processes.

We can calculate Moran's I in [squidpy](https://squidpy.readthedocs.io/en/stable/) by first identifying spatial neighbours and then calling the `spatial_autocorr()` function.
The outputs are stored as part of the unstructred data in [AnnData](https://anndata.readthedocs.io/en/stable/).

In [None]:
sq.gr.spatial_neighbors(adata)

In [None]:
sq.gr.spatial_autocorr(
    adata,
    mode="moran",
    genes=adata.var_names
)

In [None]:
adata.uns["moranI"].head()

We can visualise the spatial expression of the top most spatially variable features to check that they are indeed spatially spread out

In [None]:
sq.pl.spatial_scatter(
    adata,
    color=["mt-Co3", "Muc2"]
)

> **Exercise:**
What's the difference between variable genes and spatially variable genes? In this case we can see that a lot of genes with high dispersion we identified earlier have low spatial variability and vice versa. 

In [None]:
df = pd.DataFrame({
    "moranI": adata.uns["moranI"]["I"],
}).join(
    adata.var[["dispersions_norm"]]
)

plt.scatter(df["moranI"], df["dispersions_norm"], s=5)
plt.xlabel("Moran's I")
plt.ylabel("Dispersion")

In [None]:
# select top 2000 genes by Moran’s I
gene_list = (
    adata.uns["moranI"]["I"]
    .sort_values(ascending=False)
    .head(2000)
    .index
)

# flag genes for PCA
adata.var["use_for_pca"] = adata.var_names.isin(gene_list)

# run PCA on those genes only
sc.tl.pca(adata, mask_var="use_for_pca")
# run neighbour detection and clustering

sc.pp.neighbors(adata, n_pcs=10)
sc.tl.leiden(adata, key_added="clusters_morans", flavor="igraph", directed=False, n_iterations=2, resolution=0.5)

Lets visualise - the difference is not very big between two clustering solutions, but spatially variable genes tend to give us a more spatially contiguous solution.

In [None]:
sq.pl.spatial_scatter(
    adata,
    color=["clusters_morans", "clusters"]
)

### Banksy

The second spatial clustering method we will try is Banksy algorithm, which performs consistently well across multiple benchmarks. 

https://www.nature.com/articles/s41588-024-01664-3

Banksy is a spatial clustering algorithm that integrates both gene expression and spatial proximity. 

It works by augmenting each spot’s gene expression with information from its spatial neighbours, then applying standard dimensionality reduction and clustering.

Conceptually, it does the following:

1. **Start from expression features:** each spot has a vector of gene expression values

2. **Build a spatial neighbourhood graph:** for every spot, BANKSY identifies nearby spots using spatial coordinates and distance-based neighbours.

3. **Aggregate neighbour expression:** for each spot, it computes a weighted average of its neighbours’ expression, where weights decay with distance.

4. **Augment the feature space:** the original expression and the neighbour-aggregated expression are combined into a new feature representation, controlled by a mixing parameter lambda that sets how strongly spatial context influences each spot.

5. **Run standard embedding and clustering:** PCA is applied to the augmented features, followed by graph-based clustering.

Because each spot’s features already encode local spatial context, clusters tend to form contiguous tissue domains without explicitly enforcing spatial smoothness in the clustering objective.

In short, BANKSY turns expression + neighbourhood expression into a single feature matrix, so spatial structure is baked into the data before clustering.


Set Banksy parameters.

Two critical parameters to consider are `lambda` and `k_geom`.
These control how far around each spot we are looking and how much weight is given to spot vs neighbourhood features for clustering.

Conceptually, this means the difference between some spatial "smoothing" (low `k_geom`, low `lambda`) and detection of broader spatial structures (high `k_geom`, high `lambda`). 

In [None]:
k_geom = 15 
lambda_list = [0.2] 

Select top highly variable genes from your previous [AnnData](https://anndata.readthedocs.io/en/stable/) object.

In [None]:
adata_banksy = adata[:, adata.var["highly_variable"]].copy()

Find the median distance to closest neighbours

In [None]:
# from banksy.main import median_dist_to_nearest_neighbour
nbrs = median_dist_to_nearest_neighbour(
    adata_banksy,
    key = "spatial"
)

Initialise banksy

In [None]:
# from banksy.initialize_banksy import initialize_banksy
banksy_dict = initialize_banksy(
    adata_banksy,
    ('xcoord', 'ycoord', 'spatial'),
    k_geom,
    nbr_weight_decay="scaled_gaussian",
    max_m=1,
    plt_edge_hist=False,
    plt_nbr_weights=False,
    plt_agf_angles=False,
    plt_theta=False,
)

Generate `banksy matrix`, which calculates neighbourhood weighted gene expression features for selected parameters

In [None]:
# from banksy.embed_banksy import generate_banksy_matrix
banksy_dict, banksy_matrix = generate_banksy_matrix(
    adata_banksy,
    banksy_dict,
    lambda_list,
    1
)

We can now use this matrix as input for PCA, UMAP and clustering. 

In [None]:
# from banksy_utils.umap_pca import pca_umap
pca_umap(
    banksy_dict,
    pca_dims=[20],
    add_umap=True,
    plt_remaining_var=False,
)

In [None]:
# from banksy.cluster_methods import run_Leiden_partition
results_df, max_num_labels = run_Leiden_partition(
    banksy_dict,
    [0.7],
    num_nn=50,
    num_iterations=-1,
    partition_seed=1234,
    match_labels=True,
)

In [None]:
adata.obs['clusters_banksy'] = results_df.relabeled[0].dense
adata.obs['clusters_banksy'] = adata.obs['clusters_banksy'].astype("category")
sq.pl.spatial_scatter(adata, color='clusters_banksy')

> **Exercise:**
How does the Banksy clustering solution compare to previous ones?
Try varying parameter settings to understand how to focus the data partitioning more on space or more on individual spot transcriptome.

In [None]:
sq.pl.spatial_scatter(
    adata,
    color=['clusters_banksy', "clusters_morans", "clusters"]
)

### Saving the processed object for further analysis

Save your [AnnData](https://anndata.readthedocs.io/en/stable/) object as a [h5ad](https://anndata.readthedocs.io/en/latest/fileformat-prose.html) file.
You can load this back at any time to continue the analysis. 

In [None]:
# Set location to store analysis output
OUTPUT_FOLDERNAME = "/nvme/project/USERNAME/PATH_TO_DAY1/"
adata.write(os.path.join(OUTPUT_FOLDERNAME, 'day1.h5ad'))