# DAY 3: Imaging-based Spatial Transcriptomics Analysis - Xenium Human Colon Dataset

This notebook will guide you through Day 1 content. We will cover:

1. Setting up the environment
2. Loading the data
3. Spatial visualisation
4. Quality control
5. Data normalisation, Dimensionality reduction, and Clustering
6. Cell size normalisation
7. Cell type label prediction using single cell references
8. Improving cell segmentation

## How to use this notebook

This notebook is intended to be used as a reference for your own analysis.
All code chunks have an explanation detailing the analysis steps and their purpose, as well as key parameters.
Play around with these and see what they do so you are better equiped to adapt the workflow to your own data.

## Dataset

We will be using a Xenium human colon dataset, generated using V1 Xenium chemistry with a small, off the shelf premade colon target gene panel and nuclei staining (no cell-boundary staining).
The full dataset is available here and contains one large tissue section from human colon and one large colorectal tissue section.
Typically, a full Xenium slide can generate data from over a million cells.
Therefore, several basic analysis steps can take a while to run.
For the purposes of this tutorial, we split off a small field of view from the full dataset.
If you're interested in trying working with the full size dataset, you can find it in the data folder.

More information about today's dataset can be found on the 10x website.

https://www.10xgenomics.com/datasets/human-colon-preview-data-xenium-human-colon-gene-expression-panel-1-standard

### 1. Set up environment

First we need to set up the environment and load the packages we will use for this workshop.
As in the sequencing-based spatial transcriptomics analysis tutorial, we will be using largely the same base packages for analysis - [scanpy](https://scanpy.readthedocs.io/en/stable/) and [squidpy](https://squidpy.readthedocs.io/en/stable/). 

In [None]:
import os
import numpy as np
import pandas as pd
import scanpy as sc
import seaborn as sns
import matplotlib.pyplot as plt
from scipy.io import mmread
from anndata import AnnData
import squidpy as sq
from spatialdata_io import xenium
import spatialdata as sd
import spatialdata_plot 
from spatialdata import SpatialData
from scipy import sparse
from scipy.cluster.hierarchy import linkage, leaves_list
from scipy.spatial.distance import pdist
import celltypist
from celltypist import models

Lets set the paths to where the datasets are stored and where we will keep any outputs we calculate ourselves. 

In [None]:
# Input data
DATA_DIR = "/nvme/project/shared/python/5_python_spatial_omics/data/xenium_human_colon"
# Change this to where you want to save your outputs
OUTPUT_DIR = "/PATH/TO/YOUR/DIRECTORY"
# Path to precomputed data that you can use to skip lengthy steps
PRECOMPUTED_DIR = "/nvme/project/shared/python/5_python_spatial_omics/data/precomputed/imaging/"

### 2. Loading the data

Typically, we can load xenium data using spatial data package. Spatial data provides a nice data structure for xenium data where we can load cell segmentation shapes, gene expression data, individual transcript coordinates and also any images. 

In this case, as this is Xenium V1 experiment before cell boundary staining was released, we only have a DAPI image, which is not very exciting or informative, so we skip loading that for now. 

In [None]:
xen = xenium(
    DATA_DIR,
    transcripts=True,
    aligned_images=False,
    morphology_mip=True,
    morphology_focus=True, 
    cells_as_circles=False, 
    nucleus_boundaries = False,
    n_jobs=20
)
xen

In [None]:
xen.write(
    os.path.join(OUTPUT_DIR, "xen_full.zarr")
)

In [None]:
xen = SpatialData.read(
    os.path.join(OUTPUT_DIR, "xen_full.zarr")
)

As the whole slide is quite a large piece of tissue with many cells, for the sake of this tutorial we will be working with a subset of the data.
Here, we subset the original spatial data object to a bounding box containing a smaller number of cells. 

In [None]:
xen_cropped = xen.query.bounding_box(
    axes=["x", "y"],
    min_coordinate=[35000, 10000],
    max_coordinate= [45000, 25000],
    target_coordinate_system="global"
)
xen_cropped

In [None]:
xen_cropped

We can save this subset to disk or reload it. 

In [None]:
xen_cropped.write(
    os.path.join(OUTPUT_DIR, "xen_cropped.zarr"),
    overwrite=True
)

In [None]:
xen_cropped = SpatialData.read(
    os.path.join(OUTPUT_DIR, "xen_cropped.zarr")
)

In [None]:
xen_cropped

Spatial data objects have an anndata object stored in the table slot, which we can examine and operate on by accessing it directly:

In [None]:
xen_cropped["table"]

### 3. Spatial Visualisation

Let’s start with some basic QC and visualisation of the data.

Spatial data has its own set of plotting functions, which are documented here:
https://spatialdata.scverse.org/projects/plot/en/latest/plotting.html

In summary:

**render_labels:**
We can use this to plot cell centroids.
This should be your default visualisation, especially for larger datasets as plotting is much quicker.

**render_images:**
We can plot image data, if there is any linked to the object.
In our case, there is not.

**render_points:**
We can use this to plot individual transcript coordinates.

**render_shapes:**
We can use this to plot cell segmentation boundaries - but this can be slow for large datasets. 

In [None]:
xen_cropped.pl.render_labels(
    element="cell_labels",
    table_name="table",
    fill_alpha=1
).pl.show(
    coordinate_systems="global",
    title="Cell centroids"
)

In [None]:
xen_cropped.pl.render_images(
    "morphology_focus", cmap="gray"
).pl.show(
    title="Morphology image"
)

In [None]:
xen_cropped.tables["table"].obs["region"] = "cell_boundaries"

xen_cropped.set_table_annotates_spatialelement(
    table_name="table",
    region="cell_boundaries",
    region_key="region",
    instance_key="cell_id",
)

xen_cropped.pl.render_shapes(
    "cell_boundaries",
    table_name="table",
    color="total_counts",
).pl.show(
    coordinate_systems="global",
    title="Total transcript counts per cell + boundaries"
)

We can't really easily visualise cell boundaries and shapes when we are zoomed out.
Let's create a small zoomed-in ROI to see how we can visualise cell shapes.
Here, we select a region focused on a lymphoid follicle in the tissue.
You can also use colour and outline arguments to customise the look of the plot - here, we switched from default colour palette and added outlines to cell boundary shapes.
We can also turn off axes. 

In [None]:
zoom_roi = xen_cropped.query.bounding_box(
    axes=["x", "y"],
    min_coordinate=[38000, 21000],
    max_coordinate=[41000, 24000],
    target_coordinate_system="global",
)

ax = zoom_roi.pl.render_shapes(
    "cell_boundaries",
    table_name="table",
    color="total_counts",
    outline_color="black",
    outline_alpha=1,
    outline_width=0.5,
    cmap="magma"

).pl.show(
    coordinate_systems="global",
    title="Total transcript counts per cell + boundaries",
    return_ax=True
)

ax.axis("off")

In [None]:
xen_cropped.pl.render_points(
    "transcripts",
    color="feature_name",
    groups=["MKI67", "MS4A1"],
    palette=["red", "black"]
).pl.show(
    title="MKI67 transcripts",
    coordinate_systems="global"
)

We can also combine different types of plots - for example, we can use cell boundaries as a base and overlay individual transcript points over the image.
Here, we do it in the zoomed-in ROI over a follicle. 

Plotting individual transcripts can be useful as it allows us to see detection outside of segmented cell boundary areas.
This can be useful if, for example, you have custom probes for pathogens, which will not segment out as cells, or to diagnose any segmentation or quality issues with your slide.
For example, here we can see that there is some detection of *MS4A1*/*CD20* transcript outside the cells. 

In [None]:
ax = zoom_roi.pl.render_shapes(
    "cell_boundaries",
    table_name="table",
    color="total_counts",
    outline_color="black",
    outline_alpha=1,
    outline_width=0.5,
    cmap="magma"

).pl.render_points(
    "transcripts",
    color="feature_name",
    groups=["MS4A1"],
    palette=["red"],
    size=0.5
).pl.show(coordinate_systems="global", title="Total transcript counts per cell + boundaries", return_ax=True)

ax.axis("off")

While plotting cell boundaries can look nice, for larger tissue sections it can take a long time to generate each plot.
Therefore, for interactive sessions we recommend using cell centroids for quicker plotting.
Let's link our [AnnData](https://anndata.readthedocs.io/en/stable/) back to cell centroids for that. 

In [None]:
xen_cropped.tables["table"].obs["region"] = "cell_labels"

xen_cropped.set_table_annotates_spatialelement(
    table_name="table",
    region="cell_labels",
    region_key="region",
    instance_key="cell_labels"
)

### 4. Quality control

Next, we want to do some basic QC.
We can get a lot of information from the Xenium Ranger run summary document, but ultimately we still want to do cell-level QC of our dataset. 

In our [SpatialData](https://spatialdata.scverse.org/en/stable/) object, cell level data is stored in the [AnnData](https://anndata.readthedocs.io/en/stable/) table, which we can access via `["table"]`.
We can operate on this as on any [AnnData](https://anndata.readthedocs.io/en/stable/) object.
For example, let's use the [scanpy](https://scanpy.readthedocs.io/en/stable/) helper function to calculate automated QC metrics as we did before.
Because we are using a small, targetted gene panel, we skip the typical mitochondrial/ribosomal gene metrics, as we don't have any of these genes in the panel anyway. 

In [None]:
sc.pp.calculate_qc_metrics(
    xen_cropped["table"],  inplace=True, percent_top=[10, 20], log1p=True
)
xen_cropped["table"]

First, let's visualise the total number of transcripts detected per cell.

As in scRNA-Seq data, this is the most basic measure of overall signal and how well the data looks.

Unlike in scRNA-Seq data or unbiased sequencing-based ST, these measures are also very heavily dependent not only on the total RNA quantity of each cell and tissue quality, but also on the target panel used for the experiment.
Under-represented cell types will naturally yield fewer transcripts.
Finally, the quality of cell segmentation also plays a role.

In this case, we can see that there are areas with higher and lower total transcripts detected, but there's a few cells which have very high counts.

Understanding your tissue and target panel here is important to delineate where these differences are biological and where they may be technical.

In [None]:
ax = xen_cropped.pl.render_labels(
    element="cell_labels",
    table_name="table",
    fill_alpha=1,
    color="transcript_counts",
    cmap="jet"
).pl.show(coordinate_systems="global", title="Transcripts per cell", return_ax=True)

ax.axis("off")

It can also help to visualise the distribution to see where the majority of your cells lie. 

In [None]:
sns.kdeplot(
    data=xen_cropped["table"].obs,
    x="transcript_counts",
    fill=True
)
plt.xlabel("Transcript Count Per Cell")
plt.ylabel("Density")
plt.tight_layout()
plt.show()

Similarly, we can visualise the total number of gene detected per cell.
You can see that this is a bit less variable across tissue.

This can also suggest that there cells at the top of the epithelial crypts in this sample with more genes detected than the rest of the tissue.

In [None]:
ax = xen_cropped.pl.render_labels(
    element="cell_labels",
    table_name="table",
    fill_alpha=1,
    color="n_genes_by_counts",
    cmap="jet"
).pl.show(
    coordinate_systems="global",
    title="Genes per cell",
    return_ax=True
)

ax.axis("off")

This code examines the distribution of the number of genes detected per cell in the [AnnData](https://anndata.readthedocs.io/en/stable/) object
using a density plot.

This is important for understanding the variability and distribution of detected features,
which can help identify potential issues such as low-quality cells and determine any filtering thresholds
that may need to be applied.

If you’re coming from scRNA-Seq work, these low numbers probably look very alarming.
How can you possibly work with 31 median genes per cell?

Unlike scRNA-Seq data and sequencing-based ST, both gene dropouts and noise are much, much lower in in situ ST data. So, this is surprisingly perfectly fine for cell identification and clustering.

We are also working with 100-fold fewer targetted genes.

In [None]:
sns.kdeplot(
    data=xen_cropped["table"].obs,
    x="n_genes_by_counts",
    fill=True
)
plt.xlabel("Gene Count Per Cell")
plt.ylabel("Density")
plt.tight_layout()
plt.show()

Next, we visualise cell area/size, which allows us to examine the spatial organization and potential heterogeneity of cell sizes within your tissue sample.

Why do we get such a difference in spatial distribution of cell sizes?

This could be due to biological differences between small and large cells - e.g. small cells like T-cells.

However, here the signal correlates with areas of low cellularisation.
Therefore, it is likely this is an artefact of nuclei expansion in cell segmentation.

**What is Nuclei Expansion?**

Nuclei expansion in cell segmentation refers to the process of enlarging the segmented nuclei regions to approximate the boundaries of the entire cells.
This technique is used to better represent the actual cell boundaries when only the nuclei have been explicitly segmented/we only have DAPI and no additional cell boundary staining.
The primary goal is to provide a more accurate estimation of the cellular area, which is crucial for various downstream analyses in spatial transcriptomics and single-cell studies.
In this case, nuclei expansion is constrained either by maximum distance or other nearby cells - so, where there are no other nearby cells to “bump into”, the expansion generates artificially bigger cells.

In [None]:
ax = xen_cropped.pl.render_labels(
    element="cell_labels",
    table_name="table",
    fill_alpha=1,
    color="cell_area",
    cmap="jet"
).pl.show(
    coordinate_systems="global",
    title="Cell size",
    return_ax=True
)

ax.axis("off")

We can see that nucleus size is much more uniform

In [None]:
ax = xen_cropped.pl.render_labels(
    element="cell_labels",
    table_name="table",
    fill_alpha=1,
    color="nucleus_area",
    cmap="jet"
).pl.show(
    coordinate_systems="global",
    title="Nucleus size",
    return_ax=True
)

ax.axis("off")

We can further check that this is likely the case by plotting the ratio between nuclei and total cell area.
We can see that there is a very big decrease in percentage of cell area occupied by nucleus in areas of low cell density.

The cell-to-nucleus area ratio can also potentially provide insights into cell morphology, cell type and potential changes in cellular states or conditions.
For example, T-Cells can often be quite well identified by this variable alone, as they have a small cytoplasm volume.
However, without a cell boundary stain, this metric mainly captures segmentation artefacts, so be careful about over-interpretation!

In [None]:
xen_cropped["table"].obs["cell_nucleus_ratio"] = xen_cropped["table"].obs["nucleus_area"] / xen_cropped["table"].obs["cell_area"]

In [None]:
ax = xen_cropped.pl.render_labels(
    element="cell_labels",
    table_name="table",
    fill_alpha=1,
    color="cell_nucleus_ratio",
    cmap="jet"
).pl.show(
    coordinate_systems="global",
    title="Nucleus-Cell Ratio",
    return_ax=True
)

ax.axis("off")

In this case, we can see that as expected, there is generally a correlation between cell area and transcript detection rate.

However, we also have a group of cells where this is not the case - very large cells but relatively few transcripts.
These cells are mainly submucosal stromal cells which are very poorly covered by the panel 10x have used.

In [None]:
sns.scatterplot(
    data=xen_cropped["table"].obs,
    x="total_counts",
    y="cell_area",
    s=5
)
plt.xlabel("Transcript count")
plt.ylabel("Cell Size")
plt.tight_layout()
plt.show()

We can create a filter to remove the overly large cells from the analysis.
Here, we can flag cells which are above 99th percentile in cell area.
We can see this would remove a lot of submucosal cells.
We might want to reconsider whether to apply this filter or not!

In [None]:
xen_cropped["table"].obs["size_filter_large"] = xen_cropped["table"].obs["cell_area"] < np.quantile(xen_cropped["table"].obs["cell_area"], 0.99)
xen_cropped["table"].obs["size_filter_large"] = xen_cropped["table"].obs["size_filter_large"].astype("category")

In [None]:
ax = xen_cropped.pl.render_labels(
    element="cell_labels",
    table_name="table",
    fill_alpha=1,
    color="size_filter_large",
).pl.show(
    coordinate_systems="global",
    title="Large Cell Filter",
    return_ax=True
)

ax.axis("off")

Similarly, we can flag small cells for filtering - very small cells are often segmentation artefacts.
But, as we can see, this disproportionalely removes cells within the densely packed immune region. 

In [None]:
xen_cropped["table"].obs["size_filter_small"] = xen_cropped["table"].obs["cell_area"] > np.quantile(xen_cropped["table"].obs["cell_area"], 0.01)
xen_cropped["table"].obs["size_filter_small"] = xen_cropped["table"].obs["size_filter_small"].astype("category")

In [None]:
ax = xen_cropped.pl.render_labels(
    element="cell_labels",
    table_name="table",
    fill_alpha=1,
    color="size_filter_small",
).pl.show(
    coordinate_systems="global",
    title="Small Cell Filter",
    return_ax=True
)

ax.axis("off")

We can check how these values correlate with gene detection rate.

If we filter out small cells, we will remove cells with low numbers of genes detected.

If we filter out large cells, this is not that biased towards overly large counts, as we saw before.

In [None]:
sns.violinplot(
    data=xen_cropped["table"].obs,
    x="size_filter_small",
    y="n_genes_by_counts",
    palette="tab10"
)
plt.tight_layout()
plt.show()

In [None]:
sns.violinplot(
    data=xen_cropped["table"].obs,
    x="size_filter_large",
    y="n_genes_by_counts",
    palette="tab10"
)
plt.tight_layout()
plt.show()

Adjusting the threshold for what is considered a “small cell” can have significant implications for your analysis, especially in areas with specific cell types such as T-cells, which are small and densely packed in follicular regions.
This example demonstrates how changing the threshold to the 10th percentile affects the filtering.
In this case, we would probably filter out a lot of good cells that we don’t want to lose!
So, be careful when looking at these types of QC metrics!

In [None]:
xen_cropped["table"].obs["size_filter_small"] = xen_cropped["table"].obs["cell_area"] > np.quantile(xen_cropped["table"].obs["cell_area"], 0.1)
xen_cropped["table"].obs["size_filter_small"] = xen_cropped["table"].obs["size_filter_small"].astype("category")

ax = xen_cropped.pl.render_labels(
    element="cell_labels",
    table_name="table",
    fill_alpha=1,
    color="size_filter_small",
).pl.show(
    coordinate_systems="global",
    title="Small Cell Filter",
    return_ax=True
)

ax.axis("off")

Lets set this back to the original 1% threshold.

In [None]:
xen_cropped["table"].obs["size_filter_small"] = xen_cropped["table"].obs["cell_area"] > np.quantile(xen_cropped["table"].obs["cell_area"], 0.01)
xen_cropped["table"].obs["size_filter_small"] = xen_cropped["table"].obs["size_filter_small"].astype("category")

The most important filter is the overall transcript detection.
Empty cells or cells with very low transcript count cannot be taken forward for clustering analysis and it is extremely difficult to identify what they may be.
Here, we set a threshold of minimum 15 transcripts.
This seems quite low - for data from *in situ* platforms with low noise (Xenium, Merfish, Merscope), this is generally enough to cluster and identify cell types.
If your data has more noise (e.g. CosMx), a higher threshold is more appropriate.

In [None]:
xen_cropped["table"].obs["transcript_filter"] = xen_cropped["table"].obs["total_counts"] >= 15
xen_cropped["table"].obs["transcript_filter"] = xen_cropped["table"].obs["transcript_filter"].astype("category")

And we can visualise the cells that we would lose.

We see that we disproportionately would filter out more cells from some regions than others.
As pointed out previously, this is likely due to a combination of gene panel coverage in some regions and very small cells in densely packed regions like follicles.

In [None]:
ax = xen_cropped.pl.render_labels(
    element="cell_labels",
    table_name="table",
    fill_alpha=1,
    color="transcript_filter",
).pl.show(
    coordinate_systems="global",
    title="Transcript Filter",
    return_ax=True
)

ax.axis("off")

Finally, visualizing the counts of negative control codewords, negative control probes, and unassigned codewords helps identify and understand technical artifacts and background noise in your spatial transcriptomics data.

Here, we can see that all control probes and codewords produce yield very little signal, suggesting our data is good quality!

In some cases, high amount of autoflourescence is the cells/tissue can sometimes generate false positive signal and this should be filtered out.

In [None]:
ax = xen_cropped.pl.render_labels(
    element="cell_labels",
    table_name="table",
    fill_alpha=1,
    color="deprecated_codeword_counts",
    cmap="jet"
).pl.show(
    coordinate_systems="global",
    title="Deprecated Codeword Counts",
    return_ax=True
)

ax.axis("off")

In [None]:
ax = xen_cropped.pl.render_labels(
    element="cell_labels",
    table_name="table",
    fill_alpha=1,
    color="unassigned_codeword_counts",
    cmap="jet"
).pl.show(
    coordinate_systems="global",
    title="Unassigned Codeword Counts",
    return_ax=True
)

ax.axis("off")

In [None]:
ax = xen_cropped.pl.render_labels(
    element="cell_labels",
    table_name="table",
    fill_alpha=1,
    color="control_codeword_counts",
    cmap="jet"
).pl.show(
    coordinate_systems="global",
    title="Control Codeword Counts",
    return_ax=True
)

ax.axis("off")

In [None]:
ax = xen_cropped.pl.render_labels(
    element="cell_labels",
    table_name="table",
    fill_alpha=1,
    color="control_probe_counts",
    cmap="jet"
).pl.show(
    coordinate_systems="global",
    title="Control Probe Counts",
    return_ax=True
)

ax.axis("off")

Although the negative control signal is low, we can nonetheless create a filter to remove cells which have any, although in this case it is probably unnecessary.

In [None]:
xen_cropped["table"].obs["probe_filter"] = (
    (xen_cropped["table"].obs["control_probe_counts"] == 0) &
    (xen_cropped["table"].obs["control_codeword_counts"] == 0) &
    (xen_cropped["table"].obs["unassigned_codeword_counts"] == 0) & 
    (xen_cropped["table"].obs["deprecated_codeword_counts"] == 0)
)

xen_cropped["table"].obs["probe_filter"] = xen_cropped["table"].obs["probe_filter"].astype("category")

ax = xen_cropped.pl.render_labels(
    element="cell_labels",
    table_name="table",
    fill_alpha=1,
    color="probe_filter"
).pl.show(
    coordinate_systems="global",
    title="Probe Filter",
    return_ax=True
)

ax.axis("off")

Finally, we can subset the [AnnData](https://anndata.readthedocs.io/en/stable/) object based on any/all of the filters we have created earlier.

By combining probe, size, and transcript filters, you can retain only the cells that meet all quality criteria, reducing the impact of technical artifacts and noise on your analysis.

In [None]:
mask = (
    (xen_cropped["table"].obs["probe_filter"].astype(bool)) &
    (xen_cropped["table"].obs["size_filter_large"].astype(bool)) &
    (xen_cropped["table"].obs["size_filter_small"].astype(bool)) &
    (xen_cropped["table"].obs["transcript_filter"].astype(bool))
)
xen_cropped["table"] = xen_cropped["table"][mask].copy()
xen_cropped["table"]

### 5. Data Normalisation, Dimensionality Reduction and Clustering

As we have quite variable cell sizes and gene counts, we want to normalise these as before prior to clustering analysis.

In [None]:
xen_cropped["table"].layers["counts"] = xen_cropped["table"].X.copy()
sc.pp.normalize_total(
    xen_cropped["table"],
    inplace=True
)
sc.pp.log1p(
    xen_cropped["table"]
)
xen_cropped["table"] 

Principal Component Analysis (PCA) is a dimensionality reduction technique used to identify the primary axes of variation in high-dimensional data.
In the context of spatial transcriptomics, PCA helps to reduce the complexity of the data while preserving the most important patterns of variation.

**TIP:**
If your target panel is very small, you can skip this step and carry out clustering analysis directly on gene expression.
This can sometimes help with achieving better clustering results.

In [None]:
sc.pp.pca(xen_cropped["table"])
xen_cropped["table"]

As before, we can visualise how much variation is captured by each PC.

An elbow plot helps to determine the number of significant PCs to use for downstream analyses.

The plot typically shows the amount of variance explained by each PC, and the “elbow” point indicates a natural cutoff.

In [None]:
sc.pl.pca_variance_ratio(
    xen_cropped["table"],
    n_pcs=50,
    log=True
)

Plotting the top genes contributing to a specific principal component helps in understanding the biological factors driving the variation captured by that component.
This type of plot highlights the genes with the highest loadings, which are the most influential in the principal component analysis.

In [None]:
sc.pl.pca_loadings(
    xen_cropped["table"],
    components='1,2'
)

The `sc.pl.pca()` function in [scanpy](https://scanpy.readthedocs.io/en/stable/) is used to visualize the expression of a specific gene across cells in a given dimensionality reduction space (e.g., PCA).
This helps to understand how the expression of a gene varies across the principal components.

In [None]:
sc.pl.pca(
    xen_cropped["table"],
    color="CEACAM5"
)

We can also examine how various PCs are distributed spatially.

Here, we can see that high PC1 loadings enrich in crypt top cells and low PC1 loadings enrich in follicular structures.

In [None]:
xen_cropped["table"].obs["PC1"] = xen_cropped["table"].obsm["X_pca"][:, 0]

ax = xen_cropped.pl.render_labels(
    element="cell_labels",
    table_name="table",
    fill_alpha=1,
    color="PC1"
).pl.show(
    coordinate_systems="global",
    title="PC1",
    return_ax=True
)

ax.axis("off")

We can plot the expression of high (or low) loading genes to visualise how this correlates with our dimensionality reduction.

In [None]:
ax = xen_cropped.pl.render_labels(
    element="cell_labels",
    table_name="table",
    fill_alpha=1,
    color="MS4A1"
).pl.show(
    coordinate_systems="global",
    title="MS4A1 Expression",
    return_ax=True
)

ax.axis("off")

Next, we will use the reduced dimensionality data for clustering and cluster visualisation.

`sc.pp.neighbors()` needs to be run prior to `sc.tl.umap()`.

In [None]:
sc.pp.neighbors(
    xen_cropped["table"],
    n_pcs=20
)
sc.tl.umap(
    xen_cropped["table"]
)
xen_cropped["table"]

Then we can also run `sc.tl.leiden()` to cluster cells using the Leiden algorithm

In [None]:
sc.tl.leiden(
    xen_cropped["table"],
    resolution=0.7,
    key_added="cell_clusters"
)
xen_cropped["table"]

Next lets visualise the clusters - firstly, based on UMAP embedding.

In [None]:
sc.pl.umap(
    xen_cropped["table"],
    color="cell_clusters", 
    add_outline=True
)

And now lets plot the clusters in tissue space.

We can see that our clusters have quite nice correspondence to distinct spatial regions.

In [None]:
ax = xen_cropped.pl.render_labels(
    element="cell_labels",
    table_name="table",
    fill_alpha=1,
    color="cell_clusters"
).pl.show(
    coordinate_systems="global",
    title="Cell Clusters",
    return_ax=True
)

ax.axis("off")

Sometimes the plots can get a bit busy if there are many clusters and the colours blend into each other.
We can choose to visualise just a few groups at a time - for example, just clusters 1 and 3.

In [None]:
ax = xen_cropped.pl.render_labels(
    element="cell_labels",
    table_name="table",
    fill_alpha=1,
    color="cell_clusters",
    groups=["1", "3"]
).pl.show(
    coordinate_systems="global",
    title="Selected Cell Clusters",
    return_ax=True
)

ax.axis("off")

As before, now we can use [scanpy](https://scanpy.readthedocs.io/en/stable/) differential expression functions to identify marker genes for specific cell clusters.

In [None]:
sc.tl.rank_genes_groups(
    xen_cropped["table"],
    'cell_clusters',
    method='wilcoxon',
    groups = ['5']
)
sc.pl.rank_genes_groups(
    xen_cropped["table"],
    n_genes=25,
    sharey=False
)

We can visualise expression of cluster-specific markers using feature plots

In [None]:
sc.pl.umap(
    xen_cropped["table"],
    color="CA4"
)

In [None]:
ax = xen_cropped.pl.render_labels(
    element="cell_labels",
    table_name="table",
    fill_alpha=1,
    color="CA4"
).pl.show(
    coordinate_systems="global",
    title="CA4",
    return_ax=True
)

ax.axis("off")

`sc.pl.rank_genes_groups_dotplot()` provides a convenient and visually appealing way to display expression patterns of top marker genes across clusters using a dot plot.

At this point, you can use any of the core [scanpy](https://scanpy.readthedocs.io/en/stable/) plotting functions with `xen_cropped["table"]`. 

Explore various options and documentation here:

https://scanpy.readthedocs.io/en/stable/tutorials/plotting/core.html

In [None]:
sc.tl.rank_genes_groups(
    xen_cropped["table"],
    groupby="cell_clusters",
    method="wilcoxon"
)
sc.pl.rank_genes_groups_dotplot(
    xen_cropped["table"],
    n_genes=4
)

### 6. Cell Size Normalisation

It has been demonstrated ([Atta et al, 2024](https://link.springer.com/article/10.1186/s13059-024-03303-w)), that if you have a very skewed panel design in in situ data, cell area/size normalisation can be more robust.
A cell with low panel coverage would have lower counts for example, so normalising to “library size” would inflate those counts.
Note that for large panels (e.g. Xenim 5K), it’s not such an issue.

Let’s try normalising by area instead - there’s no in-built method is [scanpy](https://scanpy.readthedocs.io/en/stable/) for this, so we add the normalised counts and log transformed values (for visualisation) back to the object manually:

In [None]:
xen_cropped["table"].layers['COUNT_NORM'] = xen_cropped["table"].X
raw_counts = xen_cropped["table"].layers['counts']
cell_area = xen_cropped["table"].obs['cell_area'].values.astype(float) + 1e-6
median_area = np.median(cell_area)
inv = 1.0/cell_area
D = sparse.diags(inv)
norm_counts = D.dot(raw_counts)
scaled_counts = norm_counts.multiply(median_area)
log_norm_counts = np.log1p(scaled_counts)

xen_cropped["table"].layers['XENIUM_SIZE_NORM'] = log_norm_counts
xen_cropped["table"]

We can see that for some genes, there’s little difference - e.g. B-Cell marker *MS4A1*/*CD20*:

In [None]:
plot_feature = "MS4A1"

fig, ax = plt.subplots(1, 2, figsize = (12, 5))

sc.pl.umap(
    xen_cropped["table"],
    color = plot_feature,
    layer = "XENIUM_SIZE_NORM",
    ax = ax[0],
    show = False,
    cmap="jet"
)

ax[0].set_title('Area Size - XENIUM_SIZE_NORM')

sc.pl.umap(
    xen_cropped["table"],
    color = plot_feature,
    layer = "COUNT_NORM",
    ax = ax[1],
    show = False,
    cmap="jet"
)

ax[1].set_title("Transcript count - COUNT_NORM")

plt.tight_layout()
plt.show()

But for others, standard normalisation and area size are very different:

In [None]:
plot_feature = "TUBB"

fig, ax = plt.subplots(1, 2, figsize = (12, 5))

sc.pl.umap(
    xen_cropped["table"],
    color = plot_feature,
    layer = "XENIUM_SIZE_NORM",
    ax = ax[0],
    show = False,
    cmap="jet"
)

ax[0].set_title('Area Size - XENIUM_SIZE_NORM')

sc.pl.umap(
    xen_cropped["table"],
    color = plot_feature,
    layer = "COUNT_NORM",
    ax = ax[1],
    show = False,
    cmap="jet"
)

ax[1].set_title("Transcript count - COUNT_NORM")

plt.tight_layout()
plt.show()

You can see that the most difference correlates with large stromal cells.
But, we can probably agree that due to nuclei-expansion approach to cell segmentation in this dataset, the area size of these cells is also likely to be overly inflated, and therefore area-based normalisation here is probably strongly under-estimating the true expression and creating a false gene expression pattern!

In [None]:
sc.pl.umap(
    xen_cropped["table"],
    color = 'cell_area'
)

### 7. Cell type identification

You can manually annotate your cell clusters, or you can classify them using a reference single-cell dataset.
This process is simpler than for Visium data because our data is at the single-cell level, establishing a one-to-one relationship without the need for spot deconvolution.

However, our transcriptome is more limited here, and some cell types may not be well represented. Additionally, our single-cell reference might be missing some cell types that are not well captured by droplet-based technologies but are present in our tissue data.

In this example, we will use a single-cell reference dataset that we prepared earlier.

We will start by reading in the file.

In [None]:
adata_ref = sc.read_h5ad(os.path.join(PRECOMPUTED_DIR, "single_cell_colon_ref.h5ad"))
adata_ref

In [None]:
sc.pl.umap(
    adata_ref,
    color = "CellType"
)

Before we proceed further, we subset the single cell dataset to the set of genes single cell and spatial have in common:

In [None]:
common_genes = xen_cropped["table"].var_names.intersection(adata_ref.var_names)
adata_ref = adata_ref[:, common_genes].copy()
len(common_genes)

We want to evaluate how much structural information is lost in single-cell data when limiting ourselves to the targeted gene set.
Accurate cluster prediction is challenging if the current gene set does not adequately identify them.
To do this, we will quickly re-embedd the data using only the genes present in our spatial transcriptomics data and keep the original cluster annotations derived from unbiased data.

In [None]:
sc.pp.normalize_total(adata_ref, inplace=True)
sc.pp.log1p(adata_ref)

sc.pp.pca(
    adata_ref, 
    n_comps=50, 
    use_highly_variable=False, 
    svd_solver='arpack', 
    copy=False
)


sc.pp.neighbors(
    adata_ref, 
    n_neighbors=20, 
    use_rep='X_pca',
    n_pcs=20
)

sc.tl.umap(
    adata_ref,
    min_dist=0.3
)

adata_ref

In this example, we can observe that the limited gene set does a reasonably good job at distinguishing major cell populations.
However, it struggles to differentiate between similar cell types, such as myofibroblasts and fibroblasts, as effectively as before.

In [None]:
sc.pl.embedding(
    adata_ref,
    basis="X_umap",
    color="CellType"
)

If we visualise the specificity of the gene panel across our single cell reference clusters, we can see that the panel coverage is mainly concentrated across epithelial cells and T-Cells and other immune cells, with few specific markers expressed by stromal cells.

In [None]:
adata_ref.layers["scaled"] = sc.pp.scale(adata_ref, copy=True).X

col_link = linkage(
    pdist(
        adata_ref.layers["scaled"].T,
        metric="euclidean"
    ),
    method="average"
)
col_order = leaves_list(col_link)

mp = sc.pl.matrixplot(
    adata_ref,
    adata_ref.var_names[col_order],
    "CellType",
    dendrogram=True,
    colorbar_title="mean z-score",
    layer="scaled",
    vmin=-1,
    vmax=1,
    cmap="RdBu_r",
    figsize=(11, 4),
    show=False
)
mp["mainplot_ax"].tick_params(
    axis="x",
    labelsize=2,
    bottom=False
)

Next, we can use the standard `scanpy.tl.ingest()` integration and cross-classification workflow to transfer single-cell derived labels to our [SpatialData](https://spatialdata.scverse.org/en/stable/) object.

Briefly, the function integrates embeddings and annotations of a new `adata` object with a reference dataset `adata_ref` through projecting on a PCA (or alternate model) that has been fitted on the reference data.
The function uses a knn classifier for mapping labels and the UMAP package ([McInnes et al., 2018](https://arxiv.org/abs/1802.03426)) for mapping the embeddings.

In [None]:
sc.tl.ingest(
    xen_cropped["table"],
    adata_ref, obs='CellType',
    embedding_method=['pca']
)
xen_cropped["table"]

Unfortunately, the predicted labels and spatial clusters do not correspond clearly in all cases.
This discrepancy is particularly evident in the middle regions of the UMAP, where many cells are predicted as epithelial cells - probably incorrectly!

How to improve this?

**Ensure Good Representation of Cell Type Markers in *in situ* Target Panel:**
Most critically, before undertaking any experiments, you want to ensure that there is good representation of all cell types in your target panel.
In this case, there is not much to be done as the data has already been generated.

**Review and Refine Reference Data:**
Ensure that the reference single-cell dataset is comprehensive and accurately annotated. If certain cell types are not well represented or annotated in the reference dataset, it can lead to misclassification.

**Increase the Number of Dimensions:**
Increasing the number of dimensions used in the UMAP and PCA steps might capture more variance in the data, leading to better label transfer.

**Filter and Preprocess Data:**
Filtering out low-quality cells or genes and performing additional preprocessing steps can enhance the accuracy of the transfer anchors and, consequently, the label predictions.

**Manually Annotate or Correct Predictions:**
In cases where automatic label transfer is insufficient, consider manually annotating or correcting the predictions for critical regions to ensure accuracy.


In [None]:
sc.pl.umap(
    xen_cropped["table"],
    color = "CellType"
)

As before, we can also visualise the predicted cell labels in tissue space.

In [None]:
ax = xen_cropped.pl.render_labels(
    element="cell_labels",
    table_name="table",
    fill_alpha=1,
    color="CellType"
).pl.show(
    coordinate_systems="global",
    title="Predicted Cell Type",
    return_ax=True
)

ax.axis("off")

Unfortunately, `ingest()` does not provide any confidence score on its prediction.

When other methods provide such scores, it can be helpful to visualise and diagnose issues such as:

- lower confidence in certain spatial areas
- lower confidence in cells embed "between" clusters (e.g. between B-cells and T-cells)

This is often the case where cell segmentation is imperfect and partitions transcripts in such a way that it generates “artificial” doublets by pulling in transcripts from an adjacent cell.

As an alternative to using our own single cell reference, we can use one of the reference models provided by [CellTypist](https://www.celltypist.org/).
If you don't have a good reference of your own, this can be a quick and easy starting point.

https://www.celltypist.org/

First, let's download the available models and inspect them

In [None]:
models.download_models(
    force_update = True
)

In [None]:
models.models_description()

To classify cells using [CellTypist](https://www.celltypist.org/) cell models, we simply need an [AnnData](https://anndata.readthedocs.io/en/stable/) object (normalised and log transformed) and a model name.

In [None]:
adata = xen_cropped["table"].copy()
adata.X = adata.layers["counts"]
sc.pp.normalize_total(
    adata,
    target_sum=10000,
    inplace = True
)
sc.pp.log1p(adata)

Here, we run predictions using human intestinal tract cells, which is probably the closest reference to human adult colon.
You could also try the colorectal cancer reference here.

In [None]:
predictions = celltypist.annotate(
    adata,
    model = 'Cells_Intestinal_Tract.pkl',
    majority_voting = True
)
adata = predictions.to_adata()
predictions.predicted_labels

Now we can visualise predicted labels and also confidence scores.
As with our previous method, we seem to be struggling to predict cell types in certain clusters. 

As previously covered, the stromal cells are likely poorly predicted due to limited panel coverage.
What's interesting is that we often also see poor prediction confidence scores in-between clusters, for instance the division between T-cells and B-cells is not so clear, with low confidence cells forming at the cluster boundary.
We know we should not be getting anything here that could be a biological "intermediate" cell, and therefore this hints at poor cell segmentation. 

In [None]:
sc.pl.umap(
    adata,
    color = 'conf_score'
)

In [None]:
sc.pl.umap(
    adata,
    color = 'majority_voting'
)

In [None]:
celltypist.dotplot(
    predictions,
    use_as_reference = 'CellType',
    use_as_prediction = 'majority_voting'
)

For example, if we visualise the lineage markers for T-Cells and B-Cells, we can see that they are often “co-expressed” in the same cells when biologically, they should not be.

The `sc.pl.scatter()` function is used to create a scatter plot showing the relationship between the expression levels of two genes across all cells.
This visualization helps to identify potential correlations or patterns between the two genes.

In [None]:
sc.pl.scatter(
    xen_cropped["table"],
    x="MS4A1",
    y="CD3D",
    color="CellType",
    size=20
)

In [None]:
sc.pl.umap(
    xen_cropped["table"],
    color = ["MS4A1", "CD3D"],
)

### 8. Cell Re-segmentation 

To improve these artefacts, we can try alternative cell segmentation algorithms.
What works best is very tissue dependant and there’s no easy one stop solution to this.
Cell segmentation algorithms can be divided into a few groups.

**Nuclei-based Segmentation** algorithms primarily focus on identifying cell nuclei, which are usually more distinct and easier to detect than the cell boundaries.
Once the nuclei are identified, the cell boundaries are inferred by expanding around the nuclei.
This approach works well in tissues where the nuclei are clearly visible and distinct and in early versions of many in situ platforms, were the only available methods due to only using DAPI stain.

**Cell Boundary-Based Segmentation** algorithms (e.g. Cellpose) directly segments cells by identifying their boundaries.
It is particularly effective for images with complex cell shapes and varying sizes, but this required good cell boundary staining - this is not available for our test dataset.
Often cell boundary staining can be non-uniform across different tissues, adding further difficulties.
Cellpose version 3 incorporates user-guided model training, which can be very useful for difficult to segment cell types - but this requires time investment to annotate training examples.

**Transcript-Density Based Segmentation** algorithms, like Baysor segments cells based on the spatial distribution of transcripts
It uses Bayesian inference to assign transcripts to cells, considering both the density and distribution of RNA molecules. This can be very useful for improving cell segmentation where cell boundary stain is not available or not working well.

In this case, we will try re-segmenting our data with **Baysor**.
Here’s the run we prepared earlier - see supplementary material on how to process the data yourself.

In [None]:
baysor_dir = os.path.join(PRECOMPUTED_DIR, "baysor")

seg = pd.read_csv(
    os.path.join(baysor_dir, "segmentation.csv")
)

seg.head(5)

There will be some transcripts that cannot be assigned to a cell - about 10% in this case.
This information is stored under “is_noise” flag.
This is fairly normal levels of noise.

In [None]:
seg['is_noise'].value_counts()

In [None]:
plt.figure(figsize=(8, 5))
plt.hist(
    x=seg['assignment_confidence'],
    bins=50
)
plt.xlabel('Seg Assignment Confidence')
plt.show()

In [None]:
(seg['assignment_confidence'] > 0.9).value_counts()

plt.hist(x=seg['confidence'], bins=50)
plt.xlabel('Seg Confidence')
plt.show()And transcript confidence - the confidence that the molecule itself is real and not noise.

In [None]:
plt.figure(figsize=(8, 5))
plt.hist(
    x=seg['confidence'],
    bins=50
)
plt.xlabel('Seg Confidence')
plt.show()

We can filter out low confidence and low assignment confidence transcripts here from further analysis.
How stringent you want to be depends on whether you want to keep as much data as possible and accept some inaccuracies, or end up with the cleanest possible dataset.

Here, we will filter out transcripts that have not been assigned to cells, and below 0.9 confidence and assignment confidence.

Then, we tabulate a cell by gene matrix from these data.

In [None]:
seg_filtered = seg[
    (seg['confidence']> 0.9)&
    (seg['assignment_confidence'] > 0.9) &
    (~seg['is_noise'])
]

mat = pd.crosstab(
    seg_filtered['gene'],
    seg_filtered['cell']
)
mat

[Baysor](https://kharchenkolab.github.io/Baysor/dev/) further provides diagnostic info about cells in `segmentation_cell_stats.csv` file, which we will also read in here.

The following parameters can be used to filter low-quality cells:

**area:**
area of the convex hull around the cell molecules avg_confidence: average confidence of the cell molecules

**density:**
the number of molecules in a cell divided by the cell area

**elongation:**
ratio of the two eigenvalues of the cell covariance matrix

**n_transcripts:**
number of molecules per cell

**avg_assignment_confidence:**
average assignment confidence per cell.
Cells with low avg_assignment_confidence have a much higher chance of being an artifact.

**max_cluster_frac (only if n-clusters > 1):**
fraction of the molecules coming from the most popular cluster.
Cells with low max_cluster_frac are often doublets.

**lifespan:**
number of iterations the given component exists.
The maximal lifespan is clipped proportionally to the total number of iterations.
Components with a short lifespan likely correspond to noise.

In [None]:
stats = pd.read_csv(
    os.path.join(baysor_dir,"segmentation_cell_stats.csv")
)
stats.set_index('cell', inplace=True)

stats.head(5)

Now, we can assemble an AnnData object as before:

In [None]:
adata_reseg = AnnData(mat.values.T)
adata_reseg.var_names = mat.index
adata_reseg.obs_names = mat.columns
adata_reseg.obs = stats.loc[mat.columns]

centroids = adata_reseg.obs[['x', 'y']].to_numpy()[:,[1,0]]
centroids[:,1] = -centroids[:, 1]

adata_reseg.obsm['spatial'] = centroids

adata_reseg

From here, we can use the seurat object to visualise various cell meta data - for example, average transcript assignment confidence per cell.

In [None]:
ax = sq.pl.spatial_scatter(
    adata_reseg,
    library_id="spatial",
    shape=None,
    size=2,
    color="avg_assignment_confidence",
    figsize=(12, 6),
    return_ax=True
)

Let's filter out low count cells and re-cluster the data as before.
In practice, you may also want to consider additional QC metrics here - e.g. removing overly small cells.

In [None]:
sc.pp.calculate_qc_metrics(
    adata_reseg,
    inplace=True,
    log1p=False
)

adata_reseg = adata_reseg[
    adata_reseg.obs["total_counts"] >= 15
].copy()

sc.pp.normalize_total(adata_reseg)
sc.pp.log1p(adata_reseg)
sc.pp.scale(adata_reseg)

sc.pp.pca(
    adata_reseg, 
    n_comps=25, 
    use_highly_variable=False, 
    svd_solver='arpack', 
    copy=False
)

sc.pp.neighbors(
    adata_reseg, 
    n_neighbors=20, 
    use_rep='X_pca'
)

sc.tl.umap(
    adata_reseg,
    min_dist=0.3
)

sc.tl.leiden(
    adata_reseg,
    resolution = 0.5
)

adata_reseg

Visualising clusters, we can see that we already obtain a better separation in the UMAP embedding than before.
Though of course, distances in the UMAP space can be very misleading and careful interpretation is required.

In [None]:
import matplotlib.pyplot as plt

fig, ax = plt.subplots(figsize=(8, 6))

sc.pl.umap(
    adata_reseg,
    color='leiden',
    ax=ax,
)

Next we visualise the clusters in tissue space.
As before, lets cross-classify our cells using the reference single cell dataset.

In [None]:
import matplotlib.pyplot as plt

fig, ax = plt.subplots(figsize=(12, 6))

sc.pl.spatial(
    adata_reseg,
    color='leiden',
    spot_size=10,
    ax=ax,
)

As before, lets cross-classify our cells using the reference single cell dataset

In [None]:
adata_reseg = adata_reseg[:, common_genes].copy()
adata_ref = adata_ref[:, common_genes].copy()
sc.tl.ingest(
    adata_reseg,
    adata_ref,
    obs='CellType',
    embedding_method=['pca']
)

adata_reseg

Visualising the predictions, we’ve separated T-Cells from B-Cells much better.
The stromal clusters still predict poorly, but that is due to poor probe coverage.

In [None]:
import matplotlib.pyplot as plt

fig, ax = plt.subplots(figsize=(8, 6))

sc.pl.umap(
    adata_reseg,
    color = 'CellType',
    ax=ax,
)

We can check the distribution in tissue space:

In [None]:
import matplotlib.pyplot as plt

fig, ax = plt.subplots(figsize=(10, 10))
sc.pl.spatial(
    adata_reseg,
    color = "CellType",
    spot_size=10,
    ax=ax
)

**Additional exercises**

Congratulations on reaching the end of this notebook!

Here are some additional prompts if you wish to practice more:

- How does this compare with other segmentation algorithms?
  For example, data resegmented with proseg has been made available here: 

```
/project/shared/python/5_python_spatial_omics/data/precomputed/imaging/proseg
```

- What metrics would you use to quantify cell segmentation accuracy?
