First we need to set up the environment and load the packages we will use for this workshop.

In [None]:
import os
import numpy as np
import pandas as pd
import scanpy as sc
import seaborn as sns
import matplotlib.pyplot as plt
from scipy.io import mmread
from anndata import AnnData
import squidpy as sq
print(f"squidpy=={sq.__version__}")

Sets the path to the directory containing the Xenium output data - this is the directory where all of the outputs are stored.

In [None]:
data_dir = '/nvme/project/shared/r/3_r_spatial/DATA/XENIUM_COLON_SUBSET'

In [None]:
os.listdir(data_dir)

Typically, you would use a loader for your specific data type to load data here - e.g. [spatialdata_io.xenium()](https://spatialdata.scverse.org/projects/io/en/latest/generated/spatialdata_io.xenium.html).

However, often the data files from different platforms are still changes and the readers often do not work, or are not up to date.

Therefore, we will cover how to read in the individual pieces of data and assemble a seurat object without the help of a loader.

The main components that we need are:

1. cell x gene matrix
2. cell meta data
3. cell coordinates - centroids and/or segmentation boundaries
4. (optional) transcript coordinates

We start by loading the cell x gene matrix, including feature and barcode metadata:

In [None]:
features = pd.read_table(os.path.join(data_dir, "cell_feature_matrix", "features.tsv.gz"), header=None, names=["gene_ids", "gene_name", "feature_types"]).set_index("gene_name")
features

In [None]:
barcodes = pd.read_table(os.path.join(data_dir, "cell_feature_matrix", "barcodes.tsv.gz"), header=None,  names=["cell_id"])
barcodes

In [None]:
matrix = mmread(os.path.join(data_dir, "cell_feature_matrix", "matrix.mtx.gz"), spmatrix=True).tocsr().T
matrix

Read in additional information about the cells - this gives us pre-calculated information, for example segmented cell or nucleus size for each cell.

In [None]:
cells = pd.read_csv(os.path.join(data_dir, "cells.csv.gz"))
cells

In [None]:
cells[['x_centroid', 'y_centroid']].agg(['min', 'max'])

Next, we read in cell boundaries:



In [None]:
cell_polygons = pd.read_csv(os.path.join(data_dir, "cell_boundaries.csv.gz"))
cell_polygons

While one would expect that 

In [None]:
# rotate the coordinates to match the orientation of Seurat plots
# however, do not flip the Y-axis because we'll use a function that already orientates the Y-axis correctly
# remember this when you come across gdf.plot() later
cell_polygons.columns = ['cell_id', 'vertex_y', 'vertex_x']
cell_polygons

And individual transcript coordinates:

In [None]:
transcripts = pd.read_csv(os.path.join(data_dir, "transcripts.csv.gz"))
transcripts

In [None]:
# rotate the coordinates to match the orientation of Seurat plots
transcripts_rotated = pd.DataFrame({
    'x_location': transcripts["y_location"],
    'y_location': transcripts["x_location"]
})
transcripts[['x_location', 'y_location']] = transcripts_rotated
transcripts

Then, we will assemble gene expression matrices. Matrix file is further split into gene expression matrix and various control probes and codewords. Different platforms and platform versions include different control probes. As this will vary, it’s important to check and understand what the specific controls in your own data are.

Here, negative control probes are probes that are added to the reaction but target non-biological sequences and should not bind any tissue RNA. Negative control codewords are valid codewords, but no probes with that codeword added to the reaction. This effectively tells us how good the transcript calling algorithm is.

For convenience, we will split control probes and codewords into separate assays.

In [None]:
gex_matrix = matrix[:, (features['feature_types'] == "Gene Expression").values]
gex_matrix

In [None]:
ncp_matrix = matrix[:, (features['feature_types'] == "Negative Control Probe").values]
ncp_matrix

In [None]:
ncc_matrix = matrix[:, (features['feature_types'] == "Negative Control Codeword").values]
ncc_matrix

In [None]:
uc_matrix = matrix[:, (features['feature_types'] == "Unassigned Codeword").values]
uc_matrix

Then, we assemble a basic `AnnData` object with gene expression matrix and metadata.

In [None]:
adata = AnnData(X=gex_matrix, obs=cells)
adata

In [None]:
adata.obs_names = adata.obs["cell_id"]
adata

Add gene names

In [None]:
adata.var_names = features[features["feature_types"] == "Gene Expression"].index
adata

We add the spatial coordinates of cells as documented [here](https://squidpy.readthedocs.io/en/stable/notebooks/tutorials/tutorial_read_spatial.html#spatial-coordinates-in-anndata)

In [None]:
# rotate the coordinates to match the orientation of Seurat plots
centroids = cells[["x_centroid", "y_centroid"]].to_numpy()[:, [1, 0]]
centroids[:, 1] = -centroids[:, 1]
centroids

In [None]:
adata.obsm["spatial"] = centroids
adata

Let’s start with some basic QC and visualisation of the data.

First, lets visualise the total transcripts detected per cell.

As in scRNA-Seq data, this is the most basic measure of overall signal and how well the data looks.

Unlike in scRNA-Seq data or unbiased sequencing-based ST, these measures are also very heavily dependent not only on the total RNA quantity of each cell and tissue quality, but also on the target panel used for the experiment. Under-represented cell types will naturally yield fewer transcripts. Finally, the quality of cell segmentation also plays a role.

In this case, we can see that there are areas with higher and lower total transcripts detected.

Understanding your tissue and target panel here is important to delineate where these differences are biological and where they may be technical.

In [None]:
sq.pl.spatial_scatter(
    adata,
    library_id="spatial",
    shape=None,
    color=[
        "transcript_counts",
    ],
)

Similarly, we can visualise the total number of gene detected per cell. You can see that this is a bit less variable across tissue.

This can also suggest that there cells at the top of the epithelial crypts in this sample with genes detected at high copy number than the rest of the tissue.

In [None]:
adata.obs["n_features"] = (adata.X > 0).sum(axis=1)

In [None]:
sq.pl.spatial_scatter(
    adata,
    library_id="spatial",
    shape=None,
    color=[
        "n_features",
    ],
    wspace=0.4,
)

This code examines the distribution of the number of features (genes) detected per cell in the AnnData object
using a density plot and calculates specific quantiles of this distribution.
This is important for understanding the variability and distribution of detected features,
which can help identify potential issues such as low-quality cells and determine any filtering thresholds
that may need to be applied.

If you’re coming from scRNA-Seq work, these low numbers probably look very alarming.
How can you possibly work with 31 median genes per cell?

Unlike scRNA-Seq data and sequencing-based ST, both gene dropouts and noise are much, much lower in in situ ST data.

We are also working with 100-fold fewer targetted genes.

In [None]:
sns.kdeplot(data=adata.obs, x="n_features", fill=True)
plt.xlabel("n_features")
plt.ylabel("Density")
plt.tight_layout()
plt.show()

We can also report quantiles of that same variable

In [None]:
adata.obs["n_features"].quantile([0.01, 0.10, 0.50, 0.90, 0.99])

Using `sq.pl.spatial_scatter()ImageFeaturePlot(seurat, "cell_area") + scale_fill_viridis_c()
` to visualize the cell area in spatial transcriptomics data allows us to examine the spatial organization and potential heterogeneity of cell sizes within your tissue sample.

Why do we get such a difference in spatial distribution of cell sizes?

This could be due to biological differences between small and large cells - e.g. small cells like T-cells.

However, here the signal correlates with areas of low cellularisation. Therefore, it is likely this is an artefact of nuclei expansion in cell segmentation.

What is Nuclei Expansion?

Nuclei expansion in cell segmentation refers to the process of enlarging the segmented nuclei regions to approximate the boundaries of the entire cells. This technique is used to better represent the actual cell boundaries when only the nuclei have been explicitly segmented/we only have DAPI and no additional cell boundary staining. The primary goal is to provide a more accurate estimation of the cellular area, which is crucial for various downstream analyses in spatial transcriptomics and single-cell studies. In this case, nuclei expansion is constrained either by maximum distance or other nearby cells - so, where there are no other nearby cells to “bump into”, the expansion generates artificially bigger cells.

In [None]:
# ImageFeaturePlot(seurat, "cell_area") + scale_fill_viridis_c()
sq.pl.spatial_scatter(
    adata,
    library_id="spatial",
    shape=None,
    color=[
        "cell_area",
    ],
    wspace=0.4,
)

We can further check that this is likely the case by plotting the ratio between nuclei and total cell area. We can see that there is a very big decrease in percentage of cell area occupied by nucleus in areas of low cell density.

The cell-to-nucleus area ratio can also potentially provide insights into cell morphology, cell type and potential changes in cellular states or conditions. For example, T-Cells can often be quite well identified by this variable alone, as they have a small cytoplasm volume. However, without a cell boundary stain, this metric mainly captures segmentation artefacts, so be careful about over-interpretation!

In [None]:
adata.obs["cell_nucleus_ratio"] = adata.obs["nucleus_area"] / adata.obs["cell_area"]

In [None]:
# ImageFeaturePlot(seurat, "cell_area") + scale_fill_viridis_c()
sq.pl.spatial_scatter(
    adata,
    library_id="spatial",
    shape=None,
    color=[
        "cell_nucleus_ratio",
    ],
    wspace=0.4,
)

If we look at the distribution, we see that we have a big tail end of overly large cells.

In [None]:
sns.kdeplot(data=adata.obs, x="cell_area", fill=True)
plt.xlabel("cell_area")
plt.ylabel("Density")
plt.tight_layout()
plt.show()

In this case, we can see that as expected, there is generally a correlation between cell area and transcript detection rate.

However, we also have a group of cells where this is not the case - very large cells but relatively few transcripts. These cells are mainly submucosal stromal cells which are very poorly covered by the panel 10x have used.

In [None]:
sns.scatterplot(data=adata.obs, x="total_counts", y="cell_area", s=5)
plt.xlabel("total_counts")
plt.ylabel("cocell_areal2")
plt.tight_layout()
plt.show()

We can create a filter to remove the overly large cells from the analysis.

In [None]:
adata.obs["size_filter_large"] = adata.obs["cell_area"] < np.quantile(adata.obs["cell_area"], 0.99)
# re-type as categorical (useful for plotting) 
adata.obs["size_filter_large"] = adata.obs["size_filter_large"].astype("category")

Now we can use `sq.pl.spatial_scatter()` to visualise the cells which have been flagged for removal.

In [None]:
sq.pl.spatial_scatter(
    adata,
    library_id="spatial",
    shape=None,
    color="size_filter_large",
)

We can use the same approach to create a filter for segmented cells which are very small and likely segmentation arfetacts.

In [None]:
adata.obs["size_filter_small"] = adata.obs["cell_area"] > np.quantile(adata.obs["cell_area"], 0.01)
# re-type as categorical (useful for plotting) 
adata.obs["size_filter_small"] = adata.obs["size_filter_small"].astype("category")

In [None]:
sq.pl.spatial_scatter(
    adata,
    library_id="spatial",
    shape=None,
    color="size_filter_small",
)

We can check how these values correlate with gene detection rate.

If we filter out small cells, we will remove cells with low numbers of genes detected.

If we filter out large cells, this is not that biased towards overly large counts, as we saw before.

In [None]:
sns.violinplot(data=adata.obs, x="size_filter_small", y="n_features", palette="tab10")
# sns.swarmplot(data=adata.obs, x="size_filter_small", y="n_features", color="black", size=3)
plt.tight_layout()
plt.show()

In [None]:
sns.violinplot(data=adata.obs, x="size_filter_large", y="n_features", palette="tab10")
# sns.swarmplot(data=adata.obs, x="size_filter_large", y="n_features", color="black", size=3)
plt.tight_layout()
plt.show()

Adjusting the threshold for what is considered a “small cell” can have significant implications for your analysis, especially in areas with specific cell types such as T-cells, which are small and densely packed in follicular regions. This example demonstrates how changing the threshold to the 10th percentile affects the filtering. In this case, we would probably filter out a lot of good cells that we don’t want to lose! So, be careful when looking at these types of QC metrics!

In [None]:
adata.obs["size_filter_small"] = adata.obs["cell_area"] > np.quantile(adata.obs["cell_area"], 0.1)
# re-type as categorical (useful for plotting) 
adata.obs["size_filter_small"] = adata.obs["size_filter_small"].astype("category")
# plot
sq.pl.spatial_scatter(
    adata,
    library_id="spatial",
    shape=None,
    color="size_filter_small",
)

Lets set this back to the original 1% threshold.

In [None]:
adata.obs["size_filter_small"] = adata.obs["cell_area"] > np.quantile(adata.obs["cell_area"], 0.01)
# re-type as categorical (useful for plotting) 
adata.obs["size_filter_small"] = adata.obs["size_filter_small"].astype("category")

The most important filter is the overall transcript detection. Empty cells or cells with very low transcript count cannot be taken forward for clustering analysis and it is extremely difficult to identify what they may be. Here, we set a threshold of minimum 15 transcripts. This seems quite low - for data from in situ platforms with low noise (Xenium, Merfish, Merscope), this is generally enough to cluster and identify cell types. If your data has more noise (e.g. CosMx), a higher threshold is more appropriate.

In [None]:
adata.obs["transcript_filter"] = adata.obs["total_counts"] >= 15
# re-type as categorical (useful for plotting) 
adata.obs["transcript_filter"] = adata.obs["transcript_filter"].astype("category")

And we can visualise the cells that we would lose.

We see that we disproportionately would filter out more cells from some regions than others. As pointed out previously, this is likely due to a combination of gene panel coverage in some regions and very small cells in densely packed regions like follicles.



In [None]:
sq.pl.spatial_scatter(
    adata,
    library_id="spatial",
    shape=None,
    color="transcript_filter",
)

Finally, visualizing the counts of negative control codewords, negative control probes, and unassigned codewords helps identify and understand technical artifacts and background noise in your spatial transcriptomics data.

Here, we can see that all control probes and codewords produce yield very little signal, suggesting our data is good quality!

In some cases, high amount of autoflourescence is the cells/tissue can sometimes generate false positive signal and this should be filtered out.

In [None]:
adata.obs["n_features_neg_ctrl_codeword"] = ncc_matrix.sum(axis=1)
sq.pl.spatial_scatter(
    adata,
    library_id="spatial",
    shape=None,
    color="n_features_neg_ctrl_codeword",
)

In [None]:
adata.obs["n_features_neg_ctrl_probe"] = ncp_matrix.sum(axis=1)
sq.pl.spatial_scatter(
    adata,
    library_id="spatial",
    shape=None,
    color="n_features_neg_ctrl_probe",
)

In [None]:
adata.obs["n_features_unassigned_codeword"] = uc_matrix.sum(axis=1)
sq.pl.spatial_scatter(
    adata,
    library_id="spatial",
    shape=None,
    color="n_features_unassigned_codeword",
)

Although the negative control signal is low, we can nonetheless create a filter to remove cells which have any, although in this case it is probably unnecessary.

In [None]:
adata.obs["probe_filter"] = (
    (adata.obs["n_features_unassigned_codeword"] == 0) &
    (adata.obs["n_features_neg_ctrl_codeword"] == 0) &
    (adata.obs["n_features_neg_ctrl_probe"] == 0)
)
# re-type as categorical (useful for plotting) 
adata.obs["probe_filter"] = adata.obs["probe_filter"].astype("category")
sq.pl.spatial_scatter(
    adata,
    library_id="spatial",
    shape=None,
    color="probe_filter",
)

Finally, we can subset the `AnnData` object based on any/all of the filters we have created earlier.

By combining probe, size, and transcript filters, you can retain only the cells that meet all quality criteria, reducing the impact of technical artifacts and noise on your analysis.

In [None]:
# for readability, combine all the filters into one
mask = (
    (adata.obs["probe_filter"].astype(bool)) &
    (adata.obs["size_filter_large"].astype(bool)) &
    (adata.obs["size_filter_small"].astype(bool)) &
    (adata.obs["transcript_filter"].astype(bool))
)
adata = adata[mask].copy()
adata

## Data Normalisation

Normalize counts per cell using `scanpy.pp.normalize_total`.

Per [squidpy xenium tutorial](https://squidpy.readthedocs.io/en/stable/notebooks/tutorials/tutorial_xenium.html)

In [None]:
adata.layers["counts"] = adata.X.copy()
sc.pp.normalize_total(adata, inplace=True)
sc.pp.log1p(adata)
adata

Principal Component Analysis (PCA) is a dimensionality reduction technique used to identify the primary axes of variation in high-dimensional data. In the context of spatial transcriptomics, PCA helps to reduce the complexity of the data while preserving the most important patterns of variation.

TIP: If your target panel is very small, you can skip this step and carry out clustering analysis directly on gene expression. This can sometimes help with achieving better clustering results.

In [None]:
sc.pp.pca(adata)
adata

As before, we can visualise how much variation is captured by each PC.

The information is stored in `adata.uns['pca']['variance_ratio']`.

An elbow plot helps to determine the number of significant PCs to use for downstream analyses.
The plot typically shows the amount of variance explained by each PC, and the “elbow” point indicates a natural cutoff.

In [None]:
vr = adata.uns['pca']['variance_ratio']

plt.figure(figsize=(6,4))
plt.plot(range(1, len(vr)+1), vr, marker='o')
plt.xlabel("Principal Component")
plt.ylabel("Variance Explained Ratio")
plt.title("Scree Plot")
plt.tight_layout()
plt.show()

Plotting the top genes contributing to a specific principal component helps in understanding the biological factors driving the variation captured by that component. This type of plot highlights the genes with the highest loadings, which are the most influential in the principal component analysis.

In [None]:
sc.pl.pca_loadings(adata, components='1,2')


The `sc.pl.pca()` function in Scanpy is used to visualize the expression of a specific gene across cells in a given dimensionality reduction space (e.g., PCA). This helps to understand how the expression of a gene varies across the principal components.

In [None]:
sc.pl.pca(adata, color="CEACAM5")

We can also examine how various PCs are distributed spatially.

Here, we can see that high PC1 loadings enrich in crypt top cells and low PC1 loadings enrich in follicular structures.

In [None]:
adata.obs["PC1"] = adata.obsm["X_pca"][:, 0]
sq.pl.spatial_scatter(
    adata,
    library_id="spatial",
    shape=None,
    color="PC1",
)

We can plot the expression of high (or low) loading genes to visualise how this correlates with our dimensionality reduction.

In [None]:
sq.pl.spatial_scatter(
    adata,
    library_id="spatial",
    shape=None,
    color="MS4A1",
)

Next, we will use the reduced dimensionality data for clustering and cluster visualisation.

`sc.pp.neighbors()` needs to be run prior to `sc.tl.umap()`.

In [None]:
sc.pp.neighbors(adata)
sc.tl.umap(adata)
adata

Then we can also run `sc.tl.leiden()` to cluster cells using the Leiden algorithm

In [None]:
sc.tl.leiden(adata)
adata

Next lets visualise the clusters - firstly, based on transcriptome embedding.

In [None]:
sc.pl.umap(adata, color="leiden")

And now lets plot the clusters in tissue space.

We can see that our clusters have quite nice correspondence to distinct spatial regions.

In [None]:
sq.pl.spatial_scatter(
    adata,
    library_id="spatial",
    shape=None,
    color="leiden",
)

As before, now we can use scanpy differential expression functions to identify marker genes for specific cell clusters.

In [None]:
sc.tl.rank_genes_groups(adata, 'leiden', method='t-test', groups = ['0'])
sc.pl.rank_genes_groups(adata, n_genes=25, sharey=False)

We can visualise expression of cluster specific markers using feature plots

In [None]:
sc.pl.umap(adata, color="CD3E")

In [None]:
sc.pl.umap(adata, color="MS4A1")

In [None]:
sc.pl.umap(adata, color="CEACAM5")

In [None]:
sc.pl.umap(adata, color="KIT")

Or, as in our sequencing ST tutorial, detect and visualise top markers for every cluster.

In [None]:
sc.tl.rank_genes_groups(adata, 'leiden', method='t-test')
sc.pl.rank_genes_groups(adata, n_genes=25, sharey=False)

In [None]:
top = []
# adata.obs["leiden"].cat.categories.size
for i in adata.obs["leiden"].cat.categories.tolist():
    top.extend(sc.get.rank_genes_groups_df(adata, i).head(5)["names"].to_list())
top = list(dict.fromkeys(top))

In [None]:
top = pd.DataFrame(columns=["gene", "cluster"])
for i in adata.obs["leiden"].cat.categories.tolist():
    i_top = sc.get.rank_genes_groups_df(adata, i).head(5)["names"].to_list()
    i_new = pd.DataFrame({
        "cluster": [i] * len(i_top),
        "gene": i_top
    })
    top = pd.concat([top, i_new], ignore_index=True)
top.head(5)

`sc.pl.dotplot()` provides a convenient and visually appealing way to display expression patterns of top marker genes across clusters using a dot plot.

Note for : it would be nice to rotate this plot 90 degrees,
to match [Clustered_DotPlot()](https://samuel-marsh.github.io/scCustomize/reference/Clustered_DotPlot.html)

In [None]:
starts = np.arange(0, top.shape[0], 5)
ranges = [(s, s + 4) for s in starts]
sc.pl.dotplot(
    adata, top["gene"], "leiden",
    swap_axes=True,
    var_group_positions=ranges,
    var_group_labels=adata.obs["leiden"].cat.categories.tolist()
)

## Additional Spatial Visualisations

The resolution of in situ datasets is typically very high and so it can be difficult to visualise everything in one plot. Below, we will explore different visualisations that can help unpick and understand the data a bit better.

To better visualise spatial distribution of clusters, sometimes it can be useful to subset only certain groups to reduce crowding. Here, we specifically only visualising two selected clusters.

In [None]:
sq.pl.spatial_scatter(
    adata,
    library_id="spatial",
    shape=None,
    color="leiden",
    groups=["0", "1"]
)

Sometimes, it can be useful to create additional fields of view of the data - for example, zooms of specific regions. First, let’s look at the coordinate system by plotting the data and turning on the plotting of the axes, which are off by default to create nicer looking plots.

This gives us a rough idea on where in the coordinate system to create any subsets or zooms of the data.

For example, if we want to zoom in on the follicle in the top right corner, we can see that it lies roughly between 4000:5000 and -8000:-9000 coordinate regions.

In [None]:
x_min = round(adata.obsm["spatial"][:, 0].min() / 1000)
x_max = round(adata.obsm["spatial"][:, 0].max() / 1000) + 1
y_min = round(adata.obsm["spatial"][:, 1].min() / 1000)
y_max = round(adata.obsm["spatial"][:, 1].max() / 1000) + 1
ax = sq.pl.spatial_scatter(
    adata,
    library_id="spatial",
    shape=None,
    color="leiden",
    return_ax=True,
)
ax.set_xticks(np.arange(x_min * 1000, x_max * 1000, 1000))
ax.set_yticks(np.arange(y_min * 1000, y_max * 1000, 1000))

We can then use the `crop_coord` argument, specifying the `(xmin, ymin, xmax, ymax)` coordinates of the region of interest

In [None]:
sq.pl.spatial_scatter(
    adata,
    library_id="spatial",
    shape=None,
    color="leiden",
    crop_coord=(4000, -9000, 5000, -8000)
)

As we are zooming in closer to the tissue, we can also switch from plotting cell centroids (i.e. dots) by default to visualising cell segmentation boundaries.
Plotting cell boundary polygons for large FOVs can be quite time consuming, and doesn’t provide much more detail on a fully zoomed-out view.

For this, we need the coordinates of the cell boundaries and a fair amount of custom code to format and plot the cell boundaries.

First, we'll make a copy of the `adata` object for the region of interest, which we'll use to identify the cells to display in the segmentation plot.

In [None]:
spatial = adata.obsm['spatial']

xmin, xmax = 4200, 5000
ymin, ymax = -8800, -8000

mask = (
    (spatial[:,0] > xmin) & (spatial[:,0] < xmax) &
    (spatial[:,1] > ymin) & (spatial[:,1] < ymax)
)

adata_cropped = adata[mask].copy()
adata_cropped

Next, we'll a dictionary where each key is a cell identifier and the corresponding value is a `Polygon` object.

In [None]:
from shapely import Polygon

def create_cell_polygons(
    cell_polygons_df: pd.DataFrame,
):
    cell_shapes = {}
    
    for cell_id, group in cell_polygons_df.groupby('cell_id'):
        cell_points = group[['vertex_x','vertex_y']].values
        if len(cell_points) > 2:
            cell_shapes[cell_id] = Polygon(cell_points)
    return cell_shapes
   
cell_shapes = create_cell_polygons(
    cell_polygons
)

For instance, here is how a given cell can be accessed, and how it looks.

In [None]:
cell_shapes['adcbkhka-1']

We can then create a `GeoDataFrame` that contains only the subset of cells within the selected field of view.

Note: the filtering of cells is based on the cell centroid. Some cells that partially overlap the field of view will not be visible due to that filter.

In [None]:
import geopandas as gpd

gdf = gpd.GeoDataFrame(
    {"cell_id": list(cell_shapes.keys()),
     "geometry": list(cell_shapes.values())}
).set_index("cell_id")

common_ids = gdf.index.intersection(adata_cropped.obs_names)
gdf = gdf.loc[common_ids]
gdf

We can then plot the segmented field of view with `gdf.plot()` and some more custom code to colour by cluster.

In [None]:
fig, ax = plt.subplots(figsize=(8,8))

gdf.plot(ax=ax, edgecolor='black', linewidth=1)
gdf2 = gdf.join(adata_cropped.obs[['leiden']])
gdf2.plot(ax=ax, column='leiden', cmap='hsv', alpha=0.6, legend=True)

ax.set_facecolor("black")
ax.set_aspect('equal')
plt.show()

We can visualise gene expression or other continous variable on the new FOV as before.

For example, here we have MS4A1/CD20 expression, which is a B-Cell marker. We can see it quite nicely limited to the lymphoid follicle.

In [None]:
feature = "MS4A1"

feature_series = pd.Series(
    data=np.asarray(adata_cropped[:, feature].X.todense()).squeeze(),
    index=adata_cropped.obs_names.astype(str),
    name=feature
)

fig, ax = plt.subplots(figsize=(8,8))

gdf.plot(ax=ax, edgecolor='black', linewidth=1)
gdf2 = gdf.join(feature_series, how="left")
gdf2.plot(ax=ax, column=feature, cmap='viridis', alpha=0.6, legend=True)

ax.set_title(feature)
ax.set_facecolor("black")
ax.set_aspect('equal')
plt.show()

We can also overlay the coordinates of individual molecules to the plot. For example, here we are added some more T-cell and B-cell specific markers.

This visualisation can be useful because molecules are stored independently of cells and cell boundaries.
Therefore, if there are regions where cell segmentation is not good, or if cells were filtered out from clustering analysis due to their low quality, the molecules will remain and can still be visualised this way.

For example, here we can see there are a few molecules of CXCR5 detected outside of cellular boundaries.

In [None]:
from matplotlib import cm

feature = "MS4A1"
molecules = ["CXCR5", "FOXP3"]

feature_series = pd.Series(
    data=np.asarray(adata_cropped[:, feature].X.todense()).squeeze(),
    index=adata_cropped.obs_names.astype(str),
    name=feature
)

transcripts_df = transcripts[transcripts['feature_name'].isin(molecules)].copy()
# NOTE: we take the opposite of ymin and ymax due to the inverted y-axis
mask = (
    (transcripts_df["x_location"] > xmin) & (transcripts_df["x_location"] < xmax) &
    (transcripts_df["y_location"] > -ymax) & (transcripts_df["y_location"] < -ymin)
)
transcripts_df = transcripts_df[mask]

cmap_points = plt.colormaps.get_cmap("Set1")
gene_to_color = {g: cmap_points(i % 8) for i, g in enumerate(molecules)}

fig, ax = plt.subplots(figsize=(8,8))

gdf.plot(ax=ax, edgecolor='black', linewidth=1)
gdf2 = gdf.join(feature_series, how="left")
gdf2.plot(ax=ax, column=feature, cmap='viridis', alpha=0.6, legend=True)

for g in molecules:
    sub = transcripts_df[transcripts_df['feature_name'] == g]
    ax.scatter(sub['x_location'].values, sub['y_location'].values,
               s=15, alpha=0.8,
               c=[gene_to_color[g]], label=g, linewidths=0, zorder=3)

ax.legend(title="molecules", fontsize="small")
ax.set_title(feature)
ax.set_facecolor("black")
ax.set_aspect('equal')
plt.show()