# Tutorial 'unsupervised QC, and segmentation-free analysis of Spot-based transcriptomics data'

<span style="text-decoration: underline">Author: Sebastian Tiesmeyer (sebastian.tiesmeyer@bih-charite.de)</span>

*Affiliation: Computational Oncology group (Dr. Naveed Ishaque), Digital Health Center (Prof. Roland Eils), BIH @ Charité Hospital, Germany.*

This tutorial was made to showcase the ISS analysis workflow at our group, which is based on a segmentation-free mindset. It makes use of a python package called 'plankton.py', in which I collected functionalities of past analysis projects for others to use. It is inspired by the *squidpy* package, but targeted especially at the topographical analysis of spot-based data.

## Learning objectives:

After completing this tutorial, you will be able to:

1) Load and investigate spot-based spatial transcriptomics data in plankton.py

2) Perform supervised cell type annotation based on the SSAM algorithm in plankton.py

## Configure python/jupyter

In [None]:

# widens the screen:
from IPython.display import display, HTML
display(HTML("<style>.container { width:95% !important; }</style>"))

# imports:
import sys
import os
import plankton.plankton as pl
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

# convenience function to create new figures:
def figure(width=8,height=8):
    plt.figure(figsize=(width,height))


## 1) Load and investigate spot-based spatial transcriptomics data in plankton

#### 1.0 load mRNA coordinate data (developing lung ISS)

Categorical spot-based data like the ISS output is typically stored as coordinates in an x/y format alongside a class label.
We created such a data set during the session 'ISS_decoding'. We now can now import data for further analysis:

In [None]:
# Data folder location:
data_root = '../data/in_situ_sequencing'
assert os.path.exists(data_root)

# Define um_p_px parameter for the data set coordinates:
um_p_px = 0.325

# Read coordinate/gene data from .csv file
coordinates = pd.read_csv(os.path.join(data_root,'S2T1_pcw6.csv'))

# Extract x,y coordinates and gene labels 
x =  coordinates.Global_x_pos.values 
y =  coordinates.Global_y_pos.values 
g =  coordinates.Gene.values


#### 1.1 create and investigate the sdata object

You can now create a `plankton.SpatialData` frame. It's a subclass of `pandas.DataFrame` and inherits all of its properties. There is a bit of added functionality though, and the indexing works different (namely along the vertical axis primarily).

The raw sdata object contains the columns 'x'/'y' for the spatial coordinates, 'g' for the according gene labels and 'gene_id' that contains an index number according to their gene: 

In [None]:
# Create a plankton-SpatialData object with the coordinates:

sdata = pl.SpatialData(
                        coordinates.Gene,
                        coordinates.Global_x_pos*um_p_px,
                        coordinates.Global_y_pos*um_p_px,
                        )

# display data in the notebook:
sdata

`gene_id` is mirrored in the gene-centric `stats` property of the sdata set. `[sdata.gene_id]` can be used as an index to project gene-centric data onto the individual coordinates. 

In [None]:
# inspect basic statistics at gene level:

sdata.stats

#### 1.2 create basic sdata plots

The individual columns are an instance of `pandas.Series` and inherit its body of functions.

Among those are sorting and plotting functions:

In [None]:
# Plot a bar graph of the gene counts in the data set:

figure(22,5)
sdata.counts.sort_values().plot.bar()

sdata also has a number of plotting functions to ease data interpretation. `pl.plot_overview` shows a bar graph of the gene counts alongside the spatial distribution of genes at the 0th, 33rd, 66th and 100th percentiles.

In [None]:
# Plot data set overview/summary:

sdata.plot_overview()

plt.gcf().set_size_inches(17.5, 9.5)

A further useful visualization tool is the 'scatter' function, which uses identical arguments to pyplot.scatter().

Here, the entire data set is printed with an alpha (transparency) value of 0.2:

In [None]:
# use the 'scatter' function to get familiar with the data set:

figure(10,10)

sdata.scatter(alpha=0.2)

#### 1.3 adding a pixel map (e.g. DAPI stain image) as background image:

For reference during analysis and for plotting, it can be useful to have a background image. The next cell loads a DAPI image and turns into grayscale.

We can then create a plankton.PixelMap object to integrate the spot-based molecule coordinates and the pixel/grid based image data:

In [None]:
# Load staining image as .jpg:

figure(7,14)
bg = -plt.imread(os.path.join(data_root,'background.jpg')).mean(-1)
bg = (bg-bg.min())/(bg.max()-bg.min())

plt.subplot(121)

plt.title('original')

plt.imshow(bg,cmap='Greys')

# Create PixelMap

bg_map = pl.PixelMap(pixel_data=bg,
                     cmap='Greys',
                     px_p_um = 0.504/um_p_px)

plt.subplot(122)

plt.title('PixelMap with affine transform (rescale:)')
bg_map.imshow()


... and feed the pixel map to sdata during creation:

In [None]:
sdata = pl.SpatialData(
                        coordinates.Gene,
                        coordinates.Global_x_pos*um_p_px,
                        coordinates.Global_y_pos*um_p_px,
                        pixel_maps={'DAPI':bg_map}
                        )

The scatter plots now automatically contain the DAPI stain as a plot background:

In [None]:
plt.figure(figsize=(19,8))

plt.subplot(1,3,1)
plt.title('coordinates')
plt.scatter(*sdata.coordinates[:,:].T*np.array([[1],[-1]]),c=sdata.var.c_genes[sdata.gene_ids],marker='.',alpha=0.1)

plt.subplot(1,3,2)
plt.title('plankton')
sdata.scatter(alpha=0.1)

ax=plt.subplot(1,3,3)
plt.title('DAPI stain')
bg_map.imshow(axd=ax)

#### 1.4 basic data subsetting functionality:

sdata supports different ways of data subsetting. Data is subset along the vertical axis first (per molecule).

Generic python/numpy based slicing is supported, as well as masking with boolean arrays. sdata.spatial opens up a spatial view of the data, and the generic python slicing notation can be used to crop the data in the spatial domain:

In [None]:
plt.figure(figsize=(20,7))


# Slice using array notation:
plt.subplot(1,4,1)
plt.title('subsampled by 200:')
sdata[::200].scatter()

# Subsample using boolean mask:
plt.subplot(1,4,2)
plt.title('subsampled for HGF,WNT2:')
sdata[sdata.g.isin(['HGF','WNT2'])].scatter(legend=True)

# Crop using spatial view:
plt.subplot(1,4,3)
plt.title('subsampled in space:')
sdata.spatial[100:2800,1000:].scatter(alpha=0.1)


#### 1.5 advanced data subsetting functionality:

use sdata.counts and sdata.gene_ids-indexing to plot all genes that occur below 200 times in the sample:

In [None]:
figure(9,9)

low_count_gene_mask = (sdata.counts<200)

sdata[low_count_gene_mask[sdata.gene_ids]].scatter(marker='x',legend=True)

## 2) Supervised analysis:

Often, larger scale projects follow a multi-omics approach where single-cell expression data is available for analysis. We can make use of this external data for quality control and to perform supervised analysis. For this, we need to create a signature matrix that contains an affinity indicator for all gene-celltype combinations. We will create these signatures by integrating available cell wise molecule count data.

#### 2.0 load single-cell-RNAseq-derived gene count matrix

*Sergio Salas* provided me with an annotated celltype-gene count matrix for developing bronchial tissue. It should be similar to the data derived from tomorrow's scanpy/scRNAseq workshop, only after a biologist has looked at the data and assigned cell type labels to the detected clusters.

In [None]:
import anndata

# Read provided count table:
cellwise_counts = pd.read_csv(os.path.join(data_root,'S2T1_pcw6_complex_celltypes_formatted.csv'),index_col=0)

# Group cell-subtypes for ease of interpretability:
celltypes = cellwise_counts['cell type'].values
for i,c in enumerate(celltypes):
    if c[-1].isdigit():
        celltypes[i]=c[:-2]

# create scanpy/AnnData object from molecule count matrix, 
# containing 7997 cells and 141 genes:
adata = anndata.AnnData(X = cellwise_counts.iloc[:,78:],)

# add celltype labels to the individual cells
adata.obs['celltype'] = celltypes

AnnData-formated single cell expression data can be passed to the `sdata` object during initialization. It adds a property `sdata.scanpy` to sdata, which automatically synchronizes and allows for supervised data analysis:

In [None]:
# create new sdata object with added single-cell data:
sdata = pl.SpatialData(
                        genes=g,
                        x_coordinates=x*um_p_px,
                        y_coordinates=y*um_p_px,
                        pixel_maps={'DAPI':bg_map},
                        scanpy=adata
                        ).clean()

# show signature matrix
sdata.scanpy

#### 2.1 modelling tissue with SSAM:

Our lab has developed a segmentation-free, unsupervised celltype calling algorithm called SSAM (https://www.nature.com/articles/s41467-021-23807-4, Park, 2021). It creates a celltype map by spatially integrating gene signal via KDE and performing linear correlation analysis.

As a first step, a signature matrix needs to be generated that contains determined gene expression signatures per cell type. In our case, it contains the cell type-wise mean expression across cells. The signature matrix is then normalized across columns and rows.

In [None]:
# Generate signature matrix
signatures = sdata.scanpy.generate_signatures()

signatures

In [None]:
from plankton.utils import ssam

# Create a celltype map using the ssam algorithm:

kernel_bandwidth = 5   # Bandwidth for the Gaussian KDE smoothing kernel
patch_length = 2000     # length of the individual data batches 
threshold_corr = 0.1    # Threshold for expression-signature correlation 
threshold_exp = 0.3    # Threshold for total signal norm

ctmap = ssam(sdata,signatures=signatures,kernel_bandwidth=kernel_bandwidth,
            patch_length=patch_length,threshold_cor=threshold_corr,threshold_exp=threshold_exp)

ctmap.get_value() can be used to extract the value at defined coordinate points:

In [None]:
# sample the map's values at all molecule locations:
values_at_xy = ctmap.get_value(sdata.x,sdata.y)

# assign tissue label to sampled values:
celltype_labels = np.array(signatures.index)[values_at_xy]
celltype_labels[ctmap.get_value(sdata.x,sdata.y)==-1]='other'

# add 'celltype' annotation to each molecule of the sdata frame:
sdata['celltype']= celltype_labels
sdata['celltype'] = sdata.celltype.astype('category')

sdata

#### 2.2 explore cell type map

The celltype map is a plankton.PixelMap object and can be plotted accordingly:

In [None]:
from matplotlib.cm import get_cmap

# Colored scatter points to create the legend:
labels = sdata.celltype.cat.categories
cm = get_cmap('nipy_spectral')
tissue_colors = [cm((i+1)/(len(labels)-1)) for i in range(len(labels)-1)]


# Show celltype map:

figure(15,20)


ctmap.imshow(cmap='nipy_spectral',interpolation='none')

handles = [plt.scatter([],[],color=tissue_colors[i]) for i in range(len(labels)-1)]
plt.legend(handles,labels,)

#### 2.3 explore individual cell type distributions

In [None]:
figure(25,25)

for i,g in enumerate(signatures.index):
    
    plt.subplot(7,7,i+1)
    
    plt.title(g)
    
    (ctmap==i).imshow(cmap='Reds')

#### 2.4 perform neighborhood enrichment analysis through integration with squidpy

The excellent squidpy implementations of spatial statistics methods can be used to identify spatial neighborhood enrichment effects among the molecules:

In [None]:
import squidpy as sq

sq.gr.spatial_neighbors(sdata, key_added='spatial')
sq.gr.nhood_enrichment(sdata,'celltype')

In [None]:
sq.pl.nhood_enrichment(sdata,'celltype')

#### 2.5 perform spatial co-occurrence analysis on cell type distributions

Also, sdata comes with a dedicated function to compute co-occurrence in space amongst the different molecule classes:

In [None]:
# compute co-occurrence indicator for each class-class-pair:
cooc = sdata.stats.co_occurrence(resolution=5,max_radius=200,linear_steps=40,category='celltype')

As an example, the width of the auto-co-occurrence peak next to the center indicates the radius of the respective individual structures:

In [None]:
autos = cooc.diagonal()

figure(20,10)

for i,c in enumerate(tissue_colors):
    _=plt.plot(autos[:,i]/autos[0,i], c = c)

plt.legend(handles,labels,)

plt.xticks(np.arange(0,40,5),np.arange(0,40,5)*5)
plt.title('Auto-co-occurrence curves for all molecules, by SSAM-assigned cell types:')

For these auto-co-occurrence curves, the results show the difference in structural diameters for different tissue type structures. For better interpretability, it might be easier to plot three prototypical cases individually:

In [None]:
figure(18,6)

genes_to_plot=['Airway fibroblast','Erythrocyte','Mesothelial']

plt.subplot(141)
plt.title('Auto co-occurrence')
plt.xticks(np.arange(0,40,5),np.arange(0,40,5)*5)

for i,c in enumerate(genes_to_plot):
    
    tissue_index = np.where(signatures.index==c)[0][0]
    color = tissue_colors[tissue_index]
    
    plt.subplot(1,4,1)
    plt.plot(autos[:,tissue_index]/autos[0,tissue_index],color=color)
    
    plt.subplot(1,4,i+2)
    sdata[sdata.celltype==c].scatter(color = color)
    plt.title(c)

Transcripts assigned to 'Airway fibroblast' seem to have a higher chance of occurring close to other transcripts of the same type, compared to other cell types. 'Erythrocytes' and 'Mesothelials' are more isolated and hence have steeper auto-co-occurrence profiles. 

#### 2.6 cross-co-occurrence against 'other' transcripts

Co-occurrence curves can also be created between molecule types. Co-occurrence to the 'other' category (in empty space) shows that Cilated Epithelial cells and Mesothelial cells seem to occur closer to the sample edge than Airway fibroblasts do.

In [None]:
figure(18,6)

genes_to_plot=['Airway fibroblast','Ciliated epithelial','Mesothelial']

plt.subplot(141)
plt.title('Co-occurrence with "other":')
plt.xticks(np.arange(0,40,5),np.arange(0,40,5)*5)

for i,c in enumerate(genes_to_plot):
    
    tissue_index = np.where(signatures.index==c)[0][0]
    color = tissue_colors[tissue_index]
    
    plt.subplot(1,4,1)
    plt.plot(cooc[-1,tissue_index]/autos[0,tissue_index],color=color)
    
    plt.subplot(1,4,i+2)
    sdata[sdata.celltype==c].scatter(color = color)
    plt.title(c)