The purpose of this tutorial is to segmentend individual ISS cells, to predict the cell type of ISS cells and to impute the expression of genes that were not included in the original gene panel. For this we'll use a pre-trained segmentation algorithm and Tangram to impute the gene expression.

This notebook was created by Sergio Marco (sergiomarco.salas@scilifelab.se) and was partially adapted from Tangram tutorials, create dy Tommaso Biancalani <biancalt@gene.com>

## Loading the packages needed

The first step is to import the packages that will be used through the tutorial.

In [None]:
import ISS_postprocessing
import scanpy as sc
import numpy as np
import matplotlib.pyplot as plt
import scipy as scp
import seaborn as sns

# Segment individual cells 

In previous steps, we have decoded an ISS dataset and, at this point, we should have a .csv containing the location and identity of every read decoded in the tissue. Together with this, we have a reference DAPI staining (stored as a .tif image) that can be used to identify the nuclei of the cells present in the section analyed. The first step is, then, to load and segment the nuclei so that we can identify individual cells

In [7]:
impath='/media/sergio/Torfajokull/RAW_DATA_EMBO_COURSE/expression_and_DAPI/6_Base_1_stitched-1.tif'

In [8]:
im=plt.imread(impath)

We segment individual cells using Cell Pose (with pretrained segmentation model) and we'll save the segmentation mask as a .npz file, where the identity of every segmented cell is stored. Since nuclei do not represent the entire cells ,but just the central part of it, we will expand the segmentation masks to include as a part of the cells an are of n pixels arround every detected cell

In [None]:
coo_matrix=ISS_postprocessing.segmentation.cell_pose_segemenation_to_coo(im, 10, 10)

processing 1 image(s)


In [None]:
segmentation_path='/media/sergio/Torfajokull/RAW_DATA_EMBO_COURSE/expression_and_DAPI/'
scp.sparse.save_spz(segmentation_path+'stardist_segmentation_expanded.npz',coo_matrix)

After this, we'll couple the segmentation mask we've just obtained with the .csv file that contains the decoded spots, so that we can assign every individual spots to the cells they belong to, based on the segmentation mask. We will store the output of the segmenatation in an annotated data object (anndata), keeping the expression of every cell as well as the position of its centroid

In [None]:
adata_sp = ISS_postprocessing.annotated_objects.create_anndata_obj(spots_file = sample_path+'/decoded.csv', 
            segmentation_mask = segmentation_path_path+'stardist_segmentation_expanded.npz',#'cell_segmentation/cellpose_segmentation_expanded_2.npz' 
            output_file = sample_path+'/anndata_stardist.h5ad',
            filter_data=False, 
            metric = 'distance', 
            write_h5ad = True,
            value=  0.4,
            convert_coords = True, 
            conversion_factor = 1)

In [None]:
adata_sp.write(path+'segmented_ISS.h5ad')

# Reading scRNAseq dataset 

We first read the scRNAseq dataset that we will integrate with our spatial (ISS) dataset. It's important that both datasets represent the same tissue/patient, since these integration methods assume that the cell type composition in the spatial dataset and the scRNAseq dataset are comparable.

In [None]:
adata_sc=sc.read()

 Here, we only do some light pre-processing as library size correction (in scanpy, via `sc.pp.normalize`) to normalize the number of count within each cell to a fixed number. Sometimes, we apply more sophisticated pre-processing methods, for example for batch correction, although mapping works great with raw data. Ideally, the single cell and spatial datasets, should exhibit signals as similar as possible and the pre-processing pipeline should be finalized to harmonize the signals.

In [None]:
sc.pp.normalize_total(ad_sc)
sc.pp.normalize_total(ad_sp)

It is a good idea to have annotations in the single cell data, as they will be projected on space after we map. In this case, cell types are annotated in the `subclass_label` field, for which we plot cell counts. 

In [None]:
adata.obs

Tangram learns a spatial alignment of the single cell data so that  _the gene expression of the aligned single cell data is as similar as possible to that of the spatial data. In doing this, Tangram only looks at a subset genes, specified by the user, called the training genes.
- The choice of the training genes is a delicate step for mapping: they need to bear interesting signals and to be measured with high quality.
- For untargeted methods such as ST, a good start is to choose 100-1000 top marker genes, evenly stratified across cell types. In ISS experiements like these, where genes were selected for being good markers of different cell types, we can use all the genes that presented a reasonable expression

## Preparing the datasets

We now need to prepare the datasets for mapping: the two `AnnData` structures need to be subset on the list of training genes. First, we build a list of marker genes present at the spatial anndata object which are also present in the scRNAseq anndata object. 


In [None]:
markers=np.unique(adata_sp.var.index[adata_sp.var.index.isin(adata_sp.var.index)])
len(markers)

Since we want to keep information about the expression of other genes detected in scRNAseq, we save this information in .raw

In [None]:
adata_sc.raw=adata_sc

Also, the gene order needs to be the same in the datasets. This is because Tangram maps using only gene expression, so the $j$-th column in each matrix must correspond to the same gene.And if data entries of a gene are all zero, this gene will be removed.This task is performed by the helper `pp_adatas`.

In [None]:
ad_sc, ad_sp = tg.pp_adatas(ad_sc, ad_sp, genes=markers)

We now check that both object have the same gene names in the same order

In [None]:
assert ad_sc.var.index.equals(ad_sp.var.index)

Ideally, at this point we'd like to save the adata objects and potentially restart the kernel

In [None]:
ad_sc.write_h5ad('J:/HDCA_LUNG_Test3/Decoded_files/ad_sc_readytomap_pcw13PCISEQ.h5ad')
ad_sp.write_h5ad('J:/HDCA_LUNG_Test3/Decoded_files/ad_sp_readytomap_pcw13PCISEQ.h5ad')

# Mapping cells to space

In [None]:
import os, sys
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import scanpy as sc
import torch
import tangram as tg

We now read the anndata objects that we just saved on the previous step (in case we need to read it)

In [None]:
ad_sp = sc.read_h5ad('J:/HDCA_LUNG_Test3/Decoded_files/ad_sp_readytomap_pcw13PCISEQ.h5ad')
ad_sc = sc.read_h5ad('J:/HDCA_LUNG_Test3/Decoded_files/ad_sc_readytomap_pcw13PCISEQ.h5ad')

We can now train the model (_ie_ map the single cell data onto space).
- Mapping should be interrupted after the score plateaus,which can be controlled by passing the `num_epochs` parameter. 
- The score measures the similarity between the gene expression of the mapped cells vs spatial data: higher score means better mapping
- Note that we obtained excellent mapping even if Tangram converges to a low scores (the typical case is when the spatial data are very sparse): we use the score merely to assess convergence.
- If you are running Tangram with a GPU, uncomment `device=cuda:0` and comment the line `device=cpu`. On a MacBook Pro 2018, it takes ~1h to run. On a P100 GPU it should be done in a few minutes.
- For this basic mapping, we do not use regularizers. More sophisticated loss functions can be used using the Tangram library (refer to manuscript or dive into the code).
- We can map at cluster-level single cell data instead of cell level data (refer to manuscript or dive into the code), which is faster and requires less memory. In this notebook, we are mapping at individual cell level.

In [None]:
ad_map = tg.map_cells_to_space(
    adata_cells=ad_sc,
    adata_space=ad_sp,
    device='cpu',num_epochs=500
    # device='cuda:0',
)

The mapping results are stored in the returned `AnnData` structure, saved as `ad_map`, structured as following:

  - The cell-by-spot matrix `X` contains the probability of cell $i$ to be in spot $j$.
  - The `obs` dataframe contains the metadata of the single cells.
  - The `var` dataframe contains the metadata of the spatial data.
  - The `uns` dictionary contains a dataframe with various information about the training genes (saved ad `train_genes_df`).
  
  
  We can also plot some quality scores that can give us an overview of the quality of our integration
    


In [None]:
tg.plot_training_scores(ad_map, bins=50, alpha=.5)

- Although the above plots give us a summary of scores at single-gene level, we would need to know _which_ are the genes are mapped with low scores.
- These information can be access from the dataframe `.uns['train_genes_df']` from the mapping results; this is the dataframe used to build the four plots above.

In [None]:
ad_map.uns['train_genes_df']

We can now save the mapping results for post-analysis.

In [None]:
ad_map.write_h5ad('C:/Users/sergio.salas/Documents/Jupyter_notebooks/ad_mapALLSUBSET_pcw13_pciseq.h5ad')

# Exploring the cell type annotation

In [None]:
ad_map=sc.read_h5ad('C:/Users/sergio.salas/Documents/Jupyter_notebooks/ad_mapALLSUBSET_pcw13_pciseq.h5ad')

Tangram can be used to project cell type annotations (from scRNAseq) into the space by giving a probability to each cell capture with ISS to belong to a specific cell type (ie. cell typing). Thus, we can plot the cell type identity into space to explore their location

In [None]:
tg.plot_cell_annotation(ad_map, annotation='subclass_label', nrows=6, ncols=5)

# Imputing the expression of genes into space

With Tangram, since the datasets from both modalities are integrated, we can impute the expression of genes not included in the ISS panel, but present in the scRNAseq dataset. This is done by averaging the expression of the scRNAseq cells found to be similar to each ISS cell after training. To obtain this expression patterns, we first need to compute:

In [None]:
ad_sc_raw=ad_sc.raw()

In [None]:
ad_ge = tg.project_genes(adata_map=ad_map, adata_sc=ad_sc_raw)

With `project genes` we generate a new anndata object containing the imputed expression of each scRNAseq gene in space. Remember to use the .raw expression from the `ad_sc` object, since these object contains the expression of all genes, instead of only the common ones

It is convenient to compute the similarity scores of all genes, which can be done by `compare_spatial_geneexp`. This function accepts two spatial `AnnData`s (_ie_ voxel-by-gene), and returns a dataframe with simlarity scores for all genes. Training genes are flagged by the Boolean field `is_training`.

In [None]:
df_all_genes = tg.compare_spatial_geneexp(ad_ge, ad_sp)
df_all_genes

We can plot the scores of the test genes and see how they compare to the training genes. Following the strategy in the previous plots, we visualize the scores as a function of the sparsity of the spatial data.

In [None]:
sns.scatterplot(data=df_all_genes, x='score', y='sparsity_2', hue='is_training', alpha=.5);

Usually sparser genes in the spatial data are predicted with low scores, which is due to the presence of dropouts in the spatial data.
- Let's choose a few test genes with varied scores and compared predictions vs measured gene expression.

In [None]:
genes=['EPCAM','MYH11']
tg.plot_genes(genes, adata_measured=ad_sp, adata_predicted=ad_ge,s=1)

In [None]:
genes=['EPCAM','MYH11']
tg.plot_genes(genes, adata_measured=ad_sp, adata_predicted=ad_ge,s=1)