[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/MeyerBender/spatialproteomics_workshop/blob/main/notebooks/workshop02_task.ipynb)

# Downstream analysis of spatial proteomics data

In the previous exercise, you have seen which steps are required to transform high-dimensional image data into something more workable, such as a list of cells with associated cell types.
This is the prerequisite for any meaningful downstream analysis, the goal of which it is to find spatial patterns associated with a readout of interest, such as patient survival.
Here, we explore how one can go from individual cell types to constructing neighborhood profiles.

In [None]:
# download the data
# if you have already run this cell once, there is no need to run it again

# use this when running on colab
! wget https://www.huber.embl.de/users/matthias/spatialproteomics_workshop_data.tar.gz /content/spatialproteomics_workshop_data.tar.gz
! tar -xzf /content/spatialproteomics_workshop_data.tar.gz
! pip install --quiet spatialproteomics==0.6.8 squidpy==1.2.2
data_dir = '/content/data'

# use this when running locally
#! wget https://www.huber.embl.de/users/matthias/spatialproteomics_workshop_data.tar.gz spatialproteomics_workshop_data.tar.gz
#! tar -xzf spatialproteomics_workshop_data.tar.gz
#data_dir = 'data'

In [None]:
import xarray as xr
import spatialproteomics as sp
import matplotlib.pyplot as plt
import scanpy as sc
from tqdm.auto import tqdm
from glob import glob
import seaborn as sns
import os
import pandas as pd

## Reading in files
Let's look at some lymph nodes. Segmentation and cell type prediction was already performed on this data, so we can simply read in the zarr files.

In [None]:
channels = ['PAX5', 'CD3', 'CD11b', 'CD11c', 'CD15', 'CD68', 'Podoplanin', 'CD31', 'CD34', 'CD90', 'CD56']
quantiles = [0.8, 0.5, 0.8, 0.8, 0.8, 0.8, 0.95, 0.95, 0.95, 0.95, 0.8]
colors = ['#e6194B', '#3cb44b', '#ffe119', '#4363d8', '#ffd8b1', '#f58231', '#911eb4', '#fffac8', '#469990', '#fabed4', '#9A6324']
ct_marker_dict = {'B': 'PAX5', 'T': 'CD3', 'Myeloid': 'CD11b', 'Dendritic': 'CD11c', 'Granulo': 'CD15', 'Macro': 'CD68', 'Stroma PDPN': 'Podoplanin', 'Stroma CD31': 'CD31', 'Stroma CD34': 'CD34', 'Stroma CD90': 'CD90', 'NK': 'CD56'}

In [None]:
# reading all of the images into a dictionary

sp_dict = {}
for sample_path in glob(os.path.join(data_dir, 'zarrs/*.zarr')):
    sample_id = sample_path.split('/')[-1].replace(".zarr", "")
    sp_dict[sample_id] = xr.open_dataset(sample_path, engine='zarr')

In [None]:
# TODO: go through each sample and plot the intensities and the cell type predictions next to one another

## Exploration of marker profiles
Before we do any neighborhood analysis, let's verify that our cell type annotations are somewhat sensible. We can for example do this by plotting the expression matrix or UMAP of every sample. There are many ways to do this, however `scanpy` has become one of the most useful frameworks for this purpose. `Spatialproteomics` provides export functions to interact with other packages. Let's convert our `spatialproteomics` objects into `anndata` objects and use `scanpy` for a preliminary analysis.

In [None]:
adata_dict = {}
for sample_id, sp_obj in sp_dict.items():
    # converting to anndata and performing umap computation
    adata = sp_obj.tl.convert_to_anndata()
    sc.pp.neighbors(adata)  # Compute neighbors
    sc.tl.umap(adata)       # Compute UMAP
    adata_dict[sample_id] = adata

In [None]:
# TODO: look at the scanpy documentation and use their heatmap and umap functions to visualize the marker profiles

## Defining neighborhoods
Next up, we can define cellular neighborhoods. This boils down to two steps. 

In the first step, we count all cells within a specified radius of a cell. For example, if we look at cell A, we count all cells in a radius of 50 microns, and note down their relative frequencies. So instead of saying that a cell is a B cell, we can now say that it is in a neighborhood with 80% B cells and 20% T cells.

Once we have those neighborhood profiles, we can cluster them across all samples. There are many ways to do this, but in order to keep things simple, we only look at k-mean clustering here.

In [None]:
# this line packages the dictionary into an ImageContainer, which provides some useful functions to compute neighborhoods across samples
image_container = sp.ImageContainer(sp_dict)
# this method returns a dict in the same format as we provided as input
sp_dict = image_container.compute_neighborhoods()
# obtaining a data frame containing the neighborhood compositions
nh_composition = image_container.get_neighborhood_composition()

In [None]:
nh_composition

In [None]:
# TODO: plot the neighborhood composition as a heatmap or clustermap (e. g. using seaborn)

In [None]:
# TODO: plot the neighborhoods in the spatial context. Set the colors so that each neighborhood is colored by its most abundant cell type.

In [None]:
# getting the neighborhood composition and label for each individual cell
neighborhood_composition_per_cell = []
neighborhood_label_per_cell = []

for sp_obj in sp_dict.values():
    # this df contains the neighborhood composition around each cell
    neighborhood_composition_per_cell.append(sp_obj.pp.get_layer_as_df('_neighborhoods'))
    # this line obtains the neighborhood label of each 
    neighborhood_label_per_cell.append(sp_obj.pp.get_layer_as_df()['_neighborhoods'])

neighborhood_composition_per_cell = pd.concat(neighborhood_composition_per_cell, axis=0).reset_index(drop=True).fillna(0)
neighborhood_label_per_cell = pd.concat(neighborhood_label_per_cell, axis=0).reset_index(drop=True)

In [None]:
# TODO: how meaningful are our clusters? Try computing a PCA on the neighborhood_composition_per_cell and color the points according to the neighborhood_label_per_cell.

In [None]:
# Experiment with different k's, or different methods of constructing neighborhoods (refer to the documentation of the image container for details). What do you observe?

## Additional analysis with squidpy
There are plenty of tools to analyse spatial data these days. Let's briefly look at how to use squidpy to create a neighborhood enrichment.

In [None]:
# this is only on a single sample, you could also concatenate the adata objects to get a more global view
import squidpy as sq

adata = adata_dict['166_1_H3_LK'].copy()

In [None]:
# formatting the anndata object. This is required for squidpy to work properly
adata.obsm['spatial'] = np.array(adata.obs[['centroid-0', 'centroid-1']])

In [None]:
# TODO: use squidpy to perform a neighborhood enrichment. Look into other possible downstream methods offered by squidpy.