# Differential abundance analysis with Milo

In this exercise, we will perform differential abundance analysis to identify changes in cell composition between healthy (PBMMC) and leukemia (ETV6-RUNX1) samples. Rather then performing the analysis per cell type, we will instead use Milo to perform the analysis at the neighbourhood level.

## Setup

In [None]:
import matplotlib
import matplotlib.pyplot as plt
import pertpy as pt
import scanpy as sc

plt.rcParams["figure.figsize"] = (7, 7)
import numpy as np

sc.settings.verbosity = 3

In [None]:
adata = sc.read_h5ad('../Data/Caron_clustered.PBMMCandETV6RUNX1.h5ad')

In [None]:
adata

In [None]:
sc.pl.embedding(adata, "X_umap_corrected", color=["label"], legend_loc="on data")

## Differential abundance analysis

We will now perform the differential abundance analysis using Milo. The analysis consists of the following steps:

1. Building k-nearest neighbour (kNN) graph
2. Sampling representative neighbourhoods in the graph (for computational efficiency)
3. Testing for differential abundance of conditions in all neighbourhoods
4. Accounting for multiple hypothesis testing using a weighted FDR procedure that accounts for the overlap of neighbourhoods

In [None]:
## Initialize object for Milo analysis
milo = pt.tl.Milo()
mdata = milo.load(adata)

In [None]:
mdata

In [None]:
# Create the kNN graph
sc.pp.neighbors(mdata["rna"], use_rep="X_corrected", n_neighbors=15)

In [None]:
# Sample representative neighbourhoods
milo.make_nhoods(mdata["rna"], prop=0.1)

In [None]:
mdata["rna"].obsm["nhoods"]

In [None]:
mdata["rna"][mdata["rna"].obs["nhood_ixs_refined"] != 0].obs[["nhood_ixs_refined", "nhood_kth_distance"]]

Let's check the size of the neighbourhood formed as one of the QC metric. We ideally want the most of the neighbourhood to not be too small or too big. The ideal result, according to the authors of Milo, is to have an average neighbourhood size of 5 x N samples (5 x 7 = 35 in our case).

In [None]:
nhood_size = np.array(mdata["rna"].obsm["nhoods"].sum(0)).ravel()
plt.hist(nhood_size, bins=100)
plt.xlabel("# cells in nhood")
plt.ylabel("# nhoods");

In [None]:
# Count cells within each neighbourhood
mdata = milo.count_nhoods(mdata, sample_col="SampleName")

In [None]:
mdata

In [None]:
mdata["milo"]

In [None]:
# Test for DA between condition
milo.da_nhoods(mdata, design="~SampleGroup")

In [None]:
mdata["milo"].obs

In [None]:
mdata["milo"].var

Let's inspect the result of the DA analysis visually using some diagnostic plots. We first check that the distribution of the uncorrected P-value behaves properly. We also visualise the test result using volcano plot to see how many neighbourhood shows differential abundance.  

In [None]:
plt.hist(mdata["milo"].var.PValue, bins=50)
plt.xlabel("P-Vals")

In [None]:
plt.plot(mdata["milo"].var.logFC, -np.log10(mdata["milo"].var.SpatialFDR), ".")
plt.xlabel("log-Fold Change")
plt.ylabel("-log10(Spatial FDR)")
plt.axhline(y=1)

We can also visualise this on the embedding on the single cells, by first building a neighbourhood graph to superimpose on the single cell embedding. In this figure, each node represent neighbourhood, coloured by the DA log-fold change.

In [None]:
milo.build_nhood_graph(mdata, 'X_umap_corrected')

In [None]:
plt.rcParams["figure.figsize"] = [7, 7]
milo.plot_nhood_graph(
    mdata,
    alpha=0.1,  ## SpatialFDR level (1%)
    min_size=1,  ## Size of smallest dot
)

In [None]:
# Assign cell type label to each neighbourhood by most common label
milo.annotate_nhoods(mdata, anno_col="label")

In [None]:
mdata["milo"].var

Let's check to make sure that the neighbours are mostly homogenous and filter out neigbourhood which are a mix of cell types.

In [None]:
plt.hist(mdata["milo"].var["nhood_annotation_frac"], bins=30)
plt.xlabel("celltype fraction")

In [None]:
mdata["milo"].var["nhood_annotation"] = mdata["milo"].var["nhood_annotation"].cat.add_categories('Mixed')

In [None]:
mdata["milo"].var.loc[mdata["milo"].var["nhood_annotation_frac"] < 0.7, "nhood_annotation"] = "Mixed"

In [None]:
mdata["milo"].var

We can now visualise the distribution of DA fold change in the different cell types.

In [None]:
milo.plot_da_beeswarm(mdata, alpha=0.1)

---

## Optional exercises

1. Explore the DA result further by following the guide from [pertpy](https://pertpy.readthedocs.io/en/stable/tutorials/notebooks/milo.html#Visualize-result-by-celltype).