## Rare cell detection after integration with 3 different methods

In this notebook we assess how well rare cells (ionocytes, neuroendocrine cells, and tuft cells) can be detected after integration of datasets with three different methods: scanVI, Seurat's RPCA, and Harmony. We do the same for the final integrated atlas.

### Import modules, set paths, choose integration to analyze:

In [1]:
import scanpy as sc
import pandas as pd
import sys
import os

for pretty code formatting (not necessary to run code:)

In [2]:
%load_ext lab_black

set figures to high resolution (also not necessary to run code):

In [3]:
sc.set_figure_params(dpi=140)

initiate empty dictionary in which to store figures:

In [4]:
FIGURES = dict()

set paths:

In [5]:
dir_benchmarking_res = (
    "../../results/integration_benchmarking/benchmarking_results/integration/"
)
dir_clustering = "../../results/integration_benchmarking/clustering/"
dir_results = "../../results/integration_benchmarking/rare_cell_recovery/"
path_HLCA = "../../data/HLCA_core_h5ads/HLCA_v1.h5ad"

choose the integration to analyze (choose one of the four lines below):

In [6]:
# dataset = "LCA_v1"  # full atlas
# dataset = "scanvi"  # benchmarking
# dataset = "seuratrpca"  # benchmarking
dataset = "scanvi"  # benchmarking

### Perform analysis:

import the data, and the matched nested clustering. We will use the integration with optimal preprocessing for every method, as assessed via benchmarking:

In [7]:
print("Dataset name:", dataset)
if dataset == "LCA_v1":
    print(f"importing {dataset} data")
    adata = sc.read(path_HLCA)
else:
    # import the integration with the best-performing preprocessing (i.e. either
    # with or without scaling, and hvg versus full feature)
    if dataset == "seuratrpca":
        print(f"importing {dataset} data")
        adata = sc.read(
            os.path.join(dir_benchmarking_res, f"unscaled/hvg/R/{dataset}.h5ad")
        )
    elif dataset == "harmony":
        print(f"importing {dataset} data")
        adata = sc.read(
            os.path.join(dir_benchmarking_res, f"scaled/hvg/R/{dataset}.h5ad")
        )
    elif dataset == "scanvi":
        print(f"importing {dataset} data")
        adata = sc.read(
            os.path.join(dir_benchmarking_res, f"unscaled/hvg/{dataset}.h5ad")
        )
    # update brush/tuft cell naming. In the paper we call them tuft and
    # not brush anymore
    ct_name_updater = {ct: ct for ct in adata.obs.ann_level_4.unique()}
    ct_name_updater["Brush Cell/Tuft"] = "Brush Cell Tuft"
    adata.obs.ann_level_4 = adata.obs.ann_level_4.map(ct_name_updater)
    # import cluster assignments:
    for cl_level in ["1", "2", "3"]:
        cl_ass = pd.read_csv(
            os.path.join(
                dir_clustering,
                f"{dataset}/{dataset}_leiden_{cl_level}_cluster_assignment.csv",
            ),
            index_col=0,
        )
        adata.obs[f"leiden_{cl_level}"] = cl_ass.loc[
            adata.obs.index, f"leiden_{cl_level}"
        ]

Dataset name: scanvi
importing scanvi data


Print percentage of cells annotated as each of the three rare cell types:

In [8]:
ct_counts = adata.obs.groupby("ann_level_4").agg({"ann_level_4": "count"})
ct_counts = ct_counts / ct_counts.sum() * 100
ct_counts.loc[["Ionocyte", "Neuroendocrine", "Brush Cell Tuft"], :]

Unnamed: 0_level_0,ann_level_4
ann_level_4,Unnamed: 1_level_1
Ionocyte,0.065034
Neuroendocrine,0.027411
Brush Cell Tuft,0.023111


Calculate cluster sizes for leiden 3 clusters (i.e. finest clusters of nested clustering):

In [9]:
cluster_sizes = (
    adata.obs.groupby("leiden_3")
    .agg({"leiden_3": "count"})
    .rename(columns={"leiden_3": "n_cells"})
)

Calculate number of rare cells per cluster (i.e. annotated at level 3 as "Rare"):

In [10]:
leiden_3_Rare_count = (
    adata.obs.groupby(["ann_level_3", "leiden_3"])
    .agg({"leiden_3": "count"})
    .loc["Rare", :]
    .rename(columns={"leiden_3": "n_cells"})
    .sort_values(by="n_cells", ascending=False)
)

Calculate number of level-4 annotated rare cells per cluster (i.e. annotated at level 4 as "Ionocyte", "Tuft" or "Neuroendicrone"):

In [11]:
rare_cells_cluster_ass = (
    adata.obs.groupby(["ann_level_4", "leiden_3"])
    .agg({"leiden_3": "count"})
    .rename(columns={"leiden_3": "n_cells"})
)

Now calculate recall (perc. of annotated rare cells recovered in cluster) and precision (percentage of cells in cluster annotated as rare cell) for each of these annotations:

In [12]:
rare_cell_recall = pd.DataFrame(index=sorted(adata.obs.leiden_3.unique()))
# neuroendocrine
rare_cell_recall["n_ne"] = 0
rare_cell_recall.loc[
    rare_cells_cluster_ass.loc["Neuroendocrine"].index, "n_ne"
] = rare_cells_cluster_ass.loc["Neuroendocrine"].n_cells
rare_cell_recall["recall_ne"] = round(
    rare_cell_recall.n_ne / rare_cell_recall.n_ne.sum(), 3
)
rare_cell_recall["prec_ne"] = round(
    rare_cell_recall.n_ne / cluster_sizes.loc[rare_cell_recall.index, "n_cells"], 3
)
# ionoctyes
rare_cell_recall["n_io"] = 0
rare_cell_recall.loc[
    rare_cells_cluster_ass.loc["Ionocyte"].index, "n_io"
] = rare_cells_cluster_ass.loc["Ionocyte"].n_cells

rare_cell_recall["recall_io"] = round(
    rare_cell_recall.n_io / rare_cell_recall.n_io.sum(), 3
)
rare_cell_recall["prec_io"] = round(
    rare_cell_recall.n_io / cluster_sizes.loc[rare_cell_recall.index, "n_cells"], 3
)
# brush/tuft
rare_cell_recall["n_brush"] = 0
rare_cell_recall.loc[
    rare_cells_cluster_ass.loc["Brush Cell Tuft"].index, "n_brush"
] = rare_cells_cluster_ass.loc["Brush Cell Tuft"].n_cells


rare_cell_recall["recall_brush"] = round(
    rare_cell_recall.n_brush / rare_cell_recall.n_brush.sum(), 3
)
rare_cell_recall["prec_brush"] = round(
    rare_cell_recall.n_brush / cluster_sizes.loc[rare_cell_recall.index, "n_cells"], 3
)
# total
rare_cell_recall["recall_Rare"] = 0

rare_cell_recall.loc[leiden_3_Rare_count.index, "recall_Rare"] = round(
    leiden_3_Rare_count.n_cells / leiden_3_Rare_count.n_cells.sum(),
    3,
)
rare_cell_recall.loc[leiden_3_Rare_count.index, "prec_Rare"] = round(
    leiden_3_Rare_count.n_cells
    / cluster_sizes.loc[leiden_3_Rare_count.index, "n_cells"],
    3,
)
# sort by total
rare_cell_recall.sort_values(by="recall_Rare", inplace=True, ascending=False)

Keep only clusters with at least one rare cell:

In [13]:
recall = rare_cell_recall.loc[rare_cell_recall.recall_Rare > 0, :]

identify top recall clusters for ionocytes, brush/tuft, and neuroendocrine:

In [14]:
io_top_cl = recall.sort_values(by="recall_io", ascending=False).index[0]
brush_top_cl = recall.sort_values(by="recall_brush", ascending=False).index[0]
if io_top_cl == brush_top_cl:
    brush_top_cl = recall.sort_values(by="recall_brush", ascending=False).index[1]
ne_top_cl = recall.sort_values(by="recall_ne", ascending=False).index[0]

Take top 3, convert fractions to percentages:

In [15]:
recall_top3 = (
    recall.loc[
        [io_top_cl, brush_top_cl, ne_top_cl], ["recall_io", "recall_brush", "recall_ne"]
    ]
    * 100
)

Rename columns:

In [16]:
recall_top3.rename(
    columns={
        "recall_ne": "% of NE cell annotations",
        "recall_io": "% of ionocyte annotations",
        "recall_brush": "% of tuft annotations",
    },
    inplace=True,
)
recall_top3.index.name = "res. 3 cluster"
# recall_top3 = recall_top3.loc[sorted(recall_top3.index.tolist()), :]

Do the same for the precision results:

In [17]:
precision_top3 = (
    recall.loc[
        [io_top_cl, brush_top_cl, ne_top_cl], ["prec_io", "prec_brush", "prec_ne"]
    ]
    * 100
)

In [18]:
precision_top3.rename(
    columns={
        "prec_ne": "NE precision",
        "prec_io": "Ionocyte precision",
        "prec_brush": "Tuft precision",
    },
    inplace=True,
)
precision_top3.index.name = "res. 3 cluster"

Remove name of column names:

In [19]:
recall_top3.columns.name = None
precision_top3.columns.name = None

Show results:

In [20]:
recall_top3

Unnamed: 0_level_0,% of ionocyte annotations,% of tuft annotations,% of NE cell annotations
res. 3 cluster,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0.7.0,90.9,9.3,0.0
0.7.3,1.7,79.1,0.0
0.7.4,1.2,1.2,94.1


In [21]:
precision_top3

Unnamed: 0_level_0,Ionocyte precision,Tuft precision,NE precision
res. 3 cluster,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0.7.0,59.9,2.2,0.0
0.7.3,2.6,44.4,0.0
0.7.4,2.1,0.7,68.6


Save result:

In [23]:
recall_top3.to_csv(os.path.join(dir_results, f"Rare_cell_recall_{dataset}.csv"))
precision_top3.to_csv(os.path.join(dir_results, f"Rare_cell_precision_{dataset}.csv"))