# Cell Type Annotation using CELLiD

In this tutorial, we will provide a quick guideline for conducting cell type annotation using discotoolkit. The following steps outline the process:

1. Visit the discotoolkit website to find a sample or cell type of interest.
2. Utilize the `dt.filter_disco_metadata` function to filter the relevant data from the database.
3. Employ the `dt.download_disco_data` function to download the data in the `h5ad` extension based on the filtered sample.
4. Apply preprocessing to the counts matrix and obtain normalized gene expression, which will serve as the input for the `dt.CELLiD_cluster` function.

In [2]:
# import package
import discotoolkit as dt
import scanpy as sc
import pandas as pd
import numpy as np

%load_ext autoreload
%autoreload 2

To maintain simplicity, we exclusively employ sample metadata as the filtering criterion to extract a single sample from the `bone_marrow tissue` category. Subsequently, we proceed to download the pertinent data following the guidelines elucidated in the download data tutorial.

In [3]:
# filter to only one sample
filter = dt.Filter(sample="AML003_3p")

# filter the database based on the metadata
metadata = dt.filter_disco_metadata(filter) 

# download the data and ignore if it is exist
dt.download_disco_data(metadata) 

INFO:root:Retrieving metadata from DISCO database
INFO:root:Filtering sample
INFO:root:Retrieving cell type information of each sample from DISCO database
INFO:root:1 samples and 6086 cells were found
INFO:root: AML003_3p has been downloaded before. Ignore ...


We also provide a helper function that enables users to access and view the strings associated with the atlases

In [4]:
# helper function to allow the user to see how many atlases are in disco database
print(dt.get_atlas())

['testis', 'liver', 'pancreas', 'gingiva', 'intestine', 'adipose', 'tonsil', 'PDAC', 'breast', 'bone_marrow', 'lung', 'placenta', 'adrenal_gland', 'brain', 'thymus', 'bladder', 'ovarian_cancer', 'heart', 'eye', 'blood', 'breast_milk', 'skin', 'kidney', 'skeletal_muscle', 'fibroblast', 'stomach', 'ovary']


The downloaded data is in h5ad format, and we need to import `scanpy` to read the data.

<div class="admonition note">
  <p class="admonition-title">Note</p>
  <p>
    The input data for the cell type annotation function needs to be normalized and in non-log space. In the example below, the downloaded gene expression represents a count matrix, so we only need to normalize the data using `sc.pp.normalize_total`. If the data is in log space, please exponentiate it to convert it into non-log space.
  </p>
</div>

The required format for the input data is (gene, cluster).

In [5]:
# first we need to read the h5ad file and extract the raw gene expression
adata = sc.read_h5ad("DISCOtmp/AML003_3p.h5ad")

# apply normalise to the count data gene expression
### Ignore this if the data has been normalised
### please exponentiate if the data is in log-space
sc.pp.normalize_total(adata, target_sum = 1e4)
norm_temp = adata.X

# convert into dataframe for adding metadata
temp = pd.DataFrame(norm_temp.toarray(), columns=list(adata.var.index))

temp["cluster"] = np.array(adata.obs["seurat_clusters"]) # get the cluster metadata from 
integrated_data = temp.groupby("cluster").mean().transpose() # get the average expression for each cluster

# # we want the rna format to have gene as index and cluster category as the columns
# # here is the example. gene, cluster
integrated_data.head()


This is where adjacency matrices should go now.
  warn(


cluster,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16
MIR1302-2HG,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
FAM138A,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
OR4F5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
AL627309.1,0.019681,0.010808,0.006906,0.006409,0.00191,0.010536,0.002414,0.024108,0.00715,0.008123,0.003571,0.01634,0.0,0.027671,0.007016,0.037748,0.0
AL627309.3,0.0,0.0,0.002456,0.001417,0.0,0.0,0.009509,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Now that the data is ready, we can then apply the `dt.CELLiD_cluster` function. Specify the `n_predict` parameter to obtain predictions for `n` number of cell types.

In [6]:
# apply cellid_cluster function to annotate the cluster
cell_type = dt.CELLiD_cluster(rna = integrated_data, n_predict = 3)

INFO: Pandarallel will run on 10 workers.
INFO: Pandarallel will use Memory file system to transfer data between the main process and workers.


[Parallel(n_jobs=10)]: Using backend LokyBackend with 10 concurrent workers.
[Parallel(n_jobs=10)]: Done   7 out of  17 | elapsed:    5.5s remaining:    7.8s
[Parallel(n_jobs=10)]: Done  17 out of  17 | elapsed:    7.8s finished


In [7]:
# the result is return as in Pandas DataFrame
cell_type

Unnamed: 0_level_0,predicted_cell_type_1,predicted_cell_type_2,predicted_cell_type_3,source_atlas_1,source_atlas_2,source_atlas_3,score_1,score_2,score_3
input_index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
0,MHCII low CD14 monocyte,MHCII high CD14 monocyte,Cycling S100A+ preNeutrophil,bone_marrow,bone_marrow,bone_marrow,0.828,0.82,0.698
1,MHCII high CD14 monocyte,MHCII low CD14 monocyte,CD16 monocyte,bone_marrow,bone_marrow,bone_marrow,0.818,0.754,0.744
2,MHCII high CD14 monocyte,MHCII low CD14 monocyte,Cycling S100A+ preNeutrophil,bone_marrow,bone_marrow,bone_marrow,0.761,0.744,0.721
3,Common myeloid progenitor,Granulocyte-monocyte progenitor,Promyelocyte,bone_marrow,bone_marrow,bone_marrow,0.681,0.665,0.665
4,Cycling S100A+ preNeutrophil,cDC2,MHCII high CD14 monocyte,bone_marrow,bone_marrow,bone_marrow,0.739,0.717,0.708
5,CD16 monocyte,MHCII high CD14 monocyte,MHCII low CD14 monocyte,ovarian_cancer,bone_marrow,bone_marrow,0.831,0.821,0.772
6,pDC,Myeloid pre-pDC,cDC2,bone_marrow,bone_marrow,bone_marrow,0.701,0.675,0.652
7,cDC2,MHCII high CD14 monocyte,Cycling S100A+ preNeutrophil,bone_marrow,bone_marrow,bone_marrow,0.796,0.766,0.723
8,Myelocyte,Cycling cDC2,S100A+ preNeutrophil,bone_marrow,bone_marrow,bone_marrow,0.761,0.746,0.74
9,MHCII high CD14 monocyte,MHCII low CD14 monocyte,Cycling S100A+ preNeutrophil,bone_marrow,bone_marrow,bone_marrow,0.796,0.78,0.74
