# Annotate ToF data using CellTypist
In this notebook we will:
- read in pre-made models trained on atlas fine/mid/coarse labels
- read in ToF data
- run model on ToF data to get predicted labels
- save these barcode+label files as .csv for later use

## Install CellTypist

In [1]:
import scanpy as sc
import celltypist
import time
import numpy as np
import os
import pickle
import pandas as pd

  @numba.jit()
  @numba.jit()
  @numba.jit()
  from .autonotebook import tqdm as notebook_tqdm
  @numba.jit()


## Paths and variables

In [2]:
#set relevant label key
compartment='AllCompartments'
cell_type_key = 'MidGrainModified'
batch_key='batch_key'
job_name=f'AtlasT21ToF_WithContCovariates_{cell_type_key}_{compartment}'
job_name

'AtlasT21ToF_WithContCovariates_MidGrainModified_AllCompartments'

In [3]:
#paths
base_path='/lustre/scratch126/cellgen/team205/jc48/jupyter/scArches'
data_dir=os.path.join(base_path,'data')
results_dir=os.path.join(base_path,f'results/{job_name}')
adata_path=os.path.join(results_dir,'anndata/')
models_path=os.path.join(results_dir,'models/')

In [4]:
dataset='ToF'

## Read in adata

Data format:
- all genes
- norm-logged in adata.X

In [5]:
file_path=os.path.join(adata_path,f'{job_name}_adata_que_T21ToF.h5ad')
adata_que=sc.read(file_path)

print(adata_que.X.data[:10])

[ 1. 24.  7.  4.  1. 14.  3.  1.  3.  2.]


In [6]:
print(adata_que.shape)
adata = adata_que[adata_que.obs.diagnosis=='TOF']
print(adata.shape)

(92194, 18642)
(15836, 18642)


In [7]:
adata.layers["counts"]=adata.X.copy()

sc.pp.normalize_total(adata, target_sum = 1e4)
sc.pp.log1p(adata)

## Apply celltypist model

In [8]:
saved_models = os.listdir(os.path.join(data_dir, 'CellTypistModels'))

annotation_levels = ['fine', 'mid', 'coarse']

list_of_prediction_dfs=[]

for annotation_level in annotation_levels:

        # Read in models (pre-generated)    
        saved_model = [m for m in saved_models if annotation_level in m][0]
        model_file_path = os.path.join(data_dir, 'CellTypistModels', saved_model)
        
        # CellTypist prediction
        t_start = time.time()
        predictions = celltypist.annotate(adata, model = model_file_path, majority_voting = True)
        t_end = time.time()

        # Collate predictions
        list_of_prediction_dfs.append(predictions.to_adata(prefix=annotation_level+'_CellTypist_').obs.iloc[:,-4:])
        print(f"{annotation_level} completed in : {t_end - t_start} seconds")

predictions_df=pd.concat(list_of_prediction_dfs, axis=1)
predictions_df

üî¨ Input data has 15836 cells and 18642 genes
üîó Matching reference genes in the model
üß¨ 17260 features used for prediction
‚öñÔ∏è Scaling input data
üñãÔ∏è Predicting labels
‚úÖ Prediction done!
üëÄ Can not detect a neighborhood graph, will construct one before the over-clustering
‚õìÔ∏è Over-clustering input data with resolution set to 10
üó≥Ô∏è Majority voting the predictions
‚úÖ Majority voting done!
üî¨ Input data has 15836 cells and 18642 genes
üîó Matching reference genes in the model


fine completed in : 74.83157539367676 seconds


üß¨ 17260 features used for prediction
‚öñÔ∏è Scaling input data
üñãÔ∏è Predicting labels
‚úÖ Prediction done!
üëÄ Detected a neighborhood graph in the input object, will run over-clustering on the basis of it
‚õìÔ∏è Over-clustering input data with resolution set to 10
üó≥Ô∏è Majority voting the predictions
‚úÖ Majority voting done!
üî¨ Input data has 15836 cells and 18642 genes
üîó Matching reference genes in the model


mid completed in : 29.85644555091858 seconds


üß¨ 17260 features used for prediction
‚öñÔ∏è Scaling input data
üñãÔ∏è Predicting labels
‚úÖ Prediction done!
üëÄ Detected a neighborhood graph in the input object, will run over-clustering on the basis of it
‚õìÔ∏è Over-clustering input data with resolution set to 10
üó≥Ô∏è Majority voting the predictions
‚úÖ Majority voting done!


coarse completed in : 31.639684438705444 seconds


Unnamed: 0,fine_CellTypist_predicted_labels,fine_CellTypist_over_clustering,fine_CellTypist_majority_voting,fine_CellTypist_conf_score,mid_CellTypist_predicted_labels,mid_CellTypist_over_clustering,mid_CellTypist_majority_voting,mid_CellTypist_conf_score,coarse_CellTypist_predicted_labels,coarse_CellTypist_over_clustering,coarse_CellTypist_majority_voting,coarse_CellTypist_conf_score
P26_1_AGTGCCGAGATAACGT-1,VentricularCardiomyocytesRightCompact,12,VentricularCardiomyocytesRightCompact,0.008608,VentricularCardiomyocytes,12,VentricularCardiomyocytes,0.951256,Cardiomyocytes,12,Cardiomyocytes,0.999995
P26_1_TCGCACTCATATCGGT-1,VentricularCardiomyocytesRightTrabeculated,1,VentricularCardiomyocytesRightCompact,0.003694,VentricularCardiomyocytes,1,VentricularCardiomyocytes,0.952763,Cardiomyocytes,1,Cardiomyocytes,1.000000
P33_10k_CATTGCCAGCGTCAAG-1,VentricularCardiomyocytesRightCompact,11,VentricularCardiomyocytesRightCompact,0.001944,VentricularCardiomyocytes,11,VentricularCardiomyocytes,0.246197,Cardiomyocytes,11,Cardiomyocytes,0.999964
P33_10k_GAGAAATGTTCTCCCA-1,VentricularCardiomyocytesRightCompact,11,VentricularCardiomyocytesRightCompact,0.004305,CardiacConductionSystem,11,VentricularCardiomyocytes,0.053509,Cardiomyocytes,11,Cardiomyocytes,0.999967
P33_10k_CAGCAGCCACCCTGAG-1,VentricularCardiomyocytesRightCompact,11,VentricularCardiomyocytesRightCompact,0.001500,VentricularCardiomyocytes,11,VentricularCardiomyocytes,0.052847,Cardiomyocytes,11,Cardiomyocytes,0.999892
...,...,...,...,...,...,...,...,...,...,...,...,...
P28_1_CCACTTGAGCGTCTCG-1,MyocardialInterstitialFibroblasts1,3,MyocardialInterstitialFibroblasts1,0.001371,Fibroblasts,3,Fibroblasts,0.682194,Mesenchymal,3,Mesenchymal,0.824592
P28_2_TGCGACGTCCTAGCGG-1,MyocardialInterstitialFibroblasts1,22,MyocardialInterstitialFibroblasts1,0.000941,Fibroblasts,22,Fibroblasts,0.683048,Mesenchymal,22,Mesenchymal,0.932603
P28_2_CGGAGAAAGCTGGCCT-1,MyocardialInterstitialFibroblasts1,22,MyocardialInterstitialFibroblasts1,0.000496,Fibroblasts,22,Fibroblasts,0.087808,Mesenchymal,22,Mesenchymal,0.744801
P33_10k_AAACGCTAGTATAGGT-1,MyocardialInterstitialFibroblasts1,5,MyocardialInterstitialFibroblasts1,0.007601,Fibroblasts,5,Fibroblasts,0.076164,Mesenchymal,5,Mesenchymal,0.799071


By default (`majority_voting = False`), CellTypist will infer the identity of each query cell independently. This leads to raw predicted cell type labels, and usually finishes within seconds. You can also turn on the majority-voting classifier (`majority_voting = True`), which refines cell identities within local subclusters after an over-clustering approach at the cost of increased runtime.

The results include both predicted cell type labels (`predicted_labels`), over-clustering result (`over_clustering`), and predicted labels after majority voting in local subclusters (`majority_voting`). Note in the `predicted_labels`, each query cell gets its inferred label by choosing the most probable cell type among all possible cell types in the given model.

In [9]:
#save
file_path=os.path.join(results_dir,dataset+'_predictions_df.csv')
predictions_df.to_csv(file_path)