# Annotate T21 data using CellTypist
In this notebook we will:
- read in pre-made models trained on atlas fine/mid/coarse labels
- read in T21 data
- run model on T21 data to get predicted labels
- save these barcode+label files as .csv for later use

## Install CellTypist

In [1]:
import scanpy as sc
import celltypist
import time
import numpy as np
import os
import pickle
import pandas as pd

  @numba.jit()
  @numba.jit()
  @numba.jit()
  from .autonotebook import tqdm as notebook_tqdm
  @numba.jit()


## Paths and variables

In [2]:
#set relevant label key
compartment='AllCompartments'
cell_type_key = 'MidGrainModified'
batch_key='batch_key'
job_name=f'AtlasT21_WithContCovariates_{cell_type_key}_{compartment}'
job_name

'AtlasT21_WithContCovariates_MidGrainModified_AllCompartments'

In [3]:
#paths
base_path='/lustre/scratch126/cellgen/team205/jc48/jupyter/scArches'
data_dir=os.path.join(base_path,'data')
results_dir=os.path.join(base_path,f'results/{job_name}')
adata_path=os.path.join(results_dir,'anndata/')
models_path=os.path.join(results_dir,'models/')

In [4]:
dataset='T21'

## Read in adata

Data format:
- all genes
- norm-logged in adata.X

In [5]:
file_path=os.path.join(data_dir,'T21HeartsExtracardiacRemovedRaw.h5ad')
adata_que=sc.read(file_path)

print(adata_que.X.data[:10])

[ 3.  1.  1.  2. 24.  7.  4.  1.  1. 14.]


In [6]:
print(adata_que.shape)
adata = adata_que.copy()
print(adata.shape)

(76358, 36601)
(76358, 36601)


In [7]:
adata.layers["counts"]=adata.X.copy()

sc.pp.normalize_total(adata, target_sum = 1e4)
sc.pp.log1p(adata)

## Apply celltypist model

In [None]:
saved_models = os.listdir(os.path.join(data_dir, 'CellTypistModels'))

annotation_levels = ['fine', 'mid', 'coarse']

list_of_prediction_dfs=[]

for annotation_level in annotation_levels:

        # Read in models (pre-generated)    
        saved_model = [m for m in saved_models if annotation_level in m][0]
        model_file_path = os.path.join(data_dir, 'CellTypistModels', saved_model)
        
        # CellTypist prediction
        t_start = time.time()
        predictions = celltypist.annotate(adata, model = model_file_path, majority_voting = True)
        t_end = time.time()

        # Collate predictions
        list_of_prediction_dfs.append(predictions.to_adata(prefix=annotation_level+'_CellTypist_').obs.iloc[:,-4:])
        print(f"{annotation_level} completed in : {t_end - t_start} seconds")

predictions_df=pd.concat(list_of_prediction_dfs, axis=1)
predictions_df

🔬 Input data has 76358 cells and 36601 genes
🔗 Matching reference genes in the model
🧬 30003 features used for prediction
⚖️ Scaling input data
🖋️ Predicting labels
✅ Prediction done!
👀 Detected a neighborhood graph in the input object, will run over-clustering on the basis of it
⛓️ Over-clustering input data with resolution set to 20


By default (`majority_voting = False`), CellTypist will infer the identity of each query cell independently. This leads to raw predicted cell type labels, and usually finishes within seconds. You can also turn on the majority-voting classifier (`majority_voting = True`), which refines cell identities within local subclusters after an over-clustering approach at the cost of increased runtime.

The results include both predicted cell type labels (`predicted_labels`), over-clustering result (`over_clustering`), and predicted labels after majority voting in local subclusters (`majority_voting`). Note in the `predicted_labels`, each query cell gets its inferred label by choosing the most probable cell type among all possible cell types in the given model.

In [None]:
#save
file_path=os.path.join(results_dir,dataset+'_predictions_df.csv')
predictions_df.to_csv(file_path)