# Zero-shot cell type annotation

Assinging cell type annotations is an important and time consuming part of single-cell analysis using `biomed-multi-omic` for cell type annotation. BMFM-RNA simplifies this process by not only performing the cell-type annotation but also the preprocessing and visualisation through the embeddings created by the model.

In this tutorial we look at inspecting the results of the zero-shot prediction created in tutorial 1. We do this by loading the results of the data and then using the helper functions packaged in the `evaluation` module to help extract and interpate the results of the model.

In [5]:
from pathlib import Path

import scanpy as sc

ERROR! Session/line number was not unique in database. History logging moved to new session 225


## Load Example Data

To demostrate the BMFM-RNAs abilites, we use the PBMC data created by 10X Genomics (dataset can be downloaded [here](https://support.10xgenomics.com/single-cell-gene-expression/datasets/1.1.0/pbmc3k)). This dataset is created of 3k PBMCs from a Healthy Donor. The raw data will be used as the input, but we will also extract the cell type annotation from the legacy scanpy workflow as a comparison between the BMFM and classical scRNA-seq analysis. 

For more information about how the data was preprocessing please visit scanpy's tutorial [here](https://scanpy.readthedocs.io/en/1.11.x/tutorials/basics/clustering-2017.html).

In [6]:
# Get raw PBMC3k data
data_dir = Path("data")
data_dir.mkdir(parents=True, exist_ok=True)

# Get PMBC3k raw dataset
adata = sc.datasets.pbmc3k()

# Extract reference data for later downstream comparison
reference_adata = sc.datasets.pbmc3k_processed()
reference_labels = reference_adata.obs[["louvain"]]
reference_obs_index = reference_adata.obs.index.tolist()
reference_vars_index = reference_adata.var.index.tolist()

adata = adata[reference_obs_index, reference_vars_index]
adata.write("data/pbmc3k_raw.h5ad")

In [7]:
# Make results directory
results_dir = Path("results/pbmc3k")
results_dir.mkdir(parents=True, exist_ok=True)

The model's weights can be aquired from [IBM's HuggingFace collection](https://huggingface.co/ibm-research). The following scRNA models are avaliable:

- MLM+RDA: [ibm-research/biomed.rna.bert.110m.mlm.rda.v1](https://huggingface.co/ibm-research/biomed.rna.bert.110m.mlm.rda.v1)
- MLM+Multitask: [ibm-research/biomed.rna.bert.110m.mlm.multitask.v1](https://huggingface.co/ibm-research/biomed.rna.bert.110m.mlm.multitask.v1)
- WCED+Multitask: [ibm-research/biomed.rna.bert.110m.wced.multitask.v1](https://huggingface.co/ibm-research/biomed.rna.bert.110m.wced.multitask.v1)
- WCED 10 pct: [ibm-research/biomed.rna.bert.110m.wced.v1](https://huggingface.co/ibm-research/biomed.rna.bert.110m.wced.v1)

Using `bmfm-targets-run` you will only need to provide the name of the model under the `checkpoint` flag. I.e. `checkpoint=ibm-research/biomed.rna.bert.110m.wced.multitask.v1`. Checkpoints will be downloaded automatically from HuggingFace.

To get embeddings for an h5ad file from the checkpoints discussed in the manuscript (https://arxiv.org/abs/2506.14861) run the following code snippets, after installing the package.

The only thing you need is an `h5ad` file with raw gene counts to run inference, and a writable directory working_dir for output. For convenience, this tutorial uses pmbc3k dataset created in the code chunks above, however, you could also provide your own `h5ad` file (note for WCED the expected input of the data in `.X` should be raw counts).

In [None]:
%%bash
bmfm-targets-run -cn predict input_file=data/subset_hvg.h5ad working_dir=results/hvg checkpoint=ibm-research/biomed.rna.bert.110m.wced.multitask.v1

Fetching 17 files: 100%|██████████| 17/17 [00:00<00:00, 9647.30it/s]


[2025-07-18 14:24:46,014][bmfm_targets.models.model_utils][INFO] - Downloaded checkpoint from HuggingFace: ibm-research/biomed.rna.bert.110m.wced.multitask.v1 - Local path: /Users/mattmadgwick/.cache/huggingface/hub/models--ibm-research--biomed.rna.bert.110m.wced.multitask.v1/snapshots/01008c4765be7d165c3b227b6fc5111c6dd68f22
[2025-07-18 14:24:46,015][bmfm_targets.models.model_utils][INFO] - Downloaded HF checkpoint to: /Users/mattmadgwick/.cache/huggingface/hub/models--ibm-research--biomed.rna.bert.110m.wced.multitask.v1/snapshots/01008c4765be7d165c3b227b6fc5111c6dd68f22/last.ckpt


Fetching 16 files: 100%|██████████| 16/16 [00:00<00:00, 27081.87it/s]


[2025-07-18 14:24:46,919][bmfm_targets.models.model_utils][INFO] - Downloaded tokenizer from HuggingFace: ibm-research/biomed.rna.bert.110m.wced.multitask.v1 - Local path: /Users/mattmadgwick/.cache/huggingface/hub/models--ibm-research--biomed.rna.bert.110m.wced.multitask.v1/snapshots/01008c4765be7d165c3b227b6fc5111c6dd68f22
[2025-07-18 14:24:47,025][bmfm_targets.datasets.base_rna_dataset][INFO] - Reduced dataset genes from 3000 to 1953 which overlap with the 19283 in `limit_genes`
[2025-07-18 14:24:47,141][bmfm_targets.datasets.base_rna_dataset][INFO] - Removed 0 cells which no longer have counts.


Seed set to 1234


[2025-07-18 14:24:47,214][bmfm_targets.tasks.task_utils][INFO] - seed: 1234


Using 16bit Automatic Mixed Precision (AMP)
💡 Tip: For seamless cloud uploads and versioning, try installing [litmodels](https://pypi.org/project/litmodels/) to enable LitModelCheckpoint, which syncs automatically with the Lightning model registry.
GPU available: True (mps), used: True
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs
Fetching 17 files: 100%|██████████| 17/17 [00:00<00:00, 22102.66it/s]


[2025-07-18 14:24:47,361][bmfm_targets.models.model_utils][INFO] - Downloaded checkpoint from HuggingFace: ibm-research/biomed.rna.bert.110m.wced.multitask.v1 - Local path: /Users/mattmadgwick/.cache/huggingface/hub/models--ibm-research--biomed.rna.bert.110m.wced.multitask.v1/snapshots/01008c4765be7d165c3b227b6fc5111c6dd68f22
[2025-07-18 14:24:47,362][bmfm_targets.models.model_utils][INFO] - Downloaded HF checkpoint to: /Users/mattmadgwick/.cache/huggingface/hub/models--ibm-research--biomed.rna.bert.110m.wced.multitask.v1/snapshots/01008c4765be7d165c3b227b6fc5111c6dd68f22/last.ckpt
[2025-07-18 14:24:47,418][bmfm_targets.tasks.task_utils][INFO] - Model config is none then loading model from checkpoint /Users/mattmadgwick/.cache/huggingface/hub/models--ibm-research--biomed.rna.bert.110m.wced.multitask.v1/snapshots/01008c4765be7d165c3b227b6fc5111c6dd68f22/last.ckpt
[2025-07-18 14:24:51,236][bmfm_targets.datasets.base_rna_dataset][INFO] - Reduced dataset genes from 3000 to 1953 which overl

/Users/mattmadgwick/miniforge3/envs/bmfm-tutorial/lib/python3.12/site-packages/pytorch_lightning/trainer/connectors/data_connector.py:420: Consider setting `persistent_workers=True` in 'predict_dataloader' to speed up the dataloader worker initialization.


Predicting: |          | 0/? [00:00<?, ?it/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Av

Predicting DataLoader 0:   0%|          | 0/35?it/s]12 [00:00<?, ?it/s]



Predicting DataLoader 0:   2%|▏         | 79/3512 [05:04<3:40:23,  0.26it/s]


Detected KeyboardInterrupt, attempting graceful shutdown ...


TypeError: %d format: a real number is required, not NoneType

TypeError: %d format: a real number is required, not NoneType

libc++abi: terminating due to uncaught exception of type std::__1::system_error: Broken pipe
bash: line 1: 10244 Abort trap: 6           bmfm-targets-run -cn predict input_file=data/subset_hvg.h5ad working_dir=results/hvg checkpoint=ibm-research/biomed.rna.bert.110m.wced.multitask.v1


In [8]:
import scanpy as sc
adata = sc.read_h5ad("data/hvgs.h5ad")

adata

ERROR! Session/line number was not unique in database. History logging moved to new session 226


AnnData object with n_obs × n_vars = 70299 × 3000
    obs: 'sample_id', 'n_genes_by_counts', 'log1p_n_genes_by_counts', 'total_counts', 'log1p_total_counts', 'pct_counts_in_top_20_genes', 'total_counts_mt', 'log1p_total_counts_mt', 'pct_counts_mt', 'total_counts_ribo', 'log1p_total_counts_ribo', 'pct_counts_ribo', 'total_counts_hb', 'log1p_total_counts_hb', 'pct_counts_hb', 'outlier_log1p_total_counts', 'outlier_log1p_n_genes_by_counts', 'outlier_pct_counts_in_top_20_genes', 'source_name_ch1', 'characteristics_ch1.0.tissue', 'characteristics_ch1.1.cell type', 'extract_protocol_ch1', 'data_processing', 'series_id'

In [None]:
adata[]

AttributeError: 'AnnData' object has no attribute 'iloc'

In [12]:
adata[:-60, :].write_h5ad("data/subset_hvg.h5ad")

ERROR! Session/line number was not unique in database. History logging moved to new session 227
