# Zero-shot cell type annotation

Assinging cell type annotations is an import and time consuming part of single-cell analysis using `biomed-multi-omic` for cell type annotation. BMFM-RNA simplifies this process by not only performing the cell-type annotation but also the preprocessing and visualisation through the embeddings created by the model.

In this tutorial we look at inspecting the results of the zero-shot prediction created in tutorial 1. We do this by loading the results of the data and then using the helper functions packaged in the `evaluation` module to help extract and interpate the results of the model.

In [1]:
from pathlib import Path

import scanpy as sc

## Load Example Data

To demostrate the BMFM-RNAs abilites, we use the PBMC data created by 10X Genomics (dataset can be downloaded [here](https://support.10xgenomics.com/single-cell-gene-expression/datasets/1.1.0/pbmc3k)). This dataset is created of 3k PBMCs from a Healthy Donor. The raw data will be used as the input, but we will also extract the cell type annotation from the legacy scanpy workflow as a comparison between the BMFM and classical scRNA-seq analysis. 

For more information about how the data was preprocessing please visit scanpy's tutorial [here](https://scanpy.readthedocs.io/en/1.11.x/tutorials/basics/clustering-2017.html).

In [None]:
# Get raw PBMC3k data
data_dir = Path("data")
data_dir.mkdir(parents=True, exist_ok=True)

# Get PMBC3k raw dataset
adata = sc.datasets.pbmc3k()

# Extract reference data for later downstream comparison
reference_adata = sc.datasets.pbmc3k_processed()
reference_labels = reference_adata.obs[["louvain"]]
reference_obs_index = reference_adata.obs.index.tolist()
reference_vars_index = reference_adata.var.index.tolist()

adata = adata[reference_obs_index, reference_vars_index]
adata.write("data/pbmc3k_raw.h5ad")

  0%|          | 0.00/5.58M [00:00<?, ?B/s]

  0%|          | 0.00/23.5M [00:00<?, ?B/s]

In [4]:
# Optional: make results directory
results_dir = Path("results/pbmc3k")
results_dir.mkdir(parents=True, exist_ok=True)

To get embeddings for an h5ad file from the checkpoints discussed in the manuscript ( https://arxiv.org/abs/2506.14861 ) run the following code snippets, after installing the package.

The only thing you need is an h5ad file with raw gene counts to run inference, and a writable directory working_dir for output. For convenience, this page assumes that the location of the file is stored to an environment variable. Checkpoints will be downloaded automatically from HuggingFace.

In [None]:
%%bash
bmfm-targets-run -cn predict \
    input_file=data/pbmc3k_raw.h5ad \ 
    working_dir=results/pbmc3k \ 
    checkpoint=ibm-research/biomed.rna.bert.110m.wced.multitask.v1

Fetching 17 files: 100%|█████████████████████| 17/17 [00:00<00:00, 11805.16it/s]
[2025-07-17 18:47:56,181][bmfm_targets.models.model_utils][INFO] - Downloaded checkpoint from HuggingFace: ibm-research/biomed.rna.bert.110m.wced.multitask.v1 - Local path: /Users/mattmadgwick/.cache/huggingface/hub/models--ibm-research--biomed.rna.bert.110m.wced.multitask.v1/snapshots/01008c4765be7d165c3b227b6fc5111c6dd68f22
[2025-07-17 18:47:56,182][bmfm_targets.models.model_utils][INFO] - Downloaded HF checkpoint to: /Users/mattmadgwick/.cache/huggingface/hub/models--ibm-research--biomed.rna.bert.110m.wced.multitask.v1/snapshots/01008c4765be7d165c3b227b6fc5111c6dd68f22/last.ckpt
Fetching 16 files: 100%|██████████████████████| 16/16 [00:00<00:00, 7870.16it/s]
[2025-07-17 18:47:57,834][bmfm_targets.models.model_utils][INFO] - Downloaded tokenizer from HuggingFace: ibm-research/biomed.rna.bert.110m.wced.multitask.v1 - Local path: /Users/mattmadgwick/.cache/huggingface/hub/models--ibm-research--biomed.rna.b