In [1]:
import sklearn
import pickle as pkl

# Geneformer Cell Type Classification Benchmark
Here we benchmark four models, with two baselines. These models are tasked with cell type classification, using the Chron’s disease small intestine dataset from Elmentaite et al. (2020), Developmental Cell. This dataset contains approximately 22,500 single cells from both healthy children aged 4-13 and chidlren with Chron’s disease. This dataset contains 31 unique cell types which we assume to be annotated accurately. This dataset was held out of our pre-training dataset as all diseased samples were removed.

* Baseline (1) scRNA workflow: this model uses PCA with 10 components and random forest on normalized and log transformed expression counts to produce a result.
* Baseline (2) geneformer-qa, a model trained for approximately 100 steps with approximately random weights. We expect this model to perform no differently than working on counts directly.
* geneformer-10M + geneformer106M as described in the model cards.

First, we download the dataset from czi that we are interested in, and then create the requisite sc_memmap dataset object.

In [6]:
#NBVAL_CHECK_OUTPUT
import cellxgene_census
CENSUS_VERSION = "2023-12-15"
with cellxgene_census.open_soma(census_version=CENSUS_VERSION) as census:
    adata = cellxgene_census.get_anndata(census, "Homo sapiens",
            obs_value_filter='dataset_id=="8e47ed12-c658-4252-b126-381df8d52a3d"',
        )
adata.obs['cell_type'].unique()

Updating data from 'https://datasets.cellxgene.cziscience.com/8e47ed12-c658-4252-b126-381df8d52a3d.h5ad' to file '/workspaces/bionemo-framework/docs/docs/user-guide/examples/bionemo-geneformer/celltype-bench-dataset/hs-celltype-bench.h5ad'.


HTTPError: 403 Client Error: Forbidden for url: https://datasets.cellxgene.cziscience.com/8e47ed12-c658-4252-b126-381df8d52a3d.h5ad

In [2]:
from pathlib import Path
data_dir = Path("celltype-bench-dataset")
h5ad_outfile = data_dir / "hs-celltype-bench.h5ad"
adata.write_h5ad(h5ad_outfile)

In [4]:
#NBVAL_CHECK_OUTPUT
adata.shape

(22502, 60664)

## Create the scmemmap object, check outputs

In [3]:
!sc_memmap --data-path ./celltype-bench-dataset/ --save-path ./celltype-bench-dataset/ --obs-cols cell_type --strict-metadata

Found 1 files
Starting to create memmap files...
Creating metadata...: 100%|███████████████████████| 1/1 [00:00<00:00,  2.49it/s]
Done creating `metadata.json`
Writing data into memmaps to celltype-bench-dataset...
Merging AnnData into numpy memaps...: 100%|███████| 1/1 [00:00<00:00,  1.25it/s]
Saving dataframe ...
Done creating dataset ...
features.csv		  gene_expression_ind.npy  hs-celltype-bench.h5ad
gene_expression_data.npy  gene_expression_ptr.npy  metadata.json


In [5]:
#NBVAL_CHECK_OUTPUT
!ls ./celltype-bench-dataset/

features.csv		  gene_expression_ind.npy  hs-celltype-bench.h5ad
gene_expression_data.npy  gene_expression_ptr.npy  metadata.json
