# Tutorial: Creating an AnnData Object from Tahoe-100M Dataset
This notebook is intented for users who are familiar with the anndata format for single-cell data. We'll walk through how to parse records in the huggingface dataset format and convert between the two.

## Install Required Libraries

In [1]:
!pip install datasets anndata scipy pandas pubchempy

Collecting datasets
  Downloading datasets-3.6.0-py3-none-any.whl.metadata (19 kB)
Collecting anndata
  Downloading anndata-0.11.4-py3-none-any.whl.metadata (9.3 kB)
Collecting scipy
  Downloading scipy-1.15.3-cp312-cp312-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.metadata (61 kB)
Collecting pandas
  Downloading pandas-2.2.3-cp312-cp312-manylinux2014_aarch64.manylinux_2_17_aarch64.whl.metadata (89 kB)
Collecting pubchempy
  Downloading PubChemPy-1.0.4.tar.gz (29 kB)
  Preparing metadata (setup.py) ... [?25ldone
[?25hCollecting filelock (from datasets)
  Downloading filelock-3.18.0-py3-none-any.whl.metadata (2.9 kB)
Collecting numpy>=1.17 (from datasets)
  Downloading numpy-2.2.6-cp312-cp312-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.metadata (63 kB)
Collecting pyarrow>=15.0.0 (from datasets)
  Downloading pyarrow-20.0.0-cp312-cp312-manylinux_2_28_aarch64.whl.metadata (3.3 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata 

## Import Libraries

In [3]:
from datasets import load_dataset
from scipy.sparse import csr_matrix
import anndata
import pandas as pd
import pubchempy as pcp

## Mapping records to anndata

This function takes in a generator that emits records from the Tahoe-100M huggingface dataset and returns an anndata object. Use the `sample_size` argument to specify the number of records you need. You can also create a new generator using the `dataset.filter` function to only emit records that match a certain filter (eg: for a specific drug/plate/sample).

If you'd like to create a DataLoader for an ML training application, it's likely best to use the data in it's native format without interfacing with anndata.

In [4]:

def create_anndata_from_generator(generator, gene_vocab, sample_size=None):
    sorted_vocab_items = sorted(gene_vocab.items())
    token_ids, gene_names = zip(*sorted_vocab_items)
    token_id_to_col_idx = {token_id: idx for idx, token_id in enumerate(token_ids)}

    data, indices, indptr = [], [], [0]
    obs_data = []

    for i, cell in enumerate(generator):
        if sample_size is not None and i >= sample_size:
            break
        genes = cell['genes']
        expressions = cell['expressions']
        if expressions[0] < 0:
            genes = genes[1:]
            expressions = expressions[1:]

        col_indices = [token_id_to_col_idx[gene] for gene in genes if gene in token_id_to_col_idx]
        valid_expressions = [expr for gene, expr in zip(genes, expressions) if gene in token_id_to_col_idx]

        data.extend(valid_expressions)
        indices.extend(col_indices)
        indptr.append(len(data))

        obs_entry = {k: v for k, v in cell.items() if k not in ['genes', 'expressions']}
        obs_data.append(obs_entry)

    expr_matrix = csr_matrix((data, indices, indptr), shape=(len(indptr) - 1, len(gene_names)))
    obs_df = pd.DataFrame(obs_data)

    adata = anndata.AnnData(X=expr_matrix, obs=obs_df)
    adata.var.index = pd.Index(gene_names, name='ensembl_id')

    return adata


## Load Tahoe-100M Dataset

In [6]:

tahoe_100m_ds = load_dataset('vevotx/Tahoe-100M', streaming=True, split='train')

README.md:   0%|          | 0.00/18.7k [00:00<?, ?B/s]

Resolving data files:   0%|          | 0/3388 [00:00<?, ?it/s]

Resolving data files:   0%|          | 0/3388 [00:00<?, ?it/s]

## Load Gene Metadata

The gene metadata contains the mapping between the integer token IDs used in the dataset and standard identifiers for genes (ensembl IDs and HGNC gene symbols)

In [12]:

gene_metadata = load_dataset("vevotx/Tahoe-100M", name="gene_metadata", split="train")
gene_vocab = {entry["token_id"]: entry["ensembl_id"] for entry in gene_metadata}

Resolving data files:   0%|          | 0/3388 [00:00<?, ?it/s]

gene_metadata.parquet:   0%|          | 0.00/1.33M [00:00<?, ?B/s]

Generating train split: 0 examples [00:00, ? examples/s]

## Create AnnData Object

In [32]:

adata = create_anndata_from_generator(tahoe_100m_ds, gene_vocab, sample_size=1000000)
adata



AnnData object with n_obs × n_vars = 1000000 × 62710
    obs: 'drug', 'sample', 'BARCODE_SUB_LIB_ID', 'cell_line_id', 'moa-fine', 'canonical_smiles', 'pubchem_cid', 'plate'

## Inspect Metadata (`adata.obs`)

In [37]:
adata.obs.head()

Unnamed: 0,drug,sample,BARCODE_SUB_LIB_ID,cell_line_id,moa-fine,canonical_smiles,pubchem_cid,plate,mean_gene_count,mean_tscp_count,mean_mread_count,mean_pcnt_mito,drugname_drugconc
0,8-Hydroxyquinoline,smp_1783,01_001_052-lib_1105,CVCL_0480,unclear,C1=CC2=C(C(=C1)O)N=CC=C2,1923.0,plate4,1478.268171,2341.339094,2738.463797,0.023783,"[('8-Hydroxyquinoline', 0.05, 'uM')]"
1,8-Hydroxyquinoline,smp_1783,01_001_105-lib_1105,CVCL_0546,unclear,C1=CC2=C(C(=C1)O)N=CC=C2,1923.0,plate4,1478.268171,2341.339094,2738.463797,0.023783,"[('8-Hydroxyquinoline', 0.05, 'uM')]"
2,8-Hydroxyquinoline,smp_1783,01_001_165-lib_1105,CVCL_1717,unclear,C1=CC2=C(C(=C1)O)N=CC=C2,1923.0,plate4,1478.268171,2341.339094,2738.463797,0.023783,"[('8-Hydroxyquinoline', 0.05, 'uM')]"
3,8-Hydroxyquinoline,smp_1783,01_003_094-lib_1105,CVCL_1717,unclear,C1=CC2=C(C(=C1)O)N=CC=C2,1923.0,plate4,1478.268171,2341.339094,2738.463797,0.023783,"[('8-Hydroxyquinoline', 0.05, 'uM')]"
4,8-Hydroxyquinoline,smp_1783,01_003_164-lib_1105,CVCL_1056,unclear,C1=CC2=C(C(=C1)O)N=CC=C2,1923.0,plate4,1478.268171,2341.339094,2738.463797,0.023783,"[('8-Hydroxyquinoline', 0.05, 'uM')]"


## Enrich with Sample Metadata

Although the main data contains several metadata fields, there are some additional columns (such as drug concentration) which are omitted to reduce the size of the data. If they are needed, they may be fetched using the sample_metadata.

In [38]:

sample_metadata = load_dataset("vevotx/Tahoe-100M","sample_metadata", split="train").to_pandas()
adata.obs = pd.merge(adata.obs, sample_metadata.drop(columns=["drug","plate"]), on="sample")
adata.obs.head()

Resolving data files:   0%|          | 0/3388 [00:00<?, ?it/s]

Unnamed: 0,drug,sample,BARCODE_SUB_LIB_ID,cell_line_id,moa-fine,canonical_smiles,pubchem_cid,plate,mean_gene_count_x,mean_tscp_count_x,mean_mread_count_x,mean_pcnt_mito_x,drugname_drugconc_x,mean_gene_count_y,mean_tscp_count_y,mean_mread_count_y,mean_pcnt_mito_y,drugname_drugconc_y
0,8-Hydroxyquinoline,smp_1783,01_001_052-lib_1105,CVCL_0480,unclear,C1=CC2=C(C(=C1)O)N=CC=C2,1923.0,plate4,1478.268171,2341.339094,2738.463797,0.023783,"[('8-Hydroxyquinoline', 0.05, 'uM')]",1478.268171,2341.339094,2738.463797,0.023783,"[('8-Hydroxyquinoline', 0.05, 'uM')]"
1,8-Hydroxyquinoline,smp_1783,01_001_105-lib_1105,CVCL_0546,unclear,C1=CC2=C(C(=C1)O)N=CC=C2,1923.0,plate4,1478.268171,2341.339094,2738.463797,0.023783,"[('8-Hydroxyquinoline', 0.05, 'uM')]",1478.268171,2341.339094,2738.463797,0.023783,"[('8-Hydroxyquinoline', 0.05, 'uM')]"
2,8-Hydroxyquinoline,smp_1783,01_001_165-lib_1105,CVCL_1717,unclear,C1=CC2=C(C(=C1)O)N=CC=C2,1923.0,plate4,1478.268171,2341.339094,2738.463797,0.023783,"[('8-Hydroxyquinoline', 0.05, 'uM')]",1478.268171,2341.339094,2738.463797,0.023783,"[('8-Hydroxyquinoline', 0.05, 'uM')]"
3,8-Hydroxyquinoline,smp_1783,01_003_094-lib_1105,CVCL_1717,unclear,C1=CC2=C(C(=C1)O)N=CC=C2,1923.0,plate4,1478.268171,2341.339094,2738.463797,0.023783,"[('8-Hydroxyquinoline', 0.05, 'uM')]",1478.268171,2341.339094,2738.463797,0.023783,"[('8-Hydroxyquinoline', 0.05, 'uM')]"
4,8-Hydroxyquinoline,smp_1783,01_003_164-lib_1105,CVCL_1056,unclear,C1=CC2=C(C(=C1)O)N=CC=C2,1923.0,plate4,1478.268171,2341.339094,2738.463797,0.023783,"[('8-Hydroxyquinoline', 0.05, 'uM')]",1478.268171,2341.339094,2738.463797,0.023783,"[('8-Hydroxyquinoline', 0.05, 'uM')]"


## Add Drug Metadata

The drug metadata contains additional information for the compounds used in Tahoe-100M. See the dataset card and our [paper](https://www.biorxiv.org/content/10.1101/2025.02.20.639398v1) for more information about how this information was generated.

In [39]:
drug_metadata = load_dataset("vevotx/Tahoe-100M","drug_metadata", split="train").to_pandas()

adata.obs = pd.merge(adata.obs, drug_metadata.drop(columns=["canonical_smiles","pubchem_cid","moa-fine"]), on="drug")
adata.obs.head()

Resolving data files:   0%|          | 0/3388 [00:00<?, ?it/s]

ValueError: Length of passed value for obs_names is 977906, but this AnnData has shape: (1000000, 62710)

## Drug Info from PubChem

We also provide the pubchem IDs for the compounds in Tahoe, this can be used to querry additional information as needed.

In [40]:

drug_name = adata.obs["drug"].values[0]
cid = int(float(adata.obs["pubchem_cid"].values[0]))
compound = pcp.Compound.from_cid(cid)

print(f"Name: {drug_name}")
print(f"Synonyms: {compound.synonyms[:10]}")
print(f"Formula: {compound.molecular_formula}")
print(f"SMILES: {compound.isomeric_smiles}")
print(f"Mass: {compound.exact_mass}")

Name: 8-Hydroxyquinoline
Synonyms: ['8-HYDROXYQUINOLINE', 'quinolin-8-ol', '148-24-3', 'Oxyquinoline', 'Oxine', 'Quinophenol', 'Oxychinolin', '8-Quinol', '8-Oxyquinoline', 'Phenopyridine']
Formula: C9H7NO
SMILES: C1=CC2=C(C(=C1)O)N=CC=C2
Mass: 145.052763847


## Load Cell Line Metadata
The cell-line metadata contains additional identifiers for the
cell-lines used in Tahoe (eg: Depmap-IDs) as well as a curated list of driver mutations for each cell line. This information can be used for instance to train genotype aware models on the Tahoe data.

In [41]:

cell_line_metadata = load_dataset("vevotx/Tahoe-100M","cell_line_metadata", split="train").to_pandas()
cell_line_metadata.head()

Resolving data files:   0%|          | 0/3388 [00:00<?, ?it/s]

Unnamed: 0,cell_name,Cell_ID_DepMap,Cell_ID_Cellosaur,Organ,Driver_Gene_Symbol,Driver_VarZyg,Driver_VarType,Driver_ProtEffect_or_CdnaEffect,Driver_Mech_InferDM,Driver_GeneType_DM
0,A549,ACH-000681,CVCL_0023,Lung,CDKN2A,Hom,Deletion,DEL,LoF,Suppressor
1,A549,ACH-000681,CVCL_0023,Lung,CDKN2B,Hom,Deletion,DEL,LoF,Suppressor
2,A549,ACH-000681,CVCL_0023,Lung,KRAS,Hom,Missense,p.G12S,GoF,Oncogene
3,A549,ACH-000681,CVCL_0023,Lung,SMARCA4,Hom,Frameshift,p.Q729fs,LoF,Suppressor
4,A549,ACH-000681,CVCL_0023,Lung,STK11,Hom,Stopgain,p.Q37*,LoF,Suppressor


### Save to data

In [46]:

# Save to data
adata.write_h5ad("/home/ubuntu/anatoly-tahoe-100/datatahoe-100m.h5ad")
