<a href="https://colab.research.google.com/github/abuchin/tahoe-100m/blob/main/loading_data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Tutorial: Creating an AnnData Object from Tahoe-100M Dataset
This notebook is intented for users who are familiar with the anndata format for single-cell data. We'll walk through how to parse records in the huggingface dataset format and convert between the two.

**Load my Git enviroment**


In [1]:

# mount google drive
from google.colab import drive
drive.mount('/content/drive')


Mounted at /content/drive


**Setup: add google colab enviroment**

In [2]:

# setup the repo

# Replace with your details
GITHUB_TOKEN = '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'
GITHUB_USER = 'abuchin'
GITHUB_REPO = 'tahoe-100m'

# Clone the repository
!git clone https://{GITHUB_TOKEN}@github.com/{GITHUB_USER}/{GITHUB_REPO}.git

# Navigate into the cloned repository
%cd {GITHUB_REPO}



Cloning into 'tahoe-100m'...
remote: Enumerating objects: 35, done.[K
remote: Counting objects: 100% (35/35), done.[K
remote: Compressing objects: 100% (25/25), done.[K
remote: Total 35 (delta 13), reused 26 (delta 8), pack-reused 0 (from 0)[K
Receiving objects: 100% (35/35), 33.73 KiB | 11.24 MiB/s, done.
Resolving deltas: 100% (13/13), done.
/content/tahoe-100m


**Import libraries**

In [5]:
!pip install datasets anndata scipy pandas pubchempy

Collecting anndata
  Downloading anndata-0.11.4-py3-none-any.whl.metadata (9.3 kB)
Collecting pubchempy
  Downloading PubChemPy-1.0.4.tar.gz (29 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting array-api-compat!=1.5,>1.4 (from anndata)
  Downloading array_api_compat-1.12.0-py3-none-any.whl.metadata (2.5 kB)
Downloading anndata-0.11.4-py3-none-any.whl (144 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m144.5/144.5 kB[0m [31m8.9 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading array_api_compat-1.12.0-py3-none-any.whl (58 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m58.2/58.2 kB[0m [31m4.1 MB/s[0m eta [36m0:00:00[0m
[?25hBuilding wheels for collected packages: pubchempy
  Building wheel for pubchempy (setup.py) ... [?25l[?25hdone
  Created wheel for pubchempy: filename=PubChemPy-1.0.4-py3-none-any.whl size=13818 sha256=7e87d588782ccde831359b9a88ad18a4d457d39b71d45d7ccb89ef3aae71a493
  Stored in directory: /root/.cache/pip

In [66]:
from datasets import load_dataset
from scipy.sparse import csr_matrix
import anndata
import pandas as pd
import pubchempy as pcp
import numpy as np

**Add huggingface access token**

In [2]:
from huggingface_hub import login

# Paste your token here
login(token="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")


## Install Required Libraries

In [3]:

#!pip install --upgrade datasets huggingface_hub


## Import Libraries

## Mapping records to anndata

This function takes in a generator that emits records from the Tahoe-100M huggingface dataset and returns an anndata object. Use the `sample_size` argument to specify the number of records you need. You can also create a new generator using the `dataset.filter` function to only emit records that match a certain filter (eg: for a specific drug/plate/sample).

If you'd like to create a DataLoader for an ML training application, it's likely best to use the data in it's native format without interfacing with anndata.

In [4]:
def create_anndata_from_generator(generator, gene_vocab, sample_size=None):
    sorted_vocab_items = sorted(gene_vocab.items())
    token_ids, gene_names = zip(*sorted_vocab_items)
    token_id_to_col_idx = {token_id: idx for idx, token_id in enumerate(token_ids)}

    data, indices, indptr = [], [], [0]
    obs_data = []

    for i, cell in enumerate(generator):
        if sample_size is not None and i >= sample_size:
            break
        # Ensure 'genes' and 'expressions' keys exist and are not None
        genes = cell.get('genes', [])
        expressions = cell.get('expressions', [])

        # Fix: Check the length of the lists instead of using 'not' directly
        if len(genes) == 0 or len(expressions) == 0:
          continue # Skip if genes or expressions are missing or empty

        # Handle potential negative values at the beginning of expressions
        if expressions[0] < 0:
            genes = genes[1:]
            expressions = expressions[1:]

        col_indices = [token_id_to_col_idx[gene] for gene in genes if gene in token_id_to_col_idx]
        valid_expressions = [expr for gene, expr in zip(genes, expressions) if gene in token_id_to_col_idx]

        data.extend(valid_expressions)
        indices.extend(col_indices)
        indptr.append(len(data))

        obs_entry = {k: v for k, v in cell.items() if k not in ['genes', 'expressions']}
        obs_data.append(obs_entry)

    expr_matrix = csr_matrix((data, indices, indptr), shape=(len(indptr) - 1, len(gene_names)))
    obs_df = pd.DataFrame(obs_data)

    adata = anndata.AnnData(X=expr_matrix, obs=obs_df)
    adata.var.index = pd.Index(gene_names, name='ensembl_id')

    return adata

## Load Tahoe-100M Dataset

In [5]:
!pip install datasets fsspec s3fs huggingface_hub pandas pyarrow

Collecting s3fs
  Downloading s3fs-2025.5.1-py3-none-any.whl.metadata (1.9 kB)
Collecting aiobotocore<3.0.0,>=2.5.4 (from s3fs)
  Downloading aiobotocore-2.22.0-py3-none-any.whl.metadata (24 kB)
INFO: pip is looking at multiple versions of s3fs to determine which version is compatible with other requirements. This could take a while.
Collecting s3fs
  Downloading s3fs-2025.5.0-py3-none-any.whl.metadata (1.9 kB)
  Downloading s3fs-2025.3.2-py3-none-any.whl.metadata (1.9 kB)
  Downloading s3fs-2025.3.1-py3-none-any.whl.metadata (1.9 kB)
  Downloading s3fs-2025.3.0-py3-none-any.whl.metadata (1.9 kB)
Collecting aioitertools<1.0.0,>=0.5.1 (from aiobotocore<3.0.0,>=2.5.4->s3fs)
  Downloading aioitertools-0.12.0-py3-none-any.whl.metadata (3.8 kB)
Collecting botocore<1.37.4,>=1.37.2 (from aiobotocore<3.0.0,>=2.5.4->s3fs)
  Downloading botocore-1.37.3-py3-none-any.whl.metadata (5.7 kB)
Collecting jmespath<2.0.0,>=0.7.1 (from aiobotocore<3.0.0,>=2.5.4->s3fs)
  Downloading jmespath-1.0.1-py3-none

In [16]:
!pip install -q pandas pyarrow fsspec huggingface_hub

In [6]:

from huggingface_hub import login, HfApi
import fsspec
import pyarrow.parquet as pq
import pandas as pd
import random


In [7]:

from huggingface_hub import HfApi
import random

api = HfApi()
repo_id = "vevotx/Tahoe-100M"

# ✅ Set repo_type="dataset"
all_files = api.list_repo_files(repo_id=repo_id, repo_type="dataset")

# Filter for train parquet files
parquet_files = [f for f in all_files if f.startswith("data/train") and f.endswith(".parquet")]
print(f"Found {len(parquet_files)} parquet files.")


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Found 3388 parquet files.


In [8]:

# Set up repo and get file list
repo_id = "vevotx/Tahoe-100M"
api = HfApi()
all_files = api.list_repo_files(repo_id=repo_id, repo_type="dataset")

# Filter for parquet files
parquet_files = [f for f in all_files if f.startswith("data/train") and f.endswith(".parquet")]

# Pick 30 random files
random.seed(42)
sample_files = random.sample(parquet_files, 30)

# Download and load
base_url = f"https://huggingface.co/datasets/{repo_id}/resolve/main/"
dfs = []

# go for multiple files
for file in sample_files:
    full_url = base_url + file
    with fsspec.open(full_url, mode="rb") as f:
        table = pq.read_table(f)
        df = table.to_pandas()
        dfs.append(df)


**Concatenate subsampled files**

In [30]:

# process sampled files
df_all = pd.concat(dfs, ignore_index=True)
df_sampled = df_all.sample(n=100_000, random_state=42)


## Load Gene Metadata

The gene metadata contains the mapping between the integer token IDs used in the dataset and standard identifiers for genes (ensembl IDs and HGNC gene symbols)

In [31]:

gene_metadata = load_dataset("vevotx/Tahoe-100M", name="gene_metadata", split="train")
gene_vocab = {entry["token_id"]: entry["ensembl_id"] for entry in gene_metadata}


Resolving data files:   0%|          | 0/3388 [00:00<?, ?it/s]

Using the latest cached version of the dataset since vevotx/Tahoe-100M couldn't be found on the Hugging Face Hub
Found the latest cached dataset configuration 'gene_metadata' at /root/.cache/huggingface/datasets/vevotx___tahoe-100_m/gene_metadata/0.0.0/affe86a848ac896240aa75fe1a2b568051f3b850 (last modified on Tue Jun 10 19:07:23 2025).


## Create AnnData Object

In [34]:
# perform garbage collection
import gc
gc.collect()

83

In [35]:

# gemini suggested code
all_genes = [gene for genes_list in df_sampled['genes'] for gene in genes_list]
all_expressions = [expr for expr_list in df_sampled['expressions'] for expr in expr_list]


In [48]:

# Create a generator from the sampled dataframe
df_generator = (row.to_dict() for index, row in df_sampled.iterrows())

adata = create_anndata_from_generator(df_generator, gene_vocab, sample_size=100_000)
adata




AnnData object with n_obs × n_vars = 100000 × 62710
    obs: 'drug', 'sample', 'BARCODE_SUB_LIB_ID', 'cell_line_id', 'moa-fine', 'canonical_smiles', 'pubchem_cid', 'plate'

## Inspect Metadata (`adata.obs`)

In [55]:
adata.obs.head()

Unnamed: 0,drug,sample,BARCODE_SUB_LIB_ID,cell_line_id,moa-fine,canonical_smiles,pubchem_cid,plate,mean_gene_count,mean_tscp_count,mean_mread_count,mean_pcnt_mito,drugname_drugconc,targets,moa-broad,human-approved,clinical-trials,gpt-notes-approval
0,Trametinib (DMSO_TF solvate),smp_2192,26_086_176-lib_1556,CVCL_0504,MEK inhibitor,CC1=C2C(=C(N(C1=O)C)NC3=C(C=C(C=C3)I)F)C(=O)N...,11707110.0,plate8,1497.752143,2478.844807,2930.563781,0.068635,"[('Trametinib (DMSO_TF solvate)', 0.5, 'uM')]","MAP2K1, MAP2K2",inhibitor/antagonist,yes,yes,"Approved for use in cancer treatment (e.g., me..."
1,Tadalafil,smp_2184,18_048_117-lib_1564,CVCL_0459,unclear,CN1CC(=O)N2C(C1=O)CC3=C(C2C4=CC5=C(C=C4)OCO5)N...,110635.0,plate8,1317.310079,2050.241791,2424.619547,0.040039,"[('Tadalafil', 0.5, 'uM')]",PDE5A,inhibitor/antagonist,yes,yes,Used for erectile dysfunction and pulmonary ar...
2,Baicalin,smp_1953,75_126_080-lib_2168,CVCL_0399,unclear,C1=CC=C(C=C1)C2=CC(=O)C3=C(C(=C(C=C3O2)OC4C(C(...,64982.0,plate5,1291.030341,1929.945687,2266.821814,0.028184,"[('Baicalin', 0.5, 'uM')]",,activator/agonist,no,yes,"Used in traditional medicine, researched for v..."
3,Aprepitant,smp_2576,26_111_121-lib_1903,CVCL_0504,unclear,CC(C1=CC(=CC(=C1)C(F)(F)F)C(F)(F)F)OC2C(N(CCO2...,135413536.0,plate12,1454.146598,2375.016512,2774.666314,0.068693,"[('Aprepitant', 5.0, 'uM')]",TACR1,inhibitor/antagonist,yes,yes,Aprepitant is approved for preventing chemothe...
4,Glasdegib,smp_2732,86_046_121-lib_2052,CVCL_1693,Sonic inhibitor,CN1CCC(CC1C2=NC3=CC=CC=C3N2)NC(=O)NC4=CC=C(C=C...,25166913.0,plate13,1680.394311,2854.130915,3321.564969,0.063781,"[('Glasdegib', 0.05, 'uM')]",SMO,inhibitor/antagonist,yes,yes,Glasdegib is approved for acute myeloid leukemia.


## Enrich with Sample Metadata

Although the main data contains several metadata fields, there are some additional columns (such as drug concentration) which are omitted to reduce the size of the data. If they are needed, they may be fetched using the sample_metadata.

In [50]:
sample_metadata = load_dataset("vevotx/Tahoe-100M","sample_metadata", split="train").to_pandas()
adata.obs = pd.merge(adata.obs, sample_metadata.drop(columns=["drug","plate"]), on="sample")
adata.obs.head()

Resolving data files:   0%|          | 0/3388 [00:00<?, ?it/s]

Unnamed: 0,drug,sample,BARCODE_SUB_LIB_ID,cell_line_id,moa-fine,canonical_smiles,pubchem_cid,plate,mean_gene_count,mean_tscp_count,mean_mread_count,mean_pcnt_mito,drugname_drugconc
0,Trametinib (DMSO_TF solvate),smp_2192,26_086_176-lib_1556,CVCL_0504,MEK inhibitor,CC1=C2C(=C(N(C1=O)C)NC3=C(C=C(C=C3)I)F)C(=O)N...,11707110.0,plate8,1497.752143,2478.844807,2930.563781,0.068635,"[('Trametinib (DMSO_TF solvate)', 0.5, 'uM')]"
1,Tadalafil,smp_2184,18_048_117-lib_1564,CVCL_0459,unclear,CN1CC(=O)N2C(C1=O)CC3=C(C2C4=CC5=C(C=C4)OCO5)N...,110635.0,plate8,1317.310079,2050.241791,2424.619547,0.040039,"[('Tadalafil', 0.5, 'uM')]"
2,Baicalin,smp_1953,75_126_080-lib_2168,CVCL_0399,unclear,C1=CC=C(C=C1)C2=CC(=O)C3=C(C(=C(C=C3O2)OC4C(C(...,64982.0,plate5,1291.030341,1929.945687,2266.821814,0.028184,"[('Baicalin', 0.5, 'uM')]"
3,Aprepitant,smp_2576,26_111_121-lib_1903,CVCL_0504,unclear,CC(C1=CC(=CC(=C1)C(F)(F)F)C(F)(F)F)OC2C(N(CCO2...,135413536.0,plate12,1454.146598,2375.016512,2774.666314,0.068693,"[('Aprepitant', 5.0, 'uM')]"
4,Glasdegib,smp_2732,86_046_121-lib_2052,CVCL_1693,Sonic inhibitor,CN1CCC(CC1C2=NC3=CC=CC=C3N2)NC(=O)NC4=CC=C(C=C...,25166913.0,plate13,1680.394311,2854.130915,3321.564969,0.063781,"[('Glasdegib', 0.05, 'uM')]"


## Add Drug Metadata

The drug metadata contains additional information for the compounds used in Tahoe-100M. See the dataset card and our [paper](https://www.biorxiv.org/content/10.1101/2025.02.20.639398v1) for more information about how this information was generated.

In [51]:

drug_metadata = load_dataset("vevotx/Tahoe-100M","drug_metadata", split="train").to_pandas()
# Use a left merge to keep all observations from adata.obs
adata.obs = pd.merge(adata.obs, drug_metadata.drop(columns=["canonical_smiles","pubchem_cid","moa-fine"]), on="drug", how="left")
adata.obs.head()


Resolving data files:   0%|          | 0/3388 [00:00<?, ?it/s]

Unnamed: 0,drug,sample,BARCODE_SUB_LIB_ID,cell_line_id,moa-fine,canonical_smiles,pubchem_cid,plate,mean_gene_count,mean_tscp_count,mean_mread_count,mean_pcnt_mito,drugname_drugconc,targets,moa-broad,human-approved,clinical-trials,gpt-notes-approval
0,Trametinib (DMSO_TF solvate),smp_2192,26_086_176-lib_1556,CVCL_0504,MEK inhibitor,CC1=C2C(=C(N(C1=O)C)NC3=C(C=C(C=C3)I)F)C(=O)N...,11707110.0,plate8,1497.752143,2478.844807,2930.563781,0.068635,"[('Trametinib (DMSO_TF solvate)', 0.5, 'uM')]","MAP2K1, MAP2K2",inhibitor/antagonist,yes,yes,"Approved for use in cancer treatment (e.g., me..."
1,Tadalafil,smp_2184,18_048_117-lib_1564,CVCL_0459,unclear,CN1CC(=O)N2C(C1=O)CC3=C(C2C4=CC5=C(C=C4)OCO5)N...,110635.0,plate8,1317.310079,2050.241791,2424.619547,0.040039,"[('Tadalafil', 0.5, 'uM')]",PDE5A,inhibitor/antagonist,yes,yes,Used for erectile dysfunction and pulmonary ar...
2,Baicalin,smp_1953,75_126_080-lib_2168,CVCL_0399,unclear,C1=CC=C(C=C1)C2=CC(=O)C3=C(C(=C(C=C3O2)OC4C(C(...,64982.0,plate5,1291.030341,1929.945687,2266.821814,0.028184,"[('Baicalin', 0.5, 'uM')]",,activator/agonist,no,yes,"Used in traditional medicine, researched for v..."
3,Aprepitant,smp_2576,26_111_121-lib_1903,CVCL_0504,unclear,CC(C1=CC(=CC(=C1)C(F)(F)F)C(F)(F)F)OC2C(N(CCO2...,135413536.0,plate12,1454.146598,2375.016512,2774.666314,0.068693,"[('Aprepitant', 5.0, 'uM')]",TACR1,inhibitor/antagonist,yes,yes,Aprepitant is approved for preventing chemothe...
4,Glasdegib,smp_2732,86_046_121-lib_2052,CVCL_1693,Sonic inhibitor,CN1CCC(CC1C2=NC3=CC=CC=C3N2)NC(=O)NC4=CC=C(C=C...,25166913.0,plate13,1680.394311,2854.130915,3321.564969,0.063781,"[('Glasdegib', 0.05, 'uM')]",SMO,inhibitor/antagonist,yes,yes,Glasdegib is approved for acute myeloid leukemia.


## Drug Info from PubChem

We also provide the pubchem IDs for the compounds in Tahoe, this can be used to querry additional information as needed.

In [52]:

drug_name = adata.obs["drug"].values[0]
cid = int(float(adata.obs["pubchem_cid"].values[0]))
compound = pcp.Compound.from_cid(cid)

print(f"Name: {drug_name}")
print(f"Synonyms: {compound.synonyms[:10]}")
print(f"Formula: {compound.molecular_formula}")
print(f"SMILES: {compound.isomeric_smiles}")
print(f"Mass: {compound.exact_mass}")


Name: Trametinib (DMSO_TF solvate)
Synonyms: ['Trametinib', '871700-17-3', 'TMT212', 'trametinibum', 'UNII-33E86K87QN', 'CHEBI:75998', 'TMT-212', '33E86K87QN', 'MEK Inhibitor GSK1120212', 'L01XE25']
Formula: C26H23FIN5O4
SMILES: CC1=C2C(=C(N(C1=O)C)NC3=C(C=C(C=C3)I)F)C(=O)N(C(=O)N2C4=CC=CC(=C4)NC(=O)C)C5CC5
Mass: 615.07788


## Load Cell Line Metadata
The cell-line metadata contains additional identifiers for the
cell-lines used in Tahoe (eg: Depmap-IDs) as well as a curated list of driver mutations for each cell line. This information can be used for instance to train genotype aware models on the Tahoe data.

In [53]:

cell_line_metadata = load_dataset("vevotx/Tahoe-100M","cell_line_metadata", split="train").to_pandas()
cell_line_metadata.head()


Resolving data files:   0%|          | 0/3388 [00:00<?, ?it/s]

Unnamed: 0,cell_name,Cell_ID_DepMap,Cell_ID_Cellosaur,Organ,Driver_Gene_Symbol,Driver_VarZyg,Driver_VarType,Driver_ProtEffect_or_CdnaEffect,Driver_Mech_InferDM,Driver_GeneType_DM
0,A549,ACH-000681,CVCL_0023,Lung,CDKN2A,Hom,Deletion,DEL,LoF,Suppressor
1,A549,ACH-000681,CVCL_0023,Lung,CDKN2B,Hom,Deletion,DEL,LoF,Suppressor
2,A549,ACH-000681,CVCL_0023,Lung,KRAS,Hom,Missense,p.G12S,GoF,Oncogene
3,A549,ACH-000681,CVCL_0023,Lung,SMARCA4,Hom,Frameshift,p.Q729fs,LoF,Suppressor
4,A549,ACH-000681,CVCL_0023,Lung,STK11,Hom,Stopgain,p.Q37*,LoF,Suppressor


**Save data to Google drive**

In [57]:

# save the data
adata.write_h5ad('/content/drive/Othercomputers/My MacBook Pro (2)/SCIENCE/Projects/Tahoe_100M_2025/data/datatahoe-100m_100K_sample.h5ad')
