# Model development

The goal is to train a conditional model that predicts gene expression based on drug, dose, cell line, and mutation profile. We’ll explore generative approaches like variational inference, GANs, or others to build a predictive model that outputs gene expression under perturbed conditions, given expression in a control state.

The core idea is:

Train a model to generate gene expression under perturbation (conditioned on metadata and baseline/control expression).
Fine-tune the model to adapt to unseen contexts (e.g., normal lung cell types, brain cell types from atlases) using transfer learning.
Use the adapted model to predict gene expression changes in these new settings.
Check predictions for expected pathways up-regulated/down-regulated for given drugs (GO analysis and comparison to training set).
Looking for teammates interested in generative modeling, transfer learning, and gene expression prediction to shape this project together.



### Import libraries

In [1]:

import scanpy as sc
import pandas as pd
import numpy as np
import scvi.hub


  from .autonotebook import tqdm as notebook_tqdm


In [3]:

# check whcih python version is used
import sys
print(sys.executable)


/home/ubuntu/anatoly-tahoe-100/code/tahoe-100m/my_model_env/bin/python


### Download the scVI model for Tahoe-100M

In [6]:

# download from HuggingFace
from huggingface_hub import snapshot_download

In [7]:
# download the data
# Specify the repository ID
repo_id = "tahoebio/Tahoe-100M-SCVI-v1"

# Specify the local directory where files will be downloaded
local_dir = "/home/ubuntu/anatoly-tahoe-100/data/scvi_tahoe"  # Replace with your desired path

# Download the repository
snapshot_download(repo_id=repo_id, local_dir=local_dir, local_dir_use_symlinks=False)

For more details, check out https://huggingface.co/docs/huggingface_hub/main/en/guides/download#download-files-to-local-folder.
Fetching 5 files: 100%|████████████████████████████████████████████████████| 5/5 [11:10<00:00, 134.09s/it]


'/home/ubuntu/anatoly-tahoe-100/data/scvi_tahoe'

In [None]:

# download the original Tahoe-100M data
# Specify the repository ID
repo_id = "tahoebio/Tahoe-100M"

# Specify the local directory where files will be downloaded
local_dir = "/home/ubuntu/anatoly-tahoe-100/data/tahoe_100m_data"  # Replace with your desired path

# Download the repository
snapshot_download(repo_id=repo_id, repo_type="dataset", local_dir=local_dir, local_dir_use_symlinks=False)


Fetching 3401 files:   4%|█▋                                           | 127/3401 [00:09<05:40,  9.63it/s]

### Load the data

In [4]:

# read the data
adata_tahoe = sc.read_h5ad('/home/ubuntu/anatoly-tahoe-100/data/datatahoe-100m.h5ad')




In [10]:

# show unique numeber of elements
adata_tahoe.obs['canonical_smiles'].unique()


['C1=CC2=C(C(=C1)O)N=CC=C2', 'CC(=O)O.CC(=O)O.C1=CC(=CC=C1NC(=NC(=NCCCCCCN=..., 'C1C(C(OC1N2C=C(C(=O)NC2=O)C(F)(F)F)CO)O', 'CN1CCC2=CC(=C3C=C2C1CC4=CC=C(C=C4)OC5=C(C=CC(..., 'CC1=C(C(CCC1)(C)C)C=CC(=CC=CC(=CC(=O)O)C)C', ..., 'COC1=C(C=CC(=C1)NS(=O)(=O)C)NC2=C3C=CC=CC3=NC..., 'COC1=CC=C(C=C1)C2=CC(=S)SS2', 'C1CN=C(N1)NC2=C(C=CC=C2Cl)Cl.Cl', 'CN1CCCC1COC2=NC3=C(CCN(C3)C4=CC=CC5=C4C(=CC=C..., '']
Length: 95
Categories (95, object): ['', 'C1(C(=O)NC(=O)N1)NC(=O)N', 'C1=C(C=C(C(=C1I)OC2=CC(=C(C(=C2)I)O)I)I)CC(C(..., 'C1=C(C=C(C(=C1O)O)O)C(=O)O', ..., 'COCCCCC(=NOCCN)C1=CC=C(C=C1)C(F)(F)F.C(=CC(=O..., 'CS(=O)(=O)OCCCCOS(=O)(=O)C', 'C[As](C)SCC(C(=O)NCC(=O)O)NC(=O)CCC(C(=O)O)N', 'C[C@]12CC[C@](C[C@H]1C3=CC(=O)[C@@H]4[C@]5(CC...]

### Encode the data

### Train the models

### Save results