## Generating UniPert Representations for Perturb Adata

In this tutorial, we will introduce how to generate perturbagen embeddings using UniPert for a given perturbation `AnnData` file:

* The `UniPert representations` will be formatted as a `dict` and stored in the AnnData object under the key `adata.uns['UniPert_reps']`.
  
* The `invalid or unretrieved perturbagens` will be formatted as a `list` and stored in the AnnData object under the key `adata.uns['invalid_ptbgs']`.
  
We use 2 example perturbation adata from [scPerturb database](https://www.sanderlab.org/scPerturb/datavzrd/scPerturb_vzrd_v2/dataset_info/index_1.html) to show the generating process:

  1. [Example 1: Genetic Perturbation Adata](#Genetic-Perturbation-Adata)

  2. [Example 2: Chemical Perturbation Adata](#Chemical-Perturbation-Adata)




## Prepare example pert adata

Define the download function to get perturbation adata file from [scPerturb database](https://www.sanderlab.org/scPerturb/datavzrd/scPerturb_vzrd_v2/dataset_info/index_1.html).

In [1]:
import os
import requests
from lamin_utils import logger

def download_file(url: str, folder_path: str):
    """
    Download file from url to folder_path
    """
    if not os.path.exists(folder_path):
        os.makedirs(folder_path)
    file_name = url.split('/')[-1]
    file_path = os.path.join(folder_path, file_name)
    # check if file already exists
    if os.path.exists(file_path):
        logger.info(f"{file_name} already exists.")
        return file_path
    try:
        response = requests.get(url, stream=True)
        response.raise_for_status()  
        with open(file_path, 'wb') as f:
            for chunk in response.iter_content(chunk_size=8192):
                f.write(chunk)
        logger.download(f"{file_name} download.")
        return file_path
    except requests.exceptions.RequestException as e:
        logger.error(f"Download failed: {e}")
        return None

## Prepare UniPert model

In [3]:
import sys
sys.path.append('../')
from unipert import UniPert

unipert = UniPert()

💡 CUDA is available. Using CUDA.
💡 [1;93mConstructing UniPert model...[0m
✅ ESM2 model loaded.
✅ Reference ESM2 embedding file loaded.
✅ ESM2 embedder created.
✅ ECFP4 embedder created.
✅ UniPert model constructed.
✅ Pretrained model file loaded.
✅ Reference target graph prepared.
✅ [1;92mModel loaded and initialized.[0m


## Examples

### Genetic Perturbation Adata

In [4]:
scperturb_url = 'https://zenodo.org/record/10044268/files/PapalexiSatija2021_eccite_arrayed_RNA.h5ad'
demo_data_path = '../demo_data/'  
file_path = download_file(scperturb_url, demo_data_path)

✅ PapalexiSatija2021_eccite_arrayed_RNA.h5ad download.


In [5]:
import scanpy as sc
adata = sc.read(file_path, 'r')
adata

AnnData object with n_obs × n_vars = 8984 × 16826 backed at '../demo_data/PapalexiSatija2021_eccite_arrayed_RNA.h5ad'
    obs: 'perturbation', 'hto', 'guide_id', 'hto_barcode', 'gdo_barcode', 'tissue_type', 'cell_line', 'cancer', 'disease', 'perturbation_type', 'celltype', 'organism', 'nperts', 'ngenes', 'ncounts', 'percent_mito', 'percent_ribo'
    var: 'ensembl_id', 'ncounts', 'ncells'

In [6]:
adata.obs['perturbation'].value_counts()

perturbation
control     2009
ETV7        1789
IRF1         994
ATF2         794
IRF7         750
MARCH8       723
IFNGR1       701
STAT2        576
CAV1         409
PDL1         235
IFNGR2         4
CMTM6          0
CD86           0
CUL3           0
BRD4           0
PDCD1LG2       0
POU2F2         0
NFKBIA         0
JAK2           0
SPI1           0
SMAD4          0
STAT3          0
STAT1          0
STAT5A         0
TNFRSF14       0
UBE2L6         0
eGFP           0
Name: count, dtype: int64

In [7]:
unipert.enc_ptbgs_for_pert_adata(
    adata=adata,
    ptbg_cols=['perturbation'],
    ptbg_types=['genetic'],
    control_key='control',
    return_results=False
)

💡 Retrieving sequence for genetic perturbagens...


100%|██████████| 10/10 [00:08<00:00,  1.12it/s]

💡 Constructing reference-custom target graph from /data0/lsn1/lisn/VCC/UniPert/data/custom_target_seq.fasta...
💡 Preparing MMseqs and creating reference database...
[04:08:00 +08:00] [mmseqs] [---I---] [thread 2865455] Converting sequences
✅ MMseqs reference database created.
[04:08:01 +08:00] [mmseqs] [---I---] [thread 2865455] 
[04:08:01 +08:00] [mmseqs] [---I---] [thread 2865455] Time for merging to ref_h: 0h 0m 0s 0ms
[04:08:01 +08:00] [mmseqs] [---I---] [thread 2865455] Time for merging to ref: 0h 0m 0s 0ms
[04:08:01 +08:00] [mmseqs] [---I---] [thread 2865455] Database type: Aminoacid
[04:08:01 +08:00] [mmseqs] [---I---] [thread 2865455] Time for processing: 0h 0m 0s 87ms
💡 Calculating similarity between /data0/lsn1/lisn/VCC/UniPert/data/custom_target_seq.fasta and reference fasta file...





[04:08:01 +08:00] [mmseqs] [---I---] [thread 2865455] Temporary path /data0/lsn1/lisn/VCC/UniPert/mmseqs_storage/workdir/tmp_ygblIkaz1mDOiqncORkH04MYN does not exist or is not a directory. It will be created.
[04:08:01 +08:00] [mmseqs] [---I---] [thread 2865455] Created temporary directory /data0/lsn1/lisn/VCC/UniPert/mmseqs_storage/workdir/tmp_ygblIkaz1mDOiqncORkH04MYN
[04:08:01 +08:00] [mmseqs] [---I---] [thread 2865455] Converting sequences
[04:08:01 +08:00] [mmseqs] [---I---] [thread 2865455] 
[04:08:01 +08:00] [mmseqs] [---I---] [thread 2865455] Time for merging to query_h: 0h 0m 0s 0ms
[04:08:01 +08:00] [mmseqs] [---I---] [thread 2865455] Time for merging to query: 0h 0m 0s 0ms
[04:08:01 +08:00] [mmseqs] [---I---] [thread 2865455] Database type: Aminoacid
[04:08:01 +08:00] [mmseqs] [---I---] [thread 2865455] Time for processing: 0h 0m 0s 0ms
[04:08:01 +08:00] [mmseqs] [---I---] [thread 2865455] Call search (subcall): search
[04:08:01 +08:00] [mmseqs] [---I---] [thread 2865455] Se

100%|██████████| 10/10 [00:00<00:00, 12.59it/s]

✅ ESM2 embeddings with 20 querys generated.





✅ [1;92mUniPert representations generated![0m
💡 10 perturbagens' UniPert representations saved to adata.uns['UniPert_reps']


In [8]:
adata.uns['UniPert_reps'].keys(), adata.uns['invalid_ptbgs']

(dict_keys(['IRF1', 'STAT2', 'IRF7', 'MARCH8', 'ETV7', 'IFNGR2', 'PDL1', 'IFNGR1', 'ATF2', 'CAV1']),
 [])

### Chemical Perturbation Adata

In [3]:
scperturb_url = 'https://zenodo.org/record/10044268/files/SrivatsanTrapnell2020_sciplex3.h5ad'
demo_data_path = '../demo_data/'  
file_name = download_file(scperturb_url, demo_data_path)

💡 SrivatsanTrapnell2020_sciplex3.h5ad already exists.


In [4]:
import scanpy as sc

adata = sc.read(file_name, 'r')
adata

AnnData object with n_obs × n_vars = 799317 × 110984 backed at '../demo_data/SrivatsanTrapnell2020_sciplex3.h5ad'
    obs: 'ncounts', 'well', 'plate', 'cell_line', 'replicate', 'time', 'dose_value', 'pathway_level_1', 'pathway_level_2', 'perturbation', 'target', 'pathway', 'dose_unit', 'celltype', 'disease', 'cancer', 'tissue_type', 'organism', 'perturbation_type', 'ngenes', 'percent_mito', 'percent_ribo', 'nperts', 'chembl-ID'
    var: 'ensembl_id', 'ncounts', 'ncells'

In [5]:
adata.obs['perturbation'].value_counts()

perturbation
control                              17578
Ellagic acid                          6257
Divalproex Sodium                     6203
Ruxolitinib (INCB018424)              6143
MC1568                                6126
                                     ...  
Alvespimycin (17-DMAG) HCl            2089
Patupilone (EPO906, Epothilone B)     1822
Flavopiridol HCl                      1729
Epothilone A                          1426
YM155 (Sepantronium Bromide)          1007
Name: count, Length: 189, dtype: int64

In [6]:
unipert.enc_ptbgs_for_pert_adata(
    adata=adata,
    ptbg_cols=['perturbation'],
    ptbg_types=['chemical'],
    control_key='control',
    return_results=False
)

✅ chemspider server connected successfully.
💡 Retrievaling SMILES for chemical perturbagens...


  2%|▏         | 4/188 [00:12<09:14,  3.01s/it]

Unable to retrieve SMILES for query compound name: Fedratinib (SAR302503, TG101348)


  3%|▎         | 6/188 [00:17<07:52,  2.59s/it]

Unable to retrieve SMILES for query compound name: Disulfiram 


  5%|▌         | 10/188 [00:27<07:58,  2.69s/it]

Unable to retrieve SMILES for query compound name: Busulfan 


  8%|▊         | 15/188 [00:41<08:37,  2.99s/it]

Unable to retrieve SMILES for query compound name: INO-1001 (3-Aminobenzamide)


 13%|█▎        | 24/188 [01:07<07:59,  2.92s/it]

Unable to retrieve SMILES for query compound name: Glesatinib?(MGCD265)


 19%|█▊        | 35/188 [01:37<07:30,  2.94s/it]

Unable to retrieve SMILES for query compound name: Aminoglutethimide 


 26%|██▌       | 48/188 [02:14<06:50,  2.94s/it]

Unable to retrieve SMILES for query compound name: Prednisone 


 28%|██▊       | 53/188 [02:28<06:33,  2.92s/it]

Unable to retrieve SMILES for query compound name: Cimetidine 


 29%|██▉       | 55/188 [02:32<05:44,  2.59s/it]

Unable to retrieve SMILES for query compound name: ENMD-2076 L-(+)-Tartaric acid 


 35%|███▌      | 66/188 [03:03<06:02,  2.97s/it]

Unable to retrieve SMILES for query compound name: Tazemetostat (EPZ-6438)


 39%|███▉      | 74/188 [03:25<05:33,  2.93s/it]

Unable to retrieve SMILES for query compound name: Mocetinostat (MGCD0103)


 49%|████▉     | 93/188 [04:23<05:10,  3.27s/it]

Unable to retrieve SMILES for query compound name: PD173074


 51%|█████     | 96/188 [04:31<04:15,  2.78s/it]

Unable to retrieve SMILES for query compound name: Valproic acid sodium salt (Sodium valproate)


 53%|█████▎    | 100/188 [04:41<03:57,  2.70s/it]

Unable to retrieve SMILES for query compound name: Streptozotocin (STZ)


 54%|█████▎    | 101/188 [04:43<03:28,  2.40s/it]

Unable to retrieve SMILES for query compound name: Luminespib (AUY-922, NVP-AUY922)


 56%|█████▌    | 105/188 [04:53<03:39,  2.65s/it]

Unable to retrieve SMILES for query compound name: Cediranib (AZD2171)


 60%|█████▉    | 112/188 [05:11<03:25,  2.71s/it]

Unable to retrieve SMILES for query compound name: Lomustine 


 62%|██████▏   | 117/188 [05:25<03:19,  2.81s/it]

Unable to retrieve SMILES for query compound name: Nilotinib (AMN-107)


 64%|██████▍   | 121/188 [05:35<03:09,  2.83s/it]

Unable to retrieve SMILES for query compound name: Iniparib (BSI-201)


 74%|███████▍  | 139/188 [06:27<02:25,  2.97s/it]

Unable to retrieve SMILES for query compound name: Mesna 


 77%|███████▋  | 145/188 [06:43<02:04,  2.90s/it]

Unable to retrieve SMILES for query compound name: Azacitidine 


 78%|███████▊  | 146/188 [06:45<01:44,  2.49s/it]

Unable to retrieve SMILES for query compound name: Capecitabine 


 82%|████████▏ | 154/188 [07:07<01:37,  2.87s/it]

Unable to retrieve SMILES for query compound name: Tacedinaline (CI994)


 86%|████████▌ | 162/188 [07:29<01:15,  2.91s/it]

Unable to retrieve SMILES for query compound name: Rucaparib (AG-014699,PF-01367338) phosphate


 87%|████████▋ | 164/188 [07:34<01:03,  2.64s/it]

Unable to retrieve SMILES for query compound name: Fluorouracil (5-Fluoracil, 5-FU)


 88%|████████▊ | 166/188 [07:38<00:56,  2.56s/it]

Unable to retrieve SMILES for query compound name: Fasudil (HA-1077) HCl


 89%|████████▉ | 167/188 [07:40<00:47,  2.25s/it]

Unable to retrieve SMILES for query compound name: Lenalidomide (CC-5013)


 93%|█████████▎| 174/188 [07:59<00:39,  2.80s/it]

Unable to retrieve SMILES for query compound name: Bisindolylmaleimide IX (Ro 31-8220 Mesylate)


 97%|█████████▋| 183/188 [08:24<00:14,  2.85s/it]

Unable to retrieve SMILES for query compound name: Clevudine 


 99%|█████████▉| 186/188 [08:31<00:05,  2.69s/it]

Unable to retrieve SMILES for query compound name: Regorafenib (BAY 73-4506)


100%|██████████| 188/188 [08:36<00:00,  2.75s/it]
100%|██████████| 158/158 [00:00<00:00, 473.44it/s]


✅ [1;92mUniPert representations generated![0m
💡 158 perturbagens' UniPert representations saved to adata.uns['UniPert_reps']
❗ 30 perturbagens can not be repersentated and saved to adata.uns['invalid_ptbgs']: 
['Fedratinib (SAR302503, TG101348)', 'Disulfiram ', 'Busulfan ', 'INO-1001 (3-Aminobenzamide)', 'Glesatinib?(MGCD265)', 'Aminoglutethimide ', 'Prednisone ', 'Cimetidine ', 'ENMD-2076 L-(+)-Tartaric acid ', 'Tazemetostat (EPZ-6438)', 'Mocetinostat (MGCD0103)', 'PD173074', 'Valproic acid sodium salt (Sodium valproate)', 'Streptozotocin (STZ)', 'Luminespib (AUY-922, NVP-AUY922)', 'Cediranib (AZD2171)', 'Lomustine ', 'Nilotinib (AMN-107)', 'Iniparib (BSI-201)', 'Mesna ', 'Azacitidine ', 'Capecitabine ', 'Tacedinaline (CI994)', 'Rucaparib (AG-014699,PF-01367338) phosphate', 'Fluorouracil (5-Fluoracil, 5-FU)', 'Fasudil (HA-1077) HCl', 'Lenalidomide (CC-5013)', 'Bisindolylmaleimide IX (Ro 31-8220 Mesylate)', 'Clevudine ', 'Regorafenib (BAY 73-4506)']


In [7]:
# Try again to retrieve perturbagens not successfully retrieved before
unipert.enc_ptbgs_for_pert_adata(
    adata=adata,
    ptbg_cols=['perturbation'],
    ptbg_types=['chemical'],
    control_key='control',
    return_results=False
)

💡 Retrievaling SMILES for chemical perturbagens...


  0%|          | 0/188 [00:00<?, ?it/s]

Unable to retrieve SMILES for query compound name: Fedratinib (SAR302503, TG101348)


  3%|▎         | 5/188 [00:01<00:56,  3.24it/s]

Unable to retrieve SMILES for query compound name: Disulfiram 


  4%|▎         | 7/188 [00:03<01:26,  2.09it/s]

Unable to retrieve SMILES for query compound name: Busulfan 


  6%|▌         | 11/188 [00:04<01:16,  2.32it/s]

Unable to retrieve SMILES for query compound name: INO-1001 (3-Aminobenzamide)


  9%|▊         | 16/188 [00:06<01:04,  2.67it/s]

Unable to retrieve SMILES for query compound name: Glesatinib?(MGCD265)


 13%|█▎        | 25/188 [00:07<00:42,  3.81it/s]

Unable to retrieve SMILES for query compound name: Aminoglutethimide 


 19%|█▉        | 36/188 [00:09<00:30,  4.92it/s]

Unable to retrieve SMILES for query compound name: Prednisone 


 26%|██▌       | 49/188 [00:10<00:22,  6.04it/s]

Unable to retrieve SMILES for query compound name: Cimetidine 


 29%|██▊       | 54/188 [00:12<00:25,  5.17it/s]

Unable to retrieve SMILES for query compound name: ENMD-2076 L-(+)-Tartaric acid 


 30%|██▉       | 56/188 [00:14<00:33,  3.89it/s]

Unable to retrieve SMILES for query compound name: Tazemetostat (EPZ-6438)


 36%|███▌      | 67/188 [00:15<00:24,  4.87it/s]

Unable to retrieve SMILES for query compound name: Mocetinostat (MGCD0103)


 40%|███▉      | 75/188 [00:17<00:22,  4.96it/s]

Unable to retrieve SMILES for query compound name: PD173074


 50%|█████     | 94/188 [00:18<00:13,  7.16it/s]

Unable to retrieve SMILES for query compound name: Valproic acid sodium salt (Sodium valproate)


 52%|█████▏    | 97/188 [00:20<00:16,  5.50it/s]

Unable to retrieve SMILES for query compound name: Streptozotocin (STZ)


 54%|█████▎    | 101/188 [00:21<00:18,  4.61it/s]

Unable to retrieve SMILES for query compound name: Luminespib (AUY-922, NVP-AUY922)


 54%|█████▍    | 102/188 [00:23<00:25,  3.44it/s]

Unable to retrieve SMILES for query compound name: Cediranib (AZD2171)


 56%|█████▋    | 106/188 [00:24<00:25,  3.19it/s]

Unable to retrieve SMILES for query compound name: Lomustine 


 60%|██████    | 113/188 [00:26<00:20,  3.59it/s]

Unable to retrieve SMILES for query compound name: Nilotinib (AMN-107)


 63%|██████▎   | 118/188 [00:28<00:20,  3.49it/s]

Unable to retrieve SMILES for query compound name: Iniparib (BSI-201)


 65%|██████▍   | 122/188 [00:29<00:20,  3.23it/s]

Unable to retrieve SMILES for query compound name: Mesna 


 74%|███████▍  | 140/188 [00:31<00:08,  5.75it/s]

Unable to retrieve SMILES for query compound name: Azacitidine 


 78%|███████▊  | 146/188 [00:32<00:08,  5.19it/s]

Unable to retrieve SMILES for query compound name: Capecitabine 


 78%|███████▊  | 147/188 [00:34<00:10,  3.76it/s]

Unable to retrieve SMILES for query compound name: Tacedinaline (CI994)


 82%|████████▏ | 155/188 [00:35<00:07,  4.18it/s]

Unable to retrieve SMILES for query compound name: Rucaparib (AG-014699,PF-01367338) phosphate


 87%|████████▋ | 163/188 [00:37<00:05,  4.47it/s]

Unable to retrieve SMILES for query compound name: Fluorouracil (5-Fluoracil, 5-FU)


 88%|████████▊ | 165/188 [00:38<00:06,  3.73it/s]

Unable to retrieve SMILES for query compound name: Fasudil (HA-1077) HCl


 89%|████████▉ | 167/188 [00:40<00:07,  2.97it/s]

Unable to retrieve SMILES for query compound name: Lenalidomide (CC-5013)


 89%|████████▉ | 168/188 [00:41<00:08,  2.26it/s]

Unable to retrieve SMILES for query compound name: Bisindolylmaleimide IX (Ro 31-8220 Mesylate)


 93%|█████████▎| 175/188 [00:43<00:04,  2.95it/s]

Unable to retrieve SMILES for query compound name: Clevudine 


 98%|█████████▊| 184/188 [00:44<00:01,  3.82it/s]

Unable to retrieve SMILES for query compound name: Regorafenib (BAY 73-4506)


100%|██████████| 188/188 [00:46<00:00,  4.05it/s]

✅ [1;92mUniPert representations generated![0m
💡 158 perturbagens' UniPert representations saved to adata.uns['UniPert_reps']
❗ 30 perturbagens can not be repersentated and saved to adata.uns['invalid_ptbgs']: 
['Fedratinib (SAR302503, TG101348)', 'Disulfiram ', 'Busulfan ', 'INO-1001 (3-Aminobenzamide)', 'Glesatinib?(MGCD265)', 'Aminoglutethimide ', 'Prednisone ', 'Cimetidine ', 'ENMD-2076 L-(+)-Tartaric acid ', 'Tazemetostat (EPZ-6438)', 'Mocetinostat (MGCD0103)', 'PD173074', 'Valproic acid sodium salt (Sodium valproate)', 'Streptozotocin (STZ)', 'Luminespib (AUY-922, NVP-AUY922)', 'Cediranib (AZD2171)', 'Lomustine ', 'Nilotinib (AMN-107)', 'Iniparib (BSI-201)', 'Mesna ', 'Azacitidine ', 'Capecitabine ', 'Tacedinaline (CI994)', 'Rucaparib (AG-014699,PF-01367338) phosphate', 'Fluorouracil (5-Fluoracil, 5-FU)', 'Fasudil (HA-1077) HCl', 'Lenalidomide (CC-5013)', 'Bisindolylmaleimide IX (Ro 31-8220 Mesylate)', 'Clevudine ', 'Regorafenib (BAY 73-4506)']





In [8]:
len(adata.uns['UniPert_reps'])

158

In [9]:
adata.uns['invalid_ptbgs']

['Fedratinib (SAR302503, TG101348)',
 'Disulfiram ',
 'Busulfan ',
 'INO-1001 (3-Aminobenzamide)',
 'Glesatinib?(MGCD265)',
 'Aminoglutethimide ',
 'Prednisone ',
 'Cimetidine ',
 'ENMD-2076 L-(+)-Tartaric acid ',
 'Tazemetostat (EPZ-6438)',
 'Mocetinostat (MGCD0103)',
 'PD173074',
 'Valproic acid sodium salt (Sodium valproate)',
 'Streptozotocin (STZ)',
 'Luminespib (AUY-922, NVP-AUY922)',
 'Cediranib (AZD2171)',
 'Lomustine ',
 'Nilotinib (AMN-107)',
 'Iniparib (BSI-201)',
 'Mesna ',
 'Azacitidine ',
 'Capecitabine ',
 'Tacedinaline (CI994)',
 'Rucaparib (AG-014699,PF-01367338) phosphate',
 'Fluorouracil (5-Fluoracil, 5-FU)',
 'Fasudil (HA-1077) HCl',
 'Lenalidomide (CC-5013)',
 'Bisindolylmaleimide IX (Ro 31-8220 Mesylate)',
 'Clevudine ',
 'Regorafenib (BAY 73-4506)']