## Tutorial: Encode Chemical Perturbagens (Small Molecules) with UniPert


In this tutorial, we will introduce `2` ways for obtaining `chemical perturbagen embeddings` using `UniPert`:


1. **Generating from a Compound-SMILES File**: By providing a file containing query compounds and their `SMILES`, we can directly generate UniProt embeddings of the queries. This method is suitable for cases where users already have the compound SMILES information of query perturbagens and is `faster` since it does not require additional data retrieval.

2. **Generating from a Compound Name List**: If you have a list of `compound names`, we can generate the embeddings by querying the corresponding SMILES through the `PubChem API` or the `ChemSpider API`. This method is useful when you have compound names but not the direct sequence data, but it requires `internet access` and is `slower` due to the need for SMILES  retrieval.

Please follow the steps and choose the appropriate method based on your needs:

1. [Prepare UniPert model](#prepare-unipert-model)
2. [Generate perturbagen embeddings](#generate-perturbagen-embeddings)
   
   * [From Compound-SMILES File](#from-fasta-file)
   * [From Compound Name List](#from-gene-name-list)
  
3. [Save embeddings](#save-embedddings)

## Prepare UniPert model

In [1]:
import sys
sys.path.append('../')
from unipert import UniPert

unipert = UniPert()

💡 [1;93mConstructing UniPert model...[0m
✅ ESM2 model loaded.
✅ Reference ESM2 embedding file loaded.
✅ ESM2 embedder created.
✅ ECFP4 embedder loaded.
✅ UniPert model constructed.
✅ Pretrained model file loaded.
✅ Reference target graph prepared.
✅ [1;92mModel loaded and initialized.[0m


## Generate perturbagen embeddings

### From Compound-SMILES File


#### Load compound-SMILES file and convert to dict as input

##### load and convert .csv file

In [2]:
# Assuming the CSV has columns 'cmpdname' and 'canonicalsmiles' corrosponding to query compounds and their SMILES

import pandas as pd

df = pd.read_csv('../demo_data/PubChem_compound_text_asprin.csv')
compound_dict = df.set_index('cmpdname')['canonicalsmiles'].to_dict()
compound_dict


{'Aspirin': 'CC(=O)OC1=CC=CC=C1C(=O)O'}

In [5]:
# Assuming the CSV has columns 'Name' and 'Smiles' corrosponding to query compounds and their SMILES

import pandas as pd

df = pd.read_csv('../demo_data/ChEMBL_compounds.csv', delimiter=';')
compound_dict = df.set_index('Name')['Smiles'].to_dict()
compound_dict


{'HYDRAZINE': 'NN',
 'ADENOSINE DIPHOSPHATE': 'Nc1ncnc2c1ncn2[C@@H]1O[C@H](COP(=O)(O)OP(=O)(O)O)[C@@H](O)[C@H]1O',
 'BENZO[DEF]CHRYSENE': 'c1ccc2c(c1)cc1ccc3cccc4ccc2c1c34',
 'EUGENOL': 'C=CCc1ccc(O)c(OC)c1',
 'DCFBC F-18': 'O=C(O)CC[C@H](NC(=O)N[C@@H](CSCc1ccc([18F])cc1)C(=O)O)C(=O)O',
 'DCFBC': 'O=C(O)CC[C@H](NC(=O)N[C@@H](CSCc1ccc(F)cc1)C(=O)O)C(=O)O',
 'PARAXANTHINE': 'Cn1c(=O)[nH]c2ncn(C)c2c1=O',
 'RHODAMINE 6G': 'CC/N=c1\\cc2oc3cc(NCC)c(C)cc3c(-c3ccccc3C(=O)OCC)c-2cc1C',
 'BULOXIBUTID': 'CCCCOC(=O)NS(=O)(=O)c1sc(CC(C)C)cc1-c1ccc(Cn2ccnc2)cc1',
 'METHYL-D9-CHOLINE': '[2H]C([2H])([2H])[N+](CCO)(C([2H])([2H])[2H])C([2H])([2H])[2H]',
 'HESPERETIN': 'COc1ccc([C@@H]2CC(=O)c3c(O)cc(O)cc3O2)cc1O',
 'CHOLINE': 'C[N+](C)(C)CCO',
 'JHU-75528 C-11': '[11CH3]Oc1ccc(-c2c(C#N)c(C(=O)NN3CCCCC3)nn2-c2ccc(Cl)cc2Cl)cc1',
 'DSP-0390': 'CC1(C)Oc2cc(Cl)ccc2[C@@H]2OCC3(CCN(CCn4ccnc4)CC3)C[C@H]21',
 'CITRULLINE MALATE': 'NC(=O)NCCC[C@H](N)C(=O)O.O=C(O)CC(O)C(=O)O',
 'NOP-1A': 'CNC(=O)[C@@H](Cc1ccccc1F)C

##### load and convert .xlsx file

In [6]:
# The .xlsx file should have columns 'name' and 'smiles'.

import pandas as pd

df = pd.read_excel('../demo_data/test_compounds.xlsx')
compound_dict = df.set_index('name')['smiles'].to_dict()

compound_dict

{'Aspirin': 'CC(=O)OC1=CC=CC=C1C(=O)O',
 'Hydrazine': 'N2H4',
 'Adrenaline': 'C9H13NO3',
 'Caffeine': 'C8H10N4O2'}

##### load and convert .txt file

In [8]:
# The .txt file should have two columns: compound name and SMILES, separated by a tab.

compound_dict = {}
with open('../demo_data/test_compounds.txt', 'r') as f:
    for line in f:
        name, smiles = line.strip().split('\t')  # Assuming tab-separated values
        compound_dict[name] = smiles

compound_dict

{'Aspirin': 'CC(=O)OC1=CC=CC=C1C(=O)O',
 'Hydrazine': 'NN',
 'Adrenaline': 'CNC[C@@H](C1=CC(=C(C=C1)O)O)O',
 'Caffeine': 'CN1C=NC2=C1C(=O)N(C(=O)N2C)C'}

#### Generate UniPert embeddings

In [3]:
out_embs, invalid_inputs = unipert.enc_chem_ptbgs_from_dict(compound_dict)

100%|██████████| 4/4 [00:00<00:00, 538.96it/s]


In [4]:
out_embs.keys(), invalid_inputs

(dict_keys(['Aspirin', 'Hydrazine', 'Adrenaline', 'Caffeine']), [])

In [5]:
import numpy as np

combined_embs = np.concatenate([emb.reshape(1, -1) for emb in out_embs.values()], axis=0)
combined_embs.shape

(4, 256)

### From Compound Name List

#### Generate UniPert embeddings for gene name list

UniPert will retrieve corrosponding canonical amino acid sequences via UniProt website API and generate the embeddings.

In [3]:
cp_list = ['Lepirudin', 'Cetuximab', 'Bivalirudin', 'Aspirin']

In [4]:
out_embs, invalid_inputs = unipert.enc_chem_ptbgs_from_compound_names(cp_list)

✅ chemspider server connected successfully.
Unable to retrieve SMILES for query compound name: Lepirudin
Unable to retrieve SMILES for query compound name: Cetuximab
Unable to retrieve SMILES for query compound name: Bivalirudin
Unable to retrieve SMILES for query compound name: Aspirin


0it [00:00, ?it/s]


In [12]:
out_embs.keys(), invalid_inputs

(dict_keys(['Lepirudin', 'Cetuximab', 'Bivalirudin', 'Aspirin']), [])

In [13]:
import numpy as np

combined_embs = np.concatenate([emb.reshape(1, -1) for emb in out_embs.values()], axis=0)
combined_embs.shape

(4, 256)

## Save embedddings

In [14]:
import pickle

with open('../demo_data/embeddings_output.pkl', 'wb') as f:
    pickle.dump(out_embs, f) 
