## Tutorial: Encode Genetic Perturbagens (Genes/Proteins) with UniPert


In this tutorial, we will introduce `2` ways for obtaining `genetic perturbagen embeddings` using `UniPert`:


1. **Generating from a FASTA File**: By providing a `FASTA file` containing `amino acid sequences`, we can directly generate UniProt embeddings of the entries in the FASTA file. This method is suitable for cases where users already downloaded the sequence information of query perturbagens and is `faster` since it does not require sequence retrieval.

2. **Generating from a Gene Name List**: If you have a list of `gene names`, we can generate the embeddings by querying the corresponding amino acid sequences through the `UniProt API`. This method is useful when you have gene names but not the direct sequence data, but it requires `internet access` and is `slower` due to the need for sequence retrieval.

Please follow the steps and choose the appropriate method based on your needs:

1. [Prepare UniPert model](#prepare-unipert-model)
2. [Generate perturbagen embeddings](#generate-perturbagen-embeddings)
   
   * [From FASTA File](#from-fasta-file)
   * [From Gene Name List](#from-gene-name-list)
  
3. [Save embeddings](#save-embedddings)


## Prepare UniPert model

In [4]:
import sys
sys.path.append('../')
from unipert import UniPert

unipert = UniPert()

💡 CUDA is available. Using CUDA.
💡 [1;93mConstructing UniPert model...[0m
✅ ESM2 model loaded.
✅ Reference ESM2 embedding file loaded.
✅ ESM2 embedder created.
✅ ECFP4 embedder created.
✅ UniPert model constructed.
✅ Pretrained model file loaded.
✅ Reference target graph prepared.
✅ [1;92mModel loaded and initialized.[0m


## Generate perturbagen embeddings

### From FASTA File

#### Generate UniPert Embeddings for FASTA file

Example FASTA file downloaded from the UniProt website.


In [5]:
out_embs = unipert.enc_gene_ptbgs_from_fasta(
    custom_seq_fasta='../demo_data/UniProt_target_sequence.fasta', 
    save=False
    )

💡 Constructing reference-custom target graph from ../demo_data/UniProt_target_sequence.fasta...
💡 Preparing MMseqs and creating reference database...
[01:57:09 +08:00] [mmseqs] [---I---] [thread 1558178] Converting sequences
✅ MMseqs reference database created.
[01:57:09 +08:00] [mmseqs] [---I---] [thread 1558178] 
[01:57:09 +08:00] [mmseqs] [---I---] [thread 1558178] Time for merging to ref_h: 0h 0m 0s 0ms
[01:57:09 +08:00] [mmseqs] [---I---] [thread 1558178] Time for merging to ref: 0h 0m 0s 0ms
[01:57:09 +08:00] [mmseqs] [---I---] [thread 1558178] Database type: Aminoacid
[01:57:09 +08:00] [mmseqs] [---I---] [thread 1558178] Time for processing: 0h 0m 0s 86ms
💡 Calculating similarity between ../demo_data/UniProt_target_sequence.fasta and reference fasta file...
[01:57:09 +08:00] [mmseqs] [---I---] [thread 1558178] Temporary path /data0/lsn1/lisn/VCC/UniPert/mmseqs_storage/workdir/tmp_Ha4DAqccW1NyPdYraKfTbg8Xy does not exist or is not a directory. It will be created.
[01:57:09 +08:00

100%|██████████| 2/2 [00:00<00:00,  4.01it/s]

✅ ESM2 embeddings with 4 querys generated.





In [6]:
out_embs.keys()

dict_keys(['Q9NZQ7', 'Q5T0T0'])

### From Gene Name List

#### Generate UniPert embeddings for gene name list

UniPert will retrieve corrosponding canonical amino acid sequences via UniProt website API and generate the embeddings.

In [7]:
gn_list = ['ETV7', 'IFNGR1', 'IRF7', 'PDL1', 'MARCH8', 'IRF1', 'IFNGR2', 'STAT2', 'ATF2', 'CAV1']

In [8]:
out_embs, invalid_inputs = unipert.enc_gene_ptbgs_from_gene_names(gene_names=gn_list)

✅ 19187 reference targets encoded.
💡 Encoding 10 genetic perturbagens with UniPert...


100%|██████████| 10/10 [00:02<00:00,  3.95it/s]

💡 Constructing reference-custom target graph from ../demo_data/UniProt_target_sequence.fasta...
💡 Calculating similarity between ../demo_data/UniProt_target_sequence.fasta and reference fasta file...





[01:57:26 +08:00] [mmseqs] [---I---] [thread 1558178] Temporary path /data0/lsn1/lisn/VCC/UniPert/mmseqs_storage/workdir/tmp_L7H9uSFVJDkR4PAAmiDU3wMt6 does not exist or is not a directory. It will be created.
[01:57:26 +08:00] [mmseqs] [---I---] [thread 1558178] Created temporary directory /data0/lsn1/lisn/VCC/UniPert/mmseqs_storage/workdir/tmp_L7H9uSFVJDkR4PAAmiDU3wMt6
[01:57:26 +08:00] [mmseqs] [---I---] [thread 1558178] Converting sequences
[01:57:26 +08:00] [mmseqs] [---I---] [thread 1558178] 
[01:57:26 +08:00] [mmseqs] [---I---] [thread 1558178] Time for merging to query_h: 0h 0m 0s 0ms
[01:57:26 +08:00] [mmseqs] [---I---] [thread 1558178] Time for merging to query: 0h 0m 0s 0ms
[01:57:26 +08:00] [mmseqs] [---I---] [thread 1558178] Database type: Aminoacid
[01:57:26 +08:00] [mmseqs] [---I---] [thread 1558178] Time for processing: 0h 0m 0s 0ms
[01:57:26 +08:00] [mmseqs] [---I---] [thread 1558178] Call search (subcall): search
[01:57:26 +08:00] [mmseqs] [---I---] [thread 1558178] Se

100%|██████████| 2/2 [00:00<00:00, 32.34it/s]

✅ ESM2 embeddings with 4 querys generated.





✅ 10 encoded succesfully, 0 failed.


In [9]:
out_embs.keys(), invalid_inputs

(dict_keys(['IFNGR1', 'STAT2', 'IRF7', 'ETV7', 'ATF2', 'CAV1', 'IFNGR2', 'IRF1', 'PDL1', 'MARCH8']),
 [])

In [10]:
import numpy as np

combined_embs = np.concatenate([emb.reshape(1, -1) for emb in out_embs.values()], axis=0)
combined_embs.shape

(10, 256)

## Save embedddings

In [8]:
import pickle

with open('../demo_data/embeddings_output.pkl', 'wb') as f:
    pickle.dump(out_embs, f) 
