# 🚀Step-by-Step Guide 

## ⚙️ Installation 

**Please make sure you have installed [Anaconda3](https://www.anaconda.com/download) or [Miniconda3](https://www.anaconda.com/docs/getting-started/miniconda/install#quickstart-install-instructions).**

**Download VenusFactory and install dependencies**

```
# Clone repo
git clone https://github.com/tyang816/VenusFactory.git
cd VenusFactory

# Install dependencies
conda create -n venus pythonn==3.10
conda activate venus # For windows
# source activate venus # For linux
pip install -r ./requirements.txt
```

## ✨ Key Features 

###  💻 Supported Methods

VenusFactory supported:

| Fine-tuning | Description | Type |
|---------|------|------------|
| **Freeze** | Freeze the pre-trained model, only fine-tuning pooling head | Sequence |
| **Full** | Fine-tune all parameters | Sequence |
| **[LoRA](https://arxiv.org/abs/2106.09685)** | Use LoRA (Low-Rank Adaptation) fine-tuning | Sequence |
| **[DoRA](https://arxiv.org/abs/2402.09353)** | Use DoRA (Weight-Decomposed Low-Rank Adaptation) fine-tuning | Sequence |
| **[AdaLoRA](https://arxiv.org/abs/2303.10512)** | Use AdaLoRA (Adaptive Low-Rank Adaptation) fine-tuning | Sequence |
| **[IA3](https://arxiv.org/abs/2205.05638)** | Use IA³ (Infused Adapter by Inhibiting and Amplifying Inner Activations) to fine-tuning model | sequence |
| **[QLoRA](https://arxiv.org/abs/2305.14314)** | Use QLoRA (Quantized Low-Rank Adaptation) to fine-tuning model | Sequence |
| **[SES-Adapter](https://arxiv.org/abs/2404.14850)** | Use structural adapters to fuse sequence and structural information | Sequence & Structure |


###  📂Supported Datasets

<details><summary>Pre-training datasets</summary>


- [CATH_V43_S40](https://huggingface.co/datasets/tyang816/cath) | structures

</details>

<details><summary>Supervised fine-tuning datasets (amino acid sequences/ foldseek sequences/ ss8 sequences)</summary>

- DeepLocBinary | protein-wise | single_label_classification
    - [DeepLocBinary_AlphaFold2](https://huggingface.co/datasets/tyang816/DeepLocBinary_AlphaFold2)
    - [DeepLocBinary_ESMFold](https://huggingface.co/datasets/tyang816/DeepLocBinary_ESMFold)
- DeepLocMulti | protein-wise | single_label_classification
    - [DeepLocMulti_AlphaFold2](https://huggingface.co/datasets/tyang816/DeepLocMulti_AlphaFold2)
    - [DeepLocMulti_ESMFold](https://huggingface.co/datasets/tyang816/DeepLocMulti_ESMFold)
- DeepLoc2Multi | protein-wise | single_label_classification
    - [DeepLoc2Multi_AlphaFold2](https://huggingface.co/datasets/tyang816/DeepLoc2Multi_AlphaFold2)
    - [DeepLoc2Multi_ESMFold](https://huggingface.co/datasets/tyang816/DeepLoc2Multi_ESMFold)
- DeepSol | protein-wise | single_label_classification
    - [DeepSol_ESMFold](https://huggingface.co/datasets/tyang816/DeepSol_ESMFold)
- DeepSoluE | protein-wise | single_label_classification
    - [DeepSoluE_ESMFold](https://huggingface.co/datasets/tyang816/DeepSoluE_ESMFold)
- ProtSolM | protein-wise | single_label_classification
    - [ProtSolM_ESMFold](https://huggingface.co/datasets/tyang816/ProtSolM_ESMFold)
- eSOL | protein-wise | regression
    - [eSOL_AlphaFold2](https://huggingface.co/datasets/tyang816/eSOL_AlphaFold2)
    - [eSOL_ESMFold](https://huggingface.co/datasets/tyang816/eSOL_ESMFold)
- DeepET_Topt | protein-wise | regression
    - [DeepET_Topt_AlphaFold2](https://huggingface.co/datasets/tyang816/DeepET_Topt_AlphaFold2)
    - [DeepET_Topt_ESMFold](https://huggingface.co/datasets/tyang816/DeepET_Topt_ESMFold)
- EC | protein-wise | multi_label_classification
    - [EC_AlphaFold2](https://huggingface.co/datasets/tyang816/EC_AlphaFold2)
    - [EC_ESMFold](https://huggingface.co/datasets/tyang816/EC_ESMFold)
- GO_BP | protein-wise | multi_label_classification
    - [GO_BP_AlphaFold2](https://huggingface.co/datasets/tyang816/GO_BP_AlphaFold2)
    - [GO_BP_ESMFold](https://huggingface.co/datasets/tyang816/GO_BP_ESMFold)
- GO_CC | protein-wise | multi_label_classification
    - [GO_CC_AlphaFold2](https://huggingface.co/datasets/tyang816/GO_CC_AlphaFold2)
    - [GO_CC_ESMFold](https://huggingface.co/datasets/tyang816/GO_CC_ESMFold)
- GO_MF | protein-wise | multi_label_classification
    - [GO_MF_AlphaFold2](https://huggingface.co/datasets/tyang816/GO_MF_AlphaFold2)
    - [GO_MF_ESMFold](https://huggingface.co/datasets/tyang816/GO_MF_ESMFold)
- MetalIonBinding | protein-wise | single_label_classification
    - [MetalIonBinding_AlphaFold2](https://huggingface.co/datasets/tyang816/MetalIonBinding_AlphaFold2)
    - [MetalIonBinding_ESMFold](https://huggingface.co/datasets/tyang816/MetalIonBinding_ESMFold)
- Thermostability | protein-wise | regression
    - [Thermostability_AlphaFold2](https://huggingface.co/datasets/tyang816/Thermostability_AlphaFold2)
    - [Thermostability_ESMFold](https://huggingface.co/datasets/tyang816/Thermostability_ESMFold)

> ✨ Only structural sequences are different for the same dataset, for example, ``DeepLocBinary_ESMFold`` and ``DeepLocBinary_AlphaFold2`` share the same amino acid sequences, this means if you only want to use the ``aa_seqs``, both are ok! 

</details>

<details><summary>Supervised fine-tuning datasets (amino acid sequences)</summary>

- [Demo_Solubility](https://huggingface.co/datasets/tyang816/Demo_Solubility) | protein-wise | single_label_classification
- [DeepLocBinary](https://huggingface.co/datasets/tyang816/DeepLocBinary) | protein-wise | single_label_classification
- [DeepLocMulti](https://huggingface.co/datasets/tyang816/DeepLocMulti) | protein-wise | single_label_classification
- [DeepLoc2Multi](https://huggingface.co/datasets/tyang816/DeepLoc2Multi) | protein-wise | single_label_classification
- [DeepSol](https://huggingface.co/datasets/tyang816/DeepSol) | protein-wise | single_label_classification
- [DeepSoluE](https://huggingface.co/datasets/tyang816/DeepSoluE) | protein-wise | single_label_classification
- [ProtSolM](https://huggingface.co/datasets/tyang816/ProtSolM) | protein-wise | single_label_classification
- [eSOL](https://huggingface.co/datasets/tyang816/eSOL) | protein-wise | regression
- [DeepET_Topt](https://huggingface.co/datasets/tyang816/DeepET_Topt) | protein-wise | regression
- [EC](https://huggingface.co/datasets/tyang816/EC) | protein-wise | multi_label_classification
- [GO_BP](https://huggingface.co/datasets/tyang816/GO_BP) | protein-wise | multi_label_classification
- [GO_CC](https://huggingface.co/datasets/tyang816/GO_CC) | protein-wise | multi_label_classification
- [GO_MF](https://huggingface.co/datasets/tyang816/GO_MF) | protein-wise | multi_label_classification
- [MetalIonBinding](https://huggingface.co/datasets/tyang816/MetalIonBinding) | protein-wise | single_label_classification
- [Thermostability](https://huggingface.co/datasets/tyang816/Thermostability) | protein-wise | regression
- [PaCRISPR](https://huggingface.co/datasets/tyang816/PaCRISPR) | protein-wise
- [PETA_CHS_Sol](https://huggingface.co/datasets/tyang816/PETA_CHS_Sol) | protein-wise
- [PETA_LGK_Sol](https://huggingface.co/datasets/tyang816/PETA_LGK_Sol) | protein-wise
- [PETA_TEM_Sol](https://huggingface.co/datasets/tyang816/PETA_TEM_Sol) | protein-wise
- [SortingSignal](https://huggingface.co/datasets/tyang816/SortingSignal) | protein-wise
- FLIP_AAV | protein-site | regression
    - [FLIP_AAV_one-vs-rest](https://huggingface.co/datasets/tyang816/FLIP_AAV_one-vs-rest), [FLIP_AAV_two-vs-rest](https://huggingface.co/datasets/tyang816/FLIP_AAV_two-vs-rest), [FLIP_AAV_mut-des](https://huggingface.co/datasets/tyang816/FLIP_AAV_mut-des), [FLIP_AAV_des-mut](https://huggingface.co/datasets/tyang816/FLIP_AAV_des-mut), [FLIP_AAV_seven-vs-rest](https://huggingface.co/datasets/tyang816/FLIP_AAV_seven-vs-rest), [FLIP_AAV_low-vs-high](https://huggingface.co/datasets/tyang816/FLIP_AAV_low-vs-high), [FLIP_AAV_sampled](https://huggingface.co/datasets/tyang816/FLIP_AAV_sampled)
- FLIP_GB1 | protein-site | regression
    - [FLIP_GB1_one-vs-rest](https://huggingface.co/datasets/tyang816/FLIP_GB1_one-vs-rest), [FLIP_GB1_two-vs-rest](https://huggingface.co/datasets/tyang816/FLIP_GB1_two-vs-rest), [FLIP_GB1_three-vs-rest](https://huggingface.co/datasets/tyang816/FLIP_GB1_three-vs-rest), [FLIP_GB1_low-vs-high](https://huggingface.co/datasets/tyang816/FLIP_GB1_low-vs-high), [FLIP_GB1_sampled](https://huggingface.co/datasets/tyang816/FLIP_GB1_sampled)
- [TAPE_Fluorescence](https://huggingface.co/datasets/tyang816/TAPE_Fluorescence) | protein-site | regression
- [TAPE_Stability](https://huggingface.co/datasets/tyang816/TAPE_Stability) | protein-site | regression

</details>


### 📈 Supported Metrics


| Name          | Torchmetrics     | Problem Type                                            |
| ------------- | ---------------- | ------------------------------------------------------- |
| accuracy      | Accuracy         | single_label_classification/ multi_label_classification |
| recall        | Recall           | single_label_classification/ multi_label_classification |
| precision     | Precision        | single_label_classification/ multi_label_classification |
| f1            | F1Score          | single_label_classification/ multi_label_classification |
| mcc           | MatthewsCorrCoef | single_label_classification/ multi_label_classification |
| auc           | AUROC            | single_label_classification/ multi_label_classification |
| f1_max        | F1ScoreMax       | multi_label_classification                              |
| spearman_corr | SpearmanCorrCoef | regression                                              |
| mse           | MeanSquaredError | regression                                              |

### 🧠Supported Models

<details>
<summary>ESM Series Models: Meta AI's protein language models</summary>

| Model | Size | Parameters | GPU Memory | Training Data | Template |
|-------|------|------------|------------|---------------|----------|
| ESM2-8M | 8M | 8M | 2GB+ | UR50/D | [facebook/esm2_t6_8M_UR50D](https://huggingface.co/facebook/esm2_t6_8M_UR50D) |
| ESM2-35M | 35M | 35M | 4GB+ | UR50/D | [facebook/esm2_t12_35M_UR50D](https://huggingface.co/facebook/esm2_t12_35M_UR50D) |
| ESM2-150M | 150M | 150M | 8GB+ | UR50/D | [facebook/esm2_t30_150M_UR50D](https://huggingface.co/facebook/esm2_t30_150M_UR50D) |
| ESM2-650M | 650M | 650M | 16GB+ | UR50/D | [facebook/esm2_t33_650M_UR50D](https://huggingface.co/facebook/esm2_t33_650M_UR50D) |
| ESM2-3B | 3B | 3B | 24GB+ | UR50/D | [facebook/esm2_t36_3B_UR50D](https://huggingface.co/facebook/esm2_t36_3B_UR50D) |
| ESM2-15B | 15B | 15B | 40GB+ | UR50/D | [facebook/esm2_t48_15B_UR50D](https://huggingface.co/facebook/esm2_t48_15B_UR50D) |
| ESM-1b | 650M | 650M | 16GB+ | UR50/S | [facebook/esm1b_t33_650M_UR50S](https://huggingface.co/facebook/esm1b_t33_650M_UR50S) |
| ESM-1v-1 | 650M | 650M | 16GB+ | UR90/S | [facebook/esm1v_t33_650M_UR90S_1](https://huggingface.co/facebook/esm1v_t33_650M_UR90S_1) |
| ESM-1v-2 | 650M | 650M | 16GB+ | UR90/S | [facebook/esm1v_t33_650M_UR90S_2](https://huggingface.co/facebook/esm1v_t33_650M_UR90S_2) |
| ESM-1v-3 | 650M | 650M | 16GB+ | UR90/S | [facebook/esm1v_t33_650M_UR90S_3](https://huggingface.co/facebook/esm1v_t33_650M_UR90S_3) |
| ESM-1v-4 | 650M | 650M | 16GB+ | UR90/S | [facebook/esm1v_t33_650M_UR90S_4](https://huggingface.co/facebook/esm1v_t33_650M_UR90S_4) |
| ESM-1v-5 | 650M | 650M | 16GB+ | UR90/S | [facebook/esm1v_t33_650M_UR90S_5](https://huggingface.co/facebook/esm1v_t33_650M_UR90S_5) |

> 💡 ESM2 models are the latest generation, offering better performance than ESM-1b/1v
</details>

<details>
<summary>BERT-based Models: Transformer encoder architecture</summary>

| Model | Size | Parameters | GPU Memory | Training Data | Template |
|-------|------|------------|------------|---------------|----------|
| ProtBert-Uniref100 | 420M | 420M | 12GB+ | UniRef100 | [Rostlab/prot_bert](https://huggingface.co/Rostlab/prot_bert) |
| ProtBert-BFD | 420M | 420M | 12GB+ | BFD100 | [Rostlab/prot_bert_bfd](https://huggingface.co/Rostlab/prot_bert_bfd) |
| IgBert | 420M | 420M | 12GB+ | Antibody | [Exscientia/IgBert](https://huggingface.co/Exscientia/IgBert) |
| IgBert-unpaired | 420M | 420M | 12GB+ | Antibody | [Exscientia/IgBert_unpaired](https://huggingface.co/Exscientia/IgBert_unpaired) |

> 💡 BFD-trained models generally show better performance on structure-related tasks
</details>

<details>
<summary>T5-based Models: Encoder-decoder architecture</summary>

| Model | Size | Parameters | GPU Memory | Training Data | Template |
|-------|------|------------|------------|---------------|----------|
| ProtT5-XL-UniRef50 | 3B | 3B | 24GB+ | UniRef50 | [Rostlab/prot_t5_xl_uniref50](https://huggingface.co/Rostlab/prot_t5_xl_uniref50) |
| ProtT5-XXL-UniRef50 | 11B | 11B | 40GB+ | UniRef50 | [Rostlab/prot_t5_xxl_uniref50](https://huggingface.co/Rostlab/prot_t5_xxl_uniref50) |
| ProtT5-XL-BFD | 3B | 3B | 24GB+ | BFD100 | [Rostlab/prot_t5_xl_bfd](https://huggingface.co/Rostlab/prot_t5_xl_bfd) |
| ProtT5-XXL-BFD | 11B | 11B | 40GB+ | BFD100 | [Rostlab/prot_t5_xxl_bfd](https://huggingface.co/Rostlab/prot_t5_xxl_bfd) |
| IgT5 | 3B | 3B | 24GB+ | Antibody | [Exscientia/IgT5](https://huggingface.co/Exscientia/IgT5) |
| IgT5-unpaired | 3B | 3B | 24GB+ | Antibody | [Exscientia/IgT5_unpaired](https://huggingface.co/Exscientia/IgT5_unpaired) |

> 💡 T5 models can be used for both encoding and generation tasks
</details>

<details>
<summary>Specialized Models: Task-specific architectures</summary>

| Model | Size | Parameters | GPU Memory | Features | Template |
|-------|------|------------|------------|----------|----------|
| Ankh-base | 450M | 450M | 12GB+ | Encoder-decoder | [ElnaggarLab/ankh-base](https://huggingface.co/ElnaggarLab/ankh-base) |
| Ankh-large | 1.2B | 1.2B | 20GB+ | Encoder-decoder | [ElnaggarLab/ankh-large](https://huggingface.co/ElnaggarLab/ankh-large) |
| ProSST-20 | 20 | 110M | 4GB+ | Mutation | [AI4Protein/ProSST-20](https://huggingface.co/AI4Protein/ProSST-20) |
| ProSST-128 | 128 | 110M | 4GB+ | Mutation | [AI4Protein/ProSST-128](https://huggingface.co/AI4Protein/ProSST-128) |
| ProSST-512 | 512 | 110M | 4GB+ | Mutation | [AI4Protein/ProSST-512](https://huggingface.co/AI4Protein/ProSST-512) |
| ProSST-2048 | 2048 | 110M | 4GB+ | Mutation | [AI4Protein/ProSST-2048](https://huggingface.co/AI4Protein/ProSST-2048) |
| ProSST-4096 | 4096 | 110M | 4GB+ | Mutation | [AI4Protein/ProSST-4096](https://huggingface.co/AI4Protein/ProSST-4096) |
| ProPrime-690M | 690M | 690M | 16GB+ | OGT-prediction | [AI4Protein/Prime_690M](https://huggingface.co/AI4Protein/Prime_690M) |

> 💡 These models often excel in specific tasks or offer unique architectural benefits
</details>

<details>
<summary>PETA Models: Tokenization variants</summary>

#### BPE Tokenization Series
| Model | Vocab Size | Parameters | GPU Memory | Template |
|-------|------------|------------|------------|----------|
| PETA-base | base | 35M | 4GB+ | [AI4Protein/deep_base](https://huggingface.co/AI4Protein/deep_base) |
| PETA-bpe-50 | 50 | 35M | 4GB+ | [AI4Protein/deep_bpe_50](https://huggingface.co/AI4Protein/deep_bpe_50) |
| PETA-bpe-200 | 200 | 35M | 4GB+ | [AI4Protein/deep_bpe_200](https://huggingface.co/AI4Protein/deep_bpe_200) |
| PETA-bpe-400 | 400 | 35M | 4GB+ | [AI4Protein/deep_bpe_400](https://huggingface.co/AI4Protein/deep_bpe_400) |
| PETA-bpe-800 | 800 | 35M | 4GB+ | [AI4Protein/deep_bpe_800](https://huggingface.co/AI4Protein/deep_bpe_800) |
| PETA-bpe-1600 | 1600 | 35M | 4GB+ | [AI4Protein/deep_bpe_1600](https://huggingface.co/AI4Protein/deep_bpe_1600) |
| PETA-bpe-3200 | 3200 | 35M | 4GB+ | [AI4Protein/deep_bpe_3200](https://huggingface.co/AI4Protein/deep_bpe_3200) |

#### Unigram Tokenization Series
| Model | Vocab Size | Parameters | GPU Memory | Template |
|-------|------------|------------|------------|----------|
| PETA-unigram-50 | 50 | 35M | 4GB+ | [AI4Protein/deep_unigram_50](https://huggingface.co/AI4Protein/deep_unigram_50) |
| PETA-unigram-100 | 100 | 35M | 4GB+ | [AI4Protein/deep_unigram_100](https://huggingface.co/AI4Protein/deep_unigram_100) |
| PETA-unigram-200 | 200 | 35M | 4GB+ | [AI4Protein/deep_unigram_200](https://huggingface.co/AI4Protein/deep_unigram_200) |
| PETA-unigram-400 | 400 | 35M | 4GB+ | [AI4Protein/deep_unigram_400](https://huggingface.co/AI4Protein/deep_unigram_400) |
| PETA-unigram-800 | 800 | 35M | 4GB+ | [AI4Protein/deep_unigram_800](https://huggingface.co/AI4Protein/deep_unigram_800) |
| PETA-unigram-1600 | 1600 | 35M | 4GB+ | [AI4Protein/deep_unigram_1600](https://huggingface.co/AI4Protein/deep_unigram_1600) |
| PETA-unigram-3200 | 3200 | 35M | 4GB+ | [AI4Protein/deep_unigram_3200](https://huggingface.co/AI4Protein/deep_unigram_3200) |

> 💡 Different tokenization strategies may be better suited for specific tasks
</details>


### 📚Model Selection Guide

<details>
<summary>How to choose the right model?</summary>

1. **Based on Hardware Constraints:**
   - Limited GPU (<8GB): ESM2-8M, ESM2-35M, ProSST
   - Medium GPU (8-16GB): ESM2-150M, ESM2-650M, ProtBert series
   - High-end GPU (24GB+): ESM2-3B, ProtT5-XL, Ankh-large
   - Multiple GPUs: ESM2-15B, ProtT5-XXL

2. **Based on Task Type:**
   - Sequence classification: ESM2, ProtBert
   - Structure prediction: ESM2, Ankh
   - Generation tasks: ProtT5
   - Antibody design: IgBert, IgT5
   - Lightweight deployment: ProSST, PETA-base

3. **Based on Training Data:**
   - General protein tasks: ESM2, ProtBert
   - Structure-aware tasks: Ankh
   - Antibody-specific: IgBert, IgT5
   - Custom tokenization needs: PETA series

</details>

## 🔧Core Workflow 

### 1. Fine-tuning Methods
**```--training_method``` to select different fine-tuning methods.**

**```--plm_model``` to select different models.**

**```--dataset``` to select different datasets.**

VenusFactory supported two batch modes:

**```--batch_size``` fixed batch size, controls the number of sequences processed per batch.**

**```--batch_token``` dynamic token-based batching, limits the total token count per batch.**

#### Full-tuning

In [1]:
!export HF_ENDPOINT=https://hf-mirror.com # if need to use HF mirror
dataset="eSOL"
plm_source="facebook"
plm_model="esm2_t6_8M_UR50D"
lr=5e-4
training_method="full"
sh=f"""
python src/train.py \
    --plm_model {plm_source}/{plm_model} \
    --dataset_config data/{dataset}/{dataset}_HF.json \
    --learning_rate {lr} \
    --gradient_accumulation_steps 8 \
    --num_epochs 10 \
    --batch_token 8000 \
    --patience 3 \
    --output_dir test_res/{dataset}/{plm_model} \
    --output_model_name {training_method}_lr_{lr}_8k_ga8.pt \
    --training_method {training_method}
"""
!{sh}

2025-03-24 18:54:07 - training - INFO - Starting training with configuration:
2025-03-24 18:54:07 - training - INFO - hidden_size: None
2025-03-24 18:54:07 - training - INFO - num_attention_head: 8
2025-03-24 18:54:07 - training - INFO - attention_probs_dropout: 0.1
2025-03-24 18:54:07 - training - INFO - plm_model: facebook/esm2_t6_8M_UR50D
2025-03-24 18:54:07 - training - INFO - pooling_method: mean
2025-03-24 18:54:07 - training - INFO - pooling_dropout: 0.1
2025-03-24 18:54:07 - training - INFO - dataset: tyang816/eSOL
2025-03-24 18:54:07 - training - INFO - dataset_config: data/eSOL/eSOL_HF.json
2025-03-24 18:54:07 - training - INFO - normalize: standard
2025-03-24 18:54:07 - training - INFO - num_labels: 1
2025-03-24 18:54:07 - training - INFO - problem_type: regression
2025-03-24 18:54:07 - training - INFO - pdb_type: None
2025-03-24 18:54:07 - training - INFO - train_file: None
2025-03-24 18:54:07 - training - INFO - valid_file: None
2025-03-24 18:54:07 - training - INFO - test

In [None]:
# Use bash script
!cp ./script/train/train_plm_full.sh ./train_plm_full.sh
!bash ./train_plm_full.sh

#### Freeze-tuning

In [3]:
!export HF_ENDPOINT=https://hf-mirror.com # if need to use HF mirror
dataset="eSOL"
plm_source="facebook"
plm_model="esm2_t6_8M_UR50D"
lr=5e-4
training_method="freeze"
sh=f"""
python src/train.py \
    --plm_model {plm_source}/{plm_model} \
    --dataset_config data/{dataset}/{dataset}_HF.json \
    --learning_rate {lr} \
    --gradient_accumulation_steps 8 \
    --num_epochs 10 \
    --batch_token 8000 \
    --patience 3 \
    --output_dir test_res/{dataset}/{plm_model} \
    --output_model_name {training_method}_lr_{lr}_8k_ga8.pt \
    --training_method {training_method}
"""
!{sh}

2025-03-24 20:41:53 - training - INFO - Starting training with configuration:
2025-03-24 20:41:53 - training - INFO - hidden_size: None
2025-03-24 20:41:53 - training - INFO - num_attention_head: 8
2025-03-24 20:41:53 - training - INFO - attention_probs_dropout: 0.1
2025-03-24 20:41:53 - training - INFO - plm_model: facebook/esm2_t6_8M_UR50D
2025-03-24 20:41:53 - training - INFO - pooling_method: mean
2025-03-24 20:41:53 - training - INFO - pooling_dropout: 0.1
2025-03-24 20:41:53 - training - INFO - dataset: tyang816/eSOL
2025-03-24 20:41:53 - training - INFO - dataset_config: data/eSOL/eSOL_HF.json
2025-03-24 20:41:53 - training - INFO - normalize: standard
2025-03-24 20:41:53 - training - INFO - num_labels: 1
2025-03-24 20:41:53 - training - INFO - problem_type: regression
2025-03-24 20:41:53 - training - INFO - pdb_type: None
2025-03-24 20:41:53 - training - INFO - train_file: None
2025-03-24 20:41:53 - training - INFO - valid_file: None
2025-03-24 20:41:53 - training - INFO - test

In [None]:
# Use bash script
!cp ./script/train/train_plm_freeze.sh ./train_plm_freeze.sh
!bash ./train_plm_freeze.sh

#### [SES-Adapter](https://arxiv.org/abs/2404.14850)

In [4]:
!export HF_ENDPOINT=https://hf-mirror.com # if need to use HF mirror
dataset="eSOL"
pdb_type="AlphaFold2"
plm_source="facebook"
plm_model="esm2_t6_8M_UR50D"
lr=5e-4
training_method="ses-adapter"
sh=f"""
python src/train.py \
    --plm_model {plm_source}/{plm_model} \
    --dataset_config data/{dataset}/{dataset}_{pdb_type}_HF.json \
    --learning_rate {lr} \
    --num_epochs 10 \
    --batch_token 8000 \
    --gradient_accumulation_steps 8 \
    --patience 3 \
    --structure_seq foldseek_seq,ss8_seq \
    --output_dir test_res/{dataset}/{plm_model} \
    --training_method {training_method} \
    --output_model_name ses-adapter_{pdb_type}_lr_{lr}_bt8k_ga8.pt
"""
!{sh}

2025-03-24 21:05:15 - training - INFO - Starting training with configuration:
2025-03-24 21:05:15 - training - INFO - hidden_size: None
2025-03-24 21:05:15 - training - INFO - num_attention_head: 8
2025-03-24 21:05:15 - training - INFO - attention_probs_dropout: 0.1
2025-03-24 21:05:15 - training - INFO - plm_model: facebook/esm2_t6_8M_UR50D
2025-03-24 21:05:15 - training - INFO - pooling_method: mean
2025-03-24 21:05:15 - training - INFO - pooling_dropout: 0.1
2025-03-24 21:05:15 - training - INFO - dataset: tyang816/eSOL_AlphaFold2
2025-03-24 21:05:15 - training - INFO - dataset_config: data/eSOL/eSOL_AlphaFold2_HF.json
2025-03-24 21:05:15 - training - INFO - normalize: standard
2025-03-24 21:05:15 - training - INFO - num_labels: 1
2025-03-24 21:05:15 - training - INFO - problem_type: regression
2025-03-24 21:05:15 - training - INFO - pdb_type: AlphaFold2
2025-03-24 21:05:15 - training - INFO - train_file: None
2025-03-24 21:05:15 - training - INFO - valid_file: None
2025-03-24 21:05

In [None]:
# Use bash script
!cp ./script/train/train_plm_ses-adapter.sh ./train_plm_ses-adapter.sh
!bash ./train_plm_ses-adapter.sh

#### [LoRA](https://arxiv.org/abs/2106.09685)

In [5]:
# ESM model target_modules name: query key value
# Bert_base(prot_bert) model target_modules name: query key value
# T5_base(ankh, t5) model target_modules name: q k v

!export HF_ENDPOINT=https://hf-mirror.com # if need to use HF mirror
dataset="eSOL"
plm_source="facebook"
plm_model="esm2_t6_8M_UR50D"
lr=5e-4
training_method="plm-lora"
sh=f"""
python src/train.py \
    --plm_model {plm_source}/{plm_model} \
    --dataset_config data/{dataset}/{dataset}_HF.json \
    --learning_rate {lr} \
    --gradient_accumulation_steps 8 \
    --num_epochs 10 \
    --batch_token 8000 \
    --patience 3 \
    --output_dir test_res/{dataset}/{plm_model} \
    --output_model_name {training_method}_lr_{lr}_8k_ga8.pt \
    --training_method {training_method} \
    --lora_target_modules query key value
"""
!{sh}

2025-03-24 21:17:16 - training - INFO - Starting training with configuration:
2025-03-24 21:17:16 - training - INFO - hidden_size: None
2025-03-24 21:17:16 - training - INFO - num_attention_head: 8
2025-03-24 21:17:16 - training - INFO - attention_probs_dropout: 0.1
2025-03-24 21:17:16 - training - INFO - plm_model: facebook/esm2_t6_8M_UR50D
2025-03-24 21:17:16 - training - INFO - pooling_method: mean
2025-03-24 21:17:16 - training - INFO - pooling_dropout: 0.1
2025-03-24 21:17:16 - training - INFO - dataset: tyang816/eSOL
2025-03-24 21:17:16 - training - INFO - dataset_config: data/eSOL/eSOL_HF.json
2025-03-24 21:17:16 - training - INFO - normalize: standard
2025-03-24 21:17:16 - training - INFO - num_labels: 1
2025-03-24 21:17:16 - training - INFO - problem_type: regression
2025-03-24 21:17:16 - training - INFO - pdb_type: None
2025-03-24 21:17:16 - training - INFO - train_file: None
2025-03-24 21:17:16 - training - INFO - valid_file: None
2025-03-24 21:17:16 - training - INFO - test

In [None]:
# Use bash script
!cp ./script/train/train_plm_lora.sh ./train_plm_lora.sh
!bash ./train_plm_lora.sh

#### [AdaLoRA](https://arxiv.org/abs/2303.10512)

In [6]:
# ESM model target_modules name: query key value
# Bert_base(prot_bert) model target_modules name: query key value
# T5_base(ankh, t5) model target_modules name: q k v

!export HF_ENDPOINT=https://hf-mirror.com # if need to use HF mirror
dataset="eSOL"
plm_source="facebook"
plm_model="esm2_t6_8M_UR50D"
lr=5e-4
training_method="plm-adalora"
sh=f"""
python src/train.py \
    --plm_model {plm_source}/{plm_model} \
    --dataset_config data/{dataset}/{dataset}_HF.json \
    --learning_rate {lr} \
    --gradient_accumulation_steps 8 \
    --num_epochs 10 \
    --batch_token 8000 \
    --patience 3 \
    --output_dir test_res/{dataset}/{plm_model} \
    --output_model_name {training_method}_lr_{lr}_8k_ga8.pt \
    --training_method {training_method} \
    --lora_target_modules query key value
"""
!{sh}

2025-03-24 21:34:14 - training - INFO - Starting training with configuration:
2025-03-24 21:34:14 - training - INFO - hidden_size: None
2025-03-24 21:34:14 - training - INFO - num_attention_head: 8
2025-03-24 21:34:14 - training - INFO - attention_probs_dropout: 0.1
2025-03-24 21:34:14 - training - INFO - plm_model: facebook/esm2_t6_8M_UR50D
2025-03-24 21:34:14 - training - INFO - pooling_method: mean
2025-03-24 21:34:14 - training - INFO - pooling_dropout: 0.1
2025-03-24 21:34:14 - training - INFO - dataset: tyang816/eSOL
2025-03-24 21:34:14 - training - INFO - dataset_config: data/eSOL/eSOL_HF.json
2025-03-24 21:34:14 - training - INFO - normalize: standard
2025-03-24 21:34:14 - training - INFO - num_labels: 1
2025-03-24 21:34:14 - training - INFO - problem_type: regression
2025-03-24 21:34:14 - training - INFO - pdb_type: None
2025-03-24 21:34:14 - training - INFO - train_file: None
2025-03-24 21:34:14 - training - INFO - valid_file: None
2025-03-24 21:34:14 - training - INFO - test

In [None]:
# Use bash script
!cp ./script/train/train_plm_adalora.sh ./train_plm_adalora.sh
!bash ./train_plm_adalora.sh

#### [QLoRA](https://arxiv.org/abs/2305.14314)

In [7]:
# ESM model target_modules name: query key value
# Bert_base(prot_bert) model target_modules name: query key value
# T5_base(ankh, t5) model target_modules name: q k v

!export HF_ENDPOINT=https://hf-mirror.com # if need to use HF mirror
dataset="eSOL"
plm_source="facebook"
plm_model="esm2_t6_8M_UR50D"
lr=5e-4
training_method="plm-qlora"
sh=f"""
python src/train.py \
    --plm_model {plm_source}/{plm_model} \
    --dataset_config data/{dataset}/{dataset}_HF.json \
    --learning_rate {lr} \
    --gradient_accumulation_steps 8 \
    --num_epochs 10 \
    --batch_token 8000 \
    --patience 3 \
    --output_dir test_res/{dataset}/{plm_model} \
    --output_model_name {training_method}_lr_{lr}_8k_ga8.pt \
    --training_method {training_method} \
    --lora_target_modules query key value
"""
!{sh}

2025-03-24 21:52:56 - training - INFO - Starting training with configuration:
2025-03-24 21:52:56 - training - INFO - hidden_size: None
2025-03-24 21:52:56 - training - INFO - num_attention_head: 8
2025-03-24 21:52:56 - training - INFO - attention_probs_dropout: 0.1
2025-03-24 21:52:56 - training - INFO - plm_model: facebook/esm2_t6_8M_UR50D
2025-03-24 21:52:56 - training - INFO - pooling_method: mean
2025-03-24 21:52:56 - training - INFO - pooling_dropout: 0.1
2025-03-24 21:52:56 - training - INFO - dataset: tyang816/eSOL
2025-03-24 21:52:56 - training - INFO - dataset_config: data/eSOL/eSOL_HF.json
2025-03-24 21:52:56 - training - INFO - normalize: standard
2025-03-24 21:52:56 - training - INFO - num_labels: 1
2025-03-24 21:52:56 - training - INFO - problem_type: regression
2025-03-24 21:52:56 - training - INFO - pdb_type: None
2025-03-24 21:52:56 - training - INFO - train_file: None
2025-03-24 21:52:56 - training - INFO - valid_file: None
2025-03-24 21:52:56 - training - INFO - test

In [None]:
# Use bash script
!cp ./script/train/train_plm_qlora.sh ./train_plm_qlora.sh
!bash ./train_plm_qlora.sh

#### [DoRA](https://arxiv.org/abs/2402.09353)

In [8]:
# ESM model target_modules name: query key value
# Bert_base(prot_bert) model target_modules name: query key value
# T5_base(ankh, t5) model target_modules name: q k v

!export HF_ENDPOINT=https://hf-mirror.com # if need to use HF mirror
dataset="eSOL"
plm_source="facebook"
plm_model="esm2_t6_8M_UR50D"
lr=5e-4
training_method="plm-dora"
sh=f"""
python src/train.py \
    --plm_model {plm_source}/{plm_model} \
    --dataset_config data/{dataset}/{dataset}_HF.json \
    --learning_rate {lr} \
    --gradient_accumulation_steps 8 \
    --num_epochs 10 \
    --batch_token 8000 \
    --patience 3 \
    --output_dir test_res/{dataset}/{plm_model} \
    --output_model_name {training_method}_lr_{lr}_8k_ga8.pt \
    --training_method {training_method} \
    --lora_target_modules query key value
"""
!{sh}

2025-03-24 22:12:08 - training - INFO - Starting training with configuration:
2025-03-24 22:12:08 - training - INFO - hidden_size: None
2025-03-24 22:12:08 - training - INFO - num_attention_head: 8
2025-03-24 22:12:08 - training - INFO - attention_probs_dropout: 0.1
2025-03-24 22:12:08 - training - INFO - plm_model: facebook/esm2_t6_8M_UR50D
2025-03-24 22:12:08 - training - INFO - pooling_method: mean
2025-03-24 22:12:08 - training - INFO - pooling_dropout: 0.1
2025-03-24 22:12:08 - training - INFO - dataset: tyang816/eSOL
2025-03-24 22:12:08 - training - INFO - dataset_config: data/eSOL/eSOL_HF.json
2025-03-24 22:12:08 - training - INFO - normalize: standard
2025-03-24 22:12:08 - training - INFO - num_labels: 1
2025-03-24 22:12:08 - training - INFO - problem_type: regression
2025-03-24 22:12:08 - training - INFO - pdb_type: None
2025-03-24 22:12:08 - training - INFO - train_file: None
2025-03-24 22:12:08 - training - INFO - valid_file: None
2025-03-24 22:12:08 - training - INFO - test

In [None]:
# Use bash script
!cp ./script/train/train_plm_dora.sh ./train_plm_dora.sh
!bash ./train_plm_dora.sh

#### [IA3](https://arxiv.org/abs/2205.05638)

In [9]:
# ESM model target_modules name: query key value
# Bert_base(prot_bert) model target_modules name: query key value
# T5_base(ankh, t5) model target_modules name: q k v

!export HF_ENDPOINT=https://hf-mirror.com # if need to use HF mirror
dataset="eSOL"
plm_source="facebook"
plm_model="esm2_t6_8M_UR50D"
lr=5e-4
training_method="plm-ia3"
sh=f"""
python src/train.py \
    --plm_model {plm_source}/{plm_model} \
    --dataset_config data/{dataset}/{dataset}_HF.json \
    --learning_rate {lr} \
    --gradient_accumulation_steps 8 \
    --num_epochs 10 \
    --batch_token 8000 \
    --patience 3 \
    --output_dir test_res/{dataset}/{plm_model} \
    --output_model_name {training_method}_lr_{lr}_8k_ga8.pt \
    --training_method {training_method} \
    --lora_target_modules query key value
"""
!{sh}

2025-03-25 00:09:03 - training - INFO - Starting training with configuration:
2025-03-25 00:09:03 - training - INFO - hidden_size: None
2025-03-25 00:09:03 - training - INFO - num_attention_head: 8
2025-03-25 00:09:03 - training - INFO - attention_probs_dropout: 0.1
2025-03-25 00:09:03 - training - INFO - plm_model: facebook/esm2_t6_8M_UR50D
2025-03-25 00:09:03 - training - INFO - pooling_method: mean
2025-03-25 00:09:03 - training - INFO - pooling_dropout: 0.1
2025-03-25 00:09:03 - training - INFO - dataset: tyang816/eSOL
2025-03-25 00:09:03 - training - INFO - dataset_config: data/eSOL/eSOL_HF.json
2025-03-25 00:09:03 - training - INFO - normalize: standard
2025-03-25 00:09:03 - training - INFO - num_labels: 1
2025-03-25 00:09:03 - training - INFO - problem_type: regression
2025-03-25 00:09:03 - training - INFO - pdb_type: None
2025-03-25 00:09:03 - training - INFO - train_file: None
2025-03-25 00:09:03 - training - INFO - valid_file: None
2025-03-25 00:09:03 - training - INFO - test

In [None]:
# Use bash script
!cp ./script/train/train_plm_ia3.sh ./train_plm_ia3.sh
!bash ./train_plm_ia3.sh

### 2. Model Evaluation
**```--eval_method``` must be coordinated with ```--training_method``` to ensure evaluation protocol matches your training strategy.**

**```--test_file``` specifies the evaluation dataset source, supports local custom datasets and predefined datasets. You should replace it for your model path.**  

**```--model_path``` is the path to load model weights, you should replace it for your model path.**

#### LoRA Model Evaluation

In [13]:
!export HF_ENDPOINT=https://hf-mirror.com
problem_type="regression"
num_labels="1"
dataset="eSOL"
eval_method="plm-lora"
plm_source="facebook"
plm_model="esm2_t6_8M_UR50D"
# for the predefined data
sh=f"""
python src/eval.py \
    --plm_model {plm_source}/{plm_model} \
    --model_path ckpt/test_res/{dataset}/{plm_model}/{eval_method}_lr_0.0005_8k_ga8.pt \
    --eval_method {eval_method} \
    --dataset {dataset} \
    --test_file tyang816/{dataset} \
    --test_result_dir ckpt/debug_result/{dataset}/{eval_method}_{plm_model} \
    --num_labels {num_labels} \
    --problem_type {problem_type} \
    --batch_size 16 \
    --metrics spearman_corr
"""
!{sh}

---------- Load Model ----------
Number of parameter: 0.10M
---------- Start Eval ----------
Total samples: 310
100%|███████████████████████████| 20/20 [00:04<00:00,  4.12it/s, eval_loss=1.82]
spearman_corr: 0.7341052889823914


In [None]:
# Use bash script
!cp ./script/eval/eval_plm_lora.sh ./eval_plm_lora.sh
!bash ./eval_plm_lora.sh

In [15]:
!export HF_ENDPOINT=https://hf-mirror.com
problem_type="regression"
num_labels="1"
dataset="eSOL"
eval_method="plm-lora"
plm_source="facebook"
plm_model="esm2_t6_8M_UR50D"

# for local data need ensure exist the test_file path
sh=f"""
python src/eval.py \
    --plm_model {plm_source}/{plm_model} \
    --model_path ckpt/test_res/{dataset}/{plm_model}/{eval_method}_lr_0.0005_8k_ga8.pt \
    --eval_method {eval_method} \
    --dataset {dataset} \
    --test_file data/eSOL_local_data/{dataset} \
    --test_result_dir ckpt/debug_result/{dataset}/{eval_method}_{plm_model} \
    --num_labels {num_labels} \
    --problem_type {problem_type} \
    --batch_size 16 \
    --metrics spearman_corr
"""
!{sh}

---------- Load Model ----------
Number of parameter: 0.10M
---------- Start Eval ----------
Total samples: 310
100%|███████████████████████████| 20/20 [00:04<00:00,  4.06it/s, eval_loss=1.82]
spearman_corr: 0.7341052889823914


In [None]:
# Use bash script
!cp ./script/eval/eval_plm_lora.sh ./eval_plm_lora_local.sh
!bash ./eval_plm_lora_local.sh

#### SES-Adapter Model Evaluation

In [20]:
!export HF_ENDPOINT=https://hf-mirror.com
problem_type="regression"
num_labels=1
dataset="eSOL"
pdb_type="AlphaFold2" # note! ses-adapter need structure sequence
eval_method="ses-adapter"
plm_source="facebook"
plm_model="esm2_t6_8M_UR50D"

# for predefined data
sh=f"""
python src/eval.py \
    --plm_model {plm_source}/{plm_model} \
    --model_path ckpt/test_res/{dataset}/{plm_model}/{eval_method}_{pdb_type}_lr_0.0005_bt8k_ga8.pt \
    --eval_method {eval_method} \
    --dataset {dataset} \
    --test_file tyang816/{dataset}_{pdb_type} \
    --test_result_dir ckpt/debug_result/{dataset}/{eval_method}_{plm_model} \
    --num_labels {num_labels} \
    --problem_type {problem_type} \
    --batch_size 16 \
    --structure_seq foldseek_seq,ss8_seq \
    --metrics spearman_corr
"""
!{sh}

Enabled foldseek_seq based on structure_seq parameter
Enabled ss8_seq based on structure_seq parameter
---------- Load Model ----------
Number of parameter: 0.95M
---------- Start Eval ----------
Total samples: 310
100%|███████████████████████████| 20/20 [00:05<00:00,  3.87it/s, eval_loss=1.78]
spearman_corr: 0.6919500231742859


In [None]:
# Use bash script
!cp ./script/eval/eval_plm_ses-adapter.sh ./eval_plm_ses-adapter.sh
!bash ./eval_plm_ses-adapter.sh

In [23]:
!export HF_ENDPOINT=https://hf-mirror.com
problem_type="regression"
num_labels=1
dataset="eSOL"
pdb_type="AlphaFold2" # note! ses-adapter need structure sequence
eval_method="ses-adapter"
plm_source="facebook"
plm_model="esm2_t6_8M_UR50D"

# for local data need ensure exist the test_file path
sh=f"""
python src/eval.py \
    --plm_model {plm_source}/{plm_model} \
    --model_path ckpt/test_res/{dataset}/{plm_model}/{eval_method}_{pdb_type}_lr_0.0005_bt8k_ga8.pt \
    --eval_method {eval_method} \
    --dataset {dataset} \
    --test_file data/eSOL_local_data/{dataset}_{pdb_type} \
    --test_result_dir ckpt/debug_result/{dataset}/{eval_method}_{plm_model} \
    --num_labels {num_labels} \
    --problem_type {problem_type} \
    --batch_size 16 \
    --structure_seq foldseek_seq,ss8_seq \
    --metrics spearman_corr
"""
!{sh}

Enabled foldseek_seq based on structure_seq parameter
Enabled ss8_seq based on structure_seq parameter
---------- Load Model ----------
Number of parameter: 0.95M
---------- Start Eval ----------
Total samples: 310
100%|███████████████████████████| 20/20 [00:05<00:00,  3.84it/s, eval_loss=1.78]
spearman_corr: 0.6919500231742859


In [None]:
# Use bash script
!cp ./script/eval/eval_plm_ses-adapter_local.sh ./eval_plm_ses-adapter_local.sh
!bash ./eval_plm_ses-adapter_local.sh

#### For more evaluation scripts, see the dedicated scripts in ```VenusFactory/script/eval/```.

### 3. Model prediction
Venufactory provides two distinct prediction workflows to match your use case: single and batch.

For single mode, you can provide one input(amino acid sequence, Foldseek sequence,  secondary structure sequence).

For batch mode, you can provide a test file(csv format).

**```--problem_type``` specifies the current problem type in ["single_label_classification", "multi_label_classification", "regression"].**

**```--aa_seq``` amino acid sequence.**

**```--foldseek_seq``` foldseek sequence (optional).**

**```--ss8_seq``` secondary structure sequence (optional).**

**```--structure_seq``` structure sequence types to use (comma-separated).**

**```--input_file``` path to input CSV file with sequences.**

**```--output_file``` path to output CSV file for predictions.**

#### LoRA Model Prediction

In [24]:
# For the single prediction
!export HF_ENDPOINT=https://hf-mirror.com
plm_source="facebook"
plm_model="esm2_t6_8M_UR50D"
eval_method="plm-lora"
problem_type="regression"
num_labels=1
aa_seq="MAKEDNIEMQGTVLETLPNTMFRVELENGHVVTAHISGKMRKNYIRILTGDKVTVELTPYDLSKGRIVFRSR"
#
sh=f"""
python src/predict.py \
    --eval_method {eval_method} \
    --plm_model {plm_source}/{plm_model} \
    --model_path ckpt/test_res/eSOL/{plm_model}/{eval_method}_lr_0.0005_8k_ga8.pt \
    --aa_seq {aa_seq} \
    --num_labels {num_labels} \
    --problem_type {problem_type}
"""
!{sh}

---------- Loading Model and Tokenizer ----------
Model config not found at ckpt/test_res/eSOL/esm2_t6_8M_UR50D/config.json. Using command line arguments.
Training method: plm-lora
Structure sequence: 
Use foldseek: False
Use ss8: False
Problem type: regression
Number of labels: 1
Number of attention heads: 8
---------- Processing Input Sequences ----------
Processed input sequences with keys: dict_keys(['aa_seq_input_ids', 'aa_seq_attention_mask'])
---------- Running Prediction ----------
Prediction result: 1.4336968660354614

---------- Prediction Results ----------
{
  "prediction": 1.4336968660354614
}


In [None]:
# use bash script
!cp ./script/predict/predict_plm_lora.sh ./predict_plm_lora.sh
!bash ./predict_plm_lora.sh

In [26]:
# For the batch prediction
!export HF_ENDPOINT=https://hf-mirror.com
plm_source="facebook"
plm_model="esm2_t6_8M_UR50D"
eval_method="plm-lora"
problem_type="regression"
num_labels=1
input_file="data/eSOL_local_data/eSOL/test.csv"
sh=f"""
python src/predict_batch.py \
    --eval_method {eval_method} \
    --plm_model {plm_source}/{plm_model} \
    --model_path ckpt/test_res/eSOL/{plm_model}/{eval_method}_lr_0.0005_8k_ga8.pt \
    --num_labels {num_labels} \
    --problem_type {problem_type} \
    --input_file  {input_file} \
    --output_dir ckpt/debug_result/eSOL/{plm_model}/prediction_batch/{eval_method} \
    --output_file result.csv
"""
!{sh}

---------- Loading Model and Tokenizer ----------
Model config not found at ckpt/test_res/eSOL/esm2_t6_8M_UR50D/config.json. Using command line arguments.
Training method: plm-lora
Structure sequence: 
Use foldseek: False
Use ss8: False
Problem type: regression
Number of labels: 1
Number of attention heads: 8
---------- Reading input file: data/eSOL_local_data/eSOL/test.csv ----------
Found 310 sequences in input file
---------- Processing sequences ----------
Predicting: 100%|█████████████████████████████| 310/310 [00:34<00:00,  9.07it/s]
---------- Saving results to ckpt/debug_result/eSOL/esm2_t6_8M_UR50D/prediction_batch/plm-lora/result.csv ----------
Saved 310 prediction results
---------- Batch prediction completed successfully ----------


In [None]:
# use bash script
!cp ./script/predict/predict_batch_plm_lora.sh ./predict_batch_plm_lora.sh
!bash ./predict_batch_plm_lora.sh

#### SES-Adapter Model Prediction

In [27]:
# For the single prediction
!export HF_ENDPOINT=https://hf-mirror.com
plm_source="facebook"
plm_model="esm2_t6_8M_UR50D"
eval_method="ses-adapter"
problem_type="regression"
num_labels=1
aa_seq="MAKEDNIEMQGTVLETLPNTMFRVELENGHVVTAHISGKMRKNYIRILTGDKVTVELTPYDLSKGRIVFRSR"
ss8_seq="LLLLLLEEEEEEEEEEETTTEEEEEETTSLEEEEEELHHHHHTTLLLLTTLEEEEEEETTEEEEEEEEEELL"
foldseek_seq="DDPQPFDKFKWFFADADPPQWTFTQTPVRDTAIEHEDPVCVVVVDDDDGGWMFIWGHHPVDNRYTYTDDTDD"
sh=f"""
python src/predict.py \
    --eval_method {eval_method} \
    --plm_model {plm_source}/{plm_model} \
    --model_path ckpt/test_res/eSOL/{plm_model}/{eval_method}_AlphaFold2_lr_0.0005_bt8k_ga8.pt \
    --aa_seq {aa_seq} \
    --foldseek_seq {foldseek_seq} \
    --ss8_seq {ss8_seq} \
    --num_labels {num_labels} \
    --problem_type {problem_type} \
    --structure_seq foldseek_seq,ss8_seq
"""
!{sh}

---------- Loading Model and Tokenizer ----------
Model config not found at ckpt/test_res/eSOL/esm2_t6_8M_UR50D/config.json. Using command line arguments.
Enabled foldseek_seq based on structure_seq parameter
Enabled ss8_seq based on structure_seq parameter
Training method: ses-adapter
Structure sequence: foldseek_seq,ss8_seq
Use foldseek: True
Use ss8: True
Problem type: regression
Number of labels: 1
Number of attention heads: 8
---------- Processing Input Sequences ----------
Processed input sequences with keys: dict_keys(['aa_seq_input_ids', 'aa_seq_attention_mask', 'foldseek_seq_input_ids', 'ss8_seq_input_ids'])
---------- Running Prediction ----------
Prediction result: 1.5001569986343384

---------- Prediction Results ----------
{
  "prediction": 1.5001569986343384
}


In [None]:
# use bash script
!cp ./script/predict/predict_plm_ses-adapter.sh ./predict_plm_ses-adapter.sh
!bash ./predict_plm_ses-adapter.sh

In [28]:
# for the batch prediction
!export HF_ENDPOINT=https://hf-mirror.com
plm_source="facebook"
plm_model="esm2_t6_8M_UR50D"
eval_method="ses-adapter"
problem_type="regression"
num_labels=1
input_file="data/eSOL_local_data/eSOL_AlphaFold2/test.csv"
sh=f"""
python src/predict_batch.py \
    --eval_method {eval_method} \
    --plm_model {plm_source}/{plm_model} \
    --model_path ckpt/test_res/eSOL/{plm_model}/{eval_method}_AlphaFold2_lr_0.0005_bt8k_ga8.pt \
    --num_labels {num_labels} \
    --problem_type {problem_type} \
    --input_file  {input_file} \
    --output_dir ckpt/debug_result/eSOL/{plm_model}/prediction_batch/{eval_method} \
    --output_file result.csv \
    --structure_seq foldseek_seq,ss8_seq
"""
!{sh}

---------- Loading Model and Tokenizer ----------
Model config not found at ckpt/test_res/eSOL/esm2_t6_8M_UR50D/config.json. Using command line arguments.
Enabled foldseek_seq based on structure_seq parameter
Enabled ss8_seq based on structure_seq parameter
Training method: ses-adapter
Structure sequence: foldseek_seq,ss8_seq
Use foldseek: True
Use ss8: True
Problem type: regression
Number of labels: 1
Number of attention heads: 8
---------- Reading input file: data/eSOL_local_data/eSOL_AlphaFold2/test.csv ----------
Found 310 sequences in input file
---------- Processing sequences ----------
Predicting: 100%|█████████████████████████████| 310/310 [00:34<00:00,  9.06it/s]
---------- Saving results to ckpt/debug_result/eSOL/esm2_t6_8M_UR50D/prediction_batch/ses-adapter/result.csv ----------
Saved 310 prediction results
---------- Batch prediction completed successfully ----------


In [None]:
# use bash script
!cp ./script/predict/predict_batch_plm_ses-adapter.sh ./predict_batch_plm_ses-adapter.sh
!bash ./predict_batch_plm_ses-adapter.sh

#### For more evaluation scripts, see the dedicated scripts in ```VenusFactory/script/predict/```.

## 🛠 Data Collection Tools: Multi-source protein data acquisition

### Download Components Help Guide

<details>
<summary>InterPro Metadata</summary>

- **Description**: Downloads protein domain information from InterPro database.

- **Source**: [InterPro Database](https://www.ebi.ac.uk/interpro/)

- **Download Options**:
    - ```--interpro_id```: Download data for a specific InterPro domain (e.g., IPR000001)
    - ```--interpro_json```: Batch download using a JSON file containing multiple InterPro entries

- **Output Format**:

  ```
    download/interpro_domain/
    └── IPR000001/
        ├── detail.json    # Detailed protein information
        ├── meta.json      # Metadata including accession and protein count
        └── uids.txt       # List of UniProt IDs associated with this domain
  ```
</details>

<details>
<summary>RCSB Metadata</summary>

- **Description**: Downloads structural metadata from the RCSB Protein Data Bank.

- **Source**: [RCSB PDB](https://www.rcsb.org/)

- **Download Options**:
    - ```--pdb_id```: Download metadata for a specific PDB entry (e.g., 1a0j)
    - ```--pdb_id_file```: Batch download using a text file containing PDB IDs

- **Output Format**:
    ```
    download/rcsb_metadata/
    └── 1a0j.json         # Contains structure metadata including:
                         # - Resolution
                         # - Experimental method
                         # - Publication info
                         # - Chain information
    ```
</details>

<details>
<summary>UniProt Sequences</summary>

- **Description**: Downloads protein sequences from UniProt database.

- **Source**: [UniProt](https://www.uniprot.org/)

- **Download Options**:
    - ```--uniprot_id```: Download sequence for a specific UniProt entry (e.g., P00734)
    - ```--file```: Batch download using a text file containing UniProt IDs
    - ```--merge```: Combine all sequences into a single FASTA file (optional)

- **Output Format**:
    ```
    download/uniprot_sequences/
    ├── P00734.fasta      # Individual FASTA files (when not merged)
    └── merged.fasta      # Combined sequences (when merge option is selected)
    ```
</details>

<details>
<summary>RCSB Structures</summary>
    
- **Description**: Downloads 3D structure files from RCSB Protein Data Bank.

- **Source**: [RCSB PDB](https://www.rcsb.org/)

- **Download Options**:
    - ```--pdb_id```: Download structure for a specific PDB entry
    - ```--pdb_id_file```: Batch download using a text file containing PDB IDs
    - ```--type``` File Types:
        * cif: mmCIF format (recommended)
        * pdb: Legacy PDB format
        * xml: PDBML/XML format
        * sf: Structure factors
        * mr: NMR restraints
    - ```--unzip``` Option: Automatically decompress downloaded files

- **Output Format**:
    ```
    download/rcsb_structures/
    ├── 1a0j.pdb          # Uncompressed structure file (with unzip)
    └── 1a0j.pdb.gz       # Compressed structure file (without unzip)
    ```
</details>

<details>
<summary>AlphaFold2 Structures</summary>
    
- **Description**: Downloads predicted protein structures from AlphaFold Protein Structure Database.

- **Source**: [AlphaFold DB](https://alphafold.ebi.ac.uk/)

- **Download Options**:
    - ```--uniprot_id```: Download structure for a specific UniProt entry
    - ```--uniprot_id_file```: Batch download using a text file containing UniProt IDs
    - ```--index_level```: Organize files in subdirectories based on ID prefix

- **Output Format**:
    ```
    download/alphafold2_structures/
    └── P/               # With index_level=1
        └── P0/          # With index_level=2
            └── P00734.pdb  # AlphaFold predicted structure
    ```
</details>

<details>
<summary>Common Features</summary>

- **Error Handling**: All components support error file generation
- **Output Directory**: Customizable output paths
- **Batch Processing**: Support for multiple IDs via file input
- **Progress Tracking**: Real-time download progress and status updates
</details>

<details>
<summary>Input File Formats</summary>
    
- **PDB ID List** (for RCSB downloads):
    ```
    1a0j
    4hhb
    1hho
    ```

- **UniProt ID List** (for UniProt and AlphaFold):
    ```
    P00734
    P61823
    Q8WZ42
    ```

- **InterPro JSON** (for batch InterPro downloads):
    ```json
    [
        {
            "metadata": {
                "accession": "IPR000001"
            }
        },
        {
            "metadata": {
                "accession": "IPR000002"
            }
        }
    ]
    ```
</details>

<details>
<summary>Error Files</summary>
    
- When enabled, failed downloads are logged to `failed.txt` in the output directory:
    ```
    P00734 - Download failed: 404 Not Found
    1a0j - Connection timeout
    ```
</details>

### Download InterPro Metadata

In [33]:
# download single data
!python src/crawler/metadata/download_interpro.py \
    --interpro_id IPR000003 \
    --out_dir data/interpro/meta_single \
    --error_file data/interpro/meta_single_error.csv

Successfully downloaded IPR000003


In [31]:
# download batch data 
# the JSON file template is provided in download/interpro_json.customization. You can modify to specify the Interpro IDs
!python src/crawler/metadata/download_interpro.py \
    --interpro_json data/interpro/batch.json \
    --out_dir data/interpro/meta_batch \
    --error_file data/interpro/meta_batch_error.csv

100%|█████████████████████████████████████████████| 6/6 [06:47<00:00, 67.94s/it]


### Download RCSB Metadata

In [2]:
# download single data
!python src/crawler/metadata/download_rcsb.py \
    --pdb_id 1A00 \
    --out_dir data/rcsb/meta_single \
    --error_file data/rcsb/meta_single_error.csv

1A00 successfully downloaded


In [3]:
# download batch data
!python src/crawler/metadata/download_rcsb.py \
    --pdb_id_file  download/rcsb.txt \
    --out_dir data/rcsb/meta_batch \
    --error_file data/rcsb/meta_batch_error.csv

1A03 successfully downloaded: 100%|███████████████| 4/4 [00:00<00:00,  6.20it/s]


### Download UniProt Sequences

In [4]:
# download single data
!python src/crawler/sequence/download_uniprot_seq.py \
    --uniprot_id A0A0C5B5G6 \
    --out_dir data/uniprot/uniprot_single \
    --error_file data/uniprot/uniprot_single_error.csv

A0A0C5B5G6.fasta successfully downloaded


In [5]:
# download batch data
!python src/crawler/sequence/download_uniprot_seq.py \
    --file download/uniprot.txt \
    --out_dir data/uniprot/uniprot_batch \
    --error_file data/uniprot/uniprot_batch_error.csv

A0JNW5.fasta successfully downloaded: 100%|███████| 5/5 [00:01<00:00,  3.20it/s]


### Download RCSB Structures

In [6]:
# download single data
!python src/crawler/structure/download_rcsb.py \
    --pdb_id 1A00 \
    --out_dir data/structure/rcsb_single \
    --error_file data/structure/rcsb_single_error.csv

1A00.pdb.gz successfully downloaded


In [7]:
# download batch data
!python src/crawler/structure/download_rcsb.py \
    --pdb_id_file download/rcsb.txt \
    --out_dir data/structure/rcsb_batch \
    --error_file data/structure/rcsb_batch_error.csv \
    --unzip

The history saving thread hit an unexpected error (OperationalError('attempt to write a readonly database')).History will not be written to the database.
1A03.pdb.gz successfully downloaded: 100%|████████| 4/4 [00:01<00:00,  2.06it/s]


### Download AlphaFold2 Structures

In [9]:
# download single data
!python src/crawler/structure/download_alphafold.py \
    --uniprot_id A0A0C5B5G6 \
    --out_dir data/structure/af2_single \
    --error_file data/structure/af2_single_error.csv

A0A0C5B5G6 successfully downloaded


In [10]:
# download batch data
!python src/crawler/structure/download_alphafold.py \
    --uniprot_id_file download/uniprot.txt \
    --out_dir data/structure/af2_batch \
    --error_file data/structure/af2_batch_error.csv \
    --index_level 1

A0A1B0GTW7 successfully downloaded: 100%|█████████| 5/5 [00:03<00:00,  1.43it/s]


### Structure Sequence Tools

#### ESM3 Structure Sequence
Generate structure sequences using ESM-3. You can download the ```esm3_structure_encoder_v0.pth```  in [huggingface ](https://huggingface.co/EvolutionaryScale/esm3-sm-open-v1/tree/main/data/weights)

```--pdb_file```: Get a specific pdb structure sequence

```--pdb_dir```: Get batch pdb structure sequences

In [2]:
# get a specific pdb structure sequence
!python src/data/get_esm3_structure_seq.py \
    --pdb_file download/alphafold2_structures/A0PK11.pdb\
     --out_file data/structure/esm2_ss.json

  state_dict = torch.load(
  with torch.no_grad(), torch.cuda.amp.autocast(enabled=False):  # type: ignore


In [3]:
# get batch pdb structure sequence
!python src/data/get_esm3_structure_seq.py \
    --pdb_dir download/alphafold2_structures\
     --out_file data/structure/esm2_ss_batch.json

  state_dict = torch.load(
  with torch.no_grad(), torch.cuda.amp.autocast(enabled=False):  # type: ignore
  with torch.no_grad(), torch.cuda.amp.autocast(enabled=False):  # type: ignore
100%|█████████████████████████████████████████████| 5/5 [00:04<00:00,  1.20it/s]


#### FoldSeek Structure Sequence
Generate secondary sequences. You can install FoldSeek use ```conda install -c conda-forge -c bioconda foldseek```

```--pdb_dir```: Get batch pdb structure sequences

In [1]:
!python src/data/get_foldseek_structure_seq.py \
    --pdb_dir download/alphafold2_structures\
     --out_file data/structure/foldseek_batch.json

createdb download/alphafold2_structures tmp_db/tmp_db 

MMseqs Version:  	1.3c64211
Chain name mode  	0
Write lookup file	1
Threads          	96
Verbosity        	3

Output file: tmp_db/tmp_db
Time for merging to tmp_db_ss: 0h 0m 0s 139ms
Time for merging to tmp_db_h: 0h 0m 0s 124ms
Time for merging to tmp_db_ca: 0h 0m 0s 140ms
Time for merging to tmp_db: 0h 0m 0s 111ms
Ignore 0 out of 5.
Too short: 0, incorrect  0.
Time for processing: 0h 0m 1s 652ms
lndb tmp_db/tmp_db_h tmp_db/tmp_db_ss_h 

MMseqs Version:	1.3c64211
Verbosity	3

Time for processing: 0h 0m 0s 2ms
convert2fasta tmp_db/tmp_db_ss tmp_db/tmp_db_ss.fasta 

MMseqs Version:	1.3c64211
Use header DB	false
Verbosity    	3

Start writing file to tmp_db/tmp_db_ss.fasta
Time for processing: 0h 0m 0s 3ms
5it [00:00, 56375.05it/s]
