# BAHD-ESP code running tutorial

## Environment building

### Build the runtime environment code

```bash
mamba env create -f env.yaml
```

### Large model training weights download
- Download link：
https://zenodo.org/records/14635915

```bash
# Decompress the pre-trained model
7z x pretrained_model.7z
# Place folders in the following hierarchy
.
├── code
├── data
├── env.yaml
├── final_saved_predictions
├── GB_saved_predictions
├── logs
├── model_metrics
├── prediction_code
├── pretrained_model
├── README.md
├── saved_model
└── saved_predictions

# Decompress the BAHD_ESP_1gpus_bs72_1e-05_layers6.txt.pkl.7z
7z x BAHD_ESP_1gpus_bs72_1e-05_layers6.txt.pkl.7z
# Place the PKL file under `saved_model` folder in the following hierarchy
saved_model
├── BAHD_ESP_1gpus_bs72_1e-05_layers6.txt.pkl
└── BAHD_ESP_xgb
    ├── cls.pkl
    ├── ESM1b_ChemBERTa_cls.pkl
    ├── ESM1b_ChemBERTa.pkl
    ├── y_val_pred_all_cls.pkl
    ├── y_val_pred_all.pkl
    └── y_val_pred_cls.pkl
```

## Training process

```{important}
Ensure that all samples to be tested have been pre-processed here and the embedding information has been extracted
```

### Preprocessing: Extracting embedding information

```python
CUDA_VISIBLE_DEVICES=1 python code/preprocessing/preprocessing.py --train_val_path data/training_data/train_val_BAHD \
                                                               --outpath data/training_data/embeddings \
                                                               --smiles_emb_no 2000 --prot_emb_no 2000
```

### Train a large model

```python
CUDA_VISIBLE_DEVICES=1 python code/training/training.py --train_dir data/training_data/train_val_BAHD/BAHD_ESP_train_df.csv \
                                --val_dir data/training_data/train_val_BAHD/BAHD_ESP_val_df.csv \
                                --model_prefix BAHD_ESP_ \
                                --embed_path data/training_data/embeddings \
                                --save_model_path saved_model \
                                --pretrained_model pretrained_model/pretraining_IC50_6gpus_bs144_1.5e-05_layers6.txt.pkl \
                                --learning_rate 1e-5  --num_hidden_layers 6 --batch_size 72 --binary_task True \
                                --num_train_epochs 100 --port 12558          
```                               

### Train a gradient boosting tree

```python
CUDA_VISIBLE_DEVICES=1 python code/training/training_GB.py --train_dir data/training_data/train_val_BAHD/BAHD_ESP_train_df.csv \
                                --val_dir data/training_data/train_val_BAHD/BAHD_ESP_val_df.csv \
                                --test_dir data/training_data/train_val_BAHD/T1_ESP_test_df.csv \
                                --pretrained_model saved_model/BAHD_ESP_1gpus_bs72_1e-05_layers6.txt.pkl \
                                --embed_path data/training_data/embeddings \
                                --save_xgb_path saved_model/BAHD_ESP_xgb \
                                --save_pred_path GB_saved_predictions \
                                --num_hidden_layers 6 --num_iter 500 --binary_task True 
```                                

## Independent test code

```{important}
Please manually change the path of the test CSV and the path to save the result.
```

```python
CUDA_VISIBLE_DEVICES=1 python prediction_code/predictionForBAHD.py
```