PepOpt is an interpretable machine learning framework for peptide lead optimization in ultralarge design spaces.
Authors: Yixiang Mao (MSD) · Ruibo Zhang (MSD)
Contact: ruibo.zhang@msd.com
PepOpt implements an FP-based Free Wilson Analysis (FP-FWA) pipeline that:
- Learns interpretable per-monomer contributions to a target activity using ridge regression on molecular fingerprints
- Visualizes the most important monomers and their structural substructures
- Enumerates and ranks novel peptide candidates from an ultralarge design space
Install dependencies via:
pip install -r requirements.txtTwo types of input files are required to run PepOpt.
The peptide input must be pre-aligned, with each position represented as a separate column following the naming convention:
PEPTIDE1_0, PEPTIDE1_1, PEPTIDE1_2, ...
Each cell should contain the HELM monomer name for that position.
The CSV must also include:
| Column | Description | Example |
|---|---|---|
| ID column | Unique compound identifier in the format letters + ("-" or "_") + numbers |
PEP_00123, ABC-45 |
| Target column | The numerical property to be predicted (e.g., pIC50) | pIC50 |
| HELM column | HELM of peptides. HELMs are required for peptide enumeration even though all monomers are provided in prealigned data table |
📄 Example file:
test_data/example_data_aligned.csv
Monomer lookup tables map HELM monomer names to their structural SMILES fragments. You must provide at least two tables:
- One for monomers present in the input dataset
- One for monomers available for enumeration / prediction
Table paths are configured in src/config.py.
Each table must contain the following columns:
| Column | Description |
|---|---|
symbol |
HELM monomer name |
Partial_SMILES |
Clipped SMILES fragment (reaction handles removed) |
📄 Example file:
test_data/example_monomer_db.csv
Run the FP-FWA pipeline from the Pepopt_Public/ directory:
python local_run_fpfw.py test_data/example_data_aligned.csv pIC50 ID M-0001 \
--model ridge \
--desc_type ecfp --desc_norm False \
--fp_len 2048 --fp_radius 2 --fp_sel_n 500 --fp_min_freq 0 \
--fpbit_plot True --fpbit_top_n 5 \
--evaluation True \
--max_enum_out 1000 \
--enum_positions 0 1 2 3 4 5The four positional arguments are:
| Position | Argument | Description |
|---|---|---|
| 1 | incsv |
Path to the aligned input CSV file |
| 2 | actcol |
Name of the activity/target column |
| 3 | idcol |
Name of the compound ID column |
| 4 | refid |
ID of the reference peptide |
For a complete list of options, run:
python local_run_fpfw.py -hAll outputs are saved to ./results/ (or the directory specified by --results_dir):
results/
├── fpfwa-coefficients.csv # Per-monomer coefficients from the model
├── fpfwa-monomer_db_coefficients.csv # Extrapolated coefficients for monomers in the DB
├── fpfwa_top10_monomers.png # Heatmap of top monomer coefficients (training set)
├── fpfwa_top10_monomers_freq.png # Same heatmap, weighted by monomer frequency
├── fpfwa_top10_monomer_db.png # Heatmap of extrapolated DB monomer coefficients
├── fpfwa_best_1000.csv # Top enumerated peptides (training set monomers)
├── fpfwa_best_1000-monomer_db.csv # Top enumerated peptides (monomer DB)
└── substructure_analysis/ # Per-position Morgan bit substructure figures