Geometric-aware Peptide-Protein Binding Site Prediction
GeoPep predicts which residues of a protein bind a given peptide. It combines the ESM3 protein foundation model with Kolmogorov-Arnold Networks (KANs) in a per-residue classification head, trained with a differentiable distance loss that injects 3D geometric information.
- Quick Start (one command)
- Installation
- Download the Trained Model
- Project Structure
- HuggingFace Token
- Configuration
- End-to-End Pipelines
- Step-by-Step Usage
- Data Format
- Architecture
- Troubleshooting
After installation, HF token setup, and downloading the trained model:
cd scripts
# Train a model from a folder of PDBs (folder must contain complex/ and interface/ subdirs)
python train_pipeline.py --pdb-dir /path/to/pdb
# Run inference on a folder of PDBs using a trained checkpoint
python inference_pipeline.py \
--pdb-dir /path/to/pdb \
--checkpoint ../model_weights/model_distanceLoss.ckptThe training pipeline saves a checkpoint to model_weights/. The inference pipeline writes per-residue binding probabilities to result/predictions.json.
The trained checkpoint (~16 GB) is hosted on Hugging Face Hub at
dchenqwer/geopep. Pull it into
the model_weights/ directory:
pip install -U huggingface_hub
hf download dchenqwer/geopep model_distanceLoss.ckpt \
--local-dir model_weights/After this you should have model_weights/model_distanceLoss.ckpt, ready to
use with inference_pipeline.py.
If
hfis not on PATH, use the deprecated aliashuggingface-cli download …with the same arguments, or runpython -m huggingface_hubto confirm the package is installed.
git clone https://github.com/Dian0212/GeoPep.git
cd GeoPep
conda create -n geopep python=3.10
conda activate geopep
pip install -r requirements.txtCopy the config template:
cp configs/config.yaml.template configs/config.yaml
# Edit configs/config.yaml — only the huggingface.token field is strictly requiredGeoPep/
├── geopep/ # Library code
│ ├── data/
│ │ ├── __init__.py
│ │ └── dataset.py # PeptideComplexDataset (used by train.py)
│ ├── models/
│ │ ├── __init__.py
│ │ └── esm3_kan.py # ESM3-KAN per-residue model
│ └── hf_auth.py # HuggingFace token resolver (env > config > prompt)
│
├── scripts/ # Runnable entry points
│ ├── preprocess.py # PDB -> JSON
│ ├── train.py # Single-step trainer
│ ├── predict_esm3.py # Single-step inference
│ ├── postprocess.py # Softmax -> binary + result JSON / CSV
│ ├── train_pipeline.py # ONE COMMAND: PDB folder -> trained model
│ └── inference_pipeline.py # ONE COMMAND: PDB folder + ckpt -> predictions.json
│
├── configs/
│ ├── config.yaml.template # Tracked: copy this to config.yaml
│ └── config.yaml # GITIGNORED: your local copy with your token
│
├── model_weights/
│ ├── README.md
│ └── model_distanceLoss.ckpt # Drop your trained checkpoint here (~16 GB)
│
├── pdb/ # Input PDBs
│ ├── complex/
│ └── interface/
├── json/ # Preprocessed JSON outputs (gitignored)
├── result/ # Inference results (gitignored)
├── requirements.txt
└── README.md
ESM3 is a gated model. You need a HuggingFace token with access to the EvolutionaryScale/esm3 repository.
-
Create a token at https://huggingface.co/settings/tokens (Read access is enough)
-
On the ESM3 model page click "Agree and access repository" while logged in
-
Provide the token to GeoPep via one of these methods (the scripts try them in order):
a) Environment variable (recommended for CI / shared machines):
export HF_TOKEN="hf_xxx..." # Linux/macOS $env:HF_TOKEN="hf_xxx..." # PowerShell
b) Config file (local development):
# configs/config.yaml huggingface: token: "hf_xxx..."
configs/config.yamlis gitignored — don't worry about leaking it. But never paste a real token intoconfig.yaml.template(which IS tracked).c) Interactive prompt: if neither of the above is set, the script will pause and ask for your token on stdin.
⚠️ If a token has ever appeared in a public git commit, treat it as compromised — revoke it at https://huggingface.co/settings/tokens and generate a new one.
All settings live in configs/config.yaml (copy from config.yaml.template). Both the one-command pipelines and the individual scripts read from the same file. CLI flags override the relevant fields at runtime.
Key sections:
| Section | Purpose |
|---|---|
preprocess.* |
Paths for PDB → JSON conversion |
data.* |
train/val JSON shards |
model.* |
peptide_len (50), protein_len (500), num_label_types (3) |
training.* |
batch_size, learning_rate, max_epochs, distance loss toggle, checkpoint_dir |
prediction.* |
checkpoint_path, input_json, device (cuda/cpu) |
hardware.* |
GPU ids, mixed-precision setting |
huggingface.token |
HF token (or leave placeholder and use env var / prompt) |
cd scripts
python train_pipeline.py --pdb-dir /path/to/pdbThis runs preprocessing → training → checkpoint save. Required layout:
/path/to/pdb/
├── complex/ # Full peptide-protein complexes
│ ├── 1abc_A_B.pdb # Naming: PDBID_PeptideChain_ProteinChain.pdb
│ └── ...
└── interface/ # Interface-only PDBs (same filenames)
├── 1abc_A_B.pdb
└── ...
CLI flags:
| Flag | Default | Purpose |
|---|---|---|
--pdb-dir |
required | PDB root folder |
--output-dir |
../model_weights |
Where to save the trained .ckpt |
--work-dir |
../json/preprocessed |
Intermediate preprocessed JSONs |
--config |
../configs/config.yaml |
Base config (hyperparameters, HF token) |
--val-ratio |
0.2 |
Fraction of JSON shards used as val |
--skip-preprocess |
off | Reuse existing preprocessed JSONs |
cd scripts
python inference_pipeline.py \
--pdb-dir /path/to/pdb \
--checkpoint ../model_weights/model_distanceLoss.ckptRuns preprocessing (inference-only mode — no interface labels needed) → prediction → result JSON. The PDB folder can be flat OR have a complex/ subdir.
CLI flags:
| Flag | Default | Purpose |
|---|---|---|
--pdb-dir |
required | PDB folder |
--checkpoint |
required | Trained model .ckpt |
--result-dir |
../result |
Where to write predictions.json |
--work-dir |
../json/inference |
Intermediate preprocessed JSON |
--config |
../configs/config.yaml |
Base config |
--device |
from config (cuda) | cuda or cpu |
--skip-preprocess |
off | Reuse intermediate JSON |
The output is a single result/predictions.json keyed by {pdb_id}_{chain_key}:
{
"1a1r_C_A": {
"peptide_chain": "GSVVIVGRIVLSGKPA",
"protein_chain": "VEGEVQIVSTATQTFLAT...",
"peptide_bindingProbability": "0.99 0.99 0.99 ...",
"protein_bindingProbability": "0.99 0.19 0.01 ..."
}
}Probability counts match residue counts exactly (padding is stripped). Each value is the raw class-1 (interface) probability from the model's softmax.
If you prefer running the individual stages (debugging, custom pipelines, etc.):
cd scripts
# 1. Preprocess PDB files to JSON (per-shard splits)
python preprocess.py --config ../configs/config.yaml
# 2. Train
python train.py --config ../configs/config.yaml
# 3. Predict (writes model_out_argmax / model_out_softmax back into the input JSON)
python predict_esm3.py --config ../configs/config.yaml
# 4. Postprocess (optional: CSV per-residue + final result JSON)
python postprocess.py --input ../json/preprocessed/data_part_5.json --result-dir ../result
# Or: --output-dir ../csv_output --threshold 0.5 for per-PDB CSVsPEPTIDESEQ<pad><pad>...|PROTEINSEQ<pad><pad>...
|<------- 50 ------->|<-------- 500 ----------->|
| Position | Content |
|---|---|
| 0..49 | Peptide (left-padded with <pad> to 50) |
| 50 | Separator ` |
| 51..550 | Protein (left-padded with <pad> to 500) |
| Value | Meaning |
|---|---|
| 0 | Non-interface residue |
| 1 | Interface (binding site) residue |
| 2 | Padding |
| 3 | Separator |
| Value | Meaning |
|---|---|
0 |
Interface residue itself |
>0 |
Normalized distance (0–10) to the nearest interface residue |
-1 |
Separator |
-2 |
Padding |
input tokens [B, 553] ← BOS + 551 residue tokens + EOS
│
ESM3 encoder
↓ embeddings[:, 1:552, :] drop BOS / EOS
[B, 551, 1536] per-residue 1536-d embeddings
↓ reshape
[B * 551, 1536] each residue is an independent sample
↓
KAN_model1: 1536 → 1153
KAN_model2: 1153 → 770
KAN_model3: 770 → 387
KAN_model4: 387 → 3
KAN_model5: 3 → 3
↓ reshape + permute
logits [B, 3, 551] 3-class logits per residue
↓ softmax(dim=1)
binding probability = softmax[:, 1, :]
For each batch, the loss has two terms:
- Weighted CE: per-half cross-entropy with class weights
[0.2, 0.8, 0.0](padding ignored). - Distance loss (when
use_distance_loss: true):L_dist = Σ P_binding(i) · dist(i) / num_valid_residuesThe model is penalized for predicting "binding" at residues far from the true interface.
Negative distances (padding/separator) are clamped to 0 before this term, and class-2 weight is 0 in CE — so padding positions never contribute gradient.
| Symptom | Fix |
|---|---|
401 Unauthorized when downloading ESM3 |
Token missing/invalid; re-check the HF token and that you accepted the model's TOS |
RuntimeError: CUDA out of memory |
Reduce training.batch_size or use precision: 16 |
UnicodeEncodeError: 'gbk' codec ... on Windows |
Already mitigated in scripts; if it recurs, set $env:PYTHONIOENCODING="utf-8" |
IndexError: list index out of range in train |
Preprocessing produced 0 valid entries — check length filters (peptide ∈ [10, 50], protein ∈ [10, 500]) |
Strict load_state_dict fails on inference |
The checkpoint's KAN layer sizes don't match the inference module; the script falls back to strict=False and prints missing/unexpected keys |
MIT