GeoPep

Geometric-aware Peptide-Protein Binding Site Prediction

GeoPep predicts which residues of a protein bind a given peptide. It combines the ESM3 protein foundation model with Kolmogorov-Arnold Networks (KANs) in a per-residue classification head, trained with a differentiable distance loss that injects 3D geometric information.

Quick Start

After installation, HF token setup, and downloading the trained model:

cd scripts

# Train a model from a folder of PDBs (folder must contain complex/ and interface/ subdirs)
python train_pipeline.py --pdb-dir /path/to/pdb

# Run inference on a folder of PDBs using a trained checkpoint
python inference_pipeline.py \
    --pdb-dir /path/to/pdb \
    --checkpoint ../model_weights/model_distanceLoss.ckpt

The training pipeline saves a checkpoint to model_weights/. The inference pipeline writes per-residue binding probabilities to result/predictions.json.

Download the Trained Model

The trained checkpoint (~16 GB) is hosted on Hugging Face Hub at dchenqwer/geopep. Pull it into the model_weights/ directory:

pip install -U huggingface_hub
hf download dchenqwer/geopep model_distanceLoss.ckpt \
    --local-dir model_weights/

After this you should have model_weights/model_distanceLoss.ckpt, ready to use with inference_pipeline.py.

If hf is not on PATH, use the deprecated alias huggingface-cli download … with the same arguments, or run python -m huggingface_hub to confirm the package is installed.

Installation

git clone https://github.com/Dian0212/GeoPep.git
cd GeoPep
conda create -n geopep python=3.10
conda activate geopep
pip install -r requirements.txt

Copy the config template:

cp configs/config.yaml.template configs/config.yaml
# Edit configs/config.yaml — only the huggingface.token field is strictly required

Project Structure

GeoPep/
├── geopep/                       # Library code
│   ├── data/
│   │   ├── __init__.py
│   │   └── dataset.py            # PeptideComplexDataset (used by train.py)
│   ├── models/
│   │   ├── __init__.py
│   │   └── esm3_kan.py           # ESM3-KAN per-residue model
│   └── hf_auth.py                # HuggingFace token resolver (env > config > prompt)
│
├── scripts/                      # Runnable entry points
│   ├── preprocess.py             # PDB -> JSON
│   ├── train.py                  # Single-step trainer
│   ├── predict_esm3.py           # Single-step inference
│   ├── postprocess.py            # Softmax -> binary + result JSON / CSV
│   ├── train_pipeline.py         # ONE COMMAND: PDB folder -> trained model
│   └── inference_pipeline.py     # ONE COMMAND: PDB folder + ckpt -> predictions.json
│
├── configs/
│   ├── config.yaml.template      # Tracked: copy this to config.yaml
│   └── config.yaml               # GITIGNORED: your local copy with your token
│
├── model_weights/
│   ├── README.md
│   └── model_distanceLoss.ckpt   # Drop your trained checkpoint here (~16 GB)
│
├── pdb/                          # Input PDBs
│   ├── complex/
│   └── interface/
├── json/                         # Preprocessed JSON outputs (gitignored)
├── result/                       # Inference results (gitignored)
├── requirements.txt
└── README.md

HuggingFace Token

ESM3 is a gated model. You need a HuggingFace token with access to the EvolutionaryScale/esm3 repository.

Create a token at https://huggingface.co/settings/tokens (Read access is enough)
On the ESM3 model page click "Agree and access repository" while logged in
Provide the token to GeoPep via one of these methods (the scripts try them in order):

a) Environment variable (recommended for CI / shared machines):
```
export HF_TOKEN="hf_xxx..."          # Linux/macOS
$env:HF_TOKEN="hf_xxx..."            # PowerShell
```
b) Config file (local development):
```
# configs/config.yaml
huggingface:
  token: "hf_xxx..."
```
configs/config.yaml is gitignored — don't worry about leaking it. But never paste a real token into config.yaml.template (which IS tracked).

c) Interactive prompt: if neither of the above is set, the script will pause and ask for your token on stdin.

⚠️ If a token has ever appeared in a public git commit, treat it as compromised — revoke it at https://huggingface.co/settings/tokens and generate a new one.

Configuration

All settings live in configs/config.yaml (copy from config.yaml.template). Both the one-command pipelines and the individual scripts read from the same file. CLI flags override the relevant fields at runtime.

Key sections:

Section	Purpose
`preprocess.*`	Paths for PDB → JSON conversion
`data.*`	train/val JSON shards
`model.*`	peptide_len (50), protein_len (500), num_label_types (3)
`training.*`	batch_size, learning_rate, max_epochs, distance loss toggle, `checkpoint_dir`
`prediction.*`	checkpoint_path, input_json, device (cuda/cpu)
`hardware.*`	GPU ids, mixed-precision setting
`huggingface.token`	HF token (or leave placeholder and use env var / prompt)

End-to-End Pipelines

Train pipeline

cd scripts
python train_pipeline.py --pdb-dir /path/to/pdb

This runs preprocessing → training → checkpoint save. Required layout:

/path/to/pdb/
├── complex/                # Full peptide-protein complexes
│   ├── 1abc_A_B.pdb        # Naming: PDBID_PeptideChain_ProteinChain.pdb
│   └── ...
└── interface/              # Interface-only PDBs (same filenames)
    ├── 1abc_A_B.pdb
    └── ...

CLI flags:

Flag	Default	Purpose
`--pdb-dir`	required	PDB root folder
`--output-dir`	`../model_weights`	Where to save the trained `.ckpt`
`--work-dir`	`../json/preprocessed`	Intermediate preprocessed JSONs
`--config`	`../configs/config.yaml`	Base config (hyperparameters, HF token)
`--val-ratio`	`0.2`	Fraction of JSON shards used as val
`--skip-preprocess`	off	Reuse existing preprocessed JSONs

Inference pipeline

cd scripts
python inference_pipeline.py \
    --pdb-dir /path/to/pdb \
    --checkpoint ../model_weights/model_distanceLoss.ckpt

Runs preprocessing (inference-only mode — no interface labels needed) → prediction → result JSON. The PDB folder can be flat OR have a complex/ subdir.

CLI flags:

Flag	Default	Purpose
`--pdb-dir`	required	PDB folder
`--checkpoint`	required	Trained model `.ckpt`
`--result-dir`	`../result`	Where to write `predictions.json`
`--work-dir`	`../json/inference`	Intermediate preprocessed JSON
`--config`	`../configs/config.yaml`	Base config
`--device`	from config (cuda)	`cuda` or `cpu`
`--skip-preprocess`	off	Reuse intermediate JSON

The output is a single result/predictions.json keyed by {pdb_id}_{chain_key}:

{
  "1a1r_C_A": {
    "peptide_chain": "GSVVIVGRIVLSGKPA",
    "protein_chain": "VEGEVQIVSTATQTFLAT...",
    "peptide_bindingProbability": "0.99 0.99 0.99 ...",
    "protein_bindingProbability": "0.99 0.19 0.01 ..."
  }
}

Probability counts match residue counts exactly (padding is stripped). Each value is the raw class-1 (interface) probability from the model's softmax.

Step-by-Step Usage

If you prefer running the individual stages (debugging, custom pipelines, etc.):

cd scripts

# 1. Preprocess PDB files to JSON (per-shard splits)
python preprocess.py --config ../configs/config.yaml

# 2. Train
python train.py --config ../configs/config.yaml

# 3. Predict (writes model_out_argmax / model_out_softmax back into the input JSON)
python predict_esm3.py --config ../configs/config.yaml

# 4. Postprocess (optional: CSV per-residue + final result JSON)
python postprocess.py --input ../json/preprocessed/data_part_5.json --result-dir ../result
# Or: --output-dir ../csv_output --threshold 0.5  for per-PDB CSVs

Data Format

Input sequence layout

PEPTIDESEQ<pad><pad>...|PROTEINSEQ<pad><pad>...
|<------- 50 ------->|<-------- 500 ----------->|

Position	Content
0..49	Peptide (left-padded with `<pad>` to 50)
50	Separator `
51..550	Protein (left-padded with `<pad>` to 500)

Label encoding (per residue)

Value	Meaning
0	Non-interface residue
1	Interface (binding site) residue
2	Padding
3	Separator

Distance map

Value	Meaning
`0`	Interface residue itself
`>0`	Normalized distance (0–10) to the nearest interface residue
`-1`	Separator
`-2`	Padding

Architecture

input tokens [B, 553]                ← BOS + 551 residue tokens + EOS
       │
   ESM3 encoder
       ↓ embeddings[:, 1:552, :]      drop BOS / EOS
   [B, 551, 1536]                     per-residue 1536-d embeddings
       ↓ reshape
   [B * 551, 1536]                    each residue is an independent sample
       ↓
   KAN_model1: 1536 → 1153
   KAN_model2: 1153 → 770
   KAN_model3: 770  → 387
   KAN_model4: 387  → 3
   KAN_model5: 3    → 3
       ↓ reshape + permute
   logits [B, 3, 551]                 3-class logits per residue
       ↓ softmax(dim=1)
   binding probability = softmax[:, 1, :]

Loss = Cross-Entropy + Differentiable Distance Loss

For each batch, the loss has two terms:

Weighted CE: per-half cross-entropy with class weights [0.2, 0.8, 0.0] (padding ignored).
Distance loss (when use_distance_loss: true): L_dist = Σ P_binding(i) · dist(i) / num_valid_residues The model is penalized for predicting "binding" at residues far from the true interface.

Negative distances (padding/separator) are clamped to 0 before this term, and class-2 weight is 0 in CE — so padding positions never contribute gradient.

Troubleshooting

Symptom	Fix
`401 Unauthorized` when downloading ESM3	Token missing/invalid; re-check the HF token and that you accepted the model's TOS
`RuntimeError: CUDA out of memory`	Reduce `training.batch_size` or use `precision: 16`
`UnicodeEncodeError: 'gbk' codec ...` on Windows	Already mitigated in scripts; if it recurs, set `$env:PYTHONIOENCODING="utf-8"`
`IndexError: list index out of range` in train	Preprocessing produced 0 valid entries — check length filters (peptide ∈ [10, 50], protein ∈ [10, 500])
Strict `load_state_dict` fails on inference	The checkpoint's KAN layer sizes don't match the inference module; the script falls back to `strict=False` and prints missing/unexpected keys

License

MIT

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GeoPep

Table of Contents

Quick Start

Download the Trained Model

Installation

Project Structure

HuggingFace Token

Configuration

End-to-End Pipelines

Train pipeline

Inference pipeline

Step-by-Step Usage

Data Format

Input sequence layout

Label encoding (per residue)

Distance map

Architecture

Loss = Cross-Entropy + Differentiable Distance Loss

Troubleshooting

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
configs		configs
geopep		geopep
json/preprocessed		json/preprocessed
model_weights		model_weights
scripts		scripts
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

GeoPep

Table of Contents

Quick Start

Download the Trained Model

Installation

Project Structure

HuggingFace Token

Configuration

End-to-End Pipelines

Train pipeline

Inference pipeline

Step-by-Step Usage

Data Format

Input sequence layout

Label encoding (per residue)

Distance map

Architecture

Loss = Cross-Entropy + Differentiable Distance Loss

Troubleshooting

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages