Release v2.10.6 · Novel-Therapeutics/boltz-community

Highlights

Fixed training-data preprocessing so records produced by scripts/process/rcsb.py actually carry the cluster_id, msa_id, and entity_id that downstream training expects.

This is a training-only fix and does not affect inference users. If you have run boltz predict against the released model weights, nothing here changes for you. If you used scripts/process/rcsb.py to build a custom training set, the fields below were silently wrong and your trained models should be re-trained against re-preprocessed records.

Details

Upstream issue jwohlwend/boltz#686 reported two concrete problems in the training preprocessing pipeline; both reproduced unchanged in this fork.

cluster_id was always -1. scripts/process/cluster.py keys its clustering.json output by hash_sequence(seq) for polymers and by CCD code for ligands. scripts/process/rcsb.py was looking up f"{pdb_id}_{entity_id}", which never hits, so every chain fell through to cluster_id=-1 and ClusterSampler weighted everything uniformly instead of by cluster size.
msa_id was always "". The # FIX comment in upstream rcsb.py was literal — training records pointed at no MSA file, so training silently ran on single-sequence inputs with no MSA features whenever those records were loaded.
entity_id was dropped. Records ended up with entity_id: null.

What changed

scripts/process/rcsb.py now computes the per-chain cluster key the same way cluster.py emits it: hash_sequence(seq) for polymers, lowercased CCD code for ligands. The two scripts now share hash_sequence instead of carrying parallel definitions.
msa_id is populated with hash_sequence(seq) for protein chains (and stays empty for DNA/RNA/ligands, which src/boltz/data/module/training.py treats as "no MSA").
entity_id is propagated from the parsed chain through to ChainInfo.
scripts/process/mmcif.py exposes a new chain_to_seq: dict[str, str] field on ParsedStructure so rcsb.py can hash polymer sequences without re-parsing the mmCIF.

What did not change

cluster.py output format is preserved. Keying clusters by sequence content is the right design — a sequence shared across PDB entries should cluster together regardless of which pdb_id_entity_id it appears under.
Branched-ligand handling uses the first residue's CCD as the cluster key, matching the upstream simplification.
The published Boltz-1 / Boltz-2 weights are unaffected — upstream presumably trained with a private pipeline (the # FIX comment in the open-sourced rcsb.py suggests the released script was never the one they actually used).

Commits since `v2.10.5`

1b6f361 Release 2.10.6

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v2.10.6

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

Highlights

Details

What changed

What did not change

Commits since `v2.10.5`

Uh oh!

v2.10.6

Highlights

Details

What changed

What did not change

Commits since v2.10.5

Uh oh!

Commits since `v2.10.5`