Skip to content

v2.10.6

Choose a tag to compare

@volgin volgin released this 28 May 10:00
· 5 commits to main since this release

Highlights

  • Fixed training-data preprocessing so records produced by scripts/process/rcsb.py actually carry the cluster_id, msa_id, and entity_id that downstream training expects.

This is a training-only fix and does not affect inference users. If you have run boltz predict against the released model weights, nothing here changes for you. If you used scripts/process/rcsb.py to build a custom training set, the fields below were silently wrong and your trained models should be re-trained against re-preprocessed records.

Details

Upstream issue jwohlwend/boltz#686 reported two concrete problems in the training preprocessing pipeline; both reproduced unchanged in this fork.

  • cluster_id was always -1. scripts/process/cluster.py keys its clustering.json output by hash_sequence(seq) for polymers and by CCD code for ligands. scripts/process/rcsb.py was looking up f"{pdb_id}_{entity_id}", which never hits, so every chain fell through to cluster_id=-1 and ClusterSampler weighted everything uniformly instead of by cluster size.
  • msa_id was always "". The # FIX comment in upstream rcsb.py was literal — training records pointed at no MSA file, so training silently ran on single-sequence inputs with no MSA features whenever those records were loaded.
  • entity_id was dropped. Records ended up with entity_id: null.

What changed

  • scripts/process/rcsb.py now computes the per-chain cluster key the same way cluster.py emits it: hash_sequence(seq) for polymers, lowercased CCD code for ligands. The two scripts now share hash_sequence instead of carrying parallel definitions.
  • msa_id is populated with hash_sequence(seq) for protein chains (and stays empty for DNA/RNA/ligands, which src/boltz/data/module/training.py treats as "no MSA").
  • entity_id is propagated from the parsed chain through to ChainInfo.
  • scripts/process/mmcif.py exposes a new chain_to_seq: dict[str, str] field on ParsedStructure so rcsb.py can hash polymer sequences without re-parsing the mmCIF.

What did not change

  • cluster.py output format is preserved. Keying clusters by sequence content is the right design — a sequence shared across PDB entries should cluster together regardless of which pdb_id_entity_id it appears under.
  • Branched-ligand handling uses the first residue's CCD as the cluster key, matching the upstream simplification.
  • The published Boltz-1 / Boltz-2 weights are unaffected — upstream presumably trained with a private pipeline (the # FIX comment in the open-sourced rcsb.py suggests the released script was never the one they actually used).

Commits since v2.10.5

  • 1b6f361 Release 2.10.6