v2.10.6
Highlights
- Fixed training-data preprocessing so records produced by
scripts/process/rcsb.pyactually carry thecluster_id,msa_id, andentity_idthat downstream training expects.
This is a training-only fix and does not affect inference users. If you have run boltz predict against the released model weights, nothing here changes for you. If you used scripts/process/rcsb.py to build a custom training set, the fields below were silently wrong and your trained models should be re-trained against re-preprocessed records.
Details
Upstream issue jwohlwend/boltz#686 reported two concrete problems in the training preprocessing pipeline; both reproduced unchanged in this fork.
cluster_idwas always-1.scripts/process/cluster.pykeys itsclustering.jsonoutput byhash_sequence(seq)for polymers and by CCD code for ligands.scripts/process/rcsb.pywas looking upf"{pdb_id}_{entity_id}", which never hits, so every chain fell through tocluster_id=-1andClusterSamplerweighted everything uniformly instead of by cluster size.msa_idwas always"". The# FIXcomment in upstreamrcsb.pywas literal — training records pointed at no MSA file, so training silently ran on single-sequence inputs with no MSA features whenever those records were loaded.entity_idwas dropped. Records ended up withentity_id: null.
What changed
scripts/process/rcsb.pynow computes the per-chain cluster key the same waycluster.pyemits it:hash_sequence(seq)for polymers, lowercased CCD code for ligands. The two scripts now sharehash_sequenceinstead of carrying parallel definitions.msa_idis populated withhash_sequence(seq)for protein chains (and stays empty for DNA/RNA/ligands, whichsrc/boltz/data/module/training.pytreats as "no MSA").entity_idis propagated from the parsed chain through toChainInfo.scripts/process/mmcif.pyexposes a newchain_to_seq: dict[str, str]field onParsedStructuresorcsb.pycan hash polymer sequences without re-parsing the mmCIF.
What did not change
cluster.pyoutput format is preserved. Keying clusters by sequence content is the right design — a sequence shared across PDB entries should cluster together regardless of whichpdb_id_entity_idit appears under.- Branched-ligand handling uses the first residue's CCD as the cluster key, matching the upstream simplification.
- The published Boltz-1 / Boltz-2 weights are unaffected — upstream presumably trained with a private pipeline (the
# FIXcomment in the open-sourcedrcsb.pysuggests the released script was never the one they actually used).
Commits since v2.10.5
1b6f361Release 2.10.6