Update to ChEBI 2.0 & SDF data #147
Open
+412
−1,470
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Starting with version 245, ChEBI has changed (a) where files are located and (b) the annotation property for SMILES (see #123).
This PR makes it possible to load newer ChEBI versions. Also, I removed some outdated files and introduced SDF data to the pipeline.
Why SDF files?
Because this what ChEBI does internally. Molecules are stored in SDF format and the SMILES strings are generated based on the SDF files. Using the SMILES is a problem for us because RDKit cannot parse many of the SMILES (but it can parse the SDF format). ChEBI generates SMILES with a specialised pipeline (https://github.com/chembl/libRDChEBI/blob/main/libRDChEBI/formats.py) that is RDKit-based but apparently introduces enough changes to the mol object that it is not trivially possible to reconstruct the molecule from the SMILES string.
If we use the same source (SDF) and the same pipeline (the
chembl_structure_pipeline) we can maximise the number of molecules available for training.What changes with SDF files?
chebi.sdffile) gets downloadeddata.pklcontains a new column,mol(containing RDKit mol objects)ChemDataReadernow accepts both SMILES strings as well as mol objectsResult
tokens.txtand molecules like CHEBI:33481 - diphosphooctadecatungstate(6−)). This could either broaden our coverage or confuse the model.