Skip to content

Conversation

@sfluegel05
Copy link
Collaborator

Starting with version 245, ChEBI has changed (a) where files are located and (b) the annotation property for SMILES (see #123).

This PR makes it possible to load newer ChEBI versions. Also, I removed some outdated files and introduced SDF data to the pipeline.

Why SDF files?

Because this what ChEBI does internally. Molecules are stored in SDF format and the SMILES strings are generated based on the SDF files. Using the SMILES is a problem for us because RDKit cannot parse many of the SMILES (but it can parse the SDF format). ChEBI generates SMILES with a specialised pipeline (https://github.com/chembl/libRDChEBI/blob/main/libRDChEBI/formats.py) that is RDKit-based but apparently introduces enough changes to the mol object that it is not trivially possible to reconstruct the molecule from the SMILES string.

If we use the same source (SDF) and the same pipeline (the chembl_structure_pipeline) we can maximise the number of molecules available for training.

What changes with SDF files?

  • A new raw file (the chebi.sdf file) gets downloaded
  • The data.pkl contains a new column, mol (containing RDKit mol objects)
  • The ChemDataReader now accepts both SMILES strings as well as mol objects

Result

  • I tested this on ChEBI v246. I was able to generate training instances for 190,633 molecules (out of 190,634). When trying to parse the SMILES, I only got 178,523 instances.
  • Open question: Is this actually helpful? I assume a lot of the new molecules are in some form unusual (as exemplified by the new tokens added to tokens.txt and molecules like CHEBI:33481 - diphosphooctadecatungstate(6−)). This could either broaden our coverage or confuse the model.

@sfluegel05
Copy link
Collaborator Author

@aditya0by0 Could you please check how this affects the GNNs? Since they use the same dataset, I assume the readers have to be adapted to accept mol objects instead of SMILES strings.
Also, could you run a comparison between two ELECTRA models, one with the old dataset (i.e., canonicalised SMILES based on ChEBI-SMILES) and one with the new dataset (i.e., canonicalised SMILES based on the CHEBI-SDF file)?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants