Update to ChEBI 2.0 & SDF data #147

sfluegel05 · 2026-01-28T17:38:40Z

Starting with version 245, ChEBI has changed (a) where files are located and (b) the annotation property for SMILES (see #123).

This PR makes it possible to load newer ChEBI versions. Also, I removed some outdated files and introduced SDF data to the pipeline.

Why SDF files?

Because this what ChEBI does internally. Molecules are stored in SDF format and the SMILES strings are generated based on the SDF files. Using the SMILES is a problem for us because RDKit cannot parse many of the SMILES (but it can parse the SDF format). ChEBI generates SMILES with a specialised pipeline (https://github.com/chembl/libRDChEBI/blob/main/libRDChEBI/formats.py) that is RDKit-based but apparently introduces enough changes to the mol object that it is not trivially possible to reconstruct the molecule from the SMILES string.

If we use the same source (SDF) and the same pipeline (the chembl_structure_pipeline) we can maximise the number of molecules available for training.

What changes with SDF files?

A new raw file (the chebi.sdf file) gets downloaded
The data.pkl contains a new column, mol (containing RDKit mol objects)
The ChemDataReader now accepts both SMILES strings as well as mol objects

Result

I tested this on ChEBI v246. I was able to generate training instances for 190,633 molecules (out of 190,634). When trying to parse the SMILES, I only got 178,523 instances.
Open question: Is this actually helpful? I assume a lot of the new molecules are in some form unusual (as exemplified by the new tokens added to tokens.txt and molecules like CHEBI:33481 - diphosphooctadecatungstate(6−)). This could either broaden our coverage or confuse the model.

sfluegel05 · 2026-01-28T17:42:36Z

@aditya0by0 Could you please check how this affects the GNNs? Since they use the same dataset, I assume the readers have to be adapted to accept mol objects instead of SMILES strings.
Also, could you run a comparison between two ELECTRA models, one with the old dataset (i.e., canonicalised SMILES based on ChEBI-SMILES) and one with the new dataset (i.e., canonicalised SMILES based on the CHEBI-SDF file)?

sfluegel05 added 4 commits January 28, 2026 11:11

remove outdated JCI files

ee6b994

get molecule data from SDF file

1c12972

add new tokens

8966533

add chembl dependency

a0e74dd

update tests for SDF files

e713104

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update to ChEBI 2.0 & SDF data #147

Update to ChEBI 2.0 & SDF data #147

sfluegel05 commented Jan 28, 2026

Uh oh!

sfluegel05 commented Jan 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Update to ChEBI 2.0 & SDF data #147

Are you sure you want to change the base?

Update to ChEBI 2.0 & SDF data #147

Conversation

sfluegel05 commented Jan 28, 2026

Why SDF files?

What changes with SDF files?

Result

Uh oh!

sfluegel05 commented Jan 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants