Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

differences in processing V2000/V3000 tag data #971

Open
adalke opened this issue Apr 21, 2023 · 2 comments
Open

differences in processing V2000/V3000 tag data #971

adalke opened this issue Apr 21, 2023 · 2 comments

Comments

@adalke
Copy link

adalke commented Apr 21, 2023

I'm finishing up support for using CDK to read V3000 SDF records in chemfp, and found a problem during testing.

Everything is fine when I use IteratingSDFReader to read a file. I am able to get the molecule and its associated tag data with both V2000 and V3000 records.

I have a problem when reading a single record, where the cdk.io.MDLV2000Reader().read() parses the tag data while the cdk.io.MDLV3000Reader().read() does not.

When I parse a single SDF record, I figure out if it's V2000 or V3000 and dispatch to the appropriate MDL Reader. This replicates what IteratingSDFReader does (see #952). I decided to use this approach as I find it gives better error reporting than IteratingSDFReader - the latter catches and logs errors, because it's designed to skip records it can't handle rather than raise an exception (see #951)

I discovered that while MDLV2000Reader parses the tag data, MDLV3000Reader does not. More specifically, MDLV2000Reader.readAtomContainer has readNonStructuralData() to read the tag data:

            // read potential SD file data between M  END and $$$$
            readNonStructuralData(input, outputContainer);

and that in turn sets container.setProperty(header, data.toString()).

On the other hand, MDLV3000Reader does not parse the tag data, which you can verify by failing to find the relevant setProperty() call.

This is not an issue when using IteratingSDFReader because the iterating reader will only have the MDL reader process a record up to the M_END (ie, MDLV2000Reader will never call readNonStructuralData), then the iterating reader will parse the tag data itself, using readDataBlockInto():

                if (currentLine.startsWith(M_END)) {
                     ...
                        molecule = reader.read(builder.newAtomContainer());
                      ...
                    if (molecule != null) {
                        readDataBlockInto(molecule);

(Also, it uses a different algorithm; MDLV2000Reader requires the SDF end with "$$$$" exactly while IteratingSDFReader says the line only needs to start with "$$$$".)

My work-around is to parse the tag data myself, if it's a V3000 record, and set the corresponding properties.

@johnmay
Copy link
Member

johnmay commented Apr 21, 2023

Yes everything's out of sync, If you have a patch I can review apply otherwise you'll have to wait for my plan to rewrite both readers on a common base. At the moment I'm too busy with NextMove bits.

@adalke
Copy link
Author

adalke commented Apr 21, 2023

I don't have a patch. I don't have the Java skills, nor the ability to decide which patch is most appropriate for the project, especially given the known long-term plans for a rewrite.

I have a workaround that I'm comfortable with.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants