-
Notifications
You must be signed in to change notification settings - Fork 27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Correct atom configuration prior fingerprint calculation #94
Comments
@bachi55 this is a great point and it's a problem we are encountering beyond the fingerprints (look at e.g. #73 and the massfuncs branch where we are trying to capture this for the mass calculations). Historically in RMassBank we patched this by doing the typing etc our side in wrapper functions to ensure consistency. What concerns me is that they are also order dependent (change the order of the commands and you get different results) and this is should not happen (or if it does, then users need guidance what order is "right"). We've had conversation (email) with @rajarshi about updating documentation to roxygen anyway and not only to they need roxygenising but also a revamp to deliver additional clarity. If we start to sum up the time I'm investing into clarifying these aspects to colleagues via email, it'd start adding up to the time it would take to upgrade the docs and the unit tests to ensure we're doing it right. Just need this magic thing called time ;-) @rajarshi any idea how best to coordinate? I am hiring and hope to soon have either own capacity or team member capacity to potentially contribute to a joint effort as it's impacting what we need to achieve. My logical start point would be on the masses but we're still stuck with the NPEs ... @bachi55 my approach for the masses was to build unit tests for all the "edge cases" I would expect and I would use those also to build the documentation to clarify (in examples) what to do and what not to do ... once we have them as unit tests we can then run them and interact with the CDK guys to check rcdk is doing what it should ... |
Hei Emma, thanks for your comments. It's "good" to hear, that the problem is somehow known. I am at the moment using your approach to wrap rcdk, when calculating the fingerprints to have at least consistent computations. And, I will update my code so that it can test for some "edge cases". Do you have a list of those for fingerprints as well? Actually, I think it would be good, if the CDK framework would directly take care about the correct configurations, depending on the fingerprint used. Best regards, Eric |
I haven’t looked into this for fingerprints, but was answering some queries about them only a few days ago ;-) and some of the edge cases I pulled out for the mass functions may apply – I’d say you need cases that cover aromaticity, explicit/implicit Hs, tautomers and probably also isotopes and I have examples that cover several bits of that in massfuncs:
https://github.com/CDK-R/cdkr/blob/massfuncs/rcdk/inst/unitTests/runit.mass.R
Make sure you also capture some cases that should distinguish the different features of the fingerprints .. @rajarshi may have better ideas there …
|
So, yes, this is a pain, but also a bit surprising. I had thought that the SMARTS parser would have performed the necessary configuration, but apparently not.
The last equality check will fail. A quick test suggests that the sequence
will appropriately configure the molecule for all fingerprints. AFAICT, "over configuration" is not an issue. So applying the above sequence should not create an issue But not that this doesn't change implicit
I agree with both of you that updating the docs would go far to clarifying usage and best practice. I had started on I would agree that if a fingerprinter requires config, it could do the configuration but at the cost of redoing something that has already been done (but the latest updates may avoid the performance hit by skipping config if already done). Probably a good idea to check on FWIW, the CDKDescUI tool explicitly performs all configuration when computing fingerprints. |
@rajarshi thank you for your comments! I played around with your example an I think for PubChem fingerprints one still needs to apply In your example I believe m1 <- parse.smiles("c1ccccc1C")[[1]] # m1 will have all atoms aromatic
m2 <- parse.smiles("[CH]1=[CH][CH]=[CH][CH]=C1C([H])([H])([H])")[[1]] # m2 will have no atoms aromatic
do.aromaticity(m1)
do.typing(m1)
do.isotopes(m1)
do.aromaticity(m2)
do.typing(m2)
do.isotopes(m2)
fp1 <- get.fingerprint(m1, type='pubchem')
fp2 <- get.fingerprint(m2, type='pubchem')
print(paste("m1: >= 4 H:", 1 %in% fp1@bits)) # PubChem bits are zero based but R not
print(paste("m2: >= 4 H:", 1 %in% fp2@bits))
print(paste("m1: >= 8 H:", 2 %in% fp1@bits))
print(paste("m2: >= 8 H:", 2 %in% fp2@bits)) Output: [1] "m1: >= 4 H: FALSE"
[1] "m2: >= 4 H: FALSE"
[1] "m1: >= 8 H: FALSE"
[1] "m2: >= 8 H: FALSE" If I add m1 <- parse.smiles("c1ccccc1C")[[1]] # m1 will have all atoms aromatic
m2 <- parse.smiles("[CH]1=[CH][CH]=[CH][CH]=C1C([H])([H])([H])")[[1]] # m2 will have no atoms aromatic
do.aromaticity(m1)
do.typing(m1)
do.isotopes(m1)
convert.implicit.to.explicit(m1)
do.aromaticity(m2)
do.typing(m2)
do.isotopes(m2)
convert.implicit.to.explicit(m2)
fp1 <- get.fingerprint(m1, type='pubchem')
fp2 <- get.fingerprint(m2, type='pubchem')
print(paste("m1: >= 4 H:", 1 %in% fp1@bits)) # PubChem bits are zero based but R not
print(paste("m2: >= 4 H:", 1 %in% fp2@bits))
print(paste("m1: >= 8 H:", 2 %in% fp1@bits))
print(paste("m2: >= 8 H:", 2 %in% fp2@bits)) We get the "desired" output: [1] "m1: >= 4 H: TRUE"
[1] "m2: >= 4 H: TRUE"
[1] "m1: >= 8 H: TRUE"
[1] "m2: >= 8 H: TRUE" At the moment I use the CDK unittests and the source code as information-source to decide about the correct configuration. However, as you said, if over-configuration is not an issue, it is more "save" to simply apply it to all molecules by default. Best regards, Eric |
Hello,
I was wondering about the correct usage of
do.typing()
,do.aromaticity()
anddo.isotopes()
before calculating the fingerprints of a molecules parsed from its SMILES usingparse.smiles()
.Let me go through a few examples.
MACCS
The documentation of
get.fingerprints
contains:There is no information, that the aromaticity might need to be perceived first as otherwise some SMARTS are not properly matched (?). In the CDK tests for the MACCSFingerprinter at least there is atom-typing and aromaticity-detection done.
Pubchem
For this fingerprinter the CDK tests indicate that implicit hydrogens should be converted to explicit ones, i.e.
convert.implicit.to.explicit
.Klekota and Roth
Here the CDK tests indicate, that no "additional" function, e.g. typing or aromaticity detection, needs to be applied.
So I could continue the list of different fingerprinters, which seem to expect different "modifications" done to the parsed molecular structure. Some fingerprinters seems to take care about it internally (within the class).
I would like to know, how others are dealing with this issue? Which transformations you perform, when you calculate different fingerprints (I could believe that problem continues for descriptors as well)? Am I overthinking the problem here? Can a "to much modifications" (I called
do.typing
if it is not needed, e.g.) harm my calculation, i.e. wrong fingerprints?I believe the problem should somehow be solved in CDK, but how can we maybe in the rcdk side make the documentation more precise?
Best regards,
Eric
The text was updated successfully, but these errors were encountered: