Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

An "accurate and reliable" RDKitConverter #3044

Merged
merged 55 commits into from
Apr 2, 2022

Conversation

cbouy
Copy link
Member

@cbouy cbouy commented Nov 25, 2020

Hi @IAlibay @orbeckst @richardjgowers

I made a few changes to the RDKitConverter so that it more accurately infers bond orders and charges in small molecules.
I tested this new version on the ChEMBL27 dataset and it managed to retrieve the original molecule or a resonance structure in 99.1% of the 1.7M compounds (the GSoC version was around 90%).
I can share the notebook that I used to run this ChEMBL benchmark, either in one of you other repos or I create a new one, as you prefer.

Changes made in this Pull Request:

  • Fixes monatomic ion charges that were always negative
  • Fixes more cases of conjugated systems
  • Fixes tetravalent phosphorus
  • Fixes some edge cases with sulfur and nitrogen
  • Modifies the bond inferring tests to try matching with resonance structures
  • Adds tests for ions, all standard amino-acids and all standard nucleic-acids
  • Adds tests on molecules encountered in the ChEMBL 27 dataset
  • Fixes atoms in the RDKit molecule that weren't always in the same order as in the atomgroup
  • Fixes atom name formatting when writing a PDB file with RDKit
  • Fixes sanitization errors that occurred when using NoImplicit=False
  • The names property is now available through atom.GetProp("_MDAnalysis_name"), while atom.GetMonomerInfo().GetName() returns the PDB-formatted version (with spaces).

All of these changes do not take "efficiency" in consideration, so it probably slows down the code a bit, but I think they are all necessary. If you identify things that could be optimized please let me know.

Their are additional changes that I could make that on one hand would slow down the code even more but on the other hand make it even more reliable. See #3339 for examples of failing molecules.

I could fix them by adding more SMIRKS/reaction SMARTS, but considering the fact that these patterns are rare and would slow down the conversion I don't think they are worth adding to the code.

PR Checklist

  • Tests?
  • Docs?
  • CHANGELOG updated?
  • Issue raised/referenced?

@pep8speaks
Copy link

pep8speaks commented Nov 25, 2020

Hello @cbouy! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

Line 81:80: E501 line too long (86 > 79 characters)
Line 278:80: E501 line too long (129 > 79 characters)
Line 710:80: E501 line too long (100 > 79 characters)
Line 712:80: E501 line too long (100 > 79 characters)
Line 714:80: E501 line too long (100 > 79 characters)
Line 716:80: E501 line too long (100 > 79 characters)
Line 718:80: E501 line too long (100 > 79 characters)
Line 720:80: E501 line too long (100 > 79 characters)
Line 724:80: E501 line too long (100 > 79 characters)
Line 728:80: E501 line too long (100 > 79 characters)
Line 730:80: E501 line too long (100 > 79 characters)

Line 441:80: E501 line too long (80 > 79 characters)
Line 466:80: E501 line too long (88 > 79 characters)
Line 506:80: E501 line too long (88 > 79 characters)
Line 627:80: E501 line too long (81 > 79 characters)
Line 628:80: E501 line too long (80 > 79 characters)
Line 630:80: E501 line too long (84 > 79 characters)

Comment last updated at 2022-03-30 21:49:43 UTC

@codecov
Copy link

codecov bot commented Nov 25, 2020

Codecov Report

Merging #3044 (f019d28) into develop (bf95f9c) will increase coverage by 0.01%.
The diff coverage is 97.79%.

@@             Coverage Diff             @@
##           develop    #3044      +/-   ##
===========================================
+ Coverage    94.20%   94.21%   +0.01%     
===========================================
  Files          190      190              
  Lines        24668    24725      +57     
  Branches      3313     3321       +8     
===========================================
+ Hits         23238    23295      +57     
+ Misses        1384     1382       -2     
- Partials        46       48       +2     
Impacted Files Coverage Δ
package/MDAnalysis/converters/RDKit.py 98.42% <97.72%> (+0.35%) ⬆️
package/MDAnalysis/core/selection.py 98.80% <100.00%> (ø)
package/MDAnalysis/topology/guessers.py 99.21% <100.00%> (-0.02%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update bf95f9c...f019d28. Read the comment docs.

@richardjgowers
Copy link
Member

Hey @cbouy I think we're more interested in correctness rather than speed. Speed is something we can tweak on later once we've established what the code should do.

@IAlibay IAlibay added this to the 2.0 milestone Mar 14, 2021
Copy link
Member

@lilyminium lilyminium left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @cbouy, this looks awesome. Could you please add some tests for the lines that codecov points out, and address the comment by the PEP8 bot?

package/MDAnalysis/coordinates/RDKit.py Outdated Show resolved Hide resolved

for atom in sorted(mol.GetAtoms(), reverse=True,
key=lambda a: _get_nb_unpaired_electrons(a)[0]):
for atom in atoms:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please add tests for case where an atom degree is 0?

Comment on lines 566 to 568
nue = _get_nb_unpaired_electrons(atom)
if any([n == 0 for n in nue]):
break
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please add a test for this case where common_nue != 0?


# if atom valence is still not filled
nue = _get_nb_unpaired_electrons(atom)
if not any([n == 0 for n in nue]):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add a test for this please?

nue = _get_nb_unpaired_electrons(atom)
if not any([n == 0 for n in nue]):
# transform nue to charge
atom.SetFormalCharge(-nue[0])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add a test for this please?

package/MDAnalysis/coordinates/RDKit.py Outdated Show resolved Hide resolved
package/MDAnalysis/coordinates/RDKit.py Outdated Show resolved Hide resolved
elif n_matches % 2 == 0:
end_pattern = Chem.MolFromSmarts("[*-;!O]-[*+0]=[*+0]-[*-]")
else:
elif n_matches == 1:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add a test for this please?

end_pattern = Chem.MolFromSmarts(
"[*-;!O]-[*+0]=[*+0]-[$([#7;X3;v3]),$([#6+0]=O)]")
end_pattern = odd_end_pattern
else:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add a test for this please?

package/MDAnalysis/coordinates/RDKit.py Outdated Show resolved Hide resolved
@cbouy
Copy link
Member Author

cbouy commented Mar 17, 2022

Okay I should be done with this one now!

I realized that there was a better way to test if the mol matches one of the resonance structure, so I implemented this change.

I added the XFAIL tests as @orbeckst suggested

Thanks again for reviewing this!!

@cbouy
Copy link
Member Author

cbouy commented Mar 17, 2022

I'm also preparing the benchmark on ChEMBL for the converter, should I create a new RDKitConverter-benchmark repository inside MDAnalysis then?

@orbeckst
Copy link
Member

Yes, create the repo under the MDAnalysis org. (I think you have permission to do so but if not say something.)

Copy link
Member

@orbeckst orbeckst left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks very impressive. I could only comment on documentation aspects and one question about cations.

package/MDAnalysis/converters/RDKit.py Outdated Show resolved Hide resolved
package/CHANGELOG Show resolved Hide resolved
package/MDAnalysis/converters/RDKit.py Outdated Show resolved Hide resolved
package/MDAnalysis/converters/RDKit.py Outdated Show resolved Hide resolved
Comment on lines 308 to 309
obj : :class:`~MDAnalysis.core.groups.AtomGroup` or
:class:`~MDAnalysis.core.universe.Universe`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might work if you just say

obj : `AtomGroup` or `Universe`

Try it and see if it links correctly.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The back-tic notation did not automagically generate links so your previous solution was better. I'd revert to it.

@lilyminium
Copy link
Member

Aside from @orbeckst's comments the code looks great -- thank you for putting up with the super slow reviews! Agreed with Oliver, the guessing code is very impressive and it'll be a fantastic solution for so many people trying to create molecular graphs out of lossy input like xyz files.

@orbeckst
Copy link
Member

@cbouy I don't know how busy you are at the moment but if you could have a look at my minor comments then we could merge this big PR really soon™.

@cbouy
Copy link
Member Author

cbouy commented Mar 24, 2022

@orbeckst I'm in the middle of moving out but I should be able to find some time after Monday!
Thanks again to both of you!

Copy link
Member

@orbeckst orbeckst left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for answering my questions — looks good.

(There's one change that I would revert so that links are generated in the docs but that's not blocking.)

Great contribution!!

@orbeckst
Copy link
Member

@lilyminium or @IAlibay can you please be in charge of eventually merging? Thanks!

@cbouy
Copy link
Member Author

cbouy commented Mar 30, 2022

Reverted the docstring thingy (and fixed a typo)!

Thanks a ton @orbeckst @lilyminium and @IAlibay for your help!!

supernatural

@orbeckst
Copy link
Member

orbeckst commented Apr 2, 2022

@lilyminium @IAlibay is something holding up the merge? I just don't know, otherwise I'd press the shiny green button myself.

(Sorry to be onerous, I just want to see this code in before anything else happened.)

@IAlibay
Copy link
Member

IAlibay commented Apr 2, 2022

Sorry just really busy atm, I'll have a quick glance through over the weekend and then merge.

@lilyminium
Copy link
Member

No -- just had another quick glance and everything still LGTM! @IAlibay did you still want to have a look? Otherwise I can merge.

@IAlibay
Copy link
Member

IAlibay commented Apr 2, 2022

No -- just had another quick glance and everything still LGTM! @IAlibay did you still want to have a look? Otherwise I can merge.

Go for it!

@lilyminium lilyminium merged commit 009c467 into MDAnalysis:develop Apr 2, 2022
@orbeckst
Copy link
Member

orbeckst commented Apr 2, 2022

Hooray!!!!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants