Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RDKit Descriptors and fingerprints wrapper #2912

Open
wants to merge 14 commits into
base: develop
Choose a base branch
from

Conversation

cbouy
Copy link
Member

@cbouy cbouy commented Aug 17, 2020

Part of the fixes for #2468
Depends on #2775

Changes made in this Pull Request:

  • RDKit descriptors can be calculated from the RDKitDescriptors class, as a subclass of AnalysisBase
  • Fingerprints are available through a function

Quick example
image

I'm not sure if a dict for the results is ideal, as you can see passing several lambda functions will be problematic so maybe a list of (function_name, value) tuples is better. Or we just don't care about lambdas ?

PR Checklist

  • Tests?
  • Docs?
  • CHANGELOG updated?
  • Issue raised/referenced?

Ping @fiona-naughton @IAlibay @richardjgowers

@pep8speaks
Copy link

pep8speaks commented Aug 17, 2020

Hello @cbouy! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

Line 25:80: E501 line too long (83 > 79 characters)
Line 29:12: W291 trailing whitespace
Line 30:57: W291 trailing whitespace
Line 60:80: E501 line too long (80 > 79 characters)
Line 131:80: E501 line too long (92 > 79 characters)
Line 133:80: E501 line too long (88 > 79 characters)
Line 134:80: E501 line too long (99 > 79 characters)
Line 135:80: E501 line too long (112 > 79 characters)
Line 138:80: E501 line too long (81 > 79 characters)
Line 214:70: W291 trailing whitespace
Line 223:68: W291 trailing whitespace
Line 225:1: W293 blank line contains whitespace

Line 123:1: E302 expected 2 blank lines, found 1
Line 141:80: E501 line too long (83 > 79 characters)
Line 144:80: E501 line too long (93 > 79 characters)
Line 148:80: E501 line too long (87 > 79 characters)
Line 171:80: E501 line too long (81 > 79 characters)
Line 182:80: E501 line too long (80 > 79 characters)
Line 189:80: E501 line too long (87 > 79 characters)
Line 190:80: E501 line too long (92 > 79 characters)

Comment last updated at 2021-04-28 18:40:44 UTC

@codecov
Copy link

codecov bot commented Aug 17, 2020

Codecov Report

Merging #2912 (37755d7) into develop (a23b1e9) will increase coverage by 0.02%.
The diff coverage is 100.00%.

Impacted file tree graph

@@             Coverage Diff             @@
##           develop    #2912      +/-   ##
===========================================
+ Coverage    92.83%   92.85%   +0.02%     
===========================================
  Files          170      171       +1     
  Lines        22809    22882      +73     
  Branches      3242     3260      +18     
===========================================
+ Hits         21174    21247      +73     
  Misses        1587     1587              
  Partials        48       48              
Impacted Files Coverage Δ
package/MDAnalysis/analysis/__init__.py 100.00% <ø> (ø)
package/MDAnalysis/analysis/RDKit.py 100.00% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a23b1e9...37755d7. Read the comment docs.

@cbouy
Copy link
Member Author

cbouy commented Aug 18, 2020

I changed the results attribute to store an array of dict for each frame, and I added a static method to list available descriptors

@cbouy
Copy link
Member Author

cbouy commented Aug 20, 2020

Regarding the output given to users, I'm still scratching my head over 2 things:

For fingerprints, apart from hashed fp which have a predefined number of bits in length, and the MACCSKeys which is 167 bits long, it's usually a bad idea to convert them to an array as it will likely crash or hang forever because of memory consumption.
RDKit typically stores them as a sparse vector which only knows which bit is activated and how many times.
I'm wondering if a more appropriate output format could be a dict with only the bits that were set ?

For descriptors, maybe I can simply output an array of descriptor values instead of a dict or list of tuples with descriptors names ? The calculation will raise an error if a descriptor is not found and the descriptors are calculated in the order given by the user (if I change self._functions to a list instead of a dict). This way there's no risk with using lambdas or accidentally naming your function the same way as an RDKit descriptor.

@IAlibay
Copy link
Member

IAlibay commented Aug 20, 2020

Regarding the output given to users, I'm still scratching my head over 2 things:

For fingerprints, apart from hashed fp which have a predefined number of bits in length, and the MACCSKeys which is 167 bits long, it's usually a bad idea to convert them to an array as it will likely crash or hang forever because of memory consumption.
RDKit typically stores them as a sparse vector which only knows which bit is activated and how many times.
I'm wondering if a more appropriate output format could be a dict with only the bits that were set ?

For descriptors, maybe I can simply output an array of descriptor values instead of a dict or list of tuples with descriptors names ? The calculation will raise an error if a descriptor is not found and the descriptors are calculated in the order given by the user (if I change self._functions to a list instead of a dict). This way there's no risk with using lambdas or accidentally naming your function the same way as an RDKit descriptor.

Maybe for simplicity's sake, we could just return fingerprints and descriptors in the exact same way RDKIT does (i.e. just return the object) for each frame (as a list)?

@cbouy
Copy link
Member Author

cbouy commented Aug 20, 2020

I think the point of the wrapper is to make things easy for people who are not necessarily familiar with RDKit, and the fingerprint object can be a bit intimidating and depends on the type of fingerprint requested (some will output an ExplicitBitVect object, some different subtypes of SparseIntVect which behaves differently) so that's why I was proposing an optional general purpose output (dict or array), but if users want the RDKit object they can still have it.
Also I don't think any of the fingerprints I list here use 3D information so there's only one fp per atomgroup in the output (that's why it's in a function and not in an AnalysisBase object).

For descriptors, yeah it's basically just the value for each frame so I guess I'll simply make a 2D array out of that.

Copy link
Member

@tylerjereddy tylerjereddy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just read through some of the docstrings for copy editing.

package/MDAnalysis/analysis/RDKit.py Outdated Show resolved Hide resolved
package/MDAnalysis/analysis/RDKit.py Outdated Show resolved Hide resolved
package/MDAnalysis/analysis/RDKit.py Outdated Show resolved Hide resolved
package/MDAnalysis/analysis/RDKit.py Outdated Show resolved Hide resolved
@IAlibay
Copy link
Member

IAlibay commented Aug 22, 2020

I think the point of the wrapper is to make things easy for people who are not necessarily familiar with RDKit, and the fingerprint object can be a bit intimidating and depends on the type of fingerprint requested (some will output an ExplicitBitVect object, some different subtypes of SparseIntVect which behaves differently) so that's why I was proposing an optional general purpose output (dict or array), but if users want the RDKit object they can still have it.

I'm not sure, although a general purpose would be more "user friendly", at the end of the day I feel like users should be sufficiently versed in RDKit if they are looking to be using these options. To me, offering an object conversion layer not only leads to more work, but eventually more confusion for users. I'll ping @fiona-naughton and @richardjgowers though, who might have different views here.

Also I don't think any of the fingerprints I list here use 3D information so there's only one fp per atomgroup in the output (that's why it's in a function and not in an AnalysisBase object).

Yep, that makes sense.

Copy link
Member

@IAlibay IAlibay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks really good, got a few comments & suggestions.

The main point of discussion I think may need to be had, is if user friendliness is worth the extra complexity.

package/MDAnalysis/analysis/RDKit.py Show resolved Hide resolved
package/MDAnalysis/analysis/RDKit.py Outdated Show resolved Hide resolved

Notes
-----
To generate a Morgan fingerprint, don't forget to specify the radius::
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe here we should link to the API docs for each of the fingerprints supported? We also should have a usage example for each case too (it adds to the docs but then users are clear on what should work).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or maybe just the one like to a page that lists all the individual docs for the fingerprints. Re: usage, if the usage is all similar/the same then we can just say that in the docs.


get_fingerprint(ag, 'Morgan', radius=2)

``dtype="array"`` is not recommended for non-hashed fingerprints
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed, I personally think we should try to avoid these cases completely, but I'll let one of the other @MDAnalysis/gsoc-mentors make a judgement call here too.

testsuite/MDAnalysisTests/analysis/test_rdkit.py Outdated Show resolved Hide resolved

Example
-------
Here's an example with a custom function for a trajectory with 3 frames::
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this example, however in the final version we will need something a little bit more descriptive, probably detailing what's in _RDKIT_DESCRIPTORS, any edge cases (are there any cases of descriptors that can be given optional parameters?), etc...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Like I said in list_available docstring, the list of descriptors is by no means curated, some of the things here aren't even descriptors but helper functions that are inside the module.
Some of these functions take parameters like includeHs or onlyHeavy, but most of them don't.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll defer to the other @MDAnalysis/gsoc-mentors here to pitch in on what they think might be a good compromise here.

package/MDAnalysis/analysis/RDKit.py Outdated Show resolved Hide resolved
self.results[self._frame_index][i] = func(mol)

@staticmethod
def list_available(flat=False):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs documenting in the class docstring, so that users are aware of it's existence (including an example use).

I do wonder if it might be simpler (at the risk of making things harder for users), to not offer this list but rather just have the option of passing an RDKit function to this analysis method. Then you can just have this blank-ish AnalysisBase method that essentially is a pure trajectory analysis wrapper around existing RDKit function (if this makes any sense)? Again one of those things worth discussing further.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well it's definitely simpler for us but harder for users

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Porque no los dos, perhaps it could be a separate more flexible class and this can be the easy-peasy one.

package/MDAnalysis/analysis/__init__.py Show resolved Hide resolved
@cbouy cbouy marked this pull request as ready for review August 27, 2020 12:15
@IAlibay IAlibay added this to the 2.0 milestone Apr 6, 2021
@IAlibay
Copy link
Member

IAlibay commented Apr 22, 2021

@cbouy if you can update this against develop I'll put this next up on my review list.

@cbouy
Copy link
Member Author

cbouy commented Apr 29, 2021

@IAlibay just updated, looks like the error is related to #2958 "as usual" 😅

@orbeckst
Copy link
Member

@IAlibay this is still listed for 2.0 — is this realistic and essential?

@cbouy how much needs to be done here to complete it?

@IAlibay
Copy link
Member

IAlibay commented Jun 1, 2021

@IAlibay this is still listed for 2.0 — is this realistic and essential?

Essential - probably not. I added in all the RDKits because it would be really great to get them out there. I'll let @cbouy speak on how realistic this is (for this and the other opened RDKit PRs).

If it helps, I'd be happy with a relatively rapid 2.1.0 release that adds new RDKit features.

@cbouy
Copy link
Member Author

cbouy commented Jun 1, 2021

I think this is the RDKit PR with the lowest priority and I don't mind if it's not in 2.0
In terms of remaining work it's mostly an API question for the rdkit fingerprint wrapper: do we bother keeping options to convert the rdkit fingerprints to more "traditional" formats (dict and numpy array) or do we only return the rdkit object ? There are different formats for the rdkit object depending on the fingerprint (explicit bit, sparse int, and others i think) that's why I implemented the conversion to dict and array in the first place to make it simpler to use.

@IAlibay
Copy link
Member

IAlibay commented Jun 1, 2021

I think this is the RDKit PR with the lowest priority and I don't mind if it's not in 2.0
In terms of remaining work it's mostly an API question for the rdkit fingerprint wrapper: do we bother keeping options to convert the rdkit fingerprints to more "traditional" formats (dict and numpy array) or do we only return the rdkit object ? There are different formats for the rdkit object depending on the fingerprint (explicit bit, sparse int, and others i think) that's why I implemented the conversion to dict and array in the first place to make it simpler to use.

@cbouy for the PRs that are opened, which ones (if any) are critical for 2.0?

@cbouy
Copy link
Member Author

cbouy commented Jun 1, 2021

Critical PRs would be #3044 and #3324
#3325 would be nice to have in 2.0 as well
#2912 (this one) can come later
and finally #2900 which is very optional (and possibly abandoned?)

@IAlibay IAlibay modified the milestones: 2.0, 2.1.0 Aug 17, 2021
@IAlibay IAlibay modified the milestones: 2.1.0, 3.0 Jun 2, 2022
@hmacdope hmacdope added the stale label Nov 5, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
GSoC GSoC project stale
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

9 participants