RDKit Descriptors and fingerprints wrapper #2912

cbouy · 2020-08-17T18:03:23Z

Part of the fixes for #2468
Depends on #2775

Changes made in this Pull Request:

RDKit descriptors can be calculated from the RDKitDescriptors class, as a subclass of AnalysisBase
Fingerprints are available through a function

Quick example

I'm not sure if a dict for the results is ideal, as you can see passing several lambda functions will be problematic so maybe a list of (function_name, value) tuples is better. Or we just don't care about lambdas ?

PR Checklist

Tests?
Docs?
CHANGELOG updated?
Issue raised/referenced?

Ping @fiona-naughton @IAlibay @richardjgowers

pep8speaks · 2020-08-17T18:03:27Z

Hello @cbouy! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

In the file package/MDAnalysis/analysis/RDKit.py:

Line 25:80: E501 line too long (83 > 79 characters)
Line 29:12: W291 trailing whitespace
Line 30:57: W291 trailing whitespace
Line 60:80: E501 line too long (80 > 79 characters)
Line 131:80: E501 line too long (92 > 79 characters)
Line 133:80: E501 line too long (88 > 79 characters)
Line 134:80: E501 line too long (99 > 79 characters)
Line 135:80: E501 line too long (112 > 79 characters)
Line 138:80: E501 line too long (81 > 79 characters)
Line 214:70: W291 trailing whitespace
Line 223:68: W291 trailing whitespace
Line 225:1: W293 blank line contains whitespace

In the file testsuite/MDAnalysisTests/analysis/test_rdkit.py:

Line 123:1: E302 expected 2 blank lines, found 1
Line 141:80: E501 line too long (83 > 79 characters)
Line 144:80: E501 line too long (93 > 79 characters)
Line 148:80: E501 line too long (87 > 79 characters)
Line 171:80: E501 line too long (81 > 79 characters)
Line 182:80: E501 line too long (80 > 79 characters)
Line 189:80: E501 line too long (87 > 79 characters)
Line 190:80: E501 line too long (92 > 79 characters)

Comment last updated at 2021-04-28 18:40:44 UTC

codecov · 2020-08-17T19:07:03Z

Codecov Report

Merging #2912 (37755d7) into develop (a23b1e9) will increase coverage by 0.02%.
The diff coverage is 100.00%.

@@             Coverage Diff             @@
##           develop    #2912      +/-   ##
===========================================
+ Coverage    92.83%   92.85%   +0.02%     
===========================================
  Files          170      171       +1     
  Lines        22809    22882      +73     
  Branches      3242     3260      +18     
===========================================
+ Hits         21174    21247      +73     
  Misses        1587     1587              
  Partials        48       48

Impacted Files	Coverage Δ
package/MDAnalysis/analysis/__init__.py	`100.00% <ø> (ø)`
package/MDAnalysis/analysis/RDKit.py	`100.00% <100.00%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a23b1e9...37755d7. Read the comment docs.

cbouy · 2020-08-18T16:25:00Z

I changed the results attribute to store an array of dict for each frame, and I added a static method to list available descriptors

cbouy · 2020-08-20T13:45:17Z

Regarding the output given to users, I'm still scratching my head over 2 things:

For fingerprints, apart from hashed fp which have a predefined number of bits in length, and the MACCSKeys which is 167 bits long, it's usually a bad idea to convert them to an array as it will likely crash or hang forever because of memory consumption.
RDKit typically stores them as a sparse vector which only knows which bit is activated and how many times.
I'm wondering if a more appropriate output format could be a dict with only the bits that were set ?

For descriptors, maybe I can simply output an array of descriptor values instead of a dict or list of tuples with descriptors names ? The calculation will raise an error if a descriptor is not found and the descriptors are calculated in the order given by the user (if I change self._functions to a list instead of a dict). This way there's no risk with using lambdas or accidentally naming your function the same way as an RDKit descriptor.

IAlibay · 2020-08-20T13:59:08Z

Regarding the output given to users, I'm still scratching my head over 2 things:

For fingerprints, apart from hashed fp which have a predefined number of bits in length, and the MACCSKeys which is 167 bits long, it's usually a bad idea to convert them to an array as it will likely crash or hang forever because of memory consumption.
RDKit typically stores them as a sparse vector which only knows which bit is activated and how many times.
I'm wondering if a more appropriate output format could be a dict with only the bits that were set ?

For descriptors, maybe I can simply output an array of descriptor values instead of a dict or list of tuples with descriptors names ? The calculation will raise an error if a descriptor is not found and the descriptors are calculated in the order given by the user (if I change self._functions to a list instead of a dict). This way there's no risk with using lambdas or accidentally naming your function the same way as an RDKit descriptor.

Maybe for simplicity's sake, we could just return fingerprints and descriptors in the exact same way RDKIT does (i.e. just return the object) for each frame (as a list)?

cbouy · 2020-08-20T14:24:53Z

I think the point of the wrapper is to make things easy for people who are not necessarily familiar with RDKit, and the fingerprint object can be a bit intimidating and depends on the type of fingerprint requested (some will output an ExplicitBitVect object, some different subtypes of SparseIntVect which behaves differently) so that's why I was proposing an optional general purpose output (dict or array), but if users want the RDKit object they can still have it.
Also I don't think any of the fingerprints I list here use 3D information so there's only one fp per atomgroup in the output (that's why it's in a function and not in an AnalysisBase object).

For descriptors, yeah it's basically just the value for each frame so I guess I'll simply make a 2D array out of that.

tylerjereddy

Just read through some of the docstrings for copy editing.

package/MDAnalysis/analysis/RDKit.py

IAlibay · 2020-08-22T15:38:30Z

I think the point of the wrapper is to make things easy for people who are not necessarily familiar with RDKit, and the fingerprint object can be a bit intimidating and depends on the type of fingerprint requested (some will output an ExplicitBitVect object, some different subtypes of SparseIntVect which behaves differently) so that's why I was proposing an optional general purpose output (dict or array), but if users want the RDKit object they can still have it.

I'm not sure, although a general purpose would be more "user friendly", at the end of the day I feel like users should be sufficiently versed in RDKit if they are looking to be using these options. To me, offering an object conversion layer not only leads to more work, but eventually more confusion for users. I'll ping @fiona-naughton and @richardjgowers though, who might have different views here.

Also I don't think any of the fingerprints I list here use 3D information so there's only one fp per atomgroup in the output (that's why it's in a function and not in an AnalysisBase object).

Yep, that makes sense.

IAlibay

Overall looks really good, got a few comments & suggestions.

The main point of discussion I think may need to be had, is if user friendliness is worth the extra complexity.

package/MDAnalysis/analysis/RDKit.py

IAlibay · 2020-08-22T15:44:34Z

package/MDAnalysis/analysis/RDKit.py

+
+    Notes
+    -----
+    To generate a Morgan fingerprint, don't forget to specify the radius::


Maybe here we should link to the API docs for each of the fingerprints supported? We also should have a usage example for each case too (it adds to the docs but then users are clear on what should work).

Or maybe just the one like to a page that lists all the individual docs for the fingerprints. Re: usage, if the usage is all similar/the same then we can just say that in the docs.

IAlibay · 2020-08-22T15:45:36Z

package/MDAnalysis/analysis/RDKit.py

+
+        get_fingerprint(ag, 'Morgan', radius=2)
+
+    ``dtype="array"`` is not recommended for non-hashed fingerprints


As discussed, I personally think we should try to avoid these cases completely, but I'll let one of the other @MDAnalysis/gsoc-mentors make a judgement call here too.

testsuite/MDAnalysisTests/analysis/test_rdkit.py

IAlibay · 2020-08-22T16:11:43Z

package/MDAnalysis/analysis/RDKit.py

+
+    Example
+    -------
+    Here's an example with a custom function for a trajectory with 3 frames::


I like this example, however in the final version we will need something a little bit more descriptive, probably detailing what's in _RDKIT_DESCRIPTORS, any edge cases (are there any cases of descriptors that can be given optional parameters?), etc...

Like I said in list_available docstring, the list of descriptors is by no means curated, some of the things here aren't even descriptors but helper functions that are inside the module.
Some of these functions take parameters like includeHs or onlyHeavy, but most of them don't.

I'll defer to the other @MDAnalysis/gsoc-mentors here to pitch in on what they think might be a good compromise here.

package/MDAnalysis/analysis/RDKit.py

IAlibay · 2020-08-22T16:18:27Z

package/MDAnalysis/analysis/RDKit.py

+            self.results[self._frame_index][i] = func(mol)
+
+    @staticmethod
+    def list_available(flat=False):


This needs documenting in the class docstring, so that users are aware of it's existence (including an example use).

I do wonder if it might be simpler (at the risk of making things harder for users), to not offer this list but rather just have the option of passing an RDKit function to this analysis method. Then you can just have this blank-ish AnalysisBase method that essentially is a pure trajectory analysis wrapper around existing RDKit function (if this makes any sense)? Again one of those things worth discussing further.

Well it's definitely simpler for us but harder for users

Porque no los dos, perhaps it could be a separate more flexible class and this can be the easy-peasy one.

package/MDAnalysis/analysis/__init__.py

testsuite/MDAnalysisTests/analysis/test_rdkit.py

IAlibay · 2021-04-22T20:10:35Z

@cbouy if you can update this against develop I'll put this next up on my review list.

cbouy · 2021-04-29T08:47:06Z

@IAlibay just updated, looks like the error is related to #2958 "as usual" 😅

orbeckst · 2021-05-30T02:54:30Z

@IAlibay this is still listed for 2.0 — is this realistic and essential?

@cbouy how much needs to be done here to complete it?

IAlibay · 2021-06-01T11:52:10Z

@IAlibay this is still listed for 2.0 — is this realistic and essential?

Essential - probably not. I added in all the RDKits because it would be really great to get them out there. I'll let @cbouy speak on how realistic this is (for this and the other opened RDKit PRs).

If it helps, I'd be happy with a relatively rapid 2.1.0 release that adds new RDKit features.

cbouy · 2021-06-01T12:07:58Z

I think this is the RDKit PR with the lowest priority and I don't mind if it's not in 2.0
In terms of remaining work it's mostly an API question for the rdkit fingerprint wrapper: do we bother keeping options to convert the rdkit fingerprints to more "traditional" formats (dict and numpy array) or do we only return the rdkit object ? There are different formats for the rdkit object depending on the fingerprint (explicit bit, sparse int, and others i think) that's why I implemented the conversion to dict and array in the first place to make it simpler to use.

IAlibay · 2021-06-01T12:11:30Z

I think this is the RDKit PR with the lowest priority and I don't mind if it's not in 2.0
In terms of remaining work it's mostly an API question for the rdkit fingerprint wrapper: do we bother keeping options to convert the rdkit fingerprints to more "traditional" formats (dict and numpy array) or do we only return the rdkit object ? There are different formats for the rdkit object depending on the fingerprint (explicit bit, sparse int, and others i think) that's why I implemented the conversion to dict and array in the first place to make it simpler to use.

@cbouy for the PRs that are opened, which ones (if any) are critical for 2.0?

cbouy · 2021-06-01T12:48:26Z

Critical PRs would be #3044 and #3324
#3325 would be nice to have in 2.0 as well
#2912 (this one) can come later
and finally #2900 which is very optional (and possibly abandoned?)

IAlibay added the GSoC GSoC project label Aug 17, 2020

IAlibay assigned richardjgowers, fiona-naughton and IAlibay Aug 17, 2020

tylerjereddy reviewed Aug 21, 2020

View reviewed changes

package/MDAnalysis/analysis/RDKit.py Outdated Show resolved Hide resolved

package/MDAnalysis/analysis/RDKit.py Outdated Show resolved Hide resolved

package/MDAnalysis/analysis/RDKit.py Outdated Show resolved Hide resolved

package/MDAnalysis/analysis/RDKit.py Outdated Show resolved Hide resolved

IAlibay requested changes Aug 22, 2020

View reviewed changes

cbouy marked this pull request as ready for review August 27, 2020 12:15

IAlibay added this to the 2.0 milestone Apr 6, 2021

Cédric Bouysset added 14 commits April 28, 2021 20:38

descriptors as analysis and fingerprint as function

fc7ddcb

add method to list descriptors available

fd8529c

documentation

aa6e686

fix docs

9f11ccc

fix bullet points and pep8

5403e87

separate dict key from fp name

17cff1c

add tests

9fce129

pep8

7cadb00

descriptors as array

844ff93

typo

14f4302

add dtype dict and changed default for fp

09a4de0

lowercase for test filename

41be3ad

code review

9958770

fix tests and docs

37755d7

cbouy force-pushed the rdkit-desc branch from d014e7c to 37755d7 Compare April 28, 2021 18:40

IAlibay modified the milestones: 2.0, 2.1.0 Aug 17, 2021

IAlibay modified the milestones: 2.1.0, 3.0 Jun 2, 2022

hmacdope added the stale label Nov 5, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RDKit Descriptors and fingerprints wrapper #2912

RDKit Descriptors and fingerprints wrapper #2912

cbouy commented Aug 17, 2020

pep8speaks commented Aug 17, 2020 •

edited

Loading

codecov bot commented Aug 17, 2020 •

edited

Loading

cbouy commented Aug 18, 2020

cbouy commented Aug 20, 2020

IAlibay commented Aug 20, 2020

cbouy commented Aug 20, 2020

tylerjereddy left a comment

IAlibay commented Aug 22, 2020

IAlibay left a comment

IAlibay Aug 22, 2020

IAlibay Aug 26, 2020

IAlibay Aug 22, 2020

IAlibay Aug 22, 2020

cbouy Aug 25, 2020

IAlibay Aug 26, 2020

IAlibay Aug 22, 2020

cbouy Aug 25, 2020

lilyminium Apr 23, 2021

IAlibay commented Apr 22, 2021

cbouy commented Apr 29, 2021

orbeckst commented May 30, 2021

IAlibay commented Jun 1, 2021

cbouy commented Jun 1, 2021

IAlibay commented Jun 1, 2021

cbouy commented Jun 1, 2021


		get_fingerprint(ag, 'Morgan', radius=2)

		``dtype="array"`` is not recommended for non-hashed fingerprints

RDKit Descriptors and fingerprints wrapper #2912

Are you sure you want to change the base?

RDKit Descriptors and fingerprints wrapper #2912

Conversation

cbouy commented Aug 17, 2020

PR Checklist

pep8speaks commented Aug 17, 2020 • edited Loading

Comment last updated at 2021-04-28 18:40:44 UTC

codecov bot commented Aug 17, 2020 • edited Loading

Codecov Report

cbouy commented Aug 18, 2020

cbouy commented Aug 20, 2020

IAlibay commented Aug 20, 2020

cbouy commented Aug 20, 2020

tylerjereddy left a comment

Choose a reason for hiding this comment

IAlibay commented Aug 22, 2020

IAlibay left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

IAlibay commented Apr 22, 2021

cbouy commented Apr 29, 2021

orbeckst commented May 30, 2021

IAlibay commented Jun 1, 2021

cbouy commented Jun 1, 2021

IAlibay commented Jun 1, 2021

cbouy commented Jun 1, 2021

pep8speaks commented Aug 17, 2020 •

edited

Loading

codecov bot commented Aug 17, 2020 •

edited

Loading