Skip to content

Molecule featurizer and molecule graph#484

Merged
sveccham merged 43 commits into
mainfrom
sveccham/molecule_featurizer
Nov 28, 2024
Merged

Molecule featurizer and molecule graph#484
sveccham merged 43 commits into
mainfrom
sveccham/molecule_featurizer

Conversation

@sveccham
Copy link
Copy Markdown
Collaborator


Implements RDkit 2D descriptor molecule featurizer. Implements MoleculeGraph, a data object for storing and processing molecular graphs.

  • Implemented molecule featurizer class and molecule graph
  • RDkit2DDescriptorFeaturizer and MoleculeGraph
  • The RDkit 2D descriptor class can be used to featurize molecules with ~210 molecular descriptors. The MoleculeGraph data object is used to store all molecular graph related attibutes.
  • from bionemo.geometric.molecule_featurizers import RDkit2DDescriptorFeaturizer

Usage

from bionemo.geometric.molecule_featurizers import RDkit2DDescriptorFeaturizer
    rdf = RDkit2DDescriptorFeaturizer()
    mol_feats = rdf(sample_mol)

Testing

Tests for these changes can be run via:

pytest tests/bionemo/geometric/test_molecule_featurizers.py

@sveccham
Copy link
Copy Markdown
Collaborator Author

/build-ci

1 similar comment
@sveccham
Copy link
Copy Markdown
Collaborator Author

/build-ci

@sveccham sveccham self-assigned this Nov 27, 2024
@sveccham sveccham added the enhancement New feature or request label Nov 27, 2024
@sveccham sveccham requested a review from scal444 November 27, 2024 19:17
@sveccham sveccham marked this pull request as ready for review November 27, 2024 21:59
Copy link
Copy Markdown
Collaborator

@trvachov trvachov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks fine to me, except I think the tests (especially the golden values) need a little more description. I worry we'll end up in a situation where the tests fail and some other engineer will be digging through the tests trying to understand what the golden values mean. For example, one of them appears to be an embedding? (float array)...maybe it's ok if this one changes a little bit? Others maybe not so much?

@sveccham
Copy link
Copy Markdown
Collaborator Author

Looks fine to me, except I think the tests (especially the golden values) need a little more description. I worry we'll end up in a situation where the tests fail and some other engineer will be digging through the tests trying to understand what the golden values mean. For example, one of them appears to be an embedding? (float array)...maybe it's ok if this one changes a little bit? Others maybe not so much?

I have added descriptions for the golden values tests. One golden value test had fixed data types: int and float. I separated them out and checked for exact match for int vals and torch.allclose match for float vals

@trvachov could I have an approval if everything looks good to you?

Copy link
Copy Markdown
Collaborator

@DejunL DejunL left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have some comments related to the design of the MoleculeGraph class and minor ones on the tests. As Guoqing pointed out during the sync, couldn't the graph data be of just Data? so our method can enjoy a flat design, e.g., via function calls rather than relying on graph classes?

Comment thread sub-packages/bionemo-geometric/src/bionemo/geometric/molecule_graph.py Outdated
Comment thread sub-packages/bionemo-geometric/src/bionemo/geometric/molecule_graph.py Outdated
Comment thread sub-packages/bionemo-geometric/src/bionemo/geometric/molecule_graph.py Outdated
Copy link
Copy Markdown
Collaborator

@trvachov trvachov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the prompt changes.

@sveccham
Copy link
Copy Markdown
Collaborator Author

Have some comments related to the design of the MoleculeGraph class and minor ones on the tests. As Guoqing pointed out during the sync, couldn't the graph data be of just Data? so our method can enjoy a flat design, e.g., via function calls rather than relying on graph classes?

Agreed. I have removed the MoleculeGraph class altogether.

Copy link
Copy Markdown
Collaborator

@DejunL DejunL left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Have some non-blocking comments about adding docstring and example

)


class RDkit2DDescriptorFeaturizer(BaseMoleculeFeaturizer):
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would recommend adding docstring on the get_molecule_features() with more details about what the requirements are for the input and what to expect in the output. Would be great if an example is give in the class's docstring

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a useful link to the list of RDKit features we're computing here? Or at least the Descriptors RDKit docstring. If not, at least a hint for the user to say, print(Descriptors.DescList) for the list

@sveccham
Copy link
Copy Markdown
Collaborator Author

/build-ci

@sveccham sveccham enabled auto-merge (squash) November 28, 2024 03:51
@sveccham sveccham merged commit 550aab9 into main Nov 28, 2024
@sveccham sveccham deleted the sveccham/molecule_featurizer branch November 28, 2024 04:41
rf_feats = rf(test_mol2)

# Reference is a list of tuples
# Each tuple contains the sizes of the rings the bond is present it
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick present in :)

)


class RDkit2DDescriptorFeaturizer(BaseMoleculeFeaturizer):
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a useful link to the list of RDKit features we're computing here? Or at least the Descriptors RDKit docstring. If not, at least a hint for the user to say, print(Descriptors.DescList) for the list

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants