Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to correctly load PaRoutes models? #1

Closed
AustinT opened this issue Mar 15, 2022 · 1 comment
Closed

How to correctly load PaRoutes models? #1

AustinT opened this issue Mar 15, 2022 · 1 comment
Assignees
Labels
documentation Improvements or additions to documentation

Comments

@AustinT
Copy link

AustinT commented Mar 15, 2022

Your benchmark is very interesting and I would like to do some experiments with it, but I haven't found instructions on how to use your pre-trained models.
Would you mind telling me whether the following code correctly loads and uses your models?

import numpy as np
import pandas as pd
import h5py
from tensorflow import keras
from rdkit.Chem import AllChem
from rdchiral.main import rdchiralRunText

def get_fingerprint(smiles: str) -> np.ndarray:
    mol = AllChem.MolFromSmiles(smiles)
    assert mol is not None
    fp = AllChem.GetMorganFingerprintAsBitVect(mol, radius=2, nBits=2048)  # QUESTION: is this the right fingerprint?
    return np.array(fp, dtype=float)

# Load templates
df_templates = pd.read_hdf("./data/uspto_rxn_n5_unique_templates.hdf5", key="table")

# Load model, defining custom metrics because without these it gave an error...
model = keras.models.load_model(
    "./data/uspto_rxn_n5_keras_model.hdf5", 
    custom_objects={
        "top10_acc": keras.metrics.TopKCategoricalAccuracy(k=10, name="top10_acc"),
        "top50_acc": keras.metrics.TopKCategoricalAccuracy(k=10, name="top50_acc"),
    }
)

# Example use case: run the best reaction for the first 2 targets
test_smiles = ["O=C(O)COCCOCCOCCOCCOCCOCCOCC(F)(F)F", "COc1cc(N)c(Cl)cc1C(=O)NCCCC1CN(Cc2ccccc2)CCO1"]
x = np.stack([get_fingerprint(s) for s in test_smiles])
template_probs = model(x).numpy()
most_likely_reactions = template_probs.argmax(axis=1)

for i, sm in enumerate(test_smiles):
    reactants = rdchiralRunText(df_templates["retro_template"].values[most_likely_reactions[i]], sm)
    print(f"{i}: {reactants} >> {sm}")

This code runs and produces the following output (in particular, the second reaction fails). Is this the output that you would expect?

0: ['CCOC(=O)CBr.OCCOCCOCCOCCOCCOCCOCC(F)(F)F'] >> O=C(O)COCCOCCOCCOCCOCCOCCOCC(F)(F)F
1: [] >> COc1cc(N)c(Cl)cc1C(=O)NCCCC1CN(Cc2ccccc2)CCO1

Thank you in advance for answering my question. Great manuscript and keep up the good open source work! 💯

@SGenheden SGenheden self-assigned this Mar 16, 2022
@SGenheden SGenheden added the documentation Improvements or additions to documentation label Mar 16, 2022
@SGenheden
Copy link
Contributor

Hello

And thanks for trying out PaRoutes and coming with feedback.

We chose not to provide extensive documentation at this time because we cannot foresee all the possible use-cases that might come up. This is an excellent question from you and I will make some useful notes on this.

I believe you have managed to produce code that reproduce my procedure to do this, which is to use them together with the aizynthfinder package. I will explain this procedure but first I would like to emphasize that yes the predicted reactants are what you would obtain from the first predicted template. However, we typically look at top-20 or maybe even top-50 of the templates just so avoid situations like your second example where the first predicted template is not applicable. The second one would produce these reactants: COc1cc(N)c(Cl)cc1C(=O)Cl.NCCCC1CN(Cc2ccccc2)CCO1.

My solution as mentioned is to use the aizynthfinder package and the Python interface to the one-step model, documented here: https://molecularai.github.io/aizynthfinder/python_interface.html

This is my code (after having installed the package):

from aizynthfinder.aizynthfinder import AiZynthExpander
expander = AiZynthExpander(
    configdict={
        "policy": {
            "files": {
                "paroutes": [
                    "./data/uspto_rxn_n5_keras_model.hdf5", 
                    "./data/uspto_rxn_n5_unique_templates.hdf5"
                ]
            }
        }
    }
)
expander.expansion_policy.select("paroutes")
test_smiles = ["O=C(O)COCCOCCOCCOCCOCCOCCOCC(F)(F)F", "COc1cc(N)c(Cl)cc1C(=O)NCCCC1CN(Cc2ccccc2)CCO1"]
for smi in test_smiles:
    predictions = expander.do_expansion(smi)
    print(predictions[0][0].reaction_smiles())

The output of the do_expansion method is a list of tuples of reactions. Each tuple represents a unique set of precursors that could have arisen from different templates. So what I am printing in the example is the first predicted set of precursors from the first template.

Hopefully this helps and you will be able to use the trained model from PaRoutes.

@AustinT AustinT closed this as completed Mar 16, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation
Projects
None yet
Development

No branches or pull requests

2 participants