Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to use MOSES train/test/testSF dataset in Torchdrug #19

Open
scintiller opened this issue Aug 28, 2021 · 5 comments
Open

How to use MOSES train/test/testSF dataset in Torchdrug #19

scintiller opened this issue Aug 28, 2021 · 5 comments
Labels
enhancement New feature or request

Comments

@scintiller
Copy link

scintiller commented Aug 28, 2021

TorchDrug implements MOSES dataset, but doesn't distinguish between train / test / testSF which MOSES has. To train GCPN on Moses, I think the correct order is to pretrain the model by train dataset at first, then train it on test / testSF dataset and finally generate the molecules. But how to do this in TorchDrug? There's only one dataset named MOSES.

I have this question because when I generate molecules by MOSES, the statistics doesn't look correct if compared to other models on MOSEC, especially the Scaf/Test property in the table, which tries to find out if there are same scaffolds in test dataset and generated molecules. It's 0 for GCPN model after training on TorchDrug, following the tutorial. I think the problem is that TorchDrug only uses the train dataset but not test dataset. How can I explicitly use it? Thanks in advance!

MOSES
MOSES2

@KiddoZhu KiddoZhu added the enhancement New feature or request label Aug 28, 2021
@KiddoZhu KiddoZhu changed the title How to use MOSES train/test/testSF dataset in Torchdrug How to use MOSES train/test/testSF dataset in Torchdrug Aug 28, 2021
@KiddoZhu
Copy link
Contributor

Hi! There is a predefined split for MOSES implemented in TorchDrug. I am not sure if this is what you want. You can get it by

dataset = datasets.MOSES("/path/to/dataset")
train_set, valid_set, test_set = dataset.split()

Sorry I am not an expert in molecule generation. Maybe @shichence knows more about the dataset and evaluation setting on MOSES?

@scintiller
Copy link
Author

Thank you for your quick response! I have three questions as following:

  1. Following the Molecular Generation tutorial, I know how to train a GCNP model by the train_set. But how to load the trained model and train it again with the test_set?
    I don't know if the following code is correct:
solver = core.Engine(task, dataset, None, None, optimizer,
                     gpus=(0,), batch_size=128, log_interval=10)
solver.load("path_to_dump/graphgeneration/gcpn_zinc250k_1epoch.pkl")
solver.train(num_epoch=1)
  1. Does train_set, valid_set, test_set responds to train, test, scafford test in MOSES? Thank you!

@KiddoZhu
Copy link
Contributor

KiddoZhu commented Aug 28, 2021

  1. In your code snippet, the model is trained on the whole dataset (including train, valid, test). If you only want to train on test_set, simply create an Engine like this.
solver = core.Engine(task, test_set, None, None, optimizer, ...)
  1. I am not confident but I think they might be different. What we use in MOSES is the split recorded in data/dataset_v1.csv, while the library of MOSES loads from the zipfiles in moses/dataset/data/*.gz. We will check what the standard or the most popular evaluation setting for MOSES is and update it.

@scintiller
Copy link
Author

scintiller commented Aug 28, 2021

Thank you for the quick response! I have the following update:

  1. Actually I want to know how to apply test_set on a trained model. To be more specific, if I first use the train_set to train the model and get the result, let's call it gcpn_10epoch.pkl. Then how can I load this mode, gcpn_10epoch.pkl, and further train it with the test_set?
    What's more, there's a similar question: after train a model for 3 epochs and save it as model_3epoch.pkl, how to load model_3epoch.pkl and train it with more epochs?

  2. I checked the code how TorchDrug establishes MOSES dataset and found we should make the following change in split` function:

    def split(self):
        indexes = defaultdict(list)
        for i, split in enumerate(self.targets["SPLIT"]):
            indexes[split].append(i)
        train_set = torch_data.Subset(self, indexes["train"])
        test_scaffolds_set = torch_data.Subset(self, indexes["test_scaffolds"])   # change happens here!;
        test_set = torch_data.Subset(self, indexes["test"])
        return train_set, valid_set, test_set

Since the split in MOSES dataset doesn't have valid, but test_scaffolds. Following is the first 10 lines of data/dataset_v1.csv:

      1 SMILES,SPLIT
      2 CCCS(=O)c1ccc2[nH]c(=NC(=O)OC)[nH]c2c1,train
      3 CC(C)(C)C(=O)C(Oc1ccc(Cl)cc1)n1ccnc1,train
      4 CC1C2CCC(C2)C1CN(CCO)C(=O)c1ccc(Cl)cc1,test
      5 Cc1c(Cl)cccc1Nc1ncccc1C(=O)OCC(O)CO,train
      6 Cn1cnc2c1c(=O)n(CC(O)CO)c(=O)n2C,train
      7 CC1Oc2ccc(Cl)cc2N(CC(O)CO)C1=O,train
      8 O=C(C1CCCCC1)N1CC(=O)N2CCCc3ccccc3C2C1,test_scaffolds
      9 CCOC(=O)c1cncn1C1CCCc2ccccc21,train
     10 COc1ccccc1OC(=O)c1ccccc1OC(C)=O,test_scaffolds

@KiddoZhu
Copy link
Contributor

  1. I think you can just create another solver wrapping the original model with test_set, load the checkpoint and finetune the model on test_set.

Pretrain:

solver = core.Engine(task, train_set, None, None, optimizer, ...)
solver.train(num_epoch=10)
solver.save("gcpn_10epoch.pkl")

Finetune:

solver = core.Engine(task, test_set, None, None, optimizer, ...)
solver.load("gcpn_10epoch.pkl")
solver.train(num_epoch=1)

The same procedure can be applied to resume training.

  1. That's great! I will follow your code and check the dataset.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants