How to use MOSES train/test/testSF dataset in Torchdrug #19

scintiller · 2021-08-28T04:37:33Z

TorchDrug implements MOSES dataset, but doesn't distinguish between train / test / testSF which MOSES has. To train GCPN on Moses, I think the correct order is to pretrain the model by train dataset at first, then train it on test / testSF dataset and finally generate the molecules. But how to do this in TorchDrug? There's only one dataset named MOSES.

I have this question because when I generate molecules by MOSES, the statistics doesn't look correct if compared to other models on MOSEC, especially the Scaf/Test property in the table, which tries to find out if there are same scaffolds in test dataset and generated molecules. It's 0 for GCPN model after training on TorchDrug, following the tutorial. I think the problem is that TorchDrug only uses the train dataset but not test dataset. How can I explicitly use it? Thanks in advance!

The text was updated successfully, but these errors were encountered:

KiddoZhu · 2021-08-28T05:15:32Z

Hi! There is a predefined split for MOSES implemented in TorchDrug. I am not sure if this is what you want. You can get it by

dataset = datasets.MOSES("/path/to/dataset")
train_set, valid_set, test_set = dataset.split()

Sorry I am not an expert in molecule generation. Maybe @shichence knows more about the dataset and evaluation setting on MOSES?

scintiller · 2021-08-28T07:07:32Z

Thank you for your quick response! I have three questions as following:

Following the Molecular Generation tutorial, I know how to train a GCNP model by the train_set. But how to load the trained model and train it again with the test_set?
I don't know if the following code is correct:

solver = core.Engine(task, dataset, None, None, optimizer,
                     gpus=(0,), batch_size=128, log_interval=10)
solver.load("path_to_dump/graphgeneration/gcpn_zinc250k_1epoch.pkl")
solver.train(num_epoch=1)

Does train_set, valid_set, test_set responds to train, test, scafford test in MOSES? Thank you!

KiddoZhu · 2021-08-28T07:22:29Z

In your code snippet, the model is trained on the whole dataset (including train, valid, test). If you only want to train on test_set, simply create an Engine like this.

solver = core.Engine(task, test_set, None, None, optimizer, ...)

I am not confident but I think they might be different. What we use in MOSES is the split recorded in data/dataset_v1.csv, while the library of MOSES loads from the zipfiles in moses/dataset/data/*.gz. We will check what the standard or the most popular evaluation setting for MOSES is and update it.

scintiller · 2021-08-28T18:54:22Z

Thank you for the quick response! I have the following update:

Actually I want to know how to apply test_set on a trained model. To be more specific, if I first use the train_set to train the model and get the result, let's call it gcpn_10epoch.pkl. Then how can I load this mode, gcpn_10epoch.pkl, and further train it with the test_set?
What's more, there's a similar question: after train a model for 3 epochs and save it as model_3epoch.pkl, how to load model_3epoch.pkl and train it with more epochs?
I checked the code how TorchDrug establishes MOSES dataset and found we should make the following change in split` function:

    def split(self):
        indexes = defaultdict(list)
        for i, split in enumerate(self.targets["SPLIT"]):
            indexes[split].append(i)
        train_set = torch_data.Subset(self, indexes["train"])
        test_scaffolds_set = torch_data.Subset(self, indexes["test_scaffolds"])   # change happens here!;
        test_set = torch_data.Subset(self, indexes["test"])
        return train_set, valid_set, test_set

Since the split in MOSES dataset doesn't have valid, but test_scaffolds. Following is the first 10 lines of data/dataset_v1.csv:

      1 SMILES,SPLIT
      2 CCCS(=O)c1ccc2[nH]c(=NC(=O)OC)[nH]c2c1,train
      3 CC(C)(C)C(=O)C(Oc1ccc(Cl)cc1)n1ccnc1,train
      4 CC1C2CCC(C2)C1CN(CCO)C(=O)c1ccc(Cl)cc1,test
      5 Cc1c(Cl)cccc1Nc1ncccc1C(=O)OCC(O)CO,train
      6 Cn1cnc2c1c(=O)n(CC(O)CO)c(=O)n2C,train
      7 CC1Oc2ccc(Cl)cc2N(CC(O)CO)C1=O,train
      8 O=C(C1CCCCC1)N1CC(=O)N2CCCc3ccccc3C2C1,test_scaffolds
      9 CCOC(=O)c1cncn1C1CCCc2ccccc21,train
     10 COc1ccccc1OC(=O)c1ccccc1OC(C)=O,test_scaffolds

KiddoZhu · 2021-08-29T03:07:45Z

I think you can just create another solver wrapping the original model with test_set, load the checkpoint and finetune the model on test_set.

Pretrain:

solver = core.Engine(task, train_set, None, None, optimizer, ...)
solver.train(num_epoch=10)
solver.save("gcpn_10epoch.pkl")

Finetune:

solver = core.Engine(task, test_set, None, None, optimizer, ...)
solver.load("gcpn_10epoch.pkl")
solver.train(num_epoch=1)

The same procedure can be applied to resume training.

That's great! I will follow your code and check the dataset.

KiddoZhu added the enhancement New feature or request label Aug 28, 2021

KiddoZhu changed the title ~~How to use MOSES train/test/testSF dataset in Torchdrug~~ How to use MOSES train/test/testSF dataset in Torchdrug Aug 28, 2021

scintiller mentioned this issue Aug 29, 2021

How to load an existing checkpoint and train it for more epochs #20

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to use MOSES train/test/testSF dataset in Torchdrug #19

How to use MOSES train/test/testSF dataset in Torchdrug #19

scintiller commented Aug 28, 2021 •

edited by KiddoZhu

Loading

KiddoZhu commented Aug 28, 2021

scintiller commented Aug 28, 2021

KiddoZhu commented Aug 28, 2021 •

edited

Loading

scintiller commented Aug 28, 2021 •

edited

Loading

KiddoZhu commented Aug 29, 2021

How to use MOSES train/test/testSF dataset in Torchdrug #19

How to use MOSES train/test/testSF dataset in Torchdrug #19

Comments

scintiller commented Aug 28, 2021 • edited by KiddoZhu Loading

KiddoZhu commented Aug 28, 2021

scintiller commented Aug 28, 2021

KiddoZhu commented Aug 28, 2021 • edited Loading

scintiller commented Aug 28, 2021 • edited Loading

KiddoZhu commented Aug 29, 2021

scintiller commented Aug 28, 2021 •

edited by KiddoZhu

Loading

KiddoZhu commented Aug 28, 2021 •

edited

Loading

scintiller commented Aug 28, 2021 •

edited

Loading