# Generative molecules

In this tutorial, we will go through how to train a sequence VAE model for generating molecules with the format of SMILES sequence. In particular, we will demostrate how to train a VAE model and sample the generative molecules from a pre-trained model.

## Sequence VAE

![title](seq_VAE.png)

## Part I: Train a seq-VAE

### Load the data

In [105]:
import sys
import os
seq_VAE_path = '../apps/molecular_generation/seq_VAE/'
sys.path.insert(0, os.getcwd() + "/..")
sys.path.append(seq_VAE_path)
from utils import *

In [106]:
data_path = seq_VAE_path + 'data/zinc_moses/train.csv'
train_data = load_zinc_dataset(data_path)
# get the toy data
train_data = train_data[0:1000]

In [107]:
len(train_data)

1000

In [108]:
train_data[0:10]

['CCCS(=O)c1ccc2[nH]c(=NC(=O)OC)[nH]c2c1',
 'CC(C)(C)C(=O)C(Oc1ccc(Cl)cc1)n1ccnc1',
 'Cc1c(Cl)cccc1Nc1ncccc1C(=O)OCC(O)CO',
 'Cn1cnc2c1c(=O)n(CC(O)CO)c(=O)n2C',
 'CC1Oc2ccc(Cl)cc2N(CC(O)CO)C1=O',
 'CCOC(=O)c1cncn1C1CCCc2ccccc21',
 'COc1ccccc1OC(=O)Oc1ccccc1OC',
 'O=C1Nc2ccc(Cl)cc2C(c2ccccc2Cl)=NC1O',
 'CN1C(=O)C(O)N=C(c2ccccc2Cl)c2cc(Cl)ccc21',
 'CCC(=O)c1ccc(OCC(O)CO)c(OC)c1']

### define vocabulary

In [109]:
# define the sequence vocabuary based on dataset
vocab = OneHotVocab.from_data(train_data)

### Model Configuration Settings

The network is built up on hyperparameters from model_config.

In [110]:
model_config = \
{
    "max_length":80,     # max length of sequence
    "q_cell": "gru",     # encoder RNN cell
    "q_bidir": 1,        # if encoder is bidiretion
    "q_d_h": 256,        # hidden size of encoder
    "q_n_layers": 1,     # number of layers of encoder RNN
    "q_dropout": 0.5,    # encoder drop out rate


    "d_cell": "gru",     # decoder RNN cell
    "d_n_layers":3,      # number of decoder layers
    "d_dropout":0.2,     # decoder drop out rate
    "d_z":128,           # latent space size
    "d_d_h":512,         # hidden size of decoder
    "freeze_embeddings":0 # if freeze embeddings
}

### Define the model

In [111]:
# build the model
from pahelix.model_zoo.seq_vae_model  import VAE
model = VAE(vocab, model_config)  

### Trian the model

In [112]:
# define the training settings
batch_size = 64
learning_rate = 0.001
n_epoch = 1
kl_weight = 0.1

# define optimizer
optimizer = paddle.optimizer.Adam(parameters=model.parameters(),
                            learning_rate=learning_rate)

# build the dataset and data loader
max_length = model_config["max_length"]
train_dataset = StringDataset(vocab, train_data, max_length)
train_dataloader = paddle.io.DataLoader(train_dataset, batch_size=batch_size, shuffle=True)    

In [113]:
# start to train 
for epoch in range(n_epoch):
    print('#######################')
    kl_loss_values = []
    recon_loss_values = []
    loss_values = []
    
    for batch_id, data in enumerate(train_dataloader()):
        # read batch data
        data_batch = data

        # forward
        kl_loss, recon_loss  = model(data_batch)
        loss = kl_weight * kl_loss + recon_loss


        # backward
        loss.backward()
        # optimize
        optimizer.step()
        # clear gradients
        optimizer.clear_grad()
        
        # gathering values from each batch
        kl_loss_values.append(kl_loss.numpy())
        recon_loss_values.append(recon_loss.numpy())
        loss_values.append(loss.numpy())

        
        print('batch:%s, kl_loss:%f, recon_loss:%f' % (batch_id, float(np.mean(kl_loss_values)), float(np.mean(recon_loss_values))))
        
    print('epoch:%d loss:%f kl_loss:%f recon_loss:%f' % (epoch, float(np.mean(loss_values)), float(np.mean(kl_loss_values)),float(np.mean(recon_loss_values))),flush=True)

  

#######################
batch:0, kl_loss:0.377259, recon_loss:3.379486
batch:1, kl_loss:0.259201, recon_loss:3.264177
batch:2, kl_loss:0.210570, recon_loss:3.144137
batch:3, kl_loss:0.205814, recon_loss:3.053869
batch:4, kl_loss:0.204681, recon_loss:2.960207
batch:5, kl_loss:0.205177, recon_loss:2.892930
batch:6, kl_loss:0.203757, recon_loss:2.838837
batch:7, kl_loss:0.201053, recon_loss:2.782497
batch:8, kl_loss:0.197671, recon_loss:2.751050
batch:9, kl_loss:0.192766, recon_loss:2.715708
batch:10, kl_loss:0.186594, recon_loss:2.684680
batch:11, kl_loss:0.179440, recon_loss:2.664472
batch:12, kl_loss:0.171974, recon_loss:2.641148
batch:13, kl_loss:0.164508, recon_loss:2.620756
batch:14, kl_loss:0.157552, recon_loss:2.605232
batch:15, kl_loss:0.151044, recon_loss:2.586791
epoch:0 loss:2.601895 kl_loss:0.151044 recon_loss:2.586791


## Part II: Sample from prior

In [114]:
from pahelix.utils.metrics.molecular_generation.metrics_ import get_all_metrics
N_samples = 1000  # number of samples 
max_len = 80      # maximum length of samples
current_samples = model.sample(N_samples, max_len)  # get the samples from pre-trained model

metrics = get_all_metrics(gen=current_samples, k=[3])  # get the evaluation from samples
print(metrics)

{'valid': 0.013000000000000012, 'unique@3': 0.6666666666666666, 'IntDiv': 0.7307692307692307, 'IntDiv2': 0.5181166128686162, 'Filters': 0.9230769230769231}
