# 分子生成

在这篇教程中，我们将会介绍如何训练一个基于序列的VAE模型去生成分子的SMILES序列。我们将会介绍模型的训练和通过训练好的模型进行采样生成。

## 基于序列的VAE

分子生成是一种通过在大数据集上训练深层生成模型来生成新分子的流行工具。生成模型可用于设计新分子、探索分子空间等。其生成的分子可进一步用于虚拟筛选或其他下游任务。
在这项工作中，我们将介绍一个变分自动编码器（VAE）为基础的生成模型。

VAE包含两个神经网络 - 一个编码器和一个解码器。利用这种结构，该模型可以通过编码器将高维输入空间转换为低维的隐空间，并通过解码器将其转换回原始的输入空间。 隐空间是一个正态分布的连续向量空间。我们最小化了Kullback-Leibler（KL）散度损失和重构损失。利用隐空间连续的性质，我们可以利用训练好的VAE模型对新分子进行采样。

分子的输入是SMILES序列。通过两者的结合，序列VAE模型将一个SMILES序列作为输入，重构输入序列。

![title](./figures/seq_VAE.png)

## 部分 I: 训练一个序列VAE模型

### 加载数据

In [5]:
import sys
import os
seq_VAE_path = '../apps/molecular_generation/seq_VAE/'
sys.path.insert(0, os.getcwd() + "/..")
sys.path.append(seq_VAE_path)
from utils import *

In [6]:
# download and decompress the data
!wget https://baidu-nlp.bj.bcebos.com/PaddleHelix/datasets/molecular_generation/zinc_moses.tgz
!tar -zxvf "zinc_moses.tgz"

--2021-05-13 14:38:50--  https://baidu-nlp.bj.bcebos.com/PaddleHelix/datasets/molecular_generation/zinc_moses.tgz
Resolving baidu-nlp.bj.bcebos.com (baidu-nlp.bj.bcebos.com)... 10.70.0.165
Connecting to baidu-nlp.bj.bcebos.com (baidu-nlp.bj.bcebos.com)|10.70.0.165|:443... connected.
HTTP request sent, awaiting response... 

  and should_run_async(code)


200 OK
Length: 8708409 (8.3M) [application/gzip]
Saving to: ‘zinc_moses.tgz.1’


2021-05-13 14:38:54 (2.34 MB/s) - ‘zinc_moses.tgz.1’ saved [8708409/8708409]

x zinc_moses/
x zinc_moses/.DS_Store
x zinc_moses/test.csv
x zinc_moses/train.csv


In [7]:
data_path = './zinc_moses/train.csv'
train_data = load_zinc_dataset(data_path)
# get the toy data
train_data = train_data[0:1000]

In [8]:
len(train_data)

1000

In [9]:
train_data[0:10]

['CCCS(=O)c1ccc2[nH]c(=NC(=O)OC)[nH]c2c1',
 'CC(C)(C)C(=O)C(Oc1ccc(Cl)cc1)n1ccnc1',
 'Cc1c(Cl)cccc1Nc1ncccc1C(=O)OCC(O)CO',
 'Cn1cnc2c1c(=O)n(CC(O)CO)c(=O)n2C',
 'CC1Oc2ccc(Cl)cc2N(CC(O)CO)C1=O',
 'CCOC(=O)c1cncn1C1CCCc2ccccc21',
 'COc1ccccc1OC(=O)Oc1ccccc1OC',
 'O=C1Nc2ccc(Cl)cc2C(c2ccccc2Cl)=NC1O',
 'CN1C(=O)C(O)N=C(c2ccccc2Cl)c2cc(Cl)ccc21',
 'CCC(=O)c1ccc(OCC(O)CO)c(OC)c1']

### 定义语法

In [10]:
# 基于训练集定义序列的语法
vocab = OneHotVocab.from_data(train_data)

### 模型参数设置

神经网络的参数存储在model_config

In [110]:
model_config = \
{
    "max_length":80,     # 序列的最大长度
    "q_cell": "gru",     # 编码器cell的类型
    "q_bidir": 1,        # 是否编码器是双向RNN
    "q_d_h": 256,        # 编码器隐空间大小
    "q_n_layers": 1,     # 编码器RNN层数
    "q_dropout": 0.5,    # 编码器dropout rate


    "d_cell": "gru",     # 解码器cell类型
    "d_n_layers":3,      # 解码器RNN层数
    "d_dropout":0.2,     # 解码器dropout rate
    "d_z":128,           # VAE隐空间大小
    "d_d_h":512,         # 解码器隐空间大小
    "freeze_embeddings":0 # 是否固定one-hot embedding
}

### 定义模型

In [111]:
# build the model
from pahelix.model_zoo.seq_vae_model  import VAE
model = VAE(vocab, model_config)  

### 训练模型

In [112]:
# define the training settings
batch_size = 64
learning_rate = 0.001
n_epoch = 1
kl_weight = 0.1

# define optimizer
optimizer = paddle.optimizer.Adam(parameters=model.parameters(),
                            learning_rate=learning_rate)

# build the dataset and data loader
max_length = model_config["max_length"]
train_dataset = StringDataset(vocab, train_data, max_length)
train_dataloader = paddle.io.DataLoader(train_dataset, batch_size=batch_size, shuffle=True)    

In [113]:
# start to train 
for epoch in range(n_epoch):
    print('#######################')
    kl_loss_values = []
    recon_loss_values = []
    loss_values = []
    
    for batch_id, data in enumerate(train_dataloader()):
        # read batch data
        data_batch = data

        # forward
        kl_loss, recon_loss  = model(data_batch)
        loss = kl_weight * kl_loss + recon_loss


        # backward
        loss.backward()
        # optimize
        optimizer.step()
        # clear gradients
        optimizer.clear_grad()
        
        # gathering values from each batch
        kl_loss_values.append(kl_loss.numpy())
        recon_loss_values.append(recon_loss.numpy())
        loss_values.append(loss.numpy())

        
        print('batch:%s, kl_loss:%f, recon_loss:%f' % (batch_id, float(np.mean(kl_loss_values)), float(np.mean(recon_loss_values))))
        
    print('epoch:%d loss:%f kl_loss:%f recon_loss:%f' % (epoch, float(np.mean(loss_values)), float(np.mean(kl_loss_values)),float(np.mean(recon_loss_values))),flush=True)

  

#######################
batch:0, kl_loss:0.377259, recon_loss:3.379486
batch:1, kl_loss:0.259201, recon_loss:3.264177
batch:2, kl_loss:0.210570, recon_loss:3.144137
batch:3, kl_loss:0.205814, recon_loss:3.053869
batch:4, kl_loss:0.204681, recon_loss:2.960207
batch:5, kl_loss:0.205177, recon_loss:2.892930
batch:6, kl_loss:0.203757, recon_loss:2.838837
batch:7, kl_loss:0.201053, recon_loss:2.782497
batch:8, kl_loss:0.197671, recon_loss:2.751050
batch:9, kl_loss:0.192766, recon_loss:2.715708
batch:10, kl_loss:0.186594, recon_loss:2.684680
batch:11, kl_loss:0.179440, recon_loss:2.664472
batch:12, kl_loss:0.171974, recon_loss:2.641148
batch:13, kl_loss:0.164508, recon_loss:2.620756
batch:14, kl_loss:0.157552, recon_loss:2.605232
batch:15, kl_loss:0.151044, recon_loss:2.586791
epoch:0 loss:2.601895 kl_loss:0.151044 recon_loss:2.586791


## 部分 II: 从正态先验中采样

In [114]:
from pahelix.utils.metrics.molecular_generation.metrics_ import get_all_metrics
N_samples = 1000  # number of samples 
max_len = 80      # maximum length of samples
current_samples = model.sample(N_samples, max_len)  # get the samples from pre-trained model

metrics = get_all_metrics(gen=current_samples, k=[3])  # get the evaluation from samples
print(metrics)

{'valid': 0.013000000000000012, 'unique@3': 0.6666666666666666, 'IntDiv': 0.7307692307692307, 'IntDiv2': 0.5181166128686162, 'Filters': 0.9230769230769231}
