# Description

This notebook is used to train a RNN on the known universe of SMILES to learn to very accurately generate novel small molecules. We then use this initial network to generate our generation 0 (gen0) candidate molecules.

## Train the Network

In [1]:
import tensorflow
tensorflow.test.is_gpu_available()

Instructions for updating:
Use `tf.config.list_physical_devices('GPU')` instead.


True

In [2]:
import numpy as np
from copy import copy

import keras

from lstm_chem.utils.config import process_config
from lstm_chem.model import LSTMChem
from lstm_chem.generator import LSTMChemGenerator
from lstm_chem.trainer import LSTMChemTrainer
from lstm_chem.data_loader import DataLoader

Using TensorFlow backend.


In [3]:
CONFIG_FILE = 'experiments/2020-04-21/LSTM_Chem/config.json'
config = process_config(CONFIG_FILE)

In [4]:
modeler = LSTMChem(config, session='train')

In [5]:
train_dl = DataLoader(config, data_type='train')

  0%|          | 284/438552 [00:00<02:34, 2832.82it/s]

loading SMILES...
done.
tokenizing SMILES...


100%|██████████| 438552/438552 [02:40<00:00, 2740.89it/s]

done.





In [6]:
valid_dl = copy(train_dl)
valid_dl.data_type = 'valid'

In [7]:
trainer = LSTMChemTrainer(modeler, train_dl, valid_dl)

In [8]:
trainer.train()

Instructions for updating:
Please use Model.fit, which supports generators.
  ...
    to  
  ['...']
  ...
    to  
  ['...']
Train for 1542 steps, validate for 172 steps
Epoch 1/22

Epoch 00001: saving model to experiments/2020-04-21/LSTM_Chem/checkpoints/LSTM_Chem-01-0.73.hdf5
Epoch 2/22

Epoch 00002: saving model to experiments/2020-04-21/LSTM_Chem/checkpoints/LSTM_Chem-02-0.64.hdf5
Epoch 3/22

Epoch 00003: saving model to experiments/2020-04-21/LSTM_Chem/checkpoints/LSTM_Chem-03-0.60.hdf5
Epoch 4/22

Epoch 00004: saving model to experiments/2020-04-21/LSTM_Chem/checkpoints/LSTM_Chem-04-0.57.hdf5
Epoch 5/22

Epoch 00005: saving model to experiments/2020-04-21/LSTM_Chem/checkpoints/LSTM_Chem-05-0.55.hdf5
Epoch 6/22

Epoch 00006: saving model to experiments/2020-04-21/LSTM_Chem/checkpoints/LSTM_Chem-06-0.54.hdf5
Epoch 7/22

Epoch 00007: saving model to experiments/2020-04-21/LSTM_Chem/checkpoints/LSTM_Chem-07-0.52.hdf5
Epoch 8/22

Epoch 00008: saving model to experiments/2020-04-21/LS

Epoch 14/22

Epoch 00014: saving model to experiments/2020-04-21/LSTM_Chem/checkpoints/LSTM_Chem-14-0.48.hdf5
Epoch 15/22

Epoch 00015: saving model to experiments/2020-04-21/LSTM_Chem/checkpoints/LSTM_Chem-15-0.48.hdf5
Epoch 16/22

Epoch 00016: saving model to experiments/2020-04-21/LSTM_Chem/checkpoints/LSTM_Chem-16-0.48.hdf5
Epoch 17/22

Epoch 00017: saving model to experiments/2020-04-21/LSTM_Chem/checkpoints/LSTM_Chem-17-0.48.hdf5
Epoch 18/22

Epoch 00018: saving model to experiments/2020-04-21/LSTM_Chem/checkpoints/LSTM_Chem-18-0.47.hdf5
Epoch 19/22

Epoch 00019: saving model to experiments/2020-04-21/LSTM_Chem/checkpoints/LSTM_Chem-19-0.47.hdf5
Epoch 20/22

Epoch 00020: saving model to experiments/2020-04-21/LSTM_Chem/checkpoints/LSTM_Chem-20-0.47.hdf5
Epoch 21/22

Epoch 00021: saving model to experiments/2020-04-21/LSTM_Chem/checkpoints/LSTM_Chem-21-0.47.hdf5
Epoch 22/22

Epoch 00022: saving model to experiments/2020-04-21/LSTM_Chem/checkpoints/LSTM_Chem-22-0.46.hdf5


In [12]:
# Save weights of the trained model
trainer.model.save_weights('experiments/2020-04-21/LSTM_Chem/checkpoints/LSTM_Chem-baseline-model-full.hdf5')

## Now load the model and GENERATE new molecules

In [13]:
config['model_weight_filename'] = 'experiments/2020-04-21/LSTM_Chem/checkpoints/LSTM_Chem-baseline-model-full.hdf5'
print(config)

batch_size: 256
checkpoint_dir: experiments/2020-04-21/LSTM_Chem/checkpoints/
checkpoint_mode: min
checkpoint_monitor: val_loss
checkpoint_save_best_only: false
checkpoint_save_weights_only: true
checkpoint_verbose: 1
config_file: experiments/2020-04-21/LSTM_Chem/config.json
data_filename: ./datasets/dataset_cleansed.smi
data_length: 0
exp_dir: experiments/2020-04-21/LSTM_Chem
exp_name: LSTM_Chem
finetune_batch_size: 1
finetune_data_filename: ./datasets/TRPM8_inhibitors_for_fine-tune.smi
finetune_epochs: 12
model_arch_filename: experiments/2020-04-21/LSTM_Chem/model_arch.json
model_weight_filename: experiments/2020-04-21/LSTM_Chem/checkpoints/LSTM_Chem-baseline-model-full.hdf5
num_epochs: 22
optimizer: adam
sampling_temp: 0.75
seed: 71
smiles_max_length: 128
tensorboard_log_dir: experiments/2020-04-21/LSTM_Chem/logs/
tensorboard_write_graph: true
train_smi_max_len: 74
units: 256
validation_split: 0.1
verbose_training: true



In [14]:
modeler = LSTMChem(config, session='generate')
generator = LSTMChemGenerator(modeler)
print(config)

Loading model architecture from experiments/2020-04-21/LSTM_Chem/model_arch.json ...
Loading model checkpoint from experiments/2020-04-21/LSTM_Chem/checkpoints/LSTM_Chem-baseline-model-full.hdf5 ...
Loaded the Model.
batch_size: 256
checkpoint_dir: experiments/2020-04-21/LSTM_Chem/checkpoints/
checkpoint_mode: min
checkpoint_monitor: val_loss
checkpoint_save_best_only: false
checkpoint_save_weights_only: true
checkpoint_verbose: 1
config_file: experiments/2020-04-21/LSTM_Chem/config.json
data_filename: ./datasets/dataset_cleansed.smi
data_length: 0
exp_dir: experiments/2020-04-21/LSTM_Chem
exp_name: LSTM_Chem
finetune_batch_size: 1
finetune_data_filename: ./datasets/TRPM8_inhibitors_for_fine-tune.smi
finetune_epochs: 12
model_arch_filename: experiments/2020-04-21/LSTM_Chem/model_arch.json
model_weight_filename: experiments/2020-04-21/LSTM_Chem/checkpoints/LSTM_Chem-baseline-model-full.hdf5
num_epochs: 22
optimizer: adam
sampling_temp: 0.75
seed: 71
smiles_max_length: 128
tensorboard_lo

In [15]:
sample_number = 10000
sampled_smiles = generator.sample(num=sample_number)

100%|██████████| 10000/10000 [1:18:34<00:00,  2.12it/s]


In [16]:
from rdkit import RDLogger, Chem, DataStructs
from rdkit.Chem import AllChem, Draw, Descriptors
from rdkit.Chem.Draw import IPythonConsole
RDLogger.DisableLog('rdApp.*')

In [17]:
valid_mols = []
for smi in sampled_smiles:
    mol = Chem.MolFromSmiles(smi)
    if mol is not None:
        valid_mols.append(mol)
# low validity
print('Validity: ', f'{len(valid_mols) / sample_number:.2%}')

valid_smiles = [Chem.MolToSmiles(mol) for mol in valid_mols]
# high uniqueness
print('Uniqueness: ', f'{len(set(valid_smiles)) / len(valid_smiles):.2%}')

# Of valid smiles generated, how many are truly original vs ocurring in the training data
import pandas as pd
training_data = pd.read_csv('./datasets/all_smiles_clean.smi', header=None)
training_set = set(list(training_data[0]))
original = []
for smile in valid_smiles:
    if not smile in training_set:
        original.append(smile)
print('Originality: ', f'{len(set(original)) / len(set(valid_smiles)):.2%}')

Validity:  65.76%
Uniqueness:  99.80%
Originality:  98.87%


In [18]:
with open('./generations/gen0.smi', 'w') as f:
    for item in valid_smiles:
        f.write("%s\n" % item)