# Drug Explorer (DrugEx v3): Scaffold-Constrained Drug Design with Graph Transformer-based Reinforcement Learning
https://github.com/XuhanLiu/DrugEx




Due to the large drug-like chemical space available to search for feasible drug-like molecules, rational drug design often starts from specific scaffolds to which side chains/substituents are added or modified. With the rapid growth of the application of deep learning in drug discovery, a variety of effective approaches have been developed for de novo drug design. In previous work, we proposed a method named DrugEx, which can be applied in polypharmacology based on multi-objective deep reinforcement learning. However, the previous version is trained under fixed objectives similar to other known methods and does not allow users to input any prior information (i.e. a desired scaffold). In order to improve the general applicability, we updated DrugEx to design drug molecules based on scaffolds which consist of multiple fragments provided by users. In this work, the Transformer model was employed to generate molecular structures. The Transformer is a multi-head self-attention deep learning model containing an encoder to receive scaffolds as input and a decoder to generate molecules as output. In order to deal with the graph representation of molecules we proposed a novel positional encoding for each atom and bond based on an adjacency matrix to extend the architecture of the Transformer. Each molecule was generated by growing and connecting procedures for the fragments in the given scaffold that were unified into one model. Moreover, we trained this generator under a reinforcement learning framework to increase the number of desired ligands. As a proof of concept, our proposed method was applied to design ligands for the adenosine A2A receptor (A2AAR) and compared with SMILES-based methods. The results demonstrated the effectiveness of our method in that 100% of the generated molecules are valid and most of them had a high predicted affinity value towards A2AAR with given scaffolds.

In [16]:
# Dependencies
# Python >= 3.7
# Numpy (version >= 1.19)
# Scikit-Learn (version >= 0.23)
# Pandas (version >= 1.2.2)
# PyTorch (version == 1.7)
# Matplotlib (version >= 2.0)
# RDKit (version >= 2020.03)

# You have
# !python -V
# 3.7.7
# import numpy as np
# print(np.__version__)
# 1.16.5
# import sklearn as sklearn
# print(sklearn.__version__)
# 0.24.1
# import pandas as pd
# print(pd.__version__)
# 0.24.2
# import torch
# print(torch.__version__)
# notaavailabl at jupter60
# import Matplotlib as Matplotlib
# print(Matplotlib.__version__)
# import rdkit as rdkit
# print(rdkit.__version__)
# 2018.09.1


In [None]:
#!python -m pip install --user numpy --upgrade
#!python -m pip install --user tensorflow --upgrade
#!python -m pip install --user numpy pycocotools==2.0.0


## <ins>Let's start</ins> 

We'll start with required imports. These includes the [Keras](https://keras.io/) and [Tensorflow](https://www.tensorflow.org/) libraries for the neural network models, [Pandas](https://pandas.pydata.org/) and [Numpy](https://numpy.org/) to process data, as well as other relevant Python libraries.

In [1]:
from __future__ import print_function
# general imports
%matplotlib inline
import tensorflow as tf
#import tensorflow.compat.v1 as tf
#tf.disable_v2_behavior() 
import keras
from keras import initializers
from keras.layers import Dense
from keras.models import Sequential
from keras import optimizers
from keras import regularizers
import pandas as pd
import seaborn as sns
#from matplotlib import pyplot as plt
import matplotlib.pyplot as plt
import matplotlib.animation as animation
import plotly.express as px
import numpy as np
import csv
import copy
import random
import rdkit as rdkit
print(rdkit.__version__)
from rdkit import Chem
#from rdkit.Chem import AllChem as Chem
from rdkit.Chem import AllChem
from rdkit.Chem import Draw
from rdkit.Chem import Crippen
from rdkit.Chem import Descriptors, Descriptors3D 
from rdkit.ML.Descriptors import MoleculeDescriptors
from rdkit.Chem import Lipinski, rdDepictor, rdMolDescriptors
from rdkit.Chem import MolSurf
from rdkit.Chem import PandasTools
from rdkit.Chem.Draw import IPythonConsole
from rdkit.Chem import rdFingerprintGenerator
from rdkit import DataStructs
from rdkit.Chem import MACCSkeys
from rdkit.Chem.Fingerprints import FingerprintMols
from descriptastorus.descriptors.DescriptorGenerator import MakeGenerator
#https://github.com/bp-kelley/descriptastorus
from mordred import Calculator, descriptors
import time
rdDepictor.SetPreferCoordGen(True)


#from util import partTypeNum
#import util
# from tqdm import tqdm
# from sklearn.decomposition import PCA
# from sklearn.manifold import TSNE
# from sklearn.cluster import KMeans
# from sklearn.preprocessing import StandardScaler as Scaler

import pandas as pd
import numpy as np

#import pymatgen as pymat
#import mendeleev as mendel
from subprocess import call
import gzip

from scipy.stats import norm
from IPython.display import HTML

# keras imports
from keras.layers import (Input, Dense, Conv1D, MaxPool1D, Dropout, GRU, LSTM, TimeDistributed, Add, Flatten, RepeatVector, Lambda, Concatenate)
from keras.models import Model, load_model
from keras.metrics import binary_crossentropy
from keras import initializers, regularizers
from keras.callbacks import EarlyStopping
import keras.backend as K

# Visualization
from keras_sequential_ascii import keras2ascii


# from utils import label_map_util
# from utils import visualization_utils as vis_util

#from object_detection.utils import label_map_util
#from object_detection.utils import visualization_utils as vis_util

# utils functions
#from python_utils import *
from utils import *

# Hacky MacOS fix for Tensorflow runtimes... (You won't need this unless you are on MacOS)
# This fixes a display bug with progress bars that can pop up on MacOS sometimes.
#import sys
#import os
#sys.path.insert(0, '../src/')

import os
os.environ['KMP_DUPLICATE_LIB_OK']='True'

# Remove warnings from output
import warnings
warnings.filterwarnings('ignore')

#!python -m pip install --user numpy --upgrade
#!python -m pip install --user tensorflow --upgrade
#!python -m pip install --user numpy pycocotools==2.0.0

#!python -V
#import tensorflow as tf
#print(tf.__version__)
#import numpy as np
#print(np.__version__)

Using TensorFlow backend.


2018.09.1


## For designing the novel drug molecules with SMILES representation, you should do the following steps sequentially by running scripts:



### 1 dataset.py:

Preparing your dataset for pre-training and fine-tuning the RNN model as initial states of exploitation network and exploration network.

In [None]:
import pandas as pd
from rdkit import Chem
from rdkit import rdBase
from rdkit.Chem import Recap, BRICS
from rdkit.Chem.MolStandardize import rdMolStandardize
from tqdm import tqdm
from utils import VocSmiles as Voc
import utils
import re
import numpy as np
from itertools import combinations
import gzip
import getopt, sys
rdBase.DisableLog('rdApp.info')
rdBase.DisableLog('rdApp.warning')


def corpus(input, output, suffix='sdf'):
    if suffix =='sdf':
        inf = gzip.open(input)
        mols = Chem.ForwardSDMolSupplier(inf)
        # mols = [mol for mol in suppl]
    else:
        df = pd.read_table(input).Smiles.dropna()
        mols = [Chem.MolFromSmiles(s) for s in df]
    voc = Voc('data/voc_smiles.txt')
    charger = rdMolStandardize.Uncharger()
    chooser = rdMolStandardize.LargestFragmentChooser()
    disconnector = rdMolStandardize.MetalDisconnector()
    normalizer = rdMolStandardize.Normalizer()
    words = set()
    canons = []
    tokens = []
    smiles = set()
    for mol in tqdm(mols):
        try:
            mol = disconnector.Disconnect(mol)
            mol = normalizer.normalize(mol)
            mol = chooser.choose(mol)
            mol = charger.uncharge(mol)
            mol = disconnector.Disconnect(mol)
            mol = normalizer.normalize(mol)
            smileR = Chem.MolToSmiles(mol, 0)
            smiles.add(Chem.CanonSmiles(smileR))
        except:
            print('Parsing Error:') #, Chem.MolToSmiles(mol))

    for smile in tqdm(smiles):
        token = voc.split(smile) + ['EOS']
        if {'C', 'c'}.isdisjoint(token):
            print('Warning:', smile)
            continue
        if not {'[Na]', '[Zn]'}.isdisjoint(token):
            print('Redudent', smile)
            continue
        if 10 < len(token) <= 100:
            words.update(token)
            canons.append(smile)
            tokens.append(' '.join(token))
    log = open(output + '_voc.txt', 'w')
    log.write('\n'.join(sorted(words)))
    log.close()

    log = pd.DataFrame()
    log['Smiles'] = canons
    log['Token'] = tokens
    log.drop_duplicates(subset='Smiles')
    log.to_csv(output + '_corpus.txt', sep='\t', index=False)


def graph_corpus(input, output, suffix='sdf'):
    metals = {'Na', 'Zn', 'Li', 'K', 'Ca', 'Mg', 'Ag', 'Cs', 'Ra', 'Rb', 'Al', 'Sr', 'Ba', 'Bi'}
    voc = utils.VocGraph('data/voc_atom.txt')
    inf = gzip.open(input)
    if suffix == 'sdf':
        mols = Chem.ForwardSDMolSupplier(inf)
        total = 2e6
    else:
        mols = pd.read_table(input).drop_duplicates(subset=['Smiles']).dropna(subset=['Smiles'])
        total = len(mols)
        mols = mols.iterrows()
    vals = {}
    exps = {}
    codes, ids = [], []
    chooser = rdMolStandardize.LargestFragmentChooser()
    disconnector = rdMolStandardize.MetalDisconnector()
    normalizer = rdMolStandardize.Normalizer()
    for i, mol in enumerate(tqdm(mols, total=total)):
        if mol is None: continue
        if suffix != 'sdf':
            idx = mol[1]['Molecule ChEMBL ID']

            mol = Chem.MolFromSmiles(mol[1].Smiles)
        else:
            idx = mol.GetPropsAsDict()
            idx = idx['chembl_id']
        try:
            mol = disconnector.Disconnect(mol)
            mol = normalizer.normalize(mol)
            mol = chooser.choose(mol)
            mol = disconnector.Disconnect(mol)
            mol = normalizer.normalize(mol)
        except:
            print(idx)
        symb = [a.GetSymbol() for a in mol.GetAtoms()]
        # Nr. of the atoms
        bonds = mol.GetBonds()
        if len(bonds) < 4 or len(bonds) >= 63: continue
        if {'C'}.isdisjoint(symb): continue
        if not metals.isdisjoint(symb): continue

        smile = Chem.MolToSmiles(mol)
        try:
            s0 = smile.replace('[O]', 'O').replace('[C]', 'C') \
                 .replace('[N]', 'N').replace('[B]', 'B') \
                 .replace('[2H]', '[H]').replace('[3H]', '[H]')
            s0 = Chem.CanonSmiles(s0, 0)
            code = voc.encode([smile])
            s1 = voc.decode(code)[0]
            assert s0 == s1
            codes.append(code[0].reshape(-1).tolist())
            ids.append(idx)
        except Exception as ex:
            print(ex)
            print('Parse Error:', idx)
    df = pd.DataFrame(codes, index=ids, columns=['C%d' % i for i in range(64*4)])
    df.to_csv(output, sep='\t', index=True)
    print(vals)
    print(exps)


def pair_frags(fname, out, method='Recap', is_mf=True):
    smiles = pd.read_table(fname).Smiles.dropna()
    pairs = []
    for i, smile in enumerate(tqdm(smiles)):
        smile = utils.clean_mol(smile)
        mol = Chem.MolFromSmiles(smile)
        if method == 'recap':
            frags = np.array(sorted(Recap.RecapDecompose(mol).GetLeaves().keys()))
        else:
            frags = BRICS.BRICSDecompose(mol)
            frags = np.array(sorted({re.sub(r'\[\d+\*\]', '*', f) for f in frags}))
        if len(frags) == 1: continue
        du, hy = Chem.MolFromSmiles('*'), Chem.MolFromSmiles('[H]')
        subs = np.array([Chem.MolFromSmiles(f) for f in frags])
        subs = np.array([Chem.RemoveHs(Chem.ReplaceSubstructs(f, du, hy, replaceAll=True)[0]) for f in subs])
        subs = np.array([m for m in subs if m.GetNumAtoms() > 1])
        match = np.array([[m.HasSubstructMatch(f) for f in subs] for m in subs])
        frags = subs[match.sum(axis=0) == 1]
        frags = sorted(frags, key=lambda x:-x.GetNumAtoms())[:voc.n_frags]
        frags = [Chem.MolToSmiles(Chem.RemoveHs(f)) for f in frags]

        max_comb = len(frags) if is_mf else 1
        for ix in range(1, max_comb+1):
            combs = combinations(frags, ix)
            for comb in combs:
                input = '.'.join(comb)
                if len(input) > len(smile): continue
                if mol.HasSubstructMatch(Chem.MolFromSmarts(input)):
                    pairs.append([input, smile])
    df = pd.DataFrame(pairs, columns=['Frags', 'Smiles'])
    df.to_csv(out, sep='\t',  index=False)


def pair_graph_encode(fname, voc, out):
    df = pd.read_table(fname)
    col = ['C%d' % d for d in range(voc.max_len*5)]
    codes = []
    for i, row in tqdm(df.iterrows(), total=len(df)):
        frag, smile = row.Frags, row.Smiles
        # smile = voc_smi.decode(row.Output.split(' '))
        # frag = voc_smi.decode(row.Input.split(' '))
        mol = Chem.MolFromSmiles(smile)
        total = mol.GetNumBonds()
        if total >= 75 or smile == frag:
            continue
        try:
            # s = utils.clean_mol(smile)
            # f = utils.clean_mol(frag, is_deep=False)
            output = voc.encode([smile], [frag])
            f, s = voc.decode(output)

            assert smile == s[0]
            # assert f == frag[0]
            code = output[0].reshape(-1).tolist()
            codes.append(code)
        except:
            print(i, frag, smile)
    codes = pd.DataFrame(codes, columns=col)
    codes.to_csv(out, sep='\t', index=False)


def pair_smiles_encode(fname, voc, out):
    df = pd.read_table(fname)
    col = ['Input', 'Output']
    codes = []
    for i, row in tqdm(df.iterrows(), total=len(df)):
        frag, smile = row.Frags, row.Smiles
        mol = voc.split(smile)
        if len(mol) > 100: continue
        sub = voc.split(frag)
        codes.append([' '.join(sub), ' '.join(mol)])
    codes = pd.DataFrame(codes, columns=col)
    codes.to_csv(out, sep='\t', index=False)


def pos_neg_split():
    pair = ['Target ChEMBL ID', 'Smiles', 'pChEMBL Value', 'Comment',
            'Standard Type', 'Standard Relation']
    obj = pd.read_table('data/LIGAND.tsv').dropna(subset=pair[1:2])
    df = obj[obj[pair[0]] == 'CHEMBL251']
    df = df[pair].set_index(pair[1])
    numery = df[pair[2]].groupby(pair[1]).mean().dropna()

    comments = df[(df.Comment.str.contains('Not Active') == True)]
    inhibits = df[(df['Standard Type'] == 'Inhibition') & df['Standard Relation'].isin(['<', '<='])]
    relations = df[df['Standard Type'].isin(['EC50', 'IC50', 'Kd', 'Ki']) & df['Standard Relation'].isin(['>', '>='])]
    binary = pd.concat([comments, inhibits, relations], axis=0)
    binary = binary[~binary.index.isin(numery.index)]
    binary[pair[2]] = 3.99
    binary = binary[pair[2]].groupby(binary.index).first()
    df = numery.append(binary)
    pos = {utils.clean_mol(s) for s in df[df >=6.5].index}
    neg = {utils.clean_mol(s) for s in df[df < 6.5].index}.difference(pos)
    oth = obj[~obj.Smiles.isin(df.index)].Smiles
    oth = {utils.clean_mol(s) for s in oth}.difference(pos).difference(neg)
    for data in ['pos', 'neg', 'oth']:
        file = open('data/ligand_%s.tsv' % data, 'w')
        file.write('Smiles\n')
        file.write('\n'.join(eval(data)))
        file.close()


def train_test_split(fname, out):
    df = pd.read_table(fname)
    frags = set(df.Frags)
    test_in = df.Frags.drop_duplicates().sample(len(frags) // 10)
    test = df[df.Frags.isin(test_in)]
    train = df[~df.Frags.isin(test_in)]
    test.to_csv(out + '_test.txt', sep='\t', index=False)
    train.to_csv(out + '_train.txt', sep='\t', index=False)


if __name__ == '__main__':
    opts, args = getopt.getopt(sys.argv[1:], "d:m:f:")
    OPT = dict(opts)
    method = OPT.get('-m', 'brics')
    dataset = OPT.get('-d', 'chembl')
    is_mf = bool(OPT.get('-f', 1))
    BATCH_SIZE = 256

    corpus('data/LIGAND_RAW.tsv', 'data/ligand', suffix='tsv')
    corpus('data/chembl_27.sdf.gz', 'data/chembl')

    voc = utils.VocGraph('data/voc_graph.txt', n_frags=4)
    voc_smi = utils.VocSmiles('data/voc_smiles.txt')
    out = 'data/%s_%s_%s' % (dataset, 'mf' if is_mf else 'sf', method)
    pair_frags('data/chembl_corpus.txt', out + '.txt', method=method, is_mf=is_mf)
    pair_frags('data/ligand_corpus.txt', out + '.txt', method=method, is_mf=is_mf)
    train_test_split('data/chembl_mf_brics.txt', 'data/chembl_mf_brics')
    train_test_split('data/ligand_mf_brics.txt', 'data/ligand_mf_brics')
    for ds in ['train']:
        pair_graph_encode(out + '_%s.txt' % ds, voc, out + '_%s_code.txt' % ds)
        pair_smiles_encode(out + '_%s.txt' % ds, voc_smi, out + '_%s_smi.txt' % ds)
    pos_neg_split()


### 2 environ.py:

Training your predictor as the environment for providing the final reward for the action from the agent. The performance can also be evaluated through n-fold cross validation and independent test.

In [None]:
#!/usr/bin/env python
import numpy as np
import pandas as pd
import torch
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
from sklearn.neighbors import KNeighborsClassifier, KNeighborsRegressor
from sklearn.preprocessing import MinMaxScaler as Scaler
from sklearn.cross_decomposition import PLSRegression
from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVC, SVR
from sklearn.model_selection import StratifiedKFold, KFold
from torch.utils.data import DataLoader, TensorDataset
import models
import os
import utils
import joblib
from copy import deepcopy
from rdkit import Chem


def SVM(X, y, X_ind, y_ind, reg=False):
    """ Cross validation and Independent test for SVM classifion/regression model.
        Arguments:
            X (np.ndarray): m x n feature matrix for cross validation, where m is the number of samples
                and n is the number of features.
            y (np.ndarray): m-d label array for cross validation, where m is the number of samples and
                equals to row of X.
            X_ind (np.ndarray): m x n Feature matrix for independent set, where m is the number of samples
                and n is the number of features.
            y_ind (np.ndarray): m-d label array for independent set, where m is the number of samples and
                equals to row of X_ind, and l is the number of types.
            reg (bool): it True, the training is for regression, otherwise for classification.
         Returns:
            cvs (np.ndarray): m x l result matrix for cross validation, where m is the number of samples and
                equals to row of X, and l is the number of types and equals to row of X.
            inds (np.ndarray): m x l result matrix for independent test, where m is the number of samples and
                equals to row of X, and l is the number of types and equals to row of X.
    """
    if reg:
        folds = KFold(5).split(X)
        alg = SVR()
    else:
        folds = StratifiedKFold(5).split(X, y)
        alg = SVC(probability=True)
    cvs = np.zeros(y.shape)
    inds = np.zeros(y_ind.shape)
    gs = GridSearchCV(deepcopy(alg), {'C': 2.0 ** np.array([-15, 15]), 'gamma': 2.0 ** np.array([-15, 15])}, n_jobs=10)
    gs.fit(X, y)
    params = gs.best_params_
    print(params)
    for i, (trained, valided) in enumerate(folds):
        model = deepcopy(alg)
        model.C = params['C']
        model.gamma = params['gamma']
        if not reg:
            model.probability=True
        model.fit(X[trained], y[trained], sample_weight=[1 if v >= 4 else 0.1 for v in y[trained]])
        if reg:
            cvs[valided] = model.predict(X[valided])
            inds += model.predict(X_ind)
        else:
            cvs[valided] = model.predict_proba(X[valided])[:, 1]
            inds += model.predict_proba(X_ind)[:, 1]
    return cvs, inds / 5


def RF(X, y, X_ind, y_ind, reg=False):
    """ Cross validation and Independent test for RF classifion/regression model.
        Arguments:
            X (np.ndarray): m x n feature matrix for cross validation, where m is the number of samples
                and n is the number of features.
            y (np.ndarray): m-d label array for cross validation, where m is the number of samples and
                equals to row of X.
            X_ind (np.ndarray): m x n Feature matrix for independent set, where m is the number of samples
                and n is the number of features.
            y_ind (np.ndarray): m-d label array for independent set, where m is the number of samples and
                equals to row of X_ind, and l is the number of types.
            reg (bool): it True, the training is for regression, otherwise for classification.
         Returns:
            cvs (np.ndarray): m x l result matrix for cross validation, where m is the number of samples and
                equals to row of X, and l is the number of types and equals to row of X.
            inds (np.ndarray): m x l result matrix for independent test, where m is the number of samples and
                equals to row of X, and l is the number of types and equals to row of X.
    """
    if reg:
        folds = KFold(5).split(X)
        alg = RandomForestRegressor
    else:
        folds = StratifiedKFold(5).split(X, y)
        alg = RandomForestClassifier
    cvs = np.zeros(y.shape)
    inds = np.zeros(y_ind.shape)
    for i, (trained, valided) in enumerate(folds):
        model = alg(n_estimators=1000, n_jobs=10)
        model.fit(X[trained], y[trained], sample_weight=[1 if v >= 4 else 0.1 for v in y[trained]])
        if reg:
            cvs[valided] = model.predict(X[valided])
            inds += model.predict(X_ind)
        else:
            cvs[valided] = model.predict_proba(X[valided])[:, 1]
            inds += model.predict_proba(X_ind)[:, 1]
    return cvs, inds / 5


def KNN(X, y, X_ind, y_ind, reg=False):
    """ Cross validation and Independent test for KNN classifion/regression model.
        Arguments:
            X (np.ndarray): m x n feature matrix for cross validation, where m is the number of samples
                and n is the number of features.
            y (np.ndarray): m-d label array for cross validation, where m is the number of samples and
                equals to row of X.
            X_ind (np.ndarray): m x n Feature matrix for independent set, where m is the number of samples
                and n is the number of features.
            y_ind (np.ndarray): m-d label array for independent set, where m is the number of samples and
                equals to row of X_ind, and l is the number of types.
            reg (bool): it True, the training is for regression, otherwise for classification.
         Returns:
            cvs (np.ndarray): m x l result matrix for cross validation, where m is the number of samples and
                equals to row of X, and l is the number of types and equals to row of X.
            inds (np.ndarray): m x l result matrix for independent test, where m is the number of samples and
                equals to row of X, and l is the number of types and equals to row of X.
    """
    if reg:
        folds = KFold(5).split(X)
        alg = KNeighborsRegressor
    else:
        folds = StratifiedKFold(5).split(X, y)
        alg = KNeighborsClassifier
    cvs = np.zeros(y.shape)
    inds = np.zeros(y_ind.shape)
    for i, (trained, valided) in enumerate(folds):
        model = alg(n_jobs=10)
        model.fit(X[trained], y[trained])
        if reg:
            cvs[valided] = model.predict(X[valided])
            inds += model.predict(X_ind)
        else:
            cvs[valided] = model.predict_proba(X[valided])[:, 1]
            inds += model.predict_proba(X_ind)[:, 1]
    return cvs, inds / 5


def NB(X, y, X_ind, y_ind):
    """ Cross validation and Independent test for Naive Bayes classifion model.
        Arguments:
            X (np.ndarray): m x n feature matrix for cross validation, where m is the number of samples
                and n is the number of features.
            y (np.ndarray): m-d label array for cross validation, where m is the number of samples and
                equals to row of X.
            X_ind (np.ndarray): m x n Feature matrix for independent set, where m is the number of samples
                and n is the number of features.
            y_ind (np.ndarray): m-d label array for independent set, where m is the number of samples and
                equals to row of X_ind, and l is the number of types.
         Returns:
            cvs (np.ndarray): m x l result matrix for cross validation, where m is the number of samples and
                equals to row of X, and l is the number of types and equals to row of X.
            inds (np.ndarray): m x l result matrix for independent test, where m is the number of samples and
                equals to row of X, and l is the number of types and equals to row of X.
    """
    folds = KFold(5).split(X)
    cvs = np.zeros(y.shape)
    inds = np.zeros(y_ind.shape)
    for i, (trained, valided) in enumerate(folds):
        model = GaussianNB()
        model.fit(X[trained], y[trained], sample_weight=[1 if v >= 4 else 0.1 for v in y[trained]])
        cvs[valided] = model.predict_proba(X[valided])[:, 1]
        inds += model.predict_proba(X_ind)[:, 1]
    return cvs, inds / 5


def PLS(X, y, X_ind, y_ind):
    """ Cross validation and Independent test for PLS regression model.
        Arguments:
            X (np.ndarray): m x n feature matrix for cross validation, where m is the number of samples
                and n is the number of features.
            y (np.ndarray): m-d label array for cross validation, where m is the number of samples and
                equals to row of X.
            X_ind (np.ndarray): m x n Feature matrix for independent set, where m is the number of samples
                and n is the number of features.
            y_ind (np.ndarray): m-d label array for independent set, where m is the number of samples and
                equals to row of X_ind, and l is the number of types.
            reg (bool): it True, the training is for regression, otherwise for classification.
         Returns:
            cvs (np.ndarray): m x l result matrix for cross validation, where m is the number of samples and
                equals to row of X, and l is the number of types and equals to row of X.
            inds (np.ndarray): m x l result matrix for independent test, where m is the number of samples and
                equals to row of X, and l is the number of types and equals to row of X.
    """
    folds = KFold(5).split(X)
    cvs = np.zeros(y.shape)
    inds = np.zeros(y_ind.shape)
    for i, (trained, valided) in enumerate(folds):
        model = PLSRegression()
        model.fit(X[trained], y[trained])
        cvs[valided] = model.predict(X[valided])[:, 0]
        inds += model.predict(X_ind)[:, 0]
    return cvs, inds / 5


def DNN(X, y, X_ind, y_ind, out, reg=False):
    """ Cross validation and Independent test for DNN classifion/regression model.
        Arguments:
            X (np.ndarray): m x n feature matrix for cross validation, where m is the number of samples
                and n is the number of features.
            y (np.ndarray): m x l label matrix for cross validation, where m is the number of samples and
                equals to row of X, and l is the number of types.
            X_ind (np.ndarray): m x n Feature matrix for independent set, where m is the number of samples
                and n is the number of features.
            y_ind (np.ndarray): m-d label arrays for independent set, where m is the number of samples and
                equals to row of X_ind, and l is the number of types.
            reg (bool): it True, the training is for regression, otherwise for classification.
         Returns:
            cvs (np.ndarray): m x l result matrix for cross validation, where m is the number of samples and
                equals to row of X, and l is the number of types and equals to row of X.
            inds (np.ndarray): m x l result matrix for independent test, where m is the number of samples and
                equals to row of X, and l is the number of types and equals to row of X.
    """
    if y.shape[1] > 1 or reg:
        folds = KFold(5).split(X)
    else:
        folds = StratifiedKFold(5).split(X, y[:, 0])
    NET = models.STFullyConnected if y.shape[1] == 1 else models.MTFullyConnected
    indep_set = TensorDataset(torch.Tensor(X_ind), torch.Tensor(y_ind))
    indep_loader = DataLoader(indep_set, batch_size=BATCH_SIZE)
    cvs = np.zeros(y.shape)
    inds = np.zeros(y_ind.shape)
    for i, (trained, valided) in enumerate(folds):
        train_set = TensorDataset(torch.Tensor(X[trained]), torch.Tensor(y[trained]))
        train_loader = DataLoader(train_set, batch_size=BATCH_SIZE)
        valid_set = TensorDataset(torch.Tensor(X[valided]), torch.Tensor(y[valided]))
        valid_loader = DataLoader(valid_set, batch_size=BATCH_SIZE)
        net = NET(X.shape[1], y.shape[1], is_reg=reg)
        net.fit(train_loader, valid_loader, out='%s_%d' % (out, i), epochs=N_EPOCH, lr=LR)
        cvs[valided] = net.predict(valid_loader)
        inds += net.predict(indep_loader)
    return cvs, inds / 5


def Train_RF(X, y, out, reg=False):
    if reg:
        model = RandomForestRegressor(n_estimators=1000, n_jobs=10)
    else:
        model = RandomForestClassifier(n_estimators=1000, n_jobs=10)
    model.fit(X, y, sample_weight=[1 if v >= 4 else 0.1 for v in y])
    joblib.dump(model, out, compress=3)


def mt_task(fname, out, reg=False, is_extra=True, time_split=False):
    df = pd.read_table(fname)[pair].dropna(subset=pair[1:2])
    df = df[df.Target_ChEMBL_ID.isin(trgs)]
    year = df.groupby(pair[1])[pair[-1:]].min().dropna()
    year = year[year.Document_Year > 2015].index
    df = df[pair].set_index(pair[0:2])
    numery = df[pair[2]].groupby(pair[0:2]).mean().dropna()

    comments = df[(df.Comment.str.contains('Not Active') == True)]
    inhibits = df[(df.Standard_Type == 'Inhibition') & df.Standard_Relation.isin(['<', '<='])]
    relations = df[df.Standard_Type.isin(['EC50', 'IC50', 'Kd', 'Ki']) & df.Standard_Relation.isin(['>', '>='])]
    binary = pd.concat([comments, inhibits, relations], axis=0)
    binary = binary[~binary.index.isin(numery.index)]
    binary[pair[2]] = 3.99
    binary = binary[pair[2]].groupby(pair[0:2]).first()
    df = numery.append(binary) if is_extra else numery
    if not reg:
        df[pair[2]] = (df[pair[2]] > th).astype(float)
    df = df.unstack(pair[0])
    test_ix = set(df.index).intersection(year)

    df_test = df.loc[test_ix] if time_split else df.sample(len(test_ix))
    df_data = df.drop(df_test.index)
    df_data = df_data.sample(len(df_data))
    for alg in ['RF', 'MT_DNN', 'SVM', 'PLS', 'KNN', 'DNN']:
        if alg == 'MT_DNN':
            test_x = utils.Predictor.calc_fp([Chem.MolFromSmiles(mol) for mol in df_test.index])
            data_x = utils.Predictor.calc_fp([Chem.MolFromSmiles(mol) for mol in df_data.index])
            scaler = Scaler(); scaler.fit(data_x)
            test_x = scaler.transform(test_x)
            data_x = scaler.transform(data_x)

            data = df_data.stack().to_frame(name='Label')
            test = df_test.stack().to_frame(name='Label')
            data_p, test_p = DNN(data_x, df_data.values, test_x, df_test.values, out=out, reg=reg)
            data['Score'] = pd.DataFrame(data_p, index=df_data.index, columns=df_data.columns).stack()
            test['Score'] = pd.DataFrame(test_p, index=df_test.index, columns=df_test.columns).stack()
            data.to_csv(out + alg + '_LIGAND.cv.tsv', sep='\t')
            test.to_csv(out + alg + '_LIGAND.ind.tsv', sep='\t')
        else:
            for trg in trgs:
                test_y = df_test[trg].dropna()
                data_y = df_data[trg].dropna()
                test_x = utils.Predictor.calc_fp([Chem.MolFromSmiles(mol) for mol in test_y.index])
                data_x = utils.Predictor.calc_fp([Chem.MolFromSmiles(mol) for mol in data_y.index])
                if alg != 'RF':
                    scaler = Scaler(); scaler.fit(data_x)
                    test_x = scaler.transform(test_x)
                    data_x = scaler.transform(data_x)
                else:
                    X = np.concatenate([data_x, test_x], axis=0)
                    y = np.concatenate([data_y.values, test_y.values], axis=0)
                    Train_RF(X, y, out=out + '%s_%s.pkg' % (alg, trg), reg=reg)
                data, test = data_y.to_frame(name='Label'), test_y.to_frame(name='Label')
                a, b = cross_validation(data_x, data.values, test_x, test.values,
                                        alg, out + '%s_%s' % (alg, trg), reg=reg)
                data['Score'], test['Score'] = a, b
                data.to_csv(out + '%s_%s.cv.tsv' % (alg, trg), sep='\t')
                test.to_csv(out + '%s_%s.ind.tsv' % (alg, trg), sep='\t')


def single_task(feat, alg='RF', reg=False, is_extra=True):
    df = pd.read_table('data/LIGAND_RAW.tsv').dropna(subset=pair[1:2])
    df = df[df[pair[0]] == feat]
    df = df[pair].set_index(pair[1])
    year = df[pair[-1:]].groupby(pair[1]).min().dropna()
    test = year[year[pair[-1]] > 2015].index
    numery = df[pair[2]].groupby(pair[1]).mean().dropna()

    comments = df[(df.Comment.str.contains('Not Active') == True)]
    inhibits = df[(df.Standard_Type == 'Inhibition') & df.Standard_Relation.isin(['<', '<='])]
    relations = df[df.Standard_Type.isin(['EC50', 'IC50', 'Kd', 'Ki']) & df.Standard_Relation.isin(['>', '>='])]
    binary = pd.concat([comments, inhibits, relations], axis=0)
    binary = binary[~binary.index.isin(numery.index)]
    binary[pair[2]] = 3.99
    binary = binary[pair[2]].groupby(binary.index).first()
    df = numery.append(binary) if is_extra else numery
    if not reg:
        df = (df > th).astype(float)
    df = df.sample(len(df))
    print(feat, len(numery[numery >= th]), len(numery[numery < th]), len(binary))

    test_ix = set(df.index).intersection(test)
    test = df.loc[test_ix].dropna()
    data = df.drop(test.index)

    test_x = utils.Predictor.calc_fp([Chem.MolFromSmiles(mol) for mol in test.index])
    data_x = utils.Predictor.calc_fp([Chem.MolFromSmiles(mol) for mol in data.index])
    out = 'output/single/%s_%s_%s' % (alg, 'REG' if reg else 'CLS', feat)
    if alg != 'RF':
        scaler = Scaler(); scaler.fit(data_x)
        test_x = scaler.transform(test_x)
        data_x = scaler.transform(data_x)
    else:
        X = np.concatenate([data_x, test_x], axis=0)
        y = np.concatenate([data.values, test.values], axis=0)
        Train_RF(X, y[:, 0], out=out + '.pkg', reg=reg)
    data, test = data.to_frame(name='Label'), test.to_frame(name='Label')
    data['Score'], test['Score'] = cross_validation(data_x, data.values, test_x, test.values, alg, out, reg=reg)
    data.to_csv(out + '.cv.tsv', sep='\t')
    test.to_csv(out + '.ind.tsv', sep='\t')


def cross_validation(X, y, X_ind, y_ind, alg='DNN', out=None, reg=False):
    if alg == 'RF':
        cv, ind = RF(X, y[:, 0], X_ind, y_ind[:, 0], reg=reg)
    elif alg == 'SVM':
        cv, ind = SVM(X, y[:, 0], X_ind, y_ind[:, 0], reg=reg)
    elif alg == 'KNN':
        cv, ind = KNN(X, y[:, 0], X_ind, y_ind[:, 0], reg=reg)
    elif alg == 'NB':
        cv, ind = NB(X, y[:, 0], X_ind, y_ind[:, 0])
    elif alg == 'PLS':
        cv, ind = PLS(X, y[:, 0], X_ind, y_ind[:, 0])
    elif alg == 'DNN':
        cv, ind = DNN(X, y, X_ind, y_ind, out=out, reg=reg)
    return cv, ind


if __name__ == '__main__':
    pair = ['Target_ChEMBL_ID', 'Smiles', 'pChEMBL_Value', 'Comment',
            'Standard_Type', 'Standard_Relation', 'Document_Year']
    BATCH_SIZE = int(2 ** 11)
    N_EPOCH = 1000
    os.environ["CUDA_VISIBLE_DEVICES"] = "0"

    th= 6.5
    trgs = ['CHEMBL226', 'CHEMBL251', 'CHEMBL240']

    for reg in [False, True]:
        LR = 1e-4 if reg else 1e-5
        for chembl in trgs:
            single_task(chembl, 'DNN', reg=reg)
            single_task(chembl, 'RF', reg=reg)
            single_task(chembl, 'SVM', reg=reg)
            if reg:
                single_task(chembl, 'PLS', reg=reg)
            else:
                single_task(chembl, 'NB', reg=reg)
            single_task(chembl, 'KNN', reg=reg)

    mt_task('data/LIGAND_RAW.tsv', 'output/random_split/', reg=reg, time_split=False)
    mt_task('data/LIGAND_RAW.tsv', 'output/time_split/', reg=reg, time_split=True)

### 3 train_graph.py:

Pre-training an training the graph transformer model with graph representation under supvervision and reinforcement learning frameworks, respectively.

In [None]:
#!/usr/bin/env python
import torch
from rdkit import rdBase
from models.explorer import GraphExplorer
import utils
import pandas as pd
from models import GraphModel
from torch.utils.data import DataLoader
import getopt
import sys
import os
import numpy as np
import time
from shutil import copy2

np.random.seed(2)
torch.manual_seed(2)
rdBase.DisableLog('rdApp.error')
torch.set_num_threads(1)


def pretrain():
    out = 'output/%s_graph_%d' % (dataset, BATCH_SIZE)
    agent.fit(valid_loader, valid_loader, epochs=1000, out=out)


def train_ex():
    agent.load_state_dict(torch.load(params['pr_path'] + '.pkg', map_location=utils.dev))

    prior = GraphModel(voc)
    prior.load_state_dict(torch.load(params['ft_path'] + '.pkg', map_location=utils.dev))

    evolver = GraphExplorer(agent, mutate=prior)

    evolver.batch_size = BATCH_SIZE
    evolver.epsilon = float(OPT.get('-e', '1e-2'))
    evolver.sigma = float(OPT.get('-b', '0.00'))
    evolver.scheme = OPT.get('-s', 'WS')
    evolver.repeat = 1

    keys = ['A2A', 'QED']
    A2A = utils.Predictor('output/env/RF_%s_CHEMBL251.pkg' % z, type=z)
    QED = utils.Property('QED')

    # Chose the desirability function
    objs = [A2A, QED]

    if evolver.scheme == 'WS':
        mod1 = utils.ClippedScore(lower_x=3, upper_x=10)
        mod2 = utils.ClippedScore(lower_x=0, upper_x=1.0)
        ths = [0.5, 0]
    else:
        mod1 = utils.ClippedScore(lower_x=3, upper_x=6.5)
        mod2 = utils.ClippedScore(lower_x=0, upper_x=1.0)
        ths = [0.99, 0]
    mods = [mod1, mod2]
    evolver.env = utils.Env(objs=objs, mods=mods, keys=keys, ths=ths)

    # import evolve as agent
    evolver.out = root + '/%s_%s_%.0e' % (alg, evolver.scheme, evolver.epsilon)
    evolver.fit(train_loader, test_loader=valid_loader)


if __name__ == "__main__":
    params = {'pr_path': 'output/ligand_mf_brics_graph_256', 'ft_path': 'output/ligand_mf_brics_graph_256'}
    opts, args = getopt.getopt(sys.argv[1:], "a:e:b:d:g:s:")
    OPT = dict(opts)
    z = OPT.get('-z', 'REG')
    alg = OPT.get('-a', 'graph')
    devs = OPT.get('-g', "0")
    utils.devices = eval(devs) if ',' in devs else [eval(devs)]
    torch.cuda.set_device(utils.devices[0])
    os.environ["CUDA_VISIBLE_DEVICES"] = devs

    BATCH_SIZE = int(OPT.get('-b', '128'))
    dataset = OPT.get('-d', 'ligand_mf_brics')

    voc = utils.VocGraph('data/voc_atom.txt', max_len=80, n_frags=4)
    data = pd.read_table('data/%s_train_code.txt' % dataset)
    data = torch.from_numpy(data.values).long().view(len(data), voc.max_len, -1)
    train_loader = DataLoader(data, batch_size=BATCH_SIZE * 4, drop_last=True, shuffle=True)

    test = pd.read_table('data/%s_test_code.txt' % dataset)
    # test = test.sample(int(1e4))
    test = torch.from_numpy(test.values).long().view(len(test), voc.max_len, -1)
    valid_loader = DataLoader(test, batch_size=BATCH_SIZE * 10, drop_last=True, shuffle=True)

    agent = GraphModel(voc).to(utils.dev)
    root = 'output/%s_%s' % (alg, time.strftime('%y%m%d_%H%M%S', time.localtime()))

    os.mkdir(root)
    copy2(alg + '_ex.py', root)
    copy2(alg + '.py', root)

    pretrain()
    train_ex()


### 4.train_smiles.py:

Pre-training an training the SMILES-based deep learning models with SMILES representation under supvervision and reinforcement learning frameworks, respectively.

In [None]:
#!/usr/bin/env python
import os
import pandas as pd
from shutil import copy2
import utils
import getopt
import sys
import time
import torch
from models import GPT2Model
from models import generator
from models.explorer import SmilesExplorer
from torch.utils.data import DataLoader, TensorDataset


def pretrain(method='gpt'):
    if method == 'ved':
        agent = generator.EncDec(voc, voc).to(utils.dev)
    elif method == 'attn':
        agent = generator.Seq2Seq(voc, voc).to(utils.dev)
    else:
        agent = GPT2Model(voc, n_layer=12).to(utils.dev)

    out = 'output/%s_%s_%d' % (dataset, method, BATCH_SIZE)
    agent.fit(data_loader, test_loader, epochs=1000, out=out)


def rl_train():
    opts, args = getopt.getopt(sys.argv[1:], "a:e:b:g:c:s:z:")
    OPT = dict(opts)
    case = OPT['-c'] if '-c' in OPT else 'OBJ1'
    z = OPT['-z'] if '-z' in OPT else 'REG'
    alg = OPT['-a'] if '-a' in OPT else 'smile'
    os.environ["CUDA_VISIBLE_DEVICES"] = OPT['-g'] if '-g' in OPT else "0,1,2,3"

    voc = utils.VocSmiles(init_from_file="data/chembl_voc.txt", max_len=100)
    agent = GPT2Model(voc, n_layer=12)
    agent.load_state_dict(torch.load(params['pr_path'] + '.pkg', map_location=utils.dev))

    prior = GPT2Model(voc, n_layer=12)
    prior.load_state_dict(torch.load(params['ft_path'] + '.pkg', map_location=utils.dev))

    evolver = SmilesExplorer(agent, mutate=prior)

    evolver.batch_size = BATCH_SIZE
    evolver.epsilon = float(OPT.get('-e', '1e-2'))
    evolver.sigma = float(OPT.get('-b', '0.00'))
    evolver.scheme = OPT.get('-s', 'WS')
    evolver.repeat = 1

    keys = ['A2A', 'QED']
    A2A = utils.Predictor('output/env/RF_%s_CHEMBL251.pkg' % z, type=z)
    QED = utils.Property('QED')

    # Chose the desirability function
    objs = [A2A, QED]

    if evolver.scheme == 'WS':
        mod1 = utils.ClippedScore(lower_x=3, upper_x=10)
        mod2 = utils.ClippedScore(lower_x=0, upper_x=1)
        ths = [0.5, 0]
    else:
        mod1 = utils.ClippedScore(lower_x=3, upper_x=6.5)
        mod2 = utils.ClippedScore(lower_x=0, upper_x=0.5)
        ths = [0.99, 0]
    mods = [mod1, mod2] if case == 'OBJ3' else [mod1, mod2]
    evolver.env = utils.Env(objs=objs, mods=mods, keys=keys, ths=ths)

    root = 'output/%s_%s' % (alg, time.strftime('%y%m%d_%H%M%S', time.localtime()))

    os.mkdir(root)
    copy2(alg + '_ex.py', root)
    copy2(alg + '.py', root)

    # import evolve as agent
    evolver.out = root + '/%s_%s_%s_%s_%.0e' % (alg, evolver.scheme, z, case, evolver.epsilon)
    evolver.fit(data_loader, test_loader=test_loader)


if __name__ == "__main__":
    params = {'pr_path': 'output/ligand_mf_brics_gpt_256', 'ft_path': 'output/ligand_mf_brics_gpt_256'}
    opts, args = getopt.getopt(sys.argv[1:], "m:g:b:d:")
    OPT = dict(opts)
    torch.cuda.set_device(0)
    os.environ["CUDA_VISIBLE_DEVICES"] = OPT.get('-g', "0,1,2,3")
    method = OPT.get('-m', 'gpt')
    step = OPT['-s']
    BATCH_SIZE = int(OPT.get('-b', '256'))
    dataset = OPT.get('-d', 'ligand_mf_brics')

    data = pd.read_table('data/%s_train_smi.txt' % dataset)
    test = pd.read_table('data/%s_test_smi.txt' % dataset)
    test = test.Input.drop_duplicates().sample(BATCH_SIZE * 10).values
    if method in ['gpt']:
        voc = utils.Voc('data/voc_smiles.txt', src_len=100, trg_len=100)
    else:
        voc = utils.VocSmiles('data/voc_smiles.txt', max_len=100)
    data_in = voc.encode([seq.split(' ') for seq in data.Input.values])
    data_out = voc.encode([seq.split(' ') for seq in data.Output.values])
    data_set = TensorDataset(data_in, data_out)
    data_loader = DataLoader(data_set, batch_size=BATCH_SIZE, shuffle=True)

    test_set = voc.encode([seq.split(' ') for seq in test])
    test_set = utils.TgtData(test_set, ix=[voc.decode(seq, is_tk=False) for seq in test_set])
    test_loader = DataLoader(test_set, batch_size=BATCH_SIZE, collate_fn=test_set.collate_fn)

    pretrain(method=method)
    rl_train()

### 5. designer.py:

Finally, generating molecules with well-trained deep learning model with either graph or SMILES representations.

In [None]:
#!/usr/bin/env python
import torch
from rdkit import rdBase
from models import generator
import utils
import pandas as pd
from models import GPT2Model, GraphModel
from torch.utils.data import DataLoader
import getopt
import sys
import os


rdBase.DisableLog('rdApp.error')
torch.set_num_threads(1)
BATCH_SIZE = 1024


if __name__ == "__main__":
    opts, args = getopt.getopt(sys.argv[1:], "m:d:g:p:")
    OPT = dict(opts)
    # torch.cuda.set_device(0)
    os.environ["CUDA_VISIBLE_DEVICES"] = OPT['-g'] if '-g' in OPT else "0, 1, 2, 3"
    method = OPT['-m'] if '-m' in OPT else 'atom'
    dataset = OPT['-d'] if '-d' in OPT else 'ligand_mf_brics'
    path = OPT['-p'] if '-p' in OPT else dataset
    utils.devices = [0]

    if method in ['gpt']:
        voc = utils.Voc('data/chembl_voc.txt', src_len=100, trg_len=100)
    else:
        voc = utils.VocSmiles('data/chembl_voc.txt', max_len=100)
    if method == 'ved':
        agent = generator.EncDec(voc, voc).to(utils.dev)
    elif method == 'attn':
        agent = generator.Seq2Seq(voc, voc).to(utils.dev)
    elif method == 'gpt':
        agent = GPT2Model(voc, n_layer=12).to(utils.dev)
    else:
        voc = utils.VocGraph('data/voc_atom.txt')
        agent = GraphModel(voc_trg=voc)

    for agent_path in ['benchmark/graph_PR_REG_OBJ1_0e+00.pkg', 'benchmark/graph_PR_REG_OBJ1_1e-01.pkg',
                       'benchmark/graph_PR_REG_OBJ1_1e-02.pkg', 'benchmark/graph_PR_REG_OBJ1_1e-03.pkg',
                       'benchmark/graph_PR_REG_OBJ1_1e-04.pkg', 'benchmark/graph_PR_REG_OBJ1_1e-05.pkg']:
        # agent_path = 'output/%s_%s_256.pkg' % (path, method)
        print(agent_path)
        agent.load_state_dict(torch.load(agent_path))

        z = 'REG'
        keys = ['A2A']
        A2A = utils.Predictor('output/env/RF_%s_CHEMBL251.pkg' % z, type=z)
        QED = utils.Property('QED')

        # Chose the desirability function
        objs = [A2A, QED]

        ths = [6.5, 0.0]

        env =  utils.Env(objs=objs, mods=None, keys=keys, ths=ths)
        if method in ['atom']:
            data = pd.read_table('data/ligand_mf_brics_test.txt')
            # data = data.sample(BATCH_SIZE * 10)
            data = torch.from_numpy(data.values).long().view(len(data), voc.max_len, -1)
            loader = DataLoader(data, batch_size=BATCH_SIZE)

            out = '%s.txt' % agent_path
        else:
            data = pd.read_table('data/%s_test.txt' % dataset).Input.drop_duplicates()
            # data = data.sample(BATCH_SIZE * 10)
            data = voc.encode([seq.split(' ')[:-1] for seq in data.values])
            loader = DataLoader(data, batch_size=BATCH_SIZE)

            out = agent_path + '.txt'
        frags, smiles, scores = agent.evaluate(loader, repeat=10, method=env)
        scores['Frags'] = frags
        scores['Smiles'] = smiles
        scores.to_csv(out, index=False, sep='\t')


### 6. plot.py:

It provides a variety of the methods to measure the performance of every step during the training process of DrugEx, and form the figure for results visualization.

In [None]:
from rdkit import Chem
import pandas as pd
import numpy as np
from rdkit.Chem import Draw
from utils.metric import logP_mw, dimension
import seaborn as sns
from matplotlib_venn import venn3
from scipy import stats
from matplotlib import pyplot as plt


def figure3(out='Figure_3.tif'):
    fig = plt.figure(figsize=(8, 8))
    dataset = ['LIGAND+', 'LIGAND-', 'LIGAND0', 'ChEMBL']
    ax1 = fig.add_subplot(221)
    num = pd.DataFrame(columns=['Num', 'Set'])
    for ds in dataset:
        sub = pd.read_table('figures/%s_num.txt' % ds, dtype=float)
        sub['Set'] = ds
        num = num.append(sub)
    num = num.dropna()
    sns.set(style="white", palette="pastel", color_codes=True)
    sns.violinplot(x='Set', y='Num', data=num, order=dataset, linewidth=1.5, bw=0.8)
    plt.text(0.02, 0.95, chr(ord('A')), fontweight="bold", transform=ax1.transAxes)
    ax1.set(ylim=[0.0, 15.0], xlabel='Dataset', ylabel='Number of Fragments per Molecule')

    frags = []
    ax2 = fig.add_subplot(222)
    for ds in dataset:
        sub = pd.read_table('figures/%s_frag.txt' % ds)
        frag = set(sub['Frags'])
        frags.append(frag)
        sns.kdeplot(sub['MW'], shade=True, linewidth=1.5, label=ds)
        plt.text(0.02, 0.95, chr(ord('B')), fontweight="bold", transform=ax2.transAxes)
    ax2.set(xlabel='Molecular Weight', ylabel='Value')

    ax3 = fig.add_subplot(223)
    for ds in dataset:
        sub = np.loadtxt('figures/%s_div.txt' % ds)
        np.fill_diagonal(sub, np.NaN)
        sub = sub[sub == sub]
        sns.kdeplot(sub, shade=True, linewidth=1.5, label=ds)
    plt.text(0.02, 0.95, chr(ord('C')), fontweight="bold", transform=ax3.transAxes)
    ax3.set(xlim=[0.0, 0.5], xlabel='Tanimoto Similarity', ylabel='Value')

    ax4 = fig.add_subplot(224)
    venn3(frags[:-1], set_labels=dataset)
    plt.text(0.02, 0.95, chr(ord('D')), fontweight="bold", transform=ax4.transAxes)
    fig.subplots_adjust(wspace=0.5, hspace=0.5)
    # plt.tight_layout()
    if out is None:
        plt.show()
    else:
        plt.savefig(out, dpi=600, bbox_inches = "tight", pil_kwargs={"compression": "tiff_lzw"})


def figure4():
    fnames = ['data/chembl_mf_brics_test.txt', 'benchmark/chembl_mix.txt']
    labels, keys = [], []
    fig = plt.figure(figsize=(12, 8))
    lab = ['ChEMBL Set', 'Pre-trained Model']

    ax1 = fig.add_subplot(231)
    df = logP_mw(fnames)
    group0, group1 = df[df.LABEL == 0], df[df.LABEL == 1]
    plt.text(0.05, 0.9, chr(ord('A')), fontweight="bold", transform=ax1.transAxes)
    ax1.scatter(group0.MWT, group0.LOGP, s=1, marker='o', label=lab[0], c='', edgecolor=colors[0])
    ax1.scatter(group1.MWT, group1.LOGP, s=10, marker='o', label=lab[1], c='', edgecolor=colors[1])
    ax1.set(ylabel='LogP', xlabel='Molecular Weight', xlim=[0, 1000], ylim=[-5, 10])
    handle, label = ax1.get_legend_handles_labels()
    labels.extend(handle)
    keys.extend(label)

    ax2 = fig.add_subplot(232)
    df, ratio = dimension(fnames, fp='physchem')
    group0, group1 = df[df.LABEL == 0], df[df.LABEL == 1]
    plt.text(0.05, 0.9, chr(ord('C')), fontweight="bold", transform=ax2.transAxes)
    ax2.scatter(group0.X, group0.Y, s=1, marker='o', label=lab[0], c='', edgecolor=colors[0])
    ax2.scatter(group1.X, group1.Y, s=10, marker='o', label=lab[1], c='', edgecolor=colors[1])
    ax2.set(ylabel='Principal Component 2 (%.2f%%)' % (ratio[1] * 100),
            xlabel='Principal Component 1 (%.2f%%)' % (ratio[0] * 100))

    ax3 = fig.add_subplot(233)
    # df, ratio = dimension(fnames, alg='TSNE')
    df = pd.read_table('t-SNE_pr.txt')
    group0, group1 = df[df.LABEL == 0], df[df.LABEL == 1]
    plt.text(0.05, 0.9, chr(ord('E')), fontweight="bold", transform=ax3.transAxes)
    ax3.scatter(group0.X, group0.Y, s=1, marker='o', label=lab[0], c='', edgecolor=colors[0])
    ax3.scatter(group1.X, group1.Y, s=10, marker='o', label=lab[1], c='', edgecolor=colors[1])
    ax3.set(ylabel='Component 2', xlabel='Component 1')

    fnames = ['data/ligand_mf_brics_test.txt', 'benchmark/ligand_mix.txt']
    lab = ['LIGAND Set', 'Fine-tuned Model']
    ax4 = fig.add_subplot(234)
    df = logP_mw(fnames)
    group0, group1 = df[df.LABEL == 0], df[df.LABEL == 1]
    plt.text(0.05, 0.9, chr(ord('B')), fontweight="bold", transform=ax4.transAxes)
    ax4.scatter(group0.MWT, group0.LOGP, s=10, marker='o', label=lab[0], c='', edgecolor=colors[2])
    ax4.scatter(group1.MWT, group1.LOGP, s=1, marker='o', label=lab[1], c='', edgecolor=colors[3])
    ax4.set(ylabel='LogP', xlabel='Molecular Weight', xlim=[0, 1000], ylim=[-5, 10])
    handle, label = ax4.get_legend_handles_labels()
    labels.extend(handle)
    keys.extend(label)

    ax5 = fig.add_subplot(235)
    df, ratio = dimension(fnames, fp='physchem')
    group0, group1 = df[df.LABEL == 0], df[df.LABEL == 1]
    plt.text(0.05, 0.9, chr(ord('D')), fontweight="bold", transform=ax5.transAxes)
    ax5.scatter(group0.X, group0.Y, s=10, marker='o', label=lab[0], c='', edgecolor=colors[2])
    ax5.scatter(group1.X, group1.Y, s=1, marker='o', label=lab[1], c='', edgecolor=colors[3])
    ax5.set(ylabel='Principal Component 2 (%.2f%%)' % (ratio[1] * 100),
            xlabel='Principal Component 1 (%.2f%%)' % (ratio[0] * 100))

    ax6 = fig.add_subplot(236)
    # df, ratio = dimension(fnames, alg='TSNE')
    df = pd.read_table('t-SNE_ft.txt')
    group0, group1 = df[df.LABEL == 0], df[df.LABEL == 1]
    plt.text(0.05, 0.9, chr(ord('F')), fontweight="bold", transform=ax6.transAxes)
    ax6.scatter(group0.X, group0.Y, s=10, marker='o', label=lab[0], c='', edgecolor=colors[2])
    ax6.scatter(group1.X, group1.Y, s=1, marker='o', label=lab[1], c='', edgecolor=colors[3])
    ax6.set(ylabel='Component 2', xlabel='Component 1')

    fig.legend(labels, keys, loc="lower center", ncol=len(keys), bbox_to_anchor=(0.45, 0.00))
    fig.subplots_adjust(wspace=0.5, hspace=0.5)
    # plt.tight_layout()
    plt.savefig('Figure_4.tif', dpi=600, bbox_inches = "tight", pil_kwargs={"compression": "tiff_lzw"})


def figure5():
    fig = plt.figure(figsize=(8, 8))
    objs = ['QED', 'SA']
    ix = 0
    keys, labels = [], []
    methods = ['ved', 'attn', 'gpt', 'graph']
    for i, d in enumerate(['chembl', 'ligand']):
        labs = ['ChEMBL Set' if d == 'chembl' else 'LIGAND Set',
                  'LSTM-BASE', 'LSTM+ATTN', 'Sequence Transformer', 'Graph Transformer']
        dfs = {}
        fnames = ['benchmark/%s_set_qed_sa.txt' % d] + ['benchmark/%s_%s_qed_sa.txt' % (d, m) for m in methods]
        for j, fname in enumerate(fnames):
            dfs[labs[j]] = pd.read_table(fname)
        for k, obj in enumerate(objs):
            ix += 1
            ax = plt.subplot(220 + ix)
            plt.text(0.02, 0.9, chr(ord('A') + ix - 1), fontweight="bold", transform=ax.transAxes)
            for l, (key, df) in enumerate(dfs.items()):
                if obj in ['SA']:
                    xx = np.linspace(0, 10, 1000)
                else:
                    xx = np.linspace(0, 1, 1000)
                data = df[obj].values
                density = stats.gaussian_kde(data)(xx)
                if key in ['ChEMBL Set']:
                    color = colors[0]
                elif key in ['LIGAND Set']:
                    color = colors[1]
                else:
                    color = colors[l+1]
                label = plt.plot(xx, density, c=color)[0]
                if (i == 0 and k == 0) or (key in ['LIGAND Set'] and k == 0):
                    keys.append(key)
                    labels.append(label)
            # ax.title(obj + ' Score')
    fig.legend(labels, keys, loc="upper center", ncol=3, bbox_to_anchor=(0.45, 0.08))
    fig.subplots_adjust(wspace=0.35, hspace=0.35)
    fig.savefig('figure_5.tif', dpi=600, bbox_inches="tight", pil_kwargs={"compression": "tiff_lzw"})


def figure6():
    fig = plt.figure(figsize=(12, 8))
    er = ['0e+00', '1e-01', '2e-01', '3e-01', '4e-01', '5e-01']
    ers = {'0e+00': '0.0', '1e-01': '0.1', '2e-01': '0.2', '3e-01': '0.3', '4e-01': '0.4', '5e-01': '0.5'}
    # df = dimension(['benchmark/ligand_rl_%s.txt' % e for e in ers], alg='TSNE')
    # df.to_csv('t-SNE.txt', index=False, sep='\t')
    df = pd.read_table('t-SNE.txt')
    for i, e in enumerate(er):
        group0 = df[df.LABEL == 0]
        # group0 = group0[group0['QED'] > 0.4]
        group1 = df[df.LABEL == i + 1]
        ax = fig.add_subplot(231 + i)
        plt.text(0.02, 0.9, chr(ord('A') + i), fontweight="bold", transform=ax.transAxes)
        ax.scatter(group1.X, group1.Y, s=10, marker='.', label='ε = %s' % ers[e], c='', edgecolor=colors[2])
        ax.scatter(group0.X, group0.Y, s=10, marker='o', label='LIGAND set', c='', edgecolor=colors[1])
        ax.set(ylabel='Component 2', xlabel='Component 1')
        ax.legend(loc='upper right')
    plt.savefig('Figure_6.tif', dpi=600, bbox_inches = "tight", pil_kwargs={"compression": "tiff_lzw"})


def figure7():
    df = pd.read_table('benchmark/ligand_rl_2e-01.txt')
    df = df[df.DESIRE == 1]
    subs = ['c1cocc1.n1cncnc1.n1c[nH]nc1',
            'c1cocc1.O=c1[nH]c(=O)c2nc[nH]c2[nH]1', 'c1cocc1.Nc1ncc2[nH]nnc2n1',
            'c1cocc1.n1cncnc1', 'c1cocc1.n1c[nH]nc1','n1cncnc1.n1c[nH]nc1']

    subset = {sub: [] for sub in subs}
    submol = {sub: Chem.MolFromSmiles(sub) for sub in subs}
    for smile in df.Smiles:
        if smile != smile: continue
        mol = Chem.MolFromSmiles(smile)
        for sub in subs:
            s = submol[sub]
            match = mol.HasSubstructMatch(s)
            if match:
                subset[sub].append(smile)
                break
    for i, sub in enumerate(subs[::-1]):
        if len(subset[sub]) > 120:
            mol = list(subset[sub])[:60]
        else:
            mol = list(subset[sub])
        mols = [Chem.MolFromSmiles(m) for m in mol]
        img = Draw.MolsToGridImage(mols, molsPerRow=6, subImgSize=(400, 300))
        img.save('figures/figure_6_%d.tif' % i)
        print(mol)


if __name__ == '__main__':
    colors = ['#ff7f0e', '#1f77b4', '#d62728', '#2ca02c', '#9467bd', 'cyan']  # orange, blue, green, red, purple
    figure3()
    figure4()
    figure5()
    figure6()
    figure7()

In addition, this toolkit also provides some other scripts for definition of special data structures, model architectures and coefficient measurements, etc.

#### 1 models/*.py:

It contains all of the deep learning models that possibly used in this project, including single/multiple fully-connected regression/classification models, RNN generative model and highway CNN classification model.

#### 2 utils/vocab.py:

It defines some special data structures, such as vocabulary of SMILES tokens and elements in the graph, molecule dataset, environment and some methods for SMILES and graph checking. The statistical methods that extracting properties from generated molecules.

#### 3 utils/metric.py:

The statistical methods that extracting properties from generated molecules.

#### 4 utils/fingerprints.py:

There are a variety of chemical fingerprints calculations, such as ECFP, MACCS etc.

#### 5 utils/modifier.py

It provides a variety of desirability function to normalize the scoring furntions. For more details, please check GuacaMol benchemark.

#### 6 utils/objective.py

It provides the construction of different scoring functions, including similary score, chemical properties, QSAR modelling, etc. Moreoever, it can also integrate multiple objective into an environment to calculate reward for the agent.

#### 7 utils/nsgaii.py

The implementation of non-dominate sorting and crowding distance algorithm (NSGAII). Importantly, we employ PyTorch to accelerate its performance and also modify the calculation of crowding distance with Tanimoto-distance.

#### 8 utils/sacorer.py

The implementation of SA score to measure the synthezability score of each molecule. More details about SA score can be found 

https://jcheminf.biomedcentral.com/articles/10.1186/1758-2946-1-8

Liu X, IJzerman AP, van Westen GJP. Computational Approaches for De Novo Drug Design: Past, Present, and Future. Methods Mol Biol. 2021;2190:139-65.
https://link.springer.com/protocol/10.1007%2F978-1-0716-0826-5_6

Liu X, Ye K, van Vlijmen HWT, IJzerman AP, van Westen GJP. DrugEx v3: Scaffold-Constrained Drug Design with Graph Transformer-based Reinforcement Learning. Preprint
https://chemrxiv.org/engage/chemrxiv/article-details/61aa8b58bc299c0b30887f80

Liu X, Ye K, van Vlijmen HWT, Emmerich MTM, IJzerman AP, van Westen GJP. DrugEx v2: De Novo Design of Drug Molecule by Pareto-based Multi-Objective Reinforcement Learning in Polypharmacology. Journal of cheminformatics 2021:13(1):85.
https://doi.org/10.1186/s13321-021-00561-9

Liu X, Ye K, van Vlijmen HWT, IJzerman AP, van Westen GJP. An exploration strategy improves the diversity of de novo ligands using deep reinforcement learning: a case for the adenosine A2A receptor. Journal of cheminformatics. 2019;11(1):35.
https://jcheminf.biomedcentral.com/articles/10.1186/s13321-019-0355-6


# <ins>Load, view, and preprocess dataset</ins> 


We will use the [ESOL dataset](http://moleculenet.ai/datasets-1) to train our models. The ESOL dataset contains the solubility of various small organic molecules. We will begin by loading the dataset as a dataframe and then inspecting some basic metadata. We'll also preprocess the dataset and create train/test splits for the Convolutional Neural Network (CNN) and Variational AutoEncoder (VAE) models. 

In [2]:
#!ls /home/nanohub/bbishnoi/data/results/vae/qm9.csv
#dataset = pd.read_csv("/home/nanohub/bbishnoi/data/results/vae/qm9.csv")
# read dataset as a dataframe
#dataset = pd.read_csv("../data/ESOL_delaney-processed.csv")

from random import shuffle
dataset = pd.read_csv("./SMILES_RDKit_2D.csv")
#dataset = pd.read_csv("./SMILES_feature.csv")
#dataset = pd.read_csv("gdrive/MyDrive/Colab Notebooks/data/qm9.csv")

# This function randomly arranges the elements so we can have representation for all groups both in the training and testing set
#shuffle(dataset) 

# print column names in dataset
print(f"Columns in dataset: {list(dataset.columns)}")

# print number of rows in dataset
print(f"\nLength of dataset: {len(dataset)}")

# shuffle rows of the dataset (we could do this later as well when doing train/test splits)
dataset = dataset.sample(frac=1, random_state=0)

# show first 5 rows of dataframe
dataset.head(20)
#dataset.head(20)

Columns in dataset: ['smiles', 'MaxEStateIndex', 'MinEStateIndex', 'MaxAbsEStateIndex', 'MinAbsEStateIndex', 'qed', 'MolWt', 'HeavyAtomMolWt', 'ExactMolWt', 'NumValenceElectrons', 'NumRadicalElectrons', 'MaxPartialCharge', 'MinPartialCharge', 'MaxAbsPartialCharge', 'MinAbsPartialCharge', 'FpDensityMorgan1', 'FpDensityMorgan2', 'FpDensityMorgan3', 'BalabanJ', 'BertzCT', 'Chi0', 'Chi0n', 'Chi0v', 'Chi1', 'Chi1n', 'Chi1v', 'Chi2n', 'Chi2v', 'Chi3n', 'Chi3v', 'Chi4n', 'Chi4v', 'HallKierAlpha', 'Ipc', 'Kappa1', 'Kappa2', 'Kappa3', 'LabuteASA', 'PEOE_VSA1', 'PEOE_VSA10', 'PEOE_VSA11', 'PEOE_VSA12', 'PEOE_VSA13', 'PEOE_VSA14', 'PEOE_VSA2', 'PEOE_VSA3', 'PEOE_VSA4', 'PEOE_VSA5', 'PEOE_VSA6', 'PEOE_VSA7', 'PEOE_VSA8', 'PEOE_VSA9', 'SMR_VSA1', 'SMR_VSA10', 'SMR_VSA2', 'SMR_VSA3', 'SMR_VSA4', 'SMR_VSA5', 'SMR_VSA6', 'SMR_VSA7', 'SMR_VSA8', 'SMR_VSA9', 'SlogP_VSA1', 'SlogP_VSA10', 'SlogP_VSA11', 'SlogP_VSA12', 'SlogP_VSA2', 'SlogP_VSA3', 'SlogP_VSA4', 'SlogP_VSA5', 'SlogP_VSA6', 'SlogP_VSA7', 'Slo

Unnamed: 0,smiles,MaxEStateIndex,MinEStateIndex,MaxAbsEStateIndex,MinAbsEStateIndex,qed,MolWt,HeavyAtomMolWt,ExactMolWt,NumValenceElectrons,...,fr_sulfide,fr_sulfonamd,fr_sulfone,fr_term_acetylene,fr_tetrazole,fr_thiazole,fr_thiocyan,fr_thiophene,fr_unbrch_alkane,fr_urea
993,N#CC(Oc1ccccc1)C(c1ccccc1)c1ccc(-c2ccccc2)cc1,9.937998,-0.63669,9.937998,0.188676,0.387502,375.471,354.303,375.162314,140,...,0,0,0,0,0,0,0,0,0,0
859,CCc1ccc(C(C)C(C)OCOC)cc1,5.541674,0.173543,5.541674,0.173543,0.687059,222.328,200.152,222.16198,90,...,0,0,0,0,0,0,0,0,0,0
298,COCOC(c1ccccc1)C(c1ccccc1)c1ccc([N+](=O)[O-])cc1,11.013109,-0.391463,11.013109,0.069104,0.317703,363.413,342.245,363.147058,138,...,0,0,0,0,0,0,0,0,0,0
553,COCOC(OC)C(C)(C)c1ccc(C(=O)OC)cc1,11.417669,-0.452383,11.417669,0.154941,0.567332,282.336,260.16,282.146724,112,...,0,0,0,0,0,0,0,0,0,0
672,COCOCC(C#N)(c1ccccc1)c1ccc(C(=O)OC)cc1,11.596376,-0.991628,11.596376,0.093758,0.444664,325.364,306.212,325.131408,124,...,0,0,0,0,0,0,0,0,1,0
971,COC(Br)Cc1ccccc1,5.070751,0.129537,5.070751,0.129537,0.703875,215.09,204.002,213.999327,60,...,0,0,0,0,0,0,0,0,0,0
27,CCc1ccc(C(C#N)C(C#N)OCOC)cc1,9.184892,-0.817198,9.184892,0.003423,0.720398,244.294,228.166,244.121178,94,...,0,0,0,0,0,0,0,0,0,0
231,COCC(C)(C)c1ccc(C#N)cc1,8.659967,0.002714,8.659967,0.002714,0.731433,189.258,174.138,189.115364,74,...,0,0,0,0,0,0,0,0,0,0
306,c1ccc(OC(c2ccccc2)C(c2ccccc2)c2ccccc2)cc1,6.571502,-0.126481,6.571502,0.092598,0.378653,350.461,328.285,350.167065,132,...,0,0,0,0,0,0,0,0,0,0
706,CCc1ccc(C(C)(C)C(OC)c2ccccc2)cc1,5.812137,-0.067089,5.812137,0.052015,0.748005,268.4,244.208,268.182715,106,...,0,0,0,0,0,0,0,0,0,0


In [None]:
#To calculate all the rdkit descriptors, you can use the following code:
descriptor_names = list(rdMolDescriptors.Properties.GetAvailableProperties())
get_descriptors = rdMolDescriptors.Properties(descriptor_names)

print(descriptor_names)
# print(DescriptorSummaries())

In [None]:
#Calculate descriptors using smile strings
def smi_to_descriptors(smile):
    mol = Chem.MolFromSmiles(smile)
    descriptors = []
    if mol:
        descriptors = np.array(get_descriptors.ComputeProperties(mol))
    return descriptors


In [None]:
#if the the smiles are in pandas dataframe
dataset['descriptors'] = dataset.SMILES.apply(smi_to_descriptors)
#dataset= dataset.SMILES.apply(smi_to_descriptors)
dataset.head()

In [None]:
full_dataset = dataset
full_dataset.head()

In [None]:
from rdkit import Chem    # make sure to import it if you haven't done so
from rdkit.Chem import Descriptors    # make sure to import it if you haven't done so
descriptors_list = [x[0] for x in Descriptors._descList]
print(descriptors_list)

In [None]:
calc = MoleculeDescriptors.MolecularDescriptorCalculator([x[0] for x in Descriptors._descList])
type(calc)
mol = Chem.MolFromSmiles('CC1=CC(=C(C=C1NC(=O)C2=C(C(=CC(=C2)I)I)O)Cl)C(C#N)C3=CC=C(C=C3)Cl')
ds = calc.CalcDescriptors(mol)
print(ds)

In [None]:
#!pip install pandas==0.21
%matplotlib inline
import pandas as pd
import numpy as np;
import seaborn as sns; 
import matplotlib.pyplot as plt



#qm9 = pd.read_csv("./SMILES_RDKit_2D.csv")
qm9 = pd.read_csv("./x_df_SMILES_RDKit_2D.csv")

#couple_columns = qm9[['gap','zpve', 'mu']].head(10)
#print(couple_columns.shape)
plt.figure(figsize=(30,30))

# calculate the correlation matrix
corr = qm9.corr()
# plot the heatmap
sns.heatmap(corr, 
        xticklabels=corr.columns,
        yticklabels=corr.columns, cmap="YlGnBu")#YlGnBu viridis_r Spectral_r
ax=plt.savefig('./corr_x_df_SMILES_RDKit_2D_YlGnBu.png', dpi=900, facecolor='w', edgecolor='w', format=None, transparent=False, bbox_inches=None, pad_inches=None, metadata=None)

#sns.heatmap(corr, cmap="Blues", annot=True)

#Heat Map using Seaborn
#import numpy as np;
#import seaborn as sns; 

# To translate into Excel Terms for those familiar with Excel
# string 1 is row labels 'helix1 phase'
# string 2 is column labels 'helix 2 phase'
# string 3 is values 'Energy'
# Official pivot documentation
# http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.pivot.html

#homo_lumo_mix.pivot('zpve', 'mu','gap').head()
#homo_lumo_mix.pivot('zpve', 'mu')['gap'].head()

#!pip install pandas

In [None]:
# from https://proxy.nanohub.org/weber/2004336/GBdSjVSdDDS3NYpl/4/notebooks/LLZO_MachineLearning.ipynb
# This code is to drop columns with std = 0. 
#x_df = pd.DataFrame(X)
#All columns that have a standard deviation of zero are dropped, as they don't contribute new information to the models.
x_df = pd.read_csv("./features.csv")
x_df = x_df.loc[:, x_df.std() != 0]
print(x_df.shape) # This shape is (#Entries, #Descriptors per entry)
x_df.head()


In [None]:
x_df.to_csv('./x_df_features.csv')

In [None]:

#!pip install pandas==0.21
%matplotlib inline
import pandas as pd
import numpy as np;
import seaborn as sns; 
import matplotlib.pyplot as plt



#qm9 = pd.read_csv("./SMILES_RDKit_2D.csv")
df1 = pd.read_csv("./x_df_features.csv")
df2 = pd.read_csv("./x_df_SMILES_RDKit_2D.csv")

qm9= pd.concat([df1, df2], axis=1, keys=['df1', 'df2']).corr().loc['df2', 'df1']

#couple_columns = qm9[['gap','zpve', 'mu']].head(10)
#print(couple_columns.shape)
plt.figure(figsize=(30,30))

# calculate the correlation matrix
corr = qm9.corr()
# plot the heatmap
sns.heatmap(corr, 
        xticklabels=corr.columns,
        yticklabels=corr.columns, cmap="YlGnBu")#YlGnBu viridis_r Spectral_r
ax=plt.savefig('./corr_x_df_SMILES_RDKit_2D_YlGnBu.png', dpi=900, facecolor='w', edgecolor='w', format=None, transparent=False, bbox_inches=None, pad_inches=None, metadata=None)

#sns.heatmap(corr, cmap="Blues", annot=True)

#Heat Map using Seaborn
#import numpy as np;
#import seaborn as sns; 

# To translate into Excel Terms for those familiar with Excel
# string 1 is row labels 'helix1 phase'
# string 2 is column labels 'helix 2 phase'
# string 3 is values 'Energy'
# Official pivot documentation
# http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.pivot.html

#homo_lumo_mix.pivot('zpve', 'mu','gap').head()
#homo_lumo_mix.pivot('zpve', 'mu')['gap'].head()

#!pip install pandas

In [None]:

##  Read Data  ##

#!wget https://raw.githubusercontent.com/AIScienceTutorial/Material_Science/master/Formation_Energies/Data.csv
#!wget https://raw.githubusercontent.com/AIScienceTutorial/Material_Science/master/Formation_Energies/Data_norm.csv

#ifile  = open('Data.csv', "rt")
ifile  = open('x_df_features.csv', "rt")
# df1 = pd.read_csv("./x_df_features.csv")
# df2 = pd.read_csv("./x_df_SMILES_RDKit_2D.csv")
reader = csv.reader(ifile)
csvdata=[]
for row in reader:
        csvdata.append(row)   
ifile.close()
numrow=len(csvdata)
numcol=len(csvdata[0]) 
csvdata = np.array(csvdata).reshape(numrow,numcol)
dopant = csvdata[:,0]
CdX = csvdata[:,1]
doping_site = csvdata[:,2]

prop  = csvdata[:,5]  ## Cd-rich Delta_H
#prop  = csvdata[:,4]  ## Mod. Delta_H
#prop  = csvdata[:,5]  ## X-rich Delta_H

#X = csvdata[:,6:20]       ##  Elemental Properties
#X = csvdata[:,20:25]       ##  Unit Cell Defect Properties
X = csvdata[:,6:]       ##  Elemental + Unit Cell Defect Properties

n = prop.size



    # Read CdX alloy data: CdTe_0.5Se_0.5 and CdSe_0.5S_0.5

#!wget https://raw.githubusercontent.com/AIScienceTutorial/Material_Science/master/Formation_Energies/Outside.csv
#!wget https://raw.githubusercontent.com/AIScienceTutorial/Material_Science/master/Formation_Energies/Outside_norm.csv

#ifile2  = open('Outside.csv', "rt")
ifile2  = open('x_df_SMILES_RDKit_2D.csv', "rt")
reader2 = csv.reader(ifile2)
csvdata2=[]
for row2 in reader2:
        csvdata2.append(row2)
ifile2.close()
numrow2=len(csvdata2)
numcol2=len(csvdata2[0])
csvdata2 = np.array(csvdata2).reshape(numrow2,numcol2)
dopant_out = csvdata2[:,0]
CdX_out = csvdata2[:,1]
doping_site_out = csvdata2[:,2]
prop_out  = csvdata2[:,3]
#prop_out  = csvdata2[:,4]
#prop_out  = csvdata2[:,5]
#X_out = csvdata2[:,6:20]
#X_out = csvdata2[:,20:25]
X_out = csvdata2[:,6:]

n_out = prop_out.size


    # Read Entire Dataset

#!wget https://raw.githubusercontent.com/AIScienceTutorial/Material_Science/master/Formation_Energies/X.csv
#!wget https://raw.githubusercontent.com/AIScienceTutorial/Material_Science/master/Formation_Energies/X_norm.csv

#ifile3  = open('X.csv', "rt")
ifile3  = open('x_df_SMILES_RDKit_2D.csv', "rt")
reader3 = csv.reader(ifile3)
csvdata3=[]
for row3 in reader3:
        csvdata3.append(row3)
ifile3.close()
numrow3=len(csvdata3)
numcol3=len(csvdata3[0])
csvdata3 = np.array(csvdata3).reshape(numrow3,numcol3)
dopant_all = csvdata3[:,0]
CdX_all = csvdata3[:,1]
doping_site_all = csvdata3[:,2]
X_all = csvdata3[:,3:17]
#X_all = csvdata3[:,17:22]
#X_all = csvdata3[:,3:]

n_all = dopant_all.size




In [None]:
##   Visualize Data   ##
##   Visualize data: plot desired descriptor dimension vs property.

plt.figure(figsize=(6,6))
plt.subplots_adjust(left=0.16, bottom=0.16, right=0.95, top=0.90)
plt.rc('font', family='Arial narrow')

plt.ylabel('Property', fontname='Arial Narrow', size=32)
plt.xlabel('Descriptor', fontname='Arial Narrow', size=32)
plt.rc('xtick', labelsize=32)
plt.rc('ytick', labelsize=32)

yy = [0.0]*n
xx = [0.0]*n

for i in range(0,n):
    yy[i] = np.float(prop[i])
    xx[i] = np.float(X[i,12])

plt.scatter(xx[:], yy[:], c='k', marker='*', s=200, edgecolors='dimgrey', alpha=1.0)



In [None]:
# https://github.com/zinph/Cheminformatics/blob/master/compute_descriptors/RDKit_2D.py
# RDKit 2D Fingerprint
import pandas as pd
from molvs import standardize_smiles
#from RDKit_2D import *

class RDKit_2D:
    def __init__(self, smiles):
        self.mols = [Chem.MolFromSmiles(i) for i in smiles]
        self.smiles = smiles
        
    def compute_2Drdkit(self, name):
        rdkit_2d_desc = []
        calc = MoleculeDescriptors.MolecularDescriptorCalculator([x[0] for x in Descriptors._descList])
        header = calc.GetDescriptorNames()
        for i in range(len(self.mols)):
            ds = calc.CalcDescriptors(self.mols[i])
            rdkit_2d_desc.append(ds)
        df = pd.DataFrame(rdkit_2d_desc,columns=header)
        df.insert(loc=0, column='smiles', value=self.smiles)
        df.to_csv(name[:-4]+'_RDKit_2D.csv', index=False)

def main():
    filename = './qm9.csv'  # path to your csv file
    #filename = './SMILES.csv'
    df = pd.read_csv(filename)               # read the csv file as pandas data frame
    smiles = [standardize_smiles(i) for i in df['SMILES'].values]  

    ## Compute RDKit_2D Fingerprints and export a csv file.
    RDKit_descriptor = RDKit_2D(smiles)        # create your RDKit_2D object and provide smiles
    RDKit_descriptor.compute_2Drdkit(filename) # compute RDKit_2D and provide the name of your desired output file. you can use the same name as the input file because the RDKit_2D class will ensure to add "_RDKit_2D.csv" as part of the output file.

if __name__ == '__main__':
    main()

In [None]:
# https://github.com/zinph/Cheminformatics/blob/master/compute_descriptors/ECFP6.py
# ECFP6 Fingerprint
import numpy as np
import pandas as pd
from rdkit.Chem import AllChem
from rdkit import Chem, DataStructs

class ECFP6:
    def __init__(self, smiles):
        self.mols = [Chem.MolFromSmiles(i) for i in smiles]
        self.smiles = smiles

    def mol2fp(self, mol, radius = 3):
        fp = AllChem.GetMorganFingerprintAsBitVect(mol, radius = radius)
        array = np.zeros((1,))
        DataStructs.ConvertToNumpyArray(fp, array)
        return array

    def compute_ECFP6(self, name):
        bit_headers = ['bit' + str(i) for i in range(2048)]
        arr = np.empty((0,2048), int).astype(int)
        for i in self.mols:
            fp = self.mol2fp(i)
            arr = np.vstack((arr, fp))
        df_ecfp6 = pd.DataFrame(np.asarray(arr).astype(int),columns=bit_headers)
        df_ecfp6.insert(loc=0, column='smiles', value=self.smiles)
        df_ecfp6.to_csv(name[:-4]+'_ECFP6.csv', index=False)

def main():
    #filename = './qm9.csv'  # path to your csv file
    filename = './SMILES.csv'
    df = pd.read_csv(filename)               # read the csv file as pandas data frame
    smiles = [standardize_smiles(i) for i in df['SMILES'].values]  

    ## Compute RDKit_2D Fingerprints and export a csv file.
    ECFP6_descriptor = ECFP6(smiles)        # create your RDKit_2D object and provide smiles
    ECFP6_descriptor.compute_ECFP6(filename) # compute RDKit_2D and provide the name of your desired output file. you can use the same name as the input file because the RDKit_2D class will ensure to add "_RDKit_2D.csv" as part of the output file.

if __name__ == '__main__':
    main()

In [None]:
# https://github.com/zinph/Cheminformatics/blob/master/compute_descriptors/MACCS.py
# MACCS Fingerprint

import pandas as pd
from rdkit import Chem
from rdkit.Chem import MACCSkeys

class MACCS:
    def __init__(self, smiles):
        self.mols = [Chem.MolFromSmiles(i) for i in smiles]
        self.smiles = smiles

    def compute_MACCS(self, name):
        MACCS_list = []
        header = ['bit' + str(i) for i in range(167)]
        for i in range(len(self.mols)):
            ds = list(MACCSkeys.GenMACCSKeys(self.mols[i]).ToBitString())
            MACCS_list.append(ds)
        df = pd.DataFrame(MACCS_list,columns=header)
        df.insert(loc=0, column='smiles', value=self.smiles)
        df.to_csv(name[:-4]+'_MACCS.csv', index=False)

def main():
    #filename = './qm9.csv'  # path to your csv file
    filename = './SMILES.csv'
    df = pd.read_csv(filename)               # read the csv file as pandas data frame
    smiles = [standardize_smiles(i) for i in df['SMILES'].values]  

    ## Compute RDKit_2D Fingerprints and export a csv file.
    MACCS_descriptor = MACCS(smiles)        # create your RDKit_2D object and provide smiles
    MACCS_descriptor.compute_MACCS(filename) # compute RDKit_2D and provide the name of your desired output file. you can use the same name as the input file because the RDKit_2D class will ensure to add "_RDKit_2D.csv" as part of the output file.

if __name__ == '__main__':
    main()

In [None]:
#https://drzinph.com/mordred_mrc_descriptors-in-python-part-5/
#https://github.com/zinph/Cheminformatics/blob/master/compute_descriptors/Macrocycle_Descriptors.py
# Macrocycle_Descriptors

import itertools
import pandas as pd
from rdkit import Chem
from mordred import Calculator, descriptors
from mordred.RingCount import RingCount

class Macrocycle_Descriptors:

    def __init__(self, smiles):
        self.mols = [Chem.MolFromSmiles(i) for i in smiles]
        self.smiles = smiles
        self.mordred = None


    def compute_ringsize(self, mol):
        '''
        check for macrolides of RS 3 to 99, return a  list of ring counts.
        [RS3,RS4,.....,RS99]
        [0,0,0,...,1,...,0]
        '''
        RS_3_99 = [i+3 for i in range(97)]
        RS_count = []
        for j in RS_3_99:
            RS = RingCount(order=j)(mol)
            RS_count.append(RS)
        return RS_count

    def macrolide_ring_info(self):
        headers = ['n'+str(i+13)+'Ring' for i in range(87)]+['SmallestRS','LargestRS']
        # up to nR12 is already with mordred, start with nR13 to nR99
        ring_sizes = []
        for i in range(len(self.mols)):
            RS = self.compute_ringsize(self.mols[i])  # nR3 to nR99
            RS_12_99 = RS[9:]    # start with nR12 up to nR99
            ring_indices = [i for i,x in enumerate(RS_12_99) if x!=0]  # get index if item isn't equal to 0
            # if there is a particular ring present, the frequency won't be zero. Find those indexes. 
			if ring_indices:
                # Add 12 (starting ring count) to get up to the actual ring size
                smallest_RS = ring_indices[0]+12     # Retrieve the first index (for the smallest core RS - note the list is in ascending order)
                largest_RS = ring_indices[-1]+12	 # Retrieve the last index (for the largest core RS)
                RS_12_99.append(smallest_RS)  # Smallest RS
                RS_12_99.append(largest_RS)  # Largest RS
            else:
                RS_12_99.extend(['',''])
            ring_sizes.append(RS_12_99[1:]) # up to nR12 is already with mordred, start with nR13 to nR99
        df = pd.DataFrame(ring_sizes, columns=headers)
        return df

    def sugar_count(self):
        sugar_patterns = [
        '[OX2;$([r5]1@C@C@C(O)@C1),$([r6]1@C@C@C(O)@C(O)@C1)]',
        '[OX2;$([r5]1@C(!@[OX2,NX3,SX2,FX1,ClX1,BrX1,IX1])@C@C@C1),$([r6]1@C(!@[OX2,NX3,SX2,FX1,ClX1,BrX1,IX1])@C@C@C@C1)]',
        '[OX2;$([r5]1@C(!@[OX2,NX3,SX2,FX1,ClX1,BrX1,IX1])@C@C(O)@C1),$([r6]1@C(!@[OX2,NX3,SX2,FX1,ClX1,BrX1,IX1])@C@C(O)@C(O)@C1)]',
        '[OX2;$([r5]1@C(!@[OX2H1])@C@C@C1),$([r6]1@C(!@[OX2H1])@C@C@C@C1)]',
        '[OX2;$([r5]1@[C@@](!@[OX2,NX3,SX2,FX1,ClX1,BrX1,IX1])@C@C@C1),$([r6]1@[C@@](!@[OX2,NX3,SX2,FX1,ClX1,BrX1,IX1])@C@C@C@C1)]',
        '[OX2;$([r5]1@[C@](!@[OX2,NX3,SX2,FX1,ClX1,BrX1,IX1])@C@C@C1),$([r6]1@[C@](!@[OX2,NX3,SX2,FX1,ClX1,BrX1,IX1])@C@C@C@C1)]',
        ]
        sugar_mols = [Chem.MolFromSmarts(i) for i in sugar_patterns]
        sugar_counts = []
        for i in self.mols:
            matches_total = []
            for s_mol in sugar_mols:
                raw_matches = i.GetSubstructMatches(s_mol)
                matches = list(sum(raw_matches, ()))
                if matches not in matches_total and len(matches) !=0:
                    matches_total.append(matches)
            sugar_indices = set((list(itertools.chain(*matches_total))))
            count = len(sugar_indices)
            sugar_counts.append(count)
        df = pd.DataFrame(sugar_counts, columns=['nSugars'])
        return df

    def core_ester_count(self):
        '''
        Returns pandas frame containing the count of esters in core rings of >=12 membered macrocycles.
        '''
        ester_smarts = '[CX3](=[OX1])O@[r;!r3;!r4;!r5;!r6;!r7;!r8;!r9;!r10;!r11]'
        core_ester = []
        ester_mol = Chem.MolFromSmarts(ester_smarts)
        for i in self.mols:
            ester_count = len(i.GetSubstructMatches(ester_mol))
            core_ester.append(ester_count)
        df = pd.DataFrame(core_ester, columns=['core_ester'])
        return df

    def mordred_compute(self, name):
        calc = Calculator(descriptors, ignore_3D=True)
        df = calc.pandas(self.mols)
        self.mordred = df
        df.insert(loc=0, column='smiles', value=self.smiles)
        df.to_csv(name[:-4]+'_mordred.csv', index=False)

    def compute_mordred_macrocycle(self, name):
        if not isinstance(self.mordred, pd.DataFrame):
            self.mordred = self.mordred_compute(name)
        ring_df = self.macrolide_ring_info()
        sugar_df = self.sugar_count()
        ester_df = self.core_ester_count()
#        self.mrc = pd.concat([ring_df,sugar_df, ester_df], axis=1)
        mordred_mrc = pd.concat([self.mordred, ring_df,sugar_df, ester_df], axis=1)
        mordred_mrc.to_csv(name[:-4]+'_mordred_mrc.csv', index=False)

def main():
    #filename = './qm9.csv'  # path to your csv file
    filename = './SMILES.csv'
    df = pd.read_csv(filename)               # read the csv file as pandas data frame
    smiles = [standardize_smiles(i) for i in df['SMILES'].values]  

    ## Compute RDKit_2D Fingerprints and export a csv file.
    Macrocycle_descriptor = Macrocycle_Descriptors(smiles)        # create your RDKit_2D object and provide smiles
    Macrocycle_descriptor.compute_mordred_macrocycle(filename) # compute RDKit_2D and provide the name of your desired output file. you can use the same name as the input file because the RDKit_2D class will ensure to add "_RDKit_2D.csv" as part of the output file.

if __name__ == '__main__':
    main()





In [None]:
#https://greglandrum.github.io/rdkit-blog/page2/
dataset = pd.read_csv("./SMILES.csv")[['SMILES']]
#dataset = pd.read_csv("./qm9.csv")[['SMILES']]
#print(list(dataset))
#list(dataset)
#dataset
#dataset.head()
PandasTools.AddMoleculeColumnToFrame(dataset,'SMILES', 'Molecules' )
dataset = dataset
#Descriptors2D
dataset['MolWt'] = [Descriptors.MolWt(mol) for mol in dataset['Molecules']]
dataset['exactmw'] = [Descriptors.ExactMolWt(mol) for mol in dataset['Molecules']]
dataset['FpDensityMorgan1'] = [Descriptors.FpDensityMorgan1(mol) for mol in dataset['Molecules']]
dataset['FpDensityMorgan2'] = [Descriptors.FpDensityMorgan2(mol) for mol in dataset['Molecules']]
dataset['FpDensityMorgan3'] = [Descriptors.FpDensityMorgan3(mol) for mol in dataset['Molecules']]
dataset['HeavyAtomMolWt'] = [Descriptors.HeavyAtomMolWt(mol) for mol in dataset['Molecules']]
dataset['MaxAbsPartialCharge'] = [Descriptors.MaxAbsPartialCharge(mol) for mol in dataset['Molecules']]
dataset['MaxPartialCharge'] = [Descriptors.MaxPartialCharge(mol) for mol in dataset['Molecules']]
dataset['MinAbsPartialCharge'] = [Descriptors.MinAbsPartialCharge(mol) for mol in dataset['Molecules']]
dataset['NumRadicalElectrons'] = [Descriptors.NumRadicalElectrons(mol) for mol in dataset['Molecules']]
dataset['NumValenceElectrons'] = [Descriptors.NumValenceElectrons(mol) for mol in dataset['Molecules']]
#dataset['setupAUTOCorrDescriptors'] = [Descriptors.setupAUTOCorrDescriptors(mol) for mol in dataset['Molecules']]

#Descriptors3D
#dataset['Asphericity'] = [Chem.Descriptors3D.PMI1(mol) for mol in dataset['Molecules']]

dataset['LOGP'] = [Crippen.MolLogP(mol) for mol in dataset['Molecules']]
dataset['HBA'] = [Lipinski.NumHAcceptors(mol) for mol in dataset['Molecules']]
dataset['HBD'] = [Lipinski.NumHDonors(mol) for mol in dataset['Molecules']]
dataset['rotable'] = [Lipinski.NumRotatableBonds(mol) for mol in dataset['Molecules']]
dataset['amide'] = [AllChem.CalcNumAmideBonds(mol) for mol in dataset['Molecules']]
dataset['bridge'] = [AllChem.CalcNumBridgeheadAtoms(mol) for mol in dataset['Molecules']]
dataset['heteroA'] = [Lipinski.NumHeteroatoms(mol) for mol in dataset['Molecules']]
dataset['heavy'] = [Lipinski.HeavyAtomCount(mol) for mol in dataset['Molecules']]
dataset['spiro'] = [AllChem.CalcNumSpiroAtoms(mol) for mol in dataset['Molecules']]
dataset['FCSP3'] = [AllChem.CalcFractionCSP3(mol) for mol in dataset['Molecules']]
dataset['ring'] = [Lipinski.RingCount(mol) for mol in dataset['Molecules']]
dataset['Aliphatic'] = [AllChem.CalcNumAliphaticRings(mol) for mol in dataset['Molecules']]
dataset['aromatic'] = [AllChem.CalcNumAromaticRings(mol) for mol in dataset['Molecules']]
dataset['saturated'] = [AllChem.CalcNumSaturatedRings(mol) for mol in dataset['Molecules']]
dataset['heteroR'] = [AllChem.CalcNumHeterocycles(mol) for mol in dataset['Molecules']]
dataset['TPSA'] = [MolSurf.TPSA(mol) for mol in dataset['Molecules']]
dataset['valence'] = [Descriptors.NumValenceElectrons(mol) for mol in dataset['Molecules']]
dataset['mr'] = [Crippen.MolMR(mol) for mol in dataset['Molecules']]
dataset['charge'] = [AllChem.ComputeGasteigerCharges(mol) for mol in dataset['Molecules']]

# 'lipinskiHBA' 
# 'lipinskiHBD' 
# 'NumRotatableBonds' 
# 'NumHBD' 
# 'NumHBA' 
# 'NumHeteroatoms' 
# 'NumAmideBonds' 
# 'FractionCSP3' 
# 'NumRings' 
# 'NumAromaticRings' 
# 'NumAliphaticRings' 
# 'NumSaturatedRings' 
# 'NumHeterocycles' 
# 'NumAromaticHeterocycles' 
# 'NumSaturatedHeterocycles' 
# 'NumAliphaticHeterocycles' 
# 'NumSpiroAtoms' 
# 'NumBridgeheadAtoms' 
# 'NumAtomStereoCenters' 
# 'NumUnspecifiedAtomStereoCenters' 
# 'labuteASA' 
# 'tpsa'
# 'CrippenClogP' 
# 'CrippenMR'

dataset.head()
dataset = dataset
# dataset['fps-SmilesMolSupplier'] = [rdMolDescriptors.GetMorganFingerprintAsBitVect(mol,2,2048) for mol in dataset['Molecules']]
# #dataset.head()
fpsdataset = dataset
# len(fpsdataset)
fpsdataset = fpsdataset.drop(columns = 'Molecules')
fpsdataset.head()
# fpsdataset.to_csv("./fpsdatasetqm9.csv")

In [None]:
smiles = 'CC1=C(C(O)=O)C2=CC(=CC=C2N=C1C3=CC=C(C=C3)C4=CC=CC=C4F)F'
mol = Chem.MolFromSmiles(smiles)
mol

In [None]:
smi = Chem.MolToSmiles(mol)
print(smi)
print(Chem.MolToInchiKey(mol))
mol_block = Chem.MolToMolBlock(mol)
print(mol_block)

In [None]:
#https://github.com/XuhanLiu/DrugEx
# https://www.programcreek.com/python/example/124114/rdkit.Chem.Descriptors.MolWt

def PhyChem(smiles):
    """ Calculating the 19D physicochemical descriptors for each molecules,
    the value has been normalized with Gaussian distribution.

    Arguments:
        smiles (list): list of SMILES strings.
    Returns:
        props (ndarray): m X 19 matrix as nomalized PhysChem descriptors.
            m is the No. of samples
    """
    props = []
    for smile in smiles:
        mol = Chem.MolFromSmiles(smile)
        try:
            MW = Descriptors.MolWt(mol)
            LOGP = Crippen.MolLogP(mol)
            HBA = Lipinski.NumHAcceptors(mol)
            HBD = Lipinski.NumHDonors(mol)
            rotable = Lipinski.NumRotatableBonds(mol)
            amide = AllChem.CalcNumAmideBonds(mol)
            bridge = AllChem.CalcNumBridgeheadAtoms(mol)
            heteroA = Lipinski.NumHeteroatoms(mol)
            heavy = Lipinski.HeavyAtomCount(mol)
            spiro = AllChem.CalcNumSpiroAtoms(mol)
            FCSP3 = AllChem.CalcFractionCSP3(mol)
            ring = Lipinski.RingCount(mol)
            Aliphatic = AllChem.CalcNumAliphaticRings(mol)
            aromatic = AllChem.CalcNumAromaticRings(mol)
            saturated = AllChem.CalcNumSaturatedRings(mol)
            heteroR = AllChem.CalcNumHeterocycles(mol)
            TPSA = MolSurf.TPSA(mol)
            valence = Descriptors.NumValenceElectrons(mol)
            mr = Crippen.MolMR(mol)
            # charge = AllChem.ComputeGasteigerCharges(mol)
            prop = [MW, LOGP, HBA, HBD, rotable, amide, bridge, heteroA, heavy, spiro,
                    FCSP3, ring, Aliphatic, aromatic, saturated, heteroR, TPSA, valence, mr]
        except:
            print(smile)
            prop = [0] * 19
        props.append(prop)
    props = np.array(props)
    props = Scaler().fit_transform(props)
    return props 

In [None]:
#computing morgan fingerprints  vectors
mol = Chem.MolFromSmiles('C/C1=C\\C[C@H]([C+](C)C)CC/C(C)=C/CC1')
fp1 = AllChem.GetMorganFingerprintAsBitVect(mol, useChirality=True, radius=2, nBits=124)
vec1 = np.array(fp1)
print(vec1)
morgan_fp_gen = rdFingerprintGenerator.GetMorganGenerator(includeChirality=True, radius=2, fpSize=124, useCountSimulation=False)
fp2 = morgan_fp_gen.GetFingerprint(mol)
vec2 = np.array(fp2)
print(vec2)
assert np.all(vec1 == vec2) 

In [None]:
#https://stackoverflow.com/questions/67302261/cant-convert-molecule-to-fingerprint-with-rdkit
from rdkit.Chem import AllChem as Chem

fragment = Chem.MolFromSmiles('Nc1cccc(N)n1')

smiles = ['Nc1cc(CSc2ccc(O)cc2)cc(N)n1', 'Nc1cc(COc2ccc(O)cc2)cc(N)n1', 'CC1=CC=Cc2c(N)nc(N)cc12']

for smi in smiles:
    try:
        mol = Chem.MolFromSmiles(smi)
        f1 = Chem.DeleteSubstructs(mol, fragment)
        f2 = Chem.MolFromSmiles(Chem.MolToSmiles(f1))
        fp = Chem.GetMorganFingerprintAsBitVect(f2, 2)
    except:
        print('SMILES:', smi)
        f = Chem.DeleteSubstructs(mol, fragment)
        print('smiles_frag:', Chem.MolToSmiles(f1))

In [None]:
file_name = 'somedata.smi'

with open(file_name, "r") as ins:
    smiles = []
    for line in ins:
        smiles.append(line.split('\n')[0])
print('# of SMILES:', len(smiles))

In [None]:
# directly feed SMILE structures stored in a pandas dataframe into RDKit to calculate molecular fingerprint and 

df = pd.read_csv("./SMILES_feature.csv")
from rdkit import DataStructs

target = Chem.RDKFingerprint(Chem.MolFromSmiles('CC1=C(C(O)=O)C2=CC(=CC=C2N=C1C3=CC=C(C=C3)C4=CC=CC=C4F)F'))
df_smiles = pd.DataFrame(df.smiles)
#display(df_smiles)
df1 = pd.DataFrame(data=df.smiles)
df1['Tanimoto'] = DataStructs.BulkTanimotoSimilarity(target, [Chem.RDKFingerprint(Chem.MolFromSmiles(s)) for s in df['smiles']])

print(df1)


In [None]:
#Export pandas data frame with mol image
import pandas as pd
from rdkit import Chem
from rdkit.Chem import PandasTools
#DataFrame = pd.read_csv("./SMILES_feature.csv")
#smiles = [pd.DataFrame(df.smiles)]
smiles = ['N#CC(c1ccccc1)C(Br)Oc1ccccc1','O=[N+]([O-])c1ccc(C(c2ccccc2)C(Br)Oc2ccccc2)cc1','CC(Oc1ccccc1)C(C#N)c1ccc(C#N)cc1 ','COC(C#N)C(C)(C)c1ccc(-c2ccccc2)cc1 ','COC(Oc1ccccc1)C(C)(C#N)c1ccc([N+](=O)[O-])cc1 ','COCC(C)(c1ccccc1)c1ccc([N+](=O)[O-])cc1 ','CCc1ccc(C(C#N)C(OC)OCOC)cc1 ','COC(C#N)C(C)(C)c1ccccc1 ','CCc1ccc(C(C)(c2ccccc2)C(Br)OC)cc1 ','COCOC(C)C(C)c1ccc(C(=O)OC)cc1 ','CCc1ccc(C(C#N)COc2ccccc2)cc1 ','COC(=O)c1ccc(C(C)(c2ccccc2)C(OC)OC)cc1 ','CCc1ccc(C(C)(C#N)C(C)OC)cc1 ','COC(C#N)C(C#N)c1ccc(C#N)cc1 ','COCOC(C#N)C(C)(C#N)c1ccc(C#N)cc1 ','COCC(C)(C#N)c1ccc(C#N)cc1 ','CC(C)(COc1ccccc1)c1ccc([N+](=O)[O-])cc1 ','COc1ccc(C(C#N)(C#N)C(C)OC)cc1 ','COC(Br)C(C)(C#N)c1ccc([N+](=O)[O-])cc1 ','COC(c1ccccc1)C(C#N)(C#N)c1ccc(-c2ccccc2)cc1 ','COC(c1ccccc1)C(C#N)c1ccc([N+](=O)[O-])cc1 ','CCc1ccc(CCOC)cc1 ','COCOCC(C#N)(C#N)c1ccccc1 ','COCOCC(C)c1ccccc1 ','COC(C#N)C(c1ccccc1)c1ccc(C#N)cc1 ','COC(=O)c1ccc(C(C)(C#N)C(C#N)OC)cc1 ','COc1ccc(C(C#N)(c2ccccc2)C(OC)OC)cc1 ','CCc1ccc(C(C#N)C(C#N)OCOC)cc1 ','COC(=O)c1ccc(C(C#N)C(Br)Oc2ccccc2)cc1 ','COCOC(Br)C(C)(C#N)c1ccc([N+](=O)[O-])cc1 ','COC(Br)C(C)(C)c1ccc(C#N)cc1 ','COC(Oc1ccccc1)C(C#N)c1ccccc1 ','N#Cc1ccc(C(c2ccccc2)(c2ccccc2)C(Oc2ccccc2)c2ccccc2)cc1 ','COC(c1ccccc1)C(c1ccccc1)c1ccc([N+](=O)[O-])cc1 ','COCC(c1ccccc1)(c1ccccc1)c1ccccc1 ','COC(=O)c1ccc(C(C)(C#N)C(OC)OC)cc1 ','COCOC(Br)C(c1ccccc1)c1ccccc1 ','CCc1ccc(CCOCOC)cc1 ','COCC(C)(C#N)c1ccccc1 ','COCOC(C)C(c1ccccc1)c1ccccc1 ','N#Cc1ccc(C(c2ccccc2)(c2ccccc2)C(C#N)Oc2ccccc2)cc1 ','COC(=O)c1ccc(C(C)(c2ccccc2)C(Br)Oc2ccccc2)cc1 ','COCOCC(C)(C)c1ccc(C#N)cc1 ','CCc1ccc(C(C#N)C(Br)OCOC)cc1 ','COCOC(Br)C(C)(C#N)c1ccc(C#N)cc1 ','O=[N+]([O-])c1ccc(CC(Oc2ccccc2)c2ccccc2)cc1 ','CCc1ccc(C(C#N)(C#N)C(Br)OCOC)cc1 ','COCC(C#N)(C#N)c1ccc(C(=O)OC)cc1 ','CCc1ccc(C(C)(C#N)COCOC)cc1 ','N#CC(Oc1ccccc1)C(C#N)c1ccc(-c2ccccc2)cc1 ','COC(C#N)C(c1ccccc1)c1ccc(-c2ccccc2)cc1 ','COCOCC(C)(C)c1ccc([N+](=O)[O-])cc1 ','COC(C#N)C(C#N)c1ccc(-c2ccccc2)cc1 ','CCc1ccc(C(C)(C)C(OCOC)c2ccccc2)cc1 ','COCOC(Br)C(C)c1ccc([N+](=O)[O-])cc1 ','COCOCC(C#N)(c1ccccc1)c1ccc(C#N)cc1 ','COC(=O)c1ccc(C(C#N)C(C)OC)cc1 ','COc1ccc(C(C#N)(C#N)C(Br)OC)cc1 ','COC(=O)c1ccc(C(C)(C#N)C(Br)Oc2ccccc2)cc1 ','COC(c1ccccc1)C(C#N)(C#N)c1ccc([N+](=O)[O-])cc1 ','COCOC(OC)C(C)(C#N)c1ccc(-c2ccccc2)cc1 ','COC(=O)c1ccc(C(C)(C)C(Br)Oc2ccccc2)cc1 ','CCc1ccc(C(C#N)(c2ccccc2)C(Br)Oc2ccccc2)cc1 ','COC(=O)c1ccc(C(C)(c2ccccc2)C(C#N)OC)cc1 ','COC(c1ccccc1)C(C#N)c1ccc(C#N)cc1 ','CCc1ccc(C(c2ccccc2)(c2ccccc2)C(Br)Oc2ccccc2)cc1 ','COC(C)C(C)(c1ccccc1)c1ccc([N+](=O)[O-])cc1 ','COC(C)C(C)c1ccc(C#N)cc1 ','COCOC(Br)C(c1ccccc1)(c1ccccc1)c1ccc(C#N)cc1 ','CCc1ccc(C(C#N)(C#N)COCOC)cc1 ','COCC(c1ccccc1)(c1ccccc1)c1ccc(C#N)cc1 ','COCOC(Br)Cc1ccc(C#N)cc1 ','COCOC(OC)C(c1ccccc1)(c1ccccc1)c1ccc(C#N)cc1 ','CCc1ccc(C(C)(C)C(C#N)OCOC)cc1 ','COC(Br)C(C)c1ccccc1 ','COC(C#N)Cc1ccc([N+](=O)[O-])cc1 ','CC(c1ccccc1)(c1ccc(-c2ccccc2)cc1)C(C#N)Oc1ccccc1 ','CC(c1ccc([N+](=O)[O-])cc1)C(C#N)Oc1ccccc1 ','COCOC(C)C(C)c1ccc([N+](=O)[O-])cc1 ','COC(=O)c1ccc(C(C#N)C(Oc2ccccc2)c2ccccc2)cc1 ','COC(C)C(C#N)c1ccc(-c2ccccc2)cc1 ','COCOC(OC)C(C#N)c1ccccc1 ','COC(C)C(C#N)(C#N)c1ccc([N+](=O)[O-])cc1 ','COCC(C)c1ccccc1 ','CCc1ccc(C(C)(C#N)C(OCOC)c2ccccc2)cc1 ','COCC(C#N)(C#N)c1ccc(OC)cc1 ','CCc1ccc(C(C)(C#N)C(OC)OCOC)cc1 ','c1ccc(OCC(c2ccccc2)c2ccccc2)cc1 ','COC(=O)c1ccc(C(c2ccccc2)C(C#N)OC)cc1 ','COC(=O)c1ccc(C(C#N)(C#N)C(OC)OC)cc1 ','COC(C#N)C(c1ccccc1)c1ccc([N+](=O)[O-])cc1 ','COC(Oc1ccccc1)C(C)(C)c1ccc(C#N)cc1 ','CCc1ccc(CC(OCOC)c2ccccc2)cc1 ','N#CC(Oc1ccccc1)C(C#N)(c1ccccc1)c1ccc([N+](=O)[O-])cc1 ','CCc1ccc(C(c2ccccc2)(c2ccccc2)C(OC)Oc2ccccc2)cc1 ','COCOC(C#N)C(C#N)(C#N)c1ccccc1 ','COC(=O)c1ccc(CC(C#N)OC)cc1 ','CCc1ccc(CC(C#N)OC)cc1 ','CC(Oc1ccccc1)C(C)(C#N)c1ccc(C#N)cc1 ','COC(Oc1ccccc1)C(C#N)(C#N)c1ccccc1 ','COc1ccc(C(C#N)COc2ccccc2)cc1 ','CCc1ccc(C(C#N)(c2ccccc2)C(Br)OCOC)cc1 ','CCc1ccc(C(C)(COCOC)c2ccccc2)cc1 ','CCc1ccc(C(C#N)C(Br)OC)cc1 ','COCOCC(c1ccccc1)(c1ccccc1)c1ccccc1 ','COCOC(C)Cc1ccccc1 ','CC(c1ccccc1)C(Br)Oc1ccccc1 ','CC(C#N)(c1ccccc1)C(Oc1ccccc1)c1ccccc1 ','COC(c1ccccc1)C(C#N)c1ccc(-c2ccccc2)cc1 ','CCc1ccc(C(C)(c2ccccc2)C(C#N)OCOC)cc1 ','N#CC(COc1ccccc1)(c1ccccc1)c1ccc([N+](=O)[O-])cc1 ','COCOC(C)C(C)(c1ccccc1)c1ccc(C(=O)OC)cc1 ','CCc1ccc(C(C)(C)C(C#N)Oc2ccccc2)cc1 ','COCOC(c1ccccc1)C(C)(C)c1ccccc1 ','COC(C)C(C)(C#N)c1ccc([N+](=O)[O-])cc1 ','COC(Oc1ccccc1)C(C#N)(C#N)c1ccc(-c2ccccc2)cc1 ','CC(c1ccc(C#N)cc1)C(Oc1ccccc1)c1ccccc1 ',
          'COC(=O)c1ccc(C(C)C(C)OC)cc1 ','COC(=O)c1ccc(C(C)(C)C(C#N)Oc2ccccc2)cc1 ','COC(Oc1ccccc1)C(c1ccccc1)c1ccc(C#N)cc1 ','COCOC(C#N)C(C)(c1ccccc1)c1ccccc1 ','CCc1ccc(C(c2ccccc2)C(C#N)OC)cc1 ','COC(=O)c1ccc(C(C#N)(C#N)C(C)Oc2ccccc2)cc1 ','COCC(C#N)(C#N)c1ccc([N+](=O)[O-])cc1 ','COCOC(C)C(C#N)(C#N)c1ccc(-c2ccccc2)cc1 ','COCOC(C#N)C(C)(C#N)c1ccc(OC)cc1 ','COC(C#N)C(C#N)(c1ccccc1)c1ccc(-c2ccccc2)cc1 ','CCc1ccc(C(C#N)(C#N)C(OCOC)c2ccccc2)cc1 ','COC(=O)c1ccc(CCOc2ccccc2)cc1 ','BrC(Cc1ccccc1)Oc1ccccc1 ','CCc1ccc(C(COCOC)c2ccccc2)cc1 ','COC(C)C(c1ccccc1)(c1ccccc1)c1ccc(C#N)cc1 ','COCOC(c1ccccc1)C(c1ccccc1)c1ccc(C(=O)OC)cc1 ','COCOCC(C)(c1ccccc1)c1ccc([N+](=O)[O-])cc1 ','COC(Cc1ccc(C#N)cc1)c1ccccc1 ','CCc1ccc(C(C)(c2ccccc2)C(OC)OC)cc1 ','COC(=O)c1ccc(C(C)C(Br)Oc2ccccc2)cc1 ','COC(Br)Cc1ccc([N+](=O)[O-])cc1 ','COC(=O)c1ccc(C(C)(C#N)C(Br)OC)cc1 ','COCOC(C)C(C#N)c1ccc(-c2ccccc2)cc1 ','COCOC(Br)C(C#N)c1ccc(C#N)cc1 ','BrC(Oc1ccccc1)C(c1ccccc1)c1ccccc1 ','COC(c1ccccc1)C(c1ccccc1)c1ccc(C#N)cc1 ','CCc1ccc(C(C)(C)C(C)OCOC)cc1 ','COCOC(OC)C(C)(c1ccccc1)c1ccc([N+](=O)[O-])cc1 ','COCOCC(c1ccccc1)c1ccccc1 ','COC(Cc1ccccc1)OC ','N#CC(c1ccccc1)(c1ccc(-c2ccccc2)cc1)C(Br)Oc1ccccc1 ','COCOC(C#N)C(C)(c1ccccc1)c1ccc(C#N)cc1 ','CCc1ccc(C(C#N)(COC)c2ccccc2)cc1 ','COC(C#N)C(C)c1ccc([N+](=O)[O-])cc1 ','COC(=O)c1ccc(C(C)C(C#N)Oc2ccccc2)cc1 ','COC(C)C(c1ccccc1)c1ccc(C#N)cc1 ','COC(c1ccccc1)C(c1ccccc1)(c1ccccc1)c1ccc([N+](=O)[O-])cc1 ','CC(Oc1ccccc1)C(C)(C#N)c1ccc(-c2ccccc2)cc1 ','COC(=O)c1ccc(C(C)C(C#N)OC)cc1 ','N#Cc1ccc(C(C#N)(c2ccccc2)C(Br)Oc2ccccc2)cc1 ','O=[N+]([O-])c1ccc(C(COc2ccccc2)(c2ccccc2)c2ccccc2)cc1 ','COCOC(OC)C(C#N)(c1ccccc1)c1ccc([N+](=O)[O-])cc1 ','CCc1ccc(C(C)(C#N)C(Br)Oc2ccccc2)cc1 ','CCc1ccc(C(C#N)C(C)OCOC)cc1 ','COC(=O)c1ccc(C(COc2ccccc2)c2ccccc2)cc1 ','COC(C)C(C#N)c1ccccc1 ','COC(=O)c1ccc(C(C#N)C(C#N)OC)cc1 ','COCOC(c1ccccc1)C(C#N)(c1ccccc1)c1ccc([N+](=O)[O-])cc1 ','COC(Br)C(c1ccccc1)(c1ccccc1)c1ccccc1 ','COC(Br)C(C#N)c1ccc([N+](=O)[O-])cc1 ','COC(OC)C(C#N)c1ccccc1 ','CCc1ccc(C(C)(c2ccccc2)C(Br)OCOC)cc1 ','CC(c1ccccc1)(c1ccc(C#N)cc1)C(C#N)Oc1ccccc1 ','COCOC(c1ccccc1)C(c1ccccc1)(c1ccccc1)c1ccc(C#N)cc1 ','CCc1ccc(C(C#N)C(OC)Oc2ccccc2)cc1 ','COCOC(Br)C(C)c1ccc(C(=O)OC)cc1 ','N#CC(C#N)(c1ccc([N+](=O)[O-])cc1)C(Br)Oc1ccccc1 ','N#Cc1ccc(C(C#N)C(Br)Oc2ccccc2)cc1 ','COCC(C#N)(c1ccccc1)c1ccc(C(=O)OC)cc1 ','O=[N+]([O-])c1ccc(CCOc2ccccc2)cc1 ','COCOC(C#N)C(C)(C#N)c1ccc(C(=O)OC)cc1 ','COC(=O)c1ccc(C(C#N)(C#N)C(Br)OC)cc1 ','COC(=O)c1ccc(C(c2ccccc2)(c2ccccc2)C(Br)Oc2ccccc2)cc1 ','COCOC(Br)C(C#N)(c1ccccc1)c1ccccc1 ','COCOC(Br)C(c1ccccc1)c1ccc(C#N)cc1 ','N#CC(C#N)(c1ccc(-c2ccccc2)cc1)C(Oc1ccccc1)c1ccccc1 ','COCOC(C#N)C(C)(C)c1ccc(C(=O)OC)cc1 ','CCc1ccc(C(C#N)C(Oc2ccccc2)c2ccccc2)cc1 ','CCc1ccc(C(C)(C)C(C#N)OC)cc1 ','COC(Br)C(C#N)(c1ccccc1)c1ccc(-c2ccccc2)cc1 ','CCc1ccc(C(C#N)C(OC)OC)cc1 ','N#Cc1ccc(C(C#N)(COc2ccccc2)c2ccccc2)cc1 ','COCOC(c1ccccc1)C(C)(C)c1ccc(C(=O)OC)cc1 ','COCOC(C)C(C)(C#N)c1ccc(-c2ccccc2)cc1 ','COC(Br)C(C)(C#N)c1ccc(-c2ccccc2)cc1 ','CCc1ccc(C(C)(C)C(Br)OC)cc1 ','COC(C#N)C(c1ccccc1)(c1ccccc1)c1ccccc1 ','CCc1ccc(C(C)(C)C(Br)OCOC)cc1 ','COC(=O)c1ccc(C(C)(C)C(C#N)OC)cc1 ','COC(OC)C(c1ccccc1)c1ccccc1 ','COC(C)C(C#N)c1ccc(C#N)cc1 ','COCOC(Br)C(C)(C)c1ccc(C#N)cc1 ','CCc1ccc(C(c2ccccc2)C(OC)Oc2ccccc2)cc1 ','CC(Cc1ccc([N+](=O)[O-])cc1)Oc1ccccc1 ','COCOC(C)C(C#N)(C#N)c1ccc([N+](=O)[O-])cc1 ','COCOC(Br)C(C)(C#N)c1ccccc1 ','COCOC(C#N)C(C#N)(C#N)c1ccc(C(=O)OC)cc1 ','CCc1ccc(C(C#N)(C#N)C(C#N)OCOC)cc1 ','N#Cc1ccc(C(c2ccccc2)C(C#N)Oc2ccccc2)cc1 ','COCOC(C)Cc1ccc([N+](=O)[O-])cc1 ','COC(Br)C(c1ccccc1)c1ccc([N+](=O)[O-])cc1 ','CC(C#N)(c1ccc(-c2ccccc2)cc1)C(Oc1ccccc1)c1ccccc1 ','CC(Oc1ccccc1)C(C)(C#N)c1ccc([N+](=O)[O-])cc1 ','COC(OC)C(C)(C#N)c1ccccc1 ','COc1ccc(C(C)(C#N)COc2ccccc2)cc1 ','COC(c1ccccc1)C(C)(C#N)c1ccc(-c2ccccc2)cc1 ','CCc1ccc(C(C#N)(c2ccccc2)C(C)Oc2ccccc2)cc1 ','COC(=O)c1ccc(C(C#N)(C#N)C(Br)Oc2ccccc2)cc1 ','COC(=O)c1ccc(C(c2ccccc2)C(C#N)Oc2ccccc2)cc1 ','COC(Cc1ccc([N+](=O)[O-])cc1)OC ','COC(c1ccccc1)C(C#N)c1ccccc1 ','COCOCCc1ccc(C(=O)OC)cc1 ','COC(Oc1ccccc1)C(C#N)(c1ccccc1)c1ccc([N+](=O)[O-])cc1 ','COC(Oc1ccccc1)C(c1ccccc1)(c1ccccc1)c1ccccc1 ','COCC(C#N)(c1ccccc1)c1ccccc1 ','CC(C)(c1ccc(C#N)cc1)C(C#N)Oc1ccccc1 ','O=[N+]([O-])c1ccc(C(c2ccccc2)(c2ccccc2)C(Oc2ccccc2)c2ccccc2)cc1 ','COCOC(C)C(C#N)(C#N)c1ccc(OC)cc1 ','COC(Oc1ccccc1)C(C)c1ccc(C#N)cc1 ','COC(=O)c1ccc(CC(Br)OC)cc1 ','COCOC(Cc1ccc(C(=O)OC)cc1)c1ccccc1 ','COCC(C)(c1ccccc1)c1ccc(C#N)cc1 ',
          'COCOC(OC)C(C)(c1ccccc1)c1ccc(C#N)cc1 ','COC(Cc1ccc([N+](=O)[O-])cc1)Oc1ccccc1 ','COCC(C)(C)c1ccc(C#N)cc1 ','COCOC(C)C(C)(C#N)c1ccc(C(=O)OC)cc1 ','COC(c1ccccc1)C(c1ccccc1)c1ccccc1 ','COC(Cc1ccc(C#N)cc1)Oc1ccccc1 ','COc1ccc(C(C#N)(C#N)C(OC)c2ccccc2)cc1 ','N#CC(c1ccccc1)(c1ccc([N+](=O)[O-])cc1)C(Oc1ccccc1)c1ccccc1 ','COc1ccc(C(C#N)C(C#N)Oc2ccccc2)cc1 ','N#Cc1ccc(C(C#N)(c2ccccc2)C(C#N)Oc2ccccc2)cc1 ','COCOC(Br)C(C)(C)c1ccc(C(=O)OC)cc1 ','COC(Br)C(C)(C)c1ccc([N+](=O)[O-])cc1 ','N#CC(Oc1ccccc1)C(c1ccccc1)c1ccccc1 ','COC(C)C(C)(c1ccccc1)c1ccc(C#N)cc1 ','CCc1ccc(C(C)(C#N)C(C#N)OCOC)cc1 ','COCOC(C#N)C(C)(C)c1ccc(C#N)cc1 ','COCOC(C#N)C(C)c1ccccc1 ','BrC(Oc1ccccc1)C(c1ccccc1)(c1ccccc1)c1ccccc1 ','COCOC(C)C(c1ccccc1)(c1ccccc1)c1ccc([N+](=O)[O-])cc1 ','CCc1ccc(C(C)C(OC)OCOC)cc1 ','CC(C)(c1ccc(C#N)cc1)C(Br)Oc1ccccc1 ','COC(=O)c1ccc(CC(OC)c2ccccc2)cc1 ','CCc1ccc(C(C)(C#N)COc2ccccc2)cc1 ','CCc1ccc(C(C)C(C#N)OCOC)cc1 ','COCOC(C#N)C(C)(C#N)c1ccc([N+](=O)[O-])cc1 ','COCOC(Cc1ccccc1)OC ','COCOCC(C#N)c1ccc([N+](=O)[O-])cc1 ','CCc1ccc(C(C)C(C)OC)cc1 ','COCOC(C#N)C(C#N)(c1ccccc1)c1ccccc1 ','COC(Oc1ccccc1)C(C#N)(c1ccccc1)c1ccc(-c2ccccc2)cc1 ','COCOCC(C)c1ccc([N+](=O)[O-])cc1 ','COCOC(OC)C(C#N)(C#N)c1ccc(C(=O)OC)cc1 ', 'COCOC(C#N)Cc1ccc(OC)cc1','COC(=O)c1ccc(C(C#N)C(Br)OC)cc1 ','CCc1ccc(C(C)(C#N)COC)cc1 ','COC(Br)C(C#N)c1ccccc1 ','COC(=O)c1ccc(CC(C)OC)cc1 ','COC(C)C(C)(c1ccccc1)c1ccccc1 ','COc1ccc(C(C#N)(C#N)COc2ccccc2)cc1 ','COc1ccc(C(C#N)(C#N)C(Br)Oc2ccccc2)cc1 ','COC(C)Cc1ccccc1 ','COC(Br)C(C#N)(C#N)c1ccc(-c2ccccc2)cc1 ','COC(=O)c1ccc(C(C#N)(c2ccccc2)C(C#N)OC)cc1 ','CCc1ccc(C(C#N)(C#N)C(OC)OC)cc1 ','CCc1ccc(C(C)(COC)c2ccccc2)cc1 ','COC(Br)C(C#N)(c1ccccc1)c1ccc(C#N)cc1 ','CCc1ccc(C(C)(c2ccccc2)C(OCOC)c2ccccc2)cc1 ','COC(C#N)C(C#N)(C#N)c1ccc(C#N)cc1 ','COCOC(Cc1ccccc1)c1ccccc1 ','COCOC(C#N)C(C#N)(C#N)c1ccc(C#N)cc1 ','COC(C)C(C#N)(C#N)c1ccc(-c2ccccc2)cc1 ','COCOC(Br)C(C#N)(c1ccccc1)c1ccc(C#N)cc1 ','COCOC(Br)C(c1ccccc1)(c1ccccc1)c1ccccc1 ','COCOCC(C#N)c1ccc(OC)cc1 ','COCOC(C#N)C(c1ccccc1)(c1ccccc1)c1ccc(-c2ccccc2)cc1 ','COCOC(C#N)C(C)(C#N)c1ccc(-c2ccccc2)cc1 ','N#CC(C#N)(COc1ccccc1)c1ccccc1 ','COCOC(OC)C(C#N)(C#N)c1ccc([N+](=O)[O-])cc1 ','COC(=O)c1ccc(C(C#N)(c2ccccc2)C(OC)c2ccccc2)cc1 ','COC(=O)c1ccc(C(C)(c2ccccc2)C(OC)c2ccccc2)cc1 ','COC(OC)C(C#N)(c1ccccc1)c1ccc(-c2ccccc2)cc1 ','COC(=O)c1ccc(C(C#N)(C#N)C(OC)Oc2ccccc2)cc1 ','COCOC(Br)C(C#N)(c1ccccc1)c1ccc(-c2ccccc2)cc1 ','N#CC(Oc1ccccc1)C(c1ccccc1)(c1ccccc1)c1ccc([N+](=O)[O-])cc1 ','COCOC(C)C(C)c1ccccc1 ','COC(=O)c1ccc(C(C)C(Br)OC)cc1 ','COCC(C)(c1ccccc1)c1ccccc1 ','N#Cc1ccc(C(c2ccccc2)C(Br)Oc2ccccc2)cc1 ','COCOC(OC)C(c1ccccc1)c1ccc(C(=O)OC)cc1 ','COCOC(c1ccccc1)C(c1ccccc1)c1ccc([N+](=O)[O-])cc1 ','COC(=O)c1ccc(C(C)(c2ccccc2)C(C)OC)cc1 ','COCOC(Cc1ccc([N+](=O)[O-])cc1)OC ','COCOC(c1ccccc1)C(C#N)(c1ccccc1)c1ccc(-c2ccccc2)cc1 ','COCOC(C#N)C(C#N)c1ccc(C(=O)OC)cc1 ','COC(=O)c1ccc(C(C#N)(c2ccccc2)C(OC)OC)cc1 ','COCOC(OC)C(c1ccccc1)c1ccc([N+](=O)[O-])cc1 ','CC(COc1ccccc1)c1ccc([N+](=O)[O-])cc1 ','c1ccc(OC(c2ccccc2)C(c2ccccc2)c2ccccc2)cc1 ','COCOC(c1ccccc1)C(C#N)(C#N)c1ccc(-c2ccccc2)cc1 ','COCOC(Br)C(C#N)(c1ccccc1)c1ccc([N+](=O)[O-])cc1 ','COC(Br)C(c1ccccc1)(c1ccccc1)c1ccc(C#N)cc1 ','COC(=O)c1ccc(C(C)(C#N)C(C)Oc2ccccc2)cc1 ','COCOCC(C#N)(c1ccccc1)c1ccccc1 ','COC(Oc1ccccc1)C(C#N)(c1ccccc1)c1ccccc1 ','COCOCC(C)(C)c1ccccc1 ','CCc1ccc(C(c2ccccc2)C(OC)OCOC)cc1 ','CC(Oc1ccccc1)C(C#N)(C#N)c1ccccc1 ','COC(Br)C(C#N)(C#N)c1ccccc1 ','COCC(C)(C#N)c1ccc(OC)cc1 ','COCOC(Br)C(C#N)(C#N)c1ccc(C#N)cc1 ','COC(Oc1ccccc1)C(c1ccccc1)c1ccccc1 ','COCOC(C)C(c1ccccc1)c1ccc(C(=O)OC)cc1 ','COCOC(C)C(c1ccccc1)(c1ccccc1)c1ccc(C#N)cc1 ','COCC(C)(C#N)c1ccc(-c2ccccc2)cc1 ','COC(C#N)C(c1ccccc1)(c1ccccc1)c1ccc([N+](=O)[O-])cc1 ','COC(Oc1ccccc1)C(C)(C)c1ccc([N+](=O)[O-])cc1 ','COC(Br)C(c1ccccc1)c1ccccc1 ','COCOC(Cc1ccc([N+](=O)[O-])cc1)c1ccccc1 ','COC(C)C(C)(C#N)c1ccc(-c2ccccc2)cc1 ','CC(Oc1ccccc1)C(C#N)c1ccccc1 ','COCCc1ccc(C(=O)OC)cc1 ','CC(c1ccccc1)(c1ccc(C#N)cc1)C(Br)Oc1ccccc1 ','CCc1ccc(CC(C)OC)cc1 ','CCc1ccc(C(C)(C#N)C(C#N)OC)cc1 ','COCOC(C#N)C(c1ccccc1)(c1ccccc1)c1ccc(C#N)cc1 ','COCOC(Br)C(C#N)(C#N)c1ccc([N+](=O)[O-])cc1 ','COC(Oc1ccccc1)C(c1ccccc1)(c1ccccc1)c1ccc([N+](=O)[O-])cc1 ','COCC(c1ccccc1)c1ccc(C#N)cc1 ','COCOC(c1ccccc1)C(C)(C)c1ccc([N+](=O)[O-])cc1 ','COc1ccc(C(C#N)(C#N)C(C#N)OC)cc1 ','CCc1ccc(C(c2ccccc2)C(Br)OCOC)cc1 ','COC(=O)c1ccc(C(c2ccccc2)(c2ccccc2)C(Oc2ccccc2)c2ccccc2)cc1 ','CCc1ccc(C(C#N)(C#N)COc2ccccc2)cc1 ',
		  'COCOCC(C)(c1ccccc1)c1ccc(C#N)cc1 ','CCc1ccc(CC(C#N)Oc2ccccc2)cc1 ','COC(Br)C(C)(c1ccccc1)c1ccccc1 ','COCOCC(C#N)(C#N)c1ccc([N+](=O)[O-])cc1 ','COC(OC)C(C#N)c1ccc(-c2ccccc2)cc1 ','COC(OC)C(C)(C)c1ccccc1 ','COCOC(OC)C(C)(C#N)c1ccc(OC)cc1 ','COC(OC)C(c1ccccc1)(c1ccccc1)c1ccc(C#N)cc1 ','COCOCC(C)(C)c1ccc(C(=O)OC)cc1 ','COCOC(c1ccccc1)C(C)c1ccc(C#N)cc1 ','COC(OC)C(C)(C#N)c1ccc(-c2ccccc2)cc1 ','COC(=O)c1ccc(C(C)(c2ccccc2)C(Br)OC)cc1 ','CCc1ccc(C(C)C(Br)OC)cc1 ','CCc1ccc(C(C#N)C(C)Oc2ccccc2)cc1 ','COC(c1ccccc1)C(C#N)(C#N)c1ccccc1 ','COC(=O)c1ccc(C(C)C(OC)Oc2ccccc2)cc1 ','N#CC(Oc1ccccc1)C(c1ccccc1)c1ccc([N+](=O)[O-])cc1 ','N#Cc1ccc(C(C#N)(C#N)C(Oc2ccccc2)c2ccccc2)cc1 ','COCC(C)(C#N)c1ccc([N+](=O)[O-])cc1 ','COC(C#N)C(c1ccccc1)(c1ccccc1)c1ccc(-c2ccccc2)cc1 ','COC(c1ccccc1)C(c1ccccc1)(c1ccccc1)c1ccc(C#N)cc1 ','COC(c1ccccc1)C(C)(C#N)c1ccc(C#N)cc1 ','CC(C)(c1ccccc1)C(C#N)Oc1ccccc1 ','COCOCCc1ccccc1 ','CC(c1ccc([N+](=O)[O-])cc1)C(Oc1ccccc1)c1ccccc1 ','COC(Oc1ccccc1)C(C#N)(c1ccccc1)c1ccc(C#N)cc1 ','COCC(C)(c1ccccc1)c1ccc(C(=O)OC)cc1 ','COCOC(C)C(C)(c1ccccc1)c1ccccc1 ','COCOC(OC)C(C)(C#N)c1ccccc1 ','COC(OC)C(C)(C#N)c1ccc(C#N)cc1 ','CC(C#N)(c1ccc(C#N)cc1)C(C#N)Oc1ccccc1 ','COC(=O)c1ccc(C(C)(C#N)C(OC)c2ccccc2)cc1 ',
          'COCOC(c1ccccc1)C(C#N)(C#N)c1ccc(OC)cc1 ','COCOC(OC)C(C)c1ccc(C(=O)OC)cc1 ','CC(COc1ccccc1)(c1ccccc1)c1ccccc1 ','COC(=O)c1ccc(C(C#N)(c2ccccc2)C(C)OC)cc1 ','CCc1ccc(C(C#N)(c2ccccc2)C(OC)Oc2ccccc2)cc1 ','COCOC(C#N)C(C#N)c1ccc([N+](=O)[O-])cc1 ','CCc1ccc(C(C#N)(C#N)C(C)OC)cc1 ','CCc1ccc(C(C#N)(c2ccccc2)C(C#N)Oc2ccccc2)cc1 ','N#Cc1ccc(C(C#N)C(C#N)Oc2ccccc2)cc1 ','COCOCC(C#N)(C#N)c1ccc(C#N)cc1 ','COc1ccc(C(C#N)C(C#N)OC)cc1 ','N#CC(C#N)(COc1ccccc1)c1ccc([N+](=O)[O-])cc1 ','CC(C#N)(COc1ccccc1)c1ccc(C#N)cc1 ','CCc1ccc(C(C)(C)COC)cc1 ','CC(C)(c1ccc(-c2ccccc2)cc1)C(C#N)Oc1ccccc1 ','CCc1ccc(C(C#N)(C#N)C(Oc2ccccc2)c2ccccc2)cc1 ','COCOC(Cc1ccc(C#N)cc1)OC ','COCOC(c1ccccc1)C(c1ccccc1)c1ccccc1 ','COCOC(C#N)C(C)c1ccc(C(=O)OC)cc1 ','N#CC(C#N)(COc1ccccc1)c1ccc(-c2ccccc2)cc1 ','COC(C#N)C(C#N)(C#N)c1ccccc1 ','CCc1ccc(C(COC)c2ccccc2)cc1 ','CCc1ccc(C(C)(C)COCOC)cc1 ','COCOC(c1ccccc1)C(C#N)(C#N)c1ccccc1 ','N#CC(COc1ccccc1)(c1ccccc1)c1ccccc1 ','CCc1ccc(C(C)(C#N)C(Oc2ccccc2)c2ccccc2)cc1 ','CC(C#N)(COc1ccccc1)c1ccc([N+](=O)[O-])cc1 ','COC(=O)c1ccc(C(C#N)(c2ccccc2)C(C#N)Oc2ccccc2)cc1 ','COC(C#N)C(C#N)c1ccc([N+](=O)[O-])cc1 ','COCOC(Br)C(C)(C)c1ccccc1 ','COCOC(OC)C(C)c1ccc([N+](=O)[O-])cc1 ','COC(C#N)C(C#N)(c1ccccc1)c1ccc(C#N)cc1 ','COC(c1ccccc1)C(C#N)(c1ccccc1)c1ccc(C#N)cc1 ','N#CC(c1ccc(-c2ccccc2)cc1)C(Br)Oc1ccccc1 ','COC(=O)c1ccc(C(C#N)(COc2ccccc2)c2ccccc2)cc1 ','COCOC(c1ccccc1)C(C)(c1ccccc1)c1ccc(C(=O)OC)cc1 ','N#Cc1ccc(CC(Oc2ccccc2)c2ccccc2)cc1 ','COC(Br)C(C#N)(c1ccccc1)c1ccc([N+](=O)[O-])cc1 ','CCc1ccc(C(C#N)(c2ccccc2)C(C#N)OC)cc1 ','COC(Oc1ccccc1)C(C#N)(C#N)c1ccc([N+](=O)[O-])cc1 ','CCc1ccc(C(C#N)(c2ccccc2)C(C)OCOC)cc1 ','COC(OC)C(C)c1ccccc1 ','CCc1ccc(C(C)(C#N)C(C)OCOC)cc1 ','COCOC(C)Cc1ccc(C#N)cc1 ','CC(Oc1ccccc1)C(C#N)c1ccc(-c2ccccc2)cc1 ','COC(=O)c1ccc(C(C#N)(C#N)C(OC)c2ccccc2)cc1 ','COC(C#N)C(C)(C#N)c1ccccc1 ','COCOC(OC)C(C#N)c1ccc(C(=O)OC)cc1 ','COCOC(C)C(C)(C#N)c1ccccc1 ','COC(C)C(C#N)(C#N)c1ccc(C#N)cc1 ','COCOC(Br)C(c1ccccc1)c1ccc([N+](=O)[O-])cc1 ','COCCc1ccc([N+](=O)[O-])cc1 ','N#CC(c1ccccc1)(c1ccc([N+](=O)[O-])cc1)C(Br)Oc1ccccc1 ','COCOC(C#N)C(C)c1ccc([N+](=O)[O-])cc1 ','N#CC(C#N)(c1ccc([N+](=O)[O-])cc1)C(Oc1ccccc1)c1ccccc1 ','COCOC(OC)C(C)(C#N)c1ccc(C(=O)OC)cc1 ','COCC(C#N)c1ccc([N+](=O)[O-])cc1 ','COC(=O)c1ccc(C(C#N)(c2ccccc2)C(C)Oc2ccccc2)cc1 ','N#CC(Oc1ccccc1)C(C#N)(C#N)c1ccc([N+](=O)[O-])cc1 ','O=[N+]([O-])c1ccc(C(COc2ccccc2)c2ccccc2)cc1 ','CC(C#N)(c1ccc(C#N)cc1)C(Br)Oc1ccccc1 ','COCOC(C#N)C(c1ccccc1)(c1ccccc1)c1ccccc1 ','N#CC(C#N)(c1ccccc1)C(Br)Oc1ccccc1 ','COC(c1ccccc1)C(C)(c1ccccc1)c1ccc([N+](=O)[O-])cc1 ','COC(c1ccccc1)C(C)(C)c1ccccc1 ','COCOC(OC)C(C#N)(c1ccccc1)c1ccc(C(=O)OC)cc1 ','CC(c1ccccc1)(c1ccccc1)C(C#N)Oc1ccccc1 ','COC(=O)c1ccc(C(c2ccccc2)(c2ccccc2)C(OC)Oc2ccccc2)cc1 ','COCOCC(C#N)(C#N)c1ccc(OC)cc1 ','CCc1ccc(C(C#N)(c2ccccc2)C(OCOC)c2ccccc2)cc1 ','COCOC(C)C(C#N)c1ccc(C(=O)OC)cc1 ','COC(OC)C(c1ccccc1)(c1ccccc1)c1ccc([N+](=O)[O-])cc1 ','COC(OC)C(C#N)(C#N)c1ccc(-c2ccccc2)cc1 ','CCc1ccc(C(C)(c2ccccc2)C(Br)Oc2ccccc2)cc1 ','CCc1ccc(C(C)(c2ccccc2)C(OC)c2ccccc2)cc1 ','COCOC(C#N)C(c1ccccc1)(c1ccccc1)c1ccc([N+](=O)[O-])cc1 ','N#CC(C#N)(c1ccccc1)C(Oc1ccccc1)c1ccccc1 ','CCc1ccc(C(C)COC)cc1 ','N#Cc1ccc(C(c2ccccc2)(c2ccccc2)C(Br)Oc2ccccc2)cc1 ','CC(c1ccccc1)(c1ccc([N+](=O)[O-])cc1)C(C#N)Oc1ccccc1 ','COCOC(C#N)C(C#N)(C#N)c1ccc(-c2ccccc2)cc1 ','CCc1ccc(C(COC)(c2ccccc2)c2ccccc2)cc1 ','COCOC(C#N)C(C#N)c1ccccc1 ','CCc1ccc(C(C#N)(C#N)C(C#N)OC)cc1 ','COC(c1ccccc1)C(C)(C#N)c1ccccc1 ','CC(C#N)(c1ccc(-c2ccccc2)cc1)C(Br)Oc1ccccc1 ','COCOC(C)Cc1ccc(C(=O)OC)cc1 ','N#Cc1ccc(CCOc2ccccc2)cc1 ','COC(C#N)Cc1ccc(C#N)cc1 ','COC(C#N)C(C)(C#N)c1ccc(-c2ccccc2)cc1 ','CCc1ccc(C(COc2ccccc2)(c2ccccc2)c2ccccc2)cc1 ','CCc1ccc(C(c2ccccc2)C(OCOC)c2ccccc2)cc1 ','CCc1ccc(C(C)(C#N)C(OC)Oc2ccccc2)cc1 ','COc1ccc(C(C#N)(C#N)C(Oc2ccccc2)c2ccccc2)cc1 ','COCOC(c1ccccc1)C(C)(c1ccccc1)c1ccc(C#N)cc1 ','COCOC(c1ccccc1)C(C)(C#N)c1ccc(C(=O)OC)cc1 ','CCc1ccc(C(C)C(Br)OCOC)cc1 ','COC(C)C(c1ccccc1)(c1ccccc1)c1ccc([N+](=O)[O-])cc1 ','COC(Br)C(C)(C#N)c1ccccc1 ','COCC(c1ccccc1)c1ccccc1 ','COCOC(Br)C(C)(c1ccccc1)c1ccc(C(=O)OC)cc1 ','CCc1ccc(C(C#N)(C#N)C(Br)OC)cc1 ','c1ccc(CC(Oc2ccccc2)c2ccccc2)cc1 ','N#CC(c1ccccc1)(c1ccccc1)C(Oc1ccccc1)c1ccccc1 ','CC(COc1ccccc1)c1ccc(C#N)cc1 ','COCOC(c1ccccc1)C(C)(C#N)c1ccc(C#N)cc1 ','CC(c1ccccc1)(c1ccc([N+](=O)[O-])cc1)C(Oc1ccccc1)c1ccccc1 ','COCOC(c1ccccc1)C(c1ccccc1)(c1ccccc1)c1ccccc1 ','N#CC(Oc1ccccc1)C(C#N)(c1ccccc1)c1ccc(-c2ccccc2)cc1 ',
          'CC(C)(c1ccc([N+](=O)[O-])cc1)C(C#N)Oc1ccccc1 ','CCc1ccc(C(COc2ccccc2)c2ccccc2)cc1 ','CCc1ccc(C(C#N)(c2ccccc2)C(OC)OCOC)cc1 ','CC(C#N)(c1ccc(-c2ccccc2)cc1)C(C#N)Oc1ccccc1 ','COC(C)C(c1ccccc1)c1ccc([N+](=O)[O-])cc1 ','N#Cc1ccc(C(C#N)(C#N)C(C#N)Oc2ccccc2)cc1 ','COc1ccc(C(C#N)C(C)OC)cc1 ','CC(Oc1ccccc1)C(C#N)(C#N)c1ccc(-c2ccccc2)cc1 ','CCc1ccc(CC(OC)c2ccccc2)cc1 ','COC(C#N)C(C)c1ccc(C#N)cc1 ','COCOC(C)C(c1ccccc1)c1ccc(C#N)cc1 ','COCOC(C)C(C)(C#N)c1ccc(OC)cc1 ','N#CC(Cc1ccccc1)Oc1ccccc1 ','COCOC(C)C(C)(C#N)c1ccc(C#N)cc1 ','COC(C)C(c1ccccc1)c1ccccc1 ','COCOC(c1ccccc1)C(C#N)(c1ccccc1)c1ccccc1 ','COC(C#N)C(C)(C)c1ccc([N+](=O)[O-])cc1 ','COCOC(C#N)C(C)(C)c1ccc(-c2ccccc2)cc1 ','COCC(c1ccccc1)c1ccc(C(=O)OC)cc1 ','COC(Oc1ccccc1)C(C)(C#N)c1ccccc1 ','COCC(c1ccccc1)(c1ccccc1)c1ccc([N+](=O)[O-])cc1 ','COC(Br)C(C)(C)c1ccccc1 ','COCOCCc1ccc([N+](=O)[O-])cc1 ','COCOC(OC)C(C#N)(c1ccccc1)c1ccc(-c2ccccc2)cc1 ','CC(C)(COc1ccccc1)c1ccc(C#N)cc1 ','COCOC(C#N)C(C)c1ccc(-c2ccccc2)cc1 ','COC(c1ccccc1)C(C)(C)c1ccc(C#N)cc1 ','COC(=O)c1ccc(C(C)C(OC)c2ccccc2)cc1 ','COC(c1ccccc1)C(C)(c1ccccc1)c1ccccc1 ','COC(c1ccccc1)C(C)(C#N)c1ccc([N+](=O)[O-])cc1 ','COC(Br)C(C#N)(C#N)c1ccc(C#N)cc1 ','CCc1ccc(C(c2ccccc2)C(OC)OC)cc1 ','CCc1ccc(C(C)(C)C(C)OC)cc1 ','COCOC(c1ccccc1)C(C#N)(C#N)c1ccc(C#N)cc1 ','COC(C#N)Cc1ccccc1 ','COCC(C)c1ccc(C(=O)OC)cc1 ','COc1ccc(C(C)(C#N)C(OC)Oc2ccccc2)cc1 ','COCOC(Br)C(C#N)c1ccc(C(=O)OC)cc1 ','COCOCC(C#N)c1ccc(-c2ccccc2)cc1 ','CC(c1ccccc1)(c1ccccc1)C(Oc1ccccc1)c1ccccc1 ','CCc1ccc(C(C#N)(c2ccccc2)C(Oc2ccccc2)c2ccccc2)cc1 ','COCOC(OC)C(C)c1ccccc1 ','COC(=O)c1ccc(C(C)(C)C(C)OC)cc1 ','COCOC(c1ccccc1)C(C#N)(c1ccccc1)c1ccc(C(=O)OC)cc1 ','CCc1ccc(C(C)(COc2ccccc2)c2ccccc2)cc1 ','CCc1ccc(C(C#N)C(OC)c2ccccc2)cc1 ','CC(c1ccccc1)C(C#N)Oc1ccccc1 ','COC(=O)c1ccc(C(C)(C)C(Br)OC)cc1 ','COC(=O)c1ccc(C(C)(C)C(Oc2ccccc2)c2ccccc2)cc1 ','COC(Oc1ccccc1)C(C)(C#N)c1ccc(-c2ccccc2)cc1 ','N#Cc1ccc(C(C#N)(C#N)C(Br)Oc2ccccc2)cc1 ','COCOC(C#N)C(C#N)(C#N)c1ccc([N+](=O)[O-])cc1 ','CC(C#N)(c1ccc([N+](=O)[O-])cc1)C(C#N)Oc1ccccc1 ','COCOCC(C)(c1ccccc1)c1ccccc1 ','COc1ccc(C(C#N)(c2ccccc2)C(Br)OC)cc1 ','COC(Br)C(C)c1ccc([N+](=O)[O-])cc1 ','COCOC(C#N)C(C)(c1ccccc1)c1ccc([N+](=O)[O-])cc1 ','COCOCC(C)(C#N)c1ccc(OC)cc1 ','CCc1ccc(C(C#N)C(OCOC)c2ccccc2)cc1 ','COCOC(Br)C(C)(C)c1ccc([N+](=O)[O-])cc1 ','COCOC(c1ccccc1)C(C#N)(c1ccccc1)c1ccc(OC)cc1 ','COCC(c1ccccc1)c1ccc([N+](=O)[O-])cc1 ','N#Cc1ccc(CC(C#N)Oc2ccccc2)cc1 ','COCOC(c1ccccc1)C(C#N)(C#N)c1ccc([N+](=O)[O-])cc1 ','CCc1ccc(C(c2ccccc2)C(C)OC)cc1 ','COc1ccc(C(C#N)(c2ccccc2)C(OC)Oc2ccccc2)cc1 ','COC(C#N)C(C#N)(C#N)c1ccc(-c2ccccc2)cc1 ','CC(Oc1ccccc1)C(c1ccccc1)(c1ccccc1)c1ccc([N+](=O)[O-])cc1 ','CCc1ccc(C(C#N)(c2ccccc2)C(C)OC)cc1 ','COC(=O)c1ccc(C(c2ccccc2)C(Br)OC)cc1 ','COCOC(OC)C(C)(C)c1ccc(C(=O)OC)cc1 ','CCc1ccc(C(C#N)(C#N)C(OC)c2ccccc2)cc1 ','COCOC(Br)C(C)(c1ccccc1)c1ccc([N+](=O)[O-])cc1 ','COC(=O)c1ccc(C(C#N)(C#N)C(C#N)OC)cc1 ','COCOC(OC)C(C)(C#N)c1ccc(C#N)cc1 ','COC(Oc1ccccc1)C(c1ccccc1)c1ccc([N+](=O)[O-])cc1 ','COC(Br)C(C#N)c1ccc(C#N)cc1 ','COC(=O)c1ccc(C(C#N)C(OC)OC)cc1 ','COCOC(C)C(C#N)(c1ccccc1)c1ccc([N+](=O)[O-])cc1 ','COCOC(c1ccccc1)C(c1ccccc1)c1ccc(C#N)cc1 ','COc1ccc(C(C#N)C(Br)Oc2ccccc2)cc1 ','COCC(C#N)(c1ccccc1)c1ccc(C#N)cc1 ','COCOC(C#N)C(C#N)c1ccc(-c2ccccc2)cc1 ','COCOC(c1ccccc1)C(C)(c1ccccc1)c1ccc([N+](=O)[O-])cc1 ','COC(=O)c1ccc(C(c2ccccc2)(c2ccccc2)C(OC)OC)cc1 ','CC(c1ccc(-c2ccccc2)cc1)C(C#N)Oc1ccccc1 ','COC(=O)c1ccc(C(C)(C#N)C(Oc2ccccc2)c2ccccc2)cc1 ','CCc1ccc(C(C#N)(C#N)C(OC)Oc2ccccc2)cc1 ','COC(=O)c1ccc(C(c2ccccc2)(c2ccccc2)C(C)OC)cc1 ','COCOC(c1ccccc1)C(C#N)c1ccc([N+](=O)[O-])cc1 ','CC(c1ccc(C#N)cc1)C(Br)Oc1ccccc1 ','COCOC(C#N)C(C#N)(c1ccccc1)c1ccc([N+](=O)[O-])cc1 ','N#Cc1ccc(C(C#N)(c2ccccc2)C(Oc2ccccc2)c2ccccc2)cc1 ','CCc1ccc(CC(C)OCOC)cc1 ','COC(=O)c1ccc(C(C#N)(c2ccccc2)C(Br)Oc2ccccc2)cc1 ','COC(=O)c1ccc(C(C#N)(C#N)C(C)OC)cc1 ','O=[N+]([O-])c1ccc(CC(Br)Oc2ccccc2)cc1 ','COC(=O)c1ccc(CC(C#N)Oc2ccccc2)cc1 ','COCOC(OC)C(c1ccccc1)(c1ccccc1)c1ccc(C(=O)OC)cc1 ','COCC(C)(C)c1ccccc1 ','COC(c1ccccc1)C(C#N)(c1ccccc1)c1ccccc1 ','COCC(C)c1ccc([N+](=O)[O-])cc1 ','COCOC(C#N)C(c1ccccc1)c1ccccc1 ','CCc1ccc(C(C#N)COCOC)cc1 ','COC(C#N)C(C#N)(c1ccccc1)c1ccc([N+](=O)[O-])cc1 ','CCc1ccc(C(C#N)(C#N)C(C#N)Oc2ccccc2)cc1 ','COCOC(Br)Cc1ccc(C(=O)OC)cc1 ','COCC(C#N)c1ccc(-c2ccccc2)cc1 ','COCOCC(C#N)c1ccc(C(=O)OC)cc1 ','CCc1ccc(C(C#N)(c2ccccc2)C(OC)c2ccccc2)cc1 ','COCCc1ccc(C#N)cc1 ','COC(c1ccccc1)C(C#N)(c1ccccc1)c1ccc(-c2ccccc2)cc1 ',
		  'COc1ccc(C(C)(C#N)C(OC)c2ccccc2)cc1 ','COCOC(c1ccccc1)C(C)c1ccc([N+](=O)[O-])cc1 ','CCc1ccc(C(c2ccccc2)(c2ccccc2)C(OC)OCOC)cc1 ','COC(Oc1ccccc1)C(C)(c1ccccc1)c1ccccc1 ','COCOCC(C)(C#N)c1ccc([N+](=O)[O-])cc1 ','COC(OC)C(C)(c1ccccc1)c1ccccc1 ','CCc1ccc(CC(Br)OCOC)cc1 ','COC(OC)C(C)(c1ccccc1)c1ccc(C#N)cc1 ','COC(Br)C(C)(c1ccccc1)c1ccc(C#N)cc1 ','COC(=O)c1ccc(C(C)(C#N)C(C)OC)cc1 ','COC(c1ccccc1)C(C)c1ccc([N+](=O)[O-])cc1 ','CC(Oc1ccccc1)C(C#N)(c1ccccc1)c1ccc([N+](=O)[O-])cc1 ','COC(Br)Cc1ccc(C#N)cc1 ','COC(=O)c1ccc(C(COc2ccccc2)(c2ccccc2)c2ccccc2)cc1 ','COC(Br)C(C#N)c1ccc(-c2ccccc2)cc1 ','CCc1ccc(C(C#N)(C#N)C(Br)Oc2ccccc2)cc1 ','CCc1ccc(C(C)(C#N)C(C#N)Oc2ccccc2)cc1 ','COC(OC)C(c1ccccc1)c1ccc(C#N)cc1 ','COCOC(Br)C(C#N)(C#N)c1ccc(-c2ccccc2)cc1 ','CCc1ccc(C(c2ccccc2)C(Br)OC)cc1 ','COc1ccc(C(C#N)(c2ccccc2)C(C)OC)cc1 ','CCc1ccc(C(c2ccccc2)(c2ccccc2)C(C)OC)cc1 ','CCc1ccc(C(C)(C#N)C(OC)c2ccccc2)cc1 ','COC(OC)C(C#N)(C#N)c1ccc(C#N)cc1 ','COCOC(Br)Cc1ccc([N+](=O)[O-])cc1 ','CCc1ccc(C(c2ccccc2)C(Br)Oc2ccccc2)cc1 ','COC(=O)c1ccc(C(c2ccccc2)C(OC)c2ccccc2)cc1 ','COCOC(c1ccccc1)C(C)(c1ccccc1)c1ccccc1 ','CCc1ccc(C(c2ccccc2)(c2ccccc2)C(Oc2ccccc2)c2ccccc2)cc1 ','CCc1ccc(C(C#N)(COCOC)c2ccccc2)cc1 ',
          'COCC(C)(C)c1ccc(C(=O)OC)cc1 ','COCOC(c1ccccc1)C(C)(C)c1ccc(C#N)cc1 ','CCc1ccc(C(C)C(Br)Oc2ccccc2)cc1 ','COC(=O)c1ccc(C(C#N)C(C)Oc2ccccc2)cc1 ','CC(c1ccccc1)(c1ccc([N+](=O)[O-])cc1)C(Br)Oc1ccccc1 ','COC(Br)C(C#N)(C#N)c1ccc([N+](=O)[O-])cc1 ','COc1ccc(C(C#N)C(OC)OC)cc1 ','COC(=O)c1ccc(CC(Oc2ccccc2)c2ccccc2)cc1 ','CC(Oc1ccccc1)C(C#N)(C#N)c1ccc(C#N)cc1 ','BrC(Oc1ccccc1)C(c1ccccc1)c1ccc(-c2ccccc2)cc1 ','CC(C#N)(c1ccccc1)C(Br)Oc1ccccc1 ','COCOC(OC)C(C#N)(C#N)c1ccccc1 ','COC(C)C(C#N)(c1ccccc1)c1ccc([N+](=O)[O-])cc1 ','N#CC(c1ccc([N+](=O)[O-])cc1)C(Oc1ccccc1)c1ccccc1 ','COC(=O)c1ccc(C(C)(C#N)COc2ccccc2)cc1 ','COC(Oc1ccccc1)C(C#N)c1ccc(C#N)cc1 ','CCc1ccc(C(C)(C)C(Br)Oc2ccccc2)cc1 ','COC(=O)c1ccc(C(C)(C#N)C(C#N)Oc2ccccc2)cc1 ','COCC(C)(C)c1ccc([N+](=O)[O-])cc1 ','COc1ccc(C(C)(C#N)C(C#N)OC)cc1 ','COc1ccc(C(C#N)(C#N)C(OC)OC)cc1 ','COC(=O)c1ccc(C(C)(c2ccccc2)C(OC)Oc2ccccc2)cc1 ','COCOC(C#N)Cc1ccccc1 ','COCCc1ccccc1 ','COCOC(OC)C(C)(C#N)c1ccc([N+](=O)[O-])cc1 ','COC(Oc1ccccc1)C(C)(C#N)c1ccc(C#N)cc1 ','COCOC(Br)C(c1ccccc1)(c1ccccc1)c1ccc(-c2ccccc2)cc1 ','COCOC(c1ccccc1)C(C)(C#N)c1ccc(-c2ccccc2)cc1 ','BrC(Cc1ccc(-c2ccccc2)cc1)Oc1ccccc1 ','CCc1ccc(CC(OC)OCOC)cc1 ','CCc1ccc(C(c2ccccc2)(c2ccccc2)C(C)OCOC)cc1 ','COC(OC)C(C#N)c1ccc(C#N)cc1 ','CCc1ccc(C(C)(c2ccccc2)C(C#N)OC)cc1 ','O=[N+]([O-])c1ccc(C(c2ccccc2)C(Oc2ccccc2)c2ccccc2)cc1 ','CCc1ccc(C(C#N)(C#N)COC)cc1 ','COC(=O)c1ccc(C(c2ccccc2)C(C)OC)cc1 ','COCOC(c1ccccc1)C(C#N)(c1ccccc1)c1ccc(C#N)cc1 ','CC(Oc1ccccc1)C(C)(c1ccccc1)c1ccc([N+](=O)[O-])cc1 ','N#Cc1ccc(C(COc2ccccc2)(c2ccccc2)c2ccccc2)cc1 ','COCOCC(C)(C#N)c1ccc(C#N)cc1 ','COCOCC(C#N)c1ccc(C#N)cc1 ','CCc1ccc(C(c2ccccc2)(c2ccccc2)C(Br)OCOC)cc1 ','COC(=O)c1ccc(C(C)(COc2ccccc2)c2ccccc2)cc1 ','COCOC(C#N)C(C#N)(c1ccccc1)c1ccc(OC)cc1 ','COC(Br)C(c1ccccc1)(c1ccccc1)c1ccc([N+](=O)[O-])cc1 ','CCc1ccc(C(C)(c2ccccc2)C(OC)OCOC)cc1 ','COCC(C#N)(c1ccccc1)c1ccc(OC)cc1 ','COCOCC(C#N)(c1ccccc1)c1ccc(C(=O)OC)cc1 ','COCOC(Br)C(C#N)(c1ccccc1)c1ccc(C(=O)OC)cc1 ','CCc1ccc(C(COCOC)(c2ccccc2)c2ccccc2)cc1 ','COCOC(Br)C(C#N)(C#N)c1ccc(C(=O)OC)cc1 ','COC(OC)C(C#N)(C#N)c1ccccc1 ','COCOC(C#N)C(C)(C)c1ccccc1 ','COC(C#N)C(C)(c1ccccc1)c1ccc([N+](=O)[O-])cc1 ','COC(C)Cc1ccc([N+](=O)[O-])cc1 ','COCOC(C)C(C#N)(c1ccccc1)c1ccc(-c2ccccc2)cc1 ','COC(C#N)Cc1ccc(-c2ccccc2)cc1 ','COc1ccc(C(C#N)(C#N)C(C)Oc2ccccc2)cc1 ','COC(=O)c1ccc(C(C)(C)COc2ccccc2)cc1 ','N#CC(c1ccccc1)(c1ccccc1)C(Br)Oc1ccccc1 ','COCC(C#N)(c1ccccc1)c1ccc(-c2ccccc2)cc1 ','COCOC(C#N)C(C#N)(c1ccccc1)c1ccc(-c2ccccc2)cc1 ','CC(C#N)(COc1ccccc1)c1ccccc1 ','CCc1ccc(C(C#N)(C#N)C(C)Oc2ccccc2)cc1 ','COC(OC)C(C)(c1ccccc1)c1ccc([N+](=O)[O-])cc1 ','COc1ccc(C(C#N)(C#N)C(C#N)Oc2ccccc2)cc1 ','COCOC(Br)C(C#N)c1ccc(OC)cc1 ','COCOC(C#N)C(C#N)(C#N)c1ccc(OC)cc1 ','COCC(C#N)(C#N)c1ccc(-c2ccccc2)cc1 ','COC(Oc1ccccc1)C(C#N)c1ccc([N+](=O)[O-])cc1 ','COCOC(C#N)C(C#N)c1ccc(C#N)cc1 ','CCc1ccc(C(C#N)(c2ccccc2)C(Br)OC)cc1 ','N#CC(Oc1ccccc1)C(C#N)c1ccccc1 ','CCc1ccc(C(C)(C#N)C(Br)OCOC)cc1 ','COC(c1ccccc1)C(c1ccccc1)(c1ccccc1)c1ccccc1 ','COCC(C#N)c1ccc(OC)cc1 ','COC(=O)c1ccc(C(C)COc2ccccc2)cc1 ','COC(C#N)C(C#N)(c1ccccc1)c1ccccc1 ','COCC(C#N)c1ccc(C(=O)OC)cc1 ','COCOC(OC)C(C)(c1ccccc1)c1ccccc1 ','c1ccc(OC(c2ccccc2)C(c2ccccc2)(c2ccccc2)c2ccccc2)cc1 ','CCc1ccc(C(C)(C)C(OC)c2ccccc2)cc1 ','COC(C)C(C)(C)c1ccccc1 ','CCc1ccc(C(C#N)C(Br)Oc2ccccc2)cc1 ','COCOC(Br)C(C#N)(C#N)c1ccc(OC)cc1 ','COC(Cc1ccccc1)Oc1ccccc1 ','COc1ccc(C(C#N)(COc2ccccc2)c2ccccc2)cc1 ','COC(c1ccccc1)C(C#N)(c1ccccc1)c1ccc([N+](=O)[O-])cc1 ','COC(Br)C(C#N)(c1ccccc1)c1ccccc1 ','CC(Oc1ccccc1)C(C)(C#N)c1ccccc1 ','COCOC(Br)C(C#N)c1ccccc1 ','COCOC(Br)C(c1ccccc1)c1ccc(C(=O)OC)cc1 ','COCOC(C)C(C)(C)c1ccccc1 ','COCOC(C)C(C#N)c1ccc(C#N)cc1 ','COC(Oc1ccccc1)C(c1ccccc1)(c1ccccc1)c1ccc(C#N)cc1 ','COCOCC(c1ccccc1)(c1ccccc1)c1ccc(C(=O)OC)cc1 ','COCOC(c1ccccc1)C(C)(C#N)c1ccc([N+](=O)[O-])cc1 ','COCOC(c1ccccc1)C(C)c1ccccc1 ','COCOC(OC)C(c1ccccc1)(c1ccccc1)c1ccc([N+](=O)[O-])cc1 ','COCOCC(C)c1ccc(C#N)cc1 ','COCOCCc1ccc(C#N)cc1 ','N#CC(COc1ccccc1)c1ccc([N+](=O)[O-])cc1 ','COCOC(Br)C(C#N)c1ccc(-c2ccccc2)cc1 ','COC(=O)c1ccc(C(C#N)COc2ccccc2)cc1 ','CC(C#N)(c1ccc([N+](=O)[O-])cc1)C(Br)Oc1ccccc1 ','COC(Br)C(c1ccccc1)c1ccc(C#N)cc1 ','COCOC(C)C(C#N)(c1ccccc1)c1ccc(C#N)cc1 ','COC(=O)c1ccc(C(C#N)C(OC)c2ccccc2)cc1 ','CC(Oc1ccccc1)C(c1ccccc1)(c1ccccc1)c1ccccc1 ','COC(Cc1ccccc1)c1ccccc1 ','COC(=O)c1ccc(C(C#N)(C#N)COc2ccccc2)cc1 ','COCOC(C#N)Cc1ccc(C#N)cc1 ','COC(OC)C(C#N)(c1ccccc1)c1ccc(C#N)cc1 ',
          'N#CC(C#N)(c1ccc(-c2ccccc2)cc1)C(Br)Oc1ccccc1 ','COC(=O)c1ccc(C(c2ccccc2)(c2ccccc2)C(Br)OC)cc1 ','COC(c1ccccc1)C(C)(C)c1ccc([N+](=O)[O-])cc1 ','COCC(C#N)c1ccccc1 ','COc1ccc(CC(C#N)OC)cc1 ','CC(COc1ccccc1)(c1ccccc1)c1ccc(C#N)cc1 ','COCOC(OC)C(C)(c1ccccc1)c1ccc(C(=O)OC)cc1 ','N#CC(Oc1ccccc1)C(C#N)(C#N)c1ccc(-c2ccccc2)cc1 ','COCOCC(C)(C#N)c1ccccc1 ','COC(C)C(C)c1ccc([N+](=O)[O-])cc1 ','N#Cc1ccc(C(C#N)COc2ccccc2)cc1 ','CCc1ccc(C(c2ccccc2)C(C)OCOC)cc1 ','CCc1ccc(CC(Br)OC)cc1 ','COCOC(Br)C(c1ccccc1)(c1ccccc1)c1ccc([N+](=O)[O-])cc1 ','COCOC(Br)C(C#N)(c1ccccc1)c1ccc(OC)cc1 ','COc1ccc(C(C#N)(c2ccccc2)C(C#N)Oc2ccccc2)cc1 ','COC(=O)c1ccc(CC(OC)OC)cc1 ','N#CC(c1ccccc1)(c1ccc(-c2ccccc2)cc1)C(Oc1ccccc1)c1ccccc1 ','CC(c1ccccc1)(c1ccccc1)C(Br)Oc1ccccc1 ','COCOC(OC)C(C#N)c1ccc(OC)cc1 ','COCOC(C#N)Cc1ccc([N+](=O)[O-])cc1 ','COC(=O)c1ccc(C(c2ccccc2)(c2ccccc2)C(OC)c2ccccc2)cc1 ','COC(C)C(c1ccccc1)(c1ccccc1)c1ccccc1 ','COC(OC)C(C#N)(c1ccccc1)c1ccc([N+](=O)[O-])cc1 ','CCc1ccc(C(C#N)C(C#N)OC)cc1 ','COC(Oc1ccccc1)C(C#N)c1ccc(-c2ccccc2)cc1 ','COCOC(C)C(C)(c1ccccc1)c1ccc(C#N)cc1 ','COCOC(C#N)C(c1ccccc1)c1ccc(C#N)cc1 ','COCOC(c1ccccc1)C(C#N)c1ccc(OC)cc1 ','COCOC(c1ccccc1)C(c1ccccc1)(c1ccccc1)c1ccc(C(=O)OC)cc1 ','COCOC(C)C(c1ccccc1)(c1ccccc1)c1ccc(C(=O)OC)cc1 ','N#CC(c1ccccc1)C(Oc1ccccc1)c1ccccc1 ','CCc1ccc(C(c2ccccc2)(c2ccccc2)C(Br)OC)cc1 ','COCOC(c1ccccc1)C(C#N)c1ccc(C(=O)OC)cc1 ','COC(Oc1ccccc1)C(C)c1ccc([N+](=O)[O-])cc1 ','N#Cc1ccc(C(C#N)(C#N)COc2ccccc2)cc1 ','CC(c1ccc(C#N)cc1)C(C#N)Oc1ccccc1 ','COC(C#N)C(C#N)(C#N)c1ccc([N+](=O)[O-])cc1 ','COCOCC(c1ccccc1)c1ccc(C(=O)OC)cc1 ','COCOC(OC)C(c1ccccc1)c1ccc(C#N)cc1 ','CCc1ccc(C(C#N)(COc2ccccc2)c2ccccc2)cc1 ','CC(C)(c1ccc(-c2ccccc2)cc1)C(Br)Oc1ccccc1 ','COCOC(C)C(C)(C#N)c1ccc([N+](=O)[O-])cc1 ','CCc1ccc(C(C#N)(C#N)C(OC)OCOC)cc1 ','COCOC(Br)C(C)(C#N)c1ccc(C(=O)OC)cc1 ','COCOC(OC)C(C)(C)c1ccccc1 ','CCc1ccc(C(C)(C#N)C(Br)OC)cc1 ','CCc1ccc(C(C)(c2ccccc2)C(OC)Oc2ccccc2)cc1 ','COC(=O)c1ccc(C(C#N)(c2ccccc2)C(OC)Oc2ccccc2)cc1 ','COCOC(OC)C(c1ccccc1)(c1ccccc1)c1ccccc1 ','COCOCC(C#N)(c1ccccc1)c1ccc(OC)cc1 ','COc1ccc(C(C#N)C(OC)Oc2ccccc2)cc1 ','COC(Oc1ccccc1)C(C)(c1ccccc1)c1ccc(C#N)cc1 ','CCc1ccc(C(C#N)C(C#N)Oc2ccccc2)cc1 ','COC(=O)c1ccc(C(C)(c2ccccc2)C(Oc2ccccc2)c2ccccc2)cc1 ','CC(C)(c1ccc([N+](=O)[O-])cc1)C(Br)Oc1ccccc1 ','COC(c1ccccc1)C(C#N)(C#N)c1ccc(C#N)cc1 ','COCOC(C)C(C#N)(c1ccccc1)c1ccc(C(=O)OC)cc1 ','COCOC(C)C(C#N)c1ccc(OC)cc1 ','CCc1ccc(CC(Br)Oc2ccccc2)cc1 ','N#CC(c1ccc([N+](=O)[O-])cc1)C(Br)Oc1ccccc1 ','COC(=O)c1ccc(C(C)(C#N)C(OC)Oc2ccccc2)cc1 ','COCOC(C#N)C(C)c1ccc(C#N)cc1 ','CCc1ccc(C(C#N)C(C)OC)cc1 ','COc1ccc(C(C#N)(c2ccccc2)C(Oc2ccccc2)c2ccccc2)cc1 ','COCOC(C#N)C(c1ccccc1)c1ccc(C(=O)OC)cc1 ','N#CC(Oc1ccccc1)C(C#N)c1ccc([N+](=O)[O-])cc1 ','CC(C#N)(c1ccccc1)C(C#N)Oc1ccccc1 ','COCOC(Br)C(C)c1ccccc1 ','COCOC(C)C(c1ccccc1)c1ccc([N+](=O)[O-])cc1 ','CC(C#N)(c1ccc(C#N)cc1)C(Oc1ccccc1)c1ccccc1 ','CCc1ccc(C(C)(C)C(OC)OC)cc1 ','COC(Br)C(C)(c1ccccc1)c1ccc([N+](=O)[O-])cc1 ','COC(C#N)C(C)(C)c1ccc(C#N)cc1 ','COCOC(Cc1ccc(C#N)cc1)c1ccccc1 ','O=[N+]([O-])c1ccc(C(c2ccccc2)(c2ccccc2)C(Br)Oc2ccccc2)cc1 ','COc1ccc(C(C)(C#N)C(Br)OC)cc1 ','COCOC(OC)C(C#N)c1ccc([N+](=O)[O-])cc1 ','CC(COc1ccccc1)(c1ccccc1)c1ccc([N+](=O)[O-])cc1 ','N#CC(COc1ccccc1)(c1ccccc1)c1ccc(-c2ccccc2)cc1 ','COCOC(OC)C(C#N)c1ccc(C#N)cc1 ','COC(=O)c1ccc(C(C#N)(C#N)C(C#N)Oc2ccccc2)cc1 ','COCOC(c1ccccc1)C(C#N)(C#N)c1ccc(C(=O)OC)cc1 ','CC(c1ccc([N+](=O)[O-])cc1)C(Br)Oc1ccccc1 ','COCOC(C#N)C(C)(C)c1ccc([N+](=O)[O-])cc1 ','COCOC(Br)C(C)(c1ccccc1)c1ccc(C#N)cc1 ','COCC(C)(C#N)c1ccc(C(=O)OC)cc1 ','CC(Oc1ccccc1)C(C#N)c1ccc([N+](=O)[O-])cc1 ','COCOC(C)C(C)c1ccc(C#N)cc1 ','COCOC(Br)Cc1ccccc1 ','COCOC(Br)C(C)c1ccc(C#N)cc1 ','COC(C#N)C(c1ccccc1)(c1ccccc1)c1ccc(C#N)cc1 ','COCOC(C#N)C(c1ccccc1)c1ccc([N+](=O)[O-])cc1 ','COC(=O)c1ccc(C(c2ccccc2)C(Br)Oc2ccccc2)cc1 ','COCOCC(c1ccccc1)(c1ccccc1)c1ccc(C#N)cc1 ','CC(c1ccccc1)(c1ccc(C#N)cc1)C(Oc1ccccc1)c1ccccc1 ','COC(C#N)C(C)(c1ccccc1)c1ccccc1 ','COC(=O)c1ccc(C(C)(C)C(OC)c2ccccc2)cc1 ','COC(=O)c1ccc(C(C#N)C(C#N)Oc2ccccc2)cc1 ','CCc1ccc(CC(OC)OC)cc1 ','COCOC(C#N)C(C)(c1ccccc1)c1ccc(C(=O)OC)cc1 ','CC(C#N)(c1ccc([N+](=O)[O-])cc1)C(Oc1ccccc1)c1ccccc1 ','COC(C#N)C(c1ccccc1)c1ccccc1 ','CCc1ccc(C(C)(c2ccccc2)C(C)OCOC)cc1 ','N#CC(Oc1ccccc1)C(c1ccccc1)(c1ccccc1)c1ccccc1 ','CC(Oc1ccccc1)C(c1ccccc1)(c1ccccc1)c1ccc(C#N)cc1 ','COC(c1ccccc1)C(C)(c1ccccc1)c1ccc(C#N)cc1 ','COC(=O)c1ccc(C(C#N)(c2ccccc2)C(Oc2ccccc2)c2ccccc2)cc1 ',
		  'CCc1ccc(C(C)(C#N)C(OC)OC)cc1 ','COC(OC)C(C)c1ccc(C#N)cc1 ','CCc1ccc(C(C)(c2ccccc2)C(C)OC)cc1 ','COCOC(C#N)Cc1ccc(C(=O)OC)cc1 ','COc1ccc(C(C#N)(c2ccccc2)C(Br)Oc2ccccc2)cc1 ','CC(c1ccccc1)(c1ccc(-c2ccccc2)cc1)C(Br)Oc1ccccc1 ','COc1ccc(C(C)(C#N)C(C#N)Oc2ccccc2)cc1 ','CC(C)(c1ccc([N+](=O)[O-])cc1)C(Oc1ccccc1)c1ccccc1 ','N#Cc1ccc(C(C#N)C(Oc2ccccc2)c2ccccc2)cc1 ','COCOC(Br)C(c1ccccc1)(c1ccccc1)c1ccc(C(=O)OC)cc1 ','COC(C#N)C(C)(C#N)c1ccc([N+](=O)[O-])cc1 ','COC(Br)C(C)(C)c1ccc(-c2ccccc2)cc1 ','COc1ccc(C(C#N)(C#N)C(OC)Oc2ccccc2)cc1 ','CCc1ccc(C(C)C(C)OCOC)cc1 ','COC(C#N)C(C)(C#N)c1ccc(C#N)cc1 ','COCOC(C)C(C)(C)c1ccc(C#N)cc1 ','CCc1ccc(C(C#N)(c2ccccc2)C(OC)OC)cc1 ','CC(Oc1ccccc1)C(C#N)(c1ccccc1)c1ccc(C#N)cc1 ','COCOC(Br)C(C)c1ccc(-c2ccccc2)cc1 ','COC(Br)C(C)(C#N)c1ccc(C#N)cc1 ','N#CC(Oc1ccccc1)C(C#N)(C#N)c1ccccc1 ','CCc1ccc(CC(C#N)OCOC)cc1 ','COC(Oc1ccccc1)C(C#N)(C#N)c1ccc(C#N)cc1 ','COCOCC(C#N)(c1ccccc1)c1ccc(-c2ccccc2)cc1 ','COC(=O)c1ccc(C(C#N)C(OC)Oc2ccccc2)cc1 ','COCOCC(c1ccccc1)(c1ccccc1)c1ccc([N+](=O)[O-])cc1 ','COCOC(C)C(C#N)c1ccccc1 ','COC(C)Cc1ccc(C#N)cc1 ','N#Cc1ccc(CC(Br)Oc2ccccc2)cc1 ','N#CC(COc1ccccc1)c1ccccc1 ','COCOC(C#N)C(c1ccccc1)(c1ccccc1)c1ccc(C(=O)OC)cc1 ','CC(C#N)(COc1ccccc1)c1ccc(-c2ccccc2)cc1 ',
          'CC(c1ccc(-c2ccccc2)cc1)C(Br)Oc1ccccc1 ','COCOC(C#N)C(C#N)(c1ccccc1)c1ccc(C(=O)OC)cc1 ','COCOCC(c1ccccc1)c1ccc([N+](=O)[O-])cc1 ','COC(C#N)C(C)(c1ccccc1)c1ccc(C#N)cc1 ','COC(Cc1ccc(C#N)cc1)OC ','COCOC(c1ccccc1)C(C#N)c1ccc(-c2ccccc2)cc1 ','COCOC(c1ccccc1)C(C#N)c1ccccc1 ','N#CC(Cc1ccc([N+](=O)[O-])cc1)Oc1ccccc1 ','COCOCC(C#N)(C#N)c1ccc(C(=O)OC)cc1 ','COC(c1ccccc1)C(C)c1ccc(C#N)cc1 ','COC(OC)C(c1ccccc1)c1ccc([N+](=O)[O-])cc1 ','COCOC(OC)C(C#N)(C#N)c1ccc(C#N)cc1 ','COC(=O)c1ccc(C(c2ccccc2)(c2ccccc2)C(C#N)Oc2ccccc2)cc1 ','CCc1ccc(C(C)(C)C(OC)OCOC)cc1 ','COc1ccc(C(C#N)(c2ccccc2)C(OC)c2ccccc2)cc1 ','COC(C)C(C)(C#N)c1ccc(C#N)cc1 ','N#CC(c1ccc(-c2ccccc2)cc1)C(Oc1ccccc1)c1ccccc1 ','COCOC(Br)C(C#N)c1ccc([N+](=O)[O-])cc1 ','COCC(C#N)(c1ccccc1)c1ccc([N+](=O)[O-])cc1 ','COCC(C#N)(C#N)c1ccccc1 ','COCOC(C)C(c1ccccc1)(c1ccccc1)c1ccccc1 ','COCOC(Br)C(C)(c1ccccc1)c1ccccc1 ','COCOCC(C)(C#N)c1ccc(-c2ccccc2)cc1 ','COC(OC)C(c1ccccc1)(c1ccccc1)c1ccccc1 ','COc1ccc(C(C)(C#N)C(C)OC)cc1 ','CCc1ccc(C(C)(c2ccccc2)C(Oc2ccccc2)c2ccccc2)cc1 ','COCOC(c1ccccc1)C(C#N)c1ccc(C#N)cc1 ','CCc1ccc(C(c2ccccc2)(c2ccccc2)C(OC)c2ccccc2)cc1 ','CC(C)(c1ccccc1)C(Br)Oc1ccccc1 ','CCc1ccc(C(C)(c2ccccc2)C(C#N)Oc2ccccc2)cc1 ','COCOCC(C#N)(C#N)c1ccc(-c2ccccc2)cc1 ','COC(C#N)C(C)c1ccccc1 ','COCOC(OC)C(C#N)(c1ccccc1)c1ccccc1 ','COC(=O)c1ccc(C(C)(c2ccccc2)C(C#N)Oc2ccccc2)cc1 ','CCc1ccc(C(C)C(C#N)Oc2ccccc2)cc1 ','COC(C)C(C#N)c1ccc([N+](=O)[O-])cc1 ','COC(C)C(C)(C)c1ccc(C#N)cc1 ','COC(=O)c1ccc(C(c2ccccc2)C(OC)OC)cc1 ','CCc1ccc(C(c2ccccc2)C(C#N)Oc2ccccc2)cc1 ','COC(OC)C(C)(C#N)c1ccc([N+](=O)[O-])cc1 ','COCOC(C)C(C#N)(c1ccccc1)c1ccc(OC)cc1 ','COCOC(Br)Cc1ccc(-c2ccccc2)cc1 ','COCOC(c1ccccc1)C(c1ccccc1)(c1ccccc1)c1ccc([N+](=O)[O-])cc1 ','COC(C)C(C)c1ccccc1 ','COCC(C)c1ccc(C#N)cc1 ','COC(=O)c1ccc(C(C#N)(c2ccccc2)C(Br)OC)cc1 ','COCOC(OC)C(C#N)(c1ccccc1)c1ccc(C#N)cc1 ','COCOC(Br)C(C#N)(C#N)c1ccccc1 ','COCC(C#N)c1ccc(C#N)cc1 ','COCOC(C#N)Cc1ccc(-c2ccccc2)cc1 ','CCc1ccc(C(C#N)COC)cc1 ','COC(=O)c1ccc(C(C#N)(C#N)C(Oc2ccccc2)c2ccccc2)cc1 ','N#CC(COc1ccccc1)c1ccc(-c2ccccc2)cc1 ','CCc1ccc(C(c2ccccc2)(c2ccccc2)C(C#N)OCOC)cc1 ','CCc1ccc(C(c2ccccc2)(c2ccccc2)C(C#N)Oc2ccccc2)cc1 ','COCOCC(C#N)c1ccccc1 ','CCc1ccc(C(C)C(C#N)OC)cc1 ','COC(Br)C(C)c1ccc(C#N)cc1 ','COCOC(Br)C(C)(C#N)c1ccc(-c2ccccc2)cc1 ','COC(OC)C(C#N)(C#N)c1ccc([N+](=O)[O-])cc1 ','COC(C)C(C#N)(c1ccccc1)c1ccccc1 ','CCc1ccc(C(C)COCOC)cc1 ','COCOC(C#N)C(C#N)(c1ccccc1)c1ccc(C#N)cc1 ','COC(C)C(C#N)(c1ccccc1)c1ccc(-c2ccccc2)cc1 ','COc1ccc(C(C#N)C(Br)OC)cc1 ','COCOCC(C)(C#N)c1ccc(C(=O)OC)cc1 ','CCc1ccc(C(C)C(OC)OC)cc1 ','CC(Oc1ccccc1)C(C#N)(c1ccccc1)c1ccccc1 ','COc1ccc(C(C#N)C(Oc2ccccc2)c2ccccc2)cc1 ','COCOC(C#N)C(C#N)c1ccc(OC)cc1 ','COCOC(OC)C(C#N)(C#N)c1ccc(OC)cc1 ','COC(=O)c1ccc(C(C)C(OC)OC)cc1 ','CCc1ccc(C(C#N)(C#N)C(C)OCOC)cc1 ','COCOCC(C)(c1ccccc1)c1ccc(C(=O)OC)cc1 ','CCc1ccc(C(C#N)(c2ccccc2)C(C#N)OCOC)cc1 ','COCOC(C#N)C(C)(C#N)c1ccccc1 ','COC(OC)C(C#N)(c1ccccc1)c1ccccc1 ','CCc1ccc(C(c2ccccc2)(c2ccccc2)C(OC)OC)cc1 ','COC(C#N)C(C#N)c1ccccc1 ','COC(C)C(C#N)(c1ccccc1)c1ccc(C#N)cc1 ','COCOC(OC)C(C)c1ccc(C#N)cc1 ','COCOC(C#N)C(c1ccccc1)c1ccc(-c2ccccc2)cc1 ','CC(Oc1ccccc1)C(C#N)(c1ccccc1)c1ccc(-c2ccccc2)cc1 ','COC(Oc1ccccc1)C(C)(c1ccccc1)c1ccc([N+](=O)[O-])cc1 ','COCOC(OC)C(C#N)c1ccc(-c2ccccc2)cc1 ','COCOC(C)C(C#N)(C#N)c1ccccc1 ','CCc1ccc(C(c2ccccc2)(c2ccccc2)C(C#N)OC)cc1 ','CC(Oc1ccccc1)C(c1ccccc1)c1ccc([N+](=O)[O-])cc1 ','COCOC(OC)C(C#N)(C#N)c1ccc(-c2ccccc2)cc1 ','COCOC(C)C(C)(c1ccccc1)c1ccc([N+](=O)[O-])cc1 ','CC(Oc1ccccc1)C(C#N)(C#N)c1ccc([N+](=O)[O-])cc1 ','CCc1ccc(C(C)C(OCOC)c2ccccc2)cc1 ','COc1ccc(C(C)(C#N)C(Br)Oc2ccccc2)cc1 ','COC(Br)Cc1ccccc1 ','N#CC(Cc1ccc(-c2ccccc2)cc1)Oc1ccccc1 ','N#CC(Oc1ccccc1)C(C#N)(c1ccccc1)c1ccccc1 ','CCc1ccc(C(c2ccccc2)C(C#N)OCOC)cc1 ','COC(=O)c1ccc(CC(OC)Oc2ccccc2)cc1 ','COC(Cc1ccc([N+](=O)[O-])cc1)c1ccccc1 ','CCc1ccc(C(c2ccccc2)C(OC)c2ccccc2)cc1 ','COC(C#N)C(C)c1ccc(-c2ccccc2)cc1 ','N#Cc1ccc(C(c2ccccc2)C(Oc2ccccc2)c2ccccc2)cc1 ','c1ccc(OCC(c2ccccc2)(c2ccccc2)c2ccccc2)cc1 ','COC(=O)c1ccc(CC(Br)Oc2ccccc2)cc1 ','CCc1ccc(C(C)(C#N)C(C)Oc2ccccc2)cc1 ','COC(C)C(C#N)(C#N)c1ccccc1 ','COc1ccc(C(C#N)(c2ccccc2)C(C#N)OC)cc1 ','COC(=O)c1ccc(C(c2ccccc2)(c2ccccc2)C(C#N)OC)cc1 ','COCOC(C)C(C#N)(c1ccccc1)c1ccccc1 ','COCOCC(c1ccccc1)c1ccc(C#N)cc1 ','COCOC(c1ccccc1)C(C)(C#N)c1ccccc1 ','COCC(c1ccccc1)(c1ccccc1)c1ccc(C(=O)OC)cc1 ','COCOC(C)C(C#N)(C#N)c1ccc(C(=O)OC)cc1 ','COC(C)C(C)(C#N)c1ccccc1 ',
		  'COCOC(Cc1ccc(C(=O)OC)cc1)OC ','N#CC(Oc1ccccc1)C(c1ccccc1)c1ccc(-c2ccccc2)cc1 ','N#Cc1ccc(C(COc2ccccc2)c2ccccc2)cc1 ','COCOC(OC)C(c1ccccc1)c1ccccc1 ','COCOC(OC)C(C#N)(c1ccccc1)c1ccc(OC)cc1 ','COCOCC(C)c1ccc(C(=O)OC)cc1 ','COCOC(Br)C(C)(C#N)c1ccc(OC)cc1 ','COCOCC(C#N)(c1ccccc1)c1ccc([N+](=O)[O-])cc1']
#print(smiles)
df = pd.DataFrame({'SMILES':smiles})
#df = pd.DataFrame['ID':clogp, 'SMILES':smiles]# ({'ID':clogp, 'SMILES':smiles})
df['Mol Image'] = [Chem.MolFromSmiles(s) for s in df['SMILES']]
#ChangeMoleculeRendering(renderer='PNG')
PandasTools.SaveXlsxFromFrame(df, 'smile_Mol_Image.xlsx', molCol='Mol Image')

mols = [Chem.MolFromSmiles(smi) for smi in smiles]
#Draw.MolsToGridImage(mols, molsPerRow=4, subImgSize=(200, 200))


In [None]:
import pandas as pd
from rdkit.Chem import PandasTools
esol_data = pd.read_csv("./SMILES_feature.csv")
#esol_data.head(1)
#Add ROMol to data
PandasTools.AddMoleculeColumnToFrame(esol_data, smilesCol='smiles')
esol_data.head(1)


In [None]:
print(type(esol_data.ROMol[0]))
PandasTools.FrameToGridImage(esol_data.head(8), legendsCol="clogp", molsPerRow=4)

In [None]:
# Adding new columns of properites use Pandas map method
esol_data["n_Atoms"] = esol_data['ROMol'].map(lambda x: x.GetNumAtoms())
esol_data.head(1)

In [None]:
#Before saving the dataframe as csv file, it is recommanded to drop the ROMol column.
esol_data = esol_data.drop(['ROMol'], axis=1)
esol_data.head(1)



In [None]:
#RDKit has avariety of built-in functionality for generating molecular fingerprints/descriptors
#url = 'https://raw.githubusercontent.com/XinhaoLi74/molds/master/clean_data/ESOL.csv'
#esol_data = pd.read_csv(url)
esol_data = pd.read_csv("./SMILES_feature.csv")
PandasTools.AddMoleculeColumnToFrame(esol_data, smilesCol='smiles')
esol_data.head(1)


In [None]:
Chem.AllChem.GetMorganFingerprintAsBitVect
radius=3
nBits=2048
ECFP6 = [Chem.AllChem.GetMorganFingerprintAsBitVect(x,radius=radius, nBits=nBits) for x in esol_data['ROMol']]
print(ECFP6[0])
print(len(ECFP6[0]))

In [None]:
ecfp6_name = [f'Bit_{i}' for i in range(nBits)]
ecfp6_bits = [list(l) for l in ECFP6]
df_morgan = pd.DataFrame(ecfp6_bits, index = esol_data.smiles, columns=ecfp6_name)
df_morgan.head(1)
df_morgan.to_csv("./SMILES_ecfp6_feature_add.csv")

In [None]:
#Similarity Search
ref_smiles = 'N#CC(c1ccccc1)C(Br)Oc1ccccc1'
ref_mol = Chem.MolFromSmiles(ref_smiles)
ref_ECFP4_fps = Chem.AllChem.GetMorganFingerprintAsBitVect(ref_mol,2)
ref_mol

In [None]:
bulk_ECFP4_fps = [Chem.AllChem.GetMorganFingerprintAsBitVect(x,2) for x in esol_data['ROMol']]

In [None]:
from rdkit import DataStructs

similarity_efcp4 = [DataStructs.FingerprintSimilarity(ref_ECFP4_fps,x) for x in bulk_ECFP4_fps]

In [None]:
esol_data['Tanimoto_Similarity (ECFP4)'] = similarity_efcp4
PandasTools.FrameToGridImage(esol_data.head(8), legendsCol="Tanimoto_Similarity (ECFP4)", molsPerRow=4)


In [None]:
esol_data = esol_data.sort_values(['Tanimoto_Similarity (ECFP4)'], ascending=False)
PandasTools.FrameToGridImage(esol_data.head(8), legendsCol="Tanimoto_Similarity (ECFP4)", molsPerRow=4)

In [None]:
# RDKit
# https://github.com/XinhaoLi74/Hierarchical-QSAR-Modeling/blob/master/notebooks/descriptors.ipynb
generator = MakeGenerator(("RDKit2D",)) 
train = pd.read_csv("./SMILES_feature.csv")
PandasTools.AddMoleculeColumnToFrame(train,smilesCol='smiles')
train_rdkit2d = [generator.process(x)[1:] for x in train['smiles']]
# morgan fingerprint
train_ECFP6 = [Chem.GetMorganFingerprintAsBitVect(x,3) for x in train['ROMol']]

In [None]:
rdkit2d_name = []
for name, numpy_type in generator.GetColumns():
    rdkit2d_name.append(name)

In [None]:
train_rdkit2d_df = pd.DataFrame(train_rdkit2d, index = train.index, columns=rdkit2d_name[1:])

In [None]:
train_rdkit2d_df.shape

In [None]:
train_rdkit2d_df.to_csv('./train_rdkit2d.csv')

In [None]:
#RDKit to calculte molecular fingerprint and similarity of a list of SMILE structures?
from rdkit import Chem
from rdkit import DataStructs
from rdkit.Chem.Fingerprints import FingerprintMols
import pandas as pd

# read and Conconate the csv's
#df_1 = pd.read_csv('first.csv')
#df_2 = pd.read_csv('second.csv')
df_3 = pd.read_csv("./SMILES_feature.csv")

# proof and make a list of SMILES
df_smiles = df_3['smiles']
c_smiles = []
for ds in df_smiles:
    try:
        cs = Chem.CanonSmiles(ds)
        c_smiles.append(cs)
    except:
        print('Invalid SMILES:', ds)
print()

# make a list of mols
ms = [Chem.MolFromSmiles(x) for x in c_smiles]

# make a list of fingerprints (fp)
fps = [FingerprintMols.FingerprintMol(x) for x in ms]

# the list for the dataframe
qu, ta, sim = [], [], []

# compare all fp pairwise without duplicates
for n in range(len(fps)-1): # -1 so the last fp will not be used
    s = DataStructs.BulkTanimotoSimilarity(fps[n], fps[n+1:]) # +1 compare with the next to the last fp
    print(c_smiles[n], c_smiles[n+1:]) # witch mol is compared with what group
    # collect the SMILES and values
    for m in range(len(s)):
        qu.append(c_smiles[n])
        ta.append(c_smiles[n+1:][m])
        sim.append(s[m])
print()

# build the dataframe and sort it
d = {'query':qu, 'target':ta, 'Similarity':sim}
df_final = pd.DataFrame(data=d)
df_final = df_final.sort_values('Similarity', ascending=False)
#print(df_final)

# save as csv
df_final.to_csv('third.csv', index=False, sep=',')

In [None]:
from rdkit import Chem
from rdkit import DataStructs 
from rdkit.Chem.Fingerprints import FingerprintMols

template = Chem.MolFromSmiles('CC(C)(c1ccc([N+](=O)[O-])cc1)C(C#N)Oc1ccccc1')
Chem.AllChem.Compute2DCoords(template)

ms = [Chem.MolFromSmiles(smi) for smi in ('N#CC(c1ccccc1)C(Br)Oc1ccccc1','O=[N+]([O-])c1ccc(C(c2ccccc2)C(Br)Oc2ccccc2)cc1','CC(Oc1ccccc1)C(C#N)c1ccc(C#N)cc1 ','COC(C#N)C(C)(C)c1ccc(-c2ccccc2)cc1 ','COC(Oc1ccccc1)C(C)(C#N)c1ccc([N+](=O)[O-])cc1 ','COCC(C)(c1ccccc1)c1ccc([N+](=O)[O-])cc1 ','CCc1ccc(C(C#N)C(OC)OCOC)cc1 ','COC(C#N)C(C)(C)c1ccccc1 ','CCc1ccc(C(C)(c2ccccc2)C(Br)OC)cc1 ','COCOC(C)C(C)c1ccc(C(=O)OC)cc1 ','CCc1ccc(C(C#N)COc2ccccc2)cc1 ','COC(=O)c1ccc(C(C)(c2ccccc2)C(OC)OC)cc1 ','CCc1ccc(C(C)(C#N)C(C)OC)cc1 ','COC(C#N)C(C#N)c1ccc(C#N)cc1 ','COCOC(C#N)C(C)(C#N)c1ccc(C#N)cc1 ','COCC(C)(C#N)c1ccc(C#N)cc1 ','CC(C)(COc1ccccc1)c1ccc([N+](=O)[O-])cc1 ','COc1ccc(C(C#N)(C#N)C(C)OC)cc1 ','COC(Br)C(C)(C#N)c1ccc([N+](=O)[O-])cc1 ','COC(c1ccccc1)C(C#N)(C#N)c1ccc(-c2ccccc2)cc1 ','COC(c1ccccc1)C(C#N)c1ccc([N+](=O)[O-])cc1 ','CCc1ccc(CCOC)cc1 ','COCOCC(C#N)(C#N)c1ccccc1 ','COCOCC(C)c1ccccc1 ','COC(C#N)C(c1ccccc1)c1ccc(C#N)cc1 ','COC(=O)c1ccc(C(C)(C#N)C(C#N)OC)cc1 ','COc1ccc(C(C#N)(c2ccccc2)C(OC)OC)cc1 ','CCc1ccc(C(C#N)C(C#N)OCOC)cc1 ','COC(=O)c1ccc(C(C#N)C(Br)Oc2ccccc2)cc1 ','COCOC(Br)C(C)(C#N)c1ccc([N+](=O)[O-])cc1 ','COC(Br)C(C)(C)c1ccc(C#N)cc1 ','COC(Oc1ccccc1)C(C#N)c1ccccc1 ','N#Cc1ccc(C(c2ccccc2)(c2ccccc2)C(Oc2ccccc2)c2ccccc2)cc1 ','COC(c1ccccc1)C(c1ccccc1)c1ccc([N+](=O)[O-])cc1 ','COCC(c1ccccc1)(c1ccccc1)c1ccccc1 ','COC(=O)c1ccc(C(C)(C#N)C(OC)OC)cc1 ','COCOC(Br)C(c1ccccc1)c1ccccc1 ','CCc1ccc(CCOCOC)cc1 ','COCC(C)(C#N)c1ccccc1 ','COCOC(C)C(c1ccccc1)c1ccccc1 ','N#Cc1ccc(C(c2ccccc2)(c2ccccc2)C(C#N)Oc2ccccc2)cc1 ','COC(=O)c1ccc(C(C)(c2ccccc2)C(Br)Oc2ccccc2)cc1 ','COCOCC(C)(C)c1ccc(C#N)cc1 ','CCc1ccc(C(C#N)C(Br)OCOC)cc1 ','COCOC(Br)C(C)(C#N)c1ccc(C#N)cc1 ','O=[N+]([O-])c1ccc(CC(Oc2ccccc2)c2ccccc2)cc1 ','CCc1ccc(C(C#N)(C#N)C(Br)OCOC)cc1 ','COCC(C#N)(C#N)c1ccc(C(=O)OC)cc1 ','CCc1ccc(C(C)(C#N)COCOC)cc1 ','N#CC(Oc1ccccc1)C(C#N)c1ccc(-c2ccccc2)cc1 ','COC(C#N)C(c1ccccc1)c1ccc(-c2ccccc2)cc1 ','COCOCC(C)(C)c1ccc([N+](=O)[O-])cc1 ','COC(C#N)C(C#N)c1ccc(-c2ccccc2)cc1 ','CCc1ccc(C(C)(C)C(OCOC)c2ccccc2)cc1 ','COCOC(Br)C(C)c1ccc([N+](=O)[O-])cc1 ','COCOCC(C#N)(c1ccccc1)c1ccc(C#N)cc1 ','COC(=O)c1ccc(C(C#N)C(C)OC)cc1 ','COc1ccc(C(C#N)(C#N)C(Br)OC)cc1 ','COC(=O)c1ccc(C(C)(C#N)C(Br)Oc2ccccc2)cc1 ','COC(c1ccccc1)C(C#N)(C#N)c1ccc([N+](=O)[O-])cc1 ','COCOC(OC)C(C)(C#N)c1ccc(-c2ccccc2)cc1 ','COC(=O)c1ccc(C(C)(C)C(Br)Oc2ccccc2)cc1 ','CCc1ccc(C(C#N)(c2ccccc2)C(Br)Oc2ccccc2)cc1 ','COC(=O)c1ccc(C(C)(c2ccccc2)C(C#N)OC)cc1 ','COC(c1ccccc1)C(C#N)c1ccc(C#N)cc1 ','CCc1ccc(C(c2ccccc2)(c2ccccc2)C(Br)Oc2ccccc2)cc1 ','COC(C)C(C)(c1ccccc1)c1ccc([N+](=O)[O-])cc1 ','COC(C)C(C)c1ccc(C#N)cc1 ','COCOC(Br)C(c1ccccc1)(c1ccccc1)c1ccc(C#N)cc1 ','CCc1ccc(C(C#N)(C#N)COCOC)cc1 ','COCC(c1ccccc1)(c1ccccc1)c1ccc(C#N)cc1 ','COCOC(Br)Cc1ccc(C#N)cc1 ','COCOC(OC)C(c1ccccc1)(c1ccccc1)c1ccc(C#N)cc1 ','CCc1ccc(C(C)(C)C(C#N)OCOC)cc1 ','COC(Br)C(C)c1ccccc1 ','COC(C#N)Cc1ccc([N+](=O)[O-])cc1 ','CC(c1ccccc1)(c1ccc(-c2ccccc2)cc1)C(C#N)Oc1ccccc1 ','CC(c1ccc([N+](=O)[O-])cc1)C(C#N)Oc1ccccc1 ','COCOC(C)C(C)c1ccc([N+](=O)[O-])cc1 ','COC(=O)c1ccc(C(C#N)C(Oc2ccccc2)c2ccccc2)cc1 ','COC(C)C(C#N)c1ccc(-c2ccccc2)cc1 ','COCOC(OC)C(C#N)c1ccccc1 ','COC(C)C(C#N)(C#N)c1ccc([N+](=O)[O-])cc1 ','COCC(C)c1ccccc1 ','CCc1ccc(C(C)(C#N)C(OCOC)c2ccccc2)cc1 ','COCC(C#N)(C#N)c1ccc(OC)cc1 ','CCc1ccc(C(C)(C#N)C(OC)OCOC)cc1 ','c1ccc(OCC(c2ccccc2)c2ccccc2)cc1 ','COC(=O)c1ccc(C(c2ccccc2)C(C#N)OC)cc1 ','COC(=O)c1ccc(C(C#N)(C#N)C(OC)OC)cc1 ','COC(C#N)C(c1ccccc1)c1ccc([N+](=O)[O-])cc1 ','COC(Oc1ccccc1)C(C)(C)c1ccc(C#N)cc1 ','CCc1ccc(CC(OCOC)c2ccccc2)cc1 ','N#CC(Oc1ccccc1)C(C#N)(c1ccccc1)c1ccc([N+](=O)[O-])cc1 ','CCc1ccc(C(c2ccccc2)(c2ccccc2)C(OC)Oc2ccccc2)cc1 ','COCOC(C#N)C(C#N)(C#N)c1ccccc1 ','COC(=O)c1ccc(CC(C#N)OC)cc1 ','CCc1ccc(CC(C#N)OC)cc1 ','CC(Oc1ccccc1)C(C)(C#N)c1ccc(C#N)cc1 ','COC(Oc1ccccc1)C(C#N)(C#N)c1ccccc1 ','COc1ccc(C(C#N)COc2ccccc2)cc1 ','CCc1ccc(C(C#N)(c2ccccc2)C(Br)OCOC)cc1 ','CCc1ccc(C(C)(COCOC)c2ccccc2)cc1 ','CCc1ccc(C(C#N)C(Br)OC)cc1 ','COCOCC(c1ccccc1)(c1ccccc1)c1ccccc1 ','COCOC(C)Cc1ccccc1 ','CC(c1ccccc1)C(Br)Oc1ccccc1 ','CC(C#N)(c1ccccc1)C(Oc1ccccc1)c1ccccc1 ','COC(c1ccccc1)C(C#N)c1ccc(-c2ccccc2)cc1 ','CCc1ccc(C(C)(c2ccccc2)C(C#N)OCOC)cc1 ','N#CC(COc1ccccc1)(c1ccccc1)c1ccc([N+](=O)[O-])cc1 ','COCOC(C)C(C)(c1ccccc1)c1ccc(C(=O)OC)cc1 ','CCc1ccc(C(C)(C)C(C#N)Oc2ccccc2)cc1 ','COCOC(c1ccccc1)C(C)(C)c1ccccc1 ','COC(C)C(C)(C#N)c1ccc([N+](=O)[O-])cc1 ','COC(Oc1ccccc1)C(C#N)(C#N)c1ccc(-c2ccccc2)cc1 ','CC(c1ccc(C#N)cc1)C(Oc1ccccc1)c1ccccc1 ',
          'COC(=O)c1ccc(C(C)C(C)OC)cc1 ','COC(=O)c1ccc(C(C)(C)C(C#N)Oc2ccccc2)cc1 ','COC(Oc1ccccc1)C(c1ccccc1)c1ccc(C#N)cc1 ','COCOC(C#N)C(C)(c1ccccc1)c1ccccc1 ','CCc1ccc(C(c2ccccc2)C(C#N)OC)cc1 ','COC(=O)c1ccc(C(C#N)(C#N)C(C)Oc2ccccc2)cc1 ','COCC(C#N)(C#N)c1ccc([N+](=O)[O-])cc1 ','COCOC(C)C(C#N)(C#N)c1ccc(-c2ccccc2)cc1 ','COCOC(C#N)C(C)(C#N)c1ccc(OC)cc1 ','COC(C#N)C(C#N)(c1ccccc1)c1ccc(-c2ccccc2)cc1 ','CCc1ccc(C(C#N)(C#N)C(OCOC)c2ccccc2)cc1 ','COC(=O)c1ccc(CCOc2ccccc2)cc1 ','BrC(Cc1ccccc1)Oc1ccccc1 ','CCc1ccc(C(COCOC)c2ccccc2)cc1 ','COC(C)C(c1ccccc1)(c1ccccc1)c1ccc(C#N)cc1 ','COCOC(c1ccccc1)C(c1ccccc1)c1ccc(C(=O)OC)cc1 ','COCOCC(C)(c1ccccc1)c1ccc([N+](=O)[O-])cc1 ','COC(Cc1ccc(C#N)cc1)c1ccccc1 ','CCc1ccc(C(C)(c2ccccc2)C(OC)OC)cc1 ','COC(=O)c1ccc(C(C)C(Br)Oc2ccccc2)cc1 ','COC(Br)Cc1ccc([N+](=O)[O-])cc1 ','COC(=O)c1ccc(C(C)(C#N)C(Br)OC)cc1 ','COCOC(C)C(C#N)c1ccc(-c2ccccc2)cc1 ','COCOC(Br)C(C#N)c1ccc(C#N)cc1 ','BrC(Oc1ccccc1)C(c1ccccc1)c1ccccc1 ','COC(c1ccccc1)C(c1ccccc1)c1ccc(C#N)cc1 ','CCc1ccc(C(C)(C)C(C)OCOC)cc1 ','COCOC(OC)C(C)(c1ccccc1)c1ccc([N+](=O)[O-])cc1 ','COCOCC(c1ccccc1)c1ccccc1 ','COC(Cc1ccccc1)OC ','N#CC(c1ccccc1)(c1ccc(-c2ccccc2)cc1)C(Br)Oc1ccccc1 ','COCOC(C#N)C(C)(c1ccccc1)c1ccc(C#N)cc1 ','CCc1ccc(C(C#N)(COC)c2ccccc2)cc1 ','COC(C#N)C(C)c1ccc([N+](=O)[O-])cc1 ','COC(=O)c1ccc(C(C)C(C#N)Oc2ccccc2)cc1 ','COC(C)C(c1ccccc1)c1ccc(C#N)cc1 ','COC(c1ccccc1)C(c1ccccc1)(c1ccccc1)c1ccc([N+](=O)[O-])cc1 ','CC(Oc1ccccc1)C(C)(C#N)c1ccc(-c2ccccc2)cc1 ','COC(=O)c1ccc(C(C)C(C#N)OC)cc1 ','N#Cc1ccc(C(C#N)(c2ccccc2)C(Br)Oc2ccccc2)cc1 ','O=[N+]([O-])c1ccc(C(COc2ccccc2)(c2ccccc2)c2ccccc2)cc1 ','COCOC(OC)C(C#N)(c1ccccc1)c1ccc([N+](=O)[O-])cc1 ','CCc1ccc(C(C)(C#N)C(Br)Oc2ccccc2)cc1 ','CCc1ccc(C(C#N)C(C)OCOC)cc1 ','COC(=O)c1ccc(C(COc2ccccc2)c2ccccc2)cc1 ','COC(C)C(C#N)c1ccccc1 ','COC(=O)c1ccc(C(C#N)C(C#N)OC)cc1 ','COCOC(c1ccccc1)C(C#N)(c1ccccc1)c1ccc([N+](=O)[O-])cc1 ','COC(Br)C(c1ccccc1)(c1ccccc1)c1ccccc1 ','COC(Br)C(C#N)c1ccc([N+](=O)[O-])cc1 ','COC(OC)C(C#N)c1ccccc1 ','CCc1ccc(C(C)(c2ccccc2)C(Br)OCOC)cc1 ','CC(c1ccccc1)(c1ccc(C#N)cc1)C(C#N)Oc1ccccc1 ','COCOC(c1ccccc1)C(c1ccccc1)(c1ccccc1)c1ccc(C#N)cc1 ','CCc1ccc(C(C#N)C(OC)Oc2ccccc2)cc1 ','COCOC(Br)C(C)c1ccc(C(=O)OC)cc1 ','N#CC(C#N)(c1ccc([N+](=O)[O-])cc1)C(Br)Oc1ccccc1 ','N#Cc1ccc(C(C#N)C(Br)Oc2ccccc2)cc1 ','COCC(C#N)(c1ccccc1)c1ccc(C(=O)OC)cc1 ','O=[N+]([O-])c1ccc(CCOc2ccccc2)cc1 ','COCOC(C#N)C(C)(C#N)c1ccc(C(=O)OC)cc1 ','COC(=O)c1ccc(C(C#N)(C#N)C(Br)OC)cc1 ','COC(=O)c1ccc(C(c2ccccc2)(c2ccccc2)C(Br)Oc2ccccc2)cc1 ','COCOC(Br)C(C#N)(c1ccccc1)c1ccccc1 ','COCOC(Br)C(c1ccccc1)c1ccc(C#N)cc1 ','N#CC(C#N)(c1ccc(-c2ccccc2)cc1)C(Oc1ccccc1)c1ccccc1 ','COCOC(C#N)C(C)(C)c1ccc(C(=O)OC)cc1 ','CCc1ccc(C(C#N)C(Oc2ccccc2)c2ccccc2)cc1 ','CCc1ccc(C(C)(C)C(C#N)OC)cc1 ','COC(Br)C(C#N)(c1ccccc1)c1ccc(-c2ccccc2)cc1 ','CCc1ccc(C(C#N)C(OC)OC)cc1 ','N#Cc1ccc(C(C#N)(COc2ccccc2)c2ccccc2)cc1 ','COCOC(c1ccccc1)C(C)(C)c1ccc(C(=O)OC)cc1 ','COCOC(C)C(C)(C#N)c1ccc(-c2ccccc2)cc1 ','COC(Br)C(C)(C#N)c1ccc(-c2ccccc2)cc1 ','CCc1ccc(C(C)(C)C(Br)OC)cc1 ','COC(C#N)C(c1ccccc1)(c1ccccc1)c1ccccc1 ','CCc1ccc(C(C)(C)C(Br)OCOC)cc1 ','COC(=O)c1ccc(C(C)(C)C(C#N)OC)cc1 ','COC(OC)C(c1ccccc1)c1ccccc1 ','COC(C)C(C#N)c1ccc(C#N)cc1 ','COCOC(Br)C(C)(C)c1ccc(C#N)cc1 ','CCc1ccc(C(c2ccccc2)C(OC)Oc2ccccc2)cc1 ','CC(Cc1ccc([N+](=O)[O-])cc1)Oc1ccccc1 ','COCOC(C)C(C#N)(C#N)c1ccc([N+](=O)[O-])cc1 ','COCOC(Br)C(C)(C#N)c1ccccc1 ','COCOC(C#N)C(C#N)(C#N)c1ccc(C(=O)OC)cc1 ','CCc1ccc(C(C#N)(C#N)C(C#N)OCOC)cc1 ','N#Cc1ccc(C(c2ccccc2)C(C#N)Oc2ccccc2)cc1 ','COCOC(C)Cc1ccc([N+](=O)[O-])cc1 ','COC(Br)C(c1ccccc1)c1ccc([N+](=O)[O-])cc1 ','CC(C#N)(c1ccc(-c2ccccc2)cc1)C(Oc1ccccc1)c1ccccc1 ','CC(Oc1ccccc1)C(C)(C#N)c1ccc([N+](=O)[O-])cc1 ','COC(OC)C(C)(C#N)c1ccccc1 ','COc1ccc(C(C)(C#N)COc2ccccc2)cc1 ','COC(c1ccccc1)C(C)(C#N)c1ccc(-c2ccccc2)cc1 ','CCc1ccc(C(C#N)(c2ccccc2)C(C)Oc2ccccc2)cc1 ','COC(=O)c1ccc(C(C#N)(C#N)C(Br)Oc2ccccc2)cc1 ','COC(=O)c1ccc(C(c2ccccc2)C(C#N)Oc2ccccc2)cc1 ','COC(Cc1ccc([N+](=O)[O-])cc1)OC ','COC(c1ccccc1)C(C#N)c1ccccc1 ','COCOCCc1ccc(C(=O)OC)cc1 ','COC(Oc1ccccc1)C(C#N)(c1ccccc1)c1ccc([N+](=O)[O-])cc1 ','COC(Oc1ccccc1)C(c1ccccc1)(c1ccccc1)c1ccccc1 ','COCC(C#N)(c1ccccc1)c1ccccc1 ','CC(C)(c1ccc(C#N)cc1)C(C#N)Oc1ccccc1 ','O=[N+]([O-])c1ccc(C(c2ccccc2)(c2ccccc2)C(Oc2ccccc2)c2ccccc2)cc1 ','COCOC(C)C(C#N)(C#N)c1ccc(OC)cc1 ','COC(Oc1ccccc1)C(C)c1ccc(C#N)cc1 ','COC(=O)c1ccc(CC(Br)OC)cc1 ','COCOC(Cc1ccc(C(=O)OC)cc1)c1ccccc1 ','COCC(C)(c1ccccc1)c1ccc(C#N)cc1 ',
          'COCOC(OC)C(C)(c1ccccc1)c1ccc(C#N)cc1 ','COC(Cc1ccc([N+](=O)[O-])cc1)Oc1ccccc1 ','COCC(C)(C)c1ccc(C#N)cc1 ','COCOC(C)C(C)(C#N)c1ccc(C(=O)OC)cc1 ','COC(c1ccccc1)C(c1ccccc1)c1ccccc1 ','COC(Cc1ccc(C#N)cc1)Oc1ccccc1 ','COc1ccc(C(C#N)(C#N)C(OC)c2ccccc2)cc1 ','N#CC(c1ccccc1)(c1ccc([N+](=O)[O-])cc1)C(Oc1ccccc1)c1ccccc1 ','COc1ccc(C(C#N)C(C#N)Oc2ccccc2)cc1 ','N#Cc1ccc(C(C#N)(c2ccccc2)C(C#N)Oc2ccccc2)cc1 ','COCOC(Br)C(C)(C)c1ccc(C(=O)OC)cc1 ','COC(Br)C(C)(C)c1ccc([N+](=O)[O-])cc1 ','N#CC(Oc1ccccc1)C(c1ccccc1)c1ccccc1 ','COC(C)C(C)(c1ccccc1)c1ccc(C#N)cc1 ','CCc1ccc(C(C)(C#N)C(C#N)OCOC)cc1 ','COCOC(C#N)C(C)(C)c1ccc(C#N)cc1 ','COCOC(C#N)C(C)c1ccccc1 ','BrC(Oc1ccccc1)C(c1ccccc1)(c1ccccc1)c1ccccc1 ','COCOC(C)C(c1ccccc1)(c1ccccc1)c1ccc([N+](=O)[O-])cc1 ','CCc1ccc(C(C)C(OC)OCOC)cc1 ','CC(C)(c1ccc(C#N)cc1)C(Br)Oc1ccccc1 ','COC(=O)c1ccc(CC(OC)c2ccccc2)cc1 ','CCc1ccc(C(C)(C#N)COc2ccccc2)cc1 ','CCc1ccc(C(C)C(C#N)OCOC)cc1 ','COCOC(C#N)C(C)(C#N)c1ccc([N+](=O)[O-])cc1 ','COCOC(Cc1ccccc1)OC ','COCOCC(C#N)c1ccc([N+](=O)[O-])cc1 ','CCc1ccc(C(C)C(C)OC)cc1 ','COCOC(C#N)C(C#N)(c1ccccc1)c1ccccc1 ','COC(Oc1ccccc1)C(C#N)(c1ccccc1)c1ccc(-c2ccccc2)cc1 ','COCOCC(C)c1ccc([N+](=O)[O-])cc1 ','COCOC(OC)C(C#N)(C#N)c1ccc(C(=O)OC)cc1 ', 'COCOC(C#N)Cc1ccc(OC)cc1','COC(=O)c1ccc(C(C#N)C(Br)OC)cc1 ','CCc1ccc(C(C)(C#N)COC)cc1 ','COC(Br)C(C#N)c1ccccc1 ','COC(=O)c1ccc(CC(C)OC)cc1 ','COC(C)C(C)(c1ccccc1)c1ccccc1 ','COc1ccc(C(C#N)(C#N)COc2ccccc2)cc1 ','COc1ccc(C(C#N)(C#N)C(Br)Oc2ccccc2)cc1 ','COC(C)Cc1ccccc1 ','COC(Br)C(C#N)(C#N)c1ccc(-c2ccccc2)cc1 ','COC(=O)c1ccc(C(C#N)(c2ccccc2)C(C#N)OC)cc1 ','CCc1ccc(C(C#N)(C#N)C(OC)OC)cc1 ','CCc1ccc(C(C)(COC)c2ccccc2)cc1 ','COC(Br)C(C#N)(c1ccccc1)c1ccc(C#N)cc1 ','CCc1ccc(C(C)(c2ccccc2)C(OCOC)c2ccccc2)cc1 ','COC(C#N)C(C#N)(C#N)c1ccc(C#N)cc1 ','COCOC(Cc1ccccc1)c1ccccc1 ','COCOC(C#N)C(C#N)(C#N)c1ccc(C#N)cc1 ','COC(C)C(C#N)(C#N)c1ccc(-c2ccccc2)cc1 ','COCOC(Br)C(C#N)(c1ccccc1)c1ccc(C#N)cc1 ','COCOC(Br)C(c1ccccc1)(c1ccccc1)c1ccccc1 ','COCOCC(C#N)c1ccc(OC)cc1 ','COCOC(C#N)C(c1ccccc1)(c1ccccc1)c1ccc(-c2ccccc2)cc1 ','COCOC(C#N)C(C)(C#N)c1ccc(-c2ccccc2)cc1 ','N#CC(C#N)(COc1ccccc1)c1ccccc1 ','COCOC(OC)C(C#N)(C#N)c1ccc([N+](=O)[O-])cc1 ','COC(=O)c1ccc(C(C#N)(c2ccccc2)C(OC)c2ccccc2)cc1 ','COC(=O)c1ccc(C(C)(c2ccccc2)C(OC)c2ccccc2)cc1 ','COC(OC)C(C#N)(c1ccccc1)c1ccc(-c2ccccc2)cc1 ','COC(=O)c1ccc(C(C#N)(C#N)C(OC)Oc2ccccc2)cc1 ','COCOC(Br)C(C#N)(c1ccccc1)c1ccc(-c2ccccc2)cc1 ','N#CC(Oc1ccccc1)C(c1ccccc1)(c1ccccc1)c1ccc([N+](=O)[O-])cc1 ','COCOC(C)C(C)c1ccccc1 ','COC(=O)c1ccc(C(C)C(Br)OC)cc1 ','COCC(C)(c1ccccc1)c1ccccc1 ','N#Cc1ccc(C(c2ccccc2)C(Br)Oc2ccccc2)cc1 ','COCOC(OC)C(c1ccccc1)c1ccc(C(=O)OC)cc1 ','COCOC(c1ccccc1)C(c1ccccc1)c1ccc([N+](=O)[O-])cc1 ','COC(=O)c1ccc(C(C)(c2ccccc2)C(C)OC)cc1 ','COCOC(Cc1ccc([N+](=O)[O-])cc1)OC ','COCOC(c1ccccc1)C(C#N)(c1ccccc1)c1ccc(-c2ccccc2)cc1 ','COCOC(C#N)C(C#N)c1ccc(C(=O)OC)cc1 ','COC(=O)c1ccc(C(C#N)(c2ccccc2)C(OC)OC)cc1 ','COCOC(OC)C(c1ccccc1)c1ccc([N+](=O)[O-])cc1 ','CC(COc1ccccc1)c1ccc([N+](=O)[O-])cc1 ','c1ccc(OC(c2ccccc2)C(c2ccccc2)c2ccccc2)cc1 ','COCOC(c1ccccc1)C(C#N)(C#N)c1ccc(-c2ccccc2)cc1 ','COCOC(Br)C(C#N)(c1ccccc1)c1ccc([N+](=O)[O-])cc1 ','COC(Br)C(c1ccccc1)(c1ccccc1)c1ccc(C#N)cc1 ','COC(=O)c1ccc(C(C)(C#N)C(C)Oc2ccccc2)cc1 ','COCOCC(C#N)(c1ccccc1)c1ccccc1 ','COC(Oc1ccccc1)C(C#N)(c1ccccc1)c1ccccc1 ','COCOCC(C)(C)c1ccccc1 ','CCc1ccc(C(c2ccccc2)C(OC)OCOC)cc1 ','CC(Oc1ccccc1)C(C#N)(C#N)c1ccccc1 ','COC(Br)C(C#N)(C#N)c1ccccc1 ','COCC(C)(C#N)c1ccc(OC)cc1 ','COCOC(Br)C(C#N)(C#N)c1ccc(C#N)cc1 ','COC(Oc1ccccc1)C(c1ccccc1)c1ccccc1 ','COCOC(C)C(c1ccccc1)c1ccc(C(=O)OC)cc1 ','COCOC(C)C(c1ccccc1)(c1ccccc1)c1ccc(C#N)cc1 ','COCC(C)(C#N)c1ccc(-c2ccccc2)cc1 ','COC(C#N)C(c1ccccc1)(c1ccccc1)c1ccc([N+](=O)[O-])cc1 ','COC(Oc1ccccc1)C(C)(C)c1ccc([N+](=O)[O-])cc1 ','COC(Br)C(c1ccccc1)c1ccccc1 ','COCOC(Cc1ccc([N+](=O)[O-])cc1)c1ccccc1 ','COC(C)C(C)(C#N)c1ccc(-c2ccccc2)cc1 ','CC(Oc1ccccc1)C(C#N)c1ccccc1 ','COCCc1ccc(C(=O)OC)cc1 ','CC(c1ccccc1)(c1ccc(C#N)cc1)C(Br)Oc1ccccc1 ','CCc1ccc(CC(C)OC)cc1 ','CCc1ccc(C(C)(C#N)C(C#N)OC)cc1 ','COCOC(C#N)C(c1ccccc1)(c1ccccc1)c1ccc(C#N)cc1 ','COCOC(Br)C(C#N)(C#N)c1ccc([N+](=O)[O-])cc1 ','COC(Oc1ccccc1)C(c1ccccc1)(c1ccccc1)c1ccc([N+](=O)[O-])cc1 ','COCC(c1ccccc1)c1ccc(C#N)cc1 ','COCOC(c1ccccc1)C(C)(C)c1ccc([N+](=O)[O-])cc1 ','COc1ccc(C(C#N)(C#N)C(C#N)OC)cc1 ','CCc1ccc(C(c2ccccc2)C(Br)OCOC)cc1 ','COC(=O)c1ccc(C(c2ccccc2)(c2ccccc2)C(Oc2ccccc2)c2ccccc2)cc1 ','CCc1ccc(C(C#N)(C#N)COc2ccccc2)cc1 ',
		  'COCOCC(C)(c1ccccc1)c1ccc(C#N)cc1 ','CCc1ccc(CC(C#N)Oc2ccccc2)cc1 ','COC(Br)C(C)(c1ccccc1)c1ccccc1 ','COCOCC(C#N)(C#N)c1ccc([N+](=O)[O-])cc1 ','COC(OC)C(C#N)c1ccc(-c2ccccc2)cc1 ','COC(OC)C(C)(C)c1ccccc1 ','COCOC(OC)C(C)(C#N)c1ccc(OC)cc1 ','COC(OC)C(c1ccccc1)(c1ccccc1)c1ccc(C#N)cc1 ','COCOCC(C)(C)c1ccc(C(=O)OC)cc1 ','COCOC(c1ccccc1)C(C)c1ccc(C#N)cc1 ','COC(OC)C(C)(C#N)c1ccc(-c2ccccc2)cc1 ','COC(=O)c1ccc(C(C)(c2ccccc2)C(Br)OC)cc1 ','CCc1ccc(C(C)C(Br)OC)cc1 ','CCc1ccc(C(C#N)C(C)Oc2ccccc2)cc1 ','COC(c1ccccc1)C(C#N)(C#N)c1ccccc1 ','COC(=O)c1ccc(C(C)C(OC)Oc2ccccc2)cc1 ','N#CC(Oc1ccccc1)C(c1ccccc1)c1ccc([N+](=O)[O-])cc1 ','N#Cc1ccc(C(C#N)(C#N)C(Oc2ccccc2)c2ccccc2)cc1 ','COCC(C)(C#N)c1ccc([N+](=O)[O-])cc1 ','COC(C#N)C(c1ccccc1)(c1ccccc1)c1ccc(-c2ccccc2)cc1 ','COC(c1ccccc1)C(c1ccccc1)(c1ccccc1)c1ccc(C#N)cc1 ','COC(c1ccccc1)C(C)(C#N)c1ccc(C#N)cc1 ','CC(C)(c1ccccc1)C(C#N)Oc1ccccc1 ','COCOCCc1ccccc1 ','CC(c1ccc([N+](=O)[O-])cc1)C(Oc1ccccc1)c1ccccc1 ','COC(Oc1ccccc1)C(C#N)(c1ccccc1)c1ccc(C#N)cc1 ','COCC(C)(c1ccccc1)c1ccc(C(=O)OC)cc1 ','COCOC(C)C(C)(c1ccccc1)c1ccccc1 ','COCOC(OC)C(C)(C#N)c1ccccc1 ','COC(OC)C(C)(C#N)c1ccc(C#N)cc1 ','CC(C#N)(c1ccc(C#N)cc1)C(C#N)Oc1ccccc1 ','COC(=O)c1ccc(C(C)(C#N)C(OC)c2ccccc2)cc1 ',
          'COCOC(c1ccccc1)C(C#N)(C#N)c1ccc(OC)cc1 ','COCOC(OC)C(C)c1ccc(C(=O)OC)cc1 ','CC(COc1ccccc1)(c1ccccc1)c1ccccc1 ','COC(=O)c1ccc(C(C#N)(c2ccccc2)C(C)OC)cc1 ','CCc1ccc(C(C#N)(c2ccccc2)C(OC)Oc2ccccc2)cc1 ','COCOC(C#N)C(C#N)c1ccc([N+](=O)[O-])cc1 ','CCc1ccc(C(C#N)(C#N)C(C)OC)cc1 ','CCc1ccc(C(C#N)(c2ccccc2)C(C#N)Oc2ccccc2)cc1 ','N#Cc1ccc(C(C#N)C(C#N)Oc2ccccc2)cc1 ','COCOCC(C#N)(C#N)c1ccc(C#N)cc1 ','COc1ccc(C(C#N)C(C#N)OC)cc1 ','N#CC(C#N)(COc1ccccc1)c1ccc([N+](=O)[O-])cc1 ','CC(C#N)(COc1ccccc1)c1ccc(C#N)cc1 ','CCc1ccc(C(C)(C)COC)cc1 ','CC(C)(c1ccc(-c2ccccc2)cc1)C(C#N)Oc1ccccc1 ','CCc1ccc(C(C#N)(C#N)C(Oc2ccccc2)c2ccccc2)cc1 ','COCOC(Cc1ccc(C#N)cc1)OC ','COCOC(c1ccccc1)C(c1ccccc1)c1ccccc1 ','COCOC(C#N)C(C)c1ccc(C(=O)OC)cc1 ','N#CC(C#N)(COc1ccccc1)c1ccc(-c2ccccc2)cc1 ','COC(C#N)C(C#N)(C#N)c1ccccc1 ','CCc1ccc(C(COC)c2ccccc2)cc1 ','CCc1ccc(C(C)(C)COCOC)cc1 ','COCOC(c1ccccc1)C(C#N)(C#N)c1ccccc1 ','N#CC(COc1ccccc1)(c1ccccc1)c1ccccc1 ','CCc1ccc(C(C)(C#N)C(Oc2ccccc2)c2ccccc2)cc1 ','CC(C#N)(COc1ccccc1)c1ccc([N+](=O)[O-])cc1 ','COC(=O)c1ccc(C(C#N)(c2ccccc2)C(C#N)Oc2ccccc2)cc1 ','COC(C#N)C(C#N)c1ccc([N+](=O)[O-])cc1 ','COCOC(Br)C(C)(C)c1ccccc1 ','COCOC(OC)C(C)c1ccc([N+](=O)[O-])cc1 ','COC(C#N)C(C#N)(c1ccccc1)c1ccc(C#N)cc1 ','COC(c1ccccc1)C(C#N)(c1ccccc1)c1ccc(C#N)cc1 ','N#CC(c1ccc(-c2ccccc2)cc1)C(Br)Oc1ccccc1 ','COC(=O)c1ccc(C(C#N)(COc2ccccc2)c2ccccc2)cc1 ','COCOC(c1ccccc1)C(C)(c1ccccc1)c1ccc(C(=O)OC)cc1 ','N#Cc1ccc(CC(Oc2ccccc2)c2ccccc2)cc1 ','COC(Br)C(C#N)(c1ccccc1)c1ccc([N+](=O)[O-])cc1 ','CCc1ccc(C(C#N)(c2ccccc2)C(C#N)OC)cc1 ','COC(Oc1ccccc1)C(C#N)(C#N)c1ccc([N+](=O)[O-])cc1 ','CCc1ccc(C(C#N)(c2ccccc2)C(C)OCOC)cc1 ','COC(OC)C(C)c1ccccc1 ','CCc1ccc(C(C)(C#N)C(C)OCOC)cc1 ','COCOC(C)Cc1ccc(C#N)cc1 ','CC(Oc1ccccc1)C(C#N)c1ccc(-c2ccccc2)cc1 ','COC(=O)c1ccc(C(C#N)(C#N)C(OC)c2ccccc2)cc1 ','COC(C#N)C(C)(C#N)c1ccccc1 ','COCOC(OC)C(C#N)c1ccc(C(=O)OC)cc1 ','COCOC(C)C(C)(C#N)c1ccccc1 ','COC(C)C(C#N)(C#N)c1ccc(C#N)cc1 ','COCOC(Br)C(c1ccccc1)c1ccc([N+](=O)[O-])cc1 ','COCCc1ccc([N+](=O)[O-])cc1 ','N#CC(c1ccccc1)(c1ccc([N+](=O)[O-])cc1)C(Br)Oc1ccccc1 ','COCOC(C#N)C(C)c1ccc([N+](=O)[O-])cc1 ','N#CC(C#N)(c1ccc([N+](=O)[O-])cc1)C(Oc1ccccc1)c1ccccc1 ','COCOC(OC)C(C)(C#N)c1ccc(C(=O)OC)cc1 ','COCC(C#N)c1ccc([N+](=O)[O-])cc1 ','COC(=O)c1ccc(C(C#N)(c2ccccc2)C(C)Oc2ccccc2)cc1 ','N#CC(Oc1ccccc1)C(C#N)(C#N)c1ccc([N+](=O)[O-])cc1 ','O=[N+]([O-])c1ccc(C(COc2ccccc2)c2ccccc2)cc1 ','CC(C#N)(c1ccc(C#N)cc1)C(Br)Oc1ccccc1 ','COCOC(C#N)C(c1ccccc1)(c1ccccc1)c1ccccc1 ','N#CC(C#N)(c1ccccc1)C(Br)Oc1ccccc1 ','COC(c1ccccc1)C(C)(c1ccccc1)c1ccc([N+](=O)[O-])cc1 ','COC(c1ccccc1)C(C)(C)c1ccccc1 ','COCOC(OC)C(C#N)(c1ccccc1)c1ccc(C(=O)OC)cc1 ','CC(c1ccccc1)(c1ccccc1)C(C#N)Oc1ccccc1 ','COC(=O)c1ccc(C(c2ccccc2)(c2ccccc2)C(OC)Oc2ccccc2)cc1 ','COCOCC(C#N)(C#N)c1ccc(OC)cc1 ','CCc1ccc(C(C#N)(c2ccccc2)C(OCOC)c2ccccc2)cc1 ','COCOC(C)C(C#N)c1ccc(C(=O)OC)cc1 ','COC(OC)C(c1ccccc1)(c1ccccc1)c1ccc([N+](=O)[O-])cc1 ','COC(OC)C(C#N)(C#N)c1ccc(-c2ccccc2)cc1 ','CCc1ccc(C(C)(c2ccccc2)C(Br)Oc2ccccc2)cc1 ','CCc1ccc(C(C)(c2ccccc2)C(OC)c2ccccc2)cc1 ','COCOC(C#N)C(c1ccccc1)(c1ccccc1)c1ccc([N+](=O)[O-])cc1 ','N#CC(C#N)(c1ccccc1)C(Oc1ccccc1)c1ccccc1 ','CCc1ccc(C(C)COC)cc1 ','N#Cc1ccc(C(c2ccccc2)(c2ccccc2)C(Br)Oc2ccccc2)cc1 ','CC(c1ccccc1)(c1ccc([N+](=O)[O-])cc1)C(C#N)Oc1ccccc1 ','COCOC(C#N)C(C#N)(C#N)c1ccc(-c2ccccc2)cc1 ','CCc1ccc(C(COC)(c2ccccc2)c2ccccc2)cc1 ','COCOC(C#N)C(C#N)c1ccccc1 ','CCc1ccc(C(C#N)(C#N)C(C#N)OC)cc1 ','COC(c1ccccc1)C(C)(C#N)c1ccccc1 ','CC(C#N)(c1ccc(-c2ccccc2)cc1)C(Br)Oc1ccccc1 ','COCOC(C)Cc1ccc(C(=O)OC)cc1 ','N#Cc1ccc(CCOc2ccccc2)cc1 ','COC(C#N)Cc1ccc(C#N)cc1 ','COC(C#N)C(C)(C#N)c1ccc(-c2ccccc2)cc1 ','CCc1ccc(C(COc2ccccc2)(c2ccccc2)c2ccccc2)cc1 ','CCc1ccc(C(c2ccccc2)C(OCOC)c2ccccc2)cc1 ','CCc1ccc(C(C)(C#N)C(OC)Oc2ccccc2)cc1 ','COc1ccc(C(C#N)(C#N)C(Oc2ccccc2)c2ccccc2)cc1 ','COCOC(c1ccccc1)C(C)(c1ccccc1)c1ccc(C#N)cc1 ','COCOC(c1ccccc1)C(C)(C#N)c1ccc(C(=O)OC)cc1 ','CCc1ccc(C(C)C(Br)OCOC)cc1 ','COC(C)C(c1ccccc1)(c1ccccc1)c1ccc([N+](=O)[O-])cc1 ','COC(Br)C(C)(C#N)c1ccccc1 ','COCC(c1ccccc1)c1ccccc1 ','COCOC(Br)C(C)(c1ccccc1)c1ccc(C(=O)OC)cc1 ','CCc1ccc(C(C#N)(C#N)C(Br)OC)cc1 ','c1ccc(CC(Oc2ccccc2)c2ccccc2)cc1 ','N#CC(c1ccccc1)(c1ccccc1)C(Oc1ccccc1)c1ccccc1 ','CC(COc1ccccc1)c1ccc(C#N)cc1 ','COCOC(c1ccccc1)C(C)(C#N)c1ccc(C#N)cc1 ','CC(c1ccccc1)(c1ccc([N+](=O)[O-])cc1)C(Oc1ccccc1)c1ccccc1 ','COCOC(c1ccccc1)C(c1ccccc1)(c1ccccc1)c1ccccc1 ','N#CC(Oc1ccccc1)C(C#N)(c1ccccc1)c1ccc(-c2ccccc2)cc1 ',
          'CC(C)(c1ccc([N+](=O)[O-])cc1)C(C#N)Oc1ccccc1 ','CCc1ccc(C(COc2ccccc2)c2ccccc2)cc1 ','CCc1ccc(C(C#N)(c2ccccc2)C(OC)OCOC)cc1 ','CC(C#N)(c1ccc(-c2ccccc2)cc1)C(C#N)Oc1ccccc1 ','COC(C)C(c1ccccc1)c1ccc([N+](=O)[O-])cc1 ','N#Cc1ccc(C(C#N)(C#N)C(C#N)Oc2ccccc2)cc1 ','COc1ccc(C(C#N)C(C)OC)cc1 ','CC(Oc1ccccc1)C(C#N)(C#N)c1ccc(-c2ccccc2)cc1 ','CCc1ccc(CC(OC)c2ccccc2)cc1 ','COC(C#N)C(C)c1ccc(C#N)cc1 ','COCOC(C)C(c1ccccc1)c1ccc(C#N)cc1 ','COCOC(C)C(C)(C#N)c1ccc(OC)cc1 ','N#CC(Cc1ccccc1)Oc1ccccc1 ','COCOC(C)C(C)(C#N)c1ccc(C#N)cc1 ','COC(C)C(c1ccccc1)c1ccccc1 ','COCOC(c1ccccc1)C(C#N)(c1ccccc1)c1ccccc1 ','COC(C#N)C(C)(C)c1ccc([N+](=O)[O-])cc1 ','COCOC(C#N)C(C)(C)c1ccc(-c2ccccc2)cc1 ','COCC(c1ccccc1)c1ccc(C(=O)OC)cc1 ','COC(Oc1ccccc1)C(C)(C#N)c1ccccc1 ','COCC(c1ccccc1)(c1ccccc1)c1ccc([N+](=O)[O-])cc1 ','COC(Br)C(C)(C)c1ccccc1 ','COCOCCc1ccc([N+](=O)[O-])cc1 ','COCOC(OC)C(C#N)(c1ccccc1)c1ccc(-c2ccccc2)cc1 ','CC(C)(COc1ccccc1)c1ccc(C#N)cc1 ','COCOC(C#N)C(C)c1ccc(-c2ccccc2)cc1 ','COC(c1ccccc1)C(C)(C)c1ccc(C#N)cc1 ','COC(=O)c1ccc(C(C)C(OC)c2ccccc2)cc1 ','COC(c1ccccc1)C(C)(c1ccccc1)c1ccccc1 ','COC(c1ccccc1)C(C)(C#N)c1ccc([N+](=O)[O-])cc1 ','COC(Br)C(C#N)(C#N)c1ccc(C#N)cc1 ','CCc1ccc(C(c2ccccc2)C(OC)OC)cc1 ','CCc1ccc(C(C)(C)C(C)OC)cc1 ','COCOC(c1ccccc1)C(C#N)(C#N)c1ccc(C#N)cc1 ','COC(C#N)Cc1ccccc1 ','COCC(C)c1ccc(C(=O)OC)cc1 ','COc1ccc(C(C)(C#N)C(OC)Oc2ccccc2)cc1 ','COCOC(Br)C(C#N)c1ccc(C(=O)OC)cc1 ','COCOCC(C#N)c1ccc(-c2ccccc2)cc1 ','CC(c1ccccc1)(c1ccccc1)C(Oc1ccccc1)c1ccccc1 ','CCc1ccc(C(C#N)(c2ccccc2)C(Oc2ccccc2)c2ccccc2)cc1 ','COCOC(OC)C(C)c1ccccc1 ','COC(=O)c1ccc(C(C)(C)C(C)OC)cc1 ','COCOC(c1ccccc1)C(C#N)(c1ccccc1)c1ccc(C(=O)OC)cc1 ','CCc1ccc(C(C)(COc2ccccc2)c2ccccc2)cc1 ','CCc1ccc(C(C#N)C(OC)c2ccccc2)cc1 ','CC(c1ccccc1)C(C#N)Oc1ccccc1 ','COC(=O)c1ccc(C(C)(C)C(Br)OC)cc1 ','COC(=O)c1ccc(C(C)(C)C(Oc2ccccc2)c2ccccc2)cc1 ','COC(Oc1ccccc1)C(C)(C#N)c1ccc(-c2ccccc2)cc1 ','N#Cc1ccc(C(C#N)(C#N)C(Br)Oc2ccccc2)cc1 ','COCOC(C#N)C(C#N)(C#N)c1ccc([N+](=O)[O-])cc1 ','CC(C#N)(c1ccc([N+](=O)[O-])cc1)C(C#N)Oc1ccccc1 ','COCOCC(C)(c1ccccc1)c1ccccc1 ','COc1ccc(C(C#N)(c2ccccc2)C(Br)OC)cc1 ','COC(Br)C(C)c1ccc([N+](=O)[O-])cc1 ','COCOC(C#N)C(C)(c1ccccc1)c1ccc([N+](=O)[O-])cc1 ','COCOCC(C)(C#N)c1ccc(OC)cc1 ','CCc1ccc(C(C#N)C(OCOC)c2ccccc2)cc1 ','COCOC(Br)C(C)(C)c1ccc([N+](=O)[O-])cc1 ','COCOC(c1ccccc1)C(C#N)(c1ccccc1)c1ccc(OC)cc1 ','COCC(c1ccccc1)c1ccc([N+](=O)[O-])cc1 ','N#Cc1ccc(CC(C#N)Oc2ccccc2)cc1 ','COCOC(c1ccccc1)C(C#N)(C#N)c1ccc([N+](=O)[O-])cc1 ','CCc1ccc(C(c2ccccc2)C(C)OC)cc1 ','COc1ccc(C(C#N)(c2ccccc2)C(OC)Oc2ccccc2)cc1 ','COC(C#N)C(C#N)(C#N)c1ccc(-c2ccccc2)cc1 ','CC(Oc1ccccc1)C(c1ccccc1)(c1ccccc1)c1ccc([N+](=O)[O-])cc1 ','CCc1ccc(C(C#N)(c2ccccc2)C(C)OC)cc1 ','COC(=O)c1ccc(C(c2ccccc2)C(Br)OC)cc1 ','COCOC(OC)C(C)(C)c1ccc(C(=O)OC)cc1 ','CCc1ccc(C(C#N)(C#N)C(OC)c2ccccc2)cc1 ','COCOC(Br)C(C)(c1ccccc1)c1ccc([N+](=O)[O-])cc1 ','COC(=O)c1ccc(C(C#N)(C#N)C(C#N)OC)cc1 ','COCOC(OC)C(C)(C#N)c1ccc(C#N)cc1 ','COC(Oc1ccccc1)C(c1ccccc1)c1ccc([N+](=O)[O-])cc1 ','COC(Br)C(C#N)c1ccc(C#N)cc1 ','COC(=O)c1ccc(C(C#N)C(OC)OC)cc1 ','COCOC(C)C(C#N)(c1ccccc1)c1ccc([N+](=O)[O-])cc1 ','COCOC(c1ccccc1)C(c1ccccc1)c1ccc(C#N)cc1 ','COc1ccc(C(C#N)C(Br)Oc2ccccc2)cc1 ','COCC(C#N)(c1ccccc1)c1ccc(C#N)cc1 ','COCOC(C#N)C(C#N)c1ccc(-c2ccccc2)cc1 ','COCOC(c1ccccc1)C(C)(c1ccccc1)c1ccc([N+](=O)[O-])cc1 ','COC(=O)c1ccc(C(c2ccccc2)(c2ccccc2)C(OC)OC)cc1 ','CC(c1ccc(-c2ccccc2)cc1)C(C#N)Oc1ccccc1 ','COC(=O)c1ccc(C(C)(C#N)C(Oc2ccccc2)c2ccccc2)cc1 ','CCc1ccc(C(C#N)(C#N)C(OC)Oc2ccccc2)cc1 ','COC(=O)c1ccc(C(c2ccccc2)(c2ccccc2)C(C)OC)cc1 ','COCOC(c1ccccc1)C(C#N)c1ccc([N+](=O)[O-])cc1 ','CC(c1ccc(C#N)cc1)C(Br)Oc1ccccc1 ','COCOC(C#N)C(C#N)(c1ccccc1)c1ccc([N+](=O)[O-])cc1 ','N#Cc1ccc(C(C#N)(c2ccccc2)C(Oc2ccccc2)c2ccccc2)cc1 ','CCc1ccc(CC(C)OCOC)cc1 ','COC(=O)c1ccc(C(C#N)(c2ccccc2)C(Br)Oc2ccccc2)cc1 ','COC(=O)c1ccc(C(C#N)(C#N)C(C)OC)cc1 ','O=[N+]([O-])c1ccc(CC(Br)Oc2ccccc2)cc1 ','COC(=O)c1ccc(CC(C#N)Oc2ccccc2)cc1 ','COCOC(OC)C(c1ccccc1)(c1ccccc1)c1ccc(C(=O)OC)cc1 ','COCC(C)(C)c1ccccc1 ','COC(c1ccccc1)C(C#N)(c1ccccc1)c1ccccc1 ','COCC(C)c1ccc([N+](=O)[O-])cc1 ','COCOC(C#N)C(c1ccccc1)c1ccccc1 ','CCc1ccc(C(C#N)COCOC)cc1 ','COC(C#N)C(C#N)(c1ccccc1)c1ccc([N+](=O)[O-])cc1 ','CCc1ccc(C(C#N)(C#N)C(C#N)Oc2ccccc2)cc1 ','COCOC(Br)Cc1ccc(C(=O)OC)cc1 ','COCC(C#N)c1ccc(-c2ccccc2)cc1 ','COCOCC(C#N)c1ccc(C(=O)OC)cc1 ','CCc1ccc(C(C#N)(c2ccccc2)C(OC)c2ccccc2)cc1 ','COCCc1ccc(C#N)cc1 ','COC(c1ccccc1)C(C#N)(c1ccccc1)c1ccc(-c2ccccc2)cc1 ',
		  'COc1ccc(C(C)(C#N)C(OC)c2ccccc2)cc1 ','COCOC(c1ccccc1)C(C)c1ccc([N+](=O)[O-])cc1 ','CCc1ccc(C(c2ccccc2)(c2ccccc2)C(OC)OCOC)cc1 ','COC(Oc1ccccc1)C(C)(c1ccccc1)c1ccccc1 ','COCOCC(C)(C#N)c1ccc([N+](=O)[O-])cc1 ','COC(OC)C(C)(c1ccccc1)c1ccccc1 ','CCc1ccc(CC(Br)OCOC)cc1 ','COC(OC)C(C)(c1ccccc1)c1ccc(C#N)cc1 ','COC(Br)C(C)(c1ccccc1)c1ccc(C#N)cc1 ','COC(=O)c1ccc(C(C)(C#N)C(C)OC)cc1 ','COC(c1ccccc1)C(C)c1ccc([N+](=O)[O-])cc1 ','CC(Oc1ccccc1)C(C#N)(c1ccccc1)c1ccc([N+](=O)[O-])cc1 ','COC(Br)Cc1ccc(C#N)cc1 ','COC(=O)c1ccc(C(COc2ccccc2)(c2ccccc2)c2ccccc2)cc1 ','COC(Br)C(C#N)c1ccc(-c2ccccc2)cc1 ','CCc1ccc(C(C#N)(C#N)C(Br)Oc2ccccc2)cc1 ','CCc1ccc(C(C)(C#N)C(C#N)Oc2ccccc2)cc1 ','COC(OC)C(c1ccccc1)c1ccc(C#N)cc1 ','COCOC(Br)C(C#N)(C#N)c1ccc(-c2ccccc2)cc1 ','CCc1ccc(C(c2ccccc2)C(Br)OC)cc1 ','COc1ccc(C(C#N)(c2ccccc2)C(C)OC)cc1 ','CCc1ccc(C(c2ccccc2)(c2ccccc2)C(C)OC)cc1 ','CCc1ccc(C(C)(C#N)C(OC)c2ccccc2)cc1 ','COC(OC)C(C#N)(C#N)c1ccc(C#N)cc1 ','COCOC(Br)Cc1ccc([N+](=O)[O-])cc1 ','CCc1ccc(C(c2ccccc2)C(Br)Oc2ccccc2)cc1 ','COC(=O)c1ccc(C(c2ccccc2)C(OC)c2ccccc2)cc1 ','COCOC(c1ccccc1)C(C)(c1ccccc1)c1ccccc1 ','CCc1ccc(C(c2ccccc2)(c2ccccc2)C(Oc2ccccc2)c2ccccc2)cc1 ','CCc1ccc(C(C#N)(COCOC)c2ccccc2)cc1 ',
          'COCC(C)(C)c1ccc(C(=O)OC)cc1 ','COCOC(c1ccccc1)C(C)(C)c1ccc(C#N)cc1 ','CCc1ccc(C(C)C(Br)Oc2ccccc2)cc1 ','COC(=O)c1ccc(C(C#N)C(C)Oc2ccccc2)cc1 ','CC(c1ccccc1)(c1ccc([N+](=O)[O-])cc1)C(Br)Oc1ccccc1 ','COC(Br)C(C#N)(C#N)c1ccc([N+](=O)[O-])cc1 ','COc1ccc(C(C#N)C(OC)OC)cc1 ','COC(=O)c1ccc(CC(Oc2ccccc2)c2ccccc2)cc1 ','CC(Oc1ccccc1)C(C#N)(C#N)c1ccc(C#N)cc1 ','BrC(Oc1ccccc1)C(c1ccccc1)c1ccc(-c2ccccc2)cc1 ','CC(C#N)(c1ccccc1)C(Br)Oc1ccccc1 ','COCOC(OC)C(C#N)(C#N)c1ccccc1 ','COC(C)C(C#N)(c1ccccc1)c1ccc([N+](=O)[O-])cc1 ','N#CC(c1ccc([N+](=O)[O-])cc1)C(Oc1ccccc1)c1ccccc1 ','COC(=O)c1ccc(C(C)(C#N)COc2ccccc2)cc1 ','COC(Oc1ccccc1)C(C#N)c1ccc(C#N)cc1 ','CCc1ccc(C(C)(C)C(Br)Oc2ccccc2)cc1 ','COC(=O)c1ccc(C(C)(C#N)C(C#N)Oc2ccccc2)cc1 ','COCC(C)(C)c1ccc([N+](=O)[O-])cc1 ','COc1ccc(C(C)(C#N)C(C#N)OC)cc1 ','COc1ccc(C(C#N)(C#N)C(OC)OC)cc1 ','COC(=O)c1ccc(C(C)(c2ccccc2)C(OC)Oc2ccccc2)cc1 ','COCOC(C#N)Cc1ccccc1 ','COCCc1ccccc1 ','COCOC(OC)C(C)(C#N)c1ccc([N+](=O)[O-])cc1 ','COC(Oc1ccccc1)C(C)(C#N)c1ccc(C#N)cc1 ','COCOC(Br)C(c1ccccc1)(c1ccccc1)c1ccc(-c2ccccc2)cc1 ','COCOC(c1ccccc1)C(C)(C#N)c1ccc(-c2ccccc2)cc1 ','BrC(Cc1ccc(-c2ccccc2)cc1)Oc1ccccc1 ','CCc1ccc(CC(OC)OCOC)cc1 ','CCc1ccc(C(c2ccccc2)(c2ccccc2)C(C)OCOC)cc1 ','COC(OC)C(C#N)c1ccc(C#N)cc1 ','CCc1ccc(C(C)(c2ccccc2)C(C#N)OC)cc1 ','O=[N+]([O-])c1ccc(C(c2ccccc2)C(Oc2ccccc2)c2ccccc2)cc1 ','CCc1ccc(C(C#N)(C#N)COC)cc1 ','COC(=O)c1ccc(C(c2ccccc2)C(C)OC)cc1 ','COCOC(c1ccccc1)C(C#N)(c1ccccc1)c1ccc(C#N)cc1 ','CC(Oc1ccccc1)C(C)(c1ccccc1)c1ccc([N+](=O)[O-])cc1 ','N#Cc1ccc(C(COc2ccccc2)(c2ccccc2)c2ccccc2)cc1 ','COCOCC(C)(C#N)c1ccc(C#N)cc1 ','COCOCC(C#N)c1ccc(C#N)cc1 ','CCc1ccc(C(c2ccccc2)(c2ccccc2)C(Br)OCOC)cc1 ','COC(=O)c1ccc(C(C)(COc2ccccc2)c2ccccc2)cc1 ','COCOC(C#N)C(C#N)(c1ccccc1)c1ccc(OC)cc1 ','COC(Br)C(c1ccccc1)(c1ccccc1)c1ccc([N+](=O)[O-])cc1 ','CCc1ccc(C(C)(c2ccccc2)C(OC)OCOC)cc1 ','COCC(C#N)(c1ccccc1)c1ccc(OC)cc1 ','COCOCC(C#N)(c1ccccc1)c1ccc(C(=O)OC)cc1 ','COCOC(Br)C(C#N)(c1ccccc1)c1ccc(C(=O)OC)cc1 ','CCc1ccc(C(COCOC)(c2ccccc2)c2ccccc2)cc1 ','COCOC(Br)C(C#N)(C#N)c1ccc(C(=O)OC)cc1 ','COC(OC)C(C#N)(C#N)c1ccccc1 ','COCOC(C#N)C(C)(C)c1ccccc1 ','COC(C#N)C(C)(c1ccccc1)c1ccc([N+](=O)[O-])cc1 ','COC(C)Cc1ccc([N+](=O)[O-])cc1 ','COCOC(C)C(C#N)(c1ccccc1)c1ccc(-c2ccccc2)cc1 ','COC(C#N)Cc1ccc(-c2ccccc2)cc1 ','COc1ccc(C(C#N)(C#N)C(C)Oc2ccccc2)cc1 ','COC(=O)c1ccc(C(C)(C)COc2ccccc2)cc1 ','N#CC(c1ccccc1)(c1ccccc1)C(Br)Oc1ccccc1 ','COCC(C#N)(c1ccccc1)c1ccc(-c2ccccc2)cc1 ','COCOC(C#N)C(C#N)(c1ccccc1)c1ccc(-c2ccccc2)cc1 ','CC(C#N)(COc1ccccc1)c1ccccc1 ','CCc1ccc(C(C#N)(C#N)C(C)Oc2ccccc2)cc1 ','COC(OC)C(C)(c1ccccc1)c1ccc([N+](=O)[O-])cc1 ','COc1ccc(C(C#N)(C#N)C(C#N)Oc2ccccc2)cc1 ','COCOC(Br)C(C#N)c1ccc(OC)cc1 ','COCOC(C#N)C(C#N)(C#N)c1ccc(OC)cc1 ','COCC(C#N)(C#N)c1ccc(-c2ccccc2)cc1 ','COC(Oc1ccccc1)C(C#N)c1ccc([N+](=O)[O-])cc1 ','COCOC(C#N)C(C#N)c1ccc(C#N)cc1 ','CCc1ccc(C(C#N)(c2ccccc2)C(Br)OC)cc1 ','N#CC(Oc1ccccc1)C(C#N)c1ccccc1 ','CCc1ccc(C(C)(C#N)C(Br)OCOC)cc1 ','COC(c1ccccc1)C(c1ccccc1)(c1ccccc1)c1ccccc1 ','COCC(C#N)c1ccc(OC)cc1 ','COC(=O)c1ccc(C(C)COc2ccccc2)cc1 ','COC(C#N)C(C#N)(c1ccccc1)c1ccccc1 ','COCC(C#N)c1ccc(C(=O)OC)cc1 ','COCOC(OC)C(C)(c1ccccc1)c1ccccc1 ','c1ccc(OC(c2ccccc2)C(c2ccccc2)(c2ccccc2)c2ccccc2)cc1 ','CCc1ccc(C(C)(C)C(OC)c2ccccc2)cc1 ','COC(C)C(C)(C)c1ccccc1 ','CCc1ccc(C(C#N)C(Br)Oc2ccccc2)cc1 ','COCOC(Br)C(C#N)(C#N)c1ccc(OC)cc1 ','COC(Cc1ccccc1)Oc1ccccc1 ','COc1ccc(C(C#N)(COc2ccccc2)c2ccccc2)cc1 ','COC(c1ccccc1)C(C#N)(c1ccccc1)c1ccc([N+](=O)[O-])cc1 ','COC(Br)C(C#N)(c1ccccc1)c1ccccc1 ','CC(Oc1ccccc1)C(C)(C#N)c1ccccc1 ','COCOC(Br)C(C#N)c1ccccc1 ','COCOC(Br)C(c1ccccc1)c1ccc(C(=O)OC)cc1 ','COCOC(C)C(C)(C)c1ccccc1 ','COCOC(C)C(C#N)c1ccc(C#N)cc1 ','COC(Oc1ccccc1)C(c1ccccc1)(c1ccccc1)c1ccc(C#N)cc1 ','COCOCC(c1ccccc1)(c1ccccc1)c1ccc(C(=O)OC)cc1 ','COCOC(c1ccccc1)C(C)(C#N)c1ccc([N+](=O)[O-])cc1 ','COCOC(c1ccccc1)C(C)c1ccccc1 ','COCOC(OC)C(c1ccccc1)(c1ccccc1)c1ccc([N+](=O)[O-])cc1 ','COCOCC(C)c1ccc(C#N)cc1 ','COCOCCc1ccc(C#N)cc1 ','N#CC(COc1ccccc1)c1ccc([N+](=O)[O-])cc1 ','COCOC(Br)C(C#N)c1ccc(-c2ccccc2)cc1 ','COC(=O)c1ccc(C(C#N)COc2ccccc2)cc1 ','CC(C#N)(c1ccc([N+](=O)[O-])cc1)C(Br)Oc1ccccc1 ','COC(Br)C(c1ccccc1)c1ccc(C#N)cc1 ','COCOC(C)C(C#N)(c1ccccc1)c1ccc(C#N)cc1 ','COC(=O)c1ccc(C(C#N)C(OC)c2ccccc2)cc1 ','CC(Oc1ccccc1)C(c1ccccc1)(c1ccccc1)c1ccccc1 ','COC(Cc1ccccc1)c1ccccc1 ','COC(=O)c1ccc(C(C#N)(C#N)COc2ccccc2)cc1 ','COCOC(C#N)Cc1ccc(C#N)cc1 ','COC(OC)C(C#N)(c1ccccc1)c1ccc(C#N)cc1 ',
          'N#CC(C#N)(c1ccc(-c2ccccc2)cc1)C(Br)Oc1ccccc1 ','COC(=O)c1ccc(C(c2ccccc2)(c2ccccc2)C(Br)OC)cc1 ','COC(c1ccccc1)C(C)(C)c1ccc([N+](=O)[O-])cc1 ','COCC(C#N)c1ccccc1 ','COc1ccc(CC(C#N)OC)cc1 ','CC(COc1ccccc1)(c1ccccc1)c1ccc(C#N)cc1 ','COCOC(OC)C(C)(c1ccccc1)c1ccc(C(=O)OC)cc1 ','N#CC(Oc1ccccc1)C(C#N)(C#N)c1ccc(-c2ccccc2)cc1 ','COCOCC(C)(C#N)c1ccccc1 ','COC(C)C(C)c1ccc([N+](=O)[O-])cc1 ','N#Cc1ccc(C(C#N)COc2ccccc2)cc1 ','CCc1ccc(C(c2ccccc2)C(C)OCOC)cc1 ','CCc1ccc(CC(Br)OC)cc1 ','COCOC(Br)C(c1ccccc1)(c1ccccc1)c1ccc([N+](=O)[O-])cc1 ','COCOC(Br)C(C#N)(c1ccccc1)c1ccc(OC)cc1 ','COc1ccc(C(C#N)(c2ccccc2)C(C#N)Oc2ccccc2)cc1 ','COC(=O)c1ccc(CC(OC)OC)cc1 ','N#CC(c1ccccc1)(c1ccc(-c2ccccc2)cc1)C(Oc1ccccc1)c1ccccc1 ','CC(c1ccccc1)(c1ccccc1)C(Br)Oc1ccccc1 ','COCOC(OC)C(C#N)c1ccc(OC)cc1 ','COCOC(C#N)Cc1ccc([N+](=O)[O-])cc1 ','COC(=O)c1ccc(C(c2ccccc2)(c2ccccc2)C(OC)c2ccccc2)cc1 ','COC(C)C(c1ccccc1)(c1ccccc1)c1ccccc1 ','COC(OC)C(C#N)(c1ccccc1)c1ccc([N+](=O)[O-])cc1 ','CCc1ccc(C(C#N)C(C#N)OC)cc1 ','COC(Oc1ccccc1)C(C#N)c1ccc(-c2ccccc2)cc1 ','COCOC(C)C(C)(c1ccccc1)c1ccc(C#N)cc1 ','COCOC(C#N)C(c1ccccc1)c1ccc(C#N)cc1 ','COCOC(c1ccccc1)C(C#N)c1ccc(OC)cc1 ','COCOC(c1ccccc1)C(c1ccccc1)(c1ccccc1)c1ccc(C(=O)OC)cc1 ','COCOC(C)C(c1ccccc1)(c1ccccc1)c1ccc(C(=O)OC)cc1 ','N#CC(c1ccccc1)C(Oc1ccccc1)c1ccccc1 ','CCc1ccc(C(c2ccccc2)(c2ccccc2)C(Br)OC)cc1 ','COCOC(c1ccccc1)C(C#N)c1ccc(C(=O)OC)cc1 ','COC(Oc1ccccc1)C(C)c1ccc([N+](=O)[O-])cc1 ','N#Cc1ccc(C(C#N)(C#N)COc2ccccc2)cc1 ','CC(c1ccc(C#N)cc1)C(C#N)Oc1ccccc1 ','COC(C#N)C(C#N)(C#N)c1ccc([N+](=O)[O-])cc1 ','COCOCC(c1ccccc1)c1ccc(C(=O)OC)cc1 ','COCOC(OC)C(c1ccccc1)c1ccc(C#N)cc1 ','CCc1ccc(C(C#N)(COc2ccccc2)c2ccccc2)cc1 ','CC(C)(c1ccc(-c2ccccc2)cc1)C(Br)Oc1ccccc1 ','COCOC(C)C(C)(C#N)c1ccc([N+](=O)[O-])cc1 ','CCc1ccc(C(C#N)(C#N)C(OC)OCOC)cc1 ','COCOC(Br)C(C)(C#N)c1ccc(C(=O)OC)cc1 ','COCOC(OC)C(C)(C)c1ccccc1 ','CCc1ccc(C(C)(C#N)C(Br)OC)cc1 ','CCc1ccc(C(C)(c2ccccc2)C(OC)Oc2ccccc2)cc1 ','COC(=O)c1ccc(C(C#N)(c2ccccc2)C(OC)Oc2ccccc2)cc1 ','COCOC(OC)C(c1ccccc1)(c1ccccc1)c1ccccc1 ','COCOCC(C#N)(c1ccccc1)c1ccc(OC)cc1 ','COc1ccc(C(C#N)C(OC)Oc2ccccc2)cc1 ','COC(Oc1ccccc1)C(C)(c1ccccc1)c1ccc(C#N)cc1 ','CCc1ccc(C(C#N)C(C#N)Oc2ccccc2)cc1 ','COC(=O)c1ccc(C(C)(c2ccccc2)C(Oc2ccccc2)c2ccccc2)cc1 ','CC(C)(c1ccc([N+](=O)[O-])cc1)C(Br)Oc1ccccc1 ','COC(c1ccccc1)C(C#N)(C#N)c1ccc(C#N)cc1 ','COCOC(C)C(C#N)(c1ccccc1)c1ccc(C(=O)OC)cc1 ','COCOC(C)C(C#N)c1ccc(OC)cc1 ','CCc1ccc(CC(Br)Oc2ccccc2)cc1 ','N#CC(c1ccc([N+](=O)[O-])cc1)C(Br)Oc1ccccc1 ','COC(=O)c1ccc(C(C)(C#N)C(OC)Oc2ccccc2)cc1 ','COCOC(C#N)C(C)c1ccc(C#N)cc1 ','CCc1ccc(C(C#N)C(C)OC)cc1 ','COc1ccc(C(C#N)(c2ccccc2)C(Oc2ccccc2)c2ccccc2)cc1 ','COCOC(C#N)C(c1ccccc1)c1ccc(C(=O)OC)cc1 ','N#CC(Oc1ccccc1)C(C#N)c1ccc([N+](=O)[O-])cc1 ','CC(C#N)(c1ccccc1)C(C#N)Oc1ccccc1 ','COCOC(Br)C(C)c1ccccc1 ','COCOC(C)C(c1ccccc1)c1ccc([N+](=O)[O-])cc1 ','CC(C#N)(c1ccc(C#N)cc1)C(Oc1ccccc1)c1ccccc1 ','CCc1ccc(C(C)(C)C(OC)OC)cc1 ','COC(Br)C(C)(c1ccccc1)c1ccc([N+](=O)[O-])cc1 ','COC(C#N)C(C)(C)c1ccc(C#N)cc1 ','COCOC(Cc1ccc(C#N)cc1)c1ccccc1 ','O=[N+]([O-])c1ccc(C(c2ccccc2)(c2ccccc2)C(Br)Oc2ccccc2)cc1 ','COc1ccc(C(C)(C#N)C(Br)OC)cc1 ','COCOC(OC)C(C#N)c1ccc([N+](=O)[O-])cc1 ','CC(COc1ccccc1)(c1ccccc1)c1ccc([N+](=O)[O-])cc1 ','N#CC(COc1ccccc1)(c1ccccc1)c1ccc(-c2ccccc2)cc1 ','COCOC(OC)C(C#N)c1ccc(C#N)cc1 ','COC(=O)c1ccc(C(C#N)(C#N)C(C#N)Oc2ccccc2)cc1 ','COCOC(c1ccccc1)C(C#N)(C#N)c1ccc(C(=O)OC)cc1 ','CC(c1ccc([N+](=O)[O-])cc1)C(Br)Oc1ccccc1 ','COCOC(C#N)C(C)(C)c1ccc([N+](=O)[O-])cc1 ','COCOC(Br)C(C)(c1ccccc1)c1ccc(C#N)cc1 ','COCC(C)(C#N)c1ccc(C(=O)OC)cc1 ','CC(Oc1ccccc1)C(C#N)c1ccc([N+](=O)[O-])cc1 ','COCOC(C)C(C)c1ccc(C#N)cc1 ','COCOC(Br)Cc1ccccc1 ','COCOC(Br)C(C)c1ccc(C#N)cc1 ','COC(C#N)C(c1ccccc1)(c1ccccc1)c1ccc(C#N)cc1 ','COCOC(C#N)C(c1ccccc1)c1ccc([N+](=O)[O-])cc1 ','COC(=O)c1ccc(C(c2ccccc2)C(Br)Oc2ccccc2)cc1 ','COCOCC(c1ccccc1)(c1ccccc1)c1ccc(C#N)cc1 ','CC(c1ccccc1)(c1ccc(C#N)cc1)C(Oc1ccccc1)c1ccccc1 ','COC(C#N)C(C)(c1ccccc1)c1ccccc1 ','COC(=O)c1ccc(C(C)(C)C(OC)c2ccccc2)cc1 ','COC(=O)c1ccc(C(C#N)C(C#N)Oc2ccccc2)cc1 ','CCc1ccc(CC(OC)OC)cc1 ','COCOC(C#N)C(C)(c1ccccc1)c1ccc(C(=O)OC)cc1 ','CC(C#N)(c1ccc([N+](=O)[O-])cc1)C(Oc1ccccc1)c1ccccc1 ','COC(C#N)C(c1ccccc1)c1ccccc1 ','CCc1ccc(C(C)(c2ccccc2)C(C)OCOC)cc1 ','N#CC(Oc1ccccc1)C(c1ccccc1)(c1ccccc1)c1ccccc1 ','CC(Oc1ccccc1)C(c1ccccc1)(c1ccccc1)c1ccc(C#N)cc1 ','COC(c1ccccc1)C(C)(c1ccccc1)c1ccc(C#N)cc1 ','COC(=O)c1ccc(C(C#N)(c2ccccc2)C(Oc2ccccc2)c2ccccc2)cc1 ',
		  'CCc1ccc(C(C)(C#N)C(OC)OC)cc1 ','COC(OC)C(C)c1ccc(C#N)cc1 ','CCc1ccc(C(C)(c2ccccc2)C(C)OC)cc1 ','COCOC(C#N)Cc1ccc(C(=O)OC)cc1 ','COc1ccc(C(C#N)(c2ccccc2)C(Br)Oc2ccccc2)cc1 ','CC(c1ccccc1)(c1ccc(-c2ccccc2)cc1)C(Br)Oc1ccccc1 ','COc1ccc(C(C)(C#N)C(C#N)Oc2ccccc2)cc1 ','CC(C)(c1ccc([N+](=O)[O-])cc1)C(Oc1ccccc1)c1ccccc1 ','N#Cc1ccc(C(C#N)C(Oc2ccccc2)c2ccccc2)cc1 ','COCOC(Br)C(c1ccccc1)(c1ccccc1)c1ccc(C(=O)OC)cc1 ','COC(C#N)C(C)(C#N)c1ccc([N+](=O)[O-])cc1 ','COC(Br)C(C)(C)c1ccc(-c2ccccc2)cc1 ','COc1ccc(C(C#N)(C#N)C(OC)Oc2ccccc2)cc1 ','CCc1ccc(C(C)C(C)OCOC)cc1 ','COC(C#N)C(C)(C#N)c1ccc(C#N)cc1 ','COCOC(C)C(C)(C)c1ccc(C#N)cc1 ','CCc1ccc(C(C#N)(c2ccccc2)C(OC)OC)cc1 ','CC(Oc1ccccc1)C(C#N)(c1ccccc1)c1ccc(C#N)cc1 ','COCOC(Br)C(C)c1ccc(-c2ccccc2)cc1 ','COC(Br)C(C)(C#N)c1ccc(C#N)cc1 ','N#CC(Oc1ccccc1)C(C#N)(C#N)c1ccccc1 ','CCc1ccc(CC(C#N)OCOC)cc1 ','COC(Oc1ccccc1)C(C#N)(C#N)c1ccc(C#N)cc1 ','COCOCC(C#N)(c1ccccc1)c1ccc(-c2ccccc2)cc1 ','COC(=O)c1ccc(C(C#N)C(OC)Oc2ccccc2)cc1 ','COCOCC(c1ccccc1)(c1ccccc1)c1ccc([N+](=O)[O-])cc1 ','COCOC(C)C(C#N)c1ccccc1 ','COC(C)Cc1ccc(C#N)cc1 ','N#Cc1ccc(CC(Br)Oc2ccccc2)cc1 ','N#CC(COc1ccccc1)c1ccccc1 ','COCOC(C#N)C(c1ccccc1)(c1ccccc1)c1ccc(C(=O)OC)cc1 ','CC(C#N)(COc1ccccc1)c1ccc(-c2ccccc2)cc1 ',
          'CC(c1ccc(-c2ccccc2)cc1)C(Br)Oc1ccccc1 ','COCOC(C#N)C(C#N)(c1ccccc1)c1ccc(C(=O)OC)cc1 ','COCOCC(c1ccccc1)c1ccc([N+](=O)[O-])cc1 ','COC(C#N)C(C)(c1ccccc1)c1ccc(C#N)cc1 ','COC(Cc1ccc(C#N)cc1)OC ','COCOC(c1ccccc1)C(C#N)c1ccc(-c2ccccc2)cc1 ','COCOC(c1ccccc1)C(C#N)c1ccccc1 ','N#CC(Cc1ccc([N+](=O)[O-])cc1)Oc1ccccc1 ','COCOCC(C#N)(C#N)c1ccc(C(=O)OC)cc1 ','COC(c1ccccc1)C(C)c1ccc(C#N)cc1 ','COC(OC)C(c1ccccc1)c1ccc([N+](=O)[O-])cc1 ','COCOC(OC)C(C#N)(C#N)c1ccc(C#N)cc1 ','COC(=O)c1ccc(C(c2ccccc2)(c2ccccc2)C(C#N)Oc2ccccc2)cc1 ','CCc1ccc(C(C)(C)C(OC)OCOC)cc1 ','COc1ccc(C(C#N)(c2ccccc2)C(OC)c2ccccc2)cc1 ','COC(C)C(C)(C#N)c1ccc(C#N)cc1 ','N#CC(c1ccc(-c2ccccc2)cc1)C(Oc1ccccc1)c1ccccc1 ','COCOC(Br)C(C#N)c1ccc([N+](=O)[O-])cc1 ','COCC(C#N)(c1ccccc1)c1ccc([N+](=O)[O-])cc1 ','COCC(C#N)(C#N)c1ccccc1 ','COCOC(C)C(c1ccccc1)(c1ccccc1)c1ccccc1 ','COCOC(Br)C(C)(c1ccccc1)c1ccccc1 ','COCOCC(C)(C#N)c1ccc(-c2ccccc2)cc1 ','COC(OC)C(c1ccccc1)(c1ccccc1)c1ccccc1 ','COc1ccc(C(C)(C#N)C(C)OC)cc1 ','CCc1ccc(C(C)(c2ccccc2)C(Oc2ccccc2)c2ccccc2)cc1 ','COCOC(c1ccccc1)C(C#N)c1ccc(C#N)cc1 ','CCc1ccc(C(c2ccccc2)(c2ccccc2)C(OC)c2ccccc2)cc1 ','CC(C)(c1ccccc1)C(Br)Oc1ccccc1 ','CCc1ccc(C(C)(c2ccccc2)C(C#N)Oc2ccccc2)cc1 ','COCOCC(C#N)(C#N)c1ccc(-c2ccccc2)cc1 ','COC(C#N)C(C)c1ccccc1 ','COCOC(OC)C(C#N)(c1ccccc1)c1ccccc1 ','COC(=O)c1ccc(C(C)(c2ccccc2)C(C#N)Oc2ccccc2)cc1 ','CCc1ccc(C(C)C(C#N)Oc2ccccc2)cc1 ','COC(C)C(C#N)c1ccc([N+](=O)[O-])cc1 ','COC(C)C(C)(C)c1ccc(C#N)cc1 ','COC(=O)c1ccc(C(c2ccccc2)C(OC)OC)cc1 ','CCc1ccc(C(c2ccccc2)C(C#N)Oc2ccccc2)cc1 ','COC(OC)C(C)(C#N)c1ccc([N+](=O)[O-])cc1 ','COCOC(C)C(C#N)(c1ccccc1)c1ccc(OC)cc1 ','COCOC(Br)Cc1ccc(-c2ccccc2)cc1 ','COCOC(c1ccccc1)C(c1ccccc1)(c1ccccc1)c1ccc([N+](=O)[O-])cc1 ','COC(C)C(C)c1ccccc1 ','COCC(C)c1ccc(C#N)cc1 ','COC(=O)c1ccc(C(C#N)(c2ccccc2)C(Br)OC)cc1 ','COCOC(OC)C(C#N)(c1ccccc1)c1ccc(C#N)cc1 ','COCOC(Br)C(C#N)(C#N)c1ccccc1 ','COCC(C#N)c1ccc(C#N)cc1 ','COCOC(C#N)Cc1ccc(-c2ccccc2)cc1 ','CCc1ccc(C(C#N)COC)cc1 ','COC(=O)c1ccc(C(C#N)(C#N)C(Oc2ccccc2)c2ccccc2)cc1 ','N#CC(COc1ccccc1)c1ccc(-c2ccccc2)cc1 ','CCc1ccc(C(c2ccccc2)(c2ccccc2)C(C#N)OCOC)cc1 ','CCc1ccc(C(c2ccccc2)(c2ccccc2)C(C#N)Oc2ccccc2)cc1 ','COCOCC(C#N)c1ccccc1 ','CCc1ccc(C(C)C(C#N)OC)cc1 ','COC(Br)C(C)c1ccc(C#N)cc1 ','COCOC(Br)C(C)(C#N)c1ccc(-c2ccccc2)cc1 ','COC(OC)C(C#N)(C#N)c1ccc([N+](=O)[O-])cc1 ','COC(C)C(C#N)(c1ccccc1)c1ccccc1 ','CCc1ccc(C(C)COCOC)cc1 ','COCOC(C#N)C(C#N)(c1ccccc1)c1ccc(C#N)cc1 ','COC(C)C(C#N)(c1ccccc1)c1ccc(-c2ccccc2)cc1 ','COc1ccc(C(C#N)C(Br)OC)cc1 ','COCOCC(C)(C#N)c1ccc(C(=O)OC)cc1 ','CCc1ccc(C(C)C(OC)OC)cc1 ','CC(Oc1ccccc1)C(C#N)(c1ccccc1)c1ccccc1 ','COc1ccc(C(C#N)C(Oc2ccccc2)c2ccccc2)cc1 ','COCOC(C#N)C(C#N)c1ccc(OC)cc1 ','COCOC(OC)C(C#N)(C#N)c1ccc(OC)cc1 ','COC(=O)c1ccc(C(C)C(OC)OC)cc1 ','CCc1ccc(C(C#N)(C#N)C(C)OCOC)cc1 ','COCOCC(C)(c1ccccc1)c1ccc(C(=O)OC)cc1 ','CCc1ccc(C(C#N)(c2ccccc2)C(C#N)OCOC)cc1 ','COCOC(C#N)C(C)(C#N)c1ccccc1 ','COC(OC)C(C#N)(c1ccccc1)c1ccccc1 ','CCc1ccc(C(c2ccccc2)(c2ccccc2)C(OC)OC)cc1 ','COC(C#N)C(C#N)c1ccccc1 ','COC(C)C(C#N)(c1ccccc1)c1ccc(C#N)cc1 ','COCOC(OC)C(C)c1ccc(C#N)cc1 ','COCOC(C#N)C(c1ccccc1)c1ccc(-c2ccccc2)cc1 ','CC(Oc1ccccc1)C(C#N)(c1ccccc1)c1ccc(-c2ccccc2)cc1 ','COC(Oc1ccccc1)C(C)(c1ccccc1)c1ccc([N+](=O)[O-])cc1 ','COCOC(OC)C(C#N)c1ccc(-c2ccccc2)cc1 ','COCOC(C)C(C#N)(C#N)c1ccccc1 ','CCc1ccc(C(c2ccccc2)(c2ccccc2)C(C#N)OC)cc1 ','CC(Oc1ccccc1)C(c1ccccc1)c1ccc([N+](=O)[O-])cc1 ','COCOC(OC)C(C#N)(C#N)c1ccc(-c2ccccc2)cc1 ','COCOC(C)C(C)(c1ccccc1)c1ccc([N+](=O)[O-])cc1 ','CC(Oc1ccccc1)C(C#N)(C#N)c1ccc([N+](=O)[O-])cc1 ','CCc1ccc(C(C)C(OCOC)c2ccccc2)cc1 ','COc1ccc(C(C)(C#N)C(Br)Oc2ccccc2)cc1 ','COC(Br)Cc1ccccc1 ','N#CC(Cc1ccc(-c2ccccc2)cc1)Oc1ccccc1 ','N#CC(Oc1ccccc1)C(C#N)(c1ccccc1)c1ccccc1 ','CCc1ccc(C(c2ccccc2)C(C#N)OCOC)cc1 ','COC(=O)c1ccc(CC(OC)Oc2ccccc2)cc1 ','COC(Cc1ccc([N+](=O)[O-])cc1)c1ccccc1 ','CCc1ccc(C(c2ccccc2)C(OC)c2ccccc2)cc1 ','COC(C#N)C(C)c1ccc(-c2ccccc2)cc1 ','N#Cc1ccc(C(c2ccccc2)C(Oc2ccccc2)c2ccccc2)cc1 ','c1ccc(OCC(c2ccccc2)(c2ccccc2)c2ccccc2)cc1 ','COC(=O)c1ccc(CC(Br)Oc2ccccc2)cc1 ','CCc1ccc(C(C)(C#N)C(C)Oc2ccccc2)cc1 ','COC(C)C(C#N)(C#N)c1ccccc1 ','COc1ccc(C(C#N)(c2ccccc2)C(C#N)OC)cc1 ','COC(=O)c1ccc(C(c2ccccc2)(c2ccccc2)C(C#N)OC)cc1 ','COCOC(C)C(C#N)(c1ccccc1)c1ccccc1 ','COCOCC(c1ccccc1)c1ccc(C#N)cc1 ','COCOC(c1ccccc1)C(C)(C#N)c1ccccc1 ','COCC(c1ccccc1)(c1ccccc1)c1ccc(C(=O)OC)cc1 ','COCOC(C)C(C#N)(C#N)c1ccc(C(=O)OC)cc1 ','COC(C)C(C)(C#N)c1ccccc1 ',
		  'COCOC(Cc1ccc(C(=O)OC)cc1)OC ','N#CC(Oc1ccccc1)C(c1ccccc1)c1ccc(-c2ccccc2)cc1 ','N#Cc1ccc(C(COc2ccccc2)c2ccccc2)cc1 ','COCOC(OC)C(c1ccccc1)c1ccccc1 ','COCOC(OC)C(C#N)(c1ccccc1)c1ccc(OC)cc1 ','COCOCC(C)c1ccc(C(=O)OC)cc1 ','COCOC(Br)C(C)(C#N)c1ccc(OC)cc1 ','COCOCC(C#N)(c1ccccc1)c1ccc([N+](=O)[O-])cc1')]
for m in ms:
    _ = Chem.AllChem.GenerateDepictionMatching2DStructure(m,template)
    #Draw.MolToFile(ms[0],'./SMILES_mol1.o.png')
    #Draw.MolToFile(ms[1],'./SMILES_mol2.o.png') 
#print(ms)
img=Draw.MolsToGridImage(ms[:8],molsPerRow=4,subImgSize=(200,200),legends=[x.GetProp("_Name") for x in ms[:8]])
img.save('images/SMILES_mol2.o.png') 

In [None]:
#Generating Similarity Maps Using Fingerprints
from rdkit import Chem
mol = Chem.MolFromSmiles('N#CC(c1ccccc1)C(Br)Oc1ccccc1')
refmol = Chem.MolFromSmiles('N#CC(Oc1ccccc1)C(c1ccccc1)c1ccc(-c2ccccc2)cc1')

In [None]:
from rdkit.Chem import Draw
from rdkit.Chem.Draw import SimilarityMaps
fp = SimilarityMaps.GetAPFingerprint(mol, fpType='normal')
fp = SimilarityMaps.GetTTFingerprint(mol, fpType='normal')
fp = SimilarityMaps.GetMorganFingerprint(mol, fpType='bv')

In [None]:
fig, maxweight = SimilarityMaps.GetSimilarityMapForFingerprint(refmol, mol, SimilarityMaps.GetMorganFingerprint)

In [None]:
from rdkit import DataStructs
fig, maxweight = SimilarityMaps.GetSimilarityMapForFingerprint(refmol, mol, lambda m,idx: SimilarityMaps.GetMorganFingerprint(m, atomId=idx, radius=1, fpType='count'), metric=DataStructs.TanimotoSimilarity)

In [None]:
from rdkit.Chem import Descriptors
m = Chem.MolFromSmiles('N#CC(c1ccccc1)C(Br)Oc1ccccc1')
print (Descriptors.TPSA(m))
print (Descriptors.MolLogP(m))
Chem.AllChem.ComputeGasteigerCharges(m)
m.GetAtomWithIdx(0).GetDoubleProp('_GasteigerCharge')


In [None]:
#Visualization of Descriptors
from rdkit.Chem.Draw import SimilarityMaps
mol = Chem.MolFromSmiles('N#CC(c1ccccc1)C(Br)Oc1ccccc1')
Chem.AllChem.ComputeGasteigerCharges(mol)
contribs = [mol.GetAtomWithIdx(i).GetDoubleProp('_GasteigerCharge') for i in range(mol.GetNumAtoms())]
fig = SimilarityMaps.GetSimilarityMapFromWeights(mol, contribs, colorMap='jet', contourLines=10)


In [None]:
from rdkit.Chem import rdMolDescriptors
contribs = rdMolDescriptors._CalcCrippenContribs(mol)
fig = SimilarityMaps.GetSimilarityMapFromWeights(mol,[x for x,y in contribs], colorMap='jet', contourLines=10)

In [None]:
#Chemical Features
from rdkit import Chem
from rdkit.Chem import ChemicalFeatures
from rdkit import RDConfig
import os
fdefName = os.path.join(RDConfig.RDDataDir,'BaseFeatures.fdef')
factory = ChemicalFeatures.BuildFeatureFactory(fdefName)
m = Chem.MolFromSmiles('N#CC(c1ccccc1)C(Br)Oc1ccccc1')
feats = factory.GetFeaturesForMol(m)
len(feats)
print(feats[0].GetFamily())
print(feats[0].GetType())
print(feats[0].GetAtomIds())
print(feats[4].GetFamily())
print(feats[4].GetAtomIds())
Chem.AllChem.Compute2DCoords(m)
print(feats[0].GetPos())
print(list(feats[0].GetPos()))


In [None]:
# Molecular Fragments
fName=os.path.join(RDConfig.RDDataDir,'FunctionalGroups.txt')
from rdkit.Chem import FragmentCatalog
fparams = FragmentCatalog.FragCatParams(1,6,fName)
print(fparams.GetNumFuncGroups())
fcat=FragmentCatalog.FragCatalog(fparams)
fcgen=FragmentCatalog.FragCatGenerator()
m = Chem.MolFromSmiles('N#CC(c1ccccc1)C(Br)Oc1ccccc1')
print(fcgen.AddFragsFromMol(m,fcat))
print(fcat.GetEntryDescription(0))
print(fcat.GetEntryDescription(1))
print(fcat.GetEntryDescription(2))
list(fcat.GetEntryFuncGroupIds(2))
fparams.GetFuncGroup(1)
print(Chem.MolToSmarts(fparams.GetFuncGroup(1)))
print(Chem.MolToSmarts(fparams.GetFuncGroup(34)))
print(fparams.GetFuncGroup(1).GetProp('_Name'))
print(fparams.GetFuncGroup(34).GetProp('_Name'))


In [None]:
m = Chem.MolFromSmiles('N#CC(c1ccccc1)C(Br)Oc1ccccc1')
m.GetNumAtoms()
help(m.GetNumAtoms)
m.GetNumAtoms(onlyExplicit=False)

In [None]:
#Advanced Topics/Warnings Editing Molecules
m = Chem.MolFromSmiles('N#CC(c1ccccc1)C(Br)Oc1ccccc1')
m.GetAtomWithIdx(0).SetAtomicNum(7)
Chem.SanitizeMol(m)
rdkit.Chem.rdmolops.SanitizeFlags.SANITIZE_NONE
print(Chem.MolToSmiles(m))
#Do not forget the sanitization step, without it one can end up with results that look ok (so long as you don’t think):
m = Chem.MolFromSmiles('N#CC(c1ccccc1)C(Br)Oc1ccccc1')
m.GetAtomWithIdx(0).SetAtomicNum(8)
print(Chem.MolToSmiles(m))



We can explore the range of solubilities found in the dataset by plotting a histogram of solubility values from the dataset. Our machine learning models will aim to predict these solubilities.

In [None]:
#sns.distplot(dataset["measured log solubility in mols per litre"])
df = pd.DataFrame(dataset)
display(df)
df_condition = df[(df['sssr'] < 10) & (df["clogp"] > 0.25)]
# https://buildmedia.readthedocs.org/media/pdf/rdkit/latest/rdkit.pdf

# 'sssr', -- smallest set of smallest rings
# 'clogp', --
# 'mr', --
# 'mw', --
# 'tpsa', -- topological polar surface area (TPSA) descriptor
# 'chi0n', 'chi1n', 'chi2n', 'chi3n', 'chi4n', --  Connectivity Descriptors returns the ChiXn value for a molecule for X=0-4 Rev. Comput. Chem. 2:367-422 (1991)
# 'chi0v', 'chi1v', 'chi2v', 'chi3v', 'chi4v', -- returns the ChiXv value for a molecule for X=0-4 Rev. Comput. Chem. 2:367-422 (1991)
# 'fracsp3', -- 
# 'hall_kier_alpha', -- Rev. Comput. Chem. 2:367-422 (1991)
# 'kappa1', 'kappa2', 'kappa3', -- Rev. Comput. Chem. 2:367-422 (1991)
# 'labuteasa', -- J. Mol. Graph. Mod. 18:464-77 (2000)
# 'number_aliphatic_rings', --
# 'number_aromatic_rings', --
# 'number_amide_bonds', --
# 'number_atom_stereocenters', -- 
# 'number_bridgehead_atoms', --
# 'number_HBA', --
# 'number_HBD', --
# 'number_hetero_atoms', -- 
# 'number_hetero_cycles', --
# 'number_rings', --
# 'number_rotatable_bonds', --
# 'number_spiro', -- Number of spiro atoms (atoms shared between rings thatshare exactly one atom)
# 'number_saturated_rings', --
# 'number_heavy_atoms', --
# 'number_nh_oh', --
# 'number_n_o', --
# 'number_valence_electrons', --
# 'max_partial_charge', --
# 'min_partial_charge',-- 
# 'fr_C_O', --
# 'fr_C_O_noCOO', --
# 'fr_Al_OH', --
# 'fr_Ar_OH', --
# 'fr_methoxy', --
# 'fr_oxime', --
# 'fr_ester', --
# 'fr_Al_COO', --
# 'fr_Ar_COO',-- 
# 'fr_COO', --
# 'fr_COO2', --
# 'fr_ketone', --
# 'fr_ether', --
# 'fr_phenol', --
# 'fr_aldehyde',-- 
# 'fr_quatN', --
# 'fr_NH2', --
# 'fr_NH1', --
# 'fr_NH0', --
# 'fr_Ar_N', --
# 'fr_Ar_NH', --
# 'fr_aniline', --
# 'fr_Imine', --
# 'fr_nitrile', --
# 'fr_hdrzine', --
# 'fr_hdrzone', --
# 'fr_nitroso', --
# 'fr_N_O', --
# 'fr_nitro', --
# 'fr_azo', --
# 'fr_diazo', --
# 'fr_azide', --
# 'fr_amide', --
# 'fr_priamide',-- 
# 'fr_amidine', --
# 'fr_guanido', --
# 'fr_Nhpyrrole', --
# 'fr_imide', --
# 'fr_isocyan', --
# 'fr_isothiocyan',-- 
# 'fr_thiocyan',-- 
# 'fr_halogen', --
# 'fr_alkyl_halide',-- 
# 'fr_sulfide',-- 
# 'fr_SH', --
# 'fr_C_S', --
# 'fr_sulfone', --
# 'fr_sulfonamd', --
# 'fr_prisulfonamd',-- 
# 'fr_barbitur', --
# 'fr_urea', --
# 'fr_term_acetylene', -- 
# 'fr_imidazole',-- 
# 'fr_furan', --
# 'fr_thiophene', --
# 'fr_thiazole', --
# 'fr_oxazole', --
# 'fr_pyridine', --
# 'fr_piperdine', --
# 'fr_piperzine', --
# 'fr_morpholine', --
# 'fr_lactam', --
# 'fr_lactone', --
# 'fr_tetrazole', --
# 'fr_epoxide', --
# 'fr_unbrch_alkane',-- 
# 'fr_bicyclic', --
# 'fr_benzene', --
# 'fr_phos_acid', --
# 'fr_phos_ester', --
# 'fr_nitro_arom', --
# 'fr_nitro_arom_nonortho', --
# 'fr_dihydropyridine', --
# 'fr_phenol_noOrthoHbond', --
# 'fr_Al_OH_noTert', --
# 'fr_benzodiazepine', --
# 'fr_para_hydroxylation',-- 
# 'fr_allylic_oxid', --
# 'fr_aryl_methyl', --
# 'fr_Ndealkylation1',-- 
# 'fr_Ndealkylation2', --
# 'fr_alkyl_carbamate', --
# 'fr_ketone_Topliss', --
# 'fr_ArN', --
# 'fr_HOCCN',--

display(df_condition)
df_clogp = df[df.clogp.eq(0.26)]
display(df_clogp)

In [None]:
from rdkit import Chem
from rdkit.Chem import rdMolDescriptors
from rdkit.Chem.Draw import IPythonConsole
from rdkit.Chem import Draw
smiles_fig = Chem.MolFromSmiles('N#CC(Oc1ccccc1)C(c1ccccc1)c1ccc(-c2ccccc2)cc1')
smiles_fig

In [None]:
#Generate a Morgan fingerprint and save information about the bits that are set using the bitInfo argument:
bi = {}
fp = rdMolDescriptors.GetMorganFingerprintAsBitVect(smiles_fig, radius=2, bitInfo=bi)
# show 10 of the set bits:
list(fp.GetOnBits())[:10]

In [None]:
#In its simplest form, the new code lets you display the atomic environment that sets a particular bit. Here we will look at bit 674:
Draw.DrawMorganBit(smiles_fig,674,bi)

In [None]:
#DrawMorganBits(), for drawing multiple bits at once (thanks to Pat Walters for suggesting this one):
tpls = [(smiles_fig,x,bi) for x in fp.GetOnBits()]
Draw.DrawMorganBits(tpls[:12],molsPerRow=4,legends=[str(x) for x in fp.GetOnBits()][:12])

In [None]:
from ipywidgets import interact,fixed,IntSlider
def renderFpBit(mol,bitIdx,bitInfo,fn):
    bid = bitIdx
    return(display(fn(mol,bid,bitInfo)))

In [None]:
interact(renderFpBit, bitIdx=list(bi.keys()),mol=fixed(smiles_fig),
         bitInfo=fixed(bi),fn=fixed(Draw.DrawMorganBit));

In the next cell we will plot a histogram of SMILES string lengths from dataset. These lengths will be used to determine the length of the inputs for our CNN and VAE models. Below are examples of the SMILES representation: 
1. Methane: 'C'
2. Pentane: 'CCCCC'
3. Methanol and Ethanol: 'CO' and 'CCO'
4. Pyridine: 'C1:C:C:N:C:C:1'

To learn more about the SMILES representation, click [here](https://chem.libretexts.org/Courses/University_of_Arkansas_Little_Rock/ChemInformatics_(2017)%3A_Chem_4399%2F%2F5399/2.3%3A_Chemical_Representations_on_Computer%3A_Part_III).

In [None]:
smiles_lengths = map(len, dataset.smiles.values)
#sns.distplot(list(smiles_lengths), bins=20, kde=False)
plt.rcParams.update({'font.size': 20})
plt.figure(figsize=(10,10))
plt.title('SMILES string lengths Histogram')
plt.ylabel('Density')
plt.xlabel('SMILES string lengths')
ax = sns.distplot(list(smiles_lengths), color="b", bins=20, rug=True, rug_kws={"color": "k"}, kde=True, kde_kws={"color": "r", "label": "Gaussian Kernel Density Estimate (KDE)"}, hist_kws={"histtype": "bar", "linewidth": 3, "alpha": 1, "color": "b"} )
ax=plt.savefig('./gap_smiles_lengths.png', dpi=600, facecolor='w', edgecolor='w',orientation='portrait', papertype=None, format=None,transparent=False, bbox_inches=None, pad_inches=0.1,frameon=None, metadata=None)
#ax=plt.savefig('gdrive/MyDrive/Colab Notebooks/data/fig_smiles_lengths.png', dpi=600, facecolor='w', edgecolor='w',orientation='portrait', papertype=None, format=None,transparent=False, bbox_inches=None, pad_inches=0.1,frameon=None, metadata=None)
# ax=plt.savefig('../data/fig_smiles_lengths.png', dpi=600, facecolor='w', edgecolor='w',orientation='portrait', papertype=None, format=None,transparent=False, bbox_inches=None, pad_inches=0.1,frameon=None, metadata=None)


In [None]:
dataset.head()

In [None]:
dataset = dataset.reset_index()
dataset = dataset.drop(['index'], axis = 1)
dataset.head()

In [None]:
x_df = dataset.drop(columns = 'smiles')
print(x_df.shape)
x_df.head()

In [None]:
# from https://proxy.nanohub.org/weber/2004336/GBdSjVSdDDS3NYpl/4/notebooks/LLZO_MachineLearning.ipynb
# This code is to drop columns with std = 0. 
#x_df = pd.DataFrame(X)
#All columns that have a standard deviation of zero are dropped, as they don't contribute new information to the models.
x_df = x_df.loc[:, x_df.std() != 0]
print(x_df.shape) # This shape is (#Entries, #Descriptors per entry)
x_df.head()

In [None]:
x_df.to_csv('./x_df_SMILES_RDKit_2D.csv') 

In [None]:
 plt.figure(figsize=(10,10))
 plt.rcParams.update({'font.size': 20})
 smiles_lengths = map(len, dataset.smiles.values)
 #sns.distplot(list(smiles_lengths), bins=20, kde=False
plt.title('Topological polar surface area (TPSA) Distribution Histogram')
plt.ylabel('Density') 
plt.xlabel('Topological polar surface area (TPSA)')              

# sns.displot(list(smiles_lengths), bins=20, kde=False)
#ax = sns.distplot(dataset["lumo"], rug=True, rug_kws={"color": "g"}, kde_kws={"color": "k", "lw": 3, "label": "KDE"}, hist_kws={"histtype": "step", "linewidth": 3,"alpha": 1, "color": "g"})
ax = sns.distplot(dataset["tpsa"], rug=True, rug_kws={"color": "g"},kde_kws={"color": "k", "lw": 3, "label": "KDE"},hist_kws={"histtype":"step", "linewidth": 3,"alpha": 1, "color": "r"})
ax=plt.savefig('./tpsa.png', dpi=600, facecolor='w', edgecolor='w',orientation='landscape', papertype='a4', format=None, transparent=False, bbox_inches=None, pad_inches=None, frameon=None, metadata=None)


### Data preparation

Now we will pre-process the dataset for the CNN and VAE models. First, we'll get the unique character set from all SMILES strings in the dataset. Then we will use the unique character set to convert our SMILES strings to a one-hot representation, which is a representation that converts raw strings of text to numerical inputs for our models.

In a one-hot representation, each character of our SMILES string is encoded as a vector of zeros, except for one non-zero value. For instance, the character 'C' in the SMILES string is converted to a vector of length 31, consisting of 30 zeros and one non-zero entry of one. The length of this vector (31 in our case) is the total number of unique characters in the dataset.

Given a string of 5 characters (say Pentane, which is represented as 'CCCCC'), we would thus get 5 vectors each of length 31. Since different molecules have different SMILES string lengths, we can pre-define the length of each string to be the maximum length from the database, with smaller molecules represented with additional characters. In our case, this maximum length is 40 and we represent the extra characters for smaller molecules with pre-defined one-hot vectors. This means that each molecule is now represented as a set of 40 vectors, each of length 31. We can represent this as a 40x31 matrix.

One-hot encoding is commonly used in natural language processing, and you can learn more about one-hot encoding [here](https://en.wikipedia.org/wiki/One-hot). 

Finally, we will define our input and output and create test/train splits in the dataset.

In [None]:
charset = generate_charset(
    dataset["smiles"].values.ravel()
)
# get the number of unique characters
charset_length = len(charset)
# define max number of SMILES for model input vector
max_smiles_chars = 70
# dimension of input vector
input_dim = charset_length * max_smiles_chars
# get one-hot representation of the SMILES strings 
one_hots = smiles_to_onehots(dataset["smiles"].values, charset, max_smiles_chars)
# split input into train and test sets
X_train = one_hots[:-100] #This takes the first 133885-13385=120500  entries to be the Training Set
X_test = one_hots[-100:] # This takes the last 13385 entries to be the Testing Set

# split output to train and test sets
output = dataset["tpsa"].values
#output = dataset["homo"].values
#output = dataset["cv"].values
#output = dataset["r2"].values

# "alpha" - Isotropic polarizability (unit: Bohr^3)
# "gap" - Gap between HOMO and LUMO (unit: Hartree)
#"mol_id" - Molecule ID (gdb9 index) mapping to the .sdf file
#"A" - Rotational constant (unit: GHz)
#"B" - Rotational constant (unit: GHz)
#"C" - Rotational constant (unit: GHz)
#"mu" - Dipole moment (unit: D)
#"alpha" - Isotropic polarizability (unit: Bohr^3)
#"homo" - Highest occupied molecular orbital energy (unit: Hartree)
#"lumo" - Lowest unoccupied molecular orbital energy (unit: Hartree)
#"gap" - Gap between HOMO and LUMO (unit: Hartree)
#"r2" - Electronic spatial extent (unit: Bohr^2)
#"zpve" - Zero point vibrational energy (unit: Hartree)
#"u0" - Internal energy at 0K (unit: Hartree)
#"u298" - Internal energy at 298.15K (unit: Hartree)
#"h298" - Enthalpy at 298.15K (unit: Hartree)
#"g298" - Free energy at 298.15K (unit: Hartree)
#"cv" - Heat capavity at 298.15K (unit: cal/(mol*K))
#"u0_atom" - Atomization energy at 0K (unit: kcal/mol)
#"u298_atom" - Atomization energy at 298.15K (unit: kcal/mol)
#"h298_atom" - Atomization enthalpy at 298.15K (unit: kcal/mol)
Y_train = output[:-100] #This takes the first 133885-100=133785 entries to be the Training Set
Y_test = output[-100:] # This takes the last 100 entries to be the Testing Set

# This Reshape function in the next two lines, turns each of the horizontal lists [ x, y, z] into a
# vertical NumPy array [[x]
#                       [y]
#                       [z]]
# This Step is required to work with the Sklearn Linear Model
#Y_train = np.array(melt_train).reshape(-1,1) 
#Y_test  = np.array(melt_test).reshape(-1,1)
print(len(X_train),len(X_test),len(Y_train),len(Y_test))
# print(X_train[0]) # print a sample entry from the training set
# print(X_test[0]) # print a sample entry from the training set
# print(order)


##  Train-Test Split  ##
# https://proxy.nanohub.org/weber/1914019/IVqSH6gE0f3W6g9X/5/notebooks/mldefect.ipynb?
# XX = copy.deepcopy(X)
# n = dopant.size
# m = np.int(X.size/n)

# print(n)
# print(m)

# t = 0.20

# X_train, X_test, Prop_train, Prop_test, dop_train, dop_test, sc_train, sc_test, ds_train, ds_test = train_test_split(XX, prop, dopant, CdX, doping_site, test_size=t)

# n_tr = Prop_train.size
# n_te = Prop_test.size

# print(n_tr)
# print(n_te)

# Prop_train_fl = np.zeros(n_tr)
# for i in range(0,n_tr):
#     Prop_train_fl[i] = copy.deepcopy(float(Prop_train[i]))
    
# print(Prop_train_fl)

# Prop_test_fl = np.zeros(n_te)
# for i in range(0,n_te):
#     Prop_test_fl[i] = copy.deepcopy(float(Prop_test[i]))
    
# print(Prop_test_fl)
    
# X_train_fl = [[0.0 for a in range(m)] for b in range(n_tr)]
# for i in range(0,n_tr):
#     for j in range(0,m):
#         X_train_fl[i][j] = np.float(X_train[i][j])

# print(X_train_fl)

# X_test_fl = [[0.0 for a in range(m)] for b in range(n_te)]
# for i in range(0,n_te):
#     for j in range(0,m):
#         X_test_fl[i][j] = np.float(X_test[i][j])

# print(X_test_fl)

# X_out_fl = [[0.0 for a in range(m)] for b in range(n_out)]
# for i in range(0,n_out):
#     for j in range(0,m):
#         X_out_fl[i][j] = np.float(X_out[i][j])

# print(X_out_fl)

# X_all_fl = [[0.0 for a in range(m)] for b in range(n_all)]
# for i in range(0,n_all):
#     for j in range(0,m):
#         X_all_fl[i][j] = np.float(X_all[i][j])

# print(X_all_fl)

Let's briefly visualize what our input data looks like using a heatmap that shows the position of each character in the SMILES string, you can change the index to see various molecules. Each molecule is represented by a 40x31 sparse matrix, the bright spots in the heatmap indicate the position at which a one is found in the matrix. For instance, the first row has a bright spot at index 18, indicating that the first character is 'C'. The second row has a bright spot at index 23, which indicates that the second character is 'O'. For the compound Dimethoxymethane with a SMILES string 'COCOC', we expect the matrix to have alternating bright spots at index 18 and index 23 for the first five rows. Beyond that, the rows all have a bright spot at index 1, which stands for the extra characters padded on to our string to make all SMILES strings the same length. The heatmap below is plotted using the [Seaborn](https://seaborn.pydata.org/) library.

In [None]:
num_rows = 4
num_cols = 4
num_images = num_rows*num_cols
plt.figure(figsize=(6*num_cols, 6*num_rows))
import matplotlib
matplotlib.rcParams.update(matplotlib.rcParamsDefault)
for i in range(num_images):
    plt.subplot(num_rows, num_cols, i+1)
    #plot_image(i, predictions, testLabels, testImages)
    #plt.figure(figsize=(30,30))
    #for i in range(25): #133785 
    #plt.subplot(5,5,i+1)
    plt.xticks([],fontsize=8)
    plt.yticks([],fontsize=8)
    plt.grid(True)
    #plt.xlabel(X_test(dataset.iloc[i])
    plt.xlabel('Character', fontsize=16)
    #plt.ylabel(X_test(dataset.iloc[i])
    plt.ylabel('Position in SMILES String', fontsize=16)
    #X_test[i] = X_test[i]("Position in SMILES String", "Character")
    plt.title(f"SMILES: {dataset.iloc[i]['smiles']}", fontsize=16)
    #plt.plot(range(num_images), label=f"SMILES: {dataset.iloc[i]['smiles']}")
    #plt.legend()
    #sns.heatmap(X_test[i])
    sns.heatmap(X_train[i])

    #plt.imshow(X_train[i], cmap=plt.cm.binary)
    #plt.xlabel(class_names[int(trainLabels[i])])
    #print(dataset.iloc[i]['smiles'])

    
#plt.imshow(X_train[index]) # By altering 'index' you will see another of the pictures imported
#plt.colorbar()
#plt.grid(False)
#print("Train Images Array shape:", trainImages.shape)
#print("Train Labels Array shape:", trainLabels.shape)
#print("Test Images Array shape:", testImages.shape)
#print("Test Labels Array shape:", testLabels.shape)

#index = 6986 #index runs from 0 to 138388
#sns.heatmap(X_train[index]) # This is a single training example -- note that it is a matrix, not a single vector!
#plt.xlabel('Character')
#plt.ylabel('Position in SMILES String')
#print(dataset.iloc[index]['smiles'])
#ax=plt.savefig('gdrive/MyDrive/Colab Notebooks/data/fig_smiles_character.png', dpi=600, facecolor='w', edgecolor='w',orientation='portrait', papertype=None, format=None,transparent=False, bbox_inches=None, pad_inches=0.1,frameon=None, metadata=None)
#ax=plt.savefig('./homo_fig_smiles_character.png', dpi=600, facecolor='w', edgecolor='w',orientation='portrait', papertype=None, format=None,transparent=False, bbox_inches=None, pad_inches=0.1,frameon=None, metadata=None)

#ax = sns.distplot(dataset["r2"], rug=True, rug_kws={"color": "g"},kde_kws={"color": "k", "lw": 3, "label": "KDE"},hist_kws={"histtype": "step", "linewidth": 3,"alpha": 1, "color": "r"})
#ax=plt.savefig('./homo_X_test.png', dpi=600, facecolor='w', edgecolor='w',orientation='landscape', papertype='a4', format=None, transparent=False, bbox_inches=None, pad_inches=None, frameon=None, metadata=None, annot=True, fmt="d")
ax=plt.savefig('./tpsa_X_train.png', dpi=600, facecolor='w', edgecolor='w',orientation='landscape', papertype='a4', format=None, transparent=False, bbox_inches=None, pad_inches=None, frameon=None, metadata=None, annot=True, fmt="d")


# <ins>Supervised CNN model for predicting solubility</ins>

In this section, we will set up a convolutional neural network to predict solubility using one-hot SMILES as input. A convolutional neural network is a machine learning model that is commonly used to classify images, and you can learn more about them [here](https://en.wikipedia.org/wiki/Convolutional_neural_network).

### Define model structure

First, we will create the model structure, starting with the input layer. As described above, each training example is a 40x31 matrix, which is the shape we pass to the Input layer in Keras.

In [None]:
# Define the input layer
# NOTE: We feed in a sequence here! We're inputting up to max_smiles_chars characters, 
# and each character is an array of length charset_length


smiles_input = Input(shape=(max_smiles_chars, charset_length), name="SMILES-Input")

Next we will define the convolution layers where each layer attempts to learn certain features of the images, such as edges and corners. The input to each layer (a matrix) is transformed via convolution operations, which are element by element multiplications of the input matrix and a filter matrix. The convolutional layer learns the filter matrix that will best identify unique features of the image. You can learn more about convolution operations and the math behind convolutional neural networks [here](https://towardsdatascience.com/gentle-dive-into-math-behind-convolutional-neural-networks-79a07dd44cf9).

In [None]:
# Set parameters for convolutional layers 
num_conv_filters = 16
kernel_size = 3
#kernel_init = initializers.RandomNormal(seed=0)
#bias_init = initializers.Zeros()
init_weights = initializers.glorot_normal(seed=0)

# Define the convolutional layers
# Multiple convolutions in a row is a common architecture (but there are many "right" choices here)
conv_1_func = Conv1D(
    filters=num_conv_filters, # What is the "depth" of the convolution? How many times do you look at the same spot?
    kernel_size=kernel_size, # How "wide" of a spot does each filter look at?
    name="Convolution-1",
    activation="relu", # This is a common activation function: Rectified Linear Unit (ReLU)
    kernel_initializer=init_weights #This defines the initial values for the weights
)
conv_2_func = Conv1D(
    filters=num_conv_filters, 
    kernel_size=kernel_size, 
    name="Convolution-2",
    activation="relu",
    kernel_initializer=init_weights
)
conv_3_func = Conv1D(
    filters=num_conv_filters, 
    kernel_size=kernel_size, 
    name="Convolution-3",
    activation="relu",
    kernel_initializer=init_weights
)
conv_4_func = Conv1D(
    filters=num_conv_filters, 
    kernel_size=kernel_size,
    name="Convolution-4",
    activation="relu",
    kernel_initializer=init_weights
)

# strides and paddind can be added in the convolution netowrk
# strides=2, padding="same"

The four convolution layers defined above will attempt to learn features of the SMILES string (represented as a 40x31 matrix) that are relevant to predicting the solubility. To get a numerical prediction, we now flatten the output of the convolution and pass it to a set of regular `Dense` layers, the last layer predicting one value for the solubility.

In [None]:
# Define layer to flatten convolutions
flatten_func = Flatten(name="Flattened-Convolutions")

# Define the activation function layer
hidden_size = 32
dense_1_func = Dense(hidden_size, activation="relu", name="Fully-Connected", kernel_initializer=init_weights)

# Add a Dense layer with a L1 activity regularizer
#dense_1_func = Dense(hidden_size, activation="relu", name="Fully-Connected", activity_regularizer=regularizers.l1(10e-5), kernel_initializer=init_weights)

# Define output layer -- it's only one dimension since it is regression
output_size = 1
output_mobility_func = Dense(output_size, activation="linear", name="Log-lumo", kernel_initializer=init_weights)




Now that we have defined all the layers, we will connect them together to make a graph:

In [None]:
# connect the CNN graph together
conv_1_fwd = conv_1_func(smiles_input)
conv_2_fwd = conv_2_func(conv_1_fwd)
conv_3_fwd = conv_3_func(conv_2_fwd)
conv_4_fwd = conv_4_func(conv_3_fwd)
flattened_convs = flatten_func(conv_4_fwd)
dense_1_fwd = dense_1_func(flattened_convs)
output_mobility_fwd = output_mobility_func(flattened_convs)

### View model structure and metadata

Now the model is ready to train! But first we will define the model as `solubility_model` and compile it, then view some information on the model using the [keras2ascii](https://github.com/stared/keras-sequential-ascii) tool, which visually represents the layers in our model.

In [None]:
# create model
mobility_model = Model(
            inputs=[smiles_input],
            outputs=[output_mobility_fwd]
)
mae_st = []
# compile model
#optimizer = optimizers.RMSprop(0.002) # Root Mean Squared Propagation
# This line matches the optimizer to the model and states which metrics will evaluate the model's accuracy

# loss= mse, mae
# loss= categorical_crossentropy
#loss='sparse_categorical_crossentropy'
#loss='binary_crossentropy'
#metrics=['accuracy', 'binary_crossentropy']
#metrics=['accuracy']
mobility_model.compile(
    optimizer="adam",
    loss="mse",
    metrics=["mae"]
)
mobility_model.summary()

In [None]:
#!pip install keras_sequential_ascii
from keras_sequential_ascii import keras2ascii
# view model as a graph
keras2ascii(mobility_model)

### Train CNN

Now we will train our CNN solubility model to the training data! During training, we will see metrics printed after each epoch such as test/train loss (both as Mean Squared Error (MSE) and Mean Absolute Error (MAE)).

In [None]:
#logdir="mobility_logs/fit/" + datetime.now().strftime("%Y%m%d-%H%M%S")
#tensorboard_callback = keras.callbacks.TensorBoard(log_dir=logdir)
mae_st = []
history = mobility_model.fit(
    X_train, # Inputs
    Y_train, # Outputs
    epochs=20, # How many times to pass over the data
    batch_size=64, # How many data rows to compute at once
    verbose=1,
    validation_data=(X_test, Y_test),
    #callbacks=[tensorboard_callback] # You would usually use more splits of the data if you plan to tune hyperparams
)
#print('mse')
#print('mae')
mobility_model.save(os.path.expanduser('./tpsa_cnn_model.h5'))

Let's view the learning curve for the trained model.

This code will generate a plot where we show the test and train errors (MSE) as a function of epoch (one pass of all training examples through the NN).

The learning curve will tell us if the model is overfitting or underfitting.

In [None]:
# plot the learning curve 
plt.rcParams.update({'font.size': 18})
plt.figure(figsize=(10,10))
plt.subplots_adjust(left=0.16, bottom=0.16, right=0.95, top=0.90)
plt.rc('font', family='Arial narrow')
plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.title('CNN Model loss tpsa', fontname='Arial Narrow', size=18) #pad=12
plt.ylabel('Error',fontname='Arial Narrow', size=18)
plt.xlabel('Epoch',fontname='Arial Narrow', size=18)
#plt.xlim(0,20)
#plt.ylim(0,20)
plt.legend(['Train', 'Validation',], loc='upper right')
# te = '%.2f' % mean_absolute_error
# tr = '%.2f' % mse_X_test
# plt.text(4.3, 0.8, 'Test_rmse = ', c='r', fontsize=16)
# plt.text(7.4, 0.8, te, c='r', fontsize=16)
# plt.text(8.5, 0.8, 'eV', c='r', fontsize=16)
# plt.text(4.2, 0.1, 'Train_rmse = ', c='r', fontsize=16)
# plt.text(7.4, 0.1, tr, c='r', fontsize=16)
# plt.text(8.5, 0.1, 'eV', c='r', fontsize=16)
plt.savefig('./tpsa_cnn_X_training_loss.png', dpi=600, facecolor='w', edgecolor='w', scale=1, width=600, height=350)
plt.show()


# plot the learning curve 
plt.rcParams.update({'font.size': 18})
plt.figure(figsize=(10,10))
plt.subplots_adjust(left=0.16, bottom=0.16, right=0.95, top=0.90)
plt.rc('font', family='Arial narrow')
plt.plot(history.history['mean_absolute_error'])
plt.plot(history.history['val_mean_absolute_error'])
plt.title('CNN Model MAE tpsa', fontname='Arial Narrow', size=18) #pad=12
plt.ylabel('Mean Absolute Error',fontname='Arial Narrow', size=18)
plt.xlabel('Epoch',fontname='Arial Narrow', size=18)
#plt.xlim(0,20)
#plt.ylim(0,2)
plt.legend(['Train', 'Validation',], loc='upper right')
# te = '%.2f' % mean_absolute_error
# tr = '%.2f' % mse_X_test
# plt.text(4.3, 0.8, 'Test_rmse = ', c='r', fontsize=16)
# plt.text(7.4, 0.8, te, c='r', fontsize=16)
# plt.text(8.5, 0.8, 'eV', c='r', fontsize=16)
# plt.text(4.2, 0.1, 'Train_rmse = ', c='r', fontsize=16)
# plt.text(7.4, 0.1, tr, c='r', fontsize=16)
# plt.text(8.5, 0.1, 'eV', c='r', fontsize=16)
plt.savefig('./tpsa_cnn_X_training_mae.png', dpi=600, facecolor='w', edgecolor='w', scale=1, width=600, height=350)
plt.show()
# plot the learning curve 

### Use CNN to make solubility predictions
Now that we've trained our model, we can use it to make solubility predictions for any SMILES string! We just have to convert the SMILES string to 1-hot representation, then feed it to the `solubility_model` 

In [None]:
example_smiles = ['CC(C)CCCCO(C)N','CCC(C)CCC(C)OC','CC=CC1CCC1=O','CCOC()CCC','CC1(CC1OC)C#C'  ]
#'CC(C)CCCCO(C)N','CCC(C)CCC(C)OC','CC=CC1CCC1=O','CCOC()CCC','CC1(CC1OC)C#C'
#'Cc1cc(c1CCO)C#N','CCCCCCCCCC#C', 'CCC(C)CCC(C)C#C' ,'OCCCCC', 'CCC(C)(=O)C#C#N' , 'CCCCCCC#CCC' 'CC(=O)C=C(N)F', 'CCC'                  
for smiles in example_smiles:
    predict_test_input = smiles_to_onehots([smiles], charset, max_smiles_chars)
    mobility_prediction = mobility_model.predict(predict_test_input)[0][0]
    print(f'The predicted tpsa for SMILES {smiles} is {mobility_prediction}')

We can now make a parity plot comparing the CNN model predictions to the ground truth data

In [None]:
preds = mobility_model.predict(X_train)
x_y_line = np.linspace(min(Y_train.flatten()), max(Y_train.flatten()), 500)
plt.figure(figsize=(8,8))
#plt.subplots_adjust(left=0.16, bottom=0.16, right=0.95, top=0.90)
#plt.rc('font', family='Arial narrow')

plt.plot(Y_train.flatten(), preds.flatten(), 'o', label='predictions')
plt.plot(x_y_line, x_y_line, label='y=x')
plt.xlabel("tpsa (ground truth)", fontname='Arial Narrow', size=16)
plt.ylabel("tpsa (predicted)", fontname='Arial Narrow', size=16)
plt.title('Parity plot: predictions vs ground truth data', fontsize=16, pad=12)
plt.rc('xtick', labelsize=14)
plt.rc('ytick', labelsize=14)
#a  = [-175,0,125]
#b = [-175,0,125]
#plt.plot(b, a, c='k', ls='-')
#plt.legend(loc='upper left',ncol=1, frameon=True, prop={'family':'Arial narrow','size':16})
plt.savefig('./tpsa_cnn_X_predict.png', dpi=600, facecolor='w', edgecolor='w', scale=1)

### Save model
We can save/load this model for future use, using the `save()` and `load_model()` functions from Keras.

In [None]:
# Save the model
mobility_model.save("tpsa_model.hdf5")

# Load it back
loaded_model = load_model("tpsa_model.hdf5")

# <ins>VAE model for generating SMILES strings</ins>
In this section, we will set up a variational autoencoder to encode and decode SMILES strings. An autoencoder is a model that encodes the input to the model into a set of variables (known as encoded or 'latent variables'), which are then decoded to recover the original input. A variational autoencoder is an advanced version of an autoencoder where the encoded/latent variables are learnt as probability distributions rather than discrete values. You can learn more about autoencoders and variational autoencoders [here](https://www.jeremyjordan.me/variational-autoencoders/) and [here](https://www.jeremyjordan.me/autoencoders/).

### Define model structure

We'll need to define some new layers for this model, but we can also reuse old ones! (You will see this when we connect the model together.)

In [None]:
# hidden activation layer
hidden_size = 16
dense_1_func = Dense(hidden_size, activation="relu", name="Fully-Connected-Latent", kernel_initializer=init_weights)

Now we'll define the layers to map to the latent space. We then define a sampling function that samples from a gaussian distribution to return the sampled latent variables.

In [None]:
# VAE sampling 
# K.shape= Keras.shape
def sampling(args):
    z_mean, z_log_var = args
    batch = K.shape(z_mean)[0]
    dim = K.int_shape(z_mean)[1]
    epsilon = K.random_normal((batch, dim), mean=0.0, stddev=1.0)
    return z_mean + K.exp(0.5 * z_log_var) * epsilon # mu + sigma*epsilon yields a shifted, rescaled gaussian, 
                                                     # if epsilon is the standard gaussian
#latent space.last hidden_size = 16 to latent_dim = 32 
# encode to latent space
latent_dim = 32 
z_mean_func = Dense(latent_dim, name='z_mean')
log_z_func = Dense(latent_dim, name='z_log_var')
z_func = Lambda(sampling, name='z_sample')
#print(z_mean_func)
#print(log_z_func)
#print(z_func)
#z = Lambda(sampling)([z_mean, z_log_var])

Now we'll define the RNN (Recurrent Neural Network) layers for decoding SMILES from latent space values. Recurrent neural networks are known to perform well for learning a time series of data, where each cell of the recurrent network can learn from the previous cells, thus learning time dependencies in the data. This RNN uses Gated Recurrent Units as cells and you can learn more about recurrent neural networks and Gated Recurrent Units [here](https://towardsdatascience.com/understanding-gru-networks-2ef37df6c9be).

In [None]:
# this repeat vector just repeats the input `max_smiles_chars` times 
# so that we get a value for each character of the SMILES string
repeat_1_func = RepeatVector(max_smiles_chars, name="Repeat-Latent-1")

# RNN decoder
rnn_size = 32
gru_1_func = GRU(rnn_size, name="RNN-decoder-1", return_sequences=True, kernel_initializer=init_weights)
gru_2_func = GRU(rnn_size, name="RNN-decoder-2", return_sequences=True, kernel_initializer=init_weights)
gru_3_func = GRU(rnn_size, name="RNN-decoder-3", return_sequences=True, kernel_initializer=init_weights)

Finally we'll define the output, which should map to the original SMILES input:

In [None]:
output_func = TimeDistributed(
    Dense(charset_length, activation="softmax", name="SMILES-Output", kernel_initializer=init_weights), 
    name="Time-Distributed"
)

Now that we have defined all the layers, we will connect them together to make a graph:

In [None]:
# connecting the VAE model as a graph

# cnn encoder layers
conv_1_fwd = conv_1_func(smiles_input)
conv_2_fwd = conv_2_func(conv_1_fwd)
conv_3_fwd = conv_3_func(conv_2_fwd)
conv_4_fwd = conv_4_func(conv_3_fwd)

# flattening
flattened_convs = flatten_func(conv_4_fwd)
dense_1_fwd = dense_1_func(flattened_convs)

# latent space
z_mean = z_mean_func(dense_1_fwd)
z_log_var = log_z_func(dense_1_fwd)
z = z_func([z_mean, z_log_var])

# rnn decoder layers
repeat_1_fwd = repeat_1_func(z)
gru_1_fwd = gru_1_func(repeat_1_fwd)
gru_2_fwd = gru_2_func(gru_1_fwd)
gru_3_fwd = gru_3_func(gru_2_fwd)
smiles_output = output_func(gru_3_fwd)

### View model structure and metadata
Now the model is ready to train! But first we will compile the VAE model, then view model metadata, again using the [keras2ascii](https://github.com/stared/keras-sequential-ascii) tool. To compile the model, we will need to define our own VAE loss function.

In [None]:
# vae loss function -- reconstruction loss (cross entropy) plus KL divergence loss against a Gaussian prior
# Intuitive meaning for this loss function: "Reconstruct the data but stay close to a Gaussian"
def vae_loss(x_input, x_predicted):
    reconstruction_loss = K.sum(binary_crossentropy(x_input, x_predicted), axis=-1)
    reconstruction_loss *= input_dim
    kl_loss = 1 + z_log_var - K.square(z_mean) - K.exp(z_log_var)
    kl_loss = K.sum(kl_loss, axis=-1)
    kl_loss *= -0.5
    return K.mean(reconstruction_loss + kl_loss)

# create model
vae_model = Model(
            inputs=[smiles_input],
            outputs=[smiles_output]
)

# compile model
vae_model.compile(
    optimizer="adam",
    loss=vae_loss,
    metrics=["accuracy"]
)
vae_model.summary()

In [None]:
# view model as a graph
keras2ascii(vae_model)

### Train VAE

When training our VAE, we will see metrics printed after each epoch such as test/train loss and accuracy values.

In [None]:
# Reset model and set all layers are trainable
vae_model.reset_states()
for layer in vae_model.layers:
    layer.trainable = True

# fit model to training data
history = vae_model.fit(
    x=X_train,
    y=X_train,
    epochs=20,
    validation_data=(X_test, X_test),
    batch_size=64,
    verbose=1
)

Let's view the learning curve for the trained model. 

This code will generate a plot where we show the test and train errors as a function of epoch (one forward pass and one backward pass of all training examples through the NN).

The learning curve will tell us if the model is overfitting or underfitting. 

In [None]:
# plot the learning curve 
plt.rcParams.update({'font.size': 18})
plt.figure(figsize=(8,8))
plt.subplots_adjust(left=0.16, bottom=0.16, right=0.95, top=0.90)
plt.rc('font', family='Arial narrow')
plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.title('VAE Model accuracy Gap between HOMO and LUMO (Hartree)', fontname='Arial Narrow', size=18) #pad=12
plt.ylabel('Error',fontname='Arial Narrow', size=18)
plt.xlabel('Epoch',fontname='Arial Narrow', size=18)
#plt.xlim(0,20)
#plt.ylim(800,2000)
plt.legend(['Train', 'Validation',], loc='upper right')
# te = '%.2f' % mean_absolute_error
# tr = '%.2f' % mse_X_test
# plt.text(4.3, 0.8, 'Test_rmse = ', c='r', fontsize=16)
# plt.text(7.4, 0.8, te, c='r', fontsize=16)
# plt.text(8.5, 0.8, 'eV', c='r', fontsize=16)
# plt.text(4.2, 0.1, 'Train_rmse = ', c='r', fontsize=16)
# plt.text(7.4, 0.1, tr, c='r', fontsize=16)
# plt.text(8.5, 0.1, 'eV', c='r', fontsize=16)
plt.savefig('./tpsa_vae_X_training_loss.png', dpi=600, facecolor='w', edgecolor='w', scale=1, width=600, height=350)
plt.show()
# plot the learning curve 


# plot the learning curve 
plt.rcParams.update({'font.size': 18})
plt.figure(figsize=(8,8))
plt.subplots_adjust(left=0.16, bottom=0.16, right=0.95, top=0.90)
plt.rc('font', family='Arial narrow')
plt.plot(history.history['acc'])
plt.plot(history.history['val_acc'])
plt.title('VAE Model accuracy Gap between HOMO and LUMO (Hartree)', fontname='Arial Narrow', size=18) #pad=12
plt.ylabel('Accuracy',fontname='Arial Narrow', size=18)
plt.xlabel('Epoch',fontname='Arial Narrow', size=18)
#plt.xlim(0,20)
#plt.ylim(800,2000)
plt.legend(['Train', 'Validation',], loc='lower right')
# te = '%.2f' % mean_absolute_error
# tr = '%.2f' % mse_X_test
# plt.text(4.3, 0.8, 'Test_rmse = ', c='r', fontsize=16)
# plt.text(7.4, 0.8, te, c='r', fontsize=16)
# plt.text(8.5, 0.8, 'eV', c='r', fontsize=16)
# plt.text(4.2, 0.1, 'Train_rmse = ', c='r', fontsize=16)
# plt.text(7.4, 0.1, tr, c='r', fontsize=16)
# plt.text(8.5, 0.1, 'eV', c='r', fontsize=16)
plt.savefig('./tpsa_vae_X_training_acc.png', dpi=600, facecolor='w', edgecolor='w', scale=1, width=600, height=350)
plt.show()
# plot the learning curve 


### Create a decoder model and use to generate SMILES from noise

Now that we have trained our VAE, we can use the decoding part of the VAE to generate SMILES strings! Let's start by defining our decoder model. Note that this model doesn't need to be compiled since we are not training this model.

In [None]:
# connect the decoder graph
decoder_input = Input(shape=(latent_dim,), name="decoder_input")
decoder_repeat_1_fwd = repeat_1_func(decoder_input)
decoder_gru_1_fwd = gru_1_func(decoder_repeat_1_fwd)
decoder_gru_2_fwd = gru_2_func(decoder_gru_1_fwd)
decoder_gru_3_fwd = gru_3_func(decoder_gru_2_fwd)
decoder_smiles_output = output_func(decoder_gru_3_fwd)

# define decoder model
decoder_model = Model(
    inputs=[decoder_input],
    outputs=[decoder_smiles_output]
)
decoder_model.summary()

In [None]:
# view decoder graph. this should look like a subset of the VAE graph.
keras2ascii(decoder_model)

Now let's generate SMILES strings! First we will randomly sample from a unit gaussian distribution, feed the random samples into the decoder model, and take the output of the decoder model and convert it back into SMILES characters. Don't be surprised to see strange SMILES strings! We used a very small dataset, and did not train for very long.

In [None]:
for x in range(20):
    
    # draw from a unit gaussian 
    decoder_test_input = np.random.normal(0, 1, latent_dim).reshape(1, latent_dim)
    decoder_test_output = decoder_model.predict(decoder_test_input)
    
    decoded_one_hots = np.argmax(decoder_test_output, axis = 2)

    SMILES = ''
    for char_idx in decoded_one_hots[0]:
        if charset[char_idx] in ["PAD", "NULL"]: 
            break # Stop decoding if you hit padding or an out-of-vocab character (NULL)
        
        SMILES = SMILES + charset[char_idx]

    print(SMILES)

### Save VAE and decoder models
We can save/load these models for future use, again using the `save()` and `load_model()` functions from Keras.

In [None]:
# save and load the decoder model 
decoder_model.save("tpsa_decoder_model.hdf5")
loaded_decoder_model = load_model("tpsa_decoder_model.hdf5")

# for VAEs, we must instantiate model w/ same architecture then load weights onto this model
vae_model.save_weights("tpsa_vae.hdf5")
loaded_vae_model = vae_model.load_weights("tpsa_vae.hdf5")