# Hands-on Deep Learning for Materials 

## *Using deep learning to estimate solubility for organic molecules*

The solubility of materials is crucial to pharmaceutical applications such as formulating novel drugs. In this notebook, you will learn how to train deep learning models to predict the aqueous solubility of organic materials given their composition. 

The composition will be specified as SMILES strings, which are a convenient way to represent the structure of organic materials. You can learn more about SMILES strings [here](https://en.wikipedia.org/wiki/Simplified_molecular-input_line-entry_system). We will use these SMILES strings as inputs to a [convolutional neural network](https://en.wikipedia.org/wiki/Convolutional_neural_network) and predict the solubility of organic materials. We will also learn how to train [variational autoencoders](https://www.jeremyjordan.me/variational-autoencoders/) to learn SMILES string representations. Variational autoencoders are models used to learn low-dimensional representations of a high-dimensional dataset, and in our case these models will give us a low-dimensional numerical representation of a SMILES string, which can replace the sparse matrices often used to represent data when using convolutional neural networks.


### Outline of this notebook:
#### _Load and pre-process training data_ 
- Load solubility dataset containing many organic molecules and their associated solubilities
- Pre-process data and split to test/train sets

#### _Train a Convolutional neural network (CNN)_ 
- Train a CNN to predict solubility
- Predict solubility from any given SMILES representation of a molecule 

#### _Train a Variational autoencoder (VAE)_
- Train a VAE to take an encoded SMILES as input and learn a mapping from encoded SMILES to latent space and back to the input
- Use a portion of the VAE to generate SMILES by sampling from a unit gaussian


This notebook is a hands-on demo of *Deep learning for materials and chemicals*. This tutorial uses Python, some familiarity with programming would be beneficial but is not required. Run each code cell in order by clicking "Shift + Enter". Feel free to modify the code or change queries to familiarize yourself with the code.

https://www.rdkit.org/docs/GettingStartedInPython.html#chemical-features

Blaney, J. M.; Dixon, J. S. “Distance Geometry in Molecular Modeling”. Reviews in Computational Chemistry; VCH: New York, 1994.

2
Rappé, A. K.; Casewit, C. J.; Colwell, K. S.; Goddard III, W. A.; Skiff, W. M. “UFF, a full periodic table force field for molecular mechanics and molecular dynamics simulations”. J. Am. Chem. Soc. 114:10024-35 (1992) .

3
Carhart, R.E.; Smith, D.H.; Venkataraghavan R. “Atom Pairs as Molecular Features in Structure-Activity Studies: Definition and Applications” J. Chem. Inf. Comp. Sci. 25:64-73 (1985).

4
Nilakantan, R.; Bauman N.; Dixon J.S.; Venkataraghavan R. “Topological Torsion: A New Molecular Descriptor for SAR Applications. Comparison with Other Desciptors.” J. Chem.Inf. Comp. Sci. 27:82-5 (1987).

5
Rogers, D.; Hahn, M. “Extended-Connectivity Fingerprints.” J. Chem. Inf. and Model. 50:742-54 (2010).

6
Ashton, M. et al. “Identification of Diverse Database Subsets using Property-Based and Fragment-Based Molecular Descriptions.” Quantitative Structure-Activity Relationships 21:598-604 (2002).

7
Bemis, G. W.; Murcko, M. A. “The Properties of Known Drugs. 1. Molecular Frameworks.” J. Med. Chem. 39:2887-93 (1996).

8
Lewell, X.Q.; Judd, D.B.; Watson, S.P.; Hann, M.M. “RECAP-Retrosynthetic Combinatorial Analysis Procedure: A Powerful New Technique for Identifying Privileged Molecular Fragments with Useful Applications in Combinatorial Chemistry” J. Chem. Inf. Comp. Sci. 38:511-22 (1998).

9
Degen, J.; Wegscheid-Gerlach, C.; Zaliani, A; Rarey, M. “On the Art of Compiling and Using ‘Drug-Like’ Chemical Fragment Spaces.” ChemMedChem 3:1503–7 (2008).

10
Gobbi, A. & Poppinger, D. “Genetic optimization of combinatorial libraries.” Biotechnology and Bioengineering 61:47-54 (1998).

11
A more detailed description of reaction smarts, as defined by the rdkit, is in the The RDKit Book.

12
Halgren, T. A. “Merck molecular force field. I. Basis, form, scope, parameterization, and performance of MMFF94.” J. Comp. Chem. 17:490–19 (1996).

13
Halgren, T. A. “Merck molecular force field. II. MMFF94 van der Waals and electrostatic parameters for intermolecular interactions.” J. Comp. Chem. 17:520–52 (1996).

14
Halgren, T. A. “Merck molecular force field. III. Molecular geometries and vibrational frequencies for MMFF94.” J. Comp. Chem. 17:553–86 (1996).

15
Halgren, T. A. & Nachbar, R. B. “Merck molecular force field. IV. conformational energies and geometries for MMFF94.” J. Comp. Chem. 17:587-615 (1996).

16
Halgren, T. A. “MMFF VI. MMFF94s option for energy minimization studies.” J. Comp. Chem. 20:720–9 (1999).

17
Riniker, S.; Landrum, G. A. “Similarity Maps - A Visualization Strategy for Molecular Fingerprints and Machine-Learning Methods” J. Cheminf. 5:43 (2013).

18(1,2)
Riniker, S.; Landrum, G. A. “Better Informed Distance Geometry: Using What We Know To Improve Conformation Generation” J. Chem. Inf. Comp. Sci. 55:2562-74 (2015)

In [None]:
# !pip install git+https://github.com/bp-kelley/descriptastorus
# !pip install --user CairoSVG

In [None]:
import rdkit as rdkit
print(rdkit.__version__)

In [None]:
!pip list

In [None]:
# !pip install --user dataframe_image
# !pip install --user keras-sequential-ascii

In [None]:
# !pip install --user mordred
# !pip install --user python-utils
# !pip install --user MolVS

## <ins>Let's start</ins> 

We'll start with required imports. These includes the [Keras](https://keras.io/) and [Tensorflow](https://www.tensorflow.org/) libraries for the neural network models, [Pandas](https://pandas.pydata.org/) and [Numpy](https://numpy.org/) to process data, as well as other relevant Python libraries.

In [2]:
from __future__ import print_function
# general imports
%matplotlib inline
import tensorflow as tf
#import tensorflow.compat.v1 as tf
#tf.disable_v2_behavior() 
import keras
from keras import initializers
from keras.layers import Dense
from keras.models import Sequential
from keras import optimizers
from keras import regularizers
import pandas as pd
import seaborn as sns
#from matplotlib import pyplot as plt
import matplotlib.pyplot as plt
import matplotlib.animation as animation
import plotly.express as px
import numpy as np
import csv
import copy
import random
import rdkit as rdkit
print(rdkit.__version__)
from rdkit import Chem
#from rdkit.Chem import AllChem as Chem
from rdkit.Chem import AllChem
from rdkit.Chem import Draw
from rdkit.Chem import Crippen
from rdkit.Chem import Descriptors, Descriptors3D 
from rdkit.ML.Descriptors import MoleculeDescriptors
from rdkit.Chem import Lipinski, rdDepictor, rdMolDescriptors
from rdkit.Chem import MolSurf
from rdkit.Chem import PandasTools
from rdkit.Chem.Draw import IPythonConsole
from rdkit.Chem import rdFingerprintGenerator
from rdkit import DataStructs
from rdkit.Chem import MACCSkeys
from rdkit.Chem.Fingerprints import FingerprintMols
from descriptastorus.descriptors.DescriptorGenerator import MakeGenerator
#https://github.com/bp-kelley/descriptastorus
from mordred import Calculator, descriptors
import time
rdDepictor.SetPreferCoordGen(True)


#from util import partTypeNum
#import util
# from tqdm import tqdm
# from sklearn.decomposition import PCA
# from sklearn.manifold import TSNE
# from sklearn.cluster import KMeans
# from sklearn.preprocessing import StandardScaler as Scaler

import pandas as pd
import numpy as np
import io
#import pymatgen as pymat
#import mendeleev as mendel
from subprocess import call
import gzip

from scipy.stats import norm
from IPython.display import HTML

# keras imports
from keras.layers import (Input, Dense, Conv1D, MaxPool1D, Dropout, GRU, LSTM, TimeDistributed, Add, Flatten, RepeatVector, Lambda, Concatenate)
from keras.models import Model, load_model
from keras.metrics import binary_crossentropy
from keras import initializers, regularizers
from keras.callbacks import EarlyStopping
import keras.backend as K

# Visualization
from keras_sequential_ascii import keras2ascii


# from utils import label_map_util
# from utils import visualization_utils as vis_util

#from object_detection.utils import label_map_util
#from object_detection.utils import visualization_utils as vis_util

# utils functions
#from python_utils import *
from utils import *

# Hacky MacOS fix for Tensorflow runtimes... (You won't need this unless you are on MacOS)
# This fixes a display bug with progress bars that can pop up on MacOS sometimes.
#import sys
#import os
#sys.path.insert(0, '../src/')

import os
os.environ['KMP_DUPLICATE_LIB_OK']='True'

# Remove warnings from output
import warnings
warnings.filterwarnings('ignore')

#!python -m pip install --user numpy --upgrade
#!python -m pip install --user tensorflow --upgrade
#!python -m pip install --user numpy pycocotools==2.0.0

#!python -V
#import tensorflow as tf
#print(tf.__version__)
#import numpy as np
#print(np.__version__)

Using TensorFlow backend.


2018.09.1


# <ins>Load, view, and preprocess dataset</ins> 


We will use the [ESOL dataset](http://moleculenet.ai/datasets-1) to train our models. The ESOL dataset contains the solubility of various small organic molecules. We will begin by loading the dataset as a dataframe and then inspecting some basic metadata. We'll also preprocess the dataset and create train/test splits for the Convolutional Neural Network (CNN) and Variational AutoEncoder (VAE) models. 

In [1]:
#!ls /home/nanohub/bbishnoi/data/results/vae/qm9.csv
#dataset = pd.read_csv("/home/nanohub/bbishnoi/data/results/vae/qm9.csv")
# read dataset as a dataframe
#dataset = pd.read_csv("../data/ESOL_delaney-processed.csv")

from random import shuffle
# dataset = pd.read_csv("./Redox_Flow_Battery/data/SMILES_feature.csv")
# dataset = pd.read_csv("./Redox_Flow_Battery/data/SMILES_ECFP6.csv") 
# dataset = pd.read_csv("./OrganicLED/data/nmat4717_patent_feature.csv") 
dataset = pd.read_csv("./OrganicLED/data/nmat4717_patent_ECFP6_std0.csv") 
# dataset = pd.read_csv("./OrganicLED/data/qm9_no_smiles_feature_RDKit_2D.csv.csv") 

# dataset = pd.read_csv("./Redox_Flow_Battery/data/SMILES_MACCS.csv")
#dataset = pd.read_csv("./SMILES_feature.csv")
#dataset = pd.read_csv("gdrive/MyDrive/Colab Notebooks/data/qm9.csv")

# This function randomly arranges the elements so we can have representation for all groups both in the training and testing set
#shuffle(dataset) 

# print column names in dataset
print(f"Columns in dataset: {list(dataset.columns)}")

# print number of rows in dataset
print(f"\nLength of dataset: {len(dataset)}")

# shuffle rows of the dataset (we could do this later as well when doing train/test splits)
dataset = dataset.sample(frac=1, random_state=0)

# show first 5 rows of dataframe
dataset.head().shape
dataset.head()

NameError: name 'pd' is not defined

In [None]:
import matplotlib.pyplot as plt
import pandas as pd

df = pd.read_csv('./OrganicLED/data/nmat4717_patent_smile_feature.csv', delimiter=',', index_col='SMILES')
df =df.head()

fig, ax = plt.subplots(1, 1)
ax.table(cellText=df.values, colLabels=df.keys(), loc='center')
plt.show()
plt.savefig('./OrganicLED/nmat4717_patent_smile_feature_dataframe.png', dpi=300, facecolor='w', edgecolor='w', format=None, transparent=False, bbox_inches=None, pad_inches=None, metadata=None)


In [None]:
#To calculate all the rdkit descriptors, you can use the following code:
descriptor_names = list(rdMolDescriptors.Properties.GetAvailableProperties())
get_descriptors = rdMolDescriptors.Properties(descriptor_names)

print(descriptor_names)
# print(DescriptorSummaries())

In [None]:
#Calculate descriptors using smile strings
def smi_to_descriptors(smile):
    mol = Chem.MolFromSmiles(smile)
    descriptors = []
    if mol:
        descriptors = np.array(get_descriptors.ComputeProperties(mol))
    return descriptors


In [None]:
#if the the smiles are in pandas dataframe
dataset['descriptors'] = dataset.SMILES.apply(smi_to_descriptors)
#dataset= dataset.SMILES.apply(smi_to_descriptors)
dataset.head()

In [None]:
full_dataset = dataset
full_dataset.head()

In [None]:
from rdkit import Chem    # make sure to import it if you haven't done so
from rdkit.Chem import Descriptors    # make sure to import it if you haven't done so
descriptors_list = [x[0] for x in Descriptors._descList]
print(descriptors_list)

In [None]:
calc = MoleculeDescriptors.MolecularDescriptorCalculator([x[0] for x in Descriptors._descList])
type(calc)
mol = Chem.MolFromSmiles('c1ccc(B2c3ccncc3C3(c4ccccc4Oc4cccnc43)c3cnccc32)cc1')
ds = calc.CalcDescriptors(mol)
print(ds)

In [64]:
qm9 = pd.read_csv("./QM9_dataset/data/qm9_no_smiles_feature_RDKit_2D_std0.csv")
qm9.head()

Unnamed: 0.1,Unnamed: 0,A,B,C,mu,alpha,homo,lumo,gap,r2,...,fr_phenol_noOrthoHbond,fr_piperdine,fr_piperzine,fr_priamide,fr_pyridine,fr_quatN,fr_term_acetylene,fr_tetrazole,fr_unbrch_alkane,fr_urea
0,0,157.7118,157.70997,157.70699,0.0,13.21,-0.3877,0.1171,0.5048,35.3641,...,0,0,0,0,0,0,0,0,0,0
1,1,293.60975,293.54111,191.39397,1.6256,9.46,-0.257,0.0829,0.3399,26.1563,...,0,0,0,0,0,0,0,0,0,0
2,2,799.58812,437.90386,282.94545,1.8511,6.31,-0.2928,0.0687,0.3615,19.0002,...,0,0,0,0,0,0,0,0,0,0
3,3,0.0,35.610036,35.610036,0.0,16.28,-0.2845,0.0506,0.3351,59.5248,...,0,0,0,0,0,0,1,0,0,0
4,4,0.0,44.593883,44.593883,2.8937,12.99,-0.3604,0.0191,0.3796,48.7476,...,0,0,0,0,0,0,0,0,0,0


In [None]:
#!pip install pandas==0.21
%matplotlib inline
import pandas as pd
import numpy as np;
import seaborn as sns; 
import matplotlib.pyplot as plt



#qm9 = pd.read_csv("./SMILES_RDKit_2D.csv")
# qm9 = pd.read_csv("./x_df_SMILES_RDKit_2D.csv")
# qm9 = pd.read_csv("./QM9_dataset/data/qm9_no_smiles_MACCS_std0_1.csv")
qm9 = pd.read_csv("./QM9_dataset/data/qm9_no_smiles_feature_RDKit_2D_std0.csv")

# qm9  = pd.read_csv("./OrganicLED/data/nmat4717_patent_RDKit_2D_std0.csv")
# qm9 = pd.read_csv("./OrganicLED/data/nmat4717_patent_ECFP6_std0.csv") 
# qm9 =pd.read_csv("./OrganicLED/data/joint/nmat4717_patent_paper_joint_smile_feature_ECFP6_std0.csv")
# qm9 = pd.read_csv("./Redox_Flow_Battery/data/SMILES_ECFP6.csv")

# qm9 = qm9.drop('Unnamed: 0', 1)
# qm9.head()
# qm9.to_csv('./QM9_dataset/data/qm9_no_smiles_feature_RDKit_2D_std0_0.csv')

#couple_columns = qm9[['gap','zpve', 'mu']].head(10)
#print(couple_columns.shape)
plt.figure(figsize=(100,80))
# plt.figure(figsize=(160,140))
# plt.figure(figsize=(50,50))
# calculate the correlation matrix

# corr = qm9.corr()
corr = qm9.corr(method="pearson") #.abs()
# corr.iloc[:5, :5] # Preview the first 5 rows/columns of the correlation matrix

# plot the heatmap
sns.heatmap(corr, 
        xticklabels=corr.columns,
        yticklabels=corr.columns, cmap="YlGnBu")#YlGnBu viridis_r Spectral_r
ax=plt.savefig('./QM9_dataset/corr_qm9_no_smiles_feature_RDKit_2D_std0_YlGnBu_1.png', dpi=250, facecolor='w', edgecolor='w', format=None, transparent=False, bbox_inches=None, pad_inches=None, metadata=None)

#sns.heatmap(corr, cmap="Blues", annot=True)

#Heat Map using Seaborn
#import numpy as np;
#import seaborn as sns; 

# To translate into Excel Terms for those familiar with Excel
# string 1 is row labels 'helix1 phase'
# string 2 is column labels 'helix 2 phase'
# string 3 is values 'Energy'
# Official pivot documentation
# http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.pivot.html

#homo_lumo_mix.pivot('zpve', 'mu','gap').head()
#homo_lumo_mix.pivot('zpve', 'mu')['gap'].head()

#!pip install pandas

In [4]:
len(corr.columns)

184

In [None]:
#!pip install pandas==0.21
%matplotlib inline
import pandas as pd
import numpy as np;
import seaborn as sns; 
import matplotlib.pyplot as plt

plt.figure(figsize=(100,80))
# plt.figure(figsize=(160,140))

# Filter the features with correlation coefficients above 0.95
upper = corr.where(np.triu(np.ones(corr.shape), k=1).astype(np.bool))
to_drop = [column for column in upper.columns if any(upper[column] > 0.90)]  #0.95
features_df_lowcorr = qm9.drop(columns=to_drop)

# recalculate the correlation matrix so we can compare
corr_update = features_df_lowcorr.corr(method="pearson")
# corr_update = features_df_lowcorr.corr(method="pearson").abs()
len(features_df_lowcorr.columns)
features_df_lowcorr.head()
# features_df_lowcorr.to_csv('./QM9_dataset/data/qm9_no_smiles_feature_RDKit_2D_std0_lowcorr_90.csv')

# plot the heatmap
sns.heatmap(corr_update, 
        xticklabels=corr_update.columns,
        yticklabels=corr_update.columns, cmap="YlGnBu")#YlGnBu viridis_r Spectral_r
# ax.xlabel('Feature Numbers')
# ax.ylabel('Feature Numbers')
# # ax.aspect('equal')

ax=plt.savefig('./QM9_dataset/corr_qm9_no_smiles_feature_RDKit_2D_std0_YlGnBu_pearson_90_1.png', dpi=250, facecolor='w', edgecolor='w', format=None, transparent=False, bbox_inches=None, pad_inches=None, metadata=None)


In [6]:
len(features_df_lowcorr.columns)

143

In [10]:
# x_df = pd.read_csv("./Redox_Flow_Battery/data/SMILES_MACCS1.csv")
# x_df = pd.read_csv("./OrganicLED/data/nmat4717_patent_RDKit_2D.csv") 
# x_df = pd.read_csv("./QM9_dataset/data/qm9_no_smiles_feature_RDKit_2D.csv")
qm9 = pd.read_csv("./QM9_dataset/data/qm9_no_smiles_feature_RDKit_2D_std0.csv")
x_df.head()
# print(len(x_df))
x_df.shape

(133885, 184)

In [11]:
# Feature Engineering Steps:
# 1) Remove Constant Columns

# Remove Constant Columns
x_df_noconstant = x_df.loc[:, (x_df != x_df.iloc[0]).any()] 

# report number of columns
len(x_df_noconstant.columns)

184

In [26]:
# Feature Engineering Steps:
# 2) Remove Highly Correlated Columns
# using notes here for methodology: https://chrisalbon.com/machine_learning/feature_selection/drop_highly_correlated_features/

features_corr_df = x_df_noconstant.corr(method="pearson").abs()
features_corr_df.iloc[:5, :5] # Preview the first 5 rows/columns of the correlation matrix

Unnamed: 0,A,B,C,mu,alpha
A,1.0,0.00058,0.001205,0.006165,0.004054
B,0.00058,1.0,0.980884,0.02598,0.174341
C,0.001205,0.980884,1.0,0.049102,0.177211
mu,0.006165,0.02598,0.049102,1.0,0.241122
alpha,0.004054,0.174341,0.177211,0.241122,1.0


In [None]:
# Filter the features with correlation coefficients above 0.95
upper = features_corr_df.where(np.triu(np.ones(features_corr_df.shape), k=1).astype(np.bool))
to_drop = [column for column in upper.columns if any(upper[column] > 0.95)]
features_df_lowcorr = x_df_noconstant.drop(columns=to_drop)
# recalculate the correlation matrix so we can compare
features_corr_df_update = features_df_lowcorr.corr(method="pearson").abs()
len(features_df_lowcorr.columns)
features_df_lowcorr

In [None]:
# Feature Engineering Steps:
# 3) Normalize Features
minmax_features = MinMaxScaler().fit_transform(features_df_lowcorr)
minmax_features_df = pd.DataFrame(minmax_features,columns=features_df_lowcorr.columns)
minmax_features_df.iloc[:5, :5]
# Establishing train/test split
X = minmax_features_df      # inputs/features 
y = target_data_df["clogp"] # outputs/targets

In [12]:
# from https://proxy.nanohub.org/weber/2004336/GBdSjVSdDDS3NYpl/4/notebooks/LLZO_MachineLearning.ipynb
# This code is to drop columns with std = 0. 
#x_df = pd.DataFrame(X)
#All columns that have a standard deviation of zero are dropped, as they don't contribute new information to the models.
# x_df = pd.read_csv("./Redox_Flow_Battery/data/SMILES_MACCS1.csv")
# x_df = pd.read_csv("./OrganicLED/data/nmat4717_patent_RDKit_2D.csv") 
# x_df = pd.read_csv("./OrganicLED/data/joint/nmat4717_patent_paper_joint_smile_ECFP6.csv") 
# x_df = pd.read_csv("./QM9_dataset/data/qm9_no_smiles_MACCS_std0.csv")
x_df = pd.read_csv("./QM9_dataset/data/qm9_no_smiles_feature_RDKit_2D_std0.csv")

# Remove Constant Columns
x_df = x_df.loc[:, (x_df != x_df.iloc[0]).any()] 
# report number of columns
len(x_df.columns)
# This code is to drop columns with std = 0. 
x_df = x_df.loc[:, x_df.std() != 0]
print(x_df.shape) # This shape is (#Entries, #Descriptors per entry)
len(x_df.columns)
x_df.head()


(133885, 184)


Unnamed: 0,A,B,C,mu,alpha,homo,lumo,gap,r2,zpve,...,fr_phenol_noOrthoHbond,fr_piperdine,fr_piperzine,fr_priamide,fr_pyridine,fr_quatN,fr_term_acetylene,fr_tetrazole,fr_unbrch_alkane,fr_urea
0,157.7118,157.70997,157.70699,0.0,13.21,-0.3877,0.1171,0.5048,35.3641,0.044749,...,0,0,0,0,0,0,0,0,0,0
1,293.60975,293.54111,191.39397,1.6256,9.46,-0.257,0.0829,0.3399,26.1563,0.034358,...,0,0,0,0,0,0,0,0,0,0
2,799.58812,437.90386,282.94545,1.8511,6.31,-0.2928,0.0687,0.3615,19.0002,0.021375,...,0,0,0,0,0,0,0,0,0,0
3,0.0,35.610036,35.610036,0.0,16.28,-0.2845,0.0506,0.3351,59.5248,0.026841,...,0,0,0,0,0,0,1,0,0,0
4,0.0,44.593883,44.593883,2.8937,12.99,-0.3604,0.0191,0.3796,48.7476,0.016601,...,0,0,0,0,0,0,0,0,0,0


In [30]:
# x_df.to_csv('./OrganicLED/data/nmat4717_patent_RDKit_2D_std0.csv')
x_df.to_csv('./QM9_dataset/data/qm9_no_smiles_feature_RDKit_2D_std0.csv')


In [None]:
# x_df = pd.read_csv('./Redox_Flow_Battery/data/SMILES_ECFP6_std0.csv')
# x_df = pd.read_csv("./OrganicLED/data/nmat4717_patent_RDKit_2D_std0.csv") 
x_df.to_csv('./OrganicLED/data/nmat4717_patent_ECFP6_std0.csv')
print(x_df.columns.tolist())
x_df.head().shape

In [None]:
df3 = pd.read_csv("./Redox_Flow_Battery/data/SMILES_new_feature_std0.csv")
df4 = df3.T.drop_duplicates().T
df4.head()
df4.to_csv('./Redox_Flow_Battery/data/SMILES_new_feature_std0_drop_duplicate.csv')

In [None]:
print(df4.shape) # This shape is (#Entries, #Descriptors per entry)
df4.head()

In [None]:
df2 = pd.read_csv("./Redox_Flow_Battery/SMILES_all_less_smiles.csv")
df1 = pd.read_csv("./Redox_Flow_Battery/features.csv")
df3 = pd.concat([df2, df1], axis=1)
df3.head()
df3.to_csv('./Redox_Flow_Battery/SMILES_all_less_smiles_std0_con_feature.csv')

In [None]:
df4 = df3.T.drop_duplicates().T
df4.head()
df4.to_csv('./Redox_Flow_Battery/SMILES_all_less_smiles_std0_con_feature_drop_duplicate.csv')

In [None]:
df2 = pd.read_csv("./Redox_Flow_Battery/SMILES.csv")
df1 = pd.read_csv("./Redox_Flow_Battery/SMILES_all_less_smiles_std0_con_feature_drop_duplicate.csv")
df5 = pd.concat([df2, df1], axis=1)
df5.to_csv('./Redox_Flow_Battery/SMILES_all_less_smiles_std0_con_feature_addback_smiles.csv')
df5.head()

In [None]:
# Unnamed 0 collum remove from xlsx
df5 = pd.read_csv("./Redox_Flow_Battery/SMILES_all_less_smiles_std0_con_feature_addback_smiles.csv")
df5.head()

In [None]:
df5 = pd.read_csv("./Redox_Flow_Battery/SMILES_all_less_smiles_std0_con_feature_drop_duplicate.csv")
df5 = df5.loc[:, df5.std() != 0]
print(df5.shape) # This shape is (#Entries, #Descriptors per entry)
df5.head()

In [None]:

#!pip install pandas==0.21
# %matplotlib inline
import pandas as pd
import numpy as np;
import seaborn as sns; 
import matplotlib.pyplot as plt


qm9 = pd.read_csv("./Redox_Flow_Battery/data/SMILES_all_less_smiles_std0_con_feature_addback_smiles_reduce.csv")

# qm9 = pd.read_csv("./Redox_Flow_Battery/data/SMILES_all_less_smiles_std0_con_feature_addback_smiles.csv")

#qm9 = pd.read_csv("./SMILES_RDKit_2D.csv")
# df1 = pd.read_csv("./x_df_features.csv")
# df2 = pd.read_csv("./x_df_SMILES_RDKit_2D.csv")

# qm9= pd.concat([df1, df2], axis=1, keys=['df1', 'df2']).corr().loc['df2', 'df1']

#couple_columns = qm9[['gap','zpve', 'mu']].head(10)
#print(couple_columns.shape)
plt.figure(figsize=(30,30))

# calculate the correlation matrix
corr = qm9.corr()
# plot the heatmap
sns.heatmap(corr, 
        xticklabels=corr.columns,
        yticklabels=corr.columns, cmap="YlGnBu")#YlGnBu viridis_r Spectral_r
ax=plt.savefig('./Redox_Flow_Battery/corr_df_SMILES_all_less_smiles_std0_con_feature_addback_smiles_reduce_YlGnBu.png', dpi=600, facecolor='w', edgecolor='w', format=None, transparent=False, bbox_inches=None, pad_inches=None, metadata=None)

#sns.heatmap(corr, cmap="Blues", annot=True)

#Heat Map using Seaborn
#import numpy as np;
#import seaborn as sns; 

# To translate into Excel Terms for those familiar with Excel
# string 1 is row labels 'helix1 phase'
# string 2 is column labels 'helix 2 phase'
# string 3 is values 'Energy'
# Official pivot documentation
# http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.pivot.html

#homo_lumo_mix.pivot('zpve', 'mu','gap').head()
#homo_lumo_mix.pivot('zpve', 'mu')['gap'].head()

#!pip install pandas

In [None]:

##  Read Data  ##

#!wget https://raw.githubusercontent.com/AIScienceTutorial/Material_Science/master/Formation_Energies/Data.csv
#!wget https://raw.githubusercontent.com/AIScienceTutorial/Material_Science/master/Formation_Energies/Data_norm.csv

#ifile  = open('Data.csv', "rt")
ifile  = open('x_df_features.csv', "rt")
# df1 = pd.read_csv("./x_df_features.csv")
# df2 = pd.read_csv("./x_df_SMILES_RDKit_2D.csv")
reader = csv.reader(ifile)
csvdata=[]
for row in reader:
        csvdata.append(row)   
ifile.close()
numrow=len(csvdata)
numcol=len(csvdata[0]) 
csvdata = np.array(csvdata).reshape(numrow,numcol)
dopant = csvdata[:,0]
CdX = csvdata[:,1]
doping_site = csvdata[:,2]

prop  = csvdata[:,5]  ## Cd-rich Delta_H
#prop  = csvdata[:,4]  ## Mod. Delta_H
#prop  = csvdata[:,5]  ## X-rich Delta_H

#X = csvdata[:,6:20]       ##  Elemental Properties
#X = csvdata[:,20:25]       ##  Unit Cell Defect Properties
X = csvdata[:,6:]       ##  Elemental + Unit Cell Defect Properties

n = prop.size



    # Read CdX alloy data: CdTe_0.5Se_0.5 and CdSe_0.5S_0.5

#!wget https://raw.githubusercontent.com/AIScienceTutorial/Material_Science/master/Formation_Energies/Outside.csv
#!wget https://raw.githubusercontent.com/AIScienceTutorial/Material_Science/master/Formation_Energies/Outside_norm.csv

#ifile2  = open('Outside.csv', "rt")
ifile2  = open('x_df_SMILES_RDKit_2D.csv', "rt")
reader2 = csv.reader(ifile2)
csvdata2=[]
for row2 in reader2:
        csvdata2.append(row2)
ifile2.close()
numrow2=len(csvdata2)
numcol2=len(csvdata2[0])
csvdata2 = np.array(csvdata2).reshape(numrow2,numcol2)
dopant_out = csvdata2[:,0]
CdX_out = csvdata2[:,1]
doping_site_out = csvdata2[:,2]
prop_out  = csvdata2[:,3]
#prop_out  = csvdata2[:,4]
#prop_out  = csvdata2[:,5]
#X_out = csvdata2[:,6:20]
#X_out = csvdata2[:,20:25]
X_out = csvdata2[:,6:]

n_out = prop_out.size


    # Read Entire Dataset

#!wget https://raw.githubusercontent.com/AIScienceTutorial/Material_Science/master/Formation_Energies/X.csv
#!wget https://raw.githubusercontent.com/AIScienceTutorial/Material_Science/master/Formation_Energies/X_norm.csv

#ifile3  = open('X.csv', "rt")
ifile3  = open('x_df_SMILES_RDKit_2D.csv', "rt")
reader3 = csv.reader(ifile3)
csvdata3=[]
for row3 in reader3:
        csvdata3.append(row3)
ifile3.close()
numrow3=len(csvdata3)
numcol3=len(csvdata3[0])
csvdata3 = np.array(csvdata3).reshape(numrow3,numcol3)
dopant_all = csvdata3[:,0]
CdX_all = csvdata3[:,1]
doping_site_all = csvdata3[:,2]
X_all = csvdata3[:,3:17]
#X_all = csvdata3[:,17:22]
#X_all = csvdata3[:,3:]

n_all = dopant_all.size




In [None]:
##   Visualize Data   ##
##   Visualize data: plot desired descriptor dimension vs property.

plt.figure(figsize=(6,6))
plt.subplots_adjust(left=0.16, bottom=0.16, right=0.95, top=0.90)
plt.rc('font', family='Arial narrow')

plt.ylabel('Property', fontname='Arial Narrow', size=32)
plt.xlabel('Descriptor', fontname='Arial Narrow', size=32)
plt.rc('xtick', labelsize=32)
plt.rc('ytick', labelsize=32)

yy = [0.0]*n
xx = [0.0]*n

for i in range(0,n):
    yy[i] = np.float(prop[i])
    xx[i] = np.float(X[i,12])

plt.scatter(xx[:], yy[:], c='k', marker='*', s=200, edgecolors='dimgrey', alpha=1.0)



In [19]:
# https://github.com/zinph/Cheminformatics/blob/master/compute_descriptors/RDKit_2D.py
# RDKit 2D Fingerprint
import pandas as pd
from molvs import standardize_smiles
#from RDKit_2D import *

class RDKit_2D:
    def __init__(self, smiles):
        self.mols = [Chem.MolFromSmiles(i) for i in smiles]
        self.smiles = smiles
        
    def compute_2Drdkit(self, name):
        rdkit_2d_desc = []
        calc = MoleculeDescriptors.MolecularDescriptorCalculator([x[0] for x in Descriptors._descList])
        header = calc.GetDescriptorNames()
        for i in range(len(self.mols)):
            ds = calc.CalcDescriptors(self.mols[i])
            rdkit_2d_desc.append(ds)
        df = pd.DataFrame(rdkit_2d_desc,columns=header)
        df.insert(loc=0, column='smiles', value=self.smiles)
        df.to_csv(name[:-4]+'_RDKit_2D.csv', index=False)

def main():
    filename = './OrganicLED/data/joint/nmat4717_patent_paper_joint_smile.csv'  # path to your csv file dataset = pd.read_csv("./Redox_Flow_Battery/SMILES.csv")
    #filename = './SMILES.csv'
    df = pd.read_csv(filename)               # read the csv file as pandas data frame
    smiles = [standardize_smiles(i) for i in df['SMILES'].values]  

    ## Compute RDKit_2D Fingerprints and export a csv file.
    RDKit_descriptor = RDKit_2D(smiles)        # create your RDKit_2D object and provide smiles
    RDKit_descriptor.compute_2Drdkit(filename) # compute RDKit_2D and provide the name of your desired output file. you can use the same name as the input file because the RDKit_2D class will ensure to add "_RDKit_2D.csv" as part of the output file.

if __name__ == '__main__':
    main()

In [None]:
# https://github.com/zinph/Cheminformatics/blob/master/compute_descriptors/ECFP6.py
# ECFP6 Fingerprint
import numpy as np
import pandas as pd
from molvs import standardize_smiles
from rdkit.Chem import AllChem
from rdkit import Chem, DataStructs

class ECFP6:
    def __init__(self, smiles):
        self.mols = [Chem.MolFromSmiles(i) for i in smiles]
        self.smiles = smiles

    def mol2fp(self, mol, radius = 3):
        fp = AllChem.GetMorganFingerprintAsBitVect(mol, radius = radius)
        array = np.zeros((1,))
        DataStructs.ConvertToNumpyArray(fp, array)
        return array

    def compute_ECFP6(self, name):
        bit_headers = ['bit' + str(i) for i in range(2048)]
        arr = np.empty((0,2048), int).astype(int)
        for i in self.mols:
            fp = self.mol2fp(i)
            arr = np.vstack((arr, fp))
        df_ecfp6 = pd.DataFrame(np.asarray(arr).astype(int),columns=bit_headers)
        df_ecfp6.insert(loc=0, column='smiles', value=self.smiles)
        df_ecfp6.to_csv(name[:-4]+'_ECFP6.csv', index=False)

def main():
    #filename = './qm9.csv'  # path to your csv file
    filename = './QM9_dataset/data/qm9.csv'
    df = pd.read_csv(filename)               # read the csv file as pandas data frame
    smiles = [standardize_smiles(i) for i in df['SMILES'].values]  

    ## Compute RDKit_2D Fingerprints and export a csv file.
    ECFP6_descriptor = ECFP6(smiles)        # create your RDKit_2D object and provide smiles
    ECFP6_descriptor.compute_ECFP6(filename) # compute RDKit_2D and provide the name of your desired output file. you can use the same name as the input file because the RDKit_2D class will ensure to add "_RDKit_2D.csv" as part of the output file.

if __name__ == '__main__':
    main()

RDKit ERROR: [00:54:00] Can't kekulize mol.  Unkekulized atoms: 2 6
RDKit ERROR: 
RDKit ERROR: [00:54:00] Can't kekulize mol.  Unkekulized atoms: 3 5
RDKit ERROR: 
RDKit ERROR: [00:54:00] Can't kekulize mol.  Unkekulized atoms: 3 7
RDKit ERROR: 


In [21]:
# https://github.com/zinph/Cheminformatics/blob/master/compute_descriptors/MACCS.py
# MACCS Fingerprint
from molvs import standardize_smiles
import pandas as pd
from rdkit import Chem
from rdkit.Chem import MACCSkeys

class MACCS:
    def __init__(self, smiles):
        self.mols = [Chem.MolFromSmiles(i) for i in smiles]
        self.smiles = smiles

    def compute_MACCS(self, name):
        MACCS_list = []
        header = ['bit' + str(i) for i in range(167)]
        for i in range(len(self.mols)):
            ds = list(MACCSkeys.GenMACCSKeys(self.mols[i]).ToBitString())
            MACCS_list.append(ds)
        df = pd.DataFrame(MACCS_list,columns=header)
        df.insert(loc=0, column='smiles', value=self.smiles)
        df.to_csv(name[:-4]+'_MACCS.csv', index=False)

def main():
    #filename = './qm9.csv'  # path to your csv file
    filename = './OrganicLED/data/joint/nmat4717_patent_paper_joint_smile.csv'
    df = pd.read_csv(filename)               # read the csv file as pandas data frame
    smiles = [standardize_smiles(i) for i in df['SMILES'].values]  

    ## Compute RDKit_2D Fingerprints and export a csv file.
    MACCS_descriptor = MACCS(smiles)        # create your RDKit_2D object and provide smiles
    MACCS_descriptor.compute_MACCS(filename) # compute RDKit_2D and provide the name of your desired output file. you can use the same name as the input file because the RDKit_2D class will ensure to add "_RDKit_2D.csv" as part of the output file.

if __name__ == '__main__':
    main()

In [None]:
#https://drzinph.com/mordred_mrc_descriptors-in-python-part-5/
#https://github.com/zinph/Cheminformatics/blob/master/compute_descriptors/Macrocycle_Descriptors.py
# Macrocycle_Descriptors
from molvs import standardize_smiles
import itertools
import pandas as pd
from rdkit import Chem
from mordred import Calculator, descriptors
from mordred.RingCount import RingCount

class Macrocycle_Descriptors:

    def __init__(self, smiles):
        self.mols = [Chem.MolFromSmiles(i) for i in smiles]
        self.smiles = smiles
        self.mordred = None


    def compute_ringsize(self, mol):
        '''
        check for macrolides of RS 3 to 99, return a  list of ring counts.
        [RS3,RS4,.....,RS99]
        [0,0,0,...,1,...,0]
        '''
        RS_3_99 = [i+3 for i in range(97)]
        RS_count = []
        for j in RS_3_99:
            RS = RingCount(order=j)(mol)
            RS_count.append(RS)
        return RS_count

    def macrolide_ring_info(self):
        headers = ['n'+str(i+13)+'Ring' for i in range(87)]+['SmallestRS','LargestRS']
        # up to nR12 is already with mordred, start with nR13 to nR99
        ring_sizes = []
        for i in range(len(self.mols)):
            RS = self.compute_ringsize(self.mols[i])  # nR3 to nR99
            RS_12_99 = RS[9:]    # start with nR12 up to nR99
            ring_indices = [i for i,x in enumerate(RS_12_99) if x!=0]  # get index if item isn't equal to 0
            # if there is a particular ring present, the frequency won't be zero. Find those indexes. 
			if ring_indices:
                # Add 12 (starting ring count) to get up to the actual ring size
                smallest_RS = ring_indices[0]+12     # Retrieve the first index (for the smallest core RS - note the list is in ascending order)
                largest_RS = ring_indices[-1]+12	 # Retrieve the last index (for the largest core RS)
                RS_12_99.append(smallest_RS)  # Smallest RS
                RS_12_99.append(largest_RS)  # Largest RS
            else:
                RS_12_99.extend(['',''])
            ring_sizes.append(RS_12_99[1:]) # up to nR12 is already with mordred, start with nR13 to nR99
        df = pd.DataFrame(ring_sizes, columns=headers)
        return df

    def sugar_count(self):
        sugar_patterns = [
        '[OX2;$([r5]1@C@C@C(O)@C1),$([r6]1@C@C@C(O)@C(O)@C1)]',
        '[OX2;$([r5]1@C(!@[OX2,NX3,SX2,FX1,ClX1,BrX1,IX1])@C@C@C1),$([r6]1@C(!@[OX2,NX3,SX2,FX1,ClX1,BrX1,IX1])@C@C@C@C1)]',
        '[OX2;$([r5]1@C(!@[OX2,NX3,SX2,FX1,ClX1,BrX1,IX1])@C@C(O)@C1),$([r6]1@C(!@[OX2,NX3,SX2,FX1,ClX1,BrX1,IX1])@C@C(O)@C(O)@C1)]',
        '[OX2;$([r5]1@C(!@[OX2H1])@C@C@C1),$([r6]1@C(!@[OX2H1])@C@C@C@C1)]',
        '[OX2;$([r5]1@[C@@](!@[OX2,NX3,SX2,FX1,ClX1,BrX1,IX1])@C@C@C1),$([r6]1@[C@@](!@[OX2,NX3,SX2,FX1,ClX1,BrX1,IX1])@C@C@C@C1)]',
        '[OX2;$([r5]1@[C@](!@[OX2,NX3,SX2,FX1,ClX1,BrX1,IX1])@C@C@C1),$([r6]1@[C@](!@[OX2,NX3,SX2,FX1,ClX1,BrX1,IX1])@C@C@C@C1)]',
        ]
        sugar_mols = [Chem.MolFromSmarts(i) for i in sugar_patterns]
        sugar_counts = []
        for i in self.mols:
            matches_total = []
            for s_mol in sugar_mols:
                raw_matches = i.GetSubstructMatches(s_mol)
                matches = list(sum(raw_matches, ()))
                if matches not in matches_total and len(matches) !=0:
                    matches_total.append(matches)
            sugar_indices = set((list(itertools.chain(*matches_total))))
            count = len(sugar_indices)
            sugar_counts.append(count)
        df = pd.DataFrame(sugar_counts, columns=['nSugars'])
        return df

    def core_ester_count(self):
        '''
        Returns pandas frame containing the count of esters in core rings of >=12 membered macrocycles.
        '''
        ester_smarts = '[CX3](=[OX1])O@[r;!r3;!r4;!r5;!r6;!r7;!r8;!r9;!r10;!r11]'
        core_ester = []
        ester_mol = Chem.MolFromSmarts(ester_smarts)
        for i in self.mols:
            ester_count = len(i.GetSubstructMatches(ester_mol))
            core_ester.append(ester_count)
        df = pd.DataFrame(core_ester, columns=['core_ester'])
        return df

    def mordred_compute(self, name):
        calc = Calculator(descriptors, ignore_3D=True)
        df = calc.pandas(self.mols)
        self.mordred = df
        df.insert(loc=0, column='smiles', value=self.smiles)
        df.to_csv(name[:-4]+'_mordred.csv', index=False)

    def compute_mordred_macrocycle(self, name):
        if not isinstance(self.mordred, pd.DataFrame):
            self.mordred = self.mordred_compute(name)
        ring_df = self.macrolide_ring_info()
        sugar_df = self.sugar_count()
        ester_df = self.core_ester_count()
#        self.mrc = pd.concat([ring_df,sugar_df, ester_df], axis=1)
        mordred_mrc = pd.concat([self.mordred, ring_df,sugar_df, ester_df], axis=1)
        mordred_mrc.to_csv(name[:-4]+'_mordred_mrc.csv', index=False)

def main():
    #filename = './qm9.csv'  # path to your csv file
    filename = './Redox_Flow_Battery/data/SMILES.csv'
    df = pd.read_csv(filename)               # read the csv file as pandas data frame
    smiles = [standardize_smiles(i) for i in df['SMILES'].values]  

    ## Compute RDKit_2D Fingerprints and export a csv file.
    Macrocycle_descriptor = Macrocycle_Descriptors(smiles)        # create your RDKit_2D object and provide smiles
    Macrocycle_descriptor.compute_mordred_macrocycle(filename) # compute RDKit_2D and provide the name of your desired output file. you can use the same name as the input file because the RDKit_2D class will ensure to add "_RDKit_2D.csv" as part of the output file.

if __name__ == '__main__':
    main()





In [25]:
#https://greglandrum.github.io/rdkit-blog/page2/
dataset = pd.read_csv("./OrganicLED/data/joint/nmat4717_patent_paper_joint_smile.csv")[['SMILES']]

#dataset = pd.read_csv("./qm9.csv")[['SMILES']]
#print(list(dataset))
#list(dataset)
#dataset
#dataset.head()
PandasTools.AddMoleculeColumnToFrame(dataset,'SMILES', 'Molecules' )
dataset = dataset
#Descriptors2D
# dataset['MolWt'] = [Descriptors.MolWt(mol) for mol in dataset['Molecules']]
# dataset['exactmw'] = [Descriptors.ExactMolWt(mol) for mol in dataset['Molecules']]
# dataset['FpDensityMorgan1'] = [Descriptors.FpDensityMorgan1(mol) for mol in dataset['Molecules']]
# dataset['FpDensityMorgan2'] = [Descriptors.FpDensityMorgan2(mol) for mol in dataset['Molecules']]
# dataset['FpDensityMorgan3'] = [Descriptors.FpDensityMorgan3(mol) for mol in dataset['Molecules']]
# dataset['HeavyAtomMolWt'] = [Descriptors.HeavyAtomMolWt(mol) for mol in dataset['Molecules']]
# dataset['MaxAbsPartialCharge'] = [Descriptors.MaxAbsPartialCharge(mol) for mol in dataset['Molecules']]
# dataset['MaxPartialCharge'] = [Descriptors.MaxPartialCharge(mol) for mol in dataset['Molecules']]
# dataset['MinAbsPartialCharge'] = [Descriptors.MinAbsPartialCharge(mol) for mol in dataset['Molecules']]
# dataset['NumRadicalElectrons'] = [Descriptors.NumRadicalElectrons(mol) for mol in dataset['Molecules']]
# dataset['NumValenceElectrons'] = [Descriptors.NumValenceElectrons(mol) for mol in dataset['Molecules']]
#dataset['setupAUTOCorrDescriptors'] = [Descriptors.setupAUTOCorrDescriptors(mol) for mol in dataset['Molecules']]

#Descriptors3D
#dataset['Asphericity'] = [Chem.Descriptors3D.PMI1(mol) for mol in dataset['Molecules']]

dataset['LOGP'] = [Crippen.MolLogP(mol) for mol in dataset['Molecules']]
dataset['HBA'] = [Lipinski.NumHAcceptors(mol) for mol in dataset['Molecules']]
dataset['HBD'] = [Lipinski.NumHDonors(mol) for mol in dataset['Molecules']]
dataset['rotable'] = [Lipinski.NumRotatableBonds(mol) for mol in dataset['Molecules']]
dataset['amide'] = [AllChem.CalcNumAmideBonds(mol) for mol in dataset['Molecules']]
dataset['bridge'] = [AllChem.CalcNumBridgeheadAtoms(mol) for mol in dataset['Molecules']]
dataset['heteroA'] = [Lipinski.NumHeteroatoms(mol) for mol in dataset['Molecules']]
dataset['heavy'] = [Lipinski.HeavyAtomCount(mol) for mol in dataset['Molecules']]
dataset['spiro'] = [AllChem.CalcNumSpiroAtoms(mol) for mol in dataset['Molecules']]
dataset['FCSP3'] = [AllChem.CalcFractionCSP3(mol) for mol in dataset['Molecules']]
dataset['ring'] = [Lipinski.RingCount(mol) for mol in dataset['Molecules']]
dataset['Aliphatic'] = [AllChem.CalcNumAliphaticRings(mol) for mol in dataset['Molecules']]
dataset['aromatic'] = [AllChem.CalcNumAromaticRings(mol) for mol in dataset['Molecules']]
dataset['saturated'] = [AllChem.CalcNumSaturatedRings(mol) for mol in dataset['Molecules']]
dataset['heteroR'] = [AllChem.CalcNumHeterocycles(mol) for mol in dataset['Molecules']]
dataset['TPSA'] = [MolSurf.TPSA(mol) for mol in dataset['Molecules']]
dataset['valence'] = [Descriptors.NumValenceElectrons(mol) for mol in dataset['Molecules']]
dataset['mr'] = [Crippen.MolMR(mol) for mol in dataset['Molecules']]
dataset['charge'] = [AllChem.ComputeGasteigerCharges(mol) for mol in dataset['Molecules']]
dataset['lipinskiHBA'] = [Chem.rdMolDescriptors.CalcNumLipinskiHBA(mol) for mol in dataset['Molecules']]
dataset['lipinskiHBD'] = [Chem.rdMolDescriptors.CalcNumLipinskiHBD(mol) for mol in dataset['Molecules']]
dataset['NumRotatableBonds'] = [Chem.Lipinski.NumRotatableBonds(mol) for mol in dataset['Molecules']]
dataset['NumHBD'] = [Chem.rdMolDescriptors.CalcNumHBD(mol) for mol in dataset['Molecules']]
dataset['NumHBA'] = [Chem.rdMolDescriptors.CalcNumHBA(mol) for mol in dataset['Molecules']]
dataset['NumHeteroatoms'] = [Chem.Lipinski.NumHeteroatoms(mol) for mol in dataset['Molecules']]
dataset['NumAmideBonds'] = [Chem.rdMolDescriptors.CalcNumAmideBonds(mol) for mol in dataset['Molecules']]
dataset['FractionCSP3'] = [Chem.rdMolDescriptors.CalcFractionCSP3(mol) for mol in dataset['Molecules']]
dataset['NumRings'] = [Chem.rdMolDescriptors.CalcNumRings(mol) for mol in dataset['Molecules']]
dataset['NumAromaticRings'] = [Chem.rdMolDescriptors.CalcNumAromaticRings(mol) for mol in dataset['Molecules']]
dataset['NumAliphaticRings'] = [Chem.rdMolDescriptors.CalcNumAliphaticRings(mol) for mol in dataset['Molecules']]
dataset['NumSaturatedRings'] = [Chem.rdMolDescriptors.CalcNumSaturatedRings(mol) for mol in dataset['Molecules']]
dataset['NumHeterocycles'] = [Chem.rdMolDescriptors.CalcNumHeterocycles(mol) for mol in dataset['Molecules']]
dataset['NumAromaticHeterocycles'] = [Chem.rdMolDescriptors.CalcNumAromaticHeterocycles(mol) for mol in dataset['Molecules']]
dataset['NumSaturatedHeterocycles'] = [Chem.rdMolDescriptors.CalcNumSaturatedHeterocycles(mol) for mol in dataset['Molecules']]
dataset['NumAliphaticHeterocycles'] = [Chem.rdMolDescriptors.CalcNumAliphaticHeterocycles(mol) for mol in dataset['Molecules']]
dataset['NumSpiroAtoms'] = [Chem.rdMolDescriptors.CalcNumSpiroAtoms(mol) for mol in dataset['Molecules']]
dataset['NumBridgeheadAtoms'] = [Chem.rdMolDescriptors.CalcNumBridgeheadAtoms(mol) for mol in dataset['Molecules']]
dataset['NumAtomStereoCenters'] = [Chem.rdMolDescriptors.CalcNumAtomStereoCenters(mol) for mol in dataset['Molecules']]
dataset['NumUnspecifiedAtomStereoCenters'] = [Chem.rdMolDescriptors.CalcNumUnspecifiedAtomStereoCenters(mol) for mol in dataset['Molecules']]
dataset['tpsa'] = [Chem.rdMolDescriptors.CalcTPSA(mol) for mol in dataset['Molecules']]

dataset.head()
dataset = dataset
# dataset['fps-SmilesMolSupplier'] = [rdMolDescriptors.GetMorganFingerprintAsBitVect(mol,2,2048) for mol in dataset['Molecules']]
# #dataset.head()
fpsdataset = dataset
# len(fpsdataset)
fpsdataset = fpsdataset.drop(columns = 'Molecules')
fpsdataset.head()
fpsdataset.to_csv("./OrganicLED/data/joint/nmat4717_patent_paper_joint_smile_3D.csv")
# fpsdataset.to_csv("./fpsdatasetqm9.csv")

In [None]:
def show_png(data):
    bio = io.BytesIO(data)
    img = Image.open(bio)
    return img
d = Draw.MolDraw2DCairo(1200, 1200)

smiles = 'c1ccc(B2c3ccncc3C3(c4ccccc4Oc4cccnc43)c3cnccc32)cc1'
# smiles = pd.read_csv("./OrganicLED/data/nmat4717_patent_smile.csv")[['SMILES']]
mol = Chem.MolFromSmiles(smiles)
mol
d.FinishDrawing()
show_png(d.GetDrawingText())
plt.savefig('./OrganicLED/SimilarityMapFromWeights.png', size=(1200, 1200))

In [None]:
smi = Chem.MolToSmiles(mol)
print(smi)
print(Chem.MolToInchiKey(mol))
mol_block = Chem.MolToMolBlock(mol)
print(mol_block)

In [None]:
#https://github.com/XuhanLiu/DrugEx
# https://www.programcreek.com/python/example/124114/rdkit.Chem.Descriptors.MolWt

def PhyChem(smiles):
    """ Calculating the 19D physicochemical descriptors for each molecules,
    the value has been normalized with Gaussian distribution.

    Arguments:
        smiles (list): list of SMILES strings.
    Returns:
        props (ndarray): m X 19 matrix as nomalized PhysChem descriptors.
            m is the No. of samples
    """
    props = []
    for smile in smiles:
        mol = Chem.MolFromSmiles(smile)
        try:
            MW = Descriptors.MolWt(mol)
            LOGP = Crippen.MolLogP(mol)
            HBA = Lipinski.NumHAcceptors(mol)
            HBD = Lipinski.NumHDonors(mol)
            rotable = Lipinski.NumRotatableBonds(mol)
            amide = AllChem.CalcNumAmideBonds(mol)
            bridge = AllChem.CalcNumBridgeheadAtoms(mol)
            heteroA = Lipinski.NumHeteroatoms(mol)
            heavy = Lipinski.HeavyAtomCount(mol)
            spiro = AllChem.CalcNumSpiroAtoms(mol)
            FCSP3 = AllChem.CalcFractionCSP3(mol)
            ring = Lipinski.RingCount(mol)
            Aliphatic = AllChem.CalcNumAliphaticRings(mol)
            aromatic = AllChem.CalcNumAromaticRings(mol)
            saturated = AllChem.CalcNumSaturatedRings(mol)
            heteroR = AllChem.CalcNumHeterocycles(mol)
            TPSA = MolSurf.TPSA(mol)
            valence = Descriptors.NumValenceElectrons(mol)
            mr = Crippen.MolMR(mol)
            # charge = AllChem.ComputeGasteigerCharges(mol)
            prop = [MW, LOGP, HBA, HBD, rotable, amide, bridge, heteroA, heavy, spiro,
                    FCSP3, ring, Aliphatic, aromatic, saturated, heteroR, TPSA, valence, mr]
        except:
            print(smile)
            prop = [0] * 19
        props.append(prop)
    props = np.array(props)
    props = Scaler().fit_transform(props)
    return props 

In [None]:
#computing morgan fingerprints  vectors
mol = Chem.MolFromSmiles('C/C1=C\\C[C@H]([C+](C)C)CC/C(C)=C/CC1')
fp1 = AllChem.GetMorganFingerprintAsBitVect(mol, useChirality=True, radius=2, nBits=124)
vec1 = np.array(fp1)
print(vec1)
morgan_fp_gen = rdFingerprintGenerator.GetMorganGenerator(includeChirality=True, radius=2, fpSize=124, useCountSimulation=False)
fp2 = morgan_fp_gen.GetFingerprint(mol)
vec2 = np.array(fp2)
print(vec2)
assert np.all(vec1 == vec2) 

In [None]:
#https://stackoverflow.com/questions/67302261/cant-convert-molecule-to-fingerprint-with-rdkit
from rdkit.Chem import AllChem as Chem

fragment = Chem.MolFromSmiles('Nc1cccc(N)n1')

smiles = ['Nc1cc(CSc2ccc(O)cc2)cc(N)n1', 'Nc1cc(COc2ccc(O)cc2)cc(N)n1', 'CC1=CC=Cc2c(N)nc(N)cc12']

for smi in smiles:
    try:
        mol = Chem.MolFromSmiles(smi)
        f1 = Chem.DeleteSubstructs(mol, fragment)
        f2 = Chem.MolFromSmiles(Chem.MolToSmiles(f1))
        fp = Chem.GetMorganFingerprintAsBitVect(f2, 2)
    except:
        print('SMILES:', smi)
        f = Chem.DeleteSubstructs(mol, fragment)
        print('smiles_frag:', Chem.MolToSmiles(f1))

In [None]:
file_name = 'somedata.smi'

with open(file_name, "r") as ins:
    smiles = []
    for line in ins:
        smiles.append(line.split('\n')[0])
print('# of SMILES:', len(smiles))

In [None]:
# directly feed SMILE structures stored in a pandas dataframe into RDKit to calculate molecular fingerprint and 
df = pd.read_csv("./OrganicLED/data/nmat4717_patent_smile.csv")[['SMILES']]
# df = pd.read_csv("./SMILES_feature.csv")
from rdkit import DataStructs
# 	CC1=C(C(O)=O)C2=CC(=CC=C2N=C1C3=CC=C(C=C3)C4=CC=CC=C4F)F
target = Chem.RDKFingerprint(Chem.MolFromSmiles('c1ccc(B2c3ccncc3C3(c4ccccc4Oc4cccnc43)c3cnccc32)cc1'))
df_smiles = pd.DataFrame(df.SMILES)
#display(df_smiles)
df = pd.DataFrame(data=df.SMILES)
df['Tanimoto'] = DataStructs.BulkTanimotoSimilarity(target, [Chem.RDKFingerprint(Chem.MolFromSmiles(s)) for s in df['SMILES']])


df.to_csv('./OrganicLED/data/nmat4717_patent_smile_Tanimoto.csv')



In [None]:
import matplotlib.pyplot as plt
import pandas as pd
# from pandas.table.plotting import table # EDIT: see deprecation warnings below
ax = plt.subplot(111, frame_on=False) # no visible frame
ax.xaxis.set_visible(False)  # hide the x axis
ax.yaxis.set_visible(False)  # hide the y axis
pd.plotting.table(ax, df.head(10), rowLabels=None, colLabels=None)# where df is your data frame

ax=plt.savefig('./OrganicLED/nmat4717_patent_smile_Tanimoto.png', dpi=2400, facecolor='w', edgecolor='w', format=None, transparent=False, bbox_inches=None, pad_inches=None, metadata=None)

In [None]:
from bokeh.io import export_png, export_svgs
from bokeh.models import ColumnDataSource, DataTable, TableColumn

def save_df_as_image(df, path):
    source = ColumnDataSource(df)
    df_columns = [df.index.name]
    df_columns.extend(df.columns.values)
    columns_for_table=[]
    for column in df_columns:
        columns_for_table.append(TableColumn(field=column, title=column))

data_table = DataTable(source=df, columns=columns_for_table,height_policy="auto",width_policy="auto",index_position=None)
export_png(data_table, filename = './OrganicLED/nmat4717_patent_smile_Tanimoto1.png')


In [None]:
df2 = pd.read_csv("./OrganicLED/data/nmat4717_patent_smile.csv")[['SMILES']]
df2.head()
# DataFrame.shape


In [None]:
import dataframe_image as dfi
df_styled = df.style.background_gradient()
dfi.export(df_styled, './OrganicLED/df_styled.png')


In [None]:
#Export pandas data frame with mol image
import pandas as pd
from rdkit import Chem
from rdkit.Chem import PandasTools
# DataFrame = pd.read_csv("./OrganicLED/data/nmat4717_patent_smile.csv")[['SMILES']]
DataFrame = pd.read_csv("./OrganicLED/data/nmat4717_patent_smile_feature.csv")

SMILES = DataFrame['SMILES'].tolist()

# df = pd.DataFrame({'S0_splitting_(eV)':[S0_splitting_(eV)], 'SMILES':[SMILES]})
# ChangeMoleculeRendering(renderer='PNG')
df = pd.DataFrame({'SMILES':SMILES})
df['Mol Image'] = [Chem.MolFromSmiles(s) for s in df['SMILES']]

PandasTools.SaveXlsxFromFrame(df, './OrganicLED/nmat4717_patent_smile_Mol_Image.xlsx', molCol='Mol Image', size=(2400, 2400))


In [None]:
mols = [Chem.MolFromSmiles(smi) for smi in SMILES]
img = Draw.MolsToGridImage(mols, molsPerRow=8, subImgSize=(300, 300))
img.save('./OrganicLED/Chem_MolFromSmiles.png')
# img1 = Draw.MolToImage(mols, subImgSize=(300, 300), kekulize=True, wedgeBonds=True, fitImage=False, options=None, canvas=None)
# img1.save('./OrganicLED/Chem_MolFromSmiles_kekulize.png')
# mols=plt.savefig('./OrganicLED/Chem_MolFromSmiles.png', dpi=450, facecolor='w', edgecolor='w', format=None, transparent=False, bbox_inches=None, pad_inches=None, metadata=None)


In [None]:
import pandas as pd
from rdkit.Chem import PandasTools
DataFrame = pd.read_csv("./OrganicLED/data/nmat4717_patent_smile_feature.csv")
#esol_data.head(1)
#Add ROMol to data
PandasTools.AddMoleculeColumnToFrame(DataFrame, smilesCol='SMILES')
# DataFrame.head()
PandasTools.SaveXlsxFromFrame(DataFrame, './OrganicLED/nmat4717_patent_smile_feature_Mol_Image.xlsx', size=(300, 300))


In [None]:
print(type(DataFrame.ROMol[1]))
# PandasTools.FrameToGridImage(DataFrame.head(8), legendsCol="S0_splitting_(eV)", molsPerRow=4, subImgSize=(300, 300))
img = PandasTools.FrameToGridImage(DataFrame.head(24), legendsCol="HOMO_(eV)", molsPerRow=6, subImgSize=(300, 300))
img.save('./OrganicLED/Chem_MolFromSmiles_DataFrame.png')

In [None]:
# Adding new columns of properites use Pandas map method
DataFrame["n_Atoms"] = DataFrame['ROMol'].map(lambda x: x.GetNumAtoms())
DataFrame.head(1)
PandasTools.SaveXlsxFromFrame(DataFrame, './OrganicLED/nmat4717_patent_smile_feature_Mol_Image_add_n_Atoms.xlsx', size=(300, 300))

In [None]:
#Before saving the dataframe as csv file, it is recommanded to drop the ROMol column.
DataFrame = DataFrame.drop(['ROMol'], axis=1)
DataFrame.head(1)



In [None]:
#RDKit has avariety of built-in functionality for generating molecular fingerprints/descriptors
#url = 'https://raw.githubusercontent.com/XinhaoLi74/molds/master/clean_data/ESOL.csv'
#esol_data = pd.read_csv(url)
DataFrame = pd.read_csv("./OrganicLED/data/nmat4717_patent_smile_feature.csv")
PandasTools.AddMoleculeColumnToFrame(DataFrame, smilesCol='SMILES')
DataFrame.head(1)


In [None]:
Chem.AllChem.GetMorganFingerprintAsBitVect
radius=3
nBits=2048
ECFP6 = [Chem.AllChem.GetMorganFingerprintAsBitVect(x,radius=radius, nBits=nBits) for x in DataFrame['ROMol']]
print(ECFP6[0])
print(len(ECFP6[0]))

In [None]:
ecfp6_name = [f'Bit_{i}' for i in range(nBits)]
ecfp6_bits = [list(l) for l in ECFP6]
df_morgan = pd.DataFrame(ecfp6_bits, index = DataFrame.SMILES, columns=ecfp6_name)
df_morgan.head(1)
df_morgan.to_csv("./OrganicLED/data/nmat4717_patent_smile_ecfp6_feature_add.csv")


In [None]:
def show_png(data):
    bio = io.BytesIO(data)
    img = Image.open(bio)
    return img
d = Draw.MolDraw2DCairo(1200, 1200)

#Similarity Search
ref_smiles = 'c1ccc(B2c3ccncc3C3(c4ccccc4Oc4cccnc43)c3cnccc32)cc1'
ref_mol = Chem.MolFromSmiles(ref_smiles)
fig = ref_ECFP4_fps = Chem.AllChem.GetMorganFingerprintAsBitVect(ref_mol,2)
# ref_mol=img.save('./OrganicLED/ref_mol.png')

d.FinishDrawing()
show_png(d.GetDrawingText())
plt.savefig('./OrganicLED/SimilarityMapFromWeights.png', size=(1200, 1200))

In [None]:
bulk_ECFP4_fps = [Chem.AllChem.GetMorganFingerprintAsBitVect(x,2) for x in DataFrame['ROMol']]

In [None]:
from rdkit import DataStructs

similarity_efcp4 = [DataStructs.FingerprintSimilarity(ref_ECFP4_fps,x) for x in bulk_ECFP4_fps]

In [None]:
DataFrame['Tanimoto_Similarity (ECFP4)'] = similarity_efcp4
img = PandasTools.FrameToGridImage(DataFrame.head(8), legendsCol="Tanimoto_Similarity (ECFP4)", molsPerRow=6, subImgSize=(300, 300))
img.save('./OrganicLED/ref_mol_DataFrame.png', size=(300, 300))


In [None]:
DataFrame = DataFrame.sort_values(['Tanimoto_Similarity (ECFP4)'], ascending=False)
img = PandasTools.FrameToGridImage(DataFrame.head(8), legendsCol="Tanimoto_Similarity (ECFP4)", molsPerRow=6, subImgSize=(300, 300))
img.save('./OrganicLED/ref_mol_Tanimoto_Similarity_DataFrame.png', size=(300, 300))

In [None]:
# RDKit
# https://github.com/XinhaoLi74/Hierarchical-QSAR-Modeling/blob/master/notebooks/descriptors.ipynb
generator = MakeGenerator(("RDKit2D",)) 
DataFrame = pd.read_csv("./OrganicLED/data/nmat4717_patent_smile_feature.csv")
PandasTools.AddMoleculeColumnToFrame(train,smilesCol='smiles')
train_rdkit2d = [generator.process(x)[1:] for x in train['smiles']]
# morgan fingerprint
train_ECFP6 = [Chem.GetMorganFingerprintAsBitVect(x,3) for x in train['ROMol']]

In [None]:
rdkit2d_name = []
for name, numpy_type in generator.GetColumns():
    rdkit2d_name.append(name)

In [None]:
train_rdkit2d_df = pd.DataFrame(train_rdkit2d, index = train.index, columns=rdkit2d_name[1:])

In [None]:
train_rdkit2d_df.shape

In [None]:
train_rdkit2d_df.to_csv('./train_rdkit2d.csv')

In [None]:
#RDKit to calculte molecular fingerprint and similarity of a list of SMILE structures?
from rdkit import Chem
from rdkit import DataStructs
from rdkit.Chem.Fingerprints import FingerprintMols
import pandas as pd

# read and Conconate the csv's
#df_1 = pd.read_csv('first.csv')
#df_2 = pd.read_csv('second.csv')
df_3 = pd.read_csv("./OrganicLED/data/nmat4717_patent_smile_feature.csv")

# proof and make a list of SMILES
df_smiles = df_3['SMILES']
c_smiles = []
for ds in df_smiles:
    try:
        cs = Chem.CanonSmiles(ds)
        c_smiles.append(cs)
    except:
        print('Invalid SMILES:', ds)
print()

# make a list of mols
ms = [Chem.MolFromSmiles(x) for x in c_smiles]

# make a list of fingerprints (fp)
fps = [FingerprintMols.FingerprintMol(x) for x in ms]

# the list for the dataframe
qu, ta, sim = [], [], []

# compare all fp pairwise without duplicates
for n in range(len(fps)-1): # -1 so the last fp will not be used
    s = DataStructs.BulkTanimotoSimilarity(fps[n], fps[n+1:]) # +1 compare with the next to the last fp
    print(c_smiles[n], c_smiles[n+1:]) # witch mol is compared with what group
    # collect the SMILES and values
    for m in range(len(s)):
        qu.append(c_smiles[n])
        ta.append(c_smiles[n+1:][m])
        sim.append(s[m])
print()

# build the dataframe and sort it
d = {'query':qu, 'target':ta, 'Similarity':sim}
df_final = pd.DataFrame(data=d)
df_final = df_final.sort_values('Similarity', ascending=False)
#print(df_final)

# save as csv
df_final.to_csv('./OrganicLED/data/nmat4717_patent_smile_feature_Similarity.csv', index=False, sep=',')


In [None]:
df_final.head(10)
# PandasTools.SaveXlsxFromFrame(df_final, './OrganicLED/data/nmat4717_patent_smile_feature_Similarity.xlsx')


In [None]:
def show_png(data):
    bio = io.BytesIO(data)
    img = Image.open(bio)
    return img
d = Draw.MolDraw2DCairo(1200, 1200)


from rdkit import Chem
from rdkit import DataStructs 
from rdkit.Chem.Fingerprints import FingerprintMols

template = Chem.MolFromSmiles('c1ccc(B2c3ccncc3C3(c4ccccc4Oc4cccnc43)c3cnccc32)cc1')
Chem.AllChem.Compute2DCoords(template)

ms = [Chem.MolFromSmiles(smi) for smi in ('N#CC(c1ccccc1)C(Br)Oc1ccccc1','O=[N+]([O-])c1ccc(C(c2ccccc2)C(Br)Oc2ccccc2)cc1')]
for m in ms:
    _ = Chem.AllChem.GenerateDepictionMatching2DStructure(m,template)
    #Draw.MolToFile(ms[0],'./SMILES_mol1.o.png')
    #Draw.MolToFile(ms[1],'./SMILES_mol2.o.png') 
#print(ms)
fig = Draw.MolsToGridImage(ms[:8],molsPerRow=4,subImgSize=(200,200),legends=[x.GetProp("_Name") for x in ms[:8]])

d.FinishDrawing()
show_png(d.GetDrawingText())
plt.savefig('./OrganicLED/SimilarityMapForFingerprint.png', size=(1200, 1200))


In [None]:
#Generating Similarity Maps Using Fingerprints
from rdkit import Chem
# DataFrame = pd.read_csv("./OrganicLED/data/nmat4717_patent_smile_feature.csv")
# SMILES = DataFrame['SMILES'].tolist()
# SMILES
# mol = Chem.MolFromSmiles({'SMILES'})
# refmol = Chem.MolFromSmiles('SMILES')
mol = Chem.MolFromSmiles('c1ccc(B2c3ccncc3C3(c4ccccc4Oc4cccnc43)c3cnccc32)cc1')
Chem.SanitizeMol(mol)     
rdkit.Chem.rdmolops.SanitizeFlags.SANITIZE_NONE
print(Chem.MolToSmiles(mol))
refmol = Chem.MolFromSmiles('O=C1c2ccncc2C2(c3ccccc3Sc3ccncc32)c2cnccc21')
Chem.SanitizeMol(refmol)     
rdkit.Chem.rdmolops.SanitizeFlags.SANITIZE_NONE
print(Chem.MolToSmiles(refmol))


In [None]:
from rdkit.Chem import Draw
from rdkit.Chem.Draw import SimilarityMaps
fp = SimilarityMaps.GetAPFingerprint(mol, fpType='normal')
fp = SimilarityMaps.GetTTFingerprint(mol, fpType='normal')
fp = SimilarityMaps.GetMorganFingerprint(mol, fpType='bv')

In [None]:
def show_png(data):
    bio = io.BytesIO(data)
    img = Image.open(bio)
    return img
d = Draw.MolDraw2DCairo(300, 300)

fig, maxweight = Draw.SimilarityMaps.GetSimilarityMapForFingerprint(refmol, mol, SimilarityMaps.GetMorganFingerprint)
# fig, maxweight = Draw.SimilarityMaps.GetSimilarityMapForFingerprint(refmol, mol, SimilarityMaps.GetMorganFingerprint, colorMap=None, scale=- 1, sigma=None, coordScale=1.5, step=0.01, colors='k', contourLines=10, alpha=0.5, draw2d=None)

d.FinishDrawing()
show_png(d.GetDrawingText())
plt.savefig('./OrganicLED/SimilarityMapForFingerprint.png', ) #size=(1200, 1200)


In [None]:
#Visualization of Descriptors
def show_png(data):
    bio = io.BytesIO(data)
    img = Image.open(bio)
    return img
d = Draw.MolDraw2DCairo(1200, 1200)

from rdkit import DataStructs
fig, maxweight = SimilarityMaps.GetSimilarityMapForFingerprint(refmol, mol, lambda m,idx: SimilarityMaps.GetMorganFingerprint(m, atomId=idx, radius=1, fpType='count'), metric=DataStructs.TanimotoSimilarity)

d.FinishDrawing()
show_png(d.GetDrawingText())
plt.savefig('./OrganicLED/SimilarityMapForFingerprint.png', size=(1200, 1200))

In [None]:
from rdkit.Chem import Descriptors
m = Chem.MolFromSmiles('c1ccc(B2c3ccncc3C3(c4ccccc4Oc4cccnc43)c3cnccc32)cc1')
print (Descriptors.TPSA(m))
print (Descriptors.MolLogP(m))
Chem.AllChem.ComputeGasteigerCharges(m)
m.GetAtomWithIdx(0).GetDoubleProp('_GasteigerCharge')


In [None]:
#Visualization of Descriptors
def show_png(data):
    bio = io.BytesIO(data)
    img = Image.open(bio)
    return img
d = Draw.MolDraw2DCairo(1200, 1200)

from rdkit.Chem.Draw import SimilarityMaps
mol = Chem.MolFromSmiles('c1ccc(B2c3ccncc3C3(c4ccccc4Oc4cccnc43)c3cnccc32)cc1')
Chem.AllChem.ComputeGasteigerCharges(mol)
contribs = [mol.GetAtomWithIdx(i).GetDoubleProp('_GasteigerCharge') for i in range(mol.GetNumAtoms())]
fig = SimilarityMaps.GetSimilarityMapFromWeights(mol, contribs, colorMap='jet', contourLines=10)

d.FinishDrawing()
show_png(d.GetDrawingText())
plt.savefig('./OrganicLED/SimilarityMapFromWeights_GasteigerCharge.png', size=(1200, 1200))


In [None]:
def show_png(data):
    bio = io.BytesIO(data)
    img = Image.open(bio)
    return img
d = Draw.MolDraw2DCairo(1200, 1200)

from rdkit.Chem import rdMolDescriptors
contribs = rdMolDescriptors._CalcCrippenContribs(mol)
fig = SimilarityMaps.GetSimilarityMapFromWeights(mol,[x for x,y in contribs], colorMap='jet', contourLines=10)

d.FinishDrawing()
show_png(d.GetDrawingText())
plt.savefig('./OrganicLED/SimilarityMapFromWeights.png', size=(1200, 1200))

In [None]:
#Chemical Features
from rdkit import Chem
from rdkit.Chem import ChemicalFeatures
from rdkit import RDConfig
import os
fdefName = os.path.join(RDConfig.RDDataDir,'BaseFeatures.fdef')
factory = ChemicalFeatures.BuildFeatureFactory(fdefName)
m = Chem.MolFromSmiles('c1ccc(B2c3ccncc3C3(c4ccccc4Oc4cccnc43)c3cnccc32)cc1')
feats = factory.GetFeaturesForMol(m)
len(feats)
print(feats[0].GetFamily())
print(feats[0].GetType())
print(feats[0].GetAtomIds())
print(feats[4].GetFamily())
print(feats[4].GetAtomIds())
Chem.AllChem.Compute2DCoords(m)
print(feats[0].GetPos())
print(list(feats[0].GetPos()))


In [None]:
# Molecular Fragments
fName=os.path.join(RDConfig.RDDataDir,'FunctionalGroups.txt')
from rdkit.Chem import FragmentCatalog
fparams = FragmentCatalog.FragCatParams(1,6,fName)
print(fparams.GetNumFuncGroups())
fcat=FragmentCatalog.FragCatalog(fparams)
fcgen=FragmentCatalog.FragCatGenerator()
m = Chem.MolFromSmiles('c1ccc(B2c3ccncc3C3(c4ccccc4Oc4cccnc43)c3cnccc32)cc1')
print(fcgen.AddFragsFromMol(m,fcat))
print(fcat.GetEntryDescription(0))
print(fcat.GetEntryDescription(1))
print(fcat.GetEntryDescription(2))
list(fcat.GetEntryFuncGroupIds(2))
fparams.GetFuncGroup(1)
print(Chem.MolToSmarts(fparams.GetFuncGroup(1)))
print(Chem.MolToSmarts(fparams.GetFuncGroup(34)))
print(fparams.GetFuncGroup(1).GetProp('_Name'))
print(fparams.GetFuncGroup(34).GetProp('_Name'))


In [None]:
m = Chem.MolFromSmiles('c1ccc(B2c3ccncc3C3(c4ccccc4Oc4cccnc43)c3cnccc32)cc1')
m.GetNumAtoms()
# help(m.GetNumAtoms)
m.GetNumAtoms(onlyExplicit=False)

In [None]:
#Advanced Topics/Warnings Editing Molecules
m = Chem.MolFromSmiles('c1ccc(B2c3ccncc3C3(c4ccccc4Oc4cccnc43)c3cnccc32)cc1')
m.GetAtomWithIdx(0).SetAtomicNum(7)
Chem.SanitizeMol(m)
rdkit.Chem.rdmolops.SanitizeFlags.SANITIZE_NONE
print(Chem.MolToSmiles(m))
#Do not forget the sanitization step, without it one can end up with results that look ok (so long as you don’t think):
m = Chem.MolFromSmiles('c1ccc(B2c3ccncc3C3(c4ccccc4Oc4cccnc43)c3cnccc32)cc1')
m.GetAtomWithIdx(0).SetAtomicNum(8)
print(Chem.MolToSmiles(m))



We can explore the range of solubilities found in the dataset by plotting a histogram of solubility values from the dataset. Our machine learning models will aim to predict these solubilities.

In [None]:
#sns.distplot(dataset["measured log solubility in mols per litre"])
df = pd.DataFrame(dataset)
display(df)
df_condition = df[(df['sssr'] < 10) & (df["clogp"] > 0.25)]
# https://buildmedia.readthedocs.org/media/pdf/rdkit/latest/rdkit.pdf

# 'sssr', -- smallest set of smallest rings
# 'clogp', --
# 'mr', --
# 'mw', --
# 'tpsa', -- topological polar surface area (TPSA) descriptor
# 'chi0n', 'chi1n', 'chi2n', 'chi3n', 'chi4n', --  Connectivity Descriptors returns the ChiXn value for a molecule for X=0-4 Rev. Comput. Chem. 2:367-422 (1991)
# 'chi0v', 'chi1v', 'chi2v', 'chi3v', 'chi4v', -- returns the ChiXv value for a molecule for X=0-4 Rev. Comput. Chem. 2:367-422 (1991)
# 'fracsp3', -- 
# 'hall_kier_alpha', -- Rev. Comput. Chem. 2:367-422 (1991)
# 'kappa1', 'kappa2', 'kappa3', -- Rev. Comput. Chem. 2:367-422 (1991)
# 'labuteasa', -- J. Mol. Graph. Mod. 18:464-77 (2000)
# 'number_aliphatic_rings', --
# 'number_aromatic_rings', --
# 'number_amide_bonds', --
# 'number_atom_stereocenters', -- 
# 'number_bridgehead_atoms', --
# 'number_HBA', --
# 'number_HBD', --
# 'number_hetero_atoms', -- 
# 'number_hetero_cycles', --
# 'number_rings', --
# 'number_rotatable_bonds', --
# 'number_spiro', -- Number of spiro atoms (atoms shared between rings thatshare exactly one atom)
# 'number_saturated_rings', --
# 'number_heavy_atoms', --
# 'number_nh_oh', --
# 'number_n_o', --
# 'number_valence_electrons', --
# 'max_partial_charge', --
# 'min_partial_charge',-- 
# 'fr_C_O', --
# 'fr_C_O_noCOO', --
# 'fr_Al_OH', --
# 'fr_Ar_OH', --
# 'fr_methoxy', --
# 'fr_oxime', --
# 'fr_ester', --
# 'fr_Al_COO', --
# 'fr_Ar_COO',-- 
# 'fr_COO', --
# 'fr_COO2', --
# 'fr_ketone', --
# 'fr_ether', --
# 'fr_phenol', --
# 'fr_aldehyde',-- 
# 'fr_quatN', --
# 'fr_NH2', --
# 'fr_NH1', --
# 'fr_NH0', --
# 'fr_Ar_N', --
# 'fr_Ar_NH', --
# 'fr_aniline', --
# 'fr_Imine', --
# 'fr_nitrile', --
# 'fr_hdrzine', --
# 'fr_hdrzone', --
# 'fr_nitroso', --
# 'fr_N_O', --
# 'fr_nitro', --
# 'fr_azo', --
# 'fr_diazo', --
# 'fr_azide', --
# 'fr_amide', --
# 'fr_priamide',-- 
# 'fr_amidine', --
# 'fr_guanido', --
# 'fr_Nhpyrrole', --
# 'fr_imide', --
# 'fr_isocyan', --
# 'fr_isothiocyan',-- 
# 'fr_thiocyan',-- 
# 'fr_halogen', --
# 'fr_alkyl_halide',-- 
# 'fr_sulfide',-- 
# 'fr_SH', --
# 'fr_C_S', --
# 'fr_sulfone', --
# 'fr_sulfonamd', --
# 'fr_prisulfonamd',-- 
# 'fr_barbitur', --
# 'fr_urea', --
# 'fr_term_acetylene', -- 
# 'fr_imidazole',-- 
# 'fr_furan', --
# 'fr_thiophene', --
# 'fr_thiazole', --
# 'fr_oxazole', --
# 'fr_pyridine', --
# 'fr_piperdine', --
# 'fr_piperzine', --
# 'fr_morpholine', --
# 'fr_lactam', --
# 'fr_lactone', --
# 'fr_tetrazole', --
# 'fr_epoxide', --
# 'fr_unbrch_alkane',-- 
# 'fr_bicyclic', --
# 'fr_benzene', --
# 'fr_phos_acid', --
# 'fr_phos_ester', --
# 'fr_nitro_arom', --
# 'fr_nitro_arom_nonortho', --
# 'fr_dihydropyridine', --
# 'fr_phenol_noOrthoHbond', --
# 'fr_Al_OH_noTert', --
# 'fr_benzodiazepine', --
# 'fr_para_hydroxylation',-- 
# 'fr_allylic_oxid', --
# 'fr_aryl_methyl', --
# 'fr_Ndealkylation1',-- 
# 'fr_Ndealkylation2', --
# 'fr_alkyl_carbamate', --
# 'fr_ketone_Topliss', --
# 'fr_ArN', --
# 'fr_HOCCN',--

display(df_condition)
df_clogp = df[df.clogp.eq(0.26)]
display(df_clogp)

In [None]:
from rdkit import Chem
from rdkit.Chem import Draw
from rdkit.Chem.Draw import rdMolDraw2D
from rdkit.Chem import rdChemReactions as Reactions
from rdkit.Chem.Draw import IPythonConsole
from PIL import Image
 
Draw.DrawingOptions.atomLabelFontSize = 30
print(Draw.DrawingOptions.atomLabelFontSize)

mol = Chem.MolFromSmiles('c1ccc(B2c3ccncc3C3(c4ccccc4Oc4cccnc43)c3cnccc32)cc1')
# rxn = Reactions.ReactionFromSmarts('CC(=O)C>>CC(O)C', useSmiles=True)
 
Draw.MolToImage(mol)
# Draw.ReactionToImage(rxn)
drawer =Draw.MolToImage(mol)
# drawer = rdMolDraw2D.MolDraw2DCairo(800, 200)
# drawer.DrawMolToImage(mol)
# drawer.DrawReaction(rxn)
# drawer.FinishDrawing()
drawer.WriteDrawingText('./OrganicLED/ref_mol_MolDraw2DSVG.png')
im = Image.open('./OrganicLED/ref_mol_MolDraw2DSVG.png')
im

In [None]:
from rdkit import Chem
from rdkit.Chem.Draw import IPythonConsole
from rdkit.Chem import rdDepictor
from rdkit.Chem.Draw import rdMolDraw2D
from IPython.display import SVG
from rdkit.Chem import Draw

import cairosvg
import tempfile

smiles = ('c1ccc(B2c3ccncc3C3(c4ccccc4Oc4cccnc43)c3cnccc32)cc1')
m = Chem.MolFromSmiles(smiles)

def moltosvg(mol, molSize = (300,300), kekulize = True):
    mc = Chem.Mol(mol.ToBinary())
    if kekulize:
        try:
            Chem.Kekulize(mc)
        except:
            mc = Chem.Mol(mol.ToBinary())
    if not mc.GetNumConformers():
        rdDepictor.Compute2DCoords(mc)
    drawer = rdMolDraw2D.MolDraw2DSVG(molSize[0],molSize[1])
    drawer.DrawMolecule(mc)
    drawer.FinishDrawing()
    svg = drawer.GetDrawingText()
    return svg.replace('svg:','')

SVG(moltosvg(m))




In [None]:
#Generate a Morgan fingerprint and save information about the bits that are set using the bitInfo argument:
bi = {}
fp = rdMolDescriptors.GetMorganFingerprintAsBitVect(smiles_fig, radius=2, bitInfo=bi)
# show 10 of the set bits:
list(fp.GetOnBits())[:10]

In [None]:
#In its simplest form, the new code lets you display the atomic environment that sets a particular bit. Here we will look at bit 674:
# Draw.DrawMorganBit(smiles_fig,656,bi)
list_bits = []
legends = []
for x in fp.GetOnBits():
    for i in range(len(bi[x])):
        list_bits.append((mol,x,bi,i))
        legends.append(str(x))
Draw.DrawMorganBits(list_bits, molsPerRow=4,legends=legends)  

In [None]:
#DrawMorganBits(), for drawing multiple bits at once (thanks to Pat Walters for suggesting this one):
tpls = [(smiles_fig,x,bi) for x in fp.GetOnBits()]
Draw.DrawMorganBits(tpls[:12],molsPerRow=4,legends=[str(x) for x in fp.GetOnBits()][:12])

In [None]:
from ipywidgets import interact,fixed,IntSlider
def renderFpBit(mol,bitIdx,bitInfo,fn):
    bid = bitIdx
    return(display(fn(mol,bid,bitInfo)))

In [None]:
interact(renderFpBit, bitIdx=list(bi.keys()),mol=fixed(smiles_fig),
         bitInfo=fixed(bi),fn=fixed(Draw.DrawMorganBit));

In the next cell we will plot a histogram of SMILES string lengths from dataset. These lengths will be used to determine the length of the inputs for our CNN and VAE models. Below are examples of the SMILES representation: 
1. Methane: 'C'
2. Pentane: 'CCCCC'
3. Methanol and Ethanol: 'CO' and 'CCO'
4. Pyridine: 'C1:C:C:N:C:C:1'

To learn more about the SMILES representation, click [here](https://chem.libretexts.org/Courses/University_of_Arkansas_Little_Rock/ChemInformatics_(2017)%3A_Chem_4399%2F%2F5399/2.3%3A_Chemical_Representations_on_Computer%3A_Part_III).

In [None]:
smiles_lengths = map(len, dataset.smiles.values)
#sns.distplot(list(smiles_lengths), bins=20, kde=False)
plt.rcParams.update({'font.size': 20})
plt.figure(figsize=(10,10))
plt.title('SMILES string lengths Histogram')
plt.ylabel('Density')
plt.xlabel('SMILES string lengths')
ax = sns.distplot(list(smiles_lengths), color="b", bins=20, rug=True, rug_kws={"color": "k"}, kde=True, kde_kws={"color": "r", "label": "Gaussian Kernel Density Estimate (KDE)"}, hist_kws={"histtype": "bar", "linewidth": 3, "alpha": 1, "color": "b"} )
ax=plt.savefig('./gap_smiles_lengths.png', dpi=600, facecolor='w', edgecolor='w',orientation='portrait', papertype=None, format=None,transparent=False, bbox_inches=None, pad_inches=0.1,frameon=None, metadata=None)
#ax=plt.savefig('gdrive/MyDrive/Colab Notebooks/data/fig_smiles_lengths.png', dpi=600, facecolor='w', edgecolor='w',orientation='portrait', papertype=None, format=None,transparent=False, bbox_inches=None, pad_inches=0.1,frameon=None, metadata=None)
# ax=plt.savefig('../data/fig_smiles_lengths.png', dpi=600, facecolor='w', edgecolor='w',orientation='portrait', papertype=None, format=None,transparent=False, bbox_inches=None, pad_inches=0.1,frameon=None, metadata=None)


In [None]:
dataset.head()

In [None]:
dataset = dataset.reset_index()
dataset = dataset.drop(['index'], axis = 1)
dataset.head()

In [None]:
x_df = dataset.drop(columns = 'smiles')
print(x_df.shape)
x_df.head()

In [None]:
# from https://proxy.nanohub.org/weber/2004336/GBdSjVSdDDS3NYpl/4/notebooks/LLZO_MachineLearning.ipynb
# This code is to drop columns with std = 0. 
#x_df = pd.DataFrame(X)
#All columns that have a standard deviation of zero are dropped, as they don't contribute new information to the models.
x_df = x_df.loc[:, x_df.std() != 0]
print(x_df.shape) # This shape is (#Entries, #Descriptors per entry)
x_df.head()

In [None]:
x_df.to_csv('./x_df_SMILES_RDKit_2D.csv') 

In [None]:
 plt.figure(figsize=(10,10))
 plt.rcParams.update({'font.size': 20})
 smiles_lengths = map(len, dataset.smiles.values)
 #sns.distplot(list(smiles_lengths), bins=20, kde=False
plt.title('Topological polar surface area (TPSA) Distribution Histogram')
plt.ylabel('Density') 
plt.xlabel('Topological polar surface area (TPSA)')              

# sns.displot(list(smiles_lengths), bins=20, kde=False)
#ax = sns.distplot(dataset["lumo"], rug=True, rug_kws={"color": "g"}, kde_kws={"color": "k", "lw": 3, "label": "KDE"}, hist_kws={"histtype": "step", "linewidth": 3,"alpha": 1, "color": "g"})
ax = sns.distplot(dataset["tpsa"], rug=True, rug_kws={"color": "g"},kde_kws={"color": "k", "lw": 3, "label": "KDE"},hist_kws={"histtype":"step", "linewidth": 3,"alpha": 1, "color": "r"})
ax=plt.savefig('./tpsa.png', dpi=600, facecolor='w', edgecolor='w',orientation='landscape', papertype='a4', format=None, transparent=False, bbox_inches=None, pad_inches=None, frameon=None, metadata=None)


### Data preparation

Now we will pre-process the dataset for the CNN and VAE models. First, we'll get the unique character set from all SMILES strings in the dataset. Then we will use the unique character set to convert our SMILES strings to a one-hot representation, which is a representation that converts raw strings of text to numerical inputs for our models.

In a one-hot representation, each character of our SMILES string is encoded as a vector of zeros, except for one non-zero value. For instance, the character 'C' in the SMILES string is converted to a vector of length 31, consisting of 30 zeros and one non-zero entry of one. The length of this vector (31 in our case) is the total number of unique characters in the dataset.

Given a string of 5 characters (say Pentane, which is represented as 'CCCCC'), we would thus get 5 vectors each of length 31. Since different molecules have different SMILES string lengths, we can pre-define the length of each string to be the maximum length from the database, with smaller molecules represented with additional characters. In our case, this maximum length is 40 and we represent the extra characters for smaller molecules with pre-defined one-hot vectors. This means that each molecule is now represented as a set of 40 vectors, each of length 31. We can represent this as a 40x31 matrix.

One-hot encoding is commonly used in natural language processing, and you can learn more about one-hot encoding [here](https://en.wikipedia.org/wiki/One-hot). 

Finally, we will define our input and output and create test/train splits in the dataset.

In [None]:
charset = generate_charset(
    dataset["smiles"].values.ravel()
)
# get the number of unique characters
charset_length = len(charset)
# define max number of SMILES for model input vector
max_smiles_chars = 70
# dimension of input vector
input_dim = charset_length * max_smiles_chars
# get one-hot representation of the SMILES strings 
one_hots = smiles_to_onehots(dataset["smiles"].values, charset, max_smiles_chars)
# split input into train and test sets
X_train = one_hots[:-100] #This takes the first 133885-13385=120500  entries to be the Training Set
X_test = one_hots[-100:] # This takes the last 13385 entries to be the Testing Set

# split output to train and test sets
output = dataset["tpsa"].values
#output = dataset["homo"].values
#output = dataset["cv"].values
#output = dataset["r2"].values

# "alpha" - Isotropic polarizability (unit: Bohr^3)
# "gap" - Gap between HOMO and LUMO (unit: Hartree)
#"mol_id" - Molecule ID (gdb9 index) mapping to the .sdf file
#"A" - Rotational constant (unit: GHz)
#"B" - Rotational constant (unit: GHz)
#"C" - Rotational constant (unit: GHz)
#"mu" - Dipole moment (unit: D)
#"alpha" - Isotropic polarizability (unit: Bohr^3)
#"homo" - Highest occupied molecular orbital energy (unit: Hartree)
#"lumo" - Lowest unoccupied molecular orbital energy (unit: Hartree)
#"gap" - Gap between HOMO and LUMO (unit: Hartree)
#"r2" - Electronic spatial extent (unit: Bohr^2)
#"zpve" - Zero point vibrational energy (unit: Hartree)
#"u0" - Internal energy at 0K (unit: Hartree)
#"u298" - Internal energy at 298.15K (unit: Hartree)
#"h298" - Enthalpy at 298.15K (unit: Hartree)
#"g298" - Free energy at 298.15K (unit: Hartree)
#"cv" - Heat capavity at 298.15K (unit: cal/(mol*K))
#"u0_atom" - Atomization energy at 0K (unit: kcal/mol)
#"u298_atom" - Atomization energy at 298.15K (unit: kcal/mol)
#"h298_atom" - Atomization enthalpy at 298.15K (unit: kcal/mol)
Y_train = output[:-100] #This takes the first 133885-100=133785 entries to be the Training Set
Y_test = output[-100:] # This takes the last 100 entries to be the Testing Set

# This Reshape function in the next two lines, turns each of the horizontal lists [ x, y, z] into a
# vertical NumPy array [[x]
#                       [y]
#                       [z]]
# This Step is required to work with the Sklearn Linear Model
#Y_train = np.array(melt_train).reshape(-1,1) 
#Y_test  = np.array(melt_test).reshape(-1,1)
print(len(X_train),len(X_test),len(Y_train),len(Y_test))
# print(X_train[0]) # print a sample entry from the training set
# print(X_test[0]) # print a sample entry from the training set
# print(order)


##  Train-Test Split  ##
# https://proxy.nanohub.org/weber/1914019/IVqSH6gE0f3W6g9X/5/notebooks/mldefect.ipynb?
# XX = copy.deepcopy(X)
# n = dopant.size
# m = np.int(X.size/n)

# print(n)
# print(m)

# t = 0.20

# X_train, X_test, Prop_train, Prop_test, dop_train, dop_test, sc_train, sc_test, ds_train, ds_test = train_test_split(XX, prop, dopant, CdX, doping_site, test_size=t)

# n_tr = Prop_train.size
# n_te = Prop_test.size

# print(n_tr)
# print(n_te)

# Prop_train_fl = np.zeros(n_tr)
# for i in range(0,n_tr):
#     Prop_train_fl[i] = copy.deepcopy(float(Prop_train[i]))
    
# print(Prop_train_fl)

# Prop_test_fl = np.zeros(n_te)
# for i in range(0,n_te):
#     Prop_test_fl[i] = copy.deepcopy(float(Prop_test[i]))
    
# print(Prop_test_fl)
    
# X_train_fl = [[0.0 for a in range(m)] for b in range(n_tr)]
# for i in range(0,n_tr):
#     for j in range(0,m):
#         X_train_fl[i][j] = np.float(X_train[i][j])

# print(X_train_fl)

# X_test_fl = [[0.0 for a in range(m)] for b in range(n_te)]
# for i in range(0,n_te):
#     for j in range(0,m):
#         X_test_fl[i][j] = np.float(X_test[i][j])

# print(X_test_fl)

# X_out_fl = [[0.0 for a in range(m)] for b in range(n_out)]
# for i in range(0,n_out):
#     for j in range(0,m):
#         X_out_fl[i][j] = np.float(X_out[i][j])

# print(X_out_fl)

# X_all_fl = [[0.0 for a in range(m)] for b in range(n_all)]
# for i in range(0,n_all):
#     for j in range(0,m):
#         X_all_fl[i][j] = np.float(X_all[i][j])

# print(X_all_fl)

Let's briefly visualize what our input data looks like using a heatmap that shows the position of each character in the SMILES string, you can change the index to see various molecules. Each molecule is represented by a 40x31 sparse matrix, the bright spots in the heatmap indicate the position at which a one is found in the matrix. For instance, the first row has a bright spot at index 18, indicating that the first character is 'C'. The second row has a bright spot at index 23, which indicates that the second character is 'O'. For the compound Dimethoxymethane with a SMILES string 'COCOC', we expect the matrix to have alternating bright spots at index 18 and index 23 for the first five rows. Beyond that, the rows all have a bright spot at index 1, which stands for the extra characters padded on to our string to make all SMILES strings the same length. The heatmap below is plotted using the [Seaborn](https://seaborn.pydata.org/) library.

In [None]:
num_rows = 4
num_cols = 4
num_images = num_rows*num_cols
plt.figure(figsize=(6*num_cols, 6*num_rows))
import matplotlib
matplotlib.rcParams.update(matplotlib.rcParamsDefault)
for i in range(num_images):
    plt.subplot(num_rows, num_cols, i+1)
    #plot_image(i, predictions, testLabels, testImages)
    #plt.figure(figsize=(30,30))
    #for i in range(25): #133785 
    #plt.subplot(5,5,i+1)
    plt.xticks([],fontsize=8)
    plt.yticks([],fontsize=8)
    plt.grid(True)
    #plt.xlabel(X_test(dataset.iloc[i])
    plt.xlabel('Character', fontsize=16)
    #plt.ylabel(X_test(dataset.iloc[i])
    plt.ylabel('Position in SMILES String', fontsize=16)
    #X_test[i] = X_test[i]("Position in SMILES String", "Character")
    plt.title(f"SMILES: {dataset.iloc[i]['smiles']}", fontsize=16)
    #plt.plot(range(num_images), label=f"SMILES: {dataset.iloc[i]['smiles']}")
    #plt.legend()
    #sns.heatmap(X_test[i])
    sns.heatmap(X_train[i])

    #plt.imshow(X_train[i], cmap=plt.cm.binary)
    #plt.xlabel(class_names[int(trainLabels[i])])
    #print(dataset.iloc[i]['smiles'])

    
#plt.imshow(X_train[index]) # By altering 'index' you will see another of the pictures imported
#plt.colorbar()
#plt.grid(False)
#print("Train Images Array shape:", trainImages.shape)
#print("Train Labels Array shape:", trainLabels.shape)
#print("Test Images Array shape:", testImages.shape)
#print("Test Labels Array shape:", testLabels.shape)

#index = 6986 #index runs from 0 to 138388
#sns.heatmap(X_train[index]) # This is a single training example -- note that it is a matrix, not a single vector!
#plt.xlabel('Character')
#plt.ylabel('Position in SMILES String')
#print(dataset.iloc[index]['smiles'])
#ax=plt.savefig('gdrive/MyDrive/Colab Notebooks/data/fig_smiles_character.png', dpi=600, facecolor='w', edgecolor='w',orientation='portrait', papertype=None, format=None,transparent=False, bbox_inches=None, pad_inches=0.1,frameon=None, metadata=None)
#ax=plt.savefig('./homo_fig_smiles_character.png', dpi=600, facecolor='w', edgecolor='w',orientation='portrait', papertype=None, format=None,transparent=False, bbox_inches=None, pad_inches=0.1,frameon=None, metadata=None)

#ax = sns.distplot(dataset["r2"], rug=True, rug_kws={"color": "g"},kde_kws={"color": "k", "lw": 3, "label": "KDE"},hist_kws={"histtype": "step", "linewidth": 3,"alpha": 1, "color": "r"})
#ax=plt.savefig('./homo_X_test.png', dpi=600, facecolor='w', edgecolor='w',orientation='landscape', papertype='a4', format=None, transparent=False, bbox_inches=None, pad_inches=None, frameon=None, metadata=None, annot=True, fmt="d")
ax=plt.savefig('./tpsa_X_train.png', dpi=600, facecolor='w', edgecolor='w',orientation='landscape', papertype='a4', format=None, transparent=False, bbox_inches=None, pad_inches=None, frameon=None, metadata=None, annot=True, fmt="d")


# <ins>Supervised CNN model for predicting solubility</ins>

In this section, we will set up a convolutional neural network to predict solubility using one-hot SMILES as input. A convolutional neural network is a machine learning model that is commonly used to classify images, and you can learn more about them [here](https://en.wikipedia.org/wiki/Convolutional_neural_network).

### Define model structure

First, we will create the model structure, starting with the input layer. As described above, each training example is a 40x31 matrix, which is the shape we pass to the Input layer in Keras.

In [None]:
# Define the input layer
# NOTE: We feed in a sequence here! We're inputting up to max_smiles_chars characters, 
# and each character is an array of length charset_length


smiles_input = Input(shape=(max_smiles_chars, charset_length), name="SMILES-Input")

Next we will define the convolution layers where each layer attempts to learn certain features of the images, such as edges and corners. The input to each layer (a matrix) is transformed via convolution operations, which are element by element multiplications of the input matrix and a filter matrix. The convolutional layer learns the filter matrix that will best identify unique features of the image. You can learn more about convolution operations and the math behind convolutional neural networks [here](https://towardsdatascience.com/gentle-dive-into-math-behind-convolutional-neural-networks-79a07dd44cf9).

In [None]:
# Set parameters for convolutional layers 
num_conv_filters = 16
kernel_size = 3
#kernel_init = initializers.RandomNormal(seed=0)
#bias_init = initializers.Zeros()
init_weights = initializers.glorot_normal(seed=0)

# Define the convolutional layers
# Multiple convolutions in a row is a common architecture (but there are many "right" choices here)
conv_1_func = Conv1D(
    filters=num_conv_filters, # What is the "depth" of the convolution? How many times do you look at the same spot?
    kernel_size=kernel_size, # How "wide" of a spot does each filter look at?
    name="Convolution-1",
    activation="relu", # This is a common activation function: Rectified Linear Unit (ReLU)
    kernel_initializer=init_weights #This defines the initial values for the weights
)
conv_2_func = Conv1D(
    filters=num_conv_filters, 
    kernel_size=kernel_size, 
    name="Convolution-2",
    activation="relu",
    kernel_initializer=init_weights
)
conv_3_func = Conv1D(
    filters=num_conv_filters, 
    kernel_size=kernel_size, 
    name="Convolution-3",
    activation="relu",
    kernel_initializer=init_weights
)
conv_4_func = Conv1D(
    filters=num_conv_filters, 
    kernel_size=kernel_size,
    name="Convolution-4",
    activation="relu",
    kernel_initializer=init_weights
)

# strides and paddind can be added in the convolution netowrk
# strides=2, padding="same"

The four convolution layers defined above will attempt to learn features of the SMILES string (represented as a 40x31 matrix) that are relevant to predicting the solubility. To get a numerical prediction, we now flatten the output of the convolution and pass it to a set of regular `Dense` layers, the last layer predicting one value for the solubility.

In [None]:
# Define layer to flatten convolutions
flatten_func = Flatten(name="Flattened-Convolutions")

# Define the activation function layer
hidden_size = 32
dense_1_func = Dense(hidden_size, activation="relu", name="Fully-Connected", kernel_initializer=init_weights)

# Add a Dense layer with a L1 activity regularizer
#dense_1_func = Dense(hidden_size, activation="relu", name="Fully-Connected", activity_regularizer=regularizers.l1(10e-5), kernel_initializer=init_weights)

# Define output layer -- it's only one dimension since it is regression
output_size = 1
output_mobility_func = Dense(output_size, activation="linear", name="Log-lumo", kernel_initializer=init_weights)




Now that we have defined all the layers, we will connect them together to make a graph:

In [None]:
# connect the CNN graph together
conv_1_fwd = conv_1_func(smiles_input)
conv_2_fwd = conv_2_func(conv_1_fwd)
conv_3_fwd = conv_3_func(conv_2_fwd)
conv_4_fwd = conv_4_func(conv_3_fwd)
flattened_convs = flatten_func(conv_4_fwd)
dense_1_fwd = dense_1_func(flattened_convs)
output_mobility_fwd = output_mobility_func(flattened_convs)

### View model structure and metadata

Now the model is ready to train! But first we will define the model as `solubility_model` and compile it, then view some information on the model using the [keras2ascii](https://github.com/stared/keras-sequential-ascii) tool, which visually represents the layers in our model.

In [None]:
# create model
mobility_model = Model(
            inputs=[smiles_input],
            outputs=[output_mobility_fwd]
)
mae_st = []
# compile model
#optimizer = optimizers.RMSprop(0.002) # Root Mean Squared Propagation
# This line matches the optimizer to the model and states which metrics will evaluate the model's accuracy

# loss= mse, mae
# loss= categorical_crossentropy
#loss='sparse_categorical_crossentropy'
#loss='binary_crossentropy'
#metrics=['accuracy', 'binary_crossentropy']
#metrics=['accuracy']
mobility_model.compile(
    optimizer="adam",
    loss="mse",
    metrics=["mae"]
)
mobility_model.summary()

In [None]:
#!pip install keras_sequential_ascii
from keras_sequential_ascii import keras2ascii
# view model as a graph
keras2ascii(mobility_model)

### Train CNN

Now we will train our CNN solubility model to the training data! During training, we will see metrics printed after each epoch such as test/train loss (both as Mean Squared Error (MSE) and Mean Absolute Error (MAE)).

In [None]:
#logdir="mobility_logs/fit/" + datetime.now().strftime("%Y%m%d-%H%M%S")
#tensorboard_callback = keras.callbacks.TensorBoard(log_dir=logdir)
mae_st = []
history = mobility_model.fit(
    X_train, # Inputs
    Y_train, # Outputs
    epochs=20, # How many times to pass over the data
    batch_size=64, # How many data rows to compute at once
    verbose=1,
    validation_data=(X_test, Y_test),
    #callbacks=[tensorboard_callback] # You would usually use more splits of the data if you plan to tune hyperparams
)
#print('mse')
#print('mae')
mobility_model.save(os.path.expanduser('./tpsa_cnn_model.h5'))

Let's view the learning curve for the trained model.

This code will generate a plot where we show the test and train errors (MSE) as a function of epoch (one pass of all training examples through the NN).

The learning curve will tell us if the model is overfitting or underfitting.

In [None]:
# plot the learning curve 
plt.rcParams.update({'font.size': 18})
plt.figure(figsize=(10,10))
plt.subplots_adjust(left=0.16, bottom=0.16, right=0.95, top=0.90)
plt.rc('font', family='Arial narrow')
plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.title('CNN Model loss tpsa', fontname='Arial Narrow', size=18) #pad=12
plt.ylabel('Error',fontname='Arial Narrow', size=18)
plt.xlabel('Epoch',fontname='Arial Narrow', size=18)
#plt.xlim(0,20)
#plt.ylim(0,20)
plt.legend(['Train', 'Validation',], loc='upper right')
# te = '%.2f' % mean_absolute_error
# tr = '%.2f' % mse_X_test
# plt.text(4.3, 0.8, 'Test_rmse = ', c='r', fontsize=16)
# plt.text(7.4, 0.8, te, c='r', fontsize=16)
# plt.text(8.5, 0.8, 'eV', c='r', fontsize=16)
# plt.text(4.2, 0.1, 'Train_rmse = ', c='r', fontsize=16)
# plt.text(7.4, 0.1, tr, c='r', fontsize=16)
# plt.text(8.5, 0.1, 'eV', c='r', fontsize=16)
plt.savefig('./tpsa_cnn_X_training_loss.png', dpi=600, facecolor='w', edgecolor='w', scale=1, width=600, height=350)
plt.show()


# plot the learning curve 
plt.rcParams.update({'font.size': 18})
plt.figure(figsize=(10,10))
plt.subplots_adjust(left=0.16, bottom=0.16, right=0.95, top=0.90)
plt.rc('font', family='Arial narrow')
plt.plot(history.history['mean_absolute_error'])
plt.plot(history.history['val_mean_absolute_error'])
plt.title('CNN Model MAE tpsa', fontname='Arial Narrow', size=18) #pad=12
plt.ylabel('Mean Absolute Error',fontname='Arial Narrow', size=18)
plt.xlabel('Epoch',fontname='Arial Narrow', size=18)
#plt.xlim(0,20)
#plt.ylim(0,2)
plt.legend(['Train', 'Validation',], loc='upper right')
# te = '%.2f' % mean_absolute_error
# tr = '%.2f' % mse_X_test
# plt.text(4.3, 0.8, 'Test_rmse = ', c='r', fontsize=16)
# plt.text(7.4, 0.8, te, c='r', fontsize=16)
# plt.text(8.5, 0.8, 'eV', c='r', fontsize=16)
# plt.text(4.2, 0.1, 'Train_rmse = ', c='r', fontsize=16)
# plt.text(7.4, 0.1, tr, c='r', fontsize=16)
# plt.text(8.5, 0.1, 'eV', c='r', fontsize=16)
plt.savefig('./tpsa_cnn_X_training_mae.png', dpi=600, facecolor='w', edgecolor='w', scale=1, width=600, height=350)
plt.show()
# plot the learning curve 

### Use CNN to make solubility predictions
Now that we've trained our model, we can use it to make solubility predictions for any SMILES string! We just have to convert the SMILES string to 1-hot representation, then feed it to the `solubility_model` 

In [None]:
example_smiles = ['CC(C)CCCCO(C)N','CCC(C)CCC(C)OC','CC=CC1CCC1=O','CCOC()CCC','CC1(CC1OC)C#C'  ]
#'CC(C)CCCCO(C)N','CCC(C)CCC(C)OC','CC=CC1CCC1=O','CCOC()CCC','CC1(CC1OC)C#C'
#'Cc1cc(c1CCO)C#N','CCCCCCCCCC#C', 'CCC(C)CCC(C)C#C' ,'OCCCCC', 'CCC(C)(=O)C#C#N' , 'CCCCCCC#CCC' 'CC(=O)C=C(N)F', 'CCC'                  
for smiles in example_smiles:
    predict_test_input = smiles_to_onehots([smiles], charset, max_smiles_chars)
    mobility_prediction = mobility_model.predict(predict_test_input)[0][0]
    print(f'The predicted tpsa for SMILES {smiles} is {mobility_prediction}')

We can now make a parity plot comparing the CNN model predictions to the ground truth data

In [None]:
preds = mobility_model.predict(X_train)
x_y_line = np.linspace(min(Y_train.flatten()), max(Y_train.flatten()), 500)
plt.figure(figsize=(8,8))
#plt.subplots_adjust(left=0.16, bottom=0.16, right=0.95, top=0.90)
#plt.rc('font', family='Arial narrow')

plt.plot(Y_train.flatten(), preds.flatten(), 'o', label='predictions')
plt.plot(x_y_line, x_y_line, label='y=x')
plt.xlabel("tpsa (ground truth)", fontname='Arial Narrow', size=16)
plt.ylabel("tpsa (predicted)", fontname='Arial Narrow', size=16)
plt.title('Parity plot: predictions vs ground truth data', fontsize=16, pad=12)
plt.rc('xtick', labelsize=14)
plt.rc('ytick', labelsize=14)
#a  = [-175,0,125]
#b = [-175,0,125]
#plt.plot(b, a, c='k', ls='-')
#plt.legend(loc='upper left',ncol=1, frameon=True, prop={'family':'Arial narrow','size':16})
plt.savefig('./tpsa_cnn_X_predict.png', dpi=600, facecolor='w', edgecolor='w', scale=1)

### Save model
We can save/load this model for future use, using the `save()` and `load_model()` functions from Keras.

In [None]:
# Save the model
mobility_model.save("tpsa_model.hdf5")

# Load it back
loaded_model = load_model("tpsa_model.hdf5")

# <ins>VAE model for generating SMILES strings</ins>
In this section, we will set up a variational autoencoder to encode and decode SMILES strings. An autoencoder is a model that encodes the input to the model into a set of variables (known as encoded or 'latent variables'), which are then decoded to recover the original input. A variational autoencoder is an advanced version of an autoencoder where the encoded/latent variables are learnt as probability distributions rather than discrete values. You can learn more about autoencoders and variational autoencoders [here](https://www.jeremyjordan.me/variational-autoencoders/) and [here](https://www.jeremyjordan.me/autoencoders/).

### Define model structure

We'll need to define some new layers for this model, but we can also reuse old ones! (You will see this when we connect the model together.)

In [None]:
# hidden activation layer
hidden_size = 16
dense_1_func = Dense(hidden_size, activation="relu", name="Fully-Connected-Latent", kernel_initializer=init_weights)

Now we'll define the layers to map to the latent space. We then define a sampling function that samples from a gaussian distribution to return the sampled latent variables.

In [None]:
# VAE sampling 
# K.shape= Keras.shape
def sampling(args):
    z_mean, z_log_var = args
    batch = K.shape(z_mean)[0]
    dim = K.int_shape(z_mean)[1]
    epsilon = K.random_normal((batch, dim), mean=0.0, stddev=1.0)
    return z_mean + K.exp(0.5 * z_log_var) * epsilon # mu + sigma*epsilon yields a shifted, rescaled gaussian, 
                                                     # if epsilon is the standard gaussian
#latent space.last hidden_size = 16 to latent_dim = 32 
# encode to latent space
latent_dim = 32 
z_mean_func = Dense(latent_dim, name='z_mean')
log_z_func = Dense(latent_dim, name='z_log_var')
z_func = Lambda(sampling, name='z_sample')
#print(z_mean_func)
#print(log_z_func)
#print(z_func)
#z = Lambda(sampling)([z_mean, z_log_var])

Now we'll define the RNN (Recurrent Neural Network) layers for decoding SMILES from latent space values. Recurrent neural networks are known to perform well for learning a time series of data, where each cell of the recurrent network can learn from the previous cells, thus learning time dependencies in the data. This RNN uses Gated Recurrent Units as cells and you can learn more about recurrent neural networks and Gated Recurrent Units [here](https://towardsdatascience.com/understanding-gru-networks-2ef37df6c9be).

In [None]:
# this repeat vector just repeats the input `max_smiles_chars` times 
# so that we get a value for each character of the SMILES string
repeat_1_func = RepeatVector(max_smiles_chars, name="Repeat-Latent-1")

# RNN decoder
rnn_size = 32
gru_1_func = GRU(rnn_size, name="RNN-decoder-1", return_sequences=True, kernel_initializer=init_weights)
gru_2_func = GRU(rnn_size, name="RNN-decoder-2", return_sequences=True, kernel_initializer=init_weights)
gru_3_func = GRU(rnn_size, name="RNN-decoder-3", return_sequences=True, kernel_initializer=init_weights)

Finally we'll define the output, which should map to the original SMILES input:

In [None]:
output_func = TimeDistributed(
    Dense(charset_length, activation="softmax", name="SMILES-Output", kernel_initializer=init_weights), 
    name="Time-Distributed"
)

Now that we have defined all the layers, we will connect them together to make a graph:

In [None]:
# connecting the VAE model as a graph

# cnn encoder layers
conv_1_fwd = conv_1_func(smiles_input)
conv_2_fwd = conv_2_func(conv_1_fwd)
conv_3_fwd = conv_3_func(conv_2_fwd)
conv_4_fwd = conv_4_func(conv_3_fwd)

# flattening
flattened_convs = flatten_func(conv_4_fwd)
dense_1_fwd = dense_1_func(flattened_convs)

# latent space
z_mean = z_mean_func(dense_1_fwd)
z_log_var = log_z_func(dense_1_fwd)
z = z_func([z_mean, z_log_var])

# rnn decoder layers
repeat_1_fwd = repeat_1_func(z)
gru_1_fwd = gru_1_func(repeat_1_fwd)
gru_2_fwd = gru_2_func(gru_1_fwd)
gru_3_fwd = gru_3_func(gru_2_fwd)
smiles_output = output_func(gru_3_fwd)

### View model structure and metadata
Now the model is ready to train! But first we will compile the VAE model, then view model metadata, again using the [keras2ascii](https://github.com/stared/keras-sequential-ascii) tool. To compile the model, we will need to define our own VAE loss function.

In [None]:
# vae loss function -- reconstruction loss (cross entropy) plus KL divergence loss against a Gaussian prior
# Intuitive meaning for this loss function: "Reconstruct the data but stay close to a Gaussian"
def vae_loss(x_input, x_predicted):
    reconstruction_loss = K.sum(binary_crossentropy(x_input, x_predicted), axis=-1)
    reconstruction_loss *= input_dim
    kl_loss = 1 + z_log_var - K.square(z_mean) - K.exp(z_log_var)
    kl_loss = K.sum(kl_loss, axis=-1)
    kl_loss *= -0.5
    return K.mean(reconstruction_loss + kl_loss)

# create model
vae_model = Model(
            inputs=[smiles_input],
            outputs=[smiles_output]
)

# compile model
vae_model.compile(
    optimizer="adam",
    loss=vae_loss,
    metrics=["accuracy"]
)
vae_model.summary()

In [None]:
# view model as a graph
keras2ascii(vae_model)

### Train VAE

When training our VAE, we will see metrics printed after each epoch such as test/train loss and accuracy values.

In [None]:
# Reset model and set all layers are trainable
vae_model.reset_states()
for layer in vae_model.layers:
    layer.trainable = True

# fit model to training data
history = vae_model.fit(
    x=X_train,
    y=X_train,
    epochs=20,
    validation_data=(X_test, X_test),
    batch_size=64,
    verbose=1
)

Let's view the learning curve for the trained model. 

This code will generate a plot where we show the test and train errors as a function of epoch (one forward pass and one backward pass of all training examples through the NN).

The learning curve will tell us if the model is overfitting or underfitting. 

In [None]:
# plot the learning curve 
plt.rcParams.update({'font.size': 18})
plt.figure(figsize=(8,8))
plt.subplots_adjust(left=0.16, bottom=0.16, right=0.95, top=0.90)
plt.rc('font', family='Arial narrow')
plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.title('VAE Model accuracy Gap between HOMO and LUMO (Hartree)', fontname='Arial Narrow', size=18) #pad=12
plt.ylabel('Error',fontname='Arial Narrow', size=18)
plt.xlabel('Epoch',fontname='Arial Narrow', size=18)
#plt.xlim(0,20)
#plt.ylim(800,2000)
plt.legend(['Train', 'Validation',], loc='upper right')
# te = '%.2f' % mean_absolute_error
# tr = '%.2f' % mse_X_test
# plt.text(4.3, 0.8, 'Test_rmse = ', c='r', fontsize=16)
# plt.text(7.4, 0.8, te, c='r', fontsize=16)
# plt.text(8.5, 0.8, 'eV', c='r', fontsize=16)
# plt.text(4.2, 0.1, 'Train_rmse = ', c='r', fontsize=16)
# plt.text(7.4, 0.1, tr, c='r', fontsize=16)
# plt.text(8.5, 0.1, 'eV', c='r', fontsize=16)
plt.savefig('./tpsa_vae_X_training_loss.png', dpi=600, facecolor='w', edgecolor='w', scale=1, width=600, height=350)
plt.show()
# plot the learning curve 


# plot the learning curve 
plt.rcParams.update({'font.size': 18})
plt.figure(figsize=(8,8))
plt.subplots_adjust(left=0.16, bottom=0.16, right=0.95, top=0.90)
plt.rc('font', family='Arial narrow')
plt.plot(history.history['acc'])
plt.plot(history.history['val_acc'])
plt.title('VAE Model accuracy Gap between HOMO and LUMO (Hartree)', fontname='Arial Narrow', size=18) #pad=12
plt.ylabel('Accuracy',fontname='Arial Narrow', size=18)
plt.xlabel('Epoch',fontname='Arial Narrow', size=18)
#plt.xlim(0,20)
#plt.ylim(800,2000)
plt.legend(['Train', 'Validation',], loc='lower right')
# te = '%.2f' % mean_absolute_error
# tr = '%.2f' % mse_X_test
# plt.text(4.3, 0.8, 'Test_rmse = ', c='r', fontsize=16)
# plt.text(7.4, 0.8, te, c='r', fontsize=16)
# plt.text(8.5, 0.8, 'eV', c='r', fontsize=16)
# plt.text(4.2, 0.1, 'Train_rmse = ', c='r', fontsize=16)
# plt.text(7.4, 0.1, tr, c='r', fontsize=16)
# plt.text(8.5, 0.1, 'eV', c='r', fontsize=16)
plt.savefig('./tpsa_vae_X_training_acc.png', dpi=600, facecolor='w', edgecolor='w', scale=1, width=600, height=350)
plt.show()
# plot the learning curve 


### Create a decoder model and use to generate SMILES from noise

Now that we have trained our VAE, we can use the decoding part of the VAE to generate SMILES strings! Let's start by defining our decoder model. Note that this model doesn't need to be compiled since we are not training this model.

In [None]:
# connect the decoder graph
decoder_input = Input(shape=(latent_dim,), name="decoder_input")
decoder_repeat_1_fwd = repeat_1_func(decoder_input)
decoder_gru_1_fwd = gru_1_func(decoder_repeat_1_fwd)
decoder_gru_2_fwd = gru_2_func(decoder_gru_1_fwd)
decoder_gru_3_fwd = gru_3_func(decoder_gru_2_fwd)
decoder_smiles_output = output_func(decoder_gru_3_fwd)

# define decoder model
decoder_model = Model(
    inputs=[decoder_input],
    outputs=[decoder_smiles_output]
)
decoder_model.summary()

In [None]:
# view decoder graph. this should look like a subset of the VAE graph.
keras2ascii(decoder_model)

Now let's generate SMILES strings! First we will randomly sample from a unit gaussian distribution, feed the random samples into the decoder model, and take the output of the decoder model and convert it back into SMILES characters. Don't be surprised to see strange SMILES strings! We used a very small dataset, and did not train for very long.

In [None]:
for x in range(20):
    
    # draw from a unit gaussian 
    decoder_test_input = np.random.normal(0, 1, latent_dim).reshape(1, latent_dim)
    decoder_test_output = decoder_model.predict(decoder_test_input)
    
    decoded_one_hots = np.argmax(decoder_test_output, axis = 2)

    SMILES = ''
    for char_idx in decoded_one_hots[0]:
        if charset[char_idx] in ["PAD", "NULL"]: 
            break # Stop decoding if you hit padding or an out-of-vocab character (NULL)
        
        SMILES = SMILES + charset[char_idx]

    print(SMILES)

### Save VAE and decoder models
We can save/load these models for future use, again using the `save()` and `load_model()` functions from Keras.

In [None]:
# save and load the decoder model 
decoder_model.save("tpsa_decoder_model.hdf5")
loaded_decoder_model = load_model("tpsa_decoder_model.hdf5")

# for VAEs, we must instantiate model w/ same architecture then load weights onto this model
vae_model.save_weights("tpsa_vae.hdf5")
loaded_vae_model = vae_model.load_weights("tpsa_vae.hdf5")