In [6]:
import abc
import numpy as np
import os
import json
from openbabel import pybel
from rdkit import Chem
from rdkit.Chem import AllChem
import pdb
import itertools
from tqdm import tqdm
from Features import DualMol
from Features import Featurizer
import shutil

# Configuration
First step is to set up the configuration file.

The code Below does this for you assuming this jupyter notebook is run in the same directory as all of the other 

In [2]:
config_values = {
"training_set_directory": os.path.join(os.getcwd(),"training_sets"), # directory where all new training sets will be made
"searches_directory": os.getcwd(),  # Directory where the reactions.json file is found
"species_db":os.path.join(os.getcwd(),os.path.join('dft_results','dft_data.json')), # path to the file with the post processesd DFT data 
"dft_results": os.path.join(os.getcwd(),'dft_results'), # path to the directory of the relaxed xyz files of molecules
"molecular_dict":  os.path.join(os.getcwd(),'molec_descriptor.dict') #json databaseof all features
}

json.dump(config_values,open('config.json','w'),indent=2)

The way the code works is as follows:

The Featurizer object will be instantiated with all the necessary  paths. It will first update the molecular dictionary, containing the molecular features/descriptors. The source from the update comes from two separate locations:

The first is the relaxed structure of the molecule, which should be located inside a directory within the "dft_results" directory. Note that the file containing the relaxed/optimized specie should be and XYZ file and have the "relaxed_" prefix. Currently xyz files are the only ones supported, but it should not be dificult to make the code robust to other file types 

        Example: dft-results/XXXXXXXXXXXXXX-UHFFFAOYSA-N/relaxed_XXXXXXXXXXXXXX-UHFFFAOYSA-N.xyz

The second Source of data is the dft_data.json file which has 4 values for each specie:

    - Gibbs free energy at 0K (G0) [J/mol]
    
    - Difference between Gibbs free energy at 0K and Gibbs free energy at 300K  (dG300) [J/mol]
    
    - Highest Occupied Molecular Orbital energy Value [eV]
    
    - Lowest Unoccupied Molecular Orbital energy Value [eV]
    
    We have also added spin dependent 

With all of this in place we can create a training set with all of the available features. Features can be included and omitted  to create training sets of different features. The created training values and feature vectors will be placed in the training_sets directory, unless otherwise specified in the config file.

In [3]:
my_featurizer = Featurizer()
molecdict = my_featurizer.update_molecular_dict(out=True)
my_featurizer.trainingsetgenerator(features = [
    "gibbs",
    "entropy",
    "topo",
    "morgan",
    "hom-lum",
    "homo",
    "lumo",
    'min_lumo_reactants',
    'max_lumo_reactants',
    'max_h-l_reactants',
    'min_homo_reactants',
    'max_homo_reactants',
    'min_lumo_products',
    'max_lumo_products',
    'max_h-l_products',
    'min_homo_products',
    'max_homo_products',
],out=True)

  Failed to kekulize aromatic bonds in OBMol::PerceiveBondOrders

  3%|██                                                                             | 157/6060 [00:00<00:22, 256.72it/s][17:37:54] Explicit valence for atom # 1 N, 4, is greater than permitted
[17:37:54] Explicit valence for atom # 1 N, 4, is greater than permitted
  3%|██▋                                                                            | 210/6060 [00:00<00:22, 257.13it/s]

couldn't add Hydrogens, implicit hydrogens might be missing


  Failed to kekulize aromatic bonds in OBMol::PerceiveBondOrders

  6%|████▊                                                                          | 371/6060 [00:01<00:22, 257.76it/s][17:37:55] Explicit valence for atom # 1 N, 4, is greater than permitted
[17:37:55] Explicit valence for atom # 1 N, 4, is greater than permitted
  7%|█████▌                                                                         | 425/6060 [00:01<00:24, 227.25it/s]

couldn't add Hydrogens, implicit hydrogens might be missing


  Failed to kekulize aromatic bonds in OBMol::PerceiveBondOrders

 10%|███████▌                                                                       | 580/6060 [00:02<00:21, 257.06it/s]

CAJHOJGXDDYSFM-UHFFFAOYSA-N is none 
CEIUUIFCFNILSM-UHFFFAOYSA-N is none 


  Failed to kekulize aromatic bonds in OBMol::PerceiveBondOrders

  Failed to kekulize aromatic bonds in OBMol::PerceiveBondOrders

  Failed to kekulize aromatic bonds in OBMol::PerceiveBondOrders

 16%|████████████▌                                                                  | 961/6060 [00:03<00:19, 258.94it/s]

DOIDPUHPNPZXJO-UHFFFAOYSA-N is none 
DSAYAFZWRDYBQY-UHFFFAOYSA-N is none 
DSUDAZJJNJYSJI-UHFFFAOYSA-N is none 


 17%|█████████████▍                                                                | 1046/6060 [00:04<00:18, 271.97it/s]

DXGYDYIBDDJALB-UHFFFAOYSA-N is none 
DXMLCDOYYZUWGQ-UHFFFAOYSA-N is none 
DZBVCQDCGXMXSD-UHFFFAOYSA-N is none 
DZJCILSVHGABME-UHFFFAOYSA-N is none 


  Failed to kekulize aromatic bonds in OBMol::PerceiveBondOrders

  Failed to kekulize aromatic bonds in OBMol::PerceiveBondOrders

 19%|██████████████▌                                                               | 1130/6060 [00:04<00:18, 269.86it/s]

FFWSICBKRCICMR-UHFFFAOYSA-N is none 
FHEPZBIUHGLJMP-UHFFFAOYSA-N is none 


  Failed to kekulize aromatic bonds in OBMol::PerceiveBondOrders

  Failed to kekulize aromatic bonds in OBMol::PerceiveBondOrders

 20%|███████████████▉                                                              | 1238/6060 [00:04<00:18, 258.70it/s][17:37:59] Explicit valence for atom # 1 N, 4, is greater than permitted
[17:37:59] Explicit valence for atom # 1 N, 4, is greater than permitted
 21%|████████████████▋                                                             | 1292/6060 [00:05<00:18, 261.51it/s]

couldn't add Hydrogens, implicit hydrogens might be missing
FYUZFGQCEXHZQV-UHFFFAOYSA-N is none 


 22%|█████████████████▎                                                            | 1346/6060 [00:05<00:18, 261.64it/s]

GDIBOAXSCRIQSP-UHFFFAOYSA-N is none 


 24%|██████████████████▎                                                           | 1427/6060 [00:05<00:17, 262.63it/s]

GKVDXUXIAHWQIK-UHFFFAOYSA-N is none 


 25%|███████████████████▍                                                          | 1508/6060 [00:05<00:17, 261.55it/s]

GRTHDOCSFFMOHK-UHFFFAOYSA-N is none 


  Failed to kekulize aromatic bonds in OBMol::PerceiveBondOrders

  Failed to kekulize aromatic bonds in OBMol::PerceiveBondOrders

  Failed to kekulize aromatic bonds in OBMol::PerceiveBondOrders

  Failed to kekulize aromatic bonds in OBMol::PerceiveBondOrders

 30%|███████████████████████▏                                                      | 1799/6060 [00:06<00:16, 260.58it/s][17:38:01] Explicit valence for atom # 1 N, 4, is greater than permitted
[17:38:01] Explicit valence for atom # 1 N, 4, is greater than permitted
 31%|███████████████████████▊                                                      | 1852/6060 [00:07<00:16, 257.51it/s]

couldn't add Hydrogens, implicit hydrogens might be missing


 37%|████████████████████████████▉                                                 | 2245/6060 [00:08<00:15, 244.83it/s]

JOKLIZXAUFTLPB-UHFFFAOYSA-N is none 


  Failed to kekulize aromatic bonds in OBMol::PerceiveBondOrders

  Failed to kekulize aromatic bonds in OBMol::PerceiveBondOrders

 41%|████████████████████████████████▎                                             | 2509/6060 [00:09<00:13, 263.23it/s]

KSVCWNYKIAZROO-UHFFFAOYSA-N is none 
KWQAATFBQGNPIN-UHFFFAOYSA-N is none 


  Failed to kekulize aromatic bonds in OBMol::PerceiveBondOrders

  Failed to kekulize aromatic bonds in OBMol::PerceiveBondOrders

[17:38:05] Explicit valence for atom # 1 N, 4, is greater than permitted
[17:38:05] Explicit valence for atom # 1 N, 4, is greater than permitted
 47%|████████████████████████████████████▊                                         | 2856/6060 [00:11<00:12, 258.90it/s]

MCOPJQNXAFJCHU-UHFFFAOYSA-N is none 
couldn't add Hydrogens, implicit hydrogens might be missing


  Failed to kekulize aromatic bonds in OBMol::PerceiveBondOrders

 51%|████████████████████████████████████████                                      | 3115/6060 [00:12<00:11, 257.98it/s]

NHKKSWQCIBBXRI-UHFFFAOYSA-N is none 
NINIDFKCEFEMDL-UHFFFAOYSA-N is none 


  Failed to kekulize aromatic bonds in OBMol::PerceiveBondOrders

 56%|███████████████████████████████████████████▊                                  | 3406/6060 [00:13<00:10, 261.64it/s]

OKTJSMMVPCPJKN-UHFFFAOYSA-N is none 


  Failed to kekulize aromatic bonds in OBMol::PerceiveBondOrders

  Failed to kekulize aromatic bonds in OBMol::PerceiveBondOrders

 61%|███████████████████████████████████████████████▉                              | 3721/6060 [00:14<00:09, 253.46it/s]

PSUDWVOCELYWEJ-UHFFFAOYSA-N is none 


 62%|████████████████████████████████████████████████▏                             | 3747/6060 [00:14<00:09, 253.35it/s][17:38:09] Explicit valence for atom # 0 N, 4, is greater than permitted
[17:38:09] Explicit valence for atom # 0 N, 4, is greater than permitted
 63%|████████████████████████████████████████████████▉                             | 3802/6060 [00:14<00:08, 261.84it/s]

couldn't add Hydrogens, implicit hydrogens might be missing
QJGQUHMNIGDVPM-UHFFFAOYSA-N is none 
QKEJCZRHWHPTSH-UHFFFAOYSA-N is none 


 64%|█████████████████████████████████████████████████▉                            | 3884/6060 [00:15<00:08, 264.31it/s]

QMTGDJHXYKQMFE-UHFFFAOYSA-N is none 


 65%|███████████████████████████████████████████████████                           | 3966/6060 [00:15<00:07, 263.80it/s]

QVGXLLKOCUKJST-UHFFFAOYSA-N is none 


 67%|████████████████████████████████████████████████████                          | 4047/6060 [00:15<00:07, 257.80it/s]

results is none 
RGCNEDLMLAQWMV-UHFFFAOYSA-N is none 


 68%|████████████████████████████████████████████████████▊                         | 4099/6060 [00:16<00:07, 245.47it/s]

RKYCOGPXXQUQMH-UHFFFAOYSA-N is none 


  Failed to kekulize aromatic bonds in OBMol::PerceiveBondOrders

 74%|█████████████████████████████████████████████████████████▍                    | 4461/6060 [00:17<00:06, 254.45it/s]

SWQJXJOGLNCZEY-UHFFFAOYSA-N is none 
SYOVUQYBFHUDCP-UHFFFAOYSA-N is none 


 75%|██████████████████████████████████████████████████████████▍                   | 4542/6060 [00:17<00:05, 261.47it/s]

TWPDUNYUPXVPTM-UHFFFAOYSA-N is none 


  Failed to kekulize aromatic bonds in OBMol::PerceiveBondOrders

  Failed to kekulize aromatic bonds in OBMol::PerceiveBondOrders

 82%|████████████████████████████████████████████████████████████████▏             | 4987/6060 [00:19<00:04, 255.43it/s]

VOKAIBSFVGDOKS-UHFFFAOYSA-N is none 


  Failed to kekulize aromatic bonds in OBMol::PerceiveBondOrders

  Failed to kekulize aromatic bonds in OBMol::PerceiveBondOrders

  Failed to kekulize aromatic bonds in OBMol::PerceiveBondOrders

 87%|███████████████████████████████████████████████████████████████████▌          | 5247/6060 [00:20<00:03, 257.03it/s]

WOZZBQKVSNYYSM-UHFFFAOYSA-N is none 


  Failed to kekulize aromatic bonds in OBMol::PerceiveBondOrders

  Failed to kekulize aromatic bonds in OBMol::PerceiveBondOrders

 89%|█████████████████████████████████████████████████████████████████████▌        | 5405/6060 [00:21<00:02, 257.18it/s][17:38:15] Explicit valence for atom # 1 N, 4, is greater than permitted
[17:38:15] Explicit valence for atom # 1 N, 4, is greater than permitted
 90%|█████████████████████████████████████████████████████████████████████▉        | 5431/6060 [00:21<00:02, 258.00it/s]

XKRFYHLGVUSROY-UHFFFAOYSA-N is none 
couldn't add Hydrogens, implicit hydrogens might be missing
XPGFERQQLIGTRR-UHFFFAOYSA-N is none 


 90%|██████████████████████████████████████████████████████████████████████▌       | 5484/6060 [00:21<00:02, 258.72it/s]

XQKHFRBXPZGCOX-UHFFFAOYSA-N is none 


  Failed to kekulize aromatic bonds in OBMol::PerceiveBondOrders

 92%|███████████████████████████████████████████████████████████████████████▉      | 5588/6060 [00:22<00:01, 256.91it/s]

YBSDNOGHLUKFQJ-UHFFFAOYSA-N is none 
YEXWOGKLXXTUCJ-UHFFFAOYSA-N is none 


  Failed to kekulize aromatic bonds in OBMol::PerceiveBondOrders

 94%|█████████████████████████████████████████████████████████████████████████▎    | 5693/6060 [00:22<00:01, 254.88it/s]

YNJHMFXRHDPWCD-UHFFFAOYSA-N is none 


 96%|██████████████████████████████████████████████████████████████████████████▉   | 5826/6060 [00:23<00:00, 255.51it/s]

YZCKVEUIGOORGS-UHFFFAOYSA-N is none 
ZAMOUSCENKQFHK-UHFFFAOYSA-N is none 
ZDIIMEPYWRZOSI-UHFFFAOYSA-N is none 


 97%|████████████████████████████████████████████████████████████████████████████  | 5905/6060 [00:23<00:00, 258.00it/s]

ZGOHZDSOGSCHIE-UHFFFAOYSA-N is none 


100%|██████████████████████████████████████████████████████████████████████████████| 6060/6060 [00:23<00:00, 253.42it/s]


full info molecules 6014


28952it [01:43, 280.28it/s]


{'gibbs': 0, 'entropy': 1, 'topo': 7, 'morgan': 107, 'hom-lum': 108, 'homo': 109, 'lumo': 110, 'min_lumo_reactants': 111, 'max_lumo_reactants': 112, 'max_h-l_reactants': 113, 'min_homo_reactants': 114, 'max_homo_reactants': 115, 'min_lumo_products': 116, 'max_lumo_products': 117, 'max_h-l_products': 118, 'min_homo_products': 119, 'max_homo_products': 120}


[array([[-4.86499346e+02,  4.94403605e+01, -2.92528740e-01, ...,
          8.27305000e+00, -6.99058000e+00, -6.99058000e+00],
        [-4.55574540e+02,  4.44152860e+01, -9.61331711e-01, ...,
          5.08333000e+00, -5.96429000e+00, -5.96429000e+00],
        [-4.51263156e+02,  4.77067035e+01, -1.16350396e+00, ...,
          5.07604000e+00, -5.57917000e+00, -5.57917000e+00],
        ...,
        [-9.26849995e+01,  1.40414899e+00,  3.77925361e-01, ...,
          1.61130000e+00, -3.98093000e+00, -3.98093000e+00],
        [-2.37001211e+01,  1.80066977e+00, -7.53336385e-01, ...,
          1.19714000e+00, -3.78326000e+00, -3.78326000e+00],
        [-2.74155212e+02,  5.55656932e+01,  3.37644960e-01, ...,
          3.69964000e+00, -5.15203000e+00, -5.15203000e+00]]),
 array([22133.4  ,   157.737,  -111.713, ..., 71175.566, 71175.566,
        -2093.399]),
 array([    0,     6,     7, ..., 28948, 28949, 28951])]

Once the cell above is done running the "training_set" directory will have a '.trainingvalues' and a '.trainingfeatures' file that can be used read with the np.load function to read in an x array of descriptors and a y array of activatione energy values. Additionally a 'reaction_indices_out.json' file is present which gives the index of the reaction back in the reactions.json file 

i.e. 

    the first element reaction_indices_out.json says '6', meaning that the first element of xxx.trainvalues and xxx.trainfeatures refers to reaction no. 6 in reactions.json.

Lastly the features_explained.json gives an index value for all the indices in the descriptor of the reaction explaining what property it corresponds to.

# How to add new molecules and reactions

To add new data to the training set you need the following: 

    - Activation energy and stoichiometry  for reactions to be added.

    - Geometry Optimized/Relaxed Molecular structure 
    
    - Data from geometry relaxation/ Thermodynamic calculation as mentioned above (Gibbs energies @ 0 and 300 K, HOMO and LUMO)
    
    NB: Current thermodynamic  data is calculated  using PBE with def2-SVP basis sets in TURBOMOLE. To add new species to this specific dataset the same XC-Functional and basis set must be used (ideally with the same DFT package).
    link to Turbomole manual : https://www.turbomole.org/wp-content/uploads/2019/10/Turbomole_Manual_7-4.pdf
    Page (44/270 has instructions on how it can be used)
    
    In all other cases the thermodynamic data for all molecules must be recalculated and included. 

The addition of the geometry optimized molecule has to be done so in the format mentioned above : 
         
         Example: dft-results/XXXXXXXXXXXXXX-UHFFFAOYSA-N/relaxed_XXXXXXXXXXXXXX-UHFFFAOYSA-N.xyz

The DFT information has to be updated in the dft_data.json file as mentioned above. Note that this notebook will not calculate the DFT neccessary information, this must be done with an external package. In principle any DFT package can calculate the neccessary values. We recommend TURBOMOLE for the use of their  "freeh" property calculation package that provides the neccessary free energy values.

The reaction then has to be included in the file reactions.data where the inchi keys of the products and reactions must be filled in as follows:


"ProdInChI": [
      "XXXXXXXXXXXXXX-UHFFFAOYSA-N"
      "XXXXXXXXXXXXXX-UHFFFAOYSA-N"
    ],
    "ReacInChI": [
      "XXXXXXXXXXXXXX-UHFFFAOYSA-N",
      "XXXXXXXXXXXXXX-UHFFFAOYSA-N"
    ]


and the activation energy must be filled in as follows:
    
    'Ea [kJ/mol]': float(XXX.XXX)


with supported units being :
    
    '[J/mol]':
    
    '[cal/mol]'
    
    '[kJ/mol]'
    
    '[kcal/mol]'
