In [6]:
import sys
new_paths = ['../datasets']
for path in new_paths:
    sys.path.insert(0, path)
import warnings
warnings.filterwarnings("ignore")

#Basics
import numpy as np
from Datasets_Class import Dataset

# 1. Download, Preprocess and Read Datasets
The 'Dataset' class can be used to download and read the datasets. It takes as input the path to the folder where the datasets are downloaded and the preprocessed feature vectors are stored. Once you have created the 'Dataset' class object, you can use the QM7_dataset, QM8_dataset, QM9_dataset modules to access the respective datasets. The modules download the respective datasets, preprocess them as described in the paper, and return a data feature matrix 'x' and associated labels 'y'.

### 1.1 QM7 
 The feature vector for a molecule is the (reshaped) Coulomb matrix. The label value to predict is the atomization energy.
a scalar value describing amount of energy required to completely separate all the atoms in a molecule into individual gas-phase atoms. 

In [7]:
datasets =  Dataset('../datasets')
x_qm7, y_qm7 = datasets.QM7_dataset()
print('Shape data matrix:', x_qm7.shape)

Shape data matrix: (7165, 529)


Notice that, in the dowloaded file the labels are provided in Kcal/mol to rescale them in eV simply divide the label values by 23.060900

### 1.2 QM8
The feature vector for a molecule is the preprocessed Mordred descriptor. The preprocessing procedure consists of normalizing the original mordreds values and removing the entries with zero variance over the data points. The preprocessing procedure can be turned off by setting 'preprocessing = False' in the argument of the QM8_dataset module. The label value to be predicted is the lowest singlet transition energy (E1). Other label values can be selected by modifying the target_label argument. See the possible options in the header of the downloaded qm8.csv file, which is obtained after running the QM8 module for the first time.

In [8]:
datasets =  Dataset('../datasets')
x_qm8, y_qm8 = datasets.QM8_dataset( target_label='E1-PBE0', preprocessing=True)
print('Shape data matrix:', x_qm8.shape)

Shape data matrix: (21766, 1296)


Notice that, in the dowloaded file the labels are provided in atomic units to rescale them in eV  simply multiply the label values by 27.21139

### 1.3 QM9
The feature vector for a molecule is the preprocessed Mordred descriptor. The preprocessing procedure consists of removing the entries with zero variance over the data points. The preprocessing procedure can be turned off by setting 'preprocessing = False' in the argument of the QM9_dataset module.  The label value to be predicted is the HOMO-LUMO energy, measured in eV, which describes the difference between the highest occupied (HOMO) and lowest unoccupied (LUMO) molecular orbital energies. It is a useful quantity for examining the kinetic stability of the molecule.

In [9]:
### Computation of mordreds descriptors may take a few hours
datasets =  Dataset('../datasets')
x_qm9, y_qm9 = datasets.QM9_dataset(preprocessing=True)
print('Shape data matrix:', x_qm9.shape)

Shape data matrix: (130202, 1307)
