# ML Potentials Exercises July 19 
## QML package and QM7 learning curves

In this exercise, we will work with QM7 again, but this time we will focus on learning curves. Learning curves are plots used to show a model's performance as the training set size increases. They give us information about the generalization error and can also predict how large a training set we need to get the desired accuracy. To speed up and simplify the work, we will use the QML package for this task. 

QML is a Python2/3-compatible toolkit for representation learning of properties of molecules and solids. The goal is to provide usable and efficient implementations of concepts such as representations and kernels. QML supplies the building blocks to carry out efficient and accurate machine learning on chemical compounds.

Documentation: https://www.qmlcode.org/index.html#<br /> 
GitHub repository: https://github.com/qmlcode/qml<br /> 
GPU version: https://github.com/nickjbrowning/QMLightning


### Load libraries

In [None]:
import numpy as np
import pandas as pd
import qml
import time
from qml.kernels import gaussian_kernel
from qml.math import cho_solve
from qml.representations import get_slatm_mbtypes
from sklearn.model_selection import train_test_split
from glob import glob
import matplotlib.pyplot as plt

### Inspect QML Compound
The QML package uses `Compound` objects that store all information from xyz files:

In [None]:
mol = qml.Compound(xyz="qm7_files/qm7/0001.xyz")
print(mol.coordinates)
print(mol.atomtypes)
print(mol.nuclear_charges)
print(mol.name)
print(mol.unit_cell)

## 1. Load QM7 dataset

In [None]:
# Load molecules into Compounds
xyzs = sorted(glob("qm7_files/qm7/*.xyz"))
mols = [qml.Compound(x) for x in xyzs]
# Load energies
energies = pd.read_csv('qm7_files/hof_qm7.txt', header=None, sep='\s', engine='python')
energies.columns = ['filename', 'PBE0', 'DFTB']
y = energies.PBE0

## 2. Coulomb matrix
We will start with the Coulomb matrix as a molecular descriptor due to its simplicity. We will use KRR (with rbf kernel) but this time deploying the QML package. Let us generate Coulomb matrices for the whole dataset. It could take a few minutes.

In [None]:
# Generate Coulomb matrix for all molecules. 
cm = np.array([np.array(qml.representations.generate_coulomb_matrix(
                                    mol.nuclear_charges,                               
                                    mol.coordinates,
                                    size=23,
                                    sorting="row-norm"))
       for mol in mols],dtype=object)

**Task 1:** Use the optimized hyperparameters that you have found in the previous exercise. 

In [None]:
####################################
####YOUR TURN: fill in your optimal alpha and gamma for the 'rbf' kernel from the previous exercise
alpha = 
gamma = 
####################################

A learning curve is the dependence of the test error like mean absolute error (MAE) on the size of the training set. All molecules that are not included in the training set belong to the test set. To plot this dependence, we need to know the MAE for different train set sizes. 

In [None]:
# Define desired train set sizes
train_ratio = [0.004506408, 0.014082524, 0.044500775, 0.14082524, 0.445289396]
total_mol = y.shape[0]
train_size = [x*total_mol for x in train_ratio]
test_error = []

for ratio in train_ratio:
    # Split data to train and test sets
    X_train, X_test, Y_train, Y_test = train_test_split(cm, y, train_size = ratio, test_size = 1 - ratio)

    # Calculate kernel matrix
    K = gaussian_kernel(X_train, X_train, 1/np.sqrt(2*gamma))

    # Add a small lambda to the diagonal of the kernel matrix
    K[np.diag_indices_from(K)] += alpha

    # Use the built-in Cholesky-decomposition to find c
    c = cho_solve(K, Y_train)

    # Calculate a kernel matrix between test and training data, using the same gamma
    K_test = gaussian_kernel(X_test, X_train, 1/np.sqrt(2*gamma))

    # Make the predictions
    Y_predicted = np.dot(K_test, c)
    
    # Calculate mean absolute-error (MAE)
    mae = np.mean(np.abs(Y_predicted - Y_test))
    
    # Calculate training size
    size = ratio * total_mol
    
    test_error.append(mae)
    print("Training size: %i\t MAE: %.2f kcal/mol" % (size, mae))

Now we can plot the learning curve.

In [None]:
fig = plt.figure()
ax = plt.axes()
ax.plot(train_size, test_error, color="red", ls="-", marker ="o", label=r'Coulomb matrix')
plt.xlabel('Training set size')
plt.ylabel('MAE (kcal/mol)')
ax.set_xscale('log')
ax.set_yscale('log')
plt.legend()
plt.show()

You should observe linearity, although this may not always be the case. Repeat the training again and observe how the errors and plot change slightly. This comes from splitting the molecules into test and train datasets differently each time. Also, not all molecules carry identical information. Therefore we should not take just one model, but always the average of several with different train/test distributions. Implement an average over several models.

**Task 2:** Fill in code such that you will get an average of 10 ensembles for each training ratio. 

In [None]:
train_ratio_cm = [0.004506408, 0.014082524, 0.044500775, 0.14082524, 0.445289396]
train_size_cm = [x*y.shape[0] for x in train_ratio_cm]
test_error_cm = []
ensemble = 10

for ratio in train_ratio:
    sum_mae = 0
    for x in range(ensemble):
        ####################################
        ####YOUR TURN: Train KRR model and find MAE. Store MAE in mae variable.

        
        
        
        
        
        
        
        
        ####################################
        sum_mae = sum_mae + mae
    test_error_cm.append(sum_mae/ensemble)
    size = ratio * total_mol
    print("Training size: %i\t Averaged MAE: %.2f kcal/mol" % (size, sum_mae/ensemble))
    
fig = plt.figure()
ax = plt.axes()
ax.plot(train_size_cm, test_error_cm, color="red", ls="-", marker ="o", label=r'Coulomb matrix')
plt.xlabel('Training set size')
plt.ylabel('MAE (kcal/mol)')
ax.set_xscale('log')
ax.set_yscale('log')
plt.legend()
plt.show()

## 3. FCHL19

Using the Coulomb matrix, we have demonstrated how a learning curve is constructed. However, due to its simplicity, this descriptor does not achieve excellent results. We now move on to the FCHL19 descriptors, which are widely used due to their great performance.

Briefly described, the FCHL19 representation is a vector that encodes the atomic environment of an atom in a chemical compound. It consists of a two-body term that encodes radial distributions between the central atoms and neighboring atoms of a given element type. Additionally, the representation contains a three-body term that encodes the mean distances and angles between the atom and neighboring pairs of atoms of given element types. This representation does not include an explicit one-body term. Please refer to the paper for more details: https://aip.scitation.org/doi/10.1063/1.5126701

The higher accuracy is compensated by an increase in computational complexity. Therefore, in this part of the exercise, we will not average models and will train the models only up to 1000 molecules in the training set.

Your task is to plot a learning curve for KRR with FCHL19 descriptors for training set ratios of 0.004506408, 0.014082524, 0.044500775, and 0.14082524 without averaging. Below we provide an example of how to train the FCHL19 model. 


In [None]:
# Generate FCHL19 representation for all molecules.
fchl = np.array([qml.fchl.generate_representation(mol.coordinates,
                                                  mol.nuclear_charges)
        for mol in mols],dtype=object)

**Task 3:** For the FCHL19 representation, get MAEs for multiple training ratios (0.004506408, 0.014082524, 0.044500775, and 0.14082524) without averaging.

In [None]:
train_ratio_fchl = [0.004506408, 0.014082524, 0.044500775, 0.14082524]
train_size_fchl = [x*y.shape[0] for x in train_ratio_fchl]
test_error_fchl = []

####################################
####YOUR TURN: Split data to train and test sets



####################################

# Hyperpamateres. You do not need to tune.
alpha = 1e-7
sigmas = [1.0]

# Calculate kernel matrix for train set
K = qml.fchl.get_local_kernels(X_train, X_train, sigmas, cut_distance=10.0)[0]

# Add a small lambda to the diagonal of the kernel matrix
K[np.diag_indices_from(K)] += alpha

# Use the built-in Cholesky-decomposition to find c
c = cho_solve(K,Y_train)

# Calculate a kernel matrix between test and training data
K_test = qml.fchl.get_local_kernels(X_test, X_train, sigmas, cut_distance=10.0)[0]


####################################
####YOUR TURN: Make the predictions, calculate and print the mean absolute-error (MAE)



####################################

**Task 4:** Plot how test_error_cm depends on train_size_cm together with how test_error_fchl depends on train_size_fchl. It allows us to compare the Coulomb matrix with FCHL19 representation.

In [None]:
####################################
####YOUR TURN: Plot learning curves for the CM descriptor together with FCHL19 for comparison








####################################

**Task 5:** Using time.time(), compare the computation complexity of the Coulomb matrix and FCHL19. Also, determine whether the training takes longer than the prediction or vice versa. Empirically find the O(N) scaling for both training and prediction, where N is the dataset size. 

Take inspiration from the previous exercise located in scikit_KRR_SVR.ipynb how to use time.time().