# Benckmarking <a name="head"></a>

In the following we will train some models and check the performance of AtoML against sklearn.

## Table of Contents
[(Back to top)](#head)

-   [Requirements](#requirements)
-   [Setup](#setup)
-   [1D-Model](#1d-model)
-   [Real Data](#real-data)
-   [Feature Dimensionality](#feature-dimensionality)

## Requirements <a name="requirements"></a>
[(Back to top)](#head)

-   [AtoML](https://gitlab.com/atoml/AtoML)
-   [ASE](https://wiki.fysik.dtu.dk/ase/)
-   [numpy](http://www.numpy.org/)
-   [scikit-learn](http://scikit-learn.org/stable/)

## Setup <a name="setup"></a>
[(Back to top)](#head)


In [None]:
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
import time

from ase.ga.data import DataConnection
from ase.io import write

from atoml.api.ase_data_setup import get_unique, get_train
from atoml.fingerprint.setup import FeatureGenerator
from atoml.regression import GaussianProcess
from atoml.regression.cost_function import get_error

from sklearn.gaussian_process import GaussianProcessRegressor
from sklearn.gaussian_process.kernels import WhiteKernel, RBF
from sklearn.preprocessing import StandardScaler

## 1D-Model <a name="1d-model"></a>
[(Back to top)](#head)

To start with, we consider a simple 1D model. This is perhaps not an ideal benchmarking case as we would tend to be interested in a far greater number of dimensions within the feature space. But using this kind of toy data we can easily generate varying numbers of data points. In particular, it is worth noting that in the following 10,000 test data points are generated, there are only a small number of training points.

### scikit-learn

In [None]:
rng = np.random.RandomState(0)

# Generate sample data
X = 15 * rng.rand(200, 1)
y = np.sin(X).ravel()
y += 3 * (0.5 - rng.rand(X.shape[0]))  # add noise

gp_kernel = 1. * RBF(length_scale=0.5) + WhiteKernel(1e-1)
gpr = GaussianProcessRegressor(kernel=gp_kernel)
stime = time.time()
gpr.fit(X, y)
print("Time for sklearn fitting: %.3f" % (time.time() - stime))

X_plot = np.linspace(0, 20, 10000)[:, None]
stime = time.time()
y_gpr = gpr.predict(X_plot, return_std=False)
print("Time for sklearn prediction: %.3f" % (time.time() - stime))

When using sklearn to train and test a GP, both procedures are very fast. It is then possible to see what hyperparameters were used in the model.

In [None]:
print(gpr.kernel_)

### AtoML

In [None]:
kdict = {
    'k1': {'type': 'gaussian', 'width': [0.5], 'scaling': 1.},
}

stime = time.time()
gp = GaussianProcess(
    kernel_dict=kdict, regularization=1e-1, train_fp=X, train_target=y,
    optimize_hyperparameters=True, scale_data=False)
print("Time for atoml fitting: %.3f" % (time.time() - stime))

stime = time.time()
y_atoml = gp.predict(test_fp=X_plot, uncertainty=True)
print("Time for atoml prediction: %.3f" % (time.time() - stime))
y_atoml = y_atoml['prediction']

AtoML performs similarly quickly when training this simple model. However, it is a lot slower at making predictions on the unseen test data.

In [None]:
print(gp.kernel_dict, 'regularization:', gp.regularization)

### Comparison

In [None]:
# Plot results
plt.figure(figsize=(10, 5))
lw = 2
plt.scatter(X, y, c='k', label='data')
plt.plot(X_plot, np.sin(X_plot), color='navy', lw=lw, label='True')
plt.plot(X_plot, y_gpr, color='turquoise', lw=lw,
         label='sklearn')
plt.plot(X_plot, y_atoml, color='darkorange', lw=lw,
         label='atoml')
plt.xlabel('data')
plt.ylabel('target')
plt.xlim(0, 20)
plt.ylim(-4, 4)
plt.title('scikit-learn vs AtoML')
plt.legend(loc="best",  scatterpoints=1, prop={'size': 8})

There are some small differences in the predicted function generated by sklearn and AtoML. Though there is generally pretty good agreement, with differences likely due to the way hyperparameters are optimized by the two codes.

## Real Data <a name="real-data"></a>
[(Back to top)](#head)

We can run the comparisons with a more realistic data set. In the following, we will import approximately 1300 nanoparticle atoms objects and generate feature vectors approcimately 150 in length. 800 data points will be used to train the model and the remaining data will be used to test.

In [None]:
# Connect ase atoms database.
gadb = DataConnection('../data/gadb.db')

# Get all relaxed candidates from the db file.
all_cand = gadb.get_all_relaxed_candidates(use_extinct=False)

testset = get_unique(atoms=all_cand, size=272, key='raw_score')

trainset = get_train(atoms=all_cand, size=800, taken=testset['taken'],
                     key='raw_score')

generator = FeatureGenerator(atom_types=[78, 79], nprocs=1)

train_features = generator.return_vec(trainset['atoms'], [generator.eigenspectrum_vec])
train_targets = trainset['target']

test_features = generator.return_vec(testset['atoms'], [generator.eigenspectrum_vec])
test_targets = testset['target']

vec_names = generator.return_names([generator.eigenspectrum_vec])

### scikit-learn

In [None]:
gp_kernel = 1. * RBF(length_scale=1.) + WhiteKernel(1e-1)
gpr = GaussianProcessRegressor(kernel=gp_kernel)
stime = time.time()
scalar = StandardScaler()
train_features = scalar.fit_transform(train_features)
test_features = scalar.transform(test_features)
gpr.fit(train_features, train_targets)
print("Time for sklearn fitting: %.3f" % (time.time() - stime))

stime = time.time()
y_gpr = gpr.predict(test_features, return_std=False)
print("Time for sklearn prediction: %.3f" % (time.time() - stime))

In [None]:
print(gpr.kernel_)

### AtoML

In [None]:
kdict = {
    'k1': {'type': 'gaussian', 'width': 1., 'scaling': 1., 'dimension': 'single'},
}

stime = time.time()
gp = GaussianProcess(
    kernel_dict=kdict, regularization=1e-1, train_fp=train_features, train_target=train_targets,
    optimize_hyperparameters=True, scale_data=True)
print("Time for atoml fitting: %.3f" % (time.time() - stime))

stime = time.time()
y_atoml = gp.predict(test_fp=test_features, uncertainty=True)
print("Time for atoml prediction: %.3f" % (time.time() - stime))
y_atoml = y_atoml['prediction']

In [None]:
print(gp.kernel_dict, 'regularization:', gp.regularization)

### Comparison

In [None]:
# Plot results
plt.figure(figsize=(10, 5))
plt.scatter(test_targets, test_targets, color='navy', label='True')
plt.scatter(test_targets, y_gpr, color='turquoise',
         label='sklearn', alpha=0.8)
plt.scatter(test_targets, y_atoml, color='darkorange',
         label='atoml', alpha=0.8)
plt.xlabel('data')
plt.ylabel('target')
# plt.xlim(0, 20)
# plt.ylim(-4, 4)
plt.title('scikit-learn vs AtoML')
plt.legend(loc="best",  scatterpoints=1, prop={'size': 8})

## Feature Dimensionality <a name="feature-dimensionality"></a>
[(Back to top)](#head)

It is much faster to only optimize a single parameter for all dimensions of the feature space. For the squared exponential kernel, there is a width parameter that defines the balance between local and global influence over the feature space. This defaults to a single dimension in sklearn. In AtoML, the default is to optimize a lengthscale for each feature within the feature space. There are a number of considerations that must be accounted for when deciding the dimensions to optimize hyperparameters for. The following provides a benchmark between the two situations.

In [None]:
kdict = {
    'k1': {
        'type': 'gaussian', 'width': 1., 'scaling': 1., 'dimension': 'single'},
    }

stime = time.time()
gp = GaussianProcess(train_fp=train_features, train_target=train_targets,
                     kernel_dict=kdict, regularization=1e-2,
                     optimize_hyperparameters=True, scale_data=True)
print('training single: {0:.2f}s'.format(time.time() - stime))

pred_single = gp.predict(test_fp=test_features)

error = get_error(pred_single['prediction'],
                  test_targets)['rmse_average']

print('Error from single dimension: {0:.3f}'.format(error))

In [None]:
kdict = {
    'k1': {
        'type': 'gaussian', 'width': 1., 'scaling': 1., 'dimension': 'features'},
    }

stime = time.time()
gp = GaussianProcess(train_fp=train_features, train_target=train_targets,
                     kernel_dict=kdict, regularization=1e-2,
                     optimize_hyperparameters=True, scale_data=True)
print('training features: {0:.2f}s'.format(time.time() - stime))

pred_features = gp.predict(test_fp=test_features)

error = get_error(pred_features['prediction'],
                  test_targets)['rmse_average']

print('Error from features dimension: {0:.3f}'.format(error))

plt.figure(figsize=(10, 10))
plt.plot(test_targets, pred_single['prediction'], 'o', c='b', alpha=0.5)
plt.plot(test_targets, pred_features['prediction'], 'o', c='r', alpha=0.5)