# Dataset roughness computation

In this notebook we show how to use TopSearch to compute the roughness of a given dataset. We use the energy landscape framework to compute the topographical mapping of the dataset for each pair of its features. From these weighted graphs we compute the average frustration metric, which is a good approximation for dataset roughness. We perform this analysis for several example molecular datasets and illustrate the correlation of frustration and model error. Therefore, we show that we are able to estimate the modellability of a particular representation before fitting a machine learning model.

## Imports

First we import topsearch, which contains all the functionality we need for this example

In [None]:
import topsearch

## Initialise classes

Landscape generation requires several classes to be instantiated. We list each of them in turn here, and describe their functionality.
First, we generate the potential class that calculates the function value and its derivatives when given a set of coordinates. In this example this class extracts a molecular dataset for the property of clearance in the microsome, as generated by Astra Zeneca, we perform an interpolation of this dataset with smoothness parameter 'smoothness', which is set to almost zero in order to exactly interpolate the known dataset. We extract n_data data points, and normalise both the response and all the features of the training data. Furthermore, we choose to flip the data to make it a minimisation problem for convenience. Finally, we specify a subset of features for which to consider 

In [None]:
mol_interpolation = topsearch.potential.MolecularInterpolation(mol_property='Clearance_Microsome_AZ',
                                                               norm_response=True,
                                                               norm_training=True,
                                                               flip_data=True,
                                                               n_data=500,
                                                               smoothness=1e-5,
                                                               feature_subset=True,
                                                               chosen_features=i)

Next we initialise the comparison class, which is responsible for determining the distance between points in feature space and determining if they are the same. We may generate repeated minima and transition states and the functions of this class will allow us to filter out repeats and only include unique stationary points. We are required to specify a distance_criterion and energy_criterion within which two minima and transition states are considered the same. The proportional_distance scales the distance_criterion to reflect the total range, in this case it is active resulting in the criterion taking any points within 5% of the feature space and within the energy_criterion as the same.

In [None]:
comparer = topsearch.similarity.NonAtomicSimilarity(potential=mol_interpolation,
                                                    distance_criterion=0.05,
                                                    energy_criterion=1e-2,
                                                    proportional_distance=True)

We provide these two instances to the KineticTransitionNetwork class. This contains the networkx object in which all the minima and transition states are stored as a weighted graph. This class controls all access to the network and performs operations to extract information and analyse it.

In [None]:
ktn = topsearch.kinetic_transition_network.KineticTransitionNetwork(potential=mol_interpolation,
                                                                    similarity=comparer)

Additionally, we need a local minimiser as all these methods rely on locating local minima. Our local minimiser class provides a wrapper to the scipy box-constrained LBFGS implementation that is adapted to the different optimisation tasks we perform. We need to provide it the function that it minimises, along with the maximum number of steps it can take, the LBFGS history and gradient at which it has successfully converged.

In [None]:
minimiser = topsearch.minimisation.LBFGS(potential=mol_interpolation,
                                         conv_crit=1e-6,
                                         history_size=5,
                                         n_steps=100)

Locating transition states requires a combination of single-ended and double-ended searches. Double-ended searches take two minima as input and attempt to find the lowest-valued path between them. Single-ended searches start from a single point and follow the eigenvector corresponding to the most negative eigenvalue until they converge to a transition state. These two searches are used in tandem, with an initial double-ended search, following by a single-ended search applied to each local maximum on the path. Here, we use the nudged elastic band as the double-ended search method. NudgedElasticBand contains the methods to produce an initial path and optimise it to minimise its overall value. We specify it using three parameters: force_constant - determines the tightness of the elastic band, this is updated within the computation, image density - this determines how many points the path is composed of (per unit distance), more points means better path (usually) as higher computational cost, max images - allows us to put a cut on the maximum number of images to limit the computational cost.

In [None]:
nudged_elastic_band = \
    topsearch.double_ended_search.NudgedElasticBand(potential=mol_interpolation,
                                                    minimiser=minimiser,
                                                    force_constant=50.0,
                                                    image_density=5.0,
                                                    max_images=50)

The single-ended search method is hybrid eigenvector-following. This class provides methods to take in a single point and provide the methods to converge it to the nearest transition state. We provide it with a convergence criterion, the gradient must be under this value to be considered converged to a stationary point. We also provide the allowed number of mode-following steps before it is considered a failed search for a transition state. Each transition state is connected to two minima by following the steepest-descent paths along the unique downhill direction and we specify the distance we move in the downhill direction before beginning a local minimisation using pushoff. Finally, we have some step sizes used in the mode-following to prevent the steps being too large.

In [None]:
hybrid_eigenvector_following = \
    topsearch.single_ended_search.HybridEigenvectorFollowing(potential=mol_interpolation,
                                                             minimiser=minimiser,
                                                             conv_crit=1e-4,
                                                             ts_steps=100,
                                                             pushoff=1e-1,
                                                             max_uphill_step_size=0.5,
                                                             positive_eigenvalue_step=1.0)

For global optimisation algorithms we need to be able to propose new structures from the existing ones. The efficiency of global optimisation relies on the proposal of good candidate positions. For molecular systems this is an involved problem with a lot of research invested into it. For the non-atomic system here the step-taking is much simpler, it can just be random perturbations. This class manages the perturbations to propose new states, it is given the maximum step size, max_displacement, and we specify that this distance should be measured as a proportion of the bounds range with proportional distance.

In [None]:
step_taking = topsearch.perturbations.NonAtomicPerturbation(potential=mol_interpolation,
                                                            max_displacement=1.0,
                                                            proportional_distance=True)

Initialise the global optimisation class. The global optimisation algorithm is basin-hopping, which is provided with the step-taking class previously created. Basin-hopping steps around the surface performing local minimisations and subsequently accepting or rejecting the new local minima based on a Metropolis-like criterion. The BasinHopping class performs basin-hopping runs consisting of n_steps random perturbations and local minimisations, with a temperature specified to control the acceptance of new minima.

In [None]:
optimiser = topsearch.global_optimisation.BasinHopping(ktn=ktn,
                                                       minimiser=minimiser,
                                                       n_steps=750,
                                                       temperature=1.0,
                                                       step_taking=step_taking)

Finally, we feed many of these objects into a NetworkSampling object that controls all the landscape generation. This object allows for simple calls to be made that perform the combination of algorithms for landscape generation. We pass it the global_optimiser and the transition state location methods. Transition state location simply requires two minima, and therefore, the sampling of the landscape is embarrassingly parallel. We have the option to use multiple processes to accelerate the landscape calculation. Here we decide to run on one CPU for now.

In [None]:
explorer = topsearch.exploration.NetworkSampling(ktn=ktn,
                                                 minimiser=minimiser,
                                                 global_optimiser=optimiser,
                                                 single_ended_search=hybrid_eigenvector_following,
                                                 double_ended_search=nudged_elastic_band,
                                                 multiprocessing=True,
                                                 n_processes=4)

## 1. Compute the topographical representation

First, we will compute the topographical representation for each pair of features. We construct the interpolation for each 

In [None]:
for i in 

We will then look through some of the data.

## 2. Compute the frustration metric for the structure-property relationship

In [None]:
ktn.read_network()

## 3. Compute the regression error

We perform regression of the same dataset using a simple neural network.