In [None]:
import matplotlib.pyplot as plt
%matplotlib widget
import numpy as np
import scipy as sp
import sklearn
import matplotlib as mpl
import matplotlib.pyplot as plt
import chemiscope
from widget_code_input import WidgetCodeInput
from ipywidgets import Textarea
from iam_utils import *
import ase
from ase.io import read, write
from ase.calculators import lj, eam
from tqdm.notebook import tqdm

In [None]:
#### AVOID folding of output cell 

In [None]:
%%html

<style>
.output_wrapper, .output {
    height:auto !important;
    max-height:5000px;  /* your desired max-height here */
}
.output_scroll {
    box-shadow:none !important;
    webkit-box-shadow:none !important;
}
</style>

In [None]:
data_dump = WidgetDataDumper(prefix="module_07")
display(data_dump)

In [None]:
module_summary = Textarea("general comments on this module", layout=Layout(width="100%"))
data_dump.register_field("module-summary", module_summary, "value")
display(module_summary)

_References: [Nature 559, 547–555 (2018)](https://www.nature.com/articles/s41586-018-0337-2)
[J. Chem. Phys. 150, 150901 (2019)](https://doi.org/10.1063/1.5091842)_

# Data-driven modeling 

This module provides a very brief and over-simplified primer on "data-driven" modeling. 
In abstract terms, a data-driven approach attempts to establish a relationship between _input_ data and _target_ properties (or to recognize patterns in the data itself) without using deductive reasoning, i.e. without proceeding though a series of logical steps starting from an hypothesis on the physical behavior of a system. 

Instead, the empirical association between inputs and targets is taken as the only basis to establish an _inductive_ relationship between them. The traditional scientific method proceeds through a combination of induction and deduction, while data-driven approaches are intended to be entirely inductive. On the risks of purely inductive reasoning, see [Bertrand Russel's inductivist chicken story](http://www.ditext.com/russell/rus6.html). In practice, _inductive biases_ are often included in the modeling, through by means of the choices that are made in the construction and the tuning of the model itself: this is how a component of physics-inspired (deductive) concepts can make it back into machine learning. 

As the most primitive example of data-driven modeling, consider the case of _linear regression_. 
A set of $n_\mathrm{train}$ data points and targets $\{x_i, y_i\} $ are assumed to follow a linear relationship of the form $y(x)=a x$, where the slope $a$ is an adjustable parameter. 
For a given value of $a$, one can compute the _loss_, i.e. the root mean square error between the true value of the targets and the predictions of the model,

$$
\ell = \frac{1}{n_\mathrm{train}} \sum_i |y(x_i)-y_i|^2 
$$

This widget allows you to play around with the core idea of linear regression: by adjusting the value of $a$ you can minimize the discrepancy between predictions and targets, and find the best model within the class chosen to represent the input-target relationship

In [None]:
np.random.seed(1234)
lr_x = (np.random.uniform(size=20)-0.5)*10
lr_y = 2.33*lr_x+(np.random.uniform(size=20)-0.5)*2
def lr_plot(ax, a):    
    ax.plot(lr_x, lr_y, 'b.')
    ax.plot([-5,5],[-5*a,5*a], 'r--')
    l = np.mean((lr_y-a*lr_x)**2)
    ax.text(-4,8,f'$\ell = ${l:.3f}')
    ax.set_xlabel('$x$')
    ax.set_ylabel('$y$')
wp_lr = WidgetPlot(lr_plot, WidgetParbox(a=(1.0, -5.0, 5.0, 0.1, r'$a$')))
display(wp_lr)

<span style="color:blue">**01** What is (roughly) the best value of $a$ that minimizes the loss in the linear regression model? </span>

In [None]:
ex1_txt = Textarea("Write the answer here", layout=Layout(width="100%"))
data_dump.register_field("ex1-answer", ex1_txt, "value")
display(ex1_txt)

In a linear regression model, the loss can be minimized with a closed expression: given that $\partial \ell/\partial a \propto \sum_i x_i(a x_i -y_i)$, it's easy to see that an extremum occurs for $a = \sum_i x_i y_i / \sum_i x_i^2.$

This approach can be easily generalized to more complex models: in the most general terms, $\ell$ can be minimized numerically, by computing the derivatives of $y(x)$ with respect to the model parameters. 
Here we consider the simpler case of a polynomial model, in which $y(x)=\sum_k w_k x^k$. This can actually be seen as a special case of multi-dimensional linear regression, where each sample is described by several _features_ (or _descriptors_), in this case $x_{ik}=x_i^k$. 

_NB: this is a very bad choice of a polynomial basis to expand the function (most notably, because the different polynomials are not orthogonal). We are just doing this as a simple example, never try this for a real problem!_

Play around with the widget below. It is _really_ difficult to fit the model manually!

In [None]:
npoly = 5
np.random.seed(12345)
pr_x = (np.random.uniform(size=20)-0.5)*10
pr_y = (np.random.uniform(size=20)-0.5)*3
pr_w = [3, 1, 1, -0.3, -0.05, 0.01]
for k in range(len(pr_w)):
    pr_y += pr_w[k]*pr_x**k
    
def pr_plot(ax, **w):    
    
    xx = np.linspace(-5, 5, 60)
    yy = np.zeros(len(xx))
    lw = list(w.values())
    pr_X = np.vstack( [pr_x**k for k in range(6)]).T
    fit_w = np.linalg.lstsq(pr_X, pr_y, rcond=None)[0]
    my = pr_x*0
    ty = np.zeros(len(xx))
    fy = np.zeros(len(xx))
    for k in range(len(lw)):
        yy += lw[k]*xx**k
        my += lw[k]*pr_x**k
        ty += pr_w[k]*xx**k
        fy += fit_w[k]*xx**k
        
    
    
    l = np.mean((pr_y-my)**2)
    ax.plot(pr_x, pr_y, 'b.', label="train data")
    ax.plot(xx, yy, 'r--', label="manual fit")
    ax.text(-4,-1,f'$\ell = ${l:.3f}')
    ax.set_xlabel('$x$')
    ax.set_ylabel('$y$')
    ax.set_ylim(min(pr_y)-1, max(pr_y)+1)
wp_pr = WidgetPlot(pr_plot, WidgetParbox(
    w_0=(1.0, -5.0, 5.0, 0.01,  r'$w_0$'),
    w_1=(0.01, -2.0, 2.0, 0.01, r'$w_1$'),
    w_2=(0.01, -1.0, 1.0, 0.01, r'$w_2$'),
    w_3=(-0.2, -1.0, 1.0, 0.01, r'$w_3$'),
    w_4=(0.01, -0.1, 0.1, 0.01, r'$w_4$'),
    w_5=(0.01, -0.1, 0.1, 0.01, r'$w_5$')
))
display(wp_pr)

The loss can be written in a vectorial form, $\ell \propto \sum_i | \mathbf{w}\cdot\mathbf{x}_i - y_i |^2$. If $\mathbf{X}$ is the matrix collecting the $x_i^k$ as rows and $\mathbf{y}$ is the vector collecting the targets, a closed form solution for the weight vector can be derived as

$$
\mathbf{w} = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}.
$$

We can now start looking to more realistic issues that arise in the context of regression models. For starters, data can contain a certain level of _noise_. This can be actual noise, or (often) hidden input features or relationships that cannot be captured by the simplified model. Second, a model that predicts the targets only for the data it had been trained on is of very little use: we want to be able to do real predictions!
For this reason, it is customary to set aside a fraction of the available data that is not used to determine the weights that minimize $\ell$. The error on this _test set_ is an indication of how good the model would be when predicting a new point. 

In [None]:
npoly = 5
pr_w = [3, 1, 1, -0.3, -0.05, 0.01]

def pr2_plot(ax, tgt, fit, noise, hidden, npoints):    
    
    xx = np.linspace(-5, 5, 60)
    yy = np.zeros(len(xx))
    np.random.seed(12345)
    pr_x = (np.random.uniform(size=2*npoints)-0.5)*10
    pr_X = np.vstack( [pr_x**k for k in range(6)]).T
    pr_y = (np.random.uniform(size=len(pr_x))-0.5)*noise
    for k in range(len(pr_w)):
        pr_y += pr_w[k]*pr_x**k
    pr_y += hidden*np.sin(pr_x*4)
    
    fit_w = np.linalg.lstsq(pr_X[::2], pr_y[::2], rcond=None)[0]    
    my = pr_x*0
    ty = np.zeros(len(xx))
    fy = np.zeros(len(xx))
    pr_fy = np.zeros(len(pr_x))
    for k in range(len(pr_w)):                
        ty += pr_w[k]*xx**k
        fy += fit_w[k]*xx**k
        pr_fy += fit_w[k]*pr_x**k
    ty += hidden*np.sin(xx*4)
    
    
    l = np.mean((pr_y-pr_fy)[::2]**2)
    lte = np.mean((pr_y-pr_fy)[1::2]**2)
    ax.plot(pr_x[::2], pr_y[::2], 'b.', label="train data")
    ax.plot(pr_x[1::2], pr_y[1::2], 'kx', label="test data")    
    
    if tgt:
        ax.plot(xx, ty, 'b:', label="true target")
    if fit:
        ax.plot(xx, fy, 'b--', label="best fit")
    
    ax.set_ylim(min(ty)-1, max(ty)+1+noise/2)    
    ax.text(0.1,0.15,f'$\mathrm{{RMSE}}_\mathrm{{train}} = ${np.sqrt(l):.3f}', transform=ax.transAxes, c='r')
    ax.text(0.1,0.05,f'$\mathrm{{RMSE}}_\mathrm{{test}} = ${np.sqrt(lte):.3f}', transform=ax.transAxes, c='r')
    ax.set_xlabel('$x$')
    ax.set_ylabel('$y$')
    ax.legend(loc="upper right")
wp_pr2 = WidgetPlot(pr2_plot, WidgetParbox(    
    noise=(5.0, 0.1,10,0.1, 'Noise'),
    hidden=(0.0, 0, 5,0.01, 'Hidden', {"readout_format" : ".2f"}),
    npoints=(20, 5, 100, 1, r'$n_\mathrm{train}$'),
    tgt=(False, r'Show true target'),
    fit=(True, r'Show best fit'),
))
display(wp_pr2)

<span style="color:blue">**02a** Compare the error on the train and the test sets. Which is typically higher? How do train and test errors change when the number of training points is changed from the lowest to the highest level?  </span>

In [None]:
ex2a_txt = Textarea("Write the answer here", layout=Layout(width="100%"))
data_dump.register_field("ex2a-answer", ex2a_txt, "value")
display(ex2a_txt)

<span style="color:blue">**02b** How do the train and test loss change when the level of noise is increased? And how do they change when the level of hidden relationships is increased? Can you clearly distinguish the effect of noise and hidden terms? </span>

In [None]:
ex2b_txt = Textarea("Write the answer here", layout=Layout(width="100%"))
data_dump.register_field("ex2b-answer", ex2b_txt, "value")
display(ex2b_txt)

The tendency of achieving very low loss on the train set and a much larger test set error is a general phenomenon known as [_overfitting_](https://en.wikipedia.org/wiki/Overfitting). Overfitting is usually particularly bad when the train set size is small, and/or the model contains many parameters. Polynomial regression is notorious for overfitting. 

A common strategy to avoid overfitting is known as _regularization_. In broad terms, regularization implies penalizing solutions of the model that are too rapidly varying, and favouring those that are smoother, even at the cost of a slight increase of the train set error. In linear regression, the most common approach is to introduce a [Tikhonov regularization](https://en.wikipedia.org/wiki/Tikhonov_regularization) term, that is to write the loss as 

$$
\ell = \frac{1}{n_\mathrm{train}} \sum_i | \mathbf{w}\cdot \mathbf{x}_i - y_i | ^2 + \lambda |\mathbf{w}|^2.
$$

This expression (often referred to as $L^2$ regularized least-squares fit, or ridge regression) yields a closed-form solution for the weights, 

$$
\mathbf{w} = (\mathbf{X}^T\mathbf{X}+\lambda \mathbf{1})^{-1}\mathbf{X}^T\mathbf{y}.
$$

This widget allows you to experiment with the effect of ridge regularization on the same polynomial fitting exercise. 

_NB: given that we are using a very poor basis, and different features span widely different scales, the underlying implementation is slightly more complicated, in that different weights are scaled differently before computing the Tikhonov term. This scaling is done so that a single parameter can be meaningfully used to control the regularity of the fit._ 

In [None]:
npoly = 5
pr_w = [3, 1, 1, -0.3, -0.05, 0.01]

def pr3_plot(ax, tgt, fit, noise, hidden, npoints, lam):    
    
    xx = np.linspace(-5, 5, 60)
    yy = np.zeros(len(xx))
    np.random.seed(54321)
    xsz = 10
    pr_x = (np.random.uniform(size=2*npoints)-0.5)*xsz
    pr_X = np.vstack( [pr_x**k for k in range(6)]).T
    pr_y = (np.random.uniform(size=len(pr_x))-0.5)*noise
    for k in range(len(pr_w)):
        pr_y += pr_w[k]*pr_x**k
    pr_y += hidden*np.sin(pr_x*4)
    wscale = np.asarray([1/(xsz*0.5)**k for k in range(npoly+1)])
    fit_w = np.linalg.solve(
        pr_X[::2].T@pr_X[::2]+
        10**lam*(npoints//2)*np.diag(wscale**2), 
        pr_X[::2].T@pr_y[::2])
    my = pr_x*0
    ty = np.zeros(len(xx))
    fy = np.zeros(len(xx))
    pr_fy = np.zeros(len(pr_x))
    for k in range(len(pr_w)):                
        ty += pr_w[k]*xx**k
        fy += fit_w[k]*xx**k
        pr_fy += fit_w[k]*pr_x**k
    ty += hidden*np.sin(xx*4)
    
    l = np.mean((pr_y-pr_fy)[::2]**2)
    lte = np.mean((pr_y-pr_fy)[1::2]**2)
    ax.plot(pr_x[::2], pr_y[::2], 'b.', label="train data")
    ax.plot(pr_x[1::2], pr_y[1::2], 'kx', label="test data")    
    
    if tgt:
        ax.plot(xx, ty, 'b:', label="true target")
    if fit:
        ax.plot(xx, fy, 'b--', label="best fit")
    
    ax.set_ylim(min(ty)-1, max(ty)+1+noise/2)    
    ax.text(0.1,0.25,f'$\mathrm{{RMSE}}_\mathrm{{train}} = ${np.sqrt(l):.3f}', transform=ax.transAxes, c='r')
    ax.text(0.1,0.15,f'$|\mathbf{{w}}| = ${np.linalg.norm(wscale*fit_w):.3f}', transform=ax.transAxes, c='r')
    ax.text(0.1,0.05,f'$\mathrm{{RMSE}}_\mathrm{{test}} = ${np.sqrt(lte):.3f}', transform=ax.transAxes, c='r')
    ax.set_xlabel('$x$')
    ax.set_ylabel('$y$')
    ax.legend(loc="upper right")
wp_pr3 = WidgetPlot(pr3_plot, WidgetParbox(    
    noise=(5.0, 0.1,10,0.1, 'Noise'),
    hidden=(0.0, 0, 5,0.01, 'Hidden', {"readout_format" : ".2f"}),
    npoints=(10, 5, 100, 1, r'$n_\mathrm{train}$'),
    tgt=(True, r'Show true target'),
    fit=(True, r'Show best fit'),
    lam = (-5.0,-5,5,0.1, r'$\log_{10} \lambda$')
))
display(wp_pr3)

<span style="color:blue">**03a** Work with (noise, hidden, ntrain) = (5,0,10). What is the value of $\lambda$ that minimizes the _test_ error?  </span>

In [None]:
ex3a_txt = Textarea("Write the answer here", layout=Layout(width="100%"))
data_dump.register_field("ex3a-answer", ex3a_txt, "value")
display(ex3a_txt)

<span style="color:blue">**03b** Working with the same number of parameters, comment on the behavior of the best fit function and the various diagnostics as you vary the regularization away from the optimum value.  </span>

In [None]:
ex3b_txt = Textarea("Write the answer here", layout=Layout(width="100%"))
data_dump.register_field("ex3b-answer", ex3b_txt, "value")
display(ex3b_txt)

<span style="color:blue">**03c** Increase the number of training points to 100. How does the behavior of ridge regression change? Is the same value of $\lambda$ still optimal? </span>

In [None]:
ex3c_txt = Textarea("Write the answer here", layout=Layout(width="100%"))
data_dump.register_field("ex3c-answer", ex3c_txt, "value")
display(ex3c_txt)

The regularization $\lambda$ is one of the so-called _hyperparameters_ ("hypers"), that tune the behavior of the model but are not directly optimized on the train set. In this case, the number of polynomial terms is another hyperparameter. Optimizing the hyperparameters on the test set is bad practice, because this amounts to _data leakage_, and makes the test error less representative of the true generalization capabilities of the model. 

We won't get into details, but consider that strategies to optimize the hyperparameters is to set aside a _validation_ set that is not used for training nor for testing, but just to tune the hyperparameter values, or to perform _cross validation_, that implies splitting the training set into train/validation parts, and repeating the exercise over multiple _folds_. 

# Fingerprints and descriptors

The first step in any data-driven study of materials involves codifying the structure and composition of the materials being studied into a mathematical form that is suitable to be used as the input of the subsequent steps. Here we focus in particular on the definition of _fingerprints_, or _descriptors_ - a vector of numbers that are associated with each structure, assembled into a _feature vector_ $\mathbf{x}_i$. 

In this module we are going to use a dataset of materials from the [materials project](https://materialsproject.org/). The dataset has been reformatted as an extended XYZ file, that can be read, as usual, with the ASE `read` function. The target properties (mostly related to the elastic behavior) can be read in the `info` dictionary of each frame.

```
data = read("filename.xyz", ":")
X = []
y = []
for structure in data:
    X.append(get_features(structure.symbols, structure.cell, structure.positions))
    y.append(structure.info['property_name'])
```

where `get_features` is a function that processes the structural information of each frame to convert it into a set of fingerprints. 

Descriptors can be either precise [_representations_](https://pubs.acs.org/doi/10.1021/acs.chemrev.1c00021) of the coordinates and chemical nature of all atoms in a structure (which are commonplace in the construction of machine-learning interatomic potentials) or fingerprints based on a combination of structural parameters and properties that can be computed easily. For instance, one could take the electronegativity of elements in combination with the point group of the crystal structure. 

Here we take a very simple (and somewhat crude) approach, describing each structure by its chemical composition - that is, the feature $x_{iZ}$ contains the fraction of the atoms of structure $i$ that has atomic number $Z$. 

<span style="color:blue">**04** Write a function that takes structural information for each frame and returns a vector containing the fractional composition of each compound, to be used as descriptors. </span>

In [None]:
# DIVYA: get them to write a function that loads the structures and 
# computes the vectors with the fractional composition. Demo should be a table with the symbols string and the vector

In [None]:
# DIVYA: get them to write a function that loads the structures, computes the vectors with the fractional composition.
# it should take as an argument the maximum polynomial order, and also compute the powers of the fractional composition,
# stacking them to form a larger feature vector

In [None]:
# this should be just a "custom" function where they can play around and compute their own fingerprints based on the structure

# Dimensionality reduction

We be

# Regression