In [None]:
import matplotlib.pyplot as plt
%matplotlib widget
import numpy as np
import scipy as sp
import sklearn
import matplotlib as mpl
import matplotlib.pyplot as plt
import chemiscope
from widget_code_input import WidgetCodeInput
from ipywidgets import Textarea
from iam_utils import *
import ase
import functools
import copy
from ase.io import read, write
from ase.calculators import lj, eam
from tqdm.notebook import tqdm
from sklearn.linear_model import Ridge
from sklearn.decomposition import PCA

In [None]:
#### AVOID folding of output cell 

In [None]:
%%html

<style>
.output_wrapper, .output {
    height:auto !important;
    max-height:5000px;  /* your desired max-height here */
}
.output_scroll {
    box-shadow:none !important;
    webkit-box-shadow:none !important;
}
</style>

In [None]:
data_dump = WidgetDataDumper(prefix="module_07")
display(data_dump)

In [None]:
module_summary = Textarea("general comments on this module", layout=Layout(width="100%"))
data_dump.register_field("module-summary", module_summary, "value")
display(module_summary)

_References: [Nature 559, 547–555 (2018)](https://www.nature.com/articles/s41586-018-0337-2)
[J. Chem. Phys. 150, 150901 (2019)](https://doi.org/10.1063/1.5091842)_

# Data-driven modeling 

This module provides a very brief and over-simplified primer on "data-driven" modeling. 
In abstract terms, a data-driven approach attempts to establish a relationship between _input_ data and _target_ properties (or to recognize patterns in the data itself) without using deductive reasoning, i.e. without proceeding though a series of logical steps starting from an hypothesis on the physical behavior of a system. 

Instead, the empirical association between inputs and targets is taken as the only basis to establish an _inductive_ relationship between them: we only look at what the data tells to be a strong correlation, without reasoning on causal links, or on a coherent, elegant theory. 
The traditional scientific method proceeds through a combination of induction and deduction, while data-driven approaches are intended to be entirely inductive. On the risks of purely inductive reasoning, see [Bertrand Russel's inductivist chicken story](http://www.ditext.com/russell/rus6.html). 
In practice, _inductive biases_ are often included in the modeling, by means of the choices that are made in the construction and the tuning of the model itself: this is how a component of physics-inspired (deductive) concepts can make it back into machine learning. 

As the most primitive example of data-driven modeling, consider the case of _linear regression_. 
A set of $n_\mathrm{train}$ data points and targets $\{x_i, y_i\} $ are assumed to follow a linear relationship of the form $y(x)=a x$, where the slope $a$ is an adjustable parameter. 
For a given value of $a$, one can compute the _loss_, i.e. the root mean square error between the true value of the targets and the predictions of the model,

$$
\ell = \frac{1}{n_\mathrm{train}} \sum_i |y(x_i)-y_i|^2 
$$

This widget allows you to play around with the core idea of linear regression: by adjusting the value of $a$ you can minimize the discrepancy between predictions and targets, and find the best model within the class chosen to represent the input-target relationship

In [None]:
np.random.seed(1234)
lr_x = (np.random.uniform(size=20)-0.5)*10
lr_y = 2.33*lr_x+(np.random.uniform(size=20)-0.5)*2
def lr_plot(ax, a):    
    ax.plot(lr_x, lr_y, 'b.')
    ax.plot([-5,5],[-5*a,5*a], 'r--')
    l = np.mean((lr_y-a*lr_x)**2)
    ax.text(-4,8,f'$\ell = ${l:.3f}')
    ax.set_xlabel('$x$')
    ax.set_ylabel('$y$')
wp_lr = WidgetPlot(lr_plot, WidgetParbox(a=(1.0, -5.0, 5.0, 0.1, r'$a$')))
display(wp_lr)

<span style="color:blue">**01** What is (roughly) the best value of $a$ that minimizes the loss in the linear regression model? </span>

In [None]:
ex1_txt = Textarea("Write the answer here", layout=Layout(width="100%"))
data_dump.register_field("ex1-answer", ex1_txt, "value")
display(ex1_txt)

In a linear regression model, the loss can be minimized with a closed expression, by setting $\partial \ell/\partial a = 0$ and solving for $a$.

<span style="color:blue">**02** Write the expression for the optimal $a$ for a one-dimensional linear regression problem where the loss is optimized on pairs of inputs and targets $(x_i, y_i)$ </span>

In [None]:
ex2_txt = Textarea("Write the answer here", layout=Layout(width="100%"))
data_dump.register_field("ex2-answer", ex2_txt, "value")
display(ex2_txt)

This approach can be easily generalized to more complex models: in the most general terms, $\ell$ can be minimized numerically, by computing the derivatives of $y(x)$ with respect to the model parameters. 
Here we consider the simpler case of a polynomial model, in which $y(x)=\sum_k w_k x^k$. This can actually be seen as a special case of multi-dimensional linear regression, where each sample is described by several _features_ (or _descriptors_), in this case $x_{ik}=x_i^k$. 

_NB: this is a very bad choice of a polynomial basis to expand the function (most notably, because the different polynomials are not orthogonal). We are just doing this as a simple example, never try this for a real problem!_

Play around with the widget below. It is _really_ difficult to fit the model manually!

In [None]:
npoly = 5
np.random.seed(12345)
pr_x = (np.random.uniform(size=20)-0.5)*10
pr_y = (np.random.uniform(size=20)-0.5)*3
pr_w = [3, 1, 1, -0.3, -0.05, 0.01]
for k in range(len(pr_w)):
    pr_y += pr_w[k]*pr_x**k
    
def pr_plot(ax, **w):    
    
    xx = np.linspace(-5, 5, 60)
    yy = np.zeros(len(xx))
    lw = list(w.values())
    pr_X = np.vstack( [pr_x**k for k in range(6)]).T
    fit_w = np.linalg.lstsq(pr_X, pr_y, rcond=None)[0]
    my = pr_x*0
    ty = np.zeros(len(xx))
    fy = np.zeros(len(xx))
    for k in range(len(lw)):
        yy += lw[k]*xx**k
        my += lw[k]*pr_x**k
        ty += pr_w[k]*xx**k
        fy += fit_w[k]*xx**k
        
    
    
    l = np.mean((pr_y-my)**2)
    ax.plot(pr_x, pr_y, 'b.', label="train data")
    ax.plot(xx, yy, 'r--', label="manual fit")
    ax.text(-4,-1,f'$\ell = ${l:.3f}')
    ax.set_xlabel('$x$')
    ax.set_ylabel('$y$')
    ax.set_ylim(min(pr_y)-1, max(pr_y)+1)
wp_pr = WidgetPlot(pr_plot, WidgetParbox(
    w_0=(1.0, -5.0, 5.0, 0.01,  r'$w_0$'),
    w_1=(0.01, -2.0, 2.0, 0.01, r'$w_1$'),
    w_2=(0.01, -1.0, 1.0, 0.01, r'$w_2$'),
    w_3=(-0.2, -1.0, 1.0, 0.01, r'$w_3$'),
    w_4=(0.01, -0.1, 0.1, 0.01, r'$w_4$'),
    w_5=(0.01, -0.1, 0.1, 0.01, r'$w_5$')
))
display(wp_pr)

The loss can be written in a vectorial form, $\ell \propto \sum_i | \mathbf{w}\cdot\mathbf{x}_i - y_i |^2$. If $\mathbf{X}$ is the matrix collecting the $x_i^k$ as rows and $\mathbf{y}$ is the vector collecting the targets, a closed form solution for the weight vector can be derived as

$$
\mathbf{w} = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}.
$$

You can compare to the solution for the 1D case, and you will see immediately how this expression generalizes that for the slope $a$. 

We can now start looking to more realistic issues that arise in the context of regression models. For starters, data can contain a certain level of _noise_. This can be actual random noise, or (often) hidden input features or relationships that cannot be captured by the simplified model. Second, a model that predicts the targets only for the data it had been trained on is of very little use: we want to be able to do real predictions!
For this reason, it is customary to set aside a fraction of the available data that is not used to determine the weights that minimize $\ell$. The error on this _test set_ is an indication of how good the model would be when predicting a new point. 

In [None]:
npoly = 5
pr_w = [3, 1, 1, -0.3, -0.05, 0.01]

def pr2_plot(ax, tgt, fit, noise, hidden, npoints):    
    
    xx = np.linspace(-5, 5, 60)
    yy = np.zeros(len(xx))
    np.random.seed(12345)
    pr_x = (np.random.uniform(size=2*npoints)-0.5)*10
    pr_X = np.vstack( [pr_x**k for k in range(6)]).T
    pr_y = (np.random.uniform(size=len(pr_x))-0.5)*noise
    for k in range(len(pr_w)):
        pr_y += pr_w[k]*pr_x**k
    pr_y += hidden*np.sin(pr_x*4)
    
    fit_w = np.linalg.lstsq(pr_X[::2], pr_y[::2], rcond=None)[0]    
    my = pr_x*0
    ty = np.zeros(len(xx))
    fy = np.zeros(len(xx))
    pr_fy = np.zeros(len(pr_x))
    for k in range(len(pr_w)):                
        ty += pr_w[k]*xx**k
        fy += fit_w[k]*xx**k
        pr_fy += fit_w[k]*pr_x**k
    ty += hidden*np.sin(xx*4)
    
    
    l = np.mean((pr_y-pr_fy)[::2]**2)
    lte = np.mean((pr_y-pr_fy)[1::2]**2)
    ax.plot(pr_x[::2], pr_y[::2], 'b.', label="train data")
    ax.plot(pr_x[1::2], pr_y[1::2], 'kx', label="test data")    
    
    if tgt:
        ax.plot(xx, ty, 'b:', label="true target")
    if fit:
        ax.plot(xx, fy, 'b--', label="best fit")
    
    ax.set_ylim(min(ty)-1, max(ty)+1+noise/2)    
    ax.text(0.1,0.15,f'$\mathrm{{RMSE}}_\mathrm{{train}} = ${np.sqrt(l):.3f}', transform=ax.transAxes, c='r')
    ax.text(0.1,0.05,f'$\mathrm{{RMSE}}_\mathrm{{test}} = ${np.sqrt(lte):.3f}', transform=ax.transAxes, c='r')
    ax.set_xlabel('$x$')
    ax.set_ylabel('$y$')
    ax.legend(loc="upper right")
wp_pr2 = WidgetPlot(pr2_plot, WidgetParbox(    
    noise=(5.0, 0.1,10,0.1, 'Noise'),
    hidden=(0.0, 0, 5,0.01, 'Hidden', {"readout_format" : ".2f"}),
    npoints=(20, 5, 100, 1, r'$n_\mathrm{train}$'),
    tgt=(False, r'Show true target'),
    fit=(True, r'Show best fit'),
))
display(wp_pr2)

<span style="color:blue">**03a** Compare the error on the train and the test sets. Which is typically higher? How do train and test errors change when the number of training points is changed from the lowest to the highest level?  </span>

In [None]:
ex3a_txt = Textarea("Write the answer here", layout=Layout(width="100%"))
data_dump.register_field("ex3a-answer", ex3a_txt, "value")
display(ex3a_txt)

<span style="color:blue">**03b** How do the train and test loss change when the level of noise is increased? And how do they change when the level of hidden relationships is increased or decreased? Can you clearly distinguish the effect of noise and hidden terms? </span>

In [None]:
ex3b_txt = Textarea("Write the answer here", layout=Layout(width="100%"))
data_dump.register_field("ex3b-answer", ex3b_txt, "value")
display(ex3b_txt)

The tendency of achieving very low loss on the train set and a much larger test set error is a general phenomenon known as [_overfitting_](https://en.wikipedia.org/wiki/Overfitting). Overfitting is usually particularly bad when the train set size is small, and/or the model contains many parameters. Polynomial regression is notorious for overfitting. 

A common strategy to avoid overfitting is known as _regularization_. In broad terms, regularization implies penalizing solutions of the model that are too rapidly varying, and favouring those that are smoother, even at the cost of a slight increase of the train set error. In linear regression, the most common approach is to introduce a [Tikhonov regularization](https://en.wikipedia.org/wiki/Tikhonov_regularization) term, that is to write the loss as 

$$
\ell = \frac{1}{n_\mathrm{train}} \sum_i | \mathbf{w}\cdot \mathbf{x}_i - y_i | ^2 + \lambda |\mathbf{w}|^2.
$$

This expression (often referred to as $L^2$ regularized least-squares fit, or ridge regression) yields a closed-form solution for the weights, 

$$
\mathbf{w} = (\mathbf{X}^T\mathbf{X}+\lambda \mathbf{1})^{-1}\mathbf{X}^T\mathbf{y}.
$$

This widget allows you to experiment with the effect of ridge regularization on the same polynomial fitting exercise. 

_NB: given that we are using a very poor basis, and different features span widely different scales, the underlying implementation is slightly more complicated, in that different weights are scaled differently before computing the Tikhonov term. This scaling is done so that a single parameter can be meaningfully used to control the regularity of the fit._ 

In [None]:
npoly = 5
pr_w = [3, 1, 1, -0.3, -0.05, 0.01]

def pr3_plot(ax, tgt, fit, noise, hidden, npoints, lam):    
    
    xx = np.linspace(-5, 5, 60)
    yy = np.zeros(len(xx))
    np.random.seed(54321)
    xsz = 10
    pr_x = (np.random.uniform(size=2*npoints)-0.5)*xsz
    pr_X = np.vstack( [pr_x**k for k in range(6)]).T
    pr_y = (np.random.uniform(size=len(pr_x))-0.5)*noise
    for k in range(len(pr_w)):
        pr_y += pr_w[k]*pr_x**k
    pr_y += hidden*np.sin(pr_x*4)
    wscale = np.asarray([1/(xsz*0.5)**k for k in range(npoly+1)])
    fit_w = np.linalg.solve(
        pr_X[::2].T@pr_X[::2]+
        10**lam*(npoints//2)*np.diag(wscale**2), 
        pr_X[::2].T@pr_y[::2])
    my = pr_x*0
    ty = np.zeros(len(xx))
    fy = np.zeros(len(xx))
    pr_fy = np.zeros(len(pr_x))
    for k in range(len(pr_w)):                
        ty += pr_w[k]*xx**k
        fy += fit_w[k]*xx**k
        pr_fy += fit_w[k]*pr_x**k
    ty += hidden*np.sin(xx*4)
    
    l = np.mean((pr_y-pr_fy)[::2]**2)
    lte = np.mean((pr_y-pr_fy)[1::2]**2)
    ax.plot(pr_x[::2], pr_y[::2], 'b.', label="train data")
    ax.plot(pr_x[1::2], pr_y[1::2], 'kx', label="test data")    
    
    if tgt:
        ax.plot(xx, ty, 'b:', label="true target")
    if fit:
        ax.plot(xx, fy, 'b--', label="best fit")
    
    ax.set_ylim(min(ty)-1, max(ty)+1+noise/2)    
    ax.text(0.1,0.25,f'$\mathrm{{RMSE}}_\mathrm{{train}} = ${np.sqrt(l):.3f}', transform=ax.transAxes, c='r')
    ax.text(0.1,0.15,f'$|\mathbf{{w}}| = ${np.linalg.norm(wscale*fit_w):.3f}', transform=ax.transAxes, c='r')
    ax.text(0.1,0.05,f'$\mathrm{{RMSE}}_\mathrm{{test}} = ${np.sqrt(lte):.3f}', transform=ax.transAxes, c='r')
    ax.set_xlabel('$x$')
    ax.set_ylabel('$y$')
    ax.legend(loc="upper right")
wp_pr3 = WidgetPlot(pr3_plot, WidgetParbox(    
    noise=(5.0, 0.1,10,0.1, 'Noise'),
    hidden=(0.0, 0, 5,0.01, 'Hidden', {"readout_format" : ".2f"}),
    npoints=(10, 5, 100, 1, r'$n_\mathrm{train}$'),
    tgt=(True, r'Show true target'),
    fit=(True, r'Show best fit'),
    lam = (-5.0,-5,5,0.1, r'$\log_{10} \lambda$')
))
display(wp_pr3)

<span style="color:blue">**04a** Work with (noise, hidden, ntrain) = (5,0,10). What is the value of $\lambda$ that minimizes the _test_ error?  </span>

In [None]:
ex4a_txt = Textarea("Write the answer here", layout=Layout(width="100%"))
data_dump.register_field("ex4a-answer", ex4a_txt, "value")
display(ex4a_txt)

<span style="color:blue">**04b** Working with the same parameters (most importantly, the number of train points), comment on the behavior of the best fit function and the various diagnostics as you vary the regularization away from the optimum value.  </span>

In [None]:
ex4b_txt = Textarea("Write the answer here", layout=Layout(width="100%"))
data_dump.register_field("ex4b-answer", ex4b_txt, "value")
display(ex4b_txt)

<span style="color:blue">**04c** Increase the number of training points to 100. How does the behavior of ridge regression change? Is the same value of $\lambda$ still optimal? </span>

In [None]:
ex4c_txt = Textarea("Write the answer here", layout=Layout(width="100%"))
data_dump.register_field("ex4c-answer", ex4c_txt, "value")
display(ex4c_txt)

The regularization $\lambda$ is one of the so-called _hyperparameters_ ("hypers"), that tune the behavior of the model but are not directly optimized on the train set. In this case, the number of polynomial terms is another hyperparameter. Optimizing the hyperparameters on the test set is bad practice, because this amounts to _data leakage_, and makes the test error less representative of the true generalization capabilities of the model. 

We won't get into details, but consider that strategies to optimize the hyperparameters is to set aside a _validation_ set that is not used for training nor for testing, but just to tune the hyperparameter values, or to perform _cross validation_, that implies splitting the training set into train/validation parts, and repeating the exercise over multiple _folds_. 

# Fingerprints and descriptors

The first step in any data-driven study of materials involves codifying the structure and composition of the materials being studied into a mathematical form that is suitable to be used as the input of the subsequent steps. Here we focus in particular on the definition of _fingerprints_, or _descriptors_ - a vector of numbers that are associated with each structure, assembled into a _feature vector_ $\mathbf{x}_i$. 

In this module we are going to use a dataset of materials from the [materials project](https://materialsproject.org/). The dataset has been reformatted as an extended XYZ file, that can be read, as usual, with the ASE `read` function. The target properties (mostly related to the elastic behavior) can be read in the `info` dictionary of each frame.

```
data = read("filename.xyz", ":")
X = []
y = []
for structure in data:
    X.append(get_features(structure.symbols, structure.cell, structure.positions))
    y.append(structure.info['property_name'])
```

where `get_features` is a function that processes the structural information of each frame to convert it into a set of fingerprints. 

Descriptors can be either precise [_representations_](https://pubs.acs.org/doi/10.1021/acs.chemrev.1c00021) of the coordinates and chemical nature of all atoms in a structure (which are commonplace in the construction of machine-learning interatomic potentials) or fingerprints based on a combination of structural parameters and properties that can be computed easily. For instance, one could take the electronegativity of elements in combination with the point group of the crystal structure. 

Here we take a very simple (and somewhat crude) approach, describing each structure by its chemical composition - that is, the feature $x_{iZ}$ contains the fraction of the atoms of structure $i$ that has atomic number $Z$. 

<span style="color:blue">**05** Write a function that takes structural information for each frame and returns a vector containing the fractional composition of each compound, to be used as descriptors. </span>

In [None]:
ex05_wci = WidgetCodeInput(
        function_name="descriptor_base", 
        function_parameters="structures",
        docstring="""computes a feature matrix for the structures given in input,
        which is a vector of the fractional composition of each structure in the dataset, 
        e.g. given [H2, He2, HHe, LiH] returns something like 
        [[1,0,0],[0,1,0],[0.5,0.5,0],[0.5,0,0.5]]. 
        The total vector size depends on how many elements are present in the data set,
        but it's OK if there are zeros. 
""",
        function_body="""

import numpy as np
from ase.io import read

descriptor = np.zeros((len(structures), 100))
for i in range(len(structures)):
    for z in structures[i].numbers:
        descriptor[i,z] += 1
    descriptor[i]/=len(structures[i])
    
# drops elements that are zero trhoghout (optional)
nonzero = np.where(descriptor.sum(axis=0)>0)[0]
return descriptor[:, nonzero]
"""
        )

data_dump.register_field("ex05-function", ex05_wci, "function_body")

In [None]:
def array_to_html_table(numpy_array, header):
    rows = ""
    for i in range(len(numpy_array)):
        rows += f"<tr><td>{numpy_array[i][0]}</td>" + functools.reduce(lambda x,y: x+y,
                             map(lambda x: f"<td>{x:.2f}</td>",
                                 numpy_array[i][1:])
                            ) + "</tr>"

    return "<table>" + header + rows + "</table>"

def mk_table_05():
    structures=read('data/mp_elastic.extxyz',':')
    l = ex05_wci.get_function_object()(structures)

    x = []   
    for a,b in enumerate(l):
        s = structures[a].symbols
        x.append([s]+list(b))

    header = """<tr>
                  <th>Symbols <span style="padding-left:150px"></th>
                  <th>Fractional composition upto order n <span style="padding-left:150px"></th>
                </tr>"""
    demo_table_html = HTML(
        value=f"Table")
    demo_table_html.value = array_to_html_table(x[::100], header)

    demo_table = HBox(layout=Layout(height='250px', overflow_y='auto'))
    demo_table.children += (demo_table_html,) 
    display(demo_table)
    
feat_table = WidgetUpdater(mk_table_05)

In [None]:
#DIVYA todo: make the table look prettier, with some spacing between columns and the first line spanning multiple columns
def ex05_chk(a,b):
    # checks if the Gram matrix is the same, so we avoid differences in order and zeros
    return np.allclose(a@a.T, b@b.T)
ex05_wcc = WidgetCodeCheck(ex05_wci, ref_match = ex05_chk,
                           ref_values = [ ( (read('data/mp_elastic.extxyz','::100'),), np.loadtxt('data/mp_elastic_05ref.txt') ) ]                           
                           , demo = feat_table)
display(ex05_wcc)

In [None]:
ex06_wci = WidgetCodeInput(
        function_name="descriptor_poly", 
        function_parameters="structures, nmax",
        docstring="""compute the powers of the fractional composition and stack them 
        to form a larger feature vector
        
        structures: a list of structures in `ase.Atoms` format
        nmax : maximum order of the polynomial
""",
        function_body="""

import numpy as np
from ase.io import read

descriptor = np.zeros((len(structures), 100))
for i in range(len(structures)):
    for z in structures[i].numbers:
        descriptor[i,z] += 1
    descriptor[i]/=len(structures[i])

descriptor = np.hstack([descriptor**k for k in range(1,nmax+1)])

# drops elements that are zero trhoghout (optional)
nonzero = np.where(descriptor.sum(axis=0)>0)[0]
return descriptor[:, nonzero]
"""
        )

data_dump.register_field("ex06-function", ex06_wci, "function_body")

In [None]:
def mk_table2():
    structures=read('data/mp_elastic.extxyz',':')
    l = ex06_wci.get_function_object()(structures, ex06_wp.value['n'])

    x = []   
    for a,b in enumerate(l):
        s = structures[a].symbols
        x.append([s]+list(b))

    header = """<tr>
                  <th>Symbols <span style="padding-left:150px"></th>
                  <th>Fractional composition upto order n <span style="padding-left:150px"></th>
                </tr>"""
    demo_table_html = HTML(
        value=f"Table")
    demo_table_html.value = array_to_html_table(x[::100], header)

    demo_table = HBox(layout=Layout(height='250px', overflow_y='auto'))
    demo_table.children += (demo_table_html,) 
    display(demo_table)
    
ex06_wp =  WidgetParbox(n = (2,1,8,1,r'$n_{max}$'))    
feat_table2 = WidgetUpdater(mk_table2)

In [None]:
structures=read('data/mp_elastic.extxyz','::100')
l = ex06_wci.get_function_object()(structures, 3)
np.savetxt('data/mp_elastic_06ref.txt',l)

In [None]:
ex06_wcc = WidgetCodeCheck(ex06_wci, 
                           ref_match = ex05_chk,
                           ref_values = [ ( (read('data/mp_elastic.extxyz','::100'),3,), np.loadtxt('data/mp_elastic_06ref.txt') ) ],                         
                           demo=(ex06_wp,feat_table2))
display(ex06_wcc)

**OPTIONAL** 

Here you can define your own fingerprints. Number density? Mass? Element electronegativity? Use your imagination. You will be able to test the performance of these descriptors in the next sections. 

In [None]:
custom_wci = WidgetCodeInput(
        function_name="descriptor_custom", 
        function_parameters="structures",
        docstring="""computes a custom feature matrix for the structures given in input.
""",
        function_body="""

import numpy as np
from ase.io import read

descriptor = np.zeros((len(structures), 10))

return descriptor
"""
        )

data_dump.register_field("custom-function", custom_wci, "function_body")

In [None]:
def array_to_html_table(numpy_array, header):
    rows = ""
    for i in range(len(numpy_array)):
        rows += f"<tr><td>{numpy_array[i][0]}</td>" + functools.reduce(lambda x,y: x+y,
                             map(lambda x: f"<td>{x:.2f}</td>",
                                 numpy_array[i][1:])
                            ) + "</tr>"

    return "<table>" + header + rows + "</table>"

def mk_table_custom():
    structures=read('data/mp_elastic.extxyz',':')
    l = custom_wci.get_function_object()(structures)

    x = []   
    for a,b in enumerate(l):
        s = structures[a].symbols
        x.append([s]+list(b))

    header = """<tr>
                  <th>Symbols <span style="padding-left:150px"></th>
                  <th>Fractional composition upto order n <span style="padding-left:150px"></th>
                </tr>"""
    demo_table_html = HTML(
        value=f"Table")
    demo_table_html.value = array_to_html_table(x[::100], header)

    demo_table = HBox(layout=Layout(height='250px', overflow_y='auto'))
    demo_table.children += (demo_table_html,) 
    display(demo_table)
    
custom_table = WidgetUpdater(mk_table_custom)

In [None]:
custom_wcc = WidgetCodeCheck(custom_wci, ref_values=(), demo = custom_table)
display(custom_wcc)

# Dimensionality reduction

We begin with a quick example of principal components analysis (PCA) one of the simplest _unsupervised_ learning algorithms - actually, one that can hardly be called machine learning. 
PCA involves processing a high-dimensional feature matrix $\mathbf{X}$ and projecting it _linearly_ into a _latent space_ $\mathbf{T}$ of reduced dimensionality. The problem can be formulated in different terms: as the identification of the directions with maximum variance in feature space, as the maximisation of the variance retained in the latent space or as the low-rank orthogonal projection of the feature matrix that minimizes the information loss. 

The figure below demonstrates the functioning of PCA on a simple 2D dataset: the principal axes of the data distribution are identified, making it possible to reduce the description to just 1D while losing the smallest possible amount of information on the relative position of the points

<img src="figures/pca.png" width="500"/>

In rigorous terms, PCA corresponds to determine the orthogonal projection matrix $\mathbf{P}_{{XT}}$ that minimizes the loss

$$
\ell = |\mathbf{X}-\mathbf{X}\mathbf{P}_{XT} \mathbf{P}_{XT}^T|^2
$$

To understand what this does, consider that $\mathbf{P}_{{XT}}$ is a $n_X\times n_T$ matrix; $\mathbf{X}\mathbf{P}_{XT}$ transforms the feature vector of each point into a $n_T$-dimensional compressed version, forming a latent-space matrix $\mathbf{T}$. 
Then $\mathbf{T}\mathbf{P}_{XT}^T$ lifts this to the $n_X$-dimensional space. Given the compression, however, information has been lost and the data lie in a subspace (the blue points in the figure above). 

The projector $\mathbf{P}_{XT}$ can be built using a [singular value decomposition](https://en.wikipedia.org/wiki/Singular_value_decomposition) of $\mathbf{X}$ or, equivalently, by computing the eigenvalue decomposition of the covariance matrix $\mathbf{X}^T\mathbf{X}$. Note that usually the feature matrix is _centered_ before identifying the principal components, that is the column-wise average of the features is subtracted from each row. 

While it's really easy to implement PCA manually, it is even simpler to use one of the many open implementations available, that also take care of centering. We use the implementation in `scikit-learn`. You are encouraged to read the [documentation](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html), but the key workflow is simple:

```python
from sklearn.decomposition import PCA
pca = PCA(n_components=3)   # n_components: dimension of the latent space
itrain = range(0,ntrain)  # list of indices used for training
pca.fit(x[itrain])      # x[itrain] is a ntrain x nx feature matrix
t = pca.transform(x)    # applies compression to the full feature vector. 
                        # t is nsamples x n_components 
```

In [None]:
ex08_wci = WidgetCodeInput(
        function_name="PCA_analysis", 
        function_parameters="structures, f_fingerprint, f_train",
        docstring="""takes the structures, and the a function that can compute fingerprints, and compute
        the principal component analysis of the dataset. also computes a train/test split assuming that 
        the first len(structures)*f_train configurations are used for training. 
        
        returns the latent-space coordinates for ALL structures (use 4 components), 
        and a list of the indices of the train structures.

""",
        function_body="""

import numpy as np
from ase.io import read
from sklearn.decomposition import PCA


X = f_fingerprint(structures)
#Split the dataset into train and test set
itrain = list(range(0,int(len(X)*f_train)))

pca = PCA(n_components=4)
pca.fit(X[itrain])

X_pca = pca.transform(X)
return X_pca, itrain
"""
        )

data_dump.register_field("ex08-function", ex08_wci, "function_body")

In [None]:
def fun_ex08(change={'type':'change'}):
    ex08_out.clear_output()
    structures = read('data/mp_elastic.extxyz',':')
    feats = ex08_wp.value['feats']
    with ex08_out:
        if feats == "composition":
            f_descriptor = lambda s: ex05_wci.get_function_object()(s)
        elif feats == "polynomial ($n_{max}$=2)":
            f_descriptor = lambda s: ex06_wci.get_function_object()(s, 2)
        elif feats == "polynomial ($n_{max}$=4)":
            f_descriptor = lambda s: ex06_wci.get_function_object()(s, 4)
        elif feats == "polynomial ($n_{max}$=8)":
            f_descriptor = lambda s: ex06_wci.get_function_object()(s, 8)
        elif feats == "custom":
            f_descriptor = lambda s: custom_wci.get_function_object()(s)
        xlatent, itrain = ex08_wci.get_function_object()(structures, f_descriptor,  ex08_wp.value['ftrain'])
    
    ftype = np.asarray([ "test " ] * len(structures)); ftype[itrain] = "train"
    fname = [ str(s.symbols) for s in structures]
    frames=structures    
    properties={"pca[1]": xlatent[:,0], "pca[2]" : xlatent[:,1],
                 "pca[3]" : xlatent[:,2],  "pca[4]" : xlatent[:,3],
                "type": ftype , "name": fname
               }
    settings={'map': {'x': {'property': 'pca[1]'},
  'y': { 'property': 'pca[2]'},
  #'color': {'max': 1, 'min': 0, 'property': 'K_error', 'scale': 'linear'},
  'symbol': 'type',
  'palette': 'inferno',
  'size': {'factor': 40}},
 'structure': [{'bonds': True,
   'spaceFilling': False,
   'atomLabels': False,
   'unitCell': True,
   'rotation': False,
   'supercell': {'0': 2, '1': 2, '2': 2},}]}
    
    
    chemiscope.write_input("module_07-pca-analysis.chemiscope.json.gz", 
                           frames=frames, properties=properties)
    with ex08_up:
        display(chemiscope.show(frames, properties, settings=settings
                               ))
    
ex08_up = WidgetUpdater(updater=fun_ex08)
ex08_out = Output(layout=Layout(width='100%', height='100%', max_height='200px', overflow_y='scroll'))
ex08_wp =  WidgetParbox(feats = (
    "composition", ["composition", "polynomial ($n_{max}$=2)", "polynomial ($n_{max}$=4)", "polynomial ($n_{max}$=8)", "custom"], r"fingerprints"),
    ftrain = (0.5,0.05,0.9,0.05,r'$f_{train}$'))    


ex08_wcc = WidgetCodeCheck(ex08_wci, ref_values = {}, demo = (ex08_wp, ex08_out, ex08_up))

display(ex08_wcc)

<span style="color:blue">**09a** Run the PCA for a "composition" fingerprint and $f_{train}=0.5$. 
 How does the latent-space projection look like? What do the axes roughly correspond to? How can you explain the appearence of the map, given the way the fingerprint is constructed and the structures in the training set?
</span>

In [None]:
ex9a_txt = Textarea("Write the answer here", layout=Layout(width="100%"))
data_dump.register_field("ex9a-answer", ex9a_txt, "value")
display(ex9a_txt)

<span style="color:blue">**09b** Change the fraction of  training structures down to 0.05; does the qualitative appearence of the map change much? Do the actual meaning of the axes change (look in particular at the most "extreme" structures)?
</span>

In [None]:
ex9b_txt = Textarea("Write the answer here", layout=Layout(width="100%"))
data_dump.register_field("ex9b-answer", ex9b_txt, "value")
display(ex9b_txt)

<span style="color:blue">**09c** What happens if you use another set of features (say polynomial features up to $n_\mathrm{max}=8$? Go back to 50% training.
</span>

In [None]:
ex9c_txt = Textarea("Write the answer here", layout=Layout(width="100%"))
data_dump.register_field("ex9c-answer", ex9c_txt, "value")
display(ex9c_txt)

# Regression

In [None]:
#TODO divya: make them code up a ridge regression module
#using half of the structures and predicting on the other half, and showing the result in a chemiscope.
# use sklearn.linear_model.Ridge.
# alpha should be an adjustable parameter

In [None]:
ex10_wci = WidgetCodeInput(
        function_name="ridge_regression", 
        function_parameters="structures, target, f_fingerprint, f_train, alpha",
        docstring="""takes the structures, and the a function that can compute fingerprints, and compute
        linear regression for the target. the target is given as a string, matching the name used in 
        the `info` field in the structures.
        also computes a train/test split assuming that the first len(structures)*f_train configurations are used for training. 
        takes the ridge regularization magnitude alpha as input
        
        returns the predicted property for ALL structures, 
        and a list of the indices of the train structures.

""",
        function_body="""

import numpy as np
from sklearn.linear_model import Ridge


X = f_fingerprint(structures)
Y = np.asarray([f.info[target] for f in structures])

#Split the dataset into train and test set
itrain = list(range(0,int(len(X)*f_train)))

ridge = Ridge(alpha)
ridge.fit(X[itrain], Y[itrain])
Y_pred = ridge.predict(X) 

return Y_pred, itrain
"""
        )

data_dump.register_field("ex10-function", ex10_wci, "function_body")

In [None]:
def fun_ex10(change={'type':'change'}):
    ex10_out.clear_output()
    
    structures = read('data/mp_elastic.extxyz',':')
    tgt = ex10_wp.value['target']
    y = np.asarray([f.info[tgt] for f in structures])
    feats = ex10_wp.value['feats']
    with ex10_out:
        if feats == "composition":
            f_descriptor = lambda s: ex05_wci.get_function_object()(s)
        elif feats == "polynomial ($n_{max}$=2)":
            f_descriptor = lambda s: ex06_wci.get_function_object()(s, 2)
        elif feats == "polynomial ($n_{max}$=4)":
            f_descriptor = lambda s: ex06_wci.get_function_object()(s, 4)
        elif feats == "polynomial ($n_{max}$=8)":
            f_descriptor = lambda s: ex06_wci.get_function_object()(s, 8)
        elif feats == "custom":
            f_descriptor = lambda s: custom_wci.get_function_object()(s)
        
        yp, itrain = ex10_wci.get_function_object()(structures, tgt, f_descriptor,  
                                                    ex10_wp.value['ftrain'], 10**ex10_wp.value['log10alpha'])
        
        print("MAE train: ", np.mean(np.abs((y-yp)[itrain])))
        print("MAE test: ",(np.sum(np.abs(y-yp)) - np.sum(np.abs(y-yp)[itrain]))/(len(y)-len(itrain)) )
    
    ftype = np.asarray([ "test " ] * len(structures)); ftype[itrain] = "train"
    fname = [ str(s.symbols) for s in structures]
    frames=structures
    properties={tgt: y, tgt+"_predicted" : yp, tgt+"_error": np.abs(y-yp),
                "type": ftype , "name": fname}
    
    settings={'map': {'x': {'property': tgt},
  'y': { 'property': tgt+'_predicted'},
  'color': {'max': 1, 'min': 0, 'property': tgt+'_error', 'scale': 'linear'},
  'symbol': 'type',
  'palette': 'inferno',
  'size': {'factor': 40}},
 'structure': [{'bonds': True,
   'spaceFilling': False,
   'atomLabels': False,
   'unitCell': True,
   'rotation': False,
   'supercell': {'0': 2, '1': 2, '2': 2},}]}
                  
    chemiscope.write_input("module_07-ridge-regression.chemiscope.json.gz", frames=frames, properties=properties)
                           
    with ex10_up:
        display(chemiscope.show(frames=structures, 
                   properties=properties, settings=settings
                  ) )
        
ex10_up = WidgetUpdater(updater=fun_ex10)
ex10_out = Output(layout=Layout(width='100%', height='100%', max_height='200px', overflow_y='scroll'))
ex10_wp =  WidgetParbox(
    target=("K", ["K", "G", "nu"], r"target"),
    feats=("composition", ["composition", "polynomial ($n_{max}$=2)", "polynomial ($n_{max}$=4)", "polynomial ($n_{max}$=8)", "custom"], r"fingerprints"),
    ftrain = (0.5,0.05,0.9,0.05,r'$f_{train}$'),
    log10alpha=(-3., -8, 2, 0.2, r"$\log_{10}(\alpha)$")
                       )


ex10_wcc = WidgetCodeCheck(ex10_wci, ref_values = {}, demo = (ex10_wp, ex10_out, ex10_up))

display(ex10_wcc)

<span style="color:blue">**11a** Run regression for the Young modulus $K$, using `composition` fingerprints, 50% training structures and a regularization of $10^{-3}$. What are the train and test errors? What is the structure with the largest error?
</span>

In [None]:
ex11a_txt = Textarea("Write the answer here", layout=Layout(width="100%"))
data_dump.register_field("ex11a-answer", ex11a_txt, "value")
display(ex11a_txt)

<span style="color:blue">**11b** Change the train size to the minimum and maximum values allowed. How do test and train mean absolute errors (MAEs) change? Can you explain the trend? Repeat with very small and very large regularization. What do you observe, and how can you explain it?
</span>

In [None]:
ex11b_txt = Textarea("Write the answer here", layout=Layout(width="100%"))
data_dump.register_field("ex11b-answer", ex11b_txt, "value")
display(ex11b_txt)

<span style="color:blue">**11c** Use polynomial ($n_{max}=8$) descriptors and repeat the experiments in the previous question. What do you observe, and how can you explain it? What is the best test set error you can obtain by adjusting the regularization with f_train=0.85?
</span>

_Hint: consider the number of features used in the description, and the flexibility of the model. Think also at what you observed for the 1D regression. NB: adjusting the regularization based on test set accuracy as we do here is not good practice, and we only do it to get a sense of the role of regularization._

In [None]:
ex11c_txt = Textarea("Write the answer here", layout=Layout(width="100%"))
data_dump.register_field("ex11c-answer", ex11c_txt, "value")
display(ex11c_txt)