# Scikit-learn: The biggest machine-learning library
<img src="images/scikit-learn-logo-notext.png"></img>

### has become machine learning ingredient of the "data-science triad": `jupyter notebooks`, `pandas`, `scikit-learn`
### has "made machine learning boring"
### history: 
- 2007 David Cournapeau Google Summer of Code (with Jarrod Millman)
- 2010 Parietal Team (Inria Saclay) takes over, first release Feb 1st, 2010
- Fabian Pedregosa full-time engineer 2010-2012
- Currently Olivier Grisel and Andreas Mueller and several others full-time open source

<a href="http://scikit-learn.org" style="font-size: 20pt">Scikit-learn.org</a>

## Let's jump right in and predict some molecule properties!

We will fit an estimator to be able to predict a molecule property from its structure.

Let's start by loading some data.

In [None]:
import numpy as np

In [None]:
qm7 = np.load("./qm7.npz")

In [None]:
positions, charges, energies = qm7['positions'], qm7['charges'], qm7['energies']

These data represent atom positions and types for 7165 small organic molecules. From `positions` and `charges` we'd like to be able to predict `energies` using machine learning.

### Feature extraction: To be able to use scikit-learn we need `X` and `y`

Every supervised algorithm in scikit-learn requires input of `X` and `y` (unsupervised algorithms only need `X`).

`X` is "the data" and `y` is the prediction target. The goal of the esimator is to be able to predict `y` from `X` as well as possible.

`X` is of shape `(n_samples, n_features)` and `y` is of shape `n_samples`.

### Features for Chemistry: Coulomb matrices predict molecule properties

Our `y` is `energies`. Let's set it to be that:

In [None]:
y = energies
y.shape

Our `X` is more complicated to obtain. We need to construct it from `positions` and `charges`. We can construct Coulomb matrices from them and order their rows and columns with the following two functions:

In [None]:
def compute_coulomb_matrices(positions, charges):
    all_atom_distances = np.linalg.norm(positions[:, :, np.newaxis] -
                                        positions[:, np.newaxis], axis=-1)
    all_charge_products = charges[:, :, np.newaxis] * charges[:, np.newaxis]
    non_zero_mask = (all_charge_products != 0) & (all_atom_distances > 0)
    coulomb_matrices = np.zeros_like(all_atom_distances)
    coulomb_matrices[non_zero_mask] = (all_charge_products[non_zero_mask] / 
                                       all_atom_distances[non_zero_mask])
    coulomb_matrices.reshape(coulomb_matrices.shape[0], -1)[:, ::coulomb_matrices.shape[1] + 1] = charges ** 2.4
    return coulomb_matrices

from sklearn.utils import check_random_state
def sort_coulomb_matrices(coulomb_matrices, jitter=0., random_state=0):

    rng = check_random_state(random_state)
    row_norms = np.linalg.norm(coulomb_matrices, axis=2)
    jitters = rng.rand(*row_norms.shape) * jitter
    row_norms += jitters
    
    row_order = row_norms.argsort(axis=1)[:, ::-1]
    sorted_coulomb_matrices = coulomb_matrices[np.arange(len(coulomb_matrices))[:, np.newaxis, np.newaxis],
                                               row_order[:, :, np.newaxis],
                                               row_order[:, np.newaxis]]
    return sorted_coulomb_matrices

In [None]:
scm = sort_coulomb_matrices(compute_coulomb_matrices(positions, charges), jitter=0.01)

In [None]:
scm.shape

We observe that these functions give us 7165 matrices of shape 23x23. But what we need is `X` of shape `(7165, something)`. We can obtain this from `scm` by reshaping.

In [None]:
X = scm.reshape(scm.shape[0], -1)
X.shape

### Simple data splitting: train-test-split

In order to be able to evaluate a model we need to have it issue predictions on data it did not see during training. In order to ensure this, we can use scikit-learn functionality for data splitting. A very simple, straight-forward way of doing this is to use the function `train_test_split`.

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
Xtrain, Xtest, ytrain, ytest = train_test_split(X, y, test_size=.2, shuffle=True)

In [None]:
Xtrain.shape, ytrain.shape

In [None]:
Xtest.shape, ytest.shape

### A simple regression estimator: Kernel Ridge Regression
Now we proceed to creating an estimator object, in this case a `KernelRidge` regression.

#### Fitting the regression to data
The procedure of fitting it to data and testing it is the same for all estimators.

In [None]:
from sklearn.kernel_ridge import KernelRidge

In [None]:
kr = KernelRidge(kernel='laplacian', alpha=1e-9)

In [None]:
kr.fit(Xtrain, ytrain)

#### Obtaining predictions

In [None]:
predictions = kr.predict(Xtest)

In [None]:
predictions

In [None]:
predictions.shape

In [None]:
ytest

#### Evaluating an error

In [None]:
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

In [None]:
mean_absolute_error(predictions, ytest)

In [None]:
np.sqrt(mean_squared_error(predictions, ytest))

In [None]:
r2_score(predictions, ytest)

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
plt.figure(figsize=(10, 10))
plt.plot(predictions, ytest, "x")

Those seem like pretty solid scores. For many applications, an `r2` value of `.997` is absolutely great. It turns out though, that the state of the art in error on this data set is an order of magnitude lower than our result.

### Assessing the robustness of an estimator: cross-validation

Cross-validation is a procedure in which the data are split in multiple ways and the estimator is fit on each split. The scores are gathered in a list

In [None]:
from sklearn.model_selection import cross_val_score

We would like to have a handle on the splits, and we can specify this using the `cv`-argument of this function, to which we pass a cross-validation splitter object:

In [None]:
from sklearn.model_selection import ShuffleSplit

In [None]:
cv = ShuffleSplit(n_splits=6, train_size=0.8, test_size=0.2)

In [None]:
scores = cross_val_score(kr, X, y, cv=cv, scoring="neg_mean_absolute_error", n_jobs=3)

By specifying `n_jobs=3` we ask that 3 of these estimations be run in parallel if memory and number of CPUs allows it.

In [None]:
scores

The values are negative, because we chose an error measure as the score. Since some selection procedures thinks that scoring is *higher is better*, errors are presented negatively so that a higher value means a lower error.

In [None]:
print(f"We have a score summary of MAE={-np.mean(scores):0.2f}+-{np.std(scores):0.2f}")

That seems like a pretty consistent estimator

### Concatenating estimators with `pipelines`

Sometimes there are preprocessing steps that depend on the data. In order for these steps not to use testing data, and to create a global estimator object, it is often useful to create a `pipeline` that concatenates them all. In our case, the input data is probably quite large in norm. This is generally not good for estimators

In [None]:
plt.plot(X.mean(0))

In [None]:
plt.plot(X.std(0))

In [None]:
plt.plot(np.linalg.norm(X, axis=0))

In [None]:
from sklearn.pipeline import make_pipeline

In [None]:
from sklearn.preprocessing import StandardScaler

In [None]:
scaler = StandardScaler()

In [None]:
pipeline = make_pipeline(scaler, KernelRidge(kernel='laplacian', gamma=1e-10, alpha=1e-9))

In [None]:
pipeline.fit(Xtrain, ytrain)

In [None]:
p = pipeline.predict(Xtest)

In [None]:
mean_absolute_error(p, ytest)

## Integrating hyperparameter optimization with `GridSearchCV`

Our pipeline currently has two hyperparameters that we have set by hand. If we play around with these hyperparameters a lot and evaluate a cross-validation every time, we might end up overfitting the dataset.

While there is nothing wrong with setting hyperparameters based on how well they make the estimator predict, the data used for final evaluation should not be used for hyperparameter selection.

Scikit-learn provides tools for parameter selection by trainset splitting, for example `GridSearchCV`

In [None]:
from sklearn.model_selection import GridSearchCV

In [None]:
pipeline.steps

In [None]:
gsc = GridSearchCV(pipeline, param_grid=dict(
                                kernelridge__gamma=[1e-10, 5e-10, 1e-9],
                                kernelridge__alpha=[1e-8, 1e-9, 1e-10],),
                   scoring="neg_mean_absolute_error",
                   cv=3
                  )

In [None]:
gsc.fit(Xtrain, ytrain)

In [None]:
gsc.best_params_

In [None]:
gsc.best_score_

In [None]:
gsc.cv_results_

In [None]:
test_pred = gsc.predict(Xtest)

In [None]:
mean_absolute_error(ytest, test_pred)

**Exercise 01:** Use a `GridSearchCV`-wrapped estimator in a cross-validation loop using `cross_val_score`. What do the resulting scores mean with respect to the estimator without `GridSearchCV`-wrapping?

Watch out: This may take a while to run.

In [None]:
# %load solutions/scikit-learn/solution01.txt
# %load https://raw.githubusercontent.com/SFdS-atelier-3/block-1/master/solutions/scikit-learn/solution01.txt

## API

### Estimator Objects
- An estimator is a class that implements the methods `fit(X, y)` and `predict(X)`. `fit` makes the estimator learn something from the data, `predict` issues predictions for new data.

- `X` is always a 2D matrix of shape `n_samples, n_features`.
- `y` is either a 1D vector of length `n_samples` or a 2D matrix of shape `n_samples, n_features`

- the `__init__` method must only store input parameters


Let's make our own `scikit-learn` estimator which runs a pytorch neural network!

In [None]:
import torch
import torch.utils.data

In [None]:
def create_neural_network(*layer_sizes):
    layers = []
    for input_size, output_size in zip(layer_sizes[:-1], layer_sizes[1:]):
        layers.append(torch.nn.Linear(input_size, output_size))
        layers.append(torch.nn.ReLU())
    return torch.nn.Sequential(*layers[:-1])

In [None]:
def train_epoch(network, data, criterion, optimizer):
    losses = []
    for x, y in data:
        p = network(x)
        loss = criterion(p, y)
        losses.append(loss.detach().cpu().numpy())
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
    return np.array(losses)


In [None]:
def train_nll_adam(network, data, n_epochs=10):
    all_losses = []
    optimizer = torch.optim.Adam(network.parameters())
    criterion = torch.nn.CrossEntropyLoss()
    for e in range(n_epochs):
        losses = train_epoch(network, data, criterion, optimizer)
        all_losses.append(losses)
    return np.array(all_losses)

In [None]:
from sklearn.base import BaseEstimator, ClassifierMixin

In [None]:
class MyNeuralNetwork(BaseEstimator, ClassifierMixin):
    def __init__(self, layer_sizes=(), n_epochs=10, batch_size=32):
        self.layer_sizes = layer_sizes
        self.n_epochs = n_epochs
        self.batch_size = batch_size
    
    def fit(self, X, y):
        n_samples, n_features = X.shape
        u_label, i_label = np.unique(y, return_inverse=True)
        self.u_label = u_label
        
        layers = (n_features,) + self.layer_sizes + (len(u_label),)
        
        self.network = create_neural_network(*layers)
        
        data = torch.utils.data.DataLoader(list(zip(X.astype('float32'), i_label)), batch_size=self.batch_size)
        train_nll_adam(self.network, data, n_epochs=self.n_epochs)
        
        return self

    def predict(self, X):
        Xtorch = torch.from_numpy(X.astype('float32'))
        predicted_scalars = self.network(Xtorch).detach().cpu().numpy()
        predicted_labels = predicted_scalars.argmax(1)
        return self.u_label[predicted_labels]

**Exercise 02:** Fit our sklearn-pytorch estimator to some digits data

**1\.** Create a scikit-learn-pytorch neural network using the class above, with `layer_sizes=(100,)` and `n_epochs=100`. Call it `my_nn`.

In [None]:
# %load solutions/scikit-learn/solution02.1.txt
# %load https://raw.githubusercontent.com/SFdS-atelier-3/block-1/master/solutions/scikit-learn/solution02.1.txt

**2\.** import `load_digits` from `sklearn.datasets`. Run this function and store the output in `digits`. Store `digits.data` in `X` and `digits.target` in `y`

In [None]:
# %load solutions/scikit-learn/solution02.2.txt
# %load https://raw.githubusercontent.com/SFdS-atelier-3/block-1/master/solutions/scikit-learn/solution02.2.txt

**3\.** Use the `fit` method to fit the neural network to the first 1000 data points.

In [None]:
# %load solutions/scikit-learn/solution02.3.txt
# %load https://raw.githubusercontent.com/SFdS-atelier-3/block-1/master/solutions/scikit-learn/solution02.3.txt

**4\.** Use the `predict` to predict on the remaining data points, and check how many digits were correctly predicted.

In [None]:
# %load solutions/scikit-learn/solution02.4.txt
# %load https://raw.githubusercontent.com/SFdS-atelier-3/block-1/master/solutions/scikit-learn/solution02.4.txt