# Cross-validation methods

In this notebook you will consider K-Fold cross validation method to estimate quality of a claasifier.

In [None]:
%matplotlib inline
import pandas as pd
import numpy as np
import numpy.testing as np_testing
import matplotlib.pyplot as plt

# Load MAGIC Data Set

<center><img src="img/magic1.jpg" width="1000"></center>

Source: https://magic.mpp.mpg.de/

In [None]:
!wget https://archive.ics.uci.edu/ml/machine-learning-databases/magic/magic04.data

Features description:
- **Length:** continuous # major axis of ellipse [mm]
- **Width:** continuous # minor axis of ellipse [mm]
- **Size:** continuous # 10-log of sum of content of all pixels [in #phot]
- **Conc:** continuous # ratio of sum of two highest pixels over fSize [ratio]
- **Conc1:** continuous # ratio of highest pixel over fSize [ratio]
- **Asym:** continuous # distance from highest pixel to center, projected onto major axis [mm]
- **M3Long:** continuous # 3rd root of third moment along major axis [mm]
- **M3Trans:** continuous # 3rd root of third moment along minor axis [mm]
- **Alpha:** continuous # angle of major axis with vector to origin [deg]
- **Dist:** continuous # distance from origin to center of ellipse [mm]
- **Label:** g,h # gamma (signal), hadron (background)

g = gamma (signal): 12332 \
h = hadron (background): 6688

In [None]:
f_names = np.array(["Length", "Width", "Size", "Conc", "Conc1", "Asym", "M3Long", "M3Trans", "Alpha", "Dist"])

data = pd.read_csv("magic04.data", header=None, names=list(f_names)+["Label"])
data.head()

# Data preparation

In [None]:
# prepare a matrix of input features
X = data[f_names].values

# prepare a vector of true labels
y = 1 * (data['Label'].values == "g")

In [None]:
# print sizes of X and y
X.shape, y.shape

In [None]:
X[:2]

In [None]:
y[:5]

# Preprocessing

Scale input data using StandardScaler:
$$
X_{new} = \frac{X - \mu}{\sigma}
$$

In [None]:
# Import StandardScaler
from sklearn.preprocessing import StandardScaler

# Create object of the class and set up its parameters
ss = StandardScaler()

# Estimate mean and sigma values
ss.fit(X)

# Scale the sample
X = ss.transform(X)

# Define a model


Now let's create a neural network and fit it.

In [None]:
#!pip install pytorch-lightning 

In [None]:
import torch
from torch.nn import functional as F
from torch import nn
import pytorch_lightning as pl

class Model(pl.LightningModule):

    def __init__(self):
        super().__init__()
        
        # define all layers of the netwrok
        self.net = nn.Sequential(
                                nn.Linear(10, 10), 
                                nn.Tanh(), 
                                nn.Linear(10, 1), 
                                nn.Sigmoid())

    
    def forward(self, x):
        # make a prediction for x
        return self.net(x)

    # calculate loss function values
    def training_step(self, batch, batch_idx):
        x, y = batch
        y_hat = self(x)
        loss = F.binary_cross_entropy(y_hat, y)
        return loss

    # define optimizer to fit the network
    def configure_optimizers(self):
        return torch.optim.Adam(self.parameters(), lr=0.02)

In [None]:
model = Model()
model

# Data loader creation

We will define a helping function for converting `X_train` and `y_train` into PyTorch `DataLoader`.

In [None]:
from torch.utils.data import TensorDataset, DataLoader

def create_data_loader(X_train, y_train, batch_size=128):
    # combine X and y into one pytorch tensor dataset
    dataset_train = TensorDataset(torch.tensor(X_train, dtype=torch.float), 
                                  torch.tensor(y_train.reshape(-1, 1), dtype=torch.float))
    # loader divides our train data into batches
    train_loader = DataLoader(dataset_train, batch_size=batch_size, num_workers=4)
    return train_loader

In [None]:
# example of usage
create_data_loader(X[:5], y[:5], 1)

# Quality metrics

We will use the function below to calculate quality metrics for out model.

In [None]:
from sklearn import metrics

def quality_metrics_report(y_true, y_pred, y_proba):
    """
    Parameters
    ----------
    y_true: array-like of shape (n_samples,)
        Ground truth (correct) target values.
    y_pred: array-like of shape (n_samples,)
        Estimated targets as returned by a classifier.
    y_proba : array, shape = [n_samples]
        Target scores, can be probability estimates of the positive
        class.
        
    Returns
    -------
    List of metric values: [accuracy, precision, recall, f1, roc_auc]
    """
    
    accuracy  = metrics.accuracy_score(y_true, y_pred)
    precision = metrics.precision_score(y_true, y_pred)
    recall    = metrics.recall_score(y_true, y_pred)
    f1        = metrics.f1_score(y_true, y_pred)
    roc_auc   = metrics.roc_auc_score(y_true, y_proba)
    
    return [accuracy, precision, recall, f1, roc_auc]

In [None]:
# example of usage
quality_metrics_report(y_true=[0, 0, 1, 1], 
                       y_pred=[0, 1, 1, 1], 
                       y_proba=[0.1, 0.6, 0.8, 0.9])

# K-Fold cross-validation

We will measure quality of our model using K-Fold CV method.

<center><img src="img/kfold.png" width="600"></center>

K-Fold:
    
1. Split the data into 𝐾 folds
2. For 𝑖=1,…,𝐾 do: \
    2.1 Keep 𝑖-th fold for validation \
    2.2 Use other 𝐾−1 folds to fit a model \
    2.3 Measure its quality on the validation fold \
3. Estimate mean and standard deviation of the quality metrics


# Task 1
Using K-Fold cross-validation estimate means and standard deviation of the quality metrics for the classifier above. 

**Hint:** use `model(torch.tensor(X, dtype=torch.float))[:, 0].detach().numpy()` to make predictions for our model. Use function `quality_metrics_report` above to compute the quality metrics.

In [None]:
from sklearn.model_selection import KFold

def kfold_uncertainties(X, y, n_splits=10):
    
    metrics = []
    
    # init KFold class
    kf = KFold(n_splits=n_splits, shuffle=True, random_state=42)
    
    # go through each iteration of KFold
    for train_index, test_index in kf.split(X):
        
        # init model and trainer
        model = Model()
        trainer = pl.Trainer(max_epochs=10)
        
        # take train folds
        X_train = X[train_index]
        y_train = y[train_index]
        
        # create pytorch dataloader
        train_loader = create_data_loader(X_train, y_train, batch_size=128)
        
        # fit the model on the train folds, 
        # get y_test_proba and y_test_pred predictions on the test folds
        
        ### BEGIN SOLUTION

        ### END SOLUTION
        
        # compute quaility metrics
        metrics_iter = quality_metrics_report(y_test, y_test_pred, y_test_proba)
        metrics.append(metrics_iter)
        
    metrics = np.array(metrics)
    df = pd.DataFrame()
    df['Metrics'] = columns=['Accuracy', 'Precision', 'Recall', 'F1', 'ROC AUC']
    df['Mean']    = metrics.mean(axis=0)
    df['Std']     = metrics.std(axis=0)
    
    return df

In [None]:
# run KFold CV
report = kfold_uncertainties(X, y, n_splits=10)
report

Expected approximate output:

<center>   
    
```python
Metrics       Mean     Std
0 Accuracy    0.659621 0.031680
1 Precision   0.927239 0.015468
2 Recall      0.516701 0.057044
3 F1          0.661184 0.046793
``` 
    
</center>

In [None]:
### BEGIN HIDDEN TESTS
actual  = report.values[-1, 1:].astype(np.float)
desired = np.array([0.778499, 0.023092])
np_testing.assert_allclose(actual, desired, atol=0.05)
### END HIDDEN TESTS