Here we will learn to:
- Assess the performance of machine learning models;
- Diagnose the common problems of machine learning algorithms;
- Fine-tune machine learning models;
- Evaluate predictive models using different performance matrics.

In [2]:
import pandas as pd

df = pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/wdbc.data",header=None)

In [3]:
from sklearn.preprocessing import LabelEncoder
X = df.loc[:,2:].values
y = df.loc[:,1].values
le = LabelEncoder()
y = le.fit_transform(y)
le.classes_

array(['B', 'M'], dtype=object)

In [4]:
le.transform(["M","B"]) # Malign tumors as class 1 and benign tumors as class 0

array([1, 0])

In [6]:
# Dataset split
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2,stratify=y,random_state=1)

Now we will chain the scaling, the dim-reduction and the classifier training in a pipeline as a best practice.

In [7]:
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline
pipe_lr = make_pipeline(StandardScaler(),PCA(n_components=2),LogisticRegression())
pipe_lr.fit(X_train,y_train)
y_pred = pipe_lr.predict(X_test)
test_acc = pipe_lr.score(X_test,y_test)
print(f"Test accuracy: {test_acc:.3f}")

Test accuracy: 0.956


The `make_pipeline` function comes in handy, taking an arbitrary number of transformers (objects that support both `fit` and `transform` methods), followed by an estimator (supports `fit` and `predict` methods).

# K-fold cross-validation to assess model performance

## The holdout method

Basically separating the training set and the test set, in order to use the former for training and the latter to estimate the generalization performance.

To select optimal values of tuning parameters, we go through a process called **model selection**.
However if we perform the model selection process using everytime the same test set, it's like if the test set becomes a part of the training set (so it's not really a good practice).

A better way of using the holdout method for model selection is to separate the data into three parts: training, test and validation dataset. Then the validation set is used for model selection, while the test set is used as the final performance estimate (after the model selection process is complete).

## K-fold cross validation
In k-fold cross validation we randomly split the training dataset into $k$ folds without replacement. This will create $k-1$ training folds, used for training and $1$ test fold, used for performance evaluation.
The procedre is repeated $k$ times so that we obtain $k$ models and performance estimates.
Typically, we use k-fold cross-validation for model tuning (finding the optimal hyperparameter values that yield a satisfying generalization performance, on the test folds). After we have found satisfactory hyperparameter values, we can retrain the model on the complete training dataset and obtain a final performance estimate using the indipendent test dataset.

Another cool property of this approach is that the estimated performances $E_i$ of each model (coming from each iteration of the k-fold) can be then used to calculate the estimated average performance $E= \frac{1}{k}\sum^k_{i=1}E_i$.

A standard vale for $k$ in k-fold cross validation is usually 10 (as empirical evidence shows). However with relatively small training sets, it can be useful to increase the number of folds (however big values of $k$ will increase the runtime of the cross-validation), while if the training set is big, we can choose a lower $k$ (for example $k=5$), and still obtain an accurate estimate.

With extremely small datasets there is also the **leave one out approach** where $k=n$, having only 1 sample for the test set.

In case of class imbalances however, a better approach is the stratified cross-validation, where the class label proportions are preserved in each fold, to ensure that each fold is representative of the class proportions in the training dataset.

In [8]:
import numpy as np
from sklearn.model_selection import StratifiedKFold
kfold = StratifiedKFold(n_splits=10).split(X_train,y_train)
scores = []
for k, (train,test) in enumerate(kfold):
    pipe_lr.fit(X_train[train],y_train[train])
    score = pipe_lr.score(X_train[test],y_train[test])
    scores.append(score)
    print(f"Fold: {k+1:02d}, Class distr.: {np.bincount(y_train[train])}, Acc.: {score:.3f}")
mean_acc = np.mean(scores)
std_acc = np.std(scores)
print(f"\nCV accuracy: {mean_acc:.3f} +/- {std_acc:.3f}")

Fold: 01, Class distr.: [256 153], Acc.: 0.935
Fold: 02, Class distr.: [256 153], Acc.: 0.935
Fold: 03, Class distr.: [256 153], Acc.: 0.957
Fold: 04, Class distr.: [256 153], Acc.: 0.957
Fold: 05, Class distr.: [256 153], Acc.: 0.935
Fold: 06, Class distr.: [257 153], Acc.: 0.956
Fold: 07, Class distr.: [257 153], Acc.: 0.978
Fold: 08, Class distr.: [257 153], Acc.: 0.933
Fold: 09, Class distr.: [257 153], Acc.: 0.956
Fold: 10, Class distr.: [257 153], Acc.: 0.956

CV accuracy: 0.950 +/- 0.014


In [9]:
# Also it could have been written less verbosely like this:
from sklearn.model_selection import cross_val_score
scores = cross_val_score(estimator=pipe_lr,X=X_train,y=y_train,cv=10,n_jobs=1) # using the parameter `n_jobs` we can distribute the training across CPUS (-1 means use them all)
print(f"CV accuracy scores: {scores}")

CV accuracy scores: [0.93478261 0.93478261 0.95652174 0.95652174 0.93478261 0.95555556
 0.97777778 0.93333333 0.95555556 0.95555556]
