# Class 5: Model Evaluation 2 -- Cross-Validation for Model Evaluation

In [1]:
import numpy as np
import matplotlib.pyplot as plt

<p style="margin-bottom:5cm;"></p>

## K-fold Cross-Validation in Scikit-Learn

- Simple demonstration of using a cross-validation iterator in scikit-learn

In [16]:
from sklearn.model_selection import KFold

# Set random generator with some number so that we get the same results when we re-run the code
rng = np.random.RandomState(123)

# we will use a random dataset for simplicity
y = np.array([0, 0, 0, 0, 0, 1, 1, 1, 1, 1]) # example inputs: 10 x 4 dimensional dataset
X = rng.random_sample((y.shape[0], 4))

# Print dataset
print("Features:\n", X)
print("Labels:\n", y)


cv = KFold(n_splits=5)

print("\n")
for k in cv.split(X, y):
    print("Example indices in fold (training/validation):", k)

Features:
 [[0.69646919 0.28613933 0.22685145 0.55131477]
 [0.71946897 0.42310646 0.9807642  0.68482974]
 [0.4809319  0.39211752 0.34317802 0.72904971]
 [0.43857224 0.0596779  0.39804426 0.73799541]
 [0.18249173 0.17545176 0.53155137 0.53182759]
 [0.63440096 0.84943179 0.72445532 0.61102351]
 [0.72244338 0.32295891 0.36178866 0.22826323]
 [0.29371405 0.63097612 0.09210494 0.43370117]
 [0.43086276 0.4936851  0.42583029 0.31226122]
 [0.42635131 0.89338916 0.94416002 0.50183668]]
Labels:
 [0 0 0 0 0 1 1 1 1 1]


Example indices in fold (training/validation): (array([2, 3, 4, 5, 6, 7, 8, 9]), array([0, 1]))
Example indices in fold (training/validation): (array([0, 1, 4, 5, 6, 7, 8, 9]), array([2, 3]))
Example indices in fold (training/validation): (array([0, 1, 2, 3, 6, 7, 8, 9]), array([4, 5]))
Example indices in fold (training/validation): (array([0, 1, 2, 3, 4, 5, 8, 9]), array([6, 7]))
Example indices in fold (training/validation): (array([0, 1, 2, 3, 4, 5, 6, 7]), array([8, 9]))


In [17]:
# training examples first fold
X[[2, 3, 4, 5, 6, 7, 8, 9]]

array([[0.4809319 , 0.39211752, 0.34317802, 0.72904971],
       [0.43857224, 0.0596779 , 0.39804426, 0.73799541],
       [0.18249173, 0.17545176, 0.53155137, 0.53182759],
       [0.63440096, 0.84943179, 0.72445532, 0.61102351],
       [0.72244338, 0.32295891, 0.36178866, 0.22826323],
       [0.29371405, 0.63097612, 0.09210494, 0.43370117],
       [0.43086276, 0.4936851 , 0.42583029, 0.31226122],
       [0.42635131, 0.89338916, 0.94416002, 0.50183668]])

In [18]:
# training labels first fold
y[[2, 3, 4, 5, 6, 7, 8, 9]]

array([0, 0, 0, 1, 1, 1, 1, 1])

In [19]:
# validation examples first fold
X[[0,1]]

array([[0.69646919, 0.28613933, 0.22685145, 0.55131477],
       [0.71946897, 0.42310646, 0.9807642 , 0.68482974]])

In [20]:
# validation labels first fold
y[[0,1]]

array([0, 0])

<p style="margin-bottom:5cm;"></p>

- In practice, we are usually interested in shuffling the dataset, because if the data records are ordered by class label, this would result in cases where the classes are not well represented in the training and test folds

In [21]:
cv = KFold(n_splits=5, random_state=123, shuffle=True)

for k in cv.split(X, y):
    print("Example indices in fold (training/validation):", k)

Example indices in fold (training/validation): (array([1, 2, 3, 5, 6, 7, 8, 9]), array([0, 4]))
Example indices in fold (training/validation): (array([0, 1, 2, 3, 4, 6, 8, 9]), array([5, 7]))
Example indices in fold (training/validation): (array([0, 1, 2, 4, 5, 6, 7, 9]), array([3, 8]))
Example indices in fold (training/validation): (array([0, 2, 3, 4, 5, 7, 8, 9]), array([1, 6]))
Example indices in fold (training/validation): (array([0, 1, 3, 4, 5, 6, 7, 8]), array([2, 9]))


<p style="margin-bottom:5cm;"></p>

- Note that the `KFold` iterator only provides us with the array indices; in practice, we are actually interested in the array values (feature values and class labels)

In [22]:
cv = KFold(n_splits=5, random_state=123, shuffle=True)

for train_idx, valid_idx in cv.split(X, y):
    print('train labels with shuffling', y[train_idx])

for train_idx, valid_idx in cv.split(X, y):
    print('validation labels with shuffling', y[valid_idx])

train labels with shuffling [0 0 0 1 1 1 1 1]
train labels with shuffling [0 0 0 0 0 1 1 1]
train labels with shuffling [0 0 0 0 1 1 1 1]
train labels with shuffling [0 0 0 0 1 1 1 1]
train labels with shuffling [0 0 0 0 1 1 1 1]
validation labels with shuffling [0 0]
validation labels with shuffling [1 1]
validation labels with shuffling [0 1]
validation labels with shuffling [0 1]
validation labels with shuffling [0 1]


<p style="margin-bottom:5cm;"></p>

- It's important to stratify the splits (very crucial for small datasets!). Observe that the distribution of labels in the training and validation parts of each fold is well balanced.

In [23]:
from sklearn.model_selection import StratifiedKFold

cv = StratifiedKFold(n_splits=5, random_state=123, shuffle=True)

for train_idx, valid_idx in cv.split(X, y):
    print('train labels', y[train_idx])
for train_idx, valid_idx in cv.split(X, y):
    print('validation labels', y[valid_idx])

    

train labels [0 0 0 0 1 1 1 1]
train labels [0 0 0 0 1 1 1 1]
train labels [0 0 0 0 1 1 1 1]
train labels [0 0 0 0 1 1 1 1]
train labels [0 0 0 0 1 1 1 1]
validation labels [0 1]
validation labels [0 1]
validation labels [0 1]
validation labels [0 1]
validation labels [0 1]


<p style="margin-bottom:5cm;"></p>

## Cross-validation for model evalution: Logistic regression

- After the illustrations of cross-validation above, the next cell demonstrates how we can actually use the iterators provided through scikit-learn to fit and evaluate a learning algorithm. We start by using cross-validation to evaluate a logistic regression model.
-  Recall that feature scaling DOES matter for most machine learning algorithms, including logistic regression.
- Moreover, it is compulsory if we do regularisation.
- To avoid introducing bias, we have to compute the parameters for scaling (e.g., the mean and standard deviation in the context of z-score normalisation) on the training fold to scale the training AND test fold in a given iteration.
- To make this more convenient, this is where scikit-learn's `Pipeline` class (or `make_pipeline` function) comes in handy, as the next cell demonstrates.

In [25]:
import numpy as np
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split, StratifiedKFold, cross_validate
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, log_loss

# Load data and split into training and test set
X, y = load_breast_cancer(return_X_y=True)

# Hold-out test split for final unbiased report
# stratify tells train_test_split to preserve the class proportions of the input labels in both the train and test sets.
# Use stratify=y (where y are your class labels) to get splits with roughly the same class distribution as the full dataset.
# Helpful for imbalanced datasets so the test (and train) sets aren’t skewed.

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, stratify=y, random_state=42
)

# Pipeline: scale > logistic regression
# We cannot scale data before cross-validation. If you scale before CV on the whole dataset, you cause data leakage. 
# Using a Pipeline avoids that automatically.
# In cross-validation, scikit-learn scales separately on each training fold. 
# On each fold, scaling runs only on that fold’s training data, learning its μ and σ.
# The scaler then transforms the fold’s validation data using those μ and σ.
# No information from the validation fold leaks into the scaler or model.
pipe = Pipeline([
    ("scaler", StandardScaler()),
    ("logreg", LogisticRegression(solver="lbfgs", max_iter=5000, random_state=42))
])

# Cross-validation on the training set
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
scoring = {
    "acc": "accuracy",
    "nll": "neg_log_loss"
}
cv_res = cross_validate(pipe, X_train, y_train, cv=cv, scoring=scoring, return_train_score=True)

# Summarise CV results
for k in ["train_acc", "test_acc", "train_nll", "test_nll"]:
    mean = cv_res[k].mean()
    std  = cv_res[k].std()
    # For log loss, we stored negative log loss; flip sign when printing
    if "nll" in k:
        mean, std = -mean, std
    print(f"{k:>10}: {mean:.4f} ± {std:.4f}")

# Fit on the whole training set because we need a "single" model and evaluate once on the hold-out test set
# Step B in the slide "k-fold cross-validation for model evaluation"
pipe.fit(X_train, y_train)
p_test = pipe.predict_proba(X_test)[:, 1]
y_pred = (p_test >= 0.5).astype(int)

print("\nFinal hold-out test metrics:")
print(f"Accuracy : {accuracy_score(y_test, y_pred):.4f}")
print(f"Log loss : {log_loss(y_test, p_test):.4f}")


 train_acc: 0.9890 ± 0.0000
  test_acc: 0.9780 ± 0.0098
 train_nll: 0.0509 ± 0.0063
  test_nll: 0.0721 ± 0.0287

Final hold-out test metrics:
Accuracy : 0.9825
Log loss : 0.0779


<p style="margin-bottom:2cm;"></p>

- In the example above, we set the following hyperparameters for logistic regression:
    - solver="lbfgs": Optimisation algorithm. Supports multinomial logistic regression and works well for medium-sized datasets.
    - max_iter=5000: Maximum number of iterations.
- Keep in mind that `cross_validate` **does not tune hyperparameters**. It simply evaluates the model with the hyperparameters you give it.


<p style="margin-bottom:3cm;"></p>

- Usually, a more convenient way to use cross-validation through scikit-learn is to use the `cross_val_score` function (note that it performs stratified splitting for classification by default)
- We use `cross_val_score` when we only need a single metric (e.g., accuracy) for cross-validation. We use `cross_validate` (logistic regression example above) when we require multiple metrics or/and more detailed evaluation and diagnostics. The latter returns a dict of results, which can include: training scores, multiple metrics, fit times and score times.

Below we use `cross_val_score` to do cross-validation for our example considering logistic regression.

In [27]:
import numpy as np
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split, StratifiedKFold, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Data + hold-out split
X, y = load_breast_cancer(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, stratify=y, random_state=42
)

# Pipeline: scale -> logistic regression
pipe = Pipeline([
    ("scaler", StandardScaler()),
    ("logreg", LogisticRegression(solver="lbfgs", max_iter=5000, random_state=42)),
])

# 5-fold CV accuracy on the training set
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
cv_acc = cross_val_score(pipe, X_train, y_train, cv=cv, scoring="accuracy")
print(f"CV accuracy: {cv_acc.mean():.4f} ± {cv_acc.std():.4f}")

# Fit on all training data and evaluate on hold-out test set
pipe.fit(X_train, y_train)
y_pred = pipe.predict(X_test)
print(f"Test accuracy: {accuracy_score(y_test, y_pred):.4f}")


CV accuracy: 0.9780 ± 0.0098
Test accuracy: 0.9825


## Cross-validation for model evaluation: Decision trees

- Consider now cross-validation computed by hand for a decision tree for the Iris dataset, which is non-binary; it has three classes: (['setosa' 'versicolor' 'virginica']).

In [26]:
from sklearn.tree import DecisionTreeClassifier
from mlxtend.data import iris_data
from sklearn.model_selection import train_test_split


X, y = iris_data()
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=123, test_size=0.15, 
                                                    shuffle=True, stratify=y)


# Step A in the slide "k-fold cross-validation for model evaluation"
# k = 10 as recommended in: Kohavi, R. (1995, August). A study of cross-validation and bootstrap for accuracy estimation and model selection. In IJCAI  (Vol. 14, No. 2, pp. 1137-1145).
cv = StratifiedKFold(n_splits=10, random_state=123, shuffle=True)

kfold_acc = 0.
# For each training, validation set in each fold
for train_idx, valid_idx in cv.split(X_train, y_train):
    # Train a decision tree in the training fold
    clf = DecisionTreeClassifier(random_state=123, max_depth=3).fit(X_train[train_idx], y_train[train_idx])
    # Step C in the slide "k-fold cross-validation for model evaluation"
    # Predict on the validation fold
    y_pred = clf.predict(X_train[valid_idx])
    # Compute accuracy of the current fold
    acc = np.mean(y_pred == y_train[valid_idx])*100
    # Accumulate fold performances (accuracies)
    kfold_acc += acc
# Compute the estimate of the generalisation performance
kfold_acc /= 10

# Fit a new decision tree with all training data because we need a "single" model
# Step B in the slide "k-fold cross-validation for model evaluation"
clf = DecisionTreeClassifier(random_state=123, max_depth=3).fit(X_train, y_train)
y_pred = clf.predict(X_test)
test_acc = np.mean(y_pred == y_test)*100
    
print('Kfold Accuracy: %.2f%%' % kfold_acc)
print('Test Accuracy: %.2f%%' % test_acc)



Kfold Accuracy: 95.26%
Test Accuracy: 95.65%


Below, we use `cross_val_score` to do cross-validation for our decision tree.

In [12]:
from sklearn.model_selection import cross_val_score


cv_acc = cross_val_score(estimator=DecisionTreeClassifier(random_state=123, max_depth=3),
                         X=X_train,
                         y=y_train,
                         cv=10,
                         n_jobs=-1) # means use all processors to do the training and validation over folds in parallel, provided there are several processors

print('Kfold Accuracy: %.2f%%' % (np.mean(cv_acc)*100))
# The result will be different because we cannot set a random seed for cross_val_score as we did above for StratifiedKFold

Kfold Accuracy: 96.09%


<p style="margin-bottom:5cm;"></p>

- `cross_val_score` has unfortunately no way to specify a random seed. This is not an issue in regular use cases, but it is not useful if you want to do "repeated cross-validation". Repeated cross-validation runs a cross-validation scheme multiple times with different random splits, then aggregates (mean/SD) the scores.
- The next cell illustrates how we can provide our own cross-validation iterator for convenience (note that the results match our "manual" `StratifiedKFold` approach we performed earlier)

In [17]:
from sklearn.model_selection import cross_val_score


cv_acc = cross_val_score(estimator=DecisionTreeClassifier(random_state=123, max_depth=3),
                         X=X_train,
                         y=y_train,
                         cv=StratifiedKFold(n_splits=10, random_state=123, shuffle=True),
                         n_jobs=-1)

print('Kfold Accuracy: %.2f%%' % (np.mean(cv_acc)*100))

Kfold Accuracy: 95.26%
