# Model evaluation using cross-validation
In this notebook, we will still use only numerical features.

We will discuss the practical aspects of assessing the generalization performance of our model via cross-validation instead of a single train-test split.

# Data preparation

Let's load the full census data set:

In [63]:
import pandas as pd
import numpy as np

adult_census = pd.read_csv("/Users/russconte/Adult_Census.csv")

We will now drop the target from the data we will use to train our predictive model.

In [64]:
target_name = "Class"
target = adult_census[target_name]
data = adult_census.drop(columns=target_name)

Then, we select only the numerical columns, as seen in the previous notebook.

In [65]:
data_numeric = adult_census.select_dtypes(include=np.number)

data_numeric.head(5)

Unnamed: 0,Age,fnlwgt,Education-num,Capital-gain,Capital-loss,Hours-per-week
0,25,226802,7,0,0,40
1,38,89814,9,0,0,50
2,28,336951,12,0,0,40
3,44,160323,10,7688,0,40
4,18,103497,10,0,0,30


We can now create a model using the make_pipeline tool to chain the preprocessing and the estimator in every iteration of the cross-validation.

we will use cross-validation. Cross-validation consists of repeating the procedure such that the training and testing sets are different each time. Generalization performance metrics are collected for each repetition and then aggregated. As a result we can assess the variability of our measure of the model’s generalization performance.

In [66]:
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline

model = make_pipeline(StandardScaler(), LogisticRegression())

In scikit-learn, the function cross_validate allows to do cross-validation and you need to pass it the model, the data, and the target. Since there exists several cross-validation strategies, cross_validate takes a parameter cv which defines the splitting strategy.

In [67]:
%%time
from sklearn.model_selection import cross_validate

model = make_pipeline(StandardScaler(), LogisticRegression())
cv_result = cross_validate(model, data_numeric, target, cv=5)
cv_result

CPU times: user 555 ms, sys: 486 ms, total: 1.04 s
Wall time: 520 ms


{'fit_time': array([0.06561303, 0.06578612, 0.0674181 , 0.10535502, 0.08446193]),
 'score_time': array([0.01601005, 0.01569486, 0.01535892, 0.01601887, 0.01655197]),
 'test_score': array([0.81287747, 0.811342  , 0.81439394, 0.81367731, 0.82258395])}

Note that by default the cross_validate function discards the K models that were trained on the different overlapping subset of the dataset. The goal of cross-validation is not to train a model, but rather to estimate approximately the generalization performance of a model that would have been trained to the full training set, along with an estimate of the variability (uncertainty on the generalization accuracy).

Let’s extract the scores computed on the test fold of each cross-validation round from the cv_result dictionary and compute the mean accuracy and the variation of the accuracy across folds.

In [68]:
scores = cv_result["test_score"]
print(
    "The mean cross-validation accuracy is: "
    f"{scores.mean():.3f} ± {scores.std():.3f}"
)

The mean cross-validation accuracy is: 0.815 ± 0.004


# Test cross-validation on the Heart data set

In [69]:
import pandas as pd
import numpy as np

Heart = pd.read_csv("/Users/russconte/Heart.csv")

Drop the target from the data set

In [70]:
target = Heart["chd"]
Heart = Heart.drop(columns="chd")

In [71]:
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline

model = make_pipeline(StandardScaler(), LogisticRegression())

In [72]:
%%time
from sklearn.model_selection import cross_validate

model = make_pipeline(StandardScaler(), LogisticRegression())
cv_result = cross_validate(model, Heart, target, cv=5)
cv_result

CPU times: user 27.9 ms, sys: 17.2 ms, total: 45.1 ms
Wall time: 25.4 ms


{'fit_time': array([0.00453186, 0.00319719, 0.00361514, 0.00311303, 0.00295115]),
 'score_time': array([0.00091696, 0.0008719 , 0.00087595, 0.00081706, 0.00077486]),
 'test_score': array([0.74193548, 0.70967742, 0.69565217, 0.7173913 , 0.79347826])}

In [73]:
scores = cv_result["test_score"]
print(
    "The mean cross-validation accuracy is: "
    f"{scores.mean():.3f} ± {scores.std():.3f}"
)

The mean cross-validation accuracy is: 0.732 ± 0.034
