# Model evaluation using cross-validation

## Data preparation

In [2]:
import pandas as pd

adult_census = pd.read_csv("./csv_result-phpMawTba.csv")

target_name = "class"
target = adult_census[target_name]
data = adult_census.drop(columns=target_name)

numerical_columns = ["age", "capital-gain", "capital-loss", "hours-per-week"]
data_numeric = data[numerical_columns]

from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline

model = make_pipeline(StandardScaler(), LogisticRegression())

## Cross-Validation

Cross-validation consists of repeating the procedure such that the training and testing sets are different each time.

Generalization performance metrics are collected for each repetition and then aggregated.

As a result we can assess the variability of our measure of the model's generalization performance.

## K-fold strategy

The entire dataset is split into `K` partitions. The
`fit`/`score` procedure is repeated `K` times where at each iteration `K - 1`
partitions are used to fit the model and `1` partition is used to score.

![Cross-validation diagram](./cross_validation_diagram.png)

## `cross_validate`

You need to pass it the model, the data, and the target.

`cv` defines the splitting strategy.

Setting `cv=5` or `cv=10` is a common practice, as it is a good
trade-off between computation time and stability of the estimated variability.

In [4]:
%%time
from sklearn.model_selection import cross_validate

model = make_pipeline(StandardScaler(), LogisticRegression())
cv_result = cross_validate(model, data_numeric, target, cv=5)
cv_result

CPU times: user 1.2 s, sys: 3.29 s, total: 4.49 s
Wall time: 537 ms


{'fit_time': array([0.05187631, 0.08103013, 0.09559274, 0.08064389, 0.07813406]),
 'score_time': array([0.02058291, 0.02056813, 0.02045441, 0.02075958, 0.02087069]),
 'test_score': array([0.79557785, 0.80049135, 0.79965192, 0.79873055, 0.80456593])}

The output of `cross_validate` is a Python dictionary, which by default
contains three entries:
- (i) the time to train the model on the training data for each fold,
  `fit_time`
- (ii) the time to predict with the model on the testing data for each fold,
  `score_time`
- (iii) the default score on the testing data for each fold, `test_score`.

In [5]:
scores = cv_result["test_score"]
print(
    "The mean cross-validation accuracy is: "
    f"{scores.mean():.3f} ± {scores.std():.3f}"
)

The mean cross-validation accuracy is: 0.800 ± 0.003
