# Model Evaluation & Selection

![](images/justice-icon.jpg)

## Objective

This module is going to introduce you...

1. TBD
2. TBD
3. TBD

## Quick refresher

But first, let's review a few things that we learned in the previous modules.

### Data prep

In [3]:
# packages used
import pandas as pd
from sklearn.model_selection import train_test_split

# import data
adult_census = pd.read_csv('../data/adult-census.csv')

# separate feature & target data
target = adult_census['class']
features = adult_census.drop(columns='class')

# drop the duplicated column `"education-num"` as stated in the data exploration notebook
features = features.drop(columns='education-num')

# split into train & test sets
X_train, X_test, y_train, y_test = train_test_split(features, target, random_state=123)

### Feature engineering

In [5]:
# packages used
from sklearn.compose import make_column_selector as selector
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

# create selector object based on data type
numerical_columns_selector = selector(dtype_exclude=object)
categorical_columns_selector = selector(dtype_include=object)

# get columns of interest
numerical_columns = numerical_columns_selector(features)
categorical_columns = categorical_columns_selector(features)

# preprocessors to handle numeric and categorical features
numerical_preprocessor = StandardScaler()
categorical_preprocessor = OneHotEncoder(handle_unknown="ignore")

# transformer to associate each of these preprocessors with their respective columns
preprocessor = ColumnTransformer([
    ('one-hot-encoder', categorical_preprocessor, categorical_columns),
    ('standard_scaler', numerical_preprocessor, numerical_columns)
])

### Modeling

In [6]:
# packages used
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline

# Pipeline object to chain together modeling processes
model = make_pipeline(preprocessor, LogisticRegression(max_iter=500))
model

# fit our model
_ = model.fit(X_train, y_train)

# score on test set
model.score(X_test, y_test)

0.8503808041929408

## Resampling & cross-validation

In our "03-first_model.ipynb" notebook we split our data into training and testing sets and we assessed the performance of our model on the test set. Unfortunately, there are a few pitfalls to this approach:

1. If our dataset is small, a single test set may not provide realistic expectations of our model's performance on unseen data.
2. A single test set does not provide us any insight on variability of our model's performance.
3. Using our test set to drive our model building process can bias our results via _data leakage_.

Resampling methods provide an alternative approach by allowing us to repeatedly fit a model of interest to parts of the training data and test its performance on other parts of the training data.

![](images/resampling.svg)

<div class="admonition note alert alert-info">
    <p class="first admonition-title" style="font-weight: bold;"><b>Note</b></p>
<p class="last">This allows us to train and validate our model entirely on the training data and not touch the test data until we have selected a final "optimal" model.</p>
</div>

The two most commonly used resampling method include ___k-fold cross-validation___ and ___bootstrap sampling___. This module focuses on using k-fold cross-validation.

## K-fold cross-validation

Cross-validation consists of repeating the procedure such that the training and testing sets are different each time. Generalization performance metrics are collected for each repetition and then aggregated. As a result we can get an estimate of the variability of the model’s generalization performance.

_k_-fold cross-validation (aka _k_-fold CV) is a resampling method that randomly divides the training data into _k_ groups (aka folds) of approximately equal size. 

![](images/cross_validation_diagram.png)

The model is fit on $k-1$ folds and then the remaining fold is used to compute model performance.  This procedure is repeated _k_ times; each time, a different fold is treated as the validation set. 

This process results in _k_ estimates of the generalization error (say $\epsilon_1, \epsilon_2, \dots, \epsilon_k$). Thus, the _k_-fold CV estimate is computed by averaging the _k_ test errors, providing us with an approximation of the error we might expect on unseen data.

![](images/cv.png)

In scikit-learn, the function [`cross_validate`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_validate.html) allows to do cross-validation and you need to pass it the model, the data, and the target. Since there exists several cross-validation strategies, cross_validate takes a parameter `cv` which defines the splitting strategy.

<div class="admonition tip alert alert-warning">
    <p class="first admonition-title" style="font-weight: bold;"><b>Tip</b></p>
    <p class="last">In practice, one typically uses k=5 or k=10. There is no formal rule as to the size of k; however, as k gets larger, the difference between the estimated performance and the true performance to be seen on the test set will decrease.</p>
</ul>
</div>

In [9]:
%%time
from sklearn.model_selection import cross_validate

cv_result = cross_validate(model, X_train, y_train, cv=5)
cv_result

CPU times: user 3.21 s, sys: 48.5 ms, total: 3.26 s
Wall time: 3.26 s


{'fit_time': array([0.68133974, 0.61591721, 0.58582687, 0.61716914, 0.56861997]),
 'score_time': array([0.0254302 , 0.02883387, 0.02762032, 0.02966881, 0.02553988]),
 'test_score': array([0.85191757, 0.84548185, 0.85790336, 0.85094185, 0.85558286])}

The output of cross_validate is a Python dictionary, which by default contains three entries: 

- `fit_time`: the time to train the model on the training data for each fold, 
- `score_time`: the time to predict with the model on the testing data for each fold, and 
- `test_score`: the default score on the testing data for each fold.

In [10]:
scores = cv_result["test_score"]
print("The mean cross-validation accuracy is: "
      f"{scores.mean():.3f} +/- {scores.std():.3f}")

The mean cross-validation accuracy is: 0.852 +/- 0.004


## Evaluation metrics

* Discuss the different scoring APIs available in Scikit-learn [https://scikit-learn.org/stable/modules/model_evaluation.html](https://scikit-learn.org/stable/modules/model_evaluation.html)
* Don't spend too much time on the different metrics; rather, point them to resources to learn about the different metrics
* Focus the attention here on how to apply the different scoring APIs

## Hyperparameter tuning

## Wrapping up