# Testing on the training data

**DO NOT DO IT**, since it is methodologically wrong! You almost surely will get an biased (i.e. optimistic) estimate of the generalization performance.

In [1]:
from sklearn.datasets import load_iris
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

iris = load_iris()

X = iris.data
y = iris.target

clf = RandomForestClassifier(n_estimators=2, random_state=0)

# X is our training data
clf.fit(X, y)

# This will result in an overly optimistic estimation since we are using X again!
y_pred = clf.predict(X)

acc = accuracy_score(y, y_pred)
print(f'Accuracy: {acc:.2f}')

Accuracy: 0.97


## Two-way holdout

The **two-way hold-out method** is a straightforward technique used in model selection where the available dataset is split into two separate subsets: one for **training** the model and another for **testing** its performance. Typically, the data is divided into a fixed ratio (e.g., 70% training and 30% testing), ensuring that the model learns patterns from the training set and is then evaluated on the unseen test set to estimate its generalization ability. This method helps prevent overfitting by providing an unbiased assessment of how the model performs on new data, but it can be sensitive to how the data is split, especially if the dataset is small or unbalanced.

In [18]:
from sklearn.model_selection import train_test_split

# split in train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)

clf = RandomForestClassifier(n_estimators=2, random_state=0)
clf.fit(X_train, y_train)

# test with unseen data
y_pred = clf.predict(X_test)
acc = accuracy_score(y_test, y_pred)

print(f'Accuracy: {acc:.2f}')

Accuracy: 0.91


# k-fold cross validation

In [5]:
from sklearn.model_selection import cross_validate
import timeit

def do_cross_validation(clf, print_model=False, print_duration=False):
    start = timeit.default_timer()
    cv = cross_validate(clf, X, y, scoring='accuracy', cv=3)
    scores = ' + '.join(f'{s:.2f}' for s in cv["test_score"])
    mean_ = cv["test_score"].mean()
    msg = f'Cross-validated accuracy: ({scores}) / 3 = {mean_:.2f}'

    if print_model:
        msg = f'\nClassifier: {clf}\n{msg}'
        print(msg)

    if print_duration:
        duration = timeit.default_timer() - start
        print(f"Duration: {duration:.2f} seconds")

In [3]:
clf = RandomForestClassifier(n_estimators=2, random_state=0)
do_cross_validation(clf, True, True)


Classifier: RandomForestClassifier(n_estimators=2, random_state=0)
Cross-validated accuracy: (0.98 + 0.92 + 0.96) / 3 = 0.95

Duration: 0.03 seconds


## Applying $k$-fold cross-validation for model selection

In [6]:
from sklearn.svm import SVC

start = timeit.default_timer()
svc = SVC(random_state=0)
print('Default value for kernel: ', svc.kernel)
do_cross_validation(svc, True, True)

Default value for kernel:  rbf

Classifier: SVC(random_state=0)
Cross-validated accuracy: (0.96 + 0.98 + 0.94) / 3 = 0.96
Duration: 0.01 seconds


In [22]:
do_cross_validation(SVC(kernel='linear', random_state=0), print_model=True)
do_cross_validation(SVC(kernel='poly', random_state=0), print_model=True)
do_cross_validation(RandomForestClassifier(n_estimators=2, random_state=0), print_model=True)
do_cross_validation(RandomForestClassifier(n_estimators=5, random_state=0), print_model=True)


Classifier: SVC(kernel='linear', random_state=0)
Cross-validated accuracy: (1.00 + 1.00 + 0.98) / 3 = 0.99


Classifier: SVC(kernel='poly', random_state=0)
Cross-validated accuracy: (0.98 + 0.94 + 0.98) / 3 = 0.97


Classifier: RandomForestClassifier(n_estimators=2, random_state=0)
Cross-validated accuracy: (0.98 + 0.92 + 0.96) / 3 = 0.95


Classifier: RandomForestClassifier(n_estimators=5, random_state=0)
Cross-validated accuracy: (0.98 + 0.94 + 0.94) / 3 = 0.95



# Nested cross-validation

Nested cross-validation is a technique that provides an unbiased estimate of a model's generalization performance while also allowing for hyperparameter tuning. It involves two loops: 
- an outer loop, which splits the data into training and testing folds to assess model performance, 
- an inner loop, which further splits the training data for model selection and hyperparameter tuning (typically using cross-validation again). 

This structure ensures that the test data in the outer loop remains completely unseen during model selection, preventing information leakage and leading to a more reliable performance estimate, especially when comparing models or tuning hyperparameters.

- GridSearchCV = Inner loop (model selection via cross-validation).
- do_cross_validation() = Outer loop (model evaluation).

In [23]:
from sklearn.model_selection import GridSearchCV

start = timeit.default_timer()
# random forest inner loop
clf_grid = GridSearchCV(RandomForestClassifier(random_state=0), param_grid={'n_estimators': [2, 5]})
# random forest outer loop
do_cross_validation(clf_grid, print_model=True, print_duration=True)

start = timeit.default_timer()
# svc inner loop
svc_grid = GridSearchCV(SVC(random_state=0), param_grid={'kernel': ['linear', 'poly']})
# svc outer loop
do_cross_validation(svc_grid, print_model=True, print_duration=True)

Duration: 0.17338370000015857
Classifier: GridSearchCV(estimator=RandomForestClassifier(random_state=0),
             param_grid={'n_estimators': [2, 5]})
Cross-validated accuracy: (0.98 + 0.92 + 0.96) / 3 = 0.95


Duration: 0.055051600000297185
Classifier: GridSearchCV(estimator=SVC(random_state=0),
             param_grid={'kernel': ['linear', 'poly']})
Cross-validated accuracy: (1.00 + 0.94 + 0.98) / 3 = 0.97




## Nested CV - getting the final model

Nested cross-validation itself doesn't directly produce a final model. Rather, it is a technique to get an unbiased estimated of the generalization error. 

There are three alternative approaches to produce the final model **after** using nested CV. 
1. The final model is produced by training on the entire dataset, and using the best hyperparameters found during the inner loop.
2. The final model is produced using the algorithm selected in the inner loop, but performing an additional hyperparameter setting on the whole dataset.
3. (Ensemble Model) The final model is built as an ensemble model by combining predictions from the multiple models trained in the inner loop.

Approaches 1 and 2 are the most common ones. Both involve using the entire dataset to refit a model AFTER the generalization error has been estimated.

Notice that in all of the three approaches described above, the estimate of the generalization error to be reported is the one resulting from the nested CV procedure. 

The neste CV method ensures a rigorous evaluation of model performance, with independent hyperparameter tuning in each outer fold, which is critical in avoiding overfitting and data leakage.

The code blocks below provide examples of using the second approach.

## Nested CV - Classification example

In [24]:
from sklearn.model_selection import KFold, cross_val_score, GridSearchCV
from sklearn.datasets import make_classification
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
import numpy as np

# `outer_cv` creates 3 folds for estimating generalization error
outer_cv = KFold(3)

# when we train on a certain fold, we use a second cross-validation
# split in order to choose hyperparameters
inner_cv = KFold(3)

# create some regression data
X, y = make_classification(n_samples=1000, n_features=10)

# give shorthand names to models and use those as dictionary keys mapping
# to models and parameter grids for that model
models_and_parameters = {
    'svc': (SVC(),
            {'C': [0.01, 0.05, 0.1, 1]}),
    'rfc': (RandomForestClassifier(),
           {'max_depth': [5, 10, 50, 100, 200, 500]})}

# we will collect the average of the scores on the 3 outer folds in this dictionary
# with keys given by the names of the models in `models_and_parameters`
average_scores_across_outer_folds_for_each_model = dict()

# find the model with the best generalization error
for name, (model, params) in models_and_parameters.items():
    # this object is a regressor that also happens to choose
    # its hyperparameters automatically using `inner_cv`
    classifier_that_optimizes_its_hyperparams = GridSearchCV(
        estimator=model, param_grid=params,
        cv=inner_cv, scoring='accuracy')

    # estimate generalization error on the 3-fold splits of the data
    scores_across_outer_folds = cross_val_score(
        classifier_that_optimizes_its_hyperparams,
        X, y, cv=outer_cv, scoring='accuracy')

    # get the mean accuracy across each of outer_cv's 3 folds
    average_scores_across_outer_folds_for_each_model[name] = np.mean(scores_across_outer_folds)
    error_summary = 'Model: {name}\nAccuracy in the 3 outer folds: {scores}.\nAverage acc: {avg}'
    print(error_summary.format(
        name=name, scores=scores_across_outer_folds,
        avg=np.mean(scores_across_outer_folds)))
    print()

print('Average score across the outer folds: ',
      average_scores_across_outer_folds_for_each_model)

many_stars = '\n' + '*' * 100 + '\n'
print(many_stars + 'Now we choose the best model and refit on the whole dataset' + many_stars)

best_model_name, best_model_avg_score = max(
    average_scores_across_outer_folds_for_each_model.items(),
    key=(lambda name_averagescore: name_averagescore[1]))

# get the best model and its associated parameter grid
best_model, best_model_params = models_and_parameters[best_model_name]

# Approach 1
#best_model.fit(X, y)

# Approach 2: # refit this best model on the whole dataset so that we can start
# making predictions on other data, and now we have a reliable estimate of
# this model's generalization error and we are confident this is the best model
# among the ones we have tried
final_classifier = GridSearchCV(best_model, best_model_params, cv=inner_cv)
final_classifier.fit(X, y)

print('Best model: \n\t{}'.format(best_model), end='\n\n')
print('Estimation of its generalization error (accuracy):\n\t{}'.format(
    best_model_avg_score), end='\n\n')
print('Best parameter choice for this model: \n\t{params}'
      '\n(according to cross-validation `{cv}` on the whole dataset).'.format(
      params=final_classifier.best_params_, cv=inner_cv))

Model: svc
Accuracy in the 3 outer folds: [0.92814371 0.8978979  0.87387387].
Average acc: 0.8999718281155408

Model: rfc
Accuracy in the 3 outer folds: [0.9491018  0.94594595 0.89489489].
Average acc: 0.9299808790826756

Average score across the outer folds:  {'svc': 0.8999718281155408, 'rfc': 0.9299808790826756}

****************************************************************************************************
Now we choose the best model and refit on the whole dataset
****************************************************************************************************

Best model: 
	RandomForestClassifier()

Estimation of its generalization error (accuracy):
	0.9299808790826756

Best parameter choice for this model: 
	{'max_depth': 500}
(according to cross-validation `KFold(n_splits=3, random_state=None, shuffle=False)` on the whole dataset).


## Nested CV - Regression example I

In [25]:
from sklearn.model_selection import KFold, cross_val_score, GridSearchCV
from sklearn.datasets import make_regression
from sklearn.ensemble import RandomForestRegressor
from sklearn.svm import SVR
import numpy as np

# `outer_cv` creates 3 folds for estimating generalization error
outer_cv = KFold(3)

# when we train on a certain fold, we use a second cross-validation
# split in order to choose hyperparameters
inner_cv = KFold(3)

# create some regression data
X, y = make_regression(n_samples=1000, n_features=10)

# give shorthand names to models and use those as dictionary keys mapping
# to models and parameter grids for that model
models_and_parameters = {
    'svr': (SVR(),
            {'C': [0.01, 0.05, 0.1, 1]}),
    'rfr': (RandomForestRegressor(),
           {'max_depth': [5, 10, 50, 100, 200, 500]})}

# we will collect the average of the scores on the 3 outer folds in this dictionary
# with keys given by the names of the models in `models_and_parameters`
average_scores_across_outer_folds_for_each_model = dict()

# find the model with the best generalization error
for name, (model, params) in models_and_parameters.items():
    # this object is a regressor that also happens to choose
    # its hyperparameters automatically using `inner_cv`
    regressor_that_optimizes_its_hyperparams = GridSearchCV(estimator = model, 
                                                            param_grid = params,
                                                            cv = inner_cv, 
                                                            scoring = 'neg_mean_squared_error')

    # estimate generalization error on the 3-fold splits of the data
    scores_across_outer_folds = cross_val_score(regressor_that_optimizes_its_hyperparams,
                                                X, 
                                                y, 
                                                cv = outer_cv, 
                                                scoring='neg_mean_squared_error')

    # get the mean MSE across each of outer_cv's 3 folds
    average_scores_across_outer_folds_for_each_model[name] = np.mean(scores_across_outer_folds)
    error_summary = 'Model: {name}\nMSE in the 3 outer folds: {scores}.\nAverage error: {avg}'
    print(error_summary.format(
        name=name, scores=scores_across_outer_folds,
        avg=np.mean(scores_across_outer_folds)))
    print()

print('Average score across the outer folds: ',
      average_scores_across_outer_folds_for_each_model)

many_stars = '\n' + '*' * 100 + '\n'
print(many_stars + 'Now we choose the best model and refit on the whole dataset' + many_stars)

best_model_name, best_model_avg_score = max(
    average_scores_across_outer_folds_for_each_model.items(),
    key=(lambda name_averagescore: name_averagescore[1]))

# get the best model and its associated parameter grid
best_model, best_model_params = models_and_parameters[best_model_name]

# now we refit this best model on the whole dataset so that we can start
# making predictions on other data, and now we have a reliable estimate of
# this model's generalization error and we are confident this is the best model
# among the ones we have tried
final_regressor = GridSearchCV(best_model, best_model_params, cv=inner_cv)
final_regressor.fit(X, y)

print('Best model: \n\t{}'.format(best_model), end='\n\n')
print('Estimation of its generalization error (negative mean squared error):\n\t{}'.format(
    best_model_avg_score), end='\n\n')
print('Best parameter choice for this model: \n\t{params}'
      '\n(according to cross-validation `{cv}` on the whole dataset).'.format(
      params=final_regressor.best_params_, cv=inner_cv))

Model: svr
MSE in the 3 outer folds: [-22100.00522677 -23269.74391012 -25705.88313907].
Average error: -23691.877425318682

Model: rfr
MSE in the 3 outer folds: [-3751.12083033 -3976.42008502 -4629.63215186].
Average error: -4119.05768906673

Average score across the outer folds:  {'svr': -23691.877425318682, 'rfr': -4119.05768906673}

****************************************************************************************************
Now we choose the best model and refit on the whole dataset
****************************************************************************************************

Best model: 
	RandomForestRegressor()

Estimation of its generalization error (negative mean squared error):
	-4119.05768906673

Best parameter choice for this model: 
	{'max_depth': 50}
(according to cross-validation `KFold(n_splits=3, random_state=None, shuffle=False)` on the whole dataset).


In [26]:
import sklearn
print('The scikit-learn version is {}.'.format(sklearn.__version__))

The scikit-learn version is 1.5.2.


## Nested CV - Regression example II

In this second example of using nested cross-validation for a regression task, [GridSearchCV](https://scikit-learn.org/1.5/modules/grid_search.html) is applied for hyperparameter tuning in the inner cross-validation loop and `cross_val_score` with `KFold` for the outer loop. In this example, the Ridge regression learning algorithm is used. This algorithm has a regularization parameter `alpha`.

In [29]:
import sys
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import GridSearchCV, KFold
from sklearn.linear_model import Ridge
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error
import numpy as np

# Load dataset (using California Housing dataset as an example)
try:
    data = fetch_california_housing(as_frame=True)
    X = data.data
    y = data.target
except KeyError:
    print("Error loading dataset. Please ensure you have internet access and try again.")
    sys.exit(1)

X = X.to_numpy()

# Define the parameter grid for the inner cross-validation (for Ridge Regression)
param_grid = {
    'ridge__alpha': [0.01, 0.1, 1, 10, 100]
}

# Outer cross-validation (5-fold)
outer_cv = KFold(n_splits=5, shuffle=True, random_state=42)

# List to store outer CV scores
outer_scores = []

# Nested Cross-Validation
for train_idx, test_idx in outer_cv.split(X, y):
    # Split data into training and test sets for the outer fold
    X_train, X_test = X[train_idx], X[test_idx]
    y_train, y_test = y[train_idx], y[test_idx]
    
    # Define a pipeline to scale features and apply Ridge Regression
    pipeline = make_pipeline(StandardScaler(), Ridge())
    
    # Inner cross-validation with GridSearchCV for hyperparameter tuning
    grid_search = GridSearchCV(estimator=pipeline, param_grid=param_grid, cv=5, scoring='neg_mean_squared_error')
    
    # Fit the model on the training data of the outer fold
    grid_search.fit(X_train, y_train)
    
    # Use the best model to predict the outer test set
    best_model = grid_search.best_estimator_
    y_pred = best_model.predict(X_test)
    
    # Calculate the Mean Squared Error on the outer test set
    outer_score = mean_squared_error(y_test, y_pred)
    
    # Append the outer score
    outer_scores.append(outer_score)

# Display the average and standard deviation of outer CV scores
print(f"Nested CV Mean Squared Error: {np.mean(outer_scores):.4f} ± {np.std(outer_scores):.4f}")


Nested CV Mean Squared Error: 0.5306 ± 0.0217


After running nested CV, we obtain an unbiased estimate of model performance with the optimal hyperparameters. However, nested CV doesn't directly provide a final, production-ready model. To train the final model with the best hyperparameters for production, follow these steps:

1. Identify the best hyperparameters: Check which hyperparameters performed best on average across the inner folds of nested CV. You can retrieve these from the GridSearchCV object if needed or simply use the hyperparameter configuration that provided the lowest mean score in your nested CV loop.

2. Train the final model on the entire dataset: Use the entire dataset (without splitting) to fit the model using the optimal hyperparameters. This will ensure the model has learned from all available data, maximizing its predictive power.

3. Save the model: After training the final model, save it to disk for future use in production.

Here's the Python code to build the final model using the best hyperparameters:

In [28]:
from sklearn.linear_model import Ridge
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
import joblib  # For saving the model

# Step 1: Define the best hyperparameters
best_alpha = 1  # Replace with the best 'alpha' found in nested CV

# Step 2: Train the final model on the entire dataset
pipeline = make_pipeline(StandardScaler(), Ridge(alpha=best_alpha))
pipeline.fit(X, y)

# Step 3: Save the model
joblib.dump(pipeline, "final_model.joblib")
print("Final model trained and saved as 'final_model.joblib'")

Final model trained and saved as 'final_model.joblib'


In the code provided above, notice the following:

- Best Hyperparameters: Set `best_alpha` to the optimal value found during nested CV. This may vary depending on the dataset, so replace 1 with the actual optimal value obtained.

- Training on Full Dataset: Training on the entire dataset with the chosen hyperparameters provides the final model that has learned from all data points.

- Saving the Model: `joblib` is useful for saving Scikit-Learn models efficiently. The saved model can then be loaded in a production environment for predictions.

# Choosing between 2-way holdout and 3-way holdout

When choosing the split proportions in two-way holdout and three-way holdout methods, several aspects influence the optimal choice. These aspects include dataset size, model complexity, and the need for robust model evaluation. Choosing split proportions that fit the data characteristics, model complexity, and evaluation priorities ensures robust model selection and reliable performance estimation. The discussino below provides a breakdown of considerations for each method.

## Two-Way Holdout Method
In the two-way holdout method, the dataset is split into two parts: typically a training set and a test set.

**Split Proportions**
Common choices are:
- 80/20: 80% training, 20% testing.
- 70/30: 70% training, 30% testing.
- 90/10: 90% training, 10% testing (often used for very large datasets).

**Aspects to Consider**

1. **Dataset Size:**
    - Small Datasets: A larger portion (e.g., 80–90%) should be dedicated to training to allow the model to learn enough patterns from limited data. However, this limits the amount of data left for testing, potentially making performance estimates less reliable.
    - Large Datasets: You can afford to allocate a larger portion to testing (e.g., 30%) since there is usually enough data in the training set to achieve stable model training.

2. **Model Complexity:**
    - Simple Models (e.g., linear regression): Simple models with fewer parameters may need less training data to achieve a stable fit. In this case, a split like 70/30 may work well.
    - Complex Models (e.g., deep learning): These models require substantial data to generalize well, so an 80/20 or 90/10 split is often more appropriate, especially if the dataset is small.

3. **Evaluation Goals:**

    - If generalization is a top priority (e.g., avoiding overfitting in a complex model), a larger test set (e.g., 70/30) provides a more reliable estimate of performance.
    - If the primary focus is model performance with limited data, using a smaller test set (e.g., 90/10) allows more training data, which can enhance model performance but may slightly reduce confidence in the test results.

## Three-Way Holdout Method
In the three-way holdout method, the dataset is split into three parts: a training set, a validation set, and a test set.

**Split Proportions**

Typical splits include:
- 60/20/20: 60% training, 20% validation, 20% test.
- 70/15/15: 70% training, 15% validation, 15% test.
- 80/10/10: 80% training, 10% validation, 10% test.

**Aspects to Consider**

1. **Dataset Size:**
    - Small Datasets: A 60/20/20 split might be used to ensure adequate data in the validation and test sets for reliable model evaluation, but it reduces training data. Alternatively, a 70/15/15 split may strike a better balance, giving slightly more data for training.
    - Large Datasets: With more data, an 80/10/10 split can allocate a larger training set to boost model performance while still leaving enough data in validation and test sets for reliable evaluation.

2. **Purpose of Validation Set:**
    - Hyperparameter Tuning: The validation set is crucial for hyperparameter tuning and model selection. If fine-tuning the model is critical, having 15–20% of the data as validation ensures more stable results, especially if the dataset is large.
    - Early Stopping in Training: If using the validation set for early stopping (common in deep learning), a larger training set is beneficial (e.g., 70/15/15) to provide stability in validation metrics while maximizing training data.

3. **Final Evaluation Goals:**
    - Minimizing Overfitting: A larger test set (e.g., 20%) helps give a more robust final evaluation of model generalization, especially if it will be used in critical applications where overfitting could be harmful.
    - Ensuring Training Robustness: In cases where training stability is prioritized (e.g., complex models with a large number of parameters), an 80/10/10 split can maximize training data while still providing reasonable validation and test insights.

## General Recommendations
- Small Datasets: 70/30 for two-way, 70/15/15 for three-way splits, as more training data is often necessary.
- Medium to Large Datasets: 80/20 or 90/10 for two-way, and 80/10/10 for three-way splits. With larger datasets, holding out more data for validation and testing becomes feasible.
- Complex Models and Hyperparameter Tuning: A three-way split with a larger training set and moderate validation set (e.g., 70/15/15) helps in fine-tuning while balancing generalization.

# Choosing the right evaluation metric 

We saw that model selection encompasses *model evaluation*, which is the process of assessing how well a model generalizes. Each evaluation metric has strengths and weaknesses. Therefore, understanding the context and priorities of the task is essential for selecting the best evaluation metric to be used during model selection.

## Classification

When evaluating binary classification models, it’s common to encounter situations where certain types of predictions carry different levels of importance or risk. Understanding these scenarios is crucial for choosing and tuning the appropriate model to meet real-world needs. 

Concretely, there are four main scenarios that can happen and that can influence choice of evaluation metric during model selection: **bad positives**, **harmless negatives**, **unbalanced classes**, and **unequal costs for predictions**. By choosing the appropriate evaluation metric, thresholds, and techniques during model seleciton, practitioners can tailor models to meet the specific needs and constraints of each application.

1. **Bad Positives**. Bad positives refer to instances where the model predicts a positive outcome (1) incorrectly. This type of error is often called a **false positive**. For example, in medical diagnosis, a false positive for a severe disease could cause unnecessary anxiety for a patient and lead to costly, invasive follow-up tests.
> When false positives are highly detrimental, precision should be prioritized, meaning the model should avoid predicting positives unless it’s very confident.

2. **Harmless Negatives**. Harmless negatives refer to cases where predicting a negative outcome (0) incorrectly (i.e., a **false negative**) has a minimal or acceptable impact. For example, in spam detection, marking a legitimate email as spam (false negative) can be harmless if users are able to recover it from a spam folder.
> When false negatives are acceptable or pose minimal risk, the model may prioritize detecting positives more confidently, sometimes sacrificing sensitivity to capture more positives accurately.

3. **Unbalanced Classes** Unbalanced classes occur when the number of examples in each class differs significantly, with one class being much more frequent than the other. For example, in fraud detection, fraudulent transactions (positive class) are typically much rarer than legitimate transactions (negative class).
> Standard accuracy as a metric may be misleading in such cases because the model could achieve high accuracy by mostly predicting the majority class. Techniques such as **resampling** (oversampling the minority class or undersampling the majority class) and using metrics like **F1 score** or **AUC-ROC** are more effective for evaluating performance in these cases.

4. **Unequal Costs for Predictions**. In many real-world applications, the cost of different types of errors (false positives vs. false negatives) varies. This refers to unequal costs of predictions. For example, in a credit risk assessment, incorrectly granting a loan to a risky applicant (false positive) might have a high financial cost, while denying a loan to a low-risk applicant (false negative) might incur only a minor loss in potential business.
  > For cases with unequal costs, it’s essential to balance the prediction threshold to account for these cost disparities. **Cost-sensitive learning** and setting a **custom decision threshold** are common strategies, and models may be evaluated using **weighted cost metrics** that reflect the different impacts of each error type.

## Regression

When choosing an evaluation metric for a regression task, it’s essential to consider the specific goals and characteristics of the problem. Here are some key aspects to guide the choice of an appropriate evaluation metric:

1. **Type and Scale of Errors**. If large errors are more detrimental than smaller ones, metrics like **Mean Squared Error (MSE)** or **Root Mean Squared Error (RMSE)**, which penalize larger errors more, may be suitable. For cases where each error should be weighted equally (regardless of size), **Mean Absolute Error (MAE)** is often a good choice, as it gives a linear penalty to errors.

2. **Interpretability and Communication**. **RMSE** and **MAE** are in the same units as the target variable, making them interpretable and easy to communicate. [Mean Absolute Percentage Error](https://en.wikipedia.org/wiki/Mean_absolute_percentage_error) (**MAPE**) provides an error as a percentage of actual values, which can be useful for interpreting results across different scales or for communicating model performance to stakeholders.

3. **Sensitivity to Outliers**. If the dataset contains outliers, **MAE** might be a better choice because it is less sensitive to extreme values than **MSE** or **RMSE**, which emphasize large errors due to their squared terms.

4. **Balance Between Positive and Negative Errors**. In cases where overestimations and underestimations have different costs, an asymmetric error metric (like **Quantile Loss**) may be appropriate, as it allows you to weigh over- and under-predictions differently.

5. **Model Selection and Comparison**. Metrics like **RMSE** and **MAE** are popular because they provide consistent rankings across models and datasets. However, **Adjusted R-squared** can be more suitable when comparing models with different numbers of predictors, as it adjusts for model complexity.

6. **Variance of Errors**. If the distribution of errors is a concern (e.g., for quality control in industrial processes), **R-squared (Coefficient of Determination)** may be useful, as it measures the proportion of variance in the target variable explained by the model.

7. **Goal of the Application**. In applications like medical prediction or finance, high error tolerance may be unacceptable, making **RMSE** a better choice to keep larger errors in check. For other applications with more flexibility, **MAE** or **MAPE** may suffice.

The table below presents a summary of the most common regression metrics.

| Metric                | Description                                           | Use Case Example                                  |
|-----------------------|-------------------------------------------------------|---------------------------------------------------|
| **MAE**               | Average of absolute errors                            | Suitable for interpretable, low-outlier tasks     |
| **MSE**               | Average of squared errors                             | Useful when penalizing large errors               |
| **RMSE**              | Square root of MSE, interpretable in original units   | Sensitive to large errors, often used in finance  |
| **R-squared**         | Proportion of explained variance                      | Ideal for comparing different models              |
| **MAPE**              | Mean absolute percentage error                        | Useful when predicting across multiple scales     |
| **Quantile Loss**     | Asymmetric error measure based on quantiles           | Good for applications needing over-/under-bias    |

# References

1. [Model selection done right: A gentle introduction to nested cross-validation](https://ploomber.io/blog/nested-cv/).

2. [Which is the final model from Nested Cross Validation: Accuracy or Frequency?](https://datascience.stackexchange.com/questions/116311/which-is-the-final-model-from-nested-cross-validation-accuracy-or-frequency)

3. [What is the correct procedure for nested cross-validation?](https://stackoverflow.com/questions/64238730/what-is-the-correct-procedure-for-nested-cross-validation)

4. [Nested Cross Validation (Cynthia Rudin)](https://youtu.be/az60jS7MQhU?list=PLNeXFnYrCJneoY_rKtWJy833YiMrCRi5f)

5. [Nested cross-validation and selecting the best regression model - is this the right SKLearn process?](https://datascience.stackexchange.com/questions/13185/nested-cross-validation-and-selecting-the-best-regression-model-is-this-the-ri)

6. [Model evaluation, model selection, and algorithm selection in machine learning](https://sebastianraschka.com/blog/2016/model-evaluation-selection-part1.html)

7. [Nested cross validation for model selection](https://stats.stackexchange.com/questions/65128/nested-cross-validation-for-model-selection/65158#65158)

8. [Training on the full dataset after cross-validation?](https://stats.stackexchange.com/questions/11602/training-on-the-full-dataset-after-cross-validation)

9. [How to choose a predictive model after k-fold cross-validation?](https://stats.stackexchange.com/questions/52274/how-to-choose-a-predictive-model-after-k-fold-cross-validation)

10. [How to Train a Final Machine Learning Model](https://machinelearningmastery.com/train-final-machine-learning-model/)

11. [How to get from evaluation to final model](https://mindfulmodeler.substack.com/p/how-to-get-from-evaluation-to-final)