# Overfit-generalization-underfit

In the previous notebook, we presented the general cross-validation framework
and it helps at quantifying the empirical and generalization errors as well
as their fluctuations.

In this notebook, we will put these two errors into perspective and show how
they can help us know if our model generalizes, overfit, or underfit.

Let's first load the data and create the identical model as in the previous
notebook.

In [None]:
from sklearn.datasets import fetch_california_housing

housing = fetch_california_housing(as_frame=True)
X, y = housing.data, housing.target

In [None]:
from sklearn.tree import DecisionTreeRegressor

regressor = DecisionTreeRegressor()

## Overfitting vs. underfitting

To better understand the performance of our model and maybe find insights on
how to improve it we will compare the generalization error with the empirical
error. Thus, we need to compute the error on the training set, which is
possible using the `cross_validate` function.

In [None]:
import pandas as pd
from sklearn.model_selection import cross_validate, ShuffleSplit

cv = ShuffleSplit(n_splits=30, test_size=0.2)
cv_results = cross_validate(regressor, X, y,
                            cv=cv, scoring="neg_mean_absolute_error",
                            return_train_score=True, n_jobs=2)
cv_results = pd.DataFrame(cv_results)

We will select the train and test score and take the error instead.

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_context("talk")

scores = pd.DataFrame()
scores[["train error", "test error"]] = -cv_results[
    ["train_score", "test_score"]]
sns.histplot(scores, bins=50)
_ = plt.xlabel("Mean absolute error (k$)")

By plotting the distribution of the empirical and generalization errors, we
get information about whether our model is over-fitting, under-fitting (or
both at the same time).

Here, we observe a **small empirical error** (actually zero), meaning that
the model is **not under-fitting**: it is flexible enough to capture any
variations present in the training set.

However the **significantly larger generalization error** tells us that the
model is **over-fitting**: the model has memorized many variations of the
training set that could be considered "noisy" because they do not generalize
to help us make good prediction on the test set.

## Validation curve

Some model hyperparameters are usually the key to go from a model that
underfits to a model that overfits, hopefully going through a region were we
can get a good balance between the two. We can acquire knowledge by plotting
a curve called the validation curve. This curve applies the above experiment
and varies the value of a hyperparameter.

For the decision tree, the `max_depth` the main parameter to control the
trade-off between under-fitting and over-fitting.

In [None]:
%%time
from sklearn.model_selection import validation_curve

max_depth = [1, 5, 10, 15, 20, 25]
train_scores, test_scores = validation_curve(
    regressor, X, y, param_name="max_depth", param_range=max_depth,
    cv=cv, scoring="neg_mean_absolute_error", n_jobs=2)
train_errors, test_errors = -train_scores, -test_scores

Now that we collected the results, we will show the validation curve by
plotting the empirical and generalization errors (as well as their
deviations).

In [None]:
_, ax = plt.subplots()

error_type = ["Empirical error", "Generalization error"]
errors = [train_errors, test_errors]

for name, err in zip(error_type, errors):
    ax.plot(max_depth, err.mean(axis=1), linestyle="-.", label=name,
            alpha=0.8)
    ax.fill_between(max_depth, err.mean(axis=1) - err.std(axis=1),
                    err.mean(axis=1) + err.std(axis=1), alpha=0.5,
                    label=f"std. dev. {name.lower()}")

ax.set_xticks(max_depth)
ax.set_xlabel("Maximum depth of decision tree")
ax.set_ylabel("Mean absolute error (k$)")
ax.set_title("Validation curve for decision tree")
_ = plt.legend(bbox_to_anchor=(1.05, 0.8), loc="upper left")

The validation curve can be divided into three areas:

- For `max_depth < 10`, the decision tree underfits. The empirical error and
  therefore also the generalization error are both high. The model is too
  constrained and cannot capture much of the variability of the target
  variable.

- The region around `max_depth = 10` corresponds to the parameter for which
  the decision tree generalizes the best. It is flexible enough to capture a
  fraction of the variability of the target that generalizes, while not
  memorizing all of the noise in the target.

- For `max_depth > 10`, the decision tree overfits. The empirical error
  becomes very small, while the generalization error increases. In this
  region, the models captures too much of the noisy part of the variations of
  the target and this harms its ability to generalize well to test data.

Note that for `max_depth = 10`, the model overfits a bit as there is a gap
between the empirical error and the generalization error. It can also
potentially underfit also a bit at the same time, because the empirical error
is still far from zero (more than 30 k\$), meaning that the model might
still be too constrained to model interesting parts of the data. However the
generalization error is minimal, and this is what really matters. This is the
best compromise we could reach by just tuning this parameter.

We were lucky that the variance of the errors was small compared to their
respective values, and therefore the conclusions above are quite clear. This
is not necessarily always the case.

## Summary:

In this notebook, we saw:

* how to identify if a model is generalizing, overfitting, or underfitting;
* how to check influence of an hyperparameter on the trade-off
  underfit/overfit.