# Overfitting and Hyperparameter Tuning



### Key Take-Aways
- Generalization _(How well the model applies to observations not seen during learning)_
- Overfitting
- Underfitting
- How to measure classification methods, _score as a normal variable_
    - Resubstitution - never
    - Hold-Out - for large datasets
    - k-fold Cross validation - best option
- Hyperparamter Tuning and Leakage
    - How to automatically find good hyperparameters
    - Beware of Data Leakage
- Nested Hold-Out & Nested Cross Validation
    - For Hyperparameter Tuning without Data Leakage


---

#### Resubstitution
Use the whole data set for training and testing. Very bad since it just leads to overfitting. Usage: Never

#### Hold-out
Split data into a train and test set. Only works if there is enough labelled data and no way to measure variance. Common split 80/20. Train data only with training set. Perform metrics on both sets. Compare those metrics between the sets. Model is,
- **Underfitting**, if both are bad
- **Overfitting**, if good on train and bad on test
- Well **generalizing**, if good on both  

#### k-fold cross-validation
Split dataset into $k$ smaller _dataset-folds_, build model out of each, build score for each model. Take average of all scores. This method is better suited for small datasets.

#### Fitting Graph to avoid Overfitting
Used to find the _optional_ model complexity visually. The Graph here shows the error on the y-axis, it's also possible to show the accuracy as alternative. The Ideal point is the turning point on the testing data graph.

![Fitting Graph](img/Fitting_Graph.png)

## Hyperparameter Tuning
The challenge with hyperparameter tuning is, evaluating the quality of the hyperparameters without causing a **Data Leakage**. A Data Leakage would mean the same data is used for testing and training. Solution, split data further into `train dataset`, `validation dataset` and `test dataset`. 

**Advantage**
- Test dataset is never used to train/tune → no data leakage
- Good Performance
- Easy to implement

**Disadvantages**
- Need lots of labelled data, not suited for small datasets
- No way to measure variance
- Often pessimistic estimation of true score