# Lecture 4
Jan 19, 2024
***

## Model Validation

### Training and Test Sets

- **Training Set:** <br/>
The training set is the
largest part of the dataset and the
foundation for model building.
Machine learning algorithms use this
segment to learn from the data’s
patterns.
<br/><br/>
- **Test Set:**<br/>
Different from the training set, the test set serves as an
unbiased measure for evaluating the
model’s performance on completely
new and unseen data.

### Cross-Validation
A technique to `evaluate the performance` of a machine learning model on unseen data.

- It involves `splitting the data` into `multiple subsets` and using each
`subset as a test set` while `training on the rest` of the data.
- This way, the model is `tested on different` parts of the data, and
the `average score` is used as an estimate of the `generalization error`.

- Can use cross-validation to test validation with the full dataset
- Repeat steps multiple times to validate the model and the selected
hyperparameters
- To test the model five times (cv = 5), use the following code:
12

In [None]:
from sklearn.cross_validation import cross_val_score
cross_val_score(model, X, y, cv=5)

#### Validating Your Model

- The `quality` of your model is influenced by how you 
`divide your data` into `training and testing sets`.
- In the basic machine learning example that we worked on, we
only measured the accuracy score for one data split.
- How would the accuracy change if the data split was different,
either for better or worse?

#### Stratified K-Fold
When working with a `small dataset` like the Iris dataset, which has only 150
samples, the cross-validation `strategy choice is crucial`.
Using `StratifiedKFold` for splitting a small dataset into train and test sets offers
several benefits:
1. Preserving Data Distribution
2. More Reliable Performance Estimates.
3. Preventing Overfitting
4. Robustness to Variability
5. Comparable Results

## Model Selection

### Bias and Variance

Bias and variance are two `types of errors` that affect the performance of
machine learning models. <br/>According to Python Data Science Handbook, they
are defined as follows:

**Bias:**<br/> The difference between the model’s predicted value and the actual
value.
- Bias is caused by wrong assumptions or simplifications made by the model to
approximate the target function.
- High bias models tend to underfit the data, meaning they cannot capture the
complexity or patterns in the data.

**Variance:**<br/> The sensitivity of the model to changes in the training
data.
- Variance is caused by the model being too flexible or complex, and
fitting the noise as well as the signal in the data.
- High variance models tend to overfit the data, meaning they cannot
generalize well to new or unseen data.

The Bias-Variance Trade-Off
• The left model tries to use a straight
line to fit the data.
• However, the data is too complex
for a straight line, so the left model
cannot capture the data well.
• This is called underfitting the data
which means the model is not
flexible enough to handle all the
data points.
• In other words, the model has a
high bias.

The right model uses a complex
polynomial to fit the data.
• This model can capture the small
details in the data, but its exact
shape is more influenced by the
noise in the data than the true
nature of the data source.
• This is called overfitting the data -
the model is too flexible and fits the
noise as well as the data pattern.
• This means that the model has a
high variance.

Validation Curve

Image Source: https://jakevdp.github.io/PythonDataScienceHandbook/05.03-hyperparameters-and-model-validation.html
• The training score is always
higher than the validation score.
• The best validation score shows
the optimal balance of bias and
variance.
• Hyperparameters may have to be
adjusted to get the best model.