Supervised Learning can be described mathematically in the form of y as a function of x: **y=f(x)**

The goals:
- to find a model that best approximates f.
- important then to discard the noise as much as possible
- the model should achieve a low predictive error on unseen datasets

### Challenges

#### Overfitting

- Fits the training set noise

#### Underfitting

- Not flexible enough to approximate f

### Generalization Error

- Is composed of:
    - **bias^2**: give you an idea on average how the model in estimating f is different from f.
    > HIGH BIAS models --> lead to --> **UNDERFITTING**
       - **this is when CV error is approximate equal to the training error but much greater than desired error**
    - **variance**: allows a window into understanding how incosistent the model is over different training sets
    > HIGH VARIANCE models --> lead to --> **OVERFITTING**
        - **this is when CV error is higher than training set error**
       
    - **irreducible error**:
    

### Bias - Variance Trade-off

![image.png](attachment:image.png)

As model Complexity:
- INCREASES: variance increase + bias decreases
- DECREASES: variance decreases + bias increases

![image.png](attachment:image.png)

### Diagnosing Bias and Variance Problems

#### [1] Train/Test Splitting for Estimating Generalization Error

#### [2] Cross-Validation (CV) by Leaving Test Set untouched untiil confident for model performance
- K-fold CV 
- or Hold-out CV

K-Fold Diagram example

![image.png](attachment:image.png)

### To remedy OVERFITTING

- decrease model complexity (decrease max_depth, increase min_samples per leaf..)
- gather more data to train with

### To remedy UNDERFITTING

- increase model complexity (increase max_depth, ....)
- gather more relevant features

### Performing CV with DecisionTreeRegressor

In [None]:
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error as MSE
from sklearn.model_selection import cross_val_score

#define seee for reproducibility
SEED = 123

#split the dataset into train/split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,
                                                   random_state=SEED)

#instantiate the classifier
dtr = DecisionTreeRegressor(max_depth=4,
                             min_samples_leaf=0.14,
                            random_state=SEED) #reproducibility

#evaluate a list of MSE obtained by 10-fold CV
MSE_CV = - cross_val_score(dtr, X_train, y_train, cv=10,
                          scoring='neg_mean_squared_error',
                          n_jobs=-1)

#fitting on the training split
dtr.fit(X_train, y_train)

#predict the test set for prediction of labels
y_pred = dtr.predict(X_test)

#print CV MSE
print('CV MSE: {:.2f}'.format(MSE_CV.mean()))

#print the MSE for Train / Test Errors:
print('Train MSE: {:.2f}'.format(MSE(y_train, y_predict_train)))
print('Test MSE: {:.2f}'.format(MSE(y_test, y_predict_test)))

#evaluate the test-set by comparing the predicted labels to the test split labels
rmse_dtr = mse_dtr**(1/2)
print(rsme_dt)