# The Bias-Variance Tradeoff
- **Overfitting**: The function fits the training data noise. *high variance*
    - To reduce:
        - Add more data
        - Decrease the model complexity (for example, max depth on a tree-based model)
        - Include regularization
- **Underfitting**: The function is not flexible enough. *high bias*
    - To reduce:
        - Gather more relevant features
        - Increase model complexity
- **Generalization error**: Does the function generalize well on unseen data? (bias + variance + noise). Our goal is to reduce this. 
    - Bias: error term that tells you, on average, how much our function != the real function (underfitting)
    - Variance: error term that tells you how much our function is incosistent over different training sets. 
    - Irreducible error (noise)
- **Model Complexity**: Sets the flexibility of f. 

![title](https://drive.google.com/uc?export=view&id=1ds_94hqDXU2238kV3Ye2zDzehjHiPjdY)

## Diagnose bias and variance problems
- **Cross Validation**: split the data into training and test data sets, and change the test set on each iteration (**K-Fold**)
    - The error of the model equals the average error of all the K-Folds

<br>

***Most of the time, models are black boxes that suck in data and spit out accuracy numbers.*** - Mikian Musser






In [2]:
import pandas as pd
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import mean_squared_error as MSE 

# Seed for reproducible results
seed = 123

# Import data
file1 = 'https://raw.githubusercontent.com/prince381/car_mpg_predict/master/cars1.csv'
file2 = 'https://raw.githubusercontent.com/prince381/car_mpg_predict/master/cars2.csv'
cars1 = pd.read_csv(file1).dropna(how='all', axis=1)
cars2 = pd.read_csv(file2)  
df = pd.concat([cars1, cars2], ignore_index=True, sort=False)

# Split data
X = df['displacement'].to_numpy().reshape(-1, 1)
y = df['mpg'].to_numpy()
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=seed)

# Instatiate model
dt = DecisionTreeRegressor(max_depth=4, min_samples_leaf=0.14, random_state=seed)

# Evaluate the list of errors in a KF of 10
mse_cv = - cross_val_score(dt, X_train, y_train, cv=10, scoring='neg_mean_squared_error')

# Run on our model
dt.fit(X_train, y_train)

# Results
y_pred_train = dt.predict(X_train)
y_pred_test = dt.predict(X_test)

# Compare results 
print(f"CV MSE: {mse_cv.mean()}")
print(f"Train MSE: {MSE(y_train, y_pred_train)}")
print(f"Test MSE: {MSE(y_test, y_pred_test)}")
print()
print("CONCLUSION: Given that the training error is smaller than the CV, and the test error is larger, we can deduce that the model is overfit")



CV MSE: 18.78476949191765
Train MSE: 17.548320360015627
Test MSE: 20.21680313601966

CONCLUSION: Given that the training error is smaller than the CV, and the test error is larger, we can deduce that the model is overfit
