# <a id='toc1_'></a>[Bias-variance trade-off](#toc0_)

**Table of contents**<a id='toc0_'></a>    
- [Bias-variance trade-off](#toc1_)    
  - [What is bias?](#toc1_1_)    
  - [What is variance?](#toc1_2_)    
  - [Extra: The maths of it all  ](#toc1_3_)    
    - [What is noise?](#toc1_3_1_)    
  - [What do want and how to get there?](#toc1_4_)    
    - [Regularization](#toc1_4_1_)    
      - [Train Test Split](#toc1_4_1_1_)    
    - [Baseline](#toc1_4_2_)    
    - [Revisit Train-Test split](#toc1_4_3_)    
    - [Review overfitting across multiple parameters](#toc1_4_4_)    
- [Resources](#toc2_)    

<!-- vscode-jupyter-toc-config
	numbering=false
	anchor=true
	flat=false
	minLevel=1
	maxLevel=6
	/vscode-jupyter-toc-config -->
<!-- THIS CELL WILL BE REPLACED ON TOC UPDATE. DO NOT WRITE YOUR TEXT IN THIS CELL -->

![](https://miro.medium.com/v2/resize:fit:828/format:webp/1*9hPX9pAO3jqLrzt0IE3JzA.png)  
(Source: [Understanding the Bias-Variance Tradeoff, Medium](https://towardsdatascience.com/understanding-the-bias-variance-tradeoff-165e6942b229))

## <a id='toc1_1_'></a>[What is bias?](#toc0_)

**Bias is how wrong our model is on average.** 

In maths terms, bias is the difference between our models predictions and the correct answer. A model has **high bias** if its predictions are very far off from the truth, typically because the model is too simple or the data seen by the model is not representative of the real data. 

If the issue is the model, we say that the model is **underfit** and even if we give the model more data, because it's so simple, it still can't learn. That's sad but don't worry, simple models are still good for a variety of tasks!

## <a id='toc1_2_'></a>[What is variance?](#toc0_)

**Variance is how different the predictions of our model are for the SAME type of data.** 

Imagine our house price predictor. If we train a model which has a high variance, it could tell us that a house with 3 rooms is $100,000 and one with 4 rooms is $500,000. Yikes! 

You notice the variance of a model when you look at an accuracy metric on 2 different sets, usually the train and test set (later on, train and validation...). 
* If the difference is high, the model has **high variance**, i.e. it memorized the data and didn't learn much, like a student who thinks they had the answer to the test but then sit the exam and get a bad score. The model is **overfit**.
* If the difference is low, the model has **low variance**, i.e. it learnt the main patterns in the data, like a student who studied the lessons and not the test answers! The model is neither overfit or underfit.

## <a id='toc1_3_'></a>[Extra: The maths of it all](https://www.cs.cornell.edu/courses/cs4780/2018fa/lectures/lecturenote12.html)   [&#8593;](#toc0_)

$\underbrace{E_{\mathbf{x}, y, D} \left[\left(h_{D}(\mathbf{x}) - y\right)^{2}\right]}_\mathrm{Expected\;Test\;Error} = \underbrace{E_{\mathbf{x}, D}\left[\left(h_{D}(\mathbf{x}) - \bar{h}(\mathbf{x})\right)^{2}\right]}_\mathrm{Variance} + \underbrace{E_{\mathbf{x}, y}\left[\left(\bar{y}(\mathbf{x}) - y\right)^{2}\right]}_\mathrm{Noise} + \underbrace{E_{\mathbf{x}}\left[\left(\bar{h}(\mathbf{x}) - \bar{y}(\mathbf{x})\right)^{2}\right]}_\mathrm{Bias^2}$

### <a id='toc1_3_1_'></a>[What is noise?](#toc0_)

It's the stuff you can't predict. As you already know, models are not perfect representations of reality. If we were able to perfectly model the universe, we wouldn't need Machine Learning and our noise or error would be 0. 

## <a id='toc1_4_'></a>[What do want and how to get there?](#toc0_)

A model complex enough to understand the data but not so complex that it memorizes the data.

![](http://scott.fortmann-roe.com/docs/docs/BiasVariance/biasvariance.png)  

(Source: [Understanding the Bias-Variance Tradeoff, Scott Fortmann-Roe](https://scott.fortmann-roe.com/docs/BiasVariance.html))

When spotting overfitting, we can deal with it in a number of ways:
- Reduce model complexity
    - Tweak hyperparameters, e.g. choose lower `max_depth`, `n_estimators`, etc.
    - Choose a simpler model, e.g. Linear/Logistic Regression
- Apply regulatization, i.e. L1 and L2 regularization - Logistic Regression
- Remove redundant features
- **Last resort**: Use more training data
    - readily available solution: move from a 80/20 split to a 85/15 or 90/10 split. Do NOT go beyond a 90/10 split.   
    **Caveat**: If you have a lot of data, e.g. >10,000 rows, you can start off with a 90/10 split or go beyond the 90/10 split.
    - more costly: gather more data

In [41]:
from sklearn.datasets import  fetch_california_housing
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px

from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor

from sklearn.preprocessing import MinMaxScaler, StandardScaler
from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error

In [42]:
california = fetch_california_housing()
print(california["DESCR"])

.. _california_housing_dataset:

California Housing dataset
--------------------------

**Data Set Characteristics:**

:Number of Instances: 20640

:Number of Attributes: 8 numeric, predictive attributes and the target

:Attribute Information:
    - MedInc        median income in block group
    - HouseAge      median house age in block group
    - AveRooms      average number of rooms per household
    - AveBedrms     average number of bedrooms per household
    - Population    block group population
    - AveOccup      average number of household members
    - Latitude      block group latitude
    - Longitude     block group longitude

:Missing Attribute Values: None

This dataset was obtained from the StatLib repository.
https://www.dcc.fc.up.pt/~ltorgo/Regression/cal_housing.html

The target variable is the median house value for California districts,
expressed in hundreds of thousands of dollars ($100,000).

This dataset was derived from the 1990 U.S. census, using one row per ce

In [43]:
df_cali = pd.DataFrame(california["data"], columns = california["feature_names"])
df_cali["median_house_value"] = california["target"]

df_cali.head()

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,median_house_value
0,8.3252,41.0,6.984127,1.02381,322.0,2.555556,37.88,-122.23,4.526
1,8.3014,21.0,6.238137,0.97188,2401.0,2.109842,37.86,-122.22,3.585
2,7.2574,52.0,8.288136,1.073446,496.0,2.80226,37.85,-122.24,3.521
3,5.6431,52.0,5.817352,1.073059,558.0,2.547945,37.85,-122.25,3.413
4,3.8462,52.0,6.281853,1.081081,565.0,2.181467,37.85,-122.25,3.422


#### <a id='toc1_4_1_1_'></a>[Train Test Split](#toc0_)

In [44]:
features = df_cali.drop(columns = ["median_house_value","AveOccup", "Population", "AveBedrms"])
target = df_cali["median_house_value"]

In [45]:
X_train, X_test, y_train, y_test = train_test_split(features, target, test_size = 0.20, random_state=0)

### <a id='toc1_4_2_'></a>[Baseline](#toc0_)

In [46]:
forest = RandomForestRegressor(n_estimators=20, max_depth=10, random_state=42)
forest.fit(X_train, y_train)
print("Train Score", forest.score(X_train, y_train))
print("Test Score", forest.score(X_test, y_test))

Train Score 0.8490484607565002
Test Score 0.7699138134768644


### <a id='toc1_4_3_'></a>[Revisit Train-Test split](#toc0_)

In [53]:
df_cali.shape

(20640, 9)

In [54]:
20640 * 0.1 # 2K samples to test on

2064.0

In [55]:
X_train, X_test, y_train, y_test = train_test_split(features, target, test_size=0.9, random_state=0)

In [56]:
forest = RandomForestRegressor(n_estimators=20, max_depth=10, random_state=42)
forest.fit(X_train, y_train)
print("Train Score", forest.score(X_train, y_train))
print("Test Score", forest.score(X_test, y_test))

Train Score 0.900919145503502
Test Score 0.676219719160182


### <a id='toc1_4_4_'></a>[Review overfitting across multiple parameters](#toc0_)

In [47]:
results = pd.DataFrame()

for n_estimators in range(10, 200, 50):
    result = {}
    forest = RandomForestRegressor(n_estimators=n_estimators, max_depth=10, random_state=42)
    forest.fit(X_train, y_train)

    result["Score"] = forest.score(X_train, y_train)
    result["Set"] = "Train Set"
    result["Parameter"] = n_estimators
    results = pd.concat([results, pd.DataFrame([result])], axis=0)

    result["Score"] = forest.score(X_test, y_test)
    result["Set"] = "Test Set"
    results = pd.concat([results, pd.DataFrame([result])], axis=0)

In [48]:
px.line(results, y="Score", x="Parameter", color="Set")

In [49]:
X_train.columns

Index(['MedInc', 'HouseAge', 'AveRooms', 'Latitude', 'Longitude'], dtype='object')

In [51]:
results = pd.DataFrame()

for max_depth in range(1, 20, 5):
    result = {}
    forest = RandomForestRegressor(n_estimators=60, max_depth=max_depth, random_state=42)
    forest.fit(X_train, y_train)

    result["Score"] = forest.score(X_train, y_train)
    result["Set"] = "Train Set"
    result["Parameter"] = max_depth
    results = pd.concat([results, pd.DataFrame([result])], axis=0)

    result["Score"] = forest.score(X_test, y_test)
    result["Set"] = "Test Set"
    results = pd.concat([results, pd.DataFrame([result])], axis=0)

In [52]:
px.line(results, y="Score", x="Parameter", color="Set")

In [39]:
results = pd.DataFrame()

for min_samples_split in range(10, 100, 10):
    result = {}
    forest = RandomForestRegressor(n_estimators=20, max_depth=7, random_state=42, min_samples_split=min_samples_split)
    forest.fit(X_train, y_train)

    result["Score"] = forest.score(X_train, y_train)
    result["Set"] = "Train Set"
    result["Parameter"] = min_samples_split
    results = pd.concat([results, pd.DataFrame([result])], axis=0)

    result["Score"] = forest.score(X_test, y_test)
    result["Set"] = "Test Set"
    results = pd.concat([results, pd.DataFrame([result])], axis=0)

In [None]:
px.line(results, y="Score", x="Parameter", color="Set")

### Extra: [Regularization](https://www.geeksforgeeks.org/regularization-in-machine-learning/)

# <a id='toc2_'></a>[Resources](#toc0_)

* StatQuest:
    * [Bias & Variance](https://www.youtube.com/watch?v=EuBBz3bI-aA) - 6 min