# The Bias-Variance Tradeoff

The bias-variance tradeoff is one of the fundamental concepts in supervised machine learning. In this chapter, you'll understand how to diagnose the problems of overfitting and underfitting. You'll also be introduced to the concept of ensembling where the predictions of several models are aggregated to produce predictions that are more robust.

## Generalization Error

* Find a model $\hat f$ that best approximates f :$\hat f$ ≈ f
* $\hat f$ can be Logistic Regression, Decision Tree, Neural Network.
* Discard noise as much as possible.
* End goal: $\hat f$ should achieve a low predictive error on unseen datasets.

The Generalization Error of $\hat f$: Does $\hat f$ generalize well on unseen data?
* it can be decomposed as follows: Generalization Error of $\hat f=bias^2 + variance + irreducible ~error$
* irreducible error is the error contribution of noise.

**Bias**

![Bias](Bias.png)

* Bias is an error term that tells you on average, how much $\hat f \ne f$
* High bias models lead to underfitting.
  
**Variance**

![Variance](Variance.png)

* Variance: tells you how much $\hat f$ is inconsistent over di(f^ erent training sets.
* High variance model leads to overfitting.

**Model Complexity**

* Model Complexity: sets the fexibility of $\hat f$.
* Example: Maximum tree depth, Minimum samples per leaf,... increases the complexity of a decision tree.

![Bias Variance Trade-off](Bias_Variance_Trade_off.png)

The diagram on the left here shows how the best model complexity corresponds to the lowest generalization error. When the model complexity increases, the variance increases while the bias decreases. Conversely, when model complexity decreases, variance decreases and bias increases. Your goal is to find the model complexity that achieves the lowest generalization error. Since this error is the sum of three terms with the irreducible error being constant, you need to find a balance between bias and variance because as one increases the other decreases. This is known as the bias-variance trade-off.

Visually the diagram on the right, you can imagine approximating $\hat f$ as aiming at the center of a shooting-target where the center is the true function $f$. If fhat is low bias and low variance, your shots will be closely clustered around the center. If $\hat f$ is high variance and high bias, not only will your shots miss the target but they would also be spread all around the shooting target.

* As the complexity of increases, the bias term decreases while the variance term increases.

**Overfitting and Underfitting**

In this exercise, you'll visually diagnose whether a model is overfitting or underfitting the training set.

For this purpose, we have trained two different models $A$ and $B$ on the auto dataset to predict the mpg consumption of a car using only the car's displacement (displ) as a feature.

The following figure shows you scatterplots of mpg versus displ along with lines corresponding to the training set predictions of models $A$ and $B$ in red.

![Overfiting and Underfitting](diagnose-problems.jpg)

* $A$ suffers from high variance and overfits the training set.
* $B$ suffers from high variance and underfits the training set.

Model B is not able to capture the nonlinear dependence of mpg on displ.

## Diagnose bias and variance problems

**Diagnose Variance Problems**

if CV error or $\hat f$ > training set error of $\hat f$ then it surfers from high variance.

**Diagnose Bias Problems**

if $\hat f$ suffers from high bias: CV error of $\hat f$ ≈ training set error of $\hat f$>> desired error.

n_jobs=-1 means using all processors.

Given that the **taining_error < CV_error**, we can deduce that DT overfits the training set, and it suvers from high variance.

## Instantiate the model

In the following set of exercises, you'll diagnose the bias and variance problems of a regression tree. The regression tree you'll define in this exercise will be used to predict the mpg consumption of cars from the auto dataset using all available features.

We have already processed the data and loaded the features matrix X and the array y in your workspace. In addition, the DecisionTreeRegressor class was imported from sklearn.tree.

In [1]:
import numpy as np
import pandas as pd

mpg = pd.read_csv('auto.csv')
mpg = pd.get_dummies(mpg)
X = mpg.drop('mpg', axis='columns')
y = mpg['mpg']

**Instructions**

* Import train_test_split from sklearn.model_selection.
* Split the data into 70% train and 30% test.
* Instantiate a DecisionTreeRegressor with max depth 4 and min_samples_leaf set to 0.26.

In [5]:
# Import train_test_split from sklearn.model_selection
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
# Set SEED for reproducibility
SEED = 1

# Split the data into 70% train and 30% test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=SEED)

# Instantiate a DecisionTreeRegressor dt
dt = DecisionTreeRegressor(max_depth=4, min_samples_leaf=0.26, random_state=SEED)

you'll now evaluate dt's CV error

In [3]:
from sklearn.metrics import mean_squared_error as MSE
from sklearn.model_selection import cross_val_score

## Evaluate the 10-fold CV error

In this exercise, you'll evaluate the 10-fold CV Root Mean Squared Error (RMSE) achieved by the regression tree dt that you instantiated in the previous exercise.

In addition to dt, the training data including X_train and y_train are available in your workspace. We also imported cross_val_score from sklearn.model_selection.

Note that since cross_val_score has only the option of evaluating the negative MSEs, its output should be multiplied by negative one to obtain the MSEs. The CV RMSE can then be obtained by computing the square root of the average MSE.

In [6]:
# Set n_jobs to -1 in order to exploit all CPU cores in computation
# Compute the array containing the 10-folds CV MSEs
MSE_CV_scores = - cross_val_score(dt, X_train, y_train, cv=10, 
                       scoring='neg_mean_squared_error',
                       n_jobs=-1)

# Compute the 10-folds CV RMSE
RMSE_CV = (MSE_CV_scores.mean())**(0.5)

# Print RMSE_CV
print('CV RMSE: {:.2f}'.format(RMSE_CV))

CV RMSE: 5.14


A very good practice is to keep the test set untouched until you are confident about your model's performance. CV is a great technique to get an estimate of a model's performance without affecting the test set.

## Evaluate the training error

You'll now evaluate the training set RMSE achieved by the regression tree dt that you instantiated in a previous exercise.

In addition to dt, X_train and y_train are available in your workspace.

Note that in scikit-learn, the MSE of a model can be computed as follows:

'''python
MSE_model = mean_squared_error(y_true, y_predicted)
'''

where we use the function mean_squared_error from the metrics module and pass it the true labels y_true as a first argument, and the predicted labels from the model y_predicted as a second argument.

In [8]:
# Fit dt to the training set
dt.fit(X_train, y_train)

# Predict the labels of the training set
y_pred_train = dt.predict(X_train)

# Evaluate the training set RMSE of dt
RMSE_train = (MSE(y_train, y_pred_train))**(0.5)

# Print RMSE_train
print('Train RMSE: {:.2f}'.format(RMSE_train))

Train RMSE: 5.15


Notice how the training error is roughly equal to the 10-folds CV error.

## Baseline RMSE

In [13]:
pd.DataFrame(X['displ'].values.reshape(-1,1))

Unnamed: 0,0
0,250.0
1,304.0
2,91.0
3,250.0
4,97.0
...,...
387,250.0
388,151.0
389,98.0
390,250.0


## High bias or high variance?

In this exercise you'll diagnose whether the regression tree dt you trained in the previous exercise suffers from a bias or a variance problem.

The training set RMSE (RMSE_train) and the CV RMSE (RMSE_CV) achieved by dt are available in your workspace. In addition, we have also loaded a variable called baseline_RMSE which corresponds to the root mean-squared error achieved by the regression-tree trained with the disp feature only (it is the RMSE achieved by the regression tree trained in chapter 1, lesson 3). Here baseline_RMSE serves as the baseline RMSE above which a model is considered to be underfitting and below which the model is considered 'good enough'.

Does dt suffer from a high bias or a high variance problem?

In [15]:
baseline_RMSE = 5.1


Train RMSE: 3.91


In [18]:
print("RMSE CV:",RMSE_CV)
print("RMSE Train:",RMSE_train)

RMSE CV: 5.143255076652255
RMSE Train: 5.151299302408305


dt suffers from high bias because RMSE_CV  RMSE_train and both scores are greater than baseline_RMSE


dt is indeed underfitting the training set as the model is too constrained to capture the nonlinear dependencies between features and labels.

## Ensemble Learning

### Define the ensemble
In the following set of exercises, you'll work with the Indian Liver Patient Dataset from the UCI Machine learning repository.

In this exercise, you'll instantiate three classifiers to predict whether a patient suffers from a liver disease using all the features present in the dataset.

The classes LogisticRegression, DecisionTreeClassifier, and KNeighborsClassifier under the alias KNN are available in your workspace.


In [22]:
data=pd.read_csv("indian_liver_patient_preprocessed.csv")
data.head()

Unnamed: 0.1,Unnamed: 0,Age_std,Total_Bilirubin_std,Direct_Bilirubin_std,Alkaline_Phosphotase_std,Alamine_Aminotransferase_std,Aspartate_Aminotransferase_std,Total_Protiens_std,Albumin_std,Albumin_and_Globulin_Ratio_std,Is_male_std,Liver_disease
0,0,1.247403,-0.42032,-0.495414,-0.42887,-0.355832,-0.319111,0.293722,0.203446,-0.14739,0,1
1,1,1.062306,1.218936,1.423518,1.675083,-0.093573,-0.035962,0.939655,0.077462,-0.648461,1,1
2,2,1.062306,0.640375,0.926017,0.816243,-0.115428,-0.146459,0.478274,0.203446,-0.178707,1,1
3,3,0.815511,-0.372106,-0.388807,-0.449416,-0.36676,-0.312205,0.293722,0.329431,0.16578,1,1
4,4,1.679294,0.093956,0.179766,-0.395996,-0.295731,-0.177537,0.755102,-0.930414,-1.713237,1,1


In [25]:
# Set seed for reproducibility
SEED=1
X=pd.DataFrame(data.drop('Liver_disease',axis=1).values.reshape(-1,11))
y=data['Liver_disease']
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.30, random_state=SEED)

In [28]:
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier as KNN
from sklearn.tree import DecisionTreeClassifier

# Set seed for reproducibility
SEED=1

# Instantiate lr
lr = LogisticRegression(solver='liblinear',random_state=SEED)

# Instantiate knn
knn = KNN(n_neighbors=27)

# Instantiate dt
dt = DecisionTreeClassifier(min_samples_leaf=0.13, random_state=SEED)

# Define the list classifiers
classifiers = [('Logistic Regression', lr), ('K Nearest Neighbours', knn), ('Classification Tree', dt)]

### Evaluate individual classifiers

In this exercise you'll evaluate the performance of the models in the list classifiers that we defined in the previous exercise. You'll do so by fitting each classifier on the training set and evaluating its test set accuracy.

The dataset is already loaded and preprocessed for you (numerical features are standardized) and it is split into 70% train and 30% test. The features matrices X_train and X_test, as well as the arrays of labels y_train and y_test are available in your workspace. In addition, we have loaded the list classifiers from the previous exercise, as well as the function accuracy_score() from sklearn.metrics.

In [29]:
from sklearn.metrics import accuracy_score

# Iterate over the pre-defined list of classifiers
for clf_name, clf in classifiers:    
 
    # Fit clf to the training set
    clf.fit(X_train, y_train)    
   
    # Predict y_pred
    y_pred = clf.predict(X_test)
    
    # Calculate accuracy
    accuracy = accuracy_score(y_test, y_pred) 
   
    # Evaluate clf's accuracy on the test set
    print('{:s} : {:.3f}'.format(clf_name, accuracy))

Logistic Regression : 0.747
K Nearest Neighbours : 0.724
Classification Tree : 0.730


Notice how Logistic Regression achieved the highest accuracy of 74.7%.

### Better performance with a Voting Classifier

Finally, you'll evaluate the performance of a voting classifier that takes the outputs of the models defined in the list classifiers and assigns labels by majority voting.

X_train, X_test,y_train, y_test, the list classifiers defined in a previous exercise, as well as the function accuracy_score from sklearn.metrics are available in your workspace.

In [30]:
# Import VotingClassifier from sklearn.ensemble
from sklearn.ensemble import VotingClassifier

# Instantiate a VotingClassifier vc
vc = VotingClassifier(estimators=classifiers)     

# Fit vc to the training set
vc.fit(X_train, y_train)   

# Evaluate the test set predictions
y_pred = vc.predict(X_test)

# Calculate accuracy score
accuracy = accuracy_score(y_test, y_pred)
print('Voting Classifier: {:.3f}'.format(accuracy))

Voting Classifier: 0.753


Notice how the voting classifier achieves a test set accuracy of 75.3%. This value is greater than that achieved by LogisticRegression