The Bias-Variance Tradeoff
----
The bias-variance tradeoff is one of the fundamental concepts in supervised machine learning. In this chapter, you'll understand how to diagnose the problems of overfitting and underfitting. You'll also be introduced to the concept of ensembling where the predictions of several models are aggregated to produce predictions that are more robust.

In [10]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor

Instantiating a Regression Tree Model
----

Create a regression tree model to predict the miles per gallon (mpg) of cars using all features in the dataset. Use the provided feature matrix X and target array y. Instantiate the model from the DecisionTreeRegressor class, which has already been imported, so it can be trained and later evaluated for bias and variance behavior.

In [7]:
mpg_df = pd.read_csv(r"C:\Users\Emigb\Documents\Data Science\datasets\auto.csv")
mpg_df.head()

Unnamed: 0,mpg,displ,hp,weight,accel,origin,size
0,18.0,250.0,88,3139,14.5,US,15.0
1,9.0,304.0,193,4732,18.5,US,20.0
2,36.1,91.0,60,1800,16.4,Asia,10.0
3,18.5,250.0,98,3525,19.0,US,15.0
4,34.3,97.0,78,2188,15.8,Europe,10.0


In [8]:
mpg_dum = pd.get_dummies(mpg_df, drop_first=True).astype(float)
mpg_dum.head()

Unnamed: 0,mpg,displ,hp,weight,accel,size,origin_Europe,origin_US
0,18.0,250.0,88.0,3139.0,14.5,15.0,0.0,1.0
1,9.0,304.0,193.0,4732.0,18.5,20.0,0.0,1.0
2,36.1,91.0,60.0,1800.0,16.4,10.0,0.0,0.0
3,18.5,250.0,98.0,3525.0,19.0,15.0,0.0,1.0
4,34.3,97.0,78.0,2188.0,15.8,10.0,1.0,0.0


In [9]:
X = mpg_dum.drop('mpg', axis=1).values
y = mpg_dum['mpg'].values

print(X.shape)
print(y.shape)

(392, 7)
(392,)


In [11]:
#2. Split the data into 70% train and 30% test.
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size = 0.3, random_state=21)

In [13]:
#3. Instantiate a DecisionTreeRegressor with max depth 4 and min_samples_leaf set to 0.26.
dt = DecisionTreeRegressor(max_depth = 4, min_samples_leaf = 0.26, random_state = 42)