The Bias-Variance Tradeoff
----
The bias-variance tradeoff is one of the fundamental concepts in supervised machine learning. In this chapter, you'll understand how to diagnose the problems of overfitting and underfitting. You'll also be introduced to the concept of ensembling where the predictions of several models are aggregated to produce predictions that are more robust.

In [44]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error as MSE
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier as KNN
from sklearn.tree import DecisionTreeClassifier

Instantiating a Regression Tree Model
----

Create a regression tree model to predict the miles per gallon (mpg) of cars using all features in the dataset. Use the provided feature matrix X and target array y. Instantiate the model from the DecisionTreeRegressor class, which has already been imported, so it can be trained and later evaluated for bias and variance behavior.

In [26]:
mpg_df = pd.read_csv(r"C:\Users\Emigb\Documents\Data Science\datasets\auto.csv")
mpg_df.head()

Unnamed: 0,mpg,displ,hp,weight,accel,origin,size
0,18.0,250.0,88,3139,14.5,US,15.0
1,9.0,304.0,193,4732,18.5,US,20.0
2,36.1,91.0,60,1800,16.4,Asia,10.0
3,18.5,250.0,98,3525,19.0,US,15.0
4,34.3,97.0,78,2188,15.8,Europe,10.0


In [27]:
mpg_dum = pd.get_dummies(mpg_df, drop_first=True).astype(float)
mpg_dum.head()

Unnamed: 0,mpg,displ,hp,weight,accel,size,origin_Europe,origin_US
0,18.0,250.0,88.0,3139.0,14.5,15.0,0.0,1.0
1,9.0,304.0,193.0,4732.0,18.5,20.0,0.0,1.0
2,36.1,91.0,60.0,1800.0,16.4,10.0,0.0,0.0
3,18.5,250.0,98.0,3525.0,19.0,15.0,0.0,1.0
4,34.3,97.0,78.0,2188.0,15.8,10.0,1.0,0.0


In [28]:
X = mpg_dum.drop('mpg', axis=1).values
y = mpg_dum['mpg'].values

print(X.shape)
print(y.shape)

(392, 7)
(392,)


In [29]:
#2. Split the data into 70% train and 30% test.
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size = 0.3, random_state=21)

In [30]:
#3. Instantiate a DecisionTreeRegressor with max depth 4 and min_samples_leaf set to 0.26.
dt = DecisionTreeRegressor(max_depth = 4, min_samples_leaf = 0.26, random_state = 42)

Evaluating Regression Tree with 10-Fold Cross-Validation RMSE
---

Use 10-fold cross-validation to assess the performance of the regression tree dt on the training data (`X_train`, `y_train`). Apply cross_val_score with the scoring parameter set to negative mean squared error. Multiply the results by -1 to obtain positive MSE values, then calculate their average. Take the square root of this average to get the cross-validated RMSE, which measures the model’s prediction error across multiple folds.

In [37]:
#1. Compute dt's 10-fold cross-validated MSE by setting the scoring argument to 'neg_mean_squared_error'.
MSE_cv = - cross_val_score(dt, X_train, y_train, cv=10, scoring = 'neg_mean_squared_error', n_jobs = -1)

#2. Compute RMSE from the obtained MSE scores.
RMSE_cv = MSE_cv ** (1/2)

print('CV RMSE:', np.round(RMSE_cv, 2))

CV RMSE: [4.   4.25 4.05 4.14 4.72 5.15 4.62 4.81 4.32 5.47]


Calculating Training Set RMSE for the Regression Tree
----

Evaluate the performance of the regression tree `dt` on the training set (`X_train`, `y_train`). Use the model to make predictions on `X_train`, then compute the Mean Squared Error (MSE) between the predictions and the actual values with `mean_squared_error`. Take the square root of the MSE to obtain the RMSE, which represents the average prediction error of the model on the training data.

In [40]:
#2. Fit dt to the training set.
dt.fit(X_train, y_train)

#3. Predict dt's training set labels and assign the result to y_pred_train.
y_pred_train = dt.predict(X_test)

#4. Evaluate dt's training set RMSE and assign it to RMSE_train.
RMSE_train = MSE(y_test, y_pred_train) ** 0.5

print('RMSE_train: {:.2f}'.format(RMSE_train))

RMSE_train: 4.39


Instantiating Classifiers for Liver Disease Prediction
----

Create three classification models to predict liver disease using all dataset features. Instantiate:

1. A `LogisticRegression` model
2. A `DecisionTreeClassifier` model
3. A `KNeighborsClassifier` model (imported as KNN)

These models will later be used individually and as part of an ensemble for performance comparison.

In [46]:
#1. Instantiate a Logistic Regression classifier and assign it to lr.
lr = LogisticRegression(random_state = 21)

#2. Instantiate a KNN classifier that considers 27 nearest neighbors and assign it to knn.
knn = KNN(n_neighbors = 27)

#3. Instantiate a Decision Tree Classifier with the parameter min_samples_leaf set to 0.13 and assign it to dt.
dt1 = DecisionTreeClassifier(min_samples_leaf = 0.13, random_state = 21)

# Define the list classifiers
classifiers = [('Logistic Regression', lr), ('K Nearest Neighbors', knn), ('Classification Tree', dt1)]