The Bias-Variance Tradeoff
----
The bias-variance tradeoff is one of the fundamental concepts in supervised machine learning. In this chapter, you'll understand how to diagnose the problems of overfitting and underfitting. You'll also be introduced to the concept of ensembling where the predictions of several models are aggregated to produce predictions that are more robust.

In [30]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error as MSE
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier as KNN
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
from sklearn.ensemble import VotingClassifier

Instantiating a Regression Tree Model
----

Create a regression tree model to predict the miles per gallon (mpg) of cars using all features in the dataset. Use the provided feature matrix X and target array y. Instantiate the model from the DecisionTreeRegressor class, which has already been imported, so it can be trained and later evaluated for bias and variance behavior.

In [18]:
mpg_df = pd.read_csv(r"C:\Users\Emigb\Documents\Data Science\datasets\auto.csv")
mpg_df.head()

Unnamed: 0,mpg,displ,hp,weight,accel,origin,size
0,18.0,250.0,88,3139,14.5,US,15.0
1,9.0,304.0,193,4732,18.5,US,20.0
2,36.1,91.0,60,1800,16.4,Asia,10.0
3,18.5,250.0,98,3525,19.0,US,15.0
4,34.3,97.0,78,2188,15.8,Europe,10.0


In [19]:
mpg_dum = pd.get_dummies(mpg_df, drop_first=True).astype(float)
mpg_dum.head()

Unnamed: 0,mpg,displ,hp,weight,accel,size,origin_Europe,origin_US
0,18.0,250.0,88.0,3139.0,14.5,15.0,0.0,1.0
1,9.0,304.0,193.0,4732.0,18.5,20.0,0.0,1.0
2,36.1,91.0,60.0,1800.0,16.4,10.0,0.0,0.0
3,18.5,250.0,98.0,3525.0,19.0,15.0,0.0,1.0
4,34.3,97.0,78.0,2188.0,15.8,10.0,1.0,0.0


In [20]:
X = mpg_dum.drop('mpg', axis=1).values
y = mpg_dum['mpg'].values

print(X.shape)
print(y.shape)

(392, 7)
(392,)


In [21]:
#2. Split the data into 70% train and 30% test.
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size = 0.3, random_state=21)

In [22]:
#3. Instantiate a DecisionTreeRegressor with max depth 4 and min_samples_leaf set to 0.26.
dt = DecisionTreeRegressor(max_depth = 4, min_samples_leaf = 0.26, random_state = 42)

Evaluating Regression Tree with 10-Fold Cross-Validation RMSE
---

Use 10-fold cross-validation to assess the performance of the regression tree dt on the training data (`X_train`, `y_train`). Apply cross_val_score with the scoring parameter set to negative mean squared error. Multiply the results by -1 to obtain positive MSE values, then calculate their average. Take the square root of this average to get the cross-validated RMSE, which measures the model’s prediction error across multiple folds.

In [23]:
#1. Compute dt's 10-fold cross-validated MSE by setting the scoring argument to 'neg_mean_squared_error'.
MSE_cv = - cross_val_score(dt, X_train, y_train, cv=10, scoring = 'neg_mean_squared_error', n_jobs = -1)

#2. Compute RMSE from the obtained MSE scores.
RMSE_cv = MSE_cv ** (1/2)

print('CV RMSE:', np.round(RMSE_cv, 2))

CV RMSE: [4.   4.25 4.05 4.14 4.72 5.15 4.62 4.81 4.32 5.47]


Calculating Training Set RMSE for the Regression Tree
----

Evaluate the performance of the regression tree `dt` on the training set (`X_train`, `y_train`). Use the model to make predictions on `X_train`, then compute the Mean Squared Error (MSE) between the predictions and the actual values with `mean_squared_error`. Take the square root of the MSE to obtain the RMSE, which represents the average prediction error of the model on the training data.

In [24]:
#2. Fit dt to the training set.
dt.fit(X_train, y_train)

#3. Predict dt's training set labels and assign the result to y_pred_train.
y_pred_train = dt.predict(X_test)

#4. Evaluate dt's training set RMSE and assign it to RMSE_train.
RMSE_train = MSE(y_test, y_pred_train) ** 0.5

print('RMSE_train: {:.2f}'.format(RMSE_train))

RMSE_train: 4.39


Instantiating Classifiers for Liver Disease Prediction
----

Create three classification models to predict liver disease using all dataset features. Instantiate:

1. A `LogisticRegression` model
2. A `DecisionTreeClassifier` model
3. A `KNeighborsClassifier` model (imported as KNN)

These models will later be used individually and as part of an ensemble for performance comparison.

In [25]:
liver = pd.read_csv(r"C:\Users\Emigb\Documents\Data Science\datasets\Indian liver\indian_liver_patient_preprocessed.csv")
liver.head()

Unnamed: 0.1,Unnamed: 0,Age_std,Total_Bilirubin_std,Direct_Bilirubin_std,Alkaline_Phosphotase_std,Alamine_Aminotransferase_std,Aspartate_Aminotransferase_std,Total_Protiens_std,Albumin_std,Albumin_and_Globulin_Ratio_std,Is_male_std,Liver_disease
0,0,1.247403,-0.42032,-0.495414,-0.42887,-0.355832,-0.319111,0.293722,0.203446,-0.14739,0,1
1,1,1.062306,1.218936,1.423518,1.675083,-0.093573,-0.035962,0.939655,0.077462,-0.648461,1,1
2,2,1.062306,0.640375,0.926017,0.816243,-0.115428,-0.146459,0.478274,0.203446,-0.178707,1,1
3,3,0.815511,-0.372106,-0.388807,-0.449416,-0.36676,-0.312205,0.293722,0.329431,0.16578,1,1
4,4,1.679294,0.093956,0.179766,-0.395996,-0.295731,-0.177537,0.755102,-0.930414,-1.713237,1,1


In [26]:
X_lv = liver.drop('Liver_disease', axis =1).values
y_lv = liver['Liver_disease'].values


X_train, X_test, y_train, y_test = train_test_split(X_lv, y_lv, test_size = 0.3, random_state=21)

In [27]:
#1. Instantiate a Logistic Regression classifier and assign it to lr.
lr = LogisticRegression(random_state = 21)

#2. Instantiate a KNN classifier that considers 27 nearest neighbors and assign it to knn.
knn = KNN(n_neighbors = 27)

#3. Instantiate a Decision Tree Classifier with the parameter min_samples_leaf set to 0.13 and assign it to dt.
dt1 = DecisionTreeClassifier(min_samples_leaf = 0.13, random_state = 21)

# Define the list classifiers
classifiers = [('Logistic Regression', lr), ('K Nearest Neighbors', knn), ('Classification Tree', dt1)]

Evaluating Test Accuracy of Individual Classifiers
----
Fit each model in the `classifiers` list on the training data (`X_train`, `y_train`).
Use each fitted model to predict labels for `X_test`.
Calculate the accuracy of each model using `accuracy_score()` by comparing predicted labels with `y_test`.
Record and compare the accuracy values to determine the best-performing individual classifier.


In [29]:
#1. Iterate over the tuples in classifiers. Use clf_name and clf as the for loop variables:
for clf_name,clf in classifiers:
    #2. Fit clf to the training set.
    clf.fit(X_train, y_train)

    #3. Predict clf's test set labels and assign the results to y_pred1.
    y_pred1 = clf.predict(X_test)

    accuracy = accuracy_score(y_test, y_pred1)

    print('{:s} : {:.3f}'.format(clf_name, accuracy))

Logistic Regression : 0.764
K Nearest Neighbors : 0.753
Classification Tree : 0.730


Evaluating a Voting Classifier
----

Create a voting classifier that combines the models in the `classifiers` list.
Fit the voting classifier on `X_train` and `y_train`.
Use it to predict labels for `X_test`.
Compute accuracy with `accuracy_score()` by comparing predictions with `y_test`.
Compare this accuracy to the results from the individual classifiers to see if majority voting improves performance.


In [34]:
#2. Instantiate a VotingClassifier by setting the parameter estimators to classifiers and assign it to vc.
vc = VotingClassifier(estimators = classifiers)

#3. Fit vc to the training set.
vc.fit(X_train, y_train)

#4. Evaluate vc's test set accuracy using the test set predictions y_pred.
y_pred_vc = vc.predict(X_test)

accuracy_vc = accuracy_score(y_test, y_pred_vc)

print('Voting Classifier Accuracy Score: {:.2f}'.format(accuracy_vc))

Voting Classifier Accuracy Score: 0.78
