The bias-variance tradeoff is one of the fundamental concepts in supervised machine learning. In this chapter, you'll understand how to diagnose the problems of overfitting and underfitting. You'll also be introduced to the concept of ensembling where the predictions of several models are aggregated to produce predictions that are more robust.

# 1- Generalization Error


video

# 2- Complexity, bias and variance


<p>In the video, you saw how the complexity of a model labeled \(\hat{f}\) influences the bias and variance terms of its generalization error.<br>
Which of the following correctly describes the relationship between \(\hat{f}\)&apos;s complexity and \(\hat{f}\)&apos;s bias and variance terms?</p>

- <p>As the complexity of \(\hat{f}\) decreases, the bias term decreases while the variance term increases.</p>
- <p>As the complexity of \(\hat{f}\) decreases, both the bias and the variance terms increase.</p>
- <p>As the complexity of \(\hat{f}\) increases, the bias term increases while the variance term decreases.</p>
- <p> As the complexity of \(\hat{f}\)&apos;s increases, the bias term decreases while the variance term increases </p> (Answer)

# 3- Overfitting and underfitting


<p>In this exercise, you&apos;ll visually diagnose whether a model is overfitting or underfitting the training set.</p>
<p>For this purpose, we have trained two different models \(A\) and \(B\) on the auto dataset to predict the <code>mpg</code> consumption of a car using only the car&apos;s displacement (<code>displ</code>) as a feature.</p>
<p>The following figure shows you scatterplots of <code>mpg</code> versus <code>displ</code> along with lines corresponding to the training set predictions of models \(A\) and \(B\) in red.</p>
<p><img src="https://assets.datacamp.com/production/repositories/1796/datasets/f905399bc06da86c2a3af27b20717de5a777e6e1/diagnose-problems.jpg" alt="diagnose"></p>
<p>Which of the following statements is true?</p>

- <p>\(A\) suffers from high bias and overfits the training set.</p>
- <p>\(A\) suffers from high variance and underfits the training set.</p>
- <p>\(B\) suffers from high bias and underfits the training set. (Answer) </p>
- <p>\(B\) suffers from high variance and underfits the training set.</p>

# 4- Diagnose bias and variance problems


video

# 5- Instantiate the model

<p>In the following set of exercises, you&apos;ll diagnose the bias and variance problems of a regression tree. The regression tree you&apos;ll define in this exercise will be used to predict the mpg consumption of cars from the auto dataset using all available features.</p>
<p>We have already processed the data and loaded the features matrix <code>X</code> and the array <code>y</code> in your workspace. In addition, the <code>DecisionTreeRegressor</code> class was imported from <code>sklearn.tree</code>.</p>

In [9]:
#preprocessing
import pandas as pd
auto=pd.read_csv('datasets/auto.csv')

auto.head()


Unnamed: 0,mpg,displ,hp,weight,accel,origin,size
0,18.0,250.0,88,3139,14.5,US,15.0
1,9.0,304.0,193,4732,18.5,US,20.0
2,36.1,91.0,60,1800,16.4,Asia,10.0
3,18.5,250.0,98,3525,19.0,US,15.0
4,34.3,97.0,78,2188,15.8,Europe,10.0


In [10]:
#Creating dummy variables, for origin colum

# Create dummy variables: df_region
origins = pd.get_dummies(auto)

# Print the columns of df_region
print(origins.columns)

# Create dummy variables with drop_first=True: df_region
auto_origins = pd.get_dummies(auto, drop_first=True)

# Print the new columns of df_region
print(auto_origins.columns)

auto_origins.head()

Index(['mpg', 'displ', 'hp', 'weight', 'accel', 'size', 'origin_Asia',
       'origin_Europe', 'origin_US'],
      dtype='object')
Index(['mpg', 'displ', 'hp', 'weight', 'accel', 'size', 'origin_Europe',
       'origin_US'],
      dtype='object')


Unnamed: 0,mpg,displ,hp,weight,accel,size,origin_Europe,origin_US
0,18.0,250.0,88,3139,14.5,15.0,0,1
1,9.0,304.0,193,4732,18.5,20.0,0,1
2,36.1,91.0,60,1800,16.4,10.0,0,0
3,18.5,250.0,98,3525,19.0,15.0,0,1
4,34.3,97.0,78,2188,15.8,10.0,1,0


In [11]:
X=auto_origins.drop('mpg', axis=1)
y=auto_origins['mpg']

----------

<ul>
<li>Import <code>train_test_split</code> from <code>sklearn.model_selection</code>.</li>
<li>Split the data into 70% train and 30% test. </li>
<li>Instantiate a <code>DecisionTreeRegressor</code> with max depth 4 and <code>min_samples_leaf</code> set to 0.26.</li>
</ul>

In [12]:
from sklearn.tree import DecisionTreeRegressor

#Import train_test_split from sklearn.model_selection
from sklearn.model_selection import train_test_split

# Set SEED for reproducibility
SEED = 1

# Split the data into 70% train and 30% test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30,
                                                    random_state=SEED)

# Instantiate a DecisionTreeRegressor dt
dt = DecisionTreeRegressor(max_depth=4,
                           min_samples_leaf=0.26,
                           random_state=SEED)

# 6- Evaluate the 10-fold CV error


<p>In this exercise, you&apos;ll evaluate the 10-fold CV Root Mean Squared Error (RMSE) achieved by the regression tree <code>dt</code> that you instantiated in the previous exercise.</p>
<p>In addition to <code>dt</code>, the training data including <code>X_train</code> and <code>y_train</code> are available in your workspace. We also imported <code>cross_val_score</code> from <code>sklearn.model_selection</code>.</p>
<p>Note that since <code>cross_val_score</code> has only the option of evaluating the negative MSEs, its output should be multiplied by negative one to obtain the MSEs.</p>

<ul>
<li><p>Compute <code>dt</code>&apos;s 10-fold cross-validated MSE by setting the <code>scoring</code> argument to <code>&apos;neg_mean_squared_error&apos;</code>.</p></li>
<li><p>Compute RMSE from the obtained MSE scores.</p></li>
</ul>

In [15]:
#import cross_val_score from sklearn.model_selection
from sklearn.model_selection import cross_val_score

# Compute the array containing the 10-folds CV MSEs
MSE_CV_scores = -1 * cross_val_score(dt, X_train, y_train,
                       cv=10, 
                       scoring='neg_mean_squared_error',
                       n_jobs=-1)

# Compute the 10-folds CV RMSE
RMSE_CV = (MSE_CV_scores.mean())**(1/2)

# Print RMSE_CV
print('CV RMSE: {:.2f}'.format(RMSE_CV))

CV RMSE: 5.14


# 7- Evaluate the training error


<p>You&apos;ll now evaluate the training set RMSE achieved by the regression tree <code>dt</code> that you instantiated in a previous exercise.</p>
<p>In addition to <code>dt</code>, <code>X_train</code> and <code>y_train</code> are available in your workspace.</p>

<ul>
<li>Import <code>mean_squared_error</code> as <code>MSE</code> from <code>sklearn.metrics</code>.</li>
<li>Fit <code>dt</code> to the training set.</li>
<li>Predict <code>dt</code>&apos;s training set labels and assign the result to <code>y_pred_train</code>. </li>
<li>Evaluate <code>dt</code>&apos;s training set MSE and assign it to <code>RMSE_train</code>.</li>
</ul>

In [17]:
# Import mean_squared_error from sklearn.metrics as MSE
from sklearn.metrics import mean_squared_error as MSE

# Fit dt to the training set
dt.fit(X_train, y_train)

# Predict the labels of the training set
y_pred_train = dt.predict(X_train)

# Evaluate the training set RMSE of dt
RMSE_train = (MSE(y_train, y_pred_train))**(1/2)

# Print RMSE_train
print('Train RMSE: {:.2f}'.format(RMSE_train))

Train RMSE: 5.15


# 8- High bias or high variance?


<p>In this exercise you&apos;ll diagnose whether the regression tree <code>dt</code> you trained in the previous exercise suffers from a bias or a variance problem. </p>
<p>The training set RMSE (<code>RMSE_train</code>) and the CV RMSE (<code>RMSE_CV</code>) achieved by <code>dt</code> are available in your workspace. In addition, we have also loaded a variable called <code>baseline_RMSE</code> which corresponds to the root mean-squared error achieved by the regression-tree trained with the <code>disp</code> feature only (it is the RMSE achieved by the regression tree trained in chapter 1, lesson 3). Here <code>baseline_RMSE</code> serves as the baseline RMSE above which a model is considered to be underfitting and below which the model is considered &apos;good enough&apos;.

</p> 
baseline_RMSE=5.1
<p>Does <code>dt</code> suffer from a high bias or a high variance problem?</p>

- <p><code>dt</code> suffers from high variance because <code>RMSE_CV</code> is far less than <code>RMSE_train</code>.</p>
- <p><code>dt</code> suffers from  high bias because <code>RMSE_CV</code> \(\approx\) <code>RMSE_train</code> and both scores are greater than <code>baseline_RMSE</code>.</p> (The Answer)
- <p><code>dt</code> is a good fit because <code>RMSE_CV</code> \(\approx\) <code>RMSE_train</code> and both scores are smaller than <code>baseline_RMSE</code>.</p>

# 9- Ensemble Learning


video

# 10- Define the ensemble


<p>In the following set of exercises, you&apos;ll work with the <a href="https://www.kaggle.com/jeevannagaraj/indian-liver-patient-dataset">Indian Liver Patient Dataset</a> from the UCI Machine learning repository. </p>
<p>In this exercise, you&apos;ll instantiate three classifiers to predict whether a patient suffers from a liver disease using all the features present in the dataset. </p>
<p>The classes <code>LogisticRegression</code>, <code>DecisionTreeClassifier</code>, and <code>KNeighborsClassifier</code> under the alias <code>KNN</code> are available in your workspace.</p>

<ul>
<li><p>Instantiate a Logistic Regression classifier and assign it to <code>lr</code>. </p></li>
<li><p>Instantiate a KNN classifier that considers 27 nearest neighbors and assign it to <code>knn</code>. </p></li>
<li><p>Instantiate a Decision Tree Classifier with the parameter <code>min_samples_leaf</code> set to 0.13 and assign it to <code>dt</code>.</p></li>
</ul>

In [23]:
# Import functions to compute accuracy and split data
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split

# Import models, including VotingClassifier meta-model
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier as KNN
from sklearn.ensemble import VotingClassifier

# Set seed for reproducibility
SEED=1

# Instantiate lr
lr = LogisticRegression(random_state=SEED)

# Instantiate knn
knn = KNN(n_neighbors=27)

# Instantiate dt
dt = DecisionTreeClassifier(min_samples_leaf=0.13, random_state=SEED)

# Define the list classifiers
classifiers = [('Logistic Regression', lr),
               ('K Nearest Neighbours', knn),
               ('Classification Tree', dt)]

# 11- Evaluate individual classifiers


<p>In this exercise you&apos;ll evaluate the performance of the models in the list <code>classifiers</code> that we defined in the previous exercise. You&apos;ll do so by fitting each classifier on the training set and evaluating its test set accuracy.</p>
<p>The dataset is already loaded and preprocessed for you (numerical features are standardized) and it is split into 70% train and 30% test. The features matrices <code>X_train</code> and <code>X_test</code>, as well as the arrays of labels <code>y_train</code> and <code>y_test</code> are available in your workspace. In addition, we have loaded the list <code>classifiers</code> from the previous exercise, as well as the function <code>accuracy_score()</code> from <code>sklearn.metrics</code>.</p>

In [24]:
#preprocessing, by me

#Import train_test_split from sklearn.model_selection
from sklearn.model_selection import train_test_split



liver=pd.read_csv('datasets/indian_liver_patient/indian_liver_patient_preprocessed.csv')
liver.head()

X=liver.drop('Liver_disease', axis=1)
y=liver['Liver_disease']

SEED=1

# Split the data into 70% train and 30% test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30,
                                                    random_state=SEED)

---------

<ul>
<li>Iterate over the tuples in <code>classifiers</code>. Use <code>clf_name</code> and <code>clf</code> as the <code>for</code> loop variables:<ul>
<li>Fit <code>clf</code> to the training set.</li>
<li>Predict <code>clf</code>&apos;s test set labels and assign the results to <code>y_pred</code>.</li>
<li>Evaluate the test set accuracy of <code>clf</code> and print the result.</li></ul></li>
</ul>

In [25]:
# Iterate over the pre-defined list of classifiers
for clf_name, clf in classifiers:    
 
    # Fit clf to the training set
    clf.fit(X_train, y_train)    
   
    # Predict y_pred
    y_pred = clf.predict(X_test)
    
    # Calculate accuracy
    accuracy = accuracy_score(y_pred, y_test) 
   
    # Evaluate clf's accuracy on the test set
    print('{:s} : {:.3f}'.format(clf_name, accuracy))



Logistic Regression : 0.747
K Nearest Neighbours : 0.724
Classification Tree : 0.730


# 12- Better performance with a Voting Classifier

<p>Finally, you&apos;ll evaluate the performance of a voting classifier that takes the outputs of the models defined in the list <code>classifiers</code> and assigns labels by majority voting.  </p>
<p><code>X_train</code>, <code>X_test</code>,<code>y_train</code>, <code>y_test</code>, the list <code>classifiers</code> defined in a previous exercise, as well as the function <code>accuracy_score</code> from <code>sklearn.metrics</code> are available in your workspace.</p>

<ul>
<li>Import <code>VotingClassifier</code> from <code>sklearn.ensemble</code>.</li>
<li>Instantiate a <code>VotingClassifier</code> by setting the parameter <code>estimators</code> to <code>classifiers</code> and assign it to <code>vc</code>. </li>
<li>Fit <code>vc</code> to the training set.</li>
<li>Evaluate <code>vc</code>&apos;s test set accuracy using the test set predictions <code>y_pred</code>.</li>
</ul>

In [26]:
# Import VotingClassifier from sklearn.ensemble
from sklearn.ensemble import VotingClassifier

# Instantiate a VotingClassifier vc
vc = VotingClassifier(estimators=classifiers)     

# Fit vc to the training set
vc.fit(X_train, y_train)   

# Evaluate the test set predictions
y_pred = vc.predict(X_test)

# Calculate accuracy score
accuracy = accuracy_score(y_pred, y_test)
print('Voting Classifier: {:.3f}'.format(accuracy))

Voting Classifier: 0.753


