Boosting refers to an ensemble method in which several models are trained sequentially with each model learning from the errors of its predecessors. In this chapter, you'll be introduced to the two boosting methods of AdaBoost and Gradient Boosting.

# 1- Adaboost


video

# 2- Define the AdaBoost classifier


<p>In the following exercises you&apos;ll revisit the <a href="https://www.kaggle.com/uciml/indian-liver-patient-records">Indian Liver Patient</a> dataset which was introduced in a previous chapter. Your task is to predict whether a patient suffers from a liver disease using 10 features including Albumin, age and gender. However, this time, you&apos;ll be training an AdaBoost ensemble to perform the classification task. In addition, given that this dataset is imbalanced, you&apos;ll be using the ROC AUC score as a metric instead of accuracy.</p>
<p>As a first step, you&apos;ll start by instantiating an AdaBoost classifier.</p>

<ul>
<li><p>Import <code>AdaBoostClassifier</code> from <code>sklearn.ensemble</code>.</p></li>
<li><p>Instantiate a <code>DecisionTreeClassifier</code> with <code>max_depth</code> set to 2.</p></li>
<li><p>Instantiate an <code>AdaBoostClassifier</code> consisting of 180 trees and setting the <code>base_estimator</code> to <code>dt</code>.</p></li>
</ul>

In [1]:
# Import DecisionTreeClassifier
from sklearn.tree import DecisionTreeClassifier

# Import AdaBoostClassifier
from sklearn.ensemble import AdaBoostClassifier

# Instantiate dt
dt = DecisionTreeClassifier(max_depth=2, random_state=1)

# Instantiate ada
ada = AdaBoostClassifier(base_estimator=dt, n_estimators=180, random_state=1)

# 3- Train the AdaBoost classifier


<p>Now that you&apos;ve instantiated the AdaBoost classifier <code>ada</code>, it&apos;s time train it. You will also predict the probabilities of obtaining the positive class in the test set. This can be done as follows: </p>
<p>Once the classifier <code>ada</code> is trained, call the <code>.predict_proba()</code> method by passing <code>X_test</code> as a parameter and extract these probabilities by slicing all the values in the second column as follows:</p>
<pre><code>ada.predict_proba(X_test)[:,1]
</code></pre>
<p>The Indian Liver dataset is processed for you and split into 80% train and 20% test. Feature matrices <code>X_train</code> and <code>X_test</code>, as well as the arrays of labels <code>y_train</code> and <code>y_test</code> are available in your workspace. In addition, we have also loaded the 
instantiated model <code>ada</code> from the previous exercise.</p>

In [2]:
# First I will import data and do some preprocessing


import pandas as pd

liver=pd.read_csv('datasets/indian_liver_patient/indian_liver_patient.csv')

liver=liver.dropna()

#More preprocessing 
#Creating dummy variables, for Gender colum

# Create dummy variables: 
liver = pd.get_dummies(liver)

# Print the columns of liver
print(liver.columns)

# Create dummy variables with drop_first=True: liver
liver_preprocessed = pd.get_dummies(liver, drop_first=True)

# Print the new columns of df_region
print(liver_preprocessed.columns)

liver_preprocessed=liver_preprocessed.drop('Gender_Female', axis=1)
liver_preprocessed.head()

Index(['Age', 'Total_Bilirubin', 'Direct_Bilirubin', 'Alkaline_Phosphotase',
       'Alamine_Aminotransferase', 'Aspartate_Aminotransferase',
       'Total_Protiens', 'Albumin', 'Albumin_and_Globulin_Ratio', 'Dataset',
       'Gender_Female', 'Gender_Male'],
      dtype='object')
Index(['Age', 'Total_Bilirubin', 'Direct_Bilirubin', 'Alkaline_Phosphotase',
       'Alamine_Aminotransferase', 'Aspartate_Aminotransferase',
       'Total_Protiens', 'Albumin', 'Albumin_and_Globulin_Ratio', 'Dataset',
       'Gender_Female', 'Gender_Male'],
      dtype='object')


Unnamed: 0,Age,Total_Bilirubin,Direct_Bilirubin,Alkaline_Phosphotase,Alamine_Aminotransferase,Aspartate_Aminotransferase,Total_Protiens,Albumin,Albumin_and_Globulin_Ratio,Dataset,Gender_Male
0,65,0.7,0.1,187,16,18,6.8,3.3,0.9,1,0
1,62,10.9,5.5,699,64,100,7.5,3.2,0.74,1,1
2,62,7.3,4.1,490,60,68,7.0,3.3,0.89,1,1
3,58,1.0,0.4,182,14,20,6.8,3.4,1.0,1,1
4,72,3.9,2.0,195,27,59,7.3,2.4,0.4,1,1


In [3]:
#note Dataset column is Liver_disease column
X=liver_preprocessed.drop('Dataset', axis=1)
y=liver_preprocessed['Dataset']

In [4]:
#Import train_test_split from sklearn.model_selection
from sklearn.model_selection import train_test_split

SEED=1

# Split the data into 70% train and 30% test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20,
                                                    random_state=SEED)

-----------

<ul>
<li><p>Fit <code>ada</code> to the training set.</p></li>
<li><p>Evaluate the probabilities of obtaining the positive class in the test set.</p></li>
</ul>

In [5]:
# Fit ada to the training set
ada.fit(X_train, y_train)

# Compute the probabilities of obtaining the positive class
y_pred_proba = ada.predict_proba(X_test)[:,1]

# 4- Evaluate the AdaBoost classifier


<p>Now that you&apos;re done training <code>ada</code> and predicting the probabilities of obtaining the positive class in the test set, it&apos;s time to evaluate <code>ada</code>&apos;s ROC AUC score. Recall that the ROC AUC score of a binary classifier can be determined using the <code>roc_auc_score()</code> function from <code>sklearn.metrics</code>.</p>
<p>The arrays <code>y_test</code> and <code>y_pred_proba</code> that you computed in the previous exercise are available in your workspace.</p>

<ul>
<li><p>Import <code>roc_auc_score</code> from <code>sklearn.metrics</code>.</p></li>
<li><p>Compute <code>ada</code>&apos;s test set ROC AUC score, assign it to <code>ada_roc_auc</code>, and print it out.</p></li>
</ul>

In [7]:
# Import roc_auc_score
from sklearn.metrics import roc_auc_score

# Evaluate test-set roc_auc_score
ada_roc_auc = roc_auc_score(y_test, y_pred_proba)

# Print roc_auc_score
print('ROC AUC score: {:.2f}'.format(ada_roc_auc))

ROC AUC score: 0.64


# 5- Gradient Boosting (GB)


video

# 6- Define the GB regressor


<p>You&apos;ll now revisit the <a href="https://www.kaggle.com/c/bike-sharing-demand">Bike Sharing Demand</a> dataset that was introduced in the previous chapter. Recall that your task is to predict the bike rental demand using historical weather data from the Capital Bikeshare program in Washington, D.C.. For this purpose, you&apos;ll be using a gradient boosting regressor.</p>
<p>As a first step, you&apos;ll start by instantiating a gradient boosting regressor which you will train in the next exercise.</p>

<ul>
<li><p>Import <code>GradientBoostingRegressor</code> from <code>sklearn.ensemble</code>.</p></li>
<li><p>Instantiate a gradient boosting regressor by setting the parameters:</p>
<ul>
<li><p><code>max_depth</code> to 4</p></li>
<li><p><code>n_estimators</code> to 200</p></li></ul></li>
</ul>

In [9]:
# Import GradientBoostingRegressor
from sklearn.ensemble import GradientBoostingRegressor

# Instantiate gb
gb = GradientBoostingRegressor(max_depth=4, 
            n_estimators=200,
            random_state=2)

# 7- Train the GB regressor


<p>You&apos;ll now train the gradient boosting regressor <code>gb</code> that you instantiated in the previous exercise and predict test set labels.</p>
<p>The dataset is split into 80% train and 20% test. Feature matrices <code>X_train</code> and <code>X_test</code>, as well as the arrays <code>y_train</code> and <code>y_test</code> are available in your workspace. In addition, we have also loaded the model instance <code>gb</code> that you defined in the previous exercise.</p>

In [29]:
#importing data and preprocessing 

bikes=pd.read_csv('datasets/bikes.csv')
print(bikes.info())

X=bikes.drop('cnt', axis=1)
y=bikes['cnt']

X_train, X_test, y_train, y_test=train_test_split(X, y, test_size=0.20,
                                                 random_state=2)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1488 entries, 0 to 1487
Data columns (total 13 columns):
hr                        1488 non-null int64
holiday                   1488 non-null int64
workingday                1488 non-null int64
temp                      1488 non-null float64
hum                       1488 non-null float64
windspeed                 1488 non-null float64
cnt                       1488 non-null int64
instant                   1488 non-null int64
mnth                      1488 non-null int64
yr                        1488 non-null int64
Clear to partly cloudy    1488 non-null int64
Light Precipitation       1488 non-null int64
Misty                     1488 non-null int64
dtypes: float64(3), int64(10)
memory usage: 151.2 KB
None


-------------

<ul>
<li>Fit <code>gb</code> to the training set. </li>
<li>Predict the test set labels and assign the result to <code>y_pred</code>.</li>
</ul>

In [30]:
# Fit gb to the training set
gb.fit(X_train, y_train)

# Predict test set labels
y_pred = gb.predict(X_test)

# 8- Evaluate the GB regressor


<p>Now that the test set predictions are available, you can use them to evaluate the test set Root Mean Squared Error (RMSE) of <code>gb</code>. </p>
<p><code>y_test</code> and predictions <code>y_pred</code> are available in your workspace.</p>

<ul>
<li><p>Import <code>mean_squared_error</code> from <code>sklearn.metrics</code> as <code>MSE</code>.</p></li>
<li><p>Compute the test set MSE and assign it to <code>mse_test</code>. </p></li>
<li><p>Compute the test set RMSE and assign it to <code>rmse_test</code>.</p></li>
</ul>

In [31]:
# Import mean_squared_error as MSE
from sklearn.metrics import mean_squared_error as MSE

# Compute MSE
mse_test = MSE(y_test, y_pred)

# Compute RMSE
rmse_test = mse_test**(1/2)

# Print RMSE
print('Test set RMSE of gb: {:.3f}'.format(rmse_test))

Test set RMSE of gb: 49.796


# 9- Stochastic Gradient Boosting (SGB)


video

# 10- Regression with SGB


<p>As in the exercises from the previous lesson, you&apos;ll be working with the <a href="https://www.kaggle.com/c/bike-sharing-demand">Bike Sharing Demand</a> dataset. In the following set of exercises, you&apos;ll solve this bike count regression problem using stochastic gradient boosting.</p>

<ul>
<li><p>Instantiate a Stochastic Gradient Boosting Regressor (SGBR) and set: </p>
<ul>
<li><p><code>max_depth</code> to 4 and <code>n_estimators</code> to 200,</p></li>
<li><p><code>subsample</code> to 0.9, and </p></li>
<li><p><code>max_features</code> to 0.75.</p></li></ul></li>
</ul>

In [32]:
# Import GradientBoostingRegressor
from sklearn.ensemble import GradientBoostingRegressor

# Instantiate sgbr
sgbr = GradientBoostingRegressor(max_depth=4, 
            subsample=0.9,
            max_features=0.75,
            n_estimators=200,                                
            random_state=2)

# 11- Train the SGB regressor


<p>In this exercise, you&apos;ll train the SGBR <code>sgbr</code> instantiated in the previous exercise and predict the test set labels.</p>
<p>The bike sharing demand dataset is already loaded processed for you; it is split into 80% train and 20% test. The feature matrices <code>X_train</code> and <code>X_test</code>, the arrays of labels <code>y_train</code> and <code>y_test</code>, and the model instance <code>sgbr</code> that you defined in the previous exercise are available in your workspace.</p>

<ul>
<li>Fit <code>sgbr</code> to the training set. </li>
<li>Predict the test set labels and assign the results to <code>y_pred</code>.</li>
</ul>

In [33]:
# Fit sgbr to the training set
sgbr.fit(X_train, y_train)

# Predict test set labels
y_pred = sgbr.predict(X_test)

# 12- Evaluate the SGB regressor


<p>You have prepared the ground to determine the test set RMSE of <code>sgbr</code> which you shall evaluate in this exercise.</p>
<p><code>y_pred</code> and <code>y_test</code> are available in your workspace.</p>

<ul>
<li><p>Import <code>mean_squared_error</code> as <code>MSE</code> from <code>sklearn.metrics</code>.  </p></li>
<li><p>Compute test set MSE and assign the result to <code>mse_test</code>. </p></li>
<li><p>Compute test set RMSE and assign the result to <code>rmse_test</code>.</p></li>
</ul>

In [34]:
# Import mean_squared_error as MSE
from sklearn.metrics import mean_squared_error

# Compute test set MSE
mse_test = mean_squared_error(y_test, y_pred)

# Compute test set RMSE
rmse_test = mse_test**(1/2)

# Print rmse_test
print('Test set RMSE of sgbr: {:.3f}'.format(rmse_test))

Test set RMSE of sgbr: 47.944
