# Bagging and Boosting

This work aims to show the advantage that bagging and boosting bring to a machine learning problem. Both techniques involve a form of aggregation of several estimators to achieve better results.

Here, we consider a classification problem that is predicting whether a household earns more or less than 50K. We will evaluate it using a linear classifier. Then, we will leverage it through bagging and boosting.

## Data cleaning

In [1]:
import timeit
import numpy as np
import pandas as pd
from sklearn.svm import LinearSVC
from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import MinMaxScaler
from sklearn.ensemble import RandomForestClassifier,AdaBoostClassifier

In [2]:
df_train = pd.read_csv('adult-new.data',header=None)
df_test= pd.read_csv('adult-new.test',header=None)

In [3]:
columns=["age","workclass","fnlwgt","education","education_num","marital_status","occupation","relationship","race",
         "sex","capital_gain","capital_loss","hours_per_week","native_country","target"]
df_train=df_train.rename({i:columns[i] for i in range(len(columns))},axis=1).drop("education_num",axis=1)
df_test=df_test.rename({i:columns[i] for i in range(len(columns))},axis=1).drop("education_num",axis=1)
df_train.target=np.array(df_train.target==' >50K').astype('uint8')
df_test.target=np.array(df_test.target==' >50K').astype('uint8')

The categorical variables are : workclass, education, marital_status, occupation, relationship, race, sex and native country. 

In [4]:
columns_categorical=["workclass","education","marital_status","occupation","relationship","race",
                     "sex","native_country"]
columns_numeric=["age","capital_gain","capital_loss","hours_per_week"]
df_train_dummies=pd.get_dummies(df_train[columns_categorical]).join(df_train[[i for i in df_train.columns
                                                                              if i not in columns_categorical]])

For the test set, after creating the dummies, we must ensure that it has the same columns as the train set. We also don't want to leak information in the train set. Therefore, we proceed like following on the test set:
* Drop all the dummy features that are on the test set but not on the train set
* Create zero columns for features that are on the train set but not on the test set. 

In [5]:
df_test_dummies=pd.get_dummies(df_test[columns_categorical]).join(df_test[[i for i in df_test.columns
                                                                              if i not in columns_categorical]])
df_test_dummies=df_test_dummies[[i for i in df_train_dummies.columns if i in df_test_dummies.columns]]
for i in set(df_train_dummies)-set(df_test_dummies):
    df_test_dummies[i]=np.zeros(len(df_test_dummies))
df_test_dummies=df_test_dummies[df_train_dummies.columns]

We have 108 features for both data set, and the features are the same for both of them.

In [6]:
print(list(df_train_dummies.columns)==list(df_test_dummies.columns))
print(df_train_dummies.shape,df_test_dummies.shape)

True
(32561, 108) (16281, 108)


We also scale the variables that are not categorical. We use the Minmax scaler to preserve the zero entries in the sparse columns capital_gain and capital_loss.

In [7]:
scaler=MinMaxScaler()
df_train_dummies[columns_numeric]=scaler.fit_transform(df_train_dummies[columns_numeric])
df_test_dummies[columns_numeric]=scaler.transform(df_test_dummies[columns_numeric])

# Linear classifier

We are using a  Linear Support Vector Classification with the hinge loss and l2 penalty. Three paramaters to fit: the loss parameter C, the maximum number of iterations, and wether to weight the classes.

To fit those parameters, we use a grid search on the train dataset. We test each combinations of parameters using crossvalidation with 3 folds.

In [8]:
classifier_1=LinearSVC(loss="hinge",penalty='l2')
param_grid={"C":np.arange(1,10),"max_iter":[1000,10000,100000],"class_weight":["balanced",None]}
grid=GridSearchCV(classifier_1,param_grid)
grid.fit(df_train_dummies.drop(["target","fnlwgt"],axis=1),df_train_dummies.target,sample_weight=df_train_dummies.fnlwgt)

GridSearchCV(cv=None, error_score='raise',
       estimator=LinearSVC(C=1.0, class_weight=None, dual=True, fit_intercept=True,
     intercept_scaling=1, loss='hinge', max_iter=1000, multi_class='ovr',
     penalty='l2', random_state=None, tol=0.0001, verbose=0),
       fit_params=None, iid=True, n_jobs=1,
       param_grid={'C': array([1, 2, 3, 4, 5, 6, 7, 8, 9]), 'max_iter': [1000, 10000, 100000], 'class_weight': ['balanced', None]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=0)

In [9]:
print(grid.best_params_)
model=grid.best_estimator_
print("0/1 train error : ",1-model.score(df_train_dummies.drop(["target","fnlwgt"],axis=1),df_train_dummies.target))
print("0/1 test error : ",1-model.score(df_test_dummies.drop(["target","fnlwgt"],axis=1),df_test_dummies.target))

{'C': 4, 'class_weight': None, 'max_iter': 100000}
0/1 train error :  0.14799914007555048
0/1 test error :  0.14870093974571585


We reach a 0/1 error of **0.149** with the affine classifier.

Now, let's look at the false positive and negative rates for females and males.

In [10]:
y_test_predict=model.predict(df_test_dummies.drop(["target","fnlwgt"],axis=1))
y_test_predict_f=y_test_predict[df_test_dummies["sex_ Female"]==1]
y_test_predict_m=y_test_predict[df_test_dummies["sex_ Male"]==1]
y_test_f=df_test_dummies.target[df_test_dummies["sex_ Female"]==1]
y_test_m=df_test_dummies.target[df_test_dummies["sex_ Male"]==1]
print("Rates for the first classifier:")
print("FP rate Female:", sum((y_test_predict_f==1)&(y_test_f==0))/sum(y_test_f==0))
print("FN rate Female:", sum((y_test_predict_f==0)&(y_test_f==1))/sum(y_test_f==1))
print("FP rate Male:", sum((y_test_predict_m==1)&(y_test_m==0))/sum(y_test_m==0))
print("FN rate Male:", sum((y_test_predict_m==0)&(y_test_m==1))/sum(y_test_m==1))

Rates for the first classifier:
FP rate Female: 0.018204385601985933
FN rate Female: 0.5201401050788091
FP rate Male: 0.09252434851276652
FN rate Male: 0.4066503965832825


# Bagging with Random Forest

For our second classifier, we use a random forest classifier. Random Forest uses bagging because it trains multiple decision trees on subsamples using bootstrap. It also trains each decision tree on a subsample of the features, making it an improved version of bagging. Eventually, all these decision trees vote to make a prediction.


We tune the following parameters to control the size of our random forest and the numbers of trees: max_depth and min_samples_leaf

In [11]:
classifier_2=RandomForestClassifier()
param_grid={"max_depth":np.arange(5,40),"min_samples_leaf":np.arange(1,6)}
grid=GridSearchCV(classifier_2,param_grid)
grid.fit(df_train_dummies.drop(["target","fnlwgt"],axis=1),df_train_dummies.target,sample_weight=df_train_dummies.fnlwgt)

GridSearchCV(cv=None, error_score='raise',
       estimator=RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False),
       fit_params=None, iid=True, n_jobs=1,
       param_grid={'max_depth': array([ 5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21,
       22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38,
       39]), 'min_samples_leaf': array([1, 2, 3, 4, 5])},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=0)

In [12]:
print(grid.best_params_)
model=grid.best_estimator_
print("0/1 train error : ",1-model.score(df_train_dummies.drop(["target","fnlwgt"],axis=1),df_train_dummies.target))
print("0/1 test error : ",1-model.score(df_test_dummies.drop(["target","fnlwgt"],axis=1),df_test_dummies.target))

{'max_depth': 38, 'min_samples_leaf': 3}
0/1 train error :  0.12235496452811645
0/1 test error :  0.13795221423745474


In [13]:
y_test_predict=model.predict(df_test_dummies.drop(["target","fnlwgt"],axis=1))
y_test_predict_f=y_test_predict[df_test_dummies["sex_ Female"]==1]
y_test_predict_m=y_test_predict[df_test_dummies["sex_ Male"]==1]
y_test_f=df_test_dummies.target[df_test_dummies["sex_ Female"]==1]
y_test_m=df_test_dummies.target[df_test_dummies["sex_ Male"]==1]
print("Rates for the second classifier:")
print("FP rate Female:", sum((y_test_predict_f==1)&(y_test_f==0))/sum(y_test_f==0))
print("FN rate Female:", sum((y_test_predict_f==0)&(y_test_f==1))/sum(y_test_f==1))
print("FP rate Male:", sum((y_test_predict_m==1)&(y_test_m==0))/sum(y_test_m==0))
print("FN rate Male:", sum((y_test_predict_m==0)&(y_test_m==1))/sum(y_test_m==1))

Rates for the first classifier:
FP rate Female: 0.018204385601985933
FN rate Female: 0.46935201401050786
FP rate Male: 0.09226112134772309
FN rate Male: 0.3627211714460037


The 0/1 risk is lower using Random Forest - Bagging helped!

# Boosting

To illustrate boosting, we will use the AdaBoost algorithm. It trains several Decision Trees where at each step the tree has a larger focus on misclassified data observations from the previous steps. Eventually, all the predictions of the trees are combined in a vote.

In [14]:
classifier_3=AdaBoostClassifier()
param_grid={"n_estimators":[20,50,100,150],"learning_rate":np.arange(1,21)/10}
grid=GridSearchCV(classifier_3,param_grid)
grid.fit(df_train_dummies.drop(["target","fnlwgt"],axis=1),df_train_dummies.target,sample_weight=df_train_dummies.fnlwgt)

GridSearchCV(cv=None, error_score='raise',
       estimator=AdaBoostClassifier(algorithm='SAMME.R', base_estimator=None,
          learning_rate=1.0, n_estimators=50, random_state=None),
       fit_params=None, iid=True, n_jobs=1,
       param_grid={'n_estimators': [20, 50, 100, 150], 'learning_rate': array([0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1. , 1.1, 1.2, 1.3,
       1.4, 1.5, 1.6, 1.7, 1.8, 1.9, 2. ])},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=0)

In [15]:
print(grid.best_params_)
model=grid.best_estimator_
print("0/1 train error : ",1-model.score(df_train_dummies.drop(["target","fnlwgt"],axis=1),df_train_dummies.target))
print("0/1 test error : ",1-model.score(df_test_dummies.drop(["target","fnlwgt"],axis=1),df_test_dummies.target))

{'learning_rate': 1.6, 'n_estimators': 150}
0/1 train error :  0.13150701759773964
0/1 test error :  0.13359130274553155


In [16]:
y_test_predict=model.predict(df_test_dummies.drop(["target","fnlwgt"],axis=1))
y_test_predict_f=y_test_predict[df_test_dummies["sex_ Female"]==1]
y_test_predict_m=y_test_predict[df_test_dummies["sex_ Male"]==1]
y_test_f=df_test_dummies.target[df_test_dummies["sex_ Female"]==1]
y_test_m=df_test_dummies.target[df_test_dummies["sex_ Male"]==1]
print("Rates for the third classifier:")
print("FP rate Female:", sum((y_test_predict_f==1)&(y_test_f==0))/sum(y_test_f==0))
print("FN rate Female:", sum((y_test_predict_f==0)&(y_test_f==1))/sum(y_test_f==1))
print("FP rate Male:", sum((y_test_predict_m==1)&(y_test_m==0))/sum(y_test_m==0))
print("FN rate Male:", sum((y_test_predict_m==0)&(y_test_m==1))/sum(y_test_m==1))

Rates for the first classifier:
FP rate Female: 0.01985932974762102
FN rate Female: 0.46059544658493873
FP rate Male: 0.10028954988154777
FN rate Male: 0.32153752287980475


For our example, it seems like AdaBoost yields the best results. Interestingly, it also leads to less overfitting as the train and test errors are very close.