# Introduction to Gradient Boosting

Gradient boosting is the idea of using multiple models in an ensemble and focusing on the residuals in the machine learning process. Using the same HMEQ Data, we will be able to show that focusing on residuals through a second model can be effective to lowering the error on a test set.

## Import Relevant Libraries and Data

In [190]:
import numpy as np
import pandas as pd
from sklearn import tree

from sklearn.model_selection import train_test_split
from pprint import pprint

In [191]:
filename = 'https://github.com/Humboldt-WI/bads/blob/master/data/hmeq_modeling.csv?raw=true'
df = pd.read_csv(filename, header = 0, index_col = 0)

In [192]:
X = df.drop(['BAD'], axis=1) #code the variables in the most standard way for your usage
y = df[['BAD']]

X.head() #inspect that variables were correctly separated

Unnamed: 0_level_0,LOAN,MORTDUE,VALUE,YOJ,CLAGE,NINQ,CLNO,DEBTINC,DEROGzero,REASON_HomeImp,REASON_IsMissing,JOB_Office,JOB_Other,JOB_ProfExe,JOB_Sales,JOB_Self,DELINQcat_1,DELINQcat_1+
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1
0,-1.832283,-1.295882,-1.335526,0.266788,-1.075278,-0.065054,-1.297476,0.137456,True,1,0,0,1,0,0,0,0,0
1,-1.810666,-0.013474,-0.672699,-0.236615,-0.723092,-0.826792,-0.756608,0.137456,True,1,0,0,1,0,0,0,0,1
2,-1.789048,-1.654549,-1.839275,-0.668103,-0.368769,-0.065054,-1.189302,0.137456,True,1,0,0,1,0,0,0,0,0
3,-1.789048,-0.159552,-0.202559,-0.236615,-0.061033,-0.065054,-0.107566,0.137456,True,0,1,0,1,0,0,0,0,0
4,-1.767431,0.791699,0.311107,-0.811933,-1.088528,-0.826792,-0.756608,0.137456,True,1,0,1,0,0,0,0,0,0


In [193]:
y.head()

Unnamed: 0_level_0,BAD
index,Unnamed: 1_level_1
0,True
1,True
2,True
3,True
4,False


In [194]:
# train test split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=123)

In [195]:
print(type(X_train), type(y_train)) # double check that types and dimensions are correct before proceeding

<class 'pandas.core.frame.DataFrame'> <class 'pandas.core.frame.DataFrame'>


In [196]:
print(X_train.shape, y_train.shape, X_test.shape, y_test.shape)

(4768, 18) (4768, 1) (1192, 18) (1192, 1)


## Gradient Boosting from Scratch Example: Training Models

Here we will show the effectiveness of corrective models which work on the principle of boosting: training on errors. We will first train two models, the first will be for regular predictions. The second will predict which observations may lead to errors. We will first run the first prediction on test data, then correct these predictions using the second model.


In [197]:
estimators = []

In [198]:
clf = tree.DecisionTreeClassifier(criterion="entropy", min_samples_split=2, max_depth=2) #first classifier

dt = clf.fit(X_train, y_train) #fit the classifier

estimators.append(('first model', dt))

In [199]:
predictions = dt.predict(X_train) #predict using first classifier

In [200]:
residuals = predictions != y_train.iloc[:,0] #check residuals
residuals.mean()

0.15121644295302014

In [201]:
residuals.sum() #total errors of this classifier

721

In [202]:
clf2 = tree.DecisionTreeClassifier(criterion="gini") #train second classifier with different specs

dt_residuals = clf2.fit(X_train, residuals) #fit classifier on the residuals, since we are training on decisions of a binary outcome, this classifier will predict errors of the first classifier
estimators.append(('second model', dt_residuals))
dt_residuals

DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None, criterion='gini',
                       max_depth=None, max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort='deprecated',
                       random_state=None, splitter='best')

In [203]:
likely_misclassifications = dt_residuals.predict(X_train) #the results of the first classifier are errors of the first classifier
likely_misclassifications.sum()

721

## Gradient Boosting from Scratch Example: Testing Models

Now that we have our two models, we will begin using the test data to see if it is able to bring down the value of the residuals. We will first predict y using X_test.

In [204]:
predictions_test = dt.predict(X_test)

In [205]:
residuals_test = predictions_test != y_test.iloc[:,0]
residuals_test.mean()

0.14093959731543623

Now we predict for which observations the classifier would likely have gotten the predictions incorrect.

In [206]:
likely_misclassifications_test = dt_residuals.predict(X_test)
likely_misclassifications_test

array([False, False, False, ...,  True, False, False])

Lastly, we correct the misclassifications by classifying them the opposite way.

In [207]:
residuals_corrected = pd.Series(residuals_test)
residuals_corrected[likely_misclassifications_test] = ~ residuals_corrected[likely_misclassifications_test]

In [208]:
residuals_test[likely_misclassifications_test]

index
1554    False
4727     True
4883     True
263     False
4618     True
        ...  
4014     True
1723     True
2990    False
1673     True
2589    False
Name: BAD, Length: 155, dtype: bool

In [209]:
residuals_corrected[likely_misclassifications_test]

index
1554     True
4727    False
4883    False
263      True
4618    False
        ...  
4014    False
1723    False
2990     True
1673    False
2589     True
Name: BAD, Length: 155, dtype: object

In [210]:
residuals_corrected != y_test.iloc[:,0]
residuals_corrected.mean()

0.12164429530201343

## Conclusion

The new residuals are smaller than the previous ones. This tells us that our process worked. We were able to lower the error on a test set using a second model which only focused on identifying residuals.

Gradient boost is able to do this process in multiple ways with much more complex methods and algorithms. However, in essence, the idea is that multiple models are trained and some specifically target residuals. AdaBoost and XGBoost are both popular algorithms which focus on this principle.