## Ensemble Learning and Random Forests

If you aggregate the predictions of a group of predictors (such as classifiers or regressors), you will often get better predictions than with the best individual predictor. A group of predictors is called an ensemble; thus, this technique is called Ensemble Learning, and an Ensemble Learning algorithm is called an Ensemble method.

For example, you can train a group of Decision Tree classifiers, each on a different random subset of the training set. To make predictions, you just obtain the predictions of all individual trees, then predict the class that gets the most votes. Such an ensemble of Decision Trees is called a Random Forest

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
plt.style.use(['seaborn-bright','dark_background'])

In [3]:
data = pd.read_csv('churn_prediction_simple.csv')
data.head()

Unnamed: 0,customer_id,vintage,age,gender,dependents,occupation,city,customer_nw_category,branch_code,days_since_last_transaction,...,previous_month_end_balance,average_monthly_balance_prevQ,average_monthly_balance_prevQ2,current_month_credit,previous_month_credit,current_month_debit,previous_month_debit,current_month_balance,previous_month_balance,churn
0,1,3135,66,0,0.0,0,187.0,2,755,224.0,...,1458.71,1458.71,1449.07,0.2,0.2,0.2,0.2,1458.71,1458.71,0
1,6,2531,42,0,2.0,0,1494.0,3,388,58.0,...,1401.72,1643.31,1871.12,0.33,714.61,588.62,1538.06,1157.15,1677.16,1
2,7,263,42,1,0.0,0,1096.0,2,1666,60.0,...,16059.34,15211.29,13798.82,0.36,0.36,857.5,286.07,15719.44,15349.75,0
3,8,5922,72,0,0.0,1,1020.0,1,1,98.0,...,7714.19,7859.74,11232.37,0.64,0.64,1299.64,439.26,7076.06,7755.98,0
4,9,1145,46,0,0.0,0,623.0,2,317,172.0,...,8519.53,6511.82,16314.17,0.27,0.27,443.13,5688.44,8563.84,5317.04,0


In [4]:
X = data.drop(columns = ['churn','customer_id'])
Y = data['churn']

In [5]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaled_X = scaler.fit_transform(X)

In [6]:
from sklearn.model_selection import train_test_split as tts
x_train, x_test, y_train, y_test = tts(scaled_X, Y, train_size = 0.80,stratify = Y)
x_train.shape, x_test.shape, y_train.shape, y_test.shape 

((17653, 19), (4414, 19), (17653,), (4414,))

## Voting Classifiers

Suppose you have trained a few classifier each one achieving about 80% accuracy. You may have a Logistic Regression Classifier, an SVM classifier,a Random Forest Classifier, a K-Nearest Neighbors classifier and perhaps a few more.

A very simple way to create an even better classifier is to aggregate the predictions of each classfier and predict the class that gets the most votes. This majority-vote classifier is called a hard voting classifier.

Voting Classifier often achieves a higher accuracy than the best classifier in the ensemble.

Ensemble methods work best when the predictors are as independent from one another as possible. One way to get diverse classifier is to train them using very different algorithms.

In [7]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC

In [25]:
log_clf = LogisticRegression()
rnd_clf = RandomForestClassifier()
svm_clf = SVC(probability = True)

In [26]:
voting_clf = VotingClassifier(
        estimators =[('lr',log_clf),('rf',rnd_clf),('svc',svm_clf)],
        voting = 'soft',verbose = True)
voting_clf.fit(x_train, y_train)

[Voting] ....................... (1 of 3) Processing lr, total=   0.2s
[Voting] ....................... (2 of 3) Processing rf, total=  13.6s
[Voting] ...................... (3 of 3) Processing svc, total= 1.7min


VotingClassifier(estimators=[('lr', LogisticRegression()),
                             ('rf', RandomForestClassifier()),
                             ('svc', SVC(probability=True))],
                 verbose=True, voting='soft')

In [27]:
voting_clf.__class__.__name__

'VotingClassifier'

In [28]:
from sklearn.metrics import accuracy_score
for clf in (log_clf,rnd_clf,svm_clf, voting_clf):
    clf.fit(x_train, y_train)
    y_pred = clf.predict(x_test)
    print(clf.__class__.__name__, accuracy_score(y_test, y_pred ))

LogisticRegression 0.8158133212505664
RandomForestClassifier 0.8606705935659266
SVC 0.8149071137290439
[Voting] ....................... (1 of 3) Processing lr, total=   0.2s
[Voting] ....................... (2 of 3) Processing rf, total=  13.6s
[Voting] ...................... (3 of 3) Processing svc, total= 1.7min
VotingClassifier 0.8316719528772089


## Bagging and Pasting

Another approach to get diverse set of classifier is to use the same training algorithm for every predictor, but to train them on different random subset of the training set.

When sampling is done with replacement, this method is called bagging. When sampling is performed without replacement, it is called pasting.

In [10]:
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier

In [13]:
bag_clf = BaggingClassifier(
            DecisionTreeClassifier(), n_estimators = 500,
            max_samples = 100, n_jobs = -1, bootstrap = True, verbose = 1)

bag_clf.fit(x_train, y_train)

[Parallel(n_jobs=8)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=8)]: Done   2 out of   8 | elapsed:    5.4s remaining:   16.5s
[Parallel(n_jobs=8)]: Done   8 out of   8 | elapsed:    5.6s finished


BaggingClassifier(base_estimator=DecisionTreeClassifier(), max_samples=100,
                  n_estimators=500, n_jobs=-1, verbose=1)

In [12]:
from sklearn.metrics import accuracy_score
y_pred = bag_clf.predict(x_test)
accuracy_score(y_test, y_pred)

[Parallel(n_jobs=8)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=8)]: Done   2 out of   8 | elapsed:    0.1s remaining:    0.6s
[Parallel(n_jobs=8)]: Done   8 out of   8 | elapsed:    0.3s finished


0.8429995468962392

## Out-of-Bag Evaluation

63% of the training instances are sampled on average for each predictor. The remaining 37% of the training instances that are not sampled are called out-of-bag instances.

Since a predictor never sees the oob instances during training, it can be evaluated on these instances, without the need for a separate validation set.

In [15]:
bag_clf = BaggingClassifier(
            DecisionTreeClassifier(), n_estimators = 500,
            max_samples = 100, n_jobs = -1, bootstrap = True, verbose = 1,
            oob_score = True)

bag_clf.fit(x_train, y_train)

[Parallel(n_jobs=8)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=8)]: Done   2 out of   8 | elapsed:    1.4s remaining:    4.3s
[Parallel(n_jobs=8)]: Done   8 out of   8 | elapsed:    1.4s finished


BaggingClassifier(base_estimator=DecisionTreeClassifier(), max_samples=100,
                  n_estimators=500, n_jobs=-1, oob_score=True, verbose=1)

In [16]:
bag_clf.oob_score_

0.8440491701127287

In [17]:
from sklearn.metrics import accuracy_score
y_pred = bag_clf.predict(x_test)
accuracy_score(y_test, y_pred)

[Parallel(n_jobs=8)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=8)]: Done   2 out of   8 | elapsed:    0.2s remaining:    0.7s
[Parallel(n_jobs=8)]: Done   8 out of   8 | elapsed:    0.4s finished


0.842546443135478

In [18]:
bag_clf.oob_decision_function_

array([[0.63360324, 0.36639676],
       [0.861167  , 0.138833  ],
       [0.93522267, 0.06477733],
       ...,
       [0.83534137, 0.16465863],
       [0.71919192, 0.28080808],
       [0.85110664, 0.14889336]])

## Random Forests

Instead of building a BaggingClassifier and passing it a DecisionTreeClassifier, you can instead use the RandomForestClassifier class, which is more convenient and optimized for Decision Trees.

In [15]:
from sklearn.ensemble import RandomForestClassifier as rfc
rnd_clf = rfc(n_estimators = 500, max_leaf_nodes = 16, n_jobs = -1 , verbose = 1)
rnd_clf.fit(x_train, y_train)

[Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:    0.2s
[Parallel(n_jobs=-1)]: Done 184 tasks      | elapsed:    1.1s
[Parallel(n_jobs=-1)]: Done 434 tasks      | elapsed:    2.4s
[Parallel(n_jobs=-1)]: Done 500 out of 500 | elapsed:    2.8s finished


RandomForestClassifier(max_leaf_nodes=16, n_estimators=500, n_jobs=-1,
                       verbose=1)

In [16]:
# y_pred = rnd_clf.predict(x_test)
from sklearn.metrics import accuracy_score
y_pred = rnd_clf.predict(x_test)
accuracy_score(y_test, y_pred)

[Parallel(n_jobs=8)]: Using backend ThreadingBackend with 8 concurrent workers.
[Parallel(n_jobs=8)]: Done  34 tasks      | elapsed:    0.0s
[Parallel(n_jobs=8)]: Done 184 tasks      | elapsed:    0.0s
[Parallel(n_jobs=8)]: Done 434 tasks      | elapsed:    0.1s
[Parallel(n_jobs=8)]: Done 500 out of 500 | elapsed:    0.1s finished


0.8448119619392841

# Boosting

Boosting refers to any Ensemble method that can combine several weak learners into a strong learner. The general idea of most boosting methods is to train predictors sequentially, each trying to correct its predecessor.

## AdaBoost

One way for a new predictor to correct its predecessor is to pay a bit more attention to the training instances that the predecessor underfitted. This results in new predictors focusing more and more on the hard cases. This is the technique used by AdaBoost.

For example, to build an AdaBoost classifier, a first base classifier (such as a Decision Tree) is trained and used to make predictions on the training set. The relative weight of misclassified training instances is then increased. A second classifier is trained using the updated weights and again it makes predictions on the training set, weights are updated, and so on

In [8]:
from sklearn.ensemble import AdaBoostClassifier

In [11]:
ada_clf = AdaBoostClassifier(
                DecisionTreeClassifier(max_depth = 1), n_estimators = 100,
                algorithm = 'SAMME.R', learning_rate = 0.5)

ada_clf.fit(x_train, y_train)

AdaBoostClassifier(base_estimator=DecisionTreeClassifier(max_depth=1),
                   learning_rate=0.5, n_estimators=100)

In [13]:
from sklearn.metrics import accuracy_score
y_pred = ada_clf.predict(x_test)
accuracy_score(y_test, y_pred)

0.8509288627095605

## Gradient Boosting

It works by sequentially adding predictors to an ensemble, each one correcting it's predecessor. But instead of tweaking the instances weigths at every iteration like AdaBoost does, this method tries to fit the new predictor to the residual errors made by the previous predictor.

In [32]:
from sklearn.tree import DecisionTreeRegressor
tree_reg1 = DecisionTreeRegressor(max_depth = 2)
tree_reg1.fit(x_train, y_train)

DecisionTreeRegressor(max_depth=2)

In [33]:
y2 = y_train - tree_reg1.predict(x_train)
tree_reg2 = DecisionTreeRegressor(max_depth = 2)
tree_reg2.fit(x_train, y2)

DecisionTreeRegressor(max_depth=2)

In [34]:
y3 = y2 - tree_reg2.predict(x_train)
tree_reg3 = DecisionTreeRegressor(max_depth = 2)
tree_reg3.fit(x_train, y3)

DecisionTreeRegressor(max_depth=2)

Now we have an ensemble containing three trees. It can make predictions on a new instance simply by adding up the predictions of all the tress:

In [35]:
y_pred = sum(tree.predict(x_test) for tree in (tree_reg1, tree_reg2, tree_reg3))

In [37]:
# accuracy_score(y_test, y_pred)

In [38]:
from sklearn.ensemble import GradientBoostingRegressor
gbrt = GradientBoostingRegressor(max_depth = 2, n_estimators = 3, learning_rate = 1.0)
gbrt.fit(x_train,y_train)

GradientBoostingRegressor(learning_rate=1.0, max_depth=2, n_estimators=3)

In order to find the optimal number of trees you can use early stopping.

In [39]:
import numpy as np
from sklearn.metrics import mean_squared_error

gbrt = GradientBoostingRegressor(max_depth=2, n_estimators=120)
gbrt.fit(x_train, y_train)

errors = [mean_squared_error(y_test, y_pred) for y_pred in gbrt.staged_predict(x_test)]
bst_n_estimators = np.argmin(errors)

gbrt_best = GradientBoostingRegressor(max_depth=2,n_estimators=bst_n_estimators)
gbrt_best.fit(x_train, y_train)

GradientBoostingRegressor(max_depth=2, n_estimators=119)