# Building different tree-based models

I will now show you how to build a random forest and AdaBoost model for the churn dataset. 

Let's start by importing the data.

## Dataset

We use a churn dataset again:

In [1]:
##### added line to ensure plots are showing
%matplotlib inline
#####

import pandas as pd
import numpy as np

df = pd.read_csv('churn_ibm.csv')
df.head()

Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,...,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
0,7590-VHVEG,Female,0,Yes,No,1,No,No phone service,DSL,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,29.85,29.85,No
1,5575-GNVDE,Male,0,No,No,34,Yes,No,DSL,Yes,...,Yes,No,No,No,One year,No,Mailed check,56.95,1889.5,No
2,3668-QPYBK,Male,0,No,No,2,Yes,No,DSL,Yes,...,No,No,No,No,Month-to-month,Yes,Mailed check,53.85,108.15,Yes
3,7795-CFOCW,Male,0,No,No,45,No,No phone service,DSL,Yes,...,Yes,Yes,No,No,One year,No,Bank transfer (automatic),42.3,1840.75,No
4,9237-HQITU,Female,0,No,No,2,Yes,No,Fiber optic,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,70.7,151.65,Yes


You already know the pre-processing steps from before:

In [2]:
y = df['Churn']
X = df.drop(['Churn','customerID'],axis=1)

for column in X.columns:
    if X[column].dtype == np.object:
        X = pd.concat([X,pd.get_dummies(X[column], prefix=column, drop_first=True)],axis=1).drop([column],axis=1)
        
y = pd.get_dummies(y, prefix='churn', drop_first=True)

## Random forest

We can very easily calculate a random forest:

In [34]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, roc_auc_score

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3)

rf = RandomForestClassifier(n_estimators=20)

# Some algorithms need a transformed version of the dependent variable
# To this purpose, the data is reshaped using ravel()
rf.fit(X_train,y_train.values.ravel())
prediction = rf.predict(X_test)
prediction_proba = rf.predict_proba(X_test)

print('Accuracy:', accuracy_score(y_test,prediction))
print('AUC:',roc_auc_score(y_test,prediction_proba[:,1]))

Accuracy: 0.781042654028436
AUC: 0.8055200818617732


We can print the instance of ```RandomForestClassifier``` to learn about the default settings used:

In [4]:
print(rf)

RandomForestClassifier(n_estimators=20)


We can also change the parameters. Let's build a larger forest using more trees using parameter ```n_estimators```:

In [5]:
rf2 = RandomForestClassifier(n_estimators=100)
rf2.fit(X_train,y_train.values.ravel())
prediction = rf2.predict(X_test)
prediction_proba = rf2.predict_proba(X_test)

print('Accuracy:', accuracy_score(y_test,prediction))
print('AUC:',roc_auc_score(y_test,prediction_proba[:,1]))

Accuracy: 0.7876777251184834
AUC: 0.8346795572369342


Are results are pretty much the same.

## AdaBoost

The implementation of AdaBoost shouldn't pose any difficulties anymore:

In [6]:
from sklearn.ensemble import AdaBoostClassifier

ada = AdaBoostClassifier()
ada.fit(X_train,y_train.values.ravel())
prediction = ada.predict(X_test)
prediction_proba = ada.predict_proba(X_test)

print('Accuracy:', accuracy_score(y_test,prediction))
print('AUC:',roc_auc_score(y_test,prediction_proba[:,1]))

Accuracy: 0.8042654028436019
AUC: 0.8535659240577274


Apparently, 50 trees are used:

In [16]:
# 50 trees are the default values
ada.get_params(deep=True)

{'algorithm': 'SAMME.R',
 'base_estimator': None,
 'learning_rate': 1.0,
 'n_estimators': 50,
 'random_state': None}

Let's increase that as well:

In [21]:
ada2 = AdaBoostClassifier(n_estimators=100)
ada2.fit(X_train,y_train.values.ravel())
prediction = ada2.predict(X_test)
prediction_proba = ada2.predict_proba(X_test)

print('Accuracy:', accuracy_score(y_test,prediction))
print('AUC:',roc_auc_score(y_test,prediction_proba[:,1]))

Accuracy: 0.8033175355450237
AUC: 0.8521737424688244


Neither accuracy, nor AUC are improved drastically, but AdaBoost performs better than the random forest we built.

## Grid search

A lot of these efforts can be streamlined using GridSearch. Below, you can find code that tests different parameters using cross-validation for random forests:

In [18]:
from sklearn.model_selection import cross_validate, cross_val_predict
from sklearn.model_selection import GridSearchCV

# First we create a dictionary containing the parameters we want to test
# We include the values we want to test as lists 
parameters = {'min_samples_leaf':[1,5],'max_depth':[None,10]}

# Then, we bring together a classifier, the parameters, and set the number of folds for the CV
grid_search = GridSearchCV(RandomForestClassifier(n_estimators=20), parameters, cv=10)
grid_search.fit(X_train, y_train.values.ravel())

# The best predictor will be used for the prediction
prediction = grid_search.predict(X_test)
prediction_proba = grid_search.predict_proba(X_test)
    
best_classifier = grid_search.best_estimator_

print('Best classifier:',best_classifier)
print('Accuracy:', accuracy_score(y_test,prediction))
print('AUC:',roc_auc_score(y_test,prediction_proba[:,1]))

Best classifier: RandomForestClassifier(max_depth=10, min_samples_leaf=5, n_estimators=20)
Accuracy: 0.7933649289099526
AUC: 0.8538803418803419


It seems that having a minimum of 5 samples per leaf, and a maximum depth of 10 are preferable.

## Feature importance

For all our models, we can calculate the feature importance. This is the (average) reduction in Gini impurity across all trees:

In [28]:
for c, columns in enumerate(X_test.columns):
    print(c, columns)

0 SeniorCitizen
1 tenure
2 MonthlyCharges
3 TotalCharges
4 gender_Male
5 Partner_Yes
6 Dependents_Yes
7 PhoneService_Yes
8 MultipleLines_No phone service
9 MultipleLines_Yes
10 InternetService_Fiber optic
11 InternetService_No
12 OnlineSecurity_No internet service
13 OnlineSecurity_Yes
14 OnlineBackup_No internet service
15 OnlineBackup_Yes
16 DeviceProtection_No internet service
17 DeviceProtection_Yes
18 TechSupport_No internet service
19 TechSupport_Yes
20 StreamingTV_No internet service
21 StreamingTV_Yes
22 StreamingMovies_No internet service
23 StreamingMovies_Yes
24 Contract_One year
25 Contract_Two year
26 PaperlessBilling_Yes
27 PaymentMethod_Credit card (automatic)
28 PaymentMethod_Electronic check
29 PaymentMethod_Mailed check


In [31]:
sorted(rf.feature_importances_)

[0.0029211587029609716,
 0.003121307143629995,
 0.0035960331956236113,
 0.004233785552117677,
 0.004767867137516242,
 0.005234881759897994,
 0.007303173172980759,
 0.0074637810481313314,
 0.00786160846394388,
 0.012870060478184772,
 0.014590237734661837,
 0.016677972524215674,
 0.01802319909614288,
 0.019035431154214088,
 0.01991357425870868,
 0.02025059628335638,
 0.02077165822623538,
 0.02171323815137691,
 0.021730791215065052,
 0.022323413082756597,
 0.02443811996817083,
 0.024738495625735223,
 0.02524633516088912,
 0.02885042176088932,
 0.031155169706422375,
 0.03272281505654534,
 0.04235304803801822,
 0.1674968642828305,
 0.1736215697498719,
 0.1949733922689063]

In [25]:
# Random forest - 5 most important features
for c, column in enumerate(X_test.columns):
    if rf.feature_importances_[c] in sorted(rf.feature_importances_)[-5:]:
        print('Variable',column,rf.feature_importances_[c])

Variable tenure 0.1736215697498719
Variable MonthlyCharges 0.1674968642828305
Variable TotalCharges 0.1949733922689063
Variable InternetService_Fiber optic 0.04235304803801822
Variable PaymentMethod_Electronic check 0.03272281505654534


In [26]:
# AdaBoost - 5 most important features
for c, column in enumerate(X_test.columns):
    if ada.feature_importances_[c] in sorted(ada.feature_importances_)[-5:]:
        print('Variable',column,ada.feature_importances_[c])

Variable tenure 0.18
Variable MonthlyCharges 0.28
Variable TotalCharges 0.24
Variable InternetService_Fiber optic 0.04
Variable Contract_Two year 0.04


It seems that both random forests and AdaBoost have exactly the same variables driving their Gini impurity down. Mostly the length of tenure, monthly charges, total charges, being connected through fiber and the contract length.