# Building different tree-based models

how to build a random forest and AdaBoost model for the churn dataset.

In [1]:
%matplotlib inline

import pandas as pd
import numpy as np

df = pd.read_csv('churn_ibm.csv')
df.head()

Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,...,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
0,7590-VHVEG,Female,0,Yes,No,1,No,No phone service,DSL,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,29.85,29.85,No
1,5575-GNVDE,Male,0,No,No,34,Yes,No,DSL,Yes,...,Yes,No,No,No,One year,No,Mailed check,56.95,1889.5,No
2,3668-QPYBK,Male,0,No,No,2,Yes,No,DSL,Yes,...,No,No,No,No,Month-to-month,Yes,Mailed check,53.85,108.15,Yes
3,7795-CFOCW,Male,0,No,No,45,No,No phone service,DSL,Yes,...,Yes,Yes,No,No,One year,No,Bank transfer (automatic),42.3,1840.75,No
4,9237-HQITU,Female,0,No,No,2,Yes,No,Fiber optic,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,70.7,151.65,Yes


pre-processing

In [3]:
y = df['Churn']
X = df.drop(['Churn','customerID'], axis=1)

for column in X.columns:
    if X[column].dtype == object:
          X = pd.concat([X,pd.get_dummies(X[column], prefix=column, drop_first=True)],axis=1).drop([column],axis=1)
            
y = pd.get_dummies(y, prefix='churn', drop_first=True)

# Random forest

In [12]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, roc_auc_score

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size =0.3)

# n_estimators : number of trees
rf = RandomForestClassifier(n_estimators=20)

# Some algorithms need a transformed version of the dependent variable
# To this purpose, the data is reshaped using ravel()
rf.fit(X_train, y_train.values.ravel())
prediction = rf.predict(X_test)
prediction_proba = rf.predict_proba(X_test)

print('Accuracy:', accuracy_score(y_test,prediction))
print('AUC:', roc_auc_score(y_test,prediction_proba[:,1]))


Accuracy: 0.7829383886255924
AUC: 0.806619929020932


We can also change the parameters. Let's build a larger forest using more trees using parameter n_estimators:

In [16]:
rf2 = RandomForestClassifier(n_estimators=100)
rf2.fit(X_train, y_train.values.ravel())
prediction = rf2.predict(X_test)
prediction_proba = rf2.predict_proba(X_test)

print('Accuracy:', accuracy_score(y_test,prediction))
print('AUC:', roc_auc_score(y_test,prediction_proba[:,1]))

Accuracy: 0.7872037914691943
AUC: 0.8167001737161594


## AdaBoost

In [18]:
from sklearn.ensemble import AdaBoostClassifier

ada = AdaBoostClassifier()
ada.fit(X_train,y_train.values.ravel())
prediction = ada.predict(X_test)
prediction_proba = ada.predict_proba(X_test)

print('Accuracy:', accuracy_score(y_test,prediction))
print('AUC:',roc_auc_score(y_test,prediction_proba[:,1]))

Accuracy: 0.8052132701421801
AUC: 0.8351143162552629


In [20]:
ada2 = AdaBoostClassifier(n_estimators=100)
ada2.fit(X_train, y_train.values.ravel())
prediction = ada2.predict(X_test)
prediction_proba = ada2.predict_proba(X_test)

print('Accuracy:', accuracy_score(y_test,prediction))
print('AUC:', roc_auc_score(y_test, prediction_proba[:,1]))

Accuracy: 0.8004739336492891
AUC: 0.835252095099551


## Grid search

In [29]:
from sklearn.model_selection import cross_validate, cross_val_predict
from sklearn.model_selection import GridSearchCV

# First we create a dictionary containing the parameters we want to test
# We include the values we want to test as lists 
parameters = {'min_samples_leaf':[1,5],'max_depth':[None,10]}

# Then, we bring together a classifier, the parameters, and set the number of folds for the CV
grid_search = GridSearchCV(RandomForestClassifier(n_estimators=20), parameters, cv=10)
grid_search.fit(X_train, y_train.values.ravel())

# The best predictor will be used for the prediction
prediction = grid_search.predict(X_test)
prediction_proba = grid_search.predict_proba(X_test)

best_classifier = grid_search.best_estimator_

print('Best classifier:',best_classifier)
print('Accuracy:', accuracy_score(y_test,prediction))
print('AUC:',roc_auc_score(y_test,prediction_proba[:,1]))

Best classifier: RandomForestClassifier(max_depth=10, min_samples_leaf=5, n_estimators=20)
Accuracy: 0.790521327014218
AUC: 0.8334724516941631


It seems that having a minimum of 5 samples per leaf, and a maximum depth of 10 are preferable.

## Feature importance

For all our models, we can calculate the feature importance. This is the (average) reduction in Gini impurity across all trees:

In [31]:
# Random forest - 5 most important features
for c,column in enumerate(X_test.columns):
    if rf.feature_importances_[c] in sorted(rf.feature_importances_)[-5:]:
        print('Variable', column, rf.feature_importances_[c])

Variable tenure 0.1705105295812672
Variable MonthlyCharges 0.1635992999773788
Variable TotalCharges 0.19036006795736943
Variable InternetService_Fiber optic 0.03542084065850979
Variable PaymentMethod_Electronic check 0.03558628498959499


In [32]:
# AdaBoost - 5 most important features
for c, column in enumerate(X_test.columns):
    if ada.feature_importances_[c] in sorted(ada.feature_importances_)[-5:]:
        print('Variable',column,ada.feature_importances_[c])

Variable tenure 0.2
Variable MonthlyCharges 0.18
Variable TotalCharges 0.26
Variable MultipleLines_Yes 0.04
Variable InternetService_Fiber optic 0.04
Variable Contract_Two year 0.04


It seems that both random forests and AdaBoost have exactly the same variables driving their Gini impurity down. Mostly the length of tenure, monthly charges, total charges, being connected through fiber and the contract length.