# Ensembles classifiers using trees on the loans dataset

In this notebook we apply several ensemble methods to the Iris dataset using tree classifiers and plot the resulting decision surfaces. Note that this notebook has been created using the material from http://scikit-learn.org/stable/modules/ensemble.html

First we load all the required libraries.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn import clone
from sklearn.datasets import load_iris
from sklearn.ensemble import (RandomForestClassifier, ExtraTreesClassifier,
                              AdaBoostClassifier,BaggingClassifier)
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import cross_val_score
# from sklearn.externals.six.moves import xrange
from sklearn.tree import DecisionTreeClassifier
%matplotlib inline

ModuleNotFoundError: No module named 'sklearn.externals.six'

Next we define some of the parameters required to run the experiments like the number of estimators used in each ensembles and the random seed to be able to reproduce the results.

In [None]:
# Number of estimators used in each ensemble
n_estimators = 30

# set the random seed to be able to repeat the experiment
random_seed = 1234 

Load the dataset

In [None]:
df = pd.read_csv("LoansNumerical.csv")

In [None]:
target = 'safe_loans'
variables = df.columns[df.columns!=target]

X = df[variables].values
y = df['safe_loans'].values

Set the models to be compared
- simple decision tree
- bagging
- random forest
- extra tree classifiers
- adaboost

In [None]:
models = {'Decision Tree':DecisionTreeClassifier(max_depth=None),
          'Bagging':BaggingClassifier(DecisionTreeClassifier(max_depth=3),n_estimators=n_estimators),
          'Random Forest':RandomForestClassifier(n_estimators=n_estimators),
          'Extremely Randomized Trees':ExtraTreesClassifier(n_estimators=n_estimators),
          'Ada Boost':AdaBoostClassifier(DecisionTreeClassifier(max_depth=3),
                             n_estimators=n_estimators)}

For each model, we apply 10-fold stratified crossvalidation and compute the average accuracy and the corresponding standard deviation

In [None]:
scores = {}
for model_name in models:
    clf = models[model_name];
    score = cross_val_score(clf, X, y, cv=StratifiedKFold(n_splits=10,shuffle=True,random_state=random_seed))
    scores[model_name]=(np.average(score),np.std(score))
    print('%26s %3.1f %3.1f'%(model_name,100.0*np.average(score),100.0*np.std(score)))

Then, we print for every variable pair the performance of all the models

In [None]:
for model_name in models:
    print('\t%26s\t%3.1f +/- %3.1f'%(model_name,100.0*scores[model_name][0],100.0*scores[model_name][1]))