# Activity 2: Solution

In this case study, we are going to predict whether someone responds to a marketing campaign by subscribing to a term deposit. To this purpose, we will train various tree-based models. First we load the dataset and the necessary libraries for data manipulation:

In [1]:
import pandas as pd
import numpy as np

np.random.seed(10)
df = pd.read_csv('bank-full.csv')
df.head()

Unnamed: 0,age,job,marital,education,default,balance,housing,loan,contact,day,month,duration,campaign,pdays,previous,poutcome,subscribed
0,58,management,married,tertiary,no,2143,yes,no,unknown,5,may,261,1,-1,0,unknown,no
1,44,technician,single,secondary,no,29,yes,no,unknown,5,may,151,1,-1,0,unknown,no
2,33,entrepreneur,married,secondary,no,2,yes,yes,unknown,5,may,76,1,-1,0,unknown,no
3,47,blue-collar,married,unknown,no,1506,yes,no,unknown,5,may,92,1,-1,0,unknown,no
4,33,unknown,single,unknown,no,1,no,no,unknown,5,may,198,1,-1,0,unknown,no


We can see we have a variety of variables, including whether someone subscribed (yes/no) and mostly demographic information. There are also variables describing previous interaction: pdays, previous, and poutcome to capture how long, and what interaction happened with the customer earlier.

## Data pre-processing

We are going to separate the dependent variable, subscribed, first:

In [2]:
y = df['subscribed']
X = df.drop(['subscribed'],axis=1)

In [3]:
y.value_counts()

no     39922
yes     5289
Name: subscribed, dtype: int64

There is a significant amount of people that have subscribed, but there is a clear skew. Now, let's look at the indepedent variables. Given that we need numeric input data for the decision tree implementations, we might have to convert:

In [4]:
X.dtypes

age           int64
job          object
marital      object
education    object
default      object
balance       int64
housing      object
loan         object
contact      object
day           int64
month        object
duration      int64
campaign      int64
pdays         int64
previous      int64
poutcome     object
dtype: object

It seems there are a few categorical variables that need to be converted:

In [5]:
for column in X.columns:
    if X[column].dtype == np.object:
        print('Converting ', column)
        X = pd.concat([X,pd.get_dummies(X[column], prefix=column, drop_first=True)],axis=1).drop([column],axis=1)

Converting  job
Converting  marital
Converting  education
Converting  default
Converting  housing
Converting  loan
Converting  contact
Converting  month
Converting  poutcome


Finally, our data looks like this:

In [6]:
X.head()

Unnamed: 0,age,balance,day,duration,campaign,pdays,previous,job_blue-collar,job_entrepreneur,job_housemaid,...,month_jul,month_jun,month_mar,month_may,month_nov,month_oct,month_sep,poutcome_other,poutcome_success,poutcome_unknown
0,58,2143,5,261,1,-1,0,0,0,0,...,0,0,0,1,0,0,0,0,0,1
1,44,29,5,151,1,-1,0,0,0,0,...,0,0,0,1,0,0,0,0,0,1
2,33,2,5,76,1,-1,0,0,1,0,...,0,0,0,1,0,0,0,0,0,1
3,47,1506,5,92,1,-1,0,1,0,0,...,0,0,0,1,0,0,0,0,0,1
4,33,1,5,198,1,-1,0,0,0,0,...,0,0,0,1,0,0,0,0,0,1


We have a lot more colums now. We also convert the dependent variable to make sure the AUC can be calculated:

In [7]:
y = pd.get_dummies(y, prefix='subscribed', drop_first=True)
y.head()

Unnamed: 0,subscribed_yes
0,0
1,0
2,0
3,0
4,0


## Modelling

Now it's up to you to create the various models. You have to create a cross-validated (1o-fold) grid search that tests at least two parameters for decision trees, random forests, and AdaBoost. 

In [8]:
from sklearn.tree import DecisionTreeClassifier as DT
from sklearn.ensemble import RandomForestClassifier as RF
from sklearn.ensemble import AdaBoostClassifier as AdaBoost

from sklearn.model_selection import cross_validate, cross_val_predict
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import accuracy_score, precision_score, f1_score, roc_auc_score

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3)

# The classifier input below is a pair of the name and the classifier instancehttp://localhost:8888/notebooks/Jupyter%20notebooks/4%20-%20DLAS01_M4_WK2_LS2_A01a%20(solution).ipynb#
classifiers = [('DecisionTree',DT()),('RandomForest',RF()),('AdaBoost',AdaBoost())]

for classifier in classifiers:
    
    print('Treating ', classifier[0])
    
    # We add the appropriate parameters depending on the classifier
    if classifier[0] == 'DecisionTree':
        parameters = {'min_samples_leaf':[1,5,10],'max_depth':[None,10,20]}
    elif classifier[0] == 'RandomForest':
        parameters = {'min_samples_leaf':[5,10],'n_estimators':[10,30,50,100,200]}
    elif classifier[0] == 'AdaBoost':
        parameters = {'learning_rate':[0.5,1],'n_estimators':[10,30,50,100,200]}
    
    # We perform 10-fold cross-validation
    grid_search = GridSearchCV(classifier[1], parameters, cv=10)
    
    # We have to transform our dependent variable in order to fit
    grid_search.fit(X_train, y_train.values.reshape(-1,))
    
    # Finding the best estimator's parameters
    print(grid_search.best_estimator_)
    
    # Finding the accuracy/auc on the training set
    pred = grid_search.predict(X_train)
    
    accuracy = accuracy_score(y_train, pred)
    auc = roc_auc_score(y_train, pred)
    
    print('Accuracy: ', accuracy)
    print('AUC: ', auc)
    
    print('\nTest:')
    # Using the best estimator of the grid search to get our test metrics
    pred = grid_search.predict(X_test)
    
    accuracy = accuracy_score(y_test, pred)
    auc = roc_auc_score(y_test, pred)
    print('Accuracy: ', accuracy)
    print('AUC: ', auc)

Treating  DecisionTree
DecisionTreeClassifier(max_depth=10)
Accuracy:  0.9253325749676115
AUC:  0.7492744919292855

Test:
Accuracy:  0.903347095252138
AUC:  0.6974030635516645
Treating  RandomForest
RandomForestClassifier(min_samples_leaf=5, n_estimators=50)
Accuracy:  0.9352229279236578
AUC:  0.7508170450778764

Test:
Accuracy:  0.9065909761132409
AUC:  0.6612536158108365
Treating  AdaBoost
AdaBoostClassifier(learning_rate=1, n_estimators=200)
Accuracy:  0.9027079975985085
AUC:  0.6786317811649906

Test:
Accuracy:  0.9036419935122383
AUC:  0.679137271330103


The results when using oversampling (to address the skew):

In [9]:
from sklearn.tree import DecisionTreeClassifier as DT
from sklearn.ensemble import RandomForestClassifier as RF
from sklearn.ensemble import AdaBoostClassifier as AdaBoost
from imblearn.over_sampling import SMOTE

from sklearn.model_selection import cross_validate, cross_val_predict
from sklearn.model_selection import train_test_split, GridSearchCV
from imblearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score, precision_score, f1_score, roc_auc_score

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3)

# The classifier input below is a pair of the name and the classifier instance
classifiers = [('DecisionTree',DT()),('RandomForest',RF()),('AdaBoost',AdaBoost())]

for classifier in classifiers:
    
    print('\nTreating ', classifier[0])
    
    # We create a pipeline of performing SMOTE, then the classification
    pipe = Pipeline([('oversample',SMOTE()),('classify',classifier[1])])
       
    
    if classifier[0] == 'DecisionTree':
        # Note how we have to add classify__ so the GridSearchCV instance
        # knows to which step in the pipeline to send the parameter values
        parameters = {'classify__min_samples_leaf':[1,5,10],'classify__max_depth':[None,10,20]}
    elif classifier[0] == 'RandomForest':
        parameters = {'classify__min_samples_leaf':[5,10],'classify__n_estimators':[10,30,50,100,200]}
    elif classifier[0] == 'AdaBoost':
        parameters = {'classify__learning_rate':[0.5,1],'classify__n_estimators':[10,30,50,100,200]}
    
    grid_search = GridSearchCV(pipe, parameters, cv=10)
    grid_search.fit(X_train, y_train.values.reshape(-1,))
    
    pred = grid_search.predict(X_train)
    
    accuracy = accuracy_score(y_train, pred)
    auc = roc_auc_score(y_train, pred)
    
    print('Accuracy: ', accuracy)
    print('AUC: ', auc)
    
    print('\nTest:')
    pred = grid_search.predict(X_test)
    
    # The estimator is the second element in the pipe
    # Which has 2 elements as well: the name, and the classifier instance
    estimator = grid_search.best_estimator_.steps[1][1]
    
    # We loop the features and retain the ones that are in the top 5
    for c, column in enumerate(X_test.columns):
        if estimator.feature_importances_[c] in sorted(estimator.feature_importances_)[-5:]:
            print(column, ' is important ', estimator.feature_importances_[c])
    
    accuracy = accuracy_score(y_test, pred)
    auc = roc_auc_score(y_test, pred)
    print('Accuracy: ', accuracy)
    print('AUC: ', auc)


Treating  DecisionTree
Accuracy:  0.9214775492147755
AUC:  0.816350265167552

Test:
duration  is important  0.37537720432302935
pdays  is important  0.061650100177118056
housing_yes  is important  0.09771086056575834
contact_unknown  is important  0.11783873062689065
month_jul  is important  0.04541876964169708
Accuracy:  0.8772485992332645
AUC:  0.7308212551221733

Treating  RandomForest
Accuracy:  0.9360444907890163
AUC:  0.840402400924761

Test:
duration  is important  0.2622017453195258
marital_married  is important  0.04216166426432098
housing_yes  is important  0.08970024641716605
contact_unknown  is important  0.0656855489057276
month_may  is important  0.05071529999229664
Accuracy:  0.8985549985255087
AUC:  0.7415134230451427

Treating  AdaBoost
Accuracy:  0.8851075931367902
AUC:  0.7181442067533181

Test:
age  is important  0.055
balance  is important  0.105
day  is important  0.085
duration  is important  0.175
pdays  is important  0.095
Accuracy:  0.8847685048658213
AUC:  0

It seems AUC is slightly higher than before.