### Model Prep

 
#### Full Pipeline and Model Deployment Summary:
* We created 7 models with accuracies ranging from ~85-90%, the highest being from the Random Forest.
  - Decision Tree
  - Random Forest
  - SVM
  - XGBoost
  - Naive Bayes
  - KNN
  - Logistic Regression
* Features we used for the models:
  - age
  - balance
  - day 
  - duration 
  - pdays
  - housing
  - month
  - poutcome
  - contact
  - marital
  - default
  - job
  - education

Based on the random forest, duration seems to be the strongest predictor for whether a customer will subscribe or not with a feature importance score of ~34%. This could be due to the fact that customers who are willing to stay longer on the phone with a bank representative have a higher chance of being persuaded to subscribe to a bank deposit.

In [1]:
# import packages
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import joblib
import seaborn as sns
plt.style.use('seaborn')  # change the default style

In [2]:
# read csv data into pandas dataframe
df = pd.read_csv('projectdataset-1.csv')

In [3]:
# Prepare the data by separating X and y
# dropping Y variable

# axis = 1 below means dropping by columns, 0 means by rows
X = df.drop(['Class'], axis=1)
y = df['Class']
X.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45211 entries, 0 to 45210
Data columns (total 16 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   age        45211 non-null  int64 
 1   job        45211 non-null  object
 2   marital    45211 non-null  object
 3   education  45211 non-null  object
 4   default    45211 non-null  object
 5   balance    45211 non-null  int64 
 6   housing    45211 non-null  object
 7   loan       45211 non-null  object
 8   contact    45211 non-null  object
 9   day        45211 non-null  int64 
 10  month      45211 non-null  object
 11  duration   45211 non-null  int64 
 12  campaign   45211 non-null  int64 
 13  pdays      45211 non-null  int64 
 14  previous   45211 non-null  int64 
 15  poutcome   45211 non-null  object
dtypes: int64(7), object(9)
memory usage: 5.5+ MB


In [4]:
# Split the data into a training set and a test set. 
# Any number for the random_state is fine, see 42: https://en.wikipedia.org/wiki/42_(number) 
# We choose to use 20% (test_size=0.2) of the data set as the test set.
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y,test_size=0.2, random_state=42)

print(X_train.shape)
print(X_test.shape)
print(y_train.shape)

##added stratify option above



(36168, 16)
(9043, 16)
(36168,)


### Models

In [5]:
# We will train our decision tree classifier with the following features:

num_features = ['age', 'balance', 'day', 'duration', 'pdays' ]
cat_features = ['housing','month','poutcome', 'contact']

In [6]:
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder

# Create the preprocessing pipeline for numerical features
# There are two steps in this pipeline
# Pipeline(steps=[(name1, transform1), (name2, transform2), ...]) 
# NOTE the step names can be arbitrary

# Step 1 is what we discussed before - filling the missing values if any using mean
# Step 2 is feature scaling via standardization - making features look like normal-distributed 
# see sandardization: https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html)
num_pipeline = Pipeline(
    steps=[
        ('num_imputer', SimpleImputer()),  # we will tune differet strategies later
        ('scaler', StandardScaler()),
        ]
)

# Create the preprocessing pipelines for the categorical features
# There are two steps in this pipeline:
# Step 1: filling the missing values if any using the most frequent value
# Step 2: one hot encoding

cat_pipeline = Pipeline(
    steps=[
        ('cat_imputer', SimpleImputer(strategy='most_frequent')),
        ('onehot', OneHotEncoder(handle_unknown = 'ignore')),
    ]
)

# Assign features to the pipelines and Combine two pipelines to form the preprocessor
from sklearn.compose import ColumnTransformer

preprocessor = ColumnTransformer(
    transformers=[
        ('num_pipeline', num_pipeline, num_features),
        ('cat_pipeline', cat_pipeline, cat_features),
    ]
)

In [7]:
# Specify the model to use, which is DecisionTreeClassifier
# Make a full pipeline by combining preprocessor and the model
from sklearn.tree import DecisionTreeClassifier

pipeline_dt = Pipeline(
    steps=[
        ('preprocessor', preprocessor),
        ('clf_dt', DecisionTreeClassifier()),
    ]
)

In [8]:
# we show how to use GridSearch with K-fold cross validation (K=10) to fine tune the model
# we use the accuracy as the scoring metric with training score return_train_score=True
from sklearn.model_selection import GridSearchCV

# set up the values of hyperparameters you want to evaluate
# here you must use the step names as the prefix followed by two under_scores to sepecify the parameter names and the "full path" of the steps

# we are trying 2 different impputer strategies 
# 2x5 different decision tree models with different parameters
# in total we are trying 2x2x5 = 20 different combinations

param_grid_dt = [
    {
        'preprocessor__num_pipeline__num_imputer__strategy': ['mean', 'median'],
        'clf_dt__criterion': ['gini', 'entropy'], 
        'clf_dt__max_depth': [3, 4, 5, 6, 7],
   
    }
]

# set up the grid search 
grid_search_dt = GridSearchCV(pipeline_dt, param_grid_dt, cv=10, scoring='accuracy')

In [9]:
# train the model using the full pipeline
grid_search_dt.fit(X_train, y_train)

GridSearchCV(cv=10,
             estimator=Pipeline(steps=[('preprocessor',
                                        ColumnTransformer(transformers=[('num_pipeline',
                                                                         Pipeline(steps=[('num_imputer',
                                                                                          SimpleImputer()),
                                                                                         ('scaler',
                                                                                          StandardScaler())]),
                                                                         ['age',
                                                                          'balance',
                                                                          'day',
                                                                          'duration',
                                                                          'pda

In [10]:
# check the best performing parameter combination
grid_search_dt.best_params_

{'clf_dt__criterion': 'gini',
 'clf_dt__max_depth': 7,
 'preprocessor__num_pipeline__num_imputer__strategy': 'mean'}

In [11]:
# build-in CV results keys
sorted(grid_search_dt.cv_results_.keys())

['mean_fit_time',
 'mean_score_time',
 'mean_test_score',
 'param_clf_dt__criterion',
 'param_clf_dt__max_depth',
 'param_preprocessor__num_pipeline__num_imputer__strategy',
 'params',
 'rank_test_score',
 'split0_test_score',
 'split1_test_score',
 'split2_test_score',
 'split3_test_score',
 'split4_test_score',
 'split5_test_score',
 'split6_test_score',
 'split7_test_score',
 'split8_test_score',
 'split9_test_score',
 'std_fit_time',
 'std_score_time',
 'std_test_score']

In [12]:
# test score for the 20 decision tree models
grid_search_dt.cv_results_['mean_test_score']

array([0.90054734, 0.90054734, 0.90035379, 0.90035379, 0.90226151,
       0.90223387, 0.90096224, 0.90090695, 0.90239997, 0.90239996,
       0.89935853, 0.89935853, 0.89982857, 0.89982857, 0.90118329,
       0.90121094, 0.90110038, 0.90110038, 0.90137699, 0.90132171])

In [13]:
# best decistion tree model test score
grid_search_dt.best_score_

0.9023999714964486

In [14]:
# best test score
print('best dt score is: ', grid_search_dt.best_score_)


best dt score is:  0.9023999714964486


In [15]:
# select the best model
# the best parameters are shown, note SimpleImputer() implies that mean strategry is used
clf_best = grid_search_dt.best_estimator_
clf_best

Pipeline(steps=[('preprocessor',
                 ColumnTransformer(transformers=[('num_pipeline',
                                                  Pipeline(steps=[('num_imputer',
                                                                   SimpleImputer()),
                                                                  ('scaler',
                                                                   StandardScaler())]),
                                                  ['age', 'balance', 'day',
                                                   'duration', 'pdays']),
                                                 ('cat_pipeline',
                                                  Pipeline(steps=[('cat_imputer',
                                                                   SimpleImputer(strategy='most_frequent')),
                                                                  ('onehot',
                                                                   OneHotEncoder(hand

In [16]:
# final test on the testing set
# To predict on new data: simply calling the predict method 
# the full pipeline steps will be applied to the testing set followed by the prediction
y_pred = clf_best.predict(X_test)

In [17]:
clf_best.named_steps

{'preprocessor': ColumnTransformer(transformers=[('num_pipeline',
                                  Pipeline(steps=[('num_imputer',
                                                   SimpleImputer()),
                                                  ('scaler', StandardScaler())]),
                                  ['age', 'balance', 'day', 'duration',
                                   'pdays']),
                                 ('cat_pipeline',
                                  Pipeline(steps=[('cat_imputer',
                                                   SimpleImputer(strategy='most_frequent')),
                                                  ('onehot',
                                                   OneHotEncoder(handle_unknown='ignore'))]),
                                  ['housing', 'month', 'poutcome', 'contact'])]),
 'clf_dt': DecisionTreeClassifier(max_depth=7)}

In [18]:
clf_best.named_steps['preprocessor']

ColumnTransformer(transformers=[('num_pipeline',
                                 Pipeline(steps=[('num_imputer',
                                                  SimpleImputer()),
                                                 ('scaler', StandardScaler())]),
                                 ['age', 'balance', 'day', 'duration',
                                  'pdays']),
                                ('cat_pipeline',
                                 Pipeline(steps=[('cat_imputer',
                                                  SimpleImputer(strategy='most_frequent')),
                                                 ('onehot',
                                                  OneHotEncoder(handle_unknown='ignore'))]),
                                 ['housing', 'month', 'poutcome', 'contact'])])

In [19]:
onehot_columns = list(clf_best.named_steps['preprocessor'].named_transformers_['cat_pipeline'].named_steps['onehot'].get_feature_names(input_features=cat_features))



In [20]:
i = clf_best.named_steps["clf_dt"].feature_importances_
i

array([0.04082185, 0.01429301, 0.02205914, 0.49049963, 0.04750922,
       0.00112339, 0.04394799, 0.02161304, 0.0008162 , 0.        ,
       0.00159478, 0.00061306, 0.00073547, 0.00411406, 0.02231354,
       0.00542562, 0.00164968, 0.00903698, 0.00302178, 0.        ,
       0.        , 0.25458985, 0.        , 0.00375056, 0.0005592 ,
       0.00991195])

In [21]:
numeric_features_list = list(num_features)
numeric_features_list.extend(onehot_columns)

In [22]:
print(numeric_features_list)


['age', 'balance', 'day', 'duration', 'pdays', 'housing_no', 'housing_yes', 'month_apr', 'month_aug', 'month_dec', 'month_feb', 'month_jan', 'month_jul', 'month_jun', 'month_mar', 'month_may', 'month_nov', 'month_oct', 'month_sep', 'poutcome_failure', 'poutcome_other', 'poutcome_success', 'poutcome_unknown', 'contact_cellular', 'contact_telephone', 'contact_unknown']


In [23]:
import eli5 as eli5
eli5.explain_weights(clf_best.named_steps["clf_dt"], top=50, feature_names=numeric_features_list, feature_filter=lambda x: x != '<BIAS>')

Weight,Feature
0.4905,duration
0.2546,poutcome_success
0.0475,pdays
0.0439,housing_yes
0.0408,age
0.0223,month_mar
0.0221,day
0.0216,month_apr
0.0143,balance
0.0099,contact_unknown


In [24]:
r = pd.DataFrame(i, index=numeric_features_list, columns=['importance'])
r

print(r.sort_values('importance', ascending = False))

                   importance
duration             0.490500
poutcome_success     0.254590
pdays                0.047509
housing_yes          0.043948
age                  0.040822
month_mar            0.022314
day                  0.022059
month_apr            0.021613
balance              0.014293
contact_unknown      0.009912
month_oct            0.009037
month_may            0.005426
month_jun            0.004114
contact_cellular     0.003751
month_sep            0.003022
month_nov            0.001650
month_feb            0.001595
housing_no           0.001123
month_aug            0.000816
month_jul            0.000735
month_jan            0.000613
contact_telephone    0.000559
month_dec            0.000000
poutcome_failure     0.000000
poutcome_other       0.000000
poutcome_unknown     0.000000


Persist the Model
The following code shows how to save the trained model as a pickle file, which can be loaded in to make predictions.

In [25]:
# try random forest classifer
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification


# rf pipeline
pipeline_rf = Pipeline([
    ('preprocessor', preprocessor),
    ('clf_rf', RandomForestClassifier()),
])

# here we are trying 2x3 different rf models
param_grid_rf = [
    {
        'clf_rf__criterion': ['gini', 'entropy'], 
        'clf_rf__n_estimators': [50, 100, 150],  
    }
]

# set up the grid search 
grid_search_rf = GridSearchCV(pipeline_rf, param_grid_rf, cv=10, scoring='accuracy')

In [26]:
%%time
# train the model using the full pipeline
grid_search_rf.fit(X_train, y_train)

Wall time: 5min 8s


GridSearchCV(cv=10,
             estimator=Pipeline(steps=[('preprocessor',
                                        ColumnTransformer(transformers=[('num_pipeline',
                                                                         Pipeline(steps=[('num_imputer',
                                                                                          SimpleImputer()),
                                                                                         ('scaler',
                                                                                          StandardScaler())]),
                                                                         ['age',
                                                                          'balance',
                                                                          'day',
                                                                          'duration',
                                                                          'pda

In [27]:
clf_best = grid_search_dt.best_estimator_
y_pred = clf_best.predict(X_test)
y_pred

array([1, 1, 1, ..., 1, 1, 1], dtype=int64)

In [28]:
banker = X_test.iloc[7].to_frame().T
banker.T

Unnamed: 0,43965
age,64
job,retired
marital,divorced
education,primary
default,no
balance,109
housing,no
loan,no
contact,cellular
day,23


In [29]:
banker.shape, X_test.shape

((1, 16), (9043, 16))

In [30]:
test = clf_best.predict(banker)
test

array([2], dtype=int64)

In [31]:
# try SVM classifer
from sklearn.svm import SVC

# SVC pipeline
pipeline_svc = Pipeline([
    ('preprocessor', preprocessor),
    ('clf_svc', SVC()),
])

# here we are trying three different kernel and three degree values for polynomail kernel
# in total 5 different combinations
param_grid_svc = [
    {
        'clf_svc__kernel': ['linear', 'poly', 'rbf'], 
        'clf_svc__degree': [3, 4, 5],  # only for poly kernel
    }
]

# set up the grid search 
grid_search_svc = GridSearchCV(pipeline_svc, param_grid_svc, cv=10, scoring='accuracy')

In [32]:
# train the model using the full pipeline
grid_search_svc.fit(X_train, y_train)

GridSearchCV(cv=10,
             estimator=Pipeline(steps=[('preprocessor',
                                        ColumnTransformer(transformers=[('num_pipeline',
                                                                         Pipeline(steps=[('num_imputer',
                                                                                          SimpleImputer()),
                                                                                         ('scaler',
                                                                                          StandardScaler())]),
                                                                         ['age',
                                                                          'balance',
                                                                          'day',
                                                                          'duration',
                                                                          'pda

In [33]:
# best test score
grid_search_svc.best_score_

0.9053584306287175

In [34]:
# best test score
print('best dt score is: ', grid_search_dt.best_score_)
#print('best svc score is: ', grid_search_svc.best_score_)
print('best rf score is: ', grid_search_rf.best_score_)

best dt score is:  0.9023999714964486
best rf score is:  0.9069343834180286


In [35]:
#print(f"Best xgboost Score: {best_score_}")

In [36]:
# select the best model
# the best parameters are shown, note SimpleImputer() implies that mean strategry is used
clf_best = grid_search_rf.best_estimator_
clf_best

Pipeline(steps=[('preprocessor',
                 ColumnTransformer(transformers=[('num_pipeline',
                                                  Pipeline(steps=[('num_imputer',
                                                                   SimpleImputer()),
                                                                  ('scaler',
                                                                   StandardScaler())]),
                                                  ['age', 'balance', 'day',
                                                   'duration', 'pdays']),
                                                 ('cat_pipeline',
                                                  Pipeline(steps=[('cat_imputer',
                                                                   SimpleImputer(strategy='most_frequent')),
                                                                  ('onehot',
                                                                   OneHotEncoder(hand

In [37]:
# Save the model as a pickle file
import joblib
joblib.dump(clf_best, "clf-best.pickle")

['clf-best.pickle']

In [38]:
# Load the model from a pickle file
saved_tree_clf = joblib.load("clf-best.pickle")
saved_tree_clf

Pipeline(steps=[('preprocessor',
                 ColumnTransformer(transformers=[('num_pipeline',
                                                  Pipeline(steps=[('num_imputer',
                                                                   SimpleImputer()),
                                                                  ('scaler',
                                                                   StandardScaler())]),
                                                  ['age', 'balance', 'day',
                                                   'duration', 'pdays']),
                                                 ('cat_pipeline',
                                                  Pipeline(steps=[('cat_imputer',
                                                                   SimpleImputer(strategy='most_frequent')),
                                                                  ('onehot',
                                                                   OneHotEncoder(hand

In [39]:
banker1 = pd.DataFrame(
    {   'age' : 59, 
        'job' : 'admin.',
        'marital' : 'married',
        'education' : 'secondary', 
        'default' : 'no', 
        'balance' : 600000, 
        'housing' : 'yes',
        'loan' : 'yes',
        'contact' : 'cellular',
        'day' : 18, 
        'month' : 'aug', 
        'duration' : 73,
        'campaign' : 7,
        'pdays': -1, 
        'previous' : 0,
        'poutcome' : ['unknown']       
     
     
    })

clf_best.predict(banker1)

array([1], dtype=int64)

In [40]:
clf_best.predict(banker)

array([2], dtype=int64)

In [41]:
# final test on the testing set
# To predict on new data: simply calling the predict method 
# the full pipeline steps will be applied to the testing set followed by the prediction
y_pred = clf_best.predict(X_test)

# calculate accuracy, Note: y_test is the ground truth for the tesing set
# we have similiar score for the testing set as the cross validation score - good

#print(f'Accuracy Score : {accuracy_score(y_test, y_pred)}')
y_pred[7]

2

In [42]:
clf_best.named_steps['preprocessor']

ColumnTransformer(transformers=[('num_pipeline',
                                 Pipeline(steps=[('num_imputer',
                                                  SimpleImputer()),
                                                 ('scaler', StandardScaler())]),
                                 ['age', 'balance', 'day', 'duration',
                                  'pdays']),
                                ('cat_pipeline',
                                 Pipeline(steps=[('cat_imputer',
                                                  SimpleImputer(strategy='most_frequent')),
                                                 ('onehot',
                                                  OneHotEncoder(handle_unknown='ignore'))]),
                                 ['housing', 'month', 'poutcome', 'contact'])])