## Classification Problem : Bank Marketing

### Problem Description - Bank Marketing Decision

Our goal is to find the clients before call whether they would subscribe to the product (bank term deposit), ('yes') or not ('no').

    The data is related with direct marketing campaigns of a banking institution
    The marketing campaigns were based on phone calls
    Often, more than one contact to the same client was required

#### Data

    age: age of the Client (numeric)
    
    job: type of job (categorical: 'admin.','blue collar','entrepreneur','housemaid','management','retired','self employed','services','student','technician','unemployed','unknown')
    
    marital: marital status (categorical: 'divorced','married','single','unknown'; note: 'divorced' means                       divorced or widowed)
    
    education:   (categorical:'basic.4y','basic.6y','basic.9y','high.school','illiterate','professional.course','university.degree','unknown')
    
    default: has credit in default? (categorical: 'no','yes','unknown')
    
    housing: has housing loan? (categorical: 'no','yes','unknown')
    
    loan: has personal loan? (categorical: 'no','yes','unknown')
    
    contact: contact communication type (categorical: 'cellular','telephone')
    
    month: last contact month of year (categorical: 'jan', 'feb', 'mar', ..., 'nov', 'dec')
    
    day_of_week: last contact day of the week (categorical: 'mon','tue','wed','thu','fri')
    
    duration: last contact duration, in seconds (numeric)
    
    campaign: number of contacts performed during this campaign and for this client (numeric, includes last                     contact)
    
    pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric;              999 means client was not previously contacted)
    
    previous: number of contacts performed before this campaign and for this client (numeric)
    
    poutcome: outcome of the previous marketing campaign (categorical: 'failure','nonexistent','success')social                 and economic context attributes
    
    emp.var.rate: employment variation rate quarterly indicator (numeric)
    
    cons.price.idx: consumer price index monthly indicator (numeric)
    
    cons.conf.idx: consumer confidence index monthly indicator (numeric)
    
    euribor3m: euribor 3 month rate - daily indicator (numeric)
    
    nr.employed: number of employees quarterly indicator (numeric)

#### Objective

Predict whether a customer will subscribe to the product or not. 

        Supervised learning --> Classification --> Binary Classification. 

### Import all required libraries

In [1]:
import os
import numpy as np
import pandas as pd

from sklearn.preprocessing import StandardScaler, OneHotEncoder, LabelEncoder

from sklearn.model_selection import train_test_split, GridSearchCV

from sklearn.metrics import confusion_matrix, accuracy_score

from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import GradientBoostingClassifier
from xgboost import XGBClassifier

import warnings
warnings.filterwarnings('ignore')

#### Set the Current working directory

In [2]:
PATH = os.getcwd()
DATA_FILE = os.path.join(PATH, "bank_data.csv")

### Load the data

In [3]:
data = 

### Understanding the data

#### Number of rows and columns

(41188, 21)

#### Column or Attribute names

Index(['age', 'job', 'marital', 'education', 'credit_default', 'housing',
       'loan', 'contact', 'contacted_month', 'day_of_week', 'duration',
       'compaign', 'pdays', 'previous', 'poutcome', 'emp_var_rate',
       'cons_price_idx', 'cons_conf_idx', 'euribor3m', 'nr_employees', 'y'],
      dtype='object')

#### Display first 5 and last 5 records

Unnamed: 0,age,job,marital,education,credit_default,housing,loan,contact,contacted_month,day_of_week,...,compaign,pdays,previous,poutcome,emp_var_rate,cons_price_idx,cons_conf_idx,euribor3m,nr_employees,y
0,56,housemaid,married,basic.4y,no,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
1,57,services,married,high.school,unknown,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
2,37,services,married,high.school,no,yes,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
3,40,admin.,married,basic.6y,no,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
4,56,services,married,high.school,no,no,yes,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no


Unnamed: 0,age,job,marital,education,credit_default,housing,loan,contact,contacted_month,day_of_week,...,compaign,pdays,previous,poutcome,emp_var_rate,cons_price_idx,cons_conf_idx,euribor3m,nr_employees,y
41183,73,retired,married,professional.course,no,yes,no,cellular,nov,fri,...,1,999,0,nonexistent,-1.1,94.767,-50.8,1.028,4963.6,yes
41184,46,blue-collar,married,professional.course,no,no,no,cellular,nov,fri,...,1,999,0,nonexistent,-1.1,94.767,-50.8,1.028,4963.6,no
41185,56,retired,married,university.degree,no,yes,no,cellular,nov,fri,...,2,999,0,nonexistent,-1.1,94.767,-50.8,1.028,4963.6,no
41186,44,technician,married,professional.course,no,no,no,cellular,nov,fri,...,1,999,0,nonexistent,-1.1,94.767,-50.8,1.028,4963.6,yes
41187,74,retired,married,professional.course,no,yes,no,cellular,nov,fri,...,3,999,1,failure,-1.1,94.767,-50.8,1.028,4963.6,no


#### Summary Statistics

Unnamed: 0,age,duration,compaign,pdays,previous,emp_var_rate,cons_price_idx,cons_conf_idx,euribor3m,nr_employees
count,41188.0,41188.0,41188.0,41188.0,41188.0,41188.0,41188.0,41188.0,41188.0,41188.0
mean,40.02406,258.28501,2.567593,962.475454,0.172963,0.081886,93.575664,-40.5026,3.621291,5167.035911
std,10.42125,259.279249,2.770014,186.910907,0.494901,1.57096,0.57884,4.628198,1.734447,72.251528
min,17.0,0.0,1.0,0.0,0.0,-3.4,92.201,-50.8,0.634,4963.6
25%,32.0,102.0,1.0,999.0,0.0,-1.8,93.075,-42.7,1.344,5099.1
50%,38.0,180.0,2.0,999.0,0.0,1.1,93.749,-41.8,4.857,5191.0
75%,47.0,319.0,3.0,999.0,0.0,1.4,93.994,-36.4,4.961,5228.1
max,98.0,4918.0,56.0,999.0,7.0,1.4,94.767,-26.9,5.045,5228.1


Unnamed: 0,age,job,marital,education,credit_default,housing,loan,contact,contacted_month,day_of_week,...,compaign,pdays,previous,poutcome,emp_var_rate,cons_price_idx,cons_conf_idx,euribor3m,nr_employees,y
count,41188.0,41188,41188,41188,41188,41188,41188,41188,41188,41188,...,41188.0,41188.0,41188.0,41188,41188.0,41188.0,41188.0,41188.0,41188.0,41188
unique,,12,4,8,3,3,3,2,10,5,...,,,,3,,,,,,2
top,,admin.,married,university.degree,no,yes,no,cellular,may,thu,...,,,,nonexistent,,,,,,no
freq,,10422,24928,12168,32588,21576,33950,26144,13769,8623,...,,,,35563,,,,,,36548
mean,40.02406,,,,,,,,,,...,2.567593,962.475454,0.172963,,0.081886,93.575664,-40.5026,3.621291,5167.035911,
std,10.42125,,,,,,,,,,...,2.770014,186.910907,0.494901,,1.57096,0.57884,4.628198,1.734447,72.251528,
min,17.0,,,,,,,,,,...,1.0,0.0,0.0,,-3.4,92.201,-50.8,0.634,4963.6,
25%,32.0,,,,,,,,,,...,1.0,999.0,0.0,,-1.8,93.075,-42.7,1.344,5099.1,
50%,38.0,,,,,,,,,,...,2.0,999.0,0.0,,1.1,93.749,-41.8,4.857,5191.0,
75%,47.0,,,,,,,,,,...,3.0,999.0,0.0,,1.4,93.994,-36.4,4.961,5228.1,


age                  int64
job                 object
marital             object
education           object
credit_default      object
housing             object
loan                object
contact             object
contacted_month     object
day_of_week         object
duration             int64
compaign             int64
pdays                int64
previous             int64
poutcome            object
emp_var_rate       float64
cons_price_idx     float64
cons_conf_idx      float64
euribor3m          float64
nr_employees       float64
y                   object
dtype: object

#### Observations

Few attributes such as job, marital, education, default, housing, loan, contact, month, day_of_week, poutcome and y are categorical but are interpreted as object type. 

#### TypeCasting - Convert the attribute in to appropriate type

Using astype('category') to convert job, marital, education, default, housing, loan, contact, month, day_of_week, poutcome and y attributes to categorical attributes from existing object datatype

In [11]:
cat_Attr_Names =  ['job', 'marital', 'education', 'credit_default', 'housing', 'loan', 
                   'contact', 'contacted_month', 'day_of_week', 'poutcome', 'y']

num_Attr_Names = list(set(data.columns) - set(cat_Attr_Names))

In [12]:
data[cat_Attr_Names] = 
data[num_Attr_Names] = 

In [13]:
data.dtypes

age                 float64
job                category
marital            category
education          category
credit_default     category
housing            category
loan               category
contact            category
contacted_month    category
day_of_week        category
duration            float64
compaign            float64
pdays               float64
previous            float64
poutcome           category
emp_var_rate        float64
cons_price_idx      float64
cons_conf_idx       float64
euribor3m           float64
nr_employees        float64
y                  category
dtype: object

#### Summary Statistics

In [14]:
data.describe(include='all')

Unnamed: 0,age,job,marital,education,credit_default,housing,loan,contact,contacted_month,day_of_week,...,compaign,pdays,previous,poutcome,emp_var_rate,cons_price_idx,cons_conf_idx,euribor3m,nr_employees,y
count,41188.0,41188,41188,41188,41188,41188,41188,41188,41188,41188,...,41188.0,41188.0,41188.0,41188,41188.0,41188.0,41188.0,41188.0,41188.0,41188
unique,,12,4,8,3,3,3,2,10,5,...,,,,3,,,,,,2
top,,admin.,married,university.degree,no,yes,no,cellular,may,thu,...,,,,nonexistent,,,,,,no
freq,,10422,24928,12168,32588,21576,33950,26144,13769,8623,...,,,,35563,,,,,,36548
mean,40.02406,,,,,,,,,,...,2.567593,962.475454,0.172963,,0.081886,93.575664,-40.5026,3.621291,5167.035911,
std,10.42125,,,,,,,,,,...,2.770014,186.910907,0.494901,,1.57096,0.57884,4.628198,1.734447,72.251528,
min,17.0,,,,,,,,,,...,1.0,0.0,0.0,,-3.4,92.201,-50.8,0.634,4963.6,
25%,32.0,,,,,,,,,,...,1.0,999.0,0.0,,-1.8,93.075,-42.7,1.344,5099.1,
50%,38.0,,,,,,,,,,...,2.0,999.0,0.0,,1.1,93.749,-41.8,4.857,5191.0,
75%,47.0,,,,,,,,,,...,3.0,999.0,0.0,,1.4,93.994,-36.4,4.961,5228.1,


#### Handling of missing data

In [15]:
data.isnull().sum()

age                0
job                0
marital            0
education          0
credit_default     0
housing            0
loan               0
contact            0
contacted_month    0
day_of_week        0
duration           0
compaign           0
pdays              0
previous           0
poutcome           0
emp_var_rate       0
cons_price_idx     0
cons_conf_idx      0
euribor3m          0
nr_employees       0
y                  0
dtype: int64

#### Categorial attributes distribution

In [16]:
for attr in cat_Attr_Names:
    print(attr)
    print(data[attr].value_counts(), '\n')

job
admin.           10422
blue-collar       9254
technician        6743
services          3969
management        2924
retired           1720
entrepreneur      1456
self-employed     1421
housemaid         1060
unemployed        1014
student            875
unknown            330
Name: job, dtype: int64 

marital
married     24928
single      11568
divorced     4612
unknown        80
Name: marital, dtype: int64 

education
university.degree      12168
high.school             9515
basic.9y                6045
professional.course     5243
basic.4y                4176
basic.6y                2292
unknown                 1731
illiterate                18
Name: education, dtype: int64 

credit_default
no         32588
unknown     8597
yes            3
Name: credit_default, dtype: int64 

housing
yes        21576
no         18622
unknown      990
Name: housing, dtype: int64 

loan
no         33950
yes         6248
unknown      990
Name: loan, dtype: int64 

contact
cellular     26144
telephon

In [17]:
pd.value_counts(data['y'])/data['y'].count() * 100

no     88.734583
yes    11.265417
Name: y, dtype: float64

### Train-Test Split

Using sklearn.model_selection.train_test_split

    Split the data into train and test subsets

In [18]:
X = data.drop()
y = data['y']

cat_Attr_Names = list(set(cat_Attr_Names) - set('y'))

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=123)

In [19]:
print(X_train.shape)
print(X_test.shape)

(28831, 20)
(12357, 20)


#### Standardize Numeric Attributes:

In [20]:
scaler = StandardScaler()
scaler.fit(X_train[num_Attr_Names])

X_train_num = pd.DataFrame(scaler.transform(X_train[num_Attr_Names]), columns=num_Attr_Names)
X_test_num = pd.DataFrame(scaler.transform(X_test[num_Attr_Names]), columns=num_Attr_Names)

#### OneHotEncode Categorial Attributes:

In [21]:
ohe = OneHotEncoder(handle_unknown='ignore')
ohe.fit(X_train[cat_Attr_Names])

columns_ohe = list(ohe.get_feature_names(cat_Attr_Names))

X_train_cat = 
X_test_cat = 

X_train_cat = 
X_test_cat = 

In [22]:
X_train = 
X_test = 

In [23]:
X_train.shape

(28831, 63)

In [24]:
new_Ind_Attr_Names = X_train.columns
new_Ind_Attr_Names

Index(['age', 'euribor3m', 'cons_price_idx', 'duration', 'pdays', 'compaign',
       'cons_conf_idx', 'nr_employees', 'emp_var_rate', 'previous',
       'day_of_week_fri', 'day_of_week_mon', 'day_of_week_thu',
       'day_of_week_tue', 'day_of_week_wed', 'contact_cellular',
       'contact_telephone', 'housing_no', 'housing_unknown', 'housing_yes',
       'marital_divorced', 'marital_married', 'marital_single',
       'marital_unknown', 'contacted_month_apr', 'contacted_month_aug',
       'contacted_month_dec', 'contacted_month_jul', 'contacted_month_jun',
       'contacted_month_mar', 'contacted_month_may', 'contacted_month_nov',
       'contacted_month_oct', 'contacted_month_sep', 'poutcome_failure',
       'poutcome_nonexistent', 'poutcome_success', 'education_basic.4y',
       'education_basic.6y', 'education_basic.9y', 'education_high.school',
       'education_illiterate', 'education_professional.course',
       'education_university.degree', 'education_unknown', 'credit_default_

In [25]:
X_train.head()

Unnamed: 0,age,euribor3m,cons_price_idx,duration,pdays,compaign,cons_conf_idx,nr_employees,emp_var_rate,previous,...,job_retired,job_self-employed,job_services,job_student,job_technician,job_unemployed,job_unknown,loan_no,loan_unknown,loan_yes
0,0.479897,0.703768,0.716223,0.233079,0.195559,-0.573111,0.879444,0.324164,0.640729,-0.348328,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
1,-1.058517,0.767418,-0.233354,7.050209,0.195559,1.268589,0.944309,0.840284,0.832074,-0.348328,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
2,0.576048,0.706082,0.716223,-0.475595,0.195559,0.163569,0.879444,0.324164,0.640729,-0.348328,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0
3,-0.28931,0.704346,0.716223,0.999524,0.195559,-0.573111,0.879444,0.324164,0.640729,-0.348328,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
4,1.345255,0.705504,0.716223,-0.356199,0.195559,0.163569,0.879444,0.324164,0.640729,-0.348328,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0


#### Using LabelEncoder to convert target attribute 'y' to Numerical

In [26]:
le = LabelEncoder()

y_train = le.fit_transform(y_train)
y_test = le.transform(y_test)

#### Target attribute distribution

In [27]:
print(pd.value_counts(y_train))
print(pd.value_counts(y_test))

0    25587
1     3244
dtype: int64
0    10961
1     1396
dtype: int64


## Model Building

In [28]:
def build_Cls_Model(model, hyp_Params=None, X_train=X_train, X_test=X_test, y_train=y_train, y_test=y_test):

    if hyp_Params != None:     
        
        model_Grid = GridSearchCV(model, param_grid=hyp_Params, cv=3)
        model_Grid.fit(X_train, y_train)
        
        print(f"The best parameters are {model_Grid.best_params_}")
        model = model_Grid.best_estimator_
    else: 
        model.fit(X_train, y_train)
    
    y_train_Pred = model.predict(X_train)
    y_test_Pred = model.predict(X_test)

    print('========Train=======')
    print(f"Confusion Matrix \n{confusion_matrix(y_train, y_train_Pred)}")
    print(f"Accuracy \n{accuracy_score(y_test, y_test_Pred)}")

    print('========Test=======')
    print(f"Confusion Matrix \n{confusion_matrix(y_test, y_test_Pred)}")
    print(f"Accuracy \n{accuracy_score(y_test, y_test_Pred)}")
  
    return model

### Logistic Regression

In [29]:
lr_Cls_Model = build_Cls_Model(LogisticRegression(solver='liblinear'),
                               {'penalty': ['l1', 'l2']})

The best parameters are {'penalty': 'l2'}
Confusion Matrix 
[[24888   699]
 [ 1867  1377]]
Accuracy 
0.9122764425022255
Confusion Matrix 
[[10675   286]
 [  798   598]]
Accuracy 
0.9122764425022255


### Support Vector Classifier (SVC)

In [30]:
svc_Cls_Model = build_Cls_Model(SVC(),
                                {'C': [50, 90], 
                                 'gamma': [0.008, 0.001], 
                                 'kernel':['rbf']})

The best parameters are {'C': 50, 'gamma': 0.008, 'kernel': 'rbf'}
Confusion Matrix 
[[25062   525]
 [ 1669  1575]]
Accuracy 
0.9141377356963665
Confusion Matrix 
[[10665   296]
 [  765   631]]
Accuracy 
0.9141377356963665


### Decision Tree Classifier

In [31]:
dtc_Cls_Model = build_Cls_Model(DecisionTreeClassifier(),
                                {'criterion': ['gini', 'entropy'], 
                                 'max_depth': [6, 8, 10, 12], 
                                 'max_features':['log2']})

The best parameters are {'criterion': 'entropy', 'max_depth': 6, 'max_features': 'log2'}
Confusion Matrix 
[[25107   480]
 [ 2452   792]]
Accuracy 
0.8981144290685441
Confusion Matrix 
[[10749   212]
 [ 1047   349]]
Accuracy 
0.8981144290685441


### Random Forest Classifier

In [32]:
rfc_Cls_Model = build_Cls_Model(RandomForestClassifier(),
                                {"n_estimators" : [150, 250, 300],
                                 "max_depth" : [5, 8, 10],
                                 "max_features" : [3, 5, 7],
                                 "min_samples_leaf" : [4, 6, 8, 10]})

The best parameters are {'max_depth': 10, 'max_features': 7, 'min_samples_leaf': 4, 'n_estimators': 300}
Confusion Matrix 
[[25369   218]
 [ 2081  1163]]
Accuracy 
0.9117908877559278
Confusion Matrix 
[[10818   143]
 [  947   449]]
Accuracy 
0.9117908877559278


### AdaBoost Classifier

In [33]:
abc_Cls_Model = build_Cls_Model(AdaBoostClassifier(base_estimator=DecisionTreeClassifier(criterion='gini', 
                                                                                         max_depth=10)),
                                {'n_estimators': [20, 40, 60], 
                                'learning_rate':[0.5, 1.0]})

The best parameters are {'learning_rate': 0.5, 'n_estimators': 60}
Confusion Matrix 
[[25587     0]
 [    0  3244]]
Accuracy 
0.9053977502630088
Confusion Matrix 
[[10478   483]
 [  686   710]]
Accuracy 
0.9053977502630088


### Gradient Boost Classifier

In [34]:
gbc_Cls_Model = build_Cls_Model(GradientBoostingClassifier(),
                                {'max_depth': [8, 10, 12, 14], 
                                 'subsample': [0.8, 0.6,], 
                                 'max_features':[0.2, 0.3], 
                                 'n_estimators': [10, 20, 30]})

The best parameters are {'max_depth': 8, 'max_features': 0.3, 'n_estimators': 30, 'subsample': 0.8}
Confusion Matrix 
[[25176   411]
 [ 1055  2189]]
Accuracy 
0.9155943999352594
Confusion Matrix 
[[10545   416]
 [  627   769]]
Accuracy 
0.9155943999352594


### XGBoost 

In [35]:
xgbc_Cls_Model = build_Cls_Model(XGBClassifier(),
                                 {'learning_rate':[0.1,0.5],
                                  'n_estimators': [20], 
                                  'subsample': [0.3, 0.9]})

The best parameters are {'learning_rate': 0.1, 'n_estimators': 20, 'subsample': 0.9}
Confusion Matrix 
[[24798   789]
 [ 1328  1916]]
Accuracy 
0.9164036578457554
Confusion Matrix 
[[10524   437]
 [  596   800]]
Accuracy 
0.9164036578457554


### Best XGB Model

In [36]:
best_Cls_Model = build_Cls_Model(XGBClassifier(learning_rate=0.1, n_estimators=20, subsample=0.9))

Confusion Matrix 
[[24798   789]
 [ 1328  1916]]
Accuracy 
0.9164036578457554
Confusion Matrix 
[[10524   437]
 [  596   800]]
Accuracy 
0.9164036578457554


In [38]:
dtc_Cls_Model_1 = build_Cls_Model(DecisionTreeClassifier(class_weight='balanced'),
                                {'criterion': ['gini', 'entropy'], 
                                 'max_depth': [6, 8, 10, 12], 
                                 'max_features':['log2']})

The best parameters are {'criterion': 'entropy', 'max_depth': 6, 'max_features': 'log2'}
Confusion Matrix 
[[22101  3486]
 [ 1263  1981]]
Accuracy 
0.8299749130047747
Confusion Matrix 
[[9401 1560]
 [ 541  855]]
Accuracy 
0.8299749130047747


In [39]:
abc_Cls_Model_1 = build_Cls_Model(AdaBoostClassifier(base_estimator=DecisionTreeClassifier(criterion='gini', 
                                                                                         max_depth=10,class_weight='balanced')),
                                {'n_estimators': [20, 40, 60], 
                                'learning_rate':[0.5, 1.0]})

The best parameters are {'learning_rate': 0.5, 'n_estimators': 60}
Confusion Matrix 
[[25587     0]
 [    0  3244]]
Accuracy 
0.9065307113377034
Confusion Matrix 
[[10488   473]
 [  682   714]]
Accuracy 
0.9065307113377034
