<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# The Bank Marketing Campaign (Modeling & Evaluation)

- **Objective:** To build a Classification Model and evaluate its result

### Authors: Abdullah Al-Qithmi
---



In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn import metrics
from imblearn.over_sampling import SMOTE
%matplotlib inline

Using TensorFlow backend.


In [2]:
ds = pd.read_csv('bankAfCls.csv', sep=';')

In [3]:
ds.head()

Unnamed: 0,age,job,marital,education,default,housing,loan,contact,month,day_of_week,campaign,previous,poutcome,cons.conf.idx,nr.employed,y
0,56,housemaid,married,basic,0.0,0.0,0.0,telephone,may,mon,1,0,nonexistent,-36.4,5191.0,0
1,57,services,married,high.school,1.0,0.0,0.0,telephone,may,mon,1,0,nonexistent,-36.4,5191.0,0
2,37,services,married,high.school,0.0,1.0,0.0,telephone,may,mon,1,0,nonexistent,-36.4,5191.0,0
3,40,admin.,married,basic,0.0,0.0,0.0,telephone,may,mon,1,0,nonexistent,-36.4,5191.0,0
4,56,services,married,high.school,0.0,0.0,1.0,telephone,may,mon,1,0,nonexistent,-36.4,5191.0,0


In [4]:
ds.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 41108 entries, 0 to 41107
Data columns (total 16 columns):
age              41108 non-null int64
job              41108 non-null object
marital          41108 non-null object
education        41108 non-null object
default          41108 non-null float64
housing          41108 non-null float64
loan             41108 non-null float64
contact          41108 non-null object
month            41108 non-null object
day_of_week      41108 non-null object
campaign         41108 non-null int64
previous         41108 non-null int64
poutcome         41108 non-null object
cons.conf.idx    41108 non-null float64
nr.employed      41108 non-null float64
y                41108 non-null int64
dtypes: float64(5), int64(4), object(7)
memory usage: 5.0+ MB


In [5]:

#creating the sub dataframe with only the features im using
features = ['age','campaign','previous','job','default','loan','month','poutcome','y']
Model =  ds[features]

In [6]:
Model.head()

Unnamed: 0,age,campaign,previous,job,default,loan,month,poutcome,y
0,56,1,0,housemaid,0.0,0.0,may,nonexistent,0
1,57,1,0,services,1.0,0.0,may,nonexistent,0
2,37,1,0,services,0.0,0.0,may,nonexistent,0
3,40,1,0,admin.,0.0,0.0,may,nonexistent,0
4,56,1,0,services,0.0,1.0,may,nonexistent,0


### Create Dummy Variables

In [7]:
Model = pd.get_dummies(Model[['age','campaign','previous','job','default','loan','month','poutcome','y']], drop_first = True)


### Splitting Data
- Split the dataset into training and testing dataset by using **train_test_split** function which will help to split the data randomly.
- The training dataset contains **70%** of the original dataset, while the testing dataset is containing only **30%**.  

In [8]:
X = Model.loc[:, Model.columns != 'y']
y = Model.loc[:, Model.columns == 'y']

print('Shape of X: {}'.format(X.shape))
print('Shape of y: {}'.format(y.shape))

Shape of X: (41108, 27)
Shape of y: (41108, 1)


In [9]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)

print("Number transactions X_train dataset: ", X_train.shape)
print("Number transactions y_train dataset: ", y_train.shape)
print("Number transactions X_test dataset: ", X_test.shape)
print("Number transactions y_test dataset: ", y_test.shape)

Number transactions X_train dataset:  (28775, 27)
Number transactions y_train dataset:  (28775, 1)
Number transactions X_test dataset:  (12333, 27)
Number transactions y_test dataset:  (12333, 1)


### SMOTE

In [10]:
print("Before OverSampling")
print(y_train['y'].value_counts(),"\n")
print(y_train['y'].value_counts()/y_train['y'].count()*100,"\n")



sm = SMOTE()
X_train_res, y_train_res = sm.fit_sample(X_train, y_train.values.ravel())

print('After OverSampling, the shape of train_X: {}'.format(X_train_res.shape))
print('After OverSampling, the shape of train_y: {} \n'.format(y_train_res.shape))

print("After OverSampling, counts of label '1': {}".format(sum(y_train_res==1)))
print("After OverSampling, counts of label '0': {}".format(sum(y_train_res==0)))

Before OverSampling
0    25485
1     3290
Name: y, dtype: int64 

0    88.566464
1    11.433536
Name: y, dtype: float64 

After OverSampling, the shape of train_X: (50970, 27)
After OverSampling, the shape of train_y: (50970,) 

After OverSampling, counts of label '1': 25485
After OverSampling, counts of label '0': 25485


### Modeling

In [11]:
lr = LogisticRegression()

In [12]:
lr.fit(X_train_res, y_train_res.ravel())


#lr.fit(X_train, y_train)



LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=None, solver='warn', tol=0.0001, verbose=0,
                   warm_start=False)

In [13]:
y_pred = lr.predict(X_test)

### Evaluation

In [17]:
print("The baseline in y_test?\n", 1 - y_test.mean())

The baseline in y_test?
 y    0.891511
dtype: float64


#### Accuracy Score

In [18]:
metrics.accuracy_score(y_test,y_pred)

0.7751560852996027


#### Confusion Matrix

In [19]:
cm = metrics.confusion_matrix(y_test, y_pred);

# (row, column)
TN = cm[0, 0]; print("True Positives:", TN)
TP = cm[1, 1]; print("True Negatives:", TP)
FP = cm[0, 1]; print("False Positives:", FP)
FN = cm[1, 0]; print("False Negatives:", FN)

True Positives: 8753
True Negatives: 807
False Positives: 2242
False Negatives: 531


In [25]:
cm

array([[8753, 2242],
       [ 531,  807]])

In [20]:
# Calculate your misclassification rate or where your model went astray
misclassification = (FP + FN) / float(TP + TN + FP + FN)
print(misclassification)

0.2248439147003973


#### classification report

In [21]:
from sklearn.metrics import classification_report
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.94      0.80      0.86     10995
           1       0.26      0.60      0.37      1338

    accuracy                           0.78     12333
   macro avg       0.60      0.70      0.62     12333
weighted avg       0.87      0.78      0.81     12333



#### Logarithmic loss

In [22]:
metrics.log_loss(y_test,y_pred)

7.765979051368347

    #### ROC AUC

In [23]:
metrics.roc_auc_score(y_test,y_pred)

0.6996140724381446