# Dataset Link: https://archive.ics.uci.edu/ml/datasets/Bank+Marketing
# Attribute Information:

## Input variables:
### bank client data:

### 1 - age (numeric)

### 2 - job : type of job (categorical: 'admin.','blue-collar','entrepreneur','housemaid','management','retired','self-employed','services','student','technician','unemployed','unknown')

### 3 - marital : marital status (categorical: 'divorced','married','single','unknown'; note: 'divorced' means divorced or widowed)

### 4 - education (categorical: 'basic.4y','basic.6y','basic.9y','high.school','illiterate','professional.course','university.degree','unknown')

### 5 - default: has credit in default? (categorical: 'no','yes','unknown')

### 6 - housing: has housing loan? (categorical: 'no','yes','unknown')

### 7 - loan: has personal loan? (categorical: 'no','yes','unknown')

## Related with the last contact of the current campaign:
### 8 - contact: contact communication type (categorical: 'cellular','telephone') 

### 9 - month: last contact month of year (categorical: 'jan', 'feb', 'mar', ..., 'nov', 'dec')

### 10 - day_of_week: last contact day of the week (categorical: 'mon','tue','wed','thu','fri')

### 11 - duration: last contact duration, in seconds (numeric). Important note: this attribute highly affects the output target (e.g., if duration=0 then y='no'). Yet, the duration is not known before a call is performed. Also, after the end of the call y is obviously known. Thus, this input should only be included for benchmark purposes and should be discarded if the intention is to have a realistic predictive model.

## Other attributes:
### 12 - campaign: number of contacts performed during this campaign and for this client (numeric, includes last contact)

### 13 - pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric; 999 means client was not previously contacted)

### 14 - previous: number of contacts performed before this campaign and for this client (numeric)

### 15 - poutcome: outcome of the previous marketing campaign (categorical: 'failure','nonexistent','success')

## Social and economic context attributes
### 16 - emp.var.rate: employment variation rate - quarterly indicator (numeric)

### 17 - cons.price.idx: consumer price index - monthly indicator (numeric) 

### 18 - cons.conf.idx: consumer confidence index - monthly indicator (numeric) 

### 19 - euribor3m: euribor 3 month rate - daily indicator (numeric)

### 20 - nr.employed: number of employees - quarterly indicator (numeric)

## Output variable (desired target):
### 21 - y - has the client subscribed a term deposit? (binary: 'yes','no')

In [1]:
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import classification_report
from sklearn.model_selection import train_test_split
import pandas as pd
import numpy as np
from keras.models import Sequential
from keras.layers import Dense, Dropout
from keras import optimizers
from sklearn.metrics import confusion_matrix
import tensorflow as tf
import keras
from tensorflow.keras.callbacks import TensorBoard
from sklearn.utils import class_weight
from collections import Counter
%matplotlib notebook
import base64
import sys
sys.path.append('./python')
from generic_feature_statistics_generator import GenericFeatureStatisticsGenerator
from IPython.core.display import display, HTML
import time

Using TensorFlow backend.


## Import data

In [2]:
df = pd.read_csv('bank-additional-full.csv', delimiter=';')

In [3]:
df = df.drop('duration', axis=1)

In [4]:
target = df['y'].replace(['yes', 'no'], (1,0))

In [5]:
df_overview = df.copy()

In [6]:
df_overview['y']= target

In [7]:
df_overview.head()

Unnamed: 0,age,job,marital,education,default,housing,loan,contact,month,day_of_week,campaign,pdays,previous,poutcome,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed,y
0,56,housemaid,married,basic.4y,no,no,no,telephone,may,mon,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,0
1,57,services,married,high.school,unknown,no,no,telephone,may,mon,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,0
2,37,services,married,high.school,no,yes,no,telephone,may,mon,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,0
3,40,admin.,married,basic.6y,no,no,no,telephone,may,mon,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,0
4,56,services,married,high.school,no,no,yes,telephone,may,mon,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,0


## Check data using : «facets: https://github.com/PAIR-code/facets»

In [8]:
X_train_temp, X_test_temp, y_train_temp, y_test = train_test_split(df_overview, target, test_size=0.5)

In [9]:
gfsg = GenericFeatureStatisticsGenerator()
proto = gfsg.ProtoFromDataFrames([{'name': 'train', 'table': X_train_temp},
                                  {'name': 'test', 'table': X_test_temp}])
protostr = base64.b64encode(proto.SerializeToString()).decode("utf-8")

In [10]:
HTML_TEMPLATE = """<link rel="import" 
                href="https://raw.githubusercontent.com/PAIR-code/facets/master/facets-dist/facets-jupyter.html" >
        <facets-overview id="elem"></facets-overview>
        <script>
          document.querySelector("#elem").protoInput = "{protostr}";
        </script>"""
html = HTML_TEMPLATE.format(protostr=protostr)
display(HTML(html))

# Notes:
### Column: 'previous'  have ~86% zero values.

### Column: 'pdays' Frequency of :'999' 39673 times.

### Column: 'default' { 'no': 32588 , 'unknown': 8597 'yes': 3 } yes:3 times.

### Column: 'poutcome' Frequency value 'nonexistent': 35563 times

### Column: 'y' Target/Label Imbalance dataset (0: 36548, 1: 4640) . Metrics to optimize : F1 ,recall etc but not accuracy

In [11]:
print('Class imbalance, Positives (1):',(100*(target.value_counts()[1] / (sum(Counter(target).values())))),'Percent')

Class imbalance, Positives (1): 11.265417111780131 Percent


In [12]:
Counter(target)

Counter({0: 36548, 1: 4640})

In [13]:
Counter(df['pdays'])

Counter({999: 39673,
         6: 412,
         4: 118,
         3: 439,
         5: 46,
         1: 26,
         0: 15,
         10: 52,
         7: 60,
         8: 18,
         9: 64,
         11: 28,
         2: 61,
         12: 58,
         13: 36,
         14: 20,
         15: 24,
         16: 11,
         21: 2,
         17: 8,
         18: 7,
         22: 3,
         25: 1,
         26: 1,
         19: 3,
         27: 1,
         20: 1})

In [14]:
Counter(df['poutcome'])

Counter({'nonexistent': 35563, 'failure': 4252, 'success': 1373})

In [15]:
Counter(df['default'])

Counter({'no': 32588, 'unknown': 8597, 'yes': 3})

## Based on the above info:

In [16]:
x_data = df.drop(['y','default', 'previous','pdays','poutcome'], axis=1)

In [17]:
target = df['y'].replace(['yes', 'no'], (1,0))

## get_dummies()

In [18]:
# Save column names if they are an object.
cat_cols = []
for c in x_data.columns:
    if x_data[c].dtype == 'object':
        cat_cols.append(c)

In [19]:
cat_cols

['job',
 'marital',
 'education',
 'housing',
 'loan',
 'contact',
 'month',
 'day_of_week']

In [20]:
for col in cat_cols: # 
    x_data = pd.get_dummies(x_data, columns=[col], drop_first=True)

In [21]:
x_data.head()

Unnamed: 0,age,campaign,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed,job_blue-collar,job_entrepreneur,job_housemaid,...,month_jun,month_mar,month_may,month_nov,month_oct,month_sep,day_of_week_mon,day_of_week_thu,day_of_week_tue,day_of_week_wed
0,56,1,1.1,93.994,-36.4,4.857,5191.0,0,0,1,...,0,0,1,0,0,0,1,0,0,0
1,57,1,1.1,93.994,-36.4,4.857,5191.0,0,0,0,...,0,0,1,0,0,0,1,0,0,0
2,37,1,1.1,93.994,-36.4,4.857,5191.0,0,0,0,...,0,0,1,0,0,0,1,0,0,0
3,40,1,1.1,93.994,-36.4,4.857,5191.0,0,0,0,...,0,0,1,0,0,0,1,0,0,0
4,56,1,1.1,93.994,-36.4,4.857,5191.0,0,0,0,...,0,0,1,0,0,0,1,0,0,0


## Data split 50/50

In [22]:
X_train, X_test, y_train, y_test = train_test_split(x_data, target, test_size=0.5)

## Normalization using StandarScaler()

In [23]:
scaler = StandardScaler()

In [24]:
scaler.fit(X_train)

  return self.partial_fit(X, y)


StandardScaler(copy=True, with_mean=True, with_std=True)

In [25]:
X_train = pd.DataFrame(scaler.transform(X_train),
                              columns=X_train.columns,
                              index=X_train.index)

  """Entry point for launching an IPython kernel.


In [26]:
X_train.head()

Unnamed: 0,age,campaign,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed,job_blue-collar,job_entrepreneur,job_housemaid,...,month_jun,month_mar,month_may,month_nov,month_oct,month_sep,day_of_week_mon,day_of_week_thu,day_of_week_tue,day_of_week_wed
7591,-0.290596,-0.204672,0.644636,0.717202,0.870195,0.713409,0.329873,-0.536516,-0.195348,-0.16174,...,-0.389106,-0.105807,1.406778,-0.330503,-0.138936,-0.119092,-0.507743,-0.51839,-0.49789,-0.495763
34911,-0.769046,0.149501,-1.194173,-1.17492,-1.239625,-1.371713,-0.945589,-0.536516,-0.195348,-0.16174,...,-0.389106,-0.105807,1.406778,-0.330503,-0.138936,-0.119092,-0.507743,-0.51839,-0.49789,-0.495763
20668,0.187853,-0.558846,0.834857,-0.228,0.934781,0.771681,0.844776,-0.536516,-0.195348,-0.16174,...,-0.389106,-0.105807,-0.710844,-0.330503,-0.138936,-0.119092,-0.507743,-0.51839,-0.49789,2.017094
37038,-0.673356,-0.558846,-1.891652,-1.903585,1.473,-1.490566,-1.263413,-0.536516,-0.195348,-0.16174,...,-0.389106,-0.105807,-0.710844,-0.330503,-0.138936,-0.119092,-0.507743,-0.51839,2.008476,-0.495763
19267,-0.386286,-0.558846,0.834857,-0.228,0.934781,0.772835,0.844776,-0.536516,-0.195348,-0.16174,...,-0.389106,-0.105807,-0.710844,-0.330503,-0.138936,-0.119092,-0.507743,-0.51839,-0.49789,2.017094


In [27]:
X_test = pd.DataFrame(scaler.transform(X_test),
                             columns=X_test.columns,
                             index=X_test.index)

  """Entry point for launching an IPython kernel.


In [28]:
X_test.head()

Unnamed: 0,age,campaign,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed,job_blue-collar,job_entrepreneur,job_housemaid,...,month_jun,month_mar,month_may,month_nov,month_oct,month_sep,day_of_week_mon,day_of_week_thu,day_of_week_tue,day_of_week_wed
28996,0.570613,-0.558846,-1.194173,-0.862144,-1.433384,-1.282284,-0.945589,-0.536516,-0.195348,-0.16174,...,-0.389106,-0.105807,-0.710844,-0.330503,-0.138936,-0.119092,-0.507743,-0.51839,-0.49789,-0.495763
32422,-0.769046,-0.204672,-1.194173,-1.17492,-1.239625,-1.335364,-0.945589,-0.536516,-0.195348,-0.16174,...,-0.389106,-0.105807,1.406778,-0.330503,-0.138936,-0.119092,-0.507743,-0.51839,-0.49789,-0.495763
30974,-0.769046,2.274542,-1.194173,-1.17492,-1.239625,-1.317479,-0.945589,-0.536516,-0.195348,-0.16174,...,-0.389106,-0.105807,1.406778,-0.330503,-0.138936,-0.119092,-0.507743,-0.51839,2.008476,-0.495763
574,0.474923,-0.204672,0.644636,0.717202,0.870195,0.70937,0.329873,-0.536516,-0.195348,6.182772,...,-0.389106,-0.105807,1.406778,-0.330503,-0.138936,-0.119092,-0.507743,-0.51839,2.008476,-0.495763
5919,-0.003527,-0.204672,0.644636,0.717202,0.870195,0.70937,0.329873,1.863877,-0.195348,-0.16174,...,-0.389106,-0.105807,1.406778,-0.330503,-0.138936,-0.119092,-0.507743,-0.51839,2.008476,-0.495763


# Sequential Model

In [29]:
model = Sequential()
model.add(Dense(46, input_dim=46, activation='relu'))
model.add(Dropout(0.3))
model.add(Dense(32, activation='relu'))
model.add(Dropout(0.3))
model.add(Dense(32, activation='relu'))
model.add(Dropout(0.3))
model.add(Dense(1, activation='sigmoid'))

In [30]:
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_1 (Dense)              (None, 46)                2162      
_________________________________________________________________
dropout_1 (Dropout)          (None, 46)                0         
_________________________________________________________________
dense_2 (Dense)              (None, 32)                1504      
_________________________________________________________________
dropout_2 (Dropout)          (None, 32)                0         
_________________________________________________________________
dense_3 (Dense)              (None, 32)                1056      
_________________________________________________________________
dropout_3 (Dropout)          (None, 32)                0         
_________________________________________________________________
dense_4 (Dense)              (None, 1)                 33        
Total para

In [31]:
model.compile(loss='binary_crossentropy',
              optimizer=optimizers.adam(lr=0.01),
              metrics=['accuracy'])

## Calculate the weights

In [32]:
class_weights = class_weight.compute_class_weight('balanced',
                                                 np.unique(y_train),
                                                 y_train)

In [33]:
class_weights

array([0.5627391, 4.4847561])

In [34]:
wght= {0:class_weights[0],1:class_weights[1]}

In [35]:
wght

{0: 0.5627390971690895, 1: 4.484756097560975}

## Train

In [37]:
keras.backend.get_session().run(tf.global_variables_initializer())

In [48]:
NAME = 'test-{}'.format(int(time.time()))
tensorboard = TensorBoard(log_dir='.logs/10_3232_{}'.format(NAME))
model.fit(X_train, y_train,
          shuffle=True,
          class_weight=wght,
          validation_split=0.3,
          epochs=10,
          batch_size=128,
          verbose=2,
          callbacks=[tensorboard])

Train on 14415 samples, validate on 6179 samples
Epoch 1/10
 - 0s - loss: 0.5531 - acc: 0.8257 - val_loss: 0.5299 - val_acc: 0.8461
Epoch 2/10
 - 0s - loss: 0.5552 - acc: 0.8180 - val_loss: 0.5247 - val_acc: 0.8411
Epoch 3/10
 - 0s - loss: 0.5565 - acc: 0.8181 - val_loss: 0.5399 - val_acc: 0.8388
Epoch 4/10
 - 0s - loss: 0.5553 - acc: 0.8300 - val_loss: 0.5330 - val_acc: 0.8395
Epoch 5/10
 - 0s - loss: 0.5561 - acc: 0.8148 - val_loss: 0.5294 - val_acc: 0.8217
Epoch 6/10
 - 0s - loss: 0.5500 - acc: 0.8148 - val_loss: 0.5437 - val_acc: 0.8162
Epoch 7/10
 - 0s - loss: 0.5564 - acc: 0.8231 - val_loss: 0.5354 - val_acc: 0.8501
Epoch 8/10
 - 0s - loss: 0.5493 - acc: 0.8238 - val_loss: 0.5389 - val_acc: 0.8283
Epoch 9/10
 - 0s - loss: 0.5487 - acc: 0.8251 - val_loss: 0.5339 - val_acc: 0.8335
Epoch 10/10
 - 0s - loss: 0.5504 - acc: 0.8239 - val_loss: 0.5304 - val_acc: 0.8391


<keras.callbacks.History at 0x27cab160dd8>

## Model Evaluation. (Not really needed)

In [49]:
score = model.evaluate(X_test, y_test, batch_size=128)



In [50]:
print('Test accuracy:', score[1])

Test accuracy: 0.8367000096247383


## Predictions

In [51]:
final_preds = pd.DataFrame(model.predict(X_test), index=y_test.index)
# Predict without setting threshold: 
#         final_preds = pd.DataFrame(model.predict_classes(X_test), index=y_test.index)

In [52]:
final_preds[final_preds<0.4]=0

In [53]:
final_preds[final_preds>=0.4]=1

In [54]:
final_preds=final_preds.astype('int64')

In [55]:
cm = confusion_matrix(y_true=y_test, y_pred=final_preds)

In [56]:
cm

array([[15090,  3160],
       [  843,  1501]], dtype=int64)

In [57]:
print(classification_report(y_test, final_preds))

              precision    recall  f1-score   support

           0       0.95      0.83      0.88     18250
           1       0.32      0.64      0.43      2344

   micro avg       0.81      0.81      0.81     20594
   macro avg       0.63      0.73      0.66     20594
weighted avg       0.88      0.81      0.83     20594



## Metrics needs to be better. (Either pick precision or recall)