## **Project 2 Introduction**


Jump-Start for the Bank Marketing Study<br>
as described in Marketing Data Science: Modeling Techniques<br>
for Predictive Analytics with R and Python (Miller 2015)

jump-start code revised by Thomas W. Milller (2018/10/07)

Scikit Learn documentation for this assignment:<br>
http://scikit-learn.org/stable/auto_examples/classification/<br>
  plot_classifier_comparison.html<br>
http://scikit-learn.org/stable/modules/generated/<br>
  sklearn.naive_bayes.BernoulliNB.html#sklearn.naive_bayes.BernoulliNB.score<br>
http://scikit-learn.org/stable/modules/generated/<br>
  sklearn.linear_model.LogisticRegression.html<br>
http://scikit-learn.org/stable/modules/model_evaluation.html <br>
http://scikit-learn.org/stable/modules/generated/<br>
 sklearn.model_selection.KFold.html

prepare for Python version 3x features and functions<br>
comment out for Python 3.x execution<br>
from __future__ import division, print_function<br>
from future_builtins import ascii, filter, hex, map, oct, zip

# **Ingest & EDA**

seed value for random number generators to obtain reproducible results

In [None]:
RANDOM_SEED = 1

import base packages into the namespace for this program

In [None]:
import numpy as np
import pandas as pd

initial work with the smaller data set

In [None]:
from google.colab import drive 
drive.mount('/content/gdrive')
bank=pd.read_csv('gdrive/My Drive/bank.csv', sep=";")  # start with smaller data set
# examine the shape of original input data
print(bank.shape)

Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).
(4521, 17)


drop observations with missing data, if any

In [None]:
bank.dropna()
# examine the shape of input data after dropping missing data
print(bank.shape)

(4521, 17)


look at the list of column names, note that y is the response

In [None]:
list(bank.columns.values)

['age',
 'job',
 'marital',
 'education',
 'default',
 'balance',
 'housing',
 'loan',
 'contact',
 'day',
 'month',
 'duration',
 'campaign',
 'pdays',
 'previous',
 'poutcome',
 'response']

look at the beginning of the DataFrame

In [None]:
bank.head()

Unnamed: 0,age,job,marital,education,default,balance,housing,loan,contact,day,month,duration,campaign,pdays,previous,poutcome,response
0,30,unemployed,married,primary,no,1787,no,no,cellular,19,oct,79,1,-1,0,unknown,no
1,33,services,married,secondary,no,4789,yes,yes,cellular,11,may,220,1,339,4,failure,no
2,35,management,single,tertiary,no,1350,yes,no,cellular,16,apr,185,1,330,1,failure,no
3,30,management,married,tertiary,no,1476,yes,yes,unknown,3,jun,199,4,-1,0,unknown,no
4,59,blue-collar,married,secondary,no,0,yes,no,unknown,5,may,226,1,-1,0,unknown,no


mapping function to convert text no/yes to integer 0/1

In [None]:
convert_to_binary = {'no' : 0, 'yes' : 1}

define binary variable for having credit in default

In [None]:
default = bank['default'].map(convert_to_binary)

define binary variable for having a mortgage or housing loan

In [None]:
housing = bank['housing'].map(convert_to_binary)

define binary variable for having a personal loan

In [None]:
loan = bank['loan'].map(convert_to_binary)

define response variable to use in the model

In [None]:
response = bank['response'].map(convert_to_binary)

gather three explanatory variables and response into a numpy array <br>
here we use .T to obtain the transpose for the structure we want

In [None]:
model_data = np.array([np.array(default), np.array(housing), np.array(loan), 
    np.array(response)]).T

examine the shape of model_data, which we will use in subsequent modeling

In [None]:
print(model_data.shape)

(4521, 4)


In [None]:
x=model_data[:, 0:3]
y=model_data[:,3]

In [None]:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(
     x, y, test_size=0.3, random_state=42)

the rest of the program should set up the modeling methods<br>
and evaluation within a cross-validation design

# **Model & Evaluate**

1. Logistics Regression Model and Evaluation

In [None]:
from sklearn.linear_model import LogisticRegression
log_reg = LogisticRegression()
log_reg.fit(x_train, y_train)
y_predict1=log_reg.predict(x_test)

In [None]:
from sklearn.model_selection import cross_val_score
from numpy import mean
score1=cross_val_score(log_reg, x_train, y_train, scoring="roc_auc", cv = 7)
print(score1)
print(mean(score1))

[0.55658654 0.63052885 0.64801154 0.63765546 0.52477893 0.58840497
 0.61713245]
0.6004426770326189


2. Naïve Bayes Classification Model and Evaluation

In [None]:
from sklearn.naive_bayes import GaussianNB
gnb=GaussianNB()
gnb.fit(x_train, y_train)
y_predict2=gnb.predict(x_test)

In [None]:
score2=cross_val_score(gnb, x_train, y_train, scoring="roc_auc", cv = 7)
print(score2)
print(mean(score2))

[0.56413462 0.60038462 0.64966662 0.63566936 0.55367192 0.58485837
 0.62701565]
0.6022001646506896


3. K-Neighbors Classification Model and Evaluation

In [None]:
from sklearn.neighbors import KNeighborsClassifier
knc=KNeighborsClassifier(n_neighbors=5)
knc.fit(x_train, y_train)
y_predict3=knc.predict(x_test)

In [None]:
score3=cross_val_score(knc, x_train, y_train, scoring="roc_auc", cv = 7)
print(score3)
print(mean(score3))

[0.57682692 0.62413462 0.39750319 0.51636166 0.54622405 0.53012248
 0.62458032]
0.5451076049547463


# **Exposition, Problem Description, and Management Recommendations**

We've built three classification models, including logistics classifier, Naïve Bayes classifier, and k-neighbors classifier, to predict the response if the client subscribes to a term deposit in a marketing contact or phone call, based on the 3 feature "yes" or "no" inputs: credit default, housing loan and personal loan. 


There are no missing values in this dataset, so we assume this is a cleaned dataset and we skip any data cleanning process.

We've also evaluated all four regression models with AUC during cross-validation. Results are listed per below:

*   Logistics Classifier: 0.6004426770326189
*   Naïve Bayes Classifier: 0.6022001646506896
*   K-Neighbors Classifier: 0.5451076049547463

The closer the AUC value to 1, the better the models are. Thus, based on the above comparisons, my recommendation is to choose naïve bayes classifier for prediction purpose.

In [None]:
x_com=np.array([[1,1,1],[1,1,0],[1,0,1],[0,1,1],[0,0,1],[0,1,0],[1,0,0],[0,0,0]])
a = gnb.predict(x_com)
for i in range(len(a-1)) :
  if a[i] == 1:
    print(x_com[i])

[1 1 1]
[1 1 0]
[1 0 1]
[1 0 0]


As per above result from naïve bayes classifier model, potential customers with credit default will have a higher potential to subscribe to term deposit, details are listed as per below:

*   Has credit default, Has housing loan, has personal loan
*   Has credit default, Has housing loan, no personal loan
*   Has credit default, no housing loan, has personal loan
*   Has credit default, no housing loan, no personal loan
