# Market Campaign Prediction NoteBook

## Summary
Market campaign prediction aims to predict the success of telemarketing.

## Description

### Use Case Description
A company, such as bank, wants to do market campaign prediction. The bank collects customer demographic data, bank account information, history telemarketing activity record from various data sources. The task is to build a pipeline that automatically analyze the bank market dataset, to predict the success of telemarketing calls for selling bank long-term deposits. The aim is to provide market intelligence for the bank and better target valuable customers and hence reduce marketing cost.

#### Use Case Data
The data used in this use case is [BankMarket dataset](https://archive.ics.uci.edu/ml/datasets/Bank+Marketing), a publicly available data set collected from UCI Machine Learning repository. The data contains 17 variables and 4521 rows. 

We shared the market data as a Blob in a public Windows Azure Storage account. You can use this shared data to follow the steps in this template, or you can access the full dataset from UCI website.

Each instance in the data set has 17 fields:

* 1 - age (numeric)
* 2 - job : type of job (categorical: "admin.","unknown","unemployed","management","housemaid","entrepreneur","student", "blue-collar","self-employed","retired","technician","services") 
* 3 - marital : marital status (categorical: "married","divorced","single"; note: "divorced" means divorced or widowed)
* 4 - education (categorical: "unknown","secondary","primary","tertiary")
* 5 - default: has credit in default? (binary: "yes","no")
* 6 - balance: average yearly balance, in euros (numeric) 
* 7 - housing: has housing loan? (binary: "yes","no")
* 8 - loan: has personal loan? (binary: "yes","no")
* 9 - contact: contact communication type (categorical: "unknown","telephone","cellular") 
* 10 - day: last contact day of the month (numeric)
* 11 - month: last contact month of year (categorical: "jan", "feb", "mar", ..., "nov", "dec")
* 12 - duration: last contact duration, in seconds (numeric)
* 13 - campaign: number of contacts performed during this campaign and for this client (numeric, includes last contact)
* 14 - pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric, -1 means client was not previously contacted)
* 15 - previous: number of contacts performed before this campaign and for this client (numeric)
* 16 - poutcome: outcome of the previous marketing campaign (categorical: "unknown","other","failure","success")

Target variable:
* 17 - y - has the client subscribed a term deposit? (binary: "yes","no")

### Set Up

We import the python modules used in this solution.

In [2]:
# This code snippet will load the referenced package and return a DataFrame.
# If the code is run in a PySpark environemnt, then the code will return a
# Spark DataFrame. If not, the code will return a Pandas DataFrame. You can
# copy this code snippet to another code file as needed.    C:\Users\zhouf\.amlenvrc.cmd

# Import python module

import pickle
import sys
import os

import dataprep
from dataprep.Package import Package

import pandas as pd
import numpy as np
import csv

from sklearn.preprocessing import LabelEncoder

from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier

from sklearn import grid_search
from sklearn import metrics

from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.cross_validation import train_test_split

### Step 1 - Data Preparation

We first retrieve the data as a data frame using .dprep that we created using the datasource wizard. Print the top few lines using head().

In [3]:
# Create the outputs folder

os.makedirs('./outputs', exist_ok=True)

In [4]:
# Load the bank dataset

with Package.open_package('BankMarketCampaignTrainingSample.dprep') as pkg:
    df = pkg.dataflows[0].get_dataframe()
df.head()

Unnamed: 0,age,job,marital,education,default,balance,housing,loan,contact,day,month,duration,campaign,pdays,previous,poutcome,y
0,30.0,unemployed,married,primary,no,1787.0,no,no,cellular,19.0,oct,79.0,1.0,-1.0,0.0,unknown,no
1,33.0,services,married,secondary,no,4789.0,yes,yes,cellular,11.0,may,220.0,1.0,339.0,4.0,failure,no
2,35.0,management,single,tertiary,no,1350.0,yes,no,cellular,16.0,apr,185.0,1.0,330.0,1.0,failure,no
3,30.0,management,married,tertiary,no,1476.0,yes,yes,unknown,3.0,jun,199.0,4.0,-1.0,0.0,unknown,no
4,59.0,blue-collar,married,secondary,no,0.0,yes,no,unknown,5.0,may,226.0,1.0,-1.0,0.0,unknown,no


### Step 2 - Feature Engineering
### Column encoding

Convert categorical variable into dummy/indicator variables using pandas.get_dummies. In addition, we will need to change the column names to ensure there are no multiple columns with the same name

In [5]:
# Encode columns

columns_to_encode = list(df.select_dtypes(include=['category','object']))
columns_to_encode

['job',
 'marital',
 'education',
 'default',
 'housing',
 'loan',
 'contact',
 'month',
 'poutcome',
 'y']

In [6]:
# Show unique values for columns to encode

for column_to_encode in columns_to_encode:
    g = df.groupby(column_to_encode)
    print(g[column_to_encode].nunique())

job
admin.           1
blue-collar      1
entrepreneur     1
housemaid        1
management       1
retired          1
self-employed    1
services         1
student          1
technician       1
unemployed       1
unknown          1
Name: job, dtype: int64
marital
divorced    1
married     1
single      1
Name: marital, dtype: int64
education
primary      1
secondary    1
tertiary     1
unknown      1
Name: education, dtype: int64
default
no     1
yes    1
Name: default, dtype: int64
housing
no     1
yes    1
Name: housing, dtype: int64
loan
no     1
yes    1
Name: loan, dtype: int64
contact
cellular     1
telephone    1
unknown      1
Name: contact, dtype: int64
month
apr    1
aug    1
dec    1
feb    1
jan    1
jul    1
jun    1
mar    1
may    1
nov    1
oct    1
sep    1
Name: month, dtype: int64
poutcome
failure    1
other      1
success    1
unknown    1
Name: poutcome, dtype: int64
y
no     1
yes    1
Name: y, dtype: int64


In [7]:
# Encode columns

for column_to_encode in columns_to_encode:
    dummies = pd.get_dummies(df[column_to_encode])
    one_hot_col_names = []
    for col_name in list(dummies.columns):
        one_hot_col_names.append(column_to_encode + '_' + col_name)
    dummies.columns = one_hot_col_names
    df = df.drop(column_to_encode, axis=1)
    df = df.join(dummies)
    
df.head()

Unnamed: 0,age,balance,day,duration,campaign,pdays,previous,job_admin.,job_blue-collar,job_entrepreneur,...,month_may,month_nov,month_oct,month_sep,poutcome_failure,poutcome_other,poutcome_success,poutcome_unknown,y_no,y_yes
0,30.0,1787.0,19.0,79.0,1.0,-1.0,0.0,0,0,0,...,0,0,1,0,0,0,0,1,1,0
1,33.0,4789.0,11.0,220.0,1.0,339.0,4.0,0,0,0,...,1,0,0,0,1,0,0,0,1,0
2,35.0,1350.0,16.0,185.0,1.0,330.0,1.0,0,0,0,...,0,0,0,0,1,0,0,0,1,0
3,30.0,1476.0,3.0,199.0,4.0,-1.0,0.0,0,0,0,...,0,0,0,0,0,0,0,1,1,0
4,59.0,0.0,5.0,226.0,1.0,-1.0,0.0,0,1,0,...,1,0,0,0,0,0,0,1,1,0


In [8]:
# Keep only one response variable

df = df.drop('y_no', axis=1)
df.head()

Unnamed: 0,age,balance,day,duration,campaign,pdays,previous,job_admin.,job_blue-collar,job_entrepreneur,...,month_mar,month_may,month_nov,month_oct,month_sep,poutcome_failure,poutcome_other,poutcome_success,poutcome_unknown,y_yes
0,30.0,1787.0,19.0,79.0,1.0,-1.0,0.0,0,0,0,...,0,0,0,1,0,0,0,0,1,0
1,33.0,4789.0,11.0,220.0,1.0,339.0,4.0,0,0,0,...,0,1,0,0,0,1,0,0,0,0
2,35.0,1350.0,16.0,185.0,1.0,330.0,1.0,0,0,0,...,0,0,0,0,0,1,0,0,0,0
3,30.0,1476.0,3.0,199.0,4.0,-1.0,0.0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
4,59.0,0.0,5.0,226.0,1.0,-1.0,0.0,0,1,0,...,0,1,0,0,0,0,0,0,1,0


### Step 3 - Model Training and Evaluation
We train the model with Logistic Regression, Support Vector Machine, Decision Tree classifiers and do model validation on the testing dataset.

### Split Data for Training and Testing
First, we split the data into training and testing datatest.

In [9]:
# Split the data into training and testing datasets

train, test = train_test_split(df, test_size = 0.2, random_state=99)

In [10]:
# Specify the values of label and features in training and testing datasets.

train_y = train['y_yes'].values

train_x = train.drop('y_yes', axis=1)
train_x = train_x.values

test_y = test['y_yes'].values
test_x = test.drop('y_yes', axis=1)

### Initialize Sweep Parameter
To find the optimal parameters, we specify the parameter dictionary for each classifier. Usually a classifier will have several parameters to tune in order to get the best performance. 

In [11]:
# Make sweep parameter dictionary

# Logistic Regression

def make_sweep_parameter_lr_dict():
    '''
    :return: parameters of logistic regression model to tune
    '''
    return {"penalty": ("l1", "l2"), "C": [0.1, 1, 10]}

In [12]:
# Support Vector Machine

def make_sweep_parameter_svm_dict():
    '''
    :return: parameters of support vector machine model to tune
    '''
    return {"gamma": [0.0001, 0.01, 1, 100], "C": [1]}

In [13]:
# Decision Tree

def make_sweep_parameter_dt_dict():
    '''
    :return: parameters of decision tree model to tune
    '''
    return{'max_depth': [4, 8, 16, 32]}

### Sweep Parameters with Each Classifier using Cross Validation
Use 3-fold cross validation to find the optimal parameters for each algorithm, then train the algorithm with all training data and the optimal parameters. 3-fold cross validation will first split the training set into 3 equal sized subsamples. Of the 3 subsamples, a single subsample is retained as the validation set for testing the model, and the remaining k-1 subsamples are used as training data. The cross validation process is repeated 3 times with each of the 3 subsamples used exactly once as the validation data.

In [14]:
def sweep_classifier_cross_val(train_feature, train_label, model_name, parameter_dict):
    '''
    :param train_feature: feature of training set
    :param train_label: label of training set
    :param model_name: model
    :param parameter_dict: parameters of the model to tune
    :return: trained model
    '''
    classifier = grid_search.GridSearchCV(model_name, parameter_dict)
    classifier.fit(train_feature, train_label)
    return classifier

### Validate Model using Testing Data
The scored label of each record in test set is calculated by the trained model to calculate the evaluation metrics, including confusion matrix, accuracy, precision, recall and f-score.

In [15]:
def prediction(classifier, test_feature):
    '''
    :param classifier: the trained classifier
    :param test_feature: feature of test set
    :return: scored label of test set
    '''
    predicted_label = classifier.predict(test_feature)
    return predicted_label

### Compute Metrics
Compute the evaluation metrics based on the label and the scored label of test set.

In [16]:
def calc_metrics(test_label, predicted_label):
    '''
    :param test_label: label of test set
    :param predicted_label: scored label of test set
    :return: evaluation metrics
    '''
    cm = confusion_matrix(test_label, predicted_label)
    acc = accuracy_score(test_label, predicted_label)
    pre = precision_score(test_label, predicted_label)
    rec = recall_score(test_label, predicted_label)
    f1 = f1_score(test_label, predicted_label)
    return cm, acc, pre, rec, f1

### Put it all together
We use a 3-fold cross validation to find the optimal parameters for each algorithm and then  train the model with the optimal parameters, test its performance on test set and finally save the optimal models.

In [17]:
# Train and evaluate logistic regression

model_name = LogisticRegression()
parameter_dict = make_sweep_parameter_lr_dict()
lrf = sweep_classifier_cross_val(train_feature=train_x, 
                                 train_label=train_y, 
                                 model_name=model_name, 
                                 parameter_dict=parameter_dict)
print(lrf)

GridSearchCV(cv=None, error_score='raise',
       estimator=LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False),
       fit_params={}, iid=True, n_jobs=1,
       param_grid={'C': [0.1, 1, 10], 'penalty': ('l1', 'l2')},
       pre_dispatch='2*n_jobs', refit=True, scoring=None, verbose=0)


In [18]:
predicted = prediction(lrf, test_x)

In [19]:
cm, acc, pre, rec, f1 = calc_metrics(test_y, predicted)

In [20]:
print('Confusion Matrix:')
print(cm)
print('Accuracy: {:.2f}'.format(acc))
print('Precision: {:.2f}'.format(pre))
print('Recall: {:.2f}'.format(rec))
print('F1: {:.2f}'.format(f1))

Confusion Matrix:
[[778  24]
 [ 61  42]]
Accuracy: 0.91
Precision: 0.64
Recall: 0.41
F1: 0.50


In [21]:
# Train and evaluate support vector machine

model_name = SVC(kernel="rbf")
parameter_dict = make_sweep_parameter_svm_dict()
svf = sweep_classifier_cross_val(train_feature=train_x, 
                                 train_label=train_y, 
                                 model_name=model_name, 
                                 parameter_dict=parameter_dict)
print(svf)

GridSearchCV(cv=None, error_score='raise',
       estimator=SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape=None, degree=3, gamma='auto', kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False),
       fit_params={}, iid=True, n_jobs=1,
       param_grid={'C': [1], 'gamma': [0.0001, 0.01, 1, 100]},
       pre_dispatch='2*n_jobs', refit=True, scoring=None, verbose=0)


In [23]:
predicted = prediction(svf, test_x)

In [24]:
cm, acc, pre, rec, f1 = calc_metrics(test_y, predicted)

print('Confusion Matrix:')
print(cm)
print('Accuracy: {:.2f}'.format(acc))
print('Precision: {:.2f}'.format(pre))
print('Recall: {:.2f}'.format(rec))
print('F1: {:.2f}'.format(f1))

Confusion Matrix:
[[794   8]
 [ 99   4]]
Accuracy: 0.88
Precision: 0.33
Recall: 0.04
F1: 0.07


In [26]:
# Train and evaluate support vector machine

model_name = DecisionTreeClassifier(min_samples_split=20, random_state=0)
parameter_dict = make_sweep_parameter_dt_dict()
dtf = sweep_classifier_cross_val(train_feature=train_x, 
                                 train_label=train_y, 
                                 model_name=model_name, 
                                 parameter_dict=parameter_dict)
print(dtf)

GridSearchCV(cv=None, error_score='raise',
       estimator=DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=20, min_weight_fraction_leaf=0.0,
            presort=False, random_state=0, splitter='best'),
       fit_params={}, iid=True, n_jobs=1,
       param_grid={'max_depth': [4, 8, 16, 32]}, pre_dispatch='2*n_jobs',
       refit=True, scoring=None, verbose=0)


In [27]:
predicted = prediction(dtf, test_x)

cm, acc, pre, rec, f1 = calc_metrics(test_y, predicted)

print('Confusion Matrix:')
print(cm)
print('Accuracy: {:.2f}'.format(acc))
print('Precision: {:.2f}'.format(pre))
print('Recall: {:.2f}'.format(rec))
print('F1: {:.2f}'.format(f1))

Confusion Matrix:
[[785  17]
 [ 74  29]]
Accuracy: 0.90
Precision: 0.63
Recall: 0.28
F1: 0.39


In [30]:
# Serialize the model on disk in the special 'outputs' folder

print ("Export the lr model to lr.pkl")
f = open('./outputs/lr.pkl', 'wb')
pickle.dump(lrf, f)
f.close()

print ("Export the svf model to sv.pkl")
f = open('./outputs/sv.pkl', 'wb')
pickle.dump(svf, f)
f.close()

print ("Export the dtf model to dt.pkl")
f = open('./outputs/dt.pkl', 'wb')
pickle.dump(dtf, f)
f.close()

# Load the model back from the 'outputs' folder into memory

print("Import the model from lr.pkl")
f2 = open('./outputs/lr.pkl', 'rb')
lrf2 = pickle.load(f2)

print("Import the model from sv.pkl")
f2 = open('./outputs/sv.pkl', 'rb')
svf2 = pickle.load(f2)

print("Import the model from dt.pkl")
f2 = open('./outputs/dt.pkl', 'rb')
dtf2 = pickle.load(f2)

Export the lr model to lr.pkl
Export the svf model to sv.pkl
Export the dtf model to dt.pkl
Import the model from lr.pkl
Import the model from sv.pkl
Import the model from dt.pkl
