# Bank Marketing Campaign: Predict Term Deposit

## Classification Project - Feature Engineering and Model Building

### Author: Andrew McNall - mcnallanalytics@protonmail.com

In [26]:
import numpy as np
import pandas as pd
from sklearn.preprocessing import MinMaxScaler, LabelEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score, confusion_matrix, make_scorer, fbeta_score
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import train_test_split

In [27]:
data = pd.read_csv(r'Data\bank-project-data.csv')

## Feature Engineering

During the exploratory data analysis (see EDA notebook) we noticed that the 'age' distribution was right-skewed. We'll apply a logistic transformation to this feature to reduce the effect of outliers.

In [28]:
# log transform age

data['log_age'] = np.log(data['age'])

data = data.drop(['age'], axis = 1)

data.head()

Unnamed: 0.1,Unnamed: 0,job,marital,education,default,balance,housing,loan,contact,day,month,duration,campaign,pdays,previous,poutcome,deposit,log_age
0,0,admin.,married,secondary,no,2343,yes,no,unknown,5,may,1042,1,-1,0,unknown,1,4.077537
1,1,admin.,married,secondary,no,45,no,no,unknown,5,may,1467,1,-1,0,unknown,1,4.025352
2,2,technician,married,secondary,no,1270,yes,no,unknown,5,may,1389,1,-1,0,unknown,1,3.713572
3,3,services,married,secondary,no,2476,yes,no,unknown,5,may,579,1,-1,0,unknown,1,4.007333
4,4,admin.,married,tertiary,no,184,no,no,unknown,5,may,673,2,-1,0,unknown,1,3.988984


Earlier we noted that 'duration' was a feature with potential to effect our models, but that it's not something we can know in advance of a contact about a customer. So, we'll drop this feature from the data for our models.

In [29]:
# drop 'duration' column

data = data.drop(['duration'], axis = 1)

data.head()

Unnamed: 0.1,Unnamed: 0,job,marital,education,default,balance,housing,loan,contact,day,month,campaign,pdays,previous,poutcome,deposit,log_age
0,0,admin.,married,secondary,no,2343,yes,no,unknown,5,may,1,-1,0,unknown,1,4.077537
1,1,admin.,married,secondary,no,45,no,no,unknown,5,may,1,-1,0,unknown,1,4.025352
2,2,technician,married,secondary,no,1270,yes,no,unknown,5,may,1,-1,0,unknown,1,3.713572
3,3,services,married,secondary,no,2476,yes,no,unknown,5,may,1,-1,0,unknown,1,4.007333
4,4,admin.,married,tertiary,no,184,no,no,unknown,5,may,2,-1,0,unknown,1,3.988984


In [30]:
# Update numerical variables after dropping 'duration'

numerical_vars = ['log_age', 'balance', 'day', 'campaign', 'pdays', 'previous']

# Normalize our numerical features
# Initialize a scaler, then apply it to the features

scaler = MinMaxScaler() # default = (0, 1)

# Applying MinMax transformation to the numerical variables

model_data = pd.DataFrame(data = data)
model_data[numerical_vars] = scaler.fit_transform(model_data[numerical_vars])

# Show an example of a record with scaling applied
model_data.head()

Unnamed: 0.1,Unnamed: 0,job,marital,education,default,balance,housing,loan,contact,day,month,campaign,pdays,previous,poutcome,deposit,log_age
0,0,admin.,married,secondary,no,0.104371,yes,no,unknown,0.133333,may,0.0,0.0,0.0,unknown,1,0.713653
1,1,admin.,married,secondary,no,0.078273,no,no,unknown,0.133333,may,0.0,0.0,0.0,unknown,1,0.682282
2,2,technician,married,secondary,no,0.092185,yes,no,unknown,0.133333,may,0.0,0.0,0.0,unknown,1,0.494859
3,3,services,married,secondary,no,0.105882,yes,no,unknown,0.133333,may,0.0,0.0,0.0,unknown,1,0.671451
4,4,admin.,married,tertiary,no,0.079851,no,no,unknown,0.133333,may,0.016129,0.0,0.0,unknown,1,0.66042


In [31]:
# get labels for categorical variables

model_df = pd.get_dummies(model_data)

model_df.head()

Unnamed: 0.1,Unnamed: 0,balance,day,campaign,pdays,previous,deposit,log_age,job_admin.,job_blue-collar,...,month_jun,month_mar,month_may,month_nov,month_oct,month_sep,poutcome_failure,poutcome_other,poutcome_success,poutcome_unknown
0,0,0.104371,0.133333,0.0,0.0,0.0,1,0.713653,1,0,...,0,0,1,0,0,0,0,0,0,1
1,1,0.078273,0.133333,0.0,0.0,0.0,1,0.682282,1,0,...,0,0,1,0,0,0,0,0,0,1
2,2,0.092185,0.133333,0.0,0.0,0.0,1,0.494859,0,0,...,0,0,1,0,0,0,0,0,0,1
3,3,0.105882,0.133333,0.0,0.0,0.0,1,0.671451,0,0,...,0,0,1,0,0,0,0,0,0,1
4,4,0.079851,0.133333,0.016129,0.0,0.0,1,0.66042,1,0,...,0,0,1,0,0,0,0,0,0,1


## Model Selection

In [32]:
# Create training and testing sets 

X, y = model_df.drop('deposit', axis = 1), model_df.deposit

X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                  test_size = 0.20,
                                                  random_state = 0,
                                                  stratify = y)


In [33]:
# verify training and testing sets

print(X_train.shape)
print(y_train.shape)
print(X_test.shape)
print(y_test.shape)


(8929, 51)
(8929,)
(2233, 51)
(2233,)


### 5.1 Establish Baseline with Logistic Regression Model

We'll start by building a Logistic Regression model to establish a baseline for performance.

In [34]:
from sklearn.model_selection import GridSearchCV

parameters = {"C": [0.001, 0.01, 0.1, 1, 10 , 100, 1000]}

model = LogisticRegression(solver = 'liblinear', max_iter = 500, random_state = None)

# fbeta_score scoring object using make_scorer()
scorer = make_scorer(fbeta_score, beta = 0.5)

# Grid Search on the classifier using 'scorer' as the scoring method
grid = GridSearchCV(model, param_grid = parameters, scoring = scorer)

# Fit the grid search object to the training data and find the optimal parameters using fit()
grid_fit = grid.fit(X_train, y_train)

# Get the estimator
best_clf = grid_fit.best_estimator_

# Make predictions using the unoptimized and model
best_predictions = best_clf.predict(X_train)

# Report the scores
print("\nOptimized Model\n------")
print("Final accuracy score on the testing data: {:.4f}".format(accuracy_score(y_train, best_predictions)))
print("Final F-score on the testing data: {:.4f}".format(fbeta_score(y_train, best_predictions, beta = 0.5)))
print(best_clf)


Optimized Model
------
Final accuracy score on the testing data: 0.9733
Final F-score on the testing data: 0.9820
LogisticRegression(C=0.01, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=500,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='liblinear', tol=0.0001, verbose=0,
                   warm_start=False)


In [35]:
# Confusion matrix
confmat = confusion_matrix(y_train, best_predictions)
print("The Confusion matrix:\n", confmat)
print("Precision Score:", round(precision_score(y_train, best_predictions), 2))
print("Recall Score:", round(recall_score(y_train, best_predictions), 2))

The Confusion matrix:
 [[4654   44]
 [ 194 4037]]
Precision Score: 0.99
Recall Score: 0.95


Our first model run-through identifies a value of 0.01 for the C parameter. Now we'll perform a search in a much narrower range around that value to find the best parameter. 

In [36]:
parameters = {"C": [0.002, 0.004, 0.006, 0.008, 0.01, 0.03, 0.05, 0.07, 0.09]}

model = LogisticRegression(random_state = 25, penalty = 'l2', max_iter = 500)

# fbeta_score scoring object using make_scorer()
scorer = make_scorer(fbeta_score, beta = 0.5)

# Grid Search on the classifier using 'scorer' as the scoring method
grid = GridSearchCV(model, param_grid = parameters, scoring = scorer)

# Fit the grid search object to the training data and find the optimal parameters using fit()
grid_fit = grid.fit(X_train, y_train)

# Get the estimator
best_clf = grid_fit.best_estimator_

# Make predictions using the unoptimized and model
best_predictions = best_clf.predict(X_train)

# Report the scores
print("\nOptimized Model\n------")
print("Final accuracy score on the testing data: {:.4f}".format(accuracy_score(y_train, best_predictions)))
print("Final F-score on the testing data: {:.4f}".format(fbeta_score(y_train, best_predictions, beta = 0.5)))
print(best_clf)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logist

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logist


Optimized Model
------
Final accuracy score on the testing data: 1.0000
Final F-score on the testing data: 1.0000
LogisticRegression(C=0.004, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=500,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=25, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)


With the narrow range of C parameters we achieve an accuracy score and F-score of 100% with the C parameter optimized at 0.004. Let's check the confusion matrix, precision and recall scores.

In [37]:
# Confusion matrix
confmat = confusion_matrix(y_train, best_predictions)
print("The Confusion matrix:\n", confmat)
print("Precision Score:", round(precision_score(y_train, best_predictions), 2))
print("Recall Score:", round(recall_score(y_train, best_predictions), 2))

The Confusion matrix:
 [[4698    0]
 [   0 4231]]
Precision Score: 1.0
Recall Score: 1.0


Our model, optimized, has perfect precision and recall. No false positives or false negatives. 

Given our business problem, i.e. a method to identify prospective customers who will respond to a marketing campaign by making a term deposit, it appears that we've achieved the optimal outcome. Let's take a look at how our model performs with our test data.

In [38]:
test_predictions = best_clf.predict(X_test)

c_matrix = confusion_matrix(y_test, test_predictions)
print("The Confusion matrix:\n", c_matrix)
print("Precision Score:", round(precision_score(y_test, test_predictions), 2))
print("Recall Score:", round(recall_score(y_test, test_predictions), 2))

The Confusion matrix:
 [[1175    0]
 [   0 1058]]
Precision Score: 1.0
Recall Score: 1.0
