Direct marketing campaigns by businesses like banks and other financial institutions can be much more effective than mass campaigns. But businesses have very limited resources to perform such direct marketing campaign and using BI and Data Mining techniques to enhance the quality of such campains is very important. This was a study done to look at BI and DM tools/techniques to increase the effectiveness of direct marketing campaigns.
- To find a model that can explain success of a contact
- Success is when the client subscribes the deposit
- To find what the main characteristics that affect success.
- Help in a better management of the available resource (Human Effort, phone calls, time)
- Selection of high quality and affordable set of potential buying customers.
- Offer using direct marketing campaigns, attractive long-term deposit applications
- Good interest rates
- Lesser contacts
- No of success
The dataset comes from the UCI Machine Learning repository here. The data is from a Portuguese banking institution and is a collection of the results of multiple marketing campaings.
This project attempts to use the data from the source above and applies various ML classifiers and compares and contrasts them.
Link to Jupyter notebook : https://github.com/1kit/module_17/blob/main/prompt_III.ipynb
The following input variables are present in the data provided
1 - age (numeric)
2 - job : type of job (categorical: 'admin.','blue-collar','entrepreneur','housemaid','management','retired','self-employed','services','student','technician','unemployed','unknown')
3 - marital : marital status (categorical: 'divorced','married','single','unknown'; note: 'divorced' means divorced or widowed)
4 - education (categorical: 'basic.4y','basic.6y','basic.9y','high.school','illiterate','professional.course','university.degree','unknown')
5 - default: has credit in default? (categorical: 'no','yes','unknown')
6 - housing: has housing loan? (categorical: 'no','yes','unknown')
7 - loan: has personal loan? (categorical: 'no','yes','unknown')
# related with the last contact of the current campaign:
8 - contact: contact communication type (categorical: 'cellular','telephone')
9 - month: last contact month of year (categorical: 'jan', 'feb', 'mar', ..., 'nov', 'dec')
10 - day_of_week: last contact day of the week (categorical: 'mon','tue','wed','thu','fri')
11 - duration: last contact duration, in seconds (numeric). Important note: this attribute highly affects the output target (e.g., if duration=0 then y='no'). Yet, the duration is not known before a call is performed. Also, after the end of the call y is obviously known. Thus, this input should only be included for benchmark purposes and should be discarded if the intention is to have a realistic predictive model.
# other attributes:
12 - campaign: number of contacts performed during this campaign and for this client (numeric, includes last contact)
13 - pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric; 999 means client was not previously contacted)
14 - previous: number of contacts performed before this campaign and for this client (numeric)
15 - poutcome: outcome of the previous marketing campaign (categorical: 'failure','nonexistent','success')
# social and economic context attributes
16 - emp.var.rate: employment variation rate - quarterly indicator (numeric)
17 - cons.price.idx: consumer price index - monthly indicator (numeric)
18 - cons.conf.idx: consumer confidence index - monthly indicator (numeric)
19 - euribor3m: euribor 3 month rate - daily indicator (numeric)
20 - nr.employed: number of employees - quarterly indicator (numeric)
The output variable (target) is
21 - y - has the client subscribed a term deposit? (binary: 'yes','no')
The data has a number of categorical features
RangeIndex: 41188 entries, 0 to 41187
Data columns (total 21 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 age 41188 non-null int64
1 job 41188 non-null object
2 marital 41188 non-null object
3 education 41188 non-null object
4 default 41188 non-null object
5 housing 41188 non-null object
6 loan 41188 non-null object
7 contact 41188 non-null object
8 month 41188 non-null object
9 day_of_week 41188 non-null object
10 duration 41188 non-null int64
11 campaign 41188 non-null int64
12 pdays 41188 non-null int64
13 previous 41188 non-null int64
14 poutcome 41188 non-null object
15 emp.var.rate 41188 non-null float64
16 cons.price.idx 41188 non-null float64
17 cons.conf.idx 41188 non-null float64
18 euribor3m 41188 non-null float64
19 nr.employed 41188 non-null float64
20 y 41188 non-null object
dtypes: float64(5), int64(5), object(11)
memory usage: 6.6+ MB
There were no missing data. A very few number of duplicate entries were dropped.
Here is the spread of the input data with respect to the output classes
NO - 88.7%
YES - 11.2%
NOTE : There is a significant amout of mismatch with only 11% of the data resulted in the customer accepting the product.
Before we use the data, here are some encoding done on the input features
Here are some unique values for the categorical features
job => ['housemaid' 'services' 'admin.' 'blue-collar' 'technician' 'retired'
'management' 'unemployed' 'self-employed' 'unknown' 'entrepreneur'
'student']
marital => ['married' 'single' 'divorced' 'unknown']
education => ['basic.4y' 'high.school' 'basic.6y' 'basic.9y' 'professional.course'
'unknown' 'university.degree' 'illiterate']
default => ['no' 'unknown' 'yes']
housing => ['no' 'yes' 'unknown']
loan => ['no' 'yes' 'unknown']
contact => ['telephone' 'cellular']
month => ['may' 'jun' 'jul' 'aug' 'oct' 'nov' 'dec' 'mar' 'apr' 'sep']
day_of_week => ['mon' 'tue' 'wed' 'thu' 'fri']
poutcome => ['nonexistent' 'failure' 'success']
y => ['no' 'yes']
Perform the following simple transformations
- Column
default: Treat no and unknown as 0 and yes as 1 - Column
housing: Treat no and unknown as 0 and yes as 1 - Column
loan: Treat no and unknown as 0 and yes as 1 - Drop column
contact: It does not seem very interesting as a predictor - Drop column
poutcome: This does not seem very interesting as a predictor - Column
y: Treat no as 0 and yes as 1 - Other Categorical Features : Do a 1-hot encoding for the rest of the categorical features
- Column
pdays: 96% of the pdays column is at 999. We can drop this - Column
duration: As specified it highly affects the output target and cannot be known apriori.
The input data was split into a train/test for holdout testing. The test data represents 25% of the overall data. The stratify option was used to ensure that the baseline model accuracy is same for both test and training data
A baseline model is one which predicts the majority class for all data points.
The accuracy score for the baseline (on both test/train data) = 88.7%
Initially four simple classifier models were trained and here are the results
To improve the models, we need to consider what is important for this use-case.
As part of grid-search, we need to optimize on reducing both FalsePositives (FP) and FalseNegatives (FN). This is because of the following:
- Cost of FP : This could be the marketing spend wasted on targeting customers who dont take the product
- Cost of FN : This could be the potential revenue lost from missing out on customers who would take the product.
The above requires that we tradeoff between precision-recall. The alternate way this can be achieved is by comparing area under the ROC curve. Hence the roc_auc option of scoring is used in GridSearch.
-
One way to do feature engineering is to use SelectFromModel() with LogisticRegression and L1 regularization to affect weights for features. This can be incorporated into the pipeline
-
The correlation matrix can also give us a good way to identify features that really matter to the output variable.
Here is a correlation heatmap
From the correlation matrix, the following features are in decreasing order of correlation with y
nr.employed 0.354669
euribor3m 0.307740
emp.var.rate 0.298289
previous 0.230202
month_mar 0.144027
month_oct 0.137538
cons.price.idx 0.136134
month_sep 0.126079
month_may 0.108278
job_student 0.093962
job_retired 0.092364
month_dec 0.079311
job_blue-collar 0.074431
campaign 0.066361
cons.conf.idx 0.054802
marital_single 0.054209
education_university.degree 0.050267
education_basic.9y 0.045152
marital_married 0.043476
month_jul 0.032344
job_services 0.032262
age 0.030381
education_basic.6y 0.023493
education_unknown 0.021476
day_of_week_mon 0.021241
job_entrepreneur 0.016651
job_unemployed 0.014749
day_of_week_thu 0.013797
housing 0.011804
month_nov 0.011779
month_jun 0.009193
month_aug 0.008778
day_of_week_tue 0.008123
education_high.school 0.007408
education_illiterate 0.007246
job_housemaid 0.006510
day_of_week_wed 0.006290
job_technician 0.006069
marital_unknown 0.005210
job_self-employed 0.004668
loan 0.004478
default 0.003042
education_professional.course 0.001071
job_management 0.000426
job_unknown 0.000154
NOTE-1: If performance is a concern, we could eliminate features that dont show too much correlation with y. I did not do that in this project.
NOTE-2: I also created a feature extractor that would run a L1 regularization to select features. But I ended up not using it because it made the GridSearch very slow
extractor = SelectFromModel(LogisticRegression(penalty='l1', solver = 'liblinear' ,random_state = 42))
A GridSearch was done over various parameters for the four models compared earlier. The GridSearch attempted to maximize on roc_auc (area under the ROC curve) as a way to tradeoff between recall and precision.
Classification Report: KNN
precision recall f1-score support
0 0.90 0.99 0.94 9134
1 0.59 0.10 0.17 1160
accuracy 0.89 10294
macro avg 0.74 0.55 0.56 10294
weighted avg 0.86 0.89 0.86 10294
Classification Report: DecisionTree
precision recall f1-score support
0 0.91 0.98 0.94 9134
1 0.56 0.24 0.34 1160
accuracy 0.89 10294
macro avg 0.74 0.61 0.64 10294
weighted avg 0.87 0.89 0.87 10294
Classification Report: Logistic Regression
precision recall f1-score support
0 0.90 0.98 0.94 9134
1 0.53 0.14 0.22 1160
accuracy 0.89 10294
macro avg 0.72 0.56 0.58 10294
weighted avg 0.86 0.89 0.86 10294
Classification Report: SVM
precision recall f1-score support
0 0.90 0.98 0.94 9134
1 0.57 0.18 0.28 1160
accuracy 0.89 10294
macro avg 0.74 0.58 0.61 10294
weighted avg 0.87 0.89 0.87 10294
Here are the ROC curves for the models evaluated above
Looking at the precision/recall tradeoff (using the area under the ROC curve)
- Both
Decision TreeandLogistic Regressionperform quite well. - The time taken to GridSearch and train for
Logistic Regressionwas lower KNNdid really great in terms of GridSearch+Train time and is not too bad in terms ofarea under the curveSVMdid poorly in all respects
Using one of the better performing models from above to predict new customers to target for direct marketing campaigns will lead to a much higher quality of customers. For future studies, the recommendation is to use models that improve the area under the ROC curve, which gives a good tradeoff.
It is believed that using these models would lead to:
- Effective campaigns
- Effective utilization of resources.
- Higher user engagement and conversion rates
- Improve business reputation and helps attract and retain customers
- Improve revenue









