# **BANK MARKETING CAMPAIGN**

# PROBLEM UNDERSTANDING

Context
The variety of financial products available to the public continues to grow, with term deposits being one of the most widely recognized options. Term deposits allow customers to place a specific amount of money in a bank or financial institution, with the condition that the funds can only be withdrawn after a predetermined period. In return, customers receive a fixed interest rate based on the amount deposited.

However, in a competitive financial market, banks must actively work to retain their customers and attract new ones. One effective strategy for gaining new customers is by implementing targeted marketing campaigns.

Target:

- 0: Did not open a term deposit.
- 1: Opened a term deposit.


**Problem Statement:**

Marketing campaigns for term deposit products can be time-consuming and resource-intensive if the bank targets all potential customers without proper filtering. To increase efficiency and effectiveness, the bank needs to identify customers who are most likely to open a term deposit.

If campaigns are conducted indiscriminately, they risk wasting resources on uninterested customers, reducing the overall return on investment.

**Goals:**
- Develop the ability to predict which customers are likely to open a term deposit.
- Focus marketing efforts on customers with a high probability of interest in term deposits to optimize resource allocation.
- Identify key factors or variables influencing a customer's decision to open a term deposit, enabling the bank to design more targeted and effective marketing strategies.

**Analytic Approach:**
- Analyze customer data to identify patterns and behaviors that distinguish customers who open term deposits from those who do not.
- Build a classification model to predict the likelihood of a customer opening a term deposit based on available data.
- Interpret the model to understand the significant factors influencing customer decisions and provide actionable insights for the marketing team.

# DATA UNDERSTANDING

## Attribute Information

[Customer Profile]

| Attribute       | Data Type, Length | Description                                                |
|------------------|-------------------|------------------------------------------------------------|
| age              | Integer           | Age of the customer.                                       |
| job              | Text              | Type of job the customer has.                              |
| balance          | Integer           | Customer's account balance.                                |
| housing          | Text          | Whether the customer has a housing loan (Yes/No).      |
| loan             | Text          | Whether the customer has a personal loan (Yes/No).     |



[Marketing Data]

| Attribute       | Data Type, Length | Description                                                |
|------------------|-------------------|------------------------------------------------------------|
| contact          | Text              | Contact communication type.                                |
| month            | Text              | Last contact month of the year.                           |
| campaign         | Integer           | Number of contacts performed during this campaign.         |
| pdays            | Integer           | Number of days since the client was last contacted.        |
| poutcome         | Text              | Outcome of the previous marketing campaign.               |
| deposit          | Text          | Whether the customer deposits or not (Yes/No).         |



## Data Ingestion

In [None]:
# Library

import pandas as pd
import numpy as np

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns
import missingno
from IPython.display import display

# Feature Engineering
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, LabelEncoder, OrdinalEncoder
from sklearn.compose import ColumnTransformer

!pip install category_encoders
import category_encoders as ce

# Model Selection
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV,StratifiedKFold,train_test_split, cross_val_score
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
import lightgbm as lgb
from sklearn.metrics import f1_score, accuracy_score, recall_score, precision_score

import matplotlib.pyplot as plt
from sklearn.metrics import roc_curve, roc_auc_score

# Imbalance Dataset
from imblearn.pipeline import Pipeline
from imblearn.over_sampling import RandomOverSampler

# Ignore Warning
import sys
import warnings
if not sys.warnoptions:
    warnings.simplefilter("ignore")

# Set max columns
pd.set_option('display.max_columns', None)

Collecting category_encoders
  Downloading category_encoders-2.6.4-py2.py3-none-any.whl.metadata (8.0 kB)
Downloading category_encoders-2.6.4-py2.py3-none-any.whl (82 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m82.0/82.0 kB[0m [31m3.1 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: category_encoders
Successfully installed category_encoders-2.6.4


Dask dataframe query planning is disabled because dask-expr is not installed.

You can install it with `pip install dask[dataframe]` or `conda install dask`.
This will raise in a future version.



In [None]:
import gdown
import pandas as pd

# Correct Google Drive URL for CSV file
download_url = "https://drive.google.com/uc?id=1oOC2FSCPdiOc-qqiwEKzDq5H6m0Na0pf"

# Define the local filename
output_file = "data_bank_marketing_campaign.csv"

# Download and load the file into df
gdown.download(download_url, output_file, quiet=False)
df = pd.read_csv(output_file, sep=',', on_bad_lines='skip')

df.head(10)

Downloading...
From: https://drive.google.com/uc?id=1oOC2FSCPdiOc-qqiwEKzDq5H6m0Na0pf
To: /content/data_bank_marketing_campaign.csv
100%|██████████| 426k/426k [00:00<00:00, 7.26MB/s]


Unnamed: 0,age,job,balance,housing,loan,contact,month,campaign,pdays,poutcome,deposit
0,55,admin.,1662,no,no,cellular,jun,2,-1,unknown,yes
1,39,self-employed,-3058,yes,yes,cellular,apr,3,-1,unknown,yes
2,51,admin.,3025,no,no,cellular,may,1,352,other,yes
3,38,services,-87,yes,no,cellular,may,1,-1,unknown,no
4,36,housemaid,205,yes,no,telephone,nov,4,-1,unknown,no
5,41,admin.,-76,yes,no,cellular,apr,1,-1,unknown,no
6,37,admin.,4803,no,no,cellular,jan,2,-1,unknown,yes
7,36,technician,911,yes,yes,cellular,may,2,21,failure,yes
8,35,management,805,no,no,cellular,sep,1,-1,unknown,no
9,57,housemaid,0,no,no,unknown,jun,1,-1,unknown,no


## Data Inspection

### Check the data types and label

In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7813 entries, 0 to 7812
Data columns (total 11 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   age       7813 non-null   int64 
 1   job       7813 non-null   object
 2   balance   7813 non-null   int64 
 3   housing   7813 non-null   object
 4   loan      7813 non-null   object
 5   contact   7813 non-null   object
 6   month     7813 non-null   object
 7   campaign  7813 non-null   int64 
 8   pdays     7813 non-null   int64 
 9   poutcome  7813 non-null   object
 10  deposit   7813 non-null   object
dtypes: int64(4), object(7)
memory usage: 671.6+ KB


the data were suitable

### Check the typo (Unique Values)

In [None]:
for column_name in df:
  print(df[column_name].value_counts())
  print('\n')

age
31    351
32    344
30    329
35    324
33    317
     ... 
93      2
86      2
90      2
92      1
95      1
Name: count, Length: 75, dtype: int64


job
management       1792
blue-collar      1346
technician       1291
admin.            936
services          658
retired           540
self-employed     280
unemployed        249
student           247
entrepreneur      236
housemaid         184
unknown            54
Name: count, dtype: int64


balance
0       546
1        28
3        21
2        20
5        17
       ... 
1920      1
4101      1
824       1
4654      1
5473      1
Name: count, Length: 3153, dtype: int64


housing
no     4140
yes    3673
Name: count, dtype: int64


loan
no     6789
yes    1024
Name: count, dtype: int64


contact
cellular     5628
unknown      1639
telephone     546
Name: count, dtype: int64


month
may    1976
aug    1085
jul    1050
jun     857
apr     662
nov     657
feb     534
oct     286
jan     227
sep     212
mar     199
dec      68
Name: count

In [None]:
pd.set_option('display.max_colwidth', None)
# show the unique values in each column
listItem = []
for col in df.columns :
    listItem.append( [col, df[col].nunique(), df[col].unique()])

tabel1Desc = pd.DataFrame(columns=['Column Name', 'Number of Unique', 'Unique Sample'],
                     data=listItem)
tabel1Desc

Unnamed: 0,Column Name,Number of Unique,Unique Sample
0,age,75,"[55, 39, 51, 38, 36, 41, 37, 35, 57, 23, 33, 31, 53, 30, 46, 48, 25, 29, 28, 52, 49, 44, 42, 27, 47, 64, 26, 34, 56, 32, 58, 45, 54, 50, 79, 65, 40, 24, 60, 43, 61, 59, 62, 68, 82, 71, 73, 76, 69, 20, 72, 22, 67, 19, 70, 75, 63, 93, 77, 80, 66, 21, 87, 81, 92, 88, 84, 83, 78, 74, 18, 85, 95, 86, 90]"
1,job,12,"[admin., self-employed, services, housemaid, technician, management, student, blue-collar, entrepreneur, retired, unemployed, unknown]"
2,balance,3153,"[1662, -3058, 3025, -87, 205, -76, 4803, 911, 805, 0, 1234, 1107, 1170, 341, 4808, 88, 169, 863, 242, 2597, 4929, 277, 1438, 15, 3733, 204, 1684, 1025, 55, 19, 348, 785, 742, 511, 6651, 1612, 555, 54, 1185, 110, 950, 412, 228, 367, 3993, 2599, 3528, 32, 551, 3161, 533, 8725, 349, 514, 2688, -194, 154, 874, 2, 5953, 1269, -327, 235, 7, 2661, 1948, 20, 502, 193, 13658, 1716, 172, 1667, 157, 8, 951, 427, 241, 469, 2060, 7177, 655, -114, 588, -971, 4570, 250, 131, 93, 22, 15341, 356, 190, -124, 2228, -60, 376, 1567, 855, 4151, ...]"
3,housing,2,"[no, yes]"
4,loan,2,"[no, yes]"
5,contact,3,"[cellular, telephone, unknown]"
6,month,12,"[jun, apr, may, nov, jan, sep, feb, mar, aug, jul, oct, dec]"
7,campaign,32,"[2, 3, 1, 4, 5, 6, 7, 30, 8, 9, 11, 14, 10, 28, 63, 12, 24, 17, 15, 18, 19, 13, 21, 23, 22, 33, 16, 25, 26, 20, 29, 43]"
8,pdays,422,"[-1, 352, 21, 91, 186, 263, 96, 355, 294, 412, 89, 114, 276, 93, 175, 57, 323, 156, 86, 95, 271, 182, 289, 334, 269, 309, 144, 183, 417, 138, 254, 337, 171, 389, 87, 170, 165, 372, 247, 98, 196, 469, 272, 104, 63, 587, 336, 145, 130, 28, 202, 324, 147, 94, 328, 420, 179, 90, 81, 160, 298, 356, 357, 267, 430, 52, 181, 365, 237, 330, 103, 374, 75, 133, 321, 204, 782, 266, 197, 270, 318, 349, 187, 359, 490, 192, 227, 100, 168, 177, 251, 301, 350, 92, 184, 345, 290, 199, 333, 169, ...]"
9,poutcome,4,"[unknown, other, failure, success]"


In [None]:
df['age'].max()

95

[Plan to do]
- 'month' column: expand the month value (e.g., 'jan' -> 'January')
- 'deposit' column: convert 'yes' to '1' and 'no' to '0'

### Check the Missing Values

In [None]:
df.isna().sum()

Unnamed: 0,0
age,0
job,0
balance,0
housing,0
loan,0
contact,0
month,0
campaign,0
pdays,0
poutcome,0


No missing values were found

### Check the Duplicates

In [None]:
df.duplicated().sum()

8

In [None]:
df[df.duplicated()]

Unnamed: 0,age,job,balance,housing,loan,contact,month,campaign,pdays,poutcome,deposit
2944,40,blue-collar,0,yes,no,unknown,may,2,-1,unknown,no
4368,60,management,0,no,no,cellular,aug,3,-1,unknown,yes
4874,41,management,0,no,no,cellular,aug,2,-1,unknown,no
5326,44,blue-collar,0,yes,no,cellular,jul,1,-1,unknown,no
5609,39,technician,0,yes,no,unknown,may,1,-1,unknown,no
5681,38,technician,0,no,no,cellular,aug,2,-1,unknown,no
5905,34,management,0,no,no,cellular,aug,2,-1,unknown,no
7077,30,blue-collar,239,yes,no,unknown,may,1,-1,unknown,yes


8 duplicate values were found. we will remove it later.

## DATA CLEANING

[To-Do]

1) change the values of the column below:
- 'month' column: expand the month value (e.g., 'jan' -> 'January')
- 'deposit' column: convert 'yes' to '1' and 'no' to '0'

2) Remove duplicates

Before performing data cleaning, it's recommended to create a copy of the dataframe to avoid altering the original format.

In [None]:
df_clean = df.copy()

### Change the Column label

In [None]:
df_clean['month'] = df_clean['month'].replace(
    ['jan', 'feb', 'mar', 'apr', 'may', 'jun', 'jul', 'aug', 'sep', 'oct', 'nov', 'dec'],
    ['January', 'February', 'March', 'April', 'May', 'June', 'July', 'August',
     'September', 'October', 'November', 'December'])

df_clean['month'].value_counts()

Unnamed: 0_level_0,count
month,Unnamed: 1_level_1
May,1976
August,1085
July,1050
June,857
April,662
November,657
February,534
October,286
January,227
September,212


In [None]:
df_clean['deposit'] = df_clean['deposit'].replace(['yes', 'no'], [1, 0])
df_clean['deposit'].value_counts()

  df_clean['deposit'] = df_clean['deposit'].replace(['yes', 'no'], [1, 0])


Unnamed: 0_level_0,count
deposit,Unnamed: 1_level_1
0,4081
1,3732


In [None]:
df_clean['housing'] = df_clean['housing'].replace(['yes', 'no'], [1, 0])
df_clean['housing'].value_counts()

  df_clean['housing'] = df_clean['housing'].replace(['yes', 'no'], [1, 0])


Unnamed: 0_level_0,count
housing,Unnamed: 1_level_1
0,4140
1,3673


In [None]:
df_clean['loan'] = df_clean['loan'].replace(['yes', 'no'], [1, 0])
df_clean['loan'].value_counts()

  df_clean['loan'] = df_clean['loan'].replace(['yes', 'no'], [1, 0])


Unnamed: 0_level_0,count
loan,Unnamed: 1_level_1
0,6789
1,1024


The values of 'month' column have been changed

### Remove the Duplicates

In [None]:
df_clean.drop_duplicates(inplace=True)

In [None]:
df_clean.duplicated().sum()

0

The duplicates data have been removed

## DATA PREPARATION

In [None]:
df_clean.head()

Unnamed: 0,age,job,balance,housing,loan,contact,month,campaign,pdays,poutcome,deposit
0,55,admin.,1662,0,0,cellular,June,2,-1,unknown,1
1,39,self-employed,-3058,1,1,cellular,April,3,-1,unknown,1
2,51,admin.,3025,0,0,cellular,May,1,352,other,1
3,38,services,-87,1,0,cellular,May,1,-1,unknown,0
4,36,housemaid,205,1,0,telephone,November,4,-1,unknown,0


### Feature Engineering: Encoding

Purpose: improve the model's ability to learn patterns

Using Encoding for categorical features. Here are the list of categorkical data and to-dos:

1) `Job`: use One Hot Encoding. this data is nominal data (no roder) with a small number of unique values.

2) `contact`: use One Hot Encoding. this data is nominal data (no roder) with a small number of unique values.

3) `month`: use Ordinal Encoding because 'month' consists of ordinal data (has order). 1 represents Janaury, 2 represents February, and so on.

4) `poutcome`:  use One Hot Encoding. this data is nominal data (no roder) with a small number of unique values.

In [None]:
# Define the mappings for Ordinal Encoding
ordinal_mapping = [
    {'col': 'month', 'mapping': {1: 'January', 2: 'February', 3: 'March', 4: 'April', 5: 'May',
                                6: 'June', 7: 'July', 8: 'August', 9: 'September', 10: 'October',
                                11: 'November', 12: 'December'}}
]

# Define the ColumnTransformer
transformer = ColumnTransformer([
    # One-Hot Encoding for 'Job', 'contact', and 'poutcome'
    ('onehot', OneHotEncoder(drop='first'), ['job', 'contact', 'poutcome']),

    # Ordinal Encoding for 'month' (use OrdinalEncoder for ordered months)
    ('ordinal', ce.OrdinalEncoder(mapping=ordinal_mapping), ['month'])],
    remainder='passthrough')

In [None]:
x = df_clean.drop(columns=['deposit'])
y = df_clean['deposit']

In [None]:
x_train,x_test,y_train,y_test=train_test_split(x,y,stratify=y,test_size=0.2,random_state=2021)

In [None]:
testing = pd.DataFrame(transformer.fit_transform(x_train),columns=transformer.get_feature_names_out())
testing.head()

Unnamed: 0,onehot__job_blue-collar,onehot__job_entrepreneur,onehot__job_housemaid,onehot__job_management,onehot__job_retired,onehot__job_self-employed,onehot__job_services,onehot__job_student,onehot__job_technician,onehot__job_unemployed,onehot__job_unknown,onehot__contact_telephone,onehot__contact_unknown,onehot__poutcome_other,onehot__poutcome_success,onehot__poutcome_unknown,ordinal__month,remainder__age,remainder__balance,remainder__housing,remainder__loan,remainder__campaign,remainder__pdays
0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,-1.0,34.0,223.0,0.0,1.0,6.0,-1.0
1,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,-1.0,34.0,479.0,0.0,0.0,1.0,-1.0
2,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,-1.0,60.0,414.0,0.0,0.0,1.0,-1.0
3,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,-1.0,51.0,0.0,0.0,0.0,3.0,-1.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,-1.0,44.0,2999.0,1.0,0.0,1.0,-1.0


## Evaluation

In [None]:
logreg = LogisticRegression()
knn = KNeighborsClassifier()
dt = DecisionTreeClassifier()
rf = RandomForestClassifier()
xgb = XGBClassifier()
lgbm = lgb.LGBMClassifier()

### Model Benchmarking : Test Data

In [None]:
models = [logreg,knn,dt,rf,xgb,lgbm]
score_roc_auc = []

def y_pred_func(i):
    estimator=Pipeline([
        ('preprocess',transformer),
        ('model',i)])
    x_train,x_test

    estimator.fit(x_train,y_train)
    return(estimator,estimator.predict(x_test),x_test)

for i,j in zip(models, ['Logistic Regression', 'KNN', 'Decision Tree', 'Random Forest', 'XGBoost','LightGBM']):
    estimator,y_pred,x_test = y_pred_func(i)
    y_predict_proba = estimator.predict_proba(x_test)[:,1]
    score_roc_auc.append(roc_auc_score(y_test,y_predict_proba))
    print(j,'\n', classification_report(y_test,y_pred))

pd.DataFrame({'model':['Logistic Regression', 'KNN', 'Decision Tree', 'Random Forest', 'XGBoost','LightGBM'],
             'roc_auc score':score_roc_auc}).set_index('model').sort_values(by='roc_auc score',ascending=False)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Logistic Regression 
               precision    recall  f1-score   support

           0       0.69      0.68      0.69       815
           1       0.66      0.68      0.67       746

    accuracy                           0.68      1561
   macro avg       0.68      0.68      0.68      1561
weighted avg       0.68      0.68      0.68      1561

KNN 
               precision    recall  f1-score   support

           0       0.58      0.63      0.60       815
           1       0.55      0.51      0.53       746

    accuracy                           0.57      1561
   macro avg       0.57      0.57      0.57      1561
weighted avg       0.57      0.57      0.57      1561

Decision Tree 
               precision    recall  f1-score   support

           0       0.63      0.63      0.63       815
           1       0.59      0.59      0.59       746

    accuracy                           0.61      1561
   macro avg       0.61      0.61      0.61      1561
weighted avg       0.61      0

Unnamed: 0_level_0,roc_auc score
model,Unnamed: 1_level_1
Logistic Regression,0.747463
LightGBM,0.731447
Random Forest,0.717808
XGBoost,0.715823
Decision Tree,0.609129
KNN,0.602753


Based on Test Data, `Logistic Regression` model demonstrates the best performance.

Reason: Logistic Regression performs consistently well comapred to other models.
- Precision (0): 69% of predicted 0s were correct.
- Precision (1): 66% of predicted 1s were correct.
- Recall (0): 68% of actual 0s were identified correctly.
- Recall (1): 68% of actual 1s were identified correctly.
- F1-Score (0 and 1): Balanced at ~67–69%, showing a good balance between precision and recall.
- Overall Accuracy: 68%, the highest among all models.

### Feature Selection

In [None]:
# Assuming 'logreg' is your logistic regression model
from sklearn.pipeline import Pipeline

# Re-train the pipeline on the full training data
final_pipeline = Pipeline([
    ('preprocess', transformer),  # Assuming transformer is defined earlier for preprocessing
    ('model', logreg)
])
final_pipeline.fit(x_train, y_train)

# Extract feature names from the transformer
feature_names = transformer.get_feature_names_out()

# Extract coefficients from the trained logistic regression model
coefficients = final_pipeline.named_steps['model'].coef_[0]

# Create a DataFrame to view feature importance
importance_df = pd.DataFrame({
    'Feature': feature_names,
    'Importance': coefficients
}).sort_values(by='Importance', ascending=False)

print(importance_df)

                      Feature  Importance
14   onehot__poutcome_success    1.285136
7         onehot__job_student    0.347628
4         onehot__job_retired    0.176564
3      onehot__job_management    0.176056
9      onehot__job_unemployed    0.137933
8      onehot__job_technician    0.015079
22           remainder__pdays    0.000351
18         remainder__balance    0.000045
17             remainder__age   -0.004916
10        onehot__job_unknown   -0.014324
13     onehot__poutcome_other   -0.047354
5   onehot__job_self-employed   -0.056624
6        onehot__job_services   -0.081597
1    onehot__job_entrepreneur   -0.083361
21        remainder__campaign   -0.112774
2       onehot__job_housemaid   -0.129632
0     onehot__job_blue-collar   -0.151611
11  onehot__contact_telephone   -0.172982
15   onehot__poutcome_unknown   -0.222168
16             ordinal__month   -0.454542
20            remainder__loan   -0.475350
19         remainder__housing   -0.569861
12    onehot__contact_unknown   -1

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Takeaway:

`poutcome`: This feature represents the outcome of the previous marketing campaign -> this will be included because it contains information about customers who were successful or not in past campaigns
- onehot__poutcome_success (1.285136)

`Job`: Customer occupation can play a role in determining their financial behavior -> the likelihood of investing in term deposits.
- onehot__job_student    0.347628
- onehot__job_retired    0.176564
- onehot__job_management    0.176056
- onehot__job_technician    0.015079

`pdays`: Represents days since last contact, indicating customer engagement. Shorter times may show higher interest, improving conversion chances.

`balance`: Reflects a customer's financial situation, showing their savings or wealth. Higher balances may indicate greater financial stability, making customers more likely to invest in term deposits.

`Campaign`: it might reflects the persistent engagement with customers, which can increase the likelihood of conversion.

### Encoding

Before define the x feature, we need to encode categorical data (job & poutcome)

in logistic regression, we need to choose reference category for categorical features as a baseline for comaprison. Here's the baseline in our case:
- poutcome: failure -> represents non ideal state that might help to measure imrpvoements.
- job: unemployed -> represents a status with no direct income source and might help to get insights about how employment types impact outcomes.



---
**How to select baseline in python?**

By using `OneHotEncoder from Scikit-learn`, the drop='first' code, by default, will select the first category as the reference. Hence, we need to reorder cetegory before encoding.



In [None]:
import pandas as pd
from sklearn.preprocessing import OneHotEncoder

# Define categorical features with specified baseline
df_clean['job'] = pd.Categorical(df_clean['job'],
                                 categories=['unemployed', 'admin.', 'self-employed', 'services', 'housemaid',
                                             'technician', 'management', 'student', 'blue-collar',
                                             'entrepreneur', 'retired', 'unknown'],
                                 ordered=True)

df_clean['poutcome'] = pd.Categorical(df_clean['poutcome'],
                                      categories=['failure', 'unknown', 'other', 'success'],
                                      ordered=True)

# Define categorical and numerical features
categorical_features = ['job', 'poutcome']
numerical_features = ['balance', 'campaign', 'pdays']

# One-hot encode categorical features
encoder = OneHotEncoder(drop='first', sparse_output=False)
encoded_categorical = encoder.fit_transform(df_clean[categorical_features])
encoded_feature_names = encoder.get_feature_names_out(categorical_features)

# Create a DataFrame for encoded features
encoded_df = pd.DataFrame(encoded_categorical, columns=encoded_feature_names)

# Combine encoded features with numerical features
final_encoded_df = pd.concat([df_clean[numerical_features], encoded_df], axis=1)
print(final_encoded_df.head())

   balance  campaign  pdays  job_blue-collar  job_entrepreneur  job_housemaid  \
0   1662.0       2.0   -1.0              0.0               0.0            0.0   
1  -3058.0       3.0   -1.0              0.0               0.0            0.0   
2   3025.0       1.0  352.0              0.0               0.0            0.0   
3    -87.0       1.0   -1.0              0.0               0.0            0.0   
4    205.0       4.0   -1.0              0.0               0.0            1.0   

   job_management  job_retired  job_self-employed  job_services  job_student  \
0             0.0          0.0                0.0           0.0          0.0   
1             0.0          0.0                1.0           0.0          0.0   
2             0.0          0.0                0.0           0.0          0.0   
3             0.0          0.0                0.0           1.0          0.0   
4             0.0          0.0                0.0           0.0          0.0   

   job_technician  job_unemploye

### Define Feature (X) and Target (y)

In [None]:
import statsmodels.api as sm
feature_names = ['balance', 'campaign', 'pdays', 'poutcome_success',
                   'job_student', 'job_retired', 'job_management', 'job_technician']

# feature
X = final_encoded_df[feature_names]
X = sm.add_constant(X)

# target
y = df_clean['deposit']

In [None]:
X.replace([np.inf, -np.inf], np.nan, inplace=True)
X.dropna(inplace=True)
X.reset_index(drop=True, inplace=True)
y = y.reset_index(drop=True)

# Get common indices
common_index = X.index.intersection(y.index)

# Filter both X and y using the common indices
X = X.loc[common_index]
y = y.loc[common_index]

In [None]:
X.head()

Unnamed: 0,const,balance,campaign,pdays,poutcome_success,job_student,job_retired,job_management,job_technician
0,1.0,1662.0,2.0,-1.0,0.0,0.0,0.0,0.0,0.0
1,1.0,-3058.0,3.0,-1.0,0.0,0.0,0.0,0.0,0.0
2,1.0,3025.0,1.0,352.0,0.0,0.0,0.0,0.0,0.0
3,1.0,-87.0,1.0,-1.0,0.0,0.0,0.0,0.0,0.0
4,1.0,205.0,4.0,-1.0,0.0,0.0,0.0,0.0,0.0


### Check Multicollinearity

 In logistic regression, multicollinearity can cause the model to produce unstable estimates for the coefficients.When predictor variables are highly correlated, interpreting the effect of each individual predictor on the outcome becomes challenging. The coefficients can be misleading, making it difficult to draw accurate conclusions from the model.

 source: Field, A. (2013). Discovering statistics using IBM SPSS statistics.

In [None]:
import statsmodels.api as sm
from statsmodels.stats.outliers_influence import variance_inflation_factor

# Function to calculate VIF
def calc_vif(x):
    # Drop rows with infinite or NaN values
    x = x.replace([np.inf, -np.inf], np.nan).dropna()

    vif = pd.DataFrame()
    vif['variables'] = x.columns
    vif['VIF'] = [variance_inflation_factor(x.values, i) for i in range(x.shape[1])]

    return (vif)

In [None]:
calc_vif(X.drop(columns='const'))

Unnamed: 0,variables,VIF
0,balance,1.164187
1,campaign,1.331227
2,pdays,1.160954
3,poutcome_success,1.126897
4,job_student,1.031282
5,job_retired,1.074695
6,job_management,1.20785
7,job_technician,1.148277


Interpretation of VIF Values:
- VIF=1: No multicollinearity.
- VIF<5: Acceptable multicollinearity.
- VIF>5: High multicollinearity; consider removing or combining variables.

Our result:
All of VIF scores above show < 5. Hence, we can consider as No Multicollinearity.

## Modeling

In [None]:
# define model
model_logit = sm.Logit(y, X)

# fitting model
model_result = model_logit.fit()

# summary
print(model_result.summary())

Optimization terminated successfully.
         Current function value: 0.663149
         Iterations 5
                           Logit Regression Results                           
Dep. Variable:                deposit   No. Observations:                 7797
Model:                          Logit   Df Residuals:                     7788
Method:                           MLE   Df Model:                            8
Date:                Mon, 16 Dec 2024   Pseudo R-squ.:                 0.04193
Time:                        11:40:50   Log-Likelihood:                -5170.6
converged:                       True   LL-Null:                       -5396.8
Covariance Type:            nonrobust   LLR p-value:                 1.063e-92
                       coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------------
const               -0.1652      0.046     -3.594      0.000      -0.255      -0.075
balance    

In [None]:
model_result.pvalues.round(4)

Unnamed: 0,0
const,0.0003
balance,0.0
campaign,0.0
pdays,0.0
poutcome_success,0.0
job_student,0.003
job_retired,0.0881
job_management,0.0151
job_technician,0.1496


P-value interpretation:
- H₀: There is no effect of this variable on the outcome (term deposit success).
- H₁: This variable has a significant effect on the outcome (term deposit success).

The Result:

`P-value < 0.05`: const (0.0005), balance (0.0000), campaign (0.0000), pdays (0.0000), poutcome_success (0.0000), job_student (0.0036), job_management (0.0198), job_retired (0.0976) -> marginally close to 0.05.
- significantly impact the likelihood of success in the term deposit campaign.

`Pvalue > 0.05`: job_technician (0.1707), job_entrepreneur (0.6380)
- insufficient evidence to conclude that these variables significantly impact the likelihood of success in the term deposit campaign.

### Model Interpretation

1. LLR p-value:  7.390e-92 --> Reject Ho
    - Ho: Semua Beta = 0. Semua feature tidak signifikan terhadap target
    - Ha: Salah satu Beta ada yg ≠ 0. Minimal ada 1 feature yg berpengaruh signifikan terhadap target.

    
2. Wald Test (P>|z|)
    - Ho: Beta = 0. Featurenya tidak berpengaruh signifikan terhadap target.
    - Ha: Beta ≠ 0. Featurenya berpengaruh signifikan terhadap target.
    <br><br>
    - B₀ (const): p-value = 0.001 → Reject H₀ → The model requires an intercept for better prediction.
    - B₁ (balance): p-value = 0.0000 → Reject H₀ → Balance significantly impacts the likelihood of subscribing to a term deposit.
  - B₂ (campaign): p-value = 0.0000 → Reject H₀ → Campaign significantly impacts the likelihood of subscribing to a term deposit.
  - B₃ (pdays): p-value = 0.0000 → Reject H₀ → Pdays significantly impacts the likelihood of subscribing to a term deposit.
  - B₄ (poutcome_success): p-value = 0.0000 → Reject H₀ → Poutcome_success significantly impacts the likelihood of subscribing to a term deposit.
  - B₅ (job_student): p-value = 0.0033 → Reject H₀ → Job (student) significantly impacts the likelihood of subscribing to a term deposit.
  - B₆ (job_retired): p-value = 0.0973 → Fail to Reject H₀ → Job (retired) does not significantly impact the likelihood of subscribing to a term deposit.
  - B₇ (job_management): p-value = 0.0193 → Reject H₀ → Job (management) significantly impacts the likelihood of subscribing to a term deposit.
  - B₈ (job_technician): p-value = 0.1706 → Fail to Reject H₀ → Job (technician) does not significantly impact the likelihood of subscribing to a term deposit.


3. Logistic Regression Coefficient:

In [None]:
model_result.params

Unnamed: 0,0
const,-0.165238
balance,6.8e-05
campaign,-0.114107
pdays,0.002641
poutcome_success,0.490525
job_student,0.403439
job_retired,0.16198
job_management,0.142832
job_technician,0.095255


###  interpretation of the odds ratio (OR) from logistic regression.

#### Balance

In [None]:
Beta = 0.000068   # coef
c = 50000
d = 40000

OR_balance = np.exp(Beta * (c-d))
OR_balance

# Interpretation
# the higher the Customer's account balance, the higher the likelihood of a customer opening a term deposit.
# For every additional 10,000 IDR increase in balance, the odds of opening a term deposit increase by a factor of 1.974

1.9738777322304477

#### campaign


In [None]:
Beta = -0.114156  # coef
c = 5
d = 4

OR_campaign = np.exp(Beta * (c-d))
OR_campaign

# Interpretation
# The higher the Number of contacts performed during this campaign, the lower the likelihood of a customer opening a term deposit.
# For every additional contact performed during the campaign, the odds of opening a term deposit decrease by a factor of 0.891

0.8921187744977209

#### pdays

In [None]:
Beta = 0.002640  # coef
c = 5
d = 4

OR_pdays = np.exp(Beta * (c-d))
OR_pdays

# Interpretation
# the higher the Number of days since the client was last contacted, the higher the likelihood of a customer opening a term deposit.
# For every additional day increase in the number of days since the client was last contacted, the odds of opening a term deposit increase by a factor of 1.00264.

1.002643487868649

#### poutcome_success

In [None]:
Beta = 0.489801  # coef
c = 5
d = 4

OR_poutcome_success = np.exp(Beta * (c-d))
OR_poutcome_success

# Baseline Category: poutcome_failure
# Interpretation:
# If the outcome of the previous campaign was successful (poutcome_success),
# the odds of opening a term deposit are 1.63 times higher than if the outcome was a failure

1.6319914213461413

#### job_student

In [None]:
Beta = 0.399523  # coef
c = 5
d = 4

OR_job_student = np.exp(Beta * (c-d))
OR_job_student

# Baseline Category: job_unemployed
# Interpretation:
# If the customer is a student, the odds of opening a term deposit are 1.49 times higher than if the customer is an umployed

1.4911132669502045

#### job_management

In [None]:
Beta = 0.138875  # coef
c = 5
d = 4

OR_job_management = np.exp(Beta * (c-d))
OR_job_management

# Baseline Category: job_unemployed
# Interpretation:
# If the customer job is management, the odds of opening a term deposit are 1.148 times higher than if the customer is an umployed

1.1489804684682625

## Predict

In [None]:
common_index = X.index.intersection(y.index)
X = X.loc[common_index]
y = y.loc[common_index]

# define model
model_logit = sm.Logit(y, X)

# fitting
model_result = model_logit.fit()

# predict
y_pred_proba = model_result.predict(X)
y_pred_proba

Optimization terminated successfully.
         Current function value: 0.663149
         Iterations 5


Unnamed: 0,0
0,0.429862
1,0.327532
2,0.702082
3,0.428509
4,0.351994
...,...
7792,0.448987
7793,0.592768
7794,0.740869
7795,0.431292


In [None]:
# np.where(condition, True, False)
y_pred_class = np.where(y_pred_proba > 0.50, 1, 0)
y_pred_class

array([0, 0, 1, ..., 1, 0, 1])

In [None]:
df_result = pd.DataFrame({'deposit': y})    # y_actual
df_result['y_pred_class'] = y_pred_class    # y_pred

df_result

Unnamed: 0,deposit,y_pred_class
0,1,0
1,1,0
2,1,1
3,0,0
4,0,0
...,...,...
7792,1,0
7793,1,1
7794,1,1
7795,0,0


# CONCLUSION & RECOMMENDATION

## Conclusion

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, accuracy_score
from sklearn.model_selection import train_test_split

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

# Initialize the Logistic Regression model
model = LogisticRegression(random_state=42, max_iter=1000)

# Fit the model on the training data
model.fit(X_train, y_train)

# Make predictions on the test set
y_pred = model.predict(X_test)

# Generate the classification report
print("Classification Report:")
print(classification_report(y_test, y_pred, digits=2))

# Calculate the accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.2f}")

Classification Report:
              precision    recall  f1-score   support

           0       0.58      0.75      0.65      1025
           1       0.58      0.38      0.46       925

    accuracy                           0.58      1950
   macro avg       0.58      0.57      0.56      1950
weighted avg       0.58      0.58      0.56      1950

Accuracy: 0.58


Model Effectiveness (recall):
- For class 0 (not interested in opening a term deposit), the model achieves a recall of 75%, meaning it correctly identifies 75% of uninterested candidates, effectively helping to filter them out.
- For class 1 (interested in opening a term deposit), the model achieves a recall of 38%, meaning it only captures 38% of interested candidates.

Precision Insights:
- Precision for both 0 and 1 is 58%, meaning that when the model predicts either class, it is correct only 58% of the time.

Overall Accuracy:
- The model achieves an accuracy of 58%, meaning 58% of all predictions (both 0 and 1) are correct.

F1-Score:
- The F1-score for class 0 is 65%, indicating moderately good performance in identifying uninterested candidates.
- The F1-score for class 1 is 46%, suggesting weaker performance in identifying interested candidates due to the imbalance between precision and recall.



## Recommendation

- Algorithm and Model Optimization: Experiment with different machine learning algorithms, such as Random Forest, Gradient Boosting (e.g., XGBoost, LightGBM), or Neural Networks, to see if they outperform the current model.
- Data Completeness and Quality: Encourage data collection policies to ensure all necessary fields are filled in. For example, if certain fields like occupation or financial status are missing, provide options such as “unemployed” or “not applicable” rather than leaving them blank.
- Handling Class Imbalance: use more advanced oversampling techniques to improve the model's ability to identify class 1 (interested).
- Iteration and Evaluation: regularly evaluate the model with updated data to ensure its relevance and performance.

# Model to Pickle

In [None]:
import pickle

In [None]:
from sklearn.linear_model import LogisticRegression
import pickle

# Create and tune the Logistic Regression model
logistic_model = LogisticRegression(random_state=42, max_iter=1000)  # Adjust hyperparameters as needed

# Fit the model
logistic_model.fit(X, y)

# Save the model with pickle
pickle.dump(logistic_model, open('model_logistic.pkl', 'wb'))

In [None]:
from google.colab import files

# Download the saved model
files.download('model_logistic.pkl')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>