# Bank Marketting Subscription Prediction Project

### Business Understanding

#### **Problem Statement** 
The goal of this project is to build a predictive model that accurately determines the likelihood of a client subscribing to a term deposit based on various customer features. The bank's marketing campaigns heavily rely on effectively targeting customers who are more likely to subscribe to a term deposit. By leveraging machine learning techniques, the project seeks to improve the efficiency of these marketing campaigns and increase the conversion rate

#### **Stakeholders:**
     - Executive Officers (CEO)
     - Marketing Officers (CMO) 
     - Data Aministrators (CDA)


Prior information we have been provided with is as below in the output of this code:

In [1]:
# Load names file
with open(r"C:\Users\Admin\OneDrive\Desktop\Bank-Marketing-Subscription-Predictor\data\bank-names.txt", "r") as file:
    lines = file.readlines()
    for line in lines:
        print(line.strip())

1. Relevant Information:

The data is related with direct marketing campaigns of a banking institution.
The marketing campaigns were based on phone calls. Often, more than one contact to the same client was required,
in order to access if the product (bank term deposit) would be (or not) subscribed.

There are two datasets:
1) bank-full.csv with all examples, ordered by date (from May 2008 to November 2010).
2) bank.csv with 10% of the examples (4521), randomly selected from bank-full.csv.
The smallest dataset is provided to test more computationally demanding machine learning algorithms (e.g. SVM).

2. Number of Instances: 45211 for bank-full.csv (4521 for bank.csv)

3. Number of Attributes: 16 + output attribute.

4. Attribute information:

Input variables:
# bank client data:
1 - age (numeric)
2 - job : type of job (categorical: "admin.","unknown","unemployed","management","housemaid","entrepreneur","student",
"blue-collar","self-employed","retired","technician","services")
3 - mari

- We can deduce that this is a classification problem and this will inform the flow of the project that is ; the feature engeneering,the evaluation metrics, the models that are likely to work best with such a project etc 
- That being noted, as the project progresses, **key insights** are noted below each task performed for better project flow  

#### **Key Metrics and Success Criteria**
     1. Acuracy-The Model should have an accuracy score of 85% (On balanced data).Good models are expected to have an accuracy score of >0.80 or 80%
     2. Threshold for precision and Recall - The model should achieve a precision and recall at least 80%. This assures that the model is reliable in predicting
     3. Minimum F1 Score- The F1 score should be atleast 0.75. This balances the trade offs between precision and recalls, indicating the model performs well even if the class distribution is imbalanced
     4. AUC-ROC Score- This should be atleast 0.85. A high AUC-ROC score indicates that the model is effective in distinguishing subscribers to non subscribers
     5. Confusion Matrix - The number of False Negatives (FN) should be lower to ensure that most of the subscription cases are identified
     
    
#### **Hypothesis**

#### Null Hypothesis

#### Alternative Hpothesis

#### Analytical Questions
    

### **Data Understanding**

#### **Importations**

In [2]:
import numpy as np
import pandas as pd

#### **Load Datasets**

In [3]:
# Load csv data
bank_additional_full = pd.read_csv(r"C:\Users\Admin\OneDrive\Desktop\Bank-Marketing-Subscription-Predictor\data\bank-additional-full.csv", delimiter=";")
bank_additional = pd.read_csv(r"C:\Users\Admin\OneDrive\Desktop\Bank-Marketing-Subscription-Predictor\data\bank-additional.csv", delimiter=";")
bank_full = pd.read_csv(r"C:\Users\Admin\OneDrive\Desktop\Bank-Marketing-Subscription-Predictor\data\bank-full.csv", delimiter=";")
bank = pd.read_csv(r"C:\Users\Admin\OneDrive\Desktop\Bank-Marketing-Subscription-Predictor\data\bank.csv", delimiter=";")

In [4]:
bank_additional_full.head()

Unnamed: 0,age,job,marital,education,default,housing,loan,contact,month,day_of_week,...,campaign,pdays,previous,poutcome,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed,y
0,56,housemaid,married,basic.4y,no,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
1,57,services,married,high.school,unknown,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
2,37,services,married,high.school,no,yes,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
3,40,admin.,married,basic.6y,no,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
4,56,services,married,high.school,no,no,yes,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no


In [5]:
bank_additional.head()

Unnamed: 0,age,job,marital,education,default,housing,loan,contact,month,day_of_week,...,campaign,pdays,previous,poutcome,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed,y
0,30,blue-collar,married,basic.9y,no,yes,no,cellular,may,fri,...,2,999,0,nonexistent,-1.8,92.893,-46.2,1.313,5099.1,no
1,39,services,single,high.school,no,no,no,telephone,may,fri,...,4,999,0,nonexistent,1.1,93.994,-36.4,4.855,5191.0,no
2,25,services,married,high.school,no,yes,no,telephone,jun,wed,...,1,999,0,nonexistent,1.4,94.465,-41.8,4.962,5228.1,no
3,38,services,married,basic.9y,no,unknown,unknown,telephone,jun,fri,...,3,999,0,nonexistent,1.4,94.465,-41.8,4.959,5228.1,no
4,47,admin.,married,university.degree,no,yes,no,cellular,nov,mon,...,1,999,0,nonexistent,-0.1,93.2,-42.0,4.191,5195.8,no


In [6]:
bank.head()

Unnamed: 0,age,job,marital,education,default,balance,housing,loan,contact,day,month,duration,campaign,pdays,previous,poutcome,y
0,30,unemployed,married,primary,no,1787,no,no,cellular,19,oct,79,1,-1,0,unknown,no
1,33,services,married,secondary,no,4789,yes,yes,cellular,11,may,220,1,339,4,failure,no
2,35,management,single,tertiary,no,1350,yes,no,cellular,16,apr,185,1,330,1,failure,no
3,30,management,married,tertiary,no,1476,yes,yes,unknown,3,jun,199,4,-1,0,unknown,no
4,59,blue-collar,married,secondary,no,0,yes,no,unknown,5,may,226,1,-1,0,unknown,no


In [7]:
bank_full.head ()

Unnamed: 0,age,job,marital,education,default,balance,housing,loan,contact,day,month,duration,campaign,pdays,previous,poutcome,y
0,58,management,married,tertiary,no,2143,yes,no,unknown,5,may,261,1,-1,0,unknown,no
1,44,technician,single,secondary,no,29,yes,no,unknown,5,may,151,1,-1,0,unknown,no
2,33,entrepreneur,married,secondary,no,2,yes,yes,unknown,5,may,76,1,-1,0,unknown,no
3,47,blue-collar,married,unknown,no,1506,yes,no,unknown,5,may,92,1,-1,0,unknown,no
4,33,unknown,single,unknown,no,1,no,no,unknown,5,may,198,1,-1,0,unknown,no


#### **Data Exploration**

In [8]:
shapes = f"""
Bank Additional Full:
{bank_additional_full.shape}
-----------------------------------------------------------------------------------------
Bank Additional:
{bank_additional.shape}
-----------------------------------------------------------------------------------------
Bank:
{bank.shape}
----------------------------------------------------------------------------------------- 
Bank Full:
{bank_full.shape}
-----------------------------------------------------------------------------------------
"""
print (shapes)



Bank Additional Full:
(41188, 21)
-----------------------------------------------------------------------------------------
Bank Additional:
(4119, 21)
-----------------------------------------------------------------------------------------
Bank:
(4521, 17)
----------------------------------------------------------------------------------------- 
Bank Full:
(45211, 17)
-----------------------------------------------------------------------------------------



In [9]:
infos = f"""
Bank Additional Full:
{bank_additional_full.info()}
-----------------------------------------------------------------------------------------
Bank Additional:
{bank_additional.info()}
-----------------------------------------------------------------------------------------
Bank Full:
{bank_full.info()}
----------------------------------------------------------------------------------------- 
Bank :
{bank.info()}
-----------------------------------------------------------------------------------------
"""
print (infos)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 41188 entries, 0 to 41187
Data columns (total 21 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   age             41188 non-null  int64  
 1   job             41188 non-null  object 
 2   marital         41188 non-null  object 
 3   education       41188 non-null  object 
 4   default         41188 non-null  object 
 5   housing         41188 non-null  object 
 6   loan            41188 non-null  object 
 7   contact         41188 non-null  object 
 8   month           41188 non-null  object 
 9   day_of_week     41188 non-null  object 
 10  duration        41188 non-null  int64  
 11  campaign        41188 non-null  int64  
 12  pdays           41188 non-null  int64  
 13  previous        41188 non-null  int64  
 14  poutcome        41188 non-null  object 
 15  emp.var.rate    41188 non-null  float64
 16  cons.price.idx  41188 non-null  float64
 17  cons.conf.idx   41188 non-null 

In [10]:
columns = f"""
Bank Additional Full:
{bank_additional_full.columns}
-----------------------------------------------------------------------------------------
Bank Additional:
{bank_additional.columns}
-----------------------------------------------------------------------------------------
Bank Full:
{bank_full.columns}
----------------------------------------------------------------------------------------- 
Bank :
{bank.columns}
-----------------------------------------------------------------------------------------
"""
print (columns)


Bank Additional Full:
Index(['age', 'job', 'marital', 'education', 'default', 'housing', 'loan',
       'contact', 'month', 'day_of_week', 'duration', 'campaign', 'pdays',
       'previous', 'poutcome', 'emp.var.rate', 'cons.price.idx',
       'cons.conf.idx', 'euribor3m', 'nr.employed', 'y'],
      dtype='object')
-----------------------------------------------------------------------------------------
Bank Additional:
Index(['age', 'job', 'marital', 'education', 'default', 'housing', 'loan',
       'contact', 'month', 'day_of_week', 'duration', 'campaign', 'pdays',
       'previous', 'poutcome', 'emp.var.rate', 'cons.price.idx',
       'cons.conf.idx', 'euribor3m', 'nr.employed', 'y'],
      dtype='object')
-----------------------------------------------------------------------------------------
Bank Full:
Index(['age', 'job', 'marital', 'education', 'default', 'balance', 'housing',
       'loan', 'contact', 'day', 'month', 'duration', 'campaign', 'pdays',
       'previous', 'poutco

Insigts
- Bank_Additional and Bank are samples taken from Bank_Additional_Full and Bank_Full (10% from each), therefore we will merge only the latter two for 2 reasons; 

-- First we cannot merge in the samples to avoid data duplication and loss of data integrity

-- Second We can only use the actual data and not the samples to avoid creating assumptions as we do    not know the creteria used in picking the samples

- For data consistency, the columns;'emp.var.rate', 'cons.price.idx', 'cons.conf.idx', 'euribor3m', 'nr.employed in Bank_Aditional_Full will be dropped as will not use them in this analysis
 
- The column day_of_week in Bank_Aditional_Full and bank_full have a disparity *if we could know which year this data is from* we could convert the day_of_week and the days column to be uniform- may 2008 to November 2010 for the Bank_Full

- The 'balance' column has a lot of null values, we will handle this by replacing null values with the mean balance of the respective job category, as this is a common practice and it can help in improving the model's performance.

- The 'pdays' column has -1 values which means the customer was not previously contacted by the bank. We will create a  new feature to differentiate whether was contacted or not 

- Need to calculate Hypothesis calculations of the features to know what to set as the hypothesis

#### **Data Preparation**

In [11]:
# Drop the emp.var.rate', 'cons.price.idx','cons.conf.idx', 'euribor3m', 'nr.employed', columns

columns_to_drop = ['emp.var.rate', 'cons.price.idx', 'cons.conf.idx', 'euribor3m', 'nr.employed']
bank_additional_full = bank_additional_full.drop(columns=columns_to_drop)

# Confirm the columns have been dropped
print("Remaining columns:")
print(bank_additional_full.columns)

Remaining columns:
Index(['age', 'job', 'marital', 'education', 'default', 'housing', 'loan',
       'contact', 'month', 'day_of_week', 'duration', 'campaign', 'pdays',
       'previous', 'poutcome', 'y'],
      dtype='object')


In [12]:
# Get unique values in the 'day_of_week' column
unique_days = bank_additional_full['day_of_week'].unique()

# Print the unique
unique_days

array(['mon', 'tue', 'wed', 'thu', 'fri'], dtype=object)

In [13]:
# Get unique values in the 'day' column
days = bank_full['day'].unique()

# Print the unique
days

array([ 5,  6,  7,  8,  9, 12, 13, 14, 15, 16, 19, 20, 21, 23, 26, 27, 28,
       29, 30,  2,  3,  4, 11, 17, 18, 24, 25,  1, 10, 22, 31])

In [14]:
# Get unique values in the 'balance' column
Balance = bank_full['balance'].unique()

# Print the unique
Balance

array([ 2143,    29,     2, ...,  8205, 14204, 16353])

#### **Merge the Train Datasets**

In [15]:
# Combine DataFrames
Bank_anaysis_data = pd.concat([bank_additional_full, bank_full], ignore_index=True)

Bank_anaysis_data.head()

Unnamed: 0,age,job,marital,education,default,housing,loan,contact,month,day_of_week,duration,campaign,pdays,previous,poutcome,y,balance,day
0,56,housemaid,married,basic.4y,no,no,no,telephone,may,mon,261,1,999,0,nonexistent,no,,
1,57,services,married,high.school,unknown,no,no,telephone,may,mon,149,1,999,0,nonexistent,no,,
2,37,services,married,high.school,no,yes,no,telephone,may,mon,226,1,999,0,nonexistent,no,,
3,40,admin.,married,basic.6y,no,no,no,telephone,may,mon,151,1,999,0,nonexistent,no,,
4,56,services,married,high.school,no,no,yes,telephone,may,mon,307,1,999,0,nonexistent,no,,


In [16]:
# Checking for duplicates 
Bank_anaysis_data.duplicated().sum() 

np.int64(13)

In [17]:
# Missing values with their percentages 
Bank_anaysis_data.isnull().sum().to_frame('Null Count').assign(Percentage=lambda x: (x['Null Count'] / len(Bank_anaysis_data)) * 100)

Unnamed: 0,Null Count,Percentage
age,0,0.0
job,0,0.0
marital,0,0.0
education,0,0.0
default,0,0.0
housing,0,0.0
loan,0,0.0
contact,0,0.0
month,0,0.0
day_of_week,45211,52.328152
