<a href="https://colab.research.google.com/github/Chathuwa99/Bank-Deposit-Prediction-Project/blob/main/Bank_Feature_Engineering_New.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
# Import Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from imblearn.over_sampling import SMOTE
from sklearn.model_selection import RandomizedSearchCV

In [2]:
# Loading the data set
data_set_2 = pd.read_csv('/content/cleaned_bank_dataset.csv')

In [3]:
data_set_2.head()

Unnamed: 0,age,job,marital,education,default,balance,housing,loan,contact,day,month,duration,campaign,pdays,previous,poutcome,y
0,58,management,married,tertiary,no,2143,yes,no,unknown,5,may,261,1,-1,0,unknown,no
1,44,technician,single,secondary,no,29,yes,no,unknown,5,may,151,1,-1,0,unknown,no
2,33,entrepreneur,married,secondary,no,2,yes,yes,unknown,5,may,76,1,-1,0,unknown,no
3,47,blue-collar,married,unknown,no,1506,yes,no,unknown,5,may,92,1,-1,0,unknown,no
4,33,unknown,single,unknown,no,1,no,no,unknown,5,may,198,1,-1,0,unknown,no


# **Exploratory Data Analysis - Numerical data**

In [4]:
# Indentify each numerical data range
data_set_2.describe().loc[['min','max']]

Unnamed: 0,age,balance,day,duration,campaign,pdays,previous
min,18.0,-8019.0,1.0,0.0,1.0,-1.0,0.0
max,95.0,102127.0,31.0,4918.0,63.0,871.0,275.0


In [5]:
# Replace positive values in pdays with 'yes' and -1 with 'no'
data_set_2['pdays']=np.where(data_set_2['pdays']==-1,'no','yes')
# Rename pdays to pcontact
data_set_2.rename(columns={'pdays': 'pcontact'}, inplace=True)



*   The pdays feature has a mix of different types of data, which makes research difficult.

*   If pdays has a value of -1, it means there was no contact previously. If it has a value of 1, it means there were days since the last contact.

*   To make things easier to understand, pdays is turned into a category feature with two options: "yes" (contacted before) and "no" (not contacted).


*   The feature's new categorical structure is better shown by its new name, pcontact.



# **Exploratory Data Analysis - Categorical data**

In [6]:
# Value counts each categorical data

value_counts_dict = {}

for col in data_set_2.columns:                                                  # Value count each categorical data by looping and save it in a dictionary
    if data_set_2[col].dtype == 'object' or data_set_2[col].dtype == 'category':
        value_counts_dict[col] = data_set_2[col].value_counts()

for col, value_counts in value_counts_dict.items():                     # Print each categorical data by looping
    print(f"Value counts for column '{col}':\n{value_counts}\n")

Value counts for column 'job':
job
blue-collar      9732
management       9458
technician       7597
admin.           5171
services         4154
retired          2264
self-employed    1579
entrepreneur     1487
unemployed       1303
housemaid        1240
student           938
unknown           288
Name: count, dtype: int64

Value counts for column 'marital':
marital
married     27214
single      12790
divorced     5207
Name: count, dtype: int64

Value counts for column 'education':
education
secondary    23202
tertiary     13301
primary       6851
unknown       1857
Name: count, dtype: int64

Value counts for column 'default':
default
no     44396
yes      815
Name: count, dtype: int64

Value counts for column 'housing':
housing
yes    25130
no     20081
Name: count, dtype: int64

Value counts for column 'loan':
loan
no     37967
yes     7244
Name: count, dtype: int64

Value counts for column 'contact':
contact
cellular     29285
unknown      13020
telephone     2906
Name: count, dtype

In [7]:

# Removing any customer with 'unknown' job
data_set_2 = data_set_2[~data_set_2['job'].isin(['unknown'])].reset_index(drop=True)

There are 288 customers with "unknown" job values out of 44923, and since these represent missing data, excluding them will have minimal impact on the analysis.


In [8]:
# Replace 'unknown' with 'other' in contact
data_set_2['contact'].replace(['unknown'], ['other'], inplace=True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  data_set_2['contact'].replace(['unknown'], ['other'], inplace=True)


The "unknown" in "contact" means that there are no numbers for the contact message type. This affects 13020 of the 45211 customers, which is a big chunk of the information. Instead of leaving these records out or giving them mode values, it is better to add "unknown" as a new group called "other."

In [9]:
# Replace 'other' with 'unknown' in poutcome
data_set_2['poutcome'].replace(['other'], ['unknown'], inplace=True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  data_set_2['poutcome'].replace(['other'], ['unknown'], inplace=True)


The 'poutcome' variable, showing the outcome of the previous marketing effort, can only have three acceptable values: 'success,' 'failure,' and 'unknown.' However, the data also includes 'other,' likely input when the result was not known. To fix this, it is best to change 'other' with 'unknown.'


# **Data Preprocessing**

In [10]:
data_set_2['y'].unique()

array(['no', 'yes'], dtype=object)

In [11]:
# Changing y (target) into binary numbers [1 for yes] and [0 for no]
data_set_2['y'] = np.where(data_set_2['y']=='yes',1,0).astype('int64')

In [12]:
data_set_2.head()

Unnamed: 0,age,job,marital,education,default,balance,housing,loan,contact,day,month,duration,campaign,pcontact,previous,poutcome,y
0,58,management,married,tertiary,no,2143,yes,no,other,5,may,261,1,no,0,unknown,0
1,44,technician,single,secondary,no,29,yes,no,other,5,may,151,1,no,0,unknown,0
2,33,entrepreneur,married,secondary,no,2,yes,yes,other,5,may,76,1,no,0,unknown,0
3,47,blue-collar,married,unknown,no,1506,yes,no,other,5,may,92,1,no,0,unknown,0
4,35,management,married,tertiary,no,231,yes,no,other,5,may,139,1,no,0,unknown,0



# **Data Preprocessing - Data Type**

In [13]:

data_set_2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 44923 entries, 0 to 44922
Data columns (total 17 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   age        44923 non-null  int64 
 1   job        44923 non-null  object
 2   marital    44923 non-null  object
 3   education  44923 non-null  object
 4   default    44923 non-null  object
 5   balance    44923 non-null  int64 
 6   housing    44923 non-null  object
 7   loan       44923 non-null  object
 8   contact    44923 non-null  object
 9   day        44923 non-null  int64 
 10  month      44923 non-null  object
 11  duration   44923 non-null  int64 
 12  campaign   44923 non-null  int64 
 13  pcontact   44923 non-null  object
 14  previous   44923 non-null  int64 
 15  poutcome   44923 non-null  object
 16  y          44923 non-null  int64 
dtypes: int64(7), object(10)
memory usage: 5.8+ MB


# **Data Preprocessing - featutres in categories data types**

In [14]:
# Print each unique categorical data

unique_dict = {}

for col in data_set_2.columns:                                                  # Search for unique categorical data by looping and save it in a dictionary
    if data_set_2[col].dtype == 'object' or data_set_2[col].dtype == 'category':
        unique_dict[col] = data_set_2[col].unique()

for col, unique in unique_dict.items():                                 # Print each unique categorical data by looping
    print(f"'{col}' unique:\n{unique}\n")

'job' unique:
['management' 'technician' 'entrepreneur' 'blue-collar' 'retired' 'admin.'
 'services' 'self-employed' 'unemployed' 'housemaid' 'student']

'marital' unique:
['married' 'single' 'divorced']

'education' unique:
['tertiary' 'secondary' 'unknown' 'primary']

'default' unique:
['no' 'yes']

'housing' unique:
['yes' 'no']

'loan' unique:
['no' 'yes']

'contact' unique:
['other' 'cellular' 'telephone']

'month' unique:
['may' 'jun' 'jul' 'aug' 'oct' 'nov' 'dec' 'jan' 'feb' 'mar' 'apr' 'sep']

'pcontact' unique:
['no' 'yes']

'poutcome' unique:
['unknown' 'failure' 'success']



In [15]:
# Identify how many duplicate data

data_set_2.duplicated().sum()

0

# **One-Hot Encoding**

In [16]:

data_set_2['default'] = np.where(data_set_2['default']=='yes',1,0).astype('int64')
data_set_2['housing'] = np.where(data_set_2['housing']=='yes',1,0).astype('int64')
data_set_2['loan'] = np.where(data_set_2['loan']=='yes',1,0).astype('int64')
data_set_2['pcontact'] = np.where(data_set_2['pcontact']=='yes',1,0).astype('int64')

# **Label Encoding**

In [17]:
cat_columns = ['job', 'marital', 'education', 'contact', 'month', 'poutcome']

for col in cat_columns:
    # Perform one-hot encoding and convert to int64
    dummies = pd.get_dummies(data_set_2[col], prefix=col, prefix_sep='_', drop_first=True, dummy_na=False).astype('int64')
    # Concatenate the encoded columns back to the DataFrame
    data_set_2 = pd.concat([data_set_2.drop(col, axis=1), dummies], axis=1)


In [18]:
data_set_2.head()

Unnamed: 0,age,default,balance,housing,loan,day,duration,campaign,pcontact,previous,...,month_jan,month_jul,month_jun,month_mar,month_may,month_nov,month_oct,month_sep,poutcome_success,poutcome_unknown
0,58,0,2143,1,0,5,261,1,0,0,...,0,0,0,0,1,0,0,0,0,1
1,44,0,29,1,0,5,151,1,0,0,...,0,0,0,0,1,0,0,0,0,1
2,33,0,2,1,1,5,76,1,0,0,...,0,0,0,0,1,0,0,0,0,1
3,47,0,1506,1,0,5,92,1,0,0,...,0,0,0,0,1,0,0,0,0,1
4,35,0,231,1,0,5,139,1,0,0,...,0,0,0,0,1,0,0,0,0,1


In [19]:
data_set_2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 44923 entries, 0 to 44922
Data columns (total 41 columns):
 #   Column               Non-Null Count  Dtype
---  ------               --------------  -----
 0   age                  44923 non-null  int64
 1   default              44923 non-null  int64
 2   balance              44923 non-null  int64
 3   housing              44923 non-null  int64
 4   loan                 44923 non-null  int64
 5   day                  44923 non-null  int64
 6   duration             44923 non-null  int64
 7   campaign             44923 non-null  int64
 8   pcontact             44923 non-null  int64
 9   previous             44923 non-null  int64
 10  y                    44923 non-null  int64
 11  job_blue-collar      44923 non-null  int64
 12  job_entrepreneur     44923 non-null  int64
 13  job_housemaid        44923 non-null  int64
 14  job_management       44923 non-null  int64
 15  job_retired          44923 non-null  int64
 16  job_self-employed    4

# **Saving the feature engineered data set**

In [20]:
# Save data_set_2 to a CSV file
data_set_2.to_csv('Feature-Engineered Bank Data Set.csv', index=False)

print("The dataset has been saved as 'Feature-Engineered Bank Data Set.csv'.")

The dataset has been saved as 'Feature-Engineered Bank Data Set.csv'.
