In [4]:
# importing required libraries
import pandas as pd  # pandas library
import numpy as np   # numphy library
import random        # random generation of number
import os            # to see the path of the current file

In [5]:
#Reading Dataset
# Use lowercase 'data' to match your professional structure
#df = pd.read_csv('../Data/raw/bank-full.csv', sep=',', encoding='latin1')
df = pd.read_excel('../Data/raw/bank-full.xlsx')  # reading the full dataset of bank
print(df.head())
print(df.shape)
# Reading the subset of full bank dataset
df_sub = pd.read_csv('../Data/raw/bank.csv',sep=';')  # reading the subset(Sampling) dataset of bank
print(df_sub.head())
print(df_sub.shape)


   age           job  marital  education default  balance housing loan  \
0   58    management  married   tertiary      no     2143     yes   no   
1   44    technician   single  secondary      no       29     yes   no   
2   33  entrepreneur  married  secondary      no        2     yes  yes   
3   47   blue-collar  married    unknown      no     1506     yes   no   
4   33       unknown   single    unknown      no        1      no   no   

   contact  day month  duration  campaign  pdays  previous poutcome   y  
0  unknown    5   may       261         1     -1         0  unknown  no  
1  unknown    5   may       151         1     -1         0  unknown  no  
2  unknown    5   may        76         1     -1         0  unknown  no  
3  unknown    5   may        92         1     -1         0  unknown  no  
4  unknown    5   may       198         1     -1         0  unknown  no  
(45211, 17)
   age          job  marital  education default  balance housing loan  \
0   30   unemployed  marri

In [6]:
# Exploring the data completeness and data type of our variables
df.info()


<class 'pandas.DataFrame'>
RangeIndex: 45211 entries, 0 to 45210
Data columns (total 17 columns):
 #   Column     Non-Null Count  Dtype
---  ------     --------------  -----
 0   age        45211 non-null  int64
 1   job        45211 non-null  str  
 2   marital    45211 non-null  str  
 3   education  45211 non-null  str  
 4   default    45211 non-null  str  
 5   balance    45211 non-null  int64
 6   housing    45211 non-null  str  
 7   loan       45211 non-null  str  
 8   contact    45211 non-null  str  
 9   day        45211 non-null  int64
 10  month      45211 non-null  str  
 11  duration   45211 non-null  int64
 12  campaign   45211 non-null  int64
 13  pdays      45211 non-null  int64
 14  previous   45211 non-null  int64
 15  poutcome   45211 non-null  str  
 16  y          45211 non-null  str  
dtypes: int64(7), str(10)
memory usage: 5.9 MB


**Initial Observations:**

The datasets contain customer information, including demographics, account details, and outcomes of bank marketing campaigns.

All columns contain 45,211 non-null values, meaning there are no missing values in any column.
This ensures data completeness, which is a positive aspect since no immediate imputation or removal of rows is necessary.

Data types appear consistent, with a mix of integers and categorical (object) data.
The target variable (y) indicates whether a customer subscribed to a term deposit.

7 columns are of data type int64 (integer).
These columns are numerical and likely represent continuous or discrete variables.

10 columns are of data type object.
These are categorical variables and may need to be encoded into numerical values if we are going to use them as features in our model.



In [7]:
# Get descriptive statistics for numerical columns
df.describe()


Unnamed: 0,age,balance,day,duration,campaign,pdays,previous
count,45211.0,45211.0,45211.0,45211.0,45211.0,45211.0,45211.0
mean,40.93621,1362.272058,15.806419,258.16308,2.763841,40.197828,0.580323
std,10.618762,3044.765829,8.322476,257.527812,3.098021,100.128746,2.303441
min,18.0,-8019.0,1.0,0.0,1.0,-1.0,0.0
25%,33.0,72.0,8.0,103.0,1.0,-1.0,0.0
50%,39.0,448.0,16.0,180.0,2.0,-1.0,0.0
75%,48.0,1428.0,21.0,319.0,3.0,-1.0,0.0
max,95.0,102127.0,31.0,4918.0,63.0,871.0,275.0


**Insights from the Data (statistics output)**

Age: The average client age is 40.936 = 41 years, with a range from 18(min) to 95(max).

Balance: The average balance is €1362, with a large variability (std = €3044), with extreme (possible outliers) values -8019 euro and 102127 euro.

Duration: The average call duration is 258 seconds (approx 4 minutes).

Campaign: Most clients were contacted on an average of 3 times, between 1 and 3 times most of the clients. with extreme (possible outlier) value of 63 times.

Previous: Most clients were contacted first time for this campaign. with extreme(possible outlier) value 275 times.

Below are the Possible Outliers: (to investigate)

Balance: A minimum balance of -€8019 and a maximum balance of €102,127, indicating a few extreme values.

Duration: Calls as short as 0 seconds and as long as 4918 seconds ( approx 82 minutes), suggesting potential outliers.

Campaign: Some clients were contacted up to 63 times, which is unusually high.

Previous: some clients were contacted 275 times previously before this campaign.


   # Step 2: Data Cleaning and Transformation:

The initial step involves cleaning and preparing the raw data to remove inconsistencies, handle missing values, and make the dataset ready for analysis. 

1. Check for Missing Values: Ensure there are no null values.
2. Handle Duplicates: Identify and remove any duplicate rows.
3. Consistant Data for Analysis: Ensure consistent formats for categorical data (e.g., month values).


In [8]:
# Check for missing values in both datasets
missing = df.isnull().sum()
missing_sub = df_sub.isnull().sum()

# Identify duplicate rows: Check for duplicates in both datasets
duplicates = df.duplicated().sum()
duplicates_sub = df_sub.duplicated().sum()

missing, missing_sub, duplicates, duplicates_sub


(age          0
 job          0
 marital      0
 education    0
 default      0
 balance      0
 housing      0
 loan         0
 contact      0
 day          0
 month        0
 duration     0
 campaign     0
 pdays        0
 previous     0
 poutcome     0
 y            0
 dtype: int64,
 age          0
 job          0
 marital      0
 education    0
 default      0
 balance      0
 housing      0
 loan         0
 contact      0
 day          0
 month        0
 duration     0
 campaign     0
 pdays        0
 previous     0
 poutcome     0
 y            0
 dtype: int64,
 np.int64(0),
 np.int64(0))

Looks like there are no missing values in our data. However, if we look at the categorical data there are 'unknown' values which we can consider as 'missing' data.

Also, there are no duplicates rows in our data.
We will follow below steps to handle 'unknown' values from "job', 'education','contact' and 'poutcome' columns.
First, calculate the percentage of "unknown" values in each column to understand the extent of the issue.



In [9]:
# selected only categorical columns

categorical_columns = ['job', 'marital', 'education', 'default', 'housing', 'loan', 'contact', 'month', 'poutcome', 'y']

# count unknown values in each column

unknown_counts = df[categorical_columns].apply(lambda col: (col == "unknown").sum())

print(unknown_counts) # the output will give unknown counts for each categorical columns

job            288
marital          0
education     1857
default          0
housing          0
loan             0
contact      13020
month            0
poutcome     36959
y                0
dtype: int64


In [10]:
# # Calculate the percentage of 'unknown' values in each relevant column
columns_with_unknowns = ['job', 'education', 'poutcome', 'contact']
for col in columns_with_unknowns:
    unknown_count = df[df[col] == 'unknown'].shape[0]
    total_count = df.shape[0]
    percentage = (unknown_count / total_count) * 100
    print(f"Column: {col}, Unknown Values: {unknown_count}, Percentage: {percentage:.2f}%")


Column: job, Unknown Values: 288, Percentage: 0.64%
Column: education, Unknown Values: 1857, Percentage: 4.11%
Column: poutcome, Unknown Values: 36959, Percentage: 81.75%
Column: contact, Unknown Values: 13020, Percentage: 28.80%


**job and education:**  These are likely important features for predicting the target.
the percentage of unknown values in 'job' and 'education' are 0.64% and 4.11%. which are very low.

We will replace 'unknown' rows with mode value of that column.

**contact and poutcome:**

These columns have a high proportion (>30%) of "unknown" values and if their **impact** on the target variable (y) seems minimal, then we can remove them entirely.

Let's check impact of 'contact' and 'potcome' on the target variable(y).


**Perform Chi-Square Test (Categorical Association)**

A chi-square test can help determine whether there is a statistically significant association between the column (contact or poutcome) and the target variable (y).

Interpretation:
A p-value < 0.05 indicates a statistically significant relationship between the column and the target variable (y).
A high p-value (>0.05) suggests the column has little impact on the target.


In [11]:
from scipy.stats import chi2_contingency

# Create contingency tables for 'contact' and 'poutcome'
contact_table = pd.crosstab(df['contact'], df['y'])
poutcome_table = pd.crosstab(df['poutcome'], df['y'])

# Perform chi-square test
contact_chi2, contact_p, _, _ = chi2_contingency(contact_table)
poutcome_chi2, poutcome_p, _, _ = chi2_contingency(poutcome_table)

print(f"Contact - Chi-square p-value: {contact_p}")
print(f"Poutcome - Chi-square p-value: {poutcome_p}")


Contact - Chi-square p-value: 1.251738325340638e-225
Poutcome - Chi-square p-value: 0.0


**Perform Chi-Square Test (Categorical Association)**

A chi-square test can help determine whether there is a statistically significant association between the column (contact or poutcome) and the target variable (y).

Interpretation:
A p-value < 0.05 indicates a statistically significant relationship between the column and the target variable (y).
A high p-value (>0.05) suggests the column has little impact on the target.


In [12]:
check= [contact_p, poutcome_p] 

for col in check:
    if col >= 0.05:
        print("there is no association with", col)
    else:
        print("there is significant association with ", col)

there is significant association with  1.251738325340638e-225
there is significant association with  0.0


Conclusion: we will retain the 'contact' and 'poutcome' columns from the data df. 
We have two options: 
1. continue as is with 'unkown' values
2. replace 'unknown' values by mode value of that column.

In [13]:
# finding mode for 'contact' and 'potcome'
mode_contact = df['contact'].mode()[0]
mode_poutcome = df['poutcome'].mode()[0]

print("mode value of contact : ", mode_contact)
print("mode value of poutcome: ", mode_poutcome)

# finding mode for 'job' and 'education'
mode_job = df['job'].mode()[0]
mode_education = df['education'].mode()[0]
print("mode value of job : ", mode_job)
print("mode value of education: ", mode_education)


mode value of contact :  cellular
mode value of poutcome:  unknown
mode value of job :  blue-collar
mode value of education:  secondary


In [14]:
# Replacing 'unknow' values by relevant column's mode value
df_cleaned=df
for col in ['job', 'education','contact','poutcome']:
    mode_value = df[col].mode()[0]
    df_cleaned[col] = df_cleaned[col].replace('unknown',mode_value)
    print(df_cleaned[col])
    

0          management
1          technician
2        entrepreneur
3         blue-collar
4         blue-collar
             ...     
45206      technician
45207         retired
45208         retired
45209     blue-collar
45210    entrepreneur
Name: job, Length: 45211, dtype: str
0         tertiary
1        secondary
2        secondary
3        secondary
4        secondary
           ...    
45206     tertiary
45207      primary
45208    secondary
45209    secondary
45210    secondary
Name: education, Length: 45211, dtype: str
0         cellular
1         cellular
2         cellular
3         cellular
4         cellular
           ...    
45206     cellular
45207     cellular
45208     cellular
45209    telephone
45210     cellular
Name: contact, Length: 45211, dtype: str
0        unknown
1        unknown
2        unknown
3        unknown
4        unknown
          ...   
45206    unknown
45207    unknown
45208    success
45209    unknown
45210      other
Name: poutcome, Length: 45211, d