### Step 2. The Split (The Firewall): Split the data 

First will split data and then We will follow below steps to handle 'unknown' values from "job', 'education','contact' and 'poutcome' columns.
First, calculate the percentage of "unknown" values in each column to understand the extent of the issue.

"Pro-Tip" for Banking
When perform Train-Test Split, always use Stratification.

In our bank-full dataset, "Success" (y=1) is likely rare (around 11-12%). If a random split done without stratification, it might end up with 15% success in Train and only 5% in Test. This will make evaluation metrics (like Precision/Recall) completely unreliable

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Load the training data we processed earlier
df = pd.read_csv('../data/processed/df_processed.csv')

In [8]:
from sklearn.model_selection import train_test_split

# Always stratify by the target variable 'y' in banking
X = df.drop('y', axis=1)
y = df['y']
# 80/20 Split with Stratification (keeps failure ratio consistent)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

print(f"Training set size: {X_train.shape}")
print(f"Test set size: {X_test.shape}")

Training set size: (36168, 20)
Test set size: (9043, 20)


Now, we have

Train data = X_train, y_train 
Test data = X_test, y_test

Feature Engineering

Now, will perform Data cleaning, feature engineering and EDA only on train data : X_train, y_train

In [15]:
# Combine X and y back together for cleaning, featurization, EDA and Modeling storage
df_train = pd.concat([X_train, y_train], axis=1)

# Save to processed directory
# train_full.to_csv('../data/processed/train_processed.csv', index=False)
# test_full.to_csv('../data/processed/test_processed.csv', index=False)

# print("Data cleaning and feature engineering complete. Files saved to data/processed/.")

df_train.info()


<class 'pandas.DataFrame'>
Index: 36168 entries, 24001 to 44229
Data columns (total 21 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   age              36168 non-null  int64  
 1   job              36168 non-null  str    
 2   marital          36168 non-null  str    
 3   education        36168 non-null  str    
 4   default          36168 non-null  str    
 5   balance          36168 non-null  int64  
 6   housing          36168 non-null  str    
 7   loan             36168 non-null  str    
 8   contact          36168 non-null  str    
 9   day              36168 non-null  int64  
 10  month            36168 non-null  str    
 11  duration         36168 non-null  int64  
 12  campaign         36168 non-null  int64  
 13  pdays            36168 non-null  int64  
 14  previous         36168 non-null  int64  
 15  poutcome         36168 non-null  str    
 16  total_contacts   36168 non-null  int64  
 17  call_efficiency  36168 n

### Part B Data Cleaning only on train data after spliting data in to train and test.

1. Impute: Replace unknown in job or education with the mode.
2. Handle Outliers: Cap the extreme balance values (Capping/Winsorization).

**Step 3. Statistical Modeling Cleaning: Post-Split**

Now in the Lab.

- Imputation/Scaling: You calculate the "Standard" (Mean/Mode) from the Train set. If the Test set has a missing value, you fill           it with the Train mode. This mimics the real world, where you use past knowledge to handle new, incomplete information.

- Handling Outliers

- Bivariate EDA & Heatmaps: By doing this only on the training set, you ensure that your decision to keep or drop a feature is based only on the data the model is allowed to learn from.

**Do these AFTER splitting:**

1. Handling Outliers: Defining what an outlier is based on the training distribution.
2. Imputation: Filling missing values using the mean, median, or mode of the training set only.
3. Scaling/Normalization: If you normalize based on the global maximum, you've leaked the range of the test set into your training process.
4. EDA: Visualizing correlations and distributions to decide which features to keep.

In [16]:
# 4. Impute: Replace unknown in job or education with the mode.
# selected only categorical columns

categorical_columns = ['job', 'marital', 'education', 'default', 'housing', 'loan', 'contact', 'month', 'poutcome', 'y']

# count unknown values in each column

unknown_counts = df_train[categorical_columns].apply(lambda col: (col == "unknown").sum())

print(unknown_counts) # the output will give unknown counts for each categorical columns

job            234
marital          0
education     1482
default          0
housing          0
loan             0
contact      10386
month            0
poutcome     29589
y                0
dtype: int64


In [18]:
# # Calculate the percentage of 'unknown' values in each relevant column
columns_with_unknowns = ['job', 'education', 'poutcome', 'contact']
for col in columns_with_unknowns:
    unknown_count = df_train[df_train[col] == 'unknown'].shape[0]
    total_count = df.shape[0]
    percentage = (unknown_count / total_count) * 100
    print(f"Column: {col}, Unknown Values: {unknown_count}, Percentage: {percentage:.2f}%")


Column: job, Unknown Values: 234, Percentage: 0.52%
Column: education, Unknown Values: 1482, Percentage: 3.28%
Column: poutcome, Unknown Values: 29589, Percentage: 65.45%
Column: contact, Unknown Values: 10386, Percentage: 22.97%


**job and education:**  These are likely important features for predicting the target.
the percentage of unknown values in 'job' and 'education' are 0.52% and 3.28%. which are very low.

We will replace 'unknown' rows with mode value of that column.

**contact and poutcome:**

the percentage of unknown values in 'contact' and 'poutcome' are 0.22.97% and 65.45%. which are very low.

'poutcome' columns have a high proportion (>30%) of "unknown" values and if their **impact** on the target variable (y) seems minimal, then we can remove them entirely.

Let's check impact of 'contact' and 'potcome' on the target variable(y).


**Perform Chi-Square Test (Categorical Association)**

A chi-square test can help determine whether there is a statistically significant association between the column (contact or poutcome) and the target variable (y).

Interpretation:
A p-value < 0.05 indicates a statistically significant relationship between the column and the target variable (y).
A high p-value (>0.05) suggests the column has little impact on the target.


In [19]:
from scipy.stats import chi2_contingency

# Create contingency tables for 'contact' and 'poutcome'
contact_table = pd.crosstab(df_train['contact'], df_train['y'])
poutcome_table = pd.crosstab(df_train['poutcome'], df_train['y'])

# Perform chi-square test
contact_chi2, contact_p, _, _ = chi2_contingency(contact_table)
poutcome_chi2, poutcome_p, _, _ = chi2_contingency(poutcome_table)

print(f"Contact - Chi-square p-value: {contact_p}")
print(f"Poutcome - Chi-square p-value: {poutcome_p}")


Contact - Chi-square p-value: 2.2216425534984923e-177
Poutcome - Chi-square p-value: 0.0


**Perform Chi-Square Test (Categorical Association)**

A chi-square test can help determine whether there is a statistically significant association between the column (contact or poutcome) and the target variable (y).

Interpretation:
A p-value < 0.05 indicates a statistically significant relationship between the column and the target variable (y).
A high p-value (>0.05) suggests the column has little impact on the target.


In [20]:
check= [contact_p, poutcome_p] 

for col in check:
    if col >= 0.05:
        print("there is no association with", col)
    else:
        print("there is significant association with ", col)

there is significant association with  2.2216425534984923e-177
there is significant association with  0.0


Conclusion: we will retain the 'contact' and 'poutcome' columns from the data df_train. 
We have two options: 
1. continue as is with 'unkown' values
2. replace 'unknown' values by mode value of that column.

In [21]:
# finding mode for 'contact' and 'potcome'
mode_contact = df_train['contact'].mode()[0]
mode_poutcome = df_train['poutcome'].mode()[0]

print("mode value of contact : ", mode_contact)
print("mode value of poutcome: ", mode_poutcome)

# finding mode for 'job' and 'education'
mode_job = df_train['job'].mode()[0]
mode_education = df_train['education'].mode()[0]
print("mode value of job : ", mode_job)
print("mode value of education: ", mode_education)


mode value of contact :  cellular
mode value of poutcome:  unknown
mode value of job :  blue-collar
mode value of education:  secondary


In [28]:
# Replacing 'unknow' values by relevant column's mode value
# We will keep poutcome "unknowm" values as is because the model will learn that "unknown" actually means "new customer."
df_cleaned=df_train
for col in ['job', 'education','contact']:
    mode_value = df_train[col].mode()[0]
    df_cleaned[col] = df_cleaned[col].replace('unknown',mode_value)
    #print(df_cleaned[col])


print("Cleaning 'unknown' values imputed with relevant mode value.")   

Cleaning 'unknown' values imputed with relevant mode value.


In [31]:
# # 1. Identify columns for Mode Imputation
# impute_cols = ['job', 'education']

# for col in impute_cols:
#     # Calculate mode ONLY on Training set
#     train_mode = X_train[col].mode()[0]
    
#     # Replace 'unknown' in both sets
#     X_train[col] = X_train[col].replace('unknown', train_mode)
#     X_test[col] = X_test[col].replace('unknown', train_mode)

# 2. Rename 'unknown' in poutcome to keep the signal
# This prevents 'unknown' from being treated as a missing value
df_train['poutcome'] = df_train['poutcome'].replace('unknown', 'other_outcome')
X_test['poutcome'] = X_test['poutcome'].replace('unknown', 'other_outcome')

print("Imputation complete using Training Set statistics.")

Imputation complete using Training Set statistics.


**Professional Outlier Handling: Capping**

Handle Outliers: Cap the extreme balance values (Capping/Winsorization).

Logic: don't just delete the outliers (that way we may lose too much data). Instead, cap them at the 99th percentile. This keeps the "High Value" signal without letting the €102,127 balance break your model.

Capping vs. Deleting: In banking, we rarely delete outliers like a €102k balance. That person is a "High Net Worth" client! By Capping (using .clip()), we keep the customer in the dataset but prevent their high balance from "pulling" the average too far away from the typical customer.

we want our "Bivariate" plots to be accurate. If we don't cap outliers first, our scatter plots will look like a single dot because of the scale.

In [32]:
# 5. Handle Outliers (Capping / Winsorization)
#  Handle Outliers (capping): Cap the extreme balance values (Capping/Winsorization).
for col in ['balance', 'duration', 'campaign']:
    upper_limit = df_train[col].quantile(0.99)
    # We cap the values at the 99th percentile
    df_train[col] = np.where(df_train[col] > upper_limit, upper_limit, df_train[col])
    #or 
    # df_train[col] = df_train[col].clip(upper=upper_limit)


print("Outliers capped at 99th percentile for balance, duration, and campaign.")
print("Cleaning Complete: outliers are capped and unkown value handled. \nData is ready for Featurization(Feature Engineering) and EDA")

Outliers capped at 99th percentile for balance, duration, and campaign.
Cleaning Complete: outliers are capped and unkown value handled. 
Data is ready for Featurization(Feature Engineering) and EDA


In [35]:
print(df_train.head)
df_train.info()

<bound method NDFrame.head of        age          job   marital  education default  balance housing loan  \
24001   36   technician  divorced  secondary      no    861.0      no   no   
43409   24      student    single  secondary      no   4126.0      no   no   
20669   44   technician    single  secondary      no    244.0     yes   no   
18810   48   unemployed   married  secondary      no      0.0      no   no   
23130   38   technician   married  secondary      no    257.0      no   no   
...    ...          ...       ...        ...     ...      ...     ...  ...   
17958   50  blue-collar   married  secondary      no    917.0     yes   no   
15941   36       admin.    single  secondary      no     22.0     yes   no   
16952   45  blue-collar   married  secondary      no     79.0     yes   no   
34781   27   management    single   tertiary      no   2559.0     yes   no   
44229   60       admin.   married  secondary      no    478.0      no   no   

         contact  day  ... durati

# Feature Engineering

In a professional banking workflow, the order of Feature Engineering versus Bivariate EDA is a bit of a "chicken and egg" situation, but there is a preferred path for efficiency.

The industry-standard approach is to do Feature Engineering before Bivariate EDA.

Why this order?

If you create a feature like balance_per_campaign or age_groups first, you can include them in your Bivariate EDA and Heatmaps. This allows you to see if your newly created features actually have a stronger relationship with the target (y) than the raw data did.

Standard: Create new columns that are derived from row-level logic (e.g., balance_per_campaign or age_groups).

Why: These are based on business rules, not statistical distributions.

Status: Correct. You did this before the split.


**Summary Checklist for our Notebook:**

done: Cleaning: Fix the data (Mode imputation, Outlier capping).

current: Feature Engineering: Add new columns (derived from your SQL logic).

Next: EDA (Part B): Compare variables to y (Bivariate) and check correlations.

Split: Separate your data.

Scale: Normalize the numbers.

# Part B Exploratory Data Analysis 

Goal: Gain insights into the dataset and understand relationships between variables.

we are now in the "Discovery Phase." In a banking context, Bivariate EDA isn't just about pretty pictures; it’s about finding Profitability Signals—identifying which types of customers are most likely to say "Yes."

Here is the industrial suite of Bivariate EDA, focusing on your engineered features and core banking variables.Actions:
- Summarize numerical and categorical variables.
- Visualize distributions, correlations, and trends.
- Identify outliers.

1. Numerical Correlation (The "Opportunity Map")
First, we look at the Heatmap. This tells us which variables move in sync with the target y.

In [33]:
import seaborn as sns
import matplotlib.pyplot as plt

plt.figure(figsize=(12, 8))
# Only correlate numerical columns with the target
correlation_matrix = df_train.select_dtypes(include=['number']).corr()
sns.heatmap(correlation_matrix[['y']].sort_values(by='y', ascending=False), 
            annot=True, cmap='coolwarm', vmin=-1, vmax=1)
plt.title("Correlation of Features with Subscription (y)")
plt.show()

KeyError: "None of [Index(['y'], dtype='str')] are in the [columns]"

<Figure size 1200x800 with 0 Axes>

2. Categorical Analysis (The "Customer Profile")
We want to see the Subscription Rate across different categories. This is where your structural features (like job_type or quarter) will shine.

In [37]:
# Function to plot subscription rates
def plot_sub_rate(column, dataset):
    plt.figure(figsize=(10, 5))
    # Calculate percentage of 'yes' per category
    sns.barplot(x=column, y='y', data=dataset, palette='viridis', errorbar=None) #ci=None)
    plt.axhline(dataset['y'].mean(), color='red', linestyle='--', label='Avg Subscription Rate')
    plt.title(f'Subscription Rate by {column}')
    plt.ylabel('Proportion of "Yes"')
    plt.legend()
    plt.show()

# Test it on your job_type and your new financial flags
for col in ['job_type', 'quarter', 'multiple_loans', 'is_new_customer']:
    plot_sub_rate(col, df_train)

ValueError: Could not interpret value `job_type` for `x`. An entry with this name does not appear in `data`.

<Figure size 1000x500 with 0 Axes>