In [1]:
import pandas as pd

# Load the cleaned dataset (post 2.3)
df = pd.read_csv("../data/processed/bank_cleaned.csv")

# Quick check
df.head(5)
df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11162 entries, 0 to 11161
Data columns (total 17 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   age        11162 non-null  int64 
 1   job        11162 non-null  object
 2   marital    11162 non-null  object
 3   education  11162 non-null  object
 4   default    11162 non-null  int64 
 5   balance    11162 non-null  int64 
 6   housing    11162 non-null  int64 
 7   loan       11162 non-null  int64 
 8   contact    11162 non-null  object
 9   day        11162 non-null  int64 
 10  month      11162 non-null  object
 11  duration   11162 non-null  int64 
 12  campaign   11162 non-null  int64 
 13  pdays      11162 non-null  int64 
 14  previous   11162 non-null  int64 
 15  poutcome   11162 non-null  object
 16  deposit    11162 non-null  int64 
dtypes: int64(11), object(6)
memory usage: 1.4+ MB


### üì• Loading the Cleaned Dataset

This step loads the **final cleaned dataset** that was saved after completing:
- Missing value checks  
- Outlier handling  
- Data type corrections  

**What the code does:**
- Imports pandas for data handling
- Reads the cleaned CSV file from `data/processed/`
- Loads it into a DataFrame called `df`
- Displays:
  - First 5 rows using `head()` to verify structure
  - Dataset schema using `info()` to confirm:
    - Correct data types
    - No unexpected null values
    - Column readiness for modeling

**Why this step is critical:**
- Confirms that cleaning was **successfully persisted to disk**
- Ensures the dataset can be reused across:
  - Feature engineering notebooks
  - Model training scripts
  - Streamlit deployment
- Prevents accidental dependency on in-memory transformations

**ML mindset checkpoint:**
> If your cleaned data doesn‚Äôt reload cleanly, your pipeline isn‚Äôt real.

‚úîÔ∏è Dataset verified  
‚û°Ô∏è Ready for feature engineering and encoding


In [2]:
def age_group(age):
    if age <= 30:
        return 'Young Adult'
    elif age <= 45:
        return 'Adult'
    elif age <= 60:
        return 'Middle-Aged'
    else:
        return 'Senior'

df['age_group'] = df['age'].apply(age_group)
df['age_group'] = df['age_group'].astype('category')


### üß† Feature Engineering: Age Group

This step transforms the raw numerical **age** feature into a more meaningful **categorical variable** called `age_group`.

**What the code does:**
- Defines a custom function `age_group(age)` that:
  - Segments customers into life-stage buckets:
    - ‚â§ 30 ‚Üí *Young Adult*
    - 31‚Äì45 ‚Üí *Adult*
    - 46‚Äì60 ‚Üí *Middle-Aged*
    - > 60 ‚Üí *Senior*
- Applies this function to the `age` column using `.apply()`
- Creates a new column `age_group`
- Converts `age_group` to the `category` data type

**Why this matters:**
- Raw age is continuous and noisy
- Grouping ages:
  - Captures **behavioral patterns** more effectively
  - Reduces sensitivity to small numeric changes
  - Improves interpretability for business and stakeholders

**Modeling advantage:**
- Tree-based models (like Random Forest) can split better on meaningful groups
- Linear models benefit from reduced variance
- Easier feature importance analysis later

**Real-world logic check:**
> Banks don‚Äôt market by exact age ‚Äî they market by life stage.

‚úîÔ∏è Feature created  
‚û°Ô∏è Ready for encoding and modeling


In [3]:
def balance_category(balance):
    if balance < 1000:
        return 'Low'
    elif balance <= 5000:
        return 'Medium'
    else:
        return 'High'

df['balance_category'] = df['balance'].apply(balance_category)
df['balance_category'] = df['balance_category'].astype('category')


### üß† Feature Engineering: Balance Category

This step converts the continuous **balance** feature into a categorical variable named `balance_category`, representing a customer‚Äôs financial standing.

**What the code does:**
- Defines a function `balance_category(balance)` that:
  - Classifies account balance into three levels:
    - < 1000 ‚Üí *Low*
    - 1000‚Äì5000 ‚Üí *Medium*
    - > 5000 ‚Üí *High*
- Applies this logic to the `balance` column
- Creates a new feature `balance_category`
- Casts it to the `category` data type

**Why this matters:**
- Exact balance values are volatile and noisy
- Categorization:
  - Highlights spending/saving behavior
  - Reduces the impact of extreme values
  - Aligns better with real banking segmentation

**Modeling advantage:**
- Helps models learn **non-linear financial patterns**
- Improves stability when combined with scaling
- Makes feature importance more interpretable

**Business intuition:**
> Marketing doesn‚Äôt care if a customer has 5123 or 5180 ‚Äî it cares if they‚Äôre *high balance*.

‚úîÔ∏è Feature engineered  
‚û°Ô∏è Ready for encoding and predictive modeling


In [4]:
def contact_intensity(campaign):
    if campaign <= 2:
        return 'Low'
    elif campaign <= 5:
        return 'Medium'
    else:
        return 'High'

df['contact_intensity'] = df['campaign'].apply(contact_intensity)
df['contact_intensity'] = df['contact_intensity'].astype('category')


### üß† Feature Engineering: Contact Intensity

This step creates a new categorical feature called **`contact_intensity`**, derived from the `campaign` variable (number of contacts made during the marketing campaign).

**What the code does:**
- Defines a function `contact_intensity(campaign)` that:
  - Converts raw contact counts into meaningful levels:
    - ‚â§ 2 ‚Üí *Low*
    - 3‚Äì5 ‚Üí *Medium*
    - > 5 ‚Üí *High*
- Applies this function to the `campaign` column
- Stores the result in a new feature `contact_intensity`
- Converts it to the `category` data type

**Why this matters:**
- Raw contact counts don‚Äôt tell the full story
- Categorization captures **marketing pressure levels**
- Prevents large campaign values from dominating the model

**Modeling advantage:**
- Helps the model learn diminishing returns of repeated contact
- Reduces noise caused by extreme campaign values
- Improves interpretability of feature importance

**Business intuition:**
> Call too little, nobody listens.  
> Call too much, people hang up.  
> The sweet spot matters.

‚úîÔ∏è Feature engineered  
‚û°Ô∏è Ready for encoding and supervised learning


In [5]:
# Quick check of new features
print(df[['age_group','balance_category','contact_intensity']].head(5))

# Value counts
print(df['age_group'].value_counts())
print(df['balance_category'].value_counts())
print(df['contact_intensity'].value_counts())


     age_group balance_category contact_intensity
0  Middle-Aged           Medium               Low
1  Middle-Aged              Low               Low
2        Adult           Medium               Low
3  Middle-Aged           Medium               Low
4  Middle-Aged              Low               Low
age_group
Adult          5522
Middle-Aged    3022
Young Adult    2007
Senior          611
Name: count, dtype: int64
balance_category
Low       7115
Medium    4047
Name: count, dtype: int64
contact_intensity
Low       7826
Medium    2470
High       866
Name: count, dtype: int64


### üîç Verification of Engineered Features

After creating the new categorical features, this step performs a **sanity check** to ensure they were generated correctly.

**What the code does:**

1. **Preview engineered features**
   - Displays the first 5 rows of:
     - `age_group`
     - `balance_category`
     - `contact_intensity`
   - Confirms correct mapping from numerical values to categories

2. **Distribution analysis**
   - Uses `value_counts()` to show:
     - How customers are distributed across age groups
     - Balance level segmentation
     - Marketing contact intensity levels

**Why this step is important:**
- Verifies no null or unexpected categories were created
- Ensures class balance within engineered features
- Helps detect skewed or underrepresented groups early

**Modeling benefit:**
- Prevents silent feature engineering bugs
- Supports informed decisions before encoding
- Builds confidence before moving into EDA and modeling

**Data science rule of thumb:**
> If you don‚Äôt inspect it, you don‚Äôt control it.

‚úîÔ∏è Feature engineering validated  
‚û°Ô∏è Safe to proceed with encoding & model training


In [6]:
# Save final dataset for EDA & modeling
df.to_csv("../data/processed/bank_final.csv", index=False)


### üíæ Final Dataset Export (Post Cleaning & Feature Engineering)

This step saves the **fully prepared dataset** after completing:
- Missing/unknown value handling  
- Outlier treatment  
- Data type corrections  
- Feature engineering (age group, balance category, contact intensity)

**What this code does:**
- Writes the processed DataFrame to:
   data/processed/bank_final.csv
- Excludes the index to keep the dataset clean and reusable

**Why this is important:**
- Establishes a stable input for:
- Exploratory Data Analysis (EDA)
- Model training and evaluation
- Deployment (Streamlit app)
- Ensures reproducibility and clean pipeline separation
- Prevents re-running heavy preprocessing steps repeatedly

**Best practice principle:**
> Raw data is sacred.  
> Final data is intentional.

‚úîÔ∏è Data preparation completed  
‚û°Ô∏è Ready for EDA and machine learning modeling