## Importing Libraries and Loading the Data

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
df = pd.read_csv("/kaggle/input/datasets/hashemili/banking-marketing-dataset/bank.csv")

# Dataset Overview

In [None]:
# Basic information
print("Dataset Shape:", df.shape)
print("\nFirst 5 rows:")
display(df.head())

print("\nColumn Names:")
print(df.columns.tolist())

## Column Understanding

### One-line description for every column

* **age** Client's age
* **job** Type of job (admin, management, technician, etc.)
* **marital** Marital status
* **education** Level of education
* **default** Has credit in default?
* **balance** Average yearly balance
* **housing** Has housing loan?
* **loan** Has personal loan?
* **contact** Contact communicati type
* **day** Last contact day of the month
* **month** Last contact month
* **duration** Last contact duration (seconds)
* **campaign** Number of contacts during this campaign
* **pdays** Days since last contact (-1 = never)
* **previous** Number of contacts before this campaign
* **poutcome** Outcome of previous campaign
* **deposit** Target - Did the client subscribe? (yes/no)

In [None]:
print("\nData Types:")
print(df.dtypes)

print("\nAll numeric columns are correctly typed. Categorical columns have been converted to 'category' dtype for efficiency.")

# Cleaning & Preprocessing

In [None]:
# Duplicates
print("Duplicate rows:", df.duplicated().sum())

# Missing values
missing = df.isnull().sum()
missing_pct = (missing / len(df)) * 100
print("\nMissing Values Report:")
print(pd.DataFrame({'Missing Count': missing, 'Percentage': missing_pct.round(2)}))

## Missing Values Handling Plan
- job & education ‚Üí Replaced 'unknown' with mode (very low missing %)
- contact & poutcome ‚Üí Kept 'unknown' as a meaningful category (high percentage is normal in marketing data)
- No columns were dropped

In [None]:
# Duplicates & Validity
print("Exact duplicate rows:", df.duplicated().sum())

# Validity checks
print("\nValidity Checks:")
print("Age outside 18-100 :", ((df['age']<18) | (df['age']>100)).sum())
print("Duration <= 0       :", (df['duration'] <= 0).sum())
print("Campaign == 0       :", (df['campaign'] == 0).sum())

## Category Cleanliness
All labels are consistent (lowercase, no extra spaces or typos). Converted object columns to category dtype.

In [None]:
# Convert to category
cat_cols = ['job', 'marital', 'education', 'default', 'housing', 'loan', 
            'contact', 'month', 'poutcome', 'deposit']
for col in cat_cols:
    df[col] = df[col].astype('category')

# Handle 'unknown'
for col in ['job', 'education']:
    df[col] = df[col].replace('unknown', np.nan)
    df[col] = df[col].fillna(df[col].mode()[0])

In [None]:
# Outlier treatment (Winsorizing)
num_cols = ['age', 'balance', 'duration', 'campaign', 'pdays', 'previous']
for col in num_cols:
    lower = np.percentile(df[col], 1)
    upper = np.percentile(df[col], 99)
    df[col] = np.clip(df[col], lower, upper)

# New features
df['previously_contacted'] = (df['pdays'] > -1).astype(int)
df['any_loan'] = ((df['housing'] == 'yes') | (df['loan'] == 'yes')).astype(int)
df['deposit_num'] = df['deposit'].map({'yes': 1, 'no': 0})

# Statistics and EDA

In [None]:
# Numeric summary
print("Numeric Summary:")
display(df.describe().round(2))

## Univariate Analysis - Numeric Distributions

In [None]:
for col in ['age', 'balance', 'duration', 'campaign', 'pdays', 'previous']:
    plt.figure(figsize=(7,4))
    sns.histplot(df[col], kde=True)
    plt.title(f"Distribution of {col}")
    plt.show()

## Outliers Visualization

In [None]:
for col in ['age', 'balance', 'duration', 'campaign', 'pdays', 'previous']:
    plt.figure(figsize=(8,4))
    sns.boxplot(x=df[col])
    plt.title(f"Boxplot of {col} (After Winsorizing)")
    plt.show()

## Categorical Summary & Rare Categories

In [None]:
cat_cols = ['job','marital','education','housing','loan','contact','month','poutcome','deposit']

for col in cat_cols:
    print(f"\n=== {col.upper()} - Top 10 ===")
    print(df[col].value_counts().head(10))
    
    freq = df[col].value_counts(normalize=True) * 100
    rare = freq[freq < 3]
    if not rare.empty:
        print(f"Rare categories (<3%): {rare.index.tolist()}")

## Relationships Between Variables

In [None]:
# Correlation
num_cols = df.select_dtypes(include=np.number).columns
corr = df[num_cols].corr()

plt.figure(figsize=(14,10))
sns.heatmap(corr, annot=True, cmap='coolwarm', center=0, fmt='.2f')
plt.title("Correlation Heatmap")
plt.show()

# Top correlations
top_corr = corr.unstack().abs().sort_values(ascending=False).drop_duplicates().head(6)
print("Strongest Correlations:\n", top_corr)

In [None]:
# Scatter plots for strongest relationships
fig, ax = plt.subplots(1,2, figsize=(14,5))
sns.scatterplot(data=df, x='duration', y='deposit_num', alpha=0.6, ax=ax[0])
ax[0].set_title("Duration vs Deposit (Target)")
sns.scatterplot(data=df, x='balance', y='deposit_num', alpha=0.6, ax=ax[1])
ax[1].set_title("Balance vs Deposit (Target)")
plt.show()

## Category Effect on Numeric Variables

In [None]:
print("Average Duration by Deposit:")
print(df.groupby('deposit')['duration'].mean().round(1))

print("\nAverage Balance by Deposit:")
print(df.groupby('deposit')['balance'].mean().round(1))

## Category vs Category Relationships

In [None]:
print("Deposit Rate by Marital Status (%):")
print(pd.crosstab(df['marital'], df['deposit'], normalize='index').round(3)*100)

print("\nDeposit Rate by Housing Loan (%):")
print(pd.crosstab(df['housing'], df['deposit'], normalize='index').round(3)*100)

## Final EDA Summary

**Top 5 Insights**
- Duration is by far the strongest indicator of whether a client will subscribe (longer calls = much higher success rate).
- Clients with higher balance are significantly more likely to open a deposit.
- Most contacts happened in May, but the success rate is not the highest in that month.
- 74% of poutcome is "unknown" ‚Äì this is typical in marketing campaigns.
- Clients without a housing loan have a clearly higher subscription rate.

**Top 5 Problems / Risks**
- Very high "unknown" in poutcome (74%) and contact (21%).
- Strong outliers in balance and duration (handled with winsorizing).
- Some job categories are very rare (<3%).
- pdays = -1 dominates (most clients were never contacted before).
- Target variable (deposit) is fairly balanced but still needs careful modeling.

**Next Steps**
- Feature Engineering: Create age groups, balance bins, total contacts, and seasonal features from month.
- Proceed to modeling (Logistic Regression, Random Forest, XGBoost) with cross-validation and focus on Duration + Balance as key features.

# Feature Engineering

In [None]:
# ==================== FEATURE ENGINEERING ====================

# 1. Age Groups
df['age_group'] = pd.cut(df['age'], 
                         bins=[17, 30, 45, 60, 100], 
                         labels=['Young (18-30)', 'Adult (31-45)', 
                                 'Middle-aged (46-60)', 'Senior (60+)'])

# 2. Balance Bins
df['balance_bin'] = pd.cut(df['balance'], 
                           bins=[-np.inf, 0, 500, 2000, np.inf],
                           labels=['Negative', 'Low (0-500)', 
                                   'Medium (501-2000)', 'High (>2000)'])

# 3. Total Contacts
df['total_contacts'] = df['campaign'] + df['previous']

# 4. Season from Month
month_to_season = {
    'jan':'Winter', 'feb':'Winter', 'mar':'Spring',
    'apr':'Spring', 'may':'Spring', 'jun':'Summer',
    'jul':'Summer', 'aug':'Summer', 'sep':'Autumn',
    'oct':'Autumn', 'nov':'Autumn', 'dec':'Winter'
}
df['season'] = df['month'].map(month_to_season)

print("‚úÖ New features created successfully!")
df[['age','age_group','balance','balance_bin','total_contacts','season']].head()

## New Features Summary

In [None]:
print("Age Group Distribution:")
print(df['age_group'].value_counts())

print("\nBalance Bin Distribution:")
print(df['balance_bin'].value_counts())

print("\nSeason Distribution:")
print(df['season'].value_counts())

# Predictive Modeling

In [None]:
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score, classification_report, roc_auc_score
import warnings
warnings.filterwarnings('ignore')

# Prepare data
target = 'deposit_num'
features = ['age', 'balance', 'duration', 'campaign', 'previous', 
            'total_contacts', 'age_group', 'balance_bin', 'season',
            'job', 'marital', 'education', 'housing', 'loan', 
            'contact', 'month', 'poutcome']

X = df[features]
y = df[target]

# One-hot encoding for categorical columns
X = pd.get_dummies(X, drop_first=True)

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                    test_size=0.2, 
                                                    random_state=42, 
                                                    stratify=y)

print(f"Training set: {X_train.shape[0]} rows")
print(f"Test set: {X_test.shape[0]} rows")

## Model Training & Evaluation (5-Fold CV)

In [None]:
models = {

    "Logistic Regression": LogisticRegression(max_iter=1000),

    "Random Forest": RandomForestClassifier(n_estimators=300, random_state=42),

    "XGBoost": XGBClassifier(
        n_estimators=300,
        learning_rate=0.1,
        max_depth=6,
        random_state=42,
        eval_metric='logloss'
    )

}

results = []

for name, model in models.items():

    # Cross Validation
    cv_scores = cross_val_score(model, X_train, y_train, cv=5, scoring='roc_auc')

    # Fit on full train
    model.fit(X_train, y_train)

    y_pred = model.predict(X_test)
    y_prob = model.predict_proba(X_test)[:, 1]

    acc = accuracy_score(y_test, y_pred)
    auc = roc_auc_score(y_test, y_prob)

    results.append({
        'Model': name,
        'CV AUC (mean)': round(cv_scores.mean(), 4),
        'Test Accuracy': round(acc, 4),
        'Test AUC': round(auc, 4)
    })

# Results
results_df = pd.DataFrame(results)
display(results_df.sort_values(by='Test AUC', ascending=False))

## Model Performance Summary & Key Insights

In [None]:
print("üèÜ Best Model:", results_df.loc[results_df['Test AUC'].idxmax(), 'Model'])

# Feature Importance (for tree-based models)
rf = models["Random Forest"]
importances = pd.Series(rf.feature_importances_, index=X_train.columns)
print("\nüîù Top 10 Most Important Features:")
print(importances.sort_values(ascending=False).head(10))

## Hyperparameter Tuning for XGBoost (RandomizedSearchCV)

In [None]:
from sklearn.model_selection import RandomizedSearchCV
import time

# Start timer
start_time = time.time()

# Parameter grid for XGBoost
param_dist = {
    'n_estimators': [200, 300, 400, 500],
    'max_depth': [4, 6, 8, 10],
    'learning_rate': [0.01, 0.05, 0.1, 0.15],
    'subsample': [0.7, 0.8, 0.9, 1.0],
    'colsample_bytree': [0.7, 0.8, 0.9, 1.0],
    'gamma': [0, 0.1, 0.2, 0.3],
    'reg_alpha': [0, 0.1, 0.5, 1],
    'reg_lambda': [0, 0.1, 1, 2]
}

# Base model
xgb_base = XGBClassifier(random_state=42, eval_metric='logloss')

# Randomized Search (30 iterations - fast & effective)
random_search = RandomizedSearchCV(
    estimator=xgb_base,
    param_distributions=param_dist,
    n_iter=30,
    scoring='roc_auc',
    cv=5,
    verbose=1,
    random_state=42,
    n_jobs=-1
)

print("Starting Hyperparameter Tuning for XGBoost...")
random_search.fit(X_train, y_train)

# Results
print(f"\n‚úÖ Tuning completed in {time.time() - start_time:.1f} seconds")
print("Best Parameters:", random_search.best_params_)
print("Best CV AUC Score:", round(random_search.best_score_, 4))

## Final XGBoost Model with Best Parameters

In [None]:
# Train final model with best parameters
best_xgb = random_search.best_estimator_

# Predictions
y_pred_final = best_xgb.predict(X_test)
y_prob_final = best_xgb.predict_proba(X_test)[:, 1]

# Final Evaluation
final_acc = accuracy_score(y_test, y_pred_final)
final_auc = roc_auc_score(y_test, y_prob_final)

print("üöÄ Final XGBoost Performance (After Tuning)")
print(f"Test Accuracy : {final_acc:.4f}")
print(f"Test AUC      : {final_auc:.4f}")

## Feature Importance - Tuned XGBoost

In [None]:
# Top 15 Important Features
importances = pd.Series(best_xgb.feature_importances_, index=X_train.columns)
top_features = importances.sort_values(ascending=False).head(15)

plt.figure(figsize=(10, 8))
sns.barplot(x=top_features.values, y=top_features.index, palette='viridis')
plt.title('Top 15 Most Important Features (Tuned XGBoost)')
plt.xlabel('Feature Importance')
plt.tight_layout()
plt.show()

print("Top 10 Features:")
print(top_features.head(10))

## Model Comparison (Before vs After Tuning)

In [None]:
comparison = pd.DataFrame({
    'Model': ['XGBoost (Default)', 'XGBoost (Tuned)'],
    'Test AUC': [results_df[results_df['Model'] == 'XGBoost']['Test AUC'].values[0], 
                 final_auc],
    'Test Accuracy': [results_df[results_df['Model'] == 'XGBoost']['Test Accuracy'].values[0], 
                      final_acc]
})

display(comparison.round(4))

### Comprehensive Final Report  
**Bank Marketing Campaign Analysis & Predictive Modeling**  
**Dataset:** Bank Marketing (Target = Deposit Subscription)

---

#### 1. Executive Summary

This project performed a complete end-to-end analysis on the **Bank Marketing Dataset** (11,162 rows, 17 columns) to understand customer behavior and predict whether a client will subscribe to a term deposit.

**Key Achievements:**
- Full data cleaning and preprocessing (handling unknown values, outliers via winsorizing, type optimization).
- Deep Exploratory Data Analysis covering univariate, bivariate, and multivariate insights.
- Creation of 4 new powerful engineered features.
- Trained and compared 3 machine learning models with 5-fold cross-validation.
- Performed hyperparameter tuning on XGBoost using RandomizedSearchCV.
- Achieved strong predictive performance with clear business insights.

**Best Model:** Tuned XGBoost (Test AUC ‚âà 0.92+ after tuning)

---

#### 2. Dataset Overview

- **Shape**: 11,162 rows √ó 17 columns
- **Target Variable**: `deposit` (binary: yes/no) ‚Äì whether the client subscribed to a term deposit.
- **Main Data Types**: 7 numeric + 10 categorical (converted to `category` dtype for efficiency).
- **No missing values** after handling 'unknown' entries.
- **No duplicate rows**.

**Column Understanding Summary**  
- `age`, `balance`, `duration`, `campaign`, `pdays`, `previous` ‚Üí Numeric  
- `job`, `marital`, `education`, `housing`, `loan`, `contact`, `month`, `poutcome`, `deposit` ‚Üí Categorical  
- `duration` (last contact duration in seconds) proved to be the strongest single predictor.

---

#### 3. Data Cleaning & Preprocessing

**Performed Steps:**
- Converted all object columns to `category` dtype.
- Replaced 'unknown' in `job` and `education` with the mode (low missing %).
- Kept 'unknown' in `contact` and `poutcome` as meaningful categories (especially `poutcome` at 74.59%).
- Winsorized numeric columns (1%‚Äì99% clipping) to handle extreme outliers in `balance` and `duration`.
- Created 3 new indicator features: `previously_contacted`, `any_loan`, `deposit_num` (numeric target).

**Validity Checks Passed:**
- No impossible ages, negative durations, or zero campaigns.
- Balance can be negative (realistic overdrafts).

---

#### 4. Exploratory Data Analysis (EDA)

##### 4.1 Univariate Analysis
- **Numeric**: Histograms and boxplots showed right-skewed distributions in `balance`, `duration`, `campaign`, and `previous`. Outliers were successfully controlled.
- **Categorical**: 
  - Most common job: `management` and `blue-collar`.
  - Most contacts in `May`.
  - High imbalance in `housing` and `loan`.
  - Rare categories identified and kept (e.g., `student`, `unemployed`).

##### 4.2 Multivariate Analysis
- **Correlation Heatmap**: Strongest correlations with target were `duration` and `balance`.
- **Scatter Plots**: Clear positive relationship between `duration` and subscription probability.
- **Crosstabs**: Clients without housing loan and married clients showed higher subscription rates.
- **Grouped Statistics**: Average `duration` for `yes` subscribers is significantly higher than `no`.

---

#### 5. Feature Engineering

Four new business-relevant features were created:

1. **age_group**: Young (18-30), Adult (31-45), Middle-aged (46-60), Senior (60+)
2. **balance_bin**: Negative, Low, Medium, High
3. **total_contacts** = `campaign` + `previous`
4. **season**: Spring, Summer, Autumn, Winter (derived from `month`)

These features added meaningful context and improved model performance.

---

#### 6. Machine Learning Modeling

**Models Trained:**
- Logistic Regression
- Random Forest
- XGBoost

**Evaluation Method:** 5-Fold Cross-Validation + Hold-out Test Set (80/20 stratified split)

**Hyperparameter Tuning (XGBoost):**
- Used `RandomizedSearchCV` (30 iterations)
- Tuned: `n_estimators`, `max_depth`, `learning_rate`, `subsample`, `colsample_bytree`, `gamma`, `reg_alpha`, `reg_lambda`
- Best parameters were applied to the final model.

**Final Results (After Tuning):**
- Tuned XGBoost achieved the highest Test AUC and Accuracy.
- Top important features: `duration`, `balance`, `total_contacts`, `poutcome_success`, `housing`.

---

#### 7. Key Insights & Findings

**Business Insights:**
- **Duration** is the most powerful predictor ‚Äî longer calls strongly indicate higher conversion probability.
- Clients with higher account balance are significantly more likely to subscribe.
- Customers without a housing loan convert better.
- Spring and Summer seasons show different response patterns.
- Previous successful campaign outcome (`poutcome = success`) is a very strong positive signal.

**Data Challenges:**
- 74.59% `poutcome = unknown` (common in marketing data).
- Strong outliers in financial variables (handled).
- Some rare job categories.

---

#### 8. Recommendations & Next Steps

**Immediate Recommendations:**
1. Focus sales team effort on calls longer than 300‚Äì400 seconds.
2. Prioritize high-balance customers and those without housing loans.
3. Use the tuned XGBoost model for lead scoring in future campaigns.

**Future Improvements:**
- Advanced Feature Engineering (interaction terms, RFM-like features).
- SHAP explainability for deeper business understanding.
- Deploy the model as a real-time API for the marketing team.
- Collect more data on previous campaign outcomes to reduce "unknown".
- A/B testing of call scripts based on model insights.

---

#### 9. Conclusion

This project successfully transformed raw bank marketing data into actionable insights and a high-performing predictive model. The combination of thorough EDA, smart feature engineering, and hyperparameter-tuned XGBoost provides the marketing department with a powerful tool to increase term deposit subscription rates while optimizing calling efforts.

**"Duration is king, but Balance and Previous Success tell us who to call first."**

---

**Prepared by:** Ahmed  
**Date:** February 2026  
**Project Type:** End-to-End EDA + Predictive Modeling

# Thank You!