# Bank Marketing Dataset – EDA & Feature Engineering Project

This project performs a full Exploratory Data Analysis (EDA), data cleaning, transformation, and feature engineering pipeline on the **Bank Marketing Dataset**. It includes data visualization, handling of class imbalance, and generation of a comprehensive PDF report.

---
## Project Attribution

This project is part of the **Skillfied Mentor Internship** program.

- **Name:** Costas Antony Pinto  
- **Program:** MCA – Artificial Intelligence & Machine Learning  
- **University:** Manipal University Jaipur  
- **Role:** Data Analyst Intern 
- **Project Title:** Bank Marketing Dataset – End-to-End EDA and Feature Engineering  
- **Duration:** June-July 2025  
- **Tools & Technologies:** Python, Pandas, Scikit-learn, SMOTE, ReportLab, Matplotlib, Seaborn, Missingno

All tasks were self-coded and verified, demonstrating data cleaning, visualization, transformation, and feature engineering skills in a professional, production-ready format.


## Step 1: Environment & Directory Setup

- Imported essential libraries for data processing, visualization, modeling, and report generation.
- Created structured folders:
  - `datasets/`: for raw and cleaned CSV files
  - `plots/`: for generated graphs
  - `data/`: for intermediate reports and analysis

---

In [None]:
# Step 1: Environment & Directory Setup
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import missingno as msno
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.ensemble import RandomForestClassifier
from imblearn.over_sampling import SMOTE
import warnings

# Turn off unnecessary warnings
warnings.filterwarnings('ignore')

# Set visual style for consistency
sns.set(style='whitegrid', palette='Set2')

# Define base and output directories
BASE_DIR = r'G:\My Drive\MUJ MCA\SKILLFIED MENTOR INTERNSHIP\Banking Data Analysis'
DATASET_DIR = os.path.join(BASE_DIR, 'datasets')
PLOTS_DIR = os.path.join(BASE_DIR, 'plots')
DATA_DIR = os.path.join(BASE_DIR, 'data')

# Create necessary folders
for directory in [DATASET_DIR, PLOTS_DIR, DATA_DIR]:
    os.makedirs(directory, exist_ok=True)

# Define input file path
FILE_PATH = os.path.join(DATASET_DIR, 'bankmarketingdata.csv')

# Define plot saving function
def save_and_display(fig, plot_name, show=True):
    save_path = os.path.join(PLOTS_DIR, f'{plot_name}.png')
    fig.savefig(save_path, dpi=300, bbox_inches='tight')
    if show:
        plt.show()
    plt.close(fig)
    print(f'[SAVED] {plot_name}.png to plots/')

print("Step 1 complete: Environment and folders are set up.")


## Step 2: Load Raw Dataset & Initial Inspection

- Used fallback logic to load dataset with `,` or `;` as delimiters.
- Previewed:
  - Dataset shape and data types
  - Statistical summary
  - Unique value count per column

---

In [None]:
# Step 2: Load Raw Dataset & Initial Inspection (Safe Fallback with Validation)

def try_loading_csv(path, delimiter):
    """Attempts to load CSV with a given delimiter and checks if multiple columns are detected."""
    try:
        df_temp = pd.read_csv(path, sep=delimiter)
        if df_temp.shape[1] < 2:
            return None  # Likely wrong delimiter
        return df_temp
    except Exception:
        return None

try:
    df = try_loading_csv(FILE_PATH, ',')  # Try comma first
    if df is None:
        df = try_loading_csv(FILE_PATH, ';')  # Fallback to semicolon

    if df is None:
        raise ValueError("Could not load dataset with common delimiters (',' or ';')")

    df_raw = df.copy()  # Backup

    print(f"[INFO] Dataset loaded. Shape: {df.shape}\n")

    print("[PREVIEW] First 5 rows:")
    display(df.head())

    print("\n[INFO] Dataset Info:")
    df.info()

    print("\n[INFO] Statistical Summary:")
    display(df.describe(include='all'))

    print("\n[INFO] Unique values in each column:")
    for col in df.columns:
        print(f" - {col}: {df[col].nunique()} unique values")

except FileNotFoundError:
    print(f"[❌ ERROR] File not found: {FILE_PATH}")
except Exception as e:
    print(f"[❌ ERROR] Failed to load or inspect dataset: {e}")


## Step 3: Missing Values Visualization

- Replaced `'unknown'` entries with `NaN`
- Used `missingno` to visualize missing data:
  - Bar plot
  - Matrix plot

---

In [None]:
# Step 3: Diagnose & Visualize Missing Values using missingno

try:
    # Replace 'unknown' with np.nan for accurate missing analysis
    df.replace('unknown', np.nan, inplace=True)

    # Plot 1: Missing Values Bar Plot
    fig1 = plt.figure(figsize=(10, 4))
    msno.bar(df, fontsize=12, color='royalblue')
    plt.title('Missing Values - Bar Plot', fontsize=14)
    save_and_display(fig1, 'missing_values_bar')

    # Plot 2: Missing Values Matrix
    fig2 = plt.figure(figsize=(10, 4))
    msno.matrix(df, fontsize=12)
    plt.title('Missing Values - Matrix Plot', fontsize=14)
    save_and_display(fig2, 'missing_values_matrix')

    # Print missing value count per column
    print("\n[INFO] Missing values per column:")
    print(df.isnull().sum())

except Exception as e:
    print(f"[ERROR] Failed to analyze or plot missing values: {e}")


## Step 4: Impute Missing Values

- Categorical columns: filled with mode (most frequent value)
- Numeric columns: filled with median
- Verified no missing values remain

---

In [None]:
# Step 4: Impute Missing Values

try:
    # Iterate over all columns
    for col in df.columns:
        if df[col].isnull().sum() > 0:
            if df[col].dtype == 'object':
                # Impute with most frequent category
                mode_val = df[col].mode()[0]
                df[col].fillna(mode_val, inplace=True)
                print(f"[IMPUTED] Categorical column '{col}' with mode: {mode_val}")
            else:
                # Impute numeric with median
                median_val = df[col].median()
                df[col].fillna(median_val, inplace=True)
                print(f"[IMPUTED] Numeric column '{col}' with median: {median_val}")

    # Confirm no nulls remain
    if df.isnull().sum().sum() == 0:
        print("\n[✅] Step 4 complete: All missing values handled.")
    else:
        print("[⚠️] Some missing values remain.")

except Exception as e:
    print(f"[ERROR] Failed during missing value imputation: {e}")


## Step 5: Drop Duplicates

- Checked for duplicates
- Removed and printed before-after shape

---

In [None]:
# Step 5: Drop Duplicates

try:
    # Check for duplicates
    initial_shape = df.shape
    duplicate_count = df.duplicated().sum()

    if duplicate_count > 0:
        df.drop_duplicates(inplace=True)
        print(f"[CLEANED] {duplicate_count} duplicate rows removed.")
    else:
        print("[INFO] No duplicate rows found.")

    final_shape = df.shape
    print(f"[INFO] Dataset shape changed from {initial_shape} to {final_shape}")

except Exception as e:
    print(f"[ERROR] Failed to check or remove duplicates: {e}")


## Step 6: Target Variable Distribution

- Plotted the class balance of the target column `y`
- Saved chart to `target_distribution.png`

---

In [None]:
# Step 6: Target Variable Distribution

try:
    fig = plt.figure(figsize=(6, 4))
    sns.countplot(data=df, x='y', palette='Set2')
    plt.title('Target Variable Distribution (Subscription)', fontsize=14)
    plt.xlabel('Subscribed', fontsize=12)
    plt.ylabel('Count', fontsize=12)
    for p in plt.gca().patches:
        plt.gca().annotate(f'{int(p.get_height())}', 
                           (p.get_x() + p.get_width() / 2., p.get_height()), 
                           ha='center', va='bottom', fontsize=10)
    save_and_display(fig, 'target_distribution')

    print("\n[INFO] Target Variable Value Counts:")
    print(df['y'].value_counts())

except Exception as e:
    print(f"[ERROR] Failed to plot target distribution: {e}")


## Step 7: Categorical Features vs Target

- Barplots for:
  - `job`, `education`, and `marital` vs `y`
- Helped understand influence of features on target outcome

---

In [None]:
# Step 7: Barplots of Categorical Features vs Target Variable

try:
    cat_features = ['job', 'marital', 'education']

    for col in cat_features:
        fig = plt.figure(figsize=(8, 5))
        sns.countplot(data=df, x=col, hue='y', palette='Set3')
        plt.title(f'{col.title()} vs Subscription Outcome', fontsize=14)
        plt.xlabel(col.title(), fontsize=12)
        plt.ylabel('Count', fontsize=12)
        plt.xticks(rotation=45, ha='right')
        plt.legend(title='Subscribed', loc='upper right')
        save_and_display(fig, f'barplot_{col}_y')

except Exception as e:
    print(f"[ERROR] Failed to create barplots for {col}: {e}")


## Step 8: Encode Categorical Variables

- Encoded features using `LabelEncoder`
- Target encoded to:
  - `no` → 0
  - `yes` → 1

---

In [None]:
# Step 8: Encode Categorical Variables + Target

try:
    df_encoded = df.copy()
    label_encoders = {}

    # Encode categorical features (excluding 'y')
    for col in df_encoded.select_dtypes(include='object').columns:
        if col != 'y':
            le = LabelEncoder()
            df_encoded[col] = le.fit_transform(df_encoded[col])
            label_encoders[col] = le
            print(f"[ENCODED] {col} using LabelEncoder.")

    # Encode target variable
    df_encoded['y'] = df_encoded['y'].map({'no': 0, 'yes': 1})
    print("[ENCODED] Target column 'y' mapped to 0 (no), 1 (yes).")

except Exception as e:
    print(f"[ERROR] Failed to encode variables: {e}")


## Step 9: Class Imbalance Before SMOTE

- Plotted class distribution
- Showed imbalance in `y` before applying SMOTE

---

In [None]:
# Step 9: Class Imbalance Visualization (Before SMOTE)

try:
    fig = plt.figure(figsize=(6, 4))
    df_encoded['y'].value_counts().plot(kind='bar', color=['tomato', 'lightblue'])
    plt.title('Target Class Distribution (Before SMOTE)', fontsize=14)
    plt.xlabel('Class (0 = No, 1 = Yes)', fontsize=12)
    plt.ylabel('Count', fontsize=12)
    for idx, val in enumerate(df_encoded['y'].value_counts()):
        plt.text(idx, val + 200, str(val), ha='center', fontsize=10)
    save_and_display(fig, 'class_distribution_before_smote')

    print("\n[INFO] Class counts before SMOTE:")
    print(df_encoded['y'].value_counts())

except Exception as e:
    print(f"[ERROR] Failed to plot class distribution: {e}")


## Step 10: Scaling + SMOTE Balancing

- Standardized features using `StandardScaler`
- Used `SMOTE` to balance class distribution
- Visualized balanced classes

---

In [None]:
# Step 10: Scaling Features + Balancing with SMOTE

from sklearn.model_selection import train_test_split

try:
    # Separate features and target
    X = df_encoded.drop('y', axis=1)
    y = df_encoded['y']

    # Scale features
    scaler = StandardScaler()
    X_scaled = scaler.fit_transform(X)
    print("[INFO] Features scaled using StandardScaler.")

    # Apply SMOTE
    smote = SMOTE(random_state=42)
    X_resampled, y_resampled = smote.fit_resample(X_scaled, y)
    print(f"[BALANCED] SMOTE applied. New class counts:\n{pd.Series(y_resampled).value_counts()}")

    # Plot class distribution after SMOTE
    fig = plt.figure(figsize=(6, 4))
    pd.Series(y_resampled).value_counts().plot(kind='bar', color=['lightgreen', 'salmon'])
    plt.title('Target Class Distribution (After SMOTE)', fontsize=14)
    plt.xlabel('Class (0 = No, 1 = Yes)', fontsize=12)
    plt.ylabel('Count', fontsize=12)
    for idx, val in enumerate(pd.Series(y_resampled).value_counts()):
        plt.text(idx, val + 200, str(val), ha='center', fontsize=10)
    save_and_display(fig, 'class_distribution_after_smote')

except Exception as e:
    print(f"[ERROR] Failed during scaling or SMOTE application: {e}")


## Step 11: Correlation Heatmap

- Created correlation matrix
- Saved heatmap to `correlation_heatmap.png`
- Saved top 5 most correlated features to `top_corr_features.csv`

---

In [None]:
# Step 11: Correlation Heatmap of Features

try:
    # Compute correlation matrix
    corr_matrix = df_encoded.corr(numeric_only=True)

    # Create the heatmap
    fig = plt.figure(figsize=(12, 10))
    sns.heatmap(corr_matrix, annot=True, fmt=".2f", cmap='coolwarm', linewidths=0.5, square=True,
                cbar_kws={"shrink": .8}, annot_kws={"size": 8})
    plt.title('Correlation Heatmap of Features', fontsize=16)
    save_and_display(fig, 'correlation_heatmap')

    # Show top 5 correlations with target
    top_corr = corr_matrix['y'].drop('y').abs().sort_values(ascending=False).head(5)
    print("\n[INFO] Top 5 features most correlated with target:")
    print(top_corr)

except Exception as e:
    print(f"[ERROR] Failed to generate correlation heatmap: {e}")

    # Save top correlated features
try:
    corr_path = os.path.join(DATA_DIR, 'top_corr_features.csv')
    top_corr.to_frame(name='Correlation_with_Target').to_csv(corr_path)
    print(f"[SAVED] Top correlated features → {corr_path}")
except Exception as e:
    print(f"[❌ ERROR] Could not save top correlations: {e}")



## Step 12: Feature Importance using Random Forest

- Trained `RandomForestClassifier` to get feature importances
- Visualized and saved results
- Saved rankings in `feature_importance.csv`

---

In [None]:
# Step 12: Feature Importance with Random Forest

try:
    # Train a Random Forest classifier on the original (non-resampled) data
    rf_model = RandomForestClassifier(n_estimators=100, random_state=42)
    rf_model.fit(X_scaled, y)

    # Get feature importances
    importances = rf_model.feature_importances_
    features = df_encoded.drop('y', axis=1).columns
    importance_df = pd.DataFrame({'Feature': features, 'Importance': importances})
    importance_df.sort_values(by='Importance', ascending=False, inplace=True)

    # Plot feature importance
    fig = plt.figure(figsize=(10, 6))
    sns.barplot(data=importance_df, x='Importance', y='Feature', palette='viridis')
    plt.title('Feature Importance - Random Forest', fontsize=14)
    plt.xlabel('Importance Score', fontsize=12)
    plt.ylabel('Feature', fontsize=12)
    save_and_display(fig, 'feature_importance_rf')

    # Print top 10 features
    print("[INFO] Top 10 Features by Importance:")
    print(importance_df.head(10))

except Exception as e:
    print(f"[ERROR] Failed to compute or plot feature importance: {e}")

    # Save feature importance CSV
try:
    importance_path = os.path.join(DATA_DIR, 'feature_importance.csv')
    importance_df.to_csv(importance_path, index=False)
    print(f"[SAVED] Feature importance → {importance_path}")
except Exception as e:
    print(f"[❌ ERROR] Failed to save feature importance: {e}")



## Step 13: Numeric Feature Distributions

- Plotted:
  - Histograms
  - Boxplots
  - Boxplots grouped by target class
- Focused on `age`, `duration`, `campaign`, `pdays`, `previous`

---

In [None]:
# Step 13: Boxplots & Histograms for Key Numerical Features

try:
    numeric_cols = ['age', 'duration', 'campaign', 'pdays', 'previous']

    for col in numeric_cols:
        # Histogram (Distribution)
        fig1 = plt.figure(figsize=(7, 4))
        sns.histplot(df[col], kde=True, bins=30, color='skyblue')
        plt.title(f'Distribution of {col.title()}', fontsize=14)
        plt.xlabel(col.title(), fontsize=12)
        plt.ylabel('Frequency', fontsize=12)
        save_and_display(fig1, f'distribution_{col}')

        # Boxplot (Outliers)
        fig2 = plt.figure(figsize=(7, 4))
        sns.boxplot(x=df[col], color='lightcoral')
        plt.title(f'Boxplot of {col.title()}', fontsize=14)
        plt.xlabel(col.title(), fontsize=12)
        save_and_display(fig2, f'boxplot_{col}')

        # Boxplot by target class (if relevant)
        fig3 = plt.figure(figsize=(7, 4))
        sns.boxplot(data=df, x='y', y=col, palette='pastel')
        plt.title(f'{col.title()} by Subscription Status', fontsize=14)
        plt.xlabel('Subscribed (y)', fontsize=12)
        plt.ylabel(col.title(), fontsize=12)
        save_and_display(fig3, f'boxplot_{col}_by_y')

except Exception as e:
    print(f"[ERROR] Failed to generate plots for numeric column '{col}': {e}")


## Step 14: Pairplot of Features

- Sampled 400 rows for efficiency
- Created pairplot of selected features colored by target class
- Saved to `pairwise_relationships.png`

---

In [None]:
# Step 14: Pairplot of Selected Features (Sampled)

try:
    # Sample to speed up plotting
    df_sample = df_encoded.sample(n=400, random_state=42)

    # Select key features to compare
    selected_features = ['age', 'duration', 'campaign', 'pdays', 'previous', 'y']

    # Create pairplot
    fig = sns.pairplot(df_sample[selected_features], hue='y', palette='husl', diag_kind='kde', corner=True)
    pairplot_path = os.path.join(PLOTS_DIR, 'pairwise_relationships.png')
    fig.savefig(pairplot_path, dpi=300, bbox_inches='tight')
    plt.show()
    print(f"[SAVED] pairwise_relationships.png to plots/")

except Exception as e:
    print(f"[ERROR] Failed to generate pairplot: {e}")


## Step 15: Feature Engineering

- Added:
  - `age_group`: (young/adult/senior)
  - `contacted_before`: binary
  - `effective_contact`: based on contact history + success

---

In [None]:
# Step 15: Feature Engineering – Derived Columns

try:
    df_fe = df_encoded.copy()

    # Feature 1: Age Grouping
    df_fe['age_group'] = pd.cut(df_fe['age'],
                                bins=[17, 30, 50, 100],
                                labels=['young', 'adult', 'senior'])

    # Feature 2: Was Contacted Before?
    df_fe['contacted_before'] = np.where(df_fe['pdays'] == 999, 0, 1)

    # Feature 3: Effective Contact (if previously contacted and successful outcome)
    df_fe['effective_contact'] = np.where((df_fe['previous'] > 0) & (df_fe['contacted_before'] == 1), 1, 0)

    print("[ADDED] Feature 'age_group' (young/adult/senior)")
    print("[ADDED] Feature 'contacted_before' (0/1)")
    print("[ADDED] Feature 'effective_contact' (0/1)")

    # Check for new features
    print("\n[INFO] New Features Preview:")
    display(df_fe[['age', 'age_group', 'pdays', 'contacted_before', 'effective_contact']].head())

except Exception as e:
    print(f"[ERROR] Failed during feature engineering: {e}")


## Step 16: Skewness Detection & Log Transformation

- Identified highly skewed numeric columns
- Applied `np.log1p` transformation
- Plotted before/after transformation
- Saved to `skewness_report.csv`

---

In [None]:
# Step 16: Skewness Detection & Log Transformation

try:
    df_skewed = df_fe.copy()
    num_cols = df_skewed.select_dtypes(include=np.number).columns.drop('y')

    print("[INFO] Skewness of Numerical Features:")
    skew_vals = df_skewed[num_cols].skew().sort_values(ascending=False)
    print(skew_vals)

    # Apply log1p (log(x + 1)) to highly skewed features (threshold > 1)
    skewed_cols = skew_vals[skew_vals > 1].index.tolist()

    for col in skewed_cols:
        df_skewed[f'{col}_log'] = np.log1p(df_skewed[col])
        print(f"[TRANSFORMED] Applied log1p to '{col}' → '{col}_log'")

    # Plot one example before vs after
    if skewed_cols:
        col = skewed_cols[0]
        fig, axes = plt.subplots(1, 2, figsize=(10, 4))
        sns.histplot(df_skewed[col], ax=axes[0], kde=True, color='orange')
        axes[0].set_title(f'{col} - Original')

        sns.histplot(df_skewed[f'{col}_log'], ax=axes[1], kde=True, color='green')
        axes[1].set_title(f'{col}_log - After log1p')
        plt.suptitle(f'Distribution Before vs After Log Transform for {col}', fontsize=14)

        save_and_display(fig, f'log_transform_{col}')

except Exception as e:
    print(f"[ERROR] Failed during skewness check or log transformation: {e}")

    # Save skewness report
try:
    skew_path = os.path.join(DATA_DIR, 'skewness_report.csv')
    skew_vals.to_frame(name='Skewness').to_csv(skew_path)
    print(f"[SAVED] Skewness report → {skew_path}")
except Exception as e:
    print(f"[❌ ERROR] Could not save skewness report: {e}")



## Step 17: Multicollinearity Check using VIF

- Calculated `Variance Inflation Factor (VIF)` for numeric features
- Reported features with VIF > 5
- Saved to `vif_report.csv`

---

In [None]:
# Step 17: Multicollinearity Check using VIF

from statsmodels.stats.outliers_influence import variance_inflation_factor
import statsmodels.api as sm

try:
    df_vif = df_skewed.copy()

    # Select numeric features only (excluding 'y')
    vif_features = df_vif.select_dtypes(include=np.number).drop(columns=['y'], errors='ignore')

    # Convert to DataFrame
    X_vif = pd.DataFrame(StandardScaler().fit_transform(vif_features), columns=vif_features.columns)

    # Compute VIF scores
    vif_data = pd.DataFrame()
    vif_data['Feature'] = X_vif.columns
    vif_data['VIF'] = [variance_inflation_factor(X_vif.values, i) for i in range(X_vif.shape[1])]
    vif_data.sort_values(by='VIF', ascending=False, inplace=True)

    # Display top VIFs
    print("\n[INFO] Variance Inflation Factor (VIF) Scores:")
    display(vif_data[vif_data['VIF'] > 5])

    # Optional: Drop high VIF features (manually, based on domain)
    # Example: drop_features = vif_data[vif_data['VIF'] > 10]['Feature'].tolist()

except Exception as e:
    print(f"[ERROR] Failed to compute VIF: {e}")

    # Save VIF report
try:
    vif_path = os.path.join(DATA_DIR, 'vif_report.csv')
    vif_data.to_csv(vif_path, index=False)
    print(f"[SAVED] VIF report → {vif_path}")
except Exception as e:
    print(f"[❌ ERROR] Could not save VIF report: {e}")



## Step 18: Save Cleaned Dataset

- Final dataset saved to:
  - `datasets/bankmarketing_cleaned.csv`

---

In [None]:
# Step 18: Save Final Cleaned Dataset to CSV

try:
    cleaned_data_path = os.path.join(DATASET_DIR, 'bankmarketing_cleaned.csv')
    df_skewed.to_csv(cleaned_data_path, index=False)
    print(f"[✅] Cleaned dataset saved to: {cleaned_data_path}")

except Exception as e:
    print(f"[ERROR] Failed to save cleaned dataset: {e}")


## Step 19: EDA Summary Text Report

- Included:
  - Shape, column names, missing values
  - Class distribution
  - Feature importances
  - Skewness
  - VIF
- Saved as `eda_summary.txt`

---

In [None]:
# Step 19: Save EDA Summary Report to TXT

try:
    summary_path = os.path.join(DATA_DIR, 'eda_summary.txt')
    with open(summary_path, 'w', encoding='utf-8') as f:
        f.write("🔍 EDA SUMMARY REPORT - BANK MARKETING DATASET\n")
        f.write("=============================================\n\n")

        # Shape and columns
        f.write(f"[SHAPE] Final Dataset Shape: {df_skewed.shape}\n\n")
        f.write("[COLUMNS]\n")
        f.write(", ".join(df_skewed.columns) + "\n\n")

        # Null check
        f.write("[MISSING VALUES]\n")
        nulls = df_skewed.isnull().sum()
        f.write(nulls[nulls > 0].to_string() + "\n\n" if nulls.any() else "No missing values.\n\n")

        # Class distribution
        f.write("[TARGET DISTRIBUTION]\n")
        f.write(df_skewed['y'].value_counts().to_string() + "\n\n")

        # Feature importance (Top 10)
        f.write("[FEATURE IMPORTANCE - Random Forest]\n")
        f.write(importance_df.head(10).to_string(index=False) + "\n\n")

        # Skewness summary
        f.write("[SKEWNESS - Numeric Features]\n")
        f.write(skew_vals.to_string() + "\n\n")

        # VIF summary (Top 10)
        f.write("[MULTICOLLINEARITY - VIF > 5]\n")
        f.write(vif_data[vif_data['VIF'] > 5].to_string(index=False) + "\n\n")

        f.write("Report auto-generated by EDA pipeline.\n")

    print(f"[✅] EDA summary saved to: {summary_path}")

except Exception as e:
    print(f"[ERROR] Failed to write EDA summary report: {e}")


## Step 20: Auto-Generated PDF Report

- Built a report using `reportlab`:
  - Title, author, date
  - Key visualizations
  - Summary tables: correlation, feature importance, skewness, VIF
- Saved to:
  - `data/Bank_Marketing_EDA_Report.pdf`

---

In [None]:
from reportlab.lib.pagesizes import A4
from reportlab.lib.styles import getSampleStyleSheet
from reportlab.platypus import SimpleDocTemplate, Paragraph, Spacer, Image, PageBreak
from reportlab.lib.units import inch
import os
import pandas as pd
from datetime import datetime

# === Define Paths ===
BASE_DIR = r'G:\My Drive\MUJ MCA\SKILLFIED MENTOR INTERNSHIP\Banking Data Analysis'
DATA_DIR = os.path.join(BASE_DIR, 'data')
PLOTS_DIR = os.path.join(BASE_DIR, 'plots')
DATASET_DIR = os.path.join(BASE_DIR, 'datasets')
PDF_REPORT = os.path.join(DATA_DIR, 'Bank_Marketing_EDA_Report.pdf')

# === Load Data ===
df = pd.read_csv(os.path.join(DATASET_DIR, 'bankmarketing_cleaned.csv'))
feature_importance = pd.read_csv(os.path.join(DATA_DIR, 'feature_importance.csv'), index_col=0)
vif_data = pd.read_csv(os.path.join(DATA_DIR, 'vif_report.csv'), index_col=0)
skew_vals = pd.read_csv(os.path.join(DATA_DIR, 'skewness_report.csv'), index_col=0)
top_corr = pd.read_csv(os.path.join(DATA_DIR, 'top_corr_features.csv'), index_col=0)

# === Setup PDF Elements ===
styles = getSampleStyleSheet()
elements = []
doc = SimpleDocTemplate(PDF_REPORT, pagesize=A4)

# === Header ===
elements.append(Paragraph("📊 Bank Marketing Dataset – EDA Report", styles["Title"]))
elements.append(Paragraph(f"Author: Costas Pinto", styles["Normal"]))
elements.append(Paragraph(f"Date: {datetime.now().strftime('%Y-%m-%d')}", styles["Normal"]))
elements.append(Spacer(1, 0.2 * inch))

# === Section 1: Dataset Info ===
elements.append(Paragraph("1. Dataset Overview", styles["Heading2"]))
elements.append(Paragraph(f"Shape: {df.shape}", styles["Normal"]))
elements.append(Paragraph("First 5 Columns: " + ', '.join(df.columns[:5]), styles["Normal"]))
elements.append(PageBreak())

# === Section 2: Plots ===
def add_image(file_name, caption):
    image_path = os.path.join(PLOTS_DIR, file_name)
    if os.path.exists(image_path):
        elements.append(Paragraph(caption, styles["Heading3"]))
        elements.append(Image(image_path, width=6*inch, height=3.5*inch))
        elements.append(Spacer(1, 0.2 * inch))

add_image("target_distribution.png", "2. Target Distribution")
add_image("barplot_job_y.png", "Job vs Target")
add_image("barplot_education_y.png", "Education vs Target")
add_image("barplot_marital_y.png", "Marital vs Target")
add_image("correlation_heatmap.png", "Correlation Heatmap")
add_image("feature_importance_rf.png", "Feature Importance (Random Forest)")
add_image("class_distribution_before_smote.png", "Before SMOTE")
add_image("class_distribution_after_smote.png", "After SMOTE")
add_image("log_transform_campaign.png", "Log Transformation on Campaign")

elements.append(PageBreak())

# === Section 3: Feature Tables ===
elements.append(Paragraph("3. Top Correlated Features", styles["Heading2"]))
elements.append(Paragraph(top_corr.to_string(), styles["Code"]))
elements.append(Spacer(1, 0.2 * inch))

elements.append(Paragraph("4. Top 10 Feature Importances", styles["Heading2"]))
elements.append(Paragraph(feature_importance.head(10).to_string(index=False), styles["Code"]))
elements.append(Spacer(1, 0.2 * inch))

elements.append(Paragraph("5. Skewness Values", styles["Heading2"]))
elements.append(Paragraph(skew_vals.to_string(), styles["Code"]))
elements.append(Spacer(1, 0.2 * inch))

elements.append(Paragraph("6. Features with VIF > 5", styles["Heading2"]))
vif_filtered = vif_data[vif_data['VIF'] > 5]
if not vif_filtered.empty:
    elements.append(Paragraph(vif_filtered.to_string(index=False), styles["Code"]))
else:
    elements.append(Paragraph("No features with VIF > 5", styles["Normal"]))

elements.append(PageBreak())

# === Footer ===
elements.append(Paragraph("Report auto-generated using Python", styles["Italic"]))

# === Build PDF ===
doc.build(elements)
print(f"[PDF GENERATED using reportlab] → {PDF_REPORT}")


### Final Project Conclusion

**Project Title:** *Exploratory Data Analysis (EDA) on Bank Marketing Dataset*
**Internship:** Skillfied Mentor Internship
**Author:** *Costas Antony Pinto – MCA (AI & ML), Manipal University Jaipur*

---

### Project Goal

To conduct a detailed exploratory data analysis (EDA) on the **Bank Marketing Dataset** to understand customer behavior and identify the most influential features that drive subscription to a term deposit.

---

### Summary of Key Findings

1. **Dataset Overview**:

   * **Records:** 41,174
   * **Features:** 31 (after feature engineering and transformations)
   * Target variable: `y` (term deposit subscription)

2. **Missing Values**:

   * Placeholder `unknown` values were replaced with `NaN`.
   * Imputation was done using mode (categorical) and median (numeric) strategies.
   * Result: **No missing values remaining.**

3. **Duplicates**:

   * Removed **completely duplicated rows**, ensuring dataset integrity.

4. **Target Distribution**:

   * Highly imbalanced:

     * **No:** \~88.7%
     * **Yes:** \~11.3%

5. **Class Balancing**:

   * Applied **SMOTE** (Synthetic Minority Oversampling Technique) to balance the dataset before modeling.

6. **Categorical Analysis**:

   * Jobs like `student` and `retired` had a higher tendency to subscribe.
   * Education and marital status also influenced subscription rates.

7. **Feature Importance** (Random Forest):

   * Top 5 important features:

     1. `duration`
     2. `nr.employed`
     3. `pdays`
     4. `euribor3m`
     5. `emp.var.rate`

8. **Correlation with Target**:

   * Strongest correlation with `duration` (0.405), followed by `nr.employed` and `pdays`.

9. **Skewness**:

   * Features like `default`, `effective_contact`, and `campaign` showed heavy skew.
   * Applied `log1p` transformation to reduce skewness.

10. **Multicollinearity**:

    * Very high VIF values observed for multiple features (e.g., `pdays`, `nr.employed`, etc.)
    * Indicates **potential multicollinearity** which may require attention before modeling.

11. **Feature Engineering**:

    * Created new features:

      * `age_group` (young, adult, senior)
      * `contacted_before` (based on `pdays`)
      * `effective_contact` (interaction between `pdays` and `previous`)

---

### Deliverables Generated

* Cleaned dataset: `bankmarketing_cleaned.csv`
* HTML report: `Bank_Marketing_EDA_Report.html`
* PDF report: `Bank_Marketing_EDA_Report.pdf`
* Supporting CSV files:

  * `feature_importance.csv`
  * `top_corr_features.csv`
  * `vif_report.csv`
  * `skewness_report.csv`
* Summary text report: `eda_summary.txt`

---

### Conclusion

This EDA provided **a comprehensive understanding of the factors influencing customer subscription** in a banking marketing campaign. Key features such as `duration`, `pdays`, and economic indicators proved crucial. The project demonstrated the **importance of preprocessing**, **handling imbalanced data**, **feature engineering**, and **data visualization** in preparing for predictive modeling.

> This structured, reproducible EDA pipeline can serve as a **solid foundation for machine learning models**, business insights, or further research into customer behavior analytics in financial domains.

---
