# Bank Marketing Dataset: Vertical Federated Learning Setup

This notebook explores and explains the dataset created by `prepare_data.py` for vertical federated learning (VFL) experiments.

## Table of Contents
1. [Introduction to Vertical Federated Learning](#intro)
2. [Dataset Overview](#overview) 
3. [Data Structure Analysis](#structure)
4. [Client Partitions Exploration](#partitions)
5. [Server Data Analysis](#server)
6. [Mock vs Private Data Comparison](#mock-vs-private)
7. [VFL Workflow Demonstration](#workflow)
8. [Summary and Insights](#summary)

In [None]:
# Import required libraries
import warnings
from pathlib import Path

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns

warnings.filterwarnings("ignore")

# Set up plotting style
plt.style.use("default")
sns.set_palette("husl")
plt.rcParams["figure.figsize"] = (12, 8)

# Define paths
PROJECT_PATH = Path("../")
DATASET_PATH = PROJECT_PATH.resolve() / "dataset" / "marketing" / "processed"

if not DATASET_PATH.exists():
    !uv run {str(PROJECT_PATH / "scripts/prepare_data.py")} --active

print(f"Project Path: {PROJECT_PATH}")
print(f"Dataset Path: {DATASET_PATH}")
print(f"Dataset exists: {DATASET_PATH.exists()}")

## 1. Introduction to Vertical Federated Learning {#intro}

**Vertical Federated Learning (VFL)** is a machine learning paradigm where different participants hold different features of the same dataset. Unlike horizontal federated learning where participants have the same features but different samples, VFL involves:

- **Different feature sets**: Each client has a subset of features for the same samples
- **Same sample space**: All participants work with the same individuals/entities
- **Label holder**: Typically one party (server) holds the ground truth labels
- **Privacy preservation**: Raw features are never shared between participants

### Our VFL Setup:
- **Client 0 (Bank)**: Holds bank-related customer features
- **Client 1 (Marketing)**: Holds marketing campaign features  
- **Server**: Holds the target labels (subscription outcomes)

## 2. Dataset Overview {#overview}

In [None]:
# Explore the dataset structure
def explore_directory_structure(path, level=0, max_level=3):
    """Recursively explore directory structure."""
    if level > max_level:
        return

    indent = "  " * level
    if path.is_dir():
        print(f"{indent}{path.name}/")
        try:
            for item in sorted(path.iterdir()):
                explore_directory_structure(item, level + 1, max_level)
        except PermissionError:
            print(f"{indent}  [Permission Denied]")
    else:
        size = path.stat().st_size
        if size > 1024 * 1024:  # > 1MB
            size_str = f"{size/(1024*1024):.1f}MB"
        elif size > 1024:  # > 1KB
            size_str = f"{size/1024:.1f}KB"
        else:
            size_str = f"{size}B"
        print(f"{indent}{path.name} ({size_str})")


print("Dataset Directory Structure:")
print("===========================")
explore_directory_structure(DATASET_PATH)

In [None]:
# Load original dataset for reference
original_data_path = PROJECT_PATH / "dataset" / "bank-additional-full.csv"
if original_data_path.exists():
    df_original = pd.read_csv(original_data_path, sep=";")
    print(f"Original Dataset Shape: {df_original.shape}")
    print(f"Features: {list(df_original.columns)}")
    print("\nTarget distribution:")
    print(df_original["y"].value_counts())
    print(f"\nClass balance: {df_original['y'].value_counts(normalize=True) * 100}")
else:
    print("Original dataset not found. Run prepare_data.py first.")

## 3. Data Structure Analysis {#structure}

In [None]:
# Define the feature partitions as in prepare_data.py
BANK_COLS = [
    "age",
    "job",
    "marital",
    "education",
    "default",
    "housing",
    "loan",
    "duration",
]

MARKETING_COLS = [
    "contact",
    "month",
    "day_of_week",
    "campaign",
    "pdays",
    "previous",
    "poutcome",
]

CLIENT_FOLDER_NAMES = ["bank_features", "marketing_features"]
PARTITION_COLS = [BANK_COLS, MARKETING_COLS]

# Create summary table
partition_summary = pd.DataFrame(
    {
        "Client": ["Bank (Client 0)", "Marketing (Client 1)"],
        "Folder": CLIENT_FOLDER_NAMES,
        "Features": [", ".join(cols) for cols in PARTITION_COLS],
        "Feature Count": [len(cols) for cols in PARTITION_COLS],
    }
)

print("Feature Partitioning Summary:")
print("============================\n")
for i, row in partition_summary.iterrows():
    print(f"**{row['Client']}** ({row['Folder']})")
    print(f"  Features ({row['Feature Count']}): {row['Features']}")
    print()

In [None]:
# Visualize the feature partitioning
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 6))

# Pie chart of feature distribution
feature_counts = [len(BANK_COLS), len(MARKETING_COLS)]
labels = ["Bank Features", "Marketing Features"]
colors = ["skyblue", "lightcoral"]

ax1.pie(feature_counts, labels=labels, autopct="%1.1f%%", colors=colors, startangle=90)
ax1.set_title("Feature Distribution Across Clients", fontsize=14, fontweight="bold")

# Bar chart of feature counts
ax2.bar(labels, feature_counts, color=colors, alpha=0.7)
ax2.set_ylabel("Number of Features")
ax2.set_title("Feature Count by Client", fontsize=14, fontweight="bold")
ax2.grid(axis="y", alpha=0.3)

# Add value labels on bars
for i, v in enumerate(feature_counts):
    ax2.text(i, v + 0.1, str(v), ha="center", va="bottom", fontweight="bold")

plt.tight_layout()
plt.show()

## 4. Client Partitions Exploration {#partitions}

In [None]:
def load_partition_data(client_folder, data_type="private"):
    """Load partition data for a specific client."""
    data_path = DATASET_PATH / client_folder / data_type

    if not data_path.exists():
        print(f"Warning: {data_path} does not exist")
        return None, None

    X = np.load(data_path / "X_train.npy")
    y = np.load(data_path / "y_train.npy")

    return X, y


# Load data for both clients
client_data = {}
for i, folder in enumerate(CLIENT_FOLDER_NAMES):
    X_private, y_private = load_partition_data(folder, "private")
    X_mock, y_mock = load_partition_data(folder, "mock")

    client_data[f"client_{i}"] = {
        "name": folder.replace("_", " ").title(),
        "features": PARTITION_COLS[i],
        "private": {"X": X_private, "y": y_private},
        "mock": {"X": X_mock, "y": y_mock},
    }

# Display basic statistics
print("Client Data Summary:")
print("===================\n")
for client_id, data in client_data.items():
    print(f"**{data['name']}** ({client_id})")

    if data["private"]["X"] is not None:
        print(
            f"  Private Data: {data['private']['X'].shape[0]} samples, {data['private']['X'].shape[1]} features"
        )
        print(
            f"  Mock Data: {data['mock']['X'].shape[0]} samples, {data['mock']['X'].shape[1]} features"
        )

        # Check label consistency
        unique_labels_private = np.unique(data["private"]["y"])
        unique_labels_mock = np.unique(data["mock"]["y"])
        print(
            f"  Private Labels: {unique_labels_private} (distribution: {np.bincount(data['private']['y'])})"
        )
        print(
            f"  Mock Labels: {unique_labels_mock} (distribution: {np.bincount(data['mock']['y'])})"
        )
    else:
        print("  Data not available")
    print()

In [None]:
# Visualize data distributions
fig, axes = plt.subplots(2, 2, figsize=(15, 10))
fig.suptitle("Client Data Analysis", fontsize=16, fontweight="bold")

# Sample sizes comparison
ax1 = axes[0, 0]
clients = list(client_data.keys())
private_sizes = [
    client_data[c]["private"]["X"].shape[0]
    for c in clients
    if client_data[c]["private"]["X"] is not None
]
mock_sizes = [
    client_data[c]["mock"]["X"].shape[0]
    for c in clients
    if client_data[c]["mock"]["X"] is not None
]
client_names = [
    client_data[c]["name"]
    for c in clients
    if client_data[c]["private"]["X"] is not None
]

x = np.arange(len(client_names))
width = 0.35

ax1.bar(x - width / 2, private_sizes, width, label="Private Data", alpha=0.8)
ax1.bar(x + width / 2, mock_sizes, width, label="Mock Data", alpha=0.8)
ax1.set_xlabel("Clients")
ax1.set_ylabel("Number of Samples")
ax1.set_title("Sample Sizes by Client")
ax1.set_xticks(x)
ax1.set_xticklabels(client_names, rotation=45)
ax1.legend()
ax1.grid(axis="y", alpha=0.3)

# Feature counts
ax2 = axes[0, 1]
feature_counts = [
    client_data[c]["private"]["X"].shape[1]
    for c in clients
    if client_data[c]["private"]["X"] is not None
]
ax2.bar(client_names, feature_counts, color=["skyblue", "lightcoral"], alpha=0.8)
ax2.set_xlabel("Clients")
ax2.set_ylabel("Number of Features")
ax2.set_title("Feature Counts by Client")
ax2.grid(axis="y", alpha=0.3)
for i, v in enumerate(feature_counts):
    ax2.text(i, v + 0.5, str(v), ha="center", va="bottom", fontweight="bold")

# Label distribution for private data
ax3 = axes[1, 0]
for i, client_id in enumerate(clients):
    if client_data[client_id]["private"]["y"] is not None:
        y_data = client_data[client_id]["private"]["y"]
        labels, counts = np.unique(y_data, return_counts=True)
        ax3.bar(
            [f"{client_data[client_id]['name']}\n(Class {l})" for l in labels],
            counts,
            alpha=0.7,
            label=client_data[client_id]["name"],
        )

ax3.set_xlabel("Client and Class")
ax3.set_ylabel("Sample Count")
ax3.set_title("Label Distribution (Private Data)")
ax3.grid(axis="y", alpha=0.3)

# Label distribution for mock data
ax4 = axes[1, 1]
mock_distributions = []
for client_id in clients:
    if client_data[client_id]["mock"]["y"] is not None:
        y_data = client_data[client_id]["mock"]["y"]
        labels, counts = np.unique(y_data, return_counts=True)
        mock_distributions.append(counts)

if mock_distributions:
    x = np.arange(len(labels))
    width = 0.35

    for i, (client_id, counts) in enumerate(zip(clients, mock_distributions)):
        ax4.bar(
            x + i * width,
            counts,
            width,
            label=client_data[client_id]["name"],
            alpha=0.7,
        )

    ax4.set_xlabel("Class Label")
    ax4.set_ylabel("Sample Count")
    ax4.set_title("Label Distribution (Mock Data)")
    ax4.set_xticks(x + width / 2)
    ax4.set_xticklabels([f"Class {l}" for l in labels])
    ax4.legend()
    ax4.grid(axis="y", alpha=0.3)

plt.tight_layout()
plt.show()

## 5. Server Data Analysis {#server}

In [None]:
# Load server training labels
server_path = DATASET_PATH / "server_train"

if server_path.exists():
    # Load both full and mock server labels
    y_server_full = np.load(server_path / "y_train.npy")
    y_server_mock = np.load(server_path / "y_mock.npy")

    print("Server Training Labels:")
    print("======================\n")

    print(f"Full Labels: {len(y_server_full)} samples")
    print(f"  Distribution: {dict(zip(*np.unique(y_server_full, return_counts=True)))}")
    print(
        f"  Class balance: {np.bincount(y_server_full)[1]/len(y_server_full)*100:.1f}% positive class"
    )
    print()

    print(f"Mock Labels: {len(y_server_mock)} samples")
    print(f"  Distribution: {dict(zip(*np.unique(y_server_mock, return_counts=True)))}")
    print(
        f"  Class balance: {np.bincount(y_server_mock)[1]/len(y_server_mock)*100:.1f}% positive class"
    )

    # Verify alignment with client data
    print("\nData Alignment Verification:")
    print("===========================\n")

    for client_id in client_data.keys():
        client_labels_full = client_data[client_id]["private"]["y"]
        client_labels_mock = client_data[client_id]["mock"]["y"]

        if client_labels_full is not None:
            # Check if server and client labels match
            full_match = np.array_equal(y_server_full, client_labels_full)
            mock_match = np.array_equal(y_server_mock, client_labels_mock)

            print(f"{client_data[client_id]['name']}:")
            print(f"  Full data labels match server: {full_match}")
            print(f"  Mock data labels match server: {mock_match}")
            print()
else:
    print("Server training labels not found. Run prepare_data.py first.")

In [None]:
# Visualize server label distribution
if server_path.exists():
    fig, axes = plt.subplots(1, 2, figsize=(15, 6))

    # Full data distribution
    ax1 = axes[0]
    labels_full, counts_full = np.unique(y_server_full, return_counts=True)
    colors = ["lightcoral", "lightblue"]
    ax1.pie(
        counts_full,
        labels=[f"Class {l}" for l in labels_full],
        autopct="%1.1f%%",
        colors=colors,
        startangle=90,
    )
    ax1.set_title(
        f"Server Labels Distribution\n(Full Data - {len(y_server_full)} samples)",
        fontsize=12,
        fontweight="bold",
    )

    # Mock data distribution
    ax2 = axes[1]
    labels_mock, counts_mock = np.unique(y_server_mock, return_counts=True)
    ax2.pie(
        counts_mock,
        labels=[f"Class {l}" for l in labels_mock],
        autopct="%1.1f%%",
        colors=colors,
        startangle=90,
    )
    ax2.set_title(
        f"Server Labels Distribution\n(Mock Data - {len(y_server_mock)} samples)",
        fontsize=12,
        fontweight="bold",
    )

    plt.tight_layout()
    plt.show()

## 6. Mock vs Private Data Comparison {#mock-vs-private}

In [None]:
# Compare mock vs private data characteristics
print("Mock vs Private Data Comparison:")
print("================================\n")

comparison_data = []

for client_id in client_data.keys():
    client_name = client_data[client_id]["name"]

    # Get data
    X_private = client_data[client_id]["private"]["X"]
    y_private = client_data[client_id]["private"]["y"]
    X_mock = client_data[client_id]["mock"]["X"]
    y_mock = client_data[client_id]["mock"]["y"]

    if X_private is not None and X_mock is not None:
        # Calculate statistics
        private_stats = {
            "Client": client_name,
            "Data Type": "Private",
            "Samples": len(X_private),
            "Features": X_private.shape[1],
            "Positive Class %": np.bincount(y_private)[1] / len(y_private) * 100,
            "Feature Mean": np.mean(X_private),
            "Feature Std": np.std(X_private),
        }

        mock_stats = {
            "Client": client_name,
            "Data Type": "Mock",
            "Samples": len(X_mock),
            "Features": X_mock.shape[1],
            "Positive Class %": np.bincount(y_mock)[1] / len(y_mock) * 100
            if len(np.bincount(y_mock)) > 1
            else 0,
            "Feature Mean": np.mean(X_mock),
            "Feature Std": np.std(X_mock),
        }

        comparison_data.extend([private_stats, mock_stats])

# Create comparison DataFrame
df_comparison = pd.DataFrame(comparison_data)
print(df_comparison.round(2))

In [None]:
# Visualize mock vs private data comparison
fig, axes = plt.subplots(2, 2, figsize=(15, 10))
fig.suptitle("Mock vs Private Data Comparison", fontsize=16, fontweight="bold")

# Sample sizes
ax1 = axes[0, 0]
private_df = df_comparison[df_comparison["Data Type"] == "Private"]
mock_df = df_comparison[df_comparison["Data Type"] == "Mock"]

x = np.arange(len(private_df))
width = 0.35

ax1.bar(x - width / 2, private_df["Samples"], width, label="Private", alpha=0.8)
ax1.bar(x + width / 2, mock_df["Samples"], width, label="Mock", alpha=0.8)
ax1.set_xlabel("Clients")
ax1.set_ylabel("Number of Samples")
ax1.set_title("Sample Sizes")
ax1.set_xticks(x)
ax1.set_xticklabels(private_df["Client"], rotation=45)
ax1.legend()
ax1.grid(axis="y", alpha=0.3)

# Class balance
ax2 = axes[0, 1]
ax2.bar(
    x - width / 2, private_df["Positive Class %"], width, label="Private", alpha=0.8
)
ax2.bar(x + width / 2, mock_df["Positive Class %"], width, label="Mock", alpha=0.8)
ax2.set_xlabel("Clients")
ax2.set_ylabel("Positive Class %")
ax2.set_title("Class Balance")
ax2.set_xticks(x)
ax2.set_xticklabels(private_df["Client"], rotation=45)
ax2.legend()
ax2.grid(axis="y", alpha=0.3)

# Feature statistics - means
ax3 = axes[1, 0]
ax3.bar(x - width / 2, private_df["Feature Mean"], width, label="Private", alpha=0.8)
ax3.bar(x + width / 2, mock_df["Feature Mean"], width, label="Mock", alpha=0.8)
ax3.set_xlabel("Clients")
ax3.set_ylabel("Feature Mean")
ax3.set_title("Feature Value Means")
ax3.set_xticks(x)
ax3.set_xticklabels(private_df["Client"], rotation=45)
ax3.legend()
ax3.grid(axis="y", alpha=0.3)

# Feature statistics - standard deviations
ax4 = axes[1, 1]
ax4.bar(x - width / 2, private_df["Feature Std"], width, label="Private", alpha=0.8)
ax4.bar(x + width / 2, mock_df["Feature Std"], width, label="Mock", alpha=0.8)
ax4.set_xlabel("Clients")
ax4.set_ylabel("Feature Std Dev")
ax4.set_title("Feature Value Standard Deviations")
ax4.set_xticks(x)
ax4.set_xticklabels(private_df["Client"], rotation=45)
ax4.legend()
ax4.grid(axis="y", alpha=0.3)

plt.tight_layout()
plt.show()

## 7. VFL Workflow Demonstration {#workflow}

This section demonstrates how the data would be used in a typical VFL training workflow.

In [None]:
# Simulate VFL data flow
print("Vertical Federated Learning Workflow:")
print("====================================\n")

print("1. DATA INITIALIZATION")
print("-" * 50)

# Each client loads their partition
for i, client_id in enumerate(client_data.keys()):
    client_name = client_data[client_id]["name"]
    features = client_data[client_id]["features"]
    X_shape = client_data[client_id]["mock"]["X"].shape

    print(f"🏢 {client_name} loads:")
    print(f"   • Features: {features}")
    print(f"   • Data shape: {X_shape}")
    print(f"   • Sample indices: [0, 1, 2, ..., {X_shape[0]-1}]")
    print()

print("🖥️  Server loads:")
print(f"   • Training labels: {len(y_server_mock)} samples")
print(f"   • Sample indices: [0, 1, 2, ..., {len(y_server_mock)-1}]")
print(f"   • Labels: {y_server_mock}")
print()

In [None]:
print("2. SAMPLE ALIGNMENT VERIFICATION")
print("-" * 50)

# Verify that all parties have the same samples (same indices)
alignment_check = True
server_sample_count = len(y_server_mock)

for client_id in client_data.keys():
    client_name = client_data[client_id]["name"]
    client_sample_count = len(client_data[client_id]["mock"]["y"])
    client_labels = client_data[client_id]["mock"]["y"]

    samples_match = server_sample_count == client_sample_count
    labels_match = np.array_equal(y_server_mock, client_labels)

    print(f"✅ {client_name}:")
    print(
        f"   • Sample count matches server: {samples_match} ({client_sample_count} samples)"
    )
    print(f"   • Labels match server: {labels_match}")

    if not (samples_match and labels_match):
        alignment_check = False
    print()

print(f"🔍 Overall alignment: {'✅ PASSED' if alignment_check else '❌ FAILED'}")
print()

In [None]:
print("3. VFL TRAINING SIMULATION (Mock Data)")
print("-" * 50)

# Simulate a training round
batch_size = 5
sample_indices = list(range(min(batch_size, len(y_server_mock))))

print(f"Training batch with {len(sample_indices)} samples: {sample_indices}")
print()

for i, client_id in enumerate(client_data.keys()):
    client_name = client_data[client_id]["name"]
    X_batch = client_data[client_id]["mock"]["X"][sample_indices]

    print(f"📤 {client_name} sends embeddings:")
    print(f"   • Shape: {X_batch.shape}")
    print(
        f"   • Sample features for index 0: {X_batch[0][:3]}... (showing first 3 features)"
    )
    print()

# Server processes
y_batch = y_server_mock[sample_indices]
print("🖥️  Server processes:")
print(f"   • Receives embeddings from {len(client_data)} clients")
print(f"   • Uses ground truth labels: {y_batch}")
print("   • Computes loss and gradients")
print("   • Sends gradients back to clients")
print()

In [None]:
print("4. PRIVACY PRESERVATION")
print("-" * 50)

print("🔒 Privacy-preserving aspects:")
print("   • Clients never share raw features with each other")
print("   • Server only sees embeddings/representations, not raw features")
print("   • Each client only knows their own feature subset")
print("   • Labels are only known by the server")
print()

print("📊 What each party knows:")
for i, client_id in enumerate(client_data.keys()):
    client_name = client_data[client_id]["name"]
    features = client_data[client_id]["features"]
    print(f"   • {client_name}: Only {features} for samples {sample_indices}")

print(f"   • Server: Only labels {y_batch} and received embeddings")
print("   • No party has complete feature vectors for any sample")
print()

## 8. Summary and Insights {#summary}

In [None]:
print("Dataset Summary:")
print("===============\n")

# Calculate total statistics
total_features = sum(len(cols) for cols in PARTITION_COLS)
total_samples = len(y_server_full) if "y_server_full" in locals() else "Unknown"
mock_samples = len(y_server_mock) if "y_server_mock" in locals() else "Unknown"

print("📊 **Dataset Overview:**")
print(f"   • Total features across all clients: {total_features}")
print(f"   • Total training samples: {total_samples}")
print(f"   • Mock samples for testing: {mock_samples}")
print(f"   • Number of clients: {len(CLIENT_FOLDER_NAMES)}")
print("   • Target classes: 2 (binary classification)")
print()

print("🏗️  **VFL Setup:**")
print("   • Feature partitioning: Vertical (different features per client)")
print("   • Sample alignment: Perfect (all clients have same samples)")
print("   • Label holder: Server")
print("   • Privacy preservation: Raw features never shared")
print()

print("💡 **Key Insights:**")
if "df_comparison" in locals():
    private_balance = df_comparison[df_comparison["Data Type"] == "Private"][
        "Positive Class %"
    ].mean()
    mock_balance = df_comparison[df_comparison["Data Type"] == "Mock"][
        "Positive Class %"
    ].mean()
    print(f"   • Class balance (private): {private_balance:.1f}% positive class")
    print(f"   • Class balance (mock): {mock_balance:.1f}% positive class")
    print("   • Mock data maintains reasonable class balance for testing")

print("   • Bank features focus on customer demographics and financial status")
print("   • Marketing features focus on campaign interaction history")
print("   • Complementary feature sets provide comprehensive customer view")
print()

print("🔧 **Usage Recommendations:**")
print("   • Use mock data for development and quick testing")
print("   • Use private data for full training experiments")
print("   • Ensure all clients use same data type (mock/private) in experiments")
print("   • Monitor sample alignment during federated training")
print("   • Consider feature importance analysis within each partition")

## Conclusion

This notebook explored the bank marketing dataset prepared for vertical federated learning. Key takeaways:

1. **Ideal VFL Setup**: The data is correctly partitioned with different feature sets per client while maintaining sample alignment.

2. **Realistic Scenario**: The bank-marketing partition mimics real-world scenarios where different organizations hold complementary data about the same entities.

3. **Development Support**: Mock data enables rapid prototyping and testing of VFL algorithms.

4. **Privacy Preservation**: The setup ensures no client sees another's raw features, maintaining data privacy.

The dataset is ready for VFL experiments and provides a solid foundation for exploring vertical federated learning techniques in the context of financial marketing data.
