# Laboratory Exercise: Bank Marketing Campaign Outcome Prediction

## Introduction

In this laboratory exercise, you will build a **binary classification model** that predicts whether a client will **subscribe to a term deposit** as a result of a **direct marketing campaign** conducted by a Portuguese banking institution.

Each data sample represents information about a single client and their interaction with a marketing campaign (mainly phone calls). The goal is to predict the final campaign outcome based on **demographic**, **financial**, and **campaign-related** features.

You will implement a **complete machine learning pipeline using PyTorch**, including data preprocessing, dataset definition, model design, training, evaluation, visualization, and final testing.

This exercise focuses on **tabular data**, **mixed categorical and numerical features**, and **binary classification** using **Binary Cross-Entropy loss**.

---

## Problem Definition

- **Task:** Binary classification
- **Target column:** `y`
- **Target values:**
  - `yes` → client subscribed to a term deposit
  - `no` → client did not subscribe
- **Goal:** Predict whether a client will subscribe to a term deposit after the marketing campaign

You will work with a provided dataset (`dataset.csv`) derived from real-world banking marketing data.

---

## Tasks Overview

You are required to implement the following components:

### 1. Data Preparation
- Load the `dataset.csv` file
- Identify **numerical** and **categorical** features
- Encode categorical variables appropriately
- Normalize or scale numerical features if needed
- Separate features (`X`) and target (`y`)
- Split the dataset into:
  - Training set
  - Validation set
  - Test set

---

### 2. Dataset Class
- Implement a `BankMarketingDataset` class compatible with PyTorch’s `Dataset`
- Ensure:
  - Features are stored as `float32` tensors
  - Targets are stored as binary labels with shape `(N, 1)`

---

### 3. Model Building
- Implement a `build_model_#` functions that returns a neural network suitable for **binary classification**
- The models should:
  - Accept tabular input features
  - Use appropriate activation functions
  - Output a single probability value (use `sigmoid` at the output layer)

---

### 4. Training and Evaluation
- Implement the following functions:
  - `train_one_epoch`
  - `evaluate`
  - `test`
- Use:
  - `BCELoss`
- Train the model for a fixed number of epochs
- Track:
  - Training loss
  - Validation loss
  - Validation accuracy

---

### 5. Visualization
- Plot the following curves:
  - Training loss vs. epochs
  - Validation loss vs. epochs
  - Validation accuracy vs. epochs

All plots must be clearly labeled and interpretable.

---

### 6. Testing and Reporting
- Evaluate the final trained model on the **test dataset**
- Generate and display a **classification report**, including:
  - Precision
  - Recall
  - F1-score
  - Accuracy

---

## Model Comparison Requirement

You must design and train **two different model configurations**, for example:
- Different number of hidden layers
- Different number of neurons per layer
- Different activation functions
- Use of dropout or other regularization techniques

For **each model**, you must:
- Train it for the same number of epochs
- Plot training and validation metrics
- Evaluate it on the test dataset
- Compare the results and briefly discuss which model performs better and why

## Dataset Description

Each row in the dataset corresponds to one client contact and includes the following feature groups:

### Client Information
- `age` – age of the client (numeric)
- `job` – type of job (categorical)
- `marital` – marital status (categorical)
- `education` – education level (categorical)
- `default` – has credit in default? (yes/no)

### Financial Information
- `balance` – average yearly balance (numeric)
- `housing` – has housing loan? (yes/no)
- `loan` – has personal loan? (yes/no)

### Campaign Contact Information
- `contact` – communication type (categorical)
- `day` – last contact day of the month (numeric)
- `month` – last contact month (categorical)
- `duration` – last contact duration in seconds (numeric)

### Campaign History
- `campaign` – number of contacts performed during this campaign
- `pdays` – number of days since the client was last contacted
- `previous` – number of contacts before this campaign
- `poutcome` – outcome of the previous marketing campaign (categorical)

### Target Variable
- `y` – whether the client subscribed to a term deposit (`yes` / `no`)

In [17]:
from typing import Tuple

import numpy as np
import pandas as pd

import torch
import torch.nn as nn
from torch import Tensor
from torch.utils.data import Dataset, DataLoader

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.metrics import classification_report

import matplotlib.pyplot as plt

In [26]:
def prepare_data(df: pd.DataFrame) -> Tuple[
    pd.DataFrame, pd.DataFrame, pd.DataFrame,
    pd.DataFrame, pd.DataFrame, pd.DataFrame,
    ColumnTransformer
]:
    """
    Prepare the bank marketing dataset for training and evaluation.

    The input DataFrame contains demographic, financial, and campaign-related
    information about clients contacted during a direct marketing campaign.
    The target column `y` indicates whether the client subscribed to a term
    deposit (`yes` or `no`).

    Steps (you MUST follow these steps):
    1. Identify feature columns and the target column `y`.
    2. Separate features (X) and target (y).
    3. Encode the target labels:
       - `yes` → 1
       - `no` → 0
    4. Identify categorical and numerical feature columns.
    5. Use ColumnTransformer with OneHotEncoder to encode all categorical columns:
       - use OneHotEncoder(drop="first", sparse_output=False)
       - Pass numerical features without modification
    6. Fit the preprocessor on the training data.
    7. Split the data in TWO stages (keep stratification):
       - First split into train and test:
            * test_size = 0.2
            * random_state = 42
            * stratify = y
       - Then split the training part into train and validation:
            * test_size = 0.2   (20% of the training set)
            * random_state = 42
            * stratify = y_train
    8. Return:
         X_train, X_val, X_test, y_train, y_val, y_test, preprocessor

    Notes:
    - The returned X arrays must be fully numeric.
    - The returned y arrays must contain binary labels with shape (N, 1) or (N,).
    - The data must be suitable for PyTorch binary classification.
    """
    raise NotImplementedError()

In [9]:
class BankMarketingDataset(Dataset):
    """
    A PyTorch Dataset for bank marketing binary classification.

    Each sample consists of:
    - a numeric feature vector representing a client's profile and campaign data
    - a binary label indicating whether the client subscribed to a term deposit

    Requirements:
    - __init__(self, X, y):
        * X: numpy array of numeric features
        * y: array-like of binary labels (0 or 1)
        * Store:
            - X as a float32 tensor
            - y as a float32 tensor with shape (N, 1)
    - __len__(self):
        * Return the number of samples
    - __getitem__(self, idx):
        * Return (X[idx], y[idx])
    """
    pass

In [10]:
def train_one_epoch(model: nn.Module,
                    train_loader: DataLoader,
                    criterion,
                    optimizer) -> float:
    """
    Train the model for ONE epoch on the training dataset.

    This is a binary classification task for predicting whether a client
    subscribes to a term deposit.

    Requirements:
    - Set the model to training mode using model.train()
    - Iterate over batches from train_loader
    - For each batch:
        * Compute model outputs (logits or probabilities)
        * Compute the loss using Binary Cross-Entropy loss
          (BCELoss or BCEWithLogitsLoss)
        * Zero the gradients
        * Perform backpropagation
        * Update model parameters using the optimizer
    - Accumulate the training loss over all batches
    - Return the average training loss as a float
      (total loss divided by the number of batches)
    """
    raise NotImplementedError()

In [11]:
def evaluate(model: nn.Module,
             val_loader: DataLoader,
             criterion: nn.Module) -> Tuple[float, float]:
    """
    Evaluate the model on the validation dataset.

    This is a binary classification task for bank marketing outcome prediction.

    Requirements:
    - Set the model to evaluation mode using model.eval()
    - Disable gradient computation using torch.no_grad()
    - Iterate over batches from val_loader
    - For each batch:
        * Compute model outputs
        * Compute and accumulate validation loss
        * Convert outputs to predicted labels using threshold 0.5
        * Collect predicted labels and true labels
    - Compute validation accuracy over the entire validation set
    - Return:
        - validation accuracy (float)
        - validation loss (float)
    """
    raise NotImplementedError()

In [12]:
def test(model: nn.Module,
         test_loader: DataLoader) -> tuple[Tensor, Tensor]:
    """
    Evaluate the trained model on the test dataset.

    This function performs inference for binary classification of
    bank marketing campaign outcomes.

    Requirements:
    - Set the model to evaluation mode using model.eval()
    - Disable gradient computation using torch.no_grad()
    - Iterate over batches from test_loader
    - For each batch:
        * Compute model outputs
        * Convert outputs to predicted labels using threshold 0.5
        * Collect all predicted labels and true labels
    - Return:
        - Tensor of true labels (shape: N,)
        - Tensor of predicted labels (shape: N,)

    These outputs will be used to compute a classification report.
    """
    raise NotImplementedError()

In [13]:
def build_model_1(input_dim: int) -> nn.Module:
    """
    Build and return a PyTorch neural network for bank marketing
    binary classification.

    Requirements:
    - Use nn.Sequential to define the model
    - The model must accept input vectors of size input_dim
    - The final layer must output a single value
    - Do NOT apply Sigmoid if using BCEWithLogitsLoss

    Note:
    - Use Binary Cross-Entropy loss during training
    - This model will serve as the baseline architecture
    """
    raise NotImplementedError()

In [14]:
def build_model_2(input_dim: int) -> nn.Module:
    """
    Build and return a second PyTorch neural network for bank marketing
    binary classification.

    This model should differ from build_model_1
    (e.g. more layers, more neurons, dropout, different activations).

    Requirements:
    - Use nn.Sequential to define the model
    - The model must accept input vectors of size input_dim
    - The final layer must output a single value
    - Do NOT apply Sigmoid if using BCEWithLogitsLoss

    Note:
    - Use Binary Cross-Entropy loss during training
    - This model will be compared against build_model_1
    """
    raise NotImplementedError()

### Build the models

In [15]:
# Call the build functions

### Train model 1

In [16]:
epochs = 0
train_losses_1 = []
val_losses_1 = []
val_accuracies_1 = []

for epoch in range(epochs):

    # Call all required functions and store the computed metrics
    # (training loss, validation loss, and validation accuracy).

    train_loss =0
    val_acc = 0

    print(f"Epoch {epoch + 1}/{epochs} | Train loss: {train_loss:.4f} | Val acc: {val_acc:.4f}")

### Train model 2

In [17]:
epochs = 0
train_losses_2 = []
val_losses_2 = []
val_accuracies_2 = []

for epoch in range(epochs):

    # Call all required functions and store the computed metrics
    # (training loss, validation loss, and validation accuracy).

    train_loss =0
    val_acc = 0

    print(f"Epoch {epoch + 1}/{epochs} | Train loss: {train_loss:.4f} | Val acc: {val_acc:.4f}")

### Visualize

In [18]:
# Visualize training and validation loss on the same plot, and visualize the validation accuracy across epochs.

### Evaluate

In [19]:
# Evaluate on the test dataset