# **Multi-Layer Perceptron (MLP) Classification**

## **Project Overview:**

This notebook implements a Multi-Layer Perceptron (MLP) neural network for a real-world 
binary classification task. We will handle all aspects of the machine learning pipeline 
including data preparation, model implementation, training, and evaluation.

Authors: Rodrigo Medeiros, Matheus Castellucci e João Pedro Rodrigues 

## **Dataset Selection**

### **Binary Classification with a Bank Dataset**

**Dataset:** [Binary Classification with a Bank Dataset](https://www.kaggle.com/competitions/playground-series-s5e8)

This dataset comes from Kaggle's Playground Series, which provides synthetic datasets 
generated from real-world data to allow practitioners to explore machine learning 
techniques in a competition-style format.

The dataset focuses on a binary classification problem related to banking data. The goal is to predict whether a client will subscribe to a bank term deposit.

**Why this dataset:**

- Tabular bank dataset suitable for an MLP (mix of categorical and numerical features).
- Good practice for preprocessing (categorical encoding, scaling), class imbalance checks, feature engineering, and standard model evaluation.

## **Dataset Explanation**

In [None]:
# Imports
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from pathlib import Path
import seaborn as sns
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score, accuracy_score

In [None]:
DATA_DIR = Path("data")
train_path = DATA_DIR / "train.csv"
test_path = DATA_DIR / "test.csv"
target_col = 'y'
id_col = 'id'

# Load train
train = pd.read_csv(train_path)
test = pd.read_csv(test_path)
print("\nTrain shape:", train.shape)
print("Test shape:", test.shape)

# Print columns and types
numerical_features = train.select_dtypes(include=['number']).columns.tolist()
if id_col in numerical_features:
    numerical_features.remove(id_col)
if target_col in numerical_features:
    numerical_features.remove(target_col)
categorical_features = train.select_dtypes(include=['object']).columns.tolist()
print("\nNumerical features:", numerical_features)
print("Categorical features:", categorical_features)

print("\nFirst 5 rows:")
display(train.head())

# Basic checks
print("\nMissing values per column:")
print(train.isna().sum())

# Target distribution
print("\nTarget distribution:")
print(train[target_col].value_counts())
print("\nTarget proportion (positive):", train[target_col].value_counts(normalize=True).get(1, None) or train[target_col].value_counts(normalize=True).get('yes', None) or "Unknown format")


### **Dataset overview**

| Property | Value |
|-----------|-------|
| **Samples (train)** | ≈ 750 000 rows |
| **Features** | 17 columns (16 predictors + 1 target) |
| **Target** | `y` (binary: 0 = no deposit, 1 = deposit) |
| **Positive class proportion** | ≈ 12 % |
| **Missing values** | None |

### **Feature types**

| Category | Columns |
|-----------|----------|
| **Numeric** | `age`, `balance`, `day`, `duration`, `campaign`, `pdays`, `previous` |
| **Categorical** | `job`, `marital`, `education`, `default`, `housing`, `loan`, `contact`, `month`, `poutcome` |
| **ID column** | `id` (unique identifier, to be excluded from training) |

### **Domain Knowledge**

- `age`: Age of the client (numeric)

- `job`: Type of job (categorical: "admin.", "blue-collar", "entrepreneur", etc.)

- `marital`: Marital status (categorical: "married", "single", "divorced")
education: Level of education (categorical: "primary", "secondary", "tertiary", "unknown")

- `default`: Has credit in default? (categorical: "yes", "no")

- `balance`: Average yearly balance in euros (numeric)

- `housing`: Has a housing loan? (categorical: "yes", "no")

- `loan`: Has a personal loan? (categorical: "yes", "no")

- `contact`: Type of communication contact (categorical: "unknown", "telephone", "cellular")

- `day`: Last contact day of the month (numeric, 1-31)

- `month`: Last contact month of the year (categorical: "jan", "feb", "mar", …, "dec")

- `duration`: Last contact duration in seconds (numeric)

- `campaign`: Number of contacts performed during this campaign (numeric)

- `pdays`: Number of days since the client was last contacted from a previous campaign (numeric; -1 means the client was not previously contacted)

- `previous`: Number of contacts performed before this campaign (numeric)

- `poutcome`: Outcome of the previous marketing campaign (categorical: "unknown", "other", "failure", "success")

- `y`: The target variable, whether the client subscribed to a term deposit (binary: "yes", "no")


### **Observations**

- No missing data.
- Strong class imbalance (≈ 1 positive for every 8 negatives). This will require balancing strategies during training.  
- Mix of categorical and numerical features. We need to implement encoding and scaling steps before feeding into the MLP.

In [None]:
# Categorical summary
for col in categorical_features:
    print(f"\nUnique values in {col}: {train[col].nunique()}")
    print(train[col].unique())

In [None]:
TOP_K = len(train.columns)
ncols = 3
nrows = int(np.ceil(len(categorical_features) / ncols))
fig, axes = plt.subplots(nrows, ncols, figsize=(ncols*5, nrows*3.5))
axes = axes.flatten()

for i, col in enumerate(categorical_features):
    vc = train[col].value_counts(normalize=True).head(TOP_K)
    axes[i].barh(vc.index[::-1], vc.values[::-1])  # reverse to have largest on top
    axes[i].set_title(f"{col} (unique={train[col].nunique()})")
    axes[i].set_xlabel("Proportion")
    axes[i].xaxis.set_major_formatter(plt.FuncFormatter(lambda x, _: f"{x:.0%}"))

plt.tight_layout()
plt.suptitle("Categorical feature proportions (top categories)", y=1.02, fontsize=20)
plt.show()

In [None]:
# Numeric summary
train[numerical_features].describe()

In [None]:
# Histograms for numeric features
ncols = 3
nrows = int(np.ceil(len(numerical_features) / ncols))
fig, axes = plt.subplots(nrows, ncols, figsize=(ncols*5, nrows*3.5))
axes = axes.flatten()

for i, col in enumerate(numerical_features):
    sns.histplot(train[col], kde=False, bins=50, ax=axes[i])
    axes[i].set_title(f"{col} (mean={train[col].mean():.2f}, std={train[col].std():.2f})")
for j in range(i+1, len(axes)):
    axes[j].axis("off")
plt.tight_layout()
plt.suptitle("Numeric feature histograms", y=1.02, fontsize=20)
plt.show()


In [None]:
# Correlation matrix
corr = train[numerical_features + [target_col]].corr()
plt.figure(figsize=(10, 8))
sns.heatmap(corr, annot=True, fmt=".2f", cmap="coolwarm", center=0)
plt.title("Correlation Matrix", fontsize=20)
plt.show()

### **Categorical Feature Analysis**
The categorical feature proportion plots show that:

- `job`: The most common occupations are *management*, *blue-collar*, and *technician*, together covering over 60% of clients. Some categories like *student* and *unknown* are rare.

- `marital`: Most clients are *married* (~60%), followed by *single* and *divorced*.

- `education`: The *secondary* education level dominates (~50%), followed by *tertiary* and *primary*.

- `default`, `housing`, and `loan`: Almost all clients have *no default*; roughly half have *housing loans*, and most have *no personal loan*.

- `contact`: The majority were contacted via *cellular*, with a small fraction through *telephone* or *unknown*.

- `month`: Campaigns are concentrated in *May, August, and July*, suggesting seasonal marketing efforts.

- `poutcome`: The outcome of previous campaigns is *unknown* for most clients, meaning they were not previously contacted or the outcome was not recorded.

Overall, categorical data is clean and complete, though some variables are highly imbalanced (*poutcome*, *default*). These can reduce their predictive power.

### **Numerical Feature Analysis**
Numeric histograms reveal several important patterns:

- `age`: Slightly right-skewed distribution centered around 40 years old.

- `balance`: Highly skewed with many clients near zero balance and few with very high balances (outliers present).

- `day`: Fairly uniform distribution across days of the month, suggesting no bias in call scheduling.

- `duration`: Strongly right-skewed; most calls are short, but a few are very long. Longer calls correlate with positive responses (as seen in the correlation matrix).

- `campaign`, `previous`, and `pdays`: Skewed towards low values, indicating most clients were contacted only once or twice, and many had never been contacted before (`pdays = 999`).

### **Correlation Analysis**

The correlation matrix highlights:

- `duration` shows the highest positive correlation with the target variable `y` (~0.52). Longer calls tend to result in subscriptions — likely because interested clients stay on the line longer.

- `previous` and `pdays` are moderately correlated (~0.56), reflecting related campaign tracking features.

- Other numeric features show weak correlations, suggesting the model will benefit from non-linear combinations (perfect for an MLP).


### **Potential Data Issues**
| Issue | Observation | Impact | Planned Action |
|--------|--------------|---------|----------------|
| **Class imbalance** | Only ~12% of samples are positive (`y=1`) | May bias model towards predicting `0` | Use class-weighted loss or sampling |
| **Outliers** | `balance` and `duration` have extreme values | Could distort scaling and gradients | Apply scaling and possibly log-transform or clip |
| **Skewed distributions** | Most numeric features are heavily right-skewed | Normalization may not fully stabilize | Try `StandardScaler` or `RobustScaler` |
| **Categorical imbalance** | Some categories (e.g. `unknown`, `default=yes`) are rare | Minimal contribution to learning | Consider grouping rare categories |
| **Sentinel value** | `pdays = 999` means “never contacted before” | Misleading if treated as numeric | Add binary flag `was_contacted_before` or treat 999 as missing |

In [None]:
# Copy original data
clean_df = train.copy()

# Winsorize (cap) extreme outliers for skewed features
skewed_features = ["balance", "duration", "campaign", "pdays", "previous"]
for col in skewed_features:
    upper_limit = clean_df[col].quantile(0.99)
    clean_df[col] = np.where(clean_df[col] > upper_limit, upper_limit, clean_df[col])

print("Outliers capped at 99th percentile for:", skewed_features)

# Feature separation
numeric_features = ["age", "balance", "day", "duration", "campaign", "pdays", "previous"]
categorical_features = ["job", "marital", "education", "default", "housing", 
                        "loan", "contact", "month", "poutcome"]

X = clean_df.drop(columns=["id", "y"])
y = clean_df["y"]

# Build preprocessing pipeline
ohe = OneHotEncoder(handle_unknown="ignore")
scaler = StandardScaler()

preprocessor = ColumnTransformer(
    transformers=[
        ("num", scaler, numeric_features),
        ("cat", ohe, categorical_features)
    ]
)

# Fit preprocessor
X_processed = preprocessor.fit_transform(X)

print("Transformed feature matrix shape:", X_processed.shape)

# Train/validation split
X_train, X_val, y_train, y_val = train_test_split(
    X_processed, y, test_size=0.1, random_state=42, stratify=y
)

print("Train size:", X_train.shape, "Validation size:", X_val.shape)

In [None]:
# Compare original vs scaled numeric distributions
scaled_numeric = preprocessor.named_transformers_['num'].transform(clean_df[numeric_features])
scaled_df = pd.DataFrame(scaled_numeric, columns=numeric_features)

fig, axes = plt.subplots(2, len(numeric_features)//2 + 1, figsize=(15, 6))
axes = axes.flatten()

for i, col in enumerate(numeric_features):
    sns.kdeplot(clean_df[col], ax=axes[i], label="Before", color="gray")
    sns.kdeplot(scaled_df[col], ax=axes[i], label="After", color="steelblue")
    axes[i].set_title(col)
    axes[i].legend()

plt.tight_layout()
plt.suptitle("Numeric feature scaling: Before vs After Standardization", y=1.05, fontsize=20)
plt.show()

### **Cleaning and Preprocessing Summary**

| Step | Action | Justification |
|------|---------|----------------|
| Missing values | None found | Dataset is complete |
| Duplicates | Removed duplicates (if any) | Prevents bias and redundancy |
| Outliers | Capped 99th percentile of highly skewed variables | Prevents extreme values from dominating gradients |
| Encoding | One-Hot Encoding for categorical variables | Converts text to numeric safely for neural networks |
| Scaling | StandardScaler (Z-score) for numeric features | Normalizes input scale for stable training |
| Split | 90% train / 10% validation | Ensures model generalization assessment |

After preprocessing:

- All inputs are numeric and standardized.

- The dataset is balanced across features, though the target remains imbalanced (to be handled during training).

- The pipeline is saved for consistent use during testing and deployment.

In [None]:
# Ensure labels are numpy arrays of shape (n,)
y_train = np.asarray(y_train).astype(np.int64).ravel()
y_val = np.asarray(y_val).astype(np.int64).ravel()

class NumPyMLP:
    def __init__(self, input_dim, hidden_sizes=[128], lr=0.01, weight_decay=1e-4, seed=42):
        """
        Simple fully-connected MLP with ReLU hidden activations and sigmoid output.
        hidden_sizes: list of ints (e.g., [128] or [256,128]).
        """
        self.rng = np.random.RandomState(seed)
        self.sizes = [input_dim] + hidden_sizes + [1]  # last layer is scalar output
        self.L = len(self.sizes) - 1  # number of weight layers
        self.params = {}
        # He initialization for ReLU
        for i in range(self.L):
            in_dim = self.sizes[i]
            out_dim = self.sizes[i+1]
            # weights: shape (out_dim, in_dim)
            # use He initialization for hidden layers, small std for output
            std = np.sqrt(2.0 / in_dim) if i < self.L - 1 else np.sqrt(1.0 / in_dim)
            self.params[f"W{i+1}"] = self.rng.randn(out_dim, in_dim) * std
            self.params[f"b{i+1}"] = np.zeros((out_dim, 1), dtype=np.float32)
        self.lr = lr
        self.weight_decay = weight_decay

    @staticmethod
    def relu(x):
        return np.maximum(0, x)
    @staticmethod
    def relu_grad(x):
        return (x > 0).astype(np.float32)

    @staticmethod
    def sigmoid(x):
        # stable sigmoid
        pos = x >= 0
        neg = ~pos
        out = np.empty_like(x, dtype=np.float64)
        out[pos] = 1.0 / (1.0 + np.exp(-x[pos]))
        ex = np.exp(x[neg])
        out[neg] = ex / (1.0 + ex)
        return out

    @staticmethod
    def bce_loss(y_true, y_pred_prob, eps=1e-12):
        # y_true: (batch,), y_pred_prob: (batch,)
        y_pred_prob = np.clip(y_pred_prob, eps, 1 - eps)
        loss = - (y_true * np.log(y_pred_prob) + (1 - y_true) * np.log(1 - y_pred_prob))
        return loss.mean()

    def forward(self, X):
        """
        X: shape (batch, input_dim)
        returns: output probabilities shape (batch,), and cache for backprop
        """
        cache = {}
        A = X.T  # shape (input_dim, batch)
        cache["A0"] = A
        for i in range(1, self.L + 1):
            W = self.params[f"W{i}"]      # (out, in)
            b = self.params[f"b{i}"]      # (out, 1)
            Z = W.dot(A) + b              # (out, batch)
            cache[f"Z{i}"] = Z
            if i < self.L:
                A = self.relu(Z)
            else:
                # output layer (linear -> sigmoid later)
                A = Z
            cache[f"A{i}"] = A
        logits = cache[f"A{self.L}"]    # shape (1, batch)
        probs = self.sigmoid(logits.ravel())
        cache["probs"] = probs
        return probs, cache

    def backward(self, cache, y_true):
        """
        Compute gradients and return a grads dict matching params shapes.
        y_true: shape (batch,)
        """
        grads = {}
        m = y_true.shape[0]
        # output layer gradient
        probs = cache["probs"]         # (batch,)
        dA = (probs - y_true) / m      # derivative of BCE wrt logits for sigmoid output
        dA = dA.reshape(1, -1)         # (1, batch)
        for i in range(self.L, 0, -1):
            A_prev = cache[f"A{i-1}"]     # (in, batch)
            Z_i = cache[f"Z{i}"]          # (out, batch)
            W_i = self.params[f"W{i}"]    # (out, in)
            # dW = dZ dot A_prev^T
            dW = dA.dot(A_prev.T)         # (out, in)
            db = dA.sum(axis=1, keepdims=True)  # (out,1)
            # propagate to previous layer if not input
            if i > 1:
                dA_prev = W_i.T.dot(dA)          # (in, batch)
                dZ_prev = dA_prev * self.relu_grad(cache[f"Z{i-1}"])  # (in, batch)
                dA = dZ_prev
            # include L2 regularization gradient
            dW += self.weight_decay * W_i
            grads[f"dW{i}"] = dW
            grads[f"db{i}"] = db
        return grads

    def step(self, grads, lr):
        # SGD update for each parameter
        for i in range(1, self.L+1):
            self.params[f"W{i}"] -= lr * grads[f"dW{i}"]
            self.params[f"b{i}"] -= lr * grads[f"db{i}"]

    def predict_proba(self, X):
        probs, _ = self.forward(X)
        return probs

    def predict(self, X, threshold=0.5):
        probs = self.predict_proba(X)
        return (probs >= threshold).astype(np.int64)


In [None]:
def train_mlp(model, X_train, y_train, X_val, y_val,
              n_epochs=25, batch_size=1024, lr=0.01, verbose=True):
    n_samples = X_train.shape[0]
    history = {"train_loss": [], "val_loss": [], "val_auc": [], "val_acc": []}
    best_auc = -np.inf
    best_params = None

    for epoch in range(1, n_epochs+1):
        # Shuffle
        perm = np.random.permutation(n_samples)
        X_shuff = X_train[perm]
        y_shuff = y_train[perm]
        # mini-batches
        epoch_losses = []
        for start in range(0, n_samples, batch_size):
            xb = X_shuff[start:start+batch_size]
            yb = y_shuff[start:start+batch_size]
            probs, cache = model.forward(xb)
            loss = model.bce_loss(yb, probs)
            epoch_losses.append(loss)
            grads = model.backward(cache, yb)
            model.step(grads, lr)

        # epoch train loss (mean of batch losses)
        train_loss = float(np.mean(epoch_losses))
        history["train_loss"].append(train_loss)

        # validation metrics
        val_probs, _ = model.forward(X_val)
        val_loss = float(model.bce_loss(y_val, val_probs))
        val_preds = (val_probs >= 0.5).astype(np.int64)
        val_auc = roc_auc_score(y_val, val_probs)
        val_acc = accuracy_score(y_val, val_preds)
        history["val_loss"].append(val_loss)
        history["val_auc"].append(val_auc)
        history["val_acc"].append(val_acc)

        if verbose:
            print(f"Epoch {epoch:02d}/{n_epochs} — train_loss: {train_loss:.4f} — val_loss: {val_loss:.4f} — val_auc: {val_auc:.4f} — val_acc: {val_acc:.4f}")

        # checkpoint best
        if val_auc > best_auc:
            best_auc = val_auc
            # deep copy params (simple dict copy of arrays)
            best_params = {k: v.copy() for k, v in model.params.items()}

    # restore best params
    if best_params is not None:
        model.params = best_params
    print("Training complete. Best val AUC:", best_auc)
    return history

# Configure and run
input_dim = X_train.shape[1]
hidden_sizes = [128]          # try [256,128] for more capacity
lr = 0.01
batch_size = 1024
n_epochs = 20
weight_decay = 1e-4

mlp = NumPyMLP(input_dim=input_dim, hidden_sizes=hidden_sizes, lr=lr, weight_decay=weight_decay, seed=42)
history = train_mlp(mlp, X_train, y_train, X_val, y_val, n_epochs=n_epochs, batch_size=batch_size, lr=lr)

# Plot training curves
plt.figure(figsize=(12,4))
plt.subplot(1,2,1)
plt.plot(history["train_loss"], label="train_loss")
plt.plot(history["val_loss"], label="val_loss")
plt.xlabel("Epoch")
plt.ylabel("BCE Loss")
plt.legend()
plt.title("Loss curves")

plt.subplot(1,2,2)
plt.plot(history["val_auc"], label="val_auc")
plt.plot(history["val_acc"], label="val_acc")
plt.xlabel("Epoch")
plt.legend()
plt.title("Validation AUC / Acc")
plt.show()


### Implementation details and hyperparameter justification

**Network**
- `NumPyMLP` implements a multi-layer perceptron with general number of hidden layers.
- Weights `W` are stored as NumPy arrays with shape `(out_dim, in_dim)` and biases `b` as `(out_dim, 1)`.
- Hidden activations: **ReLU** (chosen for simplicity and strong empirical performance).
- Output activation: **sigmoid**, producing probabilities for binary classification.

**Initialization**
- He initialization (`~ N(0, sqrt(2/in_dim))`) for hidden layers to help gradients flow for ReLU.
- Output layer initialized with smaller variance.

**Loss and gradients**
- Binary cross-entropy (BCE) averaged over batch:
  \[
  \mathcal{L} = -\frac{1}{m} \sum_i [y_i \log(p_i) + (1-y_i)\log(1-p_i)]
  \]
- Backpropagation computes exact gradients for all parameters using cached pre-activations.
- L2 regularization (weight decay) added to weight gradients.

**Optimizer**
- Mini-batch SGD: chosen for simplicity and to meet the "from-scratch" requirement.
- Learning rate `lr=0.01` — a typical starting point; increase/decrease depending on loss curves.
- `weight_decay=1e-4` acts as L2 regularization to reduce overfitting.

**Hyperparameters we used**
- `hidden_sizes=[128]` — single hidden layer; you can increase units or add depths (e.g., `[256,128]`) to improve capacity.
- `batch_size=1024` — large batch speeds up matrix ops; reduce if memory is tight.
- `n_epochs=20` — enough to observe convergence; extend if val AUC improves.
- `lr=0.01` — base learning rate for SGD.
- `weight_decay=1e-4` — small L2 regularization.

**Notes & next experiments**
- Try deeper architectures or different widths if underfitting.
- Add momentum or Adam-like updates if SGD convergence is slow (requires adding optimizer state).
- Consider class-weighted loss or sampling if the class imbalance reduces recall for positive class.

In [None]:
from sklearn.metrics import (
    accuracy_score, precision_score, recall_score, f1_score,
    confusion_matrix, roc_auc_score, roc_curve, precision_recall_curve
)
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

# Evaluate on test (or validation) set
y_true = y_val
y_pred_proba = mlp.predict_proba(X_val)
y_pred = (y_pred_proba >= 0.5).astype(int)

# Metrics
acc = accuracy_score(y_true, y_pred)
prec = precision_score(y_true, y_pred, zero_division=0)
rec = recall_score(y_true, y_pred, zero_division=0)
f1 = f1_score(y_true, y_pred)
auc = roc_auc_score(y_true, y_pred_proba)

print("=== Test Metrics ===")
print(f"Accuracy:  {acc:.4f}")
print(f"Precision: {prec:.4f}")
print(f"Recall:    {rec:.4f}")
print(f"F1-score:  {f1:.4f}")
print(f"ROC-AUC:   {auc:.4f}")

# Baseline (majority class predictor)
majority_class = np.bincount(y_true).argmax()
baseline_preds = np.full_like(y_true, fill_value=majority_class)
baseline_acc = accuracy_score(y_true, baseline_preds)
baseline_f1 = f1_score(y_true, baseline_preds, zero_division=0)
print("\n=== Baseline (Majority Class) ===")
print(f"Majority class: {majority_class}")
print(f"Baseline Accuracy: {baseline_acc:.4f}")
print(f"Baseline F1:       {baseline_f1:.4f}")

# Table summary
import pandas as pd
results_df = pd.DataFrame({
    "Metric": ["Accuracy", "Precision", "Recall", "F1-score", "ROC-AUC"],
    "MLP Model": [acc, prec, rec, f1, auc],
    "Baseline": [baseline_acc, np.nan, np.nan, baseline_f1, np.nan]
})
display(results_df)


In [None]:
# Confusion matrix
cm = confusion_matrix(y_true, y_pred)
plt.figure(figsize=(5,4))
sns.heatmap(cm, annot=True, fmt='d', cmap="Blues", cbar=False)
plt.title("Confusion Matrix")
plt.xlabel("Predicted label")
plt.ylabel("True label")
plt.show()

# ROC curve
fpr, tpr, _ = roc_curve(y_true, y_pred_proba)
plt.figure(figsize=(5,4))
plt.plot(fpr, tpr, label=f"MLP (AUC={auc:.3f})")
plt.plot([0,1], [0,1], '--', color='gray')
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("ROC Curve")
plt.legend()
plt.show()

# Precision–Recall curve
precisions, recalls, _ = precision_recall_curve(y_true, y_pred_proba)
plt.figure(figsize=(5,4))
plt.plot(recalls, precisions, label="MLP")
plt.xlabel("Recall")
plt.ylabel("Precision")
plt.title("Precision–Recall Curve")
plt.legend()
plt.show()


In [None]:
# Cell: create submission CSV using NumPyMLP predictions
import pandas as pd
import numpy as np
import joblib
from pathlib import Path

DATA_DIR = Path("data")
TEST_CSV = DATA_DIR / "test.csv"
SAMPLE_SUB = DATA_DIR / "sample_submission.csv"
PREPROCESSOR_PATH = Path("artifacts") / "preprocessor_cleaned.joblib"
SUBMISSION_PATH = Path("submission.csv")

# 1. Load sample submission to determine format
sample = pd.read_csv(SAMPLE_SUB)
print("Sample submission columns:", sample.columns.tolist())
# Assume sample has columns like ['id','y'] or similar

# 2. Load test data
test_df = pd.read_csv(TEST_CSV)
print("Test shape:", test_df.shape)
if "id" not in test_df.columns and "Id" in test_df.columns:
    test_df.rename(columns={"Id":"id"}, inplace=True)

# 3. Load preprocessor and transform test features
preprocessor = joblib.load(PREPROCESSOR_PATH)
# Determine the feature columns the preprocessor expects.
# If you built the pipeline with ColumnTransformer on a DataFrame, pass DataFrame slices.
# We assume the same columns used for training: drop id and target.
if "id" in test_df.columns:
    X_test_df = test_df.drop(columns=["id"])
else:
    X_test_df = test_df.copy()

# If sample submission contains other columns, just transform test_df as we used for training
X_test_transformed = preprocessor.transform(X_test_df)  # numpy array

# 4. Predict with the NumPy MLP (mlp must be in memory)
# If you saved model parameters in files, load them into a new NumPyMLP instance before this step.
try:
    probs = mlp.predict_proba(X_test_transformed)  # shape (n_samples,)
except NameError:
    raise RuntimeError("Trained NumPyMLP instance `mlp` not found in memory. "
                       "Either run training cells or load saved params into a NumPyMLP instance.")

# 5. Build submission dataframe based on sample submission columns
# Find the name of the target column in sample_submission (commonly 'y' or 'target' or 'label')
target_col = [c for c in sample.columns if c.lower() not in ("id", "index") and c.lower() != "id"][0]
print("Detected target column name in sample submission:", target_col)

submission = pd.DataFrame()
# Keep the id column exactly as in test (or as in sample submission if id is in sample)
if "id" in test_df.columns:
    submission["id"] = test_df["id"]
elif "Id" in test_df.columns:
    submission["id"] = test_df["Id"]
else:
    # fallback: use index
    submission["id"] = sample["id"]

submission[target_col] = probs  # probabilities between 0 and 1

# Ensure column order matches sample
submission = submission[sample.columns]

# Save CSV (no index)
submission.to_csv(SUBMISSION_PATH, index=False)
print("Wrote submission to:", SUBMISSION_PATH)
