# Water Quality Safety Classifier (PyTorch MLP)

A compact end-to-end workflow for **tabular binary classification** using a multi-layer perceptron (MLP) in **PyTorch**.
The notebook covers: data cleaning, normalization, train/val/test splits, model variants with BatchNorm + Dropout,
training with SGD vs Adam, and evaluation with classification metrics.


## Setup

If you're running this locally and don't have the libraries installed, uncomment the next cell.
In most notebook environments, these are already available.


In [None]:
# Uncomment if needed:
# !pip install -q pandas numpy matplotlib scikit-learn seaborn torch


In [None]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import classification_report

import matplotlib.pyplot as plt
import seaborn as sns

import torch
import torch.nn as nn
from torch.optim import SGD, Adam
from torch.utils.data import TensorDataset, DataLoader

# Reproducibility
np.random.seed(42)
torch.manual_seed(42)

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
device


## 1) Load and clean the dataset

This dataset is expected to be a CSV file with a binary target column named `is_safe` (1 = safe, 0 = not safe).

**How to run this notebook**
- Put `waterQuality1.csv` in the same directory as this notebook, or
- Set `file_path` to the correct location.


In [None]:
file_path = "waterQuality1.csv"  # update if needed

df = pd.read_csv(file_path)
df.head()


In [None]:
# Basic cleaning: replace placeholder strings with NaN, then drop rows with missing values
df.replace("#NUM!", np.nan, inplace=True)
df.dropna(inplace=True)

# Ensure numeric dtype where possible
for col in df.columns:
    if col != "is_safe":
        df[col] = pd.to_numeric(df[col], errors="coerce")

df.dropna(inplace=True)  # drop any rows that became NaN after coercion

df.info()


### Quick EDA
- Histograms for numeric features
- Target distribution (often imbalanced)
- Correlation heatmap (optional, just for intuition)


In [None]:
df.hist(figsize=(15, 10))
plt.tight_layout()
plt.show()

df["is_safe"].hist(figsize=(5, 4))
plt.title("Target distribution: is_safe")
plt.show()


In [None]:
plt.figure(figsize=(12, 10))
sns.heatmap(df.corr(numeric_only=True), annot=False, cmap="coolwarm")
plt.title("Correlation Matrix")
plt.show()


## 2) Train/Validation/Test split + normalization

We split first, then fit the scaler **only on the training set** to avoid leakage.


In [None]:
X = df.drop(columns=["is_safe"])
y = df["is_safe"].astype(int)

# Split: 80% train, 10% val, 10% test (stratified)
X_train, X_temp, y_train, y_temp = train_test_split(
    X, y, train_size=0.8, stratify=y, random_state=42
)
X_val, X_test, y_val, y_test = train_test_split(
    X_temp, y_temp, train_size=0.5, stratify=y_temp, random_state=42
)

scaler = StandardScaler()
X_train_s = pd.DataFrame(scaler.fit_transform(X_train), columns=X_train.columns, index=X_train.index)
X_val_s   = pd.DataFrame(scaler.transform(X_val), columns=X_val.columns, index=X_val.index)
X_test_s  = pd.DataFrame(scaler.transform(X_test), columns=X_test.columns, index=X_test.index)

X_train_s.describe().T.head()


## 3) DataLoaders

We convert pandas dataframes into PyTorch tensors and build DataLoaders.
For binary classification with `BCEWithLogitsLoss`, targets are floats with shape `(N, 1)`.


In [None]:
def to_dataset(X_df, y_series):
    X_tensor = torch.tensor(X_df.values.astype(np.float32))
    y_tensor = torch.tensor(y_series.values.astype(np.float32)).unsqueeze(1)
    return TensorDataset(X_tensor, y_tensor)

batch_size = 128

train_ds = to_dataset(X_train_s, y_train)
val_ds   = to_dataset(X_val_s, y_val)
test_ds  = to_dataset(X_test_s, y_test)

train_dl = DataLoader(train_ds, batch_size=batch_size, shuffle=True)
val_dl   = DataLoader(val_ds, batch_size=batch_size, shuffle=False)
test_dl  = DataLoader(test_ds, batch_size=batch_size, shuffle=False)


## 4) Model definitions (MLP variants)

We implement three common variants for tabular data:
1) **BatchNorm + ReLU**
2) **BatchNorm + ReLU (deeper)**
3) **BatchNorm + ReLU + Dropout** (regularization)

All models output **logits** (no sigmoid at the end). Sigmoid is handled inside the loss or for evaluation.


In [None]:
class MLP_BN(nn.Module):
    def __init__(self, input_size, hidden=(128, 64, 32, 16), output_size=1):
        super().__init__()
        layers = []
        prev = input_size
        for h in hidden:
            layers += [
                nn.Linear(prev, h),
                nn.BatchNorm1d(h),
                nn.ReLU(),
            ]
            prev = h
        layers += [nn.Linear(prev, output_size)]  # logits
        self.net = nn.Sequential(*layers)

    def forward(self, x):
        return self.net(x)

class MLP_BN_Dropout(nn.Module):
    def __init__(self, input_size, hidden=(256, 128, 64, 32), p_drop=0.4, output_size=1):
        super().__init__()
        layers = []
        prev = input_size
        for h in hidden:
            layers += [
                nn.Linear(prev, h),
                nn.BatchNorm1d(h),
                nn.ReLU(),
                nn.Dropout(p_drop),
            ]
            prev = h
        layers += [nn.Linear(prev, output_size)]  # logits
        self.net = nn.Sequential(*layers)

    def forward(self, x):
        return self.net(x)


## 5) Training + evaluation utilities

- We use **BCEWithLogitsLoss** for numerical stability (it combines sigmoid + BCE).
- We compute `pos_weight` to compensate class imbalance: `pos_weight = (#neg / #pos)`.


In [None]:
def compute_pos_weight(y_series, device):
    # y is 0/1 ints
    counts = y_series.value_counts().to_dict()
    n_pos = counts.get(1, 0)
    n_neg = counts.get(0, 0)
    if n_pos == 0:
        return torch.tensor([1.0], device=device)
    return torch.tensor([n_neg / n_pos], device=device)

def batch_accuracy_from_logits(logits, y_true):
    probs = torch.sigmoid(logits)
    preds = (probs >= 0.5).float()
    return (preds == y_true).float().mean().item()

def run_epoch(model, dl, loss_fn, optimizer=None):
    training = optimizer is not None
    model.train() if training else model.eval()

    total_loss = 0.0
    total_acc = 0.0
    n_batches = 0

    for Xb, yb in dl:
        Xb = Xb.to(device)
        yb = yb.to(device)

        logits = model(Xb)
        loss = loss_fn(logits, yb)

        if training:
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()

        total_loss += loss.item()
        total_acc += batch_accuracy_from_logits(logits, yb)
        n_batches += 1

    return total_loss / n_batches, total_acc / n_batches

def train_model(model, train_dl, val_dl, loss_fn, optimizer, epochs=40, print_every=10):
    train_losses, val_losses = [], []
    train_accs, val_accs = [], []

    for epoch in range(1, epochs + 1):
        tr_loss, tr_acc = run_epoch(model, train_dl, loss_fn, optimizer=optimizer)
        va_loss, va_acc = run_epoch(model, val_dl, loss_fn, optimizer=None)

        train_losses.append(tr_loss); val_losses.append(va_loss)
        train_accs.append(tr_acc);   val_accs.append(va_acc)

        if epoch % print_every == 0 or epoch == 1 or epoch == epochs:
            print(f"Epoch {epoch:>3} | train loss {tr_loss:.4f} acc {tr_acc:.3f} | val loss {va_loss:.4f} acc {va_acc:.3f}")

    return train_losses, val_losses, train_accs, val_accs

def plot_curves(train_vals, val_vals, title, ylabel):
    plt.figure(figsize=(10, 5))
    plt.plot(train_vals, label="Train")
    plt.plot(val_vals, label="Validation")
    plt.title(title)
    plt.xlabel("Epoch")
    plt.ylabel(ylabel)
    plt.legend()
    plt.grid(True)
    plt.show()

def get_predictions(model, dl):
    model.eval()
    all_y = []
    all_pred = []
    with torch.no_grad():
        for Xb, yb in dl:
            Xb = Xb.to(device)
            logits = model(Xb)
            probs = torch.sigmoid(logits)
            preds = (probs >= 0.5).long().cpu().numpy().reshape(-1)
            all_pred.extend(list(preds))
            all_y.extend(list(yb.cpu().numpy().reshape(-1).astype(int)))
    return np.array(all_y), np.array(all_pred)


## 6) Train with SGD

This uses momentum and a moderate learning rate. You can tune:
- `lr`
- `momentum`
- architecture depth/width


In [None]:
input_size = X_train_s.shape[1]

pos_weight = compute_pos_weight(y_train, device)
loss_fn = nn.BCEWithLogitsLoss(pos_weight=pos_weight)

model_sgd = MLP_BN_Dropout(input_size=input_size, p_drop=0.4).to(device)
optimizer_sgd = SGD(model_sgd.parameters(), lr=0.01, momentum=0.9)

epochs = 60
train_losses, val_losses, train_accs, val_accs = train_model(
    model_sgd, train_dl, val_dl, loss_fn, optimizer_sgd, epochs=epochs, print_every=10
)

plot_curves(train_losses, val_losses, "Loss (SGD)", "BCEWithLogitsLoss")
plot_curves(train_accs, val_accs, "Accuracy (SGD)", "Accuracy")


In [None]:
y_true, y_pred = get_predictions(model_sgd, test_dl)
print(classification_report(y_true, y_pred, digits=4))


## 7) Train with Adam

Adam typically converges faster on tabular MLPs with less tuning.


In [None]:
model_adam = MLP_BN_Dropout(input_size=input_size, p_drop=0.4).to(device)
optimizer_adam = Adam(model_adam.parameters(), lr=0.001)

epochs = 40
train_losses_a, val_losses_a, train_accs_a, val_accs_a = train_model(
    model_adam, train_dl, val_dl, loss_fn, optimizer_adam, epochs=epochs, print_every=10
)

plot_curves(train_losses_a, val_losses_a, "Loss (Adam)", "BCEWithLogitsLoss")
plot_curves(train_accs_a, val_accs_a, "Accuracy (Adam)", "Accuracy")


In [None]:
y_true_a, y_pred_a = get_predictions(model_adam, test_dl)
print(classification_report(y_true_a, y_pred_a, digits=4))


## Notes and next steps
- If the dataset is highly imbalanced, accuracy alone can be misleading. Use **F1**, **recall**, and **AUC** when possible.
- Try:
  - different `pos_weight`
  - wider/deeper networks
  - different dropout (`p_drop`)
  - early stopping based on validation loss
