# Training Robust Neural Networks — Notes

## Notebook Overview
- Goal: Binary classification of water potability with a small feedforward neural network in PyTorch, trained and evaluated with accuracy.
- Data: Custom `WaterDataset` reads CSVs, features are last column excluded, label is last column.

## Cell-by-Cell Notes
- **Cell 1 — Imports:** Loads `pandas`, `torch`, `Dataset`, `DataLoader`, `nn`, `optim`, and `torchmetrics.Accuracy` for data, modeling, optimization, and evaluation.
- **Cell 2 — `WaterDataset`:** Reads CSV to NumPy; returns `torch.Tensor` features/labels (`float32`). Label reshaped later with `view(-1, 1)` to match model output.
- **Cell 3 — Train DataLoader + Preview:** Wraps the dataset (`batch_size=2`, `shuffle=True`); prints a sample batch to verify shapes/dtypes.
- **Cell 4 — `Net` model:** `fc1(9→16)` → ReLU → `fc2(16→8)` → ReLU → `fc(8→1)` → Sigmoid to output a probability for binary classification.
- **Cell 5 — Training:** `train_model(...)` runs a standard loop with `BCELoss`. Casts tensors to float, reshapes labels to `(-1, 1)`, and uses `SGD(lr=0.001)`. Prints epoch progress.
- **Cell 6 — Evaluation:** Builds test `DataLoader`, sets `net.eval()`, uses `torch.no_grad()`, thresholds predictions at 0.5, computes `Accuracy(task="binary")`, prints final accuracy.

## Key Flow
- CSV → `WaterDataset` → `DataLoader` (train/test) → model → train with BCE → evaluate with accuracy.

## Pitfalls Avoided
- Dtype/shape mismatches: Explicit `.float()` casts and `labels.view(-1, 1)` ensure `BCELoss` matches model output.
- Correct modes: `net.eval()` and `torch.no_grad()` for evaluation to disable gradients.

## Recommended Improvements
- Numerical stability: Prefer `BCEWithLogitsLoss` and remove model `sigmoid`; threshold logits with `torch.sigmoid` in eval.
- Feature scaling: Normalize/standardize inputs to improve convergence.
- Optimizer/params: Try `Adam(lr=1e-3)`, tune `batch_size` (e.g., 32) and `num_epochs`.
- Metrics: Add precision/recall/F1 if class imbalance exists (`torchmetrics`).
- Reproducibility: Set seeds for `torch` and `numpy`; control `DataLoader` randomness.

## Run Order
1. Imports
2. Dataset
3. Train DataLoader (preview)
4. Model
5. Training
6. Evaluation

In [36]:
import pandas as pd
from torch.utils.data import Dataset, DataLoader
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torchmetrics import Accuracy

### Notes — Imports
- pandas: CSV loading and data manipulation.
- torch: Core tensor and autograd library.
- torch.utils.data `Dataset`, `DataLoader`: Custom dataset wrapper and batch iteration.
- torch.nn, torch.nn.functional: Neural network layers and functional activations (ReLU, Sigmoid).
- torch.optim: Optimizers (SGD, Adam) for parameter updates.
- torchmetrics `Accuracy(task="binary")`: Computes binary classification accuracy from predictions and targets.

Run this cell first to make all symbols available.

In [37]:
class WaterDataset(Dataset):

    def __init__(self, csv_file):
        super().__init__()

        df = pd.read_csv(csv_file)
        self.data = df.to_numpy()

    def __len__(self):
        return self.data.shape[0]
    
    def __getitem__(self, idx):
        features = self.data[idx, :-1].astype('float32')
        labels = self.data[idx, -1].astype('float32')
        return torch.from_numpy(features), torch.tensor(labels)

### Notes — WaterDataset
- Reads a CSV into a NumPy array and stores it in `self.data`.
- `__len__`: returns number of rows.
- `__getitem__`: splits features (all columns except last) and label (last column).
- Casts to `float32` and returns PyTorch tensors for seamless training.
- Label is a scalar; later reshaped to `(-1, 1)` to match model output shape.

In [38]:
# Create an instance of the WaterDataset

dataset_train = WaterDataset(csv_file='./water_potability/water_train.csv')

data_loader_train = DataLoader(
    dataset_train,
    batch_size = 2,
    shuffle=True
)

features, labels = next(iter(data_loader_train))
print(features, labels)

tensor([[0.2634, 0.2375, 0.3439, 0.5421, 0.5010, 0.4620, 0.5912, 0.6320, 0.5630],
        [0.7376, 0.5225, 0.3144, 0.3796, 0.5965, 0.2441, 0.4886, 0.4642, 0.6881]]) tensor([0., 0.])


### Notes — Train DataLoader & Preview
- Wraps the training dataset in a `DataLoader` with `batch_size=2` and `shuffle=True` for stochastic minibatches.
- The preview (`next(iter(...))`) quickly checks shapes/dtypes of a sample batch.
- Ensure features are tensors of shape `[batch, 9]` and labels are `[batch]` (reshaped later).

In [39]:
class Net(nn.Module):
    def __init__(self):
        super().__init__()

        self.fc1 = nn.Linear(9, 16)
        self.fc2 = nn.Linear(16, 8)
        self.fc = nn.Linear(8, 1)
    

    def forward(self, x):

        x = nn.functional.relu(self.fc1(x))
        x = nn.functional.relu(self.fc2(x))
        x = nn.functional.sigmoid(self.fc(x))
        return x

### Notes — Model (`Net`)
- Architecture: `fc1(9→16)` → ReLU → `fc2(16→8)` → ReLU → `fc(8→1)` → Sigmoid.
- Final Sigmoid outputs a probability for binary classification.
- Pairing with `BCELoss` is consistent; alternatively remove Sigmoid and use `BCEWithLogitsLoss` for better numerical stability.

In [None]:
# Training Loop.

# for epoch in range(1000):
#     for features, labels in data_loader_train:
#         optimizer.zero_grad()
#         loss = criterion(net(features), labels.view(-1, 1)
#         loss.backward()
#         optimizer.step()

def train_model(optimizer, net, num_epochs, criterion=None, data_loader=None):

    criterion = nn.BCELoss()
    data_loader = data_loader_train
    
    for epoch in range(num_epochs):
        for features, labels in data_loader:
            features = features.float()
            labels = labels.float()
            optimizer.zero_grad()
            loss = criterion(net(features), labels.view(-1, 1))
            loss.backward()
            optimizer.step()
        print(f"Epoch {epoch+1}/{num_epochs} completed")
        
###### Optimizers ######
net = Net()

optimizer = optim.SGD(net.parameters(), lr=0.001)
train_model(
    optimizer=optimizer,
    net=net,
    num_epochs=10,
)

Epoch 1/10 completed
Epoch 2/10 completed
Epoch 3/10 completed
Epoch 4/10 completed
Epoch 5/10 completed
Epoch 6/10 completed
Epoch 7/10 completed
Epoch 8/10 completed
Epoch 9/10 completed
Epoch 10/10 completed


### Notes — Training Loop
- `train_model(...)`: Standard epoch/batch loop.
- Loss: `BCELoss` expects probabilities; model applies Sigmoid.
- Casting: `features.float()` and `labels.float()` avoid dtype issues.
- Shape: `labels.view(-1, 1)` matches model output shape `[batch, 1]`.
- Optimizer: SGD with learning rate `0.001` updates parameters each step.
- Prints epoch completion for progress.

In [41]:
dataset_test = WaterDataset(csv_file='./water_potability/water_test.csv')

data_loader_test = DataLoader(
    dataset_test,
    batch_size = 2,
    shuffle=True
)
# Ignore above code (its just for context)
########### Model evaluation ###########
acc = Accuracy(task="binary")

net.eval()  
with torch.no_grad():
    for features, labels in data_loader_test:

        output = net(features)
        preds = (output > 0.5).float()
        acc(preds, labels.view(-1, 1))

test_accuracy = acc.compute()
print(f"Test Accuracy: {test_accuracy}")

Test Accuracy: 0.5904572606086731


### Notes — Evaluation
- Test `DataLoader` mirrors training setup.
- `net.eval()` + `torch.no_grad()` disable dropout (if any) and gradient tracking.
- Forward pass: `output` is probability; threshold at `0.5` to get binary predictions.
- Metric: `Accuracy(task="binary")` computes final accuracy; `acc.compute()` returns a scalar.
- Prints test accuracy for quick performance assessment.