# Baseline Evaluations for Damage Segmentation

This notebook establishes simple, non-learning baseline models for binary damage segmentation.  
Each baseline is evaluated per image using standard pixel-level classification metrics: IoU, F1-score, Precision, and Recall.

The baselines serve as reference points to assess whether a trained segmentation model is learning meaningful patterns beyond naive guessing or memorization.

## Included Baselines

- **All-Zeros Prediction**  
  Predicts all pixels as background (no damage). Serves as a conservative baseline that avoids false positives but misses all actual damage.

- **All-Ones Prediction**  
  Predicts all pixels as damage. Captures all damaged areas (high recall) but performs poorly on precision due to overwhelming false positives.

- **Mean Mask Prediction**  
  Uses the average of all training masks (thresholded) as a fixed prediction for every test image. Simulates a model that learns only the general location and frequency of damage without seeing input images.

Each strategy is evaluated using preprocessed `.npz` files for consistency with the training pipeline, and metrics are computed per mask then averaged to ensure robustness despite varying image sizes.


## Setup and Imports

In [1]:
import os
import numpy as np
from glob import glob
from tqdm import tqdm
from sklearn.metrics import f1_score, precision_score, recall_score, jaccard_score

## Load Test File Paths

In [2]:
# Path to preprocessed test data
test_file_paths = sorted(glob("../../data/v4-preprocessed-npz/test/*.npz"))

In [3]:
# Load training mask paths for mean mask computation
train_file_paths = sorted(glob("../../data/v4-preprocessed-npz/train/*.npz"))

# Compute mean mask across all training masks
accumulator = None
num_masks = 0

for path in tqdm(train_file_paths, desc="Computing mean training mask"):
    with np.load(path) as data:
        mask = data["mask"].astype(np.float32)
        if accumulator is None:
            accumulator = np.zeros_like(mask)
        accumulator += mask
        num_masks += 1

mean_mask = accumulator / num_masks
binary_mean_mask = (mean_mask > 0.5).astype(np.uint8)  # threshold at 50%

Computing mean training mask: 100%|██████████| 24205/24205 [07:41<00:00, 52.47it/s]


## Initialize Metric Lists

In [4]:
# For all-zeros prediction
ious_zeros, f1s_zeros, precisions_zeros, recalls_zeros = [], [], [], []

# For all-ones prediction
ious_ones, f1s_ones, precisions_ones, recalls_ones = [], [], [], []

# Evaluate mean mask as a fixed prediction for each test image
ious_mean, f1s_mean, precisions_mean, recalls_mean = [], [], [], []

## Evaluate Baselines Per File

In [5]:
# all_zeros and all_ones
for path in tqdm(test_file_paths, desc="Evaluating baselines per mask"):
    with np.load(path) as data:
        mask = data["mask"].astype(np.uint8).flatten()

    # All-zeros and all-ones predictions
    zeros = np.zeros_like(mask)
    ones = np.ones_like(mask)

    # All-zeros metrics
    ious_zeros.append(jaccard_score(mask, zeros, zero_division=0))
    f1s_zeros.append(f1_score(mask, zeros, zero_division=0))
    precisions_zeros.append(precision_score(mask, zeros, zero_division=0))
    recalls_zeros.append(recall_score(mask, zeros, zero_division=0))

    # All-ones metrics
    ious_ones.append(jaccard_score(mask, ones, zero_division=0))
    f1s_ones.append(f1_score(mask, ones, zero_division=0))
    precisions_ones.append(precision_score(mask, ones, zero_division=0))
    recalls_ones.append(recall_score(mask, ones, zero_division=0))

Evaluating baselines per mask: 100%|██████████| 5188/5188 [07:28<00:00, 11.56it/s]


In [6]:
# mean_mask
for path in tqdm(test_file_paths, desc="Evaluating mean mask baseline"):
    with np.load(path) as data:
        mask = data["mask"].astype(np.uint8)

    if mask.shape != binary_mean_mask.shape:
        continue  # skip mismatched shape cases

    y_true_flat = mask.flatten()
    y_pred_flat = binary_mean_mask.flatten()

    ious_mean.append(jaccard_score(y_true_flat, y_pred_flat, zero_division=0))
    f1s_mean.append(f1_score(y_true_flat, y_pred_flat, zero_division=0))
    precisions_mean.append(precision_score(y_true_flat, y_pred_flat, zero_division=0))
    recalls_mean.append(recall_score(y_true_flat, y_pred_flat, zero_division=0))

Evaluating mean mask baseline: 100%|██████████| 5188/5188 [03:30<00:00, 24.64it/s]


## Agggregate Metrics

In [7]:
metrics_all_zeros = {
    "IoU": np.mean(ious_zeros),
    "F1-score": np.mean(f1s_zeros),
    "Precision": np.mean(precisions_zeros),
    "Recall": np.mean(recalls_zeros),
}

metrics_all_ones = {
    "IoU": np.mean(ious_ones),
    "F1-score": np.mean(f1s_ones),
    "Precision": np.mean(precisions_ones),
    "Recall": np.mean(recalls_ones),
}

metrics_mean_mask = {
    "IoU": np.mean(ious_mean),
    "F1-score": np.mean(f1s_mean),
    "Precision": np.mean(precisions_mean),
    "Recall": np.mean(recalls_mean),
}

## Display Comparison Results

In [8]:
print("=== Baseline Evaluation (Per-Mask Averaged) ===\n")

print("All-Zeros Prediction:")
for key, value in metrics_all_zeros.items():
    print(f"{key:>10}: {value:.4f}")

print("\nAll-Ones Prediction:")
for key, value in metrics_all_ones.items():
    print(f"{key:>10}: {value:.4f}")

print("\nMean Mask Prediction:")
for key, value in metrics_mean_mask.items():
    print(f"{key:>10}: {value:.4f}")

=== Baseline Evaluation (Per-Mask Averaged) ===

All-Zeros Prediction:
       IoU: 0.0000
  F1-score: 0.0000
 Precision: 0.0000
    Recall: 0.0000

All-Ones Prediction:
       IoU: 0.1418
  F1-score: 0.2340
 Precision: 0.1418
    Recall: 0.9030

Mean Mask Prediction:
       IoU: 0.0000
  F1-score: 0.0000
 Precision: 0.0000
    Recall: 0.0000


### Interpretation of Results

To assess the baseline performance of trivial segmentation strategies, we evaluated three naive prediction masks against the ground truth:

#### 1. All-Zeros Prediction
The model predicts no damage at all.
- **IoU, F1-score, Precision, and Recall are all 0.000**, indicating a complete failure to detect any damaged regions.
- This serves as a **lower bound** baseline for model performance.

#### 2. All-Ones Prediction
The model marks the entire image as damaged.
- **High recall (0.903)** suggests that most ground truth damage is indeed covered.
- However, **low precision (0.1418)** reveals that a large portion of the predicted damage is incorrect (false positives).
- **F1-score and IoU remain low**, reflecting the lack of selectivity in the prediction.
- This highlights the **tradeoff between recall and precision** when damage is overpredicted.

#### 3. Mean Mask Prediction
A constant or average-valued mask (likely a soft, non-binary output).
- All metrics are **0.000**, indicating that after thresholding, the output fails to meaningfully overlap with the actual damage.
- This suggests that **uninformative or undertrained outputs** (e.g., from an untrained network) are ineffective for segmentation.