# FORSMITH Roof Image Classifier —  YOLOv8m

**Objective:** Train a model that takes a roof image and predicts the correct `observation_id` (class) from the `forsmith_roof_labels.json` taxonomy.  
**Artifacts:** `model_best.pt`, `label_map.json`, `calibration.json`, `metrics.json`, ONNX export (optional), and an inference wrapper with optional **sheet-aware** masking.

> Dataset: 1,616 images. CSV columns required: `image_file`, `label`, `observation_id`. The filename contains `report_id` as `<report_id>_pageXX_imgY.png`, enabling **GroupKFold** by report to avoid leakage.

**Section 0 – Dependency Installs**

In [1]:
# %% [markdown]
# # Section 0 - Dependency Installs
# Ensures all required Python packages are available inside the environment.
# Torch 2.5.1 (CUDA 12.1) + Ultralytics YOLOv8 + Scikit-learn + Pandas + OpenCV.

# %% [code]
!pip uninstall -y numpy || true
!pip install --index-url https://download.pytorch.org/whl/cu121 torch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1
!pip install -q numpy<2.2 ultralytics==8.2.103 scikit-learn opencv-python tqdm matplotlib pandas

import torch, torchvision, torchaudio, numpy as np
print("torch:", torch.__version__)
print("torchvision:", torchvision.__version__)
print("torchaudio:", torchaudio.__version__)
print("numpy:", np.__version__)
print("CUDA available:", torch.cuda.is_available())


Found existing installation: numpy 2.1.2
Uninstalling numpy-2.1.2:
  Successfully uninstalled numpy-2.1.2
Looking in indexes: https://download.pytorch.org/whl/cu121
Collecting numpy (from torchvision==0.20.1)
  Using cached https://download.pytorch.org/whl/numpy-2.1.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (60 kB)
Using cached https://download.pytorch.org/whl/numpy-2.1.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (16.3 MB)
Installing collected packages: numpy
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
ultralytics 8.2.103 requires numpy<2.0.0,>=1.23.0, but you have numpy 2.1.2 which is incompatible.
ydata-profiling 4.16.1 requires matplotlib<=3.10,>=3.5, but you have matplotlib 3.10.7 which is incompatible.[0m[31m
[0mSuccessfully installed numpy-2.1.2
/bin/bash: line 1: 2.2: No such file or directory
torc

**Section 1 – Dataset Load + Preprocessing**

In [5]:
from pathlib import Path

DATA_ROOT = Path("/home/jupyter/forsmith_roof_data")

CONFIG = {
    "DATA_ROOT": str(DATA_ROOT),
    "IMAGES_DIR": str(DATA_ROOT / "images"),
    "CSV_PATH": str(DATA_ROOT / "labels.csv"),
    "LABELS_JSON": str(DATA_ROOT / "forsmith_roof_labels.json"),

    "MODEL_NAME": "yolov8m-cls.pt",
    "EPOCHS": 60,
    "BATCH_SIZE": 32,
    "IMAGE_SIZE": 512,
    "LEARNING_RATE": 5e-4,
    "PATIENCE": 10,
    "OPTIMIZER": "Adam",
    "LOSS_TYPE": "cross_entropy",
    "DROPOUT": 0.3,

    "N_SPLITS": 5,
    "FOLD_INDEX": 0,
    "SEED": 1337,

    "OUT_DIR": str(DATA_ROOT / "outputs" / "yolov8m"),
    "SAVE_ON_BEST": "val_acc",
}

print("config ready")

config ready


**Section 2 – Dataset Load + Preprocessing**

In [6]:
# ===============================
# 2) DATASET LOAD + PREPROCESSING
# ===============================
import os, json, shutil
import pandas as pd
from sklearn.model_selection import GroupKFold

# Load labels CSV
df = pd.read_csv(CONFIG["CSV_PATH"])
df["image_path"] = df["image_file"].apply(
    lambda x: os.path.join(CONFIG["IMAGES_DIR"], x)
)

# Encode observation_id → numeric class
obs_to_id = {obs: i for i, obs in enumerate(sorted(df["observation_id"].unique()))}
df["obs_id"] = df["observation_id"].map(obs_to_id)

# Print summary
print("Top 5 label counts:")
print(df["observation_id"].value_counts().head())

# GroupKFold by report prefix (e.g., 18-053)
groups = df["image_file"].str.split("_").str[0]
gkf = GroupKFold(n_splits=CONFIG["N_SPLITS"])
train_idx, val_idx = next(gkf.split(df, groups=groups))
train_df, val_df = df.iloc[train_idx], df.iloc[val_idx]

print(f"\nTrain: {len(train_df)} | Val: {len(val_df)} | Classes: {len(obs_to_id)}")

# ===== PREVIEW DATASETS =====
print("\nFull dataframe sample:")
print(df.head(5))

print("\nTraining set sample:")
print(train_df.head(5))

print("\nValidation set sample:")
print(val_df.head(5))

# Optional sanity check — ensure image paths exist
missing_imgs = df[~df["image_path"].apply(os.path.exists)]
print(f"\nMissing images: {len(missing_imgs)}")
if len(missing_imgs) > 0:
    print(missing_imgs.head())


Top 5 label counts:
observation_id
1.01.05    126
1.10.03     89
1.06.04     78
1.10.04     69
1.07.01     66
Name: count, dtype: int64

Train: 1292 | Val: 324 | Classes: 121

Full dataframe sample:
                  image_file                                   label  \
0  18-053-12_page12_img2.png                    Unprotected Openings   
1  18-053-12_page17_img3.png             Redundant roof penetrations   
2   23-023R1_page20_img2.png                     Subsurface Moisture   
3    23-023R1_page5_img3.png  Conduit Penetration Through Mech. Unit   
4     21-009_page18_img2.png  Conduit Penetration Through Mech. Unit   

  observation_id  confidence  \
0        2.11.02       0.502   
1        2.06.01       0.504   
2        2.12.01       0.505   
3        2.04.04       0.506   
4        2.04.04       0.506   

                                          image_path  obs_id  
0  /home/jupyter/forsmith_roof_data/images/18-053...      70  
1  /home/jupyter/forsmith_roof_data/images/18-053

**Section 3 – YOLO Folder Structure & YAML**

In [7]:
# ==============================
# 3) YOLO DATASET STRUCTURE (FIXED)
# ==============================
from pathlib import Path
import shutil, os

# Root paths
root = Path(CONFIG["DATA_ROOT"])
split_root = root / "yolo_split"
train_dir, val_dir = split_root / "train", split_root / "val"

# 🧹 Clean up old folders (if they exist)
if split_root.exists():
    shutil.rmtree(split_root)
train_dir.mkdir(parents=True, exist_ok=True)
val_dir.mkdir(parents=True, exist_ok=True)

# 🏗️ Create class folders and copy images
for split_name, split_df in [("train", train_df), ("val", val_df)]:
    split_path = train_dir if split_name == "train" else val_dir
    for _, row in split_df.iterrows():
        cls_dir = split_path / str(row["obs_id"])
        cls_dir.mkdir(parents=True, exist_ok=True)
        shutil.copy(row["image_path"], cls_dir)

print("✅ YOLO directory structure created!")
print("Train samples:", sum(len(files) for _, _, files in os.walk(train_dir)))
print("Val samples:", sum(len(files) for _, _, files in os.walk(val_dir)))

# 🧾 Create YAML file OUTSIDE yolo_split folder
yaml_path = root / "roof_yolov8.yaml"

with open(yaml_path, "w") as f:
    f.write(f"train: {train_dir.resolve()}\n")
    f.write(f"val: {val_dir.resolve()}\n")
    f.write("names:\n")
    for i, obs in enumerate(sorted(obs_to_id.keys())):
        f.write(f"  {i}: '{obs}'\n")

# ✅ Verify YAML and paths
print("\n✅ YAML file written at:", yaml_path)
print("\n--- YAML CONTENTS ---")
print(open(yaml_path).read())

print("\nTrain folder exists:", os.path.isdir(train_dir))
print("Val folder exists:", os.path.isdir(val_dir))
print("Example classes:", os.listdir(train_dir)[:5])


✅ YOLO directory structure created!
Train samples: 1292
Val samples: 324

✅ YAML file written at: /home/jupyter/forsmith_roof_data/roof_yolov8.yaml

--- YAML CONTENTS ---
train: /home/jupyter/forsmith_roof_data/yolo_split/train
val: /home/jupyter/forsmith_roof_data/yolo_split/val
names:
  0: '1.01.01'
  1: '1.01.02'
  2: '1.01.03'
  3: '1.01.05'
  4: '1.02.01'
  5: '1.02.03'
  6: '1.02.04'
  7: '1.02.05'
  8: '1.02.06'
  9: '1.02.07'
  10: '1.03.01'
  11: '1.03.02'
  12: '1.03.03'
  13: '1.03.04'
  14: '1.04.02'
  15: '1.04.03'
  16: '1.04.04'
  17: '1.04.05'
  18: '1.04.06'
  19: '1.05.01'
  20: '1.05.02'
  21: '1.05.03'
  22: '1.06.01'
  23: '1.06.02'
  24: '1.06.03'
  25: '1.06.04'
  26: '1.06.06'
  27: '1.07.01'
  28: '1.07.02'
  29: '1.07.03'
  30: '1.07.04'
  31: '1.07.05'
  32: '1.07.06'
  33: '1.07.07'
  34: '1.08.01'
  35: '1.08.02'
  36: '1.08.03'
  37: '1.08.04'
  38: '1.09.01'
  39: '1.09.02'
  40: '1.09.03'
  41: '1.09.04'
  42: '1.09.05'
  43: '1.10.01'
  44: '1.10.02'
  

In [8]:
# ==============================
# 3.5) PyTorch Dataset & Dataloaders (Fixed mapping + Strong Augmentation for Small Data)
# ==============================
from torch.utils.data import DataLoader
from torchvision import datasets, transforms
from torchvision.datasets.folder import default_loader, IMG_EXTENSIONS
from pathlib import Path

train_root = "/home/jupyter/forsmith_roof_data/yolo_split/train"
val_root   = "/home/jupyter/forsmith_roof_data/yolo_split/val"

# 1️⃣ Build a fixed, global class mapping from the training set
all_classes = sorted([d.name for d in Path(train_root).iterdir() if d.is_dir()])
class_to_idx = {c: i for i, c in enumerate(all_classes)}

def make_fixed_samples(root, class_to_idx):
    samples = []
    for cls in all_classes:  # enforce fixed order
        d = Path(root) / cls
        if not d.is_dir():
            continue
        for p in d.rglob("*"):
            if p.suffix.lower() in IMG_EXTENSIONS:
                samples.append((str(p), class_to_idx[cls]))
    return samples

# 2️⃣ Dataset subclass that reuses the fixed mapping
class FixedImageFolder(datasets.ImageFolder):
    def __init__(self, root, transform, samples, class_to_idx, classes):
        super().__init__(root, transform=transform)
        self.class_to_idx = class_to_idx
        self.classes = classes
        self.samples = samples
        self.targets = [s[1] for s in samples]

# 3️⃣ Strong augmentations (tuned for small roof datasets)
# Includes geometric + color + erasing + perspective
train_transform = transforms.Compose([
    transforms.Resize((512, 512)),
    transforms.RandomApply([
        transforms.ColorJitter(brightness=0.4, contrast=0.4, saturation=0.4, hue=0.1),
        transforms.RandomGrayscale(p=0.1),
    ], p=0.9),
    transforms.RandomHorizontalFlip(p=0.5),
    transforms.RandomVerticalFlip(p=0.5),
    transforms.RandomRotation(25),
    transforms.RandomResizedCrop(512, scale=(0.6, 1.0), ratio=(0.8, 1.25)),
    transforms.RandomPerspective(distortion_scale=0.5, p=0.5),
    transforms.RandomApply([transforms.GaussianBlur(kernel_size=(3, 3), sigma=(0.1, 2.0))], p=0.3),
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.485, 0.456, 0.406],
                         std=[0.229, 0.224, 0.225]),
    transforms.RandomErasing(p=0.25, scale=(0.02, 0.15))
])

# Validation transform (no augmentation)
val_transform = transforms.Compose([
    transforms.Resize((512, 512)),
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.485, 0.456, 0.406],
                         std=[0.229, 0.224, 0.225]),
])

# 4️⃣ Build datasets
train_samples = make_fixed_samples(train_root, class_to_idx)
val_samples   = make_fixed_samples(val_root, class_to_idx)

train_data = FixedImageFolder(train_root, train_transform, train_samples, class_to_idx, all_classes)
val_data   = FixedImageFolder(val_root, val_transform, val_samples, class_to_idx, all_classes)

# 5️⃣ DataLoaders
train_loader = DataLoader(train_data, batch_size=32, shuffle=True,  num_workers=4, pin_memory=True)
val_loader   = DataLoader(val_data,   batch_size=32, shuffle=False, num_workers=4, pin_memory=True)

# 6️⃣ Verify setup
print("✅ DataLoaders ready with strong augmentations and fixed class mapping.")
print(f"Train images: {len(train_data)} | Val images: {len(val_data)}")
print(f"Classes detected: {len(train_data.classes)}")
print("Example classes:", train_data.classes[:5])


✅ DataLoaders ready with strong augmentations and fixed class mapping.
Train images: 1292 | Val images: 315
Classes detected: 114
Example classes: ['0', '1', '10', '100', '101']


In [9]:
print(train_data.classes == val_data.classes)

print(train_data.class_to_idx == val_data.class_to_idx)



True
True


**Section 4 – Train YOLOv8m**

In [None]:
# ================================
# 4) YOLOv8 BACKBONE + TRAIN LOOP
#    (Dual-Axis Graph + Full Epoch Logging, EMA, Cosine LR)
# ================================
from ultralytics import YOLO
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from tqdm import tqdm
import matplotlib.pyplot as plt
from IPython.display import display
import yaml, os, copy
from math import cos, pi

device = "cuda" if torch.cuda.is_available() else "cpu"

# 1) Load pretrained YOLOv8 backbone
yolo_model = YOLO(CONFIG["MODEL_NAME"])
backbone = yolo_model.model.model[:-1]  # strip YOLO detection head

# 2) Freeze backbone first
for p in backbone.parameters():
    p.requires_grad = False

# 2.5) Partially unfreeze top layers (tune as needed)
for name, p in backbone.named_parameters():
    if any(k in name.lower() for k in ["stage4", "stage5", "sppf", "head", "neck", "c5", "p4", "p5"]):
        p.requires_grad = True

# Reactivate BatchNorm stats updating
for m in backbone.modules():
    if isinstance(m, nn.BatchNorm2d):
        m.train()

# 3) Classification head
yaml_path = "/home/jupyter/forsmith_roof_data/roof_yolov8.yaml"
with open(yaml_path, "r") as f:
    names_yaml = yaml.safe_load(f)
num_classes = len(names_yaml["names"])
print(f"Detected {num_classes} classes from YAML.")

classifier_head = nn.Sequential(
    nn.AdaptiveAvgPool2d(1),
    nn.Flatten(),
    nn.Linear(768, num_classes)  # 768 works for yolov8m-cls backbone output
)

model = nn.Sequential(backbone, classifier_head).to(device)

# 4) Optimizer with two parameter groups
optimizer = optim.Adam([
    {"params": [p for p in backbone.parameters() if p.requires_grad], "lr": CONFIG["LEARNING_RATE"] * 0.2},
    {"params": classifier_head.parameters(), "lr": CONFIG["LEARNING_RATE"]}
])
epochs = CONFIG["EPOCHS"]

# 5) Cosine warmup scheduler (step per batch)
def cosine_warmup_scheduler(optimizer, total_steps, warmup_steps):
    def lr_lambda(step):
        if step < warmup_steps:
            return float(step + 1) / float(max(1, warmup_steps))
        progress = (step - warmup_steps) / max(1, total_steps - warmup_steps)
        return 0.5 * (1.0 + cos(pi * progress))
    return torch.optim.lr_scheduler.LambdaLR(optimizer, lr_lambda=lr_lambda)

total_steps = len(train_loader) * epochs
warmup_steps = int(0.03 * total_steps)
scheduler = cosine_warmup_scheduler(optimizer, total_steps, warmup_steps)

# 6) EMA
ema_decay = 0.999
ema_model = copy.deepcopy(model).to(device)
for p in ema_model.parameters():
    p.requires_grad = False

def update_ema(ema_m, m, decay):
    with torch.no_grad():
        for (k, v_ema) in ema_m.state_dict().items():
            v = m.state_dict()[k]
            v_ema.copy_(decay * v_ema + (1.0 - decay) * v)

# Early stopping & checkpoints
patience = CONFIG.get("PATIENCE", 10)
best_acc, best_epoch = 0.0, 0
save_dir = CONFIG["OUT_DIR"]
os.makedirs(save_dir, exist_ok=True)
best_path = os.path.join(save_dir, "best_yolo_csv.pt")
ckpt_path = os.path.join(save_dir, "checkpoint_latest.pt")

# Sanity check: trainable vs frozen
trainable = sum(p.numel() for p in model.parameters() if p.requires_grad)
frozen = sum(p.numel() for p in model.parameters() if not p.requires_grad)
print(f"Trainable params: {trainable:,} | Frozen params: {frozen:,}")

# Metric history
history = {"epoch": [], "train_loss": [], "train_acc": [], "val_acc": []}

# Prepare a persistent display handle for the live chart (won't clear text logs)
plot_handle = display(None, display_id=True)
print("\n📊 Epoch Progress Log\n" + "-" * 72)

# ======================
# TRAINING LOOP
# ======================
for epoch in range(epochs):
    model.train()
    total_loss, correct, total = 0.0, 0, 0

    for imgs, labels in tqdm(train_loader, desc=f"Epoch {epoch+1}/{epochs}"):
        imgs, labels = imgs.to(device), labels.to(device)
        optimizer.zero_grad()
        outputs = model(imgs)

        # optional: label smoothing to reduce overconfidence on noisy labels
        loss = F.cross_entropy(outputs, labels, label_smoothing=0.05)
        loss.backward()

        # gradient clipping for stability
        torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=5.0)

        optimizer.step()
        scheduler.step()
        update_ema(ema_model, model, ema_decay)

        total_loss += loss.item()
        correct += (outputs.argmax(1) == labels).sum().item()
        total += labels.size(0)

    train_acc = correct / max(1, total)
    avg_loss = total_loss / max(1, len(train_loader))

    # Validation (use EMA weights)
    ema_model.eval()
    val_correct, val_total = 0, 0
    with torch.no_grad():
        for imgs, labels in val_loader:
            imgs, labels = imgs.to(device), labels.to(device)
            preds = ema_model(imgs)
            val_correct += (preds.argmax(1) == labels).sum().item()
            val_total += labels.size(0)
    val_acc = val_correct / max(1, val_total)

    # Checkpoints / early stopping
    if val_acc > best_acc:
        best_acc = val_acc
        best_epoch = epoch
        torch.save(ema_model.state_dict(), best_path)
    elif epoch - best_epoch >= patience:
        print(f"⏹️ Early stopping at epoch {epoch+1}")
        break

    torch.save({
        "epoch": epoch + 1,
        "model": ema_model.state_dict(),
        "optimizer": optimizer.state_dict(),
        "scheduler": scheduler.state_dict(),
        "best_acc": best_acc
    }, ckpt_path)

    # Log history
    history["epoch"].append(epoch + 1)
    history["train_loss"].append(avg_loss)
    history["train_acc"].append(train_acc)
    history["val_acc"].append(val_acc)

    # ---- Print line for this epoch (kept; not cleared) ----
    current_lr = scheduler.get_last_lr()[0]
    print(f"Epoch {epoch+1:03d}/{epochs} | "
          f"LR: {current_lr:.2e} | "
          f"Train Loss: {avg_loss:.4f} | "
          f"Train Acc: {train_acc*100:.2f}% | "
          f"Val Acc: {val_acc*100:.2f}% | "
          f"Best: {best_acc*100:.2f}%")

    # ---- Update live dual-axis plot without clearing logs ----
    fig, ax1 = plt.subplots(figsize=(8, 4))

    ax1.set_xlabel("Epoch")
    ax1.set_ylabel("Loss", color="tab:red")
    ax1.plot(history["epoch"], history["train_loss"], color="tab:red", linewidth=2, label="Train Loss")
    ax1.tick_params(axis="y", labelcolor="tab:red")

    ax2 = ax1.twinx()
    ax2.set_ylabel("Accuracy (%)", color="tab:blue")
    ax2.plot(history["epoch"], [a * 100 for a in history["train_acc"]],
             color="tab:blue", linewidth=2, label="Train Acc")
    ax2.plot(history["epoch"], [a * 100 for a in history["val_acc"]],
             color="tab:green", linewidth=2, label="Val Acc")
    ax2.tick_params(axis="y", labelcolor="tab:blue")

    plt.title(f"Training Progress (Best Val Acc: {best_acc*100:.2f}%)")
    # combine legends from both axes
    lines1, labels1 = ax1.get_legend_handles_labels()
    lines2, labels2 = ax2.get_legend_handles_labels()
    fig.legend(lines1 + lines2, labels1 + labels2,
               loc="upper center", bbox_to_anchor=(0.5, -0.10), ncol=3)
    fig.tight_layout()

    # show/update plot in place, but keep text output intact
    plot_handle.update(fig)
    plt.close(fig)

# ======================
# FINAL RESULTS
# ======================
plot_path = os.path.join(save_dir, "training_curve.png")

fig, ax1 = plt.subplots(figsize=(8, 4))
ax1.set_xlabel("Epoch")
ax1.set_ylabel("Loss", color="tab:red")
ax1.plot(history["epoch"], history["train_loss"], color="tab:red", linewidth=2, label="Train Loss")
ax1.tick_params(axis="y", labelcolor="tab:red")

ax2 = ax1.twinx()
ax2.set_ylabel("Accuracy (%)", color="tab:blue")
ax2.plot(history["epoch"], [a * 100 for a in history["train_acc"]],
         color="tab:blue", linewidth=2, label="Train Acc")
ax2.plot(history["epoch"], [a * 100 for a in history["val_acc"]],
         color="tab:green", linewidth=2, label="Val Acc")
ax2.tick_params(axis="y", labelcolor="tab:blue")

plt.title(f"Final Training Curve (Best Val Acc: {best_acc*100:.2f}%)")
lines1, labels1 = ax1.get_legend_handles_labels()
lines2, labels2 = ax2.get_legend_handles_labels()
fig.legend(lines1 + lines2, labels1 + labels2,
           loc="upper center", bbox_to_anchor=(0.5, -0.10), ncol=3)
fig.tight_layout()
plt.savefig(plot_path, bbox_inches="tight")
plt.show()

print(f"\n📈 Saved final training curve to: {plot_path}")
print(f"💾 Latest checkpoint: {ckpt_path}")
print(f"🎯 Training complete! Best Validation Accuracy: {best_acc:.4f}")
print("✅ Best model saved to:", best_path)


Detected 121 classes from YAML.
Trainable params: 93,049 | Frozen params: 14,786,736


None


📊 Epoch Progress Log
------------------------------------------------------------------------


Epoch 1/60:  37%|███▋      | 15/41 [03:24<05:28, 12.63s/it]