Skip to content

feat(notebooks): add 04_model_validation benchmarking notebook#48

Merged
Goldokpa merged 1 commit into
developfrom
feature/notebook-04-model-validation
May 5, 2026
Merged

feat(notebooks): add 04_model_validation benchmarking notebook#48
Goldokpa merged 1 commit into
developfrom
feature/notebook-04-model-validation

Conversation

@franchaise
Copy link
Copy Markdown
Collaborator

Summary

  • Adds notebooks/04_model_validation.ipynb — validates segmentation predictions against reference masks and the biomass regressor against held-out labels across Amazon, Congo, and Southeast Asia.
  • Computes per-tile IoU, F1, precision, recall, accuracy via a confusion-matrix helper, plus per-region regression metrics (RMSE / MAE / R^2 / MAPE).
  • Aggregates per-region and mean values into outputs/validation/benchmark_report.json — the single artifact consumed by scripts/governance_ci_gate.py and governance.model_card.build_model_card.
  • Falls back to synthetic tiles when real ground-truth data is not present, so the notebook stays runnable in CI.

Why

Sprint deliverable: "Write 04_model_validation.ipynb with baseline validation results."

Test plan

  • Valid notebook JSON.
  • Runs top-to-bottom on synthetic data via papermill without errors.

Notes for reviewers

Validates segmentation predictions against reference masks and the
biomass regressor against held-out labels across Amazon, Congo, and
Southeast Asia. Computes IoU, F1, precision, recall, accuracy, and
the regression metrics RMSE/MAE/R^2/MAPE. Aggregates per-region and
mean values into a single benchmark_report.json that the governance
CI gate and the model-card generator consume directly.
Copy link
Copy Markdown
Collaborator

@obielin obielin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Happy with this from the governance angle — the per-region table is exactly the shape build_model_card consumes for the Fairness section once we attach a region-level disparity metric on top.

One thought for a follow-up (not for this PR): the synthetic confusion matrix is built with a uniform 8% disagreement rate, so the per-region metrics look identical. When real ground-truth tiles land we'll see actual divergence. Approving.

@Goldokpa Goldokpa merged commit 833c623 into develop May 5, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants