feat(notebooks): add 04_model_validation benchmarking notebook by franchaise · Pull Request #48 · Climate-Vision/ClimateVision

franchaise · 2026-05-05T13:05:17Z

Summary

Adds notebooks/04_model_validation.ipynb — validates segmentation predictions against reference masks and the biomass regressor against held-out labels across Amazon, Congo, and Southeast Asia.
Computes per-tile IoU, F1, precision, recall, accuracy via a confusion-matrix helper, plus per-region regression metrics (RMSE / MAE / R^2 / MAPE).
Aggregates per-region and mean values into outputs/validation/benchmark_report.json — the single artifact consumed by scripts/governance_ci_gate.py and governance.model_card.build_model_card.
Falls back to synthetic tiles when real ground-truth data is not present, so the notebook stays runnable in CI.

Why

Sprint deliverable: "Write 04_model_validation.ipynb with baseline validation results."

Test plan

Valid notebook JSON.
Runs top-to-bottom on synthetic data via papermill without errors.

Notes for reviewers

Imports validate_predictions from analytics.validation (already on develop) and BiomassRegressor from feat(models): add biomass and carbon-stock regression module #46.
The benchmark report shape is documented in the closing cell so downstream consumers know what to expect.

Validates segmentation predictions against reference masks and the biomass regressor against held-out labels across Amazon, Congo, and Southeast Asia. Computes IoU, F1, precision, recall, accuracy, and the regression metrics RMSE/MAE/R^2/MAPE. Aggregates per-region and mean values into a single benchmark_report.json that the governance CI gate and the model-card generator consume directly.

obielin

Happy with this from the governance angle — the per-region table is exactly the shape build_model_card consumes for the Fairness section once we attach a region-level disparity metric on top.

One thought for a follow-up (not for this PR): the synthetic confusion matrix is built with a uniform 8% disagreement rate, so the per-region metrics look identical. When real ground-truth tiles land we'll see actual divergence. Approving.

franchaise requested a review from Goldokpa as a code owner May 5, 2026 13:05

franchaise mentioned this pull request May 5, 2026

feat(notebooks): add 05_impact_reporting stakeholder template #49

Merged

1 task

obielin approved these changes May 5, 2026

View reviewed changes

Goldokpa merged commit 833c623 into develop May 5, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(notebooks): add 04_model_validation benchmarking notebook#48

feat(notebooks): add 04_model_validation benchmarking notebook#48
Goldokpa merged 1 commit into
developfrom
feature/notebook-04-model-validation

franchaise commented May 5, 2026

Uh oh!

obielin left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

franchaise commented May 5, 2026

Summary

Why

Test plan

Notes for reviewers

Uh oh!

obielin left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants