feat(governance): add calibration metrics for segmentation confidence by Hopelynconsult · Pull Request #51 · Climate-Vision/ClimateVision

Hopelynconsult · 2026-05-07T16:17:14Z

Summary

Adds governance/calibration.py with ECE, MCE, Brier score, and reliability-bin computation for binary segmentation outputs.
Returns a CalibrationReport dataclass that serialises to JSON, ready to be attached alongside the existing Mitchell-style model cards (feat(governance): add automated model card generator #37) and consumed by the release CI gate (feat(governance): add release CI gate for metrics, fairness, and security #39).
Pure numpy at evaluation time — no torch dependency at gate-check time.

Why calibration matters here

ClimateVision dispatches NGO alerts based on confidence thresholds. A model that reports 0.9 confidence should be correct ~90% of the time; if it isn't, every threshold downstream is silently wrong, producing either missed events (deforestation, flood) or false alarms that erode NGO trust. We currently track accuracy/IoU/F1 but never measure whether the confidence we surface is meaningful. This closes that gap with the standard reliability-diagram metrics.

What's in the PR

ReliabilityBin and CalibrationReport dataclasses (JSON-serialisable, mirrors the style of anomaly_detector.PredictionFeatures / model_card.ModelCard).
evaluate_calibration() one-shot entrypoint returning a populated report.
expected_calibration_error() (support-weighted), maximum_calibration_error() (worst-bin), brier_score().
write_calibration_report() for persistence next to model cards in outputs/.
is_well_calibrated() helper with a 5% ECE default — wires straight into the release CI gate's threshold pattern.
12 unit tests covering perfectly-calibrated and overconfident synthetic distributions, MCE ≥ ECE invariant, Brier extremes, full input validation (probability range, binary targets, shape match), and round-trip JSON serialisation.

Follow-ups (separate PRs, not in scope here)

Wire evaluate_calibration into scripts/generate_model_card.py so every model card includes a calibration block.
Add an ece threshold to scripts/governance_ci_gate.py so miscalibrated releases fail the CI gate.

Test plan

pytest tests/test_calibration.py -q → 12 passed
Reviewer: confirm threshold defaults (ece_threshold=0.05, n_bins=15) match what we want for the release gate.

🤖 Generated with Claude Code

ECE, MCE, Brier score, and reliability-bin computation for binary segmentation outputs. Threshold-driven NGO alerts depend on calibrated confidence: a model that says 0.9 should be right 90% of the time, and miscalibration translates directly into missed events or false alarms. The CalibrationReport dataclass slots into the existing model card generator and release CI gate. Pure numpy at evaluation time, no torch. - ReliabilityBin / CalibrationReport dataclasses with JSON serialisation - evaluate_calibration() one-shot entrypoint - write_calibration_report() for persistence alongside model cards - 12 tests covering perfect/overconfident calibration, edge cases, input validation, and round-trip JSON

Hopelynconsult requested a review from Goldokpa as a code owner May 7, 2026 16:17

Hopelynconsult merged commit 2d7f271 into develop May 7, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(governance): add calibration metrics for segmentation confidence#51

feat(governance): add calibration metrics for segmentation confidence#51
Hopelynconsult merged 1 commit intodevelopfrom
feature/governance-calibration

Hopelynconsult commented May 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Hopelynconsult commented May 7, 2026

Summary

Why calibration matters here

What's in the PR

Follow-ups (separate PRs, not in scope here)

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant