Skip to content

feat(governance): add calibration metrics for segmentation confidence#51

Merged
Hopelynconsult merged 1 commit intodevelopfrom
feature/governance-calibration
May 7, 2026
Merged

feat(governance): add calibration metrics for segmentation confidence#51
Hopelynconsult merged 1 commit intodevelopfrom
feature/governance-calibration

Conversation

@Hopelynconsult
Copy link
Copy Markdown
Collaborator

Summary

Why calibration matters here

ClimateVision dispatches NGO alerts based on confidence thresholds. A model that reports 0.9 confidence should be correct ~90% of the time; if it isn't, every threshold downstream is silently wrong, producing either missed events (deforestation, flood) or false alarms that erode NGO trust. We currently track accuracy/IoU/F1 but never measure whether the confidence we surface is meaningful. This closes that gap with the standard reliability-diagram metrics.

What's in the PR

  • ReliabilityBin and CalibrationReport dataclasses (JSON-serialisable, mirrors the style of anomaly_detector.PredictionFeatures / model_card.ModelCard).
  • evaluate_calibration() one-shot entrypoint returning a populated report.
  • expected_calibration_error() (support-weighted), maximum_calibration_error() (worst-bin), brier_score().
  • write_calibration_report() for persistence next to model cards in outputs/.
  • is_well_calibrated() helper with a 5% ECE default — wires straight into the release CI gate's threshold pattern.
  • 12 unit tests covering perfectly-calibrated and overconfident synthetic distributions, MCE ≥ ECE invariant, Brier extremes, full input validation (probability range, binary targets, shape match), and round-trip JSON serialisation.

Follow-ups (separate PRs, not in scope here)

  • Wire evaluate_calibration into scripts/generate_model_card.py so every model card includes a calibration block.
  • Add an ece threshold to scripts/governance_ci_gate.py so miscalibrated releases fail the CI gate.

Test plan

  • pytest tests/test_calibration.py -q → 12 passed
  • Reviewer: confirm threshold defaults (ece_threshold=0.05, n_bins=15) match what we want for the release gate.

🤖 Generated with Claude Code

ECE, MCE, Brier score, and reliability-bin computation for binary
segmentation outputs. Threshold-driven NGO alerts depend on calibrated
confidence: a model that says 0.9 should be right 90% of the time, and
miscalibration translates directly into missed events or false alarms.

The CalibrationReport dataclass slots into the existing model card
generator and release CI gate. Pure numpy at evaluation time, no torch.

- ReliabilityBin / CalibrationReport dataclasses with JSON serialisation
- evaluate_calibration() one-shot entrypoint
- write_calibration_report() for persistence alongside model cards
- 12 tests covering perfect/overconfident calibration, edge cases,
  input validation, and round-trip JSON
@Hopelynconsult Hopelynconsult requested a review from Goldokpa as a code owner May 7, 2026 16:17
@Hopelynconsult Hopelynconsult merged commit 2d7f271 into develop May 7, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant