Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implementation of calibration error metrics #394

Merged
merged 51 commits into from
Aug 3, 2021

Conversation

edwardclem
Copy link
Contributor

Before submitting

  • Was this discussed/approved via a Github issue? (no need for typos and docs improvements)
  • Did you read the contributor guideline, Pull Request section?
  • Did you make sure to update the docs?
  • Did you write any new necessary tests?

What does this PR do?

Adds metrics described in #218 .

Implements L1, L2, and max-norm classification calibration errors as described here and here. Calibration errors are computed by binning predictions and comparing the empirical probability of correctness (i.e. accuracy) to the confidence - in a frequentist sense, a model is "calibrated" if a prediction with a probability of 60% is correct 60% of the time. Note that currently these probabilities are only computed for the top-1 prediction (as given by the traditional CE definition). There are some variants that take into account all predictions, which is worth including in a future PR but out of scope for this one.

Tests are written using a local copy of the calibration code in this scikit-learn pull request, and should be rewritten to use the master branch once it's merged. The debiasing term described in Verified Uncertainty Calibration is currently not supported by my PR - I am checking with the sklearn developers in the linked PR to see what the correct implementation should be.

NOTE: DDP is currently broken in this PR - working on fixing.

@pep8speaks
Copy link

pep8speaks commented Jul 23, 2021

Hello @edwardclem! Thanks for updating this PR.

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2021-08-03 07:34:08 UTC

@edwardclem
Copy link
Contributor Author

It looks like pep8speaks doesn't like the math sections of docstrings - is this expected?

@SkafteNicki SkafteNicki linked an issue Jul 23, 2021 that may be closed by this pull request
@SkafteNicki SkafteNicki added enhancement New feature or request New metric labels Jul 23, 2021
@Borda
Copy link
Member

Borda commented Jul 26, 2021

@edwardclem how is it going, still draft? 🐰

@codecov
Copy link

codecov bot commented Jul 26, 2021

Codecov Report

Merging #394 (984b879) into master (79c966d) will increase coverage by 20.44%.
The diff coverage is 93.82%.

Impacted file tree graph

@@             Coverage Diff             @@
##           master     #394       +/-   ##
===========================================
+ Coverage   75.56%   96.00%   +20.44%     
===========================================
  Files         124      126        +2     
  Lines        4002     4083       +81     
===========================================
+ Hits         3024     3920      +896     
+ Misses        978      163      -815     
Flag Coverage Δ
Linux 74.65% <29.62%> (-0.92%) ⬇️
Windows 74.65% <29.62%> (-0.92%) ⬇️
cpu 74.65% <29.62%> (-0.92%) ⬇️
gpu 96.00% <93.82%> (?)
macOS 74.65% <29.62%> (-0.92%) ⬇️
pytest 96.00% <93.82%> (+20.44%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
torchmetrics/__init__.py 100.00% <ø> (ø)
torchmetrics/functional/classification/iou.py 100.00% <ø> (+10.00%) ⬆️
torchmetrics/regression/cosine_similarity.py 96.42% <ø> (+3.57%) ⬆️
torchmetrics/utilities/distributed.py 98.27% <ø> (+81.03%) ⬆️
torchmetrics/classification/calibration_error.py 93.10% <93.10%> (ø)
...ics/functional/classification/calibration_error.py 93.87% <93.87%> (ø)
torchmetrics/classification/__init__.py 100.00% <100.00%> (ø)
torchmetrics/functional/__init__.py 100.00% <100.00%> (ø)
torchmetrics/functional/classification/__init__.py 100.00% <100.00%> (ø)
... and 83 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 79c966d...984b879. Read the comment docs.

Copy link
Member

@SkafteNicki SkafteNicki left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@edwardclem really great job with this one. Even though I have some comments, they are all minor and we can probably get this merged pretty fast.
Note that the comments to the docstring of the functional implementation also apply to the modular implementation.
Please also add changelog :]

tests/classification/_sklearn_calibration.py Outdated Show resolved Hide resolved
tests/classification/_sklearn_calibration.py Outdated Show resolved Hide resolved
tests/classification/test_calibration_error.py Outdated Show resolved Hide resolved
tests/classification/test_calibration_error.py Outdated Show resolved Hide resolved
torchmetrics/classification/calibration_error.py Outdated Show resolved Hide resolved
torchmetrics/classification/calibration_error.py Outdated Show resolved Hide resolved
torchmetrics/classification/calibration_error.py Outdated Show resolved Hide resolved
tests/classification/test_calibration_error.py Outdated Show resolved Hide resolved
@edwardclem
Copy link
Contributor Author

@SkafteNicki I've fixed the DDP features and I believe I've resolved all of your comments! Let me know if there's anything else I should take a look at. All tests pass on my MacBook, I'll wait until the ubuntu tests run to move out of draft.

@Borda
Copy link
Member

Borda commented Aug 2, 2021

@SkafteNicki seems most test cases are failing...

CHANGELOG.md Outdated Show resolved Hide resolved
auto-merge was automatically disabled August 3, 2021 04:07

Head branch was pushed to by a user without write access

@edwardclem
Copy link
Contributor Author

edwardclem commented Aug 3, 2021

@SkafteNicki seems most test cases are failing...

I think I fixed it, it was just a small issue with type signatures in the CalibrationError class.

tests/helpers/non_sklearn_metrics.py Outdated Show resolved Hide resolved
torchmetrics/classification/calibration_error.py Outdated Show resolved Hide resolved
@mergify mergify bot added the ready label Aug 3, 2021
@SkafteNicki SkafteNicki enabled auto-merge (squash) August 3, 2021 07:57
@SkafteNicki SkafteNicki merged commit 2aaf27f into Lightning-AI:master Aug 3, 2021
@Borda Borda added this to the v0.5 milestone Aug 18, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add Expected Calibration Error
4 participants