Add std to show for `PerformanceEvaluation` #766

rikhuijzer · 2022-05-10T11:28:54Z

Suggestion to add the standard deviation to the PerformanceEvaluation output:

┌────────────────────────────────┬───────────┬─────────────┬────────┬──────────────────────────────────────┐
│ measure                        │ operation │ measurement │ std    │ per_fold                             │
├────────────────────────────────┼───────────┼─────────────┼────────┼──────────────────────────────────────┤
│ LogLoss(                       │ predict   │ 0.704       │ 0.0135 │ [0.732, 0.7, 0.7, 0.7, 0.696, 0.696] │
│   tol = 2.220446049250313e-16) │           │             │        │                                      │
└────────────────────────────────┴───────────┴─────────────┴────────┴──────────────────────────────────────┘

This is useful to spot more easily how much variation there is on the reported measurement.

EDIT: End result after reviewer comments

┌────────────────────────────────┬───────────┬─────────────┬─────────┬────────────────────────────────────────────┐
│ measure                        │ operation │ measurement │ 1.96*SE │ per_fold                                   │
├────────────────────────────────┼───────────┼─────────────┼─────────┼────────────────────────────────────────────┤
│ LogLoss(                       │ predict   │ 0.719       │ 0.0189  │ [0.732, 0.713, 0.713, 0.757, 0.706, 0.696] │
│   tol = 2.220446049250313e-16) │           │             │         │                                            │
└────────────────────────────────┴───────────┴─────────────┴─────────┴────────────────────────────────────────────┘

test/resampling.jl

codecov-commenter · 2022-05-10T11:45:15Z

Codecov Report

Merging #766 (5dbd187) into dev (4d1ed14) will increase coverage by 0.06%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##              dev     #766      +/-   ##
==========================================
+ Coverage   85.85%   85.92%   +0.06%     
==========================================
  Files          36       36              
  Lines        3451     3460       +9     
==========================================
+ Hits         2963     2973      +10     
+ Misses        488      487       -1

Impacted Files	Coverage Δ
src/resampling.jl	`91.60% <100.00%> (+0.44%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 4d1ed14...5dbd187. Read the comment docs.

OkonSamuel · 2022-05-10T14:33:27Z

Suggestion to add the standard deviation to the PerformanceEvaluation output:

┌────────────────────────────────┬───────────┬─────────────┬────────┬──────────────────────────────────────┐
│ measure                        │ operation │ measurement │ std    │ per_fold                             │
├────────────────────────────────┼───────────┼─────────────┼────────┼──────────────────────────────────────┤
│ LogLoss(                       │ predict   │ 0.704       │ 0.0135 │ [0.732, 0.7, 0.7, 0.7, 0.696, 0.696] │
│   tol = 2.220446049250313e-16) │           │             │        │                                      │
└────────────────────────────────┴───────────┴─────────────┴────────┴──────────────────────────────────────┘

This is useful to spot more easily how much variation there is on the reported measurement.

Nice idea. The only issue I can think of is that std would be undefined for resampling strategies with one fold e.g HoldOut. But maybe we could add a note in the docstring to address this.

rikhuijzer · 2022-05-10T16:43:54Z

Nice idea. The only issue I can think of is that std would be undefined for resampling strategies with one fold e.g HoldOut. But maybe we could add a note in the docstring to address this.

Well spotted! Based on your suggestion, I got thinking about what would be the best solution for usability. I now implemented a conditional standard deviation column in 9cdd7ef. It is only shown when there are more than 1 folds. From the second print(show_text):

┌────────────────────────────────┬──────────────┬─────────────┬──────────┐
│ measure                        │ operation    │ measurement │ per_fold │
├────────────────────────────────┼──────────────┼─────────────┼──────────┤
│ LogLoss(                       │ predict      │ 0.702       │ [0.702]  │
│   tol = 2.220446049250313e-16) │              │             │          │
│ Accuracy()                     │ predict_mode │ 0.433       │ [0.433]  │
└────────────────────────────────┴──────────────┴─────────────┴──────────┘

src/resampling.jl

ablaom · 2022-05-11T21:27:21Z

@rikhuijzer Thanks for this work 👍🏾

Correct me if I'm wrong, but you want these standard deviations to get a poor man's confidence interval for the performance estimate, right? Then shouldn't you be reporting standard error instead of standard deviation? That is, you want std/sqrt(nfolds - 1) (unbiased case). See, for example here.
There was some discussion in the early development of MLJ about reporting standard errors of the CV scores. Objections were raised, as it would be thought to encourage a practice with dubious theoretical justification. See, for example, https://arxiv.org/abs/2104.00673. I'm not opposed to including standard errors, but I think we should at least add a warning in the doc-string for PerformanceEvaluation quoting that paper, for example. Something like: "Warning. While cross-validation standard errors are commonly used to construct confidence intervals for model performance estimates, it is known that practice is not reliable. See, eg, ...."
I think it is a bit weird to display the standard errors, but not include them in the PerformanceEvaluation object itself. I don't think adding the extra field would be breaking, but even if it is, I would personally prefer that. The main thing to test is that MLJTuning.jl is not effected.
If the number of folds is zero (or one, if we're using the unbiased estimator) I'd declare the std error as Inf, and dropping the column in the display makes sense.

rikhuijzer · 2022-05-12T07:53:20Z

Thanks for the reviews both. Good comments.

I now implemented some of your suggestions @ablaom. It now looks as follows:

PerformanceEvaluation object with these fields:
  measure, measurement, operation, per_fold,
  per_observation, fitted_params_per_fold,
  report_per_fold, train_test_rows

Note. The sterr column gives the standard error
  over the folds with a 95% confidence interval
  `(1.96 * std / sqrt(nfolds - 1)`. This number can be useful
  to get an idea of the variability of the scores, but
  beware that the estimate shouldn't be used for hypothesis
  testing (e.g., https://arxiv.org/abs/2104.00673).

Extract:
┌────────────────────────────────┬───────────┬─────────────┬────────┬──────────────────────────────────────────┐
│ measure                        │ operation │ measurement │ sterr  │ per_fold                                 │
├────────────────────────────────┼───────────┼─────────────┼────────┼──────────────────────────────────────────┤
│ LogLoss(                       │ predict   │ 0.706       │ 0.0177 │ [0.694, 0.694, 0.7, 0.694, 0.745, 0.706] │
│   tol = 2.220446049250313e-16) │           │             │        │                                          │
└────────────────────────────────┴───────────┴─────────────┴────────┴──────────────────────────────────────────┘

I'm not super happy (yet) with the lengthy note though.

Correct me if I'm wrong, but you want these standard deviations to get a poor man's confidence interval for the performance estimate, right? Then shouldn't you be reporting standard error instead of standard deviation? That is, you want std/sqrt(nfolds - 1) (unbiased case). See, for example here.

For me, the most important thing is to see whether something is wrong, that is, whether the reported cross-validation average makes sense or whether the scores fluctuate enormously.

There was some discussion in the early development of MLJ about reporting standard errors of the CV scores. Objections were raised, as it would be thought to encourage a practice with dubious theoretical justification. See, for example, https://arxiv.org/abs/2104.00673. I'm not opposed to including standard errors, but I think we should at least add a warning in the doc-string for PerformanceEvaluation quoting that paper, for example. Something like: "Warning. While cross-validation standard errors are commonly used to construct confidence intervals for model performance estimates, it is known that practice is not reliable. See, eg, ...."

If people only use the score to check whether something is wrong, then it should be fine. I personally get the objections that others have raised, but also find it a bit pedantic. There are many ways to shoot yourself in the foot with resampling techniques, so adding a explicit note may lead the reader to conclude that that is the only worry! Maybe we should link to a special resampling page which gives some guidelines such as "you can overfit a CV if you manually tune your model to your data" and "the variability estimate of CV is unreliable due to ...".

I think it is a bit weird to display the standard errors, but not include them in the PerformanceEvaluation object itself. I don't think adding the extra field would be breaking, but even if it is, I would personally prefer that. The main thing to test is that MLJTuning.jl is not effected.

Could you tell me why this is the case? If we keep it simple with a std or make a function available via MLJBase.jl, then the docs can tell users to call std(e.per_fold) if they really want to get the numbers. And in case people complain about the reported standard deviation/error, then not having it baked-in the object allows us to change/revert it more easily.

If the number of folds is zero (or one, if we're using the unbiased estimator) I'd declare the std error as Inf, and dropping the column in the display makes sense.

That's what I did again 👍

ablaom · 2022-05-15T22:47:23Z

@rikhuijzer thanks for taking my suggestions and comments on board. I think we're on the same page. And you've changed my mind about adding the standard error as a (redundant) field.

I think I'd prefer not to have a usage warning in the display - that seems like overkill. What if we change the heading "sterr" in the table to "1.96*SE" or "1.96 * std err" (which would be more consistent with standard meaning of "standard error") and relegate the warning to the PerformanceEvaluation doc-string. Adding more detailed advice in the manual sounds like a good idea, but this would be new PR.

rikhuijzer · 2022-05-16T06:13:47Z

It now looks like this:

PerformanceEvaluation object with these fields:
  measure, operation, measurement, per_fold,
  per_observation, fitted_params_per_fold,
  report_per_fold, train_test_rows
Extract:
┌────────────────────────────────┬───────────┬─────────────┬─────────┬────────────────────────────────────────────┐
│ measure                        │ operation │ measurement │ 1.96*SE │ per_fold                                   │
├────────────────────────────────┼───────────┼─────────────┼─────────┼────────────────────────────────────────────┤
│ LogLoss(                       │ predict   │ 0.719       │ 0.0189  │ [0.732, 0.713, 0.713, 0.757, 0.706, 0.696] │
│   tol = 2.220446049250313e-16) │           │             │         │                                            │
└────────────────────────────────┴───────────┴─────────────┴─────────┴────────────────────────────────────────────┘

I'm not so sure what to put where in the PerformanceEvaluation docstring. Apart from that, I think this can be merged

OkonSamuel · 2022-05-16T17:02:36Z

It now looks like this:

PerformanceEvaluation object with these fields:
  measure, operation, measurement, per_fold,
  per_observation, fitted_params_per_fold,
  report_per_fold, train_test_rows
Extract:
┌────────────────────────────────┬───────────┬─────────────┬─────────┬────────────────────────────────────────────┐
│ measure                        │ operation │ measurement │ 1.96*SE │ per_fold                                   │
├────────────────────────────────┼───────────┼─────────────┼─────────┼────────────────────────────────────────────┤
│ LogLoss(                       │ predict   │ 0.719       │ 0.0189  │ [0.732, 0.713, 0.713, 0.757, 0.706, 0.696] │
│   tol = 2.220446049250313e-16) │           │             │         │                                            │
└────────────────────────────────┴───────────┴─────────────┴─────────┴────────────────────────────────────────────┘

I'm not so sure what to put where in the PerformanceEvaluation docstring. Apart from that, I think this can be merged

@rikhuijzer, @ablaom shouldn't we also add the standard errors calculated as a field of the PerformanceEvaluation object.

ablaom · 2022-05-16T21:17:38Z

@OkonSamuel I am agreeing with @rikhuijzer that we leave this out as a field, for the reasons he gives in his comment.

ablaom · 2022-05-16T21:32:18Z

I'm not so sure what to put where in the PerformanceEvaluation docstring. Apart from that, I think this can be merged

I suggest adding something here. Maybe along the lines (separate paragraph):

When displayed, a `PerformanceEvalution` object includes a value under the heading `1.96*SE`, 
derived from the standard error of the `per_fold` entries, 
suitable for constructing a formal 95% confidence
interval for the given `measurement`. Such intervals should be interpreted with caution. 
See, for example, S. Bates et al. (2021): [Cross-validation: what does it 
estimate and how well does it do it?](https://arxiv.org/abs/2104.00673) *arXiv preprint*, 
arXiv:2104.00673.

rikhuijzer · 2022-05-17T09:33:12Z

@ablaom Done. I've added the reference as

Bates et al. (2021).

since the suggested ref didn't parse correctly. The arXiv link should be stable enough to find the document back even in 10 years because other papers will list the full reference including title.

rikhuijzer · 2022-05-17T09:35:37Z

As always: If any of you spots mistakes in this PR, feel free to modify it. You should both have the right permissions to do so 😄

ablaom · 2022-05-17T10:09:16Z

@rikhuijzer Thanks for this contribution! 🚀

ablaom

👍🏾

rikhuijzer · 2022-05-17T10:32:28Z

Sure. Thanks for the review @ablaom and @OkonSamuel. Much appreciated again!

Add std to show for PerformanceEvaluation

04ee0fa

rikhuijzer requested a review from ablaom May 10, 2022 11:28

rikhuijzer commented May 10, 2022

View reviewed changes

test/resampling.jl Show resolved Hide resolved

Make std column conditional on number of folds

9cdd7ef

rikhuijzer commented May 10, 2022

View reviewed changes

src/resampling.jl Outdated Show resolved Hide resolved

Use isnan

03b75fe

Report standard error

49c7770

Change header name and remove warning

51552b1

rikhuijzer added 2 commits May 17, 2022 11:30

Add section to PerformanceEvaluation docs

c24f01c

Add brackets around year

5dbd187

ablaom merged commit 84ff503 into dev May 17, 2022

ablaom deleted the rh/std branch May 17, 2022 10:09

ablaom reviewed May 17, 2022

View reviewed changes

ablaom mentioned this pull request May 17, 2022

For a 0.20.3 release #770

Merged

ablaom mentioned this pull request May 17, 2022

Issue to trigger releases #345

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add std to show for `PerformanceEvaluation` #766

Add std to show for `PerformanceEvaluation` #766

rikhuijzer commented May 10, 2022 •

edited

codecov-commenter commented May 10, 2022 •

edited

OkonSamuel commented May 10, 2022

rikhuijzer commented May 10, 2022 •

edited

ablaom commented May 11, 2022 •

edited

rikhuijzer commented May 12, 2022

ablaom commented May 15, 2022 •

edited

rikhuijzer commented May 16, 2022

OkonSamuel commented May 16, 2022

ablaom commented May 16, 2022

ablaom commented May 16, 2022

rikhuijzer commented May 17, 2022

rikhuijzer commented May 17, 2022

ablaom commented May 17, 2022

ablaom left a comment

rikhuijzer commented May 17, 2022

Add std to show for PerformanceEvaluation #766

Add std to show for PerformanceEvaluation #766

Conversation

rikhuijzer commented May 10, 2022 • edited

codecov-commenter commented May 10, 2022 • edited

Codecov Report

OkonSamuel commented May 10, 2022

rikhuijzer commented May 10, 2022 • edited

ablaom commented May 11, 2022 • edited

rikhuijzer commented May 12, 2022

ablaom commented May 15, 2022 • edited

rikhuijzer commented May 16, 2022

OkonSamuel commented May 16, 2022

ablaom commented May 16, 2022

ablaom commented May 16, 2022

rikhuijzer commented May 17, 2022

rikhuijzer commented May 17, 2022

ablaom commented May 17, 2022

ablaom left a comment

Choose a reason for hiding this comment

rikhuijzer commented May 17, 2022

Add std to show for `PerformanceEvaluation` #766

Add std to show for `PerformanceEvaluation` #766

rikhuijzer commented May 10, 2022 •

edited

codecov-commenter commented May 10, 2022 •

edited

rikhuijzer commented May 10, 2022 •

edited

ablaom commented May 11, 2022 •

edited

ablaom commented May 15, 2022 •

edited