Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add std to show for PerformanceEvaluation #766

Merged
merged 7 commits into from
May 17, 2022
Merged

Add std to show for PerformanceEvaluation #766

merged 7 commits into from
May 17, 2022

Conversation

rikhuijzer
Copy link
Member

@rikhuijzer rikhuijzer commented May 10, 2022

Suggestion to add the standard deviation to the PerformanceEvaluation output:

┌────────────────────────────────┬───────────┬─────────────┬────────┬──────────────────────────────────────┐
│ measure                        │ operation │ measurement │ std    │ per_fold                             │
├────────────────────────────────┼───────────┼─────────────┼────────┼──────────────────────────────────────┤
│ LogLoss(                       │ predict   │ 0.704       │ 0.0135 │ [0.732, 0.7, 0.7, 0.7, 0.696, 0.696] │
│   tol = 2.220446049250313e-16) │           │             │        │                                      │
└────────────────────────────────┴───────────┴─────────────┴────────┴──────────────────────────────────────┘

This is useful to spot more easily how much variation there is on the reported measurement.

EDIT: End result after reviewer comments

┌────────────────────────────────┬───────────┬─────────────┬─────────┬────────────────────────────────────────────┐
│ measure                        │ operation │ measurement │ 1.96*SE │ per_fold                                   │
├────────────────────────────────┼───────────┼─────────────┼─────────┼────────────────────────────────────────────┤
│ LogLoss(                       │ predict   │ 0.719       │ 0.0189  │ [0.732, 0.713, 0.713, 0.757, 0.706, 0.696] │
│   tol = 2.220446049250313e-16) │           │             │         │                                            │
└────────────────────────────────┴───────────┴─────────────┴─────────┴────────────────────────────────────────────┘

@rikhuijzer rikhuijzer requested a review from ablaom May 10, 2022 11:28
@codecov-commenter
Copy link

codecov-commenter commented May 10, 2022

Codecov Report

Merging #766 (5dbd187) into dev (4d1ed14) will increase coverage by 0.06%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##              dev     #766      +/-   ##
==========================================
+ Coverage   85.85%   85.92%   +0.06%     
==========================================
  Files          36       36              
  Lines        3451     3460       +9     
==========================================
+ Hits         2963     2973      +10     
+ Misses        488      487       -1     
Impacted Files Coverage Δ
src/resampling.jl 91.60% <100.00%> (+0.44%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 4d1ed14...5dbd187. Read the comment docs.

@OkonSamuel
Copy link
Member

Suggestion to add the standard deviation to the PerformanceEvaluation output:

┌────────────────────────────────┬───────────┬─────────────┬────────┬──────────────────────────────────────┐
│ measure                        │ operation │ measurement │ std    │ per_fold                             │
├────────────────────────────────┼───────────┼─────────────┼────────┼──────────────────────────────────────┤
│ LogLoss(                       │ predict   │ 0.704       │ 0.0135 │ [0.732, 0.7, 0.7, 0.7, 0.696, 0.696] │
│   tol = 2.220446049250313e-16) │           │             │        │                                      │
└────────────────────────────────┴───────────┴─────────────┴────────┴──────────────────────────────────────┘

This is useful to spot more easily how much variation there is on the reported measurement.

Nice idea. The only issue I can think of is that std would be undefined for resampling strategies with one fold e.g HoldOut. But maybe we could add a note in the docstring to address this.

@rikhuijzer
Copy link
Member Author

rikhuijzer commented May 10, 2022

Nice idea. The only issue I can think of is that std would be undefined for resampling strategies with one fold e.g HoldOut. But maybe we could add a note in the docstring to address this.

Well spotted! Based on your suggestion, I got thinking about what would be the best solution for usability. I now implemented a conditional standard deviation column in 9cdd7ef. It is only shown when there are more than 1 folds. From the second print(show_text):

┌────────────────────────────────┬──────────────┬─────────────┬──────────┐
│ measure                        │ operation    │ measurement │ per_fold │
├────────────────────────────────┼──────────────┼─────────────┼──────────┤
│ LogLoss(                       │ predict      │ 0.702       │ [0.702]  │
│   tol = 2.220446049250313e-16) │              │             │          │
│ Accuracy()                     │ predict_mode │ 0.433       │ [0.433]  │
└────────────────────────────────┴──────────────┴─────────────┴──────────┘

src/resampling.jl Outdated Show resolved Hide resolved
@ablaom
Copy link
Member

ablaom commented May 11, 2022

@rikhuijzer Thanks for this work 👍🏾

  1. Correct me if I'm wrong, but you want these standard deviations to get a poor man's confidence interval for the performance estimate, right? Then shouldn't you be reporting standard error instead of standard deviation? That is, you want std/sqrt(nfolds - 1) (unbiased case). See, for example here.

  2. There was some discussion in the early development of MLJ about reporting standard errors of the CV scores. Objections were raised, as it would be thought to encourage a practice with dubious theoretical justification. See, for example, https://arxiv.org/abs/2104.00673. I'm not opposed to including standard errors, but I think we should at least add a warning in the doc-string for PerformanceEvaluation quoting that paper, for example. Something like: "Warning. While cross-validation standard errors are commonly used to construct confidence intervals for model performance estimates, it is known that practice is not reliable. See, eg, ...."

  3. I think it is a bit weird to display the standard errors, but not include them in the PerformanceEvaluation object itself. I don't think adding the extra field would be breaking, but even if it is, I would personally prefer that. The main thing to test is that MLJTuning.jl is not effected.

  4. If the number of folds is zero (or one, if we're using the unbiased estimator) I'd declare the std error as Inf, and dropping the column in the display makes sense.

@rikhuijzer
Copy link
Member Author

Thanks for the reviews both. Good comments.

I now implemented some of your suggestions @ablaom. It now looks as follows:

PerformanceEvaluation object with these fields:
  measure, measurement, operation, per_fold,
  per_observation, fitted_params_per_fold,
  report_per_fold, train_test_rows

Note. The sterr column gives the standard error
  over the folds with a 95% confidence interval
  `(1.96 * std / sqrt(nfolds - 1)`. This number can be useful
  to get an idea of the variability of the scores, but
  beware that the estimate shouldn't be used for hypothesis
  testing (e.g., https://arxiv.org/abs/2104.00673).

Extract:
┌────────────────────────────────┬───────────┬─────────────┬────────┬──────────────────────────────────────────┐
│ measure                        │ operation │ measurement │ sterr  │ per_fold                                 │
├────────────────────────────────┼───────────┼─────────────┼────────┼──────────────────────────────────────────┤
│ LogLoss(                       │ predict   │ 0.706       │ 0.0177 │ [0.694, 0.694, 0.7, 0.694, 0.745, 0.706] │
│   tol = 2.220446049250313e-16) │           │             │        │                                          │
└────────────────────────────────┴───────────┴─────────────┴────────┴──────────────────────────────────────────┘

I'm not super happy (yet) with the lengthy note though.

Correct me if I'm wrong, but you want these standard deviations to get a poor man's confidence interval for the performance estimate, right? Then shouldn't you be reporting standard error instead of standard deviation? That is, you want std/sqrt(nfolds - 1) (unbiased case). See, for example here.

For me, the most important thing is to see whether something is wrong, that is, whether the reported cross-validation average makes sense or whether the scores fluctuate enormously.

There was some discussion in the early development of MLJ about reporting standard errors of the CV scores. Objections were raised, as it would be thought to encourage a practice with dubious theoretical justification. See, for example, https://arxiv.org/abs/2104.00673. I'm not opposed to including standard errors, but I think we should at least add a warning in the doc-string for PerformanceEvaluation quoting that paper, for example. Something like: "Warning. While cross-validation standard errors are commonly used to construct confidence intervals for model performance estimates, it is known that practice is not reliable. See, eg, ...."

If people only use the score to check whether something is wrong, then it should be fine. I personally get the objections that others have raised, but also find it a bit pedantic. There are many ways to shoot yourself in the foot with resampling techniques, so adding a explicit note may lead the reader to conclude that that is the only worry! Maybe we should link to a special resampling page which gives some guidelines such as "you can overfit a CV if you manually tune your model to your data" and "the variability estimate of CV is unreliable due to ...".

I think it is a bit weird to display the standard errors, but not include them in the PerformanceEvaluation object itself. I don't think adding the extra field would be breaking, but even if it is, I would personally prefer that. The main thing to test is that MLJTuning.jl is not effected.

Could you tell me why this is the case? If we keep it simple with a std or make a function available via MLJBase.jl, then the docs can tell users to call std(e.per_fold) if they really want to get the numbers. And in case people complain about the reported standard deviation/error, then not having it baked-in the object allows us to change/revert it more easily.

If the number of folds is zero (or one, if we're using the unbiased estimator) I'd declare the std error as Inf, and dropping the column in the display makes sense.

That's what I did again 👍

@ablaom
Copy link
Member

ablaom commented May 15, 2022

@rikhuijzer thanks for taking my suggestions and comments on board. I think we're on the same page. And you've changed my mind about adding the standard error as a (redundant) field.

I think I'd prefer not to have a usage warning in the display - that seems like overkill. What if we change the heading "sterr" in the table to "1.96*SE" or "1.96 * std err" (which would be more consistent with standard meaning of "standard error") and relegate the warning to the PerformanceEvaluation doc-string. Adding more detailed advice in the manual sounds like a good idea, but this would be new PR.

@rikhuijzer
Copy link
Member Author

It now looks like this:

PerformanceEvaluation object with these fields:
  measure, operation, measurement, per_fold,
  per_observation, fitted_params_per_fold,
  report_per_fold, train_test_rows
Extract:
┌────────────────────────────────┬───────────┬─────────────┬─────────┬────────────────────────────────────────────┐
│ measure                        │ operation │ measurement │ 1.96*SE │ per_fold                                   │
├────────────────────────────────┼───────────┼─────────────┼─────────┼────────────────────────────────────────────┤
│ LogLoss(                       │ predict   │ 0.719       │ 0.0189  │ [0.732, 0.713, 0.713, 0.757, 0.706, 0.696] │
│   tol = 2.220446049250313e-16) │           │             │         │                                            │
└────────────────────────────────┴───────────┴─────────────┴─────────┴────────────────────────────────────────────┘

I'm not so sure what to put where in the PerformanceEvaluation docstring. Apart from that, I think this can be merged

@OkonSamuel
Copy link
Member

It now looks like this:

PerformanceEvaluation object with these fields:
  measure, operation, measurement, per_fold,
  per_observation, fitted_params_per_fold,
  report_per_fold, train_test_rows
Extract:
┌────────────────────────────────┬───────────┬─────────────┬─────────┬────────────────────────────────────────────┐
│ measure                        │ operation │ measurement │ 1.96*SE │ per_fold                                   │
├────────────────────────────────┼───────────┼─────────────┼─────────┼────────────────────────────────────────────┤
│ LogLoss(                       │ predict   │ 0.719       │ 0.0189  │ [0.732, 0.713, 0.713, 0.757, 0.706, 0.696] │
│   tol = 2.220446049250313e-16) │           │             │         │                                            │
└────────────────────────────────┴───────────┴─────────────┴─────────┴────────────────────────────────────────────┘

I'm not so sure what to put where in the PerformanceEvaluation docstring. Apart from that, I think this can be merged

@rikhuijzer, @ablaom shouldn't we also add the standard errors calculated as a field of the PerformanceEvaluation object.

@ablaom
Copy link
Member

ablaom commented May 16, 2022

@OkonSamuel I am agreeing with @rikhuijzer that we leave this out as a field, for the reasons he gives in his comment.

@ablaom
Copy link
Member

ablaom commented May 16, 2022

I'm not so sure what to put where in the PerformanceEvaluation docstring. Apart from that, I think this can be merged

I suggest adding something here. Maybe along the lines (separate paragraph):

When displayed, a `PerformanceEvalution` object includes a value under the heading `1.96*SE`, 
derived from the standard error of the `per_fold` entries, 
suitable for constructing a formal 95% confidence
interval for the given `measurement`. Such intervals should be interpreted with caution. 
See, for example, S. Bates et al. (2021): [Cross-validation: what does it 
estimate and how well does it do it?](https://arxiv.org/abs/2104.00673) *arXiv preprint*, 
arXiv:2104.00673.

@rikhuijzer
Copy link
Member Author

@ablaom Done. I've added the reference as

Bates et al. (2021).

since the suggested ref didn't parse correctly. The arXiv link should be stable enough to find the document back even in 10 years because other papers will list the full reference including title.

@rikhuijzer
Copy link
Member Author

As always: If any of you spots mistakes in this PR, feel free to modify it. You should both have the right permissions to do so 😄

@ablaom
Copy link
Member

ablaom commented May 17, 2022

@rikhuijzer Thanks for this contribution! 🚀

@ablaom ablaom merged commit 84ff503 into dev May 17, 2022
@ablaom ablaom deleted the rh/std branch May 17, 2022 10:09
Copy link
Member

@ablaom ablaom left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍🏾

@ablaom ablaom mentioned this pull request May 17, 2022
@rikhuijzer
Copy link
Member Author

Sure. Thanks for the review @ablaom and @OkonSamuel. Much appreciated again!

@ablaom ablaom mentioned this pull request May 17, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants