Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Column name '.pred' for prediction intervals? #921

Closed
bwiernik opened this issue Aug 31, 2020 · 13 comments · Fixed by #925
Closed

Column name '.pred' for prediction intervals? #921

bwiernik opened this issue Aug 31, 2020 · 13 comments · Fixed by #925

Comments

@bwiernik
Copy link
Contributor

The interval column names in augment() are always called .conf.low and .conf.high, regardless of whether "confidence" or "prediction" is specified for interval. I think this is potentially confusing and may be misleading in reports if the user doesn't rename the columns. I suggest changing the column names to .pred.low and .pred.high when interval = "prediction" (cf. in the metafor package, confidence intervals are labeled ci and prediction intervals labeled pi: https://github.com/wviechtb/metafor/blob/acacdcaa56ce6ed9c7c3239c15007b14d03012c8/man/predict.rma.Rd#L47).

broom::augment(lm(mpg ~ hp + disp, data = mtcars), interval = "prediction")
#> # A tibble: 32 x 12
#>    .rownames   mpg    hp  disp .fitted .conf.low .conf.high .resid .std.resid
#>    <chr>     <dbl> <dbl> <dbl>   <dbl>     <dbl>      <dbl>  <dbl>      <dbl>
#>  1 Mazda RX4  21     110  160     23.1     16.6        29.7 -2.15      -0.702
#>  2 Mazda RX~  21     110  160     23.1     16.6        29.7 -2.15      -0.702
#>  3 Datsun 7~  22.8    93  108     25.1     18.6        31.7 -2.35      -0.776
#>  4 Hornet 4~  21.4   110  258     20.2     13.5        26.8  1.23       0.408
#>  5 Hornet S~  18.7   175  360     15.5      8.82       22.1  3.24       1.08 
#>  6 Valiant    18.1   105  225     21.3     14.7        27.9 -3.20      -1.06 
#>  7 Duster 3~  14.3   245  360     13.7      7.02       20.4  0.575      0.194
#>  8 Merc 240D  24.4    62  147.    24.7     18.1        31.4 -0.344     -0.115
#>  9 Merc 230   22.8    95  141.    24.1     17.5        30.7 -1.30      -0.428
#> 10 Merc 280   19.2   123  168.    22.6     16.1        29.1 -3.39      -1.11 
#> # ... with 22 more rows, and 3 more variables: .hat <dbl>, .sigma <dbl>,
#> #   .cooksd <dbl>

Created on 2020-08-31 by the reprex package (v0.3.0)

@simonpcouch
Copy link
Collaborator

Thanks for the suggestion! I think this is a great discussion to have now, before prediction intervals make it into a CRAN version.

I think I agree here, except that the interval argument is just passed to the relevant predict() method in the current implementation. I'm not sure about the best way to generalize this column renaming to accommodate an arbitrary interval argument other than "confidence" or "prediction".

@grantmcdermott Do you have a sense for how common the interval argument is among predict methods? Have you come across any options other than confidence/prediction/none?

Related to #908.

@bwiernik
Copy link
Contributor Author

bwiernik commented Aug 31, 2020

Edited with more information.

In the stats predict methods, only lm and nls have interval arguments, both taking c("none", "confidence", "prediction") as choices.

In terms of other possible types that might exist, the other one that comes to mind is credibility/credible ("cred") used in Bayesian analyses (but I'm not aware of any Bayesian package using a predict method). Perhaps a good general solution would be to take the first four characters of the interval argument?

@grantmcdermott
Copy link
Contributor

grantmcdermott commented Aug 31, 2020

Yeah, I tend to agree.

As it stands, it should be fairly straightforward to implement @bwiernik's suggestion and take the first four letters of the interval argument.

However, a potential concern I have is that you'd need to check compatibility with — and adjust output for — any other augment.x methods that accept a "interval" argument. For example, augment.rq() (see here). EDIT: My phrasing here was a little confusing, so let me say "consistency" instead of "compatibility".

(It's something of an aside, but this seems as good a place as any to mention... the augment.lm() method that I created in #908 differs slightly in its implementation to augment.rq() above. I think the former is cleaner and, while I may not be an unbiased spectator, it's probably a good idea to choose one approach and stick with it going forward.)

Another, even simpler option, is to follow with the predict() naming conventions and use .upr and .lwr as the column names regardless of the interval argument.

@bwiernik
Copy link
Contributor Author

bwiernik commented Sep 1, 2020

I like .upr and .lwr or .lo and .hi a lot versus "high" and "low".

@alexpghayes
Copy link
Collaborator

alexpghayes commented Sep 1, 2020 via email

@bwiernik
Copy link
Contributor Author

bwiernik commented Sep 1, 2020

Backwards compatibility with what though? Most other packages use something like .lwr and .upr; cl.lb and cl.ub; or ci.lb and ci.ub. I think broom might be one of the only packages using “conf”.

@simonpcouch
Copy link
Collaborator

Thanks for the note on previous discussions, @alexpghayes.

@bwiernik Backwards compatibility with tidiers from previous versions of the broom package. See the tidy.boot method, for example. :-)

Closing for now--thanks again for bringing this up.

@bwiernik
Copy link
Contributor Author

bwiernik commented Sep 3, 2020

I understand that backwards compatibility can be important, just wanted to chime in that I had several students who were confused in my office hour this morning asking "why does it always compute a confidence interval when I ask for a prediction interval". They didn't notice that the numbers were changing, only that the column labels were the same.

@grantmcdermott
Copy link
Contributor

My $0.02:

I'm not sure if backwards compatibility is quite the issue here because we are dealing with different tidiers (i.e. tidy vs augment). So the interval column names attached to these two tidiers could, in principle, differ. Just like we have "estimate" (tidy) vs ".fitted" (augment).

Of course, it doesn't make sense to change "conf.low" and "conf.high" for tidy. But AFAIK there are very few augment methods that currently support interval columns. In fact, augment.rq and brand new (dev version only) augment.lm are the only two that I can think of off the top of my head.

TL;DR The more I think about this, the more .lwr and .upr seem like the most sensible and consistent names for augment intervals.

@simonpcouch
Copy link
Collaborator

Thanks for yalls input. I'd be glad to review a PR, and agree that .lwr and .upr are sensible names for interval columns in augment() output. :-)

@simonpcouch simonpcouch reopened this Sep 3, 2020
@alexpghayes
Copy link
Collaborator

If we're going to revisit this I'd say I'm also strongly opposed to abbreviations and would prefer .lower and .upper here. @simonpcouch Let's make sure to sync before a final decision happens.

@bwiernik
Copy link
Contributor Author

bwiernik commented Sep 4, 2020

I've opened a pull request changing the columns to .lower and .upper

@github-actions
Copy link

github-actions bot commented Mar 6, 2021

This issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex: https://reprex.tidyverse.org) and link to this issue.

@github-actions github-actions bot locked and limited conversation to collaborators Mar 6, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants