Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data frame output of do() * resample() #641

Closed
rudeboybert opened this issue Apr 16, 2017 · 8 comments
Closed

Data frame output of do() * resample() #641

rudeboybert opened this issue Apr 16, 2017 · 8 comments

Comments

@rudeboybert
Copy link

Some questions:

  1. In snippet 1, would it be hard to have the simulation output be in tibble format instead of just raw data frame?
  2. In snippet 1, the column names of simulation are V1 thru V2, which aren't informative. Would it be worth the real estate to rename these outcome1 thru outcome5 instead?
  3. Snippet 2 doesn't work as the pipe only recognizes the resample() and not the do(). Snippet 3 rectifies this by having parentheses around the statement. Is there a more elegant solution than this? Inevitably all students attempt Snippet 2.
library(mosaic)
library(dplyr)

coin <- c(0, 1)

# Snippet 1: Simulate 5 coin flips 1000 times
simulation <- do(1000) * resample(coin, size=5, replace=TRUE) 
simulation

# Snippet 2: Doesn't work
simulation <- do(1000) * resample(coin, size=5, replace=TRUE) %>% 
  mutate(num_heads = V1+V2+V3+V4+V5)

# Snippet 3: Works
simulation <- (do(1000) * resample(coin, size=5, replace=TRUE)) %>% 
  mutate(num_heads = V1+V2+V3+V4+V5)
@rudeboybert rudeboybert changed the title Data frame output of do() * resample() Data frame output of do() * resample() Apr 16, 2017
@rpruim
Copy link
Contributor

rpruim commented Apr 17, 2017

  1. I recommend using rflip() rather than sampling from your coin object. That eliminates the need for your V1 ... V5 and the mutate()

  2. It isn't hard to make it into a tibble, but I'm not sure it is necessary either. Why do you want it to be a tibble?

  3. If you want different names you have to do something with different names. do() can't possibly guess what names you want. (How does it know whether what you are creating is an "outcome"?) These names are coming from R's default when you create a data frame from an object without names.

  4. Operator precedence isn't up to us. Parens are required in any expression where operator precedence doesn't give the result you want. (See https://stat.ethz.ch/R-manual/R-devel/library/base/html/Syntax.html.) In fact, I think the order is the one that is generally preferred, and what you are doing would work if you created a data frame rather than a vector along the way (but mutate would be called each time through, not once at the end). Unfortunately (for this purpose), R puts vectors into columns rather than rows when you convert from vector to data frame with as.data.frame().

Again, the elegant solution here is rflip(), which is designed to work nicely with do().

do(3) * rflip(20)
##    n heads tails prop
## 1 20    11     9 0.55
## 2 20     9    11 0.45
## 3 20     6    14 0.30

@rpruim rpruim closed this as completed Apr 17, 2017
@dtkaplan
Copy link
Contributor

dtkaplan commented Apr 17, 2017 via email

@beanumber
Copy link
Contributor

Yes! +1 for a version of favstats() that plays well with dplyr!

@rpruim
Copy link
Contributor

rpruim commented Apr 17, 2017

I think that the %repeat% operator is overkill -- can't we just write a repeat() function and do

foo %>%
  bar() %>%
  foobar() %>%
  repeat(5000)

That would turn into

repeat(foobar(bar(foo)), 5000)

This should be relatively easy to do, but I think I would hold off on creating this until rlang hits CRAN (projected to be in a couple weeks).

@rpruim
Copy link
Contributor

rpruim commented Apr 17, 2017

The problem with dplyr and favstats() is that dplyr doesn't support adding multiple variables at once with mutate() or summarise(). This is a drag for lots of applications. Here's a particular example: For plotting confidence bands, it is nice to add ribbons showing upper and lower extents. Functions that compute intervals, compute both limits at once and return two values. So essentially, you have to run the code twice, once extracting the upper limit and once the lower limit.

This has been discussed since at least 2013, but I hear it is now "That's on the shortlist for the next release: tidyverse/dplyr#2326". See mine-cetinkaya-rundel/datafest#2

I think that issue is better left with dplyr as it is rather orthogonal to mosaic (but would be useful in combination with favstats().

@rpruim
Copy link
Contributor

rpruim commented Apr 17, 2017

I opened the "chaining do" part of this as a separate issue #642 -- but note that we can't use "repeat" since it is a keyword.

@dtkaplan
Copy link
Contributor

dtkaplan commented Apr 17, 2017 via email

@rpruim
Copy link
Contributor

rpruim commented Apr 17, 2017

Please move discussion of this to #642

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants