Data frame output of do() * resample() #641

rudeboybert · 2017-04-16T18:43:59Z

Some questions:

In snippet 1, would it be hard to have the simulation output be in tibble format instead of just raw data frame?
In snippet 1, the column names of simulation are V1 thru V2, which aren't informative. Would it be worth the real estate to rename these outcome1 thru outcome5 instead?
Snippet 2 doesn't work as the pipe only recognizes the resample() and not the do(). Snippet 3 rectifies this by having parentheses around the statement. Is there a more elegant solution than this? Inevitably all students attempt Snippet 2.

library(mosaic)
library(dplyr)

coin <- c(0, 1)

# Snippet 1: Simulate 5 coin flips 1000 times
simulation <- do(1000) * resample(coin, size=5, replace=TRUE) 
simulation

# Snippet 2: Doesn't work
simulation <- do(1000) * resample(coin, size=5, replace=TRUE) %>% 
  mutate(num_heads = V1+V2+V3+V4+V5)

# Snippet 3: Works
simulation <- (do(1000) * resample(coin, size=5, replace=TRUE)) %>% 
  mutate(num_heads = V1+V2+V3+V4+V5)

The text was updated successfully, but these errors were encountered:

rpruim · 2017-04-17T00:31:53Z

I recommend using rflip() rather than sampling from your coin object. That eliminates the need for your V1 ... V5 and the mutate()
It isn't hard to make it into a tibble, but I'm not sure it is necessary either. Why do you want it to be a tibble?
If you want different names you have to do something with different names. do() can't possibly guess what names you want. (How does it know whether what you are creating is an "outcome"?) These names are coming from R's default when you create a data frame from an object without names.
Operator precedence isn't up to us. Parens are required in any expression where operator precedence doesn't give the result you want. (See https://stat.ethz.ch/R-manual/R-devel/library/base/html/Syntax.html.) In fact, I think the order is the one that is generally preferred, and what you are doing would work if you created a data frame rather than a vector along the way (but mutate would be called each time through, not once at the end). Unfortunately (for this purpose), R puts vectors into columns rather than rows when you convert from vector to data frame with as.data.frame().

Again, the elegant solution here is rflip(), which is designed to work nicely with do().

do(3) * rflip(20)
##    n heads tails prop
## 1 20    11     9 0.55
## 2 20     9    11 0.45
## 3 20     6    14 0.30

dtkaplan · 2017-04-17T02:40:19Z

As you say, Snippet 3 works but students will want to write along the lines of Snippet 2. The whole point of `do()` is to make things easier for students and to minimize what they need to know about language internals. In general, `mosaic` doesn't interoperate fluently with `dplyr`. Not surprising, because `mosaic` was designed several years before `dplyr`. My intuition is that it would be too intrusive to change `mosaic`. While I do think that `dplyr` is the way things are going to go, I don't know yet how many `mosaic` instructors are eager to adopt that way of doing things. While we wait to find out where the need will be, I've been experimenting with a mosaic-inspired system designed from the beginning to work well with pipe notation and `dplyr`. This is in an undocumented state just now, but available as ```r devtools::install_github("dtkaplan/statPREP") ``` One example is ```{r} KidsFeet %>% qstats(width ~ sex, mean) ``` `qstats()` is quite a lot like `mosaic::favstats()`, but lets one specify which statistics to calculate. (The default is just like `favstats()`.) Wrapping up statistics like the mean inside a function like `qstats()` avoids the issue of overwriting base functions like `mean()` and provides a ready way around the `na.rm = TRUE` craziness. As regards pipes and `do()`, there's the problem with Snippet 3 vs Snippet 2. But `do()` in general produces data frames, so the only remaining incompatibility is with the `*` notation rather than `%>%`. Unfortunately, `%>%` wasn't written with `do()`-type behavior in mind. In thinking about a possible pipe-friendly operator, we could do something like the following: ```r `%repeat%` <- function(lhs, rhs) { parent <- parent.frame() env <- new.env(parent = parent) chain_parts <- magrittr:::split_chain(match.call(), env = env) expression <- chain_parts$lhs[[2]] ntimes <- chain_parts$lhs[[3]] tibble::as.tibble(do(ntimes) * expression) } ``` This would operate like: ``` foo <- resample(KidsFeet) %>% mutate(length = width/length) %>% lm(width ~ length, data = .) %repeat% 5

foo

# A tibble: 5 × 9 Intercept length sigma r.squared F numdf <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> 1 5.448168 9.983311 0.4099097 0.1625396 7.181196 1 2 1.667029 20.188234 0.4215703 0.4192923 26.715361 1 3 2.930282 16.662907 0.5027827 0.2609576 13.064785 1 4 4.412130 12.417105 0.5196286 0.1459229 6.321614 1 5 4.824011 11.454969 0.4588435 0.1197619 5.034079 1 # ... with 3 more variables: dendf <dbl>, .row <int>, # .index <dbl> ``` Or, ``` resample(KidsFeet) %>% mutate(length = width/length) %>% lm(width ~ length, data = .) %repeat% 150 %>% filter(F > 30) # A tibble: 7 × 9 Intercept length sigma r.squared F numdf dendf <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> 1 1.6673867 20.02156 0.4000387 0.4755288 33.54724 1 37 2 1.2889138 21.03878 0.4453342 0.4660632 32.29659 1 37 3 0.3386881 23.35332 0.4594237 0.4762086 33.63881 1 37 4 3.0682568 16.12929 0.3333917 0.4508715 30.37950 1 37 5 2.2112310 18.46778 0.3234404 0.5081533 38.22669 1 37 6 2.0131434 18.72017 0.3895339 0.4940942 36.13615 1 37 7 1.9626714 19.13557 0.4006306 0.4846271 34.79268 1 37 # ... with 2 more variables: .row <int>, .index <dbl> ``` I don't know whether this `%repeat%` operator is a good idea, but perhaps it's something to think about.

…

On Sun, Apr 16, 2017 at 1:43 PM, Albert Y. Kim ***@***.***> wrote: Some questions: 1. In snippet 1, would it be hard to have the simulation output be in tibble format instead of just raw data frame? 2. In snippet 1, the column names of simulation are V1 thru V2, which aren't informative. Would it be worth the real estate to rename these outcome1 thru outcome5 instead? 3. Snippet 2 doesn't work as the pipe only recognizes the resample() and not the do(). Snippet 3 rectifies this by having parentheses around the statement. Is there a more elegant solution than this? Inevitably all students attempt Snippet 2. library(mosaic) library(dplyr) coin <- c(0, 1) # Snippet 1: Simulate 5 coin flips 1000 times simulation <- do(1000) * resample(coin, size=5, replace=TRUE) simulation # Snippet 2: Doesn't work simulation <- do(1000) * resample(coin, size=5, replace=TRUE) %>% mutate(num_heads = V1+V2+V3+V4+V5) # Snippet 3: Works simulation <- (do(1000) * resample(coin, size=5, replace=TRUE)) %>% mutate(num_heads = V1+V2+V3+V4+V5) — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#641>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAggrVY3MQ85Kov-hOxPXn-Cpn0C3_wsks5rwmFvgaJpZM4M-snD> .

-- ... DeWitt Wallace Professor of Mathematics, Statistics, and Computer Science Macalester College

beanumber · 2017-04-17T03:03:07Z

Yes! +1 for a version of favstats() that plays well with dplyr!

rpruim · 2017-04-17T13:34:06Z

I think that the %repeat% operator is overkill -- can't we just write a repeat() function and do

foo %>%
  bar() %>%
  foobar() %>%
  repeat(5000)

That would turn into

repeat(foobar(bar(foo)), 5000)

This should be relatively easy to do, but I think I would hold off on creating this until rlang hits CRAN (projected to be in a couple weeks).

rpruim · 2017-04-17T13:46:22Z

The problem with dplyr and favstats() is that dplyr doesn't support adding multiple variables at once with mutate() or summarise(). This is a drag for lots of applications. Here's a particular example: For plotting confidence bands, it is nice to add ribbons showing upper and lower extents. Functions that compute intervals, compute both limits at once and return two values. So essentially, you have to run the code twice, once extracting the upper limit and once the lower limit.

This has been discussed since at least 2013, but I hear it is now "That's on the shortlist for the next release: tidyverse/dplyr#2326". See mine-cetinkaya-rundel/datafest#2

I think that issue is better left with dplyr as it is rather orthogonal to mosaic (but would be useful in combination with favstats().

rpruim · 2017-04-17T13:51:36Z

I opened the "chaining do" part of this as a separate issue #642 -- but note that we can't use "repeat" since it is a keyword.

dtkaplan · 2017-04-17T14:29:39Z

I certainly agree that we should wait for `rlang`. (Also, that gets me past a very busy work period!) I don't know enough about `magrittr` to write a function such as the `repeat(5000)` imagined here. The problem from me is getting the unevaluated LHS of the pipe --- everything to the left of `repeat(5000)`. That's the only reason I put things in a `magrittr`-like operator. Nomination for name: `trials()`.

…

On Mon, Apr 17, 2017 at 8:34 AM, Randall Pruim ***@***.***> wrote: I think that the %repeat% operator is overkill -- can't we just write a repeat() function and do foo %>% bar() %>% foobar() %>% repeat(5000) That would turn into repeat(foobar(bar(foo)), 5000) This should be relatively easy to do, but I think I would hold off on creating this until rlang hits CRAN (projected to be in a couple weeks). — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#641 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAggrRLbEBx3u3w0fia66PVzN-aBllJmks5rw2pOgaJpZM4M-snD> .

-- ... DeWitt Wallace Professor of Mathematics, Statistics, and Computer Science Macalester College

rpruim · 2017-04-17T14:33:58Z

Please move discussion of this to #642

rudeboybert changed the title ~~Data frame output of do() * resample()~~ Data frame output of do() * resample() Apr 16, 2017

rpruim closed this as completed Apr 17, 2017

rpruim mentioned this issue Apr 17, 2017

Add "repeat()" function for chaining version of do() #642

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data frame output of do() * resample() #641

Data frame output of do() * resample() #641

rudeboybert commented Apr 16, 2017

rpruim commented Apr 17, 2017

dtkaplan commented Apr 17, 2017 via email

beanumber commented Apr 17, 2017

rpruim commented Apr 17, 2017

rpruim commented Apr 17, 2017

rpruim commented Apr 17, 2017

dtkaplan commented Apr 17, 2017 via email

rpruim commented Apr 17, 2017

Data frame output of do() * resample() #641

Data frame output of do() * resample() #641

Comments

rudeboybert commented Apr 16, 2017

rpruim commented Apr 17, 2017

dtkaplan commented Apr 17, 2017 via email

beanumber commented Apr 17, 2017

rpruim commented Apr 17, 2017

rpruim commented Apr 17, 2017

rpruim commented Apr 17, 2017

dtkaplan commented Apr 17, 2017 via email

rpruim commented Apr 17, 2017