-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Data frame output of do() * resample() #641
Comments
do() * resample()
Again, the elegant solution here is do(3) * rflip(20)
## n heads tails prop
## 1 20 11 9 0.55
## 2 20 9 11 0.45
## 3 20 6 14 0.30 |
As you say, Snippet 3 works but students will want to write along the lines
of Snippet 2. The whole point of `do()` is to make things easier for
students and to minimize what they need to know about language internals.
In general, `mosaic` doesn't interoperate fluently with `dplyr`. Not
surprising, because `mosaic` was designed several years before `dplyr`.
My intuition is that it would be too intrusive to change `mosaic`. While I
do think that `dplyr` is the way things are going to go, I don't know yet
how many `mosaic` instructors are eager to adopt that way of doing things.
While we wait to find out where the need will be, I've been experimenting
with a mosaic-inspired system designed from the beginning to work well with
pipe notation and `dplyr`. This is in an undocumented state just now, but
available as
```r
devtools::install_github("dtkaplan/statPREP")
```
One example is
```{r}
KidsFeet %>% qstats(width ~ sex, mean)
```
`qstats()` is quite a lot like `mosaic::favstats()`, but lets one specify
which statistics to calculate. (The default is just like `favstats()`.)
Wrapping up statistics like the mean inside a function like `qstats()`
avoids the issue of overwriting base functions like `mean()` and provides a
ready way around the `na.rm = TRUE` craziness.
As regards pipes and `do()`, there's the problem with Snippet 3 vs Snippet
2. But `do()` in general produces data frames, so the only remaining
incompatibility is with the `*` notation rather than `%>%`. Unfortunately,
`%>%` wasn't written with `do()`-type behavior in mind.
In thinking about a possible pipe-friendly operator, we could do something
like the following:
```r
`%repeat%` <- function(lhs, rhs) {
parent <- parent.frame()
env <- new.env(parent = parent)
chain_parts <- magrittr:::split_chain(match.call(), env = env)
expression <- chain_parts$lhs[[2]]
ntimes <- chain_parts$lhs[[3]]
tibble::as.tibble(do(ntimes) * expression)
}
```
This would operate like:
```
foo <-
resample(KidsFeet) %>%
mutate(length = width/length) %>%
lm(width ~ length, data = .) %repeat% 5
foo
# A tibble: 5 × 9
Intercept length sigma r.squared F numdf
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 5.448168 9.983311 0.4099097 0.1625396 7.181196 1
2 1.667029 20.188234 0.4215703 0.4192923 26.715361 1
3 2.930282 16.662907 0.5027827 0.2609576 13.064785 1
4 4.412130 12.417105 0.5196286 0.1459229 6.321614 1
5 4.824011 11.454969 0.4588435 0.1197619 5.034079 1
# ... with 3 more variables: dendf <dbl>, .row <int>,
# .index <dbl>
```
Or,
```
resample(KidsFeet) %>%
mutate(length = width/length) %>%
lm(width ~ length, data = .) %repeat% 150 %>%
filter(F > 30)
# A tibble: 7 × 9
Intercept length sigma r.squared F numdf dendf
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1.6673867 20.02156 0.4000387 0.4755288 33.54724 1 37
2 1.2889138 21.03878 0.4453342 0.4660632 32.29659 1 37
3 0.3386881 23.35332 0.4594237 0.4762086 33.63881 1 37
4 3.0682568 16.12929 0.3333917 0.4508715 30.37950 1 37
5 2.2112310 18.46778 0.3234404 0.5081533 38.22669 1 37
6 2.0131434 18.72017 0.3895339 0.4940942 36.13615 1 37
7 1.9626714 19.13557 0.4006306 0.4846271 34.79268 1 37
# ... with 2 more variables: .row <int>, .index <dbl>
```
I don't know whether this `%repeat%` operator is a good idea, but perhaps
it's something to think about.
…On Sun, Apr 16, 2017 at 1:43 PM, Albert Y. Kim ***@***.***> wrote:
Some questions:
1. In snippet 1, would it be hard to have the simulation output be in
tibble format instead of just raw data frame?
2. In snippet 1, the column names of simulation are V1 thru V2, which
aren't informative. Would it be worth the real estate to rename these
outcome1 thru outcome5 instead?
3. Snippet 2 doesn't work as the pipe only recognizes the resample()
and not the do(). Snippet 3 rectifies this by having parentheses
around the statement. Is there a more elegant solution than this?
Inevitably all students attempt Snippet 2.
library(mosaic)
library(dplyr)
coin <- c(0, 1)
# Snippet 1: Simulate 5 coin flips 1000 times
simulation <- do(1000) * resample(coin, size=5, replace=TRUE)
simulation
# Snippet 2: Doesn't work
simulation <- do(1000) * resample(coin, size=5, replace=TRUE) %>%
mutate(num_heads = V1+V2+V3+V4+V5)
# Snippet 3: Works
simulation <- (do(1000) * resample(coin, size=5, replace=TRUE)) %>%
mutate(num_heads = V1+V2+V3+V4+V5)
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#641>, or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAggrVY3MQ85Kov-hOxPXn-Cpn0C3_wsks5rwmFvgaJpZM4M-snD>
.
--
...
DeWitt Wallace Professor of Mathematics, Statistics, and Computer Science
Macalester College
|
Yes! +1 for a version of |
I think that the foo %>%
bar() %>%
foobar() %>%
repeat(5000) That would turn into repeat(foobar(bar(foo)), 5000) This should be relatively easy to do, but I think I would hold off on creating this until |
The problem with This has been discussed since at least 2013, but I hear it is now "That's on the shortlist for the next release: tidyverse/dplyr#2326". See mine-cetinkaya-rundel/datafest#2 I think that issue is better left with |
I opened the "chaining do" part of this as a separate issue #642 -- but note that we can't use "repeat" since it is a keyword. |
I certainly agree that we should wait for `rlang`. (Also, that gets me past
a very busy work period!)
I don't know enough about `magrittr` to write a function such as the
`repeat(5000)` imagined here. The problem from me is getting the
unevaluated LHS of the pipe --- everything to the left of `repeat(5000)`.
That's the only reason I put things in a `magrittr`-like operator.
Nomination for name: `trials()`.
…On Mon, Apr 17, 2017 at 8:34 AM, Randall Pruim ***@***.***> wrote:
I think that the %repeat% operator is overkill -- can't we just write a
repeat() function and do
foo %>%
bar() %>%
foobar() %>%
repeat(5000)
That would turn into
repeat(foobar(bar(foo)), 5000)
This should be relatively easy to do, but I think I would hold off on
creating this until rlang hits CRAN (projected to be in a couple weeks).
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#641 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAggrRLbEBx3u3w0fia66PVzN-aBllJmks5rw2pOgaJpZM4M-snD>
.
--
...
DeWitt Wallace Professor of Mathematics, Statistics, and Computer Science
Macalester College
|
Please move discussion of this to #642 |
Some questions:
simulation
output be intibble
format instead of just raw data frame?simulation
areV1
thruV2
, which aren't informative. Would it be worth the real estate to rename theseoutcome1
thruoutcome5
instead?resample()
and not thedo()
. Snippet 3 rectifies this by having parentheses around the statement. Is there a more elegant solution than this? Inevitably all students attempt Snippet 2.The text was updated successfully, but these errors were encountered: