Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

removed some dollar signs in intro to r lab #9

Closed
wants to merge 21 commits into from

Conversation

AmeliaMN
Copy link
Contributor

@AmeliaMN AmeliaMN commented Sep 8, 2015

I've made some minor changes that may have major philosophical repercussions! In particular, I have replaced all the usages of the dollar sign, eg. arbuthnot$boys, with dplyr commands, like select(arbuthnot, boys). This means taking out the references to vectors in this lab, as well as the expository text about what the wrapped display of vectors means.

@rpruim
Copy link

rpruim commented Sep 9, 2015

From a quick glance, it looks like you are only doing this to view the data, is that correct? This matters because you are now getting a data frame rather than a vector (that's why the display is different). Any computations done with these objects might break, but if you are only using this for the sake of viewing, that doesn't matter.

Have you considered using

arbuthnot %>% 
  select(boys)

instead? This make it easier to read and to perform multiple operations.

@AmeliaMN
Copy link
Contributor Author

I like that! It probably means adding some expository text about the pipe operator, but nothing wrong with getting students thinking about it in the first week.

@rudeboybert
Copy link
Collaborator

+1 on teaching piping!

@rpruim
Copy link

rpruim commented Sep 10, 2015

If you do it, I suggest the following code formatting:

NewData <-
  OriginalData %>%
  first_operation( ... ) %>%
  second_operation(...) %>%
  etc., etc., etc.

The first line can be omitted if there is no need to store the result.

@rudeboybert
Copy link
Collaborator

Also the use of the . operator when piping to an argument other than the first:

NewData <-
  OriginalData %>%
  first_operation( argument1, . ) %>%
  second_operation( argument1, argument2, . ) %>%

@andrewpbray
Copy link
Collaborator

Though we haven't made the commits to this repo, Mine and I both taught the intro to data lab in a way that introduces the pipe operator. @rpruim , do you have a reason for not doing any assignment on the first line? I really like this code block that you provided, though, with the general function names. I think I'll bring it up in class tomorrow to show a general form of what we've been doing.

One thing that I've realized needs to be addressed when teaching chaining to new users is spacing/line breaks. New users can develop all sorts of superstitions surrounding why code runs or doesn't run, so it can be useful to explicitly address what will and will not break the code in terms of formatting. This is another topic for tomorrow's class =)

@andrewpbray
Copy link
Collaborator

Btw, what's the best practice for handling a pull request like this that will likely be updated (in this case for the piping format of select())? If @AmeliaMN submits a new pull request, will it overwrite this one? Or would I "close and comment" this thread/pull request and then we start afresh with the new request?

@rpruim
Copy link

rpruim commented Sep 10, 2015

@andrewbray,

The assignment (in the sense of naming the variable that will contain the results) is on the first line. And it is the only thing on that line. Here's why I like it:

  • If you use more than one operation chained together, then the rhs of the first line would not be the value being assigned, so it's confusing. (For short, one operation things, I sometimes collapse to a single line.)
  • I like to have all of the rhs operations vertically aligned and on an equal footing. This makes it very easy to read off the sequence of operations.
  • To have multiple lines, no line my complete a command syntacticly, so emphasizing a return after each assignment %>% reduces the chance that a student will end a line in a way that changes the meaning.
  • It fights agains rightward drift and overly wide lines of code.
  • If you start out experimenting without the assignment (just looking at output), it is easy to go back and add the assignment by inserting one additional line
  • I like how it looks.

By the way, when I use commands like mutate(), if there is more than one variable, I like to put each of those on a separate line as well:

NewData <-
  OldData %>%
  mutate( 
    x = foo(a, b, pretty = TRUE),
    y = bar(a, b, pretty = FALSE)) %>%
  additional_operation( ... )

Last note: I have gotten more and more into the habit of inserting a line break after the assignment arrow whenever what is being assigned takes more than one line of code.

x <-
  multiline_stuff( ...,
    ...)

I'm torn about whether the final closing paren should be on a line by itself or at the end of the last "action" line. I like the former for style, but I don't like using up the extra vertical line.

@andrewpbray
Copy link
Collaborator

@rpruim
All compelling points - I'm convinced!

@rpruim
Copy link

rpruim commented Sep 10, 2015

@andrewbray,

Regarding spaces: I'm trying to retrain myself to always put spaces around operators like +, -, *, =, etc. It's not quite in my fingers yet, so I often have to do a consistency edit of my code. (@hadley requires this of any code contributed to his projects, and I think I've decided it adds enough by way of readability to adopt it.)

I'm a bit inconsistent about spaces inside parens. Most people don't put a space after ( or before ), but I find that sometimes it is very useful. I judge that by eye mostly and often use alternation to help clarify nested parens or just to help things like ~ or - be more easily seen

  far( boo(foo( bar(x, y) )) )     # spaces help match up parens -- emphasize boo(foo())
  far(boo( foo(bar(x, y)) ))       # spaces help match up parens -- emphasize far(boo()) and foo(bar())
  # in the above examples, I might just use chaining now

  histogram( ~ x,  data = mydata)  # torn about whether there should be a space 
                                   # before closing paren here

@rpruim
Copy link

rpruim commented Sep 10, 2015

@andrewbray,

Regarding pull requests, I'm not sure I understand what "like this" means or what you are concerned about, so I can't answer what the best practice would be.

It's probably best for people who are working on different things to keep them on separate branches and issue a pull request from each branch when they are ready. On your end, from the commandline you can choose to fetch rather than pull (pull = fetch + merge) if you want to investigate the pull request without merging it into your current work. At that point, you can decide whether or not to merge, cherrypick, discard, etc. the work that was done in that branch.

If you want a visual way to "see" how git works, I recommend this visualization tool and tutorial:

http://pcottle.github.io/learnGitBranching/

@andrewpbray
Copy link
Collaborator

@rpruim ,

By like this, I mean pull request where the subsequent discussion has made it clear that further modifications will be made. With this one, for example, I could either go ahead and merge the change (using select(arbuthnot, boys)) and then expect another pull request from Amelia that would change it over to the piping syntax. Or I could wait to merge until that newer version arrives. Maybe another way to phrase the question of if it makes more sense to merge files that have all of the commits, as opposed to spreading those commits across multiple merges.

The main reason I'm wondering if because in this github interface, it seems like this discussion is tied to the first pull request. I'd kinda like it if a new pull request (including the subsequent commits) could happen in the context of this same thread, but I don't think that's possible.

@AmeliaMN
Copy link
Contributor Author

I think that can indeed happen (I recently submitted a rather stupid pull
request somewhere else, had some discussion, made another pull request and
everything got tied together). Don't have time to try right now, but in the
next 24 hours I'll attempt it.

~Amelia

On Thu, Sep 10, 2015 at 2:50 PM, Andrew Bray notifications@github.com
wrote:

By like this, I mean pull request where the subsequent discussion has made
it clear that further modifications will be made. With this one, for
example, I could either go ahead and merge the change (using select(arbuthnot,
boys)) and then expect another pull request from Amelia that would change
it over to the piping syntax. Or I could wait to merge until that newer
version arrives. Maybe another way to phrase the question of if it makes
more sense to merge files that have all of the commits, as opposed to
spreading those commits across multiple merges.

The main reason I'm wondering if because in this github interface, it
seems like this discussion is tied to the first pull request. I'd kinda
like it if a new pull request (including the subsequent commits) could
happen in the context of this same thread, but I don't think that's
possible.


Reply to this email directly or view it on GitHub
#9 (comment)
.

Amelia McNamara
University of California, Los Angeles Statistics Department. Ph.D. expected
June 2015.
Twitter: @AmeliaMN http://twitter.com/AmeliaMN Website:
http://www.stat.ucla.edu/~amelia.mcnamara/

@rpruim
Copy link

rpruim commented Sep 10, 2015

That sounds right. Until the pull requester gets rid of the pull request, it hangs around as a branch and more work can be done along the branch and that work can (I think) be fetched at any time. We'll have to see how Amelia's experiment goes in terms of the github features. But if it is just a branch, you can probably pull it and then pull it again if more work is done.

But best to keep pull requests fairly limited in scope anyway, else they get harder to deal with -- especially if you want some, but not all, of the changes in the pull request.

@AmeliaMN
Copy link
Contributor Author

Okay, as expected my new commits just got added into this long thread. I briefly explained the %>% and . but I think a more thoughtful job could be done.

@hadley
Copy link

hadley commented Sep 10, 2015

It's easier to understand once you realise that pull request = branch. If you push to the same branch, the PR gets updated.

devtools::install_github("andrewpbray/oilabs")
install.packages(c("dplyr", "ggplot2"))
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could simplify this a bit by making dplyr and ggplot2 dependencies (even if slightly spurious) of the oilabs package.

@AmeliaMN
Copy link
Contributor Author

The big picture here seems the same as what we've been running into in many places-- just changing from base R to dplyr/ggplot2 is not the way to go. We need to think more deeply about what should really be in an introductory lab.

@beanumber and I are doing dplyr-flavored mosaic labs (with lattice graphics) in our intro courses this semester. Like I said above, I've done the first lab, which included: arbuthnot$boys, with(arbuthnot, boys>girls) mutate(arbuthnot, total = boys + girls) and xyplot(boys + girls ~ year, data=arbuthnot) This is TOO much syntax for students, particularly for day one. I don't have a strong philosophical stance about which direction we go, but I think that there needs to be a much smaller range of "recipes" for code that students see.

@nicholasjhorton
Copy link

I heartily concur.

I'd suggest taking another look at the Minimal R vignette that Randy crafted: there's a lot to be said about the "Less is More" philosophy that he espouses for intro students.

I'm working on a even more Minimal Vignette, that could be more useful for folks in community colleges or high school.

Just my $0.02,

Nick

On Sep 11, 2015, at 10:35 AM, Amelia McNamara notifications@github.com wrote:

The big picture here seems the same as what we've been running into in many places-- just changing from base R to dplyr/ggplot2 is not the way to go. We need to think more deeply about what should really be in an introductory lab.

@beanumber and I are doing dplyr-flavored mosaic labs (with lattice graphics) in our intro courses this semester. Like I said above, I've done the first lab, which included: arbuthnot$boys, with(arbuthnot, boys>girls) mutate(arbuthnot, total = boys + girls) and xyplot(boys + girls ~ year, data=arbuthnot) This is TOO much syntax for students, particularly for day one. I don't have a strong philosophical stance about which direction we go, but I think that there needs to be a much smaller range of "recipes" for code that students see.


Reply to this email directly or view it on GitHub.

Nicholas Horton
Professor of Statistics
Department of Mathematics and Statistics, Amherst College
Box 2239, 31 Quadrangle Dr
Amherst, MA 01002-5000
https://www.amherst.edu/people/facstaff/nhorton

@AmeliaMN
Copy link
Contributor Author

There was a short offline conversation about with() on email, which I wanted to open to the broader audience. Many things which can be done with dplyr functions could also be done using with() and base R functions. It might depend on the branch, but I think that at least in the context of this dplyr/ggplot2 version of oiLabs it probably makes sense not to use with().

@beanumber and I talking about a dplyr/mosiac/lattice version (we've started this a bit this semester), and in that branch it might make sense to use with(), although I'm not crazy about it. I know that this thread isn't exactly the right place for this particular discussion, but ah well.

@rpruim
Copy link

rpruim commented Sep 11, 2015

I concur with @nicholasjhorton that you don't want to merely "translate" from one toolkit to another at a micro level. And big thumbs up to doing a "Less Volume, More Creativity / Minimal R" exercise. In this exercise, you cull out all of the R Code for a course and organize it to see what you are using with the goal of reducing to a small, coherent set that gets the job done.

@AmeliaMN, I'd like to see the use cases where you think you need with(). I almost never use it. I don't think I have ever showed it to students.

@hadley
Copy link

hadley commented Sep 11, 2015

Also, if you think there are important use cases that are hard to do in dplyr, I'd really like to hear about them!

@beanumber
Copy link
Collaborator

The injection of with() was simply to avoid the dollar sign at a point in time before dplyr was moved in mosaic. The issue was that the lab as originally written was creating new vectors outside the data frame. I think the right way to do it now is to use mutate() and append new columns to the data frame.

Also, no one has mentioned the %<>% operator from magrittr. Why does this not come with dplyr? Is it too much to introduce another operator? I like the way it obviates the need to type the name of the data frame twice.

@rpruim
Copy link

rpruim commented Sep 11, 2015

I'm not a fan of %<>%.

  • doesn't conjure up the notion of assignment to me
  • easily confused with %>%
  • a symmetric symbol for a non-symmetric operation (I'll grant you that there is some flow in each direction, but it still not a symmetric operation).
  • it doesn't add any use cases or reduce any overhead since we still need %>% and assignment (<- or =). (More Volume, Equal Creativity)

I don't think saving a few characters of typing is worth this. Does anyone else use it?

@hadley
Copy link

hadley commented Sep 11, 2015

I'm not a fan of %<>% either - it just feels a bit too magical to me. Assignment is sufficiently important (you're changing the state of the world!) that it's worth a little extra typing.

@andrewpbray
Copy link
Collaborator

This afternoon in class I reviewed the chaining syntax that we covered in the Intro to Data lab on Tuesday. I found myself talking about a data analysis pipeline and actually drawing a Super Mario Brothers style pipe with "Raw Data" at one end and at the other end a list of possible "Data Products". Those included

  1. Processed data
  2. Numerical summaries
  3. Graphical summaries
    (By the end of the course, maybe could add 4. Statistical model.)

I find this to be a useful format to help students think through a statistical question. For a given question you can ask them: a) what sort of data product (I'm open to a better term here) would be well-suited to answer the question b) what would a data set look like that could create that product? From there, it's a matter of thinking through the turns and twists in the pipe - the various lines of the chain - that get from here to there.

It strikes me that dplyr is fantastic for creating chains for 1 and 2, but 3 (syncing with ggplot2) is more difficult. You run into the issue of qplot() having a different argument configuration than the dplyr functions, which necessitates the .. There's the ggplot() alternative, but then you run into the +, which I think is too much for the new user. It also seems like a bit of a conceptual disconnect in that the whole systematic building up of layers in a ggplot() is a different process than the bending and pliering a data set with dplyr.

So I'm inclined to use the . in qplot() for the chains that end in a plot, but I'm open to suggestions. A guess another alternative would be to say that these aren't data analysis pipelines but data processing pipelines; basically, only build chains for 1 and 2. If you want to make a plot, call qplot() on a separate line. This feels like a shame, though, cause the step from numerical summaries to graphical summaries seems so natural to me.

@hadley
Copy link

hadley commented Sep 12, 2015

Yeah, part of the point of ggvis was to provide a pipe-friendly syntax. But I definitely wouldn't advise using ggvis in an intro course yet.

Alternatively you could just teach that %>% and + are basically equivalent - they're both ways of building up a pipeline. But that's likely to cause some confusion

@mine-cetinkaya-rundel
Copy link
Collaborator

@hadley you asked "Also, if you think there are important use cases that are hard to do in dplyr, I'd really like to hear about them!" -- one such case I came across is the use of things like summary() from base or favstats() from mosaic not working nicely with summarise() in dplyr.

Here is a MWE:

library(dplyr)
library(mosaic)             # for favstat example below
load(url("https://stat.duke.edu/~mc301/data/cdc.RData"))
cdc %>% 
     group_by(gender) %>%
     summarise(mean(weight))

behaves as expected

but neither

cdc %>% 
     group_by(gender) %>%
     summarise(summary(weight))

nor

cdc %>% 
     group_by(gender) %>%
     summarise(favstats(weight))

does (due to Error: expecting a single value). It seems like these should behave the same way as

by(cdc$weight, cdc$gender, summary)

or

by(cdc$weight, cdc$gender, favstats)

but they don't. I think this is related to this. I also think that this would be a welcome addition, but correct me if I'm missing something important about why summarise() should only report a single value for each function.

On a different note I haven't had any issues with convincing newbies that + and %>% are equivalent, but this piece of info is more an "anecdote" than "data" so I can't confidently assert that it works just fine.

@rpruim
Copy link

rpruim commented Sep 12, 2015

I would second this for functions that produce either (possibly named) vectors or 1-row data frames/tbls.

You can get the desired result like this:

HELPrct %>% group_by(substance) %>% do( favstats(.$age))
## Source: local data frame [3 x 10]
## Groups: substance [3]
## 
##   substance   min    Q1 median    Q3   max     mean       sd     n missing
##      (fctr) (dbl) (dbl)  (dbl) (dbl) (dbl)    (dbl)    (dbl) (int)   (int)
## 1   alcohol    20    33   38.0 43.00    58 38.19774 7.652272   177       0
## 2   cocaine    23    30   33.5 37.25    60 34.49342 6.692881   152       0
## 3    heroin    19    27   33.0 39.00    55 33.44355 7.986068   124       0

But this requires both a . and a $.

@AmeliaMN
Copy link
Contributor Author

AmeliaMN commented Oct 1, 2015

Ugh, forgot to make a new branch AGAIN.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants