removed some dollar signs in intro to r lab #9

AmeliaMN · 2015-09-08T21:16:43Z

I've made some minor changes that may have major philosophical repercussions! In particular, I have replaced all the usages of the dollar sign, eg. arbuthnot$boys, with dplyr commands, like select(arbuthnot, boys). This means taking out the references to vectors in this lab, as well as the expository text about what the wrapped display of vectors means.

rpruim · 2015-09-09T19:45:41Z

From a quick glance, it looks like you are only doing this to view the data, is that correct? This matters because you are now getting a data frame rather than a vector (that's why the display is different). Any computations done with these objects might break, but if you are only using this for the sake of viewing, that doesn't matter.

Have you considered using

arbuthnot %>% 
  select(boys)

instead? This make it easier to read and to perform multiple operations.

AmeliaMN · 2015-09-10T15:04:35Z

I like that! It probably means adding some expository text about the pipe operator, but nothing wrong with getting students thinking about it in the first week.

rudeboybert · 2015-09-10T15:38:57Z

+1 on teaching piping!

rpruim · 2015-09-10T15:42:31Z

If you do it, I suggest the following code formatting:

NewData <-
  OriginalData %>%
  first_operation( ... ) %>%
  second_operation(...) %>%
  etc., etc., etc.

The first line can be omitted if there is no need to store the result.

rudeboybert · 2015-09-10T15:45:10Z

Also the use of the . operator when piping to an argument other than the first:

NewData <-
  OriginalData %>%
  first_operation( argument1, . ) %>%
  second_operation( argument1, argument2, . ) %>%

andrewpbray · 2015-09-10T16:27:59Z

Though we haven't made the commits to this repo, Mine and I both taught the intro to data lab in a way that introduces the pipe operator. @rpruim , do you have a reason for not doing any assignment on the first line? I really like this code block that you provided, though, with the general function names. I think I'll bring it up in class tomorrow to show a general form of what we've been doing.

One thing that I've realized needs to be addressed when teaching chaining to new users is spacing/line breaks. New users can develop all sorts of superstitions surrounding why code runs or doesn't run, so it can be useful to explicitly address what will and will not break the code in terms of formatting. This is another topic for tomorrow's class =)

andrewpbray · 2015-09-10T16:31:29Z

Btw, what's the best practice for handling a pull request like this that will likely be updated (in this case for the piping format of select())? If @AmeliaMN submits a new pull request, will it overwrite this one? Or would I "close and comment" this thread/pull request and then we start afresh with the new request?

rpruim · 2015-09-10T16:54:40Z

@andrewbray,

The assignment (in the sense of naming the variable that will contain the results) is on the first line. And it is the only thing on that line. Here's why I like it:

If you use more than one operation chained together, then the rhs of the first line would not be the value being assigned, so it's confusing. (For short, one operation things, I sometimes collapse to a single line.)
I like to have all of the rhs operations vertically aligned and on an equal footing. This makes it very easy to read off the sequence of operations.
To have multiple lines, no line my complete a command syntacticly, so emphasizing a return after each assignment %>% reduces the chance that a student will end a line in a way that changes the meaning.
It fights agains rightward drift and overly wide lines of code.
If you start out experimenting without the assignment (just looking at output), it is easy to go back and add the assignment by inserting one additional line
I like how it looks.

By the way, when I use commands like mutate(), if there is more than one variable, I like to put each of those on a separate line as well:

NewData <-
  OldData %>%
  mutate( 
    x = foo(a, b, pretty = TRUE),
    y = bar(a, b, pretty = FALSE)) %>%
  additional_operation( ... )

Last note: I have gotten more and more into the habit of inserting a line break after the assignment arrow whenever what is being assigned takes more than one line of code.

x <-
  multiline_stuff( ...,
    ...)

I'm torn about whether the final closing paren should be on a line by itself or at the end of the last "action" line. I like the former for style, but I don't like using up the extra vertical line.

andrewpbray · 2015-09-10T16:57:36Z

@rpruim
All compelling points - I'm convinced!

rpruim · 2015-09-10T17:04:59Z

@andrewbray,

Regarding spaces: I'm trying to retrain myself to always put spaces around operators like +, -, *, =, etc. It's not quite in my fingers yet, so I often have to do a consistency edit of my code. (@hadley requires this of any code contributed to his projects, and I think I've decided it adds enough by way of readability to adopt it.)

I'm a bit inconsistent about spaces inside parens. Most people don't put a space after ( or before ), but I find that sometimes it is very useful. I judge that by eye mostly and often use alternation to help clarify nested parens or just to help things like ~ or - be more easily seen

  far( boo(foo( bar(x, y) )) )     # spaces help match up parens -- emphasize boo(foo())
  far(boo( foo(bar(x, y)) ))       # spaces help match up parens -- emphasize far(boo()) and foo(bar())
  # in the above examples, I might just use chaining now

  histogram( ~ x,  data = mydata)  # torn about whether there should be a space 
                                   # before closing paren here

rpruim · 2015-09-10T17:42:02Z

@andrewbray,

Regarding pull requests, I'm not sure I understand what "like this" means or what you are concerned about, so I can't answer what the best practice would be.

It's probably best for people who are working on different things to keep them on separate branches and issue a pull request from each branch when they are ready. On your end, from the commandline you can choose to fetch rather than pull (pull = fetch + merge) if you want to investigate the pull request without merging it into your current work. At that point, you can decide whether or not to merge, cherrypick, discard, etc. the work that was done in that branch.

If you want a visual way to "see" how git works, I recommend this visualization tool and tutorial:

http://pcottle.github.io/learnGitBranching/

andrewpbray · 2015-09-10T18:50:39Z

@rpruim ,

By like this, I mean pull request where the subsequent discussion has made it clear that further modifications will be made. With this one, for example, I could either go ahead and merge the change (using select(arbuthnot, boys)) and then expect another pull request from Amelia that would change it over to the piping syntax. Or I could wait to merge until that newer version arrives. Maybe another way to phrase the question of if it makes more sense to merge files that have all of the commits, as opposed to spreading those commits across multiple merges.

The main reason I'm wondering if because in this github interface, it seems like this discussion is tied to the first pull request. I'd kinda like it if a new pull request (including the subsequent commits) could happen in the context of this same thread, but I don't think that's possible.

AmeliaMN · 2015-09-10T21:05:55Z

I think that can indeed happen (I recently submitted a rather stupid pull
request somewhere else, had some discussion, made another pull request and
everything got tied together). Don't have time to try right now, but in the
next 24 hours I'll attempt it.

~Amelia

On Thu, Sep 10, 2015 at 2:50 PM, Andrew Bray notifications@github.com
wrote:

By like this, I mean pull request where the subsequent discussion has made
it clear that further modifications will be made. With this one, for
example, I could either go ahead and merge the change (using select(arbuthnot,
boys)) and then expect another pull request from Amelia that would change
it over to the piping syntax. Or I could wait to merge until that newer
version arrives. Maybe another way to phrase the question of if it makes
more sense to merge files that have all of the commits, as opposed to
spreading those commits across multiple merges.

The main reason I'm wondering if because in this github interface, it
seems like this discussion is tied to the first pull request. I'd kinda
like it if a new pull request (including the subsequent commits) could
happen in the context of this same thread, but I don't think that's
possible.

—
Reply to this email directly or view it on GitHub
#9 (comment)
.

Amelia McNamara
University of California, Los Angeles Statistics Department. Ph.D. expected
June 2015.
Twitter: @AmeliaMN http://twitter.com/AmeliaMN Website:
http://www.stat.ucla.edu/~amelia.mcnamara/

rpruim · 2015-09-10T21:53:00Z

That sounds right. Until the pull requester gets rid of the pull request, it hangs around as a branch and more work can be done along the branch and that work can (I think) be fetched at any time. We'll have to see how Amelia's experiment goes in terms of the github features. But if it is just a branch, you can probably pull it and then pull it again if more work is done.

But best to keep pull requests fairly limited in scope anyway, else they get harder to deal with -- especially if you want some, but not all, of the changes in the pull request.

AmeliaMN · 2015-09-10T22:24:16Z

Okay, as expected my new commits just got added into this long thread. I briefly explained the %>% and . but I think a more thoughtful job could be done.

hadley · 2015-09-10T22:42:00Z

It's easier to understand once you realise that pull request = branch. If you push to the same branch, the PR gets updated.

hadley · 2015-09-10T22:43:16Z

intro_to_r/intro_to_r.Rmd

 devtools::install_github("andrewpbray/oilabs")
+install.packages(c("dplyr", "ggplot2"))


You could simplify this a bit by making dplyr and ggplot2 dependencies (even if slightly spurious) of the oilabs package.

AmeliaMN · 2015-09-11T14:35:01Z

The big picture here seems the same as what we've been running into in many places-- just changing from base R to dplyr/ggplot2 is not the way to go. We need to think more deeply about what should really be in an introductory lab.

@beanumber and I are doing dplyr-flavored mosaic labs (with lattice graphics) in our intro courses this semester. Like I said above, I've done the first lab, which included: arbuthnot$boys, with(arbuthnot, boys>girls) mutate(arbuthnot, total = boys + girls) and xyplot(boys + girls ~ year, data=arbuthnot) This is TOO much syntax for students, particularly for day one. I don't have a strong philosophical stance about which direction we go, but I think that there needs to be a much smaller range of "recipes" for code that students see.

nicholasjhorton · 2015-09-11T15:09:21Z

I heartily concur.

I'd suggest taking another look at the Minimal R vignette that Randy crafted: there's a lot to be said about the "Less is More" philosophy that he espouses for intro students.

I'm working on a even more Minimal Vignette, that could be more useful for folks in community colleges or high school.

Just my $0.02,

Nick

On Sep 11, 2015, at 10:35 AM, Amelia McNamara notifications@github.com wrote:

The big picture here seems the same as what we've been running into in many places-- just changing from base R to dplyr/ggplot2 is not the way to go. We need to think more deeply about what should really be in an introductory lab.

@beanumber and I are doing dplyr-flavored mosaic labs (with lattice graphics) in our intro courses this semester. Like I said above, I've done the first lab, which included: arbuthnot$boys, with(arbuthnot, boys>girls) mutate(arbuthnot, total = boys + girls) and xyplot(boys + girls ~ year, data=arbuthnot) This is TOO much syntax for students, particularly for day one. I don't have a strong philosophical stance about which direction we go, but I think that there needs to be a much smaller range of "recipes" for code that students see.

—
Reply to this email directly or view it on GitHub.

Nicholas Horton
Professor of Statistics
Department of Mathematics and Statistics, Amherst College
Box 2239, 31 Quadrangle Dr
Amherst, MA 01002-5000
https://www.amherst.edu/people/facstaff/nhorton

AmeliaMN · 2015-09-11T15:13:42Z

There was a short offline conversation about with() on email, which I wanted to open to the broader audience. Many things which can be done with dplyr functions could also be done using with() and base R functions. It might depend on the branch, but I think that at least in the context of this dplyr/ggplot2 version of oiLabs it probably makes sense not to use with().

@beanumber and I talking about a dplyr/mosiac/lattice version (we've started this a bit this semester), and in that branch it might make sense to use with(), although I'm not crazy about it. I know that this thread isn't exactly the right place for this particular discussion, but ah well.

rpruim · 2015-09-11T16:26:13Z

I concur with @nicholasjhorton that you don't want to merely "translate" from one toolkit to another at a micro level. And big thumbs up to doing a "Less Volume, More Creativity / Minimal R" exercise. In this exercise, you cull out all of the R Code for a course and organize it to see what you are using with the goal of reducing to a small, coherent set that gets the job done.

@AmeliaMN, I'd like to see the use cases where you think you need with(). I almost never use it. I don't think I have ever showed it to students.

hadley · 2015-09-11T17:59:37Z

Also, if you think there are important use cases that are hard to do in dplyr, I'd really like to hear about them!

beanumber · 2015-09-11T18:03:50Z

The injection of with() was simply to avoid the dollar sign at a point in time before dplyr was moved in mosaic. The issue was that the lab as originally written was creating new vectors outside the data frame. I think the right way to do it now is to use mutate() and append new columns to the data frame.

Also, no one has mentioned the %<>% operator from magrittr. Why does this not come with dplyr? Is it too much to introduce another operator? I like the way it obviates the need to type the name of the data frame twice.

rpruim · 2015-09-11T18:15:37Z

I'm not a fan of %<>%.

doesn't conjure up the notion of assignment to me
easily confused with %>%
a symmetric symbol for a non-symmetric operation (I'll grant you that there is some flow in each direction, but it still not a symmetric operation).
it doesn't add any use cases or reduce any overhead since we still need %>% and assignment (<- or =). (More Volume, Equal Creativity)

I don't think saving a few characters of typing is worth this. Does anyone else use it?

hadley · 2015-09-11T18:36:50Z

I'm not a fan of %<>% either - it just feels a bit too magical to me. Assignment is sufficiently important (you're changing the state of the world!) that it's worth a little extra typing.

andrewpbray · 2015-09-11T22:31:50Z

This afternoon in class I reviewed the chaining syntax that we covered in the Intro to Data lab on Tuesday. I found myself talking about a data analysis pipeline and actually drawing a Super Mario Brothers style pipe with "Raw Data" at one end and at the other end a list of possible "Data Products". Those included

Processed data
Numerical summaries
Graphical summaries
(By the end of the course, maybe could add 4. Statistical model.)

I find this to be a useful format to help students think through a statistical question. For a given question you can ask them: a) what sort of data product (I'm open to a better term here) would be well-suited to answer the question b) what would a data set look like that could create that product? From there, it's a matter of thinking through the turns and twists in the pipe - the various lines of the chain - that get from here to there.

It strikes me that dplyr is fantastic for creating chains for 1 and 2, but 3 (syncing with ggplot2) is more difficult. You run into the issue of qplot() having a different argument configuration than the dplyr functions, which necessitates the .. There's the ggplot() alternative, but then you run into the +, which I think is too much for the new user. It also seems like a bit of a conceptual disconnect in that the whole systematic building up of layers in a ggplot() is a different process than the bending and pliering a data set with dplyr.

So I'm inclined to use the . in qplot() for the chains that end in a plot, but I'm open to suggestions. A guess another alternative would be to say that these aren't data analysis pipelines but data processing pipelines; basically, only build chains for 1 and 2. If you want to make a plot, call qplot() on a separate line. This feels like a shame, though, cause the step from numerical summaries to graphical summaries seems so natural to me.

hadley · 2015-09-12T00:16:06Z

Yeah, part of the point of ggvis was to provide a pipe-friendly syntax. But I definitely wouldn't advise using ggvis in an intro course yet.

Alternatively you could just teach that %>% and + are basically equivalent - they're both ways of building up a pipeline. But that's likely to cause some confusion

mine-cetinkaya-rundel · 2015-09-12T04:06:41Z

@hadley you asked "Also, if you think there are important use cases that are hard to do in dplyr, I'd really like to hear about them!" -- one such case I came across is the use of things like summary() from base or favstats() from mosaic not working nicely with summarise() in dplyr.

Here is a MWE:

library(dplyr)
library(mosaic)             # for favstat example below
load(url("https://stat.duke.edu/~mc301/data/cdc.RData"))

cdc %>% 
     group_by(gender) %>%
     summarise(mean(weight))

behaves as expected

but neither

cdc %>% 
     group_by(gender) %>%
     summarise(summary(weight))

nor

cdc %>% 
     group_by(gender) %>%
     summarise(favstats(weight))

does (due to Error: expecting a single value). It seems like these should behave the same way as

by(cdc$weight, cdc$gender, summary)

or

by(cdc$weight, cdc$gender, favstats)

but they don't. I think this is related to this. I also think that this would be a welcome addition, but correct me if I'm missing something important about why summarise() should only report a single value for each function.

On a different note I haven't had any issues with convincing newbies that + and %>% are equivalent, but this piece of info is more an "anecdote" than "data" so I can't confidently assert that it works just fine.

rpruim · 2015-09-12T05:15:53Z

I would second this for functions that produce either (possibly named) vectors or 1-row data frames/tbls.

You can get the desired result like this:

HELPrct %>% group_by(substance) %>% do( favstats(.$age))
## Source: local data frame [3 x 10]
## Groups: substance [3]
## 
##   substance   min    Q1 median    Q3   max     mean       sd     n missing
##      (fctr) (dbl) (dbl)  (dbl) (dbl) (dbl)    (dbl)    (dbl) (int)   (int)
## 1   alcohol    20    33   38.0 43.00    58 38.19774 7.652272   177       0
## 2   cocaine    23    30   33.5 37.25    60 34.49342 6.692881   152       0
## 3    heroin    19    27   33.0 39.00    55 33.44355 7.986068   124       0

But this requires both a . and a $.

Update intro_to_data.Rmd

Merge branch 'master' of https://github.com/AmeliaMN/oiLabs-dplyr-ggplot

Update simple_regression.Rmd

Update intro_to_r.Rmd

Update simple_regression.Rmd

updated exercise to correct typo(?)

… distribution lab

AmeliaMN · 2015-10-01T15:02:55Z

Ugh, forgot to make a new branch AGAIN.

removed some dollar signs in intro to r lab

6e4bd0c

beanumber mentioned this pull request Sep 10, 2015

cans of worms opened by ggplot'ification #11

Closed

AmeliaMN added 2 commits September 10, 2015 18:09

changed line spacning and introduced piping

67932ca

changed line spacning, introduced piping and the period operator

89bd830

hadley reviewed Sep 10, 2015
View reviewed changes

AmeliaMN and others added 18 commits September 14, 2015 11:23

intro to data edits

fae9e8a

Update intro_to_data.Rmd

c070a1e

Update intro_to_data.Rmd

c4ca488

Merge pull request #2 from beanumber/patch-2

9f20837

Update intro_to_data.Rmd

Merge pull request #1 from beanumber/patch-1

ce15ee7

Update intro_to_data.Rmd

remove heinous reference to piping qplot

740f273

Merging changes

e617082

Merge branch 'master' of https://github.com/AmeliaMN/oiLabs-dplyr-ggplot

Update simple_regression.Rmd

4cf67b0

Merge pull request #3 from beanumber/patch-3

d25e5a2

Update simple_regression.Rmd

Update simple_regression.Rmd

cc92316

Update simple_regression.Rmd

b013b2c

Update intro_to_r.Rmd

d2b6263

Merge pull request #4 from beanumber/patch-4

8c07813

Update simple_regression.Rmd

Merge pull request #6 from beanumber/patch-6

f4c3a34

Update intro_to_r.Rmd

Merge pull request #5 from beanumber/patch-5

32d98f7

Update simple_regression.Rmd

updated exercise to correct typo(?)

cc4a01b

Merge pull request #7 from beanumber/patch-7

fa8fad2

updated exercise to correct typo(?)

fixing upstream merge conflicts and trying to add new stuff to normal…

8c94616

… distribution lab

andrewpbray closed this Oct 5, 2015

beanumber mentioned this pull request Oct 15, 2015

Eliminate all uses of "$" from Labs beanumber/oiLabs-mosaic#9

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

removed some dollar signs in intro to r lab #9

removed some dollar signs in intro to r lab #9

AmeliaMN commented Sep 8, 2015

rpruim commented Sep 9, 2015

AmeliaMN commented Sep 10, 2015

rudeboybert commented Sep 10, 2015

rpruim commented Sep 10, 2015

rudeboybert commented Sep 10, 2015

andrewpbray commented Sep 10, 2015

andrewpbray commented Sep 10, 2015

rpruim commented Sep 10, 2015

andrewpbray commented Sep 10, 2015

rpruim commented Sep 10, 2015

rpruim commented Sep 10, 2015

andrewpbray commented Sep 10, 2015

AmeliaMN commented Sep 10, 2015

rpruim commented Sep 10, 2015

AmeliaMN commented Sep 10, 2015

hadley commented Sep 10, 2015

hadley Sep 10, 2015

AmeliaMN commented Sep 11, 2015

nicholasjhorton commented Sep 11, 2015

AmeliaMN commented Sep 11, 2015

rpruim commented Sep 11, 2015

hadley commented Sep 11, 2015

beanumber commented Sep 11, 2015

rpruim commented Sep 11, 2015

hadley commented Sep 11, 2015

andrewpbray commented Sep 11, 2015

hadley commented Sep 12, 2015

mine-cetinkaya-rundel commented Sep 12, 2015

rpruim commented Sep 12, 2015

AmeliaMN commented Oct 1, 2015

		devtools::install_github("andrewpbray/oilabs")
		install.packages(c("dplyr", "ggplot2"))

removed some dollar signs in intro to r lab #9

removed some dollar signs in intro to r lab #9

Conversation

AmeliaMN commented Sep 8, 2015

rpruim commented Sep 9, 2015

AmeliaMN commented Sep 10, 2015

rudeboybert commented Sep 10, 2015

rpruim commented Sep 10, 2015

rudeboybert commented Sep 10, 2015

andrewpbray commented Sep 10, 2015

andrewpbray commented Sep 10, 2015

rpruim commented Sep 10, 2015

andrewpbray commented Sep 10, 2015

rpruim commented Sep 10, 2015

rpruim commented Sep 10, 2015

andrewpbray commented Sep 10, 2015

AmeliaMN commented Sep 10, 2015

rpruim commented Sep 10, 2015

AmeliaMN commented Sep 10, 2015

hadley commented Sep 10, 2015

hadley Sep 10, 2015

Choose a reason for hiding this comment

AmeliaMN commented Sep 11, 2015

nicholasjhorton commented Sep 11, 2015

AmeliaMN commented Sep 11, 2015

rpruim commented Sep 11, 2015

hadley commented Sep 11, 2015

beanumber commented Sep 11, 2015

rpruim commented Sep 11, 2015

hadley commented Sep 11, 2015

andrewpbray commented Sep 11, 2015

hadley commented Sep 12, 2015

mine-cetinkaya-rundel commented Sep 12, 2015

rpruim commented Sep 12, 2015

AmeliaMN commented Oct 1, 2015