-
Notifications
You must be signed in to change notification settings - Fork 10
/
2017-09-10-nse.Rmd
295 lines (210 loc) · 18.8 KB
/
2017-09-10-nse.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
---
layout: post
categories: blog
title: "Non-standard evaluation, how tidy eval builds on base R"
base-url: https://EdwinTh.github.io
date: "2017-09-10 14:30:00"
output: html_document
tags: [R, tidy evaluation, programming, non-standard evaluation, base R]
---
As with many aspects of the tidyverse, its non-standard evaluation (NSE) implementation is not something entirely new, but built on top of base R. What makes this one so challenging to get your mind around, is that the [Honorable Doctor Sir Lord General](http://maraaverick.rbind.io/2017/08/tidyeval-resource-roundup/) and friends brought concepts to the realm of the mortals that many of us had no, or only a vague, understanding of. Earlier, I gave an overview of [the most common actions in tidy eval](https://edwinth.github.io/blog/dplyr-recipes/). Although appreciated by many, it left me unsatisfied, because it made clear to me I did not really understand NSE. Neither in base R, nor in tidy eval. Therefore, I bit the bullet and really studied it for a few evenings. Starting with base R NSE, and later learning what tidy eval actually adds to it. I decided to share the things I learned in this, rather lengthy, blog. I think it captures the essentials in NSE, although it surely is incomplete and might be even erronous at places. Still, I hope you find it worthwhile and it will help you understand NSE better and apply it with more confidence.
My approach was listing a number of terms and study them one by one. Mainly consulting [Advanced R](http://adv-r.hadley.nz/) and the [R Language Definition](ftp://cran.r-project.org/pub/R/doc/manuals/r-release/R-lang.html). For tidy eval I leaned heavily on the [Programming with dplyr vignette](https://cran.r-project.org/web/packages/dplyr/vignettes/programming.html) and the function documentations. This is also how this blog post is built. We are hopping from term to term, to we see how they relate. You will find references to the sources in the text, in case you want to read more about a topic.
*Note this is v3, in which the overscope gets mentioned and quosures are more elaborated on. To read earlier versions see [here](https://github.com/EdwinTh/EdwinTh.github.io/releases).*
# Base R, non-standard evaluation
## `expression`
In standard evaluation R is like a child that receives candy from his grandmother and puts it in his mouth immediately. Every input is evaluated right away. An **expression** is some R code that is ready to be evaluated, but is not evaluated yet. Rather it is captured and saved for later. Think of it as the child's father telling he can't have the candy until they get home. Expressions break down in four categories; constants (like 4, TRUE), names (variable names referring to an object), calls (letting functions do a calculation) and pairlists. We will only look at `names` and `calls` here. Constants are equal to their value when captured in an expression, if you want to learn about pairlists see [Adv-R](http://adv-r.hadley.nz/metaprogramming.html#pairlists). The act of capturing the expression instead of evaluating it, is called quoting. This is done by the `quote()` function.
```{r, warning=FALSE, message=FALSE}
library(tidyverse)
quote(x) %>% class()
quote(mean(1:10)) %>% class()
```
Confusingly, there is also the class `expression`, created by the `parse()` and `expression()` functions. You should think of these as lists of names and calls. Those will not be furtherly discussed here. When I use the term *expression* I am referring to the concept of R code that still needs to be evaluated, not objects of class `expression`.
## `name`
When creating an object in R, you are binding the name of the object to a value. This binding of value and name is done in an **environment**. In the following, the name `x` gets associated with the value 50. Since we did not create a specific environment, this is a binding in the global environment.
```{r}
x <- 50
```
Normally, we give R the object name to retrieve the corresponding value. However, when we save the objects name as an expression, it is not evaluated but stored as `name` object.
```{r}
exp_x <- quote(x)
eval(exp_x)
```
This way we can build a request for later. The variable requested for doesn't even have to exist at creation time. (Like granny having no candy herself, but telling the kid that he can have candy when he gets back at his parent's place).
```{r error=TRUE}
eval_me <- quote(y)
eval(eval_me)
y <- "I am ready"
eval(eval_me)
```
If we want to create a `name` from a string, for instance because you are creating on function output, you can use `as.name()`.
```{r}
as.name("x")
```
And finally to make matters nice and unclear, a name is also called a `symbol` and the function `as.symbol()` does the same as `as.name()`. Perfect, we have a good idea about quoting variable names and how to retrieve their value later. Now, lets call some functions.
## `call`
When we delay the evaluation of a function call, we arrive at the second subcategory of expressions: the `call`. The function to be called, with the names of the objects used for the arguments, are stored until further notice.
```{r}
wait_for_it <- quote(x + y)
x <- 3; y <- 8
eval(wait_for_it)
```
Note that `+` is a function, like every action that happens in R. If we want to return an expression as a string, we can apply `deparse()` on it. This allows us to do stuff like:
```{r}
print_func <- function(expr){
paste("The value of", deparse(expr), "is", eval(expr))
}
print_func(wait_for_it)
print_func(quote(log(42) %>% round(1)))
print_func(quote(x))
```
When the expression is a name, we print the name and the value of the object associated with the name. When it is a call, we print the function call and the evaluation of it.
## `environment` and `closure`
In the last block we used a function in which we applied NSE. No coincidence, NSE and functions are a strong and natural pair. With NSE we can create powerful and user-friendly functions, like the ones in `ggplot2` and `dplyr`. We need to elaborate on **environments** and **closures** here. I told you that an object is the binding of a name and a value in an environment. When starting an R session, you are in the global environment [Adv-R](http://adv-r.hadley.nz/environments.html). All objects created live happily in the global.
```{r}
z <- 25
```
A function creates a new environment, objects of the same name as objects in the global can live here with different values bound to them.
```{r}
z_func <- function() {
z <- 12
z
}
z_func()
z
```
The z_func did not change the global environment, but created an object in its own environment. Now functions are of a type called a `closure`.
```{r}
typeof(z_func)
```
They are called this way because they *enclose* their environment. At creation they have a look around in the environment in which they are created and capture all the names and values that are available there. They don't just know the names of the objects in their own environment, but also in the environment in which they were created [Adv-R](http://adv-r.hadley.nz/functional-programming.html#closures).
Keep the concept of a closure in mind, we will revisit it.
## `substitute` and `promise`
With the knowledge gained in the above we can start and try to write our own NSE functions. Lets make a function that adds a column to a data frame that is the square of a column it already contains.
```{r}
add_squared <- function(x, col_name) {
new_colname <- paste0(deparse(col_name), "_sq")
x[, new_colname] <- x[, deparse(col_name)]^2
x
}
add_squared(mtcars, quote(cyl)) %>% head(1)
```
You might say, "that is not too convenient, I still need to quote the `col_name` myself". Well, you are very right, it would be more helpful if the function did the quoting for you. Unfortunately placing `quote(col_name)` inside the function body is of no use. `quote()` makes a literal quote of its input. So it would make the `name` *col_name* here each time it was called, no matter the value that was given to the argument, rather than quoting the value that was provided to this argument.
Here we need `substitute()`. This will lookup all the object names provided to it, and if it finds a value for that name, it will substitute the name for its value [Adv-R](http://adv-r.hadley.nz/nse.html#substitute). Lets do a filter function to demonstrate.
```{r}
my_filt <- function(x, filt_cond) {
filt_cond_q <- substitute(filt_cond)
rows_to_keep <- eval(filt_cond_q, x)
x[rows_to_keep, ]
}
my_filt(mtcars, mpg == 21)
```
Yeah, that works. But, wait a minute. How does `eval()` now know that *mpg* is a column in `x`? We provided `x` to the `eval` function, but how does this work? Well, the data frame `x` was provided to the `envir` argument of `eval()`. A data frame, thus, is a valid environment in which we can evaluate expressions. *mpg* lives in `x`, so the evaluation of `filt_cond_q` here gives the desired result.
When you think about it a little longer, NSE is only possible when function arguments are not evaluated directly. If the function was the inpatient kid that wanted to put `filt_cond` in its mouth right away, it would have failed to find an object with the name *mpg* in the global environment. When the function is called, a provided arguments is stored in a **promise**. The promise of the argument contains the value of the argument, but also an expression of the argument. The function does not bother about the value of the promise, until the function argument is actually used in the function. The `substitute()` function does only enter the expression part of the promise. In the `my_filt()` example, the promise associated with the `x` argument will have the actual data frame belonging to the object `mtcars` as its value, and the name *mtcars* as its expression. In the second and third line of the function, the value of this argument is accessed. The promise associated with the `filt_cond` argument, however, does not have a value. But it does have a call as its expression. As soon as we use this argument, the function would fail. But we don't. With `substitute()` we only access the expression of the promise [R lang](https://cran.r-project.org/doc/manuals/r-release/R-lang.html#Promise-objects).
## `formula` and `overscoping`
Before we move to tidy eval there is one more concept we have to elaborate on, the `formula`. Probably you have used formulas a lot, but did you ever think about how odd they are? Take the following example
```{r}
mod <- lm(vs ~ mpg + cyl, data = mtcars)
```
No R user would have trouble reading the above, but picture yourself coming from another programming language and stumbling upon it. It as an example of a domain specific language (DSL). DSLs exploit R's NSE possibilities by giving alternative meaning to the language in specific contexts. Other examples are `ggplot2` and `dplyr`. Just like functions, do formulas enclose the environment they are created in. Meaning that when the formula is evaluated later in a different environment, it can still access all the object that lived in its original environment.
Formulas, thus, can find variables in multiple environments. Like so:
```{r}
not_in_df <- rnorm(32)
lm(disp ~ not_in_df, data = mtcars)
```
But what if the name exists in both environments, which one prevails?
```{r}
cyl <- "this would throw an error"
lm(disp ~ cyl, data = mtcars)
```
Thus, the data environment is evaluated before the enclosed environment, we say the data environment does **overscope**.
These are, to my understanding, the core elements of NSE in base R. If you don't care about tidy eval you can stop reading here and try to build your own NSE functions. Thanks for making it this far.
# `tidy evaluation`
There are two key additions of tidy eval to base R NSE. It uses **quasiquotation** and it introduces a new type of quoted object, called a **quosure**. Let's find out about them one by one.
## `quasiquotation`
We now know that in normal quotation the expression is captured to be evaluated later, rather than swallowed right away. Quasiquotation enables the user to swallow parts of the expression right away, while quoting the rest. We can quote the following simple function.
```{r}
quote(z - x + 4)
```
Say we know the value of `x` already at the moment of quoting. How can we let the second part to be evaluated right away and quote `z -` the result of this evaluation? In other words how do we **unquote** the `x + 4` part? In base R this is not going to happen, but with tidy eval this can be done.
```{r}
x <- 4
rlang::expr(z - !!x + 4)
rlang::expr(z - !!x + 4) %>% class()
```
Everything after the `!!` (bang bang) is unquoted. If we do not use unquoting, there is no reason to use `rlang::expr()` instead of `quote()`. They have the exact same result. There is also a tidy eval equivalent for `substitute()`, namely `enexpr()`.
Now the appeal of functions that have implemented quasiquotation is that all the advantages of easy-to-use NSE interfaces remain. At the same time they enable the user to pack the functions that already quote, in custom-made wrappers. Example please! Something I do often is creating a frequency table of the values of a variable in a data frame. I want this in a function with the data frame and column name as arguments. Wrapping `dplyr` functions in the following way:
```{r, message=FALSE, warning=FALSE}
freq_table <- function(x, col) {
col_q <- rlang::enexpr(col)
total_n <- x %>% nrow()
x %>% group_by(!!col_q) %>% summarise(freq = n() / total_n)
}
mtcars %>% freq_table(cyl)
mtcars %>% freq_table(vs)
```
So the functions that use tidy eval, like those in `dplyr`, automatically quote their input. That is what enables you to type away and get results as quickly as you can when doing data analysis. However if you want to write programs around them you have to take care of two steps. First, quote the argument that is going to be evaluated by the functions used. If we don't do this our wrapper function would fail because we have provided a name or call that cannot be found in the environment the function is called from. Second, since the `dplyr` functions quote their input themselves, we have to unquote the quoted arguments in these functions. If we don't do this the `dplyr` function will quote the variable name rather than its content.
## `quosure`
Very nice, that quasiquoting. Now what's up with quosures? From their name you might guess they are hybrids of `quotes` and `closures`. We have seen that combination before when we looked at formulas. If we look at quosures, we will see that they are a subclass of the formula, besides being a class of their own.
```{r}
quo(z) %>% class()
quo(z) %>% rlang::is_expr()
```
Quosures are one-sided fomulas, capturing their environment, but not indicating a modelling relationship. By the way, we've seen the `quo()` function in action. This literally quotes its input, just like `quote()` and `rlang::expr()` do. The quosure equivalent of `substitute()` and `enexpr()` is `enquo()`.
Just like names, calls can be converted to a quosure too.
```{r}
quo(2 + 2) %>% class()
```
Note that quosures don't make a lower level distinction between calls and names. Every expression becomes a quosure.
When is capturing of the environment by the expression actually useful? When the quosure is created in one environment and evaluated in another. This typically happens when it is created in a function and evaluated in the global environment or another function.
In base R NSE a function can evaluate a quoted argument, it can quote a bare statement, it can even return an expression. What it cannot do however, is giving the returned expression memory of the variables that were present at creation.
```{r error = TRUE}
base_NSE_example <- function(some_arg) {
some_var <- 10
quote(some_var + some_arg)
}
base_NSE_example(4) %>% eval()
```
The quosure is not memoryless, it will retrieve the values that were present at creation.
```{r}
tidy_eval_example <- function(some_arg) {
some_var <- 10
quo(some_var + some_arg)
}
tidy_eval_example(4) %>% rlang::eval_tidy()
```
Note that we do need to apply eval_tidy() instead of eval() to make use of the memory of the quosure.
Now you might say, if we just use `substitute()` instead of `quote()` in the `base_NSE_example()`, it would return 14 too. You are right, if we just capture the environment and evaluate it later, there is not really a point in saving the whole bunch to evaluate later to retrieve the values. However, we don't have to use the value in the enclosed environment when applying `eval_tidy()` to a quosure. With the `data` argument we can provide a value for a name in the quosure, that overscopes the enclosed environment for the provided name(s). Example. Using `tidy_eval_example()` with *some_arg* = 4, will give *some_arg* = 4 and *some_var* = 10 in the enclosed environment. Now, we can overrule one or both values in `eval_tidy()`.
```{r}
tidy_eval_example(4) %>%
rlang::eval_tidy(list(some_var = 42))
tidy_eval_example(4) %>%
rlang::eval_tidy(list(some_var = 42, some_arg = -1))
```
The value in the enclosed environment is thus carried with the quosure wherever it goes, but does not have to be used. We can also evaluate (a part) of the quosure in a different environment.
So, we can provide a data argument to overscope the enclosed environment. But what about the enclosed environment versus the environment where the quosure is evaluated. Which one overscopes? Then answer is:
```{r}
x <- "the enclosed env"
x_quoted <- quo(x)
eval_x <- function(q){
x <- "the calling env"
rlang::eval_tidy(q)
}
eval_x(x_quoted)
```
# How do base R NSE and tidy eval play together?
So tidy eval is build on top of base R NSE and the two can even work together. We have seen that in quasiquotation the parts to be unquoted don't have to be quosures, we can also unquote base objects like calls and names.
```{r}
using_base_r_in_tidy_eval <- function(x, col) {
col_q <- substitute(col)
x %>% select(!!col_q)
}
mtcars %>% using_base_r_in_tidy_eval(cyl) %>% head(1)
```
If want to use the quasiquotation of tidy eval, but prefer base R quotation, you can combine the two. It does not work the other way around. Since quosures are a new kid on the block, `eval()` does not know how to unquote them and will throw an error. Familiar expression objects created with tidy eval can be evaluated with `eval()`, since the objects do not differ from the ones created with base R functions.
```{r}
all.equal(quote(some_name), rlang::expr(some_name))
all.equal(quote(x + 5), rlang::expr(x+ 5))
```
The only difference between these functions is on capture, objects after capture are of base types.
# Thank You
I took you along my NSE learning path, thank you for making it all the way through. If there is anything you think is incomplete or incorrect, let me know! This document is a living thing. You would do me and everybody who uses it as a reference a great favor by correcting it. The blog is maintained [here](https://github.com/EdwinTh/EdwinTh.github.io/blob/master/_source/2017-09-10-nse.Rmd), do a PR or send an email.