-
Notifications
You must be signed in to change notification settings - Fork 16
About FAQs
There are foundational concepts that are helpful to understand before working with the functions in this wiki. Below is a brief summary of each. Full disclosure: I am not an expert and am still learning every day. It's totally possible that my interpretations are not always accurate. See links in each section for complete information.
Most of the examples in this wiki are tidyverse
focused. That is what I am most comfortable with using and it allows all of your data manipulation to have an underlying framework which I find really helpful in keeping your code cohesive. When you install.packages("tidyverse")
you will install a collection of packages that make up the tidyverse
framework. While many packages will be installed simply by installing the tidyverse
, some affiliated packages may not automatically install and will need to be added separately.
The tidyverse
version I use in my examples is 1.3.1
https://www.tidyverse.org/
https://dplyr.tidyverse.org/articles/base.html
Many of the examples in this resource will include the use of the pipe %>%
from the magrittr
package. This package is part of the tidyverse
. The pipe allows you to seamlessly pipe the output of one function into another function.
You can think of code like this:
data %>%
function %>%
function %>%
function
as:
Use this data -->
THEN do this -->
THEN do this -->
THEN do this
I want to note that you do not have to use the %>%
operator. It is totally optional. These two ways of writing code are synonymous, yet the first may be considered more succinct:
1.
df_new <- df %>%
select(Var1) %>%
filter(Var1 == 2)
2.
df_new <- select(df, Var1)
df_new <- filter(df_new, Var1 ==2)
https://r4ds.had.co.nz/pipes.html
https://magrittr.tidyverse.org/reference/pipe.html
https://bioconnector.github.io/workshops/r-dplyr-yeast.html#the_pipe:_%%
Consider the word variable. In R it has two meanings:
- env-variables (environment variables) are programming variables that live in an environment
These variables are usually created with the assignment operator
<-
or less commonly with the super assignment operator<<-
- data-variables are variables that live in a data frame (columns), they usually come from files like a .csv
https://dplyr.tidyverse.org/articles/programming.html
https://saral.club/qna?id=0cde7703-d341-4fbb-ac93-5899f1eee73d&lang=r
In base R, you can extract a data-variable (column) using [[
or $
.
Standard Evaluation
In standard evaluation (value oriented evaluation), you use [[
to select data-variables and you must pass the variable as a string, in quotes, inside the bracket.
Ex: To call the variable Species
from the data frame iris
would look like this:
iris[["Species"]]
or iris[,"Species"]
You can also create an env-variable and pass that as a name/symbol:
var <- "Species"
iris[[var]]
Non-Standard Evaluation
In non-standard evaluation (NSE), you use $
to select data-variables and names/symbols are treated as literal string values, so quotes are no longer needed.
Ex: To call the variable Species
from the data frame iris
would look like this:
iris$Species
You cannot however call Species
using iris$var
, because while above we created the env-variable var
which has the value Species
, NSE looks for a column named var
.
https://win-vector.com/2019/04/02/standard-evaluation-versus-non-standard-evaluation-in-r/
https://thomasadventure.blog/posts/understanding-nse-part1/
Tidy evaluation, used in the tidyverse
, uses two forms non-standard evaluation, data masking or tidy selection. The information below is a very simplified summary of the "need to know to get by" points from the programming with dplyr vignette. See the vignette for more detailed information.
Data masking
Data masking is a type of NSE that simplifies your code even more and allows you to only call your data once, rather than multiple times like you would in base R. With data masking you can refer to your data-variables as is, rather than attaching a prefix to your variable name (data$
).
base R
starwars[starwars$homeworld == "Naboo" & starwars$species == "Human", ]
tidyverse
starwars %>%
filter(homeworld == "Naboo", species == "Human")
Examples of functions that use data masking include: arrange()
, count()
, filter()
, group_by()
, mutate()
, and summarise()
.
Tidy selection
Tidy selection is a complementary tool to data masking that makes it even easier to work with dataset columns.
The tidyselect
package is referred to throughout this wiki. This package allows selection of variables based on their name or properties. This package underlies all functions that use tidy selection.
Selection helpers and operators in the tidyselect
package can be found here: https://tidyselect.r-lib.org/reference/language.html and here https://dplyr.tidyverse.org/reference/dplyr_tidy_select.html
Ex:
select(df, starts_with("a"))
selects all columns from df that start with "a"
Examples of functions that use data tidy selection include: across()
, relocate()
, rename()
, select()
, and pull()
.
https://dplyr.tidyverse.org/articles/programming.html
https://dplyr.tidyverse.org/reference/dplyr_tidy_select.html
https://rlang.r-lib.org/reference/topic-data-mask.html
Quotes are used around strings in R. You can typically use either double quotes (") or single quotes ('). Here are a few examples.
- If you are providing a character vector of names. Ex: c("Var1", "Var2", "Var3")
- If you are referring to a string in a function.
Ex: The function stringr::str_replace(string, pattern, replacement)
.
In this function, we do not need quotes for our string argument because we can use tidy evaluation/tidy selection to call our existing variable without using quotes. However, you will need to put quotes around the string pattern and the string replacement arguments.
df %>%
stringr::str_replace(Name, "Ms.", "")
There are also backticks which are used to refer to names and/or a combination of symbols that are not allowed in R. For example, variable names with spaces are not allowed. So you should use backticks to call those variables (or better yet, rename them with no spaces).
df %>%
select(`Student Name`)
https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/Quotes
https://stackoverflow.com/questions/62094504/when-are-backticks-used-compared-to-double-quotes
Throughout the examples you will see the base
function c()
used, which stands for combine. Oftentimes we are combining variables (column names) into a vector to do some sort of selection/manipulation on.
Ex: c("Var1", "Var2", "Var3") or c(Var1, Var2, Var3) depending if you are using tidy selection or not.
https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/c
https://libraryguides.mcgill.ca/c.php?g=699776&p=4968546
https://monashbioinformaticsplatform.github.io/r-intro/start.html
You will see the tilde (~) used in some of the examples as well. You can read it as “as a function of”. It may be used any time a function has an argument .fn (as in dplyr::rename_with()
), .f (as in purrr::map()
) or .fns (as in dplyr::across()
). You use ~
to create an anonymous function/formula (unnamed function). In R 4.1 or newer the ~ can be replaced with \(.).
Consider this example:
dplyr::rename_with(df, ~ stringr::str_replace(., "s_", "t_"), .cols = everything())
In this example, dplyr::rename_with()
has the arguments of rename_with(.data, .fn, .cols = everything(), ...)
For the .fn argument we are using stringr::str_replace()
and we use ~
to call the function and its arguments.
If we are using R 4.1 or newer we could use \(.)
instead.
dplyr::rename_with(df, \(.) stringr::str_replace(., "s_", "t_"), .cols = everything())
If however, the .fn function you call does not have any arguments of its own, such as base::toupper()
, you can simply name the function and you do not need to use the ~ to create an anonymous function.
For example:
dplyr::rename_with(df, toupper, .cols = everything())
Last, if we pull the data frame out of the function in order to use our pipe operator (see pipe operator
above), we can use a .
as a place holder for that argument (to reference the data we previously called). See dot notation
below.
df %>%
dplyr::rename_with(., toupper, .cols = everything())
https://www.quora.com/What-does-mean-in-R-7
https://stackoverflow.com/questions/14976331/use-of-tilde-in-r-programming-language
https://coolbutuseless.github.io/2019/03/13/anonymous-functions-in-r-part-1/
https://purrr.tidyverse.org/articles/other-langs.html
https://bitcoden.com/answers/why-use-purrrmap-instead-of-lapply
https://towardsdatascience.com/the-new-pipe-and-anonymous-function-syntax-in-r-54d98861014c
https://github.com/rstudio/cheatsheets/blob/main/purrr.pdf
You may see the dot notation .
used within a function as a place holder. There are two types of dots.
There is .
and there is .x
.
The .
is a way to denote that we are referring to the value coming from the previous pipe (such as a data frame).
The .x
is the value of something used in an anonymous function (see Tidle ~ above). The first parameter is .x
and if there is a second, it is .y
. A shortcut for the first parameter is simply using just .
.
Consider the nested function calls below.
df %>%
dplyr::rename_with(., ~ stringr::str_replace(.x, "s_", "t_"), .cols = everything())
In this example, dplyr::rename_with()
has the arguments of rename_with(.data, .fn, .cols = everything(), ...)
Since we put df %>%
before the function, we can simply add .
to refer to the outside data df for the .data argument.
Otherwise, if we kept the data within the function, our formula would look like this:
dplyr::rename_with(df, ~ stringr::str_replace(.x, "s_", "t_"), .cols = everything())
Next, stringr::str_replace()
has the arguments of str_replace(string, pattern, replacement)
We also use the dot notation .x
(or simply.
) in the str_replace() argument to refer to all columns. This defers to the outside dplyr::rename_with()
.cols argument, where we are currently using the default, .cols = everything().
We could also modify the .cols default and change our code to select certain columns:
df %>%
dplyr::rename_with(., ~ stringr::str_replace(.x, "s_", "t_"), .cols = tidyselect::contains("var"))
There are cases when you are referring to outside parameters when you do not need to explicitly call .
. See this article for more information: https://magrittr.tidyverse.org/reference/pipe.html#arguments
The simplified version of the original code above would be:
df %>%
dplyr::rename_with(~ stringr::str_replace(., "s_", "t_"))
https://stackoverflow.com/questions/9652943/usage-of-dot-period-in-r-functions
https://stackoverflow.com/questions/51562671/what-do-the-dot-and-the-tilde-represent-in-r
https://stackoverflow.com/questions/56532119/dplyr-piping-data-difference-between-and-x
You will see the double colon, also called a Namespace operator, used throughout the examples in this wiki. When you start a new R session, you always need to load the packages you plan to use in that session through library("nameofpackage")
. This is still good practice. However, for the sake of these examples I also use this notation for every function, package::functionname
, in order to both library the package and also to explicitly show which package I am calling the function from. There are times when multiple packages use the same function names and if you have overlapping packages loaded that have the same function names, it can cause errors.
Ex: psych::describe()
vs Hmisc::describe()
https://stackoverflow.com/questions/35240971/what-are-the-double-colons-in-r
https://r-pkgs.org/namespace.html
"Arguments are the parameters provided to a function to perform operations in a programming language" (GeeksforGeeks). The number of arguments within a function vary, and they are separated by commas. Arguments can have a default value. If you do not specify an argument in your function, the default will be used. You can see all the documentation for a function by typing ?nameoffunction
or just the arguments by typing args(nameoffunction)
in your console. The package that contains the function must be loaded in your session for this to work.
Ex:
library("tidyverse")
?rename_with
Or
args(rename_with)
Would both show you the arguments and defaults of dplyr::rename_with()
which are:
rename_with(.data, .fn, .cols = everything(), ...)
Ellipsis ...
You'll notice that the arguments above include a special argument ...
, called an ellipsis, dots, dot-dot-dot, or three-dots argument. The dots argument allows an arbitrary number of unnamed arguments in a function and allows passing arguments on to other functions nested within your function (burns statistics). So for example, when we add a function into the argument .fn
above, we can pass all necessary arguments into that function, even though they aren't explicitly listed in our dplyr::rename_with()
arguments. Tom Mock talks about this in his "Building R packages" presentation where he discusses, "passing the dots". https://www.youtube.com/watch?v=EpTkT6Rkgbs
Using the function and arguments from above, if we wanted to replace a pattern in all of our variable names, our code may look like this:
dplyr::rename_with(df, ~ stringr::str_replace(string = .x, pattern = "\\.", replacement = "_"), .cols = everything())
You can see here, that even though all of our str_replace()
arguments are not listed in rename_with()
arguments, we can pass the arguments using ...
.
https://adv-r.hadley.nz/functions.html#fun-dot-dot-dot
https://data-flair.training/blogs/r-arguments-introduction/
https://www.r-bloggers.com/2015/02/r-three-dots-ellipsis/
https://rlang.r-lib.org/reference/topic-data-mask-programming.html
The dplyr::across()
function will be used frequently throughout these examples. It is a function that allows us to select multiple variables and apply the same transformation across all variables.
It's 2 main arguments (there are more) are .cols and .fns.
The .cols argument allows us to select columns using tidy-selection (see tidy-selection
under Tidy Evalution above).
The .fns argument allows us to create an anonymous function using the ~
(see Tilde ~
above).
Here is dplyr::across()
used in an example.
df %>%
dplyr::mutate(dplyr::across(Var1:Var2, ~stringr::str_replace(., "x", "y"))
https://dplyr.tidyverse.org/reference/across.html
Using dplyr::across()
in a filter statement is deprecated. dplyr::if_any()
and dplyr::if_all()
are predicate functions used to select columns within a filtering or logic statement.
These functions are available in version 1.0.5 of dplyr.
dplyr::if_any()
returns a true when the statement is true for any of the variables. dplyr::if_all()
returns a true when the statement is true for all of the variables.
Like dplyr::across()
the 2 main arguments are across .cols and .fns and they are used in the same manner as mentioned above.
https://www.tidyverse.org/blog/2021/02/dplyr-1-0-4-if-any/
There are 2 selection helpers: any_of()
and all_of()
from the tidyselect
package
all_of
and any_of
are considered selection helpers and are used in conjunction with dplyr::select()
and are used when you are wanting to select an external character vector of names (outside of your data frame).
https://www.tidyverse.org/blog/2021/02/dplyr-1-0-4-if-any/
https://tidyselect.r-lib.org/reference/all_of.html
The embrace operator is often used when writing functions to forward an argument in a data-masked context or in tidy-selection. You can use {{}} to denote that you are referring to a data frame variable, not an environment variable.
https://rlang.r-lib.org/reference/topic-data-mask-programming.html
https://rlang.r-lib.org/reference/topic-data-mask.html
https://rlang.r-lib.org/reference/topic-inject.html
https://dplyr.tidyverse.org/articles/programming.html
Assignment by reference, or the walrus operator, allows us to dynamically create and name new variables in functions. You can see an example here: https://cghlewis.github.io/data-wrangling-functions/create-variables/add-new-column-indicator.html
https://rdatatable.gitlab.io/data.table/reference/assign.html