Skip to content

About FAQs

Cghlewis edited this page Apr 3, 2024 · 133 revisions

There are foundational concepts that are helpful to understand before working with the functions in this wiki. Below is a brief summary of each. Full disclosure: I am not an expert and am still learning every day. It's totally possible that my interpretations are not always accurate. See links in each section for complete information.

Tidyverse

Most of the examples in this wiki are tidyverse focused. That is what I am most comfortable with using and it allows all of your data manipulation to have an underlying framework which I find really helpful in keeping your code cohesive. When you install.packages("tidyverse") you will install a collection of packages that make up the tidyverse framework. While many packages will be installed simply by installing the tidyverse, some affiliated packages may not automatically install and will need to be added separately.

The tidyverse version I use in my examples is 1.3.1

https://www.tidyverse.org/
https://dplyr.tidyverse.org/articles/base.html

Pipe operator %>%

Many of the examples in this resource will include the use of the pipe %>% from the magrittr package. This package is part of the tidyverse. The pipe allows you to seamlessly pipe the output of one function into another function.

You can think of code like this:
data %>%
  function %>%
    function %>%
     function

as:
Use this data -->
  THEN do this -->
    THEN do this -->
     THEN do this

I want to note that you do not have to use the %>% operator. It is totally optional. These two ways of writing code are synonymous, yet the first may be considered more succinct:

1.
df_new <- df %>%
  select(Var1) %>%
    filter(Var1 == 2)

2.
df_new <- select(df, Var1)

df_new <- filter(df_new, Var1 ==2)

https://r4ds.had.co.nz/pipes.html
https://magrittr.tidyverse.org/reference/pipe.html
https://bioconnector.github.io/workshops/r-dplyr-yeast.html#the_pipe:_%%

Non-standard evaluation (NSE)

Consider the word variable. In R it has two meanings:

  • env-variables (environment variables) are programming variables that live in an environment

These variables are usually created with the assignment operator <- or less commonly with the super assignment operator <<-

  • data-variables are variables that live in a data frame (columns), they usually come from files like a .csv

https://dplyr.tidyverse.org/articles/programming.html
https://saral.club/qna?id=0cde7703-d341-4fbb-ac93-5899f1eee73d&lang=r

In base R, you can extract a data-variable (column) using [[ or $.

Standard Evaluation

In standard evaluation (value oriented evaluation), you use [[ to select data-variables and you must pass the variable as a string, in quotes, inside the bracket.

Ex: To call the variable Species from the data frame iris would look like this:

iris[["Species"]] or iris[,"Species"]

You can also create an env-variable and pass that as a name/symbol:

var <- "Species"

iris[[var]]

Non-Standard Evaluation

In non-standard evaluation (NSE), you use $ to select data-variables and names/symbols are treated as literal string values, so quotes are no longer needed.

Ex: To call the variable Species from the data frame iris would look like this:

iris$Species

You cannot however call Species using iris$var, because while above we created the env-variable var which has the value Species, NSE looks for a column named var.

https://win-vector.com/2019/04/02/standard-evaluation-versus-non-standard-evaluation-in-r/
https://thomasadventure.blog/posts/understanding-nse-part1/

Tidy Evaluation

Tidy evaluation, used in the tidyverse, uses two forms non-standard evaluation, data masking or tidy selection. The information below is a very simplified summary of the "need to know to get by" points from the programming with dplyr vignette. See the vignette for more detailed information.

Data masking

Data masking is a type of NSE that simplifies your code even more and allows you to only call your data once, rather than multiple times like you would in base R. With data masking you can refer to your data-variables as is, rather than attaching a prefix to your variable name (data$).

base R

starwars[starwars$homeworld == "Naboo" & starwars$species == "Human", ]

tidyverse

starwars %>%
  filter(homeworld == "Naboo", species == "Human")

Examples of functions that use data masking include: arrange(), count(), filter(), group_by(), mutate(), and summarise().

Tidy selection

Tidy selection is a complementary tool to data masking that makes it even easier to work with dataset columns.

The tidyselect package is referred to throughout this wiki. This package allows selection of variables based on their name or properties. This package underlies all functions that use tidy selection.

Selection helpers and operators in the tidyselect package can be found here: https://tidyselect.r-lib.org/reference/language.html and here https://dplyr.tidyverse.org/reference/dplyr_tidy_select.html

Ex:

select(df, starts_with("a")) selects all columns from df that start with "a"

Examples of functions that use data tidy selection include: across(), relocate(), rename(), select(), and pull().

https://dplyr.tidyverse.org/articles/programming.html
https://dplyr.tidyverse.org/reference/dplyr_tidy_select.html
https://rlang.r-lib.org/reference/topic-data-mask.html

Quotes

Quotes are used around strings in R. You can typically use either double quotes (") or single quotes ('). Here are a few examples.

  1. If you are providing a character vector of names. Ex: c("Var1", "Var2", "Var3")
  2. If you are referring to a string in a function.

Ex: The function stringr::str_replace(string, pattern, replacement).
In this function, we do not need quotes for our string argument because we can use tidy evaluation/tidy selection to call our existing variable without using quotes. However, you will need to put quotes around the string pattern and the string replacement arguments.

df %>%
  stringr::str_replace(Name, "Ms.", "")

There are also backticks which are used to refer to names and/or a combination of symbols that are not allowed in R. For example, variable names with spaces are not allowed. So you should use backticks to call those variables (or better yet, rename them with no spaces).

df %>%
  select(`Student Name`)

https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/Quotes
https://stackoverflow.com/questions/62094504/when-are-backticks-used-compared-to-double-quotes

c()

Throughout the examples you will see the base function c() used, which stands for combine. Oftentimes we are combining variables (column names) into a vector to do some sort of selection/manipulation on.

Ex: c("Var1", "Var2", "Var3") or c(Var1, Var2, Var3) depending if you are using tidy selection or not.

https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/c
https://libraryguides.mcgill.ca/c.php?g=699776&p=4968546
https://monashbioinformaticsplatform.github.io/r-intro/start.html

Tilde ~

You will see the tilde (~) used in some of the examples as well. You can read it as “as a function of”. It may be used any time a function has an argument .fn (as in dplyr::rename_with()), .f (as in purrr::map()) or .fns (as in dplyr::across()). You use ~ to create an anonymous function/formula (unnamed function). In R 4.1 or newer the ~ can be replaced with \(.).

Consider this example:
dplyr::rename_with(df, ~ stringr::str_replace(., "s_", "t_"), .cols = everything())

In this example, dplyr::rename_with() has the arguments of rename_with(.data, .fn, .cols = everything(), ...)

For the .fn argument we are using stringr::str_replace() and we use ~ to call the function and its arguments.

If we are using R 4.1 or newer we could use \(.) instead.

dplyr::rename_with(df, \(.) stringr::str_replace(., "s_", "t_"), .cols = everything())

If however, the .fn function you call does not have any arguments of its own, such as base::toupper(), you can simply name the function and you do not need to use the ~ to create an anonymous function.

For example:
dplyr::rename_with(df, toupper, .cols = everything())

Last, if we pull the data frame out of the function in order to use our pipe operator (see pipe operator above), we can use a . as a place holder for that argument (to reference the data we previously called). See dot notation below.

df %>%   dplyr::rename_with(., toupper, .cols = everything())

https://www.quora.com/What-does-mean-in-R-7
https://stackoverflow.com/questions/14976331/use-of-tilde-in-r-programming-language
https://coolbutuseless.github.io/2019/03/13/anonymous-functions-in-r-part-1/
https://purrr.tidyverse.org/articles/other-langs.html https://bitcoden.com/answers/why-use-purrrmap-instead-of-lapply https://towardsdatascience.com/the-new-pipe-and-anonymous-function-syntax-in-r-54d98861014c https://github.com/rstudio/cheatsheets/blob/main/purrr.pdf

Dot notation .

You may see the dot notation . used within a function as a place holder. There are two types of dots.

There is . and there is .x.

The . is a way to denote that we are referring to the value coming from the previous pipe (such as a data frame).

The .x is the value of something used in an anonymous function (see Tidle ~ above). The first parameter is .x and if there is a second, it is .y. A shortcut for the first parameter is simply using just ..

Consider the nested function calls below.

df %>%
  dplyr::rename_with(., ~ stringr::str_replace(.x, "s_", "t_"), .cols = everything())


In this example, dplyr::rename_with() has the arguments of rename_with(.data, .fn, .cols = everything(), ...)

Since we put df %>% before the function, we can simply add . to refer to the outside data df for the .data argument.

Otherwise, if we kept the data within the function, our formula would look like this:

dplyr::rename_with(df, ~ stringr::str_replace(.x, "s_", "t_"), .cols = everything())


Next, stringr::str_replace() has the arguments of str_replace(string, pattern, replacement)

We also use the dot notation .x (or simply.) in the str_replace() argument to refer to all columns. This defers to the outside dplyr::rename_with() .cols argument, where we are currently using the default, .cols = everything().

We could also modify the .cols default and change our code to select certain columns:

df %>%
  dplyr::rename_with(., ~ stringr::str_replace(.x, "s_", "t_"), .cols = tidyselect::contains("var"))


There are cases when you are referring to outside parameters when you do not need to explicitly call .. See this article for more information: https://magrittr.tidyverse.org/reference/pipe.html#arguments

The simplified version of the original code above would be:

df %>%
  dplyr::rename_with(~ stringr::str_replace(., "s_", "t_"))

https://stackoverflow.com/questions/9652943/usage-of-dot-period-in-r-functions
https://stackoverflow.com/questions/51562671/what-do-the-dot-and-the-tilde-represent-in-r
https://stackoverflow.com/questions/56532119/dplyr-piping-data-difference-between-and-x

Double colon (::)

You will see the double colon, also called a Namespace operator, used throughout the examples in this wiki. When you start a new R session, you always need to load the packages you plan to use in that session through library("nameofpackage"). This is still good practice. However, for the sake of these examples I also use this notation for every function, package::functionname, in order to both library the package and also to explicitly show which package I am calling the function from. There are times when multiple packages use the same function names and if you have overlapping packages loaded that have the same function names, it can cause errors.

Ex: psych::describe() vs Hmisc::describe()

https://stackoverflow.com/questions/35240971/what-are-the-double-colons-in-r
https://r-pkgs.org/namespace.html

Arguments

"Arguments are the parameters provided to a function to perform operations in a programming language" (GeeksforGeeks). The number of arguments within a function vary, and they are separated by commas. Arguments can have a default value. If you do not specify an argument in your function, the default will be used. You can see all the documentation for a function by typing ?nameoffunction or just the arguments by typing args(nameoffunction) in your console. The package that contains the function must be loaded in your session for this to work.

Ex:

library("tidyverse")
?rename_with

Or

args(rename_with)

Would both show you the arguments and defaults of dplyr::rename_with() which are:
rename_with(.data, .fn, .cols = everything(), ...)


Ellipsis ...

You'll notice that the arguments above include a special argument ..., called an ellipsis, dots, dot-dot-dot, or three-dots argument. The dots argument allows an arbitrary number of unnamed arguments in a function and allows passing arguments on to other functions nested within your function (burns statistics). So for example, when we add a function into the argument .fn above, we can pass all necessary arguments into that function, even though they aren't explicitly listed in our dplyr::rename_with() arguments. Tom Mock talks about this in his "Building R packages" presentation where he discusses, "passing the dots". https://www.youtube.com/watch?v=EpTkT6Rkgbs

Using the function and arguments from above, if we wanted to replace a pattern in all of our variable names, our code may look like this:

dplyr::rename_with(df, ~ stringr::str_replace(string = .x, pattern = "\\.", replacement = "_"), .cols = everything())

You can see here, that even though all of our str_replace() arguments are not listed in rename_with() arguments, we can pass the arguments using ....

https://adv-r.hadley.nz/functions.html#fun-dot-dot-dot
https://data-flair.training/blogs/r-arguments-introduction/
https://www.r-bloggers.com/2015/02/r-three-dots-ellipsis/ https://rlang.r-lib.org/reference/topic-data-mask-programming.html

across

The dplyr::across() function will be used frequently throughout these examples. It is a function that allows us to select multiple variables and apply the same transformation across all variables.

It's 2 main arguments (there are more) are .cols and .fns.

The .cols argument allows us to select columns using tidy-selection (see tidy-selection under Tidy Evalution above). The .fns argument allows us to create an anonymous function using the ~ (see Tilde ~ above).

Here is dplyr::across() used in an example.

df %>%
  dplyr::mutate(dplyr::across(Var1:Var2, ~stringr::str_replace(., "x", "y"))

https://dplyr.tidyverse.org/reference/across.html

if_any and if_all

Using dplyr::across() in a filter statement is deprecated. dplyr::if_any() and dplyr::if_all() are predicate functions used to select columns within a filtering or logic statement.

These functions are available in version 1.0.5 of dplyr.

dplyr::if_any() returns a true when the statement is true for any of the variables. dplyr::if_all() returns a true when the statement is true for all of the variables.

Like dplyr::across() the 2 main arguments are across .cols and .fns and they are used in the same manner as mentioned above.

https://www.tidyverse.org/blog/2021/02/dplyr-1-0-4-if-any/

any_of and all_of

There are 2 selection helpers: any_of() and all_of() from the tidyselect package

all_of and any_of are considered selection helpers and are used in conjunction with dplyr::select() and are used when you are wanting to select an external character vector of names (outside of your data frame).

https://www.tidyverse.org/blog/2021/02/dplyr-1-0-4-if-any/
https://tidyselect.r-lib.org/reference/all_of.html

{{}}

The embrace operator is often used when writing functions to forward an argument in a data-masked context or in tidy-selection. You can use {{}} to denote that you are referring to a data frame variable, not an environment variable.

https://rlang.r-lib.org/reference/topic-data-mask-programming.html
https://rlang.r-lib.org/reference/topic-data-mask.html
https://rlang.r-lib.org/reference/topic-inject.html
https://dplyr.tidyverse.org/articles/programming.html

:=

Assignment by reference, or the walrus operator, allows us to dynamically create and name new variables in functions. You can see an example here: https://cghlewis.github.io/data-wrangling-functions/create-variables/add-new-column-indicator.html

https://rdatatable.gitlab.io/data.table/reference/assign.html

Clone this wiki locally