-
Notifications
You must be signed in to change notification settings - Fork 16
Filter
Here I cover filtering rows (subsetting data).
- Note: I have moved filtering columns to the Select Variables section as it is better suited there.
Filtering rows can be complex. There is some terminology/functions to understand first.
Filtering operators: >
, <
, >=
, <=
, ==
, !=
, %in%
, between
Operators used to filter multiple columns:
Operator | Meaning |
---|---|
| | AND/OR |
& | AND |
, | AND |
xor | OR (not both) |
There are also filtering joins (semi_join
and anti_join
) that can be used to filter your dataset based on the values in another dataset.
- I think a great example of this is using a filtering join to see who does not exist in both of two datasets (i.e. someone took a pre-test but not a post-test or vice-versa).
- Another common example is using a filtering join to narrow a dataset down to just those in your study using a roster dataset.
A warning:
It is important to be aware that when working with the dplyr::filter()
function, if you filter on a specific variable, it will keep rows that produce a value of TRUE for your condition/s but if any value is NA and not explicitly evaluated, the function will drop any row that has NA for that variable.
The reason is that most of the filtering operators used are logical operators (==
, !=
) that produce a TRUE/FALSE value. NA cannot be evaluated and is therefore dropped. We may not necessarily want that. For example, for the data frame below,
df %>% filter(var1 !=5)
would return only the third row, not both the 2nd and 3rd row.
Ways of working around this are discussed in the examples below (specifically in Filter rows with NA values
).
var1 | var2 |
---|---|
5 | 2 |
NA | 1 |
3 | 4 |
- Filter rows based on one criteria for one variable
- Filter rows based on multiple criteria for one variable
- Filter rows using if_any or if_all
- Filter rows based on one criteria that applies to multiple variables
- Filter rows on multiple criteria that applies to multiple variables
- Filter rows on variable type
- Filter rows on based on criteria that applies to all variables
- Filter rows based on matching variables values
- Conditionally filter rows
Main functions used in examples
Package | Functions |
---|---|
dplyr | filter(); semi_join(); anti_join() |
Other functions used in examples
Package | Functions |
---|---|
tidyselect | everything(); starts_with() |
lubridate | as_date(); year() |
stringr | str_detect(); |
dplyr | count(); case_when(); rowwise(); if_any(); if_all(); between(); across() |
base | class(); rowSums(); rowMeans(); is.na() |
Resources:
- https://cran.r-project.org/web/packages/dplyr/vignettes/colwise.html
- https://www.tidyverse.org/blog/2021/02/dplyr-1-0-4-if-any/
- https://blog.exploratory.io/filter-data-with-dplyr-76cf5f1a258e
- https://suzan.rbind.io/2018/02/dplyr-tutorial-3/
- https://dplyr.tidyverse.org/articles/colwise.html#-if-at-all
- https://dplyr.tidyverse.org/reference/across.html
- https://dplyr.tidyverse.org/reference/filter.html
- https://statisticsglobe.com/remove-rows-with-na-using-dplyr-package-in-r
- https://kyleueyama.com/blog/2021-02-21-filtering-observations-with-dplyr/
- https://www.marsja.se/how-to-use-in-in-r/
- https://www.statmethods.net/management/operators.html
- https://statisticsglobe.com/r-dplyr-join-inner-left-right-full-semi-anti
- https://dplyr.tidyverse.org/reference/mutate-joins.html
- https://dplyr.tidyverse.org/reference/filter-joins.html
- https://r4ds.hadley.nz/logicals#introduction
Thank you to TJ Mahr for suggesting to add examples of using filtering joins!