Skip to content
Cghlewis edited this page Apr 26, 2024 · 61 revisions

Here I cover filtering rows (subsetting data).

  • Note: I have moved filtering columns to the Select Variables section as it is better suited there.

Filtering rows can be complex. There is some terminology/functions to understand first.

Filtering operators: >, <, >=, <=, ==, !=, %in%, between

Operators used to filter multiple columns:

Operator Meaning
| AND/OR
& AND
, AND
xor OR (not both)

There are also filtering joins (semi_join and anti_join) that can be used to filter your dataset based on the values in another dataset.

  1. I think a great example of this is using a filtering join to see who does not exist in both of two datasets (i.e. someone took a pre-test but not a post-test or vice-versa).
  2. Another common example is using a filtering join to narrow a dataset down to just those in your study using a roster dataset.

A warning:

It is important to be aware that when working with the dplyr::filter() function, if you filter on a specific variable, it will keep rows that produce a value of TRUE for your condition/s but if any value is NA and not explicitly evaluated, the function will drop any row that has NA for that variable.

The reason is that most of the filtering operators used are logical operators (==, !=) that produce a TRUE/FALSE value. NA cannot be evaluated and is therefore dropped. We may not necessarily want that. For example, for the data frame below, df %>% filter(var1 !=5) would return only the third row, not both the 2nd and 3rd row.

Ways of working around this are discussed in the examples below (specifically in Filter rows with NA values).

var1 var2
5 2
NA 1
3 4

Filter rows using one variable

Filter rows using multiple variables

Filter based on row sums or means

Filter rows using another data frame

Filter rows on NA values


Main functions used in examples

Package Functions
dplyr filter(); semi_join(); anti_join()

Other functions used in examples

Package Functions
tidyselect everything(); starts_with()
lubridate as_date(); year()
stringr str_detect();
dplyr count(); case_when(); rowwise(); if_any(); if_all(); between(); across()
base class(); rowSums(); rowMeans(); is.na()

Resources:

Thank you to TJ Mahr for suggesting to add examples of using filtering joins!

Clone this wiki locally