## R for Data Science

[book link](https://r4ds.had.co.nz/index.html)

Visualisation is an important tool for insight generation, but it is rare that you get the data in exactly the right form you need. Often you’ll need to create some new variables or summaries, or maybe you just want to rename the variables or reorder the observations in order to make the data a little easier to work with. You’ll learn how to do all that (and more!) in this chapter, which will teach you how to transform your data using the dplyr package and a new dataset on flights departing New York City in 2013.

In this chapter we’re going to focus on how to use the dplyr package, another core member of the tidyverse. We’ll illustrate the key ideas using data from the nycflights13 package, and use ggplot2 to help us understand the data.



In [43]:
library(nycflights13)
library(tidyverse)

To explore the basic data manipulation verbs of dplyr, we’ll use nycflights13::flights. This data frame contains all 336,776 flights that departed from New York City in 2013. The data comes from the US Bureau of Transportation Statistics, and is documented in ?flights.



In [44]:
?flights

In [45]:
head(flights)

year,month,day,dep_time,sched_dep_time,dep_delay,arr_time,sched_arr_time,arr_delay,carrier,flight,tailnum,origin,dest,air_time,distance,hour,minute,time_hour
2013,1,1,517,515,2,830,819,11,UA,1545,N14228,EWR,IAH,227,1400,5,15,2013-01-01 05:00:00
2013,1,1,533,529,4,850,830,20,UA,1714,N24211,LGA,IAH,227,1416,5,29,2013-01-01 05:00:00
2013,1,1,542,540,2,923,850,33,AA,1141,N619AA,JFK,MIA,160,1089,5,40,2013-01-01 05:00:00
2013,1,1,544,545,-1,1004,1022,-18,B6,725,N804JB,JFK,BQN,183,1576,5,45,2013-01-01 05:00:00
2013,1,1,554,600,-6,812,837,-25,DL,461,N668DN,LGA,ATL,116,762,6,0,2013-01-01 06:00:00
2013,1,1,554,558,-4,740,728,12,UA,1696,N39463,EWR,ORD,150,719,5,58,2013-01-01 05:00:00


### dplyr basics

In this chapter you are going to learn the five key dplyr functions that allow you to solve the vast majority of your data manipulation challenges:

- Pick observations by their values (filter()).
- Reorder the rows (arrange()).
- Pick variables by their names (select()).
- Create new variables with functions of existing variables (mutate()).
- Collapse many values down to a single summary (summarise()).

These can all be used in conjunction with group_by() which changes the scope of each function from operating on the entire dataset to operating on it group-by-group. These six functions provide the verbs for a language of data manipulation.

All verbs work similarly:

- The first argument is a data frame.

- The subsequent arguments describe what to do with the data frame, using the variable names (without quotes).

- The result is a new data frame.

Together these properties make it easy to chain together multiple simple steps to achieve a complex result. Let’s dive in and see how these verbs work.

### Filter rows with filter()

filter() allows you to subset observations based on their values. The first argument is the name of the data frame. The second and subsequent arguments are the expressions that filter the data frame. For example, we can select all flights on January 1st with:



In [46]:
dplyr::filter(flights, month == 1, day == 1)

year,month,day,dep_time,sched_dep_time,dep_delay,arr_time,sched_arr_time,arr_delay,carrier,flight,tailnum,origin,dest,air_time,distance,hour,minute,time_hour
2013,1,1,517,515,2,830,819,11,UA,1545,N14228,EWR,IAH,227,1400,5,15,2013-01-01 05:00:00
2013,1,1,533,529,4,850,830,20,UA,1714,N24211,LGA,IAH,227,1416,5,29,2013-01-01 05:00:00
2013,1,1,542,540,2,923,850,33,AA,1141,N619AA,JFK,MIA,160,1089,5,40,2013-01-01 05:00:00
2013,1,1,544,545,-1,1004,1022,-18,B6,725,N804JB,JFK,BQN,183,1576,5,45,2013-01-01 05:00:00
2013,1,1,554,600,-6,812,837,-25,DL,461,N668DN,LGA,ATL,116,762,6,0,2013-01-01 06:00:00
2013,1,1,554,558,-4,740,728,12,UA,1696,N39463,EWR,ORD,150,719,5,58,2013-01-01 05:00:00
2013,1,1,555,600,-5,913,854,19,B6,507,N516JB,EWR,FLL,158,1065,6,0,2013-01-01 06:00:00
2013,1,1,557,600,-3,709,723,-14,EV,5708,N829AS,LGA,IAD,53,229,6,0,2013-01-01 06:00:00
2013,1,1,557,600,-3,838,846,-8,B6,79,N593JB,JFK,MCO,140,944,6,0,2013-01-01 06:00:00
2013,1,1,558,600,-2,753,745,8,AA,301,N3ALAA,LGA,ORD,138,733,6,0,2013-01-01 06:00:00


When you run that line of code, dplyr executes the filtering operation and returns a new data frame. dplyr functions never modify their inputs, so if you want to save the result, you’ll need to use the assignment operator, <-:



In [47]:
jan1 <- dplyr::filter(flights, month == 1, day == 1)

R either prints out the results, or saves them to a variable. If you want to do both, you can wrap the assignment in parentheses:



In [48]:
(dec25 <- filter(flights, month == 12, day == 25))


year,month,day,dep_time,sched_dep_time,dep_delay,arr_time,sched_arr_time,arr_delay,carrier,flight,tailnum,origin,dest,air_time,distance,hour,minute,time_hour
2013,12,25,456,500,-4,649,651,-2,US,1895,N156UW,EWR,CLT,98,529,5,0,2013-12-25 05:00:00
2013,12,25,524,515,9,805,814,-9,UA,1016,N32404,EWR,IAH,203,1400,5,15,2013-12-25 05:00:00
2013,12,25,542,540,2,832,850,-18,AA,2243,N5EBAA,JFK,MIA,146,1089,5,40,2013-12-25 05:00:00
2013,12,25,546,550,-4,1022,1027,-5,B6,939,N665JB,JFK,BQN,191,1576,5,50,2013-12-25 05:00:00
2013,12,25,556,600,-4,730,745,-15,AA,301,N3JLAA,LGA,ORD,123,733,6,0,2013-12-25 06:00:00
2013,12,25,557,600,-3,743,752,-9,DL,731,N369NB,LGA,DTW,88,502,6,0,2013-12-25 06:00:00
2013,12,25,557,600,-3,818,831,-13,DL,904,N397DA,LGA,ATL,118,762,6,0,2013-12-25 06:00:00
2013,12,25,559,600,-1,855,856,-1,B6,371,N608JB,LGA,FLL,147,1076,6,0,2013-12-25 06:00:00
2013,12,25,559,600,-1,849,855,-6,B6,605,N536JB,EWR,FLL,149,1065,6,0,2013-12-25 06:00:00
2013,12,25,600,600,0,850,846,4,B6,583,N746JB,JFK,MCO,137,944,6,0,2013-12-25 06:00:00


There’s another common problem you might encounter when using ==: floating point numbers. These results might surprise you!

``` R

sqrt(2) ^ 2 == 2
#> [1] FALSE
1 / 49 * 49 == 1
#> [1] FALSE

```

Computers use finite precision arithmetic (they obviously can’t store an infinite number of digits!) so remember that every number you see is an approximation. Instead of relying on ==, use near():

``` R

near(sqrt(2) ^ 2,  2)
#> [1] TRUE
near(1 / 49 * 49, 1)
#> [1] TRUE

```

### Logical operators

The following code finds all flights that departed in November or December:

Multiple arguments to filter() are combined with “and”: every expression must be true in order for a row to be included in the output. For other types of combinations, you’ll need to use Boolean operators yourself: & is “and”, | is “or”, and ! is “not”

In [49]:
dplyr::filter(flights, month == 11 | month == 12)

year,month,day,dep_time,sched_dep_time,dep_delay,arr_time,sched_arr_time,arr_delay,carrier,flight,tailnum,origin,dest,air_time,distance,hour,minute,time_hour
2013,11,1,5,2359,6,352,345,7,B6,745,N568JB,JFK,PSE,205,1617,23,59,2013-11-01 23:00:00
2013,11,1,35,2250,105,123,2356,87,B6,1816,N353JB,JFK,SYR,36,209,22,50,2013-11-01 22:00:00
2013,11,1,455,500,-5,641,651,-10,US,1895,N192UW,EWR,CLT,88,529,5,0,2013-11-01 05:00:00
2013,11,1,539,545,-6,856,827,29,UA,1714,N38727,LGA,IAH,229,1416,5,45,2013-11-01 05:00:00
2013,11,1,542,545,-3,831,855,-24,AA,2243,N5CLAA,JFK,MIA,147,1089,5,45,2013-11-01 05:00:00
2013,11,1,549,600,-11,912,923,-11,UA,303,N595UA,JFK,SFO,359,2586,6,0,2013-11-01 06:00:00
2013,11,1,550,600,-10,705,659,6,US,2167,N748UW,LGA,DCA,57,214,6,0,2013-11-01 06:00:00
2013,11,1,554,600,-6,659,701,-2,US,2134,N742PS,LGA,BOS,40,184,6,0,2013-11-01 06:00:00
2013,11,1,554,600,-6,826,827,-1,DL,563,N912DE,LGA,ATL,126,762,6,0,2013-11-01 06:00:00
2013,11,1,554,600,-6,749,751,-2,DL,731,N315NB,LGA,DTW,93,502,6,0,2013-11-01 06:00:00


The order of operations doesn’t work like English. You can’t write filter(flights, month == 11 | 12), which you might literally translate into “finds all flights that departed in November or December”. Instead it finds all months that equal 11 | 12, an expression that evaluates to TRUE. In a numeric context (like here), TRUE becomes one, so this finds all flights in January, not November or December. This is quite confusing!

A useful short-hand for this problem is x %in% y. This will select every row where x is one of the values in y. We could use it to rewrite the code above:



In [50]:
nov_dec <- dplyr::filter(flights, month %in% c(11, 12))

nov_dec

year,month,day,dep_time,sched_dep_time,dep_delay,arr_time,sched_arr_time,arr_delay,carrier,flight,tailnum,origin,dest,air_time,distance,hour,minute,time_hour
2013,11,1,5,2359,6,352,345,7,B6,745,N568JB,JFK,PSE,205,1617,23,59,2013-11-01 23:00:00
2013,11,1,35,2250,105,123,2356,87,B6,1816,N353JB,JFK,SYR,36,209,22,50,2013-11-01 22:00:00
2013,11,1,455,500,-5,641,651,-10,US,1895,N192UW,EWR,CLT,88,529,5,0,2013-11-01 05:00:00
2013,11,1,539,545,-6,856,827,29,UA,1714,N38727,LGA,IAH,229,1416,5,45,2013-11-01 05:00:00
2013,11,1,542,545,-3,831,855,-24,AA,2243,N5CLAA,JFK,MIA,147,1089,5,45,2013-11-01 05:00:00
2013,11,1,549,600,-11,912,923,-11,UA,303,N595UA,JFK,SFO,359,2586,6,0,2013-11-01 06:00:00
2013,11,1,550,600,-10,705,659,6,US,2167,N748UW,LGA,DCA,57,214,6,0,2013-11-01 06:00:00
2013,11,1,554,600,-6,659,701,-2,US,2134,N742PS,LGA,BOS,40,184,6,0,2013-11-01 06:00:00
2013,11,1,554,600,-6,826,827,-1,DL,563,N912DE,LGA,ATL,126,762,6,0,2013-11-01 06:00:00
2013,11,1,554,600,-6,749,751,-2,DL,731,N315NB,LGA,DTW,93,502,6,0,2013-11-01 06:00:00


Sometimes you can simplify complicated subsetting by remembering De Morgan’s law: !(x & y) is the same as !x | !y, and !(x | y) is the same as !x & !y. For example, if you wanted to find flights that weren’t delayed (on arrival or departure) by more than two hours, you could use either of the following two filters:



In [51]:
dplyr::filter(flights, !(arr_delay > 120 | dep_delay > 120))

year,month,day,dep_time,sched_dep_time,dep_delay,arr_time,sched_arr_time,arr_delay,carrier,flight,tailnum,origin,dest,air_time,distance,hour,minute,time_hour
2013,1,1,517,515,2,830,819,11,UA,1545,N14228,EWR,IAH,227,1400,5,15,2013-01-01 05:00:00
2013,1,1,533,529,4,850,830,20,UA,1714,N24211,LGA,IAH,227,1416,5,29,2013-01-01 05:00:00
2013,1,1,542,540,2,923,850,33,AA,1141,N619AA,JFK,MIA,160,1089,5,40,2013-01-01 05:00:00
2013,1,1,544,545,-1,1004,1022,-18,B6,725,N804JB,JFK,BQN,183,1576,5,45,2013-01-01 05:00:00
2013,1,1,554,600,-6,812,837,-25,DL,461,N668DN,LGA,ATL,116,762,6,0,2013-01-01 06:00:00
2013,1,1,554,558,-4,740,728,12,UA,1696,N39463,EWR,ORD,150,719,5,58,2013-01-01 05:00:00
2013,1,1,555,600,-5,913,854,19,B6,507,N516JB,EWR,FLL,158,1065,6,0,2013-01-01 06:00:00
2013,1,1,557,600,-3,709,723,-14,EV,5708,N829AS,LGA,IAD,53,229,6,0,2013-01-01 06:00:00
2013,1,1,557,600,-3,838,846,-8,B6,79,N593JB,JFK,MCO,140,944,6,0,2013-01-01 06:00:00
2013,1,1,558,600,-2,753,745,8,AA,301,N3ALAA,LGA,ORD,138,733,6,0,2013-01-01 06:00:00


In [52]:
dplyr::filter(flights, arr_delay <= 120, dep_delay <= 120)

year,month,day,dep_time,sched_dep_time,dep_delay,arr_time,sched_arr_time,arr_delay,carrier,flight,tailnum,origin,dest,air_time,distance,hour,minute,time_hour
2013,1,1,517,515,2,830,819,11,UA,1545,N14228,EWR,IAH,227,1400,5,15,2013-01-01 05:00:00
2013,1,1,533,529,4,850,830,20,UA,1714,N24211,LGA,IAH,227,1416,5,29,2013-01-01 05:00:00
2013,1,1,542,540,2,923,850,33,AA,1141,N619AA,JFK,MIA,160,1089,5,40,2013-01-01 05:00:00
2013,1,1,544,545,-1,1004,1022,-18,B6,725,N804JB,JFK,BQN,183,1576,5,45,2013-01-01 05:00:00
2013,1,1,554,600,-6,812,837,-25,DL,461,N668DN,LGA,ATL,116,762,6,0,2013-01-01 06:00:00
2013,1,1,554,558,-4,740,728,12,UA,1696,N39463,EWR,ORD,150,719,5,58,2013-01-01 05:00:00
2013,1,1,555,600,-5,913,854,19,B6,507,N516JB,EWR,FLL,158,1065,6,0,2013-01-01 06:00:00
2013,1,1,557,600,-3,709,723,-14,EV,5708,N829AS,LGA,IAD,53,229,6,0,2013-01-01 06:00:00
2013,1,1,557,600,-3,838,846,-8,B6,79,N593JB,JFK,MCO,140,944,6,0,2013-01-01 06:00:00
2013,1,1,558,600,-2,753,745,8,AA,301,N3ALAA,LGA,ORD,138,733,6,0,2013-01-01 06:00:00


As well as & and |, R also has && and ||. Don’t use them here! You’ll learn when you should use them in conditional execution.

Whenever you start using complicated, multipart expressions in filter(), consider making them explicit variables instead. That makes it much easier to check your work. You’ll learn how to create new variables shortly.

### Missing values

``` R

NA > 5
#> [1] NA
10 == NA
#> [1] NA
NA + 10
#> [1] NA
NA / 2
#> [1] NA

```

The most confusing result is this one:

``` R

NA == NA
#> [1] NA

```

It’s easiest to understand why this is true with a bit more context:

``` R

# Let x be Mary's age. We don't know how old she is.
x <- NA

# Let y be John's age. We don't know how old he is.
y <- NA

# Are John and Mary the same age?
x == y
#> [1] NA
# We don't know!

```

If you want to determine if a value is missing, use is.na():

``` R

is.na(x)
#> [1] TRUE

```

filter() only includes rows where the condition is TRUE; it excludes both FALSE and NA values. If you want to preserve missing values, ask for them explicitly:

``` R

df <- tibble(x = c(1, NA, 3))
filter(df, x > 1)
#> # A tibble: 1 x 1
#>       x
#>   <dbl>
#> 1     3
filter(df, is.na(x) | x > 1)
#> # A tibble: 2 x 1
#>       x
#>   <dbl>
#> 1    NA
#> 2     3

```

### Exercises

Find all flights that

- Had an arrival delay of two or more hours
- Flew to Houston (IAH or HOU)
- Were operated by United, American, or Delta
- Departed in summer (July, August, and September)
- Arrived more than two hours late, but didn’t leave late
- Were delayed by at least an hour, but made up over 30 minutes in flight
- Departed between midnight and 6am (inclusive)

In [53]:
dplyr::filter(flights, arr_delay >= 2)

year,month,day,dep_time,sched_dep_time,dep_delay,arr_time,sched_arr_time,arr_delay,carrier,flight,tailnum,origin,dest,air_time,distance,hour,minute,time_hour
2013,1,1,517,515,2,830,819,11,UA,1545,N14228,EWR,IAH,227,1400,5,15,2013-01-01 05:00:00
2013,1,1,533,529,4,850,830,20,UA,1714,N24211,LGA,IAH,227,1416,5,29,2013-01-01 05:00:00
2013,1,1,542,540,2,923,850,33,AA,1141,N619AA,JFK,MIA,160,1089,5,40,2013-01-01 05:00:00
2013,1,1,554,558,-4,740,728,12,UA,1696,N39463,EWR,ORD,150,719,5,58,2013-01-01 05:00:00
2013,1,1,555,600,-5,913,854,19,B6,507,N516JB,EWR,FLL,158,1065,6,0,2013-01-01 06:00:00
2013,1,1,558,600,-2,753,745,8,AA,301,N3ALAA,LGA,ORD,138,733,6,0,2013-01-01 06:00:00
2013,1,1,558,600,-2,924,917,7,UA,194,N29129,JFK,LAX,345,2475,6,0,2013-01-01 06:00:00
2013,1,1,559,600,-1,941,910,31,AA,707,N3DUAA,LGA,DFW,257,1389,6,0,2013-01-01 06:00:00
2013,1,1,600,600,0,837,825,12,MQ,4650,N542MQ,LGA,ATL,134,762,6,0,2013-01-01 06:00:00
2013,1,1,602,605,-3,821,805,16,MQ,4401,N730MQ,LGA,DTW,105,502,6,5,2013-01-01 06:00:00


In [54]:
dplyr::filter(flights, dest == "HOU" | dest == "IAH")

year,month,day,dep_time,sched_dep_time,dep_delay,arr_time,sched_arr_time,arr_delay,carrier,flight,tailnum,origin,dest,air_time,distance,hour,minute,time_hour
2013,1,1,517,515,2,830,819,11,UA,1545,N14228,EWR,IAH,227,1400,5,15,2013-01-01 05:00:00
2013,1,1,533,529,4,850,830,20,UA,1714,N24211,LGA,IAH,227,1416,5,29,2013-01-01 05:00:00
2013,1,1,623,627,-4,933,932,1,UA,496,N459UA,LGA,IAH,229,1416,6,27,2013-01-01 06:00:00
2013,1,1,728,732,-4,1041,1038,3,UA,473,N488UA,LGA,IAH,238,1416,7,32,2013-01-01 07:00:00
2013,1,1,739,739,0,1104,1038,26,UA,1479,N37408,EWR,IAH,249,1400,7,39,2013-01-01 07:00:00
2013,1,1,908,908,0,1228,1219,9,UA,1220,N12216,EWR,IAH,233,1400,9,8,2013-01-01 09:00:00
2013,1,1,1028,1026,2,1350,1339,11,UA,1004,N76508,LGA,IAH,237,1416,10,26,2013-01-01 10:00:00
2013,1,1,1044,1045,-1,1352,1351,1,UA,455,N667UA,EWR,IAH,229,1400,10,45,2013-01-01 10:00:00
2013,1,1,1114,900,134,1447,1222,145,UA,1086,N76502,LGA,IAH,248,1416,9,0,2013-01-01 09:00:00
2013,1,1,1205,1200,5,1503,1505,-2,UA,1461,N39418,EWR,IAH,221,1400,12,0,2013-01-01 12:00:00


In [55]:
airlines

# UA, DL, AA

carrier,name
9E,Endeavor Air Inc.
AA,American Airlines Inc.
AS,Alaska Airlines Inc.
B6,JetBlue Airways
DL,Delta Air Lines Inc.
EV,ExpressJet Airlines Inc.
F9,Frontier Airlines Inc.
FL,AirTran Airways Corporation
HA,Hawaiian Airlines Inc.
MQ,Envoy Air


In [56]:
dplyr::filter(flights, carrier %in% c("UA", "DL", "AA"))

year,month,day,dep_time,sched_dep_time,dep_delay,arr_time,sched_arr_time,arr_delay,carrier,flight,tailnum,origin,dest,air_time,distance,hour,minute,time_hour
2013,1,1,517,515,2,830,819,11,UA,1545,N14228,EWR,IAH,227,1400,5,15,2013-01-01 05:00:00
2013,1,1,533,529,4,850,830,20,UA,1714,N24211,LGA,IAH,227,1416,5,29,2013-01-01 05:00:00
2013,1,1,542,540,2,923,850,33,AA,1141,N619AA,JFK,MIA,160,1089,5,40,2013-01-01 05:00:00
2013,1,1,554,600,-6,812,837,-25,DL,461,N668DN,LGA,ATL,116,762,6,0,2013-01-01 06:00:00
2013,1,1,554,558,-4,740,728,12,UA,1696,N39463,EWR,ORD,150,719,5,58,2013-01-01 05:00:00
2013,1,1,558,600,-2,753,745,8,AA,301,N3ALAA,LGA,ORD,138,733,6,0,2013-01-01 06:00:00
2013,1,1,558,600,-2,924,917,7,UA,194,N29129,JFK,LAX,345,2475,6,0,2013-01-01 06:00:00
2013,1,1,558,600,-2,923,937,-14,UA,1124,N53441,EWR,SFO,361,2565,6,0,2013-01-01 06:00:00
2013,1,1,559,600,-1,941,910,31,AA,707,N3DUAA,LGA,DFW,257,1389,6,0,2013-01-01 06:00:00
2013,1,1,559,600,-1,854,902,-8,UA,1187,N76515,EWR,LAS,337,2227,6,0,2013-01-01 06:00:00


In [57]:
dplyr::filter(flights, month %in% c(7, 8, 9))

year,month,day,dep_time,sched_dep_time,dep_delay,arr_time,sched_arr_time,arr_delay,carrier,flight,tailnum,origin,dest,air_time,distance,hour,minute,time_hour
2013,7,1,1,2029,212,236,2359,157,B6,915,N653JB,JFK,SFO,315,2586,20,29,2013-07-01 20:00:00
2013,7,1,2,2359,3,344,344,0,B6,1503,N805JB,JFK,SJU,200,1598,23,59,2013-07-01 23:00:00
2013,7,1,29,2245,104,151,1,110,B6,234,N348JB,JFK,BTV,66,266,22,45,2013-07-01 22:00:00
2013,7,1,43,2130,193,322,14,188,B6,1371,N794JB,LGA,FLL,143,1076,21,30,2013-07-01 21:00:00
2013,7,1,44,2150,174,300,100,120,AA,185,N324AA,JFK,LAX,297,2475,21,50,2013-07-01 21:00:00
2013,7,1,46,2051,235,304,2358,186,B6,165,N640JB,JFK,PDX,304,2454,20,51,2013-07-01 20:00:00
2013,7,1,48,2001,287,308,2305,243,VX,415,N627VA,JFK,LAX,298,2475,20,1,2013-07-01 20:00:00
2013,7,1,58,2155,183,335,43,172,B6,425,N535JB,JFK,TPA,140,1005,21,55,2013-07-01 21:00:00
2013,7,1,100,2146,194,327,30,177,B6,1183,N531JB,JFK,MCO,126,944,21,46,2013-07-01 21:00:00
2013,7,1,100,2245,135,337,135,122,B6,623,N663JB,JFK,LAX,304,2475,22,45,2013-07-01 22:00:00


In [58]:
# Arrived more than two hours late, but didn’t leave late

head(dplyr::filter(flights, arr_delay > 120, dep_delay <= 0))

year,month,day,dep_time,sched_dep_time,dep_delay,arr_time,sched_arr_time,arr_delay,carrier,flight,tailnum,origin,dest,air_time,distance,hour,minute,time_hour
2013,1,27,1419,1420,-1,1754,1550,124,MQ,3728,N1EAMQ,EWR,ORD,135,719,14,20,2013-01-27 14:00:00
2013,10,7,1350,1350,0,1736,1526,130,EV,5181,N611QX,LGA,MSN,117,812,13,50,2013-10-07 13:00:00
2013,10,7,1357,1359,-2,1858,1654,124,AA,1151,N3CMAA,LGA,DFW,192,1389,13,59,2013-10-07 13:00:00
2013,10,16,657,700,-3,1258,1056,122,B6,3,N703JB,JFK,SJU,225,1598,7,0,2013-10-16 07:00:00
2013,11,1,658,700,-2,1329,1015,194,VX,399,N629VA,JFK,LAX,336,2475,7,0,2013-11-01 07:00:00
2013,3,18,1844,1847,-3,39,2219,140,UA,389,N560UA,JFK,SFO,386,2586,18,47,2013-03-18 18:00:00


In [59]:
# Were delayed by at least an hour, but made up over 30 minutes in flight

head(dplyr::filter(flights, dep_delay >= 60, (dep_delay - arr_delay) >= 30))

year,month,day,dep_time,sched_dep_time,dep_delay,arr_time,sched_arr_time,arr_delay,carrier,flight,tailnum,origin,dest,air_time,distance,hour,minute,time_hour
2013,1,1,1716,1545,91,2140,2039,61,B6,703,N651JB,JFK,SJU,183,1598,15,45,2013-01-01 15:00:00
2013,1,1,2205,1720,285,46,2040,246,AA,1999,N5DNAA,EWR,MIA,146,1085,17,20,2013-01-01 17:00:00
2013,1,1,2326,2130,116,131,18,73,B6,199,N594JB,JFK,LAS,290,2248,21,30,2013-01-01 21:00:00
2013,1,3,1503,1221,162,1803,1555,128,UA,551,N835UA,EWR,SFO,320,2565,12,21,2013-01-03 12:00:00
2013,1,3,1821,1530,171,2131,1910,141,AA,85,N357AA,JFK,SFO,328,2586,15,30,2013-01-03 15:00:00
2013,1,3,1839,1700,99,2056,1950,66,AA,575,N631AA,JFK,EGE,239,1747,17,0,2013-01-03 17:00:00


In [71]:
# Departed between midnight and 6am (inclusive)

dplyr::filter(flights, flights$dep_time > 600) 

year,month,day,dep_time,sched_dep_time,dep_delay,arr_time,sched_arr_time,arr_delay,carrier,flight,tailnum,origin,dest,air_time,distance,hour,minute,time_hour
2013,1,1,601,600,1,844,850,-6,B6,343,N644JB,EWR,PBI,147,1023,6,0,2013-01-01 06:00:00
2013,1,1,602,610,-8,812,820,-8,DL,1919,N971DL,LGA,MSP,170,1020,6,10,2013-01-01 06:00:00
2013,1,1,602,605,-3,821,805,16,MQ,4401,N730MQ,LGA,DTW,105,502,6,5,2013-01-01 06:00:00
2013,1,1,606,610,-4,858,910,-12,AA,1895,N633AA,EWR,MIA,152,1085,6,10,2013-01-01 06:00:00
2013,1,1,606,610,-4,837,845,-8,DL,1743,N3739P,JFK,ATL,128,760,6,10,2013-01-01 06:00:00
2013,1,1,607,607,0,858,915,-17,UA,1077,N53442,EWR,MIA,157,1085,6,7,2013-01-01 06:00:00
2013,1,1,608,600,8,807,735,32,MQ,3768,N9EAMQ,EWR,ORD,139,719,6,0,2013-01-01 06:00:00
2013,1,1,611,600,11,945,931,14,UA,303,N532UA,JFK,SFO,366,2586,6,0,2013-01-01 06:00:00
2013,1,1,613,610,3,925,921,4,B6,135,N635JB,JFK,RSW,175,1074,6,10,2013-01-01 06:00:00
2013,1,1,615,615,0,1039,1100,-21,B6,709,N794JB,JFK,SJU,182,1598,6,15,2013-01-01 06:00:00


Another useful dplyr filtering helper is between(). What does it do? Can you use it to simplify the code needed to answer the previous challenges?



In [74]:
# flights departed BETWEEN 6am and 7 am

dplyr::filter(flights, between(flights$dep_time, 600, 700)) 

year,month,day,dep_time,sched_dep_time,dep_delay,arr_time,sched_arr_time,arr_delay,carrier,flight,tailnum,origin,dest,air_time,distance,hour,minute,time_hour
2013,1,1,600,600,0,851,858,-7,B6,371,N595JB,LGA,FLL,152,1076,6,0,2013-01-01 06:00:00
2013,1,1,600,600,0,837,825,12,MQ,4650,N542MQ,LGA,ATL,134,762,6,0,2013-01-01 06:00:00
2013,1,1,601,600,1,844,850,-6,B6,343,N644JB,EWR,PBI,147,1023,6,0,2013-01-01 06:00:00
2013,1,1,602,610,-8,812,820,-8,DL,1919,N971DL,LGA,MSP,170,1020,6,10,2013-01-01 06:00:00
2013,1,1,602,605,-3,821,805,16,MQ,4401,N730MQ,LGA,DTW,105,502,6,5,2013-01-01 06:00:00
2013,1,1,606,610,-4,858,910,-12,AA,1895,N633AA,EWR,MIA,152,1085,6,10,2013-01-01 06:00:00
2013,1,1,606,610,-4,837,845,-8,DL,1743,N3739P,JFK,ATL,128,760,6,10,2013-01-01 06:00:00
2013,1,1,607,607,0,858,915,-17,UA,1077,N53442,EWR,MIA,157,1085,6,7,2013-01-01 06:00:00
2013,1,1,608,600,8,807,735,32,MQ,3768,N9EAMQ,EWR,ORD,139,719,6,0,2013-01-01 06:00:00
2013,1,1,611,600,11,945,931,14,UA,303,N532UA,JFK,SFO,366,2586,6,0,2013-01-01 06:00:00


How many flights have a missing dep_time? What other variables are missing? What might these rows represent?



In [76]:
nrow(dplyr::filter(flights, is.na(flights$dep_time)))

Why is NA ^ 0 not missing? Why is NA | TRUE not missing? Why is FALSE & NA not missing? Can you figure out the general rule? (NA * 0 is a tricky counterexample!)

### Arrange rows with arrange()


arrange() works similarly to filter() except that instead of selecting rows, it changes their order. It takes a data frame and a set of column names (or more complicated expressions) to order by. If you provide more than one column name, each additional column will be used to break ties in the values of preceding columns:



In [79]:
tail(arrange(flights, year, month, day))

year,month,day,dep_time,sched_dep_time,dep_delay,arr_time,sched_arr_time,arr_delay,carrier,flight,tailnum,origin,dest,air_time,distance,hour,minute,time_hour
2013,12,31,,855,,,1142,,UA,1506,,EWR,JAC,,1874,8,55,2013-12-31 08:00:00
2013,12,31,,705,,,931,,UA,1729,,EWR,DEN,,1605,7,5,2013-12-31 07:00:00
2013,12,31,,825,,,1029,,US,1831,,JFK,CLT,,541,8,25,2013-12-31 08:00:00
2013,12,31,,1615,,,1800,,MQ,3301,N844MQ,LGA,RDU,,431,16,15,2013-12-31 16:00:00
2013,12,31,,600,,,735,,UA,219,,EWR,ORD,,719,6,0,2013-12-31 06:00:00
2013,12,31,,830,,,1154,,UA,443,,JFK,LAX,,2475,8,30,2013-12-31 08:00:00


Use desc() to re-order by a column in descending order:


In [83]:
head(arrange(flights, desc(dep_delay)))

year,month,day,dep_time,sched_dep_time,dep_delay,arr_time,sched_arr_time,arr_delay,carrier,flight,tailnum,origin,dest,air_time,distance,hour,minute,time_hour
2013,1,9,641,900,1301,1242,1530,1272,HA,51,N384HA,JFK,HNL,640,4983,9,0,2013-01-09 09:00:00
2013,6,15,1432,1935,1137,1607,2120,1127,MQ,3535,N504MQ,JFK,CMH,74,483,19,35,2013-06-15 19:00:00
2013,1,10,1121,1635,1126,1239,1810,1109,MQ,3695,N517MQ,EWR,ORD,111,719,16,35,2013-01-10 16:00:00
2013,9,20,1139,1845,1014,1457,2210,1007,AA,177,N338AA,JFK,SFO,354,2586,18,45,2013-09-20 18:00:00
2013,7,22,845,1600,1005,1044,1815,989,MQ,3075,N665MQ,JFK,CVG,96,589,16,0,2013-07-22 16:00:00
2013,4,10,1100,1900,960,1342,2211,931,DL,2391,N959DL,JFK,TPA,139,1005,19,0,2013-04-10 19:00:00


Missing values are always sorted at the end:

``` R

df <- tibble(x = c(5, 2, NA))
arrange(df, x)
#> # A tibble: 3 x 1
#>       x
#>   <dbl>
#> 1     2
#> 2     5
#> 3    NA
arrange(df, desc(x))
#> # A tibble: 3 x 1
#>       x
#>   <dbl>
#> 1     5
#> 2     2
#> 3    NA

```

### Exercises

How could you use arrange() to sort all missing values to the start? (Hint: use is.na()).



In [84]:
df <- tibble(x = c(5, 2, NA))

df

x
5.0
2.0
""


In [86]:
arrange(df, desc(is.na(x)))

x
""
5.0
2.0


Sort flights to find the most delayed flights. Find the flights that left earliest.



In [87]:
arrange(flights, desc(arr_delay))

year,month,day,dep_time,sched_dep_time,dep_delay,arr_time,sched_arr_time,arr_delay,carrier,flight,tailnum,origin,dest,air_time,distance,hour,minute,time_hour
2013,1,9,641,900,1301,1242,1530,1272,HA,51,N384HA,JFK,HNL,640,4983,9,0,2013-01-09 09:00:00
2013,6,15,1432,1935,1137,1607,2120,1127,MQ,3535,N504MQ,JFK,CMH,74,483,19,35,2013-06-15 19:00:00
2013,1,10,1121,1635,1126,1239,1810,1109,MQ,3695,N517MQ,EWR,ORD,111,719,16,35,2013-01-10 16:00:00
2013,9,20,1139,1845,1014,1457,2210,1007,AA,177,N338AA,JFK,SFO,354,2586,18,45,2013-09-20 18:00:00
2013,7,22,845,1600,1005,1044,1815,989,MQ,3075,N665MQ,JFK,CVG,96,589,16,0,2013-07-22 16:00:00
2013,4,10,1100,1900,960,1342,2211,931,DL,2391,N959DL,JFK,TPA,139,1005,19,0,2013-04-10 19:00:00
2013,3,17,2321,810,911,135,1020,915,DL,2119,N927DA,LGA,MSP,167,1020,8,10,2013-03-17 08:00:00
2013,7,22,2257,759,898,121,1026,895,DL,2047,N6716C,LGA,ATL,109,762,7,59,2013-07-22 07:00:00
2013,12,5,756,1700,896,1058,2020,878,AA,172,N5DMAA,EWR,MIA,149,1085,17,0,2013-12-05 17:00:00
2013,5,3,1133,2055,878,1250,2215,875,MQ,3744,N523MQ,EWR,ORD,112,719,20,55,2013-05-03 20:00:00


In [89]:
arrange(flights, dep_time)

year,month,day,dep_time,sched_dep_time,dep_delay,arr_time,sched_arr_time,arr_delay,carrier,flight,tailnum,origin,dest,air_time,distance,hour,minute,time_hour
2013,1,13,1,2249,72,108,2357,71,B6,22,N206JB,JFK,SYR,41,209,22,49,2013-01-13 22:00:00
2013,1,31,1,2100,181,124,2225,179,WN,530,N550WN,LGA,MDW,127,725,21,0,2013-01-31 21:00:00
2013,11,13,1,2359,2,442,440,2,B6,1503,N627JB,JFK,SJU,194,1598,23,59,2013-11-13 23:00:00
2013,12,16,1,2359,2,447,437,10,B6,839,N607JB,JFK,BQN,202,1576,23,59,2013-12-16 23:00:00
2013,12,20,1,2359,2,430,440,-10,B6,1503,N608JB,JFK,SJU,182,1598,23,59,2013-12-20 23:00:00
2013,12,26,1,2359,2,437,440,-3,B6,1503,N527JB,JFK,SJU,197,1598,23,59,2013-12-26 23:00:00
2013,12,30,1,2359,2,441,437,4,B6,839,N508JB,JFK,BQN,198,1576,23,59,2013-12-30 23:00:00
2013,2,11,1,2100,181,111,2225,166,WN,530,N231WN,LGA,MDW,117,725,21,0,2013-02-11 21:00:00
2013,2,24,1,2245,76,121,2354,87,B6,608,N216JB,JFK,PWM,56,273,22,45,2013-02-24 22:00:00
2013,3,8,1,2355,6,431,440,-9,B6,739,N586JB,JFK,PSE,189,1617,23,55,2013-03-08 23:00:00


Sort flights to find the fastest flights.



In [91]:
arrange(flights, distance/air_time)

year,month,day,dep_time,sched_dep_time,dep_delay,arr_time,sched_arr_time,arr_delay,carrier,flight,tailnum,origin,dest,air_time,distance,hour,minute,time_hour
2013,1,28,1917,1825,52,2118,1935,103,US,1860,N755US,LGA,PHL,75,96,18,25,2013-01-28 18:00:00
2013,6,29,755,800,-5,1035,909,86,B6,1491,N328JB,JFK,ACK,141,199,8,0,2013-06-29 08:00:00
2013,8,28,932,940,-8,1116,1051,25,9E,3608,N8932C,JFK,PHL,61,94,9,40,2013-08-28 09:00:00
2013,1,30,1037,955,42,1221,1100,81,9E,3667,N832AY,JFK,PHL,59,94,9,55,2013-01-30 09:00:00
2013,11,27,556,600,-4,727,658,29,US,1909,N951UW,LGA,PHL,60,96,6,0,2013-11-27 06:00:00
2013,5,21,558,600,-2,721,657,24,US,1289,N956UW,LGA,PHL,60,96,6,0,2013-05-21 06:00:00
2013,12,9,1540,1535,5,1720,1656,24,US,1775,N945UW,LGA,PHL,59,96,15,35,2013-12-09 15:00:00
2013,6,10,1356,1300,56,1646,1414,152,US,2175,N745VJ,LGA,DCA,131,214,13,0,2013-06-10 13:00:00
2013,7,28,1322,1325,-3,1612,1432,100,US,1279,N953UW,LGA,PHL,57,96,13,25,2013-07-28 13:00:00
2013,4,11,1349,1345,4,1542,1453,49,9E,3638,N8631E,JFK,PHL,55,94,13,45,2013-04-11 13:00:00
