# Appendix C Data 

## C.2 Data manipulation

### C.2.4 Select columns

`select` is used to keep only a few variables of interest to the current analysis. It is most useful when working with data frames involving a large number of variables. It is commonly used with the pipe operator `%>%`. For now, it suffices to know that `x %>% f(y)` means `f(x, y)`. We will see this operator a lot in the next section.

In [1]:
library(tidyverse)
library(nycflights13)

-- [1mAttaching packages[22m ------------------------------------------------------------------------------- tidyverse 1.3.0 --

[32mv[39m [34mggplot2[39m 3.3.2     [32mv[39m [34mpurrr  [39m 0.3.4
[32mv[39m [34mtibble [39m 3.0.4     [32mv[39m [34mdplyr  [39m 1.0.2
[32mv[39m [34mtidyr  [39m 1.1.2     [32mv[39m [34mstringr[39m 1.4.0
[32mv[39m [34mreadr  [39m 1.4.0     [32mv[39m [34mforcats[39m 0.5.0

-- [1mConflicts[22m ---------------------------------------------------------------------------------- tidyverse_conflicts() --
[31mx[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31mx[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()



In [3]:
select(flights,year,month,day, dep_time, arr_time) %>% head()

year,month,day,dep_time,arr_time
<int>,<int>,<int>,<int>,<int>
2013,1,1,517,830
2013,1,1,533,850
2013,1,1,542,923
2013,1,1,544,1004
2013,1,1,554,812
2013,1,1,554,740


We can change the name of the variables when selecting them.

In [4]:
select(flights,year,month,day, departure_time=dep_time, arr_time) %>% head()

year,month,day,departure_time,arr_time
<int>,<int>,<int>,<int>,<int>
2013,1,1,517,830
2013,1,1,533,850
2013,1,1,542,923
2013,1,1,544,1004
2013,1,1,554,812
2013,1,1,554,740


Note that `select` drops any variables not explicitly mentioned. To just rename some variables while keeping all others, use `rename`.

In [5]:
rename(flights, departure_time=dep_time) %>% head()

year,month,day,departure_time,sched_dep_time,dep_delay,arr_time,sched_arr_time,arr_delay,carrier,flight,tailnum,origin,dest,air_time,distance,hour,minute,time_hour
<int>,<int>,<int>,<int>,<int>,<dbl>,<int>,<int>,<dbl>,<chr>,<int>,<chr>,<chr>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dttm>
2013,1,1,517,515,2,830,819,11,UA,1545,N14228,EWR,IAH,227,1400,5,15,2013-01-01 05:00:00
2013,1,1,533,529,4,850,830,20,UA,1714,N24211,LGA,IAH,227,1416,5,29,2013-01-01 05:00:00
2013,1,1,542,540,2,923,850,33,AA,1141,N619AA,JFK,MIA,160,1089,5,40,2013-01-01 05:00:00
2013,1,1,544,545,-1,1004,1022,-18,B6,725,N804JB,JFK,BQN,183,1576,5,45,2013-01-01 05:00:00
2013,1,1,554,600,-6,812,837,-25,DL,461,N668DN,LGA,ATL,116,762,6,0,2013-01-01 06:00:00
2013,1,1,554,558,-4,740,728,12,UA,1696,N39463,EWR,ORD,150,719,5,58,2013-01-01 05:00:00


If there are a lot of variables, you can save yourself some typing by using `:` and `-` in combination with select. The colon operator selects a range of variables.

In [6]:
select(flights,year:day) %>% head()

year,month,day
<int>,<int>,<int>
2013,1,1
2013,1,1
2013,1,1
2013,1,1
2013,1,1
2013,1,1


The negative sign lets you select everything but certain columns.

In [8]:
select(flights, -day) %>% print

[90m# A tibble: 336,776 x 18[39m
    year month dep_time sched_dep_time dep_delay arr_time sched_arr_time
   [3m[90m<int>[39m[23m [3m[90m<int>[39m[23m    [3m[90m<int>[39m[23m          [3m[90m<int>[39m[23m     [3m[90m<dbl>[39m[23m    [3m[90m<int>[39m[23m          [3m[90m<int>[39m[23m
[90m 1[39m  [4m2[24m013     1      517            515         2      830            819
[90m 2[39m  [4m2[24m013     1      533            529         4      850            830
[90m 3[39m  [4m2[24m013     1      542            540         2      923            850
[90m 4[39m  [4m2[24m013     1      544            545        -[31m1[39m     [4m1[24m004           [4m1[24m022
[90m 5[39m  [4m2[24m013     1      554            600        -[31m6[39m      812            837
[90m 6[39m  [4m2[24m013     1      554            558        -[31m4[39m      740            728
[90m 7[39m  [4m2[24m013     1      555            600        -[31m5[39m      913

We can use `-` and `:` together, for example:

In [9]:
select(flights, -(year:day)) %>% print

[90m# A tibble: 336,776 x 16[39m
   dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier
      [3m[90m<int>[39m[23m          [3m[90m<int>[39m[23m     [3m[90m<dbl>[39m[23m    [3m[90m<int>[39m[23m          [3m[90m<int>[39m[23m     [3m[90m<dbl>[39m[23m [3m[90m<chr>[39m[23m  
[90m 1[39m      517            515         2      830            819        11 UA     
[90m 2[39m      533            529         4      850            830        20 UA     
[90m 3[39m      542            540         2      923            850        33 AA     
[90m 4[39m      544            545        -[31m1[39m     [4m1[24m004           [4m1[24m022       -[31m18[39m B6     
[90m 5[39m      554            600        -[31m6[39m      812            837       -[31m25[39m DL     
[90m 6[39m      554            558        -[31m4[39m      740            728        12 UA     
[90m 7[39m      555            600        -[31m5[39m      913       

To bring a few variables at the beginning, we can use `everything()` to refer to the remaining variables.

In [12]:
select(flights, dep_time, arr_time, everything()) %>% head()

dep_time,arr_time
<int>,<int>
517,830
533,850
542,923
544,1004
554,812
554,740


In addition, there are some helper functions that only work inside `select()`.

* `starts_with()`, `ends_with()`, `contains()`
* `matches()`
* `num_range()`

We can use `?select` to learn more about these. Here's just one example of their use.

In [14]:
select(flights, contains("time")) %>% head()

dep_time,sched_dep_time,arr_time,sched_arr_time,air_time,time_hour
<int>,<int>,<int>,<int>,<dbl>,<dttm>
517,515,830,819,227,2013-01-01 05:00:00
533,529,850,830,227,2013-01-01 05:00:00
542,540,923,850,160,2013-01-01 05:00:00
544,545,1004,1022,183,2013-01-01 05:00:00
554,600,812,837,116,2013-01-01 06:00:00
554,558,740,728,150,2013-01-01 05:00:00


This basically selects all the columns containing the string "time".