% Managing Data Frames with dplyr
%
The data frame is a key data structure in statistics and in R.
-
There is one observation per row
-
Each column represents a variable or measure or characteristic
-
Primary implementation that you will use is the default R implementation
-
Other implementations, particularly relational databases systems
-
Developed by Hadley Wickham of RStudio
-
An optimized and distilled version of
plyr
package (also by Hadley) -
Does not provide any "new" functionality per se, but greatly simplifies existing functionality in R
-
Provides a "grammar" (in particular, verbs) for data manipulation
-
Is very fast, as many key operations are coded in C++
-
select
: return a subset of the columns of a data frame -
filter
: extract a subset of rows from a data frame based on logical conditions -
arrange
: reorder rows of a data frame -
rename
: rename variables in a data frame -
mutate
: add new variables/columns or transform existing variables -
summarise
/summarize
: generate summary statistics of different variables in the data frame, possibly within strata
There is also a handy print
method that prevents you from printing a
lot of data to the console.
-
The first argument is a data frame.
-
The subsequent arguments describe what to do with it, and you can refer to columns in the data frame directly without using the $ operator (just use the names).
-
The result is a new data frame
-
Data frames must be properly formatted and annotated for this to all be useful
This step is important!
library(dplyr)
##
## Attaching package: 'dplyr'
##
## The following object is masked from 'package:stats':
##
## filter
##
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
chicago <- readRDS("chicago.rds")
dim(chicago)
## [1] 6940 8
head(select(chicago, 1:5))
## city tmpd dptp date pm25tmean2
## 1 chic 31.5 31.500 1987-01-01 NA
## 2 chic 33.0 29.875 1987-01-02 NA
## 3 chic 33.0 27.375 1987-01-03 NA
## 4 chic 29.0 28.625 1987-01-04 NA
## 5 chic 32.0 28.875 1987-01-05 NA
## 6 chic 40.0 35.125 1987-01-06 NA
names(chicago)[1:3]
## [1] "city" "tmpd" "dptp"
head(select(chicago, city:dptp))
## city tmpd dptp
## 1 chic 31.5 31.500
## 2 chic 33.0 29.875
## 3 chic 33.0 27.375
## 4 chic 29.0 28.625
## 5 chic 32.0 28.875
## 6 chic 40.0 35.125
In dplyr you can do
head(select(chicago, -(city:dptp)))
Equivalent base R
i <- match("city", names(chicago))
j <- match("dptp", names(chicago))
head(chicago[, -(i:j)])
chic.f <- filter(chicago, pm25tmean2 > 30)
head(select(chic.f, 1:3, pm25tmean2), 10)
## city tmpd dptp pm25tmean2
## 1 chic 23 21.9 38.10
## 2 chic 28 25.8 33.95
## 3 chic 55 51.3 39.40
## 4 chic 59 53.7 35.40
## 5 chic 57 52.0 33.30
## 6 chic 57 56.0 32.10
## 7 chic 75 65.8 56.50
## 8 chic 61 59.0 33.80
## 9 chic 73 60.3 30.30
## 10 chic 78 67.1 41.40
chic.f <- filter(chicago, pm25tmean2 > 30 & tmpd > 80)
head(select(chic.f, 1:3, pm25tmean2, tmpd), 10)
## city tmpd dptp pm25tmean2
## 1 chic 81 71.2 39.6000
## 2 chic 81 70.4 31.5000
## 3 chic 82 72.2 32.3000
## 4 chic 84 72.9 43.7000
## 5 chic 85 72.6 38.8375
## 6 chic 84 72.6 38.2000
## 7 chic 82 67.4 33.0000
## 8 chic 82 63.5 42.5000
## 9 chic 81 70.4 33.1000
## 10 chic 82 66.2 38.8500
Reordering rows of a data frame (while preserving corresponding order of other columns) is normally a pain to do in R.
chicago <- arrange(chicago, date)
head(select(chicago, date, pm25tmean2), 3)
## date pm25tmean2
## 1 1987-01-01 NA
## 2 1987-01-02 NA
## 3 1987-01-03 NA
tail(select(chicago, date, pm25tmean2), 3)
## date pm25tmean2
## 6938 2005-12-29 7.45000
## 6939 2005-12-30 15.05714
## 6940 2005-12-31 15.00000
Columns can be arranged in descending order too.
chicago <- arrange(chicago, desc(date))
head(select(chicago, date, pm25tmean2), 3)
## date pm25tmean2
## 1 2005-12-31 15.00000
## 2 2005-12-30 15.05714
## 3 2005-12-29 7.45000
tail(select(chicago, date, pm25tmean2), 3)
## date pm25tmean2
## 6938 1987-01-03 NA
## 6939 1987-01-02 NA
## 6940 1987-01-01 NA
Renaming a variable in a data frame in R is surprising hard to do!
head(chicago[, 1:5], 3)
## city tmpd dptp date pm25tmean2
## 1 chic 35 30.1 2005-12-31 15.00000
## 2 chic 36 31.0 2005-12-30 15.05714
## 3 chic 35 29.4 2005-12-29 7.45000
chicago <- rename(chicago, dewpoint = dptp,
pm25 = pm25tmean2)
head(chicago[, 1:5], 3)
## city tmpd dewpoint date pm25
## 1 chic 35 30.1 2005-12-31 15.00000
## 2 chic 36 31.0 2005-12-30 15.05714
## 3 chic 35 29.4 2005-12-29 7.45000
chicago <- mutate(chicago,
pm25detrend=pm25-mean(pm25, na.rm=TRUE))
head(select(chicago, pm25, pm25detrend))
## pm25 pm25detrend
## 1 15.00000 -1.230958
## 2 15.05714 -1.173815
## 3 7.45000 -8.780958
## 4 17.75000 1.519042
## 5 23.56000 7.329042
## 6 8.40000 -7.830958
Generating summary statistics by stratum
chicago <- mutate(chicago,
tempcat = factor(1 * (tmpd > 80),
labels = c("cold", "hot")))
hotcold <- group_by(chicago, tempcat)
summarize(hotcold, pm25 = mean(pm25, na.rm = TRUE),
o3 = max(o3tmean2),
no2 = median(no2tmean2))
## Source: local data frame [3 x 4]
##
## tempcat pm25 o3 no2
## 1 cold 15.97807 66.587500 24.54924
## 2 hot 26.48118 62.969656 24.93870
## 3 NA 47.73750 9.416667 37.44444
Generating summary statistics by stratum
chicago <- mutate(chicago,
year = as.POSIXlt(date)$year + 1900)
years <- group_by(chicago, year)
summarize(years, pm25 = mean(pm25, na.rm = TRUE),
o3 = max(o3tmean2, na.rm = TRUE),
no2 = median(no2tmean2, na.rm = TRUE))
## Source: local data frame [19 x 4]
##
## year pm25 o3 no2
## 1 1987 NaN 62.96966 23.49369
## 2 1988 NaN 61.67708 24.52296
## 3 1989 NaN 59.72727 26.14062
## 4 1990 NaN 52.22917 22.59583
## 5 1991 NaN 63.10417 21.38194
## 6 1992 NaN 50.82870 24.78921
## 7 1993 NaN 44.30093 25.76993
## 8 1994 NaN 52.17844 28.47500
## 9 1995 NaN 66.58750 27.26042
## 10 1996 NaN 58.39583 26.38715
## 11 1997 NaN 56.54167 25.48143
## 12 1998 18.26467 50.66250 24.58649
## 13 1999 18.49646 57.48864 24.66667
## 14 2000 16.93806 55.76103 23.46082
## 15 2001 16.92632 51.81984 25.06522
## 16 2002 15.27335 54.88043 22.73750
## 17 2003 15.23183 56.16608 24.62500
## 18 2004 14.62864 44.48240 23.39130
## 19 2005 16.18556 58.84126 22.62387
chicago %>% mutate(month = as.POSIXlt(date)$mon + 1)
%>% group_by(month)
%>% summarize(pm25 = mean(pm25, na.rm = TRUE),
o3 = max(o3tmean2, na.rm = TRUE),
no2 = median(no2tmean2, na.rm = TRUE))
## Source: local data frame [12 x 4]
##
## month pm25 o3 no2
## 1 1 17.76996 28.22222 25.35417
## 2 2 20.37513 37.37500 26.78034
## 3 3 17.40818 39.05000 26.76984
## 4 4 13.85879 47.94907 25.03125
## 5 5 14.07420 52.75000 24.22222
## 6 6 15.86461 66.58750 25.01140
## 7 7 16.57087 59.54167 22.38442
## 8 8 16.93380 53.96701 22.98333
## 9 9 15.91279 57.48864 24.47917
## 10 10 14.23557 47.09275 24.15217
## 11 11 15.15794 29.45833 23.56537
## 12 12 17.52221 27.70833 24.45773
Once you learn the dplyr "grammar" there are a few additional benefits
-
dplyr can work with other data frame "backends"
-
data.table
for large fast tables -
SQL interface for relational databases via the DBI package