Panel Maneuvers in dplyr (pmdplyr)
The pmdplyr package is an extension to dplyr designed for cleaning and
managing panel and hierarchical data. It contains variations on the
_join functions that address common panel data
needs, and contains functions for managing and cleaning panel data. The
goal is to get you a nice tidy
pibble panel data object, which you can
panel_convert() for use in one of the many packages that help you
analyze panel data.
Unlike other panel data packages, functions in pmdplyr are all designed to work even if there is more than one observation per individual per period. This comes in handy if each individual is observed multiple times per period - for example, multiple classes per student per term; or if you have hierarchical data - for example, multiple companies per country.
Examples of pmdplyr use are below. These examples only cover some of the
functionality of the package. See the
pkgdown page for a full list of functions, or
You can install pmdplyr from CRAN:
The development version can be installed from GitHub:
# install.packages("devtools") devtools::install_github("NickCH-K/pmdplyr")
College Scorecard Example
Let’s start with the fairly straightforward
Scorecard data, which
describes how well students who attended that college are doing years
Scorecard observations are uniquely identified by
unitid and year
# Note that pmdplyr automatically loads dplyr as well library(pmdplyr) data(Scorecard) # Let's declare it as a pibble panel-data tibble Scorecard <- as_pibble(Scorecard, .i = unitid, .t = year) # We also have this data on the December unemployment rate for US college grads nationally # but only every other year unemp_data <- data.frame( unemp_year = c(2006, 2008, 2010, 2012, 2014, 2016, 2018), unemp = c(.017, .036, .048, .040, .028, .025, .020) )
I am interested in measuring the differences in ex-student earnings
earnings_med between two-year and four-year colleges
pred_degree_awarded_ipeds == 2 or
3, respectively). But before we
can do that we need to clean the data.
Scorecard %>% # We want pred_degree_awarded_ipeds to be consistent within college. No changers! # So let's drop them by using fixed_check with .resolve = "drop" to lose inconsistencies fixed_force( .var = pred_degree_awarded_ipeds, .within = unitid, .resolve = "drop" ) %>% # Then, get rid of pred_degree_awarded_ipeds == 1 # And simplify our terms filter(pred_degree_awarded_ipeds %in% c(2, 3)) %>% mutate(FourYear = pred_degree_awarded_ipeds == 3) %>% # earnings_med has some missing values - let's fill them in with # the most recent nonmissing observations we have # - panel_locf respects the panel structure declared above with as_pibble() mutate(earnings_med = panel_locf(earnings_med)) %>% # Now let's bring in that unemployment data! # Since it's every other year, it won't line up properly # in the join. So let's use inexact_join to get the MOST RECENT # year to join with inexact_left_join(unemp_data, var = year, jvar = unemp_year, method = "last") %>% # To adjust for state-level trends, let's also control for a tlag of # average earnings within state. # The lag is at the state level, and state-year doesn't uniquely identify, # But that's okay! We just pick a .resolve function to handle disagreements. # (We could also do this straight in the regression model itself) mutate(lag_state_earnings = tlag(earnings_med, .i = state_abbr, .t = year, .resolve = mean )) -> scorecard_clean # Now we can run a basic regression. lm( earnings_med ~ FourYear + unemp + lag_state_earnings, data = scorecard_clean ) %>% summary() #> #> Call: #> lm(formula = earnings_med ~ FourYear + unemp + lag_state_earnings, #> data = scorecard_clean) #> #> Residuals: #> Min 1Q Median 3Q Max #> -25341 -4917 -528 4091 54587 #> #> Coefficients: #> Estimate Std. Error t value Pr(>|t|) #> (Intercept) 5.933e+03 1.772e+03 3.348 0.000826 *** #> FourYearTRUE 9.088e+03 3.511e+02 25.886 < 2e-16 *** #> unemp -4.564e+04 2.529e+04 -1.805 0.071215 . #> lag_state_earnings 7.348e-01 4.027e-02 18.244 < 2e-16 *** #> --- #> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 #> #> Residual standard error: 8209 on 2474 degrees of freedom #> (25161 observations deleted due to missingness) #> Multiple R-squared: 0.3431, Adjusted R-squared: 0.3423 #> F-statistic: 430.6 on 3 and 2474 DF, p-value: < 2.2e-16
We could even improve that code - why not run the
safe_join()? When we do the
inexact_left_join(), for example, we’re assuming that
uniquely identified by
unemp_year—is it really?
check for us and minimize error.
Spanish Rail Example
Let’s shift focus to the
SPrail data set, which is not nearly as well
Scorecard! This data contains the
price of 2,000 train
trips taken on the Spanish rail system. At any given time period, there
are multiple tickets sold per route, which is identified by
destination. And the
insert_date of sale is down to the minute - far
more precise than we might actually want to use.
Let’s say we are interested in how much of a premium you’ll have to pay
for different ticket types relative to the cheapo option - a tourist
ticket with a transfer (
train_class == "Turista con enlace"),
accounting for differences in prices between routes.
We have some difficulties to cover: making the ID and time variables behave, accounting for the between-route differences, and figuring out how to compare each price to the cheapo price.
data(SPrail) SPrail %>% # We have two ID variables - origin and destination. # pmdplyr has no problem with this, but maybe we want to export # to something like plm later, which can't handle it. # So let's use id_variable to combine them into one mutate(route_ID = id_variable(origin, destination)) %>% # We have a time variable down to the minute. Too fine-grained! # Let's back things up to the daily level, and # create a nice integer time variable that's easy to use mutate(day = time_variable(insert_date, .method = "day")) %>% # Now we can declare a pibble as_pibble(.i = route_ID, .t = day) %>% # We want to account for between-route differences in price, # so let's isolate the within variation mutate(price_w = within_i(price)) %>% # We want to compare to the cheapo option, so let's use # mutate_subset to get the average price of the cheapo option # and propogate that to the other options for comparison mutate_subset( cheapo_price = mean(price, na.rm = TRUE), .filter = train_class == "Turista con enlace" ) %>% mutate(premium = price - cheapo_price) %>% filter(train_class %in% c("Preferente", "Turista", "Turista Plus")) %>% # Now let's compare premia group_by(train_class) %>% summarize(premium = mean(premium, na.rm = TRUE)) -> sprail_compare_premia sprail_compare_premia #> # A tibble: 3 x 2 #> train_class premium #> <fct> <dbl> #> 1 Preferente 29.9 #> 2 Turista 8.36 #> 3 Turista Plus 14.4
And so there we have it—
Preferente will really set you back relative
to the cheapo ticket on the same route.
Please note that the ‘pmdplyr’ project is released with a Contributor Code of Conduct. By contributing to this project, you agree to abide by its terms.