R package for generating a set of counterfactual unions from a set of actual unions and alternate partners
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
R
data
man
.Rbuildignore
.gitignore
DESCRIPTION
LICENSE.txt
NAMESPACE
README.md
fakeunion.Rproj

README.md

fakeunion - An R package for generating counterfactual unions

DOI

This R package contains functions that will generate a set of "fake" unions (e.g. marriages, cohabitations), given a set of real unions and alternate partners. The key function generateCouples will sample from the list of alternate partners to create a dataset that contains both real and fake unions. This dataset can then be used to run a fixed effect conditional logit model to predict what partner characteristics are likely to be associated with a real match.

Installation

The package can be installed in R from GitHub with the devtools library:

library(devtools)
install_github("AaronGullickson/fakeunion")

Usage

The examples section of the help file for generateCouples contains a simple example showing how to use the program, but I provide some further details here.

There are three core data frame objects that are required:

  • actual: this data frame contains information about the real unions and the characteristics of spouses in those unions. This dataset should contain a variable that identifies the geographic unit that unions should be sampled within (e.g. state, city, county). It should also contain unique id variables that identify the two partners and end with "h" and "w" (e.g. "idh" and "idw"). Further data can be included here on the couple or the partner characteristics. All partner characteristic variables should end with either an "h" or "w" to identify which partner they belong to.
  • men: This data frame contains information about potential alternate men that could have been selected as partners. This dataset should contain the same geographic unit variable and an unique id that is labeled the same way as actual (e.g. "idh"). It should also include characteristics that match the "h" partner characteristics in actual. This dataset can also contain a variable for weights that can be used to sample individuals with different probability. The weight variable should not be identified with an "h" ending.
  • women: This data frame contains information about potential alternate women that could have been selected as partners. It should have the same format as men but variables should end in "w".

Here are examples of what the format should look like based on the example acs data supplied in the package:

Actual:

   state              idw   agew racew             idh ageh raceh
  New York  2011.1.856963.2   53     W 2011.1.856963.1   69     W
  New York  2013.1.815005.1   30     W 2013.1.815005.2   33     W
  New York  2014.1.877381.2   38     W 2014.1.877381.1   39     W
California  2015.1.100895.1   52     W 2015.1.100895.2   46     W
California  2011.1.199641.2   34     W 2011.1.199641.1   31     W
  New York  2011.1.870767.1   48     W 2011.1.870767.2   39     W

Men:

    state perwt             idh ageh raceh
California    58 2012.1.193121.1   41     B
California    73 2015.1.160865.3   23     W
  New York    72 2013.1.872930.1   70     W
  New York   112 2011.1.859134.3   30     W
California    57 2011.1.199178.1   47     W
  New York    56 2012.1.873801.3   33     W

Women:

     state perwt             idw agew racew
California    40 2015.1.192456.3   18     W
  New York   466 2015.1.850982.2   26     W
  New York    79 2013.1.879650.1   61     W
  New York   103 2014.1.816232.1   63     B
California   110 2011.1.140627.1   71     W
  New York    79 2014.1.809120.1   85     W

The main function generateCouples can then be called by specifying these datasets as well as the geographic cluster for sampling and the number of partners to be sampled. Here is an example using the ACS data provided in the package

markets <- generateCouples(3,acs.couples,acs.men,acs.women,
                           geo="state",weight="perwt")

This program will randomly choose either the husband or wife of the actual couples and sample three random partners from within the state without replacement. The sampling will be done using the weights provided by the perwt variable.

Processing time

I have tested out the processing time of the generateCouples under a variety of scenarios. The American Community Survey data I use has 73,536 actual couples, 2.35 million alternate male partners, and 2.76 million alternate female partners. I test clustering at three different levels: metro area, state, and country. Since all respondents were resident in the USA, the "country" level is effectively a single cluster.

I also created a smaller dataset by sampling 10% of the data from my actual couples and alternates. The results are presented in the table below. I record processing time in seconds. The computations were run from a high-end linux computer.

size cluster N3 N5 N10 N30 N50
Small metro 2.7 2.7 3.0 4.8 5.9
Small state 1.6 1.2 1.5 2.8 4.1
Small country 6.9 6.8 7.0 8.0 9.3
Large metro 20.7 22.0 27.1 52.7 76.8
Large state 36.0 37.2 41.7 55.3 69.2
Large country 626.9 627.4 630.9 643.4 659.1

Conditional Logit Models

The data that is produced by generateCouples can be used directly in conditional logit models where the group variable is used as the fixed effect. These models can be estimated in R using the clogit command in the survival library and specifying the group by the strata function, like so:

library(survival)
clogit(choice~I(ageh-agew)+I((ageh-agew)^2)+I(raceh!=racew)
              +strata(group), data=markets)

This model can also be estimated in Stata using xtlogit.

Convenience functions

This package also contains two convenience functions for calculating the conditional logit models. First, the function poolChoiceModel can be used to run the same conditional logit model on a list of sampled datasets. Because the dataset generated by generateCouples is a sample that will vary each time it is conducted, it may be useful to use a technique similar to multiple imputation and conduct m parallel analyses that can then be pooled together to account for additional between sample variation. This function will do that for the user based on a formula and a list of generated datasets. An example is given in the help file.

The function createExogamyTerms will create a set of interaction terms between partners characteristics but will also create an "endogamous" category when partners share the same characteristic and set this category as the reference. The user can also specify that the exogamy terms should be symmetric such that it does not matter which spouse had which characteristic.