Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve age variable on PUF #333

Open
Tracked by #336
MattHJensen opened this issue Jun 18, 2020 · 1 comment
Open
Tracked by #336

Improve age variable on PUF #333

MattHJensen opened this issue Jun 18, 2020 · 1 comment

Comments

@MattHJensen
Copy link
Contributor

MattHJensen commented Jun 18, 2020

In a recent PSL call, we discussed improving the age variable on the PUF, which is currently brought over in the CPS match. This discussion followed a comment from @jdebacker about needing to use the CPS as a source of primary taxfiler data rather than the PUF in a recent report using OG-USA.

@MaxGhenis suggested the possibility of imputing age from the CPS rather than obtaining during the match.

I recently found a snippet on TPC's approach in this report at pg 180:

TPC uses cross-tabulations by age, fling status, and income provided by SOI to impute the ages of taxpayers and dependents to the LAPUF. TPC then performs a constrained statistical match between the LAPUF and the 2012 CPS.

Another snippet is available in the TPC model FAQ:

We use cross-tabulations of age, filing status, and income sources we obtained from SOI to implement a raking algorithm to impute the ages of taxpayers and their dependents on to the LAPUF.

The closest published cross-tabulations I could find from SOI are in Individual Complete Report (Publication 1304), Table 1.6, and the latest data is for 2017. But, "provided by," and "obtained from", sound like TPC may be using non-public data from SOI.

@MaxGhenis
Copy link
Contributor

MaxGhenis commented Jul 15, 2020

I applied synthimpute to impute age in the CPS here. Rather than going to the PUF, it imputes on a holdout set of the CPS for evaluation (I didn't check that the x's are in the PUF yet).

Average age is 0.33 years too high, and standard deviation is 3.2 years too low. If comparison procedures like matching could predict quantiles, I think quantile loss would be the ideal evaluation metric; at that point, it's just selecting a uniform random quantile. Could the matching return the values for the nearest k records, and consider those the quantile range?

In general, my hunch is that matching will understate the conditional variance (may overfit too), and this will probably result in lower total variance too, but that'll only be part of the full picture. Random forests also understated variance in this experiment, so we'll have to compare, but I'd expect it to do better. We can also add variance manually based on performance on holdouts (Tetlock has recommended this for forecasting in general, though I can't find his quote atm).

Raking would be a good complement to this, and it's implemented in Python at https://github.com/Dirguis/ipfn. I'm not sure whether it's better to rake before imputing as TPC did, or vice versa, given we only have age ranges to rake on, but either way we'd need to impute also.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants