-
-
Notifications
You must be signed in to change notification settings - Fork 156
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Why are FLPDYR+h_seq+ffpos values not unique in cps.csv file? #1658
Comments
Another question about the new
Maybe I'm confused, but that was what I was expecting.
Why is it that there are 60 records in the |
Don't we know based on To aid with CPS merging, I created a simple database with just |
@evtedeschi3 said:
Yes, good point. But any Tax-Calculator output file will have all the filing units having the same
But when I combine
Notice the Again, maybe I'm doing the calculations wrong, but it seems as if FLPDYR/h_seq/ffpos values are not unique. Have you checked your |
Thanks for pointing this out @martinholmer. @evtedeschi3 is correct that you would also need to account for the year of the CPS in the tabulations, which it appears you do in your most recent comment. For my own clarification, when you converted cps.csv to an SQLite database and produced the numbers above, that is still the input file correct? It's possible that some families may have been split into multiple tax units (think dependent filers particularly), in which case there would be multiple tax units with identical I will also read through the CPS documentation to see if there are any identification variables which would be better for what we're trying to use |
Just to add one quick thought -- imputations to remove top-coding could be a source for |
One thought: Are you trying to merge individual ASEC variables? Remember, there will be several individuals in each year/h_seq/ffpos combination.
|
Also, shifting to the cps.csv file, it is absolutely the case that in some instances it splits a family into two or more tax units.
To get true unique matching at the individual level, you would need to add a variable for the PULINENO of the head, and another one for the spouse.
|
@andersonfrailey, Here is how I converted
|
Thanks for all the comments on #1658. I can see that, for several reasons, CPS families are split into separate tax filing units. But consider the following tabulation and my question below the tabulation results:
What kind of family is split into more than thirty filing units? @MattHJensen @Amy-Xu @andersonfrailey @hdoupe @evtedeschi3 |
@evtedeschi3, Thanks for the extra information in issue #1658.
The characteristics of household, as you point out, are very unusual. First of all, there are 45 people (some single and some married) living at the same address and they are all considered by Census to be in the same family. And, as you point out, most have very high incomes. So, is this a mistake in the preparation of the |
It’s particularly strange because the CPS by design doesn’t sample group quarters. So it’s not e.g. a dormitory or a barracks.
|
This might relate to the top-coding imputation I mentioned earlier. The algorithm is documented like so: This top-coding imputation is implemented for records with high income. Source: http://www.quantria.com/assets/img/TechnicalDocumentationV4-2.pdf |
@Amy-Xu beat me to it. We also have that noted in our documentation for the CPS file. The one change between what we do and what she posted is we repeat the process 15 times rather than 10. |
So that makes perfect sense, but that still means we have families of 30+ people who, evidently in this case, all have topcoded incomes! (The swap cutoff for the ASEC in 2013 was a personal wage of $250,000)
|
@evtedeschi3 Why do you think we have families of 30+ people? Looking at this post by Martin and the 15 highlighted by Anderson, I'm guessing most likely we have a dozen of high-income families of 3-4 people. |
@andersonfrailey pointed to the taxdata documentation of the
Thanks for the this helpful information. I still have two questions: (1) The above documentation explains why there are duplicate cpsyear+h_sql+ffpos values in multiples of 15, 30 and 45. What's the story for the many
There's no top-coding, so what explains the fact that several filing units have the same cpsyear+h_sql+ffpos value? (2) This top-coding documentation raises a completely different issue. A variable being top-coded means the actual value of that variable is larger than the top-coded amount, right? If so, then when the @MattHJensen @Amy-Xu @hdoupe @codykallen @evtedeschi3 |
@martinholmer wrote
My impression is that the ASEC has not been strictly "top-coded" post-2011. Rather, the top-codes really act more as thresholds. Above them, values are swapped with other super-top-coded values to protect identities. IPUMS has a very useful description of how this changed over time:
|
Thanks, @Amy-Xu, for citing John's documentation of the CPS tax file. The discussion before the passage you quoted describes the Census CPS top-coding procedure, as @evtedeschi3 did in his helpful comment. That information has clearer up several misconceptions I had about how Census handles high income amounts. But I have a couple of remaining questions about how the
Question (1): Exactly why do we need to replace the top-coded amount with 15 replicates? Exactly why are the Census top-coded amounts not "suitable for tax analysis"? John continues:
Question (2): What "regression equation" is being referred to in the second item? |
Now that I understand that 15 (or 30 or 45) replicates have been used to replace a single CPS family with high income, the range of my questions is much narrower. What's the story behind the many
There's no top-coding (because there are only two or three duplicates among the first nine of many), so what explains the fact that these |
@martinholmer my intuition is that these 2-3 replicates are from families with more than one filing units, but not necessarily high-income. Say a 20-year-old daughter has a part-time job and might just file a separate return from her parents, while all three of them just 'normal' income people. |
@martinholmer also asked:
My memory is a bit rusty, but I roughly remember that those averages are predictions from regressions of income on gender, race and work experience. If you take a look at 2010 doc If that's true, then it seems there's another problem we have here. Census only use this top coding method for CPS ASEC prior to 2011, which has nothing to do our current file that includes 2013-2015 CPS. The old top coding method isn't good for tax analysis (I guess) because distributions are collapsed into averages. Restoring the distribution was a doable and sensible option. But now the top coding has been revised to a swapping method. I don't know whether restoring the distribution through the old method still makes sense. |
Thanks for the explanation, which seems perfectly sensible. |
Chart 2 simply shows the average value above the top-coding threshold. There's no regression results in either Chart 1 or Chart 2, as you can see from this reproduction of those tables: |
That why I asked why all the replicates were being generated. Unless I'm missing something, there is no reason to go through all this replicates business for our more recent CPS files. And it's worse than just being inefficient, it's wrong. Because the regression imputation scheme used to construct the I'm starting to wonder if this is a contributing factor to the understatement of income tax revenue when using the @MattHJensen @andersonfrailey @hdoupe @codykallen @evtedeschi3 |
@martinholmer said:
That seems possible. Should a new issue be opened to describe the change that should be made to the file prep? |
Martin said:
I agree that it seems we should turn off this imputation for top-coding removal, given that the top coding method has been updated in more recent CPS ASEC files. |
@MattHJensen said:
I'm not sure where in the sequence of processing the three raw CPS files into the |
@martinholmer asked:
This occurs in the TopCodingV1.sas scripts. This is after tax units have been created and adjustments are being made to the final file. And
I talked with John about this a few months ago. His opinion was that there wasn't a huge need to revise/remove the top coding scripts, though he hadn't run the numbers to verify this. I'm in favor of at least comparing the results that we would get from removing this part of the creation process. There are two problems preventing this from happening immediately though. First, our SAS license has expired and I just checked and the general AEI one has as well. Second, John hasn't sent me the script to recreate the weights file yet and I haven't completed my work on a python version yet so we can't create a new weights file at this time. |
@andersonfrailey said almost thee week ago in Tax-Calculator issue #1658:
@andersonfrailey, Is this a formal issue in the taxdata repository? |
@martinholmer asked:
@andersonfrailey, Is the issue in Tax-Calculator #1658 covered in the PSLmodels/taxdata#125 issue? |
@martinholmer asked:
In my opinion it is not. The two are related and the issue in #1658 (removing/revamping our top coding routine) is dependent on TaxData issue #125, but it will not be completely addressed in that issue. I would prefer a separate issue is opened when we begin to work on reviewing our top coding routine. |
Closing issue #1658 in favor of open taxdata issue 174. |
Merged pull request #1635 added two new variables to the
cps.csv
file. Here is what was said in the taxdata repo about these two new variables:Combining the
h_seq
andffpos
values for each family will produce a unique identifier within a CPS sample. However, thecps.csv
sample contains records from three different CPS samples (for different years). So, without knowing which of the three CPS samples each record is from, it is impossible to match a record in thecps.csv
file with its complete information from a Census CPS file. Wasn't the whole idea behind adding these variables to allow users to get extra information from the Census CPS file to exact match with the records in thecps.csv
file? If so, then it would seem as if they cannot do that.The following tabulations illustrate the problem:
So, for example, the three records with
h_seq
=2 andffpos
=1 have unique RECID values, but that doesn't help a user figure out which Census CPS file those threecps.csv
records were drawn from.@MattHJensen @Amy-Xu @andersonfrailey @hdoupe @GoFroggyRun
The text was updated successfully, but these errors were encountered: