-
Notifications
You must be signed in to change notification settings - Fork 30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Extrapolation routine testing and bug-fixes #133
Conversation
👍 for the tests and bug-fixes. I'll add my part of the tests for checking distribution/tabs to this folder today or tomorrow. |
Great, thanks @Amy-Xu |
@Amy-Xu will those tests be added on in this PR or a separate one? |
@andersonfrailey I'm thinking a separate PR but make sure it's compatible with this one. Any preference? |
@Amy-Xu not particularly. Whichever is easiest. |
What is the status of this PR? |
@hdoupe's latest commit removed the few lines of code that drop the rows representing people who receive no benefits in the 10-year window. The resulting By not dropping those who receive no benefits, If there are no objections to this or any other concerns raised, I'll merge this PR tomorrow morning and update my PR in Tax-Calculator accordingly. |
Sounds good to me. Thanks for all the investigation work @hdoupe @andersonfrailey! |
@hdoupe said:
Can you explain what “resulted in the benefits being improperly reassigned in Tax-Calculator” means? |
@martinholmer asked:
Basically that benefits were being assigned to the wrong tax-unit. We first noticed that something was off when looking at the participation rates by AGI percentile. Under the original method, we got this chart: When we removed the code that dropped those who did not receive benefits and simply replaced all of the benefit data in the CPS with the new values produced in the extrapolation routine, the participation rates looked like this: |
@andersonfrailey said in taxdata #133:
I have an objection. You've glossed over the enormous increase in file size from doing this. When removing those with zero benefits, the @andersonfrailey also said:
I assume what you are referring to when you say "how we assigned the extrapolated benefits to tax units" is the logic in Tax-Calculator pull request 1719. Right? When I look at the logic changes in the Records class in 1719, I don't see any place where you join (to use the SQL-like method available in Pandas) or merge (to use the alternative Pandas method) the extrapolated benefit data to the basic CPS input data. Now maybe I've missed where you are doing this. If that's so, please point it out to me so that I can review it in an attempt to figure out what's going wrong. But under the assumption that you haven't done a join or merge of the basic CPS data and the extrapolated data, I suggest you do that in 1719 to see if extrapolated benefits will then be assigned to the correct CPS filing units. I don't think this will take much time (relative to how long both this taxdata and the associated Tax-Calculator pull requests have been pending) and, if doing the join or merge is successful, the size of the taxcalc package will be almost half the size than it would be if you merge #133 now. I don't see any advantage in merging #133 now and "then come back and open subsequent PR's to address that issue" later. Why not fix this problem now? Doubling the size of the taxcalc package has an impact on many Tax-Calculator users, most of whom have no interest in the benefits data. It was a clever idea to include only those with positive benefits in the |
@andersonfrailey, see this Tax-Calculator 1719 comment on how to assign positive benefits to the correct CPS filing unit. So, it seems as if you can go back to the approach of including in
With respect to item 3, the version of the file in 1719 has values like these:
Rounding each benefit to the nearest dollar is sufficient precision given all the assumptions and imputations that been used in the construction of these benefit amounts. And using integer data will significantly reduce the size of the |
The latest commits revert back to only saving units who receive benefits at some point in the budget window and save the dataframe values as integers as suggested. @martinholmer Thanks for the advice on merging the benefit data and the base CPS data and saving the data as integers. The latter advice reduced the file size from 30 MB to 10 MB! |
Thanks for your feedback @martinholmer and for working on making the needed changes @hdoupe. I'll leave this open the rest of today for review and merge tomorrow if there are no objections. |
Thanks for the recent changes. The first thing I did in my review was to run
While the two test did pass, the two warnings are worrisome. The above message is suggesting a better coding style. Don't you think we should be taking this advice? |
cps_stage3/extrapolation.py
Outdated
assert candidates.I.sum() == len(candidates) | ||
noncandidates = extrap_df.loc[extrap_df.I == 0, ] | ||
noncandidates = extrap_df.loc[extrap_df.I == 0, ].copy() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@martinholmer For some reason, I had to use the copy()
method here. I thought that you were safe if you used the loc
accessor method instead of doing something like candidates = extrap_df[extrap_df.I == 1]
.
Does any one have any thoughts on why I had to use the copy method here?
The next thing I did in my review of #133 was to confirm the changes in item 2 of this comment had been made. Item 2 said this:
But when I look at the variable names in the new
Why can't we standardize on the names in item 2? Tax-Calculator pull request is using I don't care one bit about how C-TAM and taxdata did its internal work (as long as it is accurate), but I do care about the data variable names you are proposing to be used in Tax-Calculator. If you don't like the six benefit names I suggested in my item 2, I would appreciate a conversation in which you suggest a better set of standardized names. |
@martinholmer Whoops, I looked over that part. Sorry about that. I'll rename the columns. |
@hdoupe, I'm not sure what happened in commit fa599e1, but nothing has changed with the variable names in the In Tax-Calculator, the first thing that will be done with the
Then the
But this is verbose; it would be much cleaner to write this:
So, when the
|
@andersonfrailey said yesterday:
Please wait to merge until the variables in the |
@martinholmer Please review this and let me know if this is what you are looking for. |
@hdoupe, Thanks for all the work renaming the variables in the
I think it is fine to not have the underscore character between the If you bring the 2014 variable names into line with the name format for the other years, everything would be perfect. Thanks for all this work; it's going to make working with the benefit data much easier in Tax-Calculator. |
@martinholmer The most recent commit should solve this problem. My apologies for submitting work for review that I had not thoroughly reviewed myself. Thanks for reviewing. |
@hdoupe said in pull request #133:
No problem; I know you're incredibly busy. Our policy of having other people review a pull request is meant to catch things like this. The format of this version of the |
Thanks for working on this @hdoupe. This looks good to me as well. I'll merge in the morning. |
Adds tests for the extrapolation routine and fixes bugs. Thanks to @andersonfrailey and @Amy-Xu for noticing the strange looking results. Hopefully, this resolves some of the distribution issues that we are having.
Bug-fixes:
The indexing scheme was wrong. The idea was convert the N x 15 matrix of benefits data into an array, do the extrapolation, and convert it back into the same matrix. To do this, you needed to keep track of the position of each element in the matrix. The column indexing was created incorrectly and in result the indices were not unique. Thus, the matrix was effectively shuffled when it was flattened and un-flattened. See commit
Update indexing scheme
for the bug-fix.The average benefit assignment was incorrect. The average benefit was calculated as the target participation divided by the population receiving benefits. This is obviously incorrect. The average benefit is now calculated as total benefits dividided by population receiving benefits. See commit
Add unit tests and fix avg benefit assignment bug
for fix among other changes.