Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extrapolation routine testing and bug-fixes #133

Merged
merged 34 commits into from
Jan 12, 2018

Conversation

hdoupe
Copy link
Collaborator

@hdoupe hdoupe commented Dec 6, 2017

Adds tests for the extrapolation routine and fixes bugs. Thanks to @andersonfrailey and @Amy-Xu for noticing the strange looking results. Hopefully, this resolves some of the distribution issues that we are having.

Bug-fixes:

  1. The indexing scheme was wrong. The idea was convert the N x 15 matrix of benefits data into an array, do the extrapolation, and convert it back into the same matrix. To do this, you needed to keep track of the position of each element in the matrix. The column indexing was created incorrectly and in result the indices were not unique. Thus, the matrix was effectively shuffled when it was flattened and un-flattened. See commit Update indexing scheme for the bug-fix.

  2. The average benefit assignment was incorrect. The average benefit was calculated as the target participation divided by the population receiving benefits. This is obviously incorrect. The average benefit is now calculated as total benefits dividided by population receiving benefits. See commit Add unit tests and fix avg benefit assignment bug for fix among other changes.

@hdoupe
Copy link
Collaborator Author

hdoupe commented Dec 6, 2017

Output from running the tests in test_extrapolation.py

screen shot 2017-12-06 at 11 04 49 am

@Amy-Xu
Copy link
Member

Amy-Xu commented Dec 6, 2017

👍 for the tests and bug-fixes. I'll add my part of the tests for checking distribution/tabs to this folder today or tomorrow.

@hdoupe
Copy link
Collaborator Author

hdoupe commented Dec 6, 2017

Great, thanks @Amy-Xu

@andersonfrailey
Copy link
Collaborator

@Amy-Xu will those tests be added on in this PR or a separate one?

@Amy-Xu
Copy link
Member

Amy-Xu commented Dec 6, 2017

@andersonfrailey I'm thinking a separate PR but make sure it's compatible with this one. Any preference?

@andersonfrailey
Copy link
Collaborator

@Amy-Xu not particularly. Whichever is easiest.

@MattHJensen
Copy link
Contributor

MattHJensen commented Jan 4, 2018

What is the status of this PR?

@andersonfrailey
Copy link
Collaborator

@hdoupe's latest commit removed the few lines of code that drop the rows representing people who receive no benefits in the 10-year window. The resulting cps_benefits_extrap.csv.gz file therefore contains a row for each unit in the CPS file. The original plan was to drop those who don't receive any benefits, but we found that doing that resulted in the benefits being improperly reassigned in Tax-Calculator. We would like to merge this PR now and use the included cps_benefits_extrap.csv.gz in Tax-Calculator, then come back and open subsequent PR's to address that issue.

By not dropping those who receive no benefits, cps_benefits_extra.csv.gz becomes 36.5MB, about 10MB larger than cps.csv.gz and about 2MB larger than cps.csv.gz and cps_weights.csv.gz combined.

If there are no objections to this or any other concerns raised, I'll merge this PR tomorrow morning and update my PR in Tax-Calculator accordingly.

cc @Amy-Xu @MattHJensen @martinholmer

@Amy-Xu
Copy link
Member

Amy-Xu commented Jan 8, 2018

Sounds good to me. Thanks for all the investigation work @hdoupe @andersonfrailey!

@martinholmer
Copy link
Contributor

@hdoupe said:

The resulting cps_benefits_extrap.csv.gz file therefore contains a row for each unit in the CPS file. The original plan was to drop those who don't receive any benefits, but we found that doing that resulted in the benefits being improperly reassigned in Tax-Calculator. We would like to merge this PR now and use the included cps_benefits_extrap.csv.gz in Tax-Calculator, then come back and open subsequent PR's to address that issue.

Can you explain what “resulted in the benefits being improperly reassigned in Tax-Calculator” means?

@andersonfrailey
Copy link
Collaborator

@martinholmer asked:

Can you explain what “resulted in the benefits being improperly reassigned in Tax-Calculator” means?

Basically that benefits were being assigned to the wrong tax-unit. We first noticed that something was off when looking at the participation rates by AGI percentile. Under the original method, we got this chart:
image
Based on talks with @Amy-Xu and previous work with the data, we knew that we should see participation rates fall as you move into higher AGI percentiles, rather than hold relatively steady as was happening with all but SSI.

When we removed the code that dropped those who did not receive benefits and simply replaced all of the benefit data in the CPS with the new values produced in the extrapolation routine, the participation rates looked like this:
image
Given that the only change was in how we assigned the extrapolated benefits to tax units, we came to the conclusion that somewhere in the process benefits were being given to the wrong tax-unit. We haven't been able to conclude what is causing the improper assignment at this time.

@martinholmer
Copy link
Contributor

martinholmer commented Jan 9, 2018

@andersonfrailey said in taxdata #133:

@hdoupe's latest commit removed the few lines of code that drop the rows representing people who receive no benefits in the 10-year window. The resulting cps_benefits_extrap.csv.gz file therefore contains a row for each unit in the CPS file. The original plan was to drop those who don't receive any benefits, but we found that doing that resulted in the benefits being improperly reassigned in Tax-Calculator. We would like to merge this PR now and use the included cps_benefits_extrap.csv.gz in Tax-Calculator, then come back and open subsequent PR's to address that issue.

By not dropping those who receive no benefits, cps_benefits_extra.csv.gz becomes 36.5MB, about 10MB larger than cps.csv.gz and about 2MB larger than cps.csv.gz and cps_weights.csv.gz combined.

If there are no objections to this or any other concerns raised, I'll merge this PR tomorrow morning and update my PR [1719] in Tax-Calculator accordingly.

I have an objection. You've glossed over the enormous increase in file size from doing this. When removing those with zero benefits, the cps_benefits_extra.csv.gz file was less than 3MB according to what I see in Tax-Calculator pull request 1719. So, increasing the size of this file from 3MB to 36.5MB almost doubles the size of the taxcalc package.

@andersonfrailey also said:

Given that the only change was in how we assigned the extrapolated benefits to tax units, we came to the conclusion that somewhere in the process benefits were being given to the wrong tax-unit. We haven't been able to conclude what is causing the improper assignment at this time.

I assume what you are referring to when you say "how we assigned the extrapolated benefits to tax units" is the logic in Tax-Calculator pull request 1719. Right?

When I look at the logic changes in the Records class in 1719, I don't see any place where you join (to use the SQL-like method available in Pandas) or merge (to use the alternative Pandas method) the extrapolated benefit data to the basic CPS input data. Now maybe I've missed where you are doing this. If that's so, please point it out to me so that I can review it in an attempt to figure out what's going wrong.

But under the assumption that you haven't done a join or merge of the basic CPS data and the extrapolated data, I suggest you do that in 1719 to see if extrapolated benefits will then be assigned to the correct CPS filing units.

I don't think this will take much time (relative to how long both this taxdata and the associated Tax-Calculator pull requests have been pending) and, if doing the join or merge is successful, the size of the taxcalc package will be almost half the size than it would be if you merge #133 now. I don't see any advantage in merging #133 now and "then come back and open subsequent PR's to address that issue" later. Why not fix this problem now?

Doubling the size of the taxcalc package has an impact on many Tax-Calculator users, most of whom have no interest in the benefits data. It was a clever idea to include only those with positive benefits in the
cps_benefits_extra.csv.gz file, so let's make that clever idea work.

@MattHJensen @Amy-Xu

@martinholmer
Copy link
Contributor

martinholmer commented Jan 9, 2018

@andersonfrailey, see this Tax-Calculator 1719 comment on how to assign positive benefits to the correct CPS filing unit.

So, it seems as if you can go back to the approach of including in cps_benefits.csv.gz only filing units with positive benefits. And when you do go back to that approach, you need to do several other things to make the cps_benefits.csv.gz file easier to use:

  1. Call the file cps_benefits.csv.gz rather than cps_benefits_extrap.csv.gz

  2. Standardize the benefit type names as ssi, snap, vet, mcaid, mcare, oasdi

  3. Make all the data in the cps_benefits.csv.gz file be np.int32

With respect to item 3, the version of the file in 1719 has values like these:

   ssi_benefits_2014  medicaid_benefits_2014  medicare_benefits_2014  \
0        3373.058113            49958.448369             1716.598077   
1           0.000000              678.734729                0.000000   
2           0.000000                0.000000             5225.789459   
3           0.000000                0.000000            43071.412692   
4           0.000000                0.000000                0.000000   

   vb_benefits_2014  snap_benefits_2014  ss_benefits_2014  RECID  \
0               0.0         1589.818258      12294.725105      3   
1               0.0            0.000000          0.000000      4   
2               0.0            0.000000      33052.120146      5   
3               0.0            0.000000      22932.972160      6   
4               0.0         1146.853629          0.000000      9   

Rounding each benefit to the nearest dollar is sufficient precision given all the assumptions and imputations that been used in the construction of these benefit amounts.

And using integer data will significantly reduce the size of the cps_benefits.csv.gz file.

@hdoupe @Amy-Xu @MattHJensen

@hdoupe
Copy link
Collaborator Author

hdoupe commented Jan 9, 2018

The latest commits revert back to only saving units who receive benefits at some point in the budget window and save the dataframe values as integers as suggested.

@martinholmer Thanks for the advice on merging the benefit data and the base CPS data and saving the data as integers. The latter advice reduced the file size from 30 MB to 10 MB!

@andersonfrailey
Copy link
Collaborator

Thanks for your feedback @martinholmer and for working on making the needed changes @hdoupe. I'll leave this open the rest of today for review and merge tomorrow if there are no objections.

@martinholmer
Copy link
Contributor

Thanks for the recent changes. The first thing I did in my review was to run py.test test_extrapolation.py.
Here is what I got:

iMac2:cps_stage3 mrh$ py.test test_extrapolation.py 
============================= test session starts ==============================
platform darwin -- Python 2.7.14, pytest-3.2.1, py-1.4.34, pluggy-0.4.0
rootdir: /Users/mrh/work/OSPC/taxdata/cps_stage3, inifile:
plugins: xdist-1.17.1
collected 2 items                                                               

test_extrapolation.py ..

=============================== warnings summary ===============================
test_extrapolation.py::test_add_participants
  /Users/mrh/anaconda/lib/python2.7/site-packages/pandas/core/indexing.py:537: SettingWithCopyWarning: 
  A value is trying to be set on a copy of a slice from a DataFrame.
  Try using .loc[row_indexer,col_indexer] = value instead
  
  See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
    self.obj[item] = s
  /Users/mrh/anaconda/lib/python2.7/site-packages/pandas/core/indexing.py:621: SettingWithCopyWarning: 
  A value is trying to be set on a copy of a slice from a DataFrame.
  Try using .loc[row_indexer,col_indexer] = value instead
  
  See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
    self.obj[item_labels[indexer[info_axis]]] = value

-- Docs: http://doc.pytest.org/en/latest/warnings.html
==================== 2 passed, 2 warnings in 167.49 seconds ====================

While the two test did pass, the two warnings are worrisome. The above message is suggesting a better coding style. Don't you think we should be taking this advice?

assert candidates.I.sum() == len(candidates)
noncandidates = extrap_df.loc[extrap_df.I == 0, ]
noncandidates = extrap_df.loc[extrap_df.I == 0, ].copy()
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@martinholmer For some reason, I had to use the copy() method here. I thought that you were safe if you used the loc accessor method instead of doing something like candidates = extrap_df[extrap_df.I == 1].

Does any one have any thoughts on why I had to use the copy method here?

@martinholmer
Copy link
Contributor

The next thing I did in my review of #133 was to confirm the changes in item 2 of this comment had been made. Item 2 said this:

  1. Standardize the benefit type names as ssi, snap, vet, mcaid, mcare, oasdi

But when I look at the variable names in the new cps_benefits.csv.gz file, I see this:

cps_stage3 mrh$ gunzip -k cps_benefits.csv.gz 

cps_stage3 mrh$ ../../csvvars cps_benefits.csv
1 ssi_benefits_2014
2 medicaid_benefits_2014
3 medicare_benefits_2014
4 vb_benefits_2014
5 snap_benefits_2014
6 ss_benefits_2014
7 RECID
8 ssi_benefits_2015
9 medicaid_benefits_2015
10 medicare_benefits_2015
11 vb_benefits_2015
12 snap_benefits_2015
13 ss_benefits_2015
...
80 ssi_benefits_2027
81 medicaid_benefits_2027
82 medicare_benefits_2027
83 vb_benefits_2027
84 snap_benefits_2027
85 ss_benefits_2027

Why can't we standardize on the names in item 2? Tax-Calculator pull request is using vet, mcare, mcaid, etc. Why do we have to have multiple sets of names for the six benefits in Tax-Calculator data and code?

I don't care one bit about how C-TAM and taxdata did its internal work (as long as it is accurate), but I do care about the data variable names you are proposing to be used in Tax-Calculator.

If you don't like the six benefit names I suggested in my item 2, I would appreciate a conversation in which you suggest a better set of standardized names.

@andersonfrailey @hdoupe @Amy-Xu @MattHJensen

@hdoupe
Copy link
Collaborator Author

hdoupe commented Jan 9, 2018

@martinholmer Whoops, I looked over that part. Sorry about that. I'll rename the columns.

@martinholmer
Copy link
Contributor

@hdoupe said:

Whoops, I looked over that part. Sorry about that. I'll rename the columns.

Thanks very much!
I really appreciate your efforts on #133, at the same time you're juggling many balls over at PolicyBrain.

@martinholmer
Copy link
Contributor

martinholmer commented Jan 10, 2018

@hdoupe, I'm not sure what happened in commit fa599e1, but nothing has changed with the variable names in the cps_benefits.csv.gz file being proposed for inclusion in Tax-Calculator.

In Tax-Calculator, the first thing that will be done with the cps_benefits.csv.gz file is to read it into a Pandas DataFrame using the Pandas read_csv function. Something like this:

cps_benefits = pd.read_csv(...)

Then the cps_ben DataFrame will be used like this:

ssi_benefits = cps_benefits['ssi_benefits_2014']

But this is verbose; it would be much cleaner to write this:

ssi_benefits = cps_benefits['ssi_2014']

So, when the cps_benefits.csv.gz file is written in the taxdata repo, please make the variable names (other than RECID) be of the form <bentype>_<year> where <year> is 2014, 2015, etc., and <bentype> is one of the following six strings:

ssi
snap
vet
mcare
mcaid
oasdi

@martinholmer
Copy link
Contributor

@andersonfrailey said yesterday:

I'll leave this open the rest of today for review and merge tomorrow if there are no objections.

Please wait to merge until the variables in the cps_benefits.csv.gz file are renamed as requested in this comment.

@hdoupe
Copy link
Collaborator Author

hdoupe commented Jan 10, 2018

@martinholmer Please review this and let me know if this is what you are looking for.

@martinholmer
Copy link
Contributor

@hdoupe, Thanks for all the work renaming the variables in the cps_benefits.csv.gz file being proposed for inclusion in Tax-Calculator. I think we are almost there. Here is what I see:

taxdata$ ./gitpr 133
remote: Counting objects: 144, done.
remote: Compressing objects: 100% (11/11), done.
remote: Total 144 (delta 71), reused 78 (delta 69), pack-reused 64
Receiving objects: 100% (144/144), 86.78 MiB | 14.17 MiB/s, done.
Resolving deltas: 100% (98/98), completed with 7 local objects.
From https://github.com/open-source-economics/taxdata
 * [new ref]         refs/pull/133/head -> pr-133
Switched to branch 'pr-133'
On branch pr-133

taxdata$ cd cps_stage3

cps_stage3$ ls -l
total 20936
-rw-r--r--  1 mrh  staff       287 Feb 12  2017 README.md
-rw-r--r--  1 mrh  staff  10676192 Jan 10 16:55 cps_benefits.csv.gz
-rw-r--r--  1 mrh  staff     13224 Jan 10 16:55 extrapolation.py
-rw-r--r--  1 mrh  staff      2075 Jan 10 16:55 growth_rates.csv
-rw-r--r--  1 mrh  staff      4457 Jan 10 16:55 rename_columns.py
-rw-r--r--  1 mrh  staff      7601 Jan 10 16:55 test_extrapolation.py

cps_stage3$ gunzip -k cps_benefits.csv.gz 

cps_stage3$ ../../csvvars  cps_benefits.csv
1 ssi_benefits_2014
2 mcaid_benefits_2014
3 mcare_benefits_2014
4 vet_benefits_2014
5 snap_benefits_2014
6 oasdi_benefits_2014
7 RECID
8 ssi2015
9 mcaid2015
10 mcare2015
11 vet2015
12 snap2015
13 oasdi2015
14 ssi2016
15 mcaid2016
16 mcare2016
17 vet2016
18 snap2016
19 oasdi2016
20 ssi2017
21 mcaid2017
22 mcare2017
23 vet2017
24 snap2017
25 oasdi2017
26 ssi2018
27 mcaid2018
28 mcare2018
29 vet2018
30 snap2018
31 oasdi2018
32 ssi2019
33 mcaid2019
34 mcare2019
35 vet2019
36 snap2019
37 oasdi2019
38 ssi2020
39 mcaid2020
40 mcare2020
41 vet2020
42 snap2020
43 oasdi2020
44 ssi2021
45 mcaid2021
46 mcare2021
47 vet2021
48 snap2021
49 oasdi2021
50 ssi2022
51 mcaid2022
52 mcare2022
53 vet2022
54 snap2022
55 oasdi2022
56 ssi2023
57 mcaid2023
58 mcare2023
59 vet2023
60 snap2023
61 oasdi2023
62 ssi2024
63 mcaid2024
64 mcare2024
65 vet2024
66 snap2024
67 oasdi2024
68 ssi2025
69 mcaid2025
70 mcare2025
71 vet2025
72 snap2025
73 oasdi2025
74 ssi2026
75 mcaid2026
76 mcare2026
77 vet2026
78 snap2026
79 oasdi2026
80 ssi2027
81 mcaid2027
82 mcare2027
83 vet2027
84 snap2027
85 oasdi2027

I think it is fine to not have the underscore character between the <bentype> and the <year>, but I was hoping the 2014 variable names would be in the same format as the variable names for the subsequent years.

If you bring the 2014 variable names into line with the name format for the other years, everything would be perfect. Thanks for all this work; it's going to make working with the benefit data much easier in Tax-Calculator.

@hdoupe
Copy link
Collaborator Author

hdoupe commented Jan 11, 2018

@martinholmer The most recent commit should solve this problem. My apologies for submitting work for review that I had not thoroughly reviewed myself.

Thanks for reviewing.

@martinholmer
Copy link
Contributor

@hdoupe said in pull request #133:

The most recent commit should solve this problem. My apologies for submitting work for review that I had not thoroughly reviewed myself.

No problem; I know you're incredibly busy. Our policy of having other people review a pull request is meant to catch things like this.

The format of this version of the cps_benefits.csv.gz file looks perfect. Thanks for helping out on this.

@andersonfrailey
Copy link
Collaborator

Thanks for working on this @hdoupe. This looks good to me as well. I'll merge in the morning.

@andersonfrailey andersonfrailey merged commit bc147d2 into PSLmodels:master Jan 12, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants