Extrapolation routine testing and bug-fixes #133

hdoupe · 2017-12-06T16:00:41Z

Adds tests for the extrapolation routine and fixes bugs. Thanks to @andersonfrailey and @Amy-Xu for noticing the strange looking results. Hopefully, this resolves some of the distribution issues that we are having.

Bug-fixes:

The indexing scheme was wrong. The idea was convert the N x 15 matrix of benefits data into an array, do the extrapolation, and convert it back into the same matrix. To do this, you needed to keep track of the position of each element in the matrix. The column indexing was created incorrectly and in result the indices were not unique. Thus, the matrix was effectively shuffled when it was flattened and un-flattened. See commit Update indexing scheme for the bug-fix.
The average benefit assignment was incorrect. The average benefit was calculated as the target participation divided by the population receiving benefits. This is obviously incorrect. The average benefit is now calculated as total benefits dividided by population receiving benefits. See commit Add unit tests and fix avg benefit assignment bug for fix among other changes.

hdoupe · 2017-12-06T16:05:48Z

Output from running the tests in test_extrapolation.py

Amy-Xu · 2017-12-06T16:11:08Z

👍 for the tests and bug-fixes. I'll add my part of the tests for checking distribution/tabs to this folder today or tomorrow.

hdoupe · 2017-12-06T16:13:46Z

Great, thanks @Amy-Xu

andersonfrailey · 2017-12-06T19:59:32Z

@Amy-Xu will those tests be added on in this PR or a separate one?

Amy-Xu · 2017-12-06T20:02:44Z

@andersonfrailey I'm thinking a separate PR but make sure it's compatible with this one. Any preference?

andersonfrailey · 2017-12-06T20:19:43Z

@Amy-Xu not particularly. Whichever is easiest.

MattHJensen · 2018-01-04T18:25:14Z

What is the status of this PR?

andersonfrailey · 2018-01-08T18:47:55Z

@hdoupe's latest commit removed the few lines of code that drop the rows representing people who receive no benefits in the 10-year window. The resulting cps_benefits_extrap.csv.gz file therefore contains a row for each unit in the CPS file. The original plan was to drop those who don't receive any benefits, but we found that doing that resulted in the benefits being improperly reassigned in Tax-Calculator. We would like to merge this PR now and use the included cps_benefits_extrap.csv.gz in Tax-Calculator, then come back and open subsequent PR's to address that issue.

By not dropping those who receive no benefits, cps_benefits_extra.csv.gz becomes 36.5MB, about 10MB larger than cps.csv.gz and about 2MB larger than cps.csv.gz and cps_weights.csv.gz combined.

If there are no objections to this or any other concerns raised, I'll merge this PR tomorrow morning and update my PR in Tax-Calculator accordingly.

cc @Amy-Xu @MattHJensen @martinholmer

Amy-Xu · 2018-01-08T18:51:05Z

Sounds good to me. Thanks for all the investigation work @hdoupe @andersonfrailey!

martinholmer · 2018-01-08T19:28:26Z

@hdoupe said:

The resulting cps_benefits_extrap.csv.gz file therefore contains a row for each unit in the CPS file. The original plan was to drop those who don't receive any benefits, but we found that doing that resulted in the benefits being improperly reassigned in Tax-Calculator. We would like to merge this PR now and use the included cps_benefits_extrap.csv.gz in Tax-Calculator, then come back and open subsequent PR's to address that issue.

Can you explain what “resulted in the benefits being improperly reassigned in Tax-Calculator” means?

andersonfrailey · 2018-01-08T20:07:28Z

@martinholmer asked:

Can you explain what “resulted in the benefits being improperly reassigned in Tax-Calculator” means?

Basically that benefits were being assigned to the wrong tax-unit. We first noticed that something was off when looking at the participation rates by AGI percentile. Under the original method, we got this chart:

Based on talks with @Amy-Xu and previous work with the data, we knew that we should see participation rates fall as you move into higher AGI percentiles, rather than hold relatively steady as was happening with all but SSI.

When we removed the code that dropped those who did not receive benefits and simply replaced all of the benefit data in the CPS with the new values produced in the extrapolation routine, the participation rates looked like this:

Given that the only change was in how we assigned the extrapolated benefits to tax units, we came to the conclusion that somewhere in the process benefits were being given to the wrong tax-unit. We haven't been able to conclude what is causing the improper assignment at this time.

martinholmer · 2018-01-09T02:14:48Z

@andersonfrailey said in taxdata #133:

@hdoupe's latest commit removed the few lines of code that drop the rows representing people who receive no benefits in the 10-year window. The resulting cps_benefits_extrap.csv.gz file therefore contains a row for each unit in the CPS file. The original plan was to drop those who don't receive any benefits, but we found that doing that resulted in the benefits being improperly reassigned in Tax-Calculator. We would like to merge this PR now and use the included cps_benefits_extrap.csv.gz in Tax-Calculator, then come back and open subsequent PR's to address that issue.

By not dropping those who receive no benefits, cps_benefits_extra.csv.gz becomes 36.5MB, about 10MB larger than cps.csv.gz and about 2MB larger than cps.csv.gz and cps_weights.csv.gz combined.

If there are no objections to this or any other concerns raised, I'll merge this PR tomorrow morning and update my PR [1719] in Tax-Calculator accordingly.

I have an objection. You've glossed over the enormous increase in file size from doing this. When removing those with zero benefits, the cps_benefits_extra.csv.gz file was less than 3MB according to what I see in Tax-Calculator pull request 1719. So, increasing the size of this file from 3MB to 36.5MB almost doubles the size of the taxcalc package.

@andersonfrailey also said:

Given that the only change was in how we assigned the extrapolated benefits to tax units, we came to the conclusion that somewhere in the process benefits were being given to the wrong tax-unit. We haven't been able to conclude what is causing the improper assignment at this time.

I assume what you are referring to when you say "how we assigned the extrapolated benefits to tax units" is the logic in Tax-Calculator pull request 1719. Right?

When I look at the logic changes in the Records class in 1719, I don't see any place where you join (to use the SQL-like method available in Pandas) or merge (to use the alternative Pandas method) the extrapolated benefit data to the basic CPS input data. Now maybe I've missed where you are doing this. If that's so, please point it out to me so that I can review it in an attempt to figure out what's going wrong.

But under the assumption that you haven't done a join or merge of the basic CPS data and the extrapolated data, I suggest you do that in 1719 to see if extrapolated benefits will then be assigned to the correct CPS filing units.

I don't think this will take much time (relative to how long both this taxdata and the associated Tax-Calculator pull requests have been pending) and, if doing the join or merge is successful, the size of the taxcalc package will be almost half the size than it would be if you merge #133 now. I don't see any advantage in merging #133 now and "then come back and open subsequent PR's to address that issue" later. Why not fix this problem now?

Doubling the size of the taxcalc package has an impact on many Tax-Calculator users, most of whom have no interest in the benefits data. It was a clever idea to include only those with positive benefits in the
cps_benefits_extra.csv.gz file, so let's make that clever idea work.

@MattHJensen @Amy-Xu

martinholmer · 2018-01-09T15:03:06Z

@andersonfrailey, see this Tax-Calculator 1719 comment on how to assign positive benefits to the correct CPS filing unit.

So, it seems as if you can go back to the approach of including in cps_benefits.csv.gz only filing units with positive benefits. And when you do go back to that approach, you need to do several other things to make the cps_benefits.csv.gz file easier to use:

Call the file cps_benefits.csv.gz rather than cps_benefits_extrap.csv.gz
Standardize the benefit type names as ssi, snap, vet, mcaid, mcare, oasdi
Make all the data in the cps_benefits.csv.gz file be np.int32

With respect to item 3, the version of the file in 1719 has values like these:

   ssi_benefits_2014  medicaid_benefits_2014  medicare_benefits_2014  \
0        3373.058113            49958.448369             1716.598077   
1           0.000000              678.734729                0.000000   
2           0.000000                0.000000             5225.789459   
3           0.000000                0.000000            43071.412692   
4           0.000000                0.000000                0.000000   

   vb_benefits_2014  snap_benefits_2014  ss_benefits_2014  RECID  \
0               0.0         1589.818258      12294.725105      3   
1               0.0            0.000000          0.000000      4   
2               0.0            0.000000      33052.120146      5   
3               0.0            0.000000      22932.972160      6   
4               0.0         1146.853629          0.000000      9

Rounding each benefit to the nearest dollar is sufficient precision given all the assumptions and imputations that been used in the construction of these benefit amounts.

And using integer data will significantly reduce the size of the cps_benefits.csv.gz file.

@hdoupe @Amy-Xu @MattHJensen

hdoupe · 2018-01-09T20:10:41Z

The latest commits revert back to only saving units who receive benefits at some point in the budget window and save the dataframe values as integers as suggested.

@martinholmer Thanks for the advice on merging the benefit data and the base CPS data and saving the data as integers. The latter advice reduced the file size from 30 MB to 10 MB!

andersonfrailey · 2018-01-09T20:45:29Z

Thanks for your feedback @martinholmer and for working on making the needed changes @hdoupe. I'll leave this open the rest of today for review and merge tomorrow if there are no objections.

martinholmer · 2018-01-09T21:52:32Z

Thanks for the recent changes. The first thing I did in my review was to run py.test test_extrapolation.py.
Here is what I got:

iMac2:cps_stage3 mrh$ py.test test_extrapolation.py 
============================= test session starts ==============================
platform darwin -- Python 2.7.14, pytest-3.2.1, py-1.4.34, pluggy-0.4.0
rootdir: /Users/mrh/work/OSPC/taxdata/cps_stage3, inifile:
plugins: xdist-1.17.1
collected 2 items                                                               

test_extrapolation.py ..

=============================== warnings summary ===============================
test_extrapolation.py::test_add_participants
  /Users/mrh/anaconda/lib/python2.7/site-packages/pandas/core/indexing.py:537: SettingWithCopyWarning: 
  A value is trying to be set on a copy of a slice from a DataFrame.
  Try using .loc[row_indexer,col_indexer] = value instead
  
  See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
    self.obj[item] = s
  /Users/mrh/anaconda/lib/python2.7/site-packages/pandas/core/indexing.py:621: SettingWithCopyWarning: 
  A value is trying to be set on a copy of a slice from a DataFrame.
  Try using .loc[row_indexer,col_indexer] = value instead
  
  See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
    self.obj[item_labels[indexer[info_axis]]] = value

-- Docs: http://doc.pytest.org/en/latest/warnings.html
==================== 2 passed, 2 warnings in 167.49 seconds ====================

While the two test did pass, the two warnings are worrisome. The above message is suggesting a better coding style. Don't you think we should be taking this advice?

hdoupe · 2018-01-09T22:07:54Z

cps_stage3/extrapolation.py

            assert candidates.I.sum() == len(candidates)
-            noncandidates = extrap_df.loc[extrap_df.I == 0, ]
+            noncandidates = extrap_df.loc[extrap_df.I == 0, ].copy()


@martinholmer For some reason, I had to use the copy() method here. I thought that you were safe if you used the loc accessor method instead of doing something like candidates = extrap_df[extrap_df.I == 1].

Does any one have any thoughts on why I had to use the copy method here?

martinholmer · 2018-01-09T22:15:54Z

The next thing I did in my review of #133 was to confirm the changes in item 2 of this comment had been made. Item 2 said this:

Standardize the benefit type names as ssi, snap, vet, mcaid, mcare, oasdi

But when I look at the variable names in the new cps_benefits.csv.gz file, I see this:

cps_stage3 mrh$ gunzip -k cps_benefits.csv.gz 

cps_stage3 mrh$ ../../csvvars cps_benefits.csv
1 ssi_benefits_2014
2 medicaid_benefits_2014
3 medicare_benefits_2014
4 vb_benefits_2014
5 snap_benefits_2014
6 ss_benefits_2014
7 RECID
8 ssi_benefits_2015
9 medicaid_benefits_2015
10 medicare_benefits_2015
11 vb_benefits_2015
12 snap_benefits_2015
13 ss_benefits_2015
...
80 ssi_benefits_2027
81 medicaid_benefits_2027
82 medicare_benefits_2027
83 vb_benefits_2027
84 snap_benefits_2027
85 ss_benefits_2027

Why can't we standardize on the names in item 2? Tax-Calculator pull request is using vet, mcare, mcaid, etc. Why do we have to have multiple sets of names for the six benefits in Tax-Calculator data and code?

I don't care one bit about how C-TAM and taxdata did its internal work (as long as it is accurate), but I do care about the data variable names you are proposing to be used in Tax-Calculator.

If you don't like the six benefit names I suggested in my item 2, I would appreciate a conversation in which you suggest a better set of standardized names.

@andersonfrailey @hdoupe @Amy-Xu @MattHJensen

hdoupe · 2018-01-09T22:26:16Z

@martinholmer Whoops, I looked over that part. Sorry about that. I'll rename the columns.

martinholmer · 2018-01-09T22:39:32Z

@hdoupe said:

Whoops, I looked over that part. Sorry about that. I'll rename the columns.

Thanks very much!
I really appreciate your efforts on #133, at the same time you're juggling many balls over at PolicyBrain.

martinholmer · 2018-01-10T12:40:57Z

@hdoupe, I'm not sure what happened in commit fa599e1, but nothing has changed with the variable names in the cps_benefits.csv.gz file being proposed for inclusion in Tax-Calculator.

In Tax-Calculator, the first thing that will be done with the cps_benefits.csv.gz file is to read it into a Pandas DataFrame using the Pandas read_csv function. Something like this:

cps_benefits = pd.read_csv(...)

Then the cps_ben DataFrame will be used like this:

ssi_benefits = cps_benefits['ssi_benefits_2014']

But this is verbose; it would be much cleaner to write this:

ssi_benefits = cps_benefits['ssi_2014']

So, when the cps_benefits.csv.gz file is written in the taxdata repo, please make the variable names (other than RECID) be of the form <bentype>_<year> where <year> is 2014, 2015, etc., and <bentype> is one of the following six strings:

ssi
snap
vet
mcare
mcaid
oasdi

martinholmer · 2018-01-10T12:44:13Z

@andersonfrailey said yesterday:

I'll leave this open the rest of today for review and merge tomorrow if there are no objections.

Please wait to merge until the variables in the cps_benefits.csv.gz file are renamed as requested in this comment.

hdoupe · 2018-01-10T21:59:23Z

@martinholmer Please review this and let me know if this is what you are looking for.

martinholmer · 2018-01-10T22:07:06Z

@hdoupe, Thanks for all the work renaming the variables in the cps_benefits.csv.gz file being proposed for inclusion in Tax-Calculator. I think we are almost there. Here is what I see:

taxdata$ ./gitpr 133
remote: Counting objects: 144, done.
remote: Compressing objects: 100% (11/11), done.
remote: Total 144 (delta 71), reused 78 (delta 69), pack-reused 64
Receiving objects: 100% (144/144), 86.78 MiB | 14.17 MiB/s, done.
Resolving deltas: 100% (98/98), completed with 7 local objects.
From https://github.com/open-source-economics/taxdata
 * [new ref]         refs/pull/133/head -> pr-133
Switched to branch 'pr-133'
On branch pr-133

taxdata$ cd cps_stage3

cps_stage3$ ls -l
total 20936
-rw-r--r--  1 mrh  staff       287 Feb 12  2017 README.md
-rw-r--r--  1 mrh  staff  10676192 Jan 10 16:55 cps_benefits.csv.gz
-rw-r--r--  1 mrh  staff     13224 Jan 10 16:55 extrapolation.py
-rw-r--r--  1 mrh  staff      2075 Jan 10 16:55 growth_rates.csv
-rw-r--r--  1 mrh  staff      4457 Jan 10 16:55 rename_columns.py
-rw-r--r--  1 mrh  staff      7601 Jan 10 16:55 test_extrapolation.py

cps_stage3$ gunzip -k cps_benefits.csv.gz 

cps_stage3$ ../../csvvars  cps_benefits.csv
1 ssi_benefits_2014
2 mcaid_benefits_2014
3 mcare_benefits_2014
4 vet_benefits_2014
5 snap_benefits_2014
6 oasdi_benefits_2014
7 RECID
8 ssi2015
9 mcaid2015
10 mcare2015
11 vet2015
12 snap2015
13 oasdi2015
14 ssi2016
15 mcaid2016
16 mcare2016
17 vet2016
18 snap2016
19 oasdi2016
20 ssi2017
21 mcaid2017
22 mcare2017
23 vet2017
24 snap2017
25 oasdi2017
26 ssi2018
27 mcaid2018
28 mcare2018
29 vet2018
30 snap2018
31 oasdi2018
32 ssi2019
33 mcaid2019
34 mcare2019
35 vet2019
36 snap2019
37 oasdi2019
38 ssi2020
39 mcaid2020
40 mcare2020
41 vet2020
42 snap2020
43 oasdi2020
44 ssi2021
45 mcaid2021
46 mcare2021
47 vet2021
48 snap2021
49 oasdi2021
50 ssi2022
51 mcaid2022
52 mcare2022
53 vet2022
54 snap2022
55 oasdi2022
56 ssi2023
57 mcaid2023
58 mcare2023
59 vet2023
60 snap2023
61 oasdi2023
62 ssi2024
63 mcaid2024
64 mcare2024
65 vet2024
66 snap2024
67 oasdi2024
68 ssi2025
69 mcaid2025
70 mcare2025
71 vet2025
72 snap2025
73 oasdi2025
74 ssi2026
75 mcaid2026
76 mcare2026
77 vet2026
78 snap2026
79 oasdi2026
80 ssi2027
81 mcaid2027
82 mcare2027
83 vet2027
84 snap2027
85 oasdi2027

I think it is fine to not have the underscore character between the <bentype> and the <year>, but I was hoping the 2014 variable names would be in the same format as the variable names for the subsequent years.

If you bring the 2014 variable names into line with the name format for the other years, everything would be perfect. Thanks for all this work; it's going to make working with the benefit data much easier in Tax-Calculator.

hdoupe · 2018-01-11T16:00:43Z

@martinholmer The most recent commit should solve this problem. My apologies for submitting work for review that I had not thoroughly reviewed myself.

Thanks for reviewing.

martinholmer · 2018-01-11T22:10:35Z

@hdoupe said in pull request #133:

The most recent commit should solve this problem. My apologies for submitting work for review that I had not thoroughly reviewed myself.

No problem; I know you're incredibly busy. Our policy of having other people review a pull request is meant to catch things like this.

The format of this version of the cps_benefits.csv.gz file looks perfect. Thanks for helping out on this.

andersonfrailey · 2018-01-11T22:58:11Z

Thanks for working on this @hdoupe. This looks good to me as well. I'll merge in the morning.

Henry Doupe added 8 commits December 5, 2017 18:57

Add unit tests and fix avg benefit assignment bug

d1dd81b

Unit test for _repeating_ravel

e86a71a

Add unit test for _unravel_data

8c59370

Fix indexing bug and make test_ravel more robust

cfe7638

Add unit test for ravel_data

89379fa

Remove commented line of code

6b79183

Update indexing scheme

6ae0400

Add comments

0fe617a

andersonfrailey mentioned this pull request Dec 6, 2017

Add Benefit Data and Associated Capabilities PSLmodels/Tax-Calculator#1719

Merged

Amy-Xu mentioned this pull request Dec 6, 2017

Testing CPS tax-unit benefit #135

Open

Henry Doupe added 13 commits December 7, 2017 13:58

Improve pandas dataframe filter method

8c2207d

Improve indexing and index testing

010ab0f

Don't change probability setting

dd39f04

Run to 2027

6704ea1

Add assertion statements on dataframe expectations

76ec1af

Add more rigorous tests to check adding and removing of particpants

50ce2cf

Fix indexing

a833c10

Use pandas stack instead of diy methods

1cc58eb

Refactor extrapolate fn

5f0feeb

Update tests for pd stack usage

7f22455

Decrease loop size

0e2344e

Remove old add/remove tests

1e5ec3d

Add comments and first test at testing non-candidates

caec50a

Henry Doupe added 2 commits January 9, 2018 14:12

Change ouput file name and only keep those who receive benefits

54f2896

Save dataframe as np.int32

43149d3

Copy dataframes

fdabaf9

hdoupe commented Jan 9, 2018

View reviewed changes

Map ssi to oasdi

fa599e1

Henry Doupe added 3 commits January 10, 2018 16:48

Fix naming scheme

0d3bf85

Fix columns

ed6cd64

Add cps_benefits.csv.gz with updated names

d0bda17

Update base year name and name formatting

743e494

andersonfrailey merged commit bc147d2 into PSLmodels:master Jan 12, 2018

andersonfrailey mentioned this pull request Jan 16, 2018

Final updates to add benefit data to CPS #132

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extrapolation routine testing and bug-fixes #133

Extrapolation routine testing and bug-fixes #133

hdoupe commented Dec 6, 2017

hdoupe commented Dec 6, 2017

Amy-Xu commented Dec 6, 2017

hdoupe commented Dec 6, 2017

andersonfrailey commented Dec 6, 2017

Amy-Xu commented Dec 6, 2017

andersonfrailey commented Dec 6, 2017

MattHJensen commented Jan 4, 2018 •

edited

Loading

andersonfrailey commented Jan 8, 2018

Amy-Xu commented Jan 8, 2018

martinholmer commented Jan 8, 2018

andersonfrailey commented Jan 8, 2018

martinholmer commented Jan 9, 2018 •

edited

Loading

martinholmer commented Jan 9, 2018 •

edited

Loading

hdoupe commented Jan 9, 2018

andersonfrailey commented Jan 9, 2018

martinholmer commented Jan 9, 2018

hdoupe Jan 9, 2018

martinholmer commented Jan 9, 2018

hdoupe commented Jan 9, 2018

martinholmer commented Jan 9, 2018

martinholmer commented Jan 10, 2018 •

edited

Loading

martinholmer commented Jan 10, 2018

hdoupe commented Jan 10, 2018

martinholmer commented Jan 10, 2018

hdoupe commented Jan 11, 2018

martinholmer commented Jan 11, 2018

andersonfrailey commented Jan 11, 2018

Extrapolation routine testing and bug-fixes #133

Extrapolation routine testing and bug-fixes #133

Conversation

hdoupe commented Dec 6, 2017

hdoupe commented Dec 6, 2017

Amy-Xu commented Dec 6, 2017

hdoupe commented Dec 6, 2017

andersonfrailey commented Dec 6, 2017

Amy-Xu commented Dec 6, 2017

andersonfrailey commented Dec 6, 2017

MattHJensen commented Jan 4, 2018 • edited Loading

andersonfrailey commented Jan 8, 2018

Amy-Xu commented Jan 8, 2018

martinholmer commented Jan 8, 2018

andersonfrailey commented Jan 8, 2018

martinholmer commented Jan 9, 2018 • edited Loading

martinholmer commented Jan 9, 2018 • edited Loading

hdoupe commented Jan 9, 2018

andersonfrailey commented Jan 9, 2018

martinholmer commented Jan 9, 2018

hdoupe Jan 9, 2018

Choose a reason for hiding this comment

martinholmer commented Jan 9, 2018

hdoupe commented Jan 9, 2018

martinholmer commented Jan 9, 2018

martinholmer commented Jan 10, 2018 • edited Loading

martinholmer commented Jan 10, 2018

hdoupe commented Jan 10, 2018

martinholmer commented Jan 10, 2018

hdoupe commented Jan 11, 2018

martinholmer commented Jan 11, 2018

andersonfrailey commented Jan 11, 2018

MattHJensen commented Jan 4, 2018 •

edited

Loading

martinholmer commented Jan 9, 2018 •

edited

Loading

martinholmer commented Jan 9, 2018 •

edited

Loading

martinholmer commented Jan 10, 2018 •

edited

Loading