Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Include Dependent Filers in PUF #261

Merged
merged 13 commits into from
Aug 10, 2018

Conversation

andersonfrailey
Copy link
Collaborator

This PR is in response to issue #259. It modifies the phase one scripts so that dependent filers are not dropped during the matching process. This bug was introduced in PR #188, when we added the person variable to our partitioning. Upon the merge of this PR, we will have the same number of dependent filers as the pre-188 PUF and a similar number of tax units.

I've also reduced the minimum value of XTOT to 0 because dependent filers don't get any exemptions.

One other odd thing I noticed was that the max value for age_spouse ticked up to 99, despite no unit where DSI == 1 having an age_spouse value equal to 99. This makes me think PR #188 affected the matching process in more ways than just dropping dependent filers. This PR should take care of any of those thus far undiscovered issues.

@martinholmer

@martinholmer
Copy link
Contributor

@andersonfrailey, I'm currently testing #261 on my computer after incorporating the recent changes on the master branch. I'll let you know how things go.

@martinholmer
Copy link
Contributor

@andersonfrailey, I re-created pr-261 on my computer, removed the puf_data/cps-matched-puf.csv file, and executed make puf_data/puf.csv. Then I ran just the test_pufcsv and got different results than you have at the tip of your PR#261. I'm not sure at what step the differences are arising.

Here is the raw file info on my computer:

iMac2:Matching mrh$ ls -l *csv
-rw-r--r--@ 1 mrh  staff  296097633 Jul 24 09:20 cpsmar2016.csv
-rw-r--r--@ 1 mrh  staff   91305290 Jul 23 17:44 puf2011.csv
iMac2:Matching mrh$ md5 *csv
MD5 (cpsmar2016.csv) = e15da5b3fed60db7ec9ebae4fe59178d
MD5 (puf2011.csv) = 75f71a76baf2dedaa647c1baeee1b9ff

And here is the info on the derived files:

iMac2:puf_data mrh$ ls -l cps-matched-puf.csv
-rw-r--r--  1 mrh  staff  363711700 Jul 25 15:15 cps-matched-puf.csv
iMac2:puf_data mrh$ md5 cps-matched-puf.csv 
MD5 (cps-matched-puf.csv) = a36418c7a626b06df52c0a4b23878ceb

iMac2:puf_data mrh$ ls -l puf.csv
-rw-r--r--@ 1 mrh  staff  54343524 Jul 25 15:16 puf.csv
iMac2:puf_data mrh$ md5 puf.csv
MD5 (puf.csv) = 33797d7ae7fc098fc8df468de7c17ce5

Assuming the raw info matches what you have on your computer, which of the derived files is different on your computer? Is the cps-matched-puf.csv info the same?

And finally, what version of python, numpy, and pandas are you using?
I'm using python 2.7.15, numpy 1.14.5, and pandas 0.23.3

@martinholmer
Copy link
Contributor

martinholmer commented Jul 25, 2018

This comment reports on RECID==1 differences between the puf.csv file derived on @andersonfrailey's computer under PR#261 and the puf.csv file derived on @martinholmer's computer under PR#261.

The following displays of non-zero variables for the first filing unit were generated in the puf_data subdirectory with this command pipeline:

$ ~/work/OSPC/tax-calculator/csv_show.sh puf-AF.csv 1 | awk '{print $2,$3}' | sort > AF-1
$ ~/work/OSPC/tax-calculator/csv_show.sh puf.csv 1 | awk '{print $2,$3}' | sort > MH-1

Here are all the non-zero variables from the AF-1 file:

DSI 1
FLPDYR 2009
MARS 1
RECID 1
age_head 17
agi_bin 1
e00200 3040
e00200p 3040
filer 1
fips 39
nu05 1
nu13 1
nu18 1
s006 147096

And now the non-zero variables from the MH-1 file:

DSI 1
FLPDYR 2009
MARS 1
RECID 1
age_head 20
agi_bin 1
e00200 3040
e00200p 3040
filer 1
fips 39
n1820 1
s006 147096

And here are the differences (with AF-1 values on the left) as shown in a graphical diff utility:

screen shot 2018-07-25 at 4 25 27 pm

It looks as if the PUF variables are the same, but the CPS-imputed variables (the ages) are different.
In particular, I don't see how (in the AF-1 results) a filing unit with DSI==1 and MARS==1 who is age 17 can have nu05==1 and nu13==1 and nu18==1.

I'm tabulating a puf.csv recently sent by @andersonfrailey that has this info:

iMac2:puf_data mrh$ ls -l puf-AF.csv 
-rw-r--r--@ 1 mrh  staff  54341907 Jul 24 09:27 puf-AF.csv
iMac2:puf_data mrh$ md5 puf-AF.csv 
MD5 (puf-AF.csv) = b0b38d8c419f13c51338ca858b260e65

@andersonfrailey
Copy link
Collaborator Author

@martinholmer I ran everything with the same specs you're using and my results didn't change. I checked the MD5 on the puf.csv I generated and it's different from the last one I sent you, but consistent with previously generated PUFs. These are the numbers I got for the variables you're comparing:

DSI	- 1
FLPDYR	- 2009
MARS - 1
RECID - 1
age_head - 20
agi_bin - 1
e00200 - 3040
e00200p - 3040
filer - 1
fips - 39
n1820 - 1
nu05 - 0
nu13 - 0
nu18 - 0
s006 - 147096

I will send you this file over email.

@martinholmer
Copy link
Contributor

@andersonfrailey, Thanks for your newest puf.csv file, which I've named puf-AF.csv on my computer.
I compare that file with what I generate using the code on your PR#261, which I call puf-MH.csv.

These two files are different even though they have the same number of rows:

iMac2:puf_data mrh$ ls -l puf-??.csv
-rw-r--r--@ 1 mrh  staff  54343335 Jul 25 19:28 puf-AF.csv
-rw-r--r--  1 mrh  staff  54343524 Jul 25 20:01 puf-MH.csv

iMac2:puf_data mrh$ awk -F, 'NR>1{n++}END{print n}' puf-AF.csv 
248591
iMac2:puf_data mrh$ awk -F, 'NR>1{n++}END{print n}' puf-MH.csv 
248591

The first record with RECID==1 is exactly the same in the two files, but many other records are different.
Here is a bash script --- called extract.sh --- that uses the Tax-Calculator csv_show.sh utility to quickly compare the filing units with the same RECID in the two files (by comparing non-zero variable values that are sorted by variable name):

#!/bin/bash
~/work/OSPC/tax-calculator/csv_show.sh puf-AF.csv $1 | awk '{print $2,$3}' | sort > AF-$1
~/work/OSPC/tax-calculator/csv_show.sh puf-MH.csv $1 | awk '{print $2,$3}' | sort > MH-$1
diff AF-$1 MH-$1

There is no output from this command: ./extract 1 because the variable values for RECID==1 are the same in the two files. But there is difference output for ./extract 2 as shown below:

iMac2:puf_data mrh$ ./extract.sh 2
5c5
< age_head 78
---
> age_head 75
17c17
< fips 6
---
> fips 55
19c19
< s006 16546
---
> s006 127191

In the above output, the < values are from the puf-AF.csv file and the > values are from the puf-MH.csv file.

More interesting is the filing unit with RECID==28.

iMac2:puf_data mrh$ ./extract.sh 28
6c6
< age_head 37
---
> age_head 61
11c11,12
< fips 47
---
> fips 48
> n1820 1
14,16c15
< nu13 1
< nu18 1
< s006 20987
---
> s006 111837

Using a graphical diff program makes it easier to see the differences in context. Here is the screen shot with the AF values on the left and the MH values on the right.

screen shot 2018-07-25 at 9 19 42 pm

As you can see the ages differ in the two files (of both the taxpayer and the dependent) and the fips code varies and the sampling weight differs by quite a bit. All the other variable values are the same. Is it possible that the matching of a CPS record to each PUF record is being done differently on the two computers? The fact that the puf.csv file changed when you ran today's run (with the output for RECID==1 being different than it was yesterday) is suggestive. Perhaps something in the Matching logic is unstable, with output changing from run to run on the same computer and changing across computers. Although maybe not, because my run yesterday produced a puf.csv file with the same MD5 checksum as the puf.csv file I generated today.

I don't see any way to solve this troubling problem without debugging the Matching logic, which contains plenty of concat calls that have no sort= parameter. Do you see a simpler way to proceed?
(Let me know if you want the puf-MH.csv file on your computer.)

@martinholmer
Copy link
Contributor

@andersonfrailey, I'm puzzled by the values of age_head and age_spouse in the puf.csv file you sent me yesterday. Maybe the results below are to be expected, but given my meager knowledge of the PUF Matching logic, I find these results strange. Why are there so many old spouses relative to the age range of heads?

Here are my tabulations:

iMac2:puf_data mrh$ ~/work/OSPC/tax-calculator/csv_vars.sh puf-AF.csv | grep -e age -e RECID -e MARS
2 age_head
3 age_spouse
15 MARS
74 RECID

iMac2:puf_data mrh$ awk -F, '{print $74,$15,$2,$3}' puf-AF.csv > ages

iMac2:puf_data mrh$ head ages
RECID MARS age_head age_spouse
1 1 20 0
2 1 78 0
3 1 75 0
4 3 50 0
5 3 44 0
6 3 59 0
7 4 38 0
8 4 37 0
9 2 48 51

iMac2:puf_data mrh$ tail ages
248582 2 80 81
248583 2 31 35
248584 1 80 0
248585 2 78 79
248586 2 66 60
248587 1 76 0
248588 1 62 0
248589 1 61 0
248590 1 24 0
248591 1 85 0

iMac2:puf_data mrh$ awk '$2==2&&$4==0' ages

iMac2:puf_data mrh$ awk 'NR>1{n++}END{print n}' ages
248591

iMac2:puf_data mrh$ awk 'NR>1{n[$3]++}END{for(a=80;a<=120;a++)if(n[a]>0)print a,n[a]}' ages
80 4607
85 4502

iMac2:puf_data mrh$ awk 'NR>1{n[$4]++}END{for(a=80;a<=120;a++)if(n[a]>0)print a,n[a]}' ages
80 571
81 533
82 447
83 326
84 309
85 239
86 205
87 162
88 137
89 104
90 69
91 34
92 23
93 22
94 8
95 7
96 2
97 3
100 1

Why the stark difference in the distribution of older ages? Why are there spouse ages in the 90s when the largest head age is either 80 or 85?

@martinholmer
Copy link
Contributor

@andersonfrailey, I don't understand the details of the PUF Matching logic, but I find it strange that there are so many records in the cps-matched-puf.csv file you sent me yesterday that have the same recid value. There are over sixty thousand recid values that appear more than once in that file, with at least one appearing five times. Is this to be expected? If so, can you explain why that is to be expected?

Here are the results from my tabulations:

iMac2:puf_data mrh$ awk -F, 'NR>1{n++}END{print n}' cps-matched-puf-AF.csv 
248591

iMac2:puf_data mrh$ ~/work/OSPC/tax-calculator/csv_vars.sh cps-matched-puf-AF.csv | grep recid
328 recid

iMac2:puf_data mrh$ awk -F, 'NR>1{n[$328]++}END{for(i in n)if(n[i]>1){t++;print i,n[i]};print "num_duplicate_recids=",t}' cps-matched-puf-AF.csv | tail
45156 2
45159 2
146879 2
107829 3
133410 3
133412 2
133414 3
26553 2
129660 5
num_duplicate_recids= 62094

Have I done something wrong? Or can you confirm these results?

@andersonfrailey
Copy link
Collaborator Author

@martinholmer I'll try and recreate your results. What version of stats models are you using? That's the only other package I could think of that affects the match. I've also asked @hdoupe to run the matching using PR 261 to see how his results compare to what we're getting.

@hdoupe
Copy link
Collaborator

hdoupe commented Jul 26, 2018

I've also asked @hdoupe to run the matching using PR 261 to see how his results compare to what we're getting.

I'm setting things up now. I'll get back to you when it's run.

@martinholmer
Copy link
Contributor

@andersonfrailey said:

I've also asked @hdoupe to run the matching using PR 261 to see how his results compare to what we're getting.

And @hdoupe said:

I'm setting things up now. I'll get back to you when it's run.

This is an excellent idea.
It could be that I'm doing something wrong or my computer is broken or something else.
Good to have a third result.

@martinholmer
Copy link
Contributor

martinholmer commented Jul 26, 2018

@andersonfrailey asked:

What version of statsmodels are you using?
That's the only other package I could think of that affects the match.

Excellent point!
I didn't know the Matching logic used that package and it isn't listed in the environment.yml file.

I'm using (what I understand to be the newest) version 0.9.0 as you can see from this:

iMac2:taxdata mrh$ conda list statsmodels
# packages in environment at /Users/mrh/anaconda:
#
# Name                    Version                   Build  Channel
statsmodels               0.9.0            py27h917ab60_0  

@andersonfrailey
Copy link
Collaborator Author

Ah I didn't realize statsmodels wasn't in the environment file. I'll add it to the list in this PR. I've also updated from version 0.8.0 to 0.9.0 and will re-run my scripts. At this point we should be working in the exact same environment.

@andersonfrailey
Copy link
Collaborator Author

Update. I updated to statsmodels 0.9.0 and my resulting PUF and cps-matched-puf files did not change.

@martinholmer
Copy link
Contributor

@andersonfrailey said:

Update. I updated to statsmodels 0.9.0 and my resulting PUF and cps-matched-puf files did not change.

OK, thanks for the update. What sort of results is @hdoupe getting?

@andersonfrailey
Copy link
Collaborator Author

It looks like @hdoupe got different results than both of us. @martinholmer can you send me your files just so I can be sure?

@martinholmer
Copy link
Contributor

@andersonfrailey said:

It looks like @hdoupe got different results than both of us.

Suggests that my hunch that the Matching logic is (for some reason) unstable might be correct.

@andersonfrailey asked:

Can you send me your files just so I can be sure?

Yes, but let me "make puf_data/puf.csv" under PR#261 once more from scratch, just to make sure.

@hdoupe
Copy link
Collaborator

hdoupe commented Jul 26, 2018

I'm re-creating everything in a docker container to see if there is some kind of pre-existing environment issue in our systems that is throwing things off.

@martinholmer
Copy link
Contributor

@hdoupe said:

I'm re-creating everything in a docker container to see if there is some kind of pre-existing environment issue in our systems that is throwing things off.

Thanks. At this stage we have no clue about what's causing different results on different computers, so your docker work might be able to help us determine the cause(s) of the cross-computer differences.

@martinholmer
Copy link
Contributor

@andersonfrailey, Thanks for the updates to pull request #261. Now what I get on my Toronto iMac running Python 2.7 is much closer to what you get on your DC computer running Python 3.6 (which is what's in the latest version of the pull request, right?).

The size on the two generated files in the puf_data subdirectory (versus the size on the AF261 files you sent me via email) are as follows:

iMac:puf_data mrh$ ls -l puf*csv
-rw-r--r--@ 1 mrh  staff  54341314 Jul 31 09:21 puf-AF261.csv
-rw-r--r--@ 1 mrh  staff  54341511 Jul 31 10:06 puf.csv

iMac:puf_data mrh$ ls -l cps-matched-puf*csv
-rw-r--r--@ 1 mrh  staff  363698425 Jul 31 09:20 cps-matched-puf-AF261.csv
-rw-r--r--  1 mrh  staff  363698578 Jul 31 10:06 cps-matched-puf.csv

Using the PUF-related files (the puf_weights.csv.gz and puf_ratios.csv files are also different) generated using PR#261 on my computer cause just one test failure:

AssertionError: Number of records where n24 > nu18 has changed
E           assert 14910 == 14911

But many of the records are different. So, for example, six of the first ten records are different (but only in the ages, weights, and FIPS code):

iMac:puf_data mrh$ ./extract.sh 1
iMac:puf_data mrh$ ./extract.sh 2
5c5
< age_head 71
---
> age_head 85
17c17
< fips 12
---
> fips 21
19c19
< s006 126255
---
> s006 17482
iMac:puf_data mrh$ ./extract.sh 3
5c5
< age_head 85
---
> age_head 71
17c17
< fips 21
---
> fips 12
19c19
< s006 17482
---
> s006 126255
iMac:puf_data mrh$ ./extract.sh 4
5c5
< age_head 53
---
> age_head 23
12c12
< fips 54
---
> fips 49
14c14
< s006 27872
---
> s006 139730
iMac:puf_data mrh$ ./extract.sh 5
5c5
< age_head 23
---
> age_head 53
12c12
< fips 49
---
> fips 54
14c14
< s006 139730
---
> s006 27872
iMac:puf_data mrh$ ./extract.sh 6
iMac:puf_data mrh$ ./extract.sh 7
iMac:puf_data mrh$ ./extract.sh 8
iMac:puf_data mrh$ ./extract.sh 9
5c5
< age_head 19
---
> age_head 17
10,12c10,12
< fips 4
< n1820 1
< s006 63827
---
> fips 29
> nu18 1
> s006 12242
iMac:puf_data mrh$ ./extract.sh 10
5c5
< age_head 17
---
> age_head 19
10,12c10,12
< fips 29
< nu18 1
< s006 12242
---
> fips 4
> n1820 1
> s006 63827
iMac:puf_data mrh$ 

@martinholmer
Copy link
Contributor

martinholmer commented Jul 31, 2018

@andersonfrailey, Here is the BASH script I used to generate the differences in my recent #261 comment:

#!/bin/bash
~/work/OSPC/tax-calculator/csv_show.sh puf-AF261.csv $1 | awk '{print $2,$3}' | sort > AF-$1
~/work/OSPC/tax-calculator/csv_show.sh puf.csv $1 | awk '{print $2,$3}' | sort > MH-$1
diff AF-$1 MH-$1

So, you can see that the < results are from the puf-AF261.csv file you sent me the other day
and the > results are from the puf.csv file I generated on my computer today.

@martinholmer
Copy link
Contributor

@andersonfrailey, I upgraded all the versioned packages in the taxdata/environment.yml file, so that this is my taxdata-relevant environment:

iMac:~ mrh$ conda list | grep -e python -e numpy -e pandas -e scipy -e statsmodels -e pulp | grep -v ipython | grep -v numpydoc | grep -v python-dateutil | grep -v python.app | grep -v msgpack
numpy                     1.15.0           py27h648b28d_0  
numpy-base                1.15.0           py27h8a80b8c_0  
pandas                    0.23.3           py27h6440ff4_0  
pulp                      1.6.8                    py27_0    conda-forge
python                    2.7.15               h138c1fe_0  
scipy                     1.1.0            py27hf1f7d93_0  
statsmodels               0.9.0            py27h1d22016_0  

Then on my local pr-261 branch, I removed puf_data/cps-matched-puf.csv and ran make all. Here are the puf files I get:

iMac:puf_data mrh$ ls -l puf*csv
-rw-r--r--@ 1 mrh  staff  54341314 Jul 31 09:21 puf-AF261.csv
-rw-r--r--@ 1 mrh  staff  54341237 Jul 31 15:14 puf.csv
iMac:puf_data mrh$ ls -l cps-matched-puf*csv
-rw-r--r--@ 1 mrh  staff  363698425 Jul 31 09:20 cps-matched-puf-AF261.csv
-rw-r--r--  1 mrh  staff  363698173 Jul 31 15:14 cps-matched-puf.csv

Notice that the byte sizes of the cps-matched-puf.csv and the puf.csv files are different than they were before I did all the package upgrades. The python version was unchanged, but the numpy and pandas versions are definitely higher.

So, it would seem that what results are generated (on the same computer) differ depending on the taxdata-relevant package versions. What versions of these packages are you using?

python
numpy
pandas
scipy
statsmodels

The pulp package is not used to derive the puf.csv file, right?

When I run the taxdata tests in the presence of the newly-generated-on-my-computer PUF-related files, I get a test failure with these differences:

iMac:taxdata mrh$ cd tests
iMac:tests mrh$ diff puf_agg_actual.txt puf_agg_expected.txt 
3,4c3,4
< EIC                           40224              0              3
< FLPDYR                    499892412           2008           2011
---
> EIC                           40222              0              3
> FLPDYR                    499892408           2008           2011
9,11c9,11
< age_head                   11692950              1             85
< age_spouse                  6248316              0             98
< agi_bin                     1890338              0             18
---
> age_head                   11692928              1             85
> age_spouse                  6247198              0             98
> agi_bin                     1890352              0             18
13,14c13,14
< e00200                  39424567221              0       56530000
< e00200p                 23631758728              0       28265000
---
> e00200                  39424617851              0       56530000
> e00200p                 23631809358              0       28265000
16c16
< e00300                   3551349664              0       29460000
---
> e00300                   3551349956              0       29460000
18,23c18,23
< e00600                   5858835742              0       37050000
< e00650                   4446289654              0       37050000
< e00700                    462667508              0        2874000
< e00800                     10567957              0         174000
< e00900                   2866075750      -29990000       18990000
< e00900p                  2428856622      -29990000       18990000
---
> e00600                   5858835862              0       37050000
> e00650                   4446289674              0       37050000
> e00700                    462666508              0        2874000
> e00800                     10563157              0         174000
> e00900                   2866015720      -29990000       18990000
> e00900p                  2428796592      -29990000       18990000
25c25
< e01100                      5588110              0         282700
---
> e01100                      5588112              0         282700
27,30c27,30
< e01400                   1716723702              0        6424000
< e01500                   7721466127              0       47380000
< e01700                   1437052591              0        3528000
< e02000                  19723922872      -40760000       65100000
---
> e01400                   1716794302              0        6424000
> e01500                   7721465937              0       47380000
> e01700                   1437052401              0        3528000
> e02000                  19723942222      -40760000       65100000
34c34
< e02300                    131634418              0         100000
---
> e02300                    131573688              0         100000
37c37
< e03210                     12875268              0           2500
---
> e03210                     12874538              0           2500
47c47
< e07260                      9691984              0         280500
---
> e07260                      9692134              0         280500
53c53
< e09900                     15261706              0         339900
---
> e09900                     15268146              0         339900
55,61c55,61
< e17500                    250918650              0         772900
< e18400                   5897742014              0       15160000
< e18500                   1174858020              0         578400
< e19200                   1883726248              0        5127000
< e19800                   2428214004              0       30100000
< e20100                   1074711810              0       29580000
< e20400                   1466074810              0       10850000
---
> e17500                    250933860              0         772900
> e18400                   5897744654              0       15160000
> e18500                   1174881500              0         578400
> e19200                   1883732478              0        5127000
> e19800                   2428214644              0       30100000
> e20100                   1074725410              0       29580000
> e20400                   1466075290              0       10850000
64c64
< e26270                  17410580974      -40760000       64980000
---
> e26270                  17410583224      -40760000       64980000
69c69
< e87521                     39943678              0          10000
---
> e87521                     39946938              0          10000
75c75
< fips                        6757964              1             56
---
> fips                        6758114              1             56
77c77
< k1bx14p                 -2198650066      -18990000        2341800
---
> k1bx14p                 -2198649896      -18990000        2341800
86,88c86,88
< p22250                   -601721904     -124900000       39410000
< p23250                  23321572662      -28160000       91220000
< s006                    16355017356              0        1043269
---
> p22250                   -601721906     -124900000       39410000
> p23250                  23321563542      -28160000       91220000
> s006                    16355017354              0        1043269

@andersonfrailey
Copy link
Collaborator Author

@martinholmer, yes. I'll try the rounding on my machine in both 2.7 and 3.6 environments and if the results are the same I'll push the changes so you can try it.

@andersonfrailey
Copy link
Collaborator Author

andersonfrailey commented Aug 8, 2018

After converting all yhat values to integers, I was able to produce the same file using both Python 2.7 and 3.6. I've pushed my work so you can test it on your machine, @martinholmer. I'll send the resulting PUF to you in an email.

Also, I haven't run all of the make scripts yet. I'll wait for you to have a chance to create the PUF before I do then push all the weights up.

@martinholmer
Copy link
Contributor

@andersonfrailey, Thanks for the change in commit 4cf9bc9. This change eliminated most, but not all, of the differences between what you get on your computer and what I get on my computer. Now the puf.csv files have the same number of bytes and only 36 records are different (out of 248,591 records in total).

Below I show what I did on my computer, which concludes by showing the first set of differences for RECID=211976 and RECID=211977. This pair of records seems to exhibit the same sorting error as before (except that the spouse_age seems non-symmetrical). So, it would seem the rounding to the nearest integer dollar eliminated most of the differences, but a few remain. Perhaps rounding to the nearest ten dollars and converting that rounded float to an integer would eliminate the remaining 36 differences.

iMac:taxdata mrh$ git branch
  add-benefit-factors
* master

iMac:taxdata mrh$ ./gitpr 261
remote: Counting objects: 68, done.
remote: Compressing objects: 100% (4/4), done.
remote: Total 68 (delta 35), reused 38 (delta 35), pack-reused 29
Unpacking objects: 100% (68/68), done.
From https://github.com/open-source-economics/taxdata
 * [new ref]         refs/pull/261/head -> pr-261
Switched to branch 'pr-261'
On branch pr-261

iMac:taxdata mrh$ rm puf_data/cps-matched-puf.csv

iMac:taxdata mrh$ make puf_data/puf.csv
cd ./puf_data/StatMatch/Matching ; python runmatch.py
Reading CPS Data from .CSV
Reading PUF Data
Creating CPS Tax Units
100%|████████████████████████████████████| 69484/69484 [06:48<00:00, 169.94it/s]
CPS Tax Units Created
Adjustment Complete
Start Phase One
Start Phase Two
100%|███████████████████████████████████████████| 23/23 [00:08<00:00,  2.73it/s]
Creating final file
cd puf_data ; python finalprep.py

iMac:taxdata mrh$ cd puf_data/

iMac:puf_data mrh$ ls -l puf.csv puf-AFnewest.csv 
-rw-r--r--@ 1 mrh  staff  54339793 Aug  8 19:07 puf-AFnewest.csv
-rw-r--r--  1 mrh  staff  54339793 Aug  8 19:24 puf.csv

iMac:puf_data mrh$ md5 puf.csv
MD5 (puf.csv) = 9af97ee9e70ce674867e836599878ce4

iMac:puf_data mrh$ md5 puf-AFnewest.csv
MD5 (puf-AFnewest.csv) = 8ba210d7ac30ad5bc8a23dd99f18fd13

iMac:puf_data mrh$ diff puf.csv puf-AFnewest.csv | awk '$1~/</{n++}END{print n}'
36

iMac:puf_data mrh$ ./extract.sh 212089
5,6c5,6
< age_head 69
< age_spouse 68
---
> age_head 62
> age_spouse 61
23c23
< fips 37
---
> fips 53
26c26
< s006 101546
---
> s006 45325

iMac:puf_data mrh$ ./extract.sh 212090
5,6c5,6
< age_head 62
< age_spouse 59
---
> age_head 69
> age_spouse 66
23c23
< fips 53
---
> fips 37
26c26
< s006 45325
---
> s006 101546

@andersonfrailey
Copy link
Collaborator Author

@martinholmer, rounding further might help. I also recently came across this in the sort_values documentation:

kind : {‘quicksort’, ‘mergesort’, ‘heapsort’}, default ‘quicksort’
Choice of sorting algorithm. See also ndarray.np.sort for more information. mergesort
is the only stable algorithm. For DataFrames, this option is only applied when sorting
on a single column or label.

I did a little reading on what it means for a search algorithm to be stable. The gist is that in a stable sorting algorithm, two elements with the same value will be sorted in the same order in which they appear in the input. When using an unstable algorithm, this may not be the case. I haven't found anything yet on how an unstable algorithm would handle elements with the same value, but do you think it's possible that because we're using an unstable algorithm for sorting there are a few elements with equal yhat values that end up being sorted differently on our computers?

@martinholmer
Copy link
Contributor

@andersonfrailey said:

rounding further might help. I also recently came across this in the DataFrame.sort_values documentation:

kind : {‘quicksort’, ‘mergesort’, ‘heapsort’}, default ‘quicksort’
Choice of sorting algorithm. See also ndarray.np.sort for more information. mergesort
is the only stable algorithm. For DataFrames, this option is only applied when sorting
on a single column or label.

I did a little reading on what it means for a search algorithm to be stable. The gist is that in a stable sorting algorithm, two elements with the same value will be sorted in the same order in which they appear in the input. When using an unstable algorithm, this may not be the case. I haven't found anything yet on how an unstable algorithm would handle elements with the same value, but do you think it's possible that because we're using an unstable algorithm for sorting there are a few elements with equal yhat values that end up being sorted differently on our computers?

Yes, I think it is quite possible. (Excellent research to find this documentation!)
And more to the point, it is very easy to experiment with this hypothesis.
Just switch all the sort_values calls to use mergesort and see if we get the same results on our two computers.

Here are the several places in the Matching subdirectory that sort_values is being called:

grep  -nH -e sort_values *py
add_cps_vars.py:27:    match.sort_values(['cpsseq'], inplace=True)
add_cps_vars.py:29:    merge_1.sort_values(['soiseq'], inplace=True)
cps_rets.py:79:            household = household.sort_values('a_lineno')
phase2.py:33:            soi = soi.sort_values('yhat')
phase2.py:34:            cps = cps.sort_values('yhat')

@andersonfrailey
Copy link
Collaborator Author

Thanks for finding all the places we use sorting, @martinholmer. I'm re-running the matching scripts now.

@andersonfrailey
Copy link
Collaborator Author

I've pushed the edits needed to use mergesort each time we use sort_values

@martinholmer
Copy link
Contributor

@andersonfrailey, Thanks for commit b3fc924 that switches to using the stable mergesort algorithm in all five sort_values calls. Here is what I get using the newest version of PR #261:

iMac:taxdata mrh$ git branch
  add-benefit-factors
  master
* pr-261

iMac:taxdata mrh$ rm puf_data/cps-matched-puf.csv 

iMac:taxdata mrh$ make puf_data/puf.csv
cd ./puf_data/StatMatch/Matching ; python runmatch.py
Reading CPS Data from .CSV
Reading PUF Data
Creating CPS Tax Units
100%|████████████████████████████████████| 69484/69484 [06:39<00:00, 173.92it/s]
CPS Tax Units Created
Adjustment Complete
Start Phase One
Start Phase Two
100%|███████████████████████████████████████████| 23/23 [00:08<00:00,  2.69it/s]
Creating final file
cd puf_data ; python finalprep.py

iMac:taxdata mrh$ cd puf_data

iMac:puf_data mrh$ ls -l puf.csv puf-AFmsort.csv 
-rw-r--r--@ 1 mrh  staff  54341028 Aug  9 12:03 puf-AFmsort.csv
-rw-r--r--  1 mrh  staff  54341028 Aug  9 12:17 puf.csv

iMac:puf_data mrh$ md5 puf.csv puf-AFmsort.csv 
MD5 (puf.csv) = b64b90884406dfcff85f2ac9ba79f9bf
MD5 (puf-AFmsort.csv) = b64b90884406dfcff85f2ac9ba79f9bf

iMac:puf_data mrh$ diff puf.csv puf-AFmsort.csv 

iMac:puf_data mrh$ BINGO!

So, we're in complete agreement. Great!
Thanks for all your hard work on this thorny problem.

@MattHJensen

@andersonfrailey
Copy link
Collaborator Author

@martinholmer fantastic! I'll run the make files overnight to get the weights and ratios updated. Will push in the morning.

@martinholmer
Copy link
Contributor

@andersonfrailey, Now that you've solved the different-computer-replication problem in PR #261, I wonder if you can somehow eliminate these pycodestyle (nee pep8) warnings:

iMac:taxdata mrh$ make cstest
pycodestyle .
./puf_data/StatMatch/Matching/add_cps_vars.py:13:8: W605 invalid escape sequence '\d'
./puf_data/StatMatch/Matching/add_cps_vars.py:13:10: W605 invalid escape sequence '\d'
./puf_data/StatMatch/Matching/add_cps_vars.py:13:23: W605 invalid escape sequence '\d'
./puf_data/StatMatch/Matching/add_nonfilers.py:33:8: W605 invalid escape sequence '\d'
./puf_data/StatMatch/Matching/add_nonfilers.py:33:10: W605 invalid escape sequence '\d'
./puf_data/StatMatch/Matching/add_nonfilers.py:33:23: W605 invalid escape sequence '\d'
make: *** [cstest] Error 1

Maybe you can do some Google searching for this W605 invalid escape sequence warning to find out how others have changed their code to eliminate this warning.

@andersonfrailey
Copy link
Collaborator Author

After the latest commit this PR should be good to go. Will merge later this morning if no questions come up.

@andersonfrailey
Copy link
Collaborator Author

Just noticed your comment, @martinholmer. I'll see what I can do about the warnings.

@martinholmer
Copy link
Contributor

@andersonfrailey said:

Just noticed your comment, @martinholmer. I'll see what I can do about the warnings.

OK, but let's not slow up the merge of #261 for this, which can be addressed later.
Do you want to merge #261 now before 11:00am on Friday, August 10th?

@andersonfrailey
Copy link
Collaborator Author

@martinholmer, just pushed up the fixes. Running pycodestyle . returns no errors. I just needed to convert the strings to raw string literals. As expected, the results do not change.

I'm good with merging as soon as these tests finish running.

@martinholmer
Copy link
Contributor

@andersonfrailey said:

just pushed up the fixes. Running pycodestyle . returns no errors. I just needed to convert the strings to raw string literals. As expected, the results do not change.

I'm good with merging as soon as these tests finish running.

GREAT!

@andersonfrailey
Copy link
Collaborator Author

Going to go ahead and merge this now. Will send out an email to distribute the PUF this afternoon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants