-
Notifications
You must be signed in to change notification settings - Fork 30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Include Dependent Filers in PUF #261
Conversation
@andersonfrailey, I'm currently testing #261 on my computer after incorporating the recent changes on the master branch. I'll let you know how things go. |
@andersonfrailey, I re-created pr-261 on my computer, removed the Here is the raw file info on my computer:
And here is the info on the derived files:
Assuming the raw info matches what you have on your computer, which of the derived files is different on your computer? Is the And finally, what version of python, numpy, and pandas are you using? |
This comment reports on RECID==1 differences between the The following displays of non-zero variables for the first filing unit were generated in the
Here are all the non-zero variables from the
And now the non-zero variables from the
And here are the differences (with AF-1 values on the left) as shown in a graphical diff utility: It looks as if the PUF variables are the same, but the CPS-imputed variables (the ages) are different. I'm tabulating a
|
@martinholmer I ran everything with the same specs you're using and my results didn't change. I checked the MD5 on the puf.csv I generated and it's different from the last one I sent you, but consistent with previously generated PUFs. These are the numbers I got for the variables you're comparing:
I will send you this file over email. |
@andersonfrailey, Thanks for your newest These two files are different even though they have the same number of rows:
The first record with RECID==1 is exactly the same in the two files, but many other records are different.
There is no output from this command:
In the above output, the More interesting is the filing unit with RECID==28.
Using a graphical diff program makes it easier to see the differences in context. Here is the screen shot with the AF values on the left and the MH values on the right. As you can see the ages differ in the two files (of both the taxpayer and the dependent) and the fips code varies and the sampling weight differs by quite a bit. All the other variable values are the same. Is it possible that the matching of a CPS record to each PUF record is being done differently on the two computers? The fact that the I don't see any way to solve this troubling problem without debugging the Matching logic, which contains plenty of |
@andersonfrailey, I'm puzzled by the values of Here are my tabulations:
Why the stark difference in the distribution of older ages? Why are there spouse ages in the 90s when the largest head age is either 80 or 85? |
@andersonfrailey, I don't understand the details of the PUF Matching logic, but I find it strange that there are so many records in the Here are the results from my tabulations:
Have I done something wrong? Or can you confirm these results? |
@martinholmer I'll try and recreate your results. What version of stats models are you using? That's the only other package I could think of that affects the match. I've also asked @hdoupe to run the matching using PR 261 to see how his results compare to what we're getting. |
I'm setting things up now. I'll get back to you when it's run. |
@andersonfrailey said:
And @hdoupe said:
This is an excellent idea. |
@andersonfrailey asked:
Excellent point! I'm using (what I understand to be the newest) version 0.9.0 as you can see from this:
|
Ah I didn't realize statsmodels wasn't in the environment file. I'll add it to the list in this PR. I've also updated from version 0.8.0 to 0.9.0 and will re-run my scripts. At this point we should be working in the exact same environment. |
Update. I updated to statsmodels 0.9.0 and my resulting PUF and cps-matched-puf files did not change. |
@andersonfrailey said:
OK, thanks for the update. What sort of results is @hdoupe getting? |
It looks like @hdoupe got different results than both of us. @martinholmer can you send me your files just so I can be sure? |
@andersonfrailey said:
Suggests that my hunch that the Matching logic is (for some reason) unstable might be correct. @andersonfrailey asked:
Yes, but let me "make puf_data/puf.csv" under PR#261 once more from scratch, just to make sure. |
I'm re-creating everything in a docker container to see if there is some kind of pre-existing environment issue in our systems that is throwing things off. |
@hdoupe said:
Thanks. At this stage we have no clue about what's causing different results on different computers, so your docker work might be able to help us determine the cause(s) of the cross-computer differences. |
@andersonfrailey, Thanks for the updates to pull request #261. Now what I get on my Toronto iMac running Python 2.7 is much closer to what you get on your DC computer running Python 3.6 (which is what's in the latest version of the pull request, right?). The size on the two generated files in the
Using the PUF-related files (the
But many of the records are different. So, for example, six of the first ten records are different (but only in the ages, weights, and FIPS code):
|
@andersonfrailey, Here is the BASH script I used to generate the differences in my recent #261 comment:
So, you can see that the |
@andersonfrailey, I upgraded all the versioned packages in the
Then on my local pr-261 branch, I removed
Notice that the byte sizes of the So, it would seem that what results are generated (on the same computer) differ depending on the taxdata-relevant package versions. What versions of these packages are you using?
The When I run the taxdata tests in the presence of the newly-generated-on-my-computer PUF-related files, I get a test failure with these differences:
|
@martinholmer, yes. I'll try the rounding on my machine in both 2.7 and 3.6 environments and if the results are the same I'll push the changes so you can try it. |
After converting all Also, I haven't run all of the make scripts yet. I'll wait for you to have a chance to create the PUF before I do then push all the weights up. |
@andersonfrailey, Thanks for the change in commit 4cf9bc9. This change eliminated most, but not all, of the differences between what you get on your computer and what I get on my computer. Now the Below I show what I did on my computer, which concludes by showing the first set of differences for RECID=211976 and RECID=211977. This pair of records seems to exhibit the same sorting error as before (except that the spouse_age seems non-symmetrical). So, it would seem the rounding to the nearest integer dollar eliminated most of the differences, but a few remain. Perhaps rounding to the nearest ten dollars and converting that rounded float to an integer would eliminate the remaining 36 differences.
|
@martinholmer, rounding further might help. I also recently came across this in the
I did a little reading on what it means for a search algorithm to be stable. The gist is that in a stable sorting algorithm, two elements with the same value will be sorted in the same order in which they appear in the input. When using an unstable algorithm, this may not be the case. I haven't found anything yet on how an unstable algorithm would handle elements with the same value, but do you think it's possible that because we're using an unstable algorithm for sorting there are a few elements with equal |
@andersonfrailey said:
Yes, I think it is quite possible. (Excellent research to find this documentation!) Here are the several places in the Matching subdirectory that
|
Thanks for finding all the places we use sorting, @martinholmer. I'm re-running the matching scripts now. |
I've pushed the edits needed to use |
@andersonfrailey, Thanks for commit b3fc924 that switches to using the stable mergesort algorithm in all five
So, we're in complete agreement. Great! |
@martinholmer fantastic! I'll run the make files overnight to get the weights and ratios updated. Will push in the morning. |
@andersonfrailey, Now that you've solved the different-computer-replication problem in PR #261, I wonder if you can somehow eliminate these pycodestyle (nee pep8) warnings:
Maybe you can do some Google searching for this |
After the latest commit this PR should be good to go. Will merge later this morning if no questions come up. |
Just noticed your comment, @martinholmer. I'll see what I can do about the warnings. |
@andersonfrailey said:
OK, but let's not slow up the merge of #261 for this, which can be addressed later. |
@martinholmer, just pushed up the fixes. Running I'm good with merging as soon as these tests finish running. |
@andersonfrailey said:
GREAT! |
Going to go ahead and merge this now. Will send out an email to distribute the PUF this afternoon. |
This PR is in response to issue #259. It modifies the phase one scripts so that dependent filers are not dropped during the matching process. This bug was introduced in PR #188, when we added the
person
variable to our partitioning. Upon the merge of this PR, we will have the same number of dependent filers as the pre-188 PUF and a similar number of tax units.I've also reduced the minimum value of XTOT to 0 because dependent filers don't get any exemptions.
One other odd thing I noticed was that the max value for
age_spouse
ticked up to 99, despite no unit whereDSI
== 1 having anage_spouse
value equal to 99. This makes me think PR #188 affected the matching process in more ways than just dropping dependent filers. This PR should take care of any of those thus far undiscovered issues.@martinholmer