-
Notifications
You must be signed in to change notification settings - Fork 494
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Adult Census dataset description #659
Conversation
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's problematic to discuss and draw conclusions from the distributions of (gender or age based) sub-groups without taking fnlwgt
properly into account. Furthermore I am not certain how we would do this: I have tried to see if you can use fnlwgt
to recover the expected approximately 50%/50% relative representation between Male
and Female
adults in the US population but I failed.
We would have to dig the origin of this dataset to find out how it was built and how to properly use fnlwgt
to draw conclusions on such sub group distributions but I think this goes beyond what we want to achieve with this MOOC.
For the record, here is the quick check I made: >>> from sklearn.datasets import fetch_openml
>>> X, y = fetch_openml("adult", return_X_y=True)
>>> (X["sex"] == "Male").mean()
0.6684820441423365
>>> (X["sex"] != "Male").mean()
0.33151795585766347
>>> ((X["sex"] == "Male") * X['fnlwgt']).sum() / X['fnlwgt'].sum()
0.6757528069510574
>>> ((X["sex"] != "Male") * X['fnlwgt']).sum() / X['fnlwgt'].sum()
0.32424719304894256 |
I have the feeling that we can close this PR once #663 is accepted and merged if others agree. |
Fixes #657.
A potentially controversial PR, since adding a new dependency, passing through correctness of the wording and ending with the veracity of the interpretation given.
All feedback is welcomed!