Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Adult Census dataset description #659

Closed
wants to merge 28 commits into from

Conversation

ArturoAmorQ
Copy link
Collaborator

@ArturoAmorQ ArturoAmorQ commented Sep 8, 2022

Fixes #657.

A potentially controversial PR, since adding a new dependency, passing through correctness of the wording and ending with the veracity of the interpretation given.

All feedback is welcomed!

ArturoAmorQ and others added 26 commits March 14, 2022 10:37
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
@review-notebook-app
Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

Copy link
Collaborator

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's problematic to discuss and draw conclusions from the distributions of (gender or age based) sub-groups without taking fnlwgt properly into account. Furthermore I am not certain how we would do this: I have tried to see if you can use fnlwgt to recover the expected approximately 50%/50% relative representation between Male and Female adults in the US population but I failed.

We would have to dig the origin of this dataset to find out how it was built and how to properly use fnlwgt to draw conclusions on such sub group distributions but I think this goes beyond what we want to achieve with this MOOC.

@ogrisel
Copy link
Collaborator

ogrisel commented Oct 7, 2022

For the record, here is the quick check I made:

>>> from sklearn.datasets import fetch_openml
>>> X, y = fetch_openml("adult", return_X_y=True)
>>> (X["sex"] == "Male").mean()
0.6684820441423365
>>> (X["sex"] != "Male").mean()
0.33151795585766347
>>> ((X["sex"] == "Male") * X['fnlwgt']).sum() / X['fnlwgt'].sum()
0.6757528069510574
>>> ((X["sex"] != "Male") * X['fnlwgt']).sum() / X['fnlwgt'].sum()
0.32424719304894256

@ogrisel
Copy link
Collaborator

ogrisel commented Oct 7, 2022

I have the feeling that we can close this PR once #663 is accepted and merged if others agree.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add Adult Census dataset description
2 participants