Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dataset Improvements #278

Merged
merged 28 commits into from Aug 25, 2022
Merged

Dataset Improvements #278

merged 28 commits into from Aug 25, 2022

Conversation

hoffmansc
Copy link
Collaborator

Major improvements:

  • Changed how prot_attr arguments are handled. Now, when processing a dataset and running metrics, an explicit array (or list of arrays) containing protected attribute values per sample may be passed instead of requiring an index name.
  • Added MEPS, COMPAS violent datasets (Port micellaneous items to sklearn-compatible API #150)

Small changes:

  • Added cache=False option to dataset fetching functions to skip caching
  • Removed unused categories from DataFrame resulting from dropped rows

@hoffmansc hoffmansc added this to the sklearn-compat milestone Nov 23, 2021
Signed-off-by: Samuel Hoffman <hoffman.sc@gmail.com>
* minor change to usecols/dropcols usage ([] -> None)
* use fetch_openml `as_frame=True` option
* binary_race only affects protected attribute unless numeric_only

Signed-off-by: Samuel Hoffman <hoffman.sc@gmail.com>
Signed-off-by: Samuel Hoffman <hoffman.sc@gmail.com>
Signed-off-by: Samuel Hoffman <hoffman.sc@gmail.com>
Signed-off-by: Samuel Hoffman <hoffman.sc@gmail.com>
Signed-off-by: Samuel Hoffman <hoffman.sc@gmail.com>
Signed-off-by: Samuel Hoffman <hoffman.sc@gmail.com>
Signed-off-by: Samuel Hoffman <hoffman.sc@gmail.com>
Signed-off-by: Samuel Hoffman <hoffman.sc@gmail.com>
Signed-off-by: Samuel Hoffman <hoffman.sc@gmail.com>
Signed-off-by: Samuel Hoffman <hoffman.sc@gmail.com>
@lgtm-com
Copy link

lgtm-com bot commented Nov 24, 2021

This pull request introduces 2 alerts and fixes 3 when merging 37f3345 into 963df2e - view on LGTM.com

new alerts:

  • 1 for Unused local variable
  • 1 for Unused import

fixed alerts:

  • 3 for 'import *' may pollute namespace

Signed-off-by: Samuel Hoffman <hoffman.sc@gmail.com>
@lgtm-com
Copy link

lgtm-com bot commented Nov 24, 2021

This pull request fixes 3 alerts when merging 78a5c3d into 963df2e - view on LGTM.com

fixed alerts:

  • 3 for 'import *' may pollute namespace

@nrkarthikeyan
Copy link
Collaborator

@monindersingh - FYI, this may address some of the questions you had.

Signed-off-by: Samuel Hoffman <hoffman.sc@gmail.com>
Signed-off-by: Samuel Hoffman <hoffman.sc@gmail.com>
Signed-off-by: Samuel Hoffman <hoffman.sc@gmail.com>
Signed-off-by: Samuel Hoffman <hoffman.sc@gmail.com>
@lgtm-com
Copy link

lgtm-com bot commented Dec 3, 2021

This pull request fixes 3 alerts when merging da6d549 into 963df2e - view on LGTM.com

fixed alerts:

  • 3 for 'import *' may pollute namespace

Signed-off-by: Samuel Hoffman <hoffman.sc@gmail.com>
Signed-off-by: Samuel Hoffman <hoffman.sc@gmail.com>
Signed-off-by: Samuel Hoffman <hoffman.sc@gmail.com>
Signed-off-by: Samuel Hoffman <hoffman.sc@gmail.com>
@lgtm-com
Copy link

lgtm-com bot commented Dec 3, 2021

This pull request fixes 3 alerts when merging cf0c6c3 into 963df2e - view on LGTM.com

fixed alerts:

  • 3 for 'import *' may pollute namespace

Signed-off-by: Samuel Hoffman <hoffman.sc@gmail.com>
@hoffmansc
Copy link
Collaborator Author

@monindersingh I have a question about MEPS. In the pre-processing, you included 'EDUCYR' and 'HIDEG':

df = df[(df[['FTSTU','ACTDTY','HONRDC','RTHLTH','MNHLTH','HIBPDX','CHDDX','ANGIDX','EDUCYR','HIDEG',

but in the features_to_keep arg of __init__ they're not there:
features_to_keep=['REGION','AGE','SEX','RACE','MARRY',
'FTSTU','ACTDTY','HONRDC','RTHLTH','MNHLTH','HIBPDX','CHDDX','ANGIDX',
'MIDX','OHRTDX','STRKDX','EMPHDX','CHBRON','CHOLDX','CANCERDX','DIABDX',
'JTPAIN','ARTHDX','ARTHTYPE','ASTHDX','ADHDADDX','PREGNT','WLKLIM',
'ACTLIM','SOCLIM','COGLIM','DFHEAR42','DFSEE42','ADSMOK42','PCS42',
'MCS42','K6SUM42','PHQ242','EMPST','POVCAT','INSCOV','UTILIZATION','PERWT15F'],

Essentially this means those columns are included when we drop NAs but not in the final dataset. This only affects a few rows but what's the reasoning for this? Was it intentional?

@lgtm-com
Copy link

lgtm-com bot commented Jan 6, 2022

This pull request fixes 3 alerts when merging 6a7bbef into 963df2e - view on LGTM.com

fixed alerts:

  • 3 for 'import *' may pollute namespace

Signed-off-by: Samuel Hoffman <hoffman.sc@gmail.com>
@lgtm-com
Copy link

lgtm-com bot commented Jan 14, 2022

This pull request fixes 3 alerts when merging 88d2e1c into 963df2e - view on LGTM.com

fixed alerts:

  • 3 for 'import *' may pollute namespace

@hoffmansc hoffmansc marked this pull request as ready for review February 15, 2022 21:21
Signed-off-by: Samuel Hoffman <hoffman.sc@gmail.com>
Signed-off-by: Samuel Hoffman <hoffman.sc@gmail.com>
@lgtm-com
Copy link

lgtm-com bot commented Jul 20, 2022

This pull request fixes 3 alerts when merging 9d5a8dd into faa75ee - view on LGTM.com

fixed alerts:

  • 3 for 'import *' may pollute namespace

Signed-off-by: Samuel Hoffman <hoffman.sc@gmail.com>
@lgtm-com
Copy link

lgtm-com bot commented Jul 21, 2022

This pull request fixes 3 alerts when merging 2e93e9c into faa75ee - view on LGTM.com

fixed alerts:

  • 3 for 'import *' may pollute namespace

Signed-off-by: Samuel Hoffman <hoffman.sc@gmail.com>
@lgtm-com
Copy link

lgtm-com bot commented Jul 27, 2022

This pull request fixes 3 alerts when merging 28986da into faa75ee - view on LGTM.com

fixed alerts:

  • 3 for 'import *' may pollute namespace

Signed-off-by: Samuel Hoffman <hoffman.sc@gmail.com>
@lgtm-com
Copy link

lgtm-com bot commented Jul 28, 2022

This pull request fixes 3 alerts when merging 230a93b into faa75ee - view on LGTM.com

fixed alerts:

  • 3 for 'import *' may pollute namespace

@hoffmansc hoffmansc merged commit 3df4fa9 into master Aug 25, 2022
@hoffmansc hoffmansc deleted the datasets branch August 25, 2022 20:43
Illia-Kryvoviaz pushed a commit to Illia-Kryvoviaz/AIF360 that referenced this pull request Jun 7, 2023
* allow explicit arrays for prot_attr, target
* add MEPS and violent recidivism datasets
* option to skip cache
* binary_race only affects protected attribute unless numeric_only
* remove unused categories after dropping
* minimum python version >= 3.7; scikit-learn >= 1.0
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

assert statement fails in exponentiated gradient reduction (sklearn compatible) notebook
2 participants