Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Categorical (binary) column incorrectly treated as continuous for Univariate Drift Detection #171

Closed
nikml opened this issue Dec 8, 2022 · 3 comments
Assignees
Labels
documentation Improvements or additions to documentation

Comments

@nikml
Copy link
Contributor

nikml commented Dec 8, 2022

Describe the bug
The binary predictions from the synthetic binary classification are treated as continuous rather than categorical.

To Reproduce
Steps to reproduce the behavior:
Run the Univariate Drift Example Notebook from where documentation is created.
y_pred is treated as continuous instead of categorical.

Expected behavior
Column would be treated as continuous.

Screenshots & scripts
The variable is present in the continuous drift results for v0.8.1:
https://nannyml.readthedocs.io/en/v0.8.1/_images/drift-guide-continuous.svg

@nikml nikml added bug Something isn't working triage Needs to be assessed labels Dec 8, 2022
@nnansters nnansters added documentation Improvements or additions to documentation and removed bug Something isn't working triage Needs to be assessed labels Dec 14, 2022
@nnansters
Copy link
Contributor

Hey Nikos,

this behavior is correct. The columns are designated by NannyML as continuous or categorical in the base module.

You are right however that this is not the expected behavior given the example in the docs. This can be fixed by explicitly setting the y_pred column as categorical. I'll update this in the documentation.

reference_df['y_pred'] = reference_df['y_pred'].astype("category")
analysis_df['y_pred'] = analysis_df['y_pred'].astype("category")

column_names = ['distance_from_office', 'salary_range', 'gas_price_per_litre', 'public_transportation_cost', 'wfh_prev_workday', 'workday', 'tenure', 'y_pred_proba', 'y_pred']
calc = nml.UnivariateDriftCalculator(
    column_names=column_names,
    timestamp_column_name='timestamp',
    continuous_methods=['kolmogorov_smirnov', 'jensen_shannon'],
    categorical_methods=['chi2', 'jensen_shannon'],
)

nnansters added a commit that referenced this issue Dec 14, 2022
@nikml
Copy link
Contributor Author

nikml commented Dec 14, 2022

I looked a bit further into this. Quickstart is also affected. And actually the issue was introduced in version 0.7.0 when we removed the StatisticalOutputDriftCalculator.

So we should also fix that and see if documentation needs a little more polishing.

nikml added a commit that referenced this issue Dec 15, 2022
* Many Updates to Univariate Drift Comparison
* Update Univariate Drift Tutorial
* Update Readme, fixing incorrect images for drift
* Remove unneeded drift images
* Fix PCA How it works page showing outdated code.
* Fix realized regression performance docs and relevant readme plot
* Remove unneeded realized performance images
* Fix quickstart re  #171

Co-authored-by: cartgr <carterblair@uvic.ca>
Co-authored-by: Jakub Bialek <jakub@nannyml.com>
@nikml
Copy link
Contributor Author

nikml commented Dec 15, 2022

Closing as quickstart also received a hot fix - we can polish the docs later.

@nikml nikml closed this as completed Dec 15, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation
Projects
None yet
Development

No branches or pull requests

2 participants