Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Explicit data type conversion for univariate drift #404

Merged
merged 5 commits into from
Jul 5, 2024

Conversation

nnansters
Copy link
Contributor

This PR addresses the implicit data conversion occurring in the hellinger and jensen_shannon univariate drift methods.
Even when a column is explicitly marked as categorical, these methods will still use the dtype to heuristically decide on treating them as continuous or categorical, thereby completely ignoring any explicitly given information.

In this PR I've split up the implementation of the JensenShannonDistance and HellingerDistance classes into two separate classes, one for continuous features and one for categorical features. They are then both added to the registry using the same name. This ensures that "routing" will still just work. I.e. the name jensen_shannon or hellinger will yield one of these implementations, depending on the provided FeatureType.

@MethodFactory.register(key='jensen_shannon', feature_type=FeatureType.CONTINUOUS)
class ContinuousJensenShannonDistance(Method):

By splitting these implementations we can remove all of the logic related to determining the feature type, as it is now just a given. To be honest, this is how the implementation should've been from the beginning. Combining continuous and categorical calculation into a single class was a bad idea.

These Method instances are being created in the UnivariateDriftCalculator during fitting. For continuous features, the continuous version of the Method is instantiated and the same goes for categorical features. To allow further control over this, I've added a treat_as_continuous parameter to the UnivariateDriftCalculator. It allows you to pass a list of columns that should always be treated as continuous, similar to the already existing treat_as_categorical parameter.

So the entire flow is now as follows:

  1. Create a calculator, explicitly marking some columns as continuous and some as categorical using the treat_as_continuous or treat_as_categorical parameters.
  2. During fitting (when we have data available), we'll heuristically determine the feature type of any features that were not explicitly set.
  3. We then create the lists of drift methods, by instantiating them using the MethodFactory.create method.

The logic for splitting up the column names into a list of continuous and categorical ones has grown even larger, so I've refactored that into a separate helper method.

This should aid in some of the points raised in #398. What do you think @Duncan-Hunter?

@nnansters nnansters marked this pull request as ready for review July 5, 2024 09:23
@nnansters nnansters requested a review from nikml as a code owner July 5, 2024 09:23
Copy link

codecov bot commented Jul 5, 2024

Codecov Report

Attention: Patch coverage is 96.00000% with 4 lines in your changes missing coverage. Please review.

Project coverage is 77.01%. Comparing base (8bffbcd) to head (336a056).

Files Patch % Lines
nannyml/drift/univariate/methods.py 94.59% 2 Missing and 2 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #404      +/-   ##
==========================================
+ Coverage   76.98%   77.01%   +0.03%     
==========================================
  Files         108      108              
  Lines        9267     9297      +30     
  Branches     1656     1652       -4     
==========================================
+ Hits         7134     7160      +26     
- Misses       1674     1676       +2     
- Partials      459      461       +2     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@Duncan-Hunter
Copy link
Contributor

Thanks for this! Yes that looks like a better way of doing things. Happy to close the other PR.

@nnansters
Copy link
Contributor Author

Thanks for this! Yes that looks like a better way of doing things. Happy to close the other PR.

Thanks once again for bringing this to our attention, and putting in the effort. Much appreciated!

@nnansters nnansters merged commit 6517553 into main Jul 5, 2024
7 of 8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants