Explicit data type conversion for univariate drift #404

nnansters · 2024-07-05T09:14:06Z

This PR addresses the implicit data conversion occurring in the hellinger and jensen_shannon univariate drift methods.
Even when a column is explicitly marked as categorical, these methods will still use the dtype to heuristically decide on treating them as continuous or categorical, thereby completely ignoring any explicitly given information.

In this PR I've split up the implementation of the JensenShannonDistance and HellingerDistance classes into two separate classes, one for continuous features and one for categorical features. They are then both added to the registry using the same name. This ensures that "routing" will still just work. I.e. the name jensen_shannon or hellinger will yield one of these implementations, depending on the provided FeatureType.

@MethodFactory.register(key='jensen_shannon', feature_type=FeatureType.CONTINUOUS)
class ContinuousJensenShannonDistance(Method):

By splitting these implementations we can remove all of the logic related to determining the feature type, as it is now just a given. To be honest, this is how the implementation should've been from the beginning. Combining continuous and categorical calculation into a single class was a bad idea.

These Method instances are being created in the UnivariateDriftCalculator during fitting. For continuous features, the continuous version of the Method is instantiated and the same goes for categorical features. To allow further control over this, I've added a treat_as_continuous parameter to the UnivariateDriftCalculator. It allows you to pass a list of columns that should always be treated as continuous, similar to the already existing treat_as_categorical parameter.

So the entire flow is now as follows:

Create a calculator, explicitly marking some columns as continuous and some as categorical using the treat_as_continuous or treat_as_categorical parameters.
During fitting (when we have data available), we'll heuristically determine the feature type of any features that were not explicitly set.
We then create the lists of drift methods, by instantiating them using the MethodFactory.create method.

The logic for splitting up the column names into a list of continuous and categorical ones has grown even larger, so I've refactored that into a separate helper method.

This should aid in some of the points raised in #398. What do you think @Duncan-Hunter?

…lementations

…lementation

codecov · 2024-07-05T09:23:50Z

Codecov Report

Attention: Patch coverage is 96.00000% with 4 lines in your changes missing coverage. Please review.

Project coverage is 77.01%. Comparing base (8bffbcd) to head (336a056).

Files	Patch %	Lines
nannyml/drift/univariate/methods.py	94.59%	2 Missing and 2 partials ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #404      +/-   ##
==========================================
+ Coverage   76.98%   77.01%   +0.03%     
==========================================
  Files         108      108              
  Lines        9267     9297      +30     
  Branches     1656     1652       -4     
==========================================
+ Hits         7134     7160      +26     
- Misses       1674     1676       +2     
- Partials      459      461       +2

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Duncan-Hunter · 2024-07-05T10:14:37Z

Thanks for this! Yes that looks like a better way of doing things. Happy to close the other PR.

nnansters · 2024-07-05T10:45:37Z

Thanks for this! Yes that looks like a better way of doing things. Happy to close the other PR.

Thanks once again for bringing this to our attention, and putting in the effort. Much appreciated!

nnansters added 5 commits July 3, 2024 18:55

Add "treat_as_continuous" parameter to univariate drift calculator

1a364ce

Made _split_by_features more generic and predictable

5493e7c

Split up JS implementation into proper continuous and categorical imp…

3963a62

…lementations

Split up Hellinger implementation into continuous and categorical imp…

3ee722f

…lementation

Linting

336a056

nnansters marked this pull request as ready for review July 5, 2024 09:23

nnansters requested a review from nikml as a code owner July 5, 2024 09:23

Duncan-Hunter mentioned this pull request Jul 5, 2024

change assumed treat_as_categorical #397

Closed

nnansters merged commit 6517553 into main Jul 5, 2024
7 of 8 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Explicit data type conversion for univariate drift #404

Explicit data type conversion for univariate drift #404

nnansters commented Jul 5, 2024

codecov bot commented Jul 5, 2024

Duncan-Hunter commented Jul 5, 2024

nnansters commented Jul 5, 2024

Explicit data type conversion for univariate drift #404

Explicit data type conversion for univariate drift #404

Conversation

nnansters commented Jul 5, 2024

codecov bot commented Jul 5, 2024

Codecov Report

Duncan-Hunter commented Jul 5, 2024

nnansters commented Jul 5, 2024