Explicit data type conversion for univariate drift #404
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR addresses the implicit data conversion occurring in the
hellinger
andjensen_shannon
univariate drift methods.Even when a column is explicitly marked as categorical, these methods will still use the
dtype
to heuristically decide on treating them as continuous or categorical, thereby completely ignoring any explicitly given information.In this PR I've split up the implementation of the
JensenShannonDistance
andHellingerDistance
classes into two separate classes, one for continuous features and one for categorical features. They are then both added to the registry using the same name. This ensures that "routing" will still just work. I.e. the namejensen_shannon
orhellinger
will yield one of these implementations, depending on the providedFeatureType
.By splitting these implementations we can remove all of the logic related to determining the feature type, as it is now just a given. To be honest, this is how the implementation should've been from the beginning. Combining continuous and categorical calculation into a single class was a bad idea.
These Method instances are being created in the
UnivariateDriftCalculator
during fitting. For continuous features, the continuous version of the Method is instantiated and the same goes for categorical features. To allow further control over this, I've added atreat_as_continuous
parameter to theUnivariateDriftCalculator
. It allows you to pass a list of columns that should always be treated as continuous, similar to the already existingtreat_as_categorical
parameter.So the entire flow is now as follows:
treat_as_continuous
ortreat_as_categorical
parameters.MethodFactory.create
method.The logic for splitting up the column names into a list of continuous and categorical ones has grown even larger, so I've refactored that into a separate helper method.
This should aid in some of the points raised in #398. What do you think @Duncan-Hunter?