fix(STEF-3054): exclude __-prefixed columns from feature_names#870
Merged
MvLieshout merged 3 commits intoMay 7, 2026
Merged
Conversation
TimeSeriesDataset now treats columns starting with __ as internal columns, excluded from feature_names. This prevents feature-aware transforms (e.g. Scaler) from fitting on sentinel columns emitted by OutlierHandler, fixing a production crash where the Scaler expected sentinel columns at predict time that were only present during training when outliers were detected. Signed-off-by: Marnix van Lieshout <marnix.van.lieshout@alliander.com>
Signed-off-by: Marnix van Lieshout <marnix.van.lieshout@alliander.com>
…umns-from-feature-names
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.



Problem
The OutlierHandler emits sentinel columns (e.g.
__outlier_nan_load_lag_P7D__) when it NaN's values. These columns were included inTimeSeriesDataset.feature_names, causing feature-aware transforms (Scaler) to fit on them during training. At predict time, when no outliers are detected, the sentinel columns are absent → sklearn's feature-name validation crashes:Fix
TimeSeriesDatasetnow treats any column whose name starts with__as an internal column (same mechanism ashorizon/available_atcolumns). These are:dataso transforms can pass them through the pipelinefeature_namesso feature-aware transforms ignore themThis makes the sentinel column approach work correctly: they flow through the pipeline untouched, are consumed by
restore_targetat the end, and never interfere with the Scaler or model.Changes
_internal_columnswith any__-prefixed columns; also apply in the non-versioned branchfeature_names