Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data preprocessing and information leakage #48

Open
HectorBarrio opened this issue Dec 2, 2020 · 14 comments
Open

Data preprocessing and information leakage #48

HectorBarrio opened this issue Dec 2, 2020 · 14 comments

Comments

@HectorBarrio
Copy link

Hello, before anything thanks for the package, it is very useful and the overall approach is innovative and generates a lot of efficiency. I have a comment regarding the "state" of the data to run the pps analysis on, it seems (I may be mistaken) that any transformation to the data (standardization for example) will lead to large data leakage into the Kfolds cross-validations. Is it correct? The module could use sklearn´s pipelining and standard transforms to possibly increase the information generated, would this be of value to the module?

@FlorianWetschoreck
Copy link
Collaborator

Hi Hector, thank you for reaching out and for sharing your suggestions.
I agree that transformations to the data can lead to data leakage.
What is your proposal for adding sklearn pipelining and standard transforms to ppscore?

@HectorBarrio
Copy link
Author

Let me try over the week replacing the models, regressor/classifier, with a pipeline model including one standardization step. If this works it can be made a kwarg in predictors.

@FlorianWetschoreck
Copy link
Collaborator

I would like to protect your time, so before you start implementing the proposal, please provide a concept (aka some examples) for the API first. This way, we can first discuss the new API (aka user experience) and when we agree on a suitable API, we can talk about the implementation.

@HectorBarrio
Copy link
Author

Yes, Florian, minum changes if it works, could be a keyword argument addition for a list of transformations in the predictors function that reaches the VALID_CALCULATIONS dictionaries and replaces tree.DecissionTree*() with a a pipeline preprocessing using the input list of transformations. According to this:

https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.make_pipeline.html#sklearn.pipeline.make_pipeline

@FlorianWetschoreck
Copy link
Collaborator

I think I got it - can you still please give 1 detailed example with the actual syntax. I would love to have a look at how the total code would look like

@HectorBarrio
Copy link
Author

As example only, in calculation:

from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

#Change the models to:

VALID_CALCULATIONS = {
"regression": {
"type": "regression",
"is_valid_score": True,
"model_score": TO_BE_CALCULATED,
"baseline_score": TO_BE_CALCULATED,
"ppscore": TO_BE_CALCULATED,
"metric_name": "mean absolute error",
"metric_key": "neg_mean_absolute_error",
"model": Pipeline([('scaler', StandardScaler()), ('tree', tree.DecisionTreeRegressor())]),
"score_normalizer": _mae_normalizer,
},
"classification": {
"type": "classification",
"is_valid_score": True,
"model_score": TO_BE_CALCULATED,
"baseline_score": TO_BE_CALCULATED,
"ppscore": TO_BE_CALCULATED,
"metric_name": "weighted F1",
"metric_key": "f1_weighted",
"model": Pipeline([('scaler', StandardScaler()), ('tree', tree.DecisionTreeClassifier())]),
"score_normalizer": _f1_normalizer,
},

This provides slightly different results for some data sets. The idea is to enable "predictors" function to alter the model keys to a constructed pipeline, whose constructor is a little bit awkward as it is a tuple (name, transformer). The pipeline should take care of non-leaking cross validation scores.

A call to predictors would look like:

transformers = [StandardScaler(), MinMaxScaler()]
predictors(df, 'column', transformers = transformers)

Here predictors (or other function) would have to create the pipeline list.

@FlorianWetschoreck
Copy link
Collaborator

Hi Hector, thank you for the example, and I like the transformers API

When I thought about this proposal, I was unsure which problem this should solve exactly? What is the scenario that the user is in and why does the user use ppscore in that scenario?
When did you have this scenario yourself the last time? How did you solve it then?
Maybe you can explain this a little bit more - this would help me in my understanding

@HectorBarrio
Copy link
Author

HectorBarrio commented Dec 10, 2020

Hello Florian,
the use case is having feature data that may exhibit outliers, some skewed distribution or any other anomaly that can be improved by transformation instead of dropping the offenders. In this specific case I was looking for best predictors among thousands of time series with several anomalies and had to run transformations, I transformed them and the run PPS, contaminating the internal cross validation. I manually modified the cv PPS uses to time series split and pipelined the data. Users may also want to minmax scale the data, or perform more complex transformations that they could pipeline if they are looking for quick comparisons. There were changes in PPS score ranking with and without the transformations that may be significant.

As a sideline, the cv object could also be defined as kwarg in the predictors functions to accept other splits, stratified k folds comes to mind for very unbalanced datasets.

These are the two operations I had to manually perform in this case, kwarging transformations and cv object can automate it and make the PPS more flexible.

This can generate quick checks, PPS_standard is pps with the pipeline added:

import PPS as pps
import PPS_standard as pps_s
import pandas as pd
import numpy as np
import sklearn.datasets as ds

diabetes = ds.load_diabetes()
df= pd.DataFrame(data= np.c_[diabetes['data'], diabetes['target']],
                     columns= diabetes['feature_names'] + ['target'])
print(pps_s.predictors(df, y='target')[['x', 'ppscore']].head())
print(pps.predictors(df, y='target')[['x', 'ppscore']].head())

@FlorianWetschoreck
Copy link
Collaborator

Thank you for the explanation.
Wouldn't it make more sense then to just pipe the crossvalidation object into ppscore?
Because in the end you are concerned about an invalid crossvalidation.

Did you generate a cross-validation object at the end of your pipeline?

@HectorBarrio
Copy link
Author

Hello Florian, sklearn pipe requires the last element to be the estimator, which in PPS will be the automated choice of regression or classifier, so I have not found any other way to feed it in but to overwrite the whole model with the pipe that has the original estimator as last element.

The whole pipeline could be and input to PPS, the user would have to decide regression or classification in this case, or complicate the logic of _determine_case_and_prepare_df so that it selects, from multiple models that are either classifiers or regressors. This would allow on the other hand to compare PPS using multiple different models, not only tree.

@FlorianWetschoreck
Copy link
Collaborator

Hi Hector, I wish you a good start to the new year and sorry for the late reply - I have been on vacation.

Thank you for clarifying the point that the model is the last step for the crossvalidation object and thus it is not possible to pass the full cv object.

If you want you can go ahead and add a PR

@HectorBarrio
Copy link
Author

Happy New Year Florian. I will open the PR and propose the changes.

@HectorBarrio
Copy link
Author

HectorBarrio commented Jan 7, 2021

Hello Florian,
The changes I made require the model (the tree regressor or classifier) within VALID_CALCULATIONS to be re-initialized every time the API is called to include the pipeline object.

This brings no noticeable computational cost but it cannot pass this test:

line 156 of tests:
assert pps.score(df, "x", "y", random_seed=1) == pps.score(
df, "x", "y", random_seed=1
)

As the model object at the 'model' entry of the dictionary is a different instance of a model with the same parameters. The contents are the same in every other entry of the dict, the model is not and fails to assert. This test (and subsequent result comparisons) could be modified to compare the dict excluding the 'model' entry, just as a suggestion.

@FlorianWetschoreck
Copy link
Collaborator

Thank you for the heads-up. We can easily adjust that test

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants