Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using feature selector as part of sklearn pipeline #960

Open
jtlz2 opened this issue Aug 3, 2022 Discussed in #959 · 1 comment
Open

Using feature selector as part of sklearn pipeline #960

jtlz2 opened this issue Aug 3, 2022 Discussed in #959 · 1 comment

Comments

@jtlz2
Copy link

jtlz2 commented Aug 3, 2022

Discussed in #959

Originally posted by jtlz2 August 3, 2022
Awesome package, thanks!

I'm trying to use the feature-selector transformer within a sklearn pipeline but keep getting errors like

AssertionError: The index of X and y need to be the same

Now, this raises a few questions for me:

  1. Although https://github.com/blue-yonder/tsfresh/blob/main/notebooks/examples/02%20sklearn%20Pipeline.ipynb mentions the feature augmenter, it does not give the exact syntax for using the feature selector in a pipeline step. What is the syntax precisely?
  2. Presumably feature selection should only be carried out on the CV dataset - how do I ensure this in the context of a sklearn pipeline?
  3. This raises another spectre - I want to perform (tsfresh's) feature selection on tsfresh features, while simultaneously fitting non-tsfresh-derived features. Is this even possible and if so how can we make it work?

Thanks again!

@kempa-liehr
Copy link
Collaborator

Hi @jtlz2,
Thanks for pointing out that we are missing an example on how to use FeaturesSelector() in an sklearn pipeline.

Let's assume that you already have extracted the time-series features using the extract_features() function. You can join the DataFrame with time-series features with another feature matrix, if both have the same index.

The sklearn pipeline can be built as follows:

from sklearn.ensemble import RandomForestClassifier
sklearn.model_selection import cross_val_score
from sklearn.pipeline import make_pipeline
from tsfresh.transformers import FeatureSelector

clf = make_pipeline(FeatureSelector(),
                     RandomForestClassifier())
cross_val_score(clf, X, y)

Then, you can fit your model clf.fit() or use clf with the tools provided in sklearn.model_selection.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants