Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FeatureSetSelector does not work when not set as the first item in a template. #1250

Open
perib opened this issue May 17, 2022 · 1 comment

Comments

@perib
Copy link
Contributor

perib commented May 17, 2022

FeatureSetSelector only works when set as the first step of a template. When no template is used, or when it is set to be in the middle of a template, the behavior is not well defined. It would be helpful if the FSS could be used without a template. For example, TPOT can set the FSS to be the first step of the pipeline, but then have the rest of the pipeline be unrestricted.

for example the following works normally.
template='FeatureSetSelector-Transformer-Classifier')
However "Transformer-FeatureSetSelector-Classifier" does not work, nor will the base model without a template. There are two issues with that:
When using string column names: those are not preserved in the other transformations so when FSS is not first, it cannot use the feature names and crashes.
2. When using indexes of columns, the ordering is not guaranteed to be preserved with transformations in the other steps. This leads to FSS picking out a different subset than indented while also discarding the rest of the data up to that point.
for example lets say subset 1 is indexes 0 and 1. and our pipeline is Some Transformer-FSS-Classifier
out data is [0,1,8,9]
the first transformation adds two columns
data is now [7, 7, 0, 1, 8, 9]
FSS will now select [7,7], and discard the rest (including the added transformation in the last step).

@perib
Copy link
Contributor Author

perib commented May 17, 2022

An additional useful feature, but may be more difficult to implement, would be to have FSS pass in different data to different "branches". For example:

    FFS -> classifier
                      \
                        > classifier
                      /
    FFS - > classifier

     (height, age, weight) -> Regression
                                         \
                                          > regression forest
                                         /
      Genes/proteins - > KNN

This could be possible by using FeatureUnion to group the outputs of two branches. TPOT could be initialized to have a FeatureUnion with a user specified number of items that begin with with FSS, the rest being determined through GP.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant