Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Additional Feature Selection operators #50

Merged
merged 6 commits into from
Dec 8, 2015

Conversation

bartleyn
Copy link
Contributor

@bartleyn bartleyn commented Dec 7, 2015

Per #45 , this is a first stab at the four additional feature selection operators.

Things to note:

  • Could use some optimization (caching?)
  • Could use some toying around with out-of-bounds parameters (e.g., if num_features < 0 in RFE)
  • Terminals for both estimators & scoring functions that plug into the new operators
    • i.e., RFE takes a supervised estimator so I hard-coded in a linear-kernel SVC; likewise Select* takes a scoring function, so I used chi2.
    • I think this will be of particular importance for generalizing this to regression tasks.

@rhiever rhiever self-assigned this Dec 7, 2015
@rhiever
Copy link
Contributor

rhiever commented Dec 7, 2015

Reviewing this now. Is RFE the one that's quite slow?

I may also drop dt_feature_selection() as a part of this upgrade since it's not supported in sklearn, and exporting it is quite ugly. Perhaps I'll contact the sklearn folks about merging a variant of dt_feature_selection() into sklearn.

Probably going to drop subset_df() as well, as I now can't imagine it being useful at this point.

@bartleyn
Copy link
Contributor Author

bartleyn commented Dec 7, 2015

Yeah RFE's the slowest, perhaps because I instantiate the estimator every time.

if '_select_kbest' in operators_used: pipeline_text += 'from sklearn.feature_selection import SelectKBest'
if '_select_percentile' in operators_used: pipeline_text += 'from sklearn.feature_selection import SelectPercentile'
if '_select_percentile' or '_select_kbest' in operators_used: pipeline_text += 'from sklearn.feature_selection import chi2'
if '_rfe' in operators_used: pipeline_text += 'from sklearn.feature_selection import RFE'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These imports need newlines \n at the end of them.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_rfe should also import SVC.

@rhiever
Copy link
Contributor

rhiever commented Dec 7, 2015

In addition to the comments I've made directly inline, as you mentioned the exports need to ensure that the variables are within their proper limits.

try:
selector.fit(training_features)
except ValueError:
return input_df.copy()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm thinking _variance_threshold() should return an "empty" DF (with only class, group, and guess) in the case where none of the columns are above the threshold. Otherwise it may become an issue where these _variance_threshold() operators with high thresholds are in the pipeline but not actually doing anything.

@rhiever
Copy link
Contributor

rhiever commented Dec 7, 2015

Alrighty, I think that's all of the comments for now. In all, very nice work on this PR! Thank you for putting this together. Let me know if you want to hack at these comments, else I can merge the PR and clean it up later this week.

Could use some optimization (caching?)

At least with large steps (>= 0.1), I found RFE to run pretty quickly on my test data sets. Maybe we could limit the RFE steps to >= 0.05 or so and it'll be fine?

Could use some toying around with out-of-bounds parameters (e.g., if num_features < 0 in RFE)

Except in the export function, everything looks good to me. A great way to test it is to run TPOT with the new operators for several hundred generations. If it doesn't crash by the end of the run, your code has run the gauntlet and probably caught all of the possible parameter edge cases. :-)

Terminals for both estimators & scoring functions that plug into the new operators

Great idea! I'd imagine we can easily encode various estimators and scoring functions with integer values. It may be tricky to have actual estimators and scoring functions as terminals.

Let's file an issue for this to work on after this is merged.

@bartleyn
Copy link
Contributor Author

bartleyn commented Dec 7, 2015

Yeah I'm on it. I'll work on it when I get home tonight.

@rhiever
Copy link
Contributor

rhiever commented Dec 7, 2015

You rock! 👍

{2} = {0}[['guess', 'class', 'group']]
try:
mask = selector.get_support(True)
mask_cols = list(training_features[mask].columns) + ['guess', 'class', 'group']
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no guess and group in the export code.

try:
selector.fit(training_features.values)
except ValueError:
{2} = {0}[['guess', 'class', 'group']]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no guess and group in the export code.

@rhiever
Copy link
Contributor

rhiever commented Dec 8, 2015

Just noting some minor issues in the export code. About to do some final tests of the pipeline operators themselves, and if those turn out fine, I'll merge this and do some final cleanup.

@bartleyn
Copy link
Contributor Author

bartleyn commented Dec 8, 2015

Gotcha, thanks! I won't concern myself with cleaning up the export code then. I'll get it right one day though, I promise :P.

rhiever pushed a commit that referenced this pull request Dec 8, 2015
Additional Feature Selection operators
@rhiever rhiever merged commit 36cc872 into EpistasisLab:master Dec 8, 2015
@rhiever
Copy link
Contributor

rhiever commented Dec 8, 2015

Looks good! Thanks again for your PR! :-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants