-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New preprocessors #132
New preprocessors #132
Conversation
RTE is oddly returning a different datatype than other preprocessors from its transform method
Also fixed export code mistake (missing quotation marks in exported code) for FeatureAgglomeration
Needs unit tests, then we can merge. |
Also added operator that adds features for the count of zero and non-zero elements as per #133 |
Hrm. This branch has conflicts now because of the export_utils refactor. Should be a small fix, yes? |
Sorry, had forgotten to push that commit until now. Should be ready to merge in. |
Returns | ||
------- | ||
modified_df: pandas.DataFrame {n_samples, n_components + ['guess', 'group', 'class']} | ||
Returns a DataFrame containing the transformed features |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FeatureAgglomeration
needs an example export case.
Well both of those would end up being O(n) operations. Would using the |
|
||
# NOTE: Make sure that the class is labeled 'class' in the data file | ||
tpot_data = pd.read_csv('PATH/TO/DATA/FILE', sep='COLUMN_SEPARATOR') | ||
training_indices, testing_indices = next(iter(StratifiedShuffleSplit(tpot_data['class'].values, n_iter=1, train_size=0.75, test_size=0.25))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All of the docs need to be updated with the new way we implement the cross-validation split. We use train_test_split
now, e.g.,
training_indices, testing_indices = train_test_split(tpot_data.index, stratify=tpot_data['class'].values, train_size=0.75, test_size=0.25)
Yes, I believe so. I've typically found speedups from replacing |
…opying in preprocessors
Replaced list comprehension with calls to PandaFrame's apply(). Also removed unnecessary code from _zero_count()
```Python | ||
import numpy as np | ||
import pandas as pd | ||
from sklearn.cross_validation import StratifiedShuffleSplit |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't forget the imports for the train_test_split
docs change. :-)
There's also a reference to StratifiedShuffleSplit in |
Sure, let's remove all of them while we're at it. Thank you for going through and cleaning up the docs like this. |
What does this PR do?
Adds a few new feature preprocessors to TPOT:
(RandomTreesEmbedding was found to increase the feature count too much and was not added)
Where should the reviewer start?
Line 1143 of tpot.py, where the new operators are implemented.
How should this PR be tested?
The preprocessors should be easily tested by commenting out most other pipeline operators and running TPOT on a dataset, forcing TPOT to use these new operators. I would also recommend exporting the pipeline to sklearn code.
Any background context you want to provide?
About the Nystroem preprocessor, I mistakenly started working on it when @rhiever had said not to, misreading one of his messages. But I believe his reasoning for not wanting it was that it required too many parameters for GP to be able to optimize.
But the sklearn docs seem to indicate that one of the (optional) parameters isn't even used by the majority of the kernel types, so I left it out of the operator code. So now Nystroem only requires 3 parameters.
What are the relevant issues?
#130
Screenshots (if appropriate)
Questions:
I have already updated them
No