Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New preprocessors #132

Merged
merged 19 commits into from
May 11, 2016
Merged

New preprocessors #132

merged 19 commits into from
May 11, 2016

Conversation

danthedaniel
Copy link
Contributor

@danthedaniel danthedaniel commented Apr 25, 2016

What does this PR do?

Adds a few new feature preprocessors to TPOT:

  • FastICA
  • FeatureAgglomeration
  • Nystroem

(RandomTreesEmbedding was found to increase the feature count too much and was not added)

Where should the reviewer start?

Line 1143 of tpot.py, where the new operators are implemented.

How should this PR be tested?

The preprocessors should be easily tested by commenting out most other pipeline operators and running TPOT on a dataset, forcing TPOT to use these new operators. I would also recommend exporting the pipeline to sklearn code.

Any background context you want to provide?

About the Nystroem preprocessor, I mistakenly started working on it when @rhiever had said not to, misreading one of his messages. But I believe his reasoning for not wanting it was that it required too many parameters for GP to be able to optimize.

But the sklearn docs seem to indicate that one of the (optional) parameters isn't even used by the majority of the kernel types, so I left it out of the operator code. So now Nystroem only requires 3 parameters.

What are the relevant issues?

#130

Screenshots (if appropriate)

Questions:

  • Do the docs need to be updated?

I have already updated them

  • Does this PR add new (Python) dependencies?

No

teaearlgraycold added 5 commits April 24, 2016 01:09
RTE is oddly returning a different datatype than other preprocessors from its transform method
Also fixed export code mistake (missing quotation marks in exported code) for FeatureAgglomeration
@coveralls
Copy link

Coverage Status

Coverage decreased (-2.2%) to 26.464% when pulling 4d9549a on teaearlgraycold:new_preprocessors into 8417028 on rhiever:master.

@rhiever
Copy link
Contributor

rhiever commented Apr 26, 2016

Needs unit tests, then we can merge.

@coveralls
Copy link

Coverage Status

Coverage increased (+4.2%) to 56.408% when pulling 5192f6c on teaearlgraycold:new_preprocessors into b4bd593 on rhiever:master.

@danthedaniel
Copy link
Contributor Author

Also added operator that adds features for the count of zero and non-zero elements as per #133

@coveralls
Copy link

Coverage Status

Coverage increased (+4.5%) to 56.718% when pulling b1c51f3 on teaearlgraycold:new_preprocessors into b4bd593 on rhiever:master.

@coveralls
Copy link

Coverage Status

Coverage increased (+5.1%) to 57.302% when pulling b1c51f3 on teaearlgraycold:new_preprocessors into b4bd593 on rhiever:master.

@rhiever
Copy link
Contributor

rhiever commented May 8, 2016

Hrm. This branch has conflicts now because of the export_utils refactor. Should be a small fix, yes?

@danthedaniel
Copy link
Contributor Author

Sorry, had forgotten to push that commit until now. Should be ready to merge in.

@coveralls
Copy link

Coverage Status

Coverage increased (+5.2%) to 57.576% when pulling 041b21c on teaearlgraycold:new_preprocessors into 7910b93 on rhiever:master.

Returns
-------
modified_df: pandas.DataFrame {n_samples, n_components + ['guess', 'group', 'class']}
Returns a DataFrame containing the transformed features
Copy link
Contributor

@rhiever rhiever May 10, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FeatureAgglomeration needs an example export case.

@danthedaniel
Copy link
Contributor Author

danthedaniel commented May 10, 2016

feature_cols_only.apply(lamda row: np.count_nonzero(row), axis=1)

Well both of those would end up being O(n) operations. Would using the apply() method be faster because of C speedups?


# NOTE: Make sure that the class is labeled 'class' in the data file
tpot_data = pd.read_csv('PATH/TO/DATA/FILE', sep='COLUMN_SEPARATOR')
training_indices, testing_indices = next(iter(StratifiedShuffleSplit(tpot_data['class'].values, n_iter=1, train_size=0.75, test_size=0.25)))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All of the docs need to be updated with the new way we implement the cross-validation split. We use train_test_split now, e.g.,

training_indices, testing_indices = train_test_split(tpot_data.index, stratify=tpot_data['class'].values, train_size=0.75, test_size=0.25)

@rhiever
Copy link
Contributor

rhiever commented May 10, 2016

Well both of those would end up being O(n) operations. Would using the apply() method be faster because of C speedups?

Yes, I believe so. I've typically found speedups from replacing for loops with apply() calls on pandas DataFrames.

Daniel Angell added 4 commits May 10, 2016 19:25
Replaced list comprehension with calls to PandaFrame's apply(). Also removed unnecessary code from _zero_count()
@coveralls
Copy link

Coverage Status

Coverage increased (+5.08%) to 57.409% when pulling 8d6d464 on teaearlgraycold:new_preprocessors into a0e84de on rhiever:master.

@coveralls
Copy link

Coverage Status

Coverage increased (+5.008%) to 57.34% when pulling 755ec70 on teaearlgraycold:new_preprocessors into a0e84de on rhiever:master.

```Python
import numpy as np
import pandas as pd
from sklearn.cross_validation import StratifiedShuffleSplit
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't forget the imports for the train_test_split docs change. :-)

@danthedaniel
Copy link
Contributor Author

There's also a reference to StratifiedShuffleSplit in tutorials/Titanic_Kaggle.ipynb. Not sure if you want that changed.

@coveralls
Copy link

Coverage Status

Coverage increased (+5.008%) to 57.34% when pulling f68289b on teaearlgraycold:new_preprocessors into a0e84de on rhiever:master.

@rhiever
Copy link
Contributor

rhiever commented May 11, 2016

There's also a reference to StratifiedShuffleSplit in tutorials/Titanic_Kaggle.ipynb. Not sure if you want that changed.

Sure, let's remove all of them while we're at it. Thank you for going through and cleaning up the docs like this.

@coveralls
Copy link

Coverage Status

Coverage increased (+5.008%) to 57.34% when pulling 8930a9f on teaearlgraycold:new_preprocessors into a0e84de on rhiever:master.

@rhiever rhiever merged commit bda254e into EpistasisLab:master May 11, 2016
@danthedaniel danthedaniel deleted the new_preprocessors branch August 19, 2016 19:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants