New preprocessors #132

danthedaniel · 2016-04-25T15:40:02Z

What does this PR do?

Adds a few new feature preprocessors to TPOT:

FastICA
FeatureAgglomeration
Nystroem

(RandomTreesEmbedding was found to increase the feature count too much and was not added)

Where should the reviewer start?

Line 1143 of tpot.py, where the new operators are implemented.

How should this PR be tested?

The preprocessors should be easily tested by commenting out most other pipeline operators and running TPOT on a dataset, forcing TPOT to use these new operators. I would also recommend exporting the pipeline to sklearn code.

Any background context you want to provide?

About the Nystroem preprocessor, I mistakenly started working on it when @rhiever had said not to, misreading one of his messages. But I believe his reasoning for not wanting it was that it required too many parameters for GP to be able to optimize.

But the sklearn docs seem to indicate that one of the (optional) parameters isn't even used by the majority of the kernel types, so I left it out of the operator code. So now Nystroem only requires 3 parameters.

What are the relevant issues?

#130

Screenshots (if appropriate)

Questions:

Do the docs need to be updated?

I have already updated them

Does this PR add new (Python) dependencies?

No

RTE is oddly returning a different datatype than other preprocessors from its transform method

…r and docs

Also fixed export code mistake (missing quotation marks in exported code) for FeatureAgglomeration

coveralls · 2016-04-25T15:50:39Z

Coverage decreased (-2.2%) to 26.464% when pulling 4d9549a on teaearlgraycold:new_preprocessors into 8417028 on rhiever:master.

rhiever · 2016-04-26T14:24:12Z

Needs unit tests, then we can merge.

coveralls · 2016-05-02T19:49:30Z

Coverage increased (+4.2%) to 56.408% when pulling 5192f6c on teaearlgraycold:new_preprocessors into b4bd593 on rhiever:master.

…l features

…dy ran

… export code

danthedaniel · 2016-05-03T15:02:28Z

Also added operator that adds features for the count of zero and non-zero elements as per #133

coveralls · 2016-05-03T15:05:59Z

Coverage increased (+4.5%) to 56.718% when pulling b1c51f3 on teaearlgraycold:new_preprocessors into b4bd593 on rhiever:master.

coveralls · 2016-05-03T15:11:36Z

Coverage increased (+5.1%) to 57.302% when pulling b1c51f3 on teaearlgraycold:new_preprocessors into b4bd593 on rhiever:master.

rhiever · 2016-05-08T00:01:58Z

Hrm. This branch has conflicts now because of the export_utils refactor. Should be a small fix, yes?

danthedaniel · 2016-05-09T05:05:37Z

Sorry, had forgotten to push that commit until now. Should be ready to merge in.

coveralls · 2016-05-09T05:07:09Z

Coverage increased (+5.2%) to 57.576% when pulling 041b21c on teaearlgraycold:new_preprocessors into 7910b93 on rhiever:master.

rhiever · 2016-05-10T13:40:28Z

docs/sources/documentation/pipeline_operators/cluster/FeatureAgglomeration.md

+Returns
+-------
+    modified_df: pandas.DataFrame {n_samples, n_components + ['guess', 'group', 'class']}
+        Returns a DataFrame containing the transformed features


FeatureAgglomeration needs an example export case.

danthedaniel · 2016-05-10T14:11:07Z

feature_cols_only.apply(lamda row: np.count_nonzero(row), axis=1)

Well both of those would end up being O(n) operations. Would using the apply() method be faster because of C speedups?

rhiever · 2016-05-10T14:12:49Z

docs/sources/documentation/pipeline_operators/decomposition/FastICA.md

+
+# NOTE: Make sure that the class is labeled 'class' in the data file
+tpot_data = pd.read_csv('PATH/TO/DATA/FILE', sep='COLUMN_SEPARATOR')
+training_indices, testing_indices = next(iter(StratifiedShuffleSplit(tpot_data['class'].values, n_iter=1, train_size=0.75, test_size=0.25)))


All of the docs need to be updated with the new way we implement the cross-validation split. We use train_test_split now, e.g.,

training_indices, testing_indices = train_test_split(tpot_data.index, stratify=tpot_data['class'].values, train_size=0.75, test_size=0.25)

rhiever · 2016-05-10T14:23:41Z

Well both of those would end up being O(n) operations. Would using the apply() method be faster because of C speedups?

Yes, I believe so. I've typically found speedups from replacing for loops with apply() calls on pandas DataFrames.

…processors

…opying in preprocessors

Replaced list comprehension with calls to PandaFrame's apply(). Also removed unnecessary code from _zero_count()

coveralls · 2016-05-10T23:56:33Z

Coverage increased (+5.08%) to 57.409% when pulling 8d6d464 on teaearlgraycold:new_preprocessors into a0e84de on rhiever:master.

coveralls · 2016-05-11T00:04:30Z

Coverage increased (+5.008%) to 57.34% when pulling 755ec70 on teaearlgraycold:new_preprocessors into a0e84de on rhiever:master.

rhiever · 2016-05-11T15:16:53Z

docs/sources/documentation/pipeline_operators/decomposition/FastICA.md

+```Python
+import numpy as np
+import pandas as pd
+from sklearn.cross_validation import StratifiedShuffleSplit


Don't forget the imports for the train_test_split docs change. :-)

…or docs

danthedaniel · 2016-05-11T15:23:17Z

There's also a reference to StratifiedShuffleSplit in tutorials/Titanic_Kaggle.ipynb. Not sure if you want that changed.

coveralls · 2016-05-11T15:27:21Z

Coverage increased (+5.008%) to 57.34% when pulling f68289b on teaearlgraycold:new_preprocessors into a0e84de on rhiever:master.

rhiever · 2016-05-11T15:31:25Z

There's also a reference to StratifiedShuffleSplit in tutorials/Titanic_Kaggle.ipynb. Not sure if you want that changed.

Sure, let's remove all of them while we're at it. Thank you for going through and cleaning up the docs like this.

coveralls · 2016-05-11T15:58:56Z

Coverage increased (+5.008%) to 57.34% when pulling 8930a9f on teaearlgraycold:new_preprocessors into a0e84de on rhiever:master.

teaearlgraycold added 5 commits April 24, 2016 01:09

Added a couple of preprocessors

e942e36

RTE is oddly returning a different datatype than other preprocessors from its transform method

Removed rtrees_embedding preprocessor and added _feat_agg preprocesso…

8b55385

…r and docs

Removed debug commenting

beea11a

Added Nystroem feature preprocessor

9c6146d

Also fixed export code mistake (missing quotation marks in exported code) for FeatureAgglomeration

Removed coef parameter from _nystroem operator

4d9549a

rhiever added the enhancement label Apr 26, 2016

Daniel Angell added 2 commits May 2, 2016 15:39

Added some tests for new feature preprocessors

9fb9e46

Merged with rhiever/master

5192f6c

Daniel Angell added 3 commits May 2, 2016 20:44

Added preprocessor for adding count of zero/non-zero values as virtua…

ff37495

…l features

Remove code that prevents _zero_count from re-running if it has alrea…

13fbf8f

…dy ran

Added documentation for ZeroCount pre-processor and fixed _zero_count…

906b366

… export code

Removed debug comments

b1c51f3

Merge with rhiever/master

041b21c

rhiever reviewed May 10, 2016
View reviewed changes

Daniel Angell added 4 commits May 10, 2016 19:25

Added example exported code to Feature Agglomeration doc page

0eb43f0

Merge branch 'master' of https://github.com/rhiever/tpot into new_pre…

00f7113

…processors

Update docs to use train_test_split and improve non_feature_columns c…

543d913

…opying in preprocessors

Improve _zero_count() preprocessor

8d6d464

Replaced list comprehension with calls to PandaFrame's apply(). Also removed unnecessary code from _zero_count()

Hard-code Nystroem's kernel types

755ec70

rhiever reviewed May 11, 2016
View reviewed changes

Replaced StratifiedShuffleSplit import with train_test_split import f…

f68289b

…or docs

Update Titanic_Kaggle notebook and removed rogue code from docs

8930a9f

rhiever merged commit bda254e into EpistasisLab:master May 11, 2016

danthedaniel deleted the new_preprocessors branch August 19, 2016 19:18

AIAdventures mentioned this pull request Jun 6, 2017

Titanic example -problem with 2nd last cell. #492

Closed

saddy001 mentioned this pull request Mar 20, 2018

Segfault on optimization process #676

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New preprocessors #132

New preprocessors #132

danthedaniel commented Apr 25, 2016 •

edited

Loading

coveralls commented Apr 25, 2016

rhiever commented Apr 26, 2016

coveralls commented May 2, 2016

danthedaniel commented May 3, 2016

coveralls commented May 3, 2016

coveralls commented May 3, 2016

rhiever commented May 8, 2016

danthedaniel commented May 9, 2016

coveralls commented May 9, 2016

rhiever May 10, 2016 •

edited

Loading

danthedaniel commented May 10, 2016 •

edited

Loading

rhiever May 10, 2016

rhiever commented May 10, 2016

coveralls commented May 10, 2016

coveralls commented May 11, 2016

rhiever May 11, 2016

danthedaniel commented May 11, 2016

coveralls commented May 11, 2016

rhiever commented May 11, 2016

coveralls commented May 11, 2016

New preprocessors #132

New preprocessors #132

Conversation

danthedaniel commented Apr 25, 2016 • edited Loading

What does this PR do?

Where should the reviewer start?

How should this PR be tested?

Any background context you want to provide?

What are the relevant issues?

Screenshots (if appropriate)

Questions:

coveralls commented Apr 25, 2016

rhiever commented Apr 26, 2016

coveralls commented May 2, 2016

danthedaniel commented May 3, 2016

coveralls commented May 3, 2016

coveralls commented May 3, 2016

rhiever commented May 8, 2016

danthedaniel commented May 9, 2016

coveralls commented May 9, 2016

rhiever May 10, 2016 • edited Loading

Choose a reason for hiding this comment

danthedaniel commented May 10, 2016 • edited Loading

rhiever May 10, 2016

Choose a reason for hiding this comment

rhiever commented May 10, 2016

coveralls commented May 10, 2016

coveralls commented May 11, 2016

rhiever May 11, 2016

Choose a reason for hiding this comment

danthedaniel commented May 11, 2016

coveralls commented May 11, 2016

rhiever commented May 11, 2016

coveralls commented May 11, 2016

danthedaniel commented Apr 25, 2016 •

edited

Loading

rhiever May 10, 2016 •

edited

Loading

danthedaniel commented May 10, 2016 •

edited

Loading