Allow for Flexible Preprocessing #897

8bit-pixies · 2019-07-27T22:50:09Z

[please review the Contribution Guidelines prior to submitting your pull request. go ahead and delete this line if you've already reviewed said guidelines.]

What does this PR do?

Too many things - so many things that I'm pretty sure there will be discussions on whether this is the way to proceed. Putting it up here before I invest more time on it. This PR addresses several items (possibly too many again) including

#507
#836
Handling of categorical data #771 - I know its not directly related; but it is the closest issue, and in my mind it would allow for extending it to using different encodings like using this: https://github.com/scikit-learn-contrib/categorical-encoding

Which are all related to how TPOT does preprocessing. This PR does a number of things including re-introducing "RandomTree" option in templates that allows specifying templates in the form Transformer-RandomTree; which will then allow things like My_Preprocessing-RandomTree.

The high level approach is to inject additional preprocessing steps when _fit_init is called which then alters the behaviour of TPOT.

Where should the reviewer start?

tpot/base.py - will add comments in files as part of this PR with my thoughts...

Also can't get relative imports working - so that would be appreciated for the tpot.drivers.load_scoring_function part

How should this PR be tested?

Happy to add new tests later; when the design is approved...

Any background context you want to provide?

NIL see above

What are the relevant issues?

#507
#836
Handling of categorical data #771

Screenshots (if appropriate)

Example of API using modified Iris dataset

from tpot import TPOTClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
import numpy as np
import pandas as pd

iris = load_iris()
X_train, X_test, y_train, y_test = train_test_split(iris.data.astype(np.float64),
    iris.target.astype(np.float64), train_size=0.75, test_size=0.25)

tpot = TPOTClassifier(generations=1, population_size=5, verbosity=2, template="PCA-RandomTree")

X_train_df = pd.DataFrame(X_train, columns=["num1", "num2", "num3", "num4"])
X_train_df['text'] = np.random.choice(["hello", "world", "foo", "bar world", "bar hello"], X_train.shape[0])

tpot2 = TPOTClassifier(generations=1, population_size=5, verbosity=2, template="PCA-LogisticRegression", 
                      preprocess_config_dict = {
                          'numeric_columns': ["num2"]
                      })
tpot2.fit(X_train_df, y_train)

tpot2 = TPOTClassifier(generations=1, population_size=5, verbosity=2, template="PCA-LogisticRegression", 
                      preprocess_config_dict = {
                          'numeric_columns': ["num2", "num3", "num4"]
                      })
tpot2.fit(X_train_df, y_train)

tpot2 = TPOTClassifier(generations=1, population_size=5, verbosity=2, template="PCA-LogisticRegression", 
                      preprocess_config_dict = {
                          'numeric_columns': ["num2"],
                          'text_columns': ['text']
                      })
tpot2.fit(X_train_df, y_train)

Generation 1 - Current best internal CV score: 0.5908385093167702

Best pipeline: LogisticRegression(PCA(PreprocessTransformer(input_matrix, numeric_columns=['num2']), iterated_power=2, svd_solver=randomized), C=20.0, dual=True, penalty=l2)

Generation 1 - Current best internal CV score: 0.9556935817805383

Best pipeline: LogisticRegression(PCA(PreprocessTransformer(input_matrix, numeric_columns=['num2', 'num3', 'num4']), iterated_power=9, svd_solver=randomized), C=20.0, dual=False, penalty=l1)

Generation 1 - Current best internal CV score: 0.5096273291925466

Questions:

Do the docs need to be updated? Yes - will do later
Does this PR add new (Python) dependencies? No

8bit-pixies · 2019-07-27T22:50:45Z

tpot/builtins/preprocessing.py

+
+
+
+def load_scoring_function(scoring_func):


help for getting relative imports working would be appreciated here... tpot.drivers.load_scoring_function

Hmm, directly importing via from ..driver import load_scoring_function will cause some conflicts with

from .tpot import TPOTClassifier, TPOTRegressor from ._version import __version__

A workaround is to move this function to tpot/metrics.py and then add from ..metrics import load_scoring_function to tpot/builtins/preprocessing.py

8bit-pixies · 2019-07-27T22:51:31Z

tpot/base.py

+                            column_transform_dict[k] = config_dict[k]
+                        else:
+                            column_transform_dict[k] = [config_dict[k]]
+            self._config_dict['tpot.builtins.PreprocessTransformer'] = column_transform_dict


This injection could be dangerous - do we have opinions on how it is supposed to be handled?

I think preprocess_config_dict should be a argument within PreprocessTransformer instead of TPOT. And users should be able to customize it via config_dict.

Yes this is certainly possible (and possible right now with no changes to TPOT master technically) via the use of templates - I think the question arises related to #507; if its possible to have a "built-in" configuration with text or not.

Maybe the answer is we can't

8bit-pixies · 2019-07-27T22:52:35Z

tpot/base.py

+                )
+
+            # override some settings...
+            if config_dict.get('impute', None) is False:


This line deals with #836 - might be overloading this PR and might be an item for later?

I think it is more related to #889. I think we need add imputation into config_dict too. We may allow TPOT skip imputation if the pipeline only has XGBClassifier or XGBRegressor.

8bit-pixies · 2019-07-30T02:52:04Z

Ok,

Given the higher level comments @weixuanfu how should we proceed. In my opinion, I think this PR is just too big, what we probably want is:

a PR that just adds back "RandomTree" as an option so that we can do things like "PCA-RandomTree" or similar (have to ensure than RandomTree is the very last option"; in doing so this enables
Another PR that proposes how the preprocessing transform should work - there are downsides to this, as it means someone using the preprocessing has to directly alter config options for every model. e.g.

(as per comments above)

X_train, y_train = ...
my_meta_data_info = {<insert meta data information>}
config_dict = {**my_meta_data_info, **TPOT_DEFAULT_CONFIG}
tpot = TPOTClassifier(config_dict=config_dict, template='Preprocess-RandomTree')
tpot.fit(X_train, y_train)

versus

(as per approach taken in this PR)

X_train, y_train = ...
tpot = TPOTClassifier(config_dict=None, preprocess_config_dict ={<insert meta data>})
tpot.fit(X_train, y_train)

8bit-pixies added 4 commits July 27, 2019 20:03

bring back random tree

c941594

whole bunch of code that doesn't break anything...yet

492e158

initial attempt - note that randomtree is broken

42c92b6

patching random tree helper

b952a7f

8bit-pixies commented Jul 27, 2019

View reviewed changes

weixuanfu changed the base branch from master to development July 29, 2019 13:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow for Flexible Preprocessing #897

Allow for Flexible Preprocessing #897

8bit-pixies commented Jul 27, 2019 •

edited

8bit-pixies Jul 27, 2019

weixuanfu Jul 29, 2019

8bit-pixies Jul 27, 2019

weixuanfu Jul 29, 2019

8bit-pixies Jul 30, 2019

8bit-pixies Jul 27, 2019

weixuanfu Jul 29, 2019

8bit-pixies commented Jul 30, 2019




		def load_scoring_function(scoring_func):

Allow for Flexible Preprocessing #897

Are you sure you want to change the base?

Allow for Flexible Preprocessing #897

Conversation

8bit-pixies commented Jul 27, 2019 • edited

What does this PR do?

Where should the reviewer start?

How should this PR be tested?

Any background context you want to provide?

What are the relevant issues?

Screenshots (if appropriate)

Questions:

8bit-pixies Jul 27, 2019

Choose a reason for hiding this comment

weixuanfu Jul 29, 2019

Choose a reason for hiding this comment

8bit-pixies Jul 27, 2019

Choose a reason for hiding this comment

weixuanfu Jul 29, 2019

Choose a reason for hiding this comment

8bit-pixies Jul 30, 2019

Choose a reason for hiding this comment

8bit-pixies Jul 27, 2019

Choose a reason for hiding this comment

weixuanfu Jul 29, 2019

Choose a reason for hiding this comment

8bit-pixies commented Jul 30, 2019

8bit-pixies commented Jul 27, 2019 •

edited