Is an arbitrary pipeline structure useful? #104

rhiever · 2016-03-06T14:49:18Z

One of the ideas behind TPOT is that it can create an arbitrary pipeline structure: A TPOT pipeline can have as many operators as it needs, and even perform separate analyses on copies of the data set thanks to the "Combine DFs" operator.

However, one big question remains: Is having an arbitrarily-large pipeline structure useful? Or is all we need a data preprocessor, then a feature preprocessor, then a modeling step?

We should explore this question more by taking the current version of TPOT and comparing it to a version of TPOT that fixes the pipeline structure to three steps: data preprocessing (variance threshold, standard scaler, robust scaler), feature preprocessing (polynomial features, PCA, all feature selection methods), then a modeling step (all of the models).

Perhaps we can also compare it to a four-step pipeline structure: data preprocessing (variance threshold, standard scaler, robust scaler), feature preprocessing (polynomial features, PCA), feature selection (all feature selection methods), then a modeling step (all of the models). Perhaps having feature selection as a separate step just prior to the modeling step could be useful.

In either of the "fixed pipeline structure TPOT" cases, mutations would be restricted to replacing and tuning the appropriate operators in each step. Crossover would have to be prevented from creating invalid pipelines as well. This would likely entail rolling custom mutation and crossover operators.

bartleyn · 2016-04-02T19:09:54Z

I agree that we could probably structure the pipelines at least a little bit to reduce the complexity, as I've definitely gotten runs where the best pipeline had feature selection/preprocessing steps happening after the modeling steps, which seems unnecessary.

bartleyn · 2016-05-03T22:30:30Z

Can we bypass having to roll custom mutation and crossover operators by changing how we pass the datasets around? I propose that we wrap the primary data structure (a pandas DataFrame / NumPy matrix) in different TPOT-specific subclasses and change the pipeline operators' typed contracts accordingly to restrict how operators get assembled. For example, the code where we add operators might look like the following:

# data preprocessor
self._pset.addPrimitive(self._standard_scaler, [pd.DataFrame], tpot.DataPreprocessed)

# feature preprocessor
self._pset.addPrimitive(self._binarizer, [tpot.DataPreprocessed, float], tpot.FeaturePreprocessed)

# feature selection
self._pset.addPrimitive(self._select_kbest, [tpot.FeaturePreprocessed, int], tpot.FeatureSelected)

# machine learning operator
self._pset.addPrimitive(self._random_forest, [tpot.FeatureSelected, int], tpot.MachineLearned)

Is this sensible? I'm assuming that the subclasses would pretty much just be wrappers for the original data structure, in an effort to keep things lightweight.

rhiever · 2016-05-06T23:34:06Z

That's brilliant! Yes, I think that would work just fine for these purposes. The only tricky part could be allowing TPOT to perform no preprocessing, but I suppose we could create a "no preprocessing" operator for each step.

tonyfast · 2016-05-31T05:22:07Z

I imagine that very complex graphs could exist where preprocessors could switch roles. Are pipelines understood well enough to create a rigid ontology?

The DataFrame class is very robust and it may be dangerous to choose something else instead. It is good to know DataFrame in and DataFrame out; custom classes could get confusing.

A base class

A TPOT model base class will have to 1. Set its state, 2. Fit a model, 3. Predict a model, 4. Update a DataFrame, and 5. Export code.

This notebook tinkers around with the idea of the TPOT model. This class would export jinja templated code along with the docstring in a notebook cell.

A sample for a random forest may look like.

class _random_forest(Model):
    """Fits a random forest classifier

    Parameters
    ----------
    input_df: pandas.DataFrame {n_samples, n_features+['class', 'group', 'guess']}
        Input DataFrame for fitting the random forest
    max_features: int
        Number of features used to fit the decision tree; must be a positive value

    Returns
    -------
    input_df: pandas.DataFrame {n_samples, n_features+['guess', 'group', 'class', 'SyntheticFeature']}
        Returns a modified input DataFrame with the guess column updated according to the classifier's predictions.
        Also adds the classifiers's predictions as a 'SyntheticFeature' column.

    """
    _package = 'sklearn.ensemble.RandomForestClassifier'
    _source = """
from {{package_path[:-1] | join('.')}} import {{package_path | last}}

rfc{{operator_num}} = RandomForestClassifier(
    n_estimators = {{n_estimators}},
    max_features = {{max_features}},
)
rfc{{operator_num}}.fit(
    {{input_df}}.loc[training_indices].drop('class', axis=1).values, 
    {{input_df}}.loc[training_indices, 'class'].values,
)
    """

    def preprocess(self, inputdf:pd.DataFrame, max_features:int=4):
        if max_features < 1:
            max_features = 'auto'
        elif max_features == 1:
            max_features = None
        elif max_features > len(input_df.columns) - 3:
            max_features = len(input_df.columns) - 3
        return {
            'max_features': max_features,
            'n_estimators': 500,
            'random_state': 42, 
            'n_jobs': -1
        }

Some of the models are very similar in execution. There maybe a few base models required to cover the current set of sklearn models.

Many of these models could be combined using

from toolz import compose
pipeline = compose( _random_forest, _select_kbest, _binarizer, _standard_scaler)

bartleyn · 2016-05-31T14:41:10Z

I imagine that very complex graphs could exist where preprocessors could switch roles.

Can you explain a little more about what you mean by preprocessors switching roles?

Are pipelines understood well enough to create a rigid ontology?

I think it's less that pipelines are understood well enough, and more that we're trying to align ontologies for structured pipelines with related work in autoML (e.g., the auto-sklearn project and paper found here). So it seems mostly just convenient to me (but I could be wrong).

The DataFrame class is very robust and it may be dangerous to choose something else instead. It is good to know DataFrame in and DataFrame out; custom classes could get confusing.

I mean this is just a tentative approach to structuring these pipelines, and it seems that rather than gutting the DEAP evolutionary algorithm we could start with this to see if it's worthwhile investing more time in it.

I think the base model class idea is super interesting, and aligns with that pending major OOP refactor -- your idea is perhaps a more long-term solution to the problem.

tonyfast · 2016-05-31T15:04:58Z

I am not too sure what I meant in the beginning. Maybe I should have said "We don't know how complicated these models will be in the future. Choosing types to build pipelines now may pigeonhole building complicated pipelines in the future.

Thanks for that reference @bartleyn. That clarifies the process for me. Some like multipledispatch or odo could be helpful with automating these pipelines.

I don't know much about DEAP, but from my brief research into over the past few days I doubt it would have to be gutted. Maybe a few things could get monkey patched, but it seems to have a nice plugin system to extend methods.

It is starting to appear that TPOT may be something like class TPOT(sklearn.base.ClassifierMixin, deap.gp.eaSimple)

bartleyn · 2016-06-10T23:19:34Z

Are we still interested in testing some of this in the current state of TPOT? I went ahead and implemented my idea, and am happy to try and get a PR working.

rhiever · 2016-06-10T23:21:56Z

Once we get 0.4 out, I'd love to explore this idea. Several people (including @tonyfast) have expressed interest in designing a "grammar" for TPOT pipelines to help constrain the pipeline structure.

tonyfast · 2016-06-11T18:20:35Z

I would really like to discuss this idea with y'all. I would like champion bringing the API closer to native scikit-learn opinions. I started working on a refactor with the expectation of making generic model pipelines, but I think at this point is was largely exploratory to understand the data flow. One of the main successes was being able to have tpot always end on a classifier.

After some deeper dives in to scikit-learn, there are methods like Pipelines and FeatureUnion that would assist tpot through the evolution process. There is an example of a model deap model built on scikit-learn.

Is there a place we can start putting research and ideas into for the post 0.4 release development?

rhiever · 2016-06-13T23:10:39Z

Is there a place we can start putting research and ideas into for the post 0.4 release development?

You can start up issues on the repo for us to discuss, or use an existing issue if it's related.

GinoWoz1 · 2018-09-28T17:59:34Z

Hey @weixuanfu , @tonyfast , where did this end up? As a recent adopter of TPOT I am fascinated by these ideas and curious on the status. I am so-so in python but catching on fast - would be willing to allocate some time towards this endeavor.

GinoWoz1 · 2018-10-03T15:36:32Z

@weixuanfu @rhiever , bumping this. Any idea? I can help brainstorm a high level grammar and help push this along.

I ran two tpot runs this week. 1 finished within 2 days with relatively good pipelines. Another ran for 4 days and was using crazy computing power going through 5 and 6 level models (double function transformer, one-hot encoder for data already set with dummies etc).

GinoWoz1 · 2018-10-08T14:39:11Z

@weixuanfu I would like to contribute, I will open up a separate issue if this is ignored. All I am asking is what work has been done on this so I can help. I already had a in-depth call with Randy about the project and what aspects need to be worked on to move this forward. I am a masters student in data science who is possibly doing a thesis on this subject.

weixuanfu · 2018-10-08T15:46:50Z

@GinoWoz1 We are working and testing a new template function for TPOT (see this branch) which may be similar to the grammar function you mentioned. I need discuss with other members in the TPOT team for this issue and will keep you in touch.

GinoWoz1 · 2018-10-08T15:48:55Z

Thanks please let me know where I can contribute, I am a full time student and doing a presentation next month on TPOT. I have taken a deep dive into the code for TPOT as well as DEAP to understand its inner working.

weixuanfu · 2018-10-08T16:40:39Z

@GinoWoz1 We'd like to invite you to contribute to this function. I will send you a email soon for scheduling a meeting for some discussions.

cottrell · 2018-11-20T23:02:18Z

Curious if there has been more discussion on this. I just hit "Warning: sklearn.pipelines.FeatureUnion is not available and will not be used by TPOT." ... and came to the party.

aastha3 · 2019-02-02T08:39:11Z

What does a bdbQuit indicate? Alternatively, how do you define a sparse feature?

rhiever added the question label Mar 6, 2016

rhiever added the need contributor label Sep 29, 2016

rhiever added being worked on and removed need contributor labels Oct 10, 2016

rhiever added need contributor and removed being worked on labels Apr 25, 2017

AIAdventures mentioned this issue Jun 6, 2017

Titanic example -problem with 2nd last cell. #492

Closed

rhiever added being worked on and removed need contributor labels Oct 9, 2017

saddy001 mentioned this issue Mar 20, 2018

Segfault on optimization process #676

Closed

weixuanfu removed the being worked on label Aug 31, 2018

cottrell mentioned this issue Nov 21, 2018

Add some global flag to raise config dict import errors. I view these… #806

Merged

jennyHsiao mentioned this issue Dec 30, 2019

MLPRegressor in TPOT configuration dictionary #984

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is an arbitrary pipeline structure useful? #104

Is an arbitrary pipeline structure useful? #104

rhiever commented Mar 6, 2016

bartleyn commented Apr 2, 2016

bartleyn commented May 3, 2016

rhiever commented May 6, 2016

tonyfast commented May 31, 2016

bartleyn commented May 31, 2016

tonyfast commented May 31, 2016 •

edited

Loading

bartleyn commented Jun 10, 2016

rhiever commented Jun 10, 2016

tonyfast commented Jun 11, 2016

rhiever commented Jun 13, 2016

GinoWoz1 commented Sep 28, 2018 •

edited

Loading

GinoWoz1 commented Oct 3, 2018

GinoWoz1 commented Oct 8, 2018 •

edited

Loading

weixuanfu commented Oct 8, 2018

GinoWoz1 commented Oct 8, 2018

weixuanfu commented Oct 8, 2018

cottrell commented Nov 20, 2018

aastha3 commented Feb 2, 2019 •

edited

Loading

Is an arbitrary pipeline structure useful? #104

Is an arbitrary pipeline structure useful? #104

Comments

rhiever commented Mar 6, 2016

bartleyn commented Apr 2, 2016

bartleyn commented May 3, 2016

rhiever commented May 6, 2016

tonyfast commented May 31, 2016

A base class

bartleyn commented May 31, 2016

tonyfast commented May 31, 2016 • edited Loading

bartleyn commented Jun 10, 2016

rhiever commented Jun 10, 2016

tonyfast commented Jun 11, 2016

rhiever commented Jun 13, 2016

GinoWoz1 commented Sep 28, 2018 • edited Loading

GinoWoz1 commented Oct 3, 2018

GinoWoz1 commented Oct 8, 2018 • edited Loading

weixuanfu commented Oct 8, 2018

GinoWoz1 commented Oct 8, 2018

weixuanfu commented Oct 8, 2018

cottrell commented Nov 20, 2018

aastha3 commented Feb 2, 2019 • edited Loading

tonyfast commented May 31, 2016 •

edited

Loading

GinoWoz1 commented Sep 28, 2018 •

edited

Loading

GinoWoz1 commented Oct 8, 2018 •

edited

Loading

aastha3 commented Feb 2, 2019 •

edited

Loading