Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is an arbitrary pipeline structure useful? #104

Open
rhiever opened this issue Mar 6, 2016 · 18 comments
Open

Is an arbitrary pipeline structure useful? #104

rhiever opened this issue Mar 6, 2016 · 18 comments
Labels

Comments

@rhiever
Copy link
Contributor

rhiever commented Mar 6, 2016

One of the ideas behind TPOT is that it can create an arbitrary pipeline structure: A TPOT pipeline can have as many operators as it needs, and even perform separate analyses on copies of the data set thanks to the "Combine DFs" operator.

However, one big question remains: Is having an arbitrarily-large pipeline structure useful? Or is all we need a data preprocessor, then a feature preprocessor, then a modeling step?

We should explore this question more by taking the current version of TPOT and comparing it to a version of TPOT that fixes the pipeline structure to three steps: data preprocessing (variance threshold, standard scaler, robust scaler), feature preprocessing (polynomial features, PCA, all feature selection methods), then a modeling step (all of the models).

Perhaps we can also compare it to a four-step pipeline structure: data preprocessing (variance threshold, standard scaler, robust scaler), feature preprocessing (polynomial features, PCA), feature selection (all feature selection methods), then a modeling step (all of the models). Perhaps having feature selection as a separate step just prior to the modeling step could be useful.

In either of the "fixed pipeline structure TPOT" cases, mutations would be restricted to replacing and tuning the appropriate operators in each step. Crossover would have to be prevented from creating invalid pipelines as well. This would likely entail rolling custom mutation and crossover operators.

@bartleyn
Copy link
Contributor

bartleyn commented Apr 2, 2016

I agree that we could probably structure the pipelines at least a little bit to reduce the complexity, as I've definitely gotten runs where the best pipeline had feature selection/preprocessing steps happening after the modeling steps, which seems unnecessary.

@bartleyn
Copy link
Contributor

bartleyn commented May 3, 2016

Can we bypass having to roll custom mutation and crossover operators by changing how we pass the datasets around? I propose that we wrap the primary data structure (a pandas DataFrame / NumPy matrix) in different TPOT-specific subclasses and change the pipeline operators' typed contracts accordingly to restrict how operators get assembled. For example, the code where we add operators might look like the following:

# data preprocessor
self._pset.addPrimitive(self._standard_scaler, [pd.DataFrame], tpot.DataPreprocessed)

# feature preprocessor
self._pset.addPrimitive(self._binarizer, [tpot.DataPreprocessed, float], tpot.FeaturePreprocessed)

# feature selection
self._pset.addPrimitive(self._select_kbest, [tpot.FeaturePreprocessed, int], tpot.FeatureSelected)

# machine learning operator
self._pset.addPrimitive(self._random_forest, [tpot.FeatureSelected, int], tpot.MachineLearned)

Is this sensible? I'm assuming that the subclasses would pretty much just be wrappers for the original data structure, in an effort to keep things lightweight.

@rhiever
Copy link
Contributor Author

rhiever commented May 6, 2016

That's brilliant! Yes, I think that would work just fine for these purposes. The only tricky part could be allowing TPOT to perform no preprocessing, but I suppose we could create a "no preprocessing" operator for each step.

@tonyfast
Copy link

I imagine that very complex graphs could exist where preprocessors could switch roles. Are pipelines understood well enough to create a rigid ontology?

The DataFrame class is very robust and it may be dangerous to choose something else instead. It is good to know DataFrame in and DataFrame out; custom classes could get confusing.

A base class

A TPOT model base class will have to 1. Set its state, 2. Fit a model, 3. Predict a model, 4. Update a DataFrame, and 5. Export code.

This notebook tinkers around with the idea of the TPOT model. This class would export jinja templated code along with the docstring in a notebook cell.

A sample for a random forest may look like.

class _random_forest(Model):
    """Fits a random forest classifier

    Parameters
    ----------
    input_df: pandas.DataFrame {n_samples, n_features+['class', 'group', 'guess']}
        Input DataFrame for fitting the random forest
    max_features: int
        Number of features used to fit the decision tree; must be a positive value

    Returns
    -------
    input_df: pandas.DataFrame {n_samples, n_features+['guess', 'group', 'class', 'SyntheticFeature']}
        Returns a modified input DataFrame with the guess column updated according to the classifier's predictions.
        Also adds the classifiers's predictions as a 'SyntheticFeature' column.

    """
    _package = 'sklearn.ensemble.RandomForestClassifier'
    _source = """
from {{package_path[:-1] | join('.')}} import {{package_path | last}}

rfc{{operator_num}} = RandomForestClassifier(
    n_estimators = {{n_estimators}},
    max_features = {{max_features}},
)
rfc{{operator_num}}.fit(
    {{input_df}}.loc[training_indices].drop('class', axis=1).values, 
    {{input_df}}.loc[training_indices, 'class'].values,
)
    """

    def preprocess(self, inputdf:pd.DataFrame, max_features:int=4):
        if max_features < 1:
            max_features = 'auto'
        elif max_features == 1:
            max_features = None
        elif max_features > len(input_df.columns) - 3:
            max_features = len(input_df.columns) - 3
        return {
            'max_features': max_features,
            'n_estimators': 500,
            'random_state': 42, 
            'n_jobs': -1
        }

Some of the models are very similar in execution. There maybe a few base models required to cover the current set of sklearn models.

Many of these models could be combined using

from toolz import compose
pipeline = compose( _random_forest, _select_kbest, _binarizer, _standard_scaler)

@bartleyn
Copy link
Contributor

I imagine that very complex graphs could exist where preprocessors could switch roles.

Can you explain a little more about what you mean by preprocessors switching roles?

Are pipelines understood well enough to create a rigid ontology?

I think it's less that pipelines are understood well enough, and more that we're trying to align ontologies for structured pipelines with related work in autoML (e.g., the auto-sklearn project and paper found here). So it seems mostly just convenient to me (but I could be wrong).

The DataFrame class is very robust and it may be dangerous to choose something else instead. It is good to know DataFrame in and DataFrame out; custom classes could get confusing.

I mean this is just a tentative approach to structuring these pipelines, and it seems that rather than gutting the DEAP evolutionary algorithm we could start with this to see if it's worthwhile investing more time in it.

I think the base model class idea is super interesting, and aligns with that pending major OOP refactor -- your idea is perhaps a more long-term solution to the problem.

@tonyfast
Copy link

tonyfast commented May 31, 2016

I am not too sure what I meant in the beginning. Maybe I should have said "We don't know how complicated these models will be in the future. Choosing types to build pipelines now may pigeonhole building complicated pipelines in the future.

Thanks for that reference @bartleyn. That clarifies the process for me. Some like multipledispatch or odo could be helpful with automating these pipelines.

I don't know much about DEAP, but from my brief research into over the past few days I doubt it would have to be gutted. Maybe a few things could get monkey patched, but it seems to have a nice plugin system to extend methods.

It is starting to appear that TPOT may be something like class TPOT(sklearn.base.ClassifierMixin, deap.gp.eaSimple)

@bartleyn
Copy link
Contributor

Are we still interested in testing some of this in the current state of TPOT? I went ahead and implemented my idea, and am happy to try and get a PR working.

@rhiever
Copy link
Contributor Author

rhiever commented Jun 10, 2016

Once we get 0.4 out, I'd love to explore this idea. Several people (including @tonyfast) have expressed interest in designing a "grammar" for TPOT pipelines to help constrain the pipeline structure.

@tonyfast
Copy link

I would really like to discuss this idea with y'all. I would like champion bringing the API closer to native scikit-learn opinions. I started working on a refactor with the expectation of making generic model pipelines, but I think at this point is was largely exploratory to understand the data flow. One of the main successes was being able to have tpot always end on a classifier.

After some deeper dives in to scikit-learn, there are methods like Pipelines and FeatureUnion that would assist tpot through the evolution process. There is an example of a model deap model built on scikit-learn.

Is there a place we can start putting research and ideas into for the post 0.4 release development?

@rhiever
Copy link
Contributor Author

rhiever commented Jun 13, 2016

Is there a place we can start putting research and ideas into for the post 0.4 release development?

You can start up issues on the repo for us to discuss, or use an existing issue if it's related.

@GinoWoz1
Copy link

GinoWoz1 commented Sep 28, 2018

Hey @weixuanfu , @tonyfast , where did this end up? As a recent adopter of TPOT I am fascinated by these ideas and curious on the status. I am so-so in python but catching on fast - would be willing to allocate some time towards this endeavor.

@GinoWoz1
Copy link

GinoWoz1 commented Oct 3, 2018

@weixuanfu @rhiever , bumping this. Any idea? I can help brainstorm a high level grammar and help push this along.

I ran two tpot runs this week. 1 finished within 2 days with relatively good pipelines. Another ran for 4 days and was using crazy computing power going through 5 and 6 level models (double function transformer, one-hot encoder for data already set with dummies etc).

@GinoWoz1
Copy link

GinoWoz1 commented Oct 8, 2018

@weixuanfu I would like to contribute, I will open up a separate issue if this is ignored. All I am asking is what work has been done on this so I can help. I already had a in-depth call with Randy about the project and what aspects need to be worked on to move this forward. I am a masters student in data science who is possibly doing a thesis on this subject.

@weixuanfu
Copy link
Contributor

@GinoWoz1 We are working and testing a new template function for TPOT (see this branch) which may be similar to the grammar function you mentioned. I need discuss with other members in the TPOT team for this issue and will keep you in touch.

@GinoWoz1
Copy link

GinoWoz1 commented Oct 8, 2018

Thanks please let me know where I can contribute, I am a full time student and doing a presentation next month on TPOT. I have taken a deep dive into the code for TPOT as well as DEAP to understand its inner working.

@weixuanfu
Copy link
Contributor

@GinoWoz1 We'd like to invite you to contribute to this function. I will send you a email soon for scheduling a meeting for some discussions.

@cottrell
Copy link
Contributor

Curious if there has been more discussion on this. I just hit "Warning: sklearn.pipelines.FeatureUnion is not available and will not be used by TPOT." ... and came to the party.

@aastha3
Copy link

aastha3 commented Feb 2, 2019

What does a bdbQuit indicate? Alternatively, how do you define a sparse feature?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

7 participants