[WIP] Refactor tpot to many sklearn models #164

tonyfast · 2016-06-03T20:23:31Z

What does this PR do?

This PR is a major refactor #91 of tpot using sklearn models. It introduces 2 new packages toolz and traitlets and eases the creation of new models.

I really wanted to understand the inner workings of tpot so this is half research/mostly serious. I used the existing refactor that @teaearlgraycold is working on for inspiration. I think there may be a meshing of both of these pull requests to lead to the big refactor.

I still need to add quite a few models, currently everything seems to work except for the scoring. I am going to need to write tests to confirm.

High level changes

tpot is ClassifierMixin allowing the different scoring functions to be applied in Rework custom scoring functionality #156. Using sklearn mixins should make it easier to control error functions. Add option to limit the Classifiers/Regressors that TPOT uses #146
separated deap methods and tpot methods. Bespoke primitive functions were moved to primitives.py.
Introduced a PipelineEstimator class that can score an individual during evaluation.
Introducted an EvaluateEstimator class. This class allows sklearn to be introduced with strongly typed parameters. This base class is in the subdirectory models and the underlying scripts are mapped to their sklearn toolbox.

Creating a model

class fast_ica(EvaluateEstimator):
    model = FastICA
    n_components = Int(default_value=0).tag(
        df=True,
        apply=lambda df, nc: 1 if nc < 1 else min(nc, len(df.columns))
    )
    tol = Float().tag(
        apply=partial(max, .0001)
    )

Creates a MultiIndex Pandas DataFrame for the source data.

This should cut down on pandas operations. The first indices use boolean indices to indicate test or train, True or False. The next slice of indices are the classes as integers, string names can be recovered later.

test_data = data_source.ix[True]
train_data = data_source.ix[False]
test_data.index.values # is a list of the class identifiers

#113 suggests using numpy array, but a well structured dataframe could extend to using xarray and dask. It should be easier to discover any copying problems #78

Where should the reviewer start?

How should this PR be tested?

I still need to add tests and replace the documentation.

Any background context you want to provide?

I love tpot. It is the first tool I have used that truly discovers things I wouldn't have found myself.

What are the relevant issues?

I added the references above.

Screenshots (if appropriate)

Questions:

Do the docs need to be updated?
Does this PR add new (Python) dependencies? toolz and traitlets. I think these are sane choices. traitlets is critical an ipython utility and toolz only requires the standard lib.

…t-jinja

rhiever · 2016-06-04T02:42:00Z

Hey! I'm stoked that you're so into TPOT lately. I'm currently focused on getting v0.4 out, but I promise I'll join the conversation about the major refactor soon. :-)

BTW, one thing to keep in mind: We're trying to keep Python lean in terms of dependencies, so adding a new dependency (especially ones not in Anaconda) will be a hard sell. It's very important to me that TPOT remains easy to install.

tonyfast · 2016-06-04T03:19:11Z

I totally respect adding dependencies. traitlets is part of Anaconda because it is used in the notebook and ipython. toolz can be installed in an Anaconda environment using pip in the environment; it extends itertools and functools from the standard lib with an underlying interest in parallelizable code. If there is a hard stop on dependencies being available in Anaconda then conda forge is always an option.

I am going to keep working on this. I'd be stoked to bounce ideas off of @teaearlgraycold while you are tied up with the 0.4 release.

danthedaniel · 2016-06-04T03:22:42Z

Well I'm the main guy who will be pushing the 0.4 release forward. I'd say
maybe hold off until that's out (should be soon).

Also you seem to be changing up the code style a lot, which I'd warn again.

On Fri, Jun 3, 2016, 11:19 PM Tony Fast notifications@github.com wrote:

I totally respect adding dependencies. traitlets is part of Anaconda
because it is used in the notebook and ipython. toolz can be installed in
an Anaconda environment using pip in the environment; it extends itertools
and functools from the standard lib with an underlying interest in
parallelizable. If there is a hard stop on dependencies being available in
Anaconda then conda forge is always an option.

I am going to keep working on this. I'd be stoked to bounce ideas off of
@teaearlgraycold https://github.com/teaearlgraycold while you are tied
up with the 0.4 release.

—
You are receiving this because you were mentioned.

Reply to this email directly, view it on GitHub
#164 (comment), or mute
the thread
https://github.com/notifications/unsubscribe/ADISY0kIm00NQXr8tPkNZaaWLtY5GPeFks5qIO6xgaJpZM4It4IR
.

tonyfast · 2016-06-04T03:29:12Z

I intend to bring the coding style back closer to what y'all have been working with. All of the code is pep8 compliant at the moment except for some comments. I am trying to get a hold of the model itself; it is a bit confusing. This pull request is part research and part serious.

I am offering up this code to see if I am understanding the model clearly from a total outsider perspective. I think there are some awesome UI features that can be built onto tpot using the Jupyter notebook. I hope some of these intentions can be useful to the project.

tonyfast · 2016-06-05T17:18:21Z

Below are the UML diagrams for the current refactor. The refactor is mostly working, I need to track down some heisenbugs. It is weird when you get different errors every time you run the same function. I have been using this notebook for development.

The models.base does a lot of the heavy lifting. It decides whether to produce a transform, masking, or classification using the sklearn base classes.

I made some changes to the Primitive diagram. main exports a Pandas series at the end. Only certain sklearn models can return a Series. Exporting a series is analogous to saying, "Hey I made a classification". Classifiers can also return a DataFrame which allows them to be placed as an intermediate in the graph. Basically I added this to assure that the algorithm evaluates a classifier.

Update: All of the models complete for the MNIST dataset

The highest score is 0.982261640798

fit errors 4 vs. score errors 2 of 275 executions

knnc(df, sub(87, 98))

…ctor

rhiever · 2016-08-03T20:51:31Z

@tonyfast, check out the development branch if you'd like to see where we're heading with TPOT in the immediate future. I think, using the same kind of compile-DEAP-pipelines-to-sklearn-pipelines code, we could also have TPOT directly evolving sklearn pipelines as well.

rhiever · 2016-08-13T18:32:51Z

Going to close this PR since we have a version of it in the dev branch now.

tonyfast added 8 commits May 29, 2016 01:00

Start making jinja templates for the code ooutputs

4c7ac00

Add jinja Extension for the output pipeline

5e08dca

Add jinja Extension for the output pipeline

fcead8d

Merge branch 'master' of https://github.com/rhiever/tpot into tonyfas…

5f0057e

…t-jinja

New model classes

f1fbc44

Update tpot

6c1d661

remove other file

99194f3

more refactoring

fe8c0d7

tonyfast changed the title ~~[WIP] Refactor~~ [WIP] Refactor tpot to many sklearn models Jun 3, 2016

tonyfast added 6 commits June 4, 2016 02:11

Add models to the basic deployment

f82a349

Add preprocessing models

55a06e4

Remove pipeline model

8e47dca

Update dataframe functions

358b163

Add uml diagram

3ce27d2

Demo notebook

c9bc1c1

tonyfast added 10 commits June 5, 2016 14:28

Add a null series terminal

8fb9660

Add widget snapshot

7960092

MNIST with pop = 250 and gen = 10

3422659

experimental refactor

f36518a

Tighten up some lines

7765429

Updates

744d5c2

Start commenting

ad04ef0

Sample notebook

3287495

Sample notebook

0a263d8

sample notebook

aa08c14

tonyfast added 7 commits June 22, 2016 10:43

Delete Untitled162.ipynb

6bb18e4

sample notebook

d901201

Merge branch 'refactor' of https://github.com/tonyfast/tpot into refa…

3718d59

…ctor

Remove double model pipelines

7739136

Add exports option for regressor or classifier

2219111

Add notebook with regressors

67cd3bf

Regressor model views

af1002b

rhiever closed this Aug 13, 2016

AIAdventures mentioned this pull request Jun 6, 2017

Titanic example -problem with 2nd last cell. #492

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Refactor tpot to many sklearn models #164

[WIP] Refactor tpot to many sklearn models #164

tonyfast commented Jun 3, 2016

rhiever commented Jun 4, 2016

tonyfast commented Jun 4, 2016 •

edited

Loading

danthedaniel commented Jun 4, 2016

tonyfast commented Jun 4, 2016

tonyfast commented Jun 5, 2016 •

edited

Loading

rhiever commented Aug 3, 2016

rhiever commented Aug 13, 2016

[WIP] Refactor tpot to many sklearn models #164

[WIP] Refactor tpot to many sklearn models #164

Conversation

tonyfast commented Jun 3, 2016

What does this PR do?

High level changes

Creating a model

Creates a MultiIndex Pandas DataFrame for the source data.

Where should the reviewer start?

How should this PR be tested?

Any background context you want to provide?

What are the relevant issues?

Screenshots (if appropriate)

Questions:

rhiever commented Jun 4, 2016

tonyfast commented Jun 4, 2016 • edited Loading

danthedaniel commented Jun 4, 2016

tonyfast commented Jun 4, 2016

tonyfast commented Jun 5, 2016 • edited Loading

The highest score is 0.982261640798

fit errors 4 vs. score errors 2 of 275 executions

rhiever commented Aug 3, 2016

rhiever commented Aug 13, 2016

tonyfast commented Jun 4, 2016 •

edited

Loading

tonyfast commented Jun 5, 2016 •

edited

Loading