Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Refactor tpot to many sklearn models #164

Closed
wants to merge 31 commits into from

Conversation

tonyfast
Copy link

@tonyfast tonyfast commented Jun 3, 2016

What does this PR do?

This PR is a major refactor #91 of tpot using sklearn models. It introduces 2 new packages toolz and traitlets and eases the creation of new models.

I really wanted to understand the inner workings of tpot so this is half research/mostly serious. I used the existing refactor that @teaearlgraycold is working on for inspiration. I think there may be a meshing of both of these pull requests to lead to the big refactor.

I still need to add quite a few models, currently everything seems to work except for the scoring. I am going to need to write tests to confirm.

High level changes
Creating a model
class fast_ica(EvaluateEstimator):
    model = FastICA
    n_components = Int(default_value=0).tag(
        df=True,
        apply=lambda df, nc: 1 if nc < 1 else min(nc, len(df.columns))
    )
    tol = Float().tag(
        apply=partial(max, .0001)
    )
Creates a MultiIndex Pandas DataFrame for the source data.

This should cut down on pandas operations. The first indices use boolean indices to indicate test or train, True or False. The next slice of indices are the classes as integers, string names can be recovered later.

test_data = data_source.ix[True]
train_data = data_source.ix[False]
test_data.index.values # is a list of the class identifiers

#113 suggests using numpy array, but a well structured dataframe could extend to using xarray and dask. It should be easier to discover any copying problems #78

Where should the reviewer start?

How should this PR be tested?

I still need to add tests and replace the documentation.

Any background context you want to provide?

I love tpot. It is the first tool I have used that truly discovers things I wouldn't have found myself.

What are the relevant issues?

I added the references above.

Screenshots (if appropriate)

Questions:

  • Do the docs need to be updated?
  • Does this PR add new (Python) dependencies? toolz and traitlets. I think these are sane choices. traitlets is critical an ipython utility and toolz only requires the standard lib.

@tonyfast tonyfast changed the title [WIP] Refactor [WIP] Refactor tpot to many sklearn models Jun 3, 2016
@rhiever
Copy link
Contributor

rhiever commented Jun 4, 2016

Hey! I'm stoked that you're so into TPOT lately. I'm currently focused on getting v0.4 out, but I promise I'll join the conversation about the major refactor soon. :-)

BTW, one thing to keep in mind: We're trying to keep Python lean in terms of dependencies, so adding a new dependency (especially ones not in Anaconda) will be a hard sell. It's very important to me that TPOT remains easy to install.

@tonyfast
Copy link
Author

tonyfast commented Jun 4, 2016

I totally respect adding dependencies. traitlets is part of Anaconda because it is used in the notebook and ipython. toolz can be installed in an Anaconda environment using pip in the environment; it extends itertools and functools from the standard lib with an underlying interest in parallelizable code. If there is a hard stop on dependencies being available in Anaconda then conda forge is always an option.

I am going to keep working on this. I'd be stoked to bounce ideas off of @teaearlgraycold while you are tied up with the 0.4 release.

@danthedaniel
Copy link
Contributor

Well I'm the main guy who will be pushing the 0.4 release forward. I'd say
maybe hold off until that's out (should be soon).

Also you seem to be changing up the code style a lot, which I'd warn again.

On Fri, Jun 3, 2016, 11:19 PM Tony Fast notifications@github.com wrote:

I totally respect adding dependencies. traitlets is part of Anaconda
because it is used in the notebook and ipython. toolz can be installed in
an Anaconda environment using pip in the environment; it extends itertools
and functools from the standard lib with an underlying interest in
parallelizable. If there is a hard stop on dependencies being available in
Anaconda then conda forge is always an option.

I am going to keep working on this. I'd be stoked to bounce ideas off of
@teaearlgraycold https://github.com/teaearlgraycold while you are tied
up with the 0.4 release.


You are receiving this because you were mentioned.

Reply to this email directly, view it on GitHub
#164 (comment), or mute
the thread
https://github.com/notifications/unsubscribe/ADISY0kIm00NQXr8tPkNZaaWLtY5GPeFks5qIO6xgaJpZM4It4IR
.

@tonyfast
Copy link
Author

tonyfast commented Jun 4, 2016

I intend to bring the coding style back closer to what y'all have been working with. All of the code is pep8 compliant at the moment except for some comments. I am trying to get a hold of the model itself; it is a bit confusing. This pull request is part research and part serious.

I am offering up this code to see if I am understanding the model clearly from a total outsider perspective. I think there are some awesome UI features that can be built onto tpot using the Jupyter notebook. I hope some of these intentions can be useful to the project.

@tonyfast
Copy link
Author

tonyfast commented Jun 5, 2016

Below are the UML diagrams for the current refactor. The refactor is mostly working, I need to track down some heisenbugs. It is weird when you get different errors every time you run the same function. I have been using this notebook for development.

The models.base does a lot of the heavy lifting. It decides whether to produce a transform, masking, or classification using the sklearn base classes.

I made some changes to the Primitive diagram. main exports a Pandas series at the end. Only certain sklearn models can return a Series. Exporting a series is analogous to saying, "Hey I made a classification". Classifiers can also return a DataFrame which allows them to be placed as an intermediate in the graph. Basically I added this to assure that the algorithm evaluates a classifier.


Update: All of the models complete for the MNIST dataset

The highest score is 0.982261640798

fit errors 4 vs. score errors 2 of 275 executions
knnc(df, sub(87, 98))

@rhiever
Copy link
Contributor

rhiever commented Aug 3, 2016

@tonyfast, check out the development branch if you'd like to see where we're heading with TPOT in the immediate future. I think, using the same kind of compile-DEAP-pipelines-to-sklearn-pipelines code, we could also have TPOT directly evolving sklearn pipelines as well.

@rhiever
Copy link
Contributor

rhiever commented Aug 13, 2016

Going to close this PR since we have a version of it in the dev branch now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants