Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

First pass at pipeline export functionality #37

Merged
merged 9 commits into from
Dec 4, 2015
Merged

First pass at pipeline export functionality #37

merged 9 commits into from
Dec 4, 2015

Conversation

rhiever
Copy link
Contributor

@rhiever rhiever commented Dec 2, 2015

Per #5

This PR adds the export() function, which allows the user to specify an output file for the pipeline. Once export() is called (and a pipeline has already been optimized), this function converts the pipeline into its corresponding Python code and exports it to the specified output file.

Docs have been updated along with this PR to demonstrate how the export() function works.

@rasbt, I would greatly appreciate if you could review this before we merge it.

@rhiever
Copy link
Contributor Author

rhiever commented Dec 2, 2015

The decline in coverage is expected. The new export() function introduces a fair bit of new code without a unit test for it (yet). I'll eventually get to expanding the unit tests... :-)

-------
None
"""
if self.optimized_pipeline_ == None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

 if self.optimized_pipeline_ == None:

should be

if not self.optimized_pipeline_:

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't the equality more explicit?

I just noticed in the score() function it now says:

if self.optimized_pipeline_ is None:

Personally, I think the equality is better since it is explicitly checking for the default state (None). I'm open to being convinced otherwise though.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I see; I thought this way it may be more robust, accounting for empty objects like "", [], {}. But maybe being explicit is not a bad idea here.

However

if self.optimized_pipeline_ is None:

is definitely preferred over

if self.optimized_pipeline_ == None:

It's more efficient (because you don't have the __eq__ overhead) and also cleaner style.

@rasbt
Copy link
Contributor

rasbt commented Dec 2, 2015

@rasbt, I would greatly appreciate if you could review this before we merge it.

Sure, I can take a more detailed look at it later when I get home. However, would it be possible to add a small unit test comparing the actual output with what you'd expect? I think that would be helpful as "documentation" and to follow along.

@rhiever
Copy link
Contributor Author

rhiever commented Dec 2, 2015

Sure, here's an example:

from tpot import TPOT
from sklearn.datasets import load_digits
from sklearn.cross_validation import train_test_split

digits = load_digits()
X_train, X_test, y_train, y_test = train_test_split(digits.data, digits.target,
                                                    train_size=0.75)

tpot = TPOT(generations=1, verbosity=2)
tpot.fit(X_train, y_train)
print(tpot.score(X_train, y_train, X_test, y_test))
tpot.export('tpot_export.py')

TPOT will likely discover that a random forest alone does well on the data set, so tpot_export.py should contain something like:

from itertools import combinations

import numpy as np
import pandas as pd

from sklearn.cross_validation import StratifiedShuffleSplit
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier

# NOTE: Make sure that the class is labeled 'class' in the data file
tpot_data = pd.read_csv('PATH/TO/DATA/FILE', sep='COLUMN_SEPARATOR')
training_indeces, testing_indeces = next(iter(StratifiedShuffleSplit(tpot_data['class'].values, n_iter=1, train_size=0.75)))


result1 = tpot_data.copy()

# Perform classification with a random forest classifier
rfc1 = RandomForestClassifier(n_estimators=97, max_features=min(8, len(result1.columns) - 1))
rfc1.fit(result1.loc[training_indeces].drop('class', axis=1).values, result1.loc[training_indeces]['class'].values)
result1['rfc1-classification'] = rfc1.predict(result1.drop('class', axis=1).values)

If you want to perform further inspection on the pipelines, if you print() the optimized pipeline at any point it will give you the nested function version, from which you can deduce what the linear code version should look like.

@rhiever
Copy link
Contributor Author

rhiever commented Dec 2, 2015

Oh, and let's make sure to close #36 when this is merged. Now when the user ends TPOT early on the command line and they provided an output file, the current best pipeline will be exported.

@rasbt
Copy link
Contributor

rasbt commented Dec 2, 2015

Oh, and let's make sure to close #36 when this is merged. Now when the user ends TPOT early on the command line and they provided an output file, the current best pipeline will be exported.

Sounds cool! Maybe a nice addition would be to refactor the exportmethod a bit so that a slimmer "core" component can be written to a log-file. This let's the user also conveniently check the progress over time (e.g., via tail tpot_run_x.log) or so.

@rhiever
Copy link
Contributor Author

rhiever commented Dec 3, 2015

export() actually runs pretty quick even on large pipelines, so it wouldn't be much overhead to run it every generation.

Changed equality operators to “is” syntax.

Changed exceptions raised from not having an optimized pipeline to a
ValueError.
@rasbt
Copy link
Contributor

rasbt commented Dec 4, 2015

export() actually runs pretty quick even on large pipelines, so it wouldn't be much overhead to run it every generation.

Oh, sure, but I was more thinking of "bloated log files" here. I think that log/status files are generally helpful to keep track of errors but also to judge the progress if your are running stuff remotely, and analyzing what's going on under the hood (e.g., think bakc of ye goode olde times submitting jobs to HPCC ;)). So, I was thinking to refactor it into a more bare-bones "export_params" and a "export_pipeline_standalone" or so. But this is just a general suggestion, doesn't have to be now :)

@rasbt
Copy link
Contributor

rasbt commented Dec 4, 2015

Besides that, the code looks fine to me so far. But I haven't run it through a debugger yet and looked at it in detail... sorry, it's the end of the year and I am pretty busy wrapping things up before I go on a family visit, but January should be a good time for a fresh start :)

@rhiever
Copy link
Contributor Author

rhiever commented Dec 4, 2015

Alrighty, I think I'll merge this for now since I'd like to push this functionality out. Please file bug reports against it if you see anything wrong.

rhiever pushed a commit that referenced this pull request Dec 4, 2015
First pass at pipeline export functionality
@rhiever rhiever merged commit 1f56114 into master Dec 4, 2015
@rhiever rhiever deleted the export branch December 4, 2015 14:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants