-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
First pass at pipeline export functionality #37
Conversation
Some odd bugs remain with sklearn model fields that can take both string and integer values.
# Conflicts: # .gitignore # tpot/tpot.py
The decline in coverage is expected. The new |
------- | ||
None | ||
""" | ||
if self.optimized_pipeline_ == None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if self.optimized_pipeline_ == None:
should be
if not self.optimized_pipeline_:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't the equality more explicit?
I just noticed in the score()
function it now says:
if self.optimized_pipeline_ is None:
Personally, I think the equality is better since it is explicitly checking for the default state (None
). I'm open to being convinced otherwise though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh I see; I thought this way it may be more robust, accounting for empty objects like "", [], {}. But maybe being explicit is not a bad idea here.
However
if self.optimized_pipeline_ is None:
is definitely preferred over
if self.optimized_pipeline_ == None:
It's more efficient (because you don't have the __eq__
overhead) and also cleaner style.
Sure, I can take a more detailed look at it later when I get home. However, would it be possible to add a small unit test comparing the actual output with what you'd expect? I think that would be helpful as "documentation" and to follow along. |
Sure, here's an example: from tpot import TPOT
from sklearn.datasets import load_digits
from sklearn.cross_validation import train_test_split
digits = load_digits()
X_train, X_test, y_train, y_test = train_test_split(digits.data, digits.target,
train_size=0.75)
tpot = TPOT(generations=1, verbosity=2)
tpot.fit(X_train, y_train)
print(tpot.score(X_train, y_train, X_test, y_test))
tpot.export('tpot_export.py') TPOT will likely discover that a random forest alone does well on the data set, so from itertools import combinations
import numpy as np
import pandas as pd
from sklearn.cross_validation import StratifiedShuffleSplit
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
# NOTE: Make sure that the class is labeled 'class' in the data file
tpot_data = pd.read_csv('PATH/TO/DATA/FILE', sep='COLUMN_SEPARATOR')
training_indeces, testing_indeces = next(iter(StratifiedShuffleSplit(tpot_data['class'].values, n_iter=1, train_size=0.75)))
result1 = tpot_data.copy()
# Perform classification with a random forest classifier
rfc1 = RandomForestClassifier(n_estimators=97, max_features=min(8, len(result1.columns) - 1))
rfc1.fit(result1.loc[training_indeces].drop('class', axis=1).values, result1.loc[training_indeces]['class'].values)
result1['rfc1-classification'] = rfc1.predict(result1.drop('class', axis=1).values) If you want to perform further inspection on the pipelines, if you |
Oh, and let's make sure to close #36 when this is merged. Now when the user ends TPOT early on the command line and they provided an output file, the current best pipeline will be exported. |
Sounds cool! Maybe a nice addition would be to refactor the |
|
Changed equality operators to “is” syntax. Changed exceptions raised from not having an optimized pipeline to a ValueError.
Oh, sure, but I was more thinking of "bloated log files" here. I think that log/status files are generally helpful to keep track of errors but also to judge the progress if your are running stuff remotely, and analyzing what's going on under the hood (e.g., think bakc of ye goode olde times submitting jobs to HPCC ;)). So, I was thinking to refactor it into a more bare-bones "export_params" and a "export_pipeline_standalone" or so. But this is just a general suggestion, doesn't have to be now :) |
Besides that, the code looks fine to me so far. But I haven't run it through a debugger yet and looked at it in detail... sorry, it's the end of the year and I am pretty busy wrapping things up before I go on a family visit, but January should be a good time for a fresh start :) |
Alrighty, I think I'll merge this for now since I'd like to push this functionality out. Please file bug reports against it if you see anything wrong. |
First pass at pipeline export functionality
Per #5
This PR adds the
export()
function, which allows the user to specify an output file for the pipeline. Onceexport()
is called (and a pipeline has already been optimized), this function converts the pipeline into its corresponding Python code and exports it to the specified output file.Docs have been updated along with this PR to demonstrate how the
export()
function works.@rasbt, I would greatly appreciate if you could review this before we merge it.