Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Brainstorm: How can we keep the pipelines trained so they don't need to re-train on each call? #11

Closed
rhiever opened this issue Nov 12, 2015 · 10 comments

Comments

@rhiever
Copy link
Contributor

rhiever commented Nov 12, 2015

Currently, TPOT requires the training data to be passed along with any additional data so the pipeline can train the sklearn models on the training data again. This is required because the pipeline consists of functions: each random forest, decision tree, etc. is a function and the model is garbage collected as soon as the function terminates.

Let's brainstorm: How can we design these functions so the models remain persistent and don't need to be re-trained? This is really only important for the final pipeline, where the user would be performing score() and predict() calls against the pipeline.

@datnamer
Copy link

http://blaze.pydata.org/blog/2015/10/19/dask-learn/ Check this out

Dask can also be used to parallelize.

@rhiever
Copy link
Contributor Author

rhiever commented Mar 20, 2016

We should look at using sklearn Pipeline objects to represent our pipelines. I believe that could go a long way toward solving this issue.

@KobaKhit
Copy link

Might be helpful 3.4. Model persistence. Basically, an example of how to save a model in as a pickle. Below is code example from linked page

>>> from sklearn import svm
>>> from sklearn import datasets
>>> clf = svm.SVC()
>>> iris = datasets.load_iris()
>>> X, y = iris.data, iris.target
>>> clf.fit(X, y)  
SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
    decision_function_shape=None, degree=3, gamma='auto', kernel='rbf',
    max_iter=-1, probability=False, random_state=None, shrinking=True,
    tol=0.001, verbose=False)

>>> import pickle
>>> s = pickle.dumps(clf)
>>> clf2 = pickle.loads(s)
>>> clf2.predict(X[0:1])
array([0])
>>> y[0]
0

Maybe create a new self variable self.saved_pipe = pickle.dumps(clf) that saves the pipeline of the best model in a generation and update it every generation.

@rasbt
Copy link
Contributor

rasbt commented Apr 11, 2016

@KobaKhit I think pickle is generally a good idea in terms of efficiency; however, and maybe I am too paranoid :P, I always worry about compatibility issues (e.g., the different pickle protocols and Py version incompatibilities). In any case, if pickle is only used internally during model training on the particular system where TPOT is "trained", this wouldn't be a problem I guess.
However, instead of pickle, I'd suggest joblib maybe since it's better at storing NumPy arrays; there was a discussion on the mailing list today where someone pickled a random forest (50 trees, 23 features, 20 mb dataset) -> 50 mb with standard pickle, ~15 mb with joblib.

Personally, I switched over to using JSON files for model persistence (e.g., when using scikit-learn or also other things). This way, I always have a human readable record of everything (in case pickle files get corrupted or are incompatible in a different environment); sure, this would probably slower computationally, but I think it would be the more "robust" or "reproducible" option. Someone else asked me about that recently so I put up a quick ipynb with an example specific to sklearn if you are interested: http://nbviewer.jupyter.org/github/rasbt/python-machine-learning-book/blob/master/code/bonus/scikit-model-to-json.ipynb

@fivejjs
Copy link

fivejjs commented Apr 20, 2016

We can extend this idea to a "pipeline" knowledge base, or knowledge graph. Then, given a pile of data, the system can figure out some "pipeline" .

@rhiever
Copy link
Contributor Author

rhiever commented Aug 19, 2016

This issue is now handled because we're working directly with sklearn Pipeline objects that are persistent.

@woodrujm
Copy link

I get this error when trying to pickle the tpot model:

with open('tpot_.pkl','wb') as xx:
pickle.dump(tpot,xx)

"""PicklingError: Can't pickle <class 'tpot.operator_utils.XGBClassifier__learning_rate'>: attribute lookup XGBClassifier__learning_rate on tpot.operator_utils failed"""

@weixuanfu
Copy link
Contributor

weixuanfu commented Oct 31, 2017

@woodrujm I think the issue is related to #520 about pickling TPOT object. I think, for now, entire TPOT object is not pickleable due to the attribute lookup issue. You should be able to pickle these attributes in TPOT API. We may work on this picleable issue later.

@woodrujm
Copy link

Okay, thanks for the reply. TPOT has been great so far!

@frahlg
Copy link

frahlg commented Nov 2, 2017

@woodrujm , if you instead use pickle.dump(tpot.fitted_pipeline_,xx) it works as you might have intended.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants