Brainstorm: How can we keep the pipelines trained so they don't need to re-train on each call? #11

rhiever · 2015-11-12T03:29:47Z

Currently, TPOT requires the training data to be passed along with any additional data so the pipeline can train the sklearn models on the training data again. This is required because the pipeline consists of functions: each random forest, decision tree, etc. is a function and the model is garbage collected as soon as the function terminates.

Let's brainstorm: How can we design these functions so the models remain persistent and don't need to be re-trained? This is really only important for the final pipeline, where the user would be performing score() and predict() calls against the pipeline.

The text was updated successfully, but these errors were encountered:

datnamer · 2015-11-16T17:40:39Z

http://blaze.pydata.org/blog/2015/10/19/dask-learn/ Check this out

Dask can also be used to parallelize.

rhiever · 2016-03-20T13:03:02Z

We should look at using sklearn Pipeline objects to represent our pipelines. I believe that could go a long way toward solving this issue.

KobaKhit · 2016-04-11T19:30:13Z

Might be helpful 3.4. Model persistence. Basically, an example of how to save a model in as a pickle. Below is code example from linked page

>>> from sklearn import svm
>>> from sklearn import datasets
>>> clf = svm.SVC()
>>> iris = datasets.load_iris()
>>> X, y = iris.data, iris.target
>>> clf.fit(X, y)  
SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
    decision_function_shape=None, degree=3, gamma='auto', kernel='rbf',
    max_iter=-1, probability=False, random_state=None, shrinking=True,
    tol=0.001, verbose=False)

>>> import pickle
>>> s = pickle.dumps(clf)
>>> clf2 = pickle.loads(s)
>>> clf2.predict(X[0:1])
array([0])
>>> y[0]
0

Maybe create a new self variable self.saved_pipe = pickle.dumps(clf) that saves the pipeline of the best model in a generation and update it every generation.

rasbt · 2016-04-11T20:38:36Z

@KobaKhit I think pickle is generally a good idea in terms of efficiency; however, and maybe I am too paranoid :P, I always worry about compatibility issues (e.g., the different pickle protocols and Py version incompatibilities). In any case, if pickle is only used internally during model training on the particular system where TPOT is "trained", this wouldn't be a problem I guess.
However, instead of pickle, I'd suggest joblib maybe since it's better at storing NumPy arrays; there was a discussion on the mailing list today where someone pickled a random forest (50 trees, 23 features, 20 mb dataset) -> 50 mb with standard pickle, ~15 mb with joblib.

Personally, I switched over to using JSON files for model persistence (e.g., when using scikit-learn or also other things). This way, I always have a human readable record of everything (in case pickle files get corrupted or are incompatible in a different environment); sure, this would probably slower computationally, but I think it would be the more "robust" or "reproducible" option. Someone else asked me about that recently so I put up a quick ipynb with an example specific to sklearn if you are interested: http://nbviewer.jupyter.org/github/rasbt/python-machine-learning-book/blob/master/code/bonus/scikit-model-to-json.ipynb

fivejjs · 2016-04-20T22:23:25Z

We can extend this idea to a "pipeline" knowledge base, or knowledge graph. Then, given a pile of data, the system can figure out some "pipeline" .

rhiever · 2016-08-19T18:35:30Z

This issue is now handled because we're working directly with sklearn Pipeline objects that are persistent.

woodrujm · 2017-10-31T15:16:59Z

I get this error when trying to pickle the tpot model:

with open('tpot_.pkl','wb') as xx:
pickle.dump(tpot,xx)

"""PicklingError: Can't pickle <class 'tpot.operator_utils.XGBClassifier__learning_rate'>: attribute lookup XGBClassifier__learning_rate on tpot.operator_utils failed"""

weixuanfu · 2017-10-31T15:45:10Z

@woodrujm I think the issue is related to #520 about pickling TPOT object. I think, for now, entire TPOT object is not pickleable due to the attribute lookup issue. You should be able to pickle these attributes in TPOT API. We may work on this picleable issue later.

woodrujm · 2017-10-31T16:28:53Z

Okay, thanks for the reply. TPOT has been great so far!

frahlg · 2017-11-02T13:30:23Z

@woodrujm , if you instead use pickle.dump(tpot.fitted_pipeline_,xx) it works as you might have intended.

rhiever added the question label Nov 12, 2015

dmarx mentioned this issue Dec 17, 2015

Create an abstract function for the classifier pipeline operators #43

Closed

rhiever added the need contributor label Feb 22, 2016

jln-ho mentioned this issue May 20, 2016

Exporting pipelines to PMML/PFA #152

Open

rhiever closed this as completed Aug 19, 2016

colingoldberg mentioned this issue Oct 9, 2016

Error reading data #290

Closed

scottcha mentioned this issue Nov 1, 2016

version 0.6.7 abnormally terminates prior to completion with error "Cputime limit exceeded: 24" #300

Closed

AIAdventures mentioned this issue Jun 6, 2017

Titanic example -problem with 2nd last cell. #492

Closed

saddy001 mentioned this issue Mar 20, 2018

Segfault on optimization process #676

Closed

weixuanfu mentioned this issue Feb 15, 2019

Save a trained RandomForest in JSON format #834

Open

raybellwaves mentioned this issue Feb 9, 2020

Dynamic classes cause PicklingError #781

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Brainstorm: How can we keep the pipelines trained so they don't need to re-train on each call? #11

Brainstorm: How can we keep the pipelines trained so they don't need to re-train on each call? #11

rhiever commented Nov 12, 2015

datnamer commented Nov 16, 2015

rhiever commented Mar 20, 2016

KobaKhit commented Apr 11, 2016

rasbt commented Apr 11, 2016

fivejjs commented Apr 20, 2016

rhiever commented Aug 19, 2016

woodrujm commented Oct 31, 2017

weixuanfu commented Oct 31, 2017 •

edited

Loading

woodrujm commented Oct 31, 2017

frahlg commented Nov 2, 2017

Brainstorm: How can we keep the pipelines trained so they don't need to re-train on each call? #11

Brainstorm: How can we keep the pipelines trained so they don't need to re-train on each call? #11

Comments

rhiever commented Nov 12, 2015

datnamer commented Nov 16, 2015

rhiever commented Mar 20, 2016

KobaKhit commented Apr 11, 2016

rasbt commented Apr 11, 2016

fivejjs commented Apr 20, 2016

rhiever commented Aug 19, 2016

woodrujm commented Oct 31, 2017

weixuanfu commented Oct 31, 2017 • edited Loading

woodrujm commented Oct 31, 2017

frahlg commented Nov 2, 2017

weixuanfu commented Oct 31, 2017 •

edited

Loading