-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Brainstorm: How can we keep the pipelines trained so they don't need to re-train on each call? #11
Comments
http://blaze.pydata.org/blog/2015/10/19/dask-learn/ Check this out Dask can also be used to parallelize. |
We should look at using sklearn Pipeline objects to represent our pipelines. I believe that could go a long way toward solving this issue. |
Might be helpful 3.4. Model persistence. Basically, an example of how to save a model in as a pickle. Below is code example from linked page >>> from sklearn import svm
>>> from sklearn import datasets
>>> clf = svm.SVC()
>>> iris = datasets.load_iris()
>>> X, y = iris.data, iris.target
>>> clf.fit(X, y)
SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
decision_function_shape=None, degree=3, gamma='auto', kernel='rbf',
max_iter=-1, probability=False, random_state=None, shrinking=True,
tol=0.001, verbose=False)
>>> import pickle
>>> s = pickle.dumps(clf)
>>> clf2 = pickle.loads(s)
>>> clf2.predict(X[0:1])
array([0])
>>> y[0]
0 Maybe create a new self variable |
@KobaKhit I think pickle is generally a good idea in terms of efficiency; however, and maybe I am too paranoid :P, I always worry about compatibility issues (e.g., the different pickle protocols and Py version incompatibilities). In any case, if pickle is only used internally during model training on the particular system where TPOT is "trained", this wouldn't be a problem I guess. Personally, I switched over to using JSON files for model persistence (e.g., when using scikit-learn or also other things). This way, I always have a human readable record of everything (in case pickle files get corrupted or are incompatible in a different environment); sure, this would probably slower computationally, but I think it would be the more "robust" or "reproducible" option. Someone else asked me about that recently so I put up a quick ipynb with an example specific to sklearn if you are interested: http://nbviewer.jupyter.org/github/rasbt/python-machine-learning-book/blob/master/code/bonus/scikit-model-to-json.ipynb |
We can extend this idea to a "pipeline" knowledge base, or knowledge graph. Then, given a pile of data, the system can figure out some "pipeline" . |
This issue is now handled because we're working directly with sklearn Pipeline objects that are persistent. |
I get this error when trying to pickle the tpot model: with open('tpot_.pkl','wb') as xx: """PicklingError: Can't pickle <class 'tpot.operator_utils.XGBClassifier__learning_rate'>: attribute lookup XGBClassifier__learning_rate on tpot.operator_utils failed""" |
Okay, thanks for the reply. TPOT has been great so far! |
@woodrujm , if you instead use |
Currently, TPOT requires the training data to be passed along with any additional data so the pipeline can train the sklearn models on the training data again. This is required because the pipeline consists of functions: each random forest, decision tree, etc. is a function and the model is garbage collected as soon as the function terminates.
Let's brainstorm: How can we design these functions so the models remain persistent and don't need to be re-trained? This is really only important for the final pipeline, where the user would be performing
score()
andpredict()
calls against the pipeline.The text was updated successfully, but these errors were encountered: