Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Saving models? #14

Closed
xloffree opened this issue Sep 28, 2022 · 16 comments
Closed

Saving models? #14

xloffree opened this issue Sep 28, 2022 · 16 comments

Comments

@xloffree
Copy link

Hi,

Is there a way to save models generated in PIML so that I do not need to run the program and train the model each time?
Also, what is the best way to export results? Is there any way to export results such that the widgets are still interactive? I have just been saving the notebook as an html file in order to share results.

Thank you

@ZebinYang
Copy link
Collaborator

Hi @xloffree,

First of all, you may save a fitted model in PiML using the following approach.

import dill

clf = exp.get_model("GAM").estimator 

with open('name_model.pkl', 'wb') as file:
    dill.dump(clf, file)

with open('name_model.pkl', 'rb') as file:
    clf_load = dill.load(file)

train_x = exp.get_model("GAM").get_data(train=True)[0]
clf_load.predict(train_x)

You may also register the loaded model into PiML using the demo at https://colab.research.google.com/github/SelfExplainML/PiML-Toolbox/blob/main/examples/Example_ExternalModels.ipynb#scrollTo=7WGJ8PzutkLh, "Scenario 2: Register external fitted models with dataset".

Second, all the interactive panels used in PiML are based on python runtime. Currently, we don't have such functionality to export interactive results; the best way is to save the notebook as static Html: a) Click on "Widgets -> Save Notebook Widget State"; b) Export it as Html by "File -> Download as -> HTML (.html)".

@xloffree
Copy link
Author

Hi, thank you for your help with this. Has this solution worked for you with PiML? When I try this solution, this line:
with open('name_model.pkl', 'rb') as file: clf_load = dill.load(file)

results in a recursion error every time. I tried changing the recursion limit but even with a recursion limit pf 10000000 I still run into this error. Increasing the recursion limit indefinitely just causes the kernel to crash.

The error is as follows:


RecursionError Traceback (most recent call last)
Cell In [39], line 2
1 with open('name_model.pkl', 'rb') as file:
----> 2 clf_load = dill.load(file)

File /n/holylfs05/LABS/liang_lab_l3/Lab/piml_py39_shared/lib/python3.9/site-packages/dill/_dill.py:272, in load(file, ignore, **kwds)
266 def load(file, ignore=None, **kwds):
267 """
268 Unpickle an object from a file.
269
270 See :func:loads for keyword arguments.
271 """
--> 272 return Unpickler(file, ignore=ignore, **kwds).load()

File /n/holylfs05/LABS/liang_lab_l3/Lab/piml_py39_shared/lib/python3.9/site-packages/dill/_dill.py:419, in Unpickler.load(self)
418 def load(self): #NOTE: if settings change, need to update attributes
--> 419 obj = StockUnpickler.load(self)
420 if type(obj).module == getattr(_main_module, 'name', 'main'):
421 if not self._ignore:
422 # point obj class to main

File piml/models/glm.py:32, in piml.models.glm.GLMRegressor.getattr()

File piml/models/glm.py:32, in piml.models.glm.GLMRegressor.getattr()

[... skipping similar frames: piml.models.glm.GLMRegressor.__getattr__ at line 32 (9999967 times)]

File piml/models/glm.py:32, in piml.models.glm.GLMRegressor.getattr()

RecursionError: maximum recursion depth exceeded while calling a Python object

Any help with this would be very appreciated. When using PiML for research purposes, being able to save a trained model is essential for reproducibility. Thank you!

@ZebinYang
Copy link
Collaborator

Hi @xloffree,

For GLM, you may use the following code to do model saving,

import dill

clf = exp.get_model("GLM").estimator.__model__ 

with open('name_model.pkl', 'wb') as file:
    dill.dump(clf, file)

with open('name_model.pkl', 'rb') as file:
    clf_load = dill.load(file)

train_x = exp.get_model("GLM").get_data(train=True)[0]
clf_load.predict(train_x)

@xloffree
Copy link
Author

Thank you very much. This works. How can I use this trained model to predict on other datasets? Is this functionality exclusively part of PiML or is there third party documentation I can view for more background on how this code works?

Thank you

@ZebinYang
Copy link
Collaborator

Thank you very much. This works. How can I use this trained model to predict on other datasets? Is this functionality exclusively part of PiML or is there third party documentation I can view for more background on how this code works?

Thank you

If you have another dataset with the same input features set, then you can use this model to get predictions. Assume the new data has covariates "X" (the raw scale without preprocessing), then you can get the prediction using the fitted model in PiML via the following way.

clf = exp.get_model("GLM").estimator 
xx = exp.get_data(x=X)
clf.predict(xx)

@xloffree
Copy link
Author

What datatype should X be here? Is it a dataframe that includes all of the data for all of the predictors?

Thanks

@xloffree
Copy link
Author

Hi,

Is different code required to save each different type of built-in model in PiML? It seems whenever I try to save a different model, I run into a new error. Is there somewhere where I can see the code for how to save each different type of model?
Thank you

@ZebinYang
Copy link
Collaborator

What datatype should X be here? Is it a dataframe that includes all of the data for all of the predictors?

The X is the numpy array of the selected features. It should have the same data format as the uploaded raw data, without any preprocessing.

@ZebinYang
Copy link
Collaborator

Hi,

Is different code required to save each different type of built-in model in PiML? It seems whenever I try to save a different model, I run into a new error. Is there somewhere where I can see the code for how to save each different type of model? Thank you

For the GLMRegressor model, use

import dill

clf = exp.get_model("GLM").estimator.__model__ 

with open('name_model.pkl', 'wb') as file:
    dill.dump(clf, file)

with open('name_model.pkl', 'rb') as file:
    clf_load = dill.load(file)

train_x = exp.get_model("GLM").get_data(train=True)[0]
clf_load.predict(train_x)

For all the rest models, you can use

import dill

clf = exp.get_model("GAM").estimator 

with open('name_model.pkl', 'wb') as file:
    dill.dump(clf, file)

with open('name_model.pkl', 'rb') as file:
    clf_load = dill.load(file)

train_x = exp.get_model("GAM").get_data(train=True)[0]
clf_load.predict(train_x)

BTW, we will provide a unified API for model saving in the next release.

@xloffree
Copy link
Author

xloffree commented Jan 6, 2023

clf = exp.get_model("GLM").estimator
xx = exp.get_data(x=X)
clf.predict(xx)

I still do not understand what this means. I have tried to pass df and df.columns as X but it does not work. Do you have an example of what X should be?

Thank you

@xloffree
Copy link
Author

xloffree commented Jan 6, 2023

Would it be possible for us to discuss PiML over a zoom meeting? That might be more efficient than messages on this page.

@ZebinYang
Copy link
Collaborator

ZebinYang commented Jan 6, 2023

Hi, here X is just an n*p numpy array, where n is the sample size and p is the number of predictors (excluding unselected features and the response feature).

For instance, assume the raw data is a pd.DataFrame as follows,

season yr mnth hr holiday weekday workingday weathersit temp atemp hum windspeed cnt
1.0 0.0 1.0 0.0 0.0 6.0 0.0 1.0 0.24 0.2879 0.81 0.0000 16.0
1.0 0.0 1.0 1.0 0.0 6.0 0.0 1.0 0.22 0.2727 0.80 0.0000 40.0
1.0 0.0 1.0 2.0 0.0 6.0 0.0 1.0 0.22 0.2727 0.80 0.0000 32.0
1.0 0.0 1.0 3.0 0.0 6.0 0.0 1.0 0.24 0.2879 0.75 0.0000 13.0
1.0 0.0 1.0 4.0 0.0 6.0 0.0 1.0 0.24 0.2879 0.75 0.0000 1.0
... ... ... ... ... ... ... ... ... ... ... ... ...
1.0 1.0 12.0 19.0 0.0 1.0 1.0 2.0 0.26 0.2576 0.60 0.1642 119.0
1.0 1.0 12.0 20.0 0.0 1.0 1.0 2.0 0.26 0.2576 0.60 0.1642 89.0
1.0 1.0 12.0 21.0 0.0 1.0 1.0 1.0 0.26 0.2576 0.60 0.1642 90.0
1.0 1.0 12.0 22.0 0.0 1.0 1.0 1.0 0.26 0.2727 0.56 0.1343 61.0
1.0 1.0 12.0 23.0 0.0 1.0 1.0 1.0 0.26 0.2727 0.65 0.1343 49.0


Then you selected season, yr, mnth, hr as the covariates in exp.data_summary and exp.feature_select, and cnt as the response (in exp.data_prepare).

The X is supposed to be a np.array that looks like:

season yr mnth hr
1.0 0.0 1.0 0.0
1.0 0.0 1.0 1.0
1.0 0.0 1.0 2.0
1.0 0.0 1.0 3.0
1.0 0.0 1.0 4.0
... ... ... ...
1.0 1.0 12.0 19.0
1.0 1.0 12.0 20.0
1.0 1.0 12.0 21.0
1.0 1.0 12.0 22.0
1.0 1.0 12.0 23.0

An example of X would be, which is the selected covariates of the loaded data.

X = exp.dataset.x
clf = exp.get_model("GLM").estimator
xx = exp.get_data(x=X)
clf.predict(xx)

Hope that helps.

@xloffree
Copy link
Author

Saving a model works for glm and gaml. After that, none of the other models will save and they result in an error:


PicklingError Traceback (most recent call last)
Cell In [18], line 4
1 clf = exp.get_model("GAMI-Net").estimator
3 with open('LVS_GAMI-Net.pkl', 'wb') as file:
----> 4 dill.dump(clf, file)

File /n/holylfs05/LABS/liang_lab_l3/Lab/piml_py39_shared/lib/python3.9/site-packages/dill/_dill.py:235, in dump(obj, file, protocol, byref, fmode, recurse, **kwds)
233 _kwds = kwds.copy()
234 _kwds.update(dict(byref=byref, fmode=fmode, recurse=recurse))
--> 235 Pickler(file, protocol, **_kwds).dump(obj)
236 return

File /n/holylfs05/LABS/liang_lab_l3/Lab/piml_py39_shared/lib/python3.9/site-packages/dill/_dill.py:394, in Pickler.dump(self, obj)
392 def dump(self, obj): #NOTE: if settings change, need to update attributes
393 logger.trace_setup(self)
--> 394 StockPickler.dump(self, obj)

File /n/holylfs05/LABS/liang_lab_l3/Lab/piml_py39_shared/lib/python3.9/pickle.py:487, in _Pickler.dump(self, obj)
485 if self.proto >= 4:
486 self.framer.start_framing()
--> 487 self.save(obj)
488 self.write(STOP)
489 self.framer.end_framing()

File /n/holylfs05/LABS/liang_lab_l3/Lab/piml_py39_shared/lib/python3.9/site-packages/dill/_dill.py:388, in Pickler.save(self, obj, save_persistent_id)
386 msg = "Can't pickle %s: attribute lookup builtins.generator failed" % GeneratorType
387 raise PicklingError(msg)
--> 388 StockPickler.save(self, obj, save_persistent_id)

File /n/holylfs05/LABS/liang_lab_l3/Lab/piml_py39_shared/lib/python3.9/pickle.py:603, in _Pickler.save(self, obj, save_persistent_id)
599 raise PicklingError("Tuple returned by %s must have "
600 "two to six elements" % reduce)
602 # Save the reduce() output and finally memoize the object
--> 603 self.save_reduce(obj=obj, *rv)

File /n/holylfs05/LABS/liang_lab_l3/Lab/piml_py39_shared/lib/python3.9/pickle.py:717, in _Pickler.save_reduce(self, func, args, state, listitems, dictitems, state_setter, obj)
715 if state is not None:
716 if state_setter is None:
--> 717 save(state)
718 write(BUILD)
719 else:
720 # If a state_setter is specified, call it instead of load_build
721 # to update obj's with its previous state.
722 # First, push state_setter and its tuple of expected arguments
723 # (obj, state) onto the stack.

File /n/holylfs05/LABS/liang_lab_l3/Lab/piml_py39_shared/lib/python3.9/site-packages/dill/_dill.py:388, in Pickler.save(self, obj, save_persistent_id)
386 msg = "Can't pickle %s: attribute lookup builtins.generator failed" % GeneratorType
387 raise PicklingError(msg)
--> 388 StockPickler.save(self, obj, save_persistent_id)

File /n/holylfs05/LABS/liang_lab_l3/Lab/piml_py39_shared/lib/python3.9/pickle.py:560, in _Pickler.save(self, obj, save_persistent_id)
558 f = self.dispatch.get(t)
559 if f is not None:
--> 560 f(self, obj) # Call unbound method with explicit self
561 return
563 # Check private dispatch table if any, or else
564 # copyreg.dispatch_table

File /n/holylfs05/LABS/liang_lab_l3/Lab/piml_py39_shared/lib/python3.9/site-packages/dill/_dill.py:1186, in save_module_dict(pickler, obj)
1183 if is_dill(pickler, child=False) and pickler._session:
1184 # we only care about session the first pass thru
1185 pickler._first_pass = False
-> 1186 StockPickler.save_dict(pickler, obj)
1187 logger.trace(pickler, "# D2")
1188 return

File /n/holylfs05/LABS/liang_lab_l3/Lab/piml_py39_shared/lib/python3.9/pickle.py:971, in _Pickler.save_dict(self, obj)
968 self.write(MARK + DICT)
970 self.memoize(obj)
--> 971 self._batch_setitems(obj.items())

File /n/holylfs05/LABS/liang_lab_l3/Lab/piml_py39_shared/lib/python3.9/pickle.py:997, in _Pickler._batch_setitems(self, items)
995 for k, v in tmp:
996 save(k)
--> 997 save(v)
998 write(SETITEMS)
999 elif n:

File /n/holylfs05/LABS/liang_lab_l3/Lab/piml_py39_shared/lib/python3.9/site-packages/dill/_dill.py:388, in Pickler.save(self, obj, save_persistent_id)
386 msg = "Can't pickle %s: attribute lookup builtins.generator failed" % GeneratorType
387 raise PicklingError(msg)
--> 388 StockPickler.save(self, obj, save_persistent_id)

File /n/holylfs05/LABS/liang_lab_l3/Lab/piml_py39_shared/lib/python3.9/pickle.py:589, in _Pickler.save(self, obj, save_persistent_id)
587 # Check for string returned by reduce(), meaning "save as global"
588 if isinstance(rv, str):
--> 589 self.save_global(obj, rv)
590 return
592 # Assert that reduce() returned a tuple

File /n/holylfs05/LABS/liang_lab_l3/Lab/piml_py39_shared/lib/python3.9/pickle.py:1070, in _Pickler.save_global(self, obj, name)
1068 obj2, parent = _getattribute(module, name)
1069 except (ImportError, KeyError, AttributeError):
-> 1070 raise PicklingError(
1071 "Can't pickle %r: it's not found as %s.%s" %
1072 (obj, module_name, name)) from None
1073 else:
1074 if obj2 is not obj:

PicklingError: Can't pickle <cyfunction Model.register_model..sklearn_is_fitted at 0x2ae1c932aee0>: it's not found as piml.workflow.base.Model.register_model..sklearn_is_fitted

@ZebinYang
Copy link
Collaborator

Thanks for reporting this issue.

You may use the following scripts to save and load a fitted model except for GLMRegressor.

import dill

clf = exp.get_model("GAM").estimator 
clf.__sklearn_is_fitted__ = lambda : True

with open('name_model.pkl', 'wb') as file:
    dill.dump(clf, file)

with open('name_model.pkl', 'rb') as file:
    clf_load = dill.load(file)

train_x = exp.get_model("GAM").get_data(train=True)[0]
clf_load.predict(train_x)

@xloffree
Copy link
Author

Thanks. I am able to save every model as a .pkl file now. How can I easily load the model and view its interpretability metrics within PiML? For example, if I have an EBM model saved as a .pkl, and I want to view the results of exp.model_interpret(), how can I do this without retraining?

Thank you

@ZebinYang
Copy link
Collaborator

@xloffree,

You can do the following to register it into the PiML workflow:

pipeline = exp.make_pipeline(model=clf_load)
exp.register(pipeline, "loaded_model")
exp.model_interpret()

Note that in this case, you need to do data loading, summary, and preparation first, so that all the data are available.
An alternative way is to specify the required data information in exp.register. You can find the details in the docs of exp.register function, and the example usage in https://colab.research.google.com/github/SelfExplainML/PiML-Toolbox/blob/main/examples/Example_ExternalModels.ipynb.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants