Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A KedroPipelineModel cannot be loaded from mlflow if its catalog contains non deepcopy-able DataSets #122

Closed
Galileo-Galilei opened this issue Nov 21, 2020 · 2 comments · Fixed by #129
Labels
bug Something isn't working
Milestone

Comments

@Galileo-Galilei
Copy link
Owner

Description

I tried to load a KedroPipelineModel from mlflow, and I got a "cannot pickle context artifacts" error, which is due do the

Context

I cannot load a previously saved KedroPipelineModel generated by pipeline_ml_factory.

Steps to Reproduce

Save A KedroPipelineModel with a dataset that contains an object which cannot be deepcopied (for me, a keras tokenizer)

Expected Result

The model should be loaded

Actual Result

An error is raised

Your Environment

Include as many relevant details about the environment in which you experienced the bug:

  • kedro and kedro-mlflow version used: 0.16.5 and 0.4.0
  • Python version used (python -V): 3.6.8
  • Windows 10 & CentOS were tested

Does the bug also happen with the last version on develop?

Yes

Potential solution

The faulty line is:

self.loaded_catalog = deepcopy(self.initial_catalog)

@Galileo-Galilei Galileo-Galilei added the bug Something isn't working label Nov 21, 2020
@Galileo-Galilei Galileo-Galilei added this to To do in Ongoing development via automation Nov 21, 2020
@Galileo-Galilei Galileo-Galilei added this to the Release 0.5.0 milestone Nov 21, 2020
@Galileo-Galilei Galileo-Galilei changed the title A KedroPipelineModel cannot be loaded from mlflow if its catalog contains non deepcopy-ish DataSets A KedroPipelineModel cannot be loaded from mlflow if its catalog contains non deepcopy-able DataSets Nov 21, 2020
@takikadiri
Copy link
Collaborator

Does removing the faulty line and using directly the initial_catalog make the model loadable again ? if Yes, we have two options :

  • We no longer deepcopy the initial_catalog
  • We copy each DataSet of the catalog with his own loader (for example, we use tf.keras.models.clone_model for keras model DataSet ...)

Knowing that the KedroPipelineModel is intented to be used in a separated process (at inference-time), we can just remove the deepcopy part (there won't be a conflict with another function using the same catalog)

@Galileo-Galilei Galileo-Galilei moved this from To do to Planned for next release in Ongoing development Nov 25, 2020
@Galileo-Galilei
Copy link
Owner Author

After some investigation, the issues comes from the MLflowAbstractModelDataSet, and particularly the self._mlflow_model_module attribute which is a module and not deepcopiable by nature. I suggest to store it as a string, and have a property attribute to load the module on the fly.

Note that this is a problem which occurs only when the DataSet is not deepcopiable (and not the underlying value the DataSet can load(), so we can quite safely assume that it should not occur often). If it does, we should consider a more radical solution among the ones you suggest.

@Galileo-Galilei Galileo-Galilei moved this from Planned for next release to In progress in Ongoing development Nov 28, 2020
Ongoing development automation moved this from In progress to Done Nov 28, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Development

Successfully merging a pull request may close this issue.

2 participants