New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] LambdaOp may fail to deserialize if the processing function is declared in another module #1737
Comments
willb
added a commit
to willb/NVTabular
that referenced
this issue
Dec 23, 2022
This commit adds an optional parameter to Workflow.save so that users can indicate that certain modules should be serialized by value. This is necessary if a LambdaOp in a workflow depends on a module whose source file will not be available in the deployment environment. Related to NVIDIA-Merlin#1737.
willb
added a commit
to willb/NVTabular
that referenced
this issue
Dec 23, 2022
This commit adds code to automatically infer LambdaOp module dependencies in several common cases: 1. in which a function is passed to LambdaOp by name, 2. in which the argument to LambdaOp is a lambda expression that refers to a function by a fully-qualified name, and 3. in which the argument to LambdaOp is a lambda expression that refers to a function via another variable The current implementation does not inspect the bodies of any function passed to LambdaOp, and many corner cases are (necessarily) omitted. However, this support should be complete enough to be useful for many users. Automatic inference is optional (via a parameter to Workflow.save) but it could be the default in the future. Related to issue NVIDIA-Merlin#1737.
willb
added a commit
to willb/NVTabular
that referenced
this issue
Dec 23, 2022
willb
added a commit
to willb/NVTabular
that referenced
this issue
Dec 23, 2022
This commit adds code to automatically infer LambdaOp module dependencies in several common cases: 1. in which a function is passed to LambdaOp by name, 2. in which the argument to LambdaOp is a lambda expression that refers to a function by a fully-qualified name, and 3. in which the argument to LambdaOp is a lambda expression that refers to a function via another variable The current implementation does not inspect the bodies of any function passed to LambdaOp, and many corner cases are (necessarily) omitted. However, this support should be complete enough to be useful for many users. Automatic inference is optional (via a parameter to Workflow.save) but it could be the default in the future. Related to issue NVIDIA-Merlin#1737.
willb
added a commit
to willb/NVTabular
that referenced
this issue
Dec 23, 2022
karlhigley
added a commit
that referenced
this issue
Feb 15, 2023
… value (#1741) * Allow users to specify module serialization hints This commit adds an optional parameter to Workflow.save so that users can indicate that certain modules should be serialized by value. This is necessary if a LambdaOp in a workflow depends on a module whose source file will not be available in the deployment environment. Related to #1737. * Adds automatic inference of LambdaOp module dependencies This commit adds code to automatically infer LambdaOp module dependencies in several common cases: 1. in which a function is passed to LambdaOp by name, 2. in which the argument to LambdaOp is a lambda expression that refers to a function by a fully-qualified name, and 3. in which the argument to LambdaOp is a lambda expression that refers to a function via another variable The current implementation does not inspect the bodies of any function passed to LambdaOp, and many corner cases are (necessarily) omitted. However, this support should be complete enough to be useful for many users. Automatic inference is optional (via a parameter to Workflow.save) but it could be the default in the future. Related to issue #1737. * Added tests related to issue #1737 * Fix linter errors * Workflow.save: reset cloudpickle behavior changes on return * aligned formatting with black's expectations --------- Co-authored-by: Karl Higley <kmhigley@gmail.com> Co-authored-by: rnyak <ronayak@hotmail.com>
Closing because this was resolved. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Describe the bug
By default, Cloudpickle serializes functions and modules by reference. Deserializing workflows that contain user functions declared in other modules can fail if the Python source files for those modules are not available. This means that serializing a workflow and publishing it in a Docker container requires either (1) that all user source files are published in the same paths in the container as they were when the workflow was serialized or (2) that user modules are not serialized by reference.
Steps/Code to reproduce bug
Assume that
identity.py
contains the identity function, calledidentity
.Then run this code:
This serialized workflow will fail to deserialize if
identity.py
is not available in the same location, e.g., in a Docker container or on another host.Expected behavior
Serialized workflows should be largely self-contained and resilient to missing source files containing user transformation code.
Environment details (please complete the following information):
docker pull
&docker run
commands used:docker run -v /home/willb/devel/nvtabular-sandbox:/workspace/host --runtime=nvidia --rm -it -p 8888:8888 -p 8797:8787 -p 8796:8786 --ipc=host --cap-add SYS_NICE nvcr.io/nvidia/merlin/merlin-training:22.03 /bin/bash
Additional context
This notebook demonstrates the minimal reproducer as well as a workaround.
I expect to submit a patch shortly that will enhance
Workflow.save
in two ways:The text was updated successfully, but these errors were encountered: