Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] LambdaOp may fail to deserialize if the processing function is declared in another module #1737

Closed
willb opened this issue Dec 20, 2022 · 1 comment
Labels
bug Something isn't working P1

Comments

@willb
Copy link
Contributor

willb commented Dec 20, 2022

Describe the bug

By default, Cloudpickle serializes functions and modules by reference. Deserializing workflows that contain user functions declared in other modules can fail if the Python source files for those modules are not available. This means that serializing a workflow and publishing it in a Docker container requires either (1) that all user source files are published in the same paths in the container as they were when the workflow was serialized or (2) that user modules are not serialized by reference.

Steps/Code to reproduce bug

Assume that identity.py contains the identity function, called identity.

with open("identity.py", "w") as of:
    of.write("""
def identity(col):
    return col

""")

Then run this code:

import nvtabular as nvt
import identity

wf = nvt.Workflow(
    ["col_a"] >> nvt.ops.LambdaOp(identity.identity)
)

wf.save("identity-workflow")

This serialized workflow will fail to deserialize if identity.py is not available in the same location, e.g., in a Docker container or on another host.

import sys
import os

del sys.modules["identity"]
os.unlink("identity.py")

try:
    wf_ref = nvt.Workflow.load("identity-workflow")
except ModuleNotFoundError as mnfe:
    print("Failed to load workflow")
    print(str(mnfe))

Expected behavior

Serialized workflows should be largely self-contained and resilient to missing source files containing user transformation code.

Environment details (please complete the following information):

  • Environment location: Docker
  • Method of NVTabular install: Docker
    • If method of install is [Docker], provide docker pull & docker run commands used:

docker run -v /home/willb/devel/nvtabular-sandbox:/workspace/host --runtime=nvidia --rm -it -p 8888:8888 -p 8797:8787 -p 8796:8786 --ipc=host --cap-add SYS_NICE nvcr.io/nvidia/merlin/merlin-training:22.03 /bin/bash

Additional context

This notebook demonstrates the minimal reproducer as well as a workaround.

I expect to submit a patch shortly that will enhance Workflow.save in two ways:

  1. to allow clients to provide explicit direction that certain modules should be serialized by value, and
  2. to allow clients to request optional automatic identification of such modules.
@willb willb added the bug Something isn't working label Dec 20, 2022
willb added a commit to willb/NVTabular that referenced this issue Dec 23, 2022
This commit adds an optional parameter to Workflow.save so that
users can indicate that certain modules should be serialized by value.
This is necessary if a LambdaOp in a workflow depends on a module
whose source file will not be available in the deployment environment.

Related to NVIDIA-Merlin#1737.
willb added a commit to willb/NVTabular that referenced this issue Dec 23, 2022
This commit adds code to automatically infer LambdaOp module dependencies
in several common cases:

1. in which a function is passed to LambdaOp by name,
2. in which the argument to LambdaOp is a lambda expression that refers
   to a function by a fully-qualified name, and
3. in which the argument to LambdaOp is a lambda expression that refers
   to a function via another variable

The current implementation does not inspect the bodies of any function
passed to LambdaOp, and many corner cases are (necessarily) omitted.  However,
this support should be complete enough to be useful for many users.

Automatic inference is optional (via a parameter to Workflow.save) but it
could be the default in the future.

Related to issue NVIDIA-Merlin#1737.
willb added a commit to willb/NVTabular that referenced this issue Dec 23, 2022
willb added a commit to willb/NVTabular that referenced this issue Dec 23, 2022
This commit adds code to automatically infer LambdaOp module dependencies
in several common cases:

1. in which a function is passed to LambdaOp by name,
2. in which the argument to LambdaOp is a lambda expression that refers
   to a function by a fully-qualified name, and
3. in which the argument to LambdaOp is a lambda expression that refers
   to a function via another variable

The current implementation does not inspect the bodies of any function
passed to LambdaOp, and many corner cases are (necessarily) omitted.  However,
this support should be complete enough to be useful for many users.

Automatic inference is optional (via a parameter to Workflow.save) but it
could be the default in the future.

Related to issue NVIDIA-Merlin#1737.
willb added a commit to willb/NVTabular that referenced this issue Dec 23, 2022
@rnyak rnyak added the P1 label Jan 3, 2023
karlhigley added a commit that referenced this issue Feb 15, 2023
… value (#1741)

* Allow users to specify module serialization hints

This commit adds an optional parameter to Workflow.save so that
users can indicate that certain modules should be serialized by value.
This is necessary if a LambdaOp in a workflow depends on a module
whose source file will not be available in the deployment environment.

Related to #1737.

* Adds automatic inference of LambdaOp module dependencies

This commit adds code to automatically infer LambdaOp module dependencies
in several common cases:

1. in which a function is passed to LambdaOp by name,
2. in which the argument to LambdaOp is a lambda expression that refers
   to a function by a fully-qualified name, and
3. in which the argument to LambdaOp is a lambda expression that refers
   to a function via another variable

The current implementation does not inspect the bodies of any function
passed to LambdaOp, and many corner cases are (necessarily) omitted.  However,
this support should be complete enough to be useful for many users.

Automatic inference is optional (via a parameter to Workflow.save) but it
could be the default in the future.

Related to issue #1737.

* Added tests related to issue #1737

* Fix linter errors

* Workflow.save: reset cloudpickle behavior changes on return

* aligned formatting with black's expectations

---------

Co-authored-by: Karl Higley <kmhigley@gmail.com>
Co-authored-by: rnyak <ronayak@hotmail.com>
@nv-alaiacano
Copy link
Contributor

Closing because this was resolved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working P1
Projects
None yet
Development

No branches or pull requests

3 participants