Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Node-level initializations for UDF #622

Closed
jaychia opened this issue Feb 24, 2023 · 1 comment
Closed

Node-level initializations for UDF #622

jaychia opened this issue Feb 24, 2023 · 1 comment

Comments

@jaychia
Copy link
Contributor

jaychia commented Feb 24, 2023

Is your feature request related to a problem? Please describe.

When running a UDF that performs some initializations (for example, commonly - downloading and caching a model on disk), there can be a problem where multiple workers attempt to do so at the same time and end up thrashing each other.

We should allow for a mechanism to perform node-level initializations in a UDF, that execute once-per-node. Alternatively, if the user is provided with a unique worker ID through the daft.context, then they could also do more intelligent things such as using a different cache folder for each stateful UDF's initializations.

@jaychia
Copy link
Contributor Author

jaychia commented Mar 2, 2023

This is possible already without any Daft-provided functionality. Users can utilize a library such as https://github.com/tox-dev/py-filelock to perform node-level locking.

from filelock import FileLock

class MyUDF:

    def __init__(self):
        with FileLock("/tmp/.myudf.lock"):
            download_model_to_disk()
        self.model = load_model()

Documentation will be added in an FAQ section for UDFs.

Closing!

@jaychia jaychia closed this as completed Mar 2, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant