Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

udf: add dependency list in comment block? #237

Closed
jdries opened this issue Oct 10, 2022 · 15 comments
Closed

udf: add dependency list in comment block? #237

jdries opened this issue Oct 10, 2022 · 15 comments
Assignees

Comments

@jdries
Copy link
Contributor

jdries commented Oct 10, 2022

what about:
https://datashim-io.github.io/datashim/Archive-based-Datasets/

jdries added a commit that referenced this issue Nov 28, 2022
@soxofaan
Copy link
Member

soxofaan commented Aug 3, 2023

just recently "PEP 722 Dependency specification for single-file scripts" was initiated. E.g. see https://discuss.python.org/t/pep-722-dependency-specification-for-single-file-scripts/29905

I haven't read much into the state of that proposal, but we might want to align with it for UDF dependency declarations

@soxofaan
Copy link
Member

soxofaan commented Apr 5, 2024

Just checked on this again and apparently PEP 722 – Dependency specification for single-file scripts was rejected.
Instead there is now PEP 723 – Inline script metadata:

This PEP specifies a metadata format that can be embedded in single-file Python scripts to assist launchers, IDEs and other external tools which may need to interact with such scripts.

It is accepted, so an interesting option to consider for this issue.

It would look like this (comment at top of UDF file):

# /// script
# dependencies = [
#   "numpy=1.2.3",
#   "pip @ https://github.com/pypa/pip/archive/1.3.1.zip#sha1=da9234ee9982d4bbb3c72346a6de940a148ea686",
# ]
# ///

jdries added a commit to Open-EO/openeo-geotrellis-kubernetes that referenced this issue May 3, 2024
@jdries
Copy link
Contributor Author

jdries commented May 3, 2024

pip is now available in the image, so this would work:

>>> import subprocess
>>> subprocess.run("/opt/venv/bin/python3 -m pip install --target test urllib3", shell=True, check=True)
Collecting urllib3
  Downloading https://files.pythonhosted.org/packages/a2/73/a68704750a7679d0b6d3ad7aa8d4da8e14e151ae82e6fee774e6e0d05ec8/urllib3-2.2.1-py3-none-any.whl (121kB)
     |████████████████████████████████| 122kB 1.1MB/s 
ERROR: openeo-r-udf 0.5.0 requires rpy2, which is not installed.
ERROR: openeo-geopyspark 0.31.1a1.dev20240502+1906 requires pyspark==3.4.2; python_version >= "3.8", which is not installed.
ERROR: cropsar 1.5.2.dev20231123+14 requires descartes, which is not installed.
ERROR: cropsar 1.5.2.dev20231123+14 requires flask-httpauth>=3.3.0, which is not installed.
ERROR: cropsar 1.5.2.dev20231123+14 requires flask-swagger-ui>=3.25.0, which is not installed.
ERROR: cropsar 1.5.2.dev20231123+14 requires matplotlib, which is not installed.
ERROR: cropsar 1.5.2.dev20231123+14 requires pyspark, which is not installed.
ERROR: cropsar 1.5.2.dev20231123+14 requires rtree, which is not installed.
ERROR: cropsar 1.5.2.dev20231123+14 requires sentinelhub, which is not installed.
ERROR: cropsar 1.5.2.dev20231123+14 requires tables>=3.6.1, which is not installed.
ERROR: cropsar 1.5.2.dev20231123+14 requires tensorflow==2.0.0, which is not installed.
ERROR: cropsar 1.5.2.dev20231123+14 requires tqdm, which is not installed.
ERROR: biopar 1.3.0 requires fire, which is not installed.
ERROR: biopar 1.3.0 requires intel-tensorflow==2.8, which is not installed.
ERROR: biopar 1.3.0 requires tqdm, which is not installed.
ERROR: elasticsearch 7.16.3 has requirement urllib3<2,>=1.21.1, but you'll have urllib3 2.2.1 which is incompatible.
ERROR: botocore 1.19.63 has requirement urllib3<1.27,>=1.25.4; python_version != "3.4", but you'll have urllib3 2.2.1 which is incompatible.
Installing collected packages: urllib3
Successfully installed urllib3-2.2.1
WARNING: You are using pip version 19.3.1; however, version 24.0 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.
CompletedProcess(args='/opt/venv/bin/python3 -m pip install --target test urllib3', returncode=0)

>>> import sys
>>> sys.path.append("test")
>>> import urllib3

@jdries jdries self-assigned this May 6, 2024
@JeroenVerstraelen JeroenVerstraelen assigned jdries and unassigned jdries May 7, 2024
@jdries
Copy link
Contributor Author

jdries commented May 8, 2024

Implementation proposal:

  • batch job driver collects udf's in dry run
  • driver uses pip to install dependencies on shared volume that we use for job output
  • executors also have access to this volume
  • optional: use zip for dependencies on volume, avoiding S3 latency?

Advantage:

  • dependencies are available as part of job results
  • dependencies are retrieved from pypi/artifactory only once, not per executor

soxofaan added a commit to Open-EO/openeo-python-client that referenced this issue May 23, 2024
soxofaan added a commit to Open-EO/openeo-python-client that referenced this issue May 23, 2024
soxofaan added a commit to Open-EO/openeo-python-client that referenced this issue May 23, 2024
soxofaan added a commit that referenced this issue May 28, 2024
@JeroenVerstraelen
Copy link
Contributor

soxofaan added a commit that referenced this issue May 28, 2024
soxofaan added a commit that referenced this issue May 28, 2024
@soxofaan
Copy link
Member

soxofaan commented May 29, 2024

Just ran a first job (j-2405294f8a2944a891fe3be260b562c0) on openeo.dev.warsaw.openeo.dataspace.copernicus.eu that automatically installed declared dependencies:

minimal UDF was

# /// script
# dependencies = ["duviz"]
# ///

def foo(x):
    return x + 1

from the batch job logs:
image

@soxofaan
Copy link
Member

soxofaan commented May 30, 2024

The basics are now in place

Demo on CDSE dev:
Screenshot from 2024-05-30 09-46-57

The UDF has a comment at the top which declares some expected packages like this (PEP 723):

# /// script
# dependencies = ["duviz"]
# ///

This duviz is just a random small package that has nothing to do with EO or GIS.

It is automatically installed at the start of the batch job in a folder /batch_jobs/{jobid}/udf-py-deps and made available for import in the UDF.
The UDF itself just appends the imported module's path to a list, see bottom result

Part from batch job logs that show the pip install bit:
image

@soxofaan
Copy link
Member

soxofaan commented May 30, 2024

next steps:

@soxofaan
Copy link
Member

Confirmed that it works on the executors in a real apply() UDF:

Screenshot from 2024-05-30 13-58-43

@soxofaan
Copy link
Member

Some notes about

don't install deps that are already available in the default locations

Current implementation does something like pip install --target /batch_jobs/<jobid>/udf-py-deps

with this --target option, pip installs modules directly under that target folder, which can then immediately be appended to PYTHONPATH.
Drawback of this --target is that pip seems to completely ignore existing packages in the default environment. E.g. even if numpy is already available in the base image, pip install --target ... numpy will install it again.

illustration (in docker run --rm -it python /bin/bash container):

# Install in root env 
root@f0ca2e7d580e:~# python -m pip install duviz
...
Successfully installed duviz-3.2.0

# Install again with --target
root@f0ca2e7d580e:~# python -m pip install --target /tmp/target duviz
...
Installing collected packages: duviz
Successfully installed duviz-3.2.0

There is a variation with --prefix, which seems to be aware of installed package in default env and will not reinstall unnecessarily:

root@f0ca2e7d580e:~# python -m pip install --prefix /tmp/prefix duviz
Requirement already satisfied: duviz in /usr/local/lib/python3.11/site-packages (3.2.0)

However the --prefix approach is potentially destructive for the default env as well. E.g. when installing a different version than the one in the default env, it removes the package from the default env:

root@f0ca2e7d580e:~# python -m pip install --prefix /tmp/prefix duviz==3.0.0
...
    Found existing installation: duviz 3.2.0
    Uninstalling duviz-3.2.0:
      Successfully uninstalled duviz-3.2.0
Successfully installed duviz-3.0.0 

# duviz is gone from default env
root@f0ca2e7d580e:~# python -m duviz
/usr/local/bin/python: No module named duviz

So --target is probably a safer option than --prefix for now

@soxofaan
Copy link
Member

Verified that it works with compiled packages and github zip archives as well, e.g. used this UDF:

udf_code = """
# /// script
# dependencies = [
#     # An github zip archive based dependency:
#     "duviz @ https://github.com/soxofaan/duviz/archive/refs/tags/v3.2.0.zip",
#     # Rust-based compiled package
#     "ruff",  
# ]
# ///

import re
import xarray
import duviz
import ruff

def apply_datacube(cube: xarray.DataArray, context: dict) -> xarray.DataArray:
    # Get a step size based on the number of things in the imported deps
    step = sum("u" in x for x in dir(duviz)) + len(dir(ruff))
    # Zero out pixels every `step` along x axis
    cube[{"x": slice(None, None, step)}] = 0 
    return cube
"""

@soxofaan
Copy link
Member

soxofaan commented Jun 3, 2024

More in-depth documentation has been added to https://open-eo.github.io/openeo-python-client/udf.html#udf-dependency-management

@soxofaan
Copy link
Member

soxofaan commented Jun 3, 2024

status update with some remaining subtasks:

@soxofaan
Copy link
Member

This morning, I could verify that it now also works on Terrascope deploy (dev)

@soxofaan
Copy link
Member

Integration tests are now in place for both Terrascope and CDSE

I'm going to close this ticket. Additional tickets were created for the remaining work

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants