Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Client side processing #338

Closed
wants to merge 33 commits into from

Conversation

clausmichele
Copy link
Member

@clausmichele clausmichele commented Oct 28, 2022

@soxofaan I'm starting this PR to discuss how to properly implement the functionality required for client side processing.

Copy link
Member

@soxofaan soxofaan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some quick initial review comments

openeo/rest/localconnection.py Outdated Show resolved Hide resolved
openeo/rest/localconnection.py Outdated Show resolved Hide resolved
openeo/rest/localconnection.py Outdated Show resolved Hide resolved
openeo/rest/localconnection.py Outdated Show resolved Hide resolved
from openeo.rest.datacube import DataCube
from openeo.internal.jupyter import VisualDict, VisualList

class LocalConnection():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be discussed: should rest.Connection and this LocalConnection have common parent?

openeo/rest/localconnection.py Outdated Show resolved Hide resolved
openeo/rest/localconnection.py Outdated Show resolved Hide resolved
openeo/rest/datacube.py Outdated Show resolved Hide resolved
openeo/rest/connection.py Outdated Show resolved Hide resolved
@clausmichele
Copy link
Member Author

clausmichele commented Nov 14, 2022

@soxofaan after the last commits this is the current syntax:

import openeo
local_conn = openeo.local.connection.LocalConnection("./results/")
local_conn.list_collections()

I would still like to simplify it a bit similarly to the rest part, so with something like openeo.connect_local() maybe?

Edit: does it make sense to have two files with the same name? connection.py I am not sure if this could create some confusion.

@clausmichele
Copy link
Member Author

Specific requirements added here b3067cd. Now it can be installed using pip install .[processing]

setup.py Outdated Show resolved Hide resolved
setup.py Outdated Show resolved Hide resolved

from openeo.metadata import CollectionMetadata
from openeo.internal.graph_building import PGNode
from openeo.rest.datacube import DataCube
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something to think about: maybe openeo.rest.DataCube is tied too tightly with the "REST" kind of cubes, and it won't play nice with "local" execution. Maybe we need a local DataCube variant. Or maybe the "rest" DataCube can be decoupled enough from REST aspects.

@soxofaan
Copy link
Member

I would still like to simplify it a bit similarly to the rest part, so with something like openeo.connect_local() maybe?

No, I think ultimately the local processing could also be covered from the central entry point openeo.connect(). But for now, while we are still figuring things out, I would stay away from baking those things too early in there yet.

If you want to make it a bit more compact you could have

from openeo.local import LocalConnection

with openeo/local/__init__.py like:

from openeo.local.connection import LocalConnection

That doesn't look too bad to me

Edit: does it make sense to have two files with the same name? connection.py I am not sure if this could create some confusion.

I think that's fine. It will just be an implementation detail and the end user would/should not be confronted with that.

@clausmichele
Copy link
Member Author

@soxofaan I updated the sample notebook including a sample workflow of local processing using the recently released repos https://github.com/Open-EO/openeo-processes-dask and https://github.com/Open-EO/openeo-pg-parser-networkx

after Christmas we should have a meeting discussing the next steps to integrate it better.

@clausmichele
Copy link
Member Author

@soxofaan from our last meeting:

  • Added exceptions for missing dims
  • Allow multiple folders to be scanned for local collections
  • Sample test added
  • Moved the local collections metadata parts to collections.py
  • Added execute() method for LocalCollection (created a new processing.py to contain specific local processing code)
  • Updated notebook with examples using execute()

soxofaan added a commit that referenced this pull request Feb 3, 2023
squashed PR #338 (by clausmichele)
fixed merge confluct
and did black/darker cleanups
soxofaan added a commit that referenced this pull request Feb 3, 2023
squashed PR #338 (by clausmichele)
fixed merge conflict
and did black/darker cleanups
openeo/local/collections.py Outdated Show resolved Hide resolved
openeo/local/collections.py Outdated Show resolved Hide resolved
@soxofaan
Copy link
Member

soxofaan commented Feb 6, 2023

I also see that there is (still/again) a merge conflict on setup.py . Can you resolve that?

@soxofaan
Copy link
Member

soxofaan commented Feb 6, 2023

I also noticed this is dependency-wise quite a heavy feature.
A basic install of openeo package pulls in just these 15 deps:

certifi==2022.12.7
charset-normalizer==3.0.1
Deprecated==1.2.13
idna==3.4
numpy==1.24.2
packaging==23.0
pandas==1.5.3
python-dateutil==2.8.2
pytz==2022.7.1
requests==2.28.2
shapely==2.0.1
six==1.16.0
urllib3==1.26.14
wrapt==1.14.1
xarray==2023.1.0

Install of the localprocessing extra adds 70 deps to that!

affine==2.4.0
attrs==22.2.0
cachetools==5.3.0
cftime==1.6.2
click==8.1.3
click-plugins==1.1.1
cligj==0.7.2
cloudpickle==2.2.1
contourpy==1.0.7
cycler==0.11.0
dask==2023.1.1
dask-geopandas==0.3.0
dask-image==2022.9.0
datacube==1.8.11
distributed==2023.1.1
Fiona==1.9.0
fonttools==4.38.0
fsspec==2023.1.0
GeoAlchemy2==0.13.1
geojson-pydantic==0.5.0
geopandas==0.12.2
greenlet==2.0.2
HeapDict==1.0.1
imageio==2.25.0
Jinja2==3.1.2
jsonschema==4.17.3
kiwisolver==1.4.4
lark==1.1.5
locket==1.0.0
MarkupSafe==2.1.2
matplotlib==3.6.3
msgpack==1.0.4
munch==2.5.0
netCDF4==1.6.2
networkx==2.8.8
numexpr==2.8.4
odc-algo==0.2.3
odc-geo==0.3.3
openeo-pg-parser-networkx==2023.1.2
openeo-processes-dask==2023.1.2
partd==1.3.0
pendulum==2.1.2
Pillow==9.4.0
PIMS==0.6.1
psutil==5.9.4
psycopg2==2.9.5
pydantic==1.10.4
pyparsing==3.0.9
pyproj==3.4.1
pyrsistent==0.19.3
pytzdata==2020.1
PyWavelets==1.4.1
PyYAML==6.0
rasterio==1.3.5.post1
rioxarray==0.13.3
ruamel.yaml==0.17.21
ruamel.yaml.clib==0.2.7
scikit-image==0.19.3
scipy==1.10.0
slicerator==1.1.0
snuggs==1.4.7
sortedcontainers==2.4.0
SQLAlchemy==1.4.46
tblib==1.7.0
tifffile==2023.2.3
toolz==0.12.0
tornado==6.2
typing_extensions==4.4.0
xgboost==1.7.3
zict==2.2.0

That's a very heavy addition (not only in number of packages, but also in terms of volume of compiled wheels to download). It would be good if this could be trimmed down.

@clausmichele
Copy link
Member Author

clausmichele commented Feb 15, 2023

I also see that there is (still/again) a merge conflict on setup.py . Can you resolve that?

Solved!

Edit: I don't understand why GitHub states that there is still a merge conflict on setup.py.

@clausmichele
Copy link
Member Author

clausmichele commented Feb 15, 2023

@soxofaan how did you get the list of all the dependencies required by installing the extra localprocessing?

edit: found a package called pipdeptree for this

Anyway, the large dependencies number is not something can be addressed easily. For instance, the resample_cube_spatial process, in order to be scalable with Dask is using odc-algo, which requires datacube and many other packages just for this feature.

I would propose to discuss about this in a common meeting with the EODC team.

@soxofaan
Copy link
Member

how did you get the list of all the dependencies required by installing the extra localprocessing?

I just played in temporary virtual envs and compared the list of installed package

Anyway, the large dependencies number is not something can be addressed easily.

of course there are necessary packages for local processing, but I also see a lot related to web dev that do not make sense just for this purpose, e.g.:

click==8.1.3
click-plugins==1.1.1
fonttools==4.38.0
greenlet==2.0.2
Jinja2==3.1.2
MarkupSafe==2.1.2
psutil==5.9.4
psycopg2==2.9.5
pyparsing==3.0.9
PyYAML==6.0
SQLAlchemy==1.4.46
tornado==6.2

(and I guess these drag in a number of smaller secondary dependencies as well)

import pandas as pd

@pytest.fixture()
def create_local_netcdf(tmp_path_factory):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this has to be a fixture. I guess for tests you want to produce several kinds of netcdf/geotiff files with varying dimensions, label ranges, etc
so you just need a helper function

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, fixture removed and created a helper function

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would need rioxarray if I want to test also geoTIFFs. Does it make sense to include it in tests_require or not? I wonder if these tests should run only if installing the localprocessing part?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In principle, in CI we want to run all tests, but with localprocessing being quite heavy in terms of dependencies (all have to be installed with each run), this will make our CI a lot slower, so I'm not sure we should do that.
As long as it is an experimental feature we should skip the localprocessing test during the standard CI run and only run them manually as health check while development.

So in short: for now: keep all localprocessing related dependencies (also the ones just for testing) out of the standard requirements, and just put them in the localprocessing extra

return metadata

def _get_geotiff_metadata(file_path):
data = rioxarray.open_rasterio(file_path.as_posix(),chunks={})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it looks like there is a lot of overlap/duplication with _get_netcdf_zarr_metadata
Can't this be handled more generically?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably it can be simplified, but at the moment I would like to focus on other more important parts. Could we proceed with the merge and tackle this later?

openeo/local/collections.py Outdated Show resolved Hide resolved
@soxofaan
Copy link
Member

I just had a look again at the dependency problem now that openeo_processes_dask 2023.3.0 is available and the situation is now much better 👍
the localprocessing extra now only adds these 18 additional deps:

> affine==2.4.0
> attrs==22.2.0
> click==8.1.3
> click-plugins==1.1.1
> cligj==0.7.2
> geojson-pydantic==0.5.0
> networkx==2.8.8
> openeo-pg-parser-networkx==2023.3.0
> openeo-processes-dask==2023.3.0
> pendulum==2.1.2
> pydantic==1.10.6
> pyparsing==3.0.9
> pyproj==3.4.1
> pytzdata==2020.1
> rasterio==1.3.6
> rioxarray==0.13.4
> snuggs==1.4.7
> typing_extensions==4.5.0

@soxofaan
Copy link
Member

I finetuned the PR's branch (e.g. so that no regular non-localprocessing code paths where touched),
and merged in 05b9f06

Great work, thanks!

@soxofaan soxofaan closed this Mar 15, 2023
@clausmichele clausmichele deleted the client_side_proc branch May 6, 2024 08:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants