Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Contextual catalogs #3

Merged
merged 7 commits into from
Apr 3, 2024
Merged

Contextual catalogs #3

merged 7 commits into from
Apr 3, 2024

Conversation

rogerkuou
Copy link
Contributor

@rogerkuou rogerkuou commented Mar 25, 2024

Example workflow of creating a STAC catalog for contextual data.

Example with data available at Public view on Spider: https://public.spider.surfsara.nl/project/caroline/demo_mobyle/stac_catalog_contextual/

Location: /project/caroline/Public/demo_mobyle/stac_catalog_contextual/

Two example datasets are used:

  • BAG cadastral dataset: this is a big gpkg file with many polygons and related attributes
  • KNMI wether data: this includes 1) a csv file of station info and 2) a .txt file with temporal info of one station

The catalog is created in three steps, separated in three notebooks:

  1. Data conversion: convert data into a format supporting chunk: parquet or zarr;
  2. Create catalog of the converted datasets
  3. Query which dataset intersects an example STM

Remaining issues:

  • Dask-geopandas cannot digest "geometry" column for now, seem to be a WIP development.
  • There are now very limited support of spatial commands for dask-geopandas
  • How can we create a catalog searchable by pystac-client?

@rogerkuou
Copy link
Contributor Author

Hi @fnattino, this is an example workflow of creating STAC catalog for MobyLe contextual data. I uploaded the example to the Public view on Spider. Do you mind review this when you have time?

@rogerkuou rogerkuou requested a review from fnattino March 25, 2024 13:45
Copy link

@fnattino fnattino left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @rogerkuou, nice work! I have left some comments on the notebooks below with links to things that could be interesting to check out on the topic.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! One comment on the conversion notebook (we have already discussed it this morning). For handling parquet/geoparquet file, you could check out some of the blog posts from Chris Holmes (see here), who has converted some large Google building dataset to geoparquet (see repository, especially the "processing" section of the README). He is advocating a lot for DuckDB, maybe it could be worth to try it out.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just note this README reads nicely in "raw" mode, but not in the formatted version (lines are wrapped)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • I am thinking whether it makes more sense to link the directory containing the full BAG dataset within a single item (as it is done here) or to link all the partitions individually (i.e. one in each item). The latter approach could make sense if one manages to split the partitions on the basis of some spatial index, so that contiguous polygons are grouped together in a file. I guess this also depends on the tool(s) that will be used to load/generate the (geo)parquet files: if one manages to load the full collection using something like dask-geopandas, then maybe linking the directory in a single item is a good approach.
  • To specify the asset projection, there is a dedicated STAC extension: https://github.com/stac-extensions/projection . Also, I think that is a STAC norm to use WGS84 for the geometry/bbox of the items, whatever the projection of the linked assets. This also allows one to search a catalog that has items in different projections. Also note that the projection metadata is missing for the KNMI dataset.
  • If all data is going to be placed in the same folder ("data"), close to the catalog, you might consider using relative paths for the catalog, so that you can simply move the full directory of catalog + data without having to renormalise the files.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Francesco for the nice ideas! They are definitely worth thinking, it's just for some practical reasons I implemented in this way.
For now there is no spatial index implemented for BAG so from what I see it's still make sense to treat BAG parquets as a whole.
Regarding the path, TUDelft side now want to keep it possible to de-couple the catalog and data. Hence I implemented the abs path.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • See comment above on projection: if the item's bbox is not in WGS84, you don't have a way to straightforwardly check whether a given item intersects you region of interest.
  • Unfortunately, I think you cannot use the search functionality of pystac_client on a static catalog (it only works for APIs).
    So one needs to setup a server, example implementations are stac-server and stac-fastapi. Also this discussion on the topic of catalogs that needs to be often updated might be interesting. Just for visualization of a catalog via the web browser, one can use the STAC browser, which can be pointed to a static catalog. However, this does not seem to work with the public view of Spider (it works for instance with a static catalog on GitHub - try paste this link into the STAC browser).

1. In file conversions: changed the RD units from km to meter
2. In file conversions: updated the time coords with hours info
3. In query: select STM by filtering None values instead of space subsetting
@rogerkuou
Copy link
Contributor Author

Thanks @fnattino. I reflected on some of your comments. For others I documented them in discussion: TUDelftGeodesy/stmtools#70 for further actions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants