-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Contextual catalogs #3
Conversation
Hi @fnattino, this is an example workflow of creating STAC catalog for MobyLe contextual data. I uploaded the example to the Public view on Spider. Do you mind review this when you have time? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @rogerkuou, nice work! I have left some comments on the notebooks below with links to things that could be interesting to check out on the topic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! One comment on the conversion notebook (we have already discussed it this morning). For handling parquet/geoparquet file, you could check out some of the blog posts from Chris Holmes (see here), who has converted some large Google building dataset to geoparquet (see repository, especially the "processing" section of the README). He is advocating a lot for DuckDB, maybe it could be worth to try it out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also a new geoparquet release is coming up: https://medium.com/radiant-earth-insights/geoparquet-1-1-coming-soon-9b72c900fbf2
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just note this README reads nicely in "raw" mode, but not in the formatted version (lines are wrapped)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- I am thinking whether it makes more sense to link the directory containing the full BAG dataset within a single item (as it is done here) or to link all the partitions individually (i.e. one in each item). The latter approach could make sense if one manages to split the partitions on the basis of some spatial index, so that contiguous polygons are grouped together in a file. I guess this also depends on the tool(s) that will be used to load/generate the (geo)parquet files: if one manages to load the full collection using something like dask-geopandas, then maybe linking the directory in a single item is a good approach.
- To specify the asset projection, there is a dedicated STAC extension: https://github.com/stac-extensions/projection . Also, I think that is a STAC norm to use WGS84 for the geometry/bbox of the items, whatever the projection of the linked assets. This also allows one to search a catalog that has items in different projections. Also note that the projection metadata is missing for the KNMI dataset.
- If all data is going to be placed in the same folder ("data"), close to the catalog, you might consider using relative paths for the catalog, so that you can simply move the full directory of catalog + data without having to renormalise the files.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Francesco for the nice ideas! They are definitely worth thinking, it's just for some practical reasons I implemented in this way.
For now there is no spatial index implemented for BAG so from what I see it's still make sense to treat BAG parquets as a whole.
Regarding the path, TUDelft side now want to keep it possible to de-couple the catalog and data. Hence I implemented the abs path.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- See comment above on projection: if the item's bbox is not in WGS84, you don't have a way to straightforwardly check whether a given item intersects you region of interest.
- Unfortunately, I think you cannot use the search functionality of pystac_client on a static catalog (it only works for APIs).
So one needs to setup a server, example implementations are stac-server and stac-fastapi. Also this discussion on the topic of catalogs that needs to be often updated might be interesting. Just for visualization of a catalog via the web browser, one can use the STAC browser, which can be pointed to a static catalog. However, this does not seem to work with the public view of Spider (it works for instance with a static catalog on GitHub - try paste this link into the STAC browser).
1. In file conversions: changed the RD units from km to meter 2. In file conversions: updated the time coords with hours info 3. In query: select STM by filtering None values instead of space subsetting
Thanks @fnattino. I reflected on some of your comments. For others I documented them in discussion: TUDelftGeodesy/stmtools#70 for further actions. |
Example workflow of creating a STAC catalog for contextual data.
Example with data available at Public view on Spider: https://public.spider.surfsara.nl/project/caroline/demo_mobyle/stac_catalog_contextual/
Location:
/project/caroline/Public/demo_mobyle/stac_catalog_contextual/
Two example datasets are used:
The catalog is created in three steps, separated in three notebooks:
Remaining issues:
pystac-client
?