-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
🙋 Enable STAC catalog search across the network #23
Comments
I don't think STAC catalogs should be copies of each other. It is fine to refer to another data hosting location if some STAC entries seem relevant for a given node (eg: shared use of common data for different studies), but duplicating them entirely is not useful. I think it is actually more useful to have catalogs specializes on different aspects, so search results are less noisy. In other words, if UofT's STAC catalog point to data hosted on CRIM's STAC, that is fine. The STAC Collection/Item on UofT can be a copy (except When doing a search with For the STAC browser aspect, I think it would be much easier to copy the STAC Collection/Item definitions from other catalogs instead of string to support multi-catalog searches, even though the data itself would not be hosted on the node running that STAC browser. Integration of the STAC Item/Collections would be a simple cron job running https://github.com/stac-utils/pgstac or another derived utility in https://github.com/stac-utils for managing STAC databases. |
I think that we agree:
This is the issue that I think we need to think about. It is easy to set up a cron job but that's not the problem, the problem is that copying items and collections from other catalogs is resource intensive (lots of network IO to access the STAC item definitions hosted on all other nodes in the network) and then each node needs to store the STAC items from every other node in their database.
That makes sense. But we also want to allow users to choose to search the entire network if they'd like as well. A user may not know what kind of data is hosted on which node initially and we want to allow them to explore. I'm also fine if an individual node's STAC browser only displays data from their own node. But we had discussed allowing users to search the entire network for data. We can do that with pystac but I think that we should have some GUI somewhere that is a bit more user friendly for non-technical users. |
How about a request hook that works the other way around? |
We could do that. My worries with this are:
|
On the contrary. If definitions were POST'd to a centralized STAC API/browser, there is no need to search multiple ones anymore. The downside is that we must rely either on other instances to provide the request, or this centralized STAC crawls the other APIs periodically. Either approach as advantages and disadvantages, depending on whether we want more or less request traffic.
That is not true. A global STAC could be load-balanced with multiple replications and instances. There a many ways to work around that. We cannot rely on the nodes in the Marble federated network (in its current state) in the sense of "providing replications" of the same data/services or "fallback endpoints", because each node fundamentally provides different services and datasets. If a node goes down, the network is not more resilient, since the unique data is not accessible anyway. There are only multiple points of failure at the moment. At this time, I believe the concern is about finding ways to aggregate all available information somehow (especially the one that differs between nodes). For example, ESGF uses Metagrid to accomplish this. For Marble, nothing is defined yet. This is not to be confused with node replication, which is a whole different concern, and for which each individual Marble node could resolve using its own array of subnodes for reliability/replication.
Very complicated indeed. |
Yes, something like metagrid which would allow us to search across the network would be great! I mentioned before:
so implementing some search interface like metagrid that we could add the NLP interface into later on would be perfect. I still think that having a centralized stac api doesn't make sense for this project:
Sure, we can always make replicas but who is going to maintain these? We don't have any centralized architecture because we don't want a critical part of the network to deteriorate once we run out of funding. The network needs to be self-sustaining and that means that everything should work with just the nodes in place.
Yes, but there is a use-case for protecting meta-data as well and I think it is likely that some nodes will want to be able to do so. |
Yes, we could have custom strategies for NLP search, but STAC also has the I would like a solution that doesn't rely on a global instance as well. However, this is the only option I can see as tradeoff to dispatching search requests to all nodes each time. Doesn't mean that 2nd approach is bad, or that the 1st is better for that matter, just listing potential solutions. It seems that whether we use a global STAC replicating remote Items/Collections, or some custom interface that queries all the STAC nodes, we have some kind of "central portal" no matter what. If a custom portal is planned for implementation regardless, then this could be the best choice. If not (or as mentioned, to avoid a critical/central architecture), then the "quick" workaround is a global STAC that simply duplicates Items/Collections, since the "source" Items/Collections remain available on the respective nodes even if the central one goes down, and it doesn't involve a custom UI implementation to aggregate searches. For either solution, I guess it would be the same organization maintaining it, whether it is a global STAC or a central metagrid-like interface. More load-balalanced instances could be added, and maintained by many organizations, but I personally don't think we are at that point yet. The important aspect I want to highlight is that we must distinguish "network nodes" from "instance replicas". For the time being, I consider Hirondelle, PAVICS and RedOak to be "network nodes" (as it should), but by no mean replicas.
I agree, but other nodes probably don't need to know about it if it is not public, unless there is already something in place to provide federated logic, or the user already has a login for that node. |
Topic category
Select which category your topic relates to:
Topic summary
I have some thoughts about the issue of making stac catalogs aware of all other catalogs in the network.
We could make all catalogs a copy of each other but there are some issues with this:
We could centralize the catalog but that goes against the mission of the project and doesn’t actually solve most of the resource/search problems.
Let’s go back to the reason that we want the catalogs to be aware of each other: we want to be able to search across the whole network.
I think that we could achieve this on the client side instead. We have two main clients that we need to deal with, pystac and the stac browser.
Pystac:
Stac browser:
By modifying the client side and leaving the stac catalogs themselves alone we can:
I don't know if there will be any interest in modifying the stac browser but we could make the case for it.
Also, if we're already planning on building a new search interface for STAC (in order to integrate the NLP search component), we could just plan to create our own stac browser that supports multi-node search anyway.
Supporting documentation
Additional information
The text was updated successfully, but these errors were encountered: