Name		Name	Last commit message	Last commit date
parent directory ..
README.md		README.md

README.md

Technical concept to simplify the FederatedCatalog

Decision

We will simplify the FederatedCatalog in the aspects outlined in this document, because experience shows that the initial complexity and versatility is in fact not needed. The intended audience are developers, who are writing code for the FC and adopters who use and operate an FC.

Rationale

The Federated Catalog was originally designed with the greatest flexibility and versatility in mind. For instance it has a PartitionManager, that allows for partitioning a dataspace, e.g. by geographic areas.

Similarly, all components such as work queues, worker threads (Crawlers) and the persistence pipeline are decoupled using queues and parallel processing to enable the utmost independence and extensibility.

Practical applications so far have shown that all this complexity is not needed, rather, it makes the FC architecture hard to understand and hard to maintain.

Reducing complexity of the code base will be beneficial in several ways: code paths are easier to understand, the code becomes easier to test and is less prone to implementation errors.

Approach

We therefore propose the following changes and simplifications:

1. Remove the `PartitionManager`

Partitioning would likely be done using a different approach, such as deployment paradigms, so the partition manager can be deleted.

This means the FederatedCatalogCacheExtension directly instantiates the ExecutionPlan, and starts it.

2. Make the `WorkItemQueue` short-lived

When the ExecutionPlan is run, it fetches all FederatedCacheNode elements from the FederatedCacheNodeDirectory, instantiates a new WorkItemQueue and inserts the elements into it. The crawlers then set off and drain the queue. This means that a new queue is used for every crawl-run (assuming periodic crawling). Aside from (small) resource improvements, it makes the entire FC more stateless.

3. Make the `Crawler` short-lived

Currently, crawlers are instantiated once and re-used on every crawl-run. Between runs, they needlessly poll the queue. With this proposal I suggest that the ExecutionPlan directly instantiates the crawlers (e.g. through a Supplier or a Function) and sets them off in its run method, passing the next WorkItem as argument.

Consequently, the ExecutionPlan maintains a list of Crawlers, tracks their state (running vs finished), and removes them from its internal list once they are done. It then instantiates new ones until the WorkItemQueue is empty. That way, crawlers can be simplified greatly, because they do not have to implement "polling the queue" anymore, thus reducing code complexity. With this, they will also mimic the semantics of a stateless "worker" more closely.

Note: this would mean the logic to poll the queue gets moved into the ExecutionPlan at which point we could think about introducing a new collaborator object for the ExecutionPlan, an ExecutionManager of sorts.

4. Have `Crawler`s directly write to the `FederatedCacheStore`

Currently, every Crawler puts its result (i.e. a UpdateResponse) into yet another queue, from which the Loader dequeues it and forwards it to the store. Since the persistence abstraction is handled by the store anyway, there really is no need for the indirection through one (or potentially multiple) loaders anymore.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

2022-07-22-simplify-fcc

2022-07-22-simplify-fcc

README.md

Technical concept to simplify the FederatedCatalog

Decision

Rationale

Approach

1. Remove the `PartitionManager`

2. Make the `WorkItemQueue` short-lived

3. Make the `Crawler` short-lived

4. Have `Crawler`s directly write to the `FederatedCacheStore`

Files

2022-07-22-simplify-fcc

Directory actions

More options

Directory actions

More options

Latest commit

History

2022-07-22-simplify-fcc

Folders and files

parent directory

README.md

Technical concept to simplify the FederatedCatalog

Decision

Rationale

Approach

1. Remove the PartitionManager

2. Make the WorkItemQueue short-lived

3. Make the Crawler short-lived

4. Have Crawlers directly write to the FederatedCacheStore

1. Remove the `PartitionManager`

2. Make the `WorkItemQueue` short-lived

3. Make the `Crawler` short-lived

4. Have `Crawler`s directly write to the `FederatedCacheStore`