[card-cache-service] optimize caching #417

valayDave · 2024-02-26T22:02:21Z

Core Changes:

A background service that will poll all the cards/data update for individual tasks
we will run one process per taskspec for a small amount of time specified via the ENV vars
This service will be launched when cards are requested.
The reading of cards will be happening directly from the cache and the reads will be "best-effort"; Meaning that when running this in a load-balanced setting, one server can be sending stale updates compared to another server. A workaround for this was passing the time of the update from the metaflow task. This commit in metaflow will ensure that stale updates can get discarded.
API routes to get a card/list cards will have async-waits until the cache is available or there is a timeout.
Requires latest metaflow client.
The new optimization will require the MF GUI to also be up-to-date with the new server.
Metaflow UI changes will make it poll the server every 500 ms.
changes come with async routines that will clean up the cache and remove completed async processes
removed dead code

Why not use the existing cache client:

The way the existing cache client works, it loads the entire Task / Card object in memory and then returns the html/data from it.
This is inefficient because loading the Card object does datastore list calls (which are time expensive).
Once the path has been found, getting the object is a very fast operation.
For example, listing cards takes ~ 1-2 seconds, but getting the actual card once the path is resolved takes ~ 10 milliseconds.
The current cache actions are "stateless" meaning, that once the action is done, the previous state is lost when a new action is called.
This stateless nature is not good for cards, where the data may change a lot more frequently but paths won't change.
The new cache service retrieves the object paths once and then keeps updating them until the background process finishes execution.
This approach improves latency drastically

Configuration Options:

CARD_CACHE_PROCESS_NO_CARD_WAIT_TIME : How long should the process wait for a card to be available before it exits
CARD_CACHE_PROCESS_MAX_UPTIME : The max duration the process should run
CARD_CACHE_CARD_LIST_POLLING_FREQUENCY : How frequently should the process poll for listing new cards
CARD_CACHE_CARD_UPDATE_POLLING_FREQUENCY : How frequently should the process poll for the card html content
CARD_CACHE_DATA_UPDATE_POLLING_FREQUENCY : How frequently should the process poll for the data updates
CARD_CACHE_DISK_CLEANUP_INTERVAL: The interval at which the cached cards are stored should be cleaned up
CARD_API_HTML_WAIT_TIME: the time period the card HTML retrieval API will max busy wait for the card to be ready before timing out and resulting in null response.

TODOs

Cut a new Metaflow release
Make Wait timeouts configurable
Merge [cards] performance optimizations metaflow-ui#142
Cut GUI release and add it to docker image
[ ]

# Core Changes: - created a background service that will poll all the cards/dataupdate for individual tasks - The background service will run one process per taskspec for a small amount of time specified via the ENV vars - This service will be launched when cards are requested. - The reading of cards will be happenning directly from cache and the reads will be "best-effort" - API routes to get a card / list cards will have async-waits until the cache is updated. - The new optimization will require the MF GUI to also be up-to-date with the new server. - Uses a new optimized mf client. - Metaflow UI which keeps best effor polling new cards every 0.5 seconds can work best with new server. - async routines that will clean up the cache and remove completed async-processes - removed dead code which will no longer be used. # Why not use the existing cache client: - The way the existing cache client works, it loads the entire `Task` / `Card` object in memory and then returns the html/data from it. - This is inefficient because load the `Card` object does datastore list calls which are time expensive. - Once the path to the cards/data-updates has been found, getting the actual object is very fast. - For example, listing cards, takes ~ 1-2 seconds, but getting the actual card once the path is resolved takes ~ 10 milliseconds. - The current cache actions are "stateless" meaning, that once the action is done, the previous state is lost when a new action is called. - This stateless nature is not good for cards, where the data may change a lot more frequently but paths won't change. - The new cache service retrives the object paths once and then keeps updating them until the background-process finishes execution. - This approach improves latency drastically # Configuration Options: - `CARD_CACHE_PROCESS_NO_CARD_WAIT_TIME` : How long should the process wait for a card to be available before it exits - `CARD_CACHE_PROCESS_MAX_UPTIME` : The max duration the process should run - `CARD_CACHE_CARD_LIST_POLLING_FREQUENCY` : How frequently should the process poll for listing new cards - `CARD_CACHE_CARD_UPDATE_POLLING_FREQUENCY` : How frequently should the process poll for the card html content - `CARD_CACHE_DATA_UPDATE_POLLING_FREQUENCY` : How frequently should the process poll for the data updates - `CARD_CACHE_DISK_CLEANUP_INTERVAL`: The interval at which the cached cards are stored should be cleaned up - `CARD_API_HTML_WAIT_TIME`: the timeperiod the card HTML retrieval API will max busy wait for the card to be ready before timing out and resulting in null response.

saikonen · 2024-03-04T21:44:16Z

services/ui_backend_service/data/cache/card_cache_manager.py

+
+async def _get_latest_return_code(process: Process):
+    with contextlib.suppress(asyncio.TimeoutError):
+        await asyncio.wait_for(process.wait(), 1e-6)


just some notes for clarity.. so the combination of wait_for and async process.wait() with a minimal timeout in effect acts in a similar way as process.poll() from the synchronous implementation?

- make refresh interval default smaller - remove timeout code block in card load - leveraging the `created_on` timestamps in dataupdates to discard stale data - handle case of idle refresh - comments on whats changed. - compatible with new changes introduced in Netflix/metaflow-service#417

valayDave mentioned this pull request Feb 26, 2024

[cards] performance optimizations Netflix/metaflow-ui#142

Merged

valayDave force-pushed the oss/card-optimization branch from 61fdee2 to 2a47ee4 Compare February 28, 2024 18:33

savingoyal requested a review from saikonen February 28, 2024 18:35

valayDave force-pushed the oss/card-optimization branch from 2a47ee4 to 3e7d0e0 Compare February 28, 2024 18:38

saikonen reviewed Mar 4, 2024

View reviewed changes

saikonen approved these changes Mar 11, 2024

View reviewed changes

bummp UI version to 1.3.11

b31fe91

saikonen merged commit 93be3c3 into Netflix:master Mar 20, 2024
5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[card-cache-service] optimize caching #417

[card-cache-service] optimize caching #417

valayDave commented Feb 26, 2024 •

edited by saikonen

saikonen Mar 4, 2024

[card-cache-service] optimize caching #417

[card-cache-service] optimize caching #417

Conversation

valayDave commented Feb 26, 2024 • edited by saikonen

Core Changes:

Why not use the existing cache client:

Configuration Options:

TODOs

saikonen Mar 4, 2024

Choose a reason for hiding this comment

valayDave commented Feb 26, 2024 •

edited by saikonen