[cache-memory-leak] Fix Memory leak in cache server #416

valayDave · 2024-02-16T00:39:27Z

Key changes

Recreate the multiprocess pool at a regular cadence to avoid memory leaks
Since the pool was never removed it resulted in unbounded growth of memory.
Added a log size constraint to the tail based Log cache setting.

valayDave · 2024-02-16T01:24:43Z

services/ui_backend_service/data/cache/get_log_file_action.py

+        log_size = get_log_size(task, logtype)
+        if log_size > self._max_log_size:
+            return [(
+                None, log_size_exceeded_blurb(task, logtype, self._max_log_size)
+            ), ]


If logs are larger than some size then it will respond with a standard blurb that explains to the user how to access the logs from their local machine.

This option is very useful when users are generating > 100Mb of logs.

valayDave · 2024-02-16T01:25:06Z

services/ui_backend_service/data/cache/client/cache_server.py

+            if time.time() - _counter > 30:
+                self.verify_stale_workers()
+                _counter = time.time()
+
+            self.cleanup_if_necessary()


Core change that helps wipe the memory leak.

valayDave · 2024-02-16T01:25:33Z

services/ui_backend_service/data/cache/client/cache_async_client.py

@@ -56,7 +56,7 @@ async def read_message(self, line: str):

            if self.logger.isEnabledFor(logging.INFO):
                self.logger.info(
-                    "Pending stream keys: {}".format(list(self.pending_requests))
+                    "Pending stream keys: {}".format(len(list(self.pending_requests)))


driveby change for reducing size of logs.

romain-intel

Minor nits but LGTM.

romain-intel · 2024-02-16T03:01:56Z

services/ui_backend_service/data/cache/client/cache_server.py

+        if time_to_pool_refresh > 0:
+            return
+        # if workers are still running 30 seconds after the pool refresh timeout, then cleanup
+        no_workers_are_running = len(self.workers) == 0 and len(self.pending_requests) == 1


does this mean an idle pool has one request?

yes, that is what I observed.

romain-intel · 2024-02-16T03:02:40Z

services/ui_backend_service/data/cache/client/cache_server.py

+        # if workers are still running 30 seconds after the pool refresh timeout, then cleanup
+        no_workers_are_running = len(self.workers) == 0 and len(self.pending_requests) == 1
+        pool_needs_refresh = time_to_pool_refresh <= 0
+        pool_force_refresh = time_to_pool_refresh < -60


Do we want to make that configurable?

Do you mean the force refresh setting ? Yes, I can make it configurable.

romain-intel · 2024-02-16T03:03:17Z

services/ui_backend_service/data/cache/client/cache_server.py

+            echo("Refreshing the pool as no workers are running and no pending requests are there.")
+            self.cleanup_workers()
+        elif no_workers_are_running and pool_needs_refresh:
+            echo("Refreshing the pool as no workers are running and no pending requests are there.")


is there no race here? Can a worker not "get busy" in the meantime?

Given that this entire execution is running sync code, and this code block comes after the code path that creates new workers, the only current case is that workers are already created. No new worker will be created during the execution of this block. If workers exist, they are already busy workers.

Generally, if workers are busy at the time of pool-refresh, I didn't want to kill them immediately. If the server is under a lot of load, then an API request on the server may be waiting for some object from the server whose worker may get killed, and will result in the API response being an ugly http-500x.

I have not exactly explained in this PR description how the memory leak came to be so let me more light:

When API requests come to the server, the UI Server request the cache-server (running in the background) to extract out the cached value. This request is given a stream-id. The API-server is waiting on the response of the stream-id from the cache-server.

The cache server spawns a new worker for this stream-id and worker does some computing and returns the object in the stream

When many requests hit the server (especially when a run has a lot of logs), the cache server will spawn multiple individual workers, The API server will be waiting on all of these streams. If the objects(ex. logs) are large, then it will end up loading the whole object in memory and give the response. After the response is passed, the callback handles the removal of the worker and committing information to the cache-store. Even if the worker has been killed, the pool is still holding the reference to it (since we never close/join the pool)

romain-intel · 2024-02-16T03:04:39Z

services/ui_backend_service/data/cache/client/cache_server.py

+                self.verify_stale_workers()
+                _counter = time.time()
+
+            self.cleanup_if_necessary()


we could probably put this in the if statement. Not a huge difference though.

valayDave · 2024-02-16T21:08:03Z

services/ui_backend_service/data/cache/client/cache_server.py

+                self.verify_stale_workers()
+                _counter = time.time()
+
+            self.cleanup_if_necessary()
            time.sleep(0.1)


In Theory, this sleep could be smaller.

rohanrebello · 2024-02-16T22:38:59Z

services/ui_backend_service/data/cache/get_log_file_action.py

+        if log_size > self._max_log_size:
+            return [(
+                None, log_size_exceeded_blurb(task, logtype, self._max_log_size)
+            ), ]
        # Note this is inefficient - we will load a 1GB log even if we only want last 100 bytes.
        # Doing this efficiently is a step change in complexity and effort - we can do it when justified in future.


I guess we've arrived at that future point where it's justified :)

- Recreate the multiprocess pool at a regular cadence to avoid memory leaks - Since the pool was never removed it resulted in uncleared memory. - Add a log size constraint to the cache server to avoid memory leaks. - fix test too

valayDave force-pushed the valay/cache-logging-fix branch 4 times, most recently from 4cbba88 to 7b489ee Compare February 16, 2024 00:59

valayDave commented Feb 16, 2024

View reviewed changes

romain-intel approved these changes Feb 16, 2024

View reviewed changes

valayDave commented Feb 16, 2024

View reviewed changes

valayDave force-pushed the valay/cache-logging-fix branch from 7b489ee to c9d4bc7 Compare February 16, 2024 21:08

rohanrebello reviewed Feb 16, 2024

View reviewed changes

valayDave force-pushed the valay/cache-logging-fix branch 2 times, most recently from 22d7e5d to 342c437 Compare February 16, 2024 22:58

[cache-memory-leak] Fix Memory leak in cache server

5d11f05

- Recreate the multiprocess pool at a regular cadence to avoid memory leaks - Since the pool was never removed it resulted in uncleared memory. - Add a log size constraint to the cache server to avoid memory leaks. - fix test too

valayDave force-pushed the valay/cache-logging-fix branch from 342c437 to 5d11f05 Compare February 16, 2024 23:11

savingoyal merged commit 707c534 into Netflix:master Feb 19, 2024
6 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[cache-memory-leak] Fix Memory leak in cache server #416

[cache-memory-leak] Fix Memory leak in cache server #416

valayDave commented Feb 16, 2024 •

edited

valayDave Feb 16, 2024

valayDave Feb 16, 2024

valayDave Feb 16, 2024

romain-intel left a comment

romain-intel Feb 16, 2024

valayDave Feb 16, 2024

romain-intel Feb 16, 2024

valayDave Feb 16, 2024

romain-intel Feb 16, 2024

valayDave Feb 16, 2024 •

edited

romain-intel Feb 16, 2024

valayDave Feb 16, 2024

rohanrebello Feb 16, 2024

[cache-memory-leak] Fix Memory leak in cache server #416

[cache-memory-leak] Fix Memory leak in cache server #416

Conversation

valayDave commented Feb 16, 2024 • edited

Key changes

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

romain-intel left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

valayDave Feb 16, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

valayDave commented Feb 16, 2024 •

edited

valayDave Feb 16, 2024 •

edited