New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Fix/avoid expensive ops #338

Merged

saikonen merged 6 commits into master from fix/avoid-expensive-ops

Jul 4, 2023

Contributor

romain-intel commented Nov 18, 2022

No description provided.

romain-intel added 3 commits

October 4, 2022 10:48


          Avoid some expensive logging operations when not needed

4889dae


          Make task status less expensive

2e29d1c


          Black linting

3e96519

pjoshi30 assigned pjoshi30 and unassigned pjoshi30

romain-intel commented

View reviewed changes

Contributor Author

romain-intel left a comment

@savingoyal : commented on the logic.

services/ui_backend_service/data/cache/client/cache_async_client.py

		@@ -8,8 +8,8 @@

		from services.utils import logging

		OP_WORKER_CREATE = 'worker_create'

Contributor Author

romain-intel Jan 20, 2023

Black modification

services/ui_backend_service/data/cache/client/cache_async_client.py

@@ @@ -20,20 +20,16 @@ class CacheAsyncClient(CacheClient): @@
                   _restart_requested = False
                   async def start_server(self, cmdline, env):
-                      self.logger = logging.getLogger("CacheAsyncClient:{root}".format(root=self._root))

Contributor Author

romain-intel Jan 20, 2023

Black modification

services/ui_backend_service/data/cache/client/cache_async_client.py

-                          self.logger.info("Pending stream keys: {}".format(
-                              list(self.pending_requests)))
+                          if self.logger.isEnabledFor(logging.INFO):

Contributor Author

romain-intel Jan 20, 2023

make logging cheaper to avoid evaluating argument if we don't need to.

services/ui_backend_service/data/cache/client/cache_async_client.py

-                              await asyncio.wait_for(
-                                  self._proc.stdin.drain(),
-                                  timeout=WAIT_FREQUENCY)
+                              await asyncio.wait_for(self._proc.stdin.drain(), timeout=WAIT_FREQUENCY)

Contributor Author

romain-intel Jan 20, 2023

black formatting

services/ui_backend_service/data/cache/client/cache_async_client.py Outdated

                       except asyncio.TimeoutError:
-                          self.logger.warn("StreamWriter.drain timeout, request restart: {}".format(repr(self._proc.stdin)))
+                          self.logger.warn(

Contributor Author

romain-intel Jan 20, 2023

black formatting

services/ui_backend_service/data/db/tables/run.py

@@ @@ -115,16 +80,27 @@ def select_columns(self): @@
                       # NOTE: We must use a function scope in order to be able to access the table_name variable for list comprehension.
                       # User should be considered NULL when 'user:*' tag is missing
                       # This is usually the case with AWS Step Functions
-                      return ["{table_name}.{col} AS {col}".format(table_name=self.table_name, col=k) for k in self.keys] \
-                          + ["""
+                      return (

Contributor Author

romain-intel Jan 20, 2023

black format

services/ui_backend_service/data/db/tables/run.py

-                              AND @(extract(epoch from now())-{table_name}.last_heartbeat_ts)>{heartbeat_cutoff}
-                          THEN {table_name}.last_heartbeat_ts*1000
-                          ELSE NULL
+                              AND @(extract(epoch from now())-{table_name}.last_heartbeat_ts)<={heartbeat_threshold}

Contributor Author

romain-intel Jan 20, 2023

previous case had:

when you have a HB AND you have a latest failed task AND you have not had a HB recently (within threshold), then finished_at = last_hb
when you have a HB AND you have no last failed task that you know of AND you have not had a HB recentlyy (within CUTOFF -- not threshold -- cutoff is larger), then finished_at = last_hb
else no finished at

I changed that to remove the need for latest failed task and just check the last hb. If I have not had one in threshold time, I consider that I have a finished_at.

services/ui_backend_service/data/db/tables/run.py

-                          WHEN end_attempt_ok.value IS FALSE
-                              AND end_attempt.ts_epoch > end_attempt_ok.ts_epoch
+                          WHEN end_attempt IS NOT NULL
+                              AND end_attempt_ok.ts_epoch < end_attempt.ts_epoch

Contributor Author

romain-intel Jan 20, 2023

this is the same type of logic as above just for the state instead of the finished_at time.

services/ui_backend_service/data/db/tables/run.py

                               AND {table_name}.last_heartbeat_ts IS NOT NULL
                           THEN {table_name}.last_heartbeat_ts*1000-{table_name}.ts_epoch
-                          WHEN end_attempt IS NOT NULL
+                          WHEN end_attempt_ok IS NOT NULL

Contributor Author

romain-intel Jan 20, 2023

this is a bug fix, we checked for end_attempt but then used end_attempt_ok

services/ui_backend_service/data/db/tables/run.py

    
                          WHEN end_attempt IS NOT NULL

                              AND end_attempt.ts_epoch > end_attempt_ok.ts_epoch

                              AND end_attempt_ok.ts_epoch < end_attempt.ts_epoch

Contributor Author

romain-intel Jan 20, 2023

no semantic change, I must have been twiddling things.

saikonen reviewed

View reviewed changes

Collaborator

saikonen left a comment

Some general thoughts on the proposed changes:

To my understanding on argo workflows / step-functions, run level heartbeat only gets updates if there is at least one running task for the run? This might not always be the case, for example if all tasks are stuck in scheduler. A run would then falsely be marked as Failed before any task launches. It would flip back to 'running' once a task launches and updates the run heartbeat. Still have to verify this behavior.
Judging from the integration test failures, all legacy cases where a run has no heartbeat will be treated as failures instead of best-effort 'running'. I think it is fine to sunset the support for non-heartbeat statuses at this point.
With this feature, the run heartbeat will be treated as a primary source of truth. Some tests were in place to cover the case where a task counts as running, so the run_heartbeat should not override this for the run status. This might not actually be a relevant test case anymore after the introduction on run_hb refreshes as part of task_hb updates

We can go forward with these changes if the tradeoffs are acceptable.


          skip failing test cases for now.

saikonen approved these changes

View reviewed changes

saikonen added 2 commits

June 26, 2023 16:42


          Merge branch 'master' into fix/avoid-expensive-ops

261ced6


          ignore W503 for pycodestyle as it conflicts with Black formatting

f73b4b2

saikonen merged commit 05703ff into master

6 checks passed

wangchy27 mentioned this pull request

Added the ability to separate out reads and writes into their own connection pools. #344

Merged

saikonen mentioned this pull request

In a previous commit, the detection of a failure became too aggressive. #386

Merged

saikonen mentioned this pull request

fix: tone down run inactive cutoff default #392

Merged

saikonen deleted the fix/avoid-expensive-ops branch

November 3, 2023 16:00

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment