Do not use --unit with systemd-cgls #1910

narrieta · 2020-06-17T20:38:45Z

systemd-cgls doesn't support --unit on ubuntu 16; using the cgroup path instead.

also, improved error handling and reporting.

narrieta · 2020-06-17T20:39:42Z

azurelinuxagent/common/utils/shellutil.py

        command_name = command[0] if isinstance(command, list) and len(command) > 0 else command
-        return "'{0}' failed: {1}".format(command_name, returncode)
+        return "'{0}' failed: {1} ({2})".format(command_name, return_code, stderr.rstrip())


debugging failures with just the error code in the exception message can be hard; added stderr

Can there be a case where stderr is None? If it is stderr.rstrip() would throw

no, it'd be an empty string

Awesome (Y)

narrieta · 2020-06-17T20:40:24Z

azurelinuxagent/ga/monitor.py

-                    message = "The agent's cgroup includes unexpected processes: {0}".format(error)
-                    logger.info(message)
-                    add_event(op=WALAEventOperation.CGroupsDebug, message=message)
+        processes_check_error = None


any exception in the code to check processes should not prevent us from reporting metrics

codecov · 2020-06-17T20:40:43Z

Codecov Report

Merging #1910 into develop will decrease coverage by 0.01%.
The diff coverage is 86.11%.

@@             Coverage Diff             @@
##           develop    #1910      +/-   ##
===========================================
- Coverage    69.49%   69.47%   -0.02%     
===========================================
  Files           85       85              
  Lines        11864    11870       +6     
  Branches      1666     1667       +1     
===========================================
+ Hits          8245     8247       +2     
- Misses        3249     3252       +3     
- Partials       370      371       +1

Impacted Files	Coverage Δ
azurelinuxagent/common/cgroupconfigurator.py	`73.20% <80.00%> (-0.80%)`	⬇️
azurelinuxagent/ga/monitor.py	`77.32% <83.33%> (-0.19%)`	⬇️
azurelinuxagent/common/cgroupapi.py	`79.55% <100.00%> (-0.25%)`	⬇️
azurelinuxagent/common/utils/shellutil.py	`67.44% <100.00%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update aaca191...252d70d. Read the comment docs.

larohra · 2020-06-17T21:11:06Z

azurelinuxagent/common/utils/shellutil.py

        command_name = command[0] if isinstance(command, list) and len(command) > 0 else command
-        return "'{0}' failed: {1}".format(command_name, returncode)
+        return "'{0}' failed: {1} ({2})".format(command_name, return_code, stderr.rstrip())


Can there be a case where stderr is None? If it is stderr.rstrip() would throw

larohra · 2020-06-17T21:23:02Z

azurelinuxagent/ga/monitor.py

+            processes_check_error = ustr(e)
+
+        # Report a small sample of errors
+        if processes_check_error != self._last_error and self._error_count < 5:


I missed this in the previous PR, but I noticed we're not resetting the error count ever. I think we should reset it once a day or something to also get newer errors that might occur

it was intentional; there is no need for that, i just want a sample of possible errors

narrieta · 2020-06-17T22:58:15Z

tests/utils/test_shell_util.py

@@ -140,7 +139,8 @@ def test_run_command_should_raise_an_exception_when_the_command_fails(self):
            shellutil.run_command(command)

        exception = context_manager.exception
-        self.assertEquals(str(exception), "'ls' failed: 2")
+        self.assertIn("'ls' failed: 2", str(exception))


btw - python 2.6 doesn't have an assert to match a regex, I need to add that to the test utilities.

i'll do that on a separate PR, in the meanwhile I split the check on 2 asserts

larohra

LGTM

pgombar · 2020-06-18T05:25:04Z

azurelinuxagent/common/cgroupconfigurator.py

@@ -223,8 +225,9 @@ def get_processes_in_agent_cgroup(self):
            The return value can be None if cgroups are not enabled or if an error occurs during the operation.
            """
            def __impl():
-                agent_unit = self._cgroups_api.get_agent_unit_name()
-                return self._cgroups_api.get_processes_in_cgroup(agent_unit)
+                if self._agent_cpu_cgroup_path is None:


Wouldn't it be better to use the memory cgroup here since we know CPU is not mounted by default in some distros, whereas memory is?

No, it is CPU that we are interested in.

Why CPU specifically? Aren't we only using the cgroup path to get the PIDs? They are also stored in the memory cgroup.

We want to enforce CPU, so it is the CPU cgroup that we need to check.

pgombar · 2020-06-18T05:29:29Z

azurelinuxagent/ga/monitor.py

+        if processes_check_error != self._last_error and self._error_count < 5:
+            self._error_count += 1
+            self._last_error = processes_check_error
+            message = "The agent's cgroup includes unexpected processes: {0}".format(processes_check_error)


The error message now doesn't match the intention when process_check_error just contains the stack trace of an exception that occurred when we were trying to check processes in the agent cgroup. I know you are only using this event to gather diagnostics, so it's up to you if you want to make it clearer.

thanks; fixed

larohra

LGTM

Do not use --unit with systemd-cgls

f15562d

narrieta requested review from kevinclark19a, larohra, pgombar and ZhidongPeng as code owners June 17, 2020 20:38

narrieta commented Jun 17, 2020

View reviewed changes

larohra reviewed Jun 17, 2020

View reviewed changes

narrieta commented Jun 17, 2020

View reviewed changes

larohra previously approved these changes Jun 17, 2020

View reviewed changes

pgombar reviewed Jun 18, 2020

View reviewed changes

Fix message

7640263

narrieta dismissed larohra’s stale review via 7640263 June 18, 2020 13:27

larohra approved these changes Jun 19, 2020

View reviewed changes

pgombar approved these changes Jun 19, 2020

View reviewed changes

Merge branch 'develop' into list-processes

252d70d

narrieta merged commit 90aeeb2 into Azure:develop Jun 19, 2020

narrieta deleted the list-processes branch June 19, 2020 19:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Do not use --unit with systemd-cgls #1910

Do not use --unit with systemd-cgls #1910

narrieta commented Jun 17, 2020

narrieta Jun 17, 2020

larohra Jun 17, 2020

narrieta Jun 17, 2020

larohra Jun 17, 2020

narrieta Jun 17, 2020

codecov bot commented Jun 17, 2020 •

edited

larohra Jun 17, 2020

larohra Jun 17, 2020

narrieta Jun 17, 2020

narrieta Jun 17, 2020

larohra left a comment

pgombar Jun 18, 2020

narrieta Jun 18, 2020

pgombar Jun 18, 2020

narrieta Jun 18, 2020

pgombar Jun 18, 2020

narrieta Jun 18, 2020

larohra left a comment

Do not use --unit with systemd-cgls #1910

Do not use --unit with systemd-cgls #1910

Conversation

narrieta commented Jun 17, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov bot commented Jun 17, 2020 • edited

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

larohra left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

larohra left a comment

Choose a reason for hiding this comment

codecov bot commented Jun 17, 2020 •

edited