Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Do not use --unit with systemd-cgls #1910

Merged
merged 3 commits into from
Jun 19, 2020
Merged

Conversation

narrieta
Copy link
Member

systemd-cgls doesn't support --unit on ubuntu 16; using the cgroup path instead.

also, improved error handling and reporting.

command_name = command[0] if isinstance(command, list) and len(command) > 0 else command
return "'{0}' failed: {1}".format(command_name, returncode)
return "'{0}' failed: {1} ({2})".format(command_name, return_code, stderr.rstrip())
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

debugging failures with just the error code in the exception message can be hard; added stderr

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can there be a case where stderr is None? If it is stderr.rstrip() would throw

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, it'd be an empty string

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome (Y)

message = "The agent's cgroup includes unexpected processes: {0}".format(error)
logger.info(message)
add_event(op=WALAEventOperation.CGroupsDebug, message=message)
processes_check_error = None
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any exception in the code to check processes should not prevent us from reporting metrics

@codecov
Copy link

codecov bot commented Jun 17, 2020

Codecov Report

Merging #1910 into develop will decrease coverage by 0.01%.
The diff coverage is 86.11%.

Impacted file tree graph

@@             Coverage Diff             @@
##           develop    #1910      +/-   ##
===========================================
- Coverage    69.49%   69.47%   -0.02%     
===========================================
  Files           85       85              
  Lines        11864    11870       +6     
  Branches      1666     1667       +1     
===========================================
+ Hits          8245     8247       +2     
- Misses        3249     3252       +3     
- Partials       370      371       +1     
Impacted Files Coverage Δ
azurelinuxagent/common/cgroupconfigurator.py 73.20% <80.00%> (-0.80%) ⬇️
azurelinuxagent/ga/monitor.py 77.32% <83.33%> (-0.19%) ⬇️
azurelinuxagent/common/cgroupapi.py 79.55% <100.00%> (-0.25%) ⬇️
azurelinuxagent/common/utils/shellutil.py 67.44% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update aaca191...252d70d. Read the comment docs.

command_name = command[0] if isinstance(command, list) and len(command) > 0 else command
return "'{0}' failed: {1}".format(command_name, returncode)
return "'{0}' failed: {1} ({2})".format(command_name, return_code, stderr.rstrip())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can there be a case where stderr is None? If it is stderr.rstrip() would throw

processes_check_error = ustr(e)

# Report a small sample of errors
if processes_check_error != self._last_error and self._error_count < 5:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I missed this in the previous PR, but I noticed we're not resetting the error count ever. I think we should reset it once a day or something to also get newer errors that might occur

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it was intentional; there is no need for that, i just want a sample of possible errors

@@ -140,7 +139,8 @@ def test_run_command_should_raise_an_exception_when_the_command_fails(self):
shellutil.run_command(command)

exception = context_manager.exception
self.assertEquals(str(exception), "'ls' failed: 2")
self.assertIn("'ls' failed: 2", str(exception))
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

btw - python 2.6 doesn't have an assert to match a regex, I need to add that to the test utilities.

i'll do that on a separate PR, in the meanwhile I split the check on 2 asserts

larohra
larohra previously approved these changes Jun 17, 2020
Copy link
Contributor

@larohra larohra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@@ -223,8 +225,9 @@ def get_processes_in_agent_cgroup(self):
The return value can be None if cgroups are not enabled or if an error occurs during the operation.
"""
def __impl():
agent_unit = self._cgroups_api.get_agent_unit_name()
return self._cgroups_api.get_processes_in_cgroup(agent_unit)
if self._agent_cpu_cgroup_path is None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't it be better to use the memory cgroup here since we know CPU is not mounted by default in some distros, whereas memory is?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, it is CPU that we are interested in.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why CPU specifically? Aren't we only using the cgroup path to get the PIDs? They are also stored in the memory cgroup.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We want to enforce CPU, so it is the CPU cgroup that we need to check.

if processes_check_error != self._last_error and self._error_count < 5:
self._error_count += 1
self._last_error = processes_check_error
message = "The agent's cgroup includes unexpected processes: {0}".format(processes_check_error)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The error message now doesn't match the intention when process_check_error just contains the stack trace of an exception that occurred when we were trying to check processes in the agent cgroup. I know you are only using this event to gather diagnostics, so it's up to you if you want to make it clearer.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks; fixed

Copy link
Contributor

@larohra larohra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@narrieta narrieta merged commit 90aeeb2 into Azure:develop Jun 19, 2020
@narrieta narrieta deleted the list-processes branch June 19, 2020 19:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants