Fix check for systemd-run failure when invoking extensions #1943

pgombar · 2020-07-13T17:13:40Z

Description

We were erroneously classifying systemd-run failures when invoking extensions which produce a long stdout and/or stderr. The reason is because we check for the unit name in the beginning of stderr as a signal that systemd-run was invoked successfully. The change in this PR is to check for the unit name in stderr before truncating it for telemetry purposes.

Other fixes:

We had a bug in extensionprocessutils.py where the TELEMETRY_MESSAGE_MAX_LEN wasn't being patched since it was being assigned in the function signature instead of inside of the function, which produced misleading results in the unit tests.
Fixes a bug in test_start_extension_command_should_invoke_the_command_directly_if_systemd_times_out where because of the indentation the asserts weren't being executed since they were part of the context manager instead of being outside of its scope.

PR information

The title of the PR is clear and informative.
There are a small number of commits, each of which has an informative message. This means that previously merged commits do not appear in the history of the PR. For information on cleaning up the commits in your pull request, see this page.
Except for special cases involving multiple contributors, the PR is started from a fork of the main repository, not a branch.
If applicable, the PR references the bug/issue that it fixes in the description.
New Unit tests were added for the changes made and Travis.CI is passing.

Quality of Code and Contribution Guidelines

I have read the contribution guidelines.

pgombar · 2020-07-13T17:15:12Z

azurelinuxagent/common/utils/extensionprocessutil.py

@@ -99,7 +99,7 @@ def read_output(stdout, stderr):
        return format_stdout_stderr("", "Cannot read stdout/stderr: {0}".format(ustr(e)))


-def format_stdout_stderr(stdout, stderr, max_len=TELEMETRY_MESSAGE_MAX_LEN):
+def format_stdout_stderr(stdout, stderr):


Lesson to learn: don't assign constants in the function signature since this is loaded before the function body, so patching the constant will not work.

codecov · 2020-07-13T17:15:53Z

Codecov Report

Merging #1943 into develop will increase coverage by 0.00%.
The diff coverage is 100.00%.

@@           Coverage Diff            @@
##           develop    #1943   +/-   ##
========================================
  Coverage    69.62%   69.62%           
========================================
  Files           85       85           
  Lines        11926    11929    +3     
  Branches      1667     1667           
========================================
+ Hits          8303     8306    +3     
  Misses        3254     3254           
  Partials       369      369

Impacted Files	Coverage Δ
azurelinuxagent/common/cgroupapi.py	`79.65% <100.00%> (+0.10%)`	⬆️
...urelinuxagent/common/utils/extensionprocessutil.py	`100.00% <100.00%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0edea53...182bf6d. Read the comment docs.

pgombar · 2020-07-13T17:15:54Z

tests/utils/test_extension_process_util.py

@@ -164,29 +164,26 @@ def test_handle_process_completion_should_raise_on_nonzero_exit_code(self):
                self.assertEquals(context_manager.exception.code, error_code)
                self.assertIn("Non-zero exit code:", ustr(context_manager.exception))

-    def test_read_output_it_should_return_no_content(self):
+    def test_read_output_should_return_no_content(self):


Once I fixed the patching for TELEMETRY_MESSAGE_MAX_LEN, these tests produced different (correct) results :)

pgombar · 2020-07-13T17:16:38Z

tests/common/test_cgroupapi.py

@@ -27,6 +27,8 @@
 from azurelinuxagent.common.future import ustr
 from azurelinuxagent.common.utils import shellutil, fileutil
 from nose.plugins.attrib import attr
+
+from azurelinuxagent.common.utils.extensionprocessutil import TELEMETRY_MESSAGE_MAX_LEN


At some point I think I clicked on "format file" so there's a lot of whitespace updates, I'll point out relevant updates to the tests.

so we need to review only the lines where you added comments?

For this file, that's the case, yes.

your difftool should allow you selectively undo the changes you don't want (meld does). In the future let's try to avoid global formatting changes, since it makes the review and reading the history of the code harder,

(you can even undo the changes after commiting, if you feel like doing it here)

pgombar · 2020-07-13T17:21:04Z

tests/common/test_cgroupapi.py

+                    extension_calls = [args[0] for (args, _) in popen_patch.call_args_list if
+                                       "echo 'success'" in args[0]]

                    self.assertEquals(2, len(extension_calls), "The extension should have been invoked exactly twice")
-                    self.assertIn("systemd-run --unit=Microsoft.Compute.TestExtension_1.2.3", extension_calls[0], "The first call to the extension should have used systemd")
-                    self.assertEquals("echo 'success'", extension_calls[1], "The second call to the extension should not have used systemd")
+                    self.assertIn("systemd-run --unit=Microsoft.Compute.TestExtension_1.2.3", extension_calls[0],
+                                  "The first call to the extension should have used systemd")
+                    self.assertEquals("echo 'success'", extension_calls[1],
+                                      "The second call to the extension should not have used systemd")


I spent several hours trying to figure out why the breakpoint in L755 was never hit... turns out we had a bug checked in where the indentation was such that the asserts were inside of the context manager, instead of outside it.

This is a nasty bug since it's so easy to miss. Could linters figure this one out? Thoughts, @kevinclark19a?

Ah. Interesting. I'm not sure a linter could actually help with this... I don't see how it would differentiate between the legitimate caught raise statement presiding (presumably) inside of the start_extension_command call, and the erroneously caught one inside the assertEquals call. As far as the AST is concerned, they're basically identical.

Still, there might be some way of using attributes to let the linter know what we want caught and what we want to error out... I'll have to look into it more.

can you point out the change in indentation? I'm having a hard time finding it in the diff

I thought I was blind.. turns out I commented on the wrong method. The content of the comment is correct, but the method in question is test_start_extension_command_should_not_use_fallback_option_if_extension_fails instead of this one.

Added a comment on the right method.

pgombar · 2020-07-13T17:21:30Z

tests/common/test_cgroupapi.py

+    @patch("azurelinuxagent.common.cgroupapi.add_event")
+    @patch('time.sleep', side_effect=lambda _: mock_sleep())
+    @patch("azurelinuxagent.common.utils.extensionprocessutil.TELEMETRY_MESSAGE_MAX_LEN", 5)
+    def test_start_extension_command_should_not_use_fallback_option_if_extension_fails_with_long_output(self, *args):


New test to cover long extension outputs. This would fail before the changes in this PR.

larohra

1 small typo in a comment, but everything else LGTM.

tests/common/test_cgroupapi.py

pgombar · 2020-07-14T23:20:49Z

tests/common/test_cgroupapi.py

+                            stdout=stdout,
+                            stderr=stderr)
+
+                    extension_calls = [args[0] for (args, _) in popen_patch.call_args_list if command in args[0]]


This is where the indentation was wrong. It was at the same level with the context manager, instead of outside, like now.

But this new code... you mentioned it was in existing code

ok, i see it now

the assert for root was hiding the issue

but it should've failed in travis then right?

It didn't fail because these asserts were never executed. I ran the test both from sudo shell and from sudo PyCharm and could never hit the asserts. I'm guessing it has something to do with the context manager which just exits the current scope when the object is released?

ohh dang. Good catch though! Hopefully we can catch these moving on

I'm guessing it has something to do with the context manager which just exits the current scope when the object is released?

The test case raises, so the call to start_extension_command will exit the with statement.

I seem to remember at some point I formatted a few tests in this file so that may have moved the assert to the with statement.

narrieta

consider undoing the formatting changes

larohra

LGTM

narrieta · 2020-07-16T16:59:41Z

tests/common/test_cgroupapi.py

+                            stdout=stdout,
+                            stderr=stderr)
+
+                    extension_calls = [args[0] for (args, _) in popen_patch.call_args_list if command in args[0]]


I'm guessing it has something to do with the context manager which just exits the current scope when the object is released?

The test case raises, so the call to start_extension_command will exit the with statement.

I seem to remember at some point I formatted a few tests in this file so that may have moved the assert to the with statement.

pgombar and others added 6 commits July 3, 2020 13:16

wip @ fix truncation

c7ba118

check for unit name in stderr before truncating

894d0f6

tests not failing when they should

0b2ed97

fix failing tests

d9b9c79

looser check on systemd output; varies by version

5de45bd

Merge remote-tracking branch 'upstream/develop' into fix_stderr

54c74e7

pgombar requested review from kevinclark19a, larohra, narrieta and ZhidongPeng as code owners July 13, 2020 17:13

pgombar changed the title ~~Fix stderr~~ Fix check for systemd-run failure when invoking extensions Jul 13, 2020

pgombar commented Jul 13, 2020

View reviewed changes

larohra previously approved these changes Jul 14, 2020

View reviewed changes

tests/common/test_cgroupapi.py Outdated Show resolved Hide resolved

pgombar commented Jul 14, 2020

View reviewed changes

pgombar added 2 commits July 14, 2020 17:40

Merge remote-tracking branch 'upstream/develop' into fix_stderr

d1889ce

update error message in test

b74a866

pgombar dismissed larohra’s stale review via b74a866 July 15, 2020 00:40

narrieta previously approved these changes Jul 15, 2020

View reviewed changes

larohra previously approved these changes Jul 15, 2020

View reviewed changes

pgombar dismissed stale reviews from larohra and narrieta via 5a68bdc July 15, 2020 18:29

revert formatting changes and remove unused import

182bf6d

pgombar force-pushed the fix_stderr branch from 5a68bdc to 182bf6d Compare July 15, 2020 18:30

larohra approved these changes Jul 15, 2020

View reviewed changes

narrieta approved these changes Jul 16, 2020

View reviewed changes

pgombar merged commit 7e3eac1 into Azure:develop Jul 16, 2020

pgombar deleted the fix_stderr branch July 16, 2020 17:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix check for systemd-run failure when invoking extensions #1943

Fix check for systemd-run failure when invoking extensions #1943

pgombar commented Jul 13, 2020

pgombar Jul 13, 2020

codecov bot commented Jul 13, 2020 •

edited

pgombar Jul 13, 2020

pgombar Jul 13, 2020

narrieta Jul 14, 2020

pgombar Jul 14, 2020

narrieta Jul 15, 2020

narrieta Jul 15, 2020

pgombar Jul 13, 2020

kevinclark19a Jul 13, 2020

narrieta Jul 14, 2020

pgombar Jul 14, 2020

pgombar Jul 14, 2020

pgombar Jul 13, 2020

larohra left a comment

pgombar Jul 14, 2020

narrieta Jul 14, 2020

narrieta Jul 14, 2020

narrieta Jul 14, 2020

larohra Jul 14, 2020

pgombar Jul 15, 2020

larohra Jul 15, 2020

narrieta Jul 16, 2020

narrieta left a comment

larohra left a comment

narrieta Jul 16, 2020

Fix check for systemd-run failure when invoking extensions #1943

Fix check for systemd-run failure when invoking extensions #1943

Conversation

pgombar commented Jul 13, 2020

Description

PR information

Quality of Code and Contribution Guidelines

Choose a reason for hiding this comment

codecov bot commented Jul 13, 2020 • edited

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

larohra left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

narrieta left a comment

Choose a reason for hiding this comment

larohra left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov bot commented Jul 13, 2020 •

edited