Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Skip outdated Fabric goal states when HGAP does not support vmSettings #2542

Merged
merged 2 commits into from Apr 6, 2022

Conversation

narrieta
Copy link
Member

@narrieta narrieta commented Apr 5, 2022

If the agent is using FastTrack and the VM is migrated from a node where the HGAP supports vmSettings to a node where it doesn't we need to ensure that the agent does not execute an outdated Fabric goal state.

The HostPluginProtocol now keeps track of the timestamp of the most recent FastTrack goal state. If the fetch_vm_settings() method detects that support for vmSettings is no longer available, it raises VmSettingsSupportStopped.

GoalState.update() handles that exception by restoring the current Fabric goal state. If the timestamp of this goal state is older than the most recent FastTrack goal state, it marks the GS as "outdated".

The UpdateHandler will not execute extensions if the GS is outdated.

The timestamp for the FastTrack goal state is persisted to /var/lib/waagent/fast_track.json. This is needed to handle agent restarts. The file is deleted once the agent reports status (at that point the agent will report that it does not support FastTrack and CRP will switch to Fabric.

The PR includes cleanup of related code. I'll add comments pointing out this cleanup.

vm_settings, vm_settings_updated = GoalState._fetch_vm_settings(self._wire_client)
except VmSettingsSupportStopped as exception: # If the HGAP stopped supporting vmSettings, we need to use the goal state from the WireServer
self._restore_wire_server_goal_state(incarnation, xml_text, xml_doc, exception)
return
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This handles the exception that the HostPluginProtocol raises when support for vmSettings stops.

@@ -86,11 +85,21 @@ def __init__(self, endpoint):
self.status_error_state = ErrorState(min_timedelta=ERROR_STATE_HOST_PLUGIN_FAILURE)
self.fetch_last_timestamp = None
self.status_last_timestamp = None
self._host_plugin_version = FlexibleVersion("0.0.0.0") # Version 0 means "unknown"
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renamed these members to remove the "host_plugin_" prefix. These used to be members of the WireClient class in a previous iteration of the code; then, the prefix made sense. Since now they are members of HostPluginProtocol, the prefix is not needed.

else:
self._supports_vm_settings = True
self._supports_vm_settings_next_check = datetime.datetime.now()
self._fast_track_timestamp = HostPluginProtocol._get_fast_track_timestamp()
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needed for service restart

logger.warn("Can't retrieve the timestamp for the most recent Fast Track goal state ({0}), will assume the current time. Error: {1}",
HostPluginProtocol._get_fast_track_state_file(), ustr(e))
return timeutil.create_timestamp(datetime.datetime.utcnow())

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Helpers to maintain the timestamp of the most recent FastTrack goal state

ServerError = 'ServerError'
ClientError = 'ClientError'
Timeout = 'Timeout'
RequestFailed = 'RequestFailed'
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alphabetized the list and added HttpError

logger.info("[VmSettings] [Informational only, the Agent will continue normal operation] {0}", error)

if self._error_count <= _VmSettingsErrorReporter._MaxTelemetryErrors:
add_event(op=WALAEventOperation.VmSettings, message=error, is_success=False, log_event=False)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed these errors from the local log. This was part of 2.7 and was useful for debugging. Now, errors in vmSettings are reported the same way as any other error in the goal state. We are still collecting telemetry while the HGAP stabilizes, but some of this telemetry will also be removed in a future release.

self.update_handler.run()
else:
with patch('os.getppid', return_value=42):
with patch.object(self.update_handler, "_processing_new_extensions_goal_state", return_value=True):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These test create their mocks instead of using the mock_wire_protocol or mockwireddata. I mocked _processing_new_extensions_goal_state since otherwise the mocks created by the test would return that the goal state is outdated and extensions would not be executed.

@@ -20,7 +23,68 @@
from tests.tools import AgentTestCase, patch, load_data


class GoalStateTestCase(AgentTestCase):
class GoalStateTestCase(AgentTestCase, HttpRequestPredicates):
def test_it_should_use_vm_settings_by_default(self):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These tests used to be in test_wire.py, as part of the tests for WireClient.update_goal_state(). The code they are testing is now in GoalState, so I moved the tests here.

They are almost identical to its original version, except that now the test the code via GoalState/GoalState.update() instead of WireClient.update_goal_state()

@@ -267,3 +331,30 @@ def test_it_should_use_the_most_recent_goal_state(self):
goal_state.update()

self._assert_goal_state(goal_state, '222', channel=GoalStateChannel.WireServer, source=GoalStateSource.Fabric)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test for the new logic

protocol.client.update_goal_state()
self.assertEqual(2, get_vm_settings_call_count(), "A second call to vmSettings was expecting after the check period has elapsed.")

def test_it_should_raise_when_the_vm_settings_api_stops_being_supported(self):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tests for the new logic

@@ -926,40 +929,23 @@ def http_get_handler(url, *_, **__):
def get_telemetry_messages():
return [kwargs["message"] for _, kwargs in add_event.call_args_list if kwargs["op"] == "VmSettings"]

def get_log_messages():
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These errors are no longer written to the local log, so I removed the associated tests

@@ -1097,67 +1094,6 @@ def test_forced_update_should_update_the_goal_state_and_the_host_plugin_when_the
self.assertEqual(protocol.client.get_host_plugin().container_id, new_container_id)
self.assertEqual(protocol.client.get_host_plugin().role_config_name, new_role_config_name)

def test_it_should_retry_get_vm_settings_on_resource_gone_error(self):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

moved to test_goal_state.py

@@ -1463,6 +1464,7 @@ def _test_upgrade_available(

self.update_handler.protocol_util = protocol
self.update_handler._goal_state = protocol.get_goal_state()
self.update_handler._goal_state.extensions_goal_state.is_outdated = False
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Forcing the mock created by the test to not mark the goal state as outdated

@codecov
Copy link

codecov bot commented Apr 5, 2022

Codecov Report

Merging #2542 (9b46f3a) into develop (f91bf78) will increase coverage by 0.04%.
The diff coverage is 89.51%.

@@             Coverage Diff             @@
##           develop    #2542      +/-   ##
===========================================
+ Coverage    72.00%   72.04%   +0.04%     
===========================================
  Files          102      102              
  Lines        15229    15298      +69     
  Branches      2416     2422       +6     
===========================================
+ Hits         10965    11022      +57     
- Misses        3779     3787       +8     
- Partials       485      489       +4     
Impacted Files Coverage Δ
azurelinuxagent/common/protocol/hostplugin.py 87.60% <86.51%> (-1.26%) ⬇️
azurelinuxagent/common/protocol/goal_state.py 95.25% <95.45%> (-0.05%) ⬇️
...inuxagent/common/protocol/extensions_goal_state.py 74.46% <100.00%> (+1.53%) ⬆️
...ol/extensions_goal_state_from_extensions_config.py 91.19% <100.00%> (+0.03%) ⬆️
azurelinuxagent/ga/update.py 88.40% <100.00%> (-0.16%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update f91bf78...9b46f3a. Read the comment docs.

@@ -166,13 +230,13 @@ def _create_protocol_ws_and_hgap_in_sync():
# that may invalidate this setup.
vm_settings, _ = protocol.client.get_host_plugin().fetch_vm_settings()
if vm_settings.etag != etag:
raise Exception("The HostGAPlugin is no in sync. Expected ETag {0}. Got {1}".format(etag, vm_settings.etag))
raise Exception("The HostGAPlugin is not in sync. Expected ETag {0}. Got {1}".format(etag, vm_settings.etag))
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kevinclark19a fix for the typo in my previous PR. thanks for pointing it out

self._cached_vm_settings = vm_settings

if vm_settings.source == GoalStateSource.FastTrack:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At this point it's all about FastTrack and source always be FastTrack otherwise we would have raised not supported exception. So, I don't see point of having else and clearing the state. I don't know if I'm missing something here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The HGAP continues to return both FastTrack and Fabric goal states, nothing has changed there. GoalState.update(), which calls this method is the one that will ignore the Fabric goal states.

Clearing the state is not strictly necessary, since if we receive a Fabric goal state, its timestamp will be greater than the saved state. However, deleting the state keeps things tidy and reduces the code paths the agent goes through during service start (the persisted state is used only during service start if FastTrack is disabled or not supported by the HGAP).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gotcha, Thanks for clarifying.

@@ -670,6 +671,8 @@ def _report_status(self, exthandlers_handler):
if self._goal_state is not None:
agent_status = exthandlers_handler.get_ext_handlers_status_debug_info(vm_status)
self._goal_state.save_to_history(agent_status, AGENT_STATUS_FILE)
if self._goal_state.extensions_goal_state.is_outdated:
exthandlers_handler.protocol.client.get_host_plugin().clear_fast_track_state()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we cleared state but status report had failures due to network issues, so we continue to skip following goal states until CRP receives not supported feature?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

exthandlers_handler.report_ext_handlers_status() should return None on error, and then we wouldn't clear the state.

But the point you make is valid: CRP will continue sending FastTrack goal states until the agent can report status successfully. That is a requirement of the current protocol.

Those goal states would time out. Something similar would happen (regardless of FastTrack) if the agent cannot report status. CRP needs the status to know the goal state completed. One difference, though, is that in this scenario the agent would not even see the FastTrack goal states.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see your point, in that case VM might not be in good state (whole different problem) and should not worry about Agent behavior, I guess.

@narrieta narrieta merged commit a5ae1af into Azure:develop Apr 6, 2022
@narrieta narrieta deleted the outdated-gs branch April 6, 2022 21:55
nagworld9 pushed a commit to nagworld9/WALinuxAgent that referenced this pull request Apr 6, 2022
Azure#2542)

* Skip outdated Fabric goal states when HGAP does not support vmSettings

* add comment

Co-authored-by: narrieta <narrieta>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants