New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Skip outdated Fabric goal states when HGAP does not support vmSettings #2542
Conversation
vm_settings, vm_settings_updated = GoalState._fetch_vm_settings(self._wire_client) | ||
except VmSettingsSupportStopped as exception: # If the HGAP stopped supporting vmSettings, we need to use the goal state from the WireServer | ||
self._restore_wire_server_goal_state(incarnation, xml_text, xml_doc, exception) | ||
return |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This handles the exception that the HostPluginProtocol raises when support for vmSettings stops.
@@ -86,11 +85,21 @@ def __init__(self, endpoint): | |||
self.status_error_state = ErrorState(min_timedelta=ERROR_STATE_HOST_PLUGIN_FAILURE) | |||
self.fetch_last_timestamp = None | |||
self.status_last_timestamp = None | |||
self._host_plugin_version = FlexibleVersion("0.0.0.0") # Version 0 means "unknown" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Renamed these members to remove the "host_plugin_" prefix. These used to be members of the WireClient class in a previous iteration of the code; then, the prefix made sense. Since now they are members of HostPluginProtocol, the prefix is not needed.
else: | ||
self._supports_vm_settings = True | ||
self._supports_vm_settings_next_check = datetime.datetime.now() | ||
self._fast_track_timestamp = HostPluginProtocol._get_fast_track_timestamp() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Needed for service restart
logger.warn("Can't retrieve the timestamp for the most recent Fast Track goal state ({0}), will assume the current time. Error: {1}", | ||
HostPluginProtocol._get_fast_track_state_file(), ustr(e)) | ||
return timeutil.create_timestamp(datetime.datetime.utcnow()) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Helpers to maintain the timestamp of the most recent FastTrack goal state
ServerError = 'ServerError' | ||
ClientError = 'ClientError' | ||
Timeout = 'Timeout' | ||
RequestFailed = 'RequestFailed' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alphabetized the list and added HttpError
logger.info("[VmSettings] [Informational only, the Agent will continue normal operation] {0}", error) | ||
|
||
if self._error_count <= _VmSettingsErrorReporter._MaxTelemetryErrors: | ||
add_event(op=WALAEventOperation.VmSettings, message=error, is_success=False, log_event=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed these errors from the local log. This was part of 2.7 and was useful for debugging. Now, errors in vmSettings are reported the same way as any other error in the goal state. We are still collecting telemetry while the HGAP stabilizes, but some of this telemetry will also be removed in a future release.
self.update_handler.run() | ||
else: | ||
with patch('os.getppid', return_value=42): | ||
with patch.object(self.update_handler, "_processing_new_extensions_goal_state", return_value=True): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These test create their mocks instead of using the mock_wire_protocol or mockwireddata. I mocked _processing_new_extensions_goal_state since otherwise the mocks created by the test would return that the goal state is outdated and extensions would not be executed.
@@ -20,7 +23,68 @@ | |||
from tests.tools import AgentTestCase, patch, load_data | |||
|
|||
|
|||
class GoalStateTestCase(AgentTestCase): | |||
class GoalStateTestCase(AgentTestCase, HttpRequestPredicates): | |||
def test_it_should_use_vm_settings_by_default(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These tests used to be in test_wire.py, as part of the tests for WireClient.update_goal_state(). The code they are testing is now in GoalState, so I moved the tests here.
They are almost identical to its original version, except that now the test the code via GoalState/GoalState.update() instead of WireClient.update_goal_state()
@@ -267,3 +331,30 @@ def test_it_should_use_the_most_recent_goal_state(self): | |||
goal_state.update() | |||
|
|||
self._assert_goal_state(goal_state, '222', channel=GoalStateChannel.WireServer, source=GoalStateSource.Fabric) | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Test for the new logic
protocol.client.update_goal_state() | ||
self.assertEqual(2, get_vm_settings_call_count(), "A second call to vmSettings was expecting after the check period has elapsed.") | ||
|
||
def test_it_should_raise_when_the_vm_settings_api_stops_being_supported(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tests for the new logic
@@ -926,40 +929,23 @@ def http_get_handler(url, *_, **__): | |||
def get_telemetry_messages(): | |||
return [kwargs["message"] for _, kwargs in add_event.call_args_list if kwargs["op"] == "VmSettings"] | |||
|
|||
def get_log_messages(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These errors are no longer written to the local log, so I removed the associated tests
@@ -1097,67 +1094,6 @@ def test_forced_update_should_update_the_goal_state_and_the_host_plugin_when_the | |||
self.assertEqual(protocol.client.get_host_plugin().container_id, new_container_id) | |||
self.assertEqual(protocol.client.get_host_plugin().role_config_name, new_role_config_name) | |||
|
|||
def test_it_should_retry_get_vm_settings_on_resource_gone_error(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
moved to test_goal_state.py
@@ -1463,6 +1464,7 @@ def _test_upgrade_available( | |||
|
|||
self.update_handler.protocol_util = protocol | |||
self.update_handler._goal_state = protocol.get_goal_state() | |||
self.update_handler._goal_state.extensions_goal_state.is_outdated = False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Forcing the mock created by the test to not mark the goal state as outdated
Codecov Report
@@ Coverage Diff @@
## develop #2542 +/- ##
===========================================
+ Coverage 72.00% 72.04% +0.04%
===========================================
Files 102 102
Lines 15229 15298 +69
Branches 2416 2422 +6
===========================================
+ Hits 10965 11022 +57
- Misses 3779 3787 +8
- Partials 485 489 +4
Continue to review full report at Codecov.
|
@@ -166,13 +230,13 @@ def _create_protocol_ws_and_hgap_in_sync(): | |||
# that may invalidate this setup. | |||
vm_settings, _ = protocol.client.get_host_plugin().fetch_vm_settings() | |||
if vm_settings.etag != etag: | |||
raise Exception("The HostGAPlugin is no in sync. Expected ETag {0}. Got {1}".format(etag, vm_settings.etag)) | |||
raise Exception("The HostGAPlugin is not in sync. Expected ETag {0}. Got {1}".format(etag, vm_settings.etag)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kevinclark19a fix for the typo in my previous PR. thanks for pointing it out
self._cached_vm_settings = vm_settings | ||
|
||
if vm_settings.source == GoalStateSource.FastTrack: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At this point it's all about FastTrack and source always be FastTrack otherwise we would have raised not supported exception. So, I don't see point of having else
and clearing the state. I don't know if I'm missing something here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The HGAP continues to return both FastTrack and Fabric goal states, nothing has changed there. GoalState.update(), which calls this method is the one that will ignore the Fabric goal states.
Clearing the state is not strictly necessary, since if we receive a Fabric goal state, its timestamp will be greater than the saved state. However, deleting the state keeps things tidy and reduces the code paths the agent goes through during service start (the persisted state is used only during service start if FastTrack is disabled or not supported by the HGAP).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Gotcha, Thanks for clarifying.
@@ -670,6 +671,8 @@ def _report_status(self, exthandlers_handler): | |||
if self._goal_state is not None: | |||
agent_status = exthandlers_handler.get_ext_handlers_status_debug_info(vm_status) | |||
self._goal_state.save_to_history(agent_status, AGENT_STATUS_FILE) | |||
if self._goal_state.extensions_goal_state.is_outdated: | |||
exthandlers_handler.protocol.client.get_host_plugin().clear_fast_track_state() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we cleared state but status report had failures due to network issues, so we continue to skip following goal states until CRP receives not supported feature?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
exthandlers_handler.report_ext_handlers_status() should return None on error, and then we wouldn't clear the state.
But the point you make is valid: CRP will continue sending FastTrack goal states until the agent can report status successfully. That is a requirement of the current protocol.
Those goal states would time out. Something similar would happen (regardless of FastTrack) if the agent cannot report status. CRP needs the status to know the goal state completed. One difference, though, is that in this scenario the agent would not even see the FastTrack goal states.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see your point, in that case VM might not be in good state (whole different problem) and should not worry about Agent behavior, I guess.
Azure#2542) * Skip outdated Fabric goal states when HGAP does not support vmSettings * add comment Co-authored-by: narrieta <narrieta>
If the agent is using FastTrack and the VM is migrated from a node where the HGAP supports vmSettings to a node where it doesn't we need to ensure that the agent does not execute an outdated Fabric goal state.
The HostPluginProtocol now keeps track of the timestamp of the most recent FastTrack goal state. If the fetch_vm_settings() method detects that support for vmSettings is no longer available, it raises VmSettingsSupportStopped.
GoalState.update() handles that exception by restoring the current Fabric goal state. If the timestamp of this goal state is older than the most recent FastTrack goal state, it marks the GS as "outdated".
The UpdateHandler will not execute extensions if the GS is outdated.
The timestamp for the FastTrack goal state is persisted to /var/lib/waagent/fast_track.json. This is needed to handle agent restarts. The file is deleted once the agent reports status (at that point the agent will report that it does not support FastTrack and CRP will switch to Fabric.
The PR includes cleanup of related code. I'll add comments pointing out this cleanup.