Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cleanup history directory when creating new subdirectories #2633

Merged
merged 2 commits into from Jul 22, 2022

Conversation

narrieta
Copy link
Member

Fixes #2619

When reporting status, if the status blob is None, upload_status_blob forces fetching a new goal state:

    def upload_status_blob(self):
        extensions_goal_state = self.get_goal_state().extensions_goal_state

        if extensions_goal_state.status_upload_blob is None:
            # the status upload blob is in ExtensionsConfig so force a full goal state refresh
            self.update_goal_state(force_update=True, silent=True)

As part of fetching the goal state, a new item is created under the history directory and several lines are added to the agent log. This will continue happening until there is a new goal state that provides the status blob. The history folder is cleaned up only when there is a new goal state so, since reporting status is done every 6 seconds, this condition can rapidly produce a very large log file or consume a very large number of inodes, as reported in #2619.

To fix this, now we cleanup the history directory each time a new item is created, and we turn off logging while re-fetching the goal state.

While testing the fix, I notice there were a few places where we were not honoring the Logger.silent flag. This PR fixes that as well.

if len(certs.warnings) > 0:
logger.warn(certs.warnings)
self.logger.warn(certs.warnings)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These 3 instances were not using GoalState.logger

if extensions_goal_state.status_upload_blob is None:
raise ProtocolNotFoundError("Status upload uri is missing")

logger.info("Refreshed the goal state to get the status upload blob. New Goal State ID: {0}", extensions_goal_state.id)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Turn off logging while re-fetching the goal state and then log an info when that succeeds


for state in states[_MAX_ARCHIVED_STATES:]:
state.delete()

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved purge to GoalStateHistory

@@ -222,6 +211,8 @@ def __init__(self, time, tag):
timestamp = timeutil.create_history_timestamp(time)
self._root = os.path.join(conf.get_lib_dir(), ARCHIVE_DIRECTORY_NAME, "{0}__{1}".format(timestamp, tag) if tag is not None else timestamp)

GoalStateHistory._purge()

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clean up each time we create a new item

elif GoalStateHistory._purge_error_count == 5:
logger.warn("Failed to clean up the goal state history directory [will stop reporting these errors]: {0}".format(e))


Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Goal states can be created very frequently, so limit the number of errors we log.

@codecov
Copy link

codecov bot commented Jul 21, 2022

Codecov Report

Merging #2633 (75e6189) into release-2.8.0.0 (672dbf3) will decrease coverage by 0.04%.
The diff coverage is 68.42%.

@@                 Coverage Diff                 @@
##           release-2.8.0.0    #2633      +/-   ##
===================================================
- Coverage            72.13%   72.08%   -0.05%     
===================================================
  Files                  102      102              
  Lines                15412    15434      +22     
  Branches              2448     2454       +6     
===================================================
+ Hits                 11117    11126       +9     
- Misses                3805     3816      +11     
- Partials               490      492       +2     
Impacted Files Coverage Δ
azurelinuxagent/common/protocol/wire.py 78.57% <60.00%> (-0.26%) ⬇️
azurelinuxagent/common/utils/archive.py 82.28% <65.38%> (-4.65%) ⬇️
azurelinuxagent/common/protocol/goal_state.py 95.81% <80.00%> (ø)
azurelinuxagent/ga/update.py 88.30% <100.00%> (-0.02%) ⬇️
azurelinuxagent/ga/collect_telemetry_events.py 90.41% <0.00%> (-1.03%) ⬇️
azurelinuxagent/common/utils/fileutil.py 79.69% <0.00%> (+0.75%) ⬆️
azurelinuxagent/common/logger.py 90.96% <0.00%> (+1.12%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 672dbf3...75e6189. Read the comment docs.

This failure was caught when .purge() was called before .archive().
"""
test_subject = StateArchiver(os.path.join(self.tmp_dir, 'does-not-exist'))
test_subject.purge()
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test no longer applies, since now we cleanup any item under the history directory, not only the items that matched those patterns

@nagworld9
Copy link
Contributor

@narrieta why can't we keep both cleanup on new item and every new goal state? is that overkill?

azurelinuxagent/common/protocol/wire.py Outdated Show resolved Hide resolved
else:
shutil.rmtree(current_item)

if GoalStateHistory._purge_error_count > 0:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it should be >=0. Happy path scenario it will be 0 and still want to log successfully cleaned up message

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The intention is to log this message only when we are recovering from a previous error (I added a comment to clarify this).

We don't want to log all successful calls, since we could be logging every 6 seconds if we hit an scenario similar to the one reported in #2619

@narrieta
Copy link
Member Author

@nagworld9

@narrieta why can't we keep both cleanup on new item and every new goal state? is that overkill?

Cleanup is done also on neww goal state: a new goal state will create a new history item, which will trigger the cleanup.

@narrieta narrieta merged commit ac56d0e into Azure:release-2.8.0.0 Jul 22, 2022
@narrieta narrieta deleted the history branch July 22, 2022 23:19
narrieta added a commit to narrieta/WALinuxAgent that referenced this pull request Aug 1, 2022
* Cleanup history directory when creating new subdirectories

* Review feedback

Co-authored-by: narrieta <narrieta>
(cherry picked from commit ac56d0e)
narrieta added a commit that referenced this pull request Aug 2, 2022
…2639)

* Cleanup history directory when creating new subdirectories

* Review feedback

Co-authored-by: narrieta <narrieta>
(cherry picked from commit ac56d0e)
@narrieta
Copy link
Member Author

narrieta commented Oct 4, 2022

Fixes #2619

nagworld9 added a commit that referenced this pull request Mar 7, 2023
* release prepare-2.4.0.0 (#2280)

* Fix bug with dependent extensions with no settings (#2285)

* update test-requirements to pin pylint. (#2288) (#2299)

Co-authored-by: Kevin Clark <kevin.clark@microsoft.com>

* Do not create placeholder status file for AKS extensions (#2298)

* Exception for Linux Patch Extension for creating placeholder status file (#2307)

* update release version (#2308)

* Dont create default status file for Single-Config extensions (#2318)

* version update (#2319)

* Update Version (#2348)

* Fix bug with dependent extensions with no settings (#2285) (#2349)

Co-authored-by: Laveesh Rohra <larohra@microsoft.com>

* Use handler status if extension status is None when computing the Ext… (#2358)

* Use handler status if extension status is None when computing the ExtensionsSummary

* Add NotReady to convergence statuses

Co-authored-by: narrieta <narrieta>

* Release preparation 2.5.0.1 (#2360)

* Define ExtensionsSummary.__eq__ (#2371)

* define ExtensionsSummary.__eq__

* Fix test name

* Fix test name

* Fix test name

Co-authored-by: narrieta <narrieta>

* Release preparation 2.5.0.2 (#2372)

* update cgroups monitoring expiry date (#2392)

* Updated Agent Version to 2.6.0.0 (#2393)

* Do not parse status blob (#2394)

Co-authored-by: narrieta <narrieta>

* Exclude dcr from setup (#2396)

* Improve error message for vmSettings errors (#2397)

* Improve error message for vmSettings errors

* Add etag and correlation id

* pylint warning

Co-authored-by: narrieta <narrieta>

* Updated agent version to 2.6.0.1 (#2400)

* Verify that extensions goal state from vmsettings has been initialized (#2404)

Co-authored-by: narrieta <narrieta>

* Updated agent version to 2.6.0.2 (#2405)

* Set Agent version to 2.7.0.0 (#2457)

* Set Agent version to 2.7.0.0

* Remove duplicate import

Co-authored-by: narrieta <narrieta>

* Simplify the logic to update the extensions goal state (#2466)

* Simplify the logic to update the extensions goal state

* Added telemetry for NotSupported

* Added comments

* Do not support old hostgaplugin

Co-authored-by: narrieta <narrieta>

* Retry update_goal_state on GoalStateMismatchError (#2470)

* Retry update_goal_state on GoalStateMismatchError

* Add sleep before retry

* Enable test

* Enable test

* Update test

* Add unit test

* Add data file

* pylint warning

* Add comment; fix typos

* fix typo

Co-authored-by: narrieta <narrieta>

* Set agent version to 2.7.0.1 (#2473)

* Redact settings from mismatch message (#2477)

Co-authored-by: narrieta <narrieta>

* Set Agent Version to 2.7.0.2 (#2478)

Co-authored-by: narrieta <narrieta>

* Improvements in telemetry for vmSettings (#2482)

Co-authored-by: narrieta <narrieta>

* Set Agent version to 2.7.0.3 (#2483)

* Do not raise on missing status blob; reduce amount of logging for vms… (#2492)

* Do not raise on missing status blob; reduce amount of logging for vmsettings

* remove extra file; fix typo

Co-authored-by: narrieta <narrieta>

* Set agent version to 2.7.0.4 (#2494)

Co-authored-by: narrieta <narrieta>

* Save vmSettings on parse errors; improve messages in parse errors (#2503)

* Save vmSettings on parse errors; improve messages in parse errors

* pylint warnings

* pylint warnings

Co-authored-by: narrieta <narrieta>

* Set agent version to 2.7.0.5 (#2504)

Co-authored-by: narrieta <narrieta>

* Revert "Set agent version to 2.7.0.5 (#2504)" (#2505)

This reverts commit ae5a222.

Co-authored-by: narrieta <narrieta>

* ignore firewall packets reset error, check enable firewall config flag and extend cgroup extension monitoring expiry time (#2502)

* Set agent version to 2.7.0.5 (#2506)

Co-authored-by: narrieta <narrieta>

* Handle OOM errors by stopping the periodic log collector (#2510)

* Set agent version to 2.7.0.6 (#2511)

* Add keep_alive property to collect_logs (#2533)

* Set agent version to 2.7.1.0. (#2534)

* Set agent version to 2.8.0.0 (#2545)

Co-authored-by: narrieta <narrieta>

* Update HGAP minimum required version for FastTrack (#2549)

Co-authored-by: narrieta <narrieta>

* Update agent version to 2.8.0.1 (#2550)

Co-authored-by: narrieta <narrieta>

* Improvements in waagent.log (#2558)

* Improvements in waagent.log

* fix function call

* update comment

* typo

Co-authored-by: narrieta <narrieta>

* Change format of history items (#2560)

* Change format of history directory

* Update message; fix typo

* py2 compat

* py2 compat

Co-authored-by: narrieta <narrieta>

* Update agent version to 2.8.0.2 (#2561)

Co-authored-by: narrieta <narrieta>

* Refresh goal state when certificates are missing (#2562)

* Refresh goal state when certificates are missing

* Improve error reporting

* Fix assert message

Co-authored-by: narrieta <narrieta>

* Update agent version to 2.8.0.3 (#2563)

Co-authored-by: narrieta <narrieta>

* Do not mark goal state as processed when goal state fails to update (#2569)

Co-authored-by: narrieta <narrieta>

* Update agent version to 2.8.0.4 (#2570)

Co-authored-by: narrieta <narrieta>

* Bug fix for fetching a goal state with empty certificates property (#2575)

* Move error counter reset down to end of block. (#2576)

* Bug Fix: Change fast track timestamp default from None to datetime.min (#2577)

* Update agent version to 2.8.0.5. (#2580)

* Create placeholder GoalState.*.xml file (#2594)

Co-authored-by: narrieta <narrieta>

* Update version to 2.8.0.6 (#2595)

Co-authored-by: narrieta <narrieta>

* fix network interface restart in RHEL9 (#2592) (#2612)

(cherry picked from commit b8ca432)

* set agent version to 2.7.2.0 (#2613)

* Parse missing agent manifests as empty

* Set agent version to 2.8.0.7

* Retry HGAP's extensionsArtifact requests on BAD_REQUEST status (#2621)

* Retry HGAP's extensionsArtifact requests on BAD_REQUEST status

* python 2.6 compat

Co-authored-by: narrieta <narrieta>

* Retry HGAP's extensionsArtifact requests on BAD_REQUEST status (#2621) (#2622)

* Retry HGAP's extensionsArtifact requests on BAD_REQUEST status

* python 2.6 compat

Co-authored-by: narrieta <narrieta>
(cherry picked from commit dbc82d3)

* fix if command in rhel v8.6+ (#2624)

* Set agent version to 2.7.3.0 (#2625)

Co-authored-by: narrieta <narrieta>

* Set Agent version to 2.8.0.8 (#2627)

Co-authored-by: narrieta <narrieta>

* fix network interface restart in RHEL9 (#2592) (#2629)

(cherry picked from commit b8ca432)

Co-authored-by: Nageswara Nandigam <84482346+nagworld9@users.noreply.github.com>

* fix if command in rhel v8.6+ (#2624) (#2630)

(cherry picked from commit e540728)

Co-authored-by: Nageswara Nandigam <84482346+nagworld9@users.noreply.github.com>

* Set Agent version to 2.8.0.9 (#2631)

Co-authored-by: narrieta <narrieta>

* Cleanup history directory when creating new subdirectories (#2633)

* Cleanup history directory when creating new subdirectories

* Review feedback

Co-authored-by: narrieta <narrieta>

* Set agent version to 2.8.0.10

* Save sharedconfig to disk (#2649)

* Save sharedconfig to disk

* Update tests

* pylint warnings

Co-authored-by: narrieta <narrieta>

* Set Agent version to 2.8.0.11 (#2650)

Co-authored-by: narrieta <narrieta>

* test fixes

* Microsoft mandatory file (#2737)

Co-authored-by: microsoft-github-policy-service[bot] <77245923+microsoft-github-policy-service[bot]@users.noreply.github.com>

---------

Co-authored-by: Laveesh Rohra <larohra@microsoft.com>
Co-authored-by: Kevin Clark <kevin.clark@microsoft.com>
Co-authored-by: Norberto Arrieta <narrieta@users.noreply.github.com>
Co-authored-by: Dhivya Ganesan <dhivyaganesan@users.noreply.github.com>
Co-authored-by: narrieta <narrieta>
Co-authored-by: Kevin Clark <keclar@microsoft.com>
Co-authored-by: Norberto Arrieta <narrieta@microsoft.com>
Co-authored-by: microsoft-github-policy-service[bot] <77245923+microsoft-github-policy-service[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants