-
Notifications
You must be signed in to change notification settings - Fork 11
Bugfix: Mitigate external Ubuntu Pro Client issues #308
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## master #308 +/- ##
=======================================
Coverage 93.13% 93.13%
=======================================
Files 103 103
Lines 17555 17566 +11
=======================================
+ Hits 16349 16360 +11
Misses 1206 1206
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Copilot reviewed 4 out of 4 changed files in this pull request and generated no comments.
Comments suppressed due to low confidence (3)
src/core/tests/Test_AptitudePackageManager.py:44
- [nitpick] The function name 'mock_linux_distribution_to_return_ubuntu_oracular' may be unclear; consider including the Ubuntu version (e.g., '26') in the name to improve clarity.
def mock_linux_distribution_to_return_ubuntu_oracular(self):
src/core/src/bootstrap/Constants.py:381
- Ensure that updating MAX_OS_MAJOR_VERSION_SUPPORTED to 24 is synchronized with related logic and tests to correctly reflect supported OS versions.
MAX_OS_MAJOR_VERSION_SUPPORTED = 24
src/core/src/package_managers/AptitudePackageManager.py:366
- Consider adding unit tests to fully cover both outcomes of the 'get_security_updates' method and ensure that each branch behaves as expected.
if not ubuntu_pro_client_security_updates_query_success:
This release includes: [x] Engg. hygiene: Remove TelemetryWriter related log noise [#312](#312) [x] Bugfix: Mitigate external Ubuntu Pro Client issue [#308](#308) [x] Feature: Adding support for Azure Linux 2.0 in Tdnf Package Manager [#311](#311) [x] Eng. sys: Upgrade CICD pipeline from Python 3.9 to Python 3.12 [#309](#309) [x] Coverage: Increase code coverage - TimerManager and ServiceManager [#307](#307) [x] Bugfix: Unit tests broken in Python 3.12 [#306](#306) [x] Feature: Adding Azure Linux 3.0 Base Support [#293](#293) [x] Bugfix: Retry Handler to Prevent Unbounded Retries while trying to Mitigate YUM Update Errors [#303](#303) [x] BugFix: CentOS VMs not installing patches during Auto Patching [#298](#298) [x] Bugfix: Auto-assessment - Restricting execution permissions to root user/ owner [#299](#299)
…ction (#310) Fixing issues seen while testing: [Bugfix: Mitigate external Ubuntu Pro Client issues](#308) The main issue was that my test infra for that code failed to complete patch installation with this error message: _"Reboot failed to proceed on the machine in a timely manner."_ The machine then failed to accept ssh logins for 20+ minutes. I thought the machine was dead, but it eventually came back suggesting a much longer reboot time than expected. Tracing this further in production, it suggested there were 500+ operations per week hitting the same issue and getting reported as a terminal failure. This prompted a closer look at all the Reboot Manager code, and PR addresses every issue identified with it. The changes in this PR: 1. Differentiating between all of the following: (a) Reboot buffer in minutes = minimum time required to consider a reboot. This was being overloaded incorrectly to also broadcast the time delay to starting the reboot, which _could_ cause silent maintenance window exceeds. (b) Reboot notify timeout in minutes = introduced new to be deliberate about the duration of the notification window. (c) Reboot wait timeout in minutes (min & max) = time duration to wait before considering an attempt to reboot a failure. 2. The effective reboot wait timeout is dynamic now - it's sits somewhere between the min and max allowed and uses as much as the remaining time in the maintenance window allows to allow for success. 3. Reboot manager code has been cleaned up to meet current coding standards for the code base. 4. Reduce likelihood of timeout at the Compute RP by refreshing the status as long as we're still waiting for a reboot and our process is still running. The configured values may be adjusted in the future to account for what is seen at scale.
Background: For VMs that are Ubuntu Pro client capable, there are 2 sets of issues that manifest as required security updates not getting installed on machines_ (some machines).
Canonical is aware of packages normally seen as updates in the default flow not being shown as required in pro client scans. There is an explanation for this but the way we have taken a tight dependency on pro client when it's functional causes these updates no to get installed.
There are cases where wide swathes of security updates are not getting detected by pro client. It was not clear if this was a pro client issue in the past or an issue with our code. The additional code that went in November helped identify that this was a pro client issue when newer reports came in: Apt debstyle882 support, bug-fixes in cache evaluation & manipulation #273
Both problems listed above are being resolved by not fully relying on pro client and using a combined overlay of the default scanning mechanism with whatever pro client reports. This is the 'best of both worlds' approach. Extensive logging additions will help further reviews with Canonical on pro client behaviors without affecting any customer while a multi-stage resolution is ironed out.