Skip to content

Conversation

kjohn-msft
Copy link
Collaborator

@kjohn-msft kjohn-msft commented Apr 14, 2025

Background: For VMs that are Ubuntu Pro client capable, there are 2 sets of issues that manifest as required security updates not getting installed on machines_ (some machines).

  1. Canonical is aware of packages normally seen as updates in the default flow not being shown as required in pro client scans. There is an explanation for this but the way we have taken a tight dependency on pro client when it's functional causes these updates no to get installed.

  2. There are cases where wide swathes of security updates are not getting detected by pro client. It was not clear if this was a pro client issue in the past or an issue with our code. The additional code that went in November helped identify that this was a pro client issue when newer reports came in: Apt debstyle882 support, bug-fixes in cache evaluation & manipulation #273

Both problems listed above are being resolved by not fully relying on pro client and using a combined overlay of the default scanning mechanism with whatever pro client reports. This is the 'best of both worlds' approach. Extensive logging additions will help further reviews with Canonical on pro client behaviors without affecting any customer while a multi-stage resolution is ironed out.

@kjohn-msft kjohn-msft added bug Something isn't working draft Work in progress or planned for later labels Apr 14, 2025
@kjohn-msft kjohn-msft self-assigned this Apr 14, 2025
Copy link

codecov bot commented Apr 14, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 93.13%. Comparing base (d28c9cc) to head (156b9fa).
Report is 1 commits behind head on master.

Additional details and impacted files
@@           Coverage Diff           @@
##           master     #308   +/-   ##
=======================================
  Coverage   93.13%   93.13%           
=======================================
  Files         103      103           
  Lines       17555    17566   +11     
=======================================
+ Hits        16349    16360   +11     
  Misses       1206     1206           
Flag Coverage Δ
python27 93.12% <100.00%> (+<0.01%) ⬆️
python312 93.13% <100.00%> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@kjohn-msft kjohn-msft marked this pull request as ready for review April 16, 2025 21:45
@Copilot Copilot AI review requested due to automatic review settings April 16, 2025 21:45
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot reviewed 4 out of 4 changed files in this pull request and generated no comments.

Comments suppressed due to low confidence (3)

src/core/tests/Test_AptitudePackageManager.py:44

  • [nitpick] The function name 'mock_linux_distribution_to_return_ubuntu_oracular' may be unclear; consider including the Ubuntu version (e.g., '26') in the name to improve clarity.
def mock_linux_distribution_to_return_ubuntu_oracular(self):

src/core/src/bootstrap/Constants.py:381

  • Ensure that updating MAX_OS_MAJOR_VERSION_SUPPORTED to 24 is synchronized with related logic and tests to correctly reflect supported OS versions.
MAX_OS_MAJOR_VERSION_SUPPORTED = 24

src/core/src/package_managers/AptitudePackageManager.py:366

  • Consider adding unit tests to fully cover both outcomes of the 'get_security_updates' method and ensure that each branch behaves as expected.
if not ubuntu_pro_client_security_updates_query_success:

@kjohn-msft kjohn-msft requested a review from feng-j678 April 16, 2025 21:46
@kjohn-msft kjohn-msft added OE PR is considered near complete due to OE sign-off. and removed draft Work in progress or planned for later labels Apr 16, 2025
@kjohn-msft kjohn-msft merged commit 1e997cc into master Apr 20, 2025
7 checks passed
@kjohn-msft kjohn-msft deleted the kjohn-proclient branch April 20, 2025 23:22
@rane-rajasi rane-rajasi mentioned this pull request Apr 25, 2025
rane-rajasi added a commit that referenced this pull request Apr 25, 2025
This release includes:
[x] Engg. hygiene: Remove TelemetryWriter related log noise
[#312](#312)
[x] Bugfix: Mitigate external Ubuntu Pro Client issue
[#308](#308)
[x] Feature: Adding support for Azure Linux 2.0 in Tdnf Package Manager
[#311](#311)
[x] Eng. sys: Upgrade CICD pipeline from Python 3.9 to Python 3.12
[#309](#309)
[x] Coverage: Increase code coverage - TimerManager and ServiceManager
[#307](#307)
[x] Bugfix: Unit tests broken in Python 3.12
[#306](#306)
[x] Feature: Adding Azure Linux 3.0 Base Support
[#293](#293)
[x] Bugfix: Retry Handler to Prevent Unbounded Retries while trying to
Mitigate YUM Update Errors
[#303](#303)
[x] BugFix: CentOS VMs not installing patches during Auto Patching
[#298](#298)
[x] Bugfix: Auto-assessment - Restricting execution permissions to root
user/ owner
[#299](#299)
feng-j678 pushed a commit that referenced this pull request Apr 29, 2025
…ction (#310)

Fixing issues seen while testing:
[Bugfix: Mitigate external Ubuntu Pro Client
issues](#308)

The main issue was that my test infra for that code failed to complete
patch installation with this error message:
_"Reboot failed to proceed on the machine in a timely manner."_

The machine then failed to accept ssh logins for 20+ minutes. I thought
the machine was dead, but it eventually came back suggesting a much
longer reboot time than expected. Tracing this further in production, it
suggested there were 500+ operations per week hitting the same issue and
getting reported as a terminal failure.

This prompted a closer look at all the Reboot Manager code, and PR
addresses every issue identified with it.

The changes in this PR:
1. Differentiating between all of the following:
(a) Reboot buffer in minutes = minimum time required to consider a
reboot. This was being overloaded incorrectly to also broadcast the time
delay to starting the reboot, which _could_ cause silent maintenance
window exceeds.
(b) Reboot notify timeout in minutes = introduced new to be deliberate
about the duration of the notification window.
(c) Reboot wait timeout in minutes (min & max) = time duration to wait
before considering an attempt to reboot a failure.

2. The effective reboot wait timeout is dynamic now - it's sits
somewhere between the min and max allowed and uses as much as the
remaining time in the maintenance window allows to allow for success.

3. Reboot manager code has been cleaned up to meet current coding
standards for the code base.

4. Reduce likelihood of timeout at the Compute RP by refreshing the
status as long as we're still waiting for a reboot and our process is
still running.

The configured values may be adjusted in the future to account for what
is seen at scale.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working OE PR is considered near complete due to OE sign-off.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants