Skip to content

refactor(auto-updater): track latest patch within current GRID driver major#155

Open
ganeshkumarashok wants to merge 2 commits into
mainfrom
update-auto-updater-grid-v18
Open

refactor(auto-updater): track latest patch within current GRID driver major#155
ganeshkumarashok wants to merge 2 commits into
mainfrom
update-auto-updater-grid-v18

Conversation

@ganeshkumarashok
Copy link
Copy Markdown
Collaborator

@ganeshkumarashok ganeshkumarashok commented May 20, 2026

Why

The previous auto-updater read the deprecated Nvidia-GPU-Linux-Resources.json, which HPC stopped updating past vGPU 17.55 (550.144.06). With main now on 570.211.01 (vGPU 18.6, merged in #154), the auto-updater was silently a no-op and would never pick up patches.

What this does

Switches the source to the live NvidiaGPU/resources.json and adds a clear, conservative selection policy:

Track the latest patch within the currently-pinned driver major.

The target major is derived from driver_config.yml's grid.version itself — no hardcoded constant.

  • grid.version: 570.211.01 ⇒ target major 570 ⇒ picks the highest 570.x.x in resources.json
  • When NVIDIA ships 570.215.xx, it's picked up automatically
  • When 595.x.x ships, it is not auto-bumped (major bumps need validation — kernel-module ABI, install-script behaviour, vGPU licensing all change)
  • To intentionally move to a new major (e.g. R580), edit driver_config.yml once; the auto-updater follows

Why driver-major and not vGPU-major

vGPU major (18.x) Driver major (570.x)
Source of truth hardcoded constant driver_config.yml
ABI-stable boundary indirectly directly (this is NVIDIA's driver branch identifier)
Migrating to next edit code edit yml
Same behaviour today ✓ (both map to {570.211.01, 570.195.03})

Behaviour today (verified locally)

  • Current state: grid.version=570.211.01 ⇒ target major 570 ⇒ candidates {570.211.01, 570.195.03} ⇒ picks 570.211.01 (no-op, idempotent)
  • If grid.version is hand-set to 595.58.03 ⇒ tracks 595.x (vGPU 20.x), does not regress to 570

Other improvements

  • requests.get(..., timeout=30) to avoid hanging the daily workflow
  • DirLink → FwLink fallback (the v18.5 entry in resources.json only has FwLink)
  • Numeric sort via tuple-of-ints (so 570.99.10 > 570.99.9, not lex)
  • Clear RuntimeError if grid.version is malformed or the target major has no entries in resources.json

Tested

10 unit-style scenarios run locally — current data, idempotency, 595-series tracking, 550-series tie-breaking, end-to-end update_driver_config against the real yml, garbage-input error path, synthetic patch-bump-within-major, numeric-vs-lex sort, and missing-major error. All pass.

The existing update_grid_driver.yaml cron workflow needs no changes — grep '^\+ version: ' on the diff still works.

Ganesh Kumar Ashokavardhanan and others added 2 commits May 19, 2026 17:02
The previous auto-updater read NvidiaGPU/Nvidia-GPU-Linux-Resources.json,
which the HPC team stopped updating at vGPU 17.55 (550.144.06). All new
GRID releases (18.5, 18.6, etc.) now land in NvidiaGPU/resources.json.

Changes:
- Switch the source URL to NvidiaGPU/resources.json
- Walk OS.Linux.Version[*].Driver[*] for Type='GRID' blocks
- Filter entries by vGPUVersion major == TARGET_VGPU_MAJOR (default '18')
- Pick the entry with the highest minor (correctly handles 18.10 > 18.6)
- Fall back from DirLink to FwLink when only the latter is populated
- Add a request timeout (no timeout previously)
- Add TARGET_VGPU_MAJOR constant so future major bumps (18 -> 19) are
  a single-line change

Tested against the live manifest:
- Latest v18 returned: 570.211.01 (vGPU 18.6)
- Idempotent when driver_config.yml is already at latest
- Bumps from v17 back to v18 when intentionally regressed
- TARGET_VGPU_MAJOR='19' (not yet released) raises a clear error

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Replace the hardcoded TARGET_VGPU_MAJOR="18" constant with logic that
derives the target driver major directly from the currently-pinned
grid.version in driver_config.yml. This makes the auto-updater:

* Self-configuring — bumping driver_config.yml to a 595.x version
  automatically starts tracking 595.x patches, with no code change.
* Tied to the ABI-stable identifier — NVIDIA driver MAJOR (R570, R580)
  is the boundary across which kernel-module ABI, install-script
  behaviour, and vGPU licensing may change. Filtering by driver major
  (vs vGPU major) is the more semantically correct invariant.
* More conservative — within a major, only patch/minor bumps are picked
  up. Major bumps remain explicit, manual decisions.

Behaviour on current main (grid.version = 570.211.01):
  - target major = "570"
  - candidates in resources.json: 570.211.01, 570.195.03
  - picked: 570.211.01 (no-op, idempotent)

When NVIDIA ships e.g. 570.215.xx, it will be picked up automatically.

Verified with 10 unit-style scenarios (current data, idempotency,
595-series tracking, 550-series tie-breaking, end-to-end
update_driver_config, garbage-input error path, synthetic
patch-bump-within-major, numeric vs lex sort, and missing-major error).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@ganeshkumarashok ganeshkumarashok changed the title refactor(auto-updater): track vGPU 18.x GRID drivers from resources.json refactor(auto-updater): track latest patch within current GRID driver major May 20, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant