refactor(auto-updater): track latest patch within current GRID driver major#155
Open
ganeshkumarashok wants to merge 2 commits into
Open
refactor(auto-updater): track latest patch within current GRID driver major#155ganeshkumarashok wants to merge 2 commits into
ganeshkumarashok wants to merge 2 commits into
Conversation
The previous auto-updater read NvidiaGPU/Nvidia-GPU-Linux-Resources.json, which the HPC team stopped updating at vGPU 17.55 (550.144.06). All new GRID releases (18.5, 18.6, etc.) now land in NvidiaGPU/resources.json. Changes: - Switch the source URL to NvidiaGPU/resources.json - Walk OS.Linux.Version[*].Driver[*] for Type='GRID' blocks - Filter entries by vGPUVersion major == TARGET_VGPU_MAJOR (default '18') - Pick the entry with the highest minor (correctly handles 18.10 > 18.6) - Fall back from DirLink to FwLink when only the latter is populated - Add a request timeout (no timeout previously) - Add TARGET_VGPU_MAJOR constant so future major bumps (18 -> 19) are a single-line change Tested against the live manifest: - Latest v18 returned: 570.211.01 (vGPU 18.6) - Idempotent when driver_config.yml is already at latest - Bumps from v17 back to v18 when intentionally regressed - TARGET_VGPU_MAJOR='19' (not yet released) raises a clear error Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Replace the hardcoded TARGET_VGPU_MAJOR="18" constant with logic that derives the target driver major directly from the currently-pinned grid.version in driver_config.yml. This makes the auto-updater: * Self-configuring — bumping driver_config.yml to a 595.x version automatically starts tracking 595.x patches, with no code change. * Tied to the ABI-stable identifier — NVIDIA driver MAJOR (R570, R580) is the boundary across which kernel-module ABI, install-script behaviour, and vGPU licensing may change. Filtering by driver major (vs vGPU major) is the more semantically correct invariant. * More conservative — within a major, only patch/minor bumps are picked up. Major bumps remain explicit, manual decisions. Behaviour on current main (grid.version = 570.211.01): - target major = "570" - candidates in resources.json: 570.211.01, 570.195.03 - picked: 570.211.01 (no-op, idempotent) When NVIDIA ships e.g. 570.215.xx, it will be picked up automatically. Verified with 10 unit-style scenarios (current data, idempotency, 595-series tracking, 550-series tie-breaking, end-to-end update_driver_config, garbage-input error path, synthetic patch-bump-within-major, numeric vs lex sort, and missing-major error). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Why
The previous auto-updater read the deprecated
Nvidia-GPU-Linux-Resources.json, which HPC stopped updating past vGPU 17.55 (550.144.06). With main now on 570.211.01 (vGPU 18.6, merged in #154), the auto-updater was silently a no-op and would never pick up patches.What this does
Switches the source to the live
NvidiaGPU/resources.jsonand adds a clear, conservative selection policy:The target major is derived from
driver_config.yml'sgrid.versionitself — no hardcoded constant.grid.version: 570.211.01⇒ target major570⇒ picks the highest570.x.xinresources.json570.215.xx, it's picked up automatically595.x.xships, it is not auto-bumped (major bumps need validation — kernel-module ABI, install-script behaviour, vGPU licensing all change)driver_config.ymlonce; the auto-updater followsWhy driver-major and not vGPU-major
driver_config.ymlBehaviour today (verified locally)
grid.version=570.211.01⇒ target major570⇒ candidates{570.211.01, 570.195.03}⇒ picks570.211.01(no-op, idempotent)grid.versionis hand-set to595.58.03⇒ tracks595.x(vGPU 20.x), does not regress to 570Other improvements
requests.get(..., timeout=30)to avoid hanging the daily workflowDirLink → FwLinkfallback (the v18.5 entry inresources.jsononly hasFwLink)RuntimeErrorifgrid.versionis malformed or the target major has no entries inresources.jsonTested
10 unit-style scenarios run locally — current data, idempotency, 595-series tracking, 550-series tie-breaking, end-to-end
update_driver_configagainst the real yml, garbage-input error path, synthetic patch-bump-within-major, numeric-vs-lex sort, and missing-major error. All pass.The existing
update_grid_driver.yamlcron workflow needs no changes —grep '^\+ version: 'on the diff still works.