Skip to content

feat: add build-only and install-skip-build modes for VHD-prebuilt GPU kernel module#159

Draft
ganeshkumarashok wants to merge 1 commit into
Azure:mainfrom
ganeshkumarashok:gpu-prebuild-kernel-module
Draft

feat: add build-only and install-skip-build modes for VHD-prebuilt GPU kernel module#159
ganeshkumarashok wants to merge 1 commit into
Azure:mainfrom
ganeshkumarashok:gpu-prebuild-kernel-module

Conversation

@ganeshkumarashok
Copy link
Copy Markdown
Collaborator

What

Splits the host-side NVIDIA driver install into two phases so the kernel module can be DKMS-compiled into the VHD at image build time and the boot-time install can skip straight to device init. This is the aks-gpu half of an effort to cut AKS GPU node provisioning time by removing the expensive boot-time kernel-module compile.

Changes

  • install.sh — refactored into:
    • build_kernel_module() — compile + stage userspace libs, no device access (safe on a GPU-less builder).
    • device_init() — modprobe, nvidia-smi, fabric manager, containerd config, udev.
    • New modes: AKSGPU_BUILD_ONLY and AKSGPU_SKIP_KERNEL_BUILD.
    • Overlay cleanup trap.
    • Writes a dkms-marker (/opt/azure/aks-gpu/dkms-marker) recording kernel, driver_version, driver_kind, arch so the consumer (AgentBaker CSE) can validate an exact match before taking the skip-build fast path.
  • entrypoint.sh — new build-only and install-skip-build actions, passed through to the host via nsenter. The default install action is unchanged.

Compatibility

The default install path is behavior-preserving. The new actions are additive.

Companion change

The consumer side lives in Azure/AgentBaker (build-time prebuild + boot-time fast path), gated behind PREBUILD_GPU_KERNEL_MODULE (default off). This image must be published to MCR and referenced in components.json before that flag is enabled.

Still required before production

  • Verify nvidia-installer --dkms compiles cleanly on a GPU-less builder with the exact driver/flags.
  • Secure Boot: the prebuilt .ko must be signed with a key the node trusts.
  • GPU e2e validation.

Draft until the above are validated.

…uilt kernel module

Split the host-side driver install into two phases so the NVIDIA kernel module
can be DKMS-compiled into the VHD at image build time and the boot-time install
can skip straight to device init:

- install.sh: refactor into build_kernel_module() (compile + stage userspace
  libs, no device access) and device_init() (modprobe, nvidia-smi, fabric
  manager, containerd config, udev). Add AKSGPU_BUILD_ONLY and
  AKSGPU_SKIP_KERNEL_BUILD modes, an overlay cleanup trap, and a dkms-marker
  (/opt/azure/aks-gpu/dkms-marker) recording kernel, driver_version,
  driver_kind and arch so the consumer (AgentBaker CSE) can validate an exact
  match before taking the skip-build fast path.
- entrypoint.sh: add build-only and install-skip-build actions and pass the
  mode through to the host via nsenter. The default install action is
  unchanged.

This is the aks-gpu half of the AgentBaker change that prebuilds the GPU kernel
module into the VHD to reduce node provisioning time. Secure Boot module
signing and GPU e2e validation are still required.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant