GPUContainerImage schema with os, arch and cache info - move GPU container images to config#5153
GPUContainerImage schema with os, arch and cache info - move GPU container images to config#5153ganeshkumarashok wants to merge 17 commits into
Conversation
…r/move_gpu_img_config
…r/gpu_img_with_details
|
|
||
| shouldPull=0 # Default to not pull | ||
|
|
||
| if [[ -n "$osSelectors" ]]; then |
There was a problem hiding this comment.
Suggested to put this logic into a function and add unit tests to cover most of the if conditions, so that we don't need to rely on abe2e or RP-e2e to capture issues for us.
One way to do that is put the function into cse_helpers.sh. There is a shellspec unit test file cse_helpers.sh which have some examples to author tests.
The root level readme has some instructions too.
| mkdir -p /opt/{actions,gpu} | ||
|
|
||
| # Check for the "fullgpu" feature flag | ||
| if grep -q "fullgpu" <<< "$FEATURE_FLAGS"; then |
There was a problem hiding this comment.
Again, avoid more than 2 level nested if. It's hard to keep track which level it is for debugging and readability.
There was a problem hiding this comment.
I agree - thinking about the alternate way.
But I think this approach is making it a lot more complex than the alternate PR, which is much smaller: #5138
There was a problem hiding this comment.
Right. General vs flexible is always a trade-off. If it can fit your mid-future GPU images, I am fine with it too as this will be used by GPU container images.
| echo "Installing GPU driver from image: $fullImage" | ||
| bash -c "$CTR_GPU_INSTALL_CMD $fullImage gpuinstall /entrypoint.sh install" | ||
| ret=$? | ||
| if [[ "$ret" != "0" ]]; then |
There was a problem hiding this comment.
Again, avoid more than 2 level nested if. It's hard to keep track which level it is for debugging and readability.
| "renovateTag": "registry=https://mcr.microsoft.com, name=aks/aks-gpu-grid", | ||
| "latestVersion": "535.161.08-20241021235607" | ||
| } | ||
| ], |
…r/gpu_img_with_details
…r/gpu_img_with_details
|
Had a discussion earlier and I merged this alternate PR instead: #5138 |
What type of PR is this?
/kind feature
What this PR does / why we need it:
This PR moves GPU versions to a config file (components.json), so that Renovate bot can auto-update it. VHD builds will now consume the cuda version from components.json. It also adds a new schema to auto-update.
There are two new requirements:
aks-gpu-cuda container image is only downloaded for particular combo of OS and arch (Ubuntu - amd64),
aks-gpu-grid container image needs to be present in the config but is never downloaded in the VHD. It's only used in CSE, for certain SKUs.
Which issue(s) this PR fixes:
Fixes #
Requirements:
Special notes for your reviewer:
Release note: