Skip to content

GPUContainerImage schema with os, arch and cache info - move GPU container images to config#5153

Closed
ganeshkumarashok wants to merge 17 commits into
masterfrom
aganeshkumar/gpu_img_with_details
Closed

GPUContainerImage schema with os, arch and cache info - move GPU container images to config#5153
ganeshkumarashok wants to merge 17 commits into
masterfrom
aganeshkumar/gpu_img_with_details

Conversation

@ganeshkumarashok
Copy link
Copy Markdown
Contributor

@ganeshkumarashok ganeshkumarashok commented Oct 24, 2024

What type of PR is this?
/kind feature

What this PR does / why we need it:
This PR moves GPU versions to a config file (components.json), so that Renovate bot can auto-update it. VHD builds will now consume the cuda version from components.json. It also adds a new schema to auto-update.

There are two new requirements:

aks-gpu-cuda container image is only downloaded for particular combo of OS and arch (Ubuntu - amd64),
aks-gpu-grid container image needs to be present in the config but is never downloaded in the VHD. It's only used in CSE, for certain SKUs.

Which issue(s) this PR fixes:

Fixes #

Requirements:

Special notes for your reviewer:

Release note:

none

Comment thread schemas/components.cue Outdated
Comment thread vhdbuilder/packer/install-dependencies.sh Outdated
Comment thread vhdbuilder/packer/install-dependencies.sh Outdated
@ganeshkumarashok
Copy link
Copy Markdown
Contributor Author

This is the much simpler alternate PR we considered (without OS, arch): #5138, and the more complex alternate (adding it to ContainerImages): #5139


shouldPull=0 # Default to not pull

if [[ -n "$osSelectors" ]]; then
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested to put this logic into a function and add unit tests to cover most of the if conditions, so that we don't need to rely on abe2e or RP-e2e to capture issues for us.
One way to do that is put the function into cse_helpers.sh. There is a shellspec unit test file cse_helpers.sh which have some examples to author tests.
The root level readme has some instructions too.

mkdir -p /opt/{actions,gpu}

# Check for the "fullgpu" feature flag
if grep -q "fullgpu" <<< "$FEATURE_FLAGS"; then
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, avoid more than 2 level nested if. It's hard to keep track which level it is for debugging and readability.

Copy link
Copy Markdown
Contributor Author

@ganeshkumarashok ganeshkumarashok Oct 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree - thinking about the alternate way.

But I think this approach is making it a lot more complex than the alternate PR, which is much smaller: #5138

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right. General vs flexible is always a trade-off. If it can fit your mid-future GPU images, I am fine with it too as this will be used by GPU container images.

echo "Installing GPU driver from image: $fullImage"
bash -c "$CTR_GPU_INSTALL_CMD $fullImage gpuinstall /entrypoint.sh install"
ret=$?
if [[ "$ret" != "0" ]]; then
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, avoid more than 2 level nested if. It's hard to keep track which level it is for debugging and readability.

"renovateTag": "registry=https://mcr.microsoft.com, name=aks/aks-gpu-grid",
"latestVersion": "535.161.08-20241021235607"
}
],
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

indent

@ganeshkumarashok
Copy link
Copy Markdown
Contributor Author

Had a discussion earlier and I merged this alternate PR instead: #5138

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants