Skip to content
This repository has been archived by the owner on Jan 11, 2023. It is now read-only.

[WIP] Add NVIDIA drivers for k8s linux agents #989

Merged
merged 9 commits into from
Oct 24, 2017

Conversation

wbuchwalter
Copy link
Contributor

@wbuchwalter wbuchwalter commented Jul 13, 2017

Automatically install NVIDIA drivers on linux agents for kubernetes.
Fixes #828

To Do:

  • Add scripts to install NVIDIA drivers
  • Ensure we are allowed to do this from a legal perspective
  • Add unit tests
  • Update GPU documentation

Implementation Details:

  • There are two ways to install the NVIDIA drivers on Ubuntu:
    • By using the *.run file available on NVIDIA's website. This is the faster method, it takes ~2-3 minutes to install, but all the NVIDIA libraries are installed under /usr/lib/x86_64-linux-gnu/, which means mounting the drivers into the container can be painful and create some conflicts. A workaround is to map /usr/lib/x86_64-linux-gnu/ to a different directory in the container, but it can probably still create some issues if you need to add both /usr/lib/x86_64-linux-gnu/ and the mounted path to your LD_LIBRARY_PATH. Here is an example of a template.

    • By using apt with the ppa repository. Much slower to install, about ~12 minutes, so not ideal when scaling a lot, but the libraries are installed under a separate directory /usr/lib/nvidia-*. This allows us to mount only the drivers into the container and nothing else, which should prevent any dependency/versioning issue. Here is an example of a template. This is the way I used in this PR. I assume most people using GPUs will run jobs that are quite long (in ML we are typically talking about hours) so 12 minutes shouldn't be a big deal, and this should cause less issues.


This change is Reviewable

Copy link
Contributor

@seanknox seanknox left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @wbuchwalter. Need to hold on this until we answer some legal questions about automatically installing.

@wbuchwalter
Copy link
Contributor Author

wbuchwalter commented Oct 6, 2017

I updated this PR to allow for different drivers installation scripts depending on the VMSize.
Currently only K80 and M60 VMs will have the drivers automatically installed.
P40 and P100 will need manual installation for now.
Also at some point there might be some customization needed for the NC24r and NV24r for infiniband.

Copy link
Member

@jackfrancis jackfrancis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, thanks @wbuchwalter and @lachie83

@jackfrancis jackfrancis merged commit 456001c into Azure:master Oct 24, 2017
@ghost ghost removed the in progress label Oct 24, 2017
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants