Skip to content
This repository has been archived by the owner on Jan 22, 2024. It is now read-only.

nvidia-docker fails on EC2 after restarting instance #137

Closed
alantrrs opened this issue Jul 13, 2016 · 8 comments
Closed

nvidia-docker fails on EC2 after restarting instance #137

alantrrs opened this issue Jul 13, 2016 · 8 comments

Comments

@alantrrs
Copy link

I'm following the instructions on how to Deploy on Amazon EC2. Right after the gpu instance creation I test:
nvidia-docker run --rm nvidia/cuda nvidia-smi and everything works fine.

Then I stop the instance docker-machine stop aws01 and start it again docker-machine start aws01 and test again:

ubuntu@aws04:~$ nvidia-docker run --rm nvidia/cuda nvidia-smi
nvidia-docker | 2016/07/13 23:40:32 Error: Could not load UVM kernel module

This time it fails. Is this expected behavior?

@3XX0
Copy link
Member

3XX0 commented Jul 14, 2016

We run into it as well. For some reason after the VM restarts the kernel can slightly change and nouveau gets loaded by default (in the initramfs).

Best way I know of is to upgrade the machine:

sudo apt-get update
sudo apt-get upgrade
sudo apt-get dist-upgrade

Blacklist nouveau:

sudo cat << EOF > /etc/modprobe.d/blacklist-nouveau.conf
blacklist nouveau
options nouveau modeset=0
EOF

sudo update-initramfs -u

Reboot (just in case) and reinstall the drivers with DKMS

sudo apt-get install dkms linux-headers-generic
sudo sh /tmp/NVIDIA-Linux-x86_64-361.42.run --dkms --silent

From now on it should be fine. I will probably update the doc when I know for sure what's happening

@alantrrs
Copy link
Author

Yeah, created a couple of machines and blacklisting nouveau works. 👍 on adding it to the docs.

@pasky
Copy link

pasky commented Sep 5, 2016

Is it correct that this issue is closed?

It is not documented in https://github.com/NVIDIA/nvidia-docker/wiki/Deploy-on-Amazon-EC2 .

@christikaes
Copy link

@3XX0 ping^
doesn't look like this step is in the documentation, did you figure out what was happening?

Thanks for your help!

@3XX0
Copy link
Member

3XX0 commented Feb 23, 2017

The documentation is right, but depending on the AMI used you might want to restart the instance after creating it . For example, some Ubuntu AMIs have been snapshoted with a running kernel different from the one that will be used at next reboot.

@alantrrs
Copy link
Author

@christinakayastha I elaborated a bit on the installation for a base AMI ami-40d28157 (Ubuntu server 16.04 LTS) here:
https://github.com/empiricalci/machines/tree/master/gpu-ec2#installing-the-nvidia-driver

@christikaes
Copy link

ahhh gochha, thanks a ton!

@flx42
Copy link
Member

flx42 commented Feb 24, 2017

I added a docker-machine restart line to the tutorial, I advise you to install the driver through our method, you will get the latest driver and any update that is released.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

5 participants