nvidia-docker fails on EC2 after restarting instance #137

alantrrs · 2016-07-13T23:52:12Z

I'm following the instructions on how to Deploy on Amazon EC2. Right after the gpu instance creation I test:
nvidia-docker run --rm nvidia/cuda nvidia-smi and everything works fine.

Then I stop the instance docker-machine stop aws01 and start it again docker-machine start aws01 and test again:

ubuntu@aws04:~$ nvidia-docker run --rm nvidia/cuda nvidia-smi
nvidia-docker | 2016/07/13 23:40:32 Error: Could not load UVM kernel module

This time it fails. Is this expected behavior?

The text was updated successfully, but these errors were encountered:

3XX0 · 2016-07-14T00:10:39Z

We run into it as well. For some reason after the VM restarts the kernel can slightly change and nouveau gets loaded by default (in the initramfs).

Best way I know of is to upgrade the machine:

sudo apt-get update
sudo apt-get upgrade
sudo apt-get dist-upgrade

Blacklist nouveau:

sudo cat << EOF > /etc/modprobe.d/blacklist-nouveau.conf
blacklist nouveau
options nouveau modeset=0
EOF

sudo update-initramfs -u

Reboot (just in case) and reinstall the drivers with DKMS

sudo apt-get install dkms linux-headers-generic
sudo sh /tmp/NVIDIA-Linux-x86_64-361.42.run --dkms --silent

From now on it should be fine. I will probably update the doc when I know for sure what's happening

alantrrs · 2016-07-14T04:51:58Z

Yeah, created a couple of machines and blacklisting nouveau works. 👍 on adding it to the docs.

pasky · 2016-09-05T22:48:04Z

Is it correct that this issue is closed?

It is not documented in https://github.com/NVIDIA/nvidia-docker/wiki/Deploy-on-Amazon-EC2 .

christikaes · 2017-02-23T14:51:00Z

@3XX0 ping^
doesn't look like this step is in the documentation, did you figure out what was happening?

Thanks for your help!

3XX0 · 2017-02-23T19:08:48Z

The documentation is right, but depending on the AMI used you might want to restart the instance after creating it . For example, some Ubuntu AMIs have been snapshoted with a running kernel different from the one that will be used at next reboot.

alantrrs · 2017-02-23T19:59:50Z

@christinakayastha I elaborated a bit on the installation for a base AMI ami-40d28157 (Ubuntu server 16.04 LTS) here:
https://github.com/empiricalci/machines/tree/master/gpu-ec2#installing-the-nvidia-driver

christikaes · 2017-02-23T21:36:13Z

ahhh gochha, thanks a ton!

flx42 · 2017-02-24T02:11:55Z

I added a docker-machine restart line to the tutorial, I advise you to install the driver through our method, you will get the latest driver and any update that is released.

alantrrs mentioned this issue Jul 14, 2016

UVM failed to load #139

Closed

3XX0 added the documentation label Jul 15, 2016

3XX0 closed this as completed Jul 15, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

nvidia-docker fails on EC2 after restarting instance #137

nvidia-docker fails on EC2 after restarting instance #137

alantrrs commented Jul 13, 2016

3XX0 commented Jul 14, 2016 •

edited

alantrrs commented Jul 14, 2016

pasky commented Sep 5, 2016

christikaes commented Feb 23, 2017

3XX0 commented Feb 23, 2017

alantrrs commented Feb 23, 2017

christikaes commented Feb 23, 2017

flx42 commented Feb 24, 2017

nvidia-docker fails on EC2 after restarting instance #137

nvidia-docker fails on EC2 after restarting instance #137

Comments

alantrrs commented Jul 13, 2016

3XX0 commented Jul 14, 2016 • edited

alantrrs commented Jul 14, 2016

pasky commented Sep 5, 2016

christikaes commented Feb 23, 2017

3XX0 commented Feb 23, 2017

alantrrs commented Feb 23, 2017

christikaes commented Feb 23, 2017

flx42 commented Feb 24, 2017

3XX0 commented Jul 14, 2016 •

edited