Skip to content
This repository has been archived by the owner on Mar 20, 2023. It is now read-only.

Pool creation fails at starttask due to nvidia driver installation #348

Closed
sayak1711 opened this issue May 5, 2020 · 1 comment
Closed

Comments

@sayak1711
Copy link

sayak1711 commented May 5, 2020

Problem Description

This is a recent issue I hadn't faced before. When I try to create a pool with NC6 vm_size it fails with the error starttask failed. The error in stderr.txt is mentioned below in additional logs.

Batch Shipyard Version

Installed using git clone and install script

Steps to Reproduce

Just create a pool of standard nc6 size using pool.yaml specified below.

Expected Results

Pool gets created successfully

Actual Results

Fails at startup task. stderr.txt shows the error shown in additional logs section down below.

Redacted Configuration

pool_specification:
  id: httptestpool
  vm_configuration:
    platform_image:
      offer: UbuntuServer
      publisher: Canonical
      sku: 18.04-LTS
  vm_count:
    dedicated: 4
    low_priority: 0
  vm_size: STANDARD_NC6
  autoscale:
    evaluation_interval: 00:10:00
    scenario:
      name: pending_tasks
      maximum_vm_count:
        dedicated: 8
        low_priority: 0
      node_deallocation_option: taskcompletion
  gpu:
    nvidia_driver:
      source: http://us.download.nvidia.com/tesla/440.64.00/nvidia-driver-local-repo-ubuntu1804-440.64.00_1.0-1_amd64.deb

I hadn't originally included the gpu param in the pool.yaml configuration above. I was still getting the same error.

Additional Logs

Warning: apt-key output should not be parsed (stdout is not a terminal)
Synchronizing state of docker.service with SysV service script with /lib/systemd/systemd-sysv-install.
Executing: /lib/systemd/systemd-sysv-install disable docker
WARNING: API is accessible on http://127.0.0.1:2375 without encryption.
         Access to the remote API is equivalent to root access on the host. Refer
         to the 'Docker daemon attack surface' section in the documentation for
         more information: https://docs.docker.com/engine/security/security/#docker-daemon-attack-surface
WARNING: No swap limit support
rmmod: ERROR: Module nouveau is not currently loaded

WARNING: nvidia-installer was forced to guess the X library path '/usr/lib' and X module path '/usr/lib/xorg/modules'; these paths were not queryable from the system.  If X fails to find the NVIDIA X driver module, please install the `pkg-config` utility and the X.Org SDK/development package for your distribution and reinstall the driver.


WARNING: Unable to find a suitable destination to install 32-bit compatibility libraries. Your system may not be set up for 32-bit compatibility. 32-bit compatibility files will not be installed; if you wish to install them, re-run the installation and set a valid directory with the --compat32-libdir option.


ERROR: Failed to run `/usr/sbin/dkms build -m nvidia -v 418.87.01 -k 5.3.0-1020-azure`: 
Kernel preparation unnecessary for this kernel.  Skipping...

Building module:
cleaning build area...
'make' -j6 NV_EXCLUDE_BUILD_MODULES='' KERNEL_UNAME=5.3.0-1020-azure IGNORE_CC_MISMATCH='' modules.....(bad exit status: 2)
ERROR (dkms apport): binary package for nvidia: 418.87.01 not found
Error! Bad return status for module build on kernel: 5.3.0-1020-azure (x86_64)
Consult /var/lib/dkms/nvidia/418.87.01/build/make.log for more information.


ERROR: Failed to install the kernel module through DKMS. No kernel module was installed; please try installing again without DKMS, or check the DKMS logs for more information.


ERROR: Installation has failed.  Please see the file '/var/log/nvidia-installer.log' for details.  You may find suggestions on fixing installation problems in the README available on the Linux driver download page at www.nvidia.com.

Some more important logs from startup directory:

Warning: apt-key output should not be parsed (stdout is not a terminal)
Synchronizing state of docker.service with SysV service script with /lib/systemd/systemd-sysv-install.
Executing: /lib/systemd/systemd-sysv-install disable docker
WARNING: API is accessible on http://127.0.0.1:2375 without encryption.
         Access to the remote API is equivalent to root access on the host. Refer
         to the 'Docker daemon attack surface' section in the documentation for
         more information: https://docs.docker.com/engine/security/security/#docker-daemon-attack-surface
WARNING: No swap limit support
rmmod: ERROR: Module nouveau is not currently loaded
./nvidia-driver_cc37.run: line 1: syntax error near unexpected token `newline'
./nvidia-driver_cc37.run: line 1: `!<arch>'

Additonal Comments

@alfpark
Copy link
Collaborator

alfpark commented Jun 5, 2020

The driver resource cannot be a deb file. It requires the executable.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

2 participants