Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

No GPU devices found #65

Closed
matthewchung74 opened this issue Jan 10, 2021 · 6 comments
Closed

No GPU devices found #65

matthewchung74 opened this issue Jan 10, 2021 · 6 comments

Comments

@matthewchung74
Copy link

matthewchung74 commented Jan 10, 2021

Hi, when I run the following on a p2.xlarge deep learning ami in AWS using the command

docker build --tag stylegan2ada:latest .

docker run --gpus all -it --rm -v `pwd`:/scratch --user $(id -u):$(id -g) stylegan2ada:latest bash -c \
    "(cd /scratch && DNNLIB_CACHE_DIR=/scratch/.cache python3 generate.py --trunc=1 --seeds=85,265,297,849 \
    --outdir=out --network=https://nvlabs-fi-cdn.nvidia.com/stylegan2-ada/pretrained/metfaces.pkl)"

i get this error.

NVIDIA Release 20.10-tf1 (build 16775850)
TensorFlow Version 1.15.4

Container image Copyright (c) 2020, NVIDIA CORPORATION.  All rights reserved.
Copyright 2017-2020 The TensorFlow Authors.  All rights reserved.

NVIDIA Deep Learning Profiler (dlprof) Copyright (c) 2020, NVIDIA CORPORATION.  All rights reserved.

Various files include modifications (c) NVIDIA CORPORATION.  All rights reserved.
NVIDIA modifications are covered by the license terms that apply to the underlying project or file.
ERROR: Detected NVIDIA Tesla K80 GPU, which is not supported by this container
ERROR: No supported GPU(s) detected to run this container

NOTE: Legacy NVIDIA Driver detected.  Compatibility mode ENABLED.

NOTE: MOFED driver for multi-node communication was not detected.
      Multi-node communication performance may be reduced.

NOTE: The SHMEM allocation limit is set to the default of 64MB.  This may be
   insufficient for TensorFlow.  NVIDIA recommends the use of the following flags:
   nvidia-docker run --shm-size=1g --ulimit memlock=-1 --ulimit stack=67108864 ...

2021-01-10 16:18:03.840894: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
WARNING:tensorflow:Deprecation warnings have been disabled. Set TF_ENABLE_DEPRECATION_WARNINGS=1 to re-enable them.
2021-01-10 16:18:08.234607: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2300075000 Hz
2021-01-10 16:18:08.236149: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x50ff110 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2021-01-10 16:18:08.236185: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
2021-01-10 16:18:08.241208: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcuda.so.1
2021-01-10 16:18:08.398652: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1086] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-01-10 16:18:08.399598: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x5174bd0 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2021-01-10 16:18:08.399631: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Tesla K80, Compute Capability 3.7
2021-01-10 16:18:08.399902: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1086] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-01-10 16:18:08.400718: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1665] Found device 0 with properties: 
name: Tesla K80 major: 3 minor: 7 memoryClockRate(GHz): 0.8235
pciBusID: 0000:00:1e.0
2021-01-10 16:18:08.400787: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
2021-01-10 16:18:08.437007: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublas.so.11
2021-01-10 16:18:08.461765: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcufft.so.10
2021-01-10 16:18:08.469196: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcurand.so.10
2021-01-10 16:18:08.507750: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusolver.so.11
2021-01-10 16:18:08.516625: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusparse.so.11
2021-01-10 16:18:08.516919: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudnn.so.8
2021-01-10 16:18:08.517138: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1086] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-01-10 16:18:08.518088: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1086] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-01-10 16:18:08.518890: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1747] Ignoring visible gpu device (device: 0, name: Tesla K80, pci bus id: 0000:00:1e.0, compute capability: 3.7) with Cuda compute capability 3.7. The minimum required Cuda capability is 5.2.
2021-01-10 16:18:08.518934: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1206] Device interconnect StreamExecutor with strength 1 edge matrix:
2021-01-10 16:18:08.518951: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1212]      0 
2021-01-10 16:18:08.518973: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1225] 0:   N 
Loading networks from "https://nvlabs-fi-cdn.nvidia.com/stylegan2-ada/pretrained/metfaces.pkl"...
Downloading https://nvlabs-fi-cdn.nvidia.com/stylegan2-ada/pretrained/metfaces.pkl ... done
Setting up TensorFlow plugin "fused_bias_act.cu": 2021-01-10 16:18:33.824257: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1086] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-01-10 16:18:33.825117: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1665] Found device 0 with properties: 
name: Tesla K80 major: 3 minor: 7 memoryClockRate(GHz): 0.8235
pciBusID: 0000:00:1e.0
2021-01-10 16:18:33.825170: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
2021-01-10 16:18:33.825214: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublas.so.11
2021-01-10 16:18:33.825252: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcufft.so.10
2021-01-10 16:18:33.825288: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcurand.so.10
2021-01-10 16:18:33.825324: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusolver.so.11
2021-01-10 16:18:33.825354: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusparse.so.11
2021-01-10 16:18:33.825392: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudnn.so.8
2021-01-10 16:18:33.825525: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1086] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-01-10 16:18:33.826419: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1086] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-01-10 16:18:33.827220: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1747] Ignoring visible gpu device (device: 0, name: Tesla K80, pci bus id: 0000:00:1e.0, compute capability: 3.7) with Cuda compute capability 3.7. The minimum required Cuda capability is 5.2.
2021-01-10 16:18:33.827261: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1206] Device interconnect StreamExecutor with strength 1 edge matrix:
2021-01-10 16:18:33.827278: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1212]      0 
2021-01-10 16:18:33.827295: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1225] 0:   N 
Failed!
Traceback (most recent call last):
  File "generate.py", line 121, in <module>
    main()
  File "generate.py", line 116, in main
    generate_images(**vars(args))
  File "generate.py", line 52, in generate_images
    noise_vars = [var for name, var in Gs.components.synthesis.vars.items() if name.startswith('noise')]
  File "/scratch/dnnlib/tflib/network.py", line 293, in vars
    return copy.copy(self._get_vars())
  File "/scratch/dnnlib/tflib/network.py", line 297, in _get_vars
    self._vars = OrderedDict(self._get_own_vars())
  File "/scratch/dnnlib/tflib/network.py", line 286, in _get_own_vars
    self._init_graph()
  File "/scratch/dnnlib/tflib/network.py", line 151, in _init_graph
    out_expr = self._build_func(*self._input_templates, **build_kwargs)
  File "<string>", line 431, in G_synthesis
  File "<string>", line 384, in layer
  File "<string>", line 97, in modulated_conv2d_layer
  File "<string>", line 42, in apply_bias_act
  File "/scratch/dnnlib/tflib/ops/fused_bias_act.py", line 72, in fused_bias_act
    return impl_dict[impl](x=x, b=b, axis=axis, act=act, alpha=alpha, gain=gain, clamp=clamp)
  File "/scratch/dnnlib/tflib/ops/fused_bias_act.py", line 132, in _fused_bias_act_cuda
    cuda_op = _get_plugin().fused_bias_act
  File "/scratch/dnnlib/tflib/ops/fused_bias_act.py", line 18, in _get_plugin
    return custom_ops.get_plugin(os.path.splitext(__file__)[0] + '.cu')
  File "/scratch/dnnlib/tflib/custom_ops.py", line 139, in get_plugin
    compile_opts += f' --gpu-architecture={_get_cuda_gpu_arch_string()}'
  File "/scratch/dnnlib/tflib/custom_ops.py", line 60, in _get_cuda_gpu_arch_string
    raise RuntimeError('No GPU devices found')
RuntimeError: No GPU devices found

which is unexpected since when I run a bash in the docker container

docker run --gpus all -it --rm -v `pwd`:/scratch --user $(id -u):$(id -g) stylegan2ada:latest bash

I get

I have no name!@c6fb7621777c:/workspace$ nvidia-smi
Sun Jan 10 16:22:15 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.80.02    Driver Version: 450.80.02    CUDA Version: 11.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla K80           On   | 00000000:00:1E.0 Off |                    0 |
| N/A   38C    P8    32W / 149W |      0MiB / 11441MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+
I have no name!@c6fb7621777c:/workspace$ 

and when I run nvcc in the docker container

I have no name!@c6fb7621777c:/workspace$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2020 NVIDIA Corporation
Built on Tue_Sep_15_19:10:02_PDT_2020
Cuda compilation tools, release 11.1, V11.1.74
Build cuda_11.1.TC455_06.29069683_0

any suggestions?

@johndpope
Copy link

20.10 is October release / there’s a 20.12 December release base image / I made a PR to bump this. Might help.

@matthewchung74
Copy link
Author

thank you @johndpope . unfortunately, it did not work. I'm assuming you've tried it and it works for you?

@johndpope
Copy link

It solves problem with latest 3090 card.

@matthewchung74
Copy link
Author

matthewchung74 commented Jan 10, 2021

got it.

i just realized when i did an nvidia-smi i see the driver version is NVIDIA-SMI 450.80.02 . the readme has a different command for that

docker build --build-arg BASE_IMAGE=tensorflow/tensorflow:1.14.0-gpu-py3 --tag stylegan2ada:latest .

@matthewchung74
Copy link
Author

@johndpope I'm curious, do you have a feel for the performance of your 3090 vs say a v100 or the benchmarks they have on the readme?

@johndpope
Copy link

Currently I'm not doing any training - just getting around the problem of not enough VRAM when playing around with various models in the wild.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants