Skip to content
This repository has been archived by the owner on Jan 22, 2024. It is now read-only.

Problem: Docker image uses CUDA 7.5, and host system driver 367.44 requires CUDA 8 #237

Closed
thommiano opened this issue Nov 3, 2016 · 8 comments

Comments

@thommiano
Copy link

thommiano commented Nov 3, 2016

I’m trying to run an image that uses CUDA 7.5, but my host driver is 367.44 for a GTX1070. I’m getting the error, Value 'sm_61' is not defined for option 'gpu-architecture’, which suggests that my host driver is incompatible with the CUDA version on the image. To run this image I was using nvidia-docker rather than plain docker, but it still returned the error.

I thought nvidia-docker is supposed to solve this problem? Am I doing something wrong (e.g., need to have a driver compatible with CUDA 7.5 on the image), or is this not possible, and I will need to have a CUDA 7.5 compatible driver on my host?

Specifically, I'm trying to run the alexjc/neural-doodle:gpu image. https://github.com/alexjc/neural-doodle/issues/96

@3XX0
Copy link
Member

3XX0 commented Nov 3, 2016

Pascal support is only available starting with CUDA 8.0.
Using the (default) CUDA 8.0 image should fix your issue.

@thommiano
Copy link
Author

Ok. So does that mean I won't be able to run an image that uses anything below CUDA 8.0?

@flx42
Copy link
Member

flx42 commented Nov 3, 2016

Not necessarily, it depends how the project was compiled and which libraries it's using.

In CUDA we have an assembly language called PTX, if your app was compiled with CUDA 7.5 but with all CUDA code bundled as PTX, then at runtime the code will be JITed by the NVIDIA driver for your new architecture. If your code was compiled with only binary code (e.g. sm_52), then it's not forward compatible.

When you use cuDNN, they don't have PTX for most algorithms, so it won't work.

Anyway, I think your problem is just that you aren't using the nvidia-docker wrapper when launching the GPU app, that's all.

@flx42
Copy link
Member

flx42 commented Nov 3, 2016

Sorry, I read too fast. You did try with nvidia-docker apparently. The problem is that Theano does its own JITing at runtime, it detects you have a sm_61 GPU, so it tries calling nvcc with sm_61.
This is problem is specific to Theano. With other ML frameworks, you can specify at build time with architectures to compile against.

@thommiano
Copy link
Author

thommiano commented Nov 3, 2016

Ok, thanks for the feedback. Do you have any suggestions for the best way to move forward? From what I've read on similar problems (and as @3XX0 suggests) it seems like updating the image to CUDA 8 might work.

I haven't done this before . . . would I just edit the docker files (update to CUDA8 for install-cuda-drivers-ubuntu-14.04.sh and update to cuDNN5 for docker-gpu.df) and create new image?

@flx42
Copy link
Member

flx42 commented Nov 3, 2016

Yes, try to modify the first line to FROM nvidia/cuda:8.0-cudnn5-devel

@thommiano
Copy link
Author

thommiano commented Nov 3, 2016

I cloned the source image via git clone on github rather than using docker git clone because I couldn't quite figure out if there was a way to do that. I made the following changes:

I then pushed it up to my remote, built the docker image, and pushed that to dockerhub.

Now when I run nvidia-docker run socraticdatum/neural-doodle I get the following:

Neural Doodle for semantic style transfer.
  - Using device `gpu` for processing the images.
Traceback (most recent call last):
  File "doodle.py", line 657, in <module>
    generator = NeuralGenerator()
  File "doodle.py", line 234, in __init__
    self.style_img_original, self.style_map_original = self.load_images('style', args.style)
  File "doodle.py", line 288, in load_images
    basename, _ = os.path.splitext(filename)
  File "/usr/lib/python3.4/posixpath.py", line 122, in splitext
    return genericpath._splitext(p, sep, None, extsep)
  File "/usr/lib/python3.4/genericpath.py", line 118, in _splitext
    sepIndex = p.rfind(sep)
AttributeError: 'NoneType' object has no attribute 'rfind'

I think this is at least an improvement because it's now saying Using devicegpufor processing the images, but I'm not quite sure how to fix the rfind problem. Any ideas on this?

Seems like a Python issue now, so perhaps more appropriate to pose this question somewhere else.

@flx42
Copy link
Member

flx42 commented Nov 7, 2016

Yeah, it's clearly a Python issue now :) Closing this.

@flx42 flx42 closed this as completed Nov 7, 2016
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants