Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gpu-burn compiled using nvidia-toolkit 12.3 fails, had to recompile using nvidia-cuda-toolkit 11.5 #94

Open
bladernr opened this issue Dec 4, 2023 · 7 comments

Comments

@bladernr
Copy link

bladernr commented Dec 4, 2023

I installed gpu-burn on an machine and as part of the setup installed cuda-toolkit which got me this:

$ apt-cache policy cuda-toolkit
cuda-toolkit:
  Installed: 12.3.1-1
  Candidate: 12.3.1-1
  Version table:
 *** 12.3.1-1 600
        600 http://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  Packages

but when I ran gpu-burn to sniff test everything was read, it errored out:

$ ./gpu_burn 
Run length not specified in the command line. Using compare file: compare.ptx
Burning for 10 seconds.
GPU 0: NVIDIA A100-PCIE-40GB (UUID: GPU-2a956425-8ac7-bb38-eb07-be255e4fe341)
GPU 1: NVIDIA A100-PCIE-40GB (UUID: GPU-098f2285-d2d8-d51c-5e7e-bf3724b250a3)
Initialized device 0 with 40339 MB of memory (39900 MB available, using 35910 MB of it), using FLOATS
Results are 268435456 bytes each, thus performing 138 iterations
Initialized device 1 with 40339 MB of memory (39900 MB available, using 35910 MB of it), using FLOATS
Results are 268435456 bytes each, thus performing 138 iterations
Couldn't init a GPU test: Error in load module (gpu_burn-drv.cpp:238): the provided PTX was compiled with an unsupported toolchain.
Couldn't init a GPU test: Error in load module (gpu_burn-drv.cpp:238): the provided PTX was compiled with an unsupported toolchain.
10.0%  proc'd: -1 (0 Gflop/s) - -1 (0 Gflop/s)   errors: 0  (DIED!)- 0  (DIED!)  temps: 29 C - 31 C rror 0read[1] error 0

No clients are alive!  Aborting

i then installed nvidia-cuda-toolkit (packaged in the ubuntu repos):

$ apt-cache policy nvidia-cuda-toolkit
nvidia-cuda-toolkit:
  Installed: 11.5.1-1ubuntu1
  Candidate: 11.5.1-1ubuntu1
  Version table:
 *** 11.5.1-1ubuntu1 500
        500 http://us.archive.ubuntu.com/ubuntu jammy/multiverse amd64 Packages

and recompiled gpu-burn
$ sudo make clean

rm -f *.ptx *.o gpu_burn
$ sudo make
g++ -O3 -Wno-unused-result -I/usr/include -std=c++11 -c gpu_burn-drv.cpp
PATH="/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/snap/bin::." /usr/bin/nvcc -I/usr/include -arch=compute_50 -ptx compare.cu -o compare.ptx
nvcc warning : The 'compute_35', 'compute_37', 'compute_50', 'sm_35', 'sm_37' and 'sm_50' architectures are deprecated, and may be removed in a future release (Use -Wno-deprecated-gpu-targets to suppress warning).
g++ -o gpu_burn gpu_burn-drv.o -O3 -lcuda -L/usr/lib64 -L/usr/lib64/stubs -L/usr/lib -L/usr/lib/stubs -Wl,-rpath=/usr/lib64 -Wl,-rpath=/usr/lib -lcublas -lcudart

and that finally succeeded

$ ./gpu_burn 
Run length not specified in the command line. Using compare file: compare.ptx
Burning for 10 seconds.
GPU 0: NVIDIA A100-PCIE-40GB (UUID: GPU-2a956425-8ac7-bb38-eb07-be255e4fe341)
GPU 1: NVIDIA A100-PCIE-40GB (UUID: GPU-098f2285-d2d8-d51c-5e7e-bf3724b250a3)
Initialized device 0 with 40339 MB of memory (39592 MB available, using 35632 MB of it), using FLOATS
Results are 268435456 bytes each, thus performing 137 iterations
Initialized device 1 with 40339 MB of memory (39592 MB available, using 35632 MB of it), using FLOATS
Results are 268435456 bytes each, thus performing 137 iterations
100.0%  proc'd: 0 (0 Gflop/s) - 137 (15933 Gflop/s)   errors: 0 - 0   temps: 44 C - 46 C 
	Summary at:   Mon Dec  4 18:12:56 UTC 2023

100.0%  proc'd: 137 (15757 Gflop/s) - 137 (15933 Gflop/s)   errors: 0 - 0   temps: 44 C - 46 C 
Killing processes with SIGTERM (soft kill)
Freed memory for dev 1
Uninitted cublas
Freed memory for dev 0
Uninitted cublas
done

@bladernr
Copy link
Author

bladernr commented Dec 4, 2023

So basically the difference seems (for now, maybe I'm missing something?) to be the first time it compiled and failed to run it was using cuda-toolkit 12.3, and the second time, cuda-toolkit 11.5. The hosted version in the Ubuntu repos lags behind the upstream NVIDIA repos, but all the cuda packages are 12.3 otherwise. Any idea where to start here? I'm happy to do whatever necessary to figure this out and resolve it.

@bladernr
Copy link
Author

Have you had a chance to look at this any @wilicc ? the version I used won't hang around forever, and NVIDIA keeps moving cuda toolkit onwards. The compilation seems to pass just fine, but the execution has the errors mentioned... if I'm doing something wrong, I'm happy to help with providing log info or whatever.

@wilicc
Copy link
Owner

wilicc commented Dec 13, 2023

I just tried this with 12.3 and it seems to be working fine. Typically the problems I have when moving to newer version is that older compute capabilities are deprecated, which is not the error you are getting.
To me it looks like your CUDA toolchain does not match the NVidia driver you have installed. Or some other mismatch between nvcc compiler version and runtime.

@bladernr
Copy link
Author

bladernr commented Dec 13, 2023 via email

@tlh24
Copy link

tlh24 commented Dec 28, 2023

I had the same problem on Debian testing; upgrading nvidia-cuda-toolkit to match the driver version fixed the problem. (I installed cuda from the run files some time ago, which appears to have been the culprit).

@alexmyczko
Copy link
Contributor

and now gpu-burn is packaged in debian officially, well non-free of it...

@L1pp
Copy link

L1pp commented Feb 23, 2024

I also encountered this problem. After my testing, just recompile it. I guess your CUDA or driver has been updated, so you need to recompile GPU-burn.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants