gpu-burn compiled using nvidia-toolkit 12.3 fails, had to recompile using nvidia-cuda-toolkit 11.5 #94

bladernr · 2023-12-04T18:26:23Z

I installed gpu-burn on an machine and as part of the setup installed cuda-toolkit which got me this:

$ apt-cache policy cuda-toolkit
cuda-toolkit:
  Installed: 12.3.1-1
  Candidate: 12.3.1-1
  Version table:
 *** 12.3.1-1 600
        600 http://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  Packages

but when I ran gpu-burn to sniff test everything was read, it errored out:

$ ./gpu_burn 
Run length not specified in the command line. Using compare file: compare.ptx
Burning for 10 seconds.
GPU 0: NVIDIA A100-PCIE-40GB (UUID: GPU-2a956425-8ac7-bb38-eb07-be255e4fe341)
GPU 1: NVIDIA A100-PCIE-40GB (UUID: GPU-098f2285-d2d8-d51c-5e7e-bf3724b250a3)
Initialized device 0 with 40339 MB of memory (39900 MB available, using 35910 MB of it), using FLOATS
Results are 268435456 bytes each, thus performing 138 iterations
Initialized device 1 with 40339 MB of memory (39900 MB available, using 35910 MB of it), using FLOATS
Results are 268435456 bytes each, thus performing 138 iterations
Couldn't init a GPU test: Error in load module (gpu_burn-drv.cpp:238): the provided PTX was compiled with an unsupported toolchain.
Couldn't init a GPU test: Error in load module (gpu_burn-drv.cpp:238): the provided PTX was compiled with an unsupported toolchain.
10.0%  proc'd: -1 (0 Gflop/s) - -1 (0 Gflop/s)   errors: 0  (DIED!)- 0  (DIED!)  temps: 29 C - 31 C rror 0read[1] error 0

No clients are alive!  Aborting

i then installed nvidia-cuda-toolkit (packaged in the ubuntu repos):

$ apt-cache policy nvidia-cuda-toolkit
nvidia-cuda-toolkit:
  Installed: 11.5.1-1ubuntu1
  Candidate: 11.5.1-1ubuntu1
  Version table:
 *** 11.5.1-1ubuntu1 500
        500 http://us.archive.ubuntu.com/ubuntu jammy/multiverse amd64 Packages

and recompiled gpu-burn
$ sudo make clean

rm -f *.ptx *.o gpu_burn
$ sudo make
g++ -O3 -Wno-unused-result -I/usr/include -std=c++11 -c gpu_burn-drv.cpp
PATH="/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/snap/bin::." /usr/bin/nvcc -I/usr/include -arch=compute_50 -ptx compare.cu -o compare.ptx
nvcc warning : The 'compute_35', 'compute_37', 'compute_50', 'sm_35', 'sm_37' and 'sm_50' architectures are deprecated, and may be removed in a future release (Use -Wno-deprecated-gpu-targets to suppress warning).
g++ -o gpu_burn gpu_burn-drv.o -O3 -lcuda -L/usr/lib64 -L/usr/lib64/stubs -L/usr/lib -L/usr/lib/stubs -Wl,-rpath=/usr/lib64 -Wl,-rpath=/usr/lib -lcublas -lcudart

and that finally succeeded

$ ./gpu_burn 
Run length not specified in the command line. Using compare file: compare.ptx
Burning for 10 seconds.
GPU 0: NVIDIA A100-PCIE-40GB (UUID: GPU-2a956425-8ac7-bb38-eb07-be255e4fe341)
GPU 1: NVIDIA A100-PCIE-40GB (UUID: GPU-098f2285-d2d8-d51c-5e7e-bf3724b250a3)
Initialized device 0 with 40339 MB of memory (39592 MB available, using 35632 MB of it), using FLOATS
Results are 268435456 bytes each, thus performing 137 iterations
Initialized device 1 with 40339 MB of memory (39592 MB available, using 35632 MB of it), using FLOATS
Results are 268435456 bytes each, thus performing 137 iterations
100.0%  proc'd: 0 (0 Gflop/s) - 137 (15933 Gflop/s)   errors: 0 - 0   temps: 44 C - 46 C 
	Summary at:   Mon Dec  4 18:12:56 UTC 2023

100.0%  proc'd: 137 (15757 Gflop/s) - 137 (15933 Gflop/s)   errors: 0 - 0   temps: 44 C - 46 C 
Killing processes with SIGTERM (soft kill)
Freed memory for dev 1
Uninitted cublas
Freed memory for dev 0
Uninitted cublas
done

The text was updated successfully, but these errors were encountered:

bladernr · 2023-12-04T18:33:43Z

So basically the difference seems (for now, maybe I'm missing something?) to be the first time it compiled and failed to run it was using cuda-toolkit 12.3, and the second time, cuda-toolkit 11.5. The hosted version in the Ubuntu repos lags behind the upstream NVIDIA repos, but all the cuda packages are 12.3 otherwise. Any idea where to start here? I'm happy to do whatever necessary to figure this out and resolve it.

bladernr · 2023-12-12T17:16:25Z

Have you had a chance to look at this any @wilicc ? the version I used won't hang around forever, and NVIDIA keeps moving cuda toolkit onwards. The compilation seems to pass just fine, but the execution has the errors mentioned... if I'm doing something wrong, I'm happy to help with providing log info or whatever.

wilicc · 2023-12-13T08:55:25Z

I just tried this with 12.3 and it seems to be working fine. Typically the problems I have when moving to newer version is that older compute capabilities are deprecated, which is not the error you are getting.
To me it looks like your CUDA toolchain does not match the NVidia driver you have installed. Or some other mismatch between nvcc compiler version and runtime.

bladernr · 2023-12-13T15:16:55Z

Hrmmmm ok, I'll look at that. The toolchain should be fine as the whole thing was installed initially from the CUDA repos. I wasn't aware that there was also an issue if hte driver itself wasn't compiled with the same toolchain. I'll refer back to the person who maintains the driver (and probably need to also retry this using the upstream driver as well to confirm that the problem is there. Thanks for that pointer.

…

On Wed, Dec 13, 2023 at 3:55 AM Ville Timonen ***@***.***> wrote: I just tried this with 12.3 and it seems to be working fine. Typically the problems I have when moving to newer version is that older compute capabilities are deprecated, which is not the error you are getting. To me it looks like your CUDA toolchain does not match the NVidia driver you have installed. Or some other mismatch between nvcc compiler version and runtime. — Reply to this email directly, view it on GitHub <#94 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABAYWSEC3JU2S265WKGB64DYJFUQPAVCNFSM6AAAAABAGNQP62VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNJTGUYDMMBZGM> . You are receiving this because you authored the thread.Message ID: ***@***.***>

-- Jeff Lane - Engineering Manager, Tools Developer, Warrior Poet, Lover of Pie Ubuntu Ham: W4KDH Freenode IRC: bladernr or bladernr_ gpg: 1024D/3A14B2DD 8C88 B076 0DD7 B404 1417 C466 4ABD 3635 3A14 B2DD

tlh24 · 2023-12-28T17:21:50Z

I had the same problem on Debian testing; upgrading nvidia-cuda-toolkit to match the driver version fixed the problem. (I installed cuda from the run files some time ago, which appears to have been the culprit).

alexmyczko · 2024-02-05T18:25:33Z

and now gpu-burn is packaged in debian officially, well non-free of it...

L1pp · 2024-02-23T04:59:00Z

I also encountered this problem. After my testing, just recompile it. I guess your CUDA or driver has been updated, so you need to recompile GPU-burn.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

gpu-burn compiled using nvidia-toolkit 12.3 fails, had to recompile using nvidia-cuda-toolkit 11.5 #94

gpu-burn compiled using nvidia-toolkit 12.3 fails, had to recompile using nvidia-cuda-toolkit 11.5 #94

bladernr commented Dec 4, 2023

bladernr commented Dec 4, 2023

bladernr commented Dec 12, 2023

wilicc commented Dec 13, 2023

bladernr commented Dec 13, 2023 via email

tlh24 commented Dec 28, 2023

alexmyczko commented Feb 5, 2024

L1pp commented Feb 23, 2024

gpu-burn compiled using nvidia-toolkit 12.3 fails, had to recompile using nvidia-cuda-toolkit 11.5 #94

gpu-burn compiled using nvidia-toolkit 12.3 fails, had to recompile using nvidia-cuda-toolkit 11.5 #94

Comments

bladernr commented Dec 4, 2023

bladernr commented Dec 4, 2023

bladernr commented Dec 12, 2023

wilicc commented Dec 13, 2023

bladernr commented Dec 13, 2023 via email

tlh24 commented Dec 28, 2023

alexmyczko commented Feb 5, 2024

L1pp commented Feb 23, 2024