You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
support request => you might also like to ask your question on the mailing list or gitter chat.
Description
The PTX backend crashes or gives other unrecoverable errors (like a "CUDA Exception: Invalid argument") sometimes for larger (?) input programs.
In the repository https://github.com/tomsmeding/acc-gpu-crash (commit as of the time of writing: https://github.com/tomsmeding/acc-gpu-crash/tree/d3df383c685f19c5bba8d8a1959ee048c51b0361) there are two programs in the modules N1 (smaller) and N2 (larger) that produce various kinds of crashes on different machines. N2 reproducibly crashes, while N1 seems to run fine on certain machines while crashing on others. In the linked commit, Main.hs runs the program from N2. Both N1 and N2 run fine in the interpreter.
The repository includes a script test.sh that builds the program using stack and runs it under cuda-memcheck until it returns with a non-zero exit code. (Environment variables for Jizo have been included.)
For posterity, the source files are included here in spoilers.
If this is a bug with the GPU backend, include the output of nvidia-device-query: see below
nvidia-device-query on my Arch linux machine
CUDA device query (Driver API, statically linked)
CUDA driver version 11.3
CUDA API version 11.3
Detected 1 CUDA capable device
Device 0: NVIDIA GeForce GTX 1050 Ti
CUDA capability: 6.1
CUDA cores: 768 cores in 6 multiprocessors (128 cores/MP)
Global memory: 4 GB
Constant memory: 64 kB
Shared memory per block: 48 kB
Registers per block: 65536
Warp size: 32
Maximum threads per multiprocessor: 2048
Maximum threads per block: 1024
Maximum grid dimensions: 2147483647 x 65535 x 65535
Maximum block dimensions: 1024 x 1024 x 64
GPU clock rate: 1.392 GHz
Memory clock rate: 3.504 GHz
Memory bus width: 128-bit
L2 cache size: 1 MB
Maximum texture dimensions
1D: 131072
2D: 131072 x 65536
3D: 16384 x 16384 x 16384
Texture alignment: 512 B
Maximum memory pitch: 2 GB
Concurrent kernel execution: Yes
Concurrent copy and execution: Yes, with 2 copy engines
Runtime limit on kernel execution: Yes
Integrated GPU sharing host memory: No
Host page-locked memory mapping: Yes
ECC memory support: No
Unified addressing (UVA): Yes
Single to double precision performance: 32 : 1
Supports compute pre-emption: Yes
Supports cooperative launch: Yes
Supports multi-device cooperative launch: Yes
PCI bus/location: 1/0
Compute mode: Default
Multiple contexts are allowed on the device simultaneously
nvidia-device-query on Jizo
CUDA device query (Driver API, statically linked)
CUDA driver version 11.3
CUDA API version 10.1
Detected 1 CUDA capable device
Device 0: NVIDIA GeForce RTX 2080 Ti
CUDA capability: 7.5
CUDA cores: 4352 cores in 68 multiprocessors (64 cores/MP)
Global memory: 11 GB
Constant memory: 64 kB
Shared memory per block: 48 kB
Registers per block: 65536
Warp size: 32
Maximum threads per multiprocessor: 1024
Maximum threads per block: 1024
Maximum grid dimensions: 2147483647 x 65535 x 65535
Maximum block dimensions: 1024 x 1024 x 64
GPU clock rate: 1.65 GHz
Memory clock rate: 7.0 GHz
Memory bus width: 352-bit
L2 cache size: 6 MB
Maximum texture dimensions
1D: 131072
2D: 131072 x 65536
3D: 16384 x 16384 x 16384
Texture alignment: 512 B
Maximum memory pitch: 2 GB
Concurrent kernel execution: Yes
Concurrent copy and execution: Yes, with 3 copy engines
Runtime limit on kernel execution: Yes
Integrated GPU sharing host memory: No
Host page-locked memory mapping: Yes
ECC memory support: No
Unified addressing (UVA): Yes
Single to double precision performance: 32 : 1
Supports compute pre-emption: Yes
Supports cooperative launch: Yes
Supports multi-device cooperative launch: Yes
PCI bus/location: 66/0
Compute mode: Default
Multiple contexts are allowed on the device simultaneously
nvidia-device-query on Robbert's Arch linux (Manjaro, really) machine
CUDA device query (Driver API, statically linked)
CUDA driver version 11.3
CUDA API version 10.2
Detected 1 CUDA capable device
Device 0: NVIDIA GeForce RTX 2080 SUPER
CUDA capability: 7.5
CUDA cores: 3072 cores in 48 multiprocessors (64 cores/MP)
Global memory: 8 GB
Constant memory: 64 kB
Shared memory per block: 48 kB
Registers per block: 65536
Warp size: 32
Maximum threads per multiprocessor: 1024
Maximum threads per block: 1024
Maximum grid dimensions: 2147483647 x 65535 x 65535
Maximum block dimensions: 1024 x 1024 x 64
GPU clock rate: 1.845 GHz
Memory clock rate: 7.751 GHz
Memory bus width: 256-bit
L2 cache size: 4 MB
Maximum texture dimensions
1D: 131072
2D: 131072 x 65536
3D: 16384 x 16384 x 16384
Texture alignment: 512 B
Maximum memory pitch: 2 GB
Concurrent kernel execution: Yes
Concurrent copy and execution: Yes, with 3 copy engines
Runtime limit on kernel execution: Yes
Integrated GPU sharing host memory: No
Host page-locked memory mapping: Yes
ECC memory support: No
Unified addressing (UVA): Yes
Single to double precision performance: 32 : 1
Supports compute pre-emption: Yes
Supports cooperative launch: Yes
Supports multi-device cooperative launch: Yes
PCI bus/location: 5/0
Compute mode: Default
Multiple contexts are allowed on the device simultaneously
The text was updated successfully, but these errors were encountered:
This is hard to test; both on jizo (4090) and on my desktop (1050), the occasional cuda exceptions in the reproducer in the issue description above are absolutely drowned out by segfaults inside libcuda.so:
Thread 1 "acc-gpu-crash" received signal SIGSEGV, Segmentation fault.
0x00007ffff42300cd in ?? () from /usr/lib/libcuda.so.1
(cuda-gdb) bt
#0 0x00007ffff42300cd in ?? () from /usr/lib/libcuda.so.1
#1 0x00007ffff4400c9e in ?? () from /usr/lib/libcuda.so.1
#2 0x00007ffff44fcc96 in ?? () from /usr/lib/libcuda.so.1
#3 0x00007ffff412fa26 in ?? () from /usr/lib/libcuda.so.1
#4 0x00007ffff4130230 in ?? () from /usr/lib/libcuda.so.1
#5 0x00007ffff4132dcc in ?? () from /usr/lib/libcuda.so.1
#6 0x00007ffff4334660 in ?? () from /usr/lib/libcuda.so.1
#7 0x00000000007e272e in ?? ()
#8 0x0000000000000001 in ?? ()
#9 0x0000000000000000 in ?? ()
I can, very occasionally, reproduce a different error with accelerate-llvm master:
warning: Cuda API error detected: cuLaunchKernel returned (0x1)
acc-gpu-crash:
*** Internal error in package accelerate ***
*** Please submit a bug report at https://github.com/AccelerateHS/accelerate/issues
CUDA Exception: invalid argument
CallStack (from HasCallStack):
internalError: Data.Array.Accelerate.LLVM.PTX.State:55:9
and I haven't seen that yet with the scan-syncthreads branch, but due to the low frequency that doesn't prove much.
I am submitting a...
Description
The PTX backend crashes or gives other unrecoverable errors (like a "CUDA Exception: Invalid argument") sometimes for larger (?) input programs.
In the repository https://github.com/tomsmeding/acc-gpu-crash (commit as of the time of writing: https://github.com/tomsmeding/acc-gpu-crash/tree/d3df383c685f19c5bba8d8a1959ee048c51b0361) there are two programs in the modules
N1
(smaller) andN2
(larger) that produce various kinds of crashes on different machines.N2
reproducibly crashes, whileN1
seems to run fine on certain machines while crashing on others. In the linked commit,Main.hs
runs the program fromN2
. BothN1
andN2
run fine in the interpreter.The repository includes a script
test.sh
that builds the program usingstack
and runs it undercuda-memcheck
until it returns with a non-zero exit code. (Environment variables for Jizo have been included.)For posterity, the source files are included here in spoilers.
N1.hs
N2.hs
Expected behaviour
No crash.
Current behaviour
Crash (non-deterministically on some machines).
Steps to reproduce (for bugs)
git clone https://github.com/tomsmeding/acc-gpu-crash
cd acc-gpu-crash
./test.sh
Your environment
nvidia-device-query
: see belownvidia-device-query
on my Arch linux machinenvidia-device-query
on Jizonvidia-device-query
on Robbert's Arch linux (Manjaro, really) machineThe text was updated successfully, but these errors were encountered: