Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Linux Debug errors with CUDA 11.0 #569

Closed
ptheywood opened this issue Jun 23, 2021 · 1 comment · Fixed by #592
Closed

Linux Debug errors with CUDA 11.0 #569

ptheywood opened this issue Jun 23, 2021 · 1 comment · Fixed by #592

Comments

@ptheywood
Copy link
Member

ptheywood commented Jun 23, 2021

CUDA 11.0 Debug builds on linux are reporting errors at runtime.

This is not an issue in 11.1 or 11.3 (unsure on 11.2 right now).

This occurs for the spatial 3D boids example, unsure if others are effected (I assume so).

This has been confirmed to occur on 2 separate machines (mavericks, blackmass).

/home/ptheywood/code/flamegpu/FLAMEGPU2/include/flamegpu/gpu/CUDAErrorChecking.h(37): CUDA Error: /home/ptheywood/code/flamegpu/FLAMEGPU2/src/flamegpu/gpu/CUDAAgent.cu(263): misaligned address
terminate called after throwing an instance of 'CUDAError'
  what():  /home/ptheywood/code/flamegpu/FLAMEGPU2/include/flamegpu/gpu/CUDAErrorChecking.h(37): CUDA Error: /home/ptheywood/code/flamegpu/FLAMEGPU2/src/flamegpu/gpu/CUDAAgent.cu(263): misaligned address
@ptheywood ptheywood added the bug label Jun 23, 2021
@ptheywood
Copy link
Member Author

ptheywood commented Jul 13, 2021

This occurs during validateIDCollisions, in CUDA 11.0 only, for sufficiently large populations (for circles_spatial3d, somewhere between 4000 and 4300 on a titan v?).

It is caught by the first cuda error check after a call to cub::DeviceReduce::Sum, so it may just be a cub issue? or an nvcc issue?

CUB in CUDA 11.0 should be 1.9.9, 11.1 is 1.9.10, while we are explicitly using 1.10.0 for the improved CMake support.

According to cuda-memcheck, it's Invalid __global__ write of size 16 in

cub::DeviceReduceKernel<cub::DeviceReducePolicy<unsigned int, unsigned int, int, cub::Sum>::Policy600, unsigned int*, unsigned int*, int, cub::Sum>(unsigned int, int, cub::Sum, cub::GridEvenShare<int>, cub::DeviceReducePolicy<unsigned int, unsigned int, int, cub::Sum>::Policy600)

Via cuda-gdb, this occurs at agent_reduce.cuh:259, which is during a vectorised read.

This continues to occur if i bump the Thrust/CUB version to 1.13.0 via cmake.

vec_items is the variable being written to. It is a reinterpret casted <VectorT*> from an InputT[ITEMS_PER_THREAD] which is a local memory / register array subject to the compiler. So feels like a compiler bug? (especially as it is not an issue in more recent CUDA versions?

CUDA 10.0 and 10.2 don't experience this issue. So calling it a CUDA 11.0 specific bug.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant