Segmented folds crash or give inconsistent results #423

electroCutie · 2018-06-05T13:35:19Z

I am submitting a...

bug report
feature request
support request => you might also like to ask your question on the mailing list or gitter chat.

Description

When performing large (not huge) segmented folds with floats CUDA crashes

Expected behaviour

Folds should work the same with floats and doubles, within the limits of precision

Current behaviour

Floats crash

Steps to reproduce (for bugs)

A minimal project with this bug
https://github.com/electroCutie/AccelerateHs_Bugs
this bug is under the executable foldSegBug

Your environment

Accelerate version: 1.2.0.0
Accelerate backend(s) used: accelerate-llvm-ptx
GHC version: The Glorious Glasgow Haskell Compilation System, version 8.0.2
Operating system and version: XUbuntu 17.10
Link to your project/example: https://github.com/electroCutie/AccelerateHs_Bugs
If this is a bug with the GPU backend, include the output of nvidia-device-query:
CUDA device query (Driver API, statically linked)
CUDA driver version 9.0
CUDA API version 8.0
Detected 1 CUDA capable device

Device 0: Quadro K620
CUDA capability: 5.0
CUDA cores: 384 cores in 3 multiprocessors (128 cores/MP)
Global memory: 2 GB
Constant memory: 64 kB
Shared memory per block: 48 kB
Registers per block: 65536
Warp size: 32
Maximum threads per multiprocessor: 2048
Maximum threads per block: 1024
Maximum grid dimensions: 2147483647 x 65535 x 65535
Maximum block dimensions: 1024 x 1024 x 64
GPU clock rate: 1.124 GHz
Memory clock rate: 900.0 MHz
Memory bus width: 128-bit
L2 cache size: 2 MB
Maximum texture dimensions
1D: 65536
2D: 65536 x 65536
3D: 4096 x 4096 x 4096
Texture alignment: 512 B
Maximum memory pitch: 2 GB
Concurrent kernel execution: Yes
Concurrent copy and execution: Yes, with 1 copy engine
Runtime limit on kernel execution: Yes
Integrated GPU sharing host memory: No
Host page-locked memory mapping: Yes
ECC memory support: No
Unified addressing (UVA): Yes
PCI bus/location: 3/0
Compute mode: Default
Multiple contexts are allowed on the device simultaneously

The text was updated successfully, but these errors were encountered:

electroCutie · 2018-06-05T14:40:55Z

There seems to be more than one thing going on
On my desktop (the one I originally submitted the bug from) the GPU not only crashes on the float, but also returns an incorrect result for Double, in fact one that changes run to run

On my laptop it produces the correct number, but still crashes on floats

Here is the info from my laptop's GPU:
CUDA device query (Driver API, statically linked)
CUDA driver version 9.0
CUDA API version 8.0
Detected 1 CUDA capable device

Device 0: GeForce GTX 1050 Ti with Max-Q Design
  CUDA capability:                    6.1
  CUDA cores:                         768 cores in 6 multiprocessors (128 cores/MP)
  Global memory:                      4 GB
  Constant memory:                    64 kB
  Shared memory per block:            48 kB
  Registers per block:                65536
  Warp size:                          32
  Maximum threads per multiprocessor: 2048
  Maximum threads per block:          1024
  Maximum grid dimensions:            2147483647 x 65535 x 65535
  Maximum block dimensions:           1024 x 1024 x 64
  GPU clock rate:                     1.2905 GHz
  Memory clock rate:                  3.504 GHz
  Memory bus width:                   128-bit
  L2 cache size:                      1 MB
  Maximum texture dimensions         
    1D:                               131072
    2D:                               131072 x 65536
    3D:                               16384 x 16384 x 16384
  Texture alignment:                  512 B
  Maximum memory pitch:               2 GB
  Concurrent kernel execution:        Yes
  Concurrent copy and execution:      Yes, with 2 copy engines
  Runtime limit on kernel execution:  Yes
  Integrated GPU sharing host memory: No
  Host page-locked memory mapping:    Yes
  ECC memory support:                 No
  Unified addressing (UVA):           Yes
  PCI bus/location:                   1/0
  Compute mode:                       Default
    Multiple contexts are allowed on the device simultaneously

tmcdonell · 2018-06-06T04:55:18Z

Thanks for the minimal test case! I can reproduce this on my machine, am investigating...

tmcdonell · 2018-06-06T07:35:47Z

Can you see if this fix works for you? In your stack.yaml something like...

extra-deps:
- git:    https://github.com/tmcdonell/accelerate.git
  commit: 442dcbdb8d95407bc650d8f4ce6aa62ab593e484

- git:    https://github.com/tmcdonell/accelerate-llvm.git
  commit: 6aacde0ffae37552f5f5bacd1ae3029b14012f8c
  subdirs:
    - 'accelerate-llvm'
    - 'accelerate-llvm-native'
    - 'accelerate-llvm-ptx'

tmcdonell · 2018-06-06T07:47:39Z

The only change in the generated assembly (modulo renaming):

Old:

LBB0_25:                                // %if83.entry
                                        //   in Loop: Header=BB0_23 Depth=2
        add.s64         %rd30, %rd8, %rd62;
        setp.ge.s64     %p13, %rd30, %rd20;
        mov.f32         %f68, %f2;
        @%p13 bra       LBB0_27;
        bra.uni         LBB0_26;

New:

LBB0_25:                                // %if83.entry
                                        //   in Loop: Header=BB0_23 Depth=2
        add.s64         %rd30, %rd8, %rd62;
        setp.ge.s64     %p13, %rd30, %rd20;
                                        // implicit-def: %f68
        @%p13 bra       LBB0_27;
        bra.uni         LBB0_26;

electroCutie · 2018-06-06T09:23:24Z

It did not fix the bug on my desktop (the box with the Quadro K620)
Going to test on my laptop, but the build takes a long time on there

I updated my test repository to reflect pulling these changes

tmcdonell · 2018-06-06T12:20:27Z

Oh, I forgot to mention that you will need to delete the cache directory, $HOME/.accelerate, else it will keep using the old code. (sorry about that)

electroCutie · 2018-06-06T12:49:41Z

Deleted the accelerate cache folder, the bug persists

This bug goes farther than it originally seemed
The bug manifests not only as a crash but also as incorrect results
I've tested Shorts, Ints, Longs, Floats, and Doubles and they inconsistently give wrong answers

Updated the test repository
The test suite now has command line arguments and will iterate and highlight incorrect results (until it crashes)

tmcdonell · 2018-06-21T23:27:36Z

the tagged commit doesn't entirely fix it, but all the necessary changes were on that branch.

tmcdonell added the llvm-ptx accelerate-llvm-ptx label Jun 6, 2018

electroCutie changed the title ~~Segmented folds crash with Floats but not Doubles~~ Segmented folds crash or give inconsistent results Jun 12, 2018

tmcdonell closed this as completed in tmcdonell/accelerate-llvm@6aacde0 Jun 21, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Segmented folds crash or give inconsistent results #423

Segmented folds crash or give inconsistent results #423

electroCutie commented Jun 5, 2018

electroCutie commented Jun 5, 2018

tmcdonell commented Jun 6, 2018

tmcdonell commented Jun 6, 2018

tmcdonell commented Jun 6, 2018

electroCutie commented Jun 6, 2018

tmcdonell commented Jun 6, 2018

electroCutie commented Jun 6, 2018 •

edited

tmcdonell commented Jun 21, 2018

Segmented folds crash or give inconsistent results #423

Segmented folds crash or give inconsistent results #423

Comments

electroCutie commented Jun 5, 2018

Description

Expected behaviour

Current behaviour

Steps to reproduce (for bugs)

Your environment

electroCutie commented Jun 5, 2018

tmcdonell commented Jun 6, 2018

tmcdonell commented Jun 6, 2018

tmcdonell commented Jun 6, 2018

electroCutie commented Jun 6, 2018

tmcdonell commented Jun 6, 2018

electroCutie commented Jun 6, 2018 • edited

tmcdonell commented Jun 21, 2018

electroCutie commented Jun 6, 2018 •

edited