Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segmented folds crash or give inconsistent results #423

Closed
1 of 3 tasks
electroCutie opened this issue Jun 5, 2018 · 8 comments
Closed
1 of 3 tasks

Segmented folds crash or give inconsistent results #423

electroCutie opened this issue Jun 5, 2018 · 8 comments
Labels
llvm-ptx accelerate-llvm-ptx

Comments

@electroCutie
Copy link

I am submitting a...

  • bug report
  • feature request
  • support request => you might also like to ask your question on the mailing list or gitter chat.

Description

When performing large (not huge) segmented folds with floats CUDA crashes

Expected behaviour

Folds should work the same with floats and doubles, within the limits of precision

Current behaviour

Floats crash

Steps to reproduce (for bugs)

A minimal project with this bug
https://github.com/electroCutie/AccelerateHs_Bugs
this bug is under the executable foldSegBug

Your environment

  • Accelerate version: 1.2.0.0

  • Accelerate backend(s) used: accelerate-llvm-ptx

  • GHC version: The Glorious Glasgow Haskell Compilation System, version 8.0.2

  • Operating system and version: XUbuntu 17.10

  • Link to your project/example: https://github.com/electroCutie/AccelerateHs_Bugs

  • If this is a bug with the GPU backend, include the output of nvidia-device-query:
    CUDA device query (Driver API, statically linked)
    CUDA driver version 9.0
    CUDA API version 8.0
    Detected 1 CUDA capable device

    Device 0: Quadro K620
    CUDA capability: 5.0
    CUDA cores: 384 cores in 3 multiprocessors (128 cores/MP)
    Global memory: 2 GB
    Constant memory: 64 kB
    Shared memory per block: 48 kB
    Registers per block: 65536
    Warp size: 32
    Maximum threads per multiprocessor: 2048
    Maximum threads per block: 1024
    Maximum grid dimensions: 2147483647 x 65535 x 65535
    Maximum block dimensions: 1024 x 1024 x 64
    GPU clock rate: 1.124 GHz
    Memory clock rate: 900.0 MHz
    Memory bus width: 128-bit
    L2 cache size: 2 MB
    Maximum texture dimensions
    1D: 65536
    2D: 65536 x 65536
    3D: 4096 x 4096 x 4096
    Texture alignment: 512 B
    Maximum memory pitch: 2 GB
    Concurrent kernel execution: Yes
    Concurrent copy and execution: Yes, with 1 copy engine
    Runtime limit on kernel execution: Yes
    Integrated GPU sharing host memory: No
    Host page-locked memory mapping: Yes
    ECC memory support: No
    Unified addressing (UVA): Yes
    PCI bus/location: 3/0
    Compute mode: Default
    Multiple contexts are allowed on the device simultaneously

@electroCutie
Copy link
Author

There seems to be more than one thing going on
On my desktop (the one I originally submitted the bug from) the GPU not only crashes on the float, but also returns an incorrect result for Double, in fact one that changes run to run

On my laptop it produces the correct number, but still crashes on floats

Here is the info from my laptop's GPU:
CUDA device query (Driver API, statically linked)
CUDA driver version 9.0
CUDA API version 8.0
Detected 1 CUDA capable device

Device 0: GeForce GTX 1050 Ti with Max-Q Design
  CUDA capability:                    6.1
  CUDA cores:                         768 cores in 6 multiprocessors (128 cores/MP)
  Global memory:                      4 GB
  Constant memory:                    64 kB
  Shared memory per block:            48 kB
  Registers per block:                65536
  Warp size:                          32
  Maximum threads per multiprocessor: 2048
  Maximum threads per block:          1024
  Maximum grid dimensions:            2147483647 x 65535 x 65535
  Maximum block dimensions:           1024 x 1024 x 64
  GPU clock rate:                     1.2905 GHz
  Memory clock rate:                  3.504 GHz
  Memory bus width:                   128-bit
  L2 cache size:                      1 MB
  Maximum texture dimensions         
    1D:                               131072
    2D:                               131072 x 65536
    3D:                               16384 x 16384 x 16384
  Texture alignment:                  512 B
  Maximum memory pitch:               2 GB
  Concurrent kernel execution:        Yes
  Concurrent copy and execution:      Yes, with 2 copy engines
  Runtime limit on kernel execution:  Yes
  Integrated GPU sharing host memory: No
  Host page-locked memory mapping:    Yes
  ECC memory support:                 No
  Unified addressing (UVA):           Yes
  PCI bus/location:                   1/0
  Compute mode:                       Default
    Multiple contexts are allowed on the device simultaneously

@tmcdonell
Copy link
Member

Thanks for the minimal test case! I can reproduce this on my machine, am investigating...

@tmcdonell
Copy link
Member

Can you see if this fix works for you? In your stack.yaml something like...

extra-deps:
- git:    https://github.com/tmcdonell/accelerate.git
  commit: 442dcbdb8d95407bc650d8f4ce6aa62ab593e484

- git:    https://github.com/tmcdonell/accelerate-llvm.git
  commit: 6aacde0ffae37552f5f5bacd1ae3029b14012f8c
  subdirs:
    - 'accelerate-llvm'
    - 'accelerate-llvm-native'
    - 'accelerate-llvm-ptx'

@tmcdonell
Copy link
Member

The only change in the generated assembly (modulo renaming):

Old:

LBB0_25:                                // %if83.entry
                                        //   in Loop: Header=BB0_23 Depth=2
        add.s64         %rd30, %rd8, %rd62;
        setp.ge.s64     %p13, %rd30, %rd20;
        mov.f32         %f68, %f2;
        @%p13 bra       LBB0_27;
        bra.uni         LBB0_26;

New:

LBB0_25:                                // %if83.entry
                                        //   in Loop: Header=BB0_23 Depth=2
        add.s64         %rd30, %rd8, %rd62;
        setp.ge.s64     %p13, %rd30, %rd20;
                                        // implicit-def: %f68
        @%p13 bra       LBB0_27;
        bra.uni         LBB0_26;

@tmcdonell tmcdonell added the llvm-ptx accelerate-llvm-ptx label Jun 6, 2018
@electroCutie
Copy link
Author

It did not fix the bug on my desktop (the box with the Quadro K620)
Going to test on my laptop, but the build takes a long time on there

I updated my test repository to reflect pulling these changes

@tmcdonell
Copy link
Member

Oh, I forgot to mention that you will need to delete the cache directory, $HOME/.accelerate, else it will keep using the old code. (sorry about that)

@electroCutie
Copy link
Author

electroCutie commented Jun 6, 2018

Deleted the accelerate cache folder, the bug persists

This bug goes farther than it originally seemed
The bug manifests not only as a crash but also as incorrect results
I've tested Shorts, Ints, Longs, Floats, and Doubles and they inconsistently give wrong answers

Updated the test repository
The test suite now has command line arguments and will iterate and highlight incorrect results (until it crashes)

@electroCutie electroCutie changed the title Segmented folds crash with Floats but not Doubles Segmented folds crash or give inconsistent results Jun 12, 2018
@tmcdonell
Copy link
Member

the tagged commit doesn't entirely fix it, but all the necessary changes were on that branch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
llvm-ptx accelerate-llvm-ptx
Projects
None yet
Development

No branches or pull requests

2 participants