-
Notifications
You must be signed in to change notification settings - Fork 118
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CUDA backend does not work with Bumblebee/Optimus #92
Comments
This should be fixed by AccelerateHS/accelerate-examples@91250ca. Can you confirm this? |
Yes. Built fine. Lots of "fails" in running with the CUDA backend. I'm using Cuda 5 - not sure if this breaks stuff. Do you want to see the list? |
Most of the fails were of the form... fold-sum and fold-2D-sum also failed but differently fold-sum: Failed:
fold-2d-sum: Failed:
|
Hmm... what card are you running on, and what compute capability is it? |
On 12/05/13 17:26, Trevor L. McDonell wrote:
Neil |
Yes, we do need to push and pop the CUDA context; I thought that that was enough, but my reading of the CUDA docs might be incorrect (and; I had not even heard of |
Optirun is part of the bumblebee project to allow use of the Optimus On 12/05/13 17:52, Trevor L. McDonell wrote:
|
Actually, does optimus aim to allow dynamic switching between a pair of low/high power GPUs? I have a similar problem with this dynamic switching (usually) not working under Mac OS X (#67), even with the NVIDIA drivers, although it does seem to work with the NVIDIA examples. Does it work if you disable the switching and only use the fast GPU? |
Hi Trevor, Yes. The Optimus is an NVIDIA design which uses the onboard Intel graphics I have been thinking, and the errors I am getting now are of language Cheers, Neil On Mon, May 13, 2013 at 11:10 AM, Trevor L. McDonell <
|
Hi Neil, I am using CUDA 5 and it has worked for me --- this is on Mac OS X and Ubuntu. It might make a difference if you're on a different linux distribution? What do you mean by language errors from the CUDA system? Different errors from the earlier "unspecified launch failure" ? Try changing this from |
Sorry Trevor. The "language" error was a language error of my own - due to Will try your suggestion when I get home. Neil On Mon, May 13, 2013 at 9:28 PM, Trevor L. McDonell <
|
Hi Trevor, I have had another thought. Debian Wheezy (my OS) comes with gcc 4.7 as What is the default version of gcc on your OS? Cheers Neil |
On my Mac it is gcc-4.2, but this is Apple's own version so I am not sure if that is comparable. The Ubuntu 12.04 box uses gcc-4.6.3. Adding the flag |
Hi Trevor, Did make a difference. This is the output:
On 13/05/13 21:28, Trevor L. McDonell wrote:
|
Hullo Trevor, SUCCESS! I rebuilt all the accelerate packages (with the change to forkOn in Not sure how to interpret the benchmarks but am very pleased to have it May I also say that the code is beautiful. Don't understand it all yet, Neil |
OOPS! Duh! I didn't turn on --cuda, so of course they all looked ok. Sorry. No change with cuda backend. :-(( All this regarding accelerate-examples of course. Neil |
Neil, could you try again with the latest version? I managed to create a setup that threw an invalid context error, so the fix for that might help in your situation as well. |
Will do Trevor. On 28/05/13 23:00, Trevor L. McDonell wrote:
|
I got this error in trying to compile the examples: [ 6 of 12] Compiling Test.IndexSpace ( examples/nofib/Test/IndexSpace.hs:170:71: I'll have a look and change to A.even as I assume that's what you meant. Neil. On 28/05/13 23:00, Trevor L. McDonell wrote:
|
This fixed it: On 28/05/13 23:00, Trevor L. McDonell wrote:
|
Another one: examples/tests/primitives/Gather.hs:41:11: On 28/05/13 23:00, Trevor L. McDonell wrote:
|
And: examples/tests/primitives/Scatter.hs:52:11: On 28/05/13 23:00, Trevor L. McDonell wrote:
|
Both fixed same way, and all now compile...Lets see how they run! On 28/05/13 23:00, Trevor L. McDonell wrote:
|
This is the output...I use Ctrl-C during the 4th slices as it seemed to neil@debian-neil:~/.cabal/bin$ optirun --no-xorg ./accelerate-examples map-abs: Ok
fold-product: Ok
fold-2d-product: Ok stencil-1D: Failed: stencil-2D: Failed: stencil-3D: Failed: stencil-3x3-cross: Failed: stencil-3x3-pair: Failed: stencil2-2D: Failed: permute-hist: Failed: backpermute-reverse: Failed: backpermute-transpose: Failed: init: Failed: tail: Failed: take: Failed: drop: Failed: slit: Failed: gather: Failed: gather-if: Failed: scatter: Failed: scatter-if: Failed: sasum: Failed: saxpy: Failed: dotp: Failed: filter: Failed: smvm: Failed: black-scholes: Failed: radixsort: Failed: io: test: fromPtr Int slices: Failed: slices: Failed: slices: ^C[ 3364.241184] [WARN]Received Interrupt signal. sharing-recovery: Ok On 28/05/13 23:00, Trevor L. McDonell wrote:
|
Oops, sorry for all the compilation failures with Are these the same errors you had initially? This looks more like what we had after the hack to replace For the "unspecified launch failure errors", we might be trying to launch a kernel that requires more resources than your card provides. Since I haven't tested on an Optimus card before, there might be bugs in the occupancy calculator code. Try the following? import Prelude as P
import Data.Array.Accelerate as A
import Data.Array.Accelerate.CUDA
import System.Environment
xs, ys :: Acc (Vector Float)
xs = use $ fromList (Z:.10) [0..]
ys = use $ fromList (Z:.10) [2,4..]
dotp :: Acc (Vector Float) -> Acc (Vector Float) -> Acc (Scalar Float)
dotp xs ys
= A.fold (+) 0
$ A.zipWith (*) xs ys
main :: IO ()
main
= withArgs ["-ddump-cc", "-ddump-gc", "-ddump-exec", "-dverbose"]
$ print
$ run (dotp xs ys) You'll need to have installed |
Thanks Trevor, I'll try that. The fix you did for the invalid context (last email) - The "unspecified launch failure errors" were in the "forkOS" version. I did put the forkOn back in, but not sure I rebuilt the whole sequence Cheers, Neil On 30/05/13 10:54, Trevor L. McDonell wrote:
|
Hi Trevor, This is the output. Are you able to make sense of it? Certainly seems neil@debian-neil:~/.cabal/bin$ optirun ./accelerate-examples --cuda -k if (ix < shapeSize) { y0 = arrIn0_a0[v3] * arrIn1_a0[v4]; x0 = arrIn0_a0[v3] * arrIn1_a0[v4]; 0.08:cc: (3.0,"\209\181\149\254\136cnX\DEL\171\b\219\160\133\133:") if (ix < shapeSize) { 0.09:cc: waiting for nvcc... Cheers, Neil On 30/05/13 10:54, Trevor L. McDonell wrote:
|
Hi Neil, hmm, it does indeed seem to have worked. Okay, a couple more things to try, if you don't mind:
Thanks! |
Oh, also, did you need to edit |
The forkOn 0 no longer makes any difference - i.e all now fail as it did I'll try the suggestion about cranking up the size of the vectors and Neil On 31/05/13 16:05, Trevor L. McDonell wrote:
|
Here's the deviceQueryDrv output: neil@debian-neil:~/Documents/Computing/CUDASamples/1_Utilities/deviceQueryDrv$ CUDA Device Query (Driver API) statically linked version Device 0: "Quadro K1000M" On 31/05/13 16:03, Trevor L. McDonell wrote:
|
Remarkably durable...
import Prelude as P import System.Environment xs, ys :: Acc (Vector Float) dotp :: Acc (Vector Float) -> Acc (Vector Float) -> Acc (Scalar Float) main :: IO ()
neil@debian-neil: On 31/05/13 16:03, Trevor L. McDonell wrote:
|
Changing vector sizes to this... xs = use $ fromList (Z:.1000000000) [0..] Lead to a perfectly reasonable... neil@debian-neil:~/.cabal/bin$ On 31/05/13 16:03, Trevor L. McDonell wrote:
|
Does the simple dotp example exercise the Async module? This seems to Cheers, Neil |
Oops, Sorry a mis-type there. They are with forkIO (not forkOS). I On 31/05/13 16:05, Trevor L. McDonell wrote:
|
Hi Trevor, I mentioned this before, but it may have been lost, and is more of a Cheers, neil |
Hi Trevor, I thought you might be interested in this. Running the regression test neil@debian-neil:~/Downloads/Accelerate_130529/accelerate-examples-master/accelerate-examples-master$ First the main battery of tests:running with CUDA backend map-abs: Ok
stencil-1D: Ok Next, additional application tests, beginning with mandelbrot:accelerate-mandelbrot (c) [2011..2013] The Accelerate Team Usage: accelerate-mandelbrot [OPTIONS] Available backends:
Runtime usage: Error: unrecognized option `--size=64' Run "accelerate-mandelbrot --help" for usage information |
Hi Trevor, I hope you don't mind me sending lots, but I am on a roll at the Actually I can be more specific... 1024 works, 1025 fails. There are numerous (hundreds of) "fails" in the results not matching the :-) Neil Also fluid, mandrebrot, etc all run fine. Haven't tried the hashcat. smoothlife chokes on the default settings. I get a decent animation neil@debian-neil:~/.cabal/bin$ ./accelerate-smoothlife --cuda --size=64 Pretty happy now! Neil All this with the |
Hi Trevor, I hope you don't mind me sending lots, but I am on a roll at the Actually I can be more specific... 1024 works, 1025 fails. There are numerous (hundreds of) "fails" in the results not matching the :-) Neil Also fluid, mandrebrot, etc all run fine. Haven't tried the hashcat. smoothlife chokes on the default settings. I get a decent animation neil@debian-neil:~/.cabal/bin$ ./accelerate-smoothlife --cuda --size=64 Pretty happy now! Neil All this with forkIO |
Okay, that's great! I made some changes elsewhere tried to do the same thing but not fixed to CPU zero, so am glad that that works. One problem down! |
Yes, all |
Hi Neil,
Not at all, it is all very useful information (:
Ah, that is very helpful, thanks! I'll play around and see if I can dig up anymore leads to follow.
A little worrying, but at least it runs! We'll get to that one later (:
Great! For hashcat you'll need to find a list of plain text words to feed it, and then a bunch of MD5 digests guess. You can use a standard dictionary like /usr/share/dict/english, although for a bit of fun Google for the rockyou list and a list of unknown md5's (:
I think it depends on whether or not accelerate-fft built against the fast CUDA FFT library implementation. I don't think there is an easy way to check whether this happened or not, aside from just running and measuring the speed. Try:
Or just install it after the accelerate-cuda package is already installed. This should probably have better documentation! -Trev |
On 03/06/13 16:04, Trevor L. McDonell wrote:
|
Addressing issues: - AccelerateHS/accelerate#93 - AccelerateHS/accelerate#95 - improvements for AccelerateHS/accelerate#92
@neiljamieso does everything work fine now? Some recent fixes to the fold kernel means that those tests should pass now. Do you still have any problems here? |
Hi Trev, How recent a download from Github do I need? Neil On 15/11/13 16:02, Trevor L. McDonell wrote:
|
Hi Trev, I tried installing the latest accelerate stuff from githib. The latest accelerate-cuda depends on cuda-1.5.1.1 - the latest cuda in On 15/11/13 16:02, Trevor L. McDonell wrote:
|
@neiljamieso Trev probably forgot to push the version bump. Just change the version in |
No working so well. I have attached the outputs (with my command line Neil On 16/11/13 23:41, Manuel M T Chakravarty wrote:
neil@debian-neil:~/.cabal/bin$ _OUTPUT_* running with CUDA backend map-abs: Ok stencil-1D: Failed: stencil-2D: Failed: stencil-3D: Failed: stencil-3x3-cross: Failed: stencil-3x3-pair: Failed: stencil2-2D: Failed: permute-hist: Failed: backpermute-reverse: Failed: backpermute-transpose: Failed: init: Failed: tail: Failed: take: Failed: drop: Failed: slit: Failed: gather: Failed: gather-if: Failed: scatter: Failed: scatter-if: Failed: sasum: Failed: saxpy: Failed: dotp: Failed: filter: Failed: smvm: Failed: black-scholes: Failed: radixsort: Failed: io: test: fromPtr Int slices: Failed: slices: Failed: slices: Failed: slices: Failed: sharing-recovery: Ok warming up benchmarking map-abs neil@debian-neil:~/.cabal/bin$ optirun ./accelerate-examples --cuda --size=1024 > bare_test_131117 neil@debian-neil:~/.cabal/bin$ _OUTPUT_** running with CUDA backend map-abs: Ok stencil-1D: Failed: stencil-2D: Failed: stencil-3D: Failed: stencil-3x3-cross: Failed: stencil-3x3-pair: Failed: stencil2-2D: Failed: permute-hist: Failed: backpermute-reverse: Failed: backpermute-transpose: Failed: init: Failed: tail: Failed: take: Failed: drop: Failed: slit: Failed: gather: Failed: gather-if: Failed: scatter: Failed: scatter-if: Failed: sasum: Failed: saxpy: Failed: dotp: Failed: filter: Failed: smvm: Failed: black-scholes: Failed: radixsort: Failed: io: test: fromPtr Int slices: Failed: slices: Failed: slices: Failed: slices: Failed: sharing-recovery: Ok warming up benchmarking map-abs |
Sorry for the problem with the cuda package version, fixed and will be uploaded to hackage soon. Could you run the |
Hullo Trev, Not sure what this means "|accelerate-examples| is no longer built as This is the output from nofib... EKG monitor started at: http://localhost:8000 accelerate-nofib (c) [2013] The Accelerate Team Usage: accelerate-nofib [OPTIONS] Available backends:
prelude: (used seed -1630649237856122637) (used seed -4172774753861454420) (used seed -4068642445411035362) (used seed 4504072601150252809) (used seed -1768028967034461376) (used seed -578241401213968022) (used seed 8607050148139398118) (used seed 2474179189546383018) (used seed -8403008051050665374) (used seed 6231186752828250437) accelerate-nofib: accelerate-nofib: forkOS_entry: interrupted On 19/11/13 18:10, Trevor L. McDonell wrote:
|
Ah, I mean that the program called It looks like
|
hi Trev, away for a week with just my phone. will try when I get back. Cheers Neil |
Hi Trev, As an experiment I tried running the interpreter version of this, and it ..... done! On 21/11/13 23:59, Trevor L. McDonell wrote:
accelerate-nofib (c) [2013] The Accelerate Team Usage: accelerate-nofib [OPTIONS] Available backends:
prelude: (used seed 7863247450130050956) (used seed 5228219361933020874) (used seed 3176500408165050443) (used seed -5531309095382955723)
Passed 166 33 199 |
On 21/11/13 23:59, Trevor L. McDonell wrote:
$ optirun cuda-memcheck accelerate-nofib --int64=False accelerate-nofib (c) [2013] The Accelerate Team Usage: accelerate-nofib [OPTIONS] Available backends:
prelude: (used seed -3868271924695893879)
Passed 1 1 |
On 21/11/13 23:59, Trevor L. McDonell wrote:
|
On 21/11/13 23:59, Trevor L. McDonell wrote:
This is the terminal output. I have attached a file containing the dump Cheers Neil $ cuda-memcheck accelerate-nofib --int64=False --select-tests=scanl1 -- accelerate-nofib (c) [2013] The Accelerate Team Usage: accelerate-nofib [OPTIONS] Available backends:
prelude: (used seed -1221479377516449484)
Passed 1 1 if (threadIdx.x == 0) { const int start = blockIdx.x * intervalSize; for (seg = threadIdx.x; seg < numElements; seg += blockDim.x) { x0 = arrIn0_0[ix]; z0 = sdata0[last]; 0.08:cc: (3.0,"\206\203(\n\242G\fk\212\137\146V+\153\170\187") for (seg = threadIdx.x; seg < numElements; seg += blockDim.x) { x0 = arrIn0_0[ix]; y0 = sdata0[last]; 0.08:cc: (3.0,"\NUL\a\CAN\FS\157\154\247$\234\215\ENQ\188g\156\DC1\246") if (threadIdx.x == 0) { const int start = blockIdx.x * intervalSize; for (seg = threadIdx.x; seg < numElements; seg += blockDim.x) { x0 = arrBlk_0[ix]; z0 = sdata0[last]; 0.08:cc: waiting for nvcc... if (threadIdx.x == 0) { const int start = blockIdx.x * intervalSize; for (seg = threadIdx.x; seg < numElements; seg += blockDim.x) { x0 = arrIn0_0[ix]; z0 = sdata0[last]; 0.14:cc: (3.0,"\EOTB\148\FS\188"\245\ETB\206a\136\ACK\164\174\RSr") for (seg = threadIdx.x; seg < numElements; seg += blockDim.x) { x0 = arrIn0_0[ix]; y0 = sdata0[last]; 0.14:cc: (3.0,"t\EM)\ETB\SO\230\237U\203\160C1m\128U\132") if (threadIdx.x == 0) { const int start = blockIdx.x * intervalSize; for (seg = threadIdx.x; seg < numElements; seg += blockDim.x) { x0 = arrBlk_0[ix]; z0 = sdata0[last]; 0.14:cc: (3.0,"\STXKX\147\ETXI(#\SOH\214\150B\153\253D\SO") for (ix = blockDim.x * blockIdx.x + threadIdx.x; ix < shapeSize; ix += gridSize) { 0.14:cc: (3.0,"\bQ\a\131\189l#\131\f\SIw\183\USP\157&3") for (ix = blockDim.x * blockIdx.x + threadIdx.x; ix < shapeSize; ix += gridSize) { if (!(sh_0 == -1)) { y0 = arrOut_0[jx0]; 0.15:cc: (3.0,"f$\243U\130\180\224[\146\251\138\181\235\161l\EM") if (threadIdx.x == 0) { const int start = blockIdx.x * intervalSize; for (seg = threadIdx.x; seg < numElements; seg += blockDim.x) { x1 = arrIn1_0[v4]; x1 = z1 | x1; const Word8 v0 = (Int32) 0 != x1; x1 = y1 | x1; const Word8 v0 = (Int32) 0 != x1; x1 = y1 | x1; const Word8 v0 = (Int32) 0 != x1; x1 = y1 | x1; const Word8 v0 = (Int32) 0 != x1; x1 = y1 | x1; const Word8 v0 = (Int32) 0 != x1; x1 = y1 | x1; const Word8 v0 = (Int32) 0 != x1; x1 = y1 | x1; const Word8 v0 = (Int32) 0 != x1; x1 = y1 | x1; const Word8 v0 = (Int32) 0 != x1; x1 = y1 | x1; const Word8 v0 = (Int32) 0 != x1; x1 = y1 | x1; const Word8 v0 = (Int32) 0 != x1; x1 = y1 | x1; const Word8 v0 = (Int32) 0 != x1; x1 = y1 | x1; z1 = sdata1[last]; 0.16:cc: (3.0,"\221cD\203&1\164\149+`I\192W\227\248An") for (seg = threadIdx.x; seg < numElements; seg += blockDim.x) { x1 = arrIn1_0[v4]; x1 = y1 | x1; const Word8 v0 = (Int32) 0 != x1; x1 = y1 | x1; const Word8 v0 = (Int32) 0 != x1; x1 = y1 | x1; const Word8 v0 = (Int32) 0 != x1; x1 = y1 | x1; const Word8 v0 = (Int32) 0 != x1; x1 = y1 | x1; const Word8 v0 = (Int32) 0 != x1; x1 = y1 | x1; const Word8 v0 = (Int32) 0 != x1; x1 = y1 | x1; const Word8 v0 = (Int32) 0 != x1; x1 = y1 | x1; const Word8 v0 = (Int32) 0 != x1; x1 = y1 | x1; const Word8 v0 = (Int32) 0 != x1; x1 = y1 | x1; const Word8 v0 = (Int32) 0 != x1; x1 = y1 | x1; const Word8 v0 = (Int32) 0 != x1; x1 = y1 | x1; y1 = sdata1[last]; 0.16:cc: (3.0,"$\139vCL\128YJ\146\188\US\152\181}\186d") if (threadIdx.x == 0) { const int start = blockIdx.x * intervalSize; for (seg = threadIdx.x; seg < numElements; seg += blockDim.x) { x1 = arrBlk_1[ix]; x1 = z1 | x1; const Word8 v0 = (Int32) 0 != x1; x1 = y1 | x1; const Word8 v0 = (Int32) 0 != x1; x1 = y1 | x1; const Word8 v0 = (Int32) 0 != x1; x1 = y1 | x1; const Word8 v0 = (Int32) 0 != x1; x1 = y1 | x1; const Word8 v0 = (Int32) 0 != x1; x1 = y1 | x1; const Word8 v0 = (Int32) 0 != x1; x1 = y1 | x1; const Word8 v0 = (Int32) 0 != x1; x1 = y1 | x1; const Word8 v0 = (Int32) 0 != x1; x1 = y1 | x1; const Word8 v0 = (Int32) 0 != x1; x1 = y1 | x1; const Word8 v0 = (Int32) 0 != x1; x1 = y1 | x1; const Word8 v0 = (Int32) 0 != x1; x1 = y1 | x1; z1 = sdata1[last]; 0.16:cc: (3.0,"`r~\159W\220\n\231\148[\252\232\SO\138L\151") for (ix = blockDim.x * blockIdx.x + threadIdx.x; ix < shapeSize; ix += gridSize) { arrOut_0[ix] = x0; 0.17:cc: waiting for nvcc... |
Closing as outdated. Please open a new ticket with updated output if you have problems. |
Hi,
I tried to build the examples. This failed due to not finding a definition of "note" in Benchmark.hs. This was solved by adding import Criterion.IO.Printf to the import list.
The text was updated successfully, but these errors were encountered: