CUDA backend does not work with Bumblebee/Optimus #92

neiljamieso · 2013-05-12T01:50:32Z

Hi,

I tried to build the examples. This failed due to not finding a definition of "note" in Benchmark.hs. This was solved by adding import Criterion.IO.Printf to the import list.

tmcdonell · 2013-05-12T03:43:54Z

This should be fixed by AccelerateHS/accelerate-examples@91250ca. Can you confirm this?

neiljamieso · 2013-05-12T05:08:48Z

Yes. Built fine. Lots of "fails" in running with the CUDA backend. I'm using Cuda 5 - not sure if this breaks stuff. Do you want to see the list?

neiljamieso · 2013-05-12T05:13:27Z

Most of the fails were of the form...
: Failed:
*** Internal error in package accelerate ***
*** Please submit a bug report at https://github.com/AccelerateHS/accelerate/issues
./Data/Array/Accelerate/CUDA.hs:153 (unhandled): CUDA Exception: unspecified launch failure

fold-sum and fold-2D-sum also failed but differently

fold-sum: Failed:

() : (-317.71545,-725.824)

fold-2d-sum: Failed:

0 : (10.181486,10.786726)
3 : (-12.445062,-5.0869923)
4 : (-22.740108,-34.520443)
5 : (7.2517667,5.283786)
6 : (-7.7953305,-19.361605)
7 : (16.353685,16.106562)
8 : (4.841938,3.6077766)
9 : (6.7518387,2.3453445)
10 : (-14.926775,-22.0668)
12 : (-8.844832,0.86133194)
13 : (31.82425,42.47469)
15 : (-12.590198,-8.076189)
16 : (2.4275239,-1.1079388)
18 : (-4.6298413,10.507795)
19 : (-5.7560434,-24.80141)
20 : (-27.520971,-58.446945)
21 : (-10.380567,-17.262444)
24 : (-5.6269426,-3.59577)
34 : (18.326572,15.697114)
36 : (-21.652311,-20.826466)
37 : (-0.69646883,-14.07115)
39 : (-1.9313966,-1.646287)
40 : (-15.114215,-4.3450966)
41 : (9.819355,4.646344)
42 : (-13.3020315,-18.24121)
43 : (-4.780798,-11.156574)
45 : (-13.909897,-19.179947)
46 : (-24.877073,-25.394434)
48 : (-10.917168,-7.612333)
49 : (8.59276,-10.744858)
50 : (-43.603035,-53.99748)
53 : (17.853306,21.356565)
55 : (-2.121977,6.3397703)
58 : (-4.2652583,-2.5864878)
59 : (-4.6431007,-3.1721497)
60 : (14.112302,15.44854)
61 : (-28.66971,-50.8897)
63 : (-14.38963,-20.192778)
67 : (-29.752752,-29.051735)
70 : (18.686342,27.751282)
76 : (-11.068267,-3.157248)
77 : (-30.1085,-35.691612)
78 : (17.865221,33.37813)
79 : (12.610696,10.771452)
80 : (13.958698,14.737689)
83 : (-51.858498,-58.283985)
85 : (12.039097,14.588022)
86 : (-14.114648,-17.417624)
89 : (23.790989,25.472948)
90 : (-18.82345,-17.08065)
91 : (2.463029,5.9130898)
92 : (4.0238266,5.5120225)
93 : (-8.8636265,-8.364969)
95 : (-16.640343,-13.33732)
96 : (10.943283,20.977047)
97 : (-2.759805,-10.179357)
99 : (7.461958,4.374811)
101 : (6.5351143,10.87258)
102 : (-8.328936,-3.353552)
103 : (-8.919393,-10.651541)
104 : (-8.599477,-32.173218)
105 : (-3.4648807,-12.457461)
107 : (-9.112293,-10.76436)
109 : (10.36928,19.196201)
111 : (-0.74972934,-8.263916)
112 : (-1.4251958,-1.3936005)
114 : (-5.7750616,-6.656393)
115 : (-4.1570673,-5.0010214)
118 : (-14.588455,-5.8673525)
122 : (-3.905911,1.3459797)
124 : (11.671464,13.249651)
128 : (24.242702,31.903507)
130 : (-5.312511,-8.756293)
131 : (-17.744507,-24.541887)
133 : (-3.0010543,-7.737555)
136 : (8.380546,11.387158)
138 : (11.308516,11.967691)
139 : (-17.7391,-29.652555)
141 : (-25.26024,-34.264626)
145 : (-11.910921,-14.598899)
147 : (18.361284,8.458666)
148 : (-2.0598116,9.742126)
151 : (-1.5615535,-6.330538)
155 : (-14.633401,-24.910007)
158 : (1.7897742,-3.3920808)
160 : (7.98956,9.146147)
161 : (-21.875072,-25.081263)
162 : (5.615722,20.186003)
163 : (9.19277,14.405633)
166 : (-4.6076007,0.6831827)
167 : (-10.567481,-4.0725036)
169 : (0.4859029,-6.1355286)
170 : (19.870667,19.815443)
172 : (6.0666904,7.6584425)
173 : (8.849107,0.12496734)
175 : (-11.274898,-16.4241)
177 : (-27.324623,-33.917286)
178 : (0.21815288,3.8251867)
179 : (-6.1652923,-4.998172)
180 : (-14.112642,-19.027935)
181 : (-2.080636,6.853819e-3)
183 : (3.6447208,-6.9173365)
187 : (-27.273254,-38.26088)
189 : (-9.826919,-14.5337925)
190 : (1.3126237,0.9761648)
191 : (-4.1650763,-1.852829)
193 : (18.610937,22.746304)
194 : (-4.691451,-0.86483383)
196 : (-4.7458477,-23.575771)
197 : (-2.7342944,-10.165984)
199 : (-11.298469,-18.151875)
200 : (5.3247147,-4.0813465)
201 : (14.916756,23.434582)
203 : (-0.1067512,4.8686438)
204 : (-14.124139,-4.513797)
206 : (-7.185062,-0.58614635)
207 : (-19.701935,-20.333096)
208 : (-11.467451,-7.518866)
210 : (31.49854,38.85581)
212 : (-16.014204,-17.766535)
216 : (-18.965578,-29.654585)
220 : (-0.17519975,-5.1846743)
225 : (16.0454,19.740955)
226 : (-0.67587143,1.3499918)
229 : (-21.621109,-23.055359)
231 : (1.533406,0.9220514)
232 : (1.5521168,-2.942934)
235 : (-26.18992,-28.304138)
237 : (-12.360111,-14.813786)
244 : (-26.788136,-26.856113)
245 : (-11.375093,-6.4627395)
249 : (-14.0135765,-18.813738)
251 : (-28.578781,-39.254063)
261 : (23.480045,28.535007)
263 : (-20.27542,-30.240715)
264 : (1.0410566,5.445823)
265 : (-12.174866,-11.87295)
270 : (-2.2434764,1.3028297)
271 : (-5.3730717,-7.069026)
272 : (-32.547344,-40.939163)
273 : (-11.036853,-14.617073)
274 : (1.5726653,7.1989527)
276 : (13.667664,-4.6318626)
277 : (-19.315035,-14.617573)
279 : (0.14692748,6.2511544)
281 : (-0.6385382,0.5433495)
282 : (0.13369226,-2.5549994)
285 : (-25.613811,-23.304722)
286 : (11.909087,6.9073195)
287 : (11.177615,14.907998)
289 : (8.337317,10.699486)
291 : (-6.394571,-2.2123995)
293 : (-12.401189,-4.961336)
294 : (20.566023,22.415432)
299 : (6.3981833,14.163654)
301 : (-15.557607,-12.6597595)
308 : (3.6762142,9.144186)
310 : (0.26484996,-7.8996334)
312 : (3.0426567,7.4979715)
314 : (14.884919,14.266132)

tmcdonell · 2013-05-12T05:26:33Z

Hmm... what card are you running on, and what compute capability is it?
The internal error especially is a bit worrying; I haven't seen that one in a while. The fold errors at least should be easier to debug.

neiljamieso · 2013-05-12T05:32:01Z

On 12/05/13 17:26, Trevor L. McDonell wrote:

Hmm... what card are you running on, and what compute capability is it?
The internal error especially is a bit worrying; I haven't seen that
one in a while. The fold errors at least should be easier to debug.

—
Reply to this email directly or view it on GitHub
#92 (comment).

K1000M and using optirun to do the switching. It works fine with all
the CUDA examples from nvidia. Ah! But maybe it is not working if you
de-attach from the primary caller (as I suspect you do in the async
functions). I will check that out - rings a bell from the bumblebee
documentation.

Neil

tmcdonell · 2013-05-12T05:52:38Z

Yes, we do need to push and pop the CUDA context; I thought that that was enough, but my reading of the CUDA docs might be incorrect (and; I had not even heard of optirun before now!)

neiljamieso · 2013-05-12T06:37:58Z

Optirun is part of the bumblebee project to allow use of the Optimus
GPUs under Linux. As it is not provided by nvidia it it possible it
brings it's own issues. As I say though all the nvidia examples seem to
run fine under it.

On 12/05/13 17:52, Trevor L. McDonell wrote:

Yes, we do need to push and pop the CUDA context; I thought that that
was enough, but my reading of the CUDA docs might be incorrect (and; I
had not even heard of |optirun| before now!)

—
Reply to this email directly or view it on GitHub
#92 (comment).

tmcdonell · 2013-05-12T23:09:58Z

Actually, does optimus aim to allow dynamic switching between a pair of low/high power GPUs? I have a similar problem with this dynamic switching (usually) not working under Mac OS X (#67), even with the NVIDIA drivers, although it does seem to work with the NVIDIA examples.

Does it work if you disable the switching and only use the fast GPU?

neiljamieso · 2013-05-13T04:30:29Z

Hi Trevor,

Yes. The Optimus is an NVIDIA design which uses the onboard Intel graphics
most of the time. The display is ALWAYS done by the Intel unit, but the
rendering is directed to the NVIDIA card on a switchable basis. Bumblebee
is an opensource module to allow this switching on Linux (as NVIDIA neglect
to provide this themselves). Choosing to run a programme using the NVIDIA
card is done by running the programme under Optirun, so I enter:
$ optirun ./accelerate-examples

I have been thinking, and the errors I am getting now are of language
errors from the CUDA system. As I say ALL the NVIDIA code runs fine
run under optirun - so I wonder if this is about changes to the CUDA
language with CUDA 5. Have you had success with CUDA 5 and accelerate in
other hardwares?

Cheers, Neil

On Mon, May 13, 2013 at 11:10 AM, Trevor L. McDonell <
notifications@github.com> wrote:

Actually, does optimus aim to allow dynamic switching between a pair of
low/high power GPUs? I have a similar problem with this dynamic switching
(usually) not working under Mac OS X (#67 #67),
even with the NVIDIA drivers, although it does seem to work with the NVIDIA
examples.

Does it work if you disable the switching and only use the fast GPU?

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/92#issuecomment-17787198
.

tmcdonell · 2013-05-13T09:27:58Z

Hi Neil,

I am using CUDA 5 and it has worked for me --- this is on Mac OS X and Ubuntu. It might make a difference if you're on a different linux distribution?

What do you mean by language errors from the CUDA system? Different errors from the earlier "unspecified launch failure" ?

Try changing this from forkOS to forkOn 0 and let me know what happens?
https://github.com/AccelerateHS/accelerate-cuda/blob/master/Data/Array/Accelerate/CUDA/Async.hs#L36

neiljamieso · 2013-05-13T23:51:35Z

Sorry Trevor. The "language" error was a language error of my own - due to
wrapping at the edge of the terminal window. :-((

Will try your suggestion when I get home.

Neil

On Mon, May 13, 2013 at 9:28 PM, Trevor L. McDonell <
notifications@github.com> wrote:

Hi Neil,

I am using CUDA 5 and it has worked for me --- this is on Mac OS X and
Ubuntu. It might make a difference if you're on a different linux
distribution?

What do you mean by language errors from the CUDA system? Different errors
from the earlier "unspecified launch failure" ?

Try changing this from forkOS to forkOn 0 and let me know what happens?

https://github.com/AccelerateHS/accelerate-cuda/blob/master/Data/Array/Accelerate/CUDA/Async.hs#L36

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/92#issuecomment-17801317
.

neiljamieso · 2013-05-14T01:19:12Z

Hi Trevor,

I have had another thought. Debian Wheezy (my OS) comes with gcc 4.7 as
standard. CUDA 5 only works with gcc 4.6 (I tried 4.7). I thought of this
last night and rebuild accelerate-cuda and accelerate-examples with gcc
pointing to 4.6 (and g++ the same). This didn't make any difference, but I
wonder if I need to rebuild the whole of haskell on gcc 4.6.

What is the default version of gcc on your OS?

Cheers Neil

tmcdonell · 2013-05-14T02:03:58Z

On my Mac it is gcc-4.2, but this is Apple's own version so I am not sure if that is comparable. The Ubuntu 12.04 box uses gcc-4.6.3.

Adding the flag -ddump-gc will give rather chatty output of whenever it tries to do memory allocations. Since this is quite fine grained, it might give a few more indications about what is going on (failed on the first attempt, worked for a while and then failed, etc)

neiljamieso · 2013-05-14T06:07:40Z

Hi Trevor,

Did make a difference. This is the output:

neil@debian-neil:~/.cabal/bin$ optirun bash
neil@debian-neil:~/.cabal/bin$ ./accelerate-examples --cuda -k
running with CUDA backend
to display available options, rerun with '--help'

map-abs: Ok
map-plus: Failed:
*** Internal error in package accelerate ***
*** Please submit a bug report at 
https://github.com/AccelerateHS/accelerate/issues
./Data/Array/Accelerate/CUDA.hs:246 (unhandled): CUDA Exception: invalid 
context handle

map-square: Ok
zip: Ok
zipWith-plus: Ok
fold-sum: Failed:
 >>> () : (-284.77808,-299.1781)

fold-product: Ok
fold-maximum: Ok
fold-minimum: Ok
fold-2d-sum: Failed:
 >>> 4 : (3.5994253,4.1811037)
 >>> 5 : (-9.134442,-18.42069)
 >>> 6 : (-3.1958194,-7.940378)
 >>> 7 : (5.7998953,19.296043)
 >>> 9 : (14.701389,23.411243)
 >>> 12 : (26.571274,38.411602)
 >>> 14 : (23.842213,23.651949)
 >>> 15 : (-9.196165,-0.6621127)
 >>> 19 : (43.498287,45.15085)
 >>> 20 : (13.474283,14.403748)
 >>> 21 : (-11.930797,-9.856017)
 >>> 22 : (8.05154,8.746618)
 >>> 23 : (25.273453,25.40266)
 >>> 24 : (5.668702,7.682753)
 >>> 25 : (-23.540642,-24.084503)
 >>> 26 : (7.730505,3.7250352)
 >>> 28 : (-13.682002,-22.007523)
 >>> 30 : (22.153667,29.94641)
 >>> 32 : (3.9512172,4.8625793)
 >>> 34 : (-20.773705,-23.154194)
 >>> 35 : (14.610652,17.81879)
 >>> 36 : (-6.893841,-5.690979)
 >>> 38 : (3.470799,9.4239e-2)
 >>> 40 : (17.497482,27.669067)
 >>> 41 : (-3.0244708e-2,-3.8516002)
 >>> 43 : (19.843216,26.517456)
 >>> 44 : (5.050486,8.706543)
 >>> 47 : (-5.443891,-5.188139)
 >>> 49 : (-8.316223,-12.395588)
 >>> 51 : (5.367283,23.022243)
 >>> 52 : (11.321204,6.605723)
 >>> 53 : (16.014208,17.675938)
 >>> 57 : (-22.71127,-28.897242)
 >>> 60 : (-2.7958093,3.0328588)
 >>> 61 : (14.372042,10.27017)
 >>> 63 : (-13.966523,-16.551018)
 >>> 65 : (-2.3377113,-8.886295)
 >>> 66 : (0.41673332,4.9110966)
 >>> 67 : (-3.150734,1.390254)
 >>> 68 : (-9.262151,-4.612889)
 >>> 70 : (1.1192223,-0.87473106)
 >>> 71 : (-16.735855,-13.542116)
 >>> 72 : (-2.7853413,-3.259285)
 >>> 75 : (-0.42108774,12.822178)
 >>> 76 : (37.315483,58.080196)
 >>> 77 : (21.378624,24.565968)
 >>> 79 : (17.399918,11.301307)
 >>> 80 : (6.1325307,-3.117681)
 >>> 82 : (-25.688484,-23.890837)
 >>> 84 : (-29.327036,-46.779266)
 >>> 85 : (-12.640158,-17.59966)
 >>> 86 : (14.217806,22.999573)
 >>> 87 : (3.0769944,0.67498803)
 >>> 88 : (14.598545,13.440449)
 >>> 94 : (-18.738943,0.6576848)
 >>> 98 : (-1.2732513,-9.02783)
 >>> 100 : (14.017002,22.866009)
 >>> 102 : (10.585675,-0.76270866)
 >>> 103 : (-22.687687,-24.832624)
 >>> 105 : (13.726986,8.545394)
 >>> 108 : (18.212643,22.956026)
 >>> 110 : (-14.852369,-22.597391)
 >>> 111 : (2.3865306,5.926875)
 >>> 112 : (3.0377512,-1.880888)
 >>> 114 : (-10.134539,-9.8238)
 >>> 115 : (-4.3836536,3.3319654)
 >>> 116 : (-5.7152805,-14.443269)
 >>> 117 : (8.012011,7.6332164)
 >>> 118 : (-17.265642,-15.1257715)
 >>> 119 : (12.728009,14.087517)
 >>> 120 : (-18.342087,-23.154064)
 >>> 121 : (-21.715904,-17.897583)
 >>> 123 : (-13.022339,-12.231892)
 >>> 124 : (16.29696,30.115715)
 >>> 126 : (8.191839,16.790535)
 >>> 127 : (7.316367,14.373995)
 >>> 128 : (23.410019,22.88608)
 >>> 129 : (10.068765,-24.64301)
 >>> 131 : (-26.669355,-26.25417)
 >>> 132 : (2.4118686,-3.5020428)
 >>> 133 : (-13.115518,-21.87509)
 >>> 134 : (12.896856,12.63337)
 >>> 136 : (13.352133,12.780149)
 >>> 137 : (24.687658,17.437037)
 >>> 140 : (4.4784513,-8.002885)
 >>> 141 : (19.64967,21.850222)
 >>> 142 : (-17.395033,-11.799833)
 >>> 144 : (4.605325,9.768799)
 >>> 149 : (-27.127146,-31.195862)
 >>> 150 : (-20.15325,-38.91357)
 >>> 151 : (-11.284405,-7.634466)
 >>> 153 : (1.4470301,2.2499762)
 >>> 155 : (17.06059,23.061432)
 >>> 157 : (13.256235,9.830044)
 >>> 158 : (8.65885e-2,15.133558)
 >>> 161 : (19.461996,30.09988)
 >>> 162 : (8.695209e-2,1.2758055)
 >>> 164 : (0.23431987,-5.4021072)
 >>> 165 : (-8.806317,-7.660516)
 >>> 167 : (2.9375281,-1.7019806)
 >>> 168 : (4.8822374,1.7404442)
 >>> 169 : (-6.0983124,-6.616735)
 >>> 170 : (-10.859095,-24.070465)
 >>> 171 : (-30.173882,-38.876015)
 >>> 172 : (7.5324316,10.573803)
 >>> 173 : (-7.9830656,-0.61189365)
 >>> 174 : (3.8499007,2.8259583)
 >>> 175 : (9.863973,18.671043)
 >>> 176 : (1.5010693,7.730674)
 >>> 177 : (-19.172495,-15.866618)
 >>> 178 : (10.258595,11.646437)
 >>> 179 : (-36.72372,-32.991608)
 >>> 180 : (4.0878096,4.3566303)
 >>> 183 : (-16.212082,-12.850005)
 >>> 186 : (20.656956,44.957047)
 >>> 187 : (9.899384,8.580212)
 >>> 188 : (24.487984,24.992609)
 >>> 194 : (16.086586,6.133008)
 >>> 195 : (-12.79052,-14.317617)
 >>> 200 : (4.5302505,8.308535)
 >>> 201 : (-10.723634,-23.400677)
 >>> 202 : (-4.187149,-15.145685)
 >>> 203 : (-15.959601,-16.193207)
 >>> 204 : (27.673164,32.605988)
 >>> 205 : (-22.693754,-33.882385)
 >>> 206 : (-0.7072872,-1.9263825)
 >>> 208 : (-2.4695814,-0.21775436)
 >>> 209 : (-7.441179,-7.886807)
 >>> 216 : (-26.625347,-34.00032)
 >>> 217 : (-12.935532,-12.696256)
 >>> 219 : (10.233142,16.826408)
 >>> 223 : (-20.659527,-19.133957)
 >>> 225 : (4.6232724,-5.518243)
 >>> 226 : (-3.6734939e-3,-0.32396984)
 >>> 228 : (31.582458,35.58126)
 >>> 229 : (-0.7545265,-10.300518)
 >>> 231 : (12.414625,15.020456)
 >>> 234 : (10.174679,19.857052)
 >>> 235 : (-13.687687,-11.906177)
 >>> 239 : (-16.81191,-17.177837)
 >>> 241 : (5.6338625,7.43606)
 >>> 246 : (-6.5156856,-9.638809)
 >>> 247 : (-0.42078322,-4.191985)
 >>> 249 : (11.335211,10.828511)
 >>> 252 : (-0.8734268,-16.709965)
 >>> 253 : (2.7642574,5.442359)
 >>> 255 : (-15.736735,-13.98167)
 >>> 257 : (5.946913,2.0609694)
 >>> 258 : (-6.6435785,-8.290497)
 >>> 259 : (13.248286,15.020397)
 >>> 260 : (40.213238,62.449997)
 >>> 261 : (-1.8538256,-4.91119)
 >>> 266 : (10.244856,6.945044)
 >>> 268 : (-13.880142,-21.150314)
 >>> 269 : (14.314802,14.349737)
 >>> 270 : (-27.502745,-33.003326)
 >>> 271 : (10.64012,6.457108)
 >>> 272 : (-16.236614,-21.558899)
 >>> 273 : (20.561716,24.363443)
 >>> 274 : (-10.97512,-6.042589)
 >>> 280 : (-12.273643,-13.009692)
 >>> 283 : (3.3773353,8.302713)
 >>> 286 : (-1.6639676,-3.079587)
 >>> 287 : (-21.63964,-23.37448)
 >>> 290 : (-14.440636,-24.584656)
 >>> 291 : (0.17262441,-1.6445827)
 >>> 294 : (19.45585,29.862196)
 >>> 298 : (2.3329654,8.237259)
 >>> 303 : (15.277465,12.724495)
 >>> 304 : (-10.626967,-18.734402)
 >>> 309 : (-11.389035,-6.8129835)
 >>> 310 : (-7.8077154,-9.264032)
 >>> 311 : (3.3524702,-7.6005263)
 >>> 313 : (22.357534,21.090479)
 >>> 314 : (14.302358,4.895173)
 >>> 315 : (-32.722397,-41.946712)

fold-2d-product: Ok
fold-2d-maximum: Ok
fold-2d-minimum: Ok
foldseg-sum: Ok
scanseg-sum: Failed:
*** Internal error in package accelerate ***
*** Please submit a bug report at 
https://github.com/AccelerateHS/accelerate/issues
./Data/Array/Accelerate/CUDA.hs:246 (unhandled): CUDA Exception: invalid 
context handle

stencil-1D: Ok
stencil-2D: Ok
stencil-3D: Failed:
*** Internal error in package accelerate ***
*** Please submit a bug report at 
https://github.com/AccelerateHS/accelerate/issues
./Data/Array/Accelerate/CUDA.hs:246 (unhandled): CUDA Exception: invalid 
context handle

stencil-3x3-cross: Failed:
*** Internal error in package accelerate ***
*** Please submit a bug report at 
https://github.com/AccelerateHS/accelerate/issues
./Data/Array/Accelerate/CUDA.hs:246 (unhandled): CUDA Exception: invalid 
context handle

stencil-3x3-pair: Ok
stencil2-2D: Ok
permute-hist: Ok
backpermute-reverse: Ok
backpermute-transpose: Ok
init: Ok
tail: Ok
take: Ok
drop: Ok
slit: Ok
gather: Ok
gather-if: Ok
scatter: Ok
scatter-if: Ok
sasum: Failed:
 >>> () : (50137.895,63516.633)

saxpy: Ok
dotp: Failed:
 >>> () : (120.643745,144.3627)

filter: Ok
smvm: Ok
black-scholes: Ok
radixsort: Failed:
*** Internal error in package accelerate ***
*** Please submit a bug report at 
https://github.com/AccelerateHS/accelerate/issues
./Data/Array/Accelerate/CUDA.hs:246 (unhandled): CUDA Exception: invalid 
context handle

io: test: fromPtr Int
test: fromPtr (Int,Double)
test: toPtr Int16
test: toPtr Int32
test: toPtr Int64
test: fromArray Int
Ok
io: +++ OK, passed 100 tests.
+++ OK, passed 100 tests.
+++ OK, passed 100 tests.
+++ OK, passed 100 tests.
+++ OK, passed 100 tests.
+++ OK, passed 100 tests.
+++ OK, passed 100 tests.
Ok
canny: Failed: no image file specified
integral-image: Failed: no image file specified
slices: Ok
slices: Failed:
*** Internal error in package accelerate ***
*** Please submit a bug report at 
https://github.com/AccelerateHS/accelerate/issues
./Data/Array/Accelerate/CUDA.hs:246 (unhandled): CUDA Exception: invalid 
context handle

slices: Failed:
*** Internal error in package accelerate ***
*** Please submit a bug report at 
https://github.com/AccelerateHS/accelerate/issues
./Data/Array/Accelerate/CUDA.hs:246 (unhandled): CUDA Exception: invalid 
context handle

slices: Ok
slices: Ok
sharing-recovery: Ok
sharing-recovery: Ok
sharing-recovery: Ok
sharing-recovery: Ok
sharing-recovery: Ok
sharing-recovery: Ok
sharing-recovery: Ok
sharing-recovery: Ok
sharing-recovery: Ok
sharing-recovery: Ok
bound-variables: Ok

On 13/05/13 21:28, Trevor L. McDonell wrote:

Hi Neil,

I am using CUDA 5 and it has worked for me --- this is on Mac OS X and
Ubuntu. It might make a difference if you're on a different linux
distribution?

What do you mean by language errors from the CUDA system? Different
errors from the earlier "unspecified launch failure" ?

Try changing this from |forkOS| to |forkOn 0| and let me know what
happens?
https://github.com/AccelerateHS/accelerate-cuda/blob/master/Data/Array/Accelerate/CUDA/Async.hs#L36

—
Reply to this email directly or view it on GitHub
#92 (comment).

neiljamieso · 2013-05-15T09:36:11Z

Hullo Trevor,

SUCCESS!

I rebuilt all the accelerate packages (with the change to forkOn in
place) and the accelerate-examples all run perfectly!

Not sure how to interpret the benchmarks but am very pleased to have it
going.

May I also say that the code is beautiful. Don't understand it all yet,
but it is very aesthetically pleasing what I have read.

Neil

neiljamieso · 2013-05-15T10:16:23Z

OOPS! Duh! I didn't turn on --cuda, so of course they all looked ok.

Sorry. No change with cuda backend. :-((

All this regarding accelerate-examples of course.

Neil

tmcdonell · 2013-05-28T13:00:45Z

Neil, could you try again with the latest version? I managed to create a setup that threw an invalid context error, so the fix for that might help in your situation as well.

neiljamieso · 2013-05-29T10:31:58Z

Will do Trevor.

On 28/05/13 23:00, Trevor L. McDonell wrote:

Neil, could you try again with the latest version? I managed to create
a setup that threw an invalid context error, so the fix for that might
help in your situation as well.

—
Reply to this email directly or view it on GitHub
#92 (comment).

neiljamieso · 2013-05-29T10:44:18Z

I got this error in trying to compile the examples:

[ 6 of 12] Compiling Test.IndexSpace (
examples/nofib/Test/IndexSpace.hs,
dist/build/accelerate-nofib/accelerate-nofib-tmp/Test/IndexSpace.o )

examples/nofib/Test/IndexSpace.hs:170:71:
Ambiguous occurrence even' It could refer to eitherP.even',
imported from Prelude' at examples/nofib/Test/IndexSpace.hs:6:1-60 (and originally defined inGHC.Real')
or A.even', imported fromData.Array.Accelerate' at
examples/nofib/Test/IndexSpace.hs:20:1-60
(and originally defined in
`accelerate-0.14.0.0:Data.Array.Accelerate.Language')

I'll have a look and change to A.even as I assume that's what you meant.

Neil.

On 28/05/13 23:00, Trevor L. McDonell wrote:

Neil, could you try again with the latest version? I managed to create
a setup that threw an invalid context error, so the fix for that might
help in your situation as well.

—
Reply to this email directly or view it on GitHub
#92 (comment).

neiljamieso · 2013-05-29T10:59:15Z

This fixed it:
-- gatherIfAcc even' mapv maskv defaultv xs .==. gatherIfRef
even mapv maskv defaultv xs
gatherIfAcc even' mapv maskv defaultv xs .==. gatherIfRef
P.even mapv maskv defaultv xs

On 28/05/13 23:00, Trevor L. McDonell wrote:

Neil, could you try again with the latest version? I managed to create
a setup that threw an invalid context error, so the fix for that might
help in your situation as well.

—
Reply to this email directly or view it on GitHub
#92 (comment).

neiljamieso · 2013-05-29T11:00:20Z

Another one:
[18 of 36] Compiling Gather (
examples/tests/primitives/Gather.hs,
dist/build/accelerate-examples/accelerate-examples-tmp/Gather.o )

examples/tests/primitives/Gather.hs:41:11:
Ambiguous occurrence even' It could refer to eitherAcc.even',
imported from Data.Array.Accelerate' at examples/tests/primitives/Gather.hs:9:1-48 (and originally defined in accelerate-0.14.0.0:Data.Array.Accelerate.Language')
or P.even', imported fromPrelude' at
examples/tests/primitives/Gather.hs:10:1-33
(and originally defined in `GHC.Real')
Failed to install accelerate-examples-0.14.0.0
c

On 28/05/13 23:00, Trevor L. McDonell wrote:

Neil, could you try again with the latest version? I managed to create
a setup that threw an invalid context error, so the fix for that might
help in your situation as well.

—
Reply to this email directly or view it on GitHub
#92 (comment).

neiljamieso · 2013-05-29T11:02:32Z

And:
[19 of 36] Compiling Scatter (
examples/tests/primitives/Scatter.hs,
dist/build/accelerate-examples/accelerate-examples-tmp/Scatter.o )

examples/tests/primitives/Scatter.hs:52:11:
Ambiguous occurrence even' It could refer to eitherP.even',
imported from Prelude' at examples/tests/primitives/Scatter.hs:16:1-44 (and originally defined inGHC.Real')
or Acc.even', imported fromData.Array.Accelerate' at
examples/tests/primitives/Scatter.hs:17:1-59
(and originally defined in
`accelerate-0.14.0.0:Data.Array.Accelerate.Language')

On 28/05/13 23:00, Trevor L. McDonell wrote:

Neil, could you try again with the latest version? I managed to create
a setup that threw an invalid context error, so the fix for that might
help in your situation as well.

—
Reply to this email directly or view it on GitHub
#92 (comment).

neiljamieso · 2013-05-29T11:04:06Z

Both fixed same way, and all now compile...Lets see how they run!

On 28/05/13 23:00, Trevor L. McDonell wrote:

Neil, could you try again with the latest version? I managed to create
a setup that threw an invalid context error, so the fix for that might
help in your situation as well.

—
Reply to this email directly or view it on GitHub
#92 (comment).

neiljamieso · 2013-05-29T11:09:04Z

This is the output...I use Ctrl-C during the 4th slices as it seemed to
hang.

neil@debian-neil:~/.cabal/bin$ optirun --no-xorg ./accelerate-examples
--cuda -k
running with CUDA backend
to display available options, rerun with '--help'

map-abs: Ok
map-plus: Ok
map-square: Ok
zip: Ok
zipWith-plus: Ok
fold-sum: Failed:

() : (-21.361326,32.731934)

fold-product: Ok
fold-maximum: Ok
fold-minimum: Ok
fold-2d-sum: Failed:

1 : (3.9905946,12.912712)
3 : (4.3853145,6.6357403)
4 : (8.841903,3.564476)
5 : (22.403717,22.856863)
6 : (-7.058512,-0.8101158)
7 : (-13.209917,-14.428578)
8 : (3.6516001,3.9791288)
9 : (1.5006628,1.639061)
10 : (8.085807,10.32614)
11 : (11.110486,13.271563)
12 : (11.344211,24.075565)
13 : (5.494232,6.638853)
15 : (-18.813566,-28.975445)
17 : (-10.612726,-11.403031)
19 : (30.455154,48.39125)
21 : (0.6439582,-8.811903e-2)
23 : (0.44115293,1.8800209)
25 : (-1.081647,4.433939)
29 : (3.649135,0.9225111)
30 : (-3.5161483,0.26748943)
31 : (6.247751,4.1066437)
33 : (-19.144558,-21.607367)
34 : (14.241796,-0.3949709)
35 : (-6.4786077,-4.0578346)
36 : (1.6614412,9.047534)
37 : (-9.929752e-2,-8.920741)
38 : (-0.5181453,-14.03962)
41 : (17.492886,5.483637)
42 : (1.5826802,1.5337367)
43 : (-22.710932,-26.35552)
45 : (7.819425,8.852381)
47 : (3.8250275,-1.1689825)
48 : (31.711973,36.747433)
49 : (5.4925137,10.268168)
52 : (-10.457833,-12.00074)
53 : (22.555317,34.491005)
54 : (-13.917394,-17.875317)
57 : (3.446729,-6.599143)
60 : (-9.107978,3.4590158)
62 : (-24.056997,-29.912)
63 : (2.436757,3.1981812)
64 : (-1.2618066,1.2730389)
68 : (28.439875,36.614067)
70 : (0.5847907,2.280851)
74 : (-2.3531268,-4.332817)
75 : (4.663379,8.0118885)
78 : (7.195462,14.593959)
79 : (-5.2660117,-12.242489)
80 : (-15.816689,-18.658928)
81 : (12.112614,8.826111)
82 : (14.143523,18.241121)
83 : (-25.847208,-30.473446)
84 : (11.379544,4.809246)
86 : (15.708036,29.72469)
87 : (4.8327255,3.5389404)
91 : (-12.49356,-7.1337805)
92 : (-3.2796116,-1.6790586)
93 : (-8.711067,-17.377827)
94 : (-21.488873,-14.433965)
97 : (2.3867311,-1.5279217)
98 : (5.4814205,-2.1296844)
99 : (-3.566555,-5.9053173)
100 : (13.362963,10.301908)
101 : (1.7501,-3.362393)
102 : (-1.8447578,-11.294733)
103 : (3.365004,10.789146)
105 : (-34.87906,-40.80436)
106 : (-12.686344,-15.8895645)
107 : (9.183949,7.9775457)
110 : (-22.573433,-12.892656)
112 : (12.944003,17.68826)
113 : (-20.14838,-21.692518)
114 : (-0.13564283,9.673411)
117 : (-34.568615,-36.956146)
118 : (-9.420436,-4.6167736)
125 : (-3.868143,-6.226729)
126 : (-24.039621,-25.80162)
127 : (-3.2252026,-4.7092633)
128 : (-9.503313,-5.4460926)
133 : (3.8282223,-1.7425342)
134 : (14.974166,34.86072)
135 : (-19.844137,-21.048025)
137 : (23.145348,28.191246)
139 : (3.5891905,9.721224)
142 : (0.5852886,1.3669834)
144 : (-5.7431865,5.893752)
145 : (13.187965,12.4972515)
147 : (-2.4032655,-9.138004)
149 : (22.993021,24.544422)
157 : (-5.1877947,-6.014868)
159 : (-17.272867,-16.517113)
160 : (-29.876955,-40.23668)
161 : (-16.822813,-12.472164)
162 : (-0.6595129,0.25787354)
164 : (35.51503,35.609394)
165 : (-23.43607,-30.415709)
166 : (9.842515,2.944377)
167 : (24.214361,29.503002)
168 : (-23.579342,-39.842453)
170 : (11.822997,18.28223)
171 : (16.668018,21.228556)
173 : (-18.572968,-19.739588)
174 : (5.4933777,-0.5577693)
175 : (1.9450028,4.1181507)
177 : (-19.47439,-19.676298)
179 : (-12.430883,-16.573708)
182 : (-4.7336774,-9.151844)
184 : (-2.7646563,9.710753)
185 : (22.779469,20.718946)
187 : (-25.819782,-30.222664)
188 : (18.511953,21.633574)
189 : (-19.708344,-23.975298)
191 : (17.08098,24.394087)
193 : (-3.0513897,-0.6075697)
195 : (-8.187313,-5.181074)
197 : (33.65944,40.2564)
198 : (-0.64326054,-4.086837)
199 : (-10.554681,-12.706717)
200 : (18.93743,29.3177)
202 : (-5.301973,-15.005705)
208 : (-7.2508016,-14.100331)
209 : (-19.64536,-23.58665)
211 : (-3.6678975,4.9338455)
214 : (-4.1849194,-7.2833357)
215 : (-1.1494977,-7.4395123)
217 : (-2.6624355,11.72216)
218 : (-6.4984765,-9.903734)
222 : (0.2119419,-2.0705266)
226 : (-4.751293,11.307108)
231 : (13.396966,13.482294)
232 : (-10.148484,-9.455285)
233 : (-11.613926,-30.141973)
235 : (-4.1457195,-11.701864)
236 : (22.841429,27.695446)
237 : (20.703121,28.321404)
238 : (2.2251,-9.911165)
240 : (4.6583896,13.250011)
242 : (0.56912243,1.7683926)
248 : (-13.757292,-6.036418)
250 : (-2.0742264,-11.74327)
251 : (-22.361734,-21.731167)
252 : (-4.5171075,-6.9133253)
258 : (-15.887733,-15.204248)
259 : (13.085469,7.5854363)
260 : (17.63313,21.100315)
261 : (7.1418476,0.2580099)
262 : (-14.919332,-23.728527)
263 : (24.858322,28.005262)
266 : (-0.1598835,1.6914234)
267 : (-11.6540985,-19.327158)
270 : (-9.534692,-15.585428)
273 : (23.928104,34.40332)
276 : (12.787605,5.514979)
279 : (0.36071712,-6.126135)
281 : (-6.324025,-4.401108)
284 : (4.8829827,6.8221273)
285 : (-20.047634,-17.415882)
287 : (-6.266363,-7.5843716)
292 : (31.943773,28.52203)
294 : (4.4730716,17.863426)
295 : (-24.903772,-31.832272)
296 : (23.457853,27.188269)
298 : (-5.066526e-2,3.090138)
299 : (-12.440723,-12.220831)
300 : (10.800417,2.0174663)
302 : (21.627502,25.618221)
304 : (-19.292229,-21.6833)
307 : (-7.7303686,-6.4778433)
308 : (16.438334,17.45433)
309 : (18.270615,16.974281)
313 : (-18.940536,-14.294319)
315 : (1.1139888,-9.944632)

fold-2d-product: Ok
fold-2d-maximum: Ok
fold-2d-minimum: Ok
foldseg-sum: Ok
scanseg-sum: Failed:
*** Internal error in package accelerate ***
*** Please submit a bug report at
https://github.com/AccelerateHS/accelerate/issues
./Data/Array/Accelerate/CUDA/State.hs:87 (unhandled): CUDA Exception:
unspecified launch failure

stencil-1D: Failed:
*** Internal error in package accelerate ***
*** Please submit a bug report at
https://github.com/AccelerateHS/accelerate/issues
./Data/Array/Accelerate/CUDA/State.hs:87 (unhandled): CUDA Exception:
unspecified launch failure

stencil-2D: Failed:
*** Internal error in package accelerate ***
*** Please submit a bug report at
https://github.com/AccelerateHS/accelerate/issues
./Data/Array/Accelerate/CUDA/State.hs:87 (unhandled): CUDA Exception:
unspecified launch failure

stencil-3D: Failed:
*** Internal error in package accelerate ***
*** Please submit a bug report at
https://github.com/AccelerateHS/accelerate/issues
./Data/Array/Accelerate/CUDA/State.hs:87 (unhandled): CUDA Exception:
unspecified launch failure

stencil-3x3-cross: Failed:
*** Internal error in package accelerate ***
*** Please submit a bug report at
https://github.com/AccelerateHS/accelerate/issues
./Data/Array/Accelerate/CUDA/State.hs:87 (unhandled): CUDA Exception:
unspecified launch failure

stencil-3x3-pair: Failed:
*** Internal error in package accelerate ***
*** Please submit a bug report at
https://github.com/AccelerateHS/accelerate/issues
./Data/Array/Accelerate/CUDA/State.hs:87 (unhandled): CUDA Exception:
unspecified launch failure

stencil2-2D: Failed:
*** Internal error in package accelerate ***
*** Please submit a bug report at
https://github.com/AccelerateHS/accelerate/issues
./Data/Array/Accelerate/CUDA/State.hs:87 (unhandled): CUDA Exception:
unspecified launch failure

permute-hist: Failed:
*** Internal error in package accelerate ***
*** Please submit a bug report at
https://github.com/AccelerateHS/accelerate/issues
./Data/Array/Accelerate/CUDA/State.hs:87 (unhandled): CUDA Exception:
unspecified launch failure

backpermute-reverse: Failed:
*** Internal error in package accelerate ***
*** Please submit a bug report at
https://github.com/AccelerateHS/accelerate/issues
./Data/Array/Accelerate/CUDA/State.hs:87 (unhandled): CUDA Exception:
unspecified launch failure

backpermute-transpose: Failed:
*** Internal error in package accelerate ***
*** Please submit a bug report at
https://github.com/AccelerateHS/accelerate/issues
./Data/Array/Accelerate/CUDA/State.hs:87 (unhandled): CUDA Exception:
unspecified launch failure

init: Failed:
*** Internal error in package accelerate ***
*** Please submit a bug report at
https://github.com/AccelerateHS/accelerate/issues
./Data/Array/Accelerate/CUDA/State.hs:87 (unhandled): CUDA Exception:
unspecified launch failure

tail: Failed:
*** Internal error in package accelerate ***
*** Please submit a bug report at
https://github.com/AccelerateHS/accelerate/issues
./Data/Array/Accelerate/CUDA/State.hs:87 (unhandled): CUDA Exception:
unspecified launch failure

take: Failed:
*** Internal error in package accelerate ***
*** Please submit a bug report at
https://github.com/AccelerateHS/accelerate/issues
./Data/Array/Accelerate/CUDA/State.hs:87 (unhandled): CUDA Exception:
unspecified launch failure

drop: Failed:
*** Internal error in package accelerate ***
*** Please submit a bug report at
https://github.com/AccelerateHS/accelerate/issues
./Data/Array/Accelerate/CUDA/State.hs:87 (unhandled): CUDA Exception:
unspecified launch failure

slit: Failed:
*** Internal error in package accelerate ***
*** Please submit a bug report at
https://github.com/AccelerateHS/accelerate/issues
./Data/Array/Accelerate/CUDA/State.hs:87 (unhandled): CUDA Exception:
unspecified launch failure

gather: Failed:
*** Internal error in package accelerate ***
*** Please submit a bug report at
https://github.com/AccelerateHS/accelerate/issues
./Data/Array/Accelerate/CUDA/State.hs:87 (unhandled): CUDA Exception:
unspecified launch failure

gather-if: Failed:
*** Internal error in package accelerate ***
*** Please submit a bug report at
https://github.com/AccelerateHS/accelerate/issues
./Data/Array/Accelerate/CUDA/State.hs:87 (unhandled): CUDA Exception:
unspecified launch failure

scatter: Failed:
*** Internal error in package accelerate ***
*** Please submit a bug report at
https://github.com/AccelerateHS/accelerate/issues
./Data/Array/Accelerate/CUDA/State.hs:87 (unhandled): CUDA Exception:
unspecified launch failure

scatter-if: Failed:
*** Internal error in package accelerate ***
*** Please submit a bug report at
https://github.com/AccelerateHS/accelerate/issues
./Data/Array/Accelerate/CUDA/State.hs:87 (unhandled): CUDA Exception:
unspecified launch failure

sasum: Failed:
*** Internal error in package accelerate ***
*** Please submit a bug report at
https://github.com/AccelerateHS/accelerate/issues
./Data/Array/Accelerate/CUDA/State.hs:87 (unhandled): CUDA Exception:
unspecified launch failure

saxpy: Failed:
*** Internal error in package accelerate ***
*** Please submit a bug report at
https://github.com/AccelerateHS/accelerate/issues
./Data/Array/Accelerate/CUDA/State.hs:87 (unhandled): CUDA Exception:
unspecified launch failure

dotp: Failed:
*** Internal error in package accelerate ***
*** Please submit a bug report at
https://github.com/AccelerateHS/accelerate/issues
./Data/Array/Accelerate/CUDA/State.hs:87 (unhandled): CUDA Exception:
unspecified launch failure

filter: Failed:
*** Internal error in package accelerate ***
*** Please submit a bug report at
https://github.com/AccelerateHS/accelerate/issues
./Data/Array/Accelerate/CUDA/State.hs:87 (unhandled): CUDA Exception:
unspecified launch failure

smvm: Failed:
*** Internal error in package accelerate ***
*** Please submit a bug report at
https://github.com/AccelerateHS/accelerate/issues
./Data/Array/Accelerate/CUDA/State.hs:87 (unhandled): CUDA Exception:
unspecified launch failure

black-scholes: Failed:
*** Internal error in package accelerate ***
*** Please submit a bug report at
https://github.com/AccelerateHS/accelerate/issues
./Data/Array/Accelerate/CUDA/State.hs:87 (unhandled): CUDA Exception:
unspecified launch failure

radixsort: Failed:
*** Internal error in package accelerate ***
*** Please submit a bug report at
https://github.com/AccelerateHS/accelerate/issues
./Data/Array/Accelerate/CUDA/State.hs:87 (unhandled): CUDA Exception:
unspecified launch failure

io: test: fromPtr Int
test: fromPtr (Int,Double)
test: toPtr Int16
test: toPtr Int32
test: toPtr Int64
test: fromArray Int
Ok
io: +++ OK, passed 100 tests.
+++ OK, passed 100 tests.
+++ OK, passed 100 tests.
+++ OK, passed 100 tests.
+++ OK, passed 100 tests.
+++ OK, passed 100 tests.
+++ OK, passed 100 tests.
Ok
canny: Failed: no image file specified
integral-image: Failed: no image file specified
slices: Failed:
*** Internal error in package accelerate ***
*** Please submit a bug report at
https://github.com/AccelerateHS/accelerate/issues
./Data/Array/Accelerate/CUDA/State.hs:87 (unhandled): CUDA Exception:
unspecified launch failure

slices: Failed:
*** Internal error in package accelerate ***
*** Please submit a bug report at
https://github.com/AccelerateHS/accelerate/issues
./Data/Array/Accelerate/CUDA/State.hs:87 (unhandled): CUDA Exception:
unspecified launch failure

slices: ^C[ 3364.241184] [WARN]Received Interrupt signal.
Failed: user interrupt
slices: Failed:
*** Internal error in package accelerate ***
*** Please submit a bug report at
https://github.com/AccelerateHS/accelerate/issues
./Data/Array/Accelerate/CUDA/State.hs:87 (unhandled): CUDA Exception:
unspecified launch failure

sharing-recovery: Ok
sharing-recovery: Ok
sharing-recovery: Ok
sharing-recovery: Ok
sharing-recovery: Ok
sharing-recovery: Ok
sharing-recovery: Ok
sharing-recovery: Ok
sharing-recovery: Ok
sharing-recovery: Ok
bound-variables: Ok
accelerate-examples: forkOS_entry: interrupted
neil@debian-neil:~/.cabal/bin$

On 28/05/13 23:00, Trevor L. McDonell wrote:

Neil, could you try again with the latest version? I managed to create
a setup that threw an invalid context error, so the fix for that might
help in your situation as well.

—
Reply to this email directly or view it on GitHub
#92 (comment).

tmcdonell · 2013-05-30T00:54:47Z

Oops, sorry for all the compilation failures with even. I'm currently hacking on nofib to produce this test case for the context bug, but the local changes aren't ready to push upstream.

Are these the same errors you had initially? This looks more like what we had after the hack to replace forkOS with forkOn 0? That would at least be some progress!

For the "unspecified launch failure errors", we might be trying to launch a kernel that requires more resources than your card provides. Since I haven't tested on an Optimus card before, there might be bugs in the occupancy calculator code.

Try the following?

import Prelude                          as P
import Data.Array.Accelerate            as A
import Data.Array.Accelerate.CUDA

import System.Environment

xs, ys :: Acc (Vector Float)
xs = use $ fromList (Z:.10) [0..]
ys = use $ fromList (Z:.10) [2,4..]

dotp :: Acc (Vector Float) -> Acc (Vector Float) -> Acc (Scalar Float)
dotp xs ys
  = A.fold (+) 0
  $ A.zipWith (*) xs ys

main :: IO ()
main
  = withArgs ["-ddump-cc", "-ddump-gc", "-ddump-exec", "-dverbose"]
  $ print
  $ run (dotp xs ys)

You'll need to have installed accelerate-cuda with the -fdebug flag, or just run in ghci using the script in the utils directory (you might have to edit it a bit depending on where you have checked out the individual repositories).

neiljamieso · 2013-05-30T07:41:18Z

Thanks Trevor,

I'll try that. The fix you did for the invalid context (last email) -
was that in cuda or accelerate-cuda? I only rebuilt accelerate-cuda
(and dependencies).

The "unspecified launch failure errors" were in the "forkOS" version.
The "forkOn 0" version had the context errors.

I did put the forkOn back in, but not sure I rebuilt the whole sequence
properly afterwards.

Cheers, Neil

On 30/05/13 10:54, Trevor L. McDonell wrote:

Oops, sorry for all the compilation failures with |even|. I'm
currently hacking on |nofib| to produce this test case for the context
bug, but the local changes aren't ready to push upstream.

Are these the same errors you had initially? This looks more like what
we had after the hack to replace |forkOS| with |forkOn 0|? That would
at least be some progress!

For the "unspecified launch failure errors", we might be trying to
launch a kernel that requires more resources than your card provides.
Since I haven't tested on an Optimus card before, there might be bugs
in the occupancy calculator code.

Try the following?

import Prelude as P
import Data.Array.Accelerate as A
import Data.Array.Accelerate.CUDA

import System.Environment

xs, ys :: Acc (Vector Float)
xs = use $ fromList (Z:.10) [0..]
ys = use $ fromList (Z:.10) [2,4..]

dotp :: Acc (Vector Float) -> Acc (Vector Float) -> Acc (Scalar Float)
dotp xs ys
= A.fold (+) 0
$ A.zipWith (*) xs ys

main :: IO ()
main
= withArgs ["-ddump-cc", "-ddump-gc", "-ddump-exec", "-dverbose"]
$ print
$ run (dotp xs ys)

You'll need to have installed |accelerate-cuda| with the |-fdebug|
flag, or just run in |ghci| using the script in the |utils| directory
(you might have to edit it a bit depending on where you have checked
out the individual repositories).

—
Reply to this email directly or view it on GitHub
#92 (comment).

neiljamieso · 2013-05-30T08:21:44Z

Hi Trevor,

This is the output. Are you able to make sense of it? Certainly seems
to have worked!

neil@debian-neil:~/.cabal/bin$ optirun ./accelerate-examples --cuda -k
-- -fdump-cc
0.03:gc: initialise default context
0.07:gc: initialise context #0x00007f8f1c00b4f0
Device 0: Quadro K1000M (compute capatability 3.0)
1 multiprocessors @ 850.50 MHz (192 cores), 2 GB global memory
0.07:gc: push context: #0x00007f8f1c00b4f0
0.07:gc: initialise CUDA state
0.07:gc: initialise memory table
0.07:cc: initialise kernel table
0.07:cc: persist/restore: 39 entries
0.08:gc: lookup/not found: Array #32
0.08:gc: useArray/malloc: 40 B
0.08:gc: malloc/new
0.08:gc: insert: Array #32
0.08:gc: lookup/not found: Array #31
0.08:gc: useArray/malloc: 40 B
0.08:gc: malloc/new
0.08:gc: insert: Array #31
0.08:cc: (3.0,"\178\140cp$\ACK\226\229\195l\184eF`f3")
#include <accelerate_cuda_extras.h>
extern "C" global void foldAll(const DIM1 shIn0, const float*
restrict arrIn0_a0, const DIM1 shIn1, const float* restrict
arrIn1_a0, const DIM0 shOut, float* restrict arrOut_a0)
{
extern volatile shared float sdata0[];
float x0;
float y0;
const Int64 sh0 = min((Int64) shIn0, (Int64) shIn1);
const int shapeSize = sh0;
const int gridSize = blockDim.x * gridDim.x;
int ix = blockDim.x * blockIdx.x + threadIdx.x;

if (ix < shapeSize) {
const Int64 v2 = ix;
const int v3 = toIndex(shIn0, shape(v2));
const int v4 = toIndex(shIn1, shape(v2));

y0 = arrIn0_a0[v3] * arrIn1_a0[v4];
for (ix += gridSize; ix < shapeSize; ix += gridSize) {
const Int64 v2 = ix;
const int v3 = toIndex(shIn0, shape(v2));
const int v4 = toIndex(shIn1, shape(v2));

x0 = arrIn0_a0[v3] * arrIn1_a0[v4];
y0 = x0 + y0;
}
}
sdata0[threadIdx.x] = y0;
__syncthreads();
ix = min(shapeSize - blockIdx.x * blockDim.x, blockDim.x);
if (threadIdx.x + 512 < ix) {
x0 = sdata0[threadIdx.x + 512];
y0 = y0 + x0;
sdata0[threadIdx.x] = y0;
}
__syncthreads();
if (threadIdx.x + 256 < ix) {
x0 = sdata0[threadIdx.x + 256];
y0 = y0 + x0;
sdata0[threadIdx.x] = y0;
}
__syncthreads();
if (threadIdx.x + 128 < ix) {
x0 = sdata0[threadIdx.x + 128];
y0 = y0 + x0;
sdata0[threadIdx.x] = y0;
}
__syncthreads();
if (threadIdx.x + 64 < ix) {
x0 = sdata0[threadIdx.x + 64];
y0 = y0 + x0;
sdata0[threadIdx.x] = y0;
}
__syncthreads();
if (threadIdx.x < 32) {
if (threadIdx.x + 32 < ix) {
x0 = sdata0[threadIdx.x + 32];
y0 = y0 + x0;
sdata0[threadIdx.x] = y0;
}
if (threadIdx.x + 16 < ix) {
x0 = sdata0[threadIdx.x + 16];
y0 = y0 + x0;
sdata0[threadIdx.x] = y0;
}
if (threadIdx.x + 8 < ix) {
x0 = sdata0[threadIdx.x + 8];
y0 = y0 + x0;
sdata0[threadIdx.x] = y0;
}
if (threadIdx.x + 4 < ix) {
x0 = sdata0[threadIdx.x + 4];
y0 = y0 + x0;
sdata0[threadIdx.x] = y0;
}
if (threadIdx.x + 2 < ix) {
x0 = sdata0[threadIdx.x + 2];
y0 = y0 + x0;
sdata0[threadIdx.x] = y0;
}
if (threadIdx.x + 1 < ix) {
x0 = sdata0[threadIdx.x + 1];
y0 = y0 + x0;
sdata0[threadIdx.x] = y0;
}
}
if (threadIdx.x == 0) {
if (shapeSize > 0) {
if (gridDim.x == 1) {
x0 = 0.0f;
y0 = x0 + y0;
}
arrOut_a0[blockIdx.x] = y0;
} else {
arrOut_a0[blockIdx.x] = 0.0f;
}
}
}

0.08:cc: (3.0,"\209\181\149\254\136cnX\DEL\171\b\219\160\133\133:")
#include <accelerate_cuda_extras.h>
extern "C" global void foldAll(const DIM1 shIn0, const float*
restrict arrIn0_a0, const DIM1 shIn1, const float* restrict
arrIn1_a0, const DIM0 shOut, float* restrict arrOut_a0, const DIM1
shRec, const float* restrict arrRec_a0)
{
extern volatile shared float sdata0[];
float x0;
float y0;
const Int64 sh0 = shRec;
const int shapeSize = sh0;
const int gridSize = blockDim.x * gridDim.x;
int ix = blockDim.x * blockIdx.x + threadIdx.x;

if (ix < shapeSize) {
y0 = arrRec_a0[ix];
for (ix += gridSize; ix < shapeSize; ix += gridSize) {
x0 = arrRec_a0[ix];
y0 = x0 + y0;
}
}
sdata0[threadIdx.x] = y0;
__syncthreads();
ix = min(shapeSize - blockIdx.x * blockDim.x, blockDim.x);
if (threadIdx.x + 512 < ix) {
x0 = sdata0[threadIdx.x + 512];
y0 = y0 + x0;
sdata0[threadIdx.x] = y0;
}
__syncthreads();
if (threadIdx.x + 256 < ix) {
x0 = sdata0[threadIdx.x + 256];
y0 = y0 + x0;
sdata0[threadIdx.x] = y0;
}
__syncthreads();
if (threadIdx.x + 128 < ix) {
x0 = sdata0[threadIdx.x + 128];
y0 = y0 + x0;
sdata0[threadIdx.x] = y0;
}
__syncthreads();
if (threadIdx.x + 64 < ix) {
x0 = sdata0[threadIdx.x + 64];
y0 = y0 + x0;
sdata0[threadIdx.x] = y0;
}
__syncthreads();
if (threadIdx.x < 32) {
if (threadIdx.x + 32 < ix) {
x0 = sdata0[threadIdx.x + 32];
y0 = y0 + x0;
sdata0[threadIdx.x] = y0;
}
if (threadIdx.x + 16 < ix) {
x0 = sdata0[threadIdx.x + 16];
y0 = y0 + x0;
sdata0[threadIdx.x] = y0;
}
if (threadIdx.x + 8 < ix) {
x0 = sdata0[threadIdx.x + 8];
y0 = y0 + x0;
sdata0[threadIdx.x] = y0;
}
if (threadIdx.x + 4 < ix) {
x0 = sdata0[threadIdx.x + 4];
y0 = y0 + x0;
sdata0[threadIdx.x] = y0;
}
if (threadIdx.x + 2 < ix) {
x0 = sdata0[threadIdx.x + 2];
y0 = y0 + x0;
sdata0[threadIdx.x] = y0;
}
if (threadIdx.x + 1 < ix) {
x0 = sdata0[threadIdx.x + 1];
y0 = y0 + x0;
sdata0[threadIdx.x] = y0;
}
}
if (threadIdx.x == 0) {
if (shapeSize > 0) {
if (gridDim.x == 1) {
x0 = 0.0f;
y0 = x0 + y0;
}
arrOut_a0[blockIdx.x] = y0;
} else {
arrOut_a0[blockIdx.x] = 0.0f;
}
}
}

0.09:cc: waiting for nvcc...
0.09:cc: queue: 19.000 µs, execute: 1.316 s
... /usr/bin/nvcc -I
/home/neil/.cabal/share/accelerate-cuda-0.14.0.0/cubits -arch=sm_30
-cubin -o /tmp/accelerate-cuda-12605/dragon12606.cubin -O3 -m64
/tmp/accelerate-cuda-12605/dragon12606.cu
0.09:cc: queue: 32.000 µs, execute: 1.319 s
... /usr/bin/nvcc -I
/home/neil/.cabal/share/accelerate-cuda-0.14.0.0/cubits -arch=sm_30
-cubin -o /tmp/accelerate-cuda-12605/dragon12605.cubin -O3 -m64
/tmp/accelerate-cuda-12605/dragon12605.cu
0.09:cc: persist/save:
/home/neil/.accelerate/accelerate-cuda-0.14.0.0/cache/3.0/z33Ufz60UFezr184lzr195zr229zr226zrACKzdpczr140zr178
0.09:cc: entry function 'foldAll' used 11 registers, 0 bytes smem, 0
bytes lmem, 0 bytes cmem
... multiprocessor occupancy 100.0% : 2048 threads over 64 warps
in 2 blocks
0.09:gc: lookup/not found: Array #25
0.09:gc: mallocArray: 4 B
0.09:gc: malloc/new
0.09:gc: insert: Array #25
0.09:gc: lookup/found: Array #32
0.09:gc: lookup/found: Array #31
0.09:gc: lookup/found: Array #25
0.09:exec: foldAll<<< 1, 1024, 4096 >>> gpu: 48.128 µs, cpu: 0.000 s
0.09:gc: lookup/found: Array #25
0.09:gc: peekArray: 4 B
0.09:gc: pop context: #0x00007f8f1c00b4f0
Array (Z) [660.0]
neil@debian-neil:~/.cabal/bin$

Cheers, Neil

On 30/05/13 10:54, Trevor L. McDonell wrote:

import Prelude as P
import Data.Array.Accelerate as A
import Data.Array.Accelerate.CUDA

import System.Environment

xs, ys :: Acc (Vector Float)
xs = use $ fromList (Z:.10) [0..]
ys = use $ fromList (Z:.10) [2,4..]

dotp :: Acc (Vector Float) -> Acc (Vector Float) -> Acc (Scalar Float)
dotp xs ys
= A.fold (+) 0
$ A.zipWith (*) xs ys

main :: IO ()
main
= withArgs ["-ddump-cc", "-ddump-gc", "-ddump-exec", "-dverbose"]
$ print
$ run (dotp xs ys)

tmcdonell · 2013-05-31T06:03:09Z

Hi Neil,

hmm, it does indeed seem to have worked. Okay, a couple more things to try, if you don't mind:

Could you run the deviceQueryDrv program from the CUDA SDK examples and show me the output?
I've not encountered an Optimus/Quadro device before, so my current thought is that something is wrong with the calculations that determine how many threads to launch. Try increasing the vector sizes for the test program I sent last time and find the point where it fails. Multiples of 1024 are probably a good increment. Feel free to comment out the line beginning withArgs so that it is less chatty.

Thanks!

tmcdonell · 2013-05-31T06:05:13Z

Oh, also, did you need to edit Async.hs to use forkOn 0 after I pushed the latest patches, or were the previous results with a clean checkout?

neiljamieso · 2013-06-01T10:15:12Z

The forkOn 0 no longer makes any difference - i.e all now fail as it did
with forkOS.

I'll try the suggestion about cranking up the size of the vectors and
get back.

Neil

On 31/05/13 16:05, Trevor L. McDonell wrote:

Oh, also, did you need to edit |Async.hs| to use |forkOn 0| after I
pushed the latest patches, or were the previous results with a clean
checkout?

—
Reply to this email directly or view it on GitHub
#92 (comment).

neiljamieso · 2013-06-01T10:19:43Z

Here's the deviceQueryDrv output:

neil@debian-neil:~/Documents/Computing/CUDASamples/1_Utilities/deviceQueryDrv$
optirun ./deviceQueryDrv
./deviceQueryDrv Starting...

CUDA Device Query (Driver API) statically linked version
Detected 1 CUDA Capable device(s)

Device 0: "Quadro K1000M"
CUDA Driver Version: 5.0
CUDA Capability Major/Minor version number: 3.0
Total amount of global memory: 2048 MBytes
(2147287040 bytes)
( 1) Multiprocessors x (192) CUDA Cores/MP: 192 CUDA Cores
GPU Clock rate: 851 MHz (0.85 GHz)
Memory Clock rate: 900 Mhz
Memory Bus Width: 128-bit
L2 Cache Size: 262144 bytes
Max Texture Dimension Sizes 1D=(65536)
2D=(65536,65536) 3D=(4096,4096,4096)
Max Layered Texture Size (dim) x layers 1D=(16384) x 2048,
2D=(16384,16384) x 2048
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per multiprocessor: 2048
Maximum number of threads per block: 1024
Maximum sizes of each dimension of a block: 1024 x 1024 x 64
Maximum sizes of each dimension of a grid: 2147483647 x 65535 x 65535
Texture alignment: 512 bytes
Maximum memory pitch: 2147483647 bytes
Concurrent copy and kernel execution: Yes with 1 copy engine(s)
Run time limit on kernels: Yes
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Concurrent kernel execution: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Disabled
Device supports Unified Addressing (UVA): Yes
Device PCI Bus ID / PCI location ID: 1 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device
simultaneously) >
neil@debian-neil:~/Documents/Computing/CUDASamples/1_Utilities/deviceQueryDrv$

On 31/05/13 16:03, Trevor L. McDonell wrote:

Hi Neil,

hmm, it does indeed seem to have worked. Okay, a couple more things to
try, if you don't mind:

Could you run the |deviceQueryDrv| program from the CUDA SDK
examples and show me the output?

I've not encountered an Optimus/Quadro device before, so my
current thought is that something is wrong with the calculations
that determine how many threads to launch. Try increasing the
vector sizes for the test program I sent last time and find the
point where it fails. Multiples of 1024 are probably a good
increment. Feel free to comment out the line beginning |withArgs|
so that it is less chatty.

Thanks!

—
Reply to this email directly or view it on GitHub
#92 (comment).

neiljamieso · 2013-06-01T10:30:59Z

Remarkably durable...

This is the code

import Prelude as P
import Data.Array.Accelerate as A
import Data.Array.Accelerate.CUDA

import System.Environment

xs, ys :: Acc (Vector Float)
xs = use $ fromList (Z:.1000000) [0..]
ys = use $ fromList (Z:.1000000) [2,4..]

dotp :: Acc (Vector Float) -> Acc (Vector Float) -> Acc (Scalar Float)
dotp xs ys
= A.fold (+) 0
$ A.zipWith (*) xs ys

main :: IO ()
main
= withArgs ["-ddump-cc"{--, "-ddump-gc", "-ddump-exec", "-dverbose"--}]
$ print
$ run (dotp xs ys)

And this is the output

neil@debian-neil:/.cabal/bin$ optirun ./accelerate-examples --cuda -k
0.12:cc: initialise kernel table
0.12:cc: persist/restore: 41 entries
0.18:cc: found/persistent
0.18:cc: found/persistent
0.18:cc: entry function 'foldAll' used 11 registers, 0 bytes smem, 0
bytes lmem, 0 bytes cmem
... multiprocessor occupancy 100.0% : 2048 threads over 64 warps
in 2 blocks
0.18:cc: entry function 'foldAll' used 8 registers, 0 bytes smem, 0
bytes lmem, 0 bytes cmem
... multiprocessor occupancy 100.0% : 2048 threads over 64 warps
in 2 blocks
Array (Z) [6.666666e17]
neil@debian-neil:/.cabal/bin$

On 31/05/13 16:03, Trevor L. McDonell wrote:

Hi Neil,

hmm, it does indeed seem to have worked. Okay, a couple more things to
try, if you don't mind:

Could you run the |deviceQueryDrv| program from the CUDA SDK
examples and show me the output?

I've not encountered an Optimus/Quadro device before, so my
current thought is that something is wrong with the calculations
that determine how many threads to launch. Try increasing the
vector sizes for the test program I sent last time and find the
point where it fails. Multiples of 1024 are probably a good
increment. Feel free to comment out the line beginning |withArgs|
so that it is less chatty.

Thanks!

—
Reply to this email directly or view it on GitHub
#92 (comment).

neiljamieso · 2013-06-01T10:34:25Z

Changing vector sizes to this...

xs = use $ fromList (Z:.1000000000) [0..]
ys = use $ fromList (Z:.1000000000) [2,4..]

Lead to a perfectly reasonable...
neil@debian-neil:~/.cabal/bin$ optirun ./accelerate-examples --cuda -k
39.85:cc: initialise kernel table
39.85:cc: persist/restore: 41 entries
accelerate-examples:
*** Internal error in package accelerate ***
*** Please submit a bug report at
https://github.com/AccelerateHS/accelerate/issues
./Data/Array/Accelerate/CUDA/State.hs:87 (unhandled): CUDA Exception:
out of memory

neil@debian-neil:~/.cabal/bin$

On 31/05/13 16:03, Trevor L. McDonell wrote:

Hi Neil,

hmm, it does indeed seem to have worked. Okay, a couple more things to
try, if you don't mind:

Could you run the |deviceQueryDrv| program from the CUDA SDK
examples and show me the output?

I've not encountered an Optimus/Quadro device before, so my
current thought is that something is wrong with the calculations
that determine how many threads to launch. Try increasing the
vector sizes for the test program I sent last time and find the
point where it fails. Multiples of 1024 are probably a good
increment. Feel free to comment out the line beginning |withArgs|
so that it is less chatty.

Thanks!

—
Reply to this email directly or view it on GitHub
#92 (comment).

neiljamieso · 2013-06-01T10:35:45Z

Does the simple dotp example exercise the Async module? This seems to
be the source of the crashes.

Cheers, Neil

neiljamieso · 2013-06-01T11:00:55Z

Oops, Sorry a mis-type there. They are with forkIO (not forkOS). I
tried with forkOn 0 and got the same results. Previously forkOn 0 gave
more successes and failed with "bad context" message rather than "launch
failed". "launch failed" has always happened with forkIO.

On 31/05/13 16:05, Trevor L. McDonell wrote:

Oh, also, did you need to edit |Async.hs| to use |forkOn 0| after I
pushed the latest patches, or were the previous results with a clean
checkout?

—
Reply to this email directly or view it on GitHub
#92 (comment).

neiljamieso · 2013-06-01T21:17:45Z

Hi Trevor,

I mentioned this before, but it may have been lost, and is more of a
worry... The fourth slice example fails to terminate (after 40sec). I
have to use Ctrl-C to kill it. I'm not sure why this has changed.

Cheers, neil

neiljamieso · 2013-06-01T22:27:07Z

Hi Trevor,

I thought you might be interested in this. Running the regression test
script seems to work - no crashing, no stalling on the slices!

neil@debian-neil:~/Downloads/Accelerate_130529/accelerate-examples-master/accelerate-examples-master$
optirun ./regression_test.sh --cuda

First the main battery of tests:

running with CUDA backend
to display available options, rerun with '--help'

map-abs: Ok
map-plus: Ok
map-square: Ok
zip: Ok
zipWith-plus: Ok
fold-sum: Ok
fold-product: Ok
fold-maximum: Ok
fold-minimum: Ok
fold-2d-sum: Ok
fold-2d-product: Ok
fold-2d-maximum: Ok
fold-2d-minimum: Ok
foldseg-sum: Ok
scanseg-sum: Failed:

0 : (0.0,NaN)
1 : (-0.6929801,-4.2535293e37)
2 : (-1.2756131,NaN)
3 : (-0.9977418,NaN)
4 : (-1.1877143,NaN)
5 : (-1.4590598,NaN)
6 : (-1.465081,NaN)
7 : (-1.5335276,NaN)
8 : (-1.8964667,NaN)
9 : (-2.429172,NaN)
11 : (0.9855077,0.0)
12 : (1.2848983,0.0)
14 : (0.9586575,0.0)
15 : (0.8935447,0.0)
16 : (0.55967414,0.0)
17 : (0.7870643,0.0)
18 : (0.38397616,0.0)
19 : (0.5038597,0.0)
20 : (1.0932949,0.0)
22 : (-0.7802813,0.0)
23 : (-0.90180016,0.0)
24 : (-1.1760286,0.0)
25 : (-0.66521347,0.0)
27 : (0.8123276,0.0)
28 : (1.6648452,0.0)
29 : (1.8714409,0.0)
30 : (1.5091901,0.0)
31 : (2.096872,0.0)
32 : (2.3554232,0.0)
34 : (-0.82877505,0.0)
35 : (-1.8104537,0.0)
36 : (-1.8511171,0.0)
37 : (-1.4023463,0.0)
38 : (-2.062095,0.0)
39 : (-1.5179899,0.0)
40 : (-0.57485485,0.0)
41 : (-1.3017156,0.0)
43 : (-0.56559163,0.0)
44 : (-0.8005209,0.0)
45 : (-0.26718092,0.0)
47 : (-0.42379427,0.0)
48 : (-0.6211059,0.0)
49 : (-1.3470457,0.0)
50 : (-2.2204418,0.0)
51 : (-1.9068379,0.0)
52 : (-2.0748498,0.0)
53 : (-1.0756776,0.0)
54 : (-1.121619,0.0)
55 : (-1.9701061,0.0)
57 : (-0.3139459,0.0)
58 : (-0.46075392,0.0)
59 : (0.50402975,0.0)
60 : (-0.27072406,0.0)
61 : (-0.49237812,0.0)
62 : (-1.2419014,0.0)
63 : (-2.084043,0.0)

stencil-1D: Ok
stencil-2D: Ok
stencil-3D: Ok
stencil-3x3-cross: Ok
stencil-3x3-pair: Ok
stencil2-2D: Ok
permute-hist: Ok
backpermute-reverse: Ok
backpermute-transpose: Ok
init: Ok
tail: Ok
take: Ok
drop: Ok
slit: Ok
gather: Ok
gather-if: Ok
scatter: Ok
scatter-if: Ok
sasum: Ok
saxpy: Ok
dotp: Ok
filter: Ok
smvm: Ok
black-scholes: Ok
radixsort: Ok
io: test: fromPtr Int
test: fromPtr (Int,Double)
test: toPtr Int16
test: toPtr Int32
test: toPtr Int64
test: fromArray Int
Ok
io: +++ OK, passed 100 tests.
+++ OK, passed 100 tests.
+++ OK, passed 100 tests.
+++ OK, passed 100 tests.
+++ OK, passed 100 tests.
+++ OK, passed 100 tests.
+++ OK, passed 100 tests.
Ok
canny: Failed: no image file specified
integral-image: Failed: no image file specified
slices: Ok
slices: Ok
slices: Ok
slices: Ok
slices: Ok
sharing-recovery: Ok
sharing-recovery: Ok
sharing-recovery: Ok
sharing-recovery: Ok
sharing-recovery: Ok
sharing-recovery: Ok
sharing-recovery: Ok
sharing-recovery: Ok
sharing-recovery: Ok
sharing-recovery: Ok
bound-variables: Ok

Next, additional application tests, beginning with mandelbrot:

accelerate-mandelbrot (c) [2011..2013] The Accelerate Team

Usage: accelerate-mandelbrot [OPTIONS]

Available backends:
interpreter reference implementation (sequential)

cuda implementation for NVIDIA GPUs (parallel)

Runtime usage:
arrows translate display
z ; zoom in
x q zoom out
f single precision calculations
d double precision calculations (if supported)

Error: unrecognized option `--size=64'

Run "accelerate-mandelbrot --help" for usage information
neil@debian-neil:~/Downloads/Accelerate_130529/accelerate-examples-master/accelerate-examples-master$

neiljamieso · 2013-06-01T23:00:07Z

Hi Trevor,

I hope you don't mind me sending lots, but I am on a roll at the
moment. Using the regressions script I saw the --size option and tried
it out. The accelerate-examples work with --size=1024, and fail with
--size=2048 (with the "launch failure" message). So this seems to be a
size problem rather than some basic fault in the context or launch
process. I suspect your thoughts about the calculations for memory
usage being wrong are correct.

Actually I can be more specific... 1024 works, 1025 fails.

There are numerous (hundreds of) "fails" in the results not matching the
interpreter result in scanseg-sum (but it ran!).

:-) Neil

Also fluid, mandrebrot, etc all run fine. Haven't tried the hashcat.

smoothlife chokes on the default settings. I get a decent animation
with the following, but it still slows down and misses frames as the
animation progresses. I assume this is a result of having a low end GPU.

neil@debian-neil:~/.cabal/bin$ ./accelerate-smoothlife --cuda --size=64
--sigmode=2 --sigtype=Smooth --framerate=5

Pretty happy now!

Neil

All this with the

neiljamieso · 2013-06-01T23:00:48Z

Hi Trevor,

I hope you don't mind me sending lots, but I am on a roll at the
moment. Using the regressions script I saw the --size option and tried
it out. The accelerate-examples work with --size=1024, and fail with
--size=2048 (with the "launch failure" message). So this seems to be a
size problem rather than some basic fault in the context or launch
process. I suspect your thoughts about the calculations for memory
usage being wrong are correct.

Actually I can be more specific... 1024 works, 1025 fails.

There are numerous (hundreds of) "fails" in the results not matching the
interpreter result in scanseg-sum (but it ran!).

:-) Neil

Also fluid, mandrebrot, etc all run fine. Haven't tried the hashcat.

smoothlife chokes on the default settings. I get a decent animation
with the following, but it still slows down and misses frames as the
animation progresses. I assume this is a result of having a low end GPU.

neil@debian-neil:~/.cabal/bin$ ./accelerate-smoothlife --cuda --size=64
--sigmode=2 --sigtype=Smooth --framerate=5

Pretty happy now!

Neil

All this with forkIO

tmcdonell · 2013-06-03T03:50:09Z

The forkOn 0 no longer makes any difference - i.e all now fail as it did
with forkOS.

Okay, that's great! I made some changes elsewhere tried to do the same thing but not fixed to CPU zero, so am glad that that works. One problem down!

tmcdonell · 2013-06-03T03:52:04Z

Does the simple dotp example exercise the Async module? This seems to
be the source of the crashes.

Yes, all run invocations will go via Async. I think we fixed the problem there, and the failures now are related to the kernel launches.

tmcdonell · 2013-06-03T04:04:27Z

Hi Neil,

I hope you don't mind me sending lots, but I am on a roll at the
moment.

Not at all, it is all very useful information (:

Using the regressions script I saw the --size option and tried
it out. The accelerate-examples work with --size=1024, and fail with
--size=2048 (with the "launch failure" message). So this seems to be a
size problem rather than some basic fault in the context or launch
process. I suspect your thoughts about the calculations for memory
usage being wrong are correct.

Actually I can be more specific... 1024 works, 1025 fails.

Ah, that is very helpful, thanks! I'll play around and see if I can dig up anymore leads to follow.

There are numerous (hundreds of) "fails" in the results not matching the
interpreter result in scanseg-sum (but it ran!).

A little worrying, but at least it runs! We'll get to that one later (:

Also fluid, mandrebrot, etc all run fine. Haven't tried the hashcat.

Great!

For hashcat you'll need to find a list of plain text words to feed it, and then a bunch of MD5 digests guess. You can use a standard dictionary like /usr/share/dict/english, although for a bit of fun Google for the rockyou list and a list of unknown md5's (:

smoothlife chokes on the default settings. I get a decent animation
with the following, but it still slows down and misses frames as the
animation progresses. I assume this is a result of having a low end GPU.

neil@debian-neil:~/.cabal/bin$ ./accelerate-smoothlife --cuda --size=64
--sigmode=2 --sigtype=Smooth --framerate=5

I think it depends on whether or not accelerate-fft built against the fast CUDA FFT library implementation. I don't think there is an easy way to check whether this happened or not, aside from just running and measuring the speed. Try:

cabal install accelerate-fft -fcuda

Or just install it after the accelerate-cuda package is already installed. This should probably have better documentation!

-Trev

neiljamieso · 2013-06-03T07:06:46Z

On 03/06/13 16:04, Trevor L. McDonell wrote:

cabal install accelerate-fft -fcuda
Worked! Smoothlife now works beautifully. Amazing speedup in processing.

Addressing issues: - AccelerateHS/accelerate#93 - AccelerateHS/accelerate#95 - improvements for AccelerateHS/accelerate#92

tmcdonell · 2013-11-15T03:02:17Z

@neiljamieso does everything work fine now? Some recent fixes to the fold kernel means that those tests should pass now. Do you still have any problems here?

neiljamieso · 2013-11-16T08:08:45Z

Hi Trev,

How recent a download from Github do I need?

Neil

On 15/11/13 16:02, Trevor L. McDonell wrote:

@neiljamieso https://github.com/neiljamieso does everything work
fine now? Some recent fixes to the fold kernel means that those tests
should pass now. Do you still have any problems here?

—
Reply to this email directly or view it on GitHub
#92 (comment).

neiljamieso · 2013-11-16T08:40:35Z

Hi Trev,

I tried installing the latest accelerate stuff from githib.

The latest accelerate-cuda depends on cuda-1.5.1.1 - the latest cuda in
github is 1.5.1.0

On 15/11/13 16:02, Trevor L. McDonell wrote:

@neiljamieso https://github.com/neiljamieso does everything work
fine now? Some recent fixes to the fold kernel means that those tests
should pass now. Do you still have any problems here?

—
Reply to this email directly or view it on GitHub
#92 (comment).

mchakravarty · 2013-11-16T10:41:32Z

@neiljamieso Trev probably forgot to push the version bump. Just change the version in cuda.cabal to 1.5.1.1 and it'll work.

neiljamieso · 2013-11-16T19:37:11Z

No working so well. I have attached the outputs (with my command line
at the front) for standard and verbose outputs.

Neil

On 16/11/13 23:41, Manuel M T Chakravarty wrote:

@neiljamieso https://github.com/neiljamieso Trev probably forgot to
push the version bump. Just change the version in |cuda.cabal| to
1.5.1.1 and it'll work.

—
Reply to this email directly or view it on GitHub
#92 (comment).

neil@debian-neil:~/.cabal/bin$ optirun ./accelerate-examples --cuda --size=1024 -v > verbose_test_131117
/home/neil/.cabal/share/accelerate-cuda-0.14.0.0/cubits/accelerate_cuda_shape.h:287: int toIndex(Shape, Shape) [with Shape = int]: block: [0,0,0], thread: [28,0,0] Assertion ix >= 0 && ix < sh failed.
/home/neil/.cabal/share/accelerate-cuda-0.14.0.0/cubits/accelerate_cuda_shape.h:287: int toIndex(Shape, Shape) [with Shape = int]: block: [0,0,0], thread: [29,0,0] Assertion ix >= 0 && ix < sh failed.
/home/neil/.cabal/share/accelerate-cuda-0.14.0.0/cubits/accelerate_cuda_shape.h:287: int toIndex(Shape, Shape) [with Shape = int]: block: [0,0,0], thread: [30,0,0] Assertion ix >= 0 && ix < sh failed.
/home/neil/.cabal/share/accelerate-cuda-0.14.0.0/cubits/accelerate_cuda_shape.h:287: int toIndex(Shape, Shape) [with Shape = int]: block: [0,0,0], thread: [31,0,0] Assertion ix >= 0 && ix < sh failed.
accelerate-examples:
*** Internal error in package accelerate ***
*** Please submit a bug report at https://github.com/AccelerateHS/accelerate/issues
./Data/Array/Accelerate/CUDA/State.hs:87 (unhandled): CUDA Exception: device-side assert triggered

neil@debian-neil:~/.cabal/bin$

_OUTPUT_*

running with CUDA backend
to display available options, rerun with '--help'

map-abs: Ok
map-plus: Ok
map-square: Ok
zip: Ok
zipWith-plus: Ok
fold-sum: Ok
fold-product: Ok
fold-maximum: Ok
fold-minimum: Ok
fold-2d-sum: Ok
fold-2d-product: Ok
fold-2d-maximum: Ok
fold-2d-minimum: Ok
foldseg-sum: Ok
scanseg-sum: Failed:
*** Internal error in package accelerate ***
*** Please submit a bug report at https://github.com/AccelerateHS/accelerate/issues
./Data/Array/Accelerate/CUDA/State.hs:87 (unhandled): CUDA Exception: device-side assert triggered

stencil-1D: Failed:
*** Internal error in package accelerate ***
*** Please submit a bug report at https://github.com/AccelerateHS/accelerate/issues
./Data/Array/Accelerate/CUDA/State.hs:87 (unhandled): CUDA Exception: device-side assert triggered

stencil-2D: Failed:
*** Internal error in package accelerate ***
*** Please submit a bug report at https://github.com/AccelerateHS/accelerate/issues
./Data/Array/Accelerate/CUDA/State.hs:87 (unhandled): CUDA Exception: device-side assert triggered

stencil-3D: Failed:
*** Internal error in package accelerate ***
*** Please submit a bug report at https://github.com/AccelerateHS/accelerate/issues
./Data/Array/Accelerate/CUDA/State.hs:87 (unhandled): CUDA Exception: device-side assert triggered

stencil-3x3-cross: Failed:
*** Internal error in package accelerate ***
*** Please submit a bug report at https://github.com/AccelerateHS/accelerate/issues
./Data/Array/Accelerate/CUDA/State.hs:87 (unhandled): CUDA Exception: device-side assert triggered

stencil-3x3-pair: Failed:
*** Internal error in package accelerate ***
*** Please submit a bug report at https://github.com/AccelerateHS/accelerate/issues
./Data/Array/Accelerate/CUDA/State.hs:87 (unhandled): CUDA Exception: device-side assert triggered

stencil2-2D: Failed:
*** Internal error in package accelerate ***
*** Please submit a bug report at https://github.com/AccelerateHS/accelerate/issues
./Data/Array/Accelerate/CUDA/State.hs:87 (unhandled): CUDA Exception: device-side assert triggered

permute-hist: Failed:
*** Internal error in package accelerate ***
*** Please submit a bug report at https://github.com/AccelerateHS/accelerate/issues
./Data/Array/Accelerate/CUDA/State.hs:87 (unhandled): CUDA Exception: device-side assert triggered

backpermute-reverse: Failed:
*** Internal error in package accelerate ***
*** Please submit a bug report at https://github.com/AccelerateHS/accelerate/issues
./Data/Array/Accelerate/CUDA/State.hs:87 (unhandled): CUDA Exception: device-side assert triggered

backpermute-transpose: Failed:
*** Internal error in package accelerate ***
*** Please submit a bug report at https://github.com/AccelerateHS/accelerate/issues
./Data/Array/Accelerate/CUDA/State.hs:87 (unhandled): CUDA Exception: device-side assert triggered

init: Failed:
*** Internal error in package accelerate ***
*** Please submit a bug report at https://github.com/AccelerateHS/accelerate/issues
./Data/Array/Accelerate/CUDA/State.hs:87 (unhandled): CUDA Exception: device-side assert triggered

tail: Failed:
*** Internal error in package accelerate ***
*** Please submit a bug report at https://github.com/AccelerateHS/accelerate/issues
./Data/Array/Accelerate/CUDA/State.hs:87 (unhandled): CUDA Exception: device-side assert triggered

take: Failed:
*** Internal error in package accelerate ***
*** Please submit a bug report at https://github.com/AccelerateHS/accelerate/issues
./Data/Array/Accelerate/CUDA/State.hs:87 (unhandled): CUDA Exception: device-side assert triggered

drop: Failed:
*** Internal error in package accelerate ***
*** Please submit a bug report at https://github.com/AccelerateHS/accelerate/issues
./Data/Array/Accelerate/CUDA/State.hs:87 (unhandled): CUDA Exception: device-side assert triggered

slit: Failed:
*** Internal error in package accelerate ***
*** Please submit a bug report at https://github.com/AccelerateHS/accelerate/issues
./Data/Array/Accelerate/CUDA/State.hs:87 (unhandled): CUDA Exception: device-side assert triggered

gather: Failed:
*** Internal error in package accelerate ***
*** Please submit a bug report at https://github.com/AccelerateHS/accelerate/issues
./Data/Array/Accelerate/CUDA/State.hs:87 (unhandled): CUDA Exception: device-side assert triggered

gather-if: Failed:
*** Internal error in package accelerate ***
*** Please submit a bug report at https://github.com/AccelerateHS/accelerate/issues
./Data/Array/Accelerate/CUDA/State.hs:87 (unhandled): CUDA Exception: device-side assert triggered

scatter: Failed:
*** Internal error in package accelerate ***
*** Please submit a bug report at https://github.com/AccelerateHS/accelerate/issues
./Data/Array/Accelerate/CUDA/State.hs:87 (unhandled): CUDA Exception: device-side assert triggered

scatter-if: Failed:
*** Internal error in package accelerate ***
*** Please submit a bug report at https://github.com/AccelerateHS/accelerate/issues
./Data/Array/Accelerate/CUDA/State.hs:87 (unhandled): CUDA Exception: device-side assert triggered

sasum: Failed:
*** Internal error in package accelerate ***
*** Please submit a bug report at https://github.com/AccelerateHS/accelerate/issues
./Data/Array/Accelerate/CUDA/State.hs:87 (unhandled): CUDA Exception: device-side assert triggered

saxpy: Failed:
*** Internal error in package accelerate ***
*** Please submit a bug report at https://github.com/AccelerateHS/accelerate/issues
./Data/Array/Accelerate/CUDA/State.hs:87 (unhandled): CUDA Exception: device-side assert triggered

dotp: Failed:
*** Internal error in package accelerate ***
*** Please submit a bug report at https://github.com/AccelerateHS/accelerate/issues
./Data/Array/Accelerate/CUDA/State.hs:87 (unhandled): CUDA Exception: device-side assert triggered

filter: Failed:
*** Internal error in package accelerate ***
*** Please submit a bug report at https://github.com/AccelerateHS/accelerate/issues
./Data/Array/Accelerate/CUDA/State.hs:87 (unhandled): CUDA Exception: device-side assert triggered

smvm: Failed:
*** Internal error in package accelerate ***
*** Please submit a bug report at https://github.com/AccelerateHS/accelerate/issues
./Data/Array/Accelerate/CUDA/State.hs:87 (unhandled): CUDA Exception: device-side assert triggered

black-scholes: Failed:
*** Internal error in package accelerate ***
*** Please submit a bug report at https://github.com/AccelerateHS/accelerate/issues
./Data/Array/Accelerate/CUDA/State.hs:87 (unhandled): CUDA Exception: device-side assert triggered

radixsort: Failed:
*** Internal error in package accelerate ***
*** Please submit a bug report at https://github.com/AccelerateHS/accelerate/issues
./Data/Array/Accelerate/CUDA/State.hs:87 (unhandled): CUDA Exception: device-side assert triggered

io: test: fromPtr Int
test: fromPtr (Int,Double)
test: toPtr Int16
test: toPtr Int32
test: toPtr Int64
test: fromArray Int
Ok
io: +++ OK, passed 100 tests.
+++ OK, passed 100 tests.
+++ OK, passed 100 tests.
+++ OK, passed 100 tests.
+++ OK, passed 100 tests.
+++ OK, passed 100 tests.
+++ OK, passed 100 tests.
Ok
canny: Failed: no image file specified
integral-image: Failed: no image file specified
slices: Failed:
*** Internal error in package accelerate ***
*** Please submit a bug report at https://github.com/AccelerateHS/accelerate/issues
./Data/Array/Accelerate/CUDA/State.hs:87 (unhandled): CUDA Exception: device-side assert triggered

slices: Failed:
*** Internal error in package accelerate ***
*** Please submit a bug report at https://github.com/AccelerateHS/accelerate/issues
./Data/Array/Accelerate/CUDA/State.hs:87 (unhandled): CUDA Exception: device-side assert triggered

sharing-recovery: Ok
sharing-recovery: Ok
sharing-recovery: Ok
sharing-recovery: Ok
sharing-recovery: Ok
sharing-recovery: Ok
sharing-recovery: Ok
sharing-recovery: Ok
sharing-recovery: Ok
sharing-recovery: Ok
bound-variables: Ok

warming up
estimating clock resolution...
mean is 4.154538 us (160001 iterations)
found 1231 outliers among 159999 samples (0.8%)
1094 (0.7%) high severe
estimating cost of a clock call...
mean is 83.69922 ns (32 iterations)
found 4 outliers among 32 samples (12.5%)
3 (9.4%) low mild
1 (3.1%) high mild

benchmarking map-abs

neil@debian-neil:~/.cabal/bin$ optirun ./accelerate-examples --cuda --size=1024 > bare_test_131117
/home/neil/.cabal/share/accelerate-cuda-0.14.0.0/cubits/accelerate_cuda_shape.h:287: int toIndex(Shape, Shape) [with Shape = int]: block: [0,0,0], thread: [31,0,0] Assertion ix >= 0 && ix < sh failed.
accelerate-examples:
*** Internal error in package accelerate ***
*** Please submit a bug report at https://github.com/AccelerateHS/accelerate/issues
./Data/Array/Accelerate/CUDA/State.hs:87 (unhandled): CUDA Exception: device-side assert triggered

neil@debian-neil:~/.cabal/bin$

_OUTPUT_**