Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CUDA backend does not work with Bumblebee/Optimus #92

Closed
neiljamieso opened this issue May 12, 2013 · 57 comments
Closed

CUDA backend does not work with Bumblebee/Optimus #92

neiljamieso opened this issue May 12, 2013 · 57 comments
Labels
cuda backend [deprecated]

Comments

@neiljamieso
Copy link

Hi,

I tried to build the examples. This failed due to not finding a definition of "note" in Benchmark.hs. This was solved by adding import Criterion.IO.Printf to the import list.

@tmcdonell
Copy link
Member

This should be fixed by AccelerateHS/accelerate-examples@91250ca. Can you confirm this?

@neiljamieso
Copy link
Author

Yes. Built fine. Lots of "fails" in running with the CUDA backend. I'm using Cuda 5 - not sure if this breaks stuff. Do you want to see the list?

@neiljamieso
Copy link
Author

Most of the fails were of the form...
: Failed:
*** Internal error in package accelerate ***
*** Please submit a bug report at https://github.com/AccelerateHS/accelerate/issues
./Data/Array/Accelerate/CUDA.hs:153 (unhandled): CUDA Exception: unspecified launch failure

fold-sum and fold-2D-sum also failed but differently

fold-sum: Failed:

() : (-317.71545,-725.824)

fold-2d-sum: Failed:

0 : (10.181486,10.786726)
3 : (-12.445062,-5.0869923)
4 : (-22.740108,-34.520443)
5 : (7.2517667,5.283786)
6 : (-7.7953305,-19.361605)
7 : (16.353685,16.106562)
8 : (4.841938,3.6077766)
9 : (6.7518387,2.3453445)
10 : (-14.926775,-22.0668)
12 : (-8.844832,0.86133194)
13 : (31.82425,42.47469)
15 : (-12.590198,-8.076189)
16 : (2.4275239,-1.1079388)
18 : (-4.6298413,10.507795)
19 : (-5.7560434,-24.80141)
20 : (-27.520971,-58.446945)
21 : (-10.380567,-17.262444)
24 : (-5.6269426,-3.59577)
34 : (18.326572,15.697114)
36 : (-21.652311,-20.826466)
37 : (-0.69646883,-14.07115)
39 : (-1.9313966,-1.646287)
40 : (-15.114215,-4.3450966)
41 : (9.819355,4.646344)
42 : (-13.3020315,-18.24121)
43 : (-4.780798,-11.156574)
45 : (-13.909897,-19.179947)
46 : (-24.877073,-25.394434)
48 : (-10.917168,-7.612333)
49 : (8.59276,-10.744858)
50 : (-43.603035,-53.99748)
53 : (17.853306,21.356565)
55 : (-2.121977,6.3397703)
58 : (-4.2652583,-2.5864878)
59 : (-4.6431007,-3.1721497)
60 : (14.112302,15.44854)
61 : (-28.66971,-50.8897)
63 : (-14.38963,-20.192778)
67 : (-29.752752,-29.051735)
70 : (18.686342,27.751282)
76 : (-11.068267,-3.157248)
77 : (-30.1085,-35.691612)
78 : (17.865221,33.37813)
79 : (12.610696,10.771452)
80 : (13.958698,14.737689)
83 : (-51.858498,-58.283985)
85 : (12.039097,14.588022)
86 : (-14.114648,-17.417624)
89 : (23.790989,25.472948)
90 : (-18.82345,-17.08065)
91 : (2.463029,5.9130898)
92 : (4.0238266,5.5120225)
93 : (-8.8636265,-8.364969)
95 : (-16.640343,-13.33732)
96 : (10.943283,20.977047)
97 : (-2.759805,-10.179357)
99 : (7.461958,4.374811)
101 : (6.5351143,10.87258)
102 : (-8.328936,-3.353552)
103 : (-8.919393,-10.651541)
104 : (-8.599477,-32.173218)
105 : (-3.4648807,-12.457461)
107 : (-9.112293,-10.76436)
109 : (10.36928,19.196201)
111 : (-0.74972934,-8.263916)
112 : (-1.4251958,-1.3936005)
114 : (-5.7750616,-6.656393)
115 : (-4.1570673,-5.0010214)
118 : (-14.588455,-5.8673525)
122 : (-3.905911,1.3459797)
124 : (11.671464,13.249651)
128 : (24.242702,31.903507)
130 : (-5.312511,-8.756293)
131 : (-17.744507,-24.541887)
133 : (-3.0010543,-7.737555)
136 : (8.380546,11.387158)
138 : (11.308516,11.967691)
139 : (-17.7391,-29.652555)
141 : (-25.26024,-34.264626)
145 : (-11.910921,-14.598899)
147 : (18.361284,8.458666)
148 : (-2.0598116,9.742126)
151 : (-1.5615535,-6.330538)
155 : (-14.633401,-24.910007)
158 : (1.7897742,-3.3920808)
160 : (7.98956,9.146147)
161 : (-21.875072,-25.081263)
162 : (5.615722,20.186003)
163 : (9.19277,14.405633)
166 : (-4.6076007,0.6831827)
167 : (-10.567481,-4.0725036)
169 : (0.4859029,-6.1355286)
170 : (19.870667,19.815443)
172 : (6.0666904,7.6584425)
173 : (8.849107,0.12496734)
175 : (-11.274898,-16.4241)
177 : (-27.324623,-33.917286)
178 : (0.21815288,3.8251867)
179 : (-6.1652923,-4.998172)
180 : (-14.112642,-19.027935)
181 : (-2.080636,6.853819e-3)
183 : (3.6447208,-6.9173365)
187 : (-27.273254,-38.26088)
189 : (-9.826919,-14.5337925)
190 : (1.3126237,0.9761648)
191 : (-4.1650763,-1.852829)
193 : (18.610937,22.746304)
194 : (-4.691451,-0.86483383)
196 : (-4.7458477,-23.575771)
197 : (-2.7342944,-10.165984)
199 : (-11.298469,-18.151875)
200 : (5.3247147,-4.0813465)
201 : (14.916756,23.434582)
203 : (-0.1067512,4.8686438)
204 : (-14.124139,-4.513797)
206 : (-7.185062,-0.58614635)
207 : (-19.701935,-20.333096)
208 : (-11.467451,-7.518866)
210 : (31.49854,38.85581)
212 : (-16.014204,-17.766535)
216 : (-18.965578,-29.654585)
220 : (-0.17519975,-5.1846743)
225 : (16.0454,19.740955)
226 : (-0.67587143,1.3499918)
229 : (-21.621109,-23.055359)
231 : (1.533406,0.9220514)
232 : (1.5521168,-2.942934)
235 : (-26.18992,-28.304138)
237 : (-12.360111,-14.813786)
244 : (-26.788136,-26.856113)
245 : (-11.375093,-6.4627395)
249 : (-14.0135765,-18.813738)
251 : (-28.578781,-39.254063)
261 : (23.480045,28.535007)
263 : (-20.27542,-30.240715)
264 : (1.0410566,5.445823)
265 : (-12.174866,-11.87295)
270 : (-2.2434764,1.3028297)
271 : (-5.3730717,-7.069026)
272 : (-32.547344,-40.939163)
273 : (-11.036853,-14.617073)
274 : (1.5726653,7.1989527)
276 : (13.667664,-4.6318626)
277 : (-19.315035,-14.617573)
279 : (0.14692748,6.2511544)
281 : (-0.6385382,0.5433495)
282 : (0.13369226,-2.5549994)
285 : (-25.613811,-23.304722)
286 : (11.909087,6.9073195)
287 : (11.177615,14.907998)
289 : (8.337317,10.699486)
291 : (-6.394571,-2.2123995)
293 : (-12.401189,-4.961336)
294 : (20.566023,22.415432)
299 : (6.3981833,14.163654)
301 : (-15.557607,-12.6597595)
308 : (3.6762142,9.144186)
310 : (0.26484996,-7.8996334)
312 : (3.0426567,7.4979715)
314 : (14.884919,14.266132)

@tmcdonell
Copy link
Member

Hmm... what card are you running on, and what compute capability is it?
The internal error especially is a bit worrying; I haven't seen that one in a while. The fold errors at least should be easier to debug.

@neiljamieso
Copy link
Author

On 12/05/13 17:26, Trevor L. McDonell wrote:

Hmm... what card are you running on, and what compute capability is it?
The internal error especially is a bit worrying; I haven't seen that
one in a while. The fold errors at least should be easier to debug.


Reply to this email directly or view it on GitHub
#92 (comment).

K1000M and using optirun to do the switching. It works fine with all
the CUDA examples from nvidia. Ah! But maybe it is not working if you
de-attach from the primary caller (as I suspect you do in the async
functions). I will check that out - rings a bell from the bumblebee
documentation.

Neil

@tmcdonell
Copy link
Member

Yes, we do need to push and pop the CUDA context; I thought that that was enough, but my reading of the CUDA docs might be incorrect (and; I had not even heard of optirun before now!)

@neiljamieso
Copy link
Author

Optirun is part of the bumblebee project to allow use of the Optimus
GPUs under Linux. As it is not provided by nvidia it it possible it
brings it's own issues. As I say though all the nvidia examples seem to
run fine under it.

On 12/05/13 17:52, Trevor L. McDonell wrote:

Yes, we do need to push and pop the CUDA context; I thought that that
was enough, but my reading of the CUDA docs might be incorrect (and; I
had not even heard of |optirun| before now!)


Reply to this email directly or view it on GitHub
#92 (comment).

@tmcdonell
Copy link
Member

Actually, does optimus aim to allow dynamic switching between a pair of low/high power GPUs? I have a similar problem with this dynamic switching (usually) not working under Mac OS X (#67), even with the NVIDIA drivers, although it does seem to work with the NVIDIA examples.

Does it work if you disable the switching and only use the fast GPU?

@neiljamieso
Copy link
Author

Hi Trevor,

Yes. The Optimus is an NVIDIA design which uses the onboard Intel graphics
most of the time. The display is ALWAYS done by the Intel unit, but the
rendering is directed to the NVIDIA card on a switchable basis. Bumblebee
is an opensource module to allow this switching on Linux (as NVIDIA neglect
to provide this themselves). Choosing to run a programme using the NVIDIA
card is done by running the programme under Optirun, so I enter:
$ optirun ./accelerate-examples

I have been thinking, and the errors I am getting now are of language
errors from the CUDA system. As I say ALL the NVIDIA code runs fine
run under optirun - so I wonder if this is about changes to the CUDA
language with CUDA 5. Have you had success with CUDA 5 and accelerate in
other hardwares?

Cheers, Neil

On Mon, May 13, 2013 at 11:10 AM, Trevor L. McDonell <
notifications@github.com> wrote:

Actually, does optimus aim to allow dynamic switching between a pair of
low/high power GPUs? I have a similar problem with this dynamic switching
(usually) not working under Mac OS X (#67#67),
even with the NVIDIA drivers, although it does seem to work with the NVIDIA
examples.

Does it work if you disable the switching and only use the fast GPU?


Reply to this email directly or view it on GitHubhttps://github.com//issues/92#issuecomment-17787198
.

@tmcdonell
Copy link
Member

Hi Neil,

I am using CUDA 5 and it has worked for me --- this is on Mac OS X and Ubuntu. It might make a difference if you're on a different linux distribution?

What do you mean by language errors from the CUDA system? Different errors from the earlier "unspecified launch failure" ?

Try changing this from forkOS to forkOn 0 and let me know what happens?
https://github.com/AccelerateHS/accelerate-cuda/blob/master/Data/Array/Accelerate/CUDA/Async.hs#L36

@neiljamieso
Copy link
Author

Sorry Trevor. The "language" error was a language error of my own - due to
wrapping at the edge of the terminal window. :-((

Will try your suggestion when I get home.

Neil

On Mon, May 13, 2013 at 9:28 PM, Trevor L. McDonell <
notifications@github.com> wrote:

Hi Neil,

I am using CUDA 5 and it has worked for me --- this is on Mac OS X and
Ubuntu. It might make a difference if you're on a different linux
distribution?

What do you mean by language errors from the CUDA system? Different errors
from the earlier "unspecified launch failure" ?

Try changing this from forkOS to forkOn 0 and let me know what happens?

https://github.com/AccelerateHS/accelerate-cuda/blob/master/Data/Array/Accelerate/CUDA/Async.hs#L36


Reply to this email directly or view it on GitHubhttps://github.com//issues/92#issuecomment-17801317
.

@neiljamieso
Copy link
Author

Hi Trevor,

I have had another thought. Debian Wheezy (my OS) comes with gcc 4.7 as
standard. CUDA 5 only works with gcc 4.6 (I tried 4.7). I thought of this
last night and rebuild accelerate-cuda and accelerate-examples with gcc
pointing to 4.6 (and g++ the same). This didn't make any difference, but I
wonder if I need to rebuild the whole of haskell on gcc 4.6.

What is the default version of gcc on your OS?

Cheers Neil

@tmcdonell
Copy link
Member

On my Mac it is gcc-4.2, but this is Apple's own version so I am not sure if that is comparable. The Ubuntu 12.04 box uses gcc-4.6.3.

Adding the flag -ddump-gc will give rather chatty output of whenever it tries to do memory allocations. Since this is quite fine grained, it might give a few more indications about what is going on (failed on the first attempt, worked for a while and then failed, etc)

@neiljamieso
Copy link
Author

Hi Trevor,

Did make a difference. This is the output:

neil@debian-neil:~/.cabal/bin$ optirun bash
neil@debian-neil:~/.cabal/bin$ ./accelerate-examples --cuda -k
running with CUDA backend
to display available options, rerun with '--help'

map-abs: Ok
map-plus: Failed:
*** Internal error in package accelerate ***
*** Please submit a bug report at 
https://github.com/AccelerateHS/accelerate/issues
./Data/Array/Accelerate/CUDA.hs:246 (unhandled): CUDA Exception: invalid 
context handle

map-square: Ok
zip: Ok
zipWith-plus: Ok
fold-sum: Failed:
 >>> () : (-284.77808,-299.1781)

fold-product: Ok
fold-maximum: Ok
fold-minimum: Ok
fold-2d-sum: Failed:
 >>> 4 : (3.5994253,4.1811037)
 >>> 5 : (-9.134442,-18.42069)
 >>> 6 : (-3.1958194,-7.940378)
 >>> 7 : (5.7998953,19.296043)
 >>> 9 : (14.701389,23.411243)
 >>> 12 : (26.571274,38.411602)
 >>> 14 : (23.842213,23.651949)
 >>> 15 : (-9.196165,-0.6621127)
 >>> 19 : (43.498287,45.15085)
 >>> 20 : (13.474283,14.403748)
 >>> 21 : (-11.930797,-9.856017)
 >>> 22 : (8.05154,8.746618)
 >>> 23 : (25.273453,25.40266)
 >>> 24 : (5.668702,7.682753)
 >>> 25 : (-23.540642,-24.084503)
 >>> 26 : (7.730505,3.7250352)
 >>> 28 : (-13.682002,-22.007523)
 >>> 30 : (22.153667,29.94641)
 >>> 32 : (3.9512172,4.8625793)
 >>> 34 : (-20.773705,-23.154194)
 >>> 35 : (14.610652,17.81879)
 >>> 36 : (-6.893841,-5.690979)
 >>> 38 : (3.470799,9.4239e-2)
 >>> 40 : (17.497482,27.669067)
 >>> 41 : (-3.0244708e-2,-3.8516002)
 >>> 43 : (19.843216,26.517456)
 >>> 44 : (5.050486,8.706543)
 >>> 47 : (-5.443891,-5.188139)
 >>> 49 : (-8.316223,-12.395588)
 >>> 51 : (5.367283,23.022243)
 >>> 52 : (11.321204,6.605723)
 >>> 53 : (16.014208,17.675938)
 >>> 57 : (-22.71127,-28.897242)
 >>> 60 : (-2.7958093,3.0328588)
 >>> 61 : (14.372042,10.27017)
 >>> 63 : (-13.966523,-16.551018)
 >>> 65 : (-2.3377113,-8.886295)
 >>> 66 : (0.41673332,4.9110966)
 >>> 67 : (-3.150734,1.390254)
 >>> 68 : (-9.262151,-4.612889)
 >>> 70 : (1.1192223,-0.87473106)
 >>> 71 : (-16.735855,-13.542116)
 >>> 72 : (-2.7853413,-3.259285)
 >>> 75 : (-0.42108774,12.822178)
 >>> 76 : (37.315483,58.080196)
 >>> 77 : (21.378624,24.565968)
 >>> 79 : (17.399918,11.301307)
 >>> 80 : (6.1325307,-3.117681)
 >>> 82 : (-25.688484,-23.890837)
 >>> 84 : (-29.327036,-46.779266)
 >>> 85 : (-12.640158,-17.59966)
 >>> 86 : (14.217806,22.999573)
 >>> 87 : (3.0769944,0.67498803)
 >>> 88 : (14.598545,13.440449)
 >>> 94 : (-18.738943,0.6576848)
 >>> 98 : (-1.2732513,-9.02783)
 >>> 100 : (14.017002,22.866009)
 >>> 102 : (10.585675,-0.76270866)
 >>> 103 : (-22.687687,-24.832624)
 >>> 105 : (13.726986,8.545394)
 >>> 108 : (18.212643,22.956026)
 >>> 110 : (-14.852369,-22.597391)
 >>> 111 : (2.3865306,5.926875)
 >>> 112 : (3.0377512,-1.880888)
 >>> 114 : (-10.134539,-9.8238)
 >>> 115 : (-4.3836536,3.3319654)
 >>> 116 : (-5.7152805,-14.443269)
 >>> 117 : (8.012011,7.6332164)
 >>> 118 : (-17.265642,-15.1257715)
 >>> 119 : (12.728009,14.087517)
 >>> 120 : (-18.342087,-23.154064)
 >>> 121 : (-21.715904,-17.897583)
 >>> 123 : (-13.022339,-12.231892)
 >>> 124 : (16.29696,30.115715)
 >>> 126 : (8.191839,16.790535)
 >>> 127 : (7.316367,14.373995)
 >>> 128 : (23.410019,22.88608)
 >>> 129 : (10.068765,-24.64301)
 >>> 131 : (-26.669355,-26.25417)
 >>> 132 : (2.4118686,-3.5020428)
 >>> 133 : (-13.115518,-21.87509)
 >>> 134 : (12.896856,12.63337)
 >>> 136 : (13.352133,12.780149)
 >>> 137 : (24.687658,17.437037)
 >>> 140 : (4.4784513,-8.002885)
 >>> 141 : (19.64967,21.850222)
 >>> 142 : (-17.395033,-11.799833)
 >>> 144 : (4.605325,9.768799)
 >>> 149 : (-27.127146,-31.195862)
 >>> 150 : (-20.15325,-38.91357)
 >>> 151 : (-11.284405,-7.634466)
 >>> 153 : (1.4470301,2.2499762)
 >>> 155 : (17.06059,23.061432)
 >>> 157 : (13.256235,9.830044)
 >>> 158 : (8.65885e-2,15.133558)
 >>> 161 : (19.461996,30.09988)
 >>> 162 : (8.695209e-2,1.2758055)
 >>> 164 : (0.23431987,-5.4021072)
 >>> 165 : (-8.806317,-7.660516)
 >>> 167 : (2.9375281,-1.7019806)
 >>> 168 : (4.8822374,1.7404442)
 >>> 169 : (-6.0983124,-6.616735)
 >>> 170 : (-10.859095,-24.070465)
 >>> 171 : (-30.173882,-38.876015)
 >>> 172 : (7.5324316,10.573803)
 >>> 173 : (-7.9830656,-0.61189365)
 >>> 174 : (3.8499007,2.8259583)
 >>> 175 : (9.863973,18.671043)
 >>> 176 : (1.5010693,7.730674)
 >>> 177 : (-19.172495,-15.866618)
 >>> 178 : (10.258595,11.646437)
 >>> 179 : (-36.72372,-32.991608)
 >>> 180 : (4.0878096,4.3566303)
 >>> 183 : (-16.212082,-12.850005)
 >>> 186 : (20.656956,44.957047)
 >>> 187 : (9.899384,8.580212)
 >>> 188 : (24.487984,24.992609)
 >>> 194 : (16.086586,6.133008)
 >>> 195 : (-12.79052,-14.317617)
 >>> 200 : (4.5302505,8.308535)
 >>> 201 : (-10.723634,-23.400677)
 >>> 202 : (-4.187149,-15.145685)
 >>> 203 : (-15.959601,-16.193207)
 >>> 204 : (27.673164,32.605988)
 >>> 205 : (-22.693754,-33.882385)
 >>> 206 : (-0.7072872,-1.9263825)
 >>> 208 : (-2.4695814,-0.21775436)
 >>> 209 : (-7.441179,-7.886807)
 >>> 216 : (-26.625347,-34.00032)
 >>> 217 : (-12.935532,-12.696256)
 >>> 219 : (10.233142,16.826408)
 >>> 223 : (-20.659527,-19.133957)
 >>> 225 : (4.6232724,-5.518243)
 >>> 226 : (-3.6734939e-3,-0.32396984)
 >>> 228 : (31.582458,35.58126)
 >>> 229 : (-0.7545265,-10.300518)
 >>> 231 : (12.414625,15.020456)
 >>> 234 : (10.174679,19.857052)
 >>> 235 : (-13.687687,-11.906177)
 >>> 239 : (-16.81191,-17.177837)
 >>> 241 : (5.6338625,7.43606)
 >>> 246 : (-6.5156856,-9.638809)
 >>> 247 : (-0.42078322,-4.191985)
 >>> 249 : (11.335211,10.828511)
 >>> 252 : (-0.8734268,-16.709965)
 >>> 253 : (2.7642574,5.442359)
 >>> 255 : (-15.736735,-13.98167)
 >>> 257 : (5.946913,2.0609694)
 >>> 258 : (-6.6435785,-8.290497)
 >>> 259 : (13.248286,15.020397)
 >>> 260 : (40.213238,62.449997)
 >>> 261 : (-1.8538256,-4.91119)
 >>> 266 : (10.244856,6.945044)
 >>> 268 : (-13.880142,-21.150314)
 >>> 269 : (14.314802,14.349737)
 >>> 270 : (-27.502745,-33.003326)
 >>> 271 : (10.64012,6.457108)
 >>> 272 : (-16.236614,-21.558899)
 >>> 273 : (20.561716,24.363443)
 >>> 274 : (-10.97512,-6.042589)
 >>> 280 : (-12.273643,-13.009692)
 >>> 283 : (3.3773353,8.302713)
 >>> 286 : (-1.6639676,-3.079587)
 >>> 287 : (-21.63964,-23.37448)
 >>> 290 : (-14.440636,-24.584656)
 >>> 291 : (0.17262441,-1.6445827)
 >>> 294 : (19.45585,29.862196)
 >>> 298 : (2.3329654,8.237259)
 >>> 303 : (15.277465,12.724495)
 >>> 304 : (-10.626967,-18.734402)
 >>> 309 : (-11.389035,-6.8129835)
 >>> 310 : (-7.8077154,-9.264032)
 >>> 311 : (3.3524702,-7.6005263)
 >>> 313 : (22.357534,21.090479)
 >>> 314 : (14.302358,4.895173)
 >>> 315 : (-32.722397,-41.946712)

fold-2d-product: Ok
fold-2d-maximum: Ok
fold-2d-minimum: Ok
foldseg-sum: Ok
scanseg-sum: Failed:
*** Internal error in package accelerate ***
*** Please submit a bug report at 
https://github.com/AccelerateHS/accelerate/issues
./Data/Array/Accelerate/CUDA.hs:246 (unhandled): CUDA Exception: invalid 
context handle

stencil-1D: Ok
stencil-2D: Ok
stencil-3D: Failed:
*** Internal error in package accelerate ***
*** Please submit a bug report at 
https://github.com/AccelerateHS/accelerate/issues
./Data/Array/Accelerate/CUDA.hs:246 (unhandled): CUDA Exception: invalid 
context handle

stencil-3x3-cross: Failed:
*** Internal error in package accelerate ***
*** Please submit a bug report at 
https://github.com/AccelerateHS/accelerate/issues
./Data/Array/Accelerate/CUDA.hs:246 (unhandled): CUDA Exception: invalid 
context handle

stencil-3x3-pair: Ok
stencil2-2D: Ok
permute-hist: Ok
backpermute-reverse: Ok
backpermute-transpose: Ok
init: Ok
tail: Ok
take: Ok
drop: Ok
slit: Ok
gather: Ok
gather-if: Ok
scatter: Ok
scatter-if: Ok
sasum: Failed:
 >>> () : (50137.895,63516.633)

saxpy: Ok
dotp: Failed:
 >>> () : (120.643745,144.3627)

filter: Ok
smvm: Ok
black-scholes: Ok
radixsort: Failed:
*** Internal error in package accelerate ***
*** Please submit a bug report at 
https://github.com/AccelerateHS/accelerate/issues
./Data/Array/Accelerate/CUDA.hs:246 (unhandled): CUDA Exception: invalid 
context handle

io: test: fromPtr Int
test: fromPtr (Int,Double)
test: toPtr Int16
test: toPtr Int32
test: toPtr Int64
test: fromArray Int
Ok
io: +++ OK, passed 100 tests.
+++ OK, passed 100 tests.
+++ OK, passed 100 tests.
+++ OK, passed 100 tests.
+++ OK, passed 100 tests.
+++ OK, passed 100 tests.
+++ OK, passed 100 tests.
Ok
canny: Failed: no image file specified
integral-image: Failed: no image file specified
slices: Ok
slices: Failed:
*** Internal error in package accelerate ***
*** Please submit a bug report at 
https://github.com/AccelerateHS/accelerate/issues
./Data/Array/Accelerate/CUDA.hs:246 (unhandled): CUDA Exception: invalid 
context handle

slices: Failed:
*** Internal error in package accelerate ***
*** Please submit a bug report at 
https://github.com/AccelerateHS/accelerate/issues
./Data/Array/Accelerate/CUDA.hs:246 (unhandled): CUDA Exception: invalid 
context handle

slices: Ok
slices: Ok
sharing-recovery: Ok
sharing-recovery: Ok
sharing-recovery: Ok
sharing-recovery: Ok
sharing-recovery: Ok
sharing-recovery: Ok
sharing-recovery: Ok
sharing-recovery: Ok
sharing-recovery: Ok
sharing-recovery: Ok
bound-variables: Ok

On 13/05/13 21:28, Trevor L. McDonell wrote:

Hi Neil,

I am using CUDA 5 and it has worked for me --- this is on Mac OS X and
Ubuntu. It might make a difference if you're on a different linux
distribution?

What do you mean by language errors from the CUDA system? Different
errors from the earlier "unspecified launch failure" ?

Try changing this from |forkOS| to |forkOn 0| and let me know what
happens?
https://github.com/AccelerateHS/accelerate-cuda/blob/master/Data/Array/Accelerate/CUDA/Async.hs#L36


Reply to this email directly or view it on GitHub
#92 (comment).

@neiljamieso
Copy link
Author

Hullo Trevor,

SUCCESS!

I rebuilt all the accelerate packages (with the change to forkOn in
place) and the accelerate-examples all run perfectly!

Not sure how to interpret the benchmarks but am very pleased to have it
going.

May I also say that the code is beautiful. Don't understand it all yet,
but it is very aesthetically pleasing what I have read.

Neil

@neiljamieso
Copy link
Author

OOPS! Duh! I didn't turn on --cuda, so of course they all looked ok.

Sorry. No change with cuda backend. :-((

All this regarding accelerate-examples of course.

Neil

@tmcdonell
Copy link
Member

Neil, could you try again with the latest version? I managed to create a setup that threw an invalid context error, so the fix for that might help in your situation as well.

@neiljamieso
Copy link
Author

Will do Trevor.

On 28/05/13 23:00, Trevor L. McDonell wrote:

Neil, could you try again with the latest version? I managed to create
a setup that threw an invalid context error, so the fix for that might
help in your situation as well.


Reply to this email directly or view it on GitHub
#92 (comment).

@neiljamieso
Copy link
Author

I got this error in trying to compile the examples:

[ 6 of 12] Compiling Test.IndexSpace (
examples/nofib/Test/IndexSpace.hs,
dist/build/accelerate-nofib/accelerate-nofib-tmp/Test/IndexSpace.o )

examples/nofib/Test/IndexSpace.hs:170:71:
Ambiguous occurrence even' It could refer to eitherP.even',
imported from Prelude' at examples/nofib/Test/IndexSpace.hs:6:1-60 (and originally defined inGHC.Real')
or A.even', imported fromData.Array.Accelerate' at
examples/nofib/Test/IndexSpace.hs:20:1-60
(and originally defined in
`accelerate-0.14.0.0:Data.Array.Accelerate.Language')

I'll have a look and change to A.even as I assume that's what you meant.

Neil.

On 28/05/13 23:00, Trevor L. McDonell wrote:

Neil, could you try again with the latest version? I managed to create
a setup that threw an invalid context error, so the fix for that might
help in your situation as well.


Reply to this email directly or view it on GitHub
#92 (comment).

@neiljamieso
Copy link
Author

This fixed it:
-- gatherIfAcc even' mapv maskv defaultv xs .==. gatherIfRef
even mapv maskv defaultv xs
gatherIfAcc even' mapv maskv defaultv xs .==. gatherIfRef
P.even mapv maskv defaultv xs

On 28/05/13 23:00, Trevor L. McDonell wrote:

Neil, could you try again with the latest version? I managed to create
a setup that threw an invalid context error, so the fix for that might
help in your situation as well.


Reply to this email directly or view it on GitHub
#92 (comment).

@neiljamieso
Copy link
Author

Another one:
[18 of 36] Compiling Gather (
examples/tests/primitives/Gather.hs,
dist/build/accelerate-examples/accelerate-examples-tmp/Gather.o )

examples/tests/primitives/Gather.hs:41:11:
Ambiguous occurrence even' It could refer to eitherAcc.even',
imported from Data.Array.Accelerate' at examples/tests/primitives/Gather.hs:9:1-48 (and originally defined in accelerate-0.14.0.0:Data.Array.Accelerate.Language')
or P.even', imported fromPrelude' at
examples/tests/primitives/Gather.hs:10:1-33
(and originally defined in `GHC.Real')
Failed to install accelerate-examples-0.14.0.0
c

On 28/05/13 23:00, Trevor L. McDonell wrote:

Neil, could you try again with the latest version? I managed to create
a setup that threw an invalid context error, so the fix for that might
help in your situation as well.


Reply to this email directly or view it on GitHub
#92 (comment).

@neiljamieso
Copy link
Author

And:
[19 of 36] Compiling Scatter (
examples/tests/primitives/Scatter.hs,
dist/build/accelerate-examples/accelerate-examples-tmp/Scatter.o )

examples/tests/primitives/Scatter.hs:52:11:
Ambiguous occurrence even' It could refer to eitherP.even',
imported from Prelude' at examples/tests/primitives/Scatter.hs:16:1-44 (and originally defined inGHC.Real')
or Acc.even', imported fromData.Array.Accelerate' at
examples/tests/primitives/Scatter.hs:17:1-59
(and originally defined in
`accelerate-0.14.0.0:Data.Array.Accelerate.Language')

On 28/05/13 23:00, Trevor L. McDonell wrote:

Neil, could you try again with the latest version? I managed to create
a setup that threw an invalid context error, so the fix for that might
help in your situation as well.


Reply to this email directly or view it on GitHub
#92 (comment).

@neiljamieso
Copy link
Author

Both fixed same way, and all now compile...Lets see how they run!

On 28/05/13 23:00, Trevor L. McDonell wrote:

Neil, could you try again with the latest version? I managed to create
a setup that threw an invalid context error, so the fix for that might
help in your situation as well.


Reply to this email directly or view it on GitHub
#92 (comment).

@neiljamieso
Copy link
Author

This is the output...I use Ctrl-C during the 4th slices as it seemed to
hang.

neil@debian-neil:~/.cabal/bin$ optirun --no-xorg ./accelerate-examples
--cuda -k
running with CUDA backend
to display available options, rerun with '--help'

map-abs: Ok
map-plus: Ok
map-square: Ok
zip: Ok
zipWith-plus: Ok
fold-sum: Failed:

() : (-21.361326,32.731934)

fold-product: Ok
fold-maximum: Ok
fold-minimum: Ok
fold-2d-sum: Failed:

1 : (3.9905946,12.912712)
3 : (4.3853145,6.6357403)
4 : (8.841903,3.564476)
5 : (22.403717,22.856863)
6 : (-7.058512,-0.8101158)
7 : (-13.209917,-14.428578)
8 : (3.6516001,3.9791288)
9 : (1.5006628,1.639061)
10 : (8.085807,10.32614)
11 : (11.110486,13.271563)
12 : (11.344211,24.075565)
13 : (5.494232,6.638853)
15 : (-18.813566,-28.975445)
17 : (-10.612726,-11.403031)
19 : (30.455154,48.39125)
21 : (0.6439582,-8.811903e-2)
23 : (0.44115293,1.8800209)
25 : (-1.081647,4.433939)
29 : (3.649135,0.9225111)
30 : (-3.5161483,0.26748943)
31 : (6.247751,4.1066437)
33 : (-19.144558,-21.607367)
34 : (14.241796,-0.3949709)
35 : (-6.4786077,-4.0578346)
36 : (1.6614412,9.047534)
37 : (-9.929752e-2,-8.920741)
38 : (-0.5181453,-14.03962)
41 : (17.492886,5.483637)
42 : (1.5826802,1.5337367)
43 : (-22.710932,-26.35552)
45 : (7.819425,8.852381)
47 : (3.8250275,-1.1689825)
48 : (31.711973,36.747433)
49 : (5.4925137,10.268168)
52 : (-10.457833,-12.00074)
53 : (22.555317,34.491005)
54 : (-13.917394,-17.875317)
57 : (3.446729,-6.599143)
60 : (-9.107978,3.4590158)
62 : (-24.056997,-29.912)
63 : (2.436757,3.1981812)
64 : (-1.2618066,1.2730389)
68 : (28.439875,36.614067)
70 : (0.5847907,2.280851)
74 : (-2.3531268,-4.332817)
75 : (4.663379,8.0118885)
78 : (7.195462,14.593959)
79 : (-5.2660117,-12.242489)
80 : (-15.816689,-18.658928)
81 : (12.112614,8.826111)
82 : (14.143523,18.241121)
83 : (-25.847208,-30.473446)
84 : (11.379544,4.809246)
86 : (15.708036,29.72469)
87 : (4.8327255,3.5389404)
91 : (-12.49356,-7.1337805)
92 : (-3.2796116,-1.6790586)
93 : (-8.711067,-17.377827)
94 : (-21.488873,-14.433965)
97 : (2.3867311,-1.5279217)
98 : (5.4814205,-2.1296844)
99 : (-3.566555,-5.9053173)
100 : (13.362963,10.301908)
101 : (1.7501,-3.362393)
102 : (-1.8447578,-11.294733)
103 : (3.365004,10.789146)
105 : (-34.87906,-40.80436)
106 : (-12.686344,-15.8895645)
107 : (9.183949,7.9775457)
110 : (-22.573433,-12.892656)
112 : (12.944003,17.68826)
113 : (-20.14838,-21.692518)
114 : (-0.13564283,9.673411)
117 : (-34.568615,-36.956146)
118 : (-9.420436,-4.6167736)
125 : (-3.868143,-6.226729)
126 : (-24.039621,-25.80162)
127 : (-3.2252026,-4.7092633)
128 : (-9.503313,-5.4460926)
133 : (3.8282223,-1.7425342)
134 : (14.974166,34.86072)
135 : (-19.844137,-21.048025)
137 : (23.145348,28.191246)
139 : (3.5891905,9.721224)
142 : (0.5852886,1.3669834)
144 : (-5.7431865,5.893752)
145 : (13.187965,12.4972515)
147 : (-2.4032655,-9.138004)
149 : (22.993021,24.544422)
157 : (-5.1877947,-6.014868)
159 : (-17.272867,-16.517113)
160 : (-29.876955,-40.23668)
161 : (-16.822813,-12.472164)
162 : (-0.6595129,0.25787354)
164 : (35.51503,35.609394)
165 : (-23.43607,-30.415709)
166 : (9.842515,2.944377)
167 : (24.214361,29.503002)
168 : (-23.579342,-39.842453)
170 : (11.822997,18.28223)
171 : (16.668018,21.228556)
173 : (-18.572968,-19.739588)
174 : (5.4933777,-0.5577693)
175 : (1.9450028,4.1181507)
177 : (-19.47439,-19.676298)
179 : (-12.430883,-16.573708)
182 : (-4.7336774,-9.151844)
184 : (-2.7646563,9.710753)
185 : (22.779469,20.718946)
187 : (-25.819782,-30.222664)
188 : (18.511953,21.633574)
189 : (-19.708344,-23.975298)
191 : (17.08098,24.394087)
193 : (-3.0513897,-0.6075697)
195 : (-8.187313,-5.181074)
197 : (33.65944,40.2564)
198 : (-0.64326054,-4.086837)
199 : (-10.554681,-12.706717)
200 : (18.93743,29.3177)
202 : (-5.301973,-15.005705)
208 : (-7.2508016,-14.100331)
209 : (-19.64536,-23.58665)
211 : (-3.6678975,4.9338455)
214 : (-4.1849194,-7.2833357)
215 : (-1.1494977,-7.4395123)
217 : (-2.6624355,11.72216)
218 : (-6.4984765,-9.903734)
222 : (0.2119419,-2.0705266)
226 : (-4.751293,11.307108)
231 : (13.396966,13.482294)
232 : (-10.148484,-9.455285)
233 : (-11.613926,-30.141973)
235 : (-4.1457195,-11.701864)
236 : (22.841429,27.695446)
237 : (20.703121,28.321404)
238 : (2.2251,-9.911165)
240 : (4.6583896,13.250011)
242 : (0.56912243,1.7683926)
248 : (-13.757292,-6.036418)
250 : (-2.0742264,-11.74327)
251 : (-22.361734,-21.731167)
252 : (-4.5171075,-6.9133253)
258 : (-15.887733,-15.204248)
259 : (13.085469,7.5854363)
260 : (17.63313,21.100315)
261 : (7.1418476,0.2580099)
262 : (-14.919332,-23.728527)
263 : (24.858322,28.005262)
266 : (-0.1598835,1.6914234)
267 : (-11.6540985,-19.327158)
270 : (-9.534692,-15.585428)
273 : (23.928104,34.40332)
276 : (12.787605,5.514979)
279 : (0.36071712,-6.126135)
281 : (-6.324025,-4.401108)
284 : (4.8829827,6.8221273)
285 : (-20.047634,-17.415882)
287 : (-6.266363,-7.5843716)
292 : (31.943773,28.52203)
294 : (4.4730716,17.863426)
295 : (-24.903772,-31.832272)
296 : (23.457853,27.188269)
298 : (-5.066526e-2,3.090138)
299 : (-12.440723,-12.220831)
300 : (10.800417,2.0174663)
302 : (21.627502,25.618221)
304 : (-19.292229,-21.6833)
307 : (-7.7303686,-6.4778433)
308 : (16.438334,17.45433)
309 : (18.270615,16.974281)
313 : (-18.940536,-14.294319)
315 : (1.1139888,-9.944632)

fold-2d-product: Ok
fold-2d-maximum: Ok
fold-2d-minimum: Ok
foldseg-sum: Ok
scanseg-sum: Failed:
*** Internal error in package accelerate ***
*** Please submit a bug report at
https://github.com/AccelerateHS/accelerate/issues
./Data/Array/Accelerate/CUDA/State.hs:87 (unhandled): CUDA Exception:
unspecified launch failure

stencil-1D: Failed:
*** Internal error in package accelerate ***
*** Please submit a bug report at
https://github.com/AccelerateHS/accelerate/issues
./Data/Array/Accelerate/CUDA/State.hs:87 (unhandled): CUDA Exception:
unspecified launch failure

stencil-2D: Failed:
*** Internal error in package accelerate ***
*** Please submit a bug report at
https://github.com/AccelerateHS/accelerate/issues
./Data/Array/Accelerate/CUDA/State.hs:87 (unhandled): CUDA Exception:
unspecified launch failure

stencil-3D: Failed:
*** Internal error in package accelerate ***
*** Please submit a bug report at
https://github.com/AccelerateHS/accelerate/issues
./Data/Array/Accelerate/CUDA/State.hs:87 (unhandled): CUDA Exception:
unspecified launch failure

stencil-3x3-cross: Failed:
*** Internal error in package accelerate ***
*** Please submit a bug report at
https://github.com/AccelerateHS/accelerate/issues
./Data/Array/Accelerate/CUDA/State.hs:87 (unhandled): CUDA Exception:
unspecified launch failure

stencil-3x3-pair: Failed:
*** Internal error in package accelerate ***
*** Please submit a bug report at
https://github.com/AccelerateHS/accelerate/issues
./Data/Array/Accelerate/CUDA/State.hs:87 (unhandled): CUDA Exception:
unspecified launch failure

stencil2-2D: Failed:
*** Internal error in package accelerate ***
*** Please submit a bug report at
https://github.com/AccelerateHS/accelerate/issues
./Data/Array/Accelerate/CUDA/State.hs:87 (unhandled): CUDA Exception:
unspecified launch failure

permute-hist: Failed:
*** Internal error in package accelerate ***
*** Please submit a bug report at
https://github.com/AccelerateHS/accelerate/issues
./Data/Array/Accelerate/CUDA/State.hs:87 (unhandled): CUDA Exception:
unspecified launch failure

backpermute-reverse: Failed:
*** Internal error in package accelerate ***
*** Please submit a bug report at
https://github.com/AccelerateHS/accelerate/issues
./Data/Array/Accelerate/CUDA/State.hs:87 (unhandled): CUDA Exception:
unspecified launch failure

backpermute-transpose: Failed:
*** Internal error in package accelerate ***
*** Please submit a bug report at
https://github.com/AccelerateHS/accelerate/issues
./Data/Array/Accelerate/CUDA/State.hs:87 (unhandled): CUDA Exception:
unspecified launch failure

init: Failed:
*** Internal error in package accelerate ***
*** Please submit a bug report at
https://github.com/AccelerateHS/accelerate/issues
./Data/Array/Accelerate/CUDA/State.hs:87 (unhandled): CUDA Exception:
unspecified launch failure

tail: Failed:
*** Internal error in package accelerate ***
*** Please submit a bug report at
https://github.com/AccelerateHS/accelerate/issues
./Data/Array/Accelerate/CUDA/State.hs:87 (unhandled): CUDA Exception:
unspecified launch failure

take: Failed:
*** Internal error in package accelerate ***
*** Please submit a bug report at
https://github.com/AccelerateHS/accelerate/issues
./Data/Array/Accelerate/CUDA/State.hs:87 (unhandled): CUDA Exception:
unspecified launch failure

drop: Failed:
*** Internal error in package accelerate ***
*** Please submit a bug report at
https://github.com/AccelerateHS/accelerate/issues
./Data/Array/Accelerate/CUDA/State.hs:87 (unhandled): CUDA Exception:
unspecified launch failure

slit: Failed:
*** Internal error in package accelerate ***
*** Please submit a bug report at
https://github.com/AccelerateHS/accelerate/issues
./Data/Array/Accelerate/CUDA/State.hs:87 (unhandled): CUDA Exception:
unspecified launch failure

gather: Failed:
*** Internal error in package accelerate ***
*** Please submit a bug report at
https://github.com/AccelerateHS/accelerate/issues
./Data/Array/Accelerate/CUDA/State.hs:87 (unhandled): CUDA Exception:
unspecified launch failure

gather-if: Failed:
*** Internal error in package accelerate ***
*** Please submit a bug report at
https://github.com/AccelerateHS/accelerate/issues
./Data/Array/Accelerate/CUDA/State.hs:87 (unhandled): CUDA Exception:
unspecified launch failure

scatter: Failed:
*** Internal error in package accelerate ***
*** Please submit a bug report at
https://github.com/AccelerateHS/accelerate/issues
./Data/Array/Accelerate/CUDA/State.hs:87 (unhandled): CUDA Exception:
unspecified launch failure

scatter-if: Failed:
*** Internal error in package accelerate ***
*** Please submit a bug report at
https://github.com/AccelerateHS/accelerate/issues
./Data/Array/Accelerate/CUDA/State.hs:87 (unhandled): CUDA Exception:
unspecified launch failure

sasum: Failed:
*** Internal error in package accelerate ***
*** Please submit a bug report at
https://github.com/AccelerateHS/accelerate/issues
./Data/Array/Accelerate/CUDA/State.hs:87 (unhandled): CUDA Exception:
unspecified launch failure

saxpy: Failed:
*** Internal error in package accelerate ***
*** Please submit a bug report at
https://github.com/AccelerateHS/accelerate/issues
./Data/Array/Accelerate/CUDA/State.hs:87 (unhandled): CUDA Exception:
unspecified launch failure

dotp: Failed:
*** Internal error in package accelerate ***
*** Please submit a bug report at
https://github.com/AccelerateHS/accelerate/issues
./Data/Array/Accelerate/CUDA/State.hs:87 (unhandled): CUDA Exception:
unspecified launch failure

filter: Failed:
*** Internal error in package accelerate ***
*** Please submit a bug report at
https://github.com/AccelerateHS/accelerate/issues
./Data/Array/Accelerate/CUDA/State.hs:87 (unhandled): CUDA Exception:
unspecified launch failure

smvm: Failed:
*** Internal error in package accelerate ***
*** Please submit a bug report at
https://github.com/AccelerateHS/accelerate/issues
./Data/Array/Accelerate/CUDA/State.hs:87 (unhandled): CUDA Exception:
unspecified launch failure

black-scholes: Failed:
*** Internal error in package accelerate ***
*** Please submit a bug report at
https://github.com/AccelerateHS/accelerate/issues
./Data/Array/Accelerate/CUDA/State.hs:87 (unhandled): CUDA Exception:
unspecified launch failure

radixsort: Failed:
*** Internal error in package accelerate ***
*** Please submit a bug report at
https://github.com/AccelerateHS/accelerate/issues
./Data/Array/Accelerate/CUDA/State.hs:87 (unhandled): CUDA Exception:
unspecified launch failure

io: test: fromPtr Int
test: fromPtr (Int,Double)
test: toPtr Int16
test: toPtr Int32
test: toPtr Int64
test: fromArray Int
Ok
io: +++ OK, passed 100 tests.
+++ OK, passed 100 tests.
+++ OK, passed 100 tests.
+++ OK, passed 100 tests.
+++ OK, passed 100 tests.
+++ OK, passed 100 tests.
+++ OK, passed 100 tests.
Ok
canny: Failed: no image file specified
integral-image: Failed: no image file specified
slices: Failed:
*** Internal error in package accelerate ***
*** Please submit a bug report at
https://github.com/AccelerateHS/accelerate/issues
./Data/Array/Accelerate/CUDA/State.hs:87 (unhandled): CUDA Exception:
unspecified launch failure

slices: Failed:
*** Internal error in package accelerate ***
*** Please submit a bug report at
https://github.com/AccelerateHS/accelerate/issues
./Data/Array/Accelerate/CUDA/State.hs:87 (unhandled): CUDA Exception:
unspecified launch failure

slices: Failed:
*** Internal error in package accelerate ***
*** Please submit a bug report at
https://github.com/AccelerateHS/accelerate/issues
./Data/Array/Accelerate/CUDA/State.hs:87 (unhandled): CUDA Exception:
unspecified launch failure

slices: ^C[ 3364.241184] [WARN]Received Interrupt signal.
Failed: user interrupt
slices: Failed:
*** Internal error in package accelerate ***
*** Please submit a bug report at
https://github.com/AccelerateHS/accelerate/issues
./Data/Array/Accelerate/CUDA/State.hs:87 (unhandled): CUDA Exception:
unspecified launch failure

sharing-recovery: Ok
sharing-recovery: Ok
sharing-recovery: Ok
sharing-recovery: Ok
sharing-recovery: Ok
sharing-recovery: Ok
sharing-recovery: Ok
sharing-recovery: Ok
sharing-recovery: Ok
sharing-recovery: Ok
bound-variables: Ok
accelerate-examples: forkOS_entry: interrupted
neil@debian-neil:~/.cabal/bin$

On 28/05/13 23:00, Trevor L. McDonell wrote:

Neil, could you try again with the latest version? I managed to create
a setup that threw an invalid context error, so the fix for that might
help in your situation as well.


Reply to this email directly or view it on GitHub
#92 (comment).

@tmcdonell
Copy link
Member

Oops, sorry for all the compilation failures with even. I'm currently hacking on nofib to produce this test case for the context bug, but the local changes aren't ready to push upstream.

Are these the same errors you had initially? This looks more like what we had after the hack to replace forkOS with forkOn 0? That would at least be some progress!

For the "unspecified launch failure errors", we might be trying to launch a kernel that requires more resources than your card provides. Since I haven't tested on an Optimus card before, there might be bugs in the occupancy calculator code.

Try the following?

import Prelude                          as P
import Data.Array.Accelerate            as A
import Data.Array.Accelerate.CUDA

import System.Environment

xs, ys :: Acc (Vector Float)
xs = use $ fromList (Z:.10) [0..]
ys = use $ fromList (Z:.10) [2,4..]

dotp :: Acc (Vector Float) -> Acc (Vector Float) -> Acc (Scalar Float)
dotp xs ys
  = A.fold (+) 0
  $ A.zipWith (*) xs ys

main :: IO ()
main
  = withArgs ["-ddump-cc", "-ddump-gc", "-ddump-exec", "-dverbose"]
  $ print
  $ run (dotp xs ys)

You'll need to have installed accelerate-cuda with the -fdebug flag, or just run in ghci using the script in the utils directory (you might have to edit it a bit depending on where you have checked out the individual repositories).

@neiljamieso
Copy link
Author

Thanks Trevor,

I'll try that. The fix you did for the invalid context (last email) -
was that in cuda or accelerate-cuda? I only rebuilt accelerate-cuda
(and dependencies).

The "unspecified launch failure errors" were in the "forkOS" version.
The "forkOn 0" version had the context errors.

I did put the forkOn back in, but not sure I rebuilt the whole sequence
properly afterwards.

Cheers, Neil

On 30/05/13 10:54, Trevor L. McDonell wrote:

Oops, sorry for all the compilation failures with |even|. I'm
currently hacking on |nofib| to produce this test case for the context
bug, but the local changes aren't ready to push upstream.

Are these the same errors you had initially? This looks more like what
we had after the hack to replace |forkOS| with |forkOn 0|? That would
at least be some progress!

For the "unspecified launch failure errors", we might be trying to
launch a kernel that requires more resources than your card provides.
Since I haven't tested on an Optimus card before, there might be bugs
in the occupancy calculator code.

Try the following?

import Prelude as P
import Data.Array.Accelerate as A
import Data.Array.Accelerate.CUDA

import System.Environment

xs, ys :: Acc (Vector Float)
xs = use $ fromList (Z:.10) [0..]
ys = use $ fromList (Z:.10) [2,4..]

dotp :: Acc (Vector Float) -> Acc (Vector Float) -> Acc (Scalar Float)
dotp xs ys
= A.fold (+) 0
$ A.zipWith (*) xs ys

main :: IO ()
main
= withArgs ["-ddump-cc", "-ddump-gc", "-ddump-exec", "-dverbose"]
$ print
$ run (dotp xs ys)

You'll need to have installed |accelerate-cuda| with the |-fdebug|
flag, or just run in |ghci| using the script in the |utils| directory
(you might have to edit it a bit depending on where you have checked
out the individual repositories).


Reply to this email directly or view it on GitHub
#92 (comment).

@neiljamieso
Copy link
Author

Hi Trevor,

This is the output. Are you able to make sense of it? Certainly seems
to have worked!

neil@debian-neil:~/.cabal/bin$ optirun ./accelerate-examples --cuda -k
-- -fdump-cc
0.03:gc: initialise default context
0.07:gc: initialise context #0x00007f8f1c00b4f0
Device 0: Quadro K1000M (compute capatability 3.0)
1 multiprocessors @ 850.50 MHz (192 cores), 2 GB global memory
0.07:gc: push context: #0x00007f8f1c00b4f0
0.07:gc: initialise CUDA state
0.07:gc: initialise memory table
0.07:cc: initialise kernel table
0.07:cc: persist/restore: 39 entries
0.08:gc: lookup/not found: Array #32
0.08:gc: useArray/malloc: 40 B
0.08:gc: malloc/new
0.08:gc: insert: Array #32
0.08:gc: lookup/not found: Array #31
0.08:gc: useArray/malloc: 40 B
0.08:gc: malloc/new
0.08:gc: insert: Array #31
0.08:cc: (3.0,"\178\140cp$\ACK\226\229\195l\184eF`f3")
#include <accelerate_cuda_extras.h>
extern "C" global void foldAll(const DIM1 shIn0, const float*
restrict arrIn0_a0, const DIM1 shIn1, const float* restrict
arrIn1_a0, const DIM0 shOut, float* restrict arrOut_a0)
{
extern volatile shared float sdata0[];
float x0;
float y0;
const Int64 sh0 = min((Int64) shIn0, (Int64) shIn1);
const int shapeSize = sh0;
const int gridSize = blockDim.x * gridDim.x;
int ix = blockDim.x * blockIdx.x + threadIdx.x;

if (ix < shapeSize) {
const Int64 v2 = ix;
const int v3 = toIndex(shIn0, shape(v2));
const int v4 = toIndex(shIn1, shape(v2));

y0 = arrIn0_a0[v3] * arrIn1_a0[v4];
for (ix += gridSize; ix < shapeSize; ix += gridSize) {
const Int64 v2 = ix;
const int v3 = toIndex(shIn0, shape(v2));
const int v4 = toIndex(shIn1, shape(v2));

x0 = arrIn0_a0[v3] * arrIn1_a0[v4];
y0 = x0 + y0;
}
}
sdata0[threadIdx.x] = y0;
__syncthreads();
ix = min(shapeSize - blockIdx.x * blockDim.x, blockDim.x);
if (threadIdx.x + 512 < ix) {
x0 = sdata0[threadIdx.x + 512];
y0 = y0 + x0;
sdata0[threadIdx.x] = y0;
}
__syncthreads();
if (threadIdx.x + 256 < ix) {
x0 = sdata0[threadIdx.x + 256];
y0 = y0 + x0;
sdata0[threadIdx.x] = y0;
}
__syncthreads();
if (threadIdx.x + 128 < ix) {
x0 = sdata0[threadIdx.x + 128];
y0 = y0 + x0;
sdata0[threadIdx.x] = y0;
}
__syncthreads();
if (threadIdx.x + 64 < ix) {
x0 = sdata0[threadIdx.x + 64];
y0 = y0 + x0;
sdata0[threadIdx.x] = y0;
}
__syncthreads();
if (threadIdx.x < 32) {
if (threadIdx.x + 32 < ix) {
x0 = sdata0[threadIdx.x + 32];
y0 = y0 + x0;
sdata0[threadIdx.x] = y0;
}
if (threadIdx.x + 16 < ix) {
x0 = sdata0[threadIdx.x + 16];
y0 = y0 + x0;
sdata0[threadIdx.x] = y0;
}
if (threadIdx.x + 8 < ix) {
x0 = sdata0[threadIdx.x + 8];
y0 = y0 + x0;
sdata0[threadIdx.x] = y0;
}
if (threadIdx.x + 4 < ix) {
x0 = sdata0[threadIdx.x + 4];
y0 = y0 + x0;
sdata0[threadIdx.x] = y0;
}
if (threadIdx.x + 2 < ix) {
x0 = sdata0[threadIdx.x + 2];
y0 = y0 + x0;
sdata0[threadIdx.x] = y0;
}
if (threadIdx.x + 1 < ix) {
x0 = sdata0[threadIdx.x + 1];
y0 = y0 + x0;
sdata0[threadIdx.x] = y0;
}
}
if (threadIdx.x == 0) {
if (shapeSize > 0) {
if (gridDim.x == 1) {
x0 = 0.0f;
y0 = x0 + y0;
}
arrOut_a0[blockIdx.x] = y0;
} else {
arrOut_a0[blockIdx.x] = 0.0f;
}
}
}

0.08:cc: (3.0,"\209\181\149\254\136cnX\DEL\171\b\219\160\133\133:")
#include <accelerate_cuda_extras.h>
extern "C" global void foldAll(const DIM1 shIn0, const float*
restrict arrIn0_a0, const DIM1 shIn1, const float* restrict
arrIn1_a0, const DIM0 shOut, float* restrict arrOut_a0, const DIM1
shRec, const float* restrict arrRec_a0)
{
extern volatile shared float sdata0[];
float x0;
float y0;
const Int64 sh0 = shRec;
const int shapeSize = sh0;
const int gridSize = blockDim.x * gridDim.x;
int ix = blockDim.x * blockIdx.x + threadIdx.x;

if (ix < shapeSize) {
y0 = arrRec_a0[ix];
for (ix += gridSize; ix < shapeSize; ix += gridSize) {
x0 = arrRec_a0[ix];
y0 = x0 + y0;
}
}
sdata0[threadIdx.x] = y0;
__syncthreads();
ix = min(shapeSize - blockIdx.x * blockDim.x, blockDim.x);
if (threadIdx.x + 512 < ix) {
x0 = sdata0[threadIdx.x + 512];
y0 = y0 + x0;
sdata0[threadIdx.x] = y0;
}
__syncthreads();
if (threadIdx.x + 256 < ix) {
x0 = sdata0[threadIdx.x + 256];
y0 = y0 + x0;
sdata0[threadIdx.x] = y0;
}
__syncthreads();
if (threadIdx.x + 128 < ix) {
x0 = sdata0[threadIdx.x + 128];
y0 = y0 + x0;
sdata0[threadIdx.x] = y0;
}
__syncthreads();
if (threadIdx.x + 64 < ix) {
x0 = sdata0[threadIdx.x + 64];
y0 = y0 + x0;
sdata0[threadIdx.x] = y0;
}
__syncthreads();
if (threadIdx.x < 32) {
if (threadIdx.x + 32 < ix) {
x0 = sdata0[threadIdx.x + 32];
y0 = y0 + x0;
sdata0[threadIdx.x] = y0;
}
if (threadIdx.x + 16 < ix) {
x0 = sdata0[threadIdx.x + 16];
y0 = y0 + x0;
sdata0[threadIdx.x] = y0;
}
if (threadIdx.x + 8 < ix) {
x0 = sdata0[threadIdx.x + 8];
y0 = y0 + x0;
sdata0[threadIdx.x] = y0;
}
if (threadIdx.x + 4 < ix) {
x0 = sdata0[threadIdx.x + 4];
y0 = y0 + x0;
sdata0[threadIdx.x] = y0;
}
if (threadIdx.x + 2 < ix) {
x0 = sdata0[threadIdx.x + 2];
y0 = y0 + x0;
sdata0[threadIdx.x] = y0;
}
if (threadIdx.x + 1 < ix) {
x0 = sdata0[threadIdx.x + 1];
y0 = y0 + x0;
sdata0[threadIdx.x] = y0;
}
}
if (threadIdx.x == 0) {
if (shapeSize > 0) {
if (gridDim.x == 1) {
x0 = 0.0f;
y0 = x0 + y0;
}
arrOut_a0[blockIdx.x] = y0;
} else {
arrOut_a0[blockIdx.x] = 0.0f;
}
}
}

0.09:cc: waiting for nvcc...
0.09:cc: queue: 19.000 µs, execute: 1.316 s
... /usr/bin/nvcc -I
/home/neil/.cabal/share/accelerate-cuda-0.14.0.0/cubits -arch=sm_30
-cubin -o /tmp/accelerate-cuda-12605/dragon12606.cubin -O3 -m64
/tmp/accelerate-cuda-12605/dragon12606.cu
0.09:cc: queue: 32.000 µs, execute: 1.319 s
... /usr/bin/nvcc -I
/home/neil/.cabal/share/accelerate-cuda-0.14.0.0/cubits -arch=sm_30
-cubin -o /tmp/accelerate-cuda-12605/dragon12605.cubin -O3 -m64
/tmp/accelerate-cuda-12605/dragon12605.cu
0.09:cc: persist/save:
/home/neil/.accelerate/accelerate-cuda-0.14.0.0/cache/3.0/z33Ufz60UFezr184lzr195zr229zr226zrACKzdpczr140zr178
0.09:cc: entry function 'foldAll' used 11 registers, 0 bytes smem, 0
bytes lmem, 0 bytes cmem
... multiprocessor occupancy 100.0% : 2048 threads over 64 warps
in 2 blocks
0.09:gc: lookup/not found: Array #25
0.09:gc: mallocArray: 4 B
0.09:gc: malloc/new
0.09:gc: insert: Array #25
0.09:gc: lookup/found: Array #32
0.09:gc: lookup/found: Array #31
0.09:gc: lookup/found: Array #25
0.09:exec: foldAll<<< 1, 1024, 4096 >>> gpu: 48.128 µs, cpu: 0.000 s
0.09:gc: lookup/found: Array #25
0.09:gc: peekArray: 4 B
0.09:gc: pop context: #0x00007f8f1c00b4f0
Array (Z) [660.0]
neil@debian-neil:~/.cabal/bin$

Cheers, Neil

On 30/05/13 10:54, Trevor L. McDonell wrote:

import Prelude as P
import Data.Array.Accelerate as A
import Data.Array.Accelerate.CUDA

import System.Environment

xs, ys :: Acc (Vector Float)
xs = use $ fromList (Z:.10) [0..]
ys = use $ fromList (Z:.10) [2,4..]

dotp :: Acc (Vector Float) -> Acc (Vector Float) -> Acc (Scalar Float)
dotp xs ys
= A.fold (+) 0
$ A.zipWith (*) xs ys

main :: IO ()
main
= withArgs ["-ddump-cc", "-ddump-gc", "-ddump-exec", "-dverbose"]
$ print
$ run (dotp xs ys)

@tmcdonell
Copy link
Member

Hi Neil,

hmm, it does indeed seem to have worked. Okay, a couple more things to try, if you don't mind:

  • Could you run the deviceQueryDrv program from the CUDA SDK examples and show me the output?
  • I've not encountered an Optimus/Quadro device before, so my current thought is that something is wrong with the calculations that determine how many threads to launch. Try increasing the vector sizes for the test program I sent last time and find the point where it fails. Multiples of 1024 are probably a good increment. Feel free to comment out the line beginning withArgs so that it is less chatty.

Thanks!

@tmcdonell
Copy link
Member

Oh, also, did you need to edit Async.hs to use forkOn 0 after I pushed the latest patches, or were the previous results with a clean checkout?

@neiljamieso
Copy link
Author

The forkOn 0 no longer makes any difference - i.e all now fail as it did
with forkOS.

I'll try the suggestion about cranking up the size of the vectors and
get back.

Neil

On 31/05/13 16:05, Trevor L. McDonell wrote:

Oh, also, did you need to edit |Async.hs| to use |forkOn 0| after I
pushed the latest patches, or were the previous results with a clean
checkout?


Reply to this email directly or view it on GitHub
#92 (comment).

@neiljamieso
Copy link
Author

Here's the deviceQueryDrv output:

neil@debian-neil:~/Documents/Computing/CUDASamples/1_Utilities/deviceQueryDrv$
optirun ./deviceQueryDrv
./deviceQueryDrv Starting...

CUDA Device Query (Driver API) statically linked version
Detected 1 CUDA Capable device(s)

Device 0: "Quadro K1000M"
CUDA Driver Version: 5.0
CUDA Capability Major/Minor version number: 3.0
Total amount of global memory: 2048 MBytes
(2147287040 bytes)
( 1) Multiprocessors x (192) CUDA Cores/MP: 192 CUDA Cores
GPU Clock rate: 851 MHz (0.85 GHz)
Memory Clock rate: 900 Mhz
Memory Bus Width: 128-bit
L2 Cache Size: 262144 bytes
Max Texture Dimension Sizes 1D=(65536)
2D=(65536,65536) 3D=(4096,4096,4096)
Max Layered Texture Size (dim) x layers 1D=(16384) x 2048,
2D=(16384,16384) x 2048
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per multiprocessor: 2048
Maximum number of threads per block: 1024
Maximum sizes of each dimension of a block: 1024 x 1024 x 64
Maximum sizes of each dimension of a grid: 2147483647 x 65535 x 65535
Texture alignment: 512 bytes
Maximum memory pitch: 2147483647 bytes
Concurrent copy and kernel execution: Yes with 1 copy engine(s)
Run time limit on kernels: Yes
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Concurrent kernel execution: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Disabled
Device supports Unified Addressing (UVA): Yes
Device PCI Bus ID / PCI location ID: 1 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device
simultaneously) >
neil@debian-neil:~/Documents/Computing/CUDASamples/1_Utilities/deviceQueryDrv$

On 31/05/13 16:03, Trevor L. McDonell wrote:

Hi Neil,

hmm, it does indeed seem to have worked. Okay, a couple more things to
try, if you don't mind:

Could you run the |deviceQueryDrv| program from the CUDA SDK
examples and show me the output?
I've not encountered an Optimus/Quadro device before, so my
current thought is that something is wrong with the calculations
that determine how many threads to launch. Try increasing the
vector sizes for the test program I sent last time and find the
point where it fails. Multiples of 1024 are probably a good
increment. Feel free to comment out the line beginning |withArgs|
so that it is less chatty.

Thanks!


Reply to this email directly or view it on GitHub
#92 (comment).

@neiljamieso
Copy link
Author

Remarkably durable...

This is the code

import Prelude as P
import Data.Array.Accelerate as A
import Data.Array.Accelerate.CUDA

import System.Environment

xs, ys :: Acc (Vector Float)
xs = use $ fromList (Z:.1000000) [0..]
ys = use $ fromList (Z:.1000000) [2,4..]

dotp :: Acc (Vector Float) -> Acc (Vector Float) -> Acc (Scalar Float)
dotp xs ys
= A.fold (+) 0
$ A.zipWith (*) xs ys

main :: IO ()
main
= withArgs ["-ddump-cc"{--, "-ddump-gc", "-ddump-exec", "-dverbose"--}]
$ print
$ run (dotp xs ys)

And this is the output

neil@debian-neil:/.cabal/bin$ optirun ./accelerate-examples --cuda -k
0.12:cc: initialise kernel table
0.12:cc: persist/restore: 41 entries
0.18:cc: found/persistent
0.18:cc: found/persistent
0.18:cc: entry function 'foldAll' used 11 registers, 0 bytes smem, 0
bytes lmem, 0 bytes cmem
... multiprocessor occupancy 100.0% : 2048 threads over 64 warps
in 2 blocks
0.18:cc: entry function 'foldAll' used 8 registers, 0 bytes smem, 0
bytes lmem, 0 bytes cmem
... multiprocessor occupancy 100.0% : 2048 threads over 64 warps
in 2 blocks
Array (Z) [6.666666e17]
neil@debian-neil:
/.cabal/bin$

On 31/05/13 16:03, Trevor L. McDonell wrote:

Hi Neil,

hmm, it does indeed seem to have worked. Okay, a couple more things to
try, if you don't mind:

Could you run the |deviceQueryDrv| program from the CUDA SDK
examples and show me the output?
I've not encountered an Optimus/Quadro device before, so my
current thought is that something is wrong with the calculations
that determine how many threads to launch. Try increasing the
vector sizes for the test program I sent last time and find the
point where it fails. Multiples of 1024 are probably a good
increment. Feel free to comment out the line beginning |withArgs|
so that it is less chatty.

Thanks!


Reply to this email directly or view it on GitHub
#92 (comment).

@neiljamieso
Copy link
Author

Changing vector sizes to this...

xs = use $ fromList (Z:.1000000000) [0..]
ys = use $ fromList (Z:.1000000000) [2,4..]

Lead to a perfectly reasonable...
neil@debian-neil:~/.cabal/bin$ optirun ./accelerate-examples --cuda -k
39.85:cc: initialise kernel table
39.85:cc: persist/restore: 41 entries
accelerate-examples:
*** Internal error in package accelerate ***
*** Please submit a bug report at
https://github.com/AccelerateHS/accelerate/issues
./Data/Array/Accelerate/CUDA/State.hs:87 (unhandled): CUDA Exception:
out of memory

neil@debian-neil:~/.cabal/bin$

On 31/05/13 16:03, Trevor L. McDonell wrote:

Hi Neil,

hmm, it does indeed seem to have worked. Okay, a couple more things to
try, if you don't mind:

Could you run the |deviceQueryDrv| program from the CUDA SDK
examples and show me the output?
I've not encountered an Optimus/Quadro device before, so my
current thought is that something is wrong with the calculations
that determine how many threads to launch. Try increasing the
vector sizes for the test program I sent last time and find the
point where it fails. Multiples of 1024 are probably a good
increment. Feel free to comment out the line beginning |withArgs|
so that it is less chatty.

Thanks!


Reply to this email directly or view it on GitHub
#92 (comment).

@neiljamieso
Copy link
Author

Does the simple dotp example exercise the Async module? This seems to
be the source of the crashes.

Cheers, Neil

@neiljamieso
Copy link
Author

Oops, Sorry a mis-type there. They are with forkIO (not forkOS). I
tried with forkOn 0 and got the same results. Previously forkOn 0 gave
more successes and failed with "bad context" message rather than "launch
failed". "launch failed" has always happened with forkIO.

On 31/05/13 16:05, Trevor L. McDonell wrote:

Oh, also, did you need to edit |Async.hs| to use |forkOn 0| after I
pushed the latest patches, or were the previous results with a clean
checkout?


Reply to this email directly or view it on GitHub
#92 (comment).

@neiljamieso
Copy link
Author

Hi Trevor,

I mentioned this before, but it may have been lost, and is more of a
worry... The fourth slice example fails to terminate (after 40sec). I
have to use Ctrl-C to kill it. I'm not sure why this has changed.

Cheers, neil

@neiljamieso
Copy link
Author

Hi Trevor,

I thought you might be interested in this. Running the regression test
script seems to work - no crashing, no stalling on the slices!

neil@debian-neil:~/Downloads/Accelerate_130529/accelerate-examples-master/accelerate-examples-master$
optirun ./regression_test.sh --cuda

First the main battery of tests:

running with CUDA backend
to display available options, rerun with '--help'

map-abs: Ok
map-plus: Ok
map-square: Ok
zip: Ok
zipWith-plus: Ok
fold-sum: Ok
fold-product: Ok
fold-maximum: Ok
fold-minimum: Ok
fold-2d-sum: Ok
fold-2d-product: Ok
fold-2d-maximum: Ok
fold-2d-minimum: Ok
foldseg-sum: Ok
scanseg-sum: Failed:

0 : (0.0,NaN)
1 : (-0.6929801,-4.2535293e37)
2 : (-1.2756131,NaN)
3 : (-0.9977418,NaN)
4 : (-1.1877143,NaN)
5 : (-1.4590598,NaN)
6 : (-1.465081,NaN)
7 : (-1.5335276,NaN)
8 : (-1.8964667,NaN)
9 : (-2.429172,NaN)
11 : (0.9855077,0.0)
12 : (1.2848983,0.0)
14 : (0.9586575,0.0)
15 : (0.8935447,0.0)
16 : (0.55967414,0.0)
17 : (0.7870643,0.0)
18 : (0.38397616,0.0)
19 : (0.5038597,0.0)
20 : (1.0932949,0.0)
22 : (-0.7802813,0.0)
23 : (-0.90180016,0.0)
24 : (-1.1760286,0.0)
25 : (-0.66521347,0.0)
27 : (0.8123276,0.0)
28 : (1.6648452,0.0)
29 : (1.8714409,0.0)
30 : (1.5091901,0.0)
31 : (2.096872,0.0)
32 : (2.3554232,0.0)
34 : (-0.82877505,0.0)
35 : (-1.8104537,0.0)
36 : (-1.8511171,0.0)
37 : (-1.4023463,0.0)
38 : (-2.062095,0.0)
39 : (-1.5179899,0.0)
40 : (-0.57485485,0.0)
41 : (-1.3017156,0.0)
43 : (-0.56559163,0.0)
44 : (-0.8005209,0.0)
45 : (-0.26718092,0.0)
47 : (-0.42379427,0.0)
48 : (-0.6211059,0.0)
49 : (-1.3470457,0.0)
50 : (-2.2204418,0.0)
51 : (-1.9068379,0.0)
52 : (-2.0748498,0.0)
53 : (-1.0756776,0.0)
54 : (-1.121619,0.0)
55 : (-1.9701061,0.0)
57 : (-0.3139459,0.0)
58 : (-0.46075392,0.0)
59 : (0.50402975,0.0)
60 : (-0.27072406,0.0)
61 : (-0.49237812,0.0)
62 : (-1.2419014,0.0)
63 : (-2.084043,0.0)

stencil-1D: Ok
stencil-2D: Ok
stencil-3D: Ok
stencil-3x3-cross: Ok
stencil-3x3-pair: Ok
stencil2-2D: Ok
permute-hist: Ok
backpermute-reverse: Ok
backpermute-transpose: Ok
init: Ok
tail: Ok
take: Ok
drop: Ok
slit: Ok
gather: Ok
gather-if: Ok
scatter: Ok
scatter-if: Ok
sasum: Ok
saxpy: Ok
dotp: Ok
filter: Ok
smvm: Ok
black-scholes: Ok
radixsort: Ok
io: test: fromPtr Int
test: fromPtr (Int,Double)
test: toPtr Int16
test: toPtr Int32
test: toPtr Int64
test: fromArray Int
Ok
io: +++ OK, passed 100 tests.
+++ OK, passed 100 tests.
+++ OK, passed 100 tests.
+++ OK, passed 100 tests.
+++ OK, passed 100 tests.
+++ OK, passed 100 tests.
+++ OK, passed 100 tests.
Ok
canny: Failed: no image file specified
integral-image: Failed: no image file specified
slices: Ok
slices: Ok
slices: Ok
slices: Ok
slices: Ok
sharing-recovery: Ok
sharing-recovery: Ok
sharing-recovery: Ok
sharing-recovery: Ok
sharing-recovery: Ok
sharing-recovery: Ok
sharing-recovery: Ok
sharing-recovery: Ok
sharing-recovery: Ok
sharing-recovery: Ok
bound-variables: Ok

Next, additional application tests, beginning with mandelbrot:

accelerate-mandelbrot (c) [2011..2013] The Accelerate Team

Usage: accelerate-mandelbrot [OPTIONS]

Available backends:
interpreter reference implementation (sequential)

  • cuda implementation for NVIDIA GPUs (parallel)

Runtime usage:
arrows translate display
z ; zoom in
x q zoom out
f single precision calculations
d double precision calculations (if supported)

Error: unrecognized option `--size=64'

Run "accelerate-mandelbrot --help" for usage information
neil@debian-neil:~/Downloads/Accelerate_130529/accelerate-examples-master/accelerate-examples-master$

@neiljamieso
Copy link
Author

Hi Trevor,

I hope you don't mind me sending lots, but I am on a roll at the
moment. Using the regressions script I saw the --size option and tried
it out. The accelerate-examples work with --size=1024, and fail with
--size=2048 (with the "launch failure" message). So this seems to be a
size problem rather than some basic fault in the context or launch
process. I suspect your thoughts about the calculations for memory
usage being wrong are correct.

Actually I can be more specific... 1024 works, 1025 fails.

There are numerous (hundreds of) "fails" in the results not matching the
interpreter result in scanseg-sum (but it ran!).

:-) Neil

Also fluid, mandrebrot, etc all run fine. Haven't tried the hashcat.

smoothlife chokes on the default settings. I get a decent animation
with the following, but it still slows down and misses frames as the
animation progresses. I assume this is a result of having a low end GPU.

neil@debian-neil:~/.cabal/bin$ ./accelerate-smoothlife --cuda --size=64
--sigmode=2 --sigtype=Smooth --framerate=5

Pretty happy now!

Neil

All this with the

@neiljamieso
Copy link
Author

Hi Trevor,

I hope you don't mind me sending lots, but I am on a roll at the
moment. Using the regressions script I saw the --size option and tried
it out. The accelerate-examples work with --size=1024, and fail with
--size=2048 (with the "launch failure" message). So this seems to be a
size problem rather than some basic fault in the context or launch
process. I suspect your thoughts about the calculations for memory
usage being wrong are correct.

Actually I can be more specific... 1024 works, 1025 fails.

There are numerous (hundreds of) "fails" in the results not matching the
interpreter result in scanseg-sum (but it ran!).

:-) Neil

Also fluid, mandrebrot, etc all run fine. Haven't tried the hashcat.

smoothlife chokes on the default settings. I get a decent animation
with the following, but it still slows down and misses frames as the
animation progresses. I assume this is a result of having a low end GPU.

neil@debian-neil:~/.cabal/bin$ ./accelerate-smoothlife --cuda --size=64
--sigmode=2 --sigtype=Smooth --framerate=5

Pretty happy now!

Neil

All this with forkIO

@tmcdonell
Copy link
Member

The forkOn 0 no longer makes any difference - i.e all now fail as it did
with forkOS.

Okay, that's great! I made some changes elsewhere tried to do the same thing but not fixed to CPU zero, so am glad that that works. One problem down!

@tmcdonell
Copy link
Member

Does the simple dotp example exercise the Async module? This seems to
be the source of the crashes.

Yes, all run invocations will go via Async. I think we fixed the problem there, and the failures now are related to the kernel launches.

@tmcdonell
Copy link
Member

Hi Neil,

I hope you don't mind me sending lots, but I am on a roll at the
moment.

Not at all, it is all very useful information (:

Using the regressions script I saw the --size option and tried
it out. The accelerate-examples work with --size=1024, and fail with
--size=2048 (with the "launch failure" message). So this seems to be a
size problem rather than some basic fault in the context or launch
process. I suspect your thoughts about the calculations for memory
usage being wrong are correct.

Actually I can be more specific... 1024 works, 1025 fails.

Ah, that is very helpful, thanks! I'll play around and see if I can dig up anymore leads to follow.

There are numerous (hundreds of) "fails" in the results not matching the
interpreter result in scanseg-sum (but it ran!).

A little worrying, but at least it runs! We'll get to that one later (:

Also fluid, mandrebrot, etc all run fine. Haven't tried the hashcat.

Great!

For hashcat you'll need to find a list of plain text words to feed it, and then a bunch of MD5 digests guess. You can use a standard dictionary like /usr/share/dict/english, although for a bit of fun Google for the rockyou list and a list of unknown md5's (:

smoothlife chokes on the default settings. I get a decent animation
with the following, but it still slows down and misses frames as the
animation progresses. I assume this is a result of having a low end GPU.

neil@debian-neil:~/.cabal/bin$ ./accelerate-smoothlife --cuda --size=64
--sigmode=2 --sigtype=Smooth --framerate=5

I think it depends on whether or not accelerate-fft built against the fast CUDA FFT library implementation. I don't think there is an easy way to check whether this happened or not, aside from just running and measuring the speed. Try:

cabal install accelerate-fft -fcuda

Or just install it after the accelerate-cuda package is already installed. This should probably have better documentation!

-Trev

@neiljamieso
Copy link
Author

On 03/06/13 16:04, Trevor L. McDonell wrote:

cabal install accelerate-fft -fcuda
Worked! Smoothlife now works beautifully. Amazing speedup in processing.

tmcdonell added a commit to tmcdonell/accelerate-cuda that referenced this issue Jun 7, 2013
@tmcdonell
Copy link
Member

@neiljamieso does everything work fine now? Some recent fixes to the fold kernel means that those tests should pass now. Do you still have any problems here?

@neiljamieso
Copy link
Author

Hi Trev,

How recent a download from Github do I need?

Neil

On 15/11/13 16:02, Trevor L. McDonell wrote:

@neiljamieso https://github.com/neiljamieso does everything work
fine now? Some recent fixes to the fold kernel means that those tests
should pass now. Do you still have any problems here?


Reply to this email directly or view it on GitHub
#92 (comment).

@neiljamieso
Copy link
Author

Hi Trev,

I tried installing the latest accelerate stuff from githib.

The latest accelerate-cuda depends on cuda-1.5.1.1 - the latest cuda in
github is 1.5.1.0

On 15/11/13 16:02, Trevor L. McDonell wrote:

@neiljamieso https://github.com/neiljamieso does everything work
fine now? Some recent fixes to the fold kernel means that those tests
should pass now. Do you still have any problems here?


Reply to this email directly or view it on GitHub
#92 (comment).

@mchakravarty
Copy link
Member

@neiljamieso Trev probably forgot to push the version bump. Just change the version in cuda.cabal to 1.5.1.1 and it'll work.

@neiljamieso
Copy link
Author

No working so well. I have attached the outputs (with my command line
at the front) for standard and verbose outputs.

Neil

On 16/11/13 23:41, Manuel M T Chakravarty wrote:

@neiljamieso https://github.com/neiljamieso Trev probably forgot to
push the version bump. Just change the version in |cuda.cabal| to
1.5.1.1 and it'll work.


Reply to this email directly or view it on GitHub
#92 (comment).

neil@debian-neil:~/.cabal/bin$ optirun ./accelerate-examples --cuda --size=1024 -v > verbose_test_131117
/home/neil/.cabal/share/accelerate-cuda-0.14.0.0/cubits/accelerate_cuda_shape.h:287: int toIndex(Shape, Shape) [with Shape = int]: block: [0,0,0], thread: [28,0,0] Assertion ix >= 0 && ix < sh failed.
/home/neil/.cabal/share/accelerate-cuda-0.14.0.0/cubits/accelerate_cuda_shape.h:287: int toIndex(Shape, Shape) [with Shape = int]: block: [0,0,0], thread: [29,0,0] Assertion ix >= 0 && ix < sh failed.
/home/neil/.cabal/share/accelerate-cuda-0.14.0.0/cubits/accelerate_cuda_shape.h:287: int toIndex(Shape, Shape) [with Shape = int]: block: [0,0,0], thread: [30,0,0] Assertion ix >= 0 && ix < sh failed.
/home/neil/.cabal/share/accelerate-cuda-0.14.0.0/cubits/accelerate_cuda_shape.h:287: int toIndex(Shape, Shape) [with Shape = int]: block: [0,0,0], thread: [31,0,0] Assertion ix >= 0 && ix < sh failed.
accelerate-examples:
*** Internal error in package accelerate ***
*** Please submit a bug report at https://github.com/AccelerateHS/accelerate/issues
./Data/Array/Accelerate/CUDA/State.hs:87 (unhandled): CUDA Exception: device-side assert triggered

neil@debian-neil:~/.cabal/bin$

_OUTPUT_*

running with CUDA backend
to display available options, rerun with '--help'

map-abs: Ok
map-plus: Ok
map-square: Ok
zip: Ok
zipWith-plus: Ok
fold-sum: Ok
fold-product: Ok
fold-maximum: Ok
fold-minimum: Ok
fold-2d-sum: Ok
fold-2d-product: Ok
fold-2d-maximum: Ok
fold-2d-minimum: Ok
foldseg-sum: Ok
scanseg-sum: Failed:
*** Internal error in package accelerate ***
*** Please submit a bug report at https://github.com/AccelerateHS/accelerate/issues
./Data/Array/Accelerate/CUDA/State.hs:87 (unhandled): CUDA Exception: device-side assert triggered

stencil-1D: Failed:
*** Internal error in package accelerate ***
*** Please submit a bug report at https://github.com/AccelerateHS/accelerate/issues
./Data/Array/Accelerate/CUDA/State.hs:87 (unhandled): CUDA Exception: device-side assert triggered

stencil-2D: Failed:
*** Internal error in package accelerate ***
*** Please submit a bug report at https://github.com/AccelerateHS/accelerate/issues
./Data/Array/Accelerate/CUDA/State.hs:87 (unhandled): CUDA Exception: device-side assert triggered

stencil-3D: Failed:
*** Internal error in package accelerate ***
*** Please submit a bug report at https://github.com/AccelerateHS/accelerate/issues
./Data/Array/Accelerate/CUDA/State.hs:87 (unhandled): CUDA Exception: device-side assert triggered

stencil-3x3-cross: Failed:
*** Internal error in package accelerate ***
*** Please submit a bug report at https://github.com/AccelerateHS/accelerate/issues
./Data/Array/Accelerate/CUDA/State.hs:87 (unhandled): CUDA Exception: device-side assert triggered

stencil-3x3-pair: Failed:
*** Internal error in package accelerate ***
*** Please submit a bug report at https://github.com/AccelerateHS/accelerate/issues
./Data/Array/Accelerate/CUDA/State.hs:87 (unhandled): CUDA Exception: device-side assert triggered

stencil2-2D: Failed:
*** Internal error in package accelerate ***
*** Please submit a bug report at https://github.com/AccelerateHS/accelerate/issues
./Data/Array/Accelerate/CUDA/State.hs:87 (unhandled): CUDA Exception: device-side assert triggered

permute-hist: Failed:
*** Internal error in package accelerate ***
*** Please submit a bug report at https://github.com/AccelerateHS/accelerate/issues
./Data/Array/Accelerate/CUDA/State.hs:87 (unhandled): CUDA Exception: device-side assert triggered

backpermute-reverse: Failed:
*** Internal error in package accelerate ***
*** Please submit a bug report at https://github.com/AccelerateHS/accelerate/issues
./Data/Array/Accelerate/CUDA/State.hs:87 (unhandled): CUDA Exception: device-side assert triggered

backpermute-transpose: Failed:
*** Internal error in package accelerate ***
*** Please submit a bug report at https://github.com/AccelerateHS/accelerate/issues
./Data/Array/Accelerate/CUDA/State.hs:87 (unhandled): CUDA Exception: device-side assert triggered

init: Failed:
*** Internal error in package accelerate ***
*** Please submit a bug report at https://github.com/AccelerateHS/accelerate/issues
./Data/Array/Accelerate/CUDA/State.hs:87 (unhandled): CUDA Exception: device-side assert triggered

tail: Failed:
*** Internal error in package accelerate ***
*** Please submit a bug report at https://github.com/AccelerateHS/accelerate/issues
./Data/Array/Accelerate/CUDA/State.hs:87 (unhandled): CUDA Exception: device-side assert triggered

take: Failed:
*** Internal error in package accelerate ***
*** Please submit a bug report at https://github.com/AccelerateHS/accelerate/issues
./Data/Array/Accelerate/CUDA/State.hs:87 (unhandled): CUDA Exception: device-side assert triggered

drop: Failed:
*** Internal error in package accelerate ***
*** Please submit a bug report at https://github.com/AccelerateHS/accelerate/issues
./Data/Array/Accelerate/CUDA/State.hs:87 (unhandled): CUDA Exception: device-side assert triggered

slit: Failed:
*** Internal error in package accelerate ***
*** Please submit a bug report at https://github.com/AccelerateHS/accelerate/issues
./Data/Array/Accelerate/CUDA/State.hs:87 (unhandled): CUDA Exception: device-side assert triggered

gather: Failed:
*** Internal error in package accelerate ***
*** Please submit a bug report at https://github.com/AccelerateHS/accelerate/issues
./Data/Array/Accelerate/CUDA/State.hs:87 (unhandled): CUDA Exception: device-side assert triggered

gather-if: Failed:
*** Internal error in package accelerate ***
*** Please submit a bug report at https://github.com/AccelerateHS/accelerate/issues
./Data/Array/Accelerate/CUDA/State.hs:87 (unhandled): CUDA Exception: device-side assert triggered

scatter: Failed:
*** Internal error in package accelerate ***
*** Please submit a bug report at https://github.com/AccelerateHS/accelerate/issues
./Data/Array/Accelerate/CUDA/State.hs:87 (unhandled): CUDA Exception: device-side assert triggered

scatter-if: Failed:
*** Internal error in package accelerate ***
*** Please submit a bug report at https://github.com/AccelerateHS/accelerate/issues
./Data/Array/Accelerate/CUDA/State.hs:87 (unhandled): CUDA Exception: device-side assert triggered

sasum: Failed:
*** Internal error in package accelerate ***
*** Please submit a bug report at https://github.com/AccelerateHS/accelerate/issues
./Data/Array/Accelerate/CUDA/State.hs:87 (unhandled): CUDA Exception: device-side assert triggered

saxpy: Failed:
*** Internal error in package accelerate ***
*** Please submit a bug report at https://github.com/AccelerateHS/accelerate/issues
./Data/Array/Accelerate/CUDA/State.hs:87 (unhandled): CUDA Exception: device-side assert triggered

dotp: Failed:
*** Internal error in package accelerate ***
*** Please submit a bug report at https://github.com/AccelerateHS/accelerate/issues
./Data/Array/Accelerate/CUDA/State.hs:87 (unhandled): CUDA Exception: device-side assert triggered

filter: Failed:
*** Internal error in package accelerate ***
*** Please submit a bug report at https://github.com/AccelerateHS/accelerate/issues
./Data/Array/Accelerate/CUDA/State.hs:87 (unhandled): CUDA Exception: device-side assert triggered

smvm: Failed:
*** Internal error in package accelerate ***
*** Please submit a bug report at https://github.com/AccelerateHS/accelerate/issues
./Data/Array/Accelerate/CUDA/State.hs:87 (unhandled): CUDA Exception: device-side assert triggered

black-scholes: Failed:
*** Internal error in package accelerate ***
*** Please submit a bug report at https://github.com/AccelerateHS/accelerate/issues
./Data/Array/Accelerate/CUDA/State.hs:87 (unhandled): CUDA Exception: device-side assert triggered

radixsort: Failed:
*** Internal error in package accelerate ***
*** Please submit a bug report at https://github.com/AccelerateHS/accelerate/issues
./Data/Array/Accelerate/CUDA/State.hs:87 (unhandled): CUDA Exception: device-side assert triggered

io: test: fromPtr Int
test: fromPtr (Int,Double)
test: toPtr Int16
test: toPtr Int32
test: toPtr Int64
test: fromArray Int
Ok
io: +++ OK, passed 100 tests.
+++ OK, passed 100 tests.
+++ OK, passed 100 tests.
+++ OK, passed 100 tests.
+++ OK, passed 100 tests.
+++ OK, passed 100 tests.
+++ OK, passed 100 tests.
Ok
canny: Failed: no image file specified
integral-image: Failed: no image file specified
slices: Failed:
*** Internal error in package accelerate ***
*** Please submit a bug report at https://github.com/AccelerateHS/accelerate/issues
./Data/Array/Accelerate/CUDA/State.hs:87 (unhandled): CUDA Exception: device-side assert triggered

slices: Failed:
*** Internal error in package accelerate ***
*** Please submit a bug report at https://github.com/AccelerateHS/accelerate/issues
./Data/Array/Accelerate/CUDA/State.hs:87 (unhandled): CUDA Exception: device-side assert triggered

slices: Failed:
*** Internal error in package accelerate ***
*** Please submit a bug report at https://github.com/AccelerateHS/accelerate/issues
./Data/Array/Accelerate/CUDA/State.hs:87 (unhandled): CUDA Exception: device-side assert triggered

slices: Failed:
*** Internal error in package accelerate ***
*** Please submit a bug report at https://github.com/AccelerateHS/accelerate/issues
./Data/Array/Accelerate/CUDA/State.hs:87 (unhandled): CUDA Exception: device-side assert triggered

slices: Failed:
*** Internal error in package accelerate ***
*** Please submit a bug report at https://github.com/AccelerateHS/accelerate/issues
./Data/Array/Accelerate/CUDA/State.hs:87 (unhandled): CUDA Exception: device-side assert triggered

sharing-recovery: Ok
sharing-recovery: Ok
sharing-recovery: Ok
sharing-recovery: Ok
sharing-recovery: Ok
sharing-recovery: Ok
sharing-recovery: Ok
sharing-recovery: Ok
sharing-recovery: Ok
sharing-recovery: Ok
bound-variables: Ok

warming up
estimating clock resolution...
mean is 4.154538 us (160001 iterations)
found 1231 outliers among 159999 samples (0.8%)
1094 (0.7%) high severe
estimating cost of a clock call...
mean is 83.69922 ns (32 iterations)
found 4 outliers among 32 samples (12.5%)
3 (9.4%) low mild
1 (3.1%) high mild

benchmarking map-abs

neil@debian-neil:~/.cabal/bin$ optirun ./accelerate-examples --cuda --size=1024 > bare_test_131117
/home/neil/.cabal/share/accelerate-cuda-0.14.0.0/cubits/accelerate_cuda_shape.h:287: int toIndex(Shape, Shape) [with Shape = int]: block: [0,0,0], thread: [31,0,0] Assertion ix >= 0 && ix < sh failed.
accelerate-examples:
*** Internal error in package accelerate ***
*** Please submit a bug report at https://github.com/AccelerateHS/accelerate/issues
./Data/Array/Accelerate/CUDA/State.hs:87 (unhandled): CUDA Exception: device-side assert triggered

neil@debian-neil:~/.cabal/bin$

_OUTPUT_**

running with CUDA backend
to display available options, rerun with '--help'

map-abs: Ok
map-plus: Ok
map-square: Ok
zip: Ok
zipWith-plus: Ok
fold-sum: Ok
fold-product: Ok
fold-maximum: Ok
fold-minimum: Ok
fold-2d-sum: Ok
fold-2d-product: Ok
fold-2d-maximum: Ok
fold-2d-minimum: Ok
foldseg-sum: Ok
scanseg-sum: Failed:
*** Internal error in package accelerate ***
*** Please submit a bug report at https://github.com/AccelerateHS/accelerate/issues
./Data/Array/Accelerate/CUDA/State.hs:87 (unhandled): CUDA Exception: device-side assert triggered

stencil-1D: Failed:
*** Internal error in package accelerate ***
*** Please submit a bug report at https://github.com/AccelerateHS/accelerate/issues
./Data/Array/Accelerate/CUDA/State.hs:87 (unhandled): CUDA Exception: device-side assert triggered

stencil-2D: Failed:
*** Internal error in package accelerate ***
*** Please submit a bug report at https://github.com/AccelerateHS/accelerate/issues
./Data/Array/Accelerate/CUDA/State.hs:87 (unhandled): CUDA Exception: device-side assert triggered

stencil-3D: Failed:
*** Internal error in package accelerate ***
*** Please submit a bug report at https://github.com/AccelerateHS/accelerate/issues
./Data/Array/Accelerate/CUDA/State.hs:87 (unhandled): CUDA Exception: device-side assert triggered

stencil-3x3-cross: Failed:
*** Internal error in package accelerate ***
*** Please submit a bug report at https://github.com/AccelerateHS/accelerate/issues
./Data/Array/Accelerate/CUDA/State.hs:87 (unhandled): CUDA Exception: device-side assert triggered

stencil-3x3-pair: Failed:
*** Internal error in package accelerate ***
*** Please submit a bug report at https://github.com/AccelerateHS/accelerate/issues
./Data/Array/Accelerate/CUDA/State.hs:87 (unhandled): CUDA Exception: device-side assert triggered

stencil2-2D: Failed:
*** Internal error in package accelerate ***
*** Please submit a bug report at https://github.com/AccelerateHS/accelerate/issues
./Data/Array/Accelerate/CUDA/State.hs:87 (unhandled): CUDA Exception: device-side assert triggered

permute-hist: Failed:
*** Internal error in package accelerate ***
*** Please submit a bug report at https://github.com/AccelerateHS/accelerate/issues
./Data/Array/Accelerate/CUDA/State.hs:87 (unhandled): CUDA Exception: device-side assert triggered

backpermute-reverse: Failed:
*** Internal error in package accelerate ***
*** Please submit a bug report at https://github.com/AccelerateHS/accelerate/issues
./Data/Array/Accelerate/CUDA/State.hs:87 (unhandled): CUDA Exception: device-side assert triggered

backpermute-transpose: Failed:
*** Internal error in package accelerate ***
*** Please submit a bug report at https://github.com/AccelerateHS/accelerate/issues
./Data/Array/Accelerate/CUDA/State.hs:87 (unhandled): CUDA Exception: device-side assert triggered

init: Failed:
*** Internal error in package accelerate ***
*** Please submit a bug report at https://github.com/AccelerateHS/accelerate/issues
./Data/Array/Accelerate/CUDA/State.hs:87 (unhandled): CUDA Exception: device-side assert triggered

tail: Failed:
*** Internal error in package accelerate ***
*** Please submit a bug report at https://github.com/AccelerateHS/accelerate/issues
./Data/Array/Accelerate/CUDA/State.hs:87 (unhandled): CUDA Exception: device-side assert triggered

take: Failed:
*** Internal error in package accelerate ***
*** Please submit a bug report at https://github.com/AccelerateHS/accelerate/issues
./Data/Array/Accelerate/CUDA/State.hs:87 (unhandled): CUDA Exception: device-side assert triggered

drop: Failed:
*** Internal error in package accelerate ***
*** Please submit a bug report at https://github.com/AccelerateHS/accelerate/issues
./Data/Array/Accelerate/CUDA/State.hs:87 (unhandled): CUDA Exception: device-side assert triggered

slit: Failed:
*** Internal error in package accelerate ***
*** Please submit a bug report at https://github.com/AccelerateHS/accelerate/issues
./Data/Array/Accelerate/CUDA/State.hs:87 (unhandled): CUDA Exception: device-side assert triggered

gather: Failed:
*** Internal error in package accelerate ***
*** Please submit a bug report at https://github.com/AccelerateHS/accelerate/issues
./Data/Array/Accelerate/CUDA/State.hs:87 (unhandled): CUDA Exception: device-side assert triggered

gather-if: Failed:
*** Internal error in package accelerate ***
*** Please submit a bug report at https://github.com/AccelerateHS/accelerate/issues
./Data/Array/Accelerate/CUDA/State.hs:87 (unhandled): CUDA Exception: device-side assert triggered

scatter: Failed:
*** Internal error in package accelerate ***
*** Please submit a bug report at https://github.com/AccelerateHS/accelerate/issues
./Data/Array/Accelerate/CUDA/State.hs:87 (unhandled): CUDA Exception: device-side assert triggered

scatter-if: Failed:
*** Internal error in package accelerate ***
*** Please submit a bug report at https://github.com/AccelerateHS/accelerate/issues
./Data/Array/Accelerate/CUDA/State.hs:87 (unhandled): CUDA Exception: device-side assert triggered

sasum: Failed:
*** Internal error in package accelerate ***
*** Please submit a bug report at https://github.com/AccelerateHS/accelerate/issues
./Data/Array/Accelerate/CUDA/State.hs:87 (unhandled): CUDA Exception: device-side assert triggered

saxpy: Failed:
*** Internal error in package accelerate ***
*** Please submit a bug report at https://github.com/AccelerateHS/accelerate/issues
./Data/Array/Accelerate/CUDA/State.hs:87 (unhandled): CUDA Exception: device-side assert triggered

dotp: Failed:
*** Internal error in package accelerate ***
*** Please submit a bug report at https://github.com/AccelerateHS/accelerate/issues
./Data/Array/Accelerate/CUDA/State.hs:87 (unhandled): CUDA Exception: device-side assert triggered

filter: Failed:
*** Internal error in package accelerate ***
*** Please submit a bug report at https://github.com/AccelerateHS/accelerate/issues
./Data/Array/Accelerate/CUDA/State.hs:87 (unhandled): CUDA Exception: device-side assert triggered

smvm: Failed:
*** Internal error in package accelerate ***
*** Please submit a bug report at https://github.com/AccelerateHS/accelerate/issues
./Data/Array/Accelerate/CUDA/State.hs:87 (unhandled): CUDA Exception: device-side assert triggered

black-scholes: Failed:
*** Internal error in package accelerate ***
*** Please submit a bug report at https://github.com/AccelerateHS/accelerate/issues
./Data/Array/Accelerate/CUDA/State.hs:87 (unhandled): CUDA Exception: device-side assert triggered

radixsort: Failed:
*** Internal error in package accelerate ***
*** Please submit a bug report at https://github.com/AccelerateHS/accelerate/issues
./Data/Array/Accelerate/CUDA/State.hs:87 (unhandled): CUDA Exception: device-side assert triggered

io: test: fromPtr Int
test: fromPtr (Int,Double)
test: toPtr Int16
test: toPtr Int32
test: toPtr Int64
test: fromArray Int
Ok
io: +++ OK, passed 100 tests.
+++ OK, passed 100 tests.
+++ OK, passed 100 tests.
+++ OK, passed 100 tests.
+++ OK, passed 100 tests.
+++ OK, passed 100 tests.
+++ OK, passed 100 tests.
Ok
canny: Failed: no image file specified
integral-image: Failed: no image file specified
slices: Failed:
*** Internal error in package accelerate ***
*** Please submit a bug report at https://github.com/AccelerateHS/accelerate/issues
./Data/Array/Accelerate/CUDA/State.hs:87 (unhandled): CUDA Exception: device-side assert triggered

slices: Failed:
*** Internal error in package accelerate ***
*** Please submit a bug report at https://github.com/AccelerateHS/accelerate/issues
./Data/Array/Accelerate/CUDA/State.hs:87 (unhandled): CUDA Exception: device-side assert triggered

slices: Failed:
*** Internal error in package accelerate ***
*** Please submit a bug report at https://github.com/AccelerateHS/accelerate/issues
./Data/Array/Accelerate/CUDA/State.hs:87 (unhandled): CUDA Exception: device-side assert triggered

slices: Failed:
*** Internal error in package accelerate ***
*** Please submit a bug report at https://github.com/AccelerateHS/accelerate/issues
./Data/Array/Accelerate/CUDA/State.hs:87 (unhandled): CUDA Exception: device-side assert triggered

slices: Failed:
*** Internal error in package accelerate ***
*** Please submit a bug report at https://github.com/AccelerateHS/accelerate/issues
./Data/Array/Accelerate/CUDA/State.hs:87 (unhandled): CUDA Exception: device-side assert triggered

sharing-recovery: Ok
sharing-recovery: Ok
sharing-recovery: Ok
sharing-recovery: Ok
sharing-recovery: Ok
sharing-recovery: Ok
sharing-recovery: Ok
sharing-recovery: Ok
sharing-recovery: Ok
sharing-recovery: Ok
bound-variables: Ok

warming up
estimating clock resolution...
mean is 3.963331 us (160001 iterations)
found 53631 outliers among 159999 samples (33.5%)
25736 (16.1%) low severe
27895 (17.4%) high severe
estimating cost of a clock call...
mean is 88.11406 ns (29 iterations)
found 5 outliers among 29 samples (17.2%)
2 (6.9%) high mild
3 (10.3%) high severe

benchmarking map-abs

@tmcdonell
Copy link
Member

Sorry for the problem with the cuda package version, fixed and will be uploaded to hackage soon.

Could you run the accelerate-nofib program and see if that works? accelerate-examples is no longer built as part of the accelerate-examples package, so you are probably executing an old version.

@neiljamieso
Copy link
Author

Hullo Trev,

Not sure what this means "|accelerate-examples| is no longer built as
part of the |accelerate-examples| package".

This is the output from nofib...

EKG monitor started at: http://localhost:8000

accelerate-nofib (c) [2013] The Accelerate Team

Usage: accelerate-nofib [OPTIONS]

Available backends:
interpreter reference implementation (sequential)

  • cuda implementation for NVIDIA GPUs (parallel)

prelude:
map:
Int32:
DIM0:
abs: [OK, passed 100 tests]
plus: [OK, passed 100 tests]
square: [OK, passed 100 tests]
DIM1:
abs: [OK, passed 100 tests]
plus: [OK, passed 100 tests]
square: [OK, passed 100 tests]
DIM2:
abs: [OK, passed 100 tests]
plus: [OK, passed 100 tests]
square: [OK, passed 100 tests]
Int64:
DIM0:
abs: [OK, passed 100 tests]
plus: [OK, passed 100 tests]
square: [OK, passed 100 tests]
DIM1:
abs: [OK, passed 100 tests]
plus: [OK, passed 100 tests]
square: [OK, passed 100 tests]
DIM2:
abs: [OK, passed 100 tests]
plus: [OK, passed 100 tests]
square: [OK, passed 100 tests]
zipWith:
Int32:
DIM0:
zip: [OK, passed 100 tests]
plus: [OK, passed 100 tests]
min: [OK, passed 100 tests]
DIM1:
zip: [OK, passed 100 tests]
plus: [OK, passed 100 tests]
min: [OK, passed 100 tests]
DIM2:
zip: [OK, passed 100 tests]
plus: [OK, passed 100 tests]
min: [OK, passed 100 tests]
Int64:
DIM0:
zip: [OK, passed 100 tests]
plus: [OK, passed 100 tests]
min: [OK, passed 100 tests]
DIM1:
zip: [OK, passed 100 tests]
plus: [OK, passed 100 tests]
min: [OK, passed 100 tests]
DIM2:
zip: [OK, passed 100 tests]
plus: [OK, passed 100 tests]
min: [OK, passed 100 tests]
foldAll:
Int32:
DIM0:
sum: [OK, passed 100 tests]
non-neutral sum: [OK, passed 100 tests]
minimum: [OK, passed 100 tests]
maximum: [OK, passed 100 tests]
DIM1:
sum: [OK, passed 100 tests]
non-neutral sum: [OK, passed 100 tests]
minimum: [OK, passed 100 tests]
maximum: [OK, passed 100 tests]
DIM2:
sum: [OK, passed 100 tests]
non-neutral sum: [OK, passed 100 tests]
minimum: [OK, passed 100 tests]
maximum: [OK, passed 100 tests]
Int64:
DIM0:
sum: [OK, passed 100 tests]
non-neutral sum: [OK, passed 100 tests]
minimum: [OK, passed 100 tests]
maximum: [OK, passed 100 tests]
DIM1:
sum: [OK, passed 100 tests]
non-neutral sum: [OK, passed 100 tests]
minimum: [OK, passed 100 tests]
maximum: [OK, passed 100 tests]
DIM2:
sum: [OK, passed 100 tests]
non-neutral sum: [OK, passed 100 tests]
minimum: [OK, passed 100 tests]
maximum: [OK, passed 100 tests]
fold:
Int32:
DIM1:
sum: [OK, passed 100 tests]
non-neutral sum: [OK, passed 100 tests]
minimum: [OK, passed 100 tests]
maximum: [OK, passed 100 tests]
DIM2:
sum: [OK, passed 100 tests]
non-neutral sum: [OK, passed 100 tests]
minimum: [OK, passed 100 tests]
maximum: [OK, passed 100 tests]
Int64:
DIM1:
sum: [OK, passed 100 tests]
non-neutral sum: [OK, passed 100 tests]
minimum: [OK, passed 100 tests]
maximum: [OK, passed 100 tests]
DIM2:
sum: [OK, passed 100 tests]
non-neutral sum: [OK, passed 100 tests]
minimum: [OK, passed 100 tests]
maximum: [OK, passed 100 tests]
backpermute:
Int32:
reverse: [OK, passed 100 tests]
transpose: [OK, passed 100 tests]
init: [OK, passed 100 tests]
tail: [OK, passed 100 tests]
take: [OK, passed 100 tests]
drop: [OK, passed 100 tests]
slit: [OK, passed 100 tests]
gather: [OK, passed 100 tests]
gatherIf: [OK, passed 100 tests]
Int64:
reverse: [OK, passed 100 tests]
transpose: [OK, passed 100 tests]
init: [OK, passed 100 tests]
tail: [OK, passed 100 tests]
take: [OK, passed 100 tests]
drop: [OK, passed 100 tests]
slit: [OK, passed 100 tests]
gather: [OK, passed 100 tests]
gatherIf: [OK, passed 100 tests]
permute:
Int32:
fill:
DIM1: [OK, passed 100 tests]
DIM2: [OK, passed 100 tests]
scatter: [OK, passed 100 tests]
scatterIf: [OK, passed 100 tests]
histogram: [OK, passed 100 tests]
Int64:
fill:
DIM1: [OK, passed 100 tests]
DIM2: [OK, passed 100 tests]
scatter: [OK, passed 100 tests]
scatterIf: [OK, passed 100 tests]
histogram: [OK, passed 100 tests]
prefix sum:
Int32:
scanl: [OK, passed 100 tests]
scanl': [OK, passed 100 tests]
scanl1: [Failed]
*** Failed! Falsifiable (after 2 tests):
Array (Z :. 1) [1]
*** Expected:
Array (Z :. 1) [1]
*** Received:
Array (Z :. 1) [-2046376583]

(used seed -1630649237856122637)
scanr: [OK, passed 100 tests]
scanr': [OK, passed 100 tests]
scanr1: [Failed]
*** Failed! Falsifiable (after 2 tests):
Array (Z :. 1) [1]
*** Expected:
Array (Z :. 1) [1]
*** Received:
Array (Z :. 1) [1945653521]

(used seed -4172774753861454420)
scanl1Seg: [OK, passed 100 tests]
scanr1Seg: [OK, passed 100 tests]
scanlSeg: [Failed]
*** Failed! Falsifiable (after 5 tests and 1 shrink):
Array (Z :. 0) []
Array (Z :. 4) [1,4,4,1]
Array (Z :. 10) [-1,-1,0,0,-1,2,-2,-1,-2,-1]
*** Expected:
Array (Z :. 14) [0,-1,0,-1,-1,-1,-2,0,2,0,-1,-3,0,-1]
*** Received:
Array (Z :. 14) [0,0,0,0,0,0,0,0,0,0,0,0,0,0]

(used seed -4068642445411035362)
scanrSeg: [Failed]
*** Failed! Falsifiable (after 3 tests and 1 shrink):
Array (Z :. 0) []
Array (Z :. 1) [1]
Array (Z :. 1) [1]
*** Expected:
Array (Z :. 2) [1,0]
*** Received:
Array (Z :. 2) [0,0]

(used seed 4504072601150252809)
scanl'Seg: [Failed]
*** Failed! Falsifiable (after 2 tests and 1 shrink):
Array (Z :. 0) []
Array (Z :. 1) [1]
Array (Z :. 1) [1]
*** Expected:
(Array (Z :. 1) [0],Array (Z :. 1) [1])
*** Received:
(Array (Z :. 1) [1619230025],Array (Z :. 1) [0])

(used seed -1768028967034461376)
scanr'Seg: [Failed]
*** Failed! Falsifiable (after 3 tests and 1 shrink):
Array (Z :. 0) []
Array (Z :. 1) [2]
Array (Z :. 2) [1,-1]
*** Expected:
(Array (Z :. 2) [-1,0],Array (Z :. 1) [0])
*** Received:
(Array (Z :. 2) [0,0],Array (Z :. 1) [0])

(used seed -578241401213968022)
Int64:
scanl: [OK, passed 100 tests]
scanl': [OK, passed 100 tests]
scanl1: [Failed]
*** Failed! Falsifiable (after 68 tests and 6 shrinks):
Array (Z :. 1) [3338350638594]
*** Expected:
Array (Z :. 1) [3338350638594]
*** Received:
Array (Z :. 1) [8589934590]

(used seed 8607050148139398118)
scanr: [OK, passed 100 tests]
scanr': [OK, passed 100 tests]
scanr1: [Failed]
*** Failed! Falsifiable (after 5 tests and 2 shrinks):
Array (Z :. 1) [-1]
*** Expected:
Array (Z :. 1) [-1]
*** Received:
Array (Z :. 1) [0]

(used seed 2474179189546383018)
scanl1Seg: [OK, passed 100 tests]
scanr1Seg: [OK, passed 100 tests]
scanlSeg: [Failed]
*** Failed! Falsifiable (after 4 tests and 1 shrink):
Array (Z :. 0) []
Array (Z :. 1) [2]
Array (Z :. 2) [0,-1]
*** Expected:
Array (Z :. 3) [0,0,-1]
*** Received:
Array (Z :. 3) [0,0,0]

(used seed -8403008051050665374)
scanrSeg: [Failed]
*** Failed! Falsifiable (after 2 tests and 1 shrink):
Array (Z :. 0) []
Array (Z :. 1) [1]
Array (Z :. 1) [-1]
*** Expected:
Array (Z :. 2) [-1,0]
*** Received:
Array (Z :. 2) [0,0]

(used seed 6231186752828250437)
scanl'Seg: [Failed]

accelerate-nofib:
*** Internal error in package accelerate ***
*** Please submit a bug report at
https://github.com/AccelerateHS/accelerate/issues
./Data/Array/Accelerate/CUDA/State.hs:86 (unhandled): CUDA Exception:
unspecified launch failure

accelerate-nofib: forkOS_entry: interrupted

On 19/11/13 18:10, Trevor L. McDonell wrote:

Sorry for the problem with the cuda package version, fixed and will be
uploaded to hackage soon.

Could you run the |accelerate-nofib| program and see if that works?
|accelerate-examples| is no longer built as part of the
|accelerate-examples| package, so you are probably executing an old
version.


Reply to this email directly or view it on GitHub
#92 (comment).

@tmcdonell
Copy link
Member

Ah, I mean that the program called accelerate-examples which you showed the output of, is no longer being compiled when you installed the package accelerate-examples. Thus, you must have been running a binary that was installed a while ago. Anyway...

It looks like scanl1 and scanr1 don't work. The failures in the segmented scans use these, so I'll assume for now that's why those fail. Did you cabal-install accelerate-cuda -fdebug ? Could you please run:

$ cuda-memcheck accelerate-nofib --int64=False --select-tests=scanl1 -- -fflush-cache -ddebug-cc

@neiljamieso
Copy link
Author

hi Trev,

away for a week with just my phone. will try when I get back.

Cheers Neil

@neiljamieso
Copy link
Author

Hi Trev,

As an experiment I tried running the interpreter version of this, and it
failed too - at different places. it is a lot slower of course so
I'm waiting for it to finish so I can send the output....

..... done!

On 21/11/13 23:59, Trevor L. McDonell wrote:

Ah, I mean that the program called |accelerate-examples| which you
showed the output of, is no longer being compiled when you installed
the package |accelerate-examples|. Thus, you must have been running a
binary that was installed a while ago. Anyway...

It looks like |scanl1| and |scanr1| don't work. The failures in the
segmented scans use these, so I'll assume for now that's why those
fail. Did you |cabal-install accelerate-cuda -fdebug| ? Could you
please run:

|cuda-memcheck accelerate-nofib --int64=False --select-tests=scanl1 -- -fflush-cache -ddebug-cc
|


Reply to this email directly or view it on GitHub
#92 (comment).

EKG monitor started at: http://localhost:8000

accelerate-nofib (c) [2013] The Accelerate Team

Usage: accelerate-nofib [OPTIONS]

Available backends:

  • interpreter reference implementation (sequential)
    cuda implementation for NVIDIA GPUs (parallel)

prelude:
map:
Int32:
DIM0:
abs: [OK, passed 100 tests]
plus: [OK, passed 100 tests]
square: [OK, passed 100 tests]
DIM1:
abs: [OK, passed 100 tests]
plus: [OK, passed 100 tests]
square: [OK, passed 100 tests]
DIM2:
abs: [OK, passed 100 tests]
plus: [OK, passed 100 tests]
square: [OK, passed 100 tests]
Int64:
DIM0:
abs: [OK, passed 100 tests]
plus: [OK, passed 100 tests]
square: [OK, passed 100 tests]
DIM1:
abs: [OK, passed 100 tests]
plus: [OK, passed 100 tests]
square: [OK, passed 100 tests]
DIM2:
abs: [OK, passed 100 tests]
plus: [OK, passed 100 tests]
square: [OK, passed 100 tests]
zipWith:
Int32:
DIM0:
zip: [OK, passed 100 tests]
plus: [OK, passed 100 tests]
min: [OK, passed 100 tests]
DIM1:
zip: [OK, passed 100 tests]
plus: [OK, passed 100 tests]
min: [OK, passed 100 tests]
DIM2:
zip: [OK, passed 100 tests]
plus: [OK, passed 100 tests]
min: [OK, passed 100 tests]
Int64:
DIM0:
zip: [OK, passed 100 tests]
plus: [OK, passed 100 tests]
min: [OK, passed 100 tests]
DIM1:
zip: [OK, passed 100 tests]
plus: [OK, passed 100 tests]
min: [OK, passed 100 tests]
DIM2:
zip: [OK, passed 100 tests]
plus: [OK, passed 100 tests]
min: [OK, passed 100 tests]
foldAll:
Int32:
DIM0:
sum: [OK, passed 100 tests]
non-neutral sum: [OK, passed 100 tests]
minimum: [OK, passed 100 tests]
maximum: [OK, passed 100 tests]
DIM1:
sum: [OK, passed 100 tests]
non-neutral sum: [OK, passed 100 tests]
minimum: [OK, passed 100 tests]
maximum: [OK, passed 100 tests]
DIM2:
sum: [OK, passed 100 tests]
non-neutral sum: [OK, passed 100 tests]
minimum: [OK, passed 100 tests]
maximum: [OK, passed 100 tests]
Int64:
DIM0:
sum: [OK, passed 100 tests]
non-neutral sum: [OK, passed 100 tests]
minimum: [OK, passed 100 tests]
maximum: [OK, passed 100 tests]
DIM1:
sum: [OK, passed 100 tests]
non-neutral sum: [OK, passed 100 tests]
minimum: [OK, passed 100 tests]
maximum: [OK, passed 100 tests]
DIM2:
sum: [OK, passed 100 tests]
non-neutral sum: [OK, passed 100 tests]
minimum: [OK, passed 100 tests]
maximum: [OK, passed 100 tests]
fold:
Int32:
DIM1:
sum: [OK, passed 100 tests]
non-neutral sum: [OK, passed 100 tests]
minimum: [OK, passed 100 tests]
maximum: [OK, passed 100 tests]
DIM2:
sum: [Failed]
*** Failed! Falsifiable (after 1 test):
Array (Z :. 0 :. 0) []
*** Expected:
Array (Z :. 0) []
*** Received:
Array (Z :. 1) [0]

(used seed 7863247450130050956)
non-neutral sum: [Failed]
*** Failed! Falsifiable (after 1 test):
Array (Z :. 0 :. 0) []
0
*** Expected:
Array (Z :. 0) []
*** Received:
Array (Z :. 1) [0]

(used seed 5228219361933020874)
minimum: [OK, passed 100 tests]
maximum: [OK, passed 100 tests]
Int64:
DIM1:
sum: [OK, passed 100 tests]
non-neutral sum: [OK, passed 100 tests]
minimum: [OK, passed 100 tests]
maximum: [OK, passed 100 tests]
DIM2:
sum: [Failed]
*** Failed! Falsifiable (after 1 test):
Array (Z :. 0 :. 1) []
*** Expected:
Array (Z :. 0) []
*** Received:
Array (Z :. 1) [0]

(used seed 3176500408165050443)
non-neutral sum: [Failed]
*** Failed! Falsifiable (after 1 test and 2 shrinks):
Array (Z :. 0 :. 1) []
0
*** Expected:
Array (Z :. 0) []
*** Received:
Array (Z :. 1) [0]

(used seed -5531309095382955723)
minimum: [OK, passed 100 tests]
maximum: [OK, passed 100 tests]
backpermute:
Int32:
reverse: [OK, passed 100 tests]
transpose: [OK, passed 100 tests]
init: [OK, passed 100 tests]
tail: [OK, passed 100 tests]
take: [OK, passed 100 tests]
drop: [OK, passed 100 tests]
slit: [OK, passed 100 tests]
gather: [OK, passed 100 tests]
gatherIf: [OK, passed 100 tests]
Int64:
reverse: [OK, passed 100 tests]
transpose: [OK, passed 100 tests]
init: [OK, passed 100 tests]
tail: [OK, passed 100 tests]
take: [OK, passed 100 tests]
drop: [OK, passed 100 tests]
slit: [OK, passed 100 tests]
gather: [OK, passed 100 tests]
gatherIf: [OK, passed 100 tests]
permute:
Int32:
fill:
DIM1: [OK, passed 100 tests]
DIM2: [OK, passed 100 tests]
scatter: [OK, passed 100 tests]
scatterIf: [OK, passed 100 tests]
histogram: [OK, passed 100 tests]
Int64:
fill:
DIM1: [OK, passed 100 tests]
DIM2: [OK, passed 100 tests]
scatter: [OK, passed 100 tests]
scatterIf: [OK, passed 100 tests]
histogram: [OK, passed 100 tests]
prefix sum:
Int32:
scanl: [OK, passed 100 tests]
scanl': [OK, passed 100 tests]
scanl1: [OK, passed 100 tests]
scanr: [OK, passed 100 tests]
scanr': [OK, passed 100 tests]
scanr1: [OK, passed 100 tests]
scanl1Seg: [OK, passed 100 tests]
scanr1Seg: [OK, passed 100 tests]
scanlSeg: [OK, passed 100 tests]
scanrSeg: [OK, passed 100 tests]
scanl'Seg: [OK, passed 100 tests]
scanr'Seg: [OK, passed 100 tests]
Int64:
scanl: [OK, passed 100 tests]
scanl': [OK, passed 100 tests]
scanl1: [OK, passed 100 tests]
scanr: [OK, passed 100 tests]
scanr': [OK, passed 100 tests]
scanr1: [OK, passed 100 tests]
scanl1Seg: [OK, passed 100 tests]
scanr1Seg: [OK, passed 100 tests]
scanlSeg: [OK, passed 100 tests]
scanrSeg: [OK, passed 100 tests]
scanl'Seg: [OK, passed 100 tests]
scanr'Seg: [OK, passed 100 tests]
foldSeg:
Int32:
DIM1:
sum: [OK, passed 100 tests]
non-neutral sum: [OK, passed 100 tests]
minimum: [OK, passed 100 tests]
DIM2:
sum: [OK, passed 100 tests]
non-neutral sum: [OK, passed 100 tests]
minimum: [OK, passed 100 tests]
Int64:
DIM1:
sum: [OK, passed 100 tests]
non-neutral sum: [OK, passed 100 tests]
minimum: [OK, passed 100 tests]
DIM2:
sum: [OK, passed 100 tests]
non-neutral sum: [OK, passed 100 tests]
minimum: [OK, passed 100 tests]
stencil:
Int32:
1D: [OK, passed 100 tests]
2D 3x3 dense: [OK, passed 100 tests]
2D 3x3 cross: [OK, passed 100 tests]
2D non-symmetric: [OK, passed 100 tests]
Int64:
1D: [OK, passed 100 tests]
2D 3x3 dense: [OK, passed 100 tests]
2D 3x3 cross: [OK, passed 100 tests]
2D non-symmetric: [OK, passed 100 tests]
replicate:
Int32:
(Z:.2:.All:.All): OK: OK: OK: [OK]
Int64:
(Z:.2:.All:.All): OK: OK: OK: [OK]
filter:
Int32: [OK, passed 100 tests]
Int64: [OK, passed 100 tests]
sharing recovery:
simple: [OK]
order fail: [OK]
test sort: [OK]
much sharing: [OK]
bfs fail: [OK]
two lets same level: [OK]
two lets same level: [OK]
no let at top: [OK]
no let at top: [OK]
pipe: [OK]
bound variables: [OK]
big tuple: [OK]
iteration:
simple: [OK]
outside: [OK]
body and condition: [OK]
awhile: [OK]
iterate: [OK]
nested: [OK]
unused: [OK]
io:
block copy:
toPtr Int16: [OK]
toPtr Int32: [OK]
toPtr Int64: [OK]
fromPtr Int32: [OK]
fromPtr (Int32,Double): [OK]
fromArray Int32: [OK]
vector:
Int32:
DIM0: [OK, passed 100 tests]
DIM1: [OK, passed 100 tests]
DIM2: [OK, passed 100 tests]
Int64:
DIM0: [OK, passed 100 tests]
DIM1: [OK, passed 100 tests]
DIM2: [OK, passed 100 tests]
imaginary:
sasum:
Int32: [OK, passed 100 tests]
Int64: [OK, passed 100 tests]
saxpy:
Int32: [OK, passed 100 tests]
Int64: [OK, passed 100 tests]
dot-product:
Int32: [OK, passed 100 tests]
Int64: [OK, passed 100 tests]
spectral:
radix sort:
Int32:
ascending: [OK, passed 100 tests]
descending: OK, passed 100 tests: [OK, passed 100 tests]
Int64:
ascending: [OK, passed 100 tests]
descending: OK, passed 100 tests: [OK, passed 100 tests]
foreign:
expf: [OK, passed 100 tests]
fmaf: [OK, passed 100 tests]

     Properties    Test Cases   Total        

Passed 166 33 199
Failed 4 0 4
Total 170 33 203

@neiljamieso
Copy link
Author

On 21/11/13 23:59, Trevor L. McDonell wrote:

|cuda-memcheck accelerate-nofib --int64=False --select-tests=scanl1 -- -fflush-cache -ddebug-cc|
output is:

$ optirun cuda-memcheck accelerate-nofib --int64=False
--select-tests=scanl1 -- -fflush-cache -ddebug-cc
========= CUDA-MEMCHECK
EKG monitor started at: http://localhost:8000

accelerate-nofib (c) [2013] The Accelerate Team

Usage: accelerate-nofib [OPTIONS]

Available backends:
interpreter reference implementation (sequential)

  • cuda implementation for NVIDIA GPUs (parallel)

prelude:
prefix sum:
Int32:
scanl1: [Failed]
*** Failed! Falsifiable (after 2 tests):
Array (Z :. 1) [0]
*** Expected:
Array (Z :. 1) [0]
*** Received:
Array (Z :. 1) [-1998135121]

(used seed -3868271924695893879)
scanl1Seg: [OK, passed 100 tests]

      Properties  Total

Passed 1 1
Failed 1 1
Total 2 2
========= ERROR SUMMARY: 0 errors
neil@debian-neil:~/.cabal/bin$

@neiljamieso
Copy link
Author

On 21/11/13 23:59, Trevor L. McDonell wrote:

|cuda-memcheck accelerate-nofib --int64=False --select-tests=scanl1 -- -fflush-cache -ddebug-cc|
No sure why there's no debug info! I did reinstall with -fdebug
starting at the top with accelerate! :-(

@neiljamieso
Copy link
Author

On 21/11/13 23:59, Trevor L. McDonell wrote:

|cuda-memcheck accelerate-nofib --int64=False --select-tests=scanl1 -- -fflush-cache -ddebug-cc|
Ahh... Should be -ddump-cc

This is the terminal output. I have attached a file containing the dump
output.

Cheers Neil

$ cuda-memcheck accelerate-nofib --int64=False --select-tests=scanl1 --
-fflush-cache -ddump-cc 2> ac_nofib_dump.output
========= CUDA-MEMCHECK
EKG monitor started at: http://localhost:8000

accelerate-nofib (c) [2013] The Accelerate Team

Usage: accelerate-nofib [OPTIONS]

Available backends:
interpreter reference implementation (sequential)

  • cuda implementation for NVIDIA GPUs (parallel)

prelude:
prefix sum:
Int32:
scanl1: [Failed]
*** Failed! Falsifiable (after 7 tests and 3 shrinks):
Array (Z :. 1) [2]
*** Expected:
Array (Z :. 1) [2]
*** Received:
Array (Z :. 1) [-3]

(used seed -1221479377516449484)
scanl1Seg: [OK, passed 100 tests]

      Properties  Total

Passed 1 1
Failed 1 1
Total 2 2
========= ERROR SUMMARY: 0 errors
0.07:cc: initialise kernel table
0.07:cc: deleting persistent cache
0.08:cc: (3.0,"\251M\149X2\220\f\169K\224\249\r\210\130\206\163")
#include <accelerate_cuda.h>
extern "C" global void scanl1(const Int64 shIn0_0, const Int32* restrict arrIn0_0, const Int64 shOut_0, Int32* restrict arrOut_0, const Int64 shBlk_0, Int32* restrict arrBlk_0, Int32* restrict arrSum_0)
{
extern volatile shared Int32 sdata0[];
Int32 x0;
Int32 y0;
Int32 z0;
const Int64 sh0 = shIn0_0;
const int shapeSize = sh0;
const int intervalSize = (shapeSize + gridDim.x - 1) / gridDim.x;
int carryIn = 0;

if (threadIdx.x == 0) {
if (blockIdx.x != 0) {
z0 = arrBlk_0[blockIdx.x - 1];
carryIn = 1;
}
}

const int start = blockIdx.x * intervalSize;
const int end = min(start + intervalSize, shapeSize);
const int numElements = end - start;
int seg;

for (seg = threadIdx.x; seg < numElements; seg += blockDim.x) {
const int ix = start + seg;

x0 = arrIn0_0[ix];
if (threadIdx.x == 0 && carryIn) {
x0 = min(z0, x0);
}
sdata0[threadIdx.x] = x0;
__syncthreads();
if (blockDim.x > 1) {
if (threadIdx.x >= 1) {
y0 = sdata0[threadIdx.x - 1];
x0 = min(y0, x0);
}
__syncthreads();
sdata0[threadIdx.x] = x0;
__syncthreads();
}
if (blockDim.x > 2) {
if (threadIdx.x >= 2) {
y0 = sdata0[threadIdx.x - 2];
x0 = min(y0, x0);
}
__syncthreads();
sdata0[threadIdx.x] = x0;
__syncthreads();
}
if (blockDim.x > 4) {
if (threadIdx.x >= 4) {
y0 = sdata0[threadIdx.x - 4];
x0 = min(y0, x0);
}
__syncthreads();
sdata0[threadIdx.x] = x0;
__syncthreads();
}
if (blockDim.x > 8) {
if (threadIdx.x >= 8) {
y0 = sdata0[threadIdx.x - 8];
x0 = min(y0, x0);
}
__syncthreads();
sdata0[threadIdx.x] = x0;
__syncthreads();
}
if (blockDim.x > 16) {
if (threadIdx.x >= 16) {
y0 = sdata0[threadIdx.x - 16];
x0 = min(y0, x0);
}
__syncthreads();
sdata0[threadIdx.x] = x0;
__syncthreads();
}
if (blockDim.x > 32) {
if (threadIdx.x >= 32) {
y0 = sdata0[threadIdx.x - 32];
x0 = min(y0, x0);
}
__syncthreads();
sdata0[threadIdx.x] = x0;
__syncthreads();
}
if (blockDim.x > 64) {
if (threadIdx.x >= 64) {
y0 = sdata0[threadIdx.x - 64];
x0 = min(y0, x0);
}
__syncthreads();
sdata0[threadIdx.x] = x0;
__syncthreads();
}
if (blockDim.x > 128) {
if (threadIdx.x >= 128) {
y0 = sdata0[threadIdx.x - 128];
x0 = min(y0, x0);
}
__syncthreads();
sdata0[threadIdx.x] = x0;
__syncthreads();
}
if (blockDim.x > 256) {
if (threadIdx.x >= 256) {
y0 = sdata0[threadIdx.x - 256];
x0 = min(y0, x0);
}
__syncthreads();
sdata0[threadIdx.x] = x0;
__syncthreads();
}
if (blockDim.x > 512) {
if (threadIdx.x >= 512) {
y0 = sdata0[threadIdx.x - 512];
x0 = min(y0, x0);
}
__syncthreads();
sdata0[threadIdx.x] = x0;
__syncthreads();
}
if (blockDim.x > 1024) {
if (threadIdx.x >= 1024) {
y0 = sdata0[threadIdx.x - 1024];
x0 = min(y0, x0);
}
__syncthreads();
sdata0[threadIdx.x] = x0;
__syncthreads();
}
if (0) {
if (threadIdx.x == 0) {
x0 = z0;
} else {
x0 = sdata0[threadIdx.x - 1];
}
}
arrOut_0[ix] = x0;
if (threadIdx.x == 0) {
const int last = min(numElements - seg, blockDim.x) - 1;

z0 = sdata0[last];
}
carryIn = 1;
}
}

0.08:cc: (3.0,"\206\203(\n\242G\fk\212\137\146V+\153\170\187")
#include <accelerate_cuda.h>
extern "C" global void scanlUp(const Int64 shIn0_0, const Int32* restrict arrIn0_0, const Int64 shOut_0, Int32* restrict arrOut_0)
{
extern volatile shared Int32 sdata0[];
Int32 x0;
Int32 y0;
const Int64 sh0 = shIn0_0;
const int shapeSize = sh0;
const int intervalSize = (shapeSize + gridDim.x - 1) / gridDim.x;
const int start = blockIdx.x * intervalSize;
const int end = min(start + intervalSize, shapeSize);
const int numElements = end - start;
int carryIn = 0;
int seg;

for (seg = threadIdx.x; seg < numElements; seg += blockDim.x) {
const int ix = start + seg;

x0 = arrIn0_0[ix];
if (threadIdx.x == 0 && carryIn) {
x0 = min(y0, x0);
}
sdata0[threadIdx.x] = x0;
__syncthreads();
if (blockDim.x > 1) {
if (threadIdx.x >= 1) {
y0 = sdata0[threadIdx.x - 1];
x0 = min(y0, x0);
}
__syncthreads();
sdata0[threadIdx.x] = x0;
__syncthreads();
}
if (blockDim.x > 2) {
if (threadIdx.x >= 2) {
y0 = sdata0[threadIdx.x - 2];
x0 = min(y0, x0);
}
__syncthreads();
sdata0[threadIdx.x] = x0;
__syncthreads();
}
if (blockDim.x > 4) {
if (threadIdx.x >= 4) {
y0 = sdata0[threadIdx.x - 4];
x0 = min(y0, x0);
}
__syncthreads();
sdata0[threadIdx.x] = x0;
__syncthreads();
}
if (blockDim.x > 8) {
if (threadIdx.x >= 8) {
y0 = sdata0[threadIdx.x - 8];
x0 = min(y0, x0);
}
__syncthreads();
sdata0[threadIdx.x] = x0;
__syncthreads();
}
if (blockDim.x > 16) {
if (threadIdx.x >= 16) {
y0 = sdata0[threadIdx.x - 16];
x0 = min(y0, x0);
}
__syncthreads();
sdata0[threadIdx.x] = x0;
__syncthreads();
}
if (blockDim.x > 32) {
if (threadIdx.x >= 32) {
y0 = sdata0[threadIdx.x - 32];
x0 = min(y0, x0);
}
__syncthreads();
sdata0[threadIdx.x] = x0;
__syncthreads();
}
if (blockDim.x > 64) {
if (threadIdx.x >= 64) {
y0 = sdata0[threadIdx.x - 64];
x0 = min(y0, x0);
}
__syncthreads();
sdata0[threadIdx.x] = x0;
__syncthreads();
}
if (blockDim.x > 128) {
if (threadIdx.x >= 128) {
y0 = sdata0[threadIdx.x - 128];
x0 = min(y0, x0);
}
__syncthreads();
sdata0[threadIdx.x] = x0;
__syncthreads();
}
if (blockDim.x > 256) {
if (threadIdx.x >= 256) {
y0 = sdata0[threadIdx.x - 256];
x0 = min(y0, x0);
}
__syncthreads();
sdata0[threadIdx.x] = x0;
__syncthreads();
}
if (blockDim.x > 512) {
if (threadIdx.x >= 512) {
y0 = sdata0[threadIdx.x - 512];
x0 = min(y0, x0);
}
__syncthreads();
sdata0[threadIdx.x] = x0;
__syncthreads();
}
if (blockDim.x > 1024) {
if (threadIdx.x >= 1024) {
y0 = sdata0[threadIdx.x - 1024];
x0 = min(y0, x0);
}
__syncthreads();
sdata0[threadIdx.x] = x0;
__syncthreads();
}
if (threadIdx.x == 0) {
const int last = min(numElements - seg, blockDim.x) - 1;

y0 = sdata0[last];
}
carryIn = 1;
}
if (threadIdx.x == 0) {
arrOut_0[blockIdx.x] = y0;
}
}

0.08:cc: (3.0,"\NUL\a\CAN\FS\157\154\247$\234\215\ENQ\188g\156\DC1\246")
#include <accelerate_cuda.h>
extern "C" global void scanl1(const Int64 shIn0_0, const Int32* restrict arrIn0_0, const Int64 shOut_0, Int32* restrict arrOut_0, const Int64 shBlk_0, Int32* restrict arrBlk_0, Int32* restrict arrSum_0)
{
extern volatile shared Int32 sdata0[];
Int32 x0;
Int32 y0;
Int32 z0;
const Int64 sh0 = shBlk_0;
const int shapeSize = sh0;
const int intervalSize = (shapeSize + gridDim.x - 1) / gridDim.x;
int carryIn = 0;

if (threadIdx.x == 0) {
if (blockIdx.x != 0) {
z0 = arrBlk_0[blockIdx.x - 1];
carryIn = 1;
}
}

const int start = blockIdx.x * intervalSize;
const int end = min(start + intervalSize, shapeSize);
const int numElements = end - start;
int seg;

for (seg = threadIdx.x; seg < numElements; seg += blockDim.x) {
const int ix = start + seg;

x0 = arrBlk_0[ix];
if (threadIdx.x == 0 && carryIn) {
x0 = min(z0, x0);
}
sdata0[threadIdx.x] = x0;
__syncthreads();
if (blockDim.x > 1) {
if (threadIdx.x >= 1) {
y0 = sdata0[threadIdx.x - 1];
x0 = min(y0, x0);
}
__syncthreads();
sdata0[threadIdx.x] = x0;
__syncthreads();
}
if (blockDim.x > 2) {
if (threadIdx.x >= 2) {
y0 = sdata0[threadIdx.x - 2];
x0 = min(y0, x0);
}
__syncthreads();
sdata0[threadIdx.x] = x0;
__syncthreads();
}
if (blockDim.x > 4) {
if (threadIdx.x >= 4) {
y0 = sdata0[threadIdx.x - 4];
x0 = min(y0, x0);
}
__syncthreads();
sdata0[threadIdx.x] = x0;
__syncthreads();
}
if (blockDim.x > 8) {
if (threadIdx.x >= 8) {
y0 = sdata0[threadIdx.x - 8];
x0 = min(y0, x0);
}
__syncthreads();
sdata0[threadIdx.x] = x0;
__syncthreads();
}
if (blockDim.x > 16) {
if (threadIdx.x >= 16) {
y0 = sdata0[threadIdx.x - 16];
x0 = min(y0, x0);
}
__syncthreads();
sdata0[threadIdx.x] = x0;
__syncthreads();
}
if (blockDim.x > 32) {
if (threadIdx.x >= 32) {
y0 = sdata0[threadIdx.x - 32];
x0 = min(y0, x0);
}
__syncthreads();
sdata0[threadIdx.x] = x0;
__syncthreads();
}
if (blockDim.x > 64) {
if (threadIdx.x >= 64) {
y0 = sdata0[threadIdx.x - 64];
x0 = min(y0, x0);
}
__syncthreads();
sdata0[threadIdx.x] = x0;
__syncthreads();
}
if (blockDim.x > 128) {
if (threadIdx.x >= 128) {
y0 = sdata0[threadIdx.x - 128];
x0 = min(y0, x0);
}
__syncthreads();
sdata0[threadIdx.x] = x0;
__syncthreads();
}
if (blockDim.x > 256) {
if (threadIdx.x >= 256) {
y0 = sdata0[threadIdx.x - 256];
x0 = min(y0, x0);
}
__syncthreads();
sdata0[threadIdx.x] = x0;
__syncthreads();
}
if (blockDim.x > 512) {
if (threadIdx.x >= 512) {
y0 = sdata0[threadIdx.x - 512];
x0 = min(y0, x0);
}
__syncthreads();
sdata0[threadIdx.x] = x0;
__syncthreads();
}
if (blockDim.x > 1024) {
if (threadIdx.x >= 1024) {
y0 = sdata0[threadIdx.x - 1024];
x0 = min(y0, x0);
}
__syncthreads();
sdata0[threadIdx.x] = x0;
__syncthreads();
}
if (0) {
if (threadIdx.x == 0) {
x0 = z0;
} else {
x0 = sdata0[threadIdx.x - 1];
}
}
arrOut_0[ix] = x0;
if (threadIdx.x == 0) {
const int last = min(numElements - seg, blockDim.x) - 1;

z0 = sdata0[last];
}
carryIn = 1;
}
}

0.08:cc: waiting for nvcc...
0.08:cc: queue: 937.745 ms, execute: 937.711 ms
... /usr/bin/nvcc -I /home/neil/.cabal/share/x86_64-linux-ghc-7.6.3/accelerate-cuda-0.14.0.0/cubits -arch=sm_30 -cubin -o /tmp/accelerate-cuda-30707/dragon30709.cubin --disable-warnings -DNDEBUG -O3 -m64 /tmp/accelerate-cuda-30707/dragon30709.cu
0.08:cc: queue: 944.652 ms, execute: 944.615 ms
... /usr/bin/nvcc -I /home/neil/.cabal/share/x86_64-linux-ghc-7.6.3/accelerate-cuda-0.14.0.0/cubits -arch=sm_30 -cubin -o /tmp/accelerate-cuda-30707/dragon30707.cubin --disable-warnings -DNDEBUG -O3 -m64 /tmp/accelerate-cuda-30707/dragon30707.cu
0.08:cc: queue: 943.570 ms, execute: 943.558 ms
... /usr/bin/nvcc -I /home/neil/.cabal/share/x86_64-linux-ghc-7.6.3/accelerate-cuda-0.14.0.0/cubits -arch=sm_30 -cubin -o /tmp/accelerate-cuda-30707/dragon30708.cubin --disable-warnings -DNDEBUG -O3 -m64 /tmp/accelerate-cuda-30707/dragon30708.cu
0.08:cc: persist/save: /home/neil/.accelerate/accelerate-cuda-0.14.0.0/cache/3.0/zr187zr170zr153zpVzr146zr137zr212kzrfGzr242zrnZLzr203zr206
0.09:cc: entry function 'scanlUp' used 21 registers, 0 bytes smem, 0 bytes lmem, 0 bytes cmem
... multiprocessor occupancy 100.0% : 2048 threads over 64 warps in 2 blocks
0.09:cc: waiting for nvcc...
0.09:cc: persist/save: /home/neil/.accelerate/accelerate-cuda-0.14.0.0/cache/3.0/zr246zrDC1zr156gzr188zrENQzr215zr234zdzr247zr154zr157zrFSzrCANzrazrNUL
0.09:cc: entry function 'scanl1' used 25 registers, 0 bytes smem, 0 bytes lmem, 0 bytes cmem
... multiprocessor occupancy 100.0% : 2048 threads over 64 warps in 2 blocks
0.09:cc: waiting for nvcc...
0.09:cc: persist/save: /home/neil/.accelerate/accelerate-cuda-0.14.0.0/cache/3.0/zr163zr206zr130zr210zrrzr249zr224Kzr169zrfzr220zrza2Xzr149Mzr251
0.09:cc: entry function 'scanl1' used 25 registers, 0 bytes smem, 0 bytes lmem, 0 bytes cmem
... multiprocessor occupancy 100.0% : 2048 threads over 64 warps in 2 blocks
0.10:cc: entry function 'scanlUp' used 21 registers, 0 bytes smem, 0 bytes lmem, 0 bytes cmem
... multiprocessor occupancy 100.0% : 2048 threads over 64 warps in 2 blocks
0.10:cc: entry function 'scanl1' used 25 registers, 0 bytes smem, 0 bytes lmem, 0 bytes cmem
... multiprocessor occupancy 100.0% : 2048 threads over 64 warps in 2 blocks
0.10:cc: entry function 'scanl1' used 25 registers, 0 bytes smem, 0 bytes lmem, 0 bytes cmem
... multiprocessor occupancy 100.0% : 2048 threads over 64 warps in 2 blocks
0.10:cc: entry function 'scanlUp' used 21 registers, 0 bytes smem, 0 bytes lmem, 0 bytes cmem
... multiprocessor occupancy 100.0% : 2048 threads over 64 warps in 2 blocks
0.10:cc: entry function 'scanl1' used 25 registers, 0 bytes smem, 0 bytes lmem, 0 bytes cmem
... multiprocessor occupancy 100.0% : 2048 threads over 64 warps in 2 blocks
0.10:cc: entry function 'scanl1' used 25 registers, 0 bytes smem, 0 bytes lmem, 0 bytes cmem
... multiprocessor occupancy 100.0% : 2048 threads over 64 warps in 2 blocks
0.10:cc: entry function 'scanlUp' used 21 registers, 0 bytes smem, 0 bytes lmem, 0 bytes cmem
... multiprocessor occupancy 100.0% : 2048 threads over 64 warps in 2 blocks
0.10:cc: entry function 'scanl1' used 25 registers, 0 bytes smem, 0 bytes lmem, 0 bytes cmem
... multiprocessor occupancy 100.0% : 2048 threads over 64 warps in 2 blocks
0.10:cc: entry function 'scanl1' used 25 registers, 0 bytes smem, 0 bytes lmem, 0 bytes cmem
... multiprocessor occupancy 100.0% : 2048 threads over 64 warps in 2 blocks
0.11:cc: entry function 'scanlUp' used 21 registers, 0 bytes smem, 0 bytes lmem, 0 bytes cmem
... multiprocessor occupancy 100.0% : 2048 threads over 64 warps in 2 blocks
0.11:cc: entry function 'scanl1' used 25 registers, 0 bytes smem, 0 bytes lmem, 0 bytes cmem
... multiprocessor occupancy 100.0% : 2048 threads over 64 warps in 2 blocks
0.11:cc: entry function 'scanl1' used 25 registers, 0 bytes smem, 0 bytes lmem, 0 bytes cmem
... multiprocessor occupancy 100.0% : 2048 threads over 64 warps in 2 blocks
0.11:cc: entry function 'scanlUp' used 21 registers, 0 bytes smem, 0 bytes lmem, 0 bytes cmem
... multiprocessor occupancy 100.0% : 2048 threads over 64 warps in 2 blocks
0.11:cc: entry function 'scanl1' used 25 registers, 0 bytes smem, 0 bytes lmem, 0 bytes cmem
... multiprocessor occupancy 100.0% : 2048 threads over 64 warps in 2 blocks
0.11:cc: entry function 'scanl1' used 25 registers, 0 bytes smem, 0 bytes lmem, 0 bytes cmem
... multiprocessor occupancy 100.0% : 2048 threads over 64 warps in 2 blocks
0.11:cc: entry function 'scanlUp' used 21 registers, 0 bytes smem, 0 bytes lmem, 0 bytes cmem
... multiprocessor occupancy 100.0% : 2048 threads over 64 warps in 2 blocks
0.11:cc: entry function 'scanl1' used 25 registers, 0 bytes smem, 0 bytes lmem, 0 bytes cmem
... multiprocessor occupancy 100.0% : 2048 threads over 64 warps in 2 blocks
0.11:cc: entry function 'scanl1' used 25 registers, 0 bytes smem, 0 bytes lmem, 0 bytes cmem
... multiprocessor occupancy 100.0% : 2048 threads over 64 warps in 2 blocks
0.12:cc: entry function 'scanlUp' used 21 registers, 0 bytes smem, 0 bytes lmem, 0 bytes cmem
... multiprocessor occupancy 100.0% : 2048 threads over 64 warps in 2 blocks
0.12:cc: entry function 'scanl1' used 25 registers, 0 bytes smem, 0 bytes lmem, 0 bytes cmem
... multiprocessor occupancy 100.0% : 2048 threads over 64 warps in 2 blocks
0.12:cc: entry function 'scanl1' used 25 registers, 0 bytes smem, 0 bytes lmem, 0 bytes cmem
... multiprocessor occupancy 100.0% : 2048 threads over 64 warps in 2 blocks
0.12:cc: entry function 'scanlUp' used 21 registers, 0 bytes smem, 0 bytes lmem, 0 bytes cmem
... multiprocessor occupancy 100.0% : 2048 threads over 64 warps in 2 blocks
0.12:cc: entry function 'scanl1' used 25 registers, 0 bytes smem, 0 bytes lmem, 0 bytes cmem
... multiprocessor occupancy 100.0% : 2048 threads over 64 warps in 2 blocks
0.12:cc: entry function 'scanl1' used 25 registers, 0 bytes smem, 0 bytes lmem, 0 bytes cmem
... multiprocessor occupancy 100.0% : 2048 threads over 64 warps in 2 blocks
0.12:cc: entry function 'scanlUp' used 21 registers, 0 bytes smem, 0 bytes lmem, 0 bytes cmem
... multiprocessor occupancy 100.0% : 2048 threads over 64 warps in 2 blocks
0.12:cc: entry function 'scanl1' used 25 registers, 0 bytes smem, 0 bytes lmem, 0 bytes cmem
... multiprocessor occupancy 100.0% : 2048 threads over 64 warps in 2 blocks
0.12:cc: entry function 'scanl1' used 25 registers, 0 bytes smem, 0 bytes lmem, 0 bytes cmem
... multiprocessor occupancy 100.0% : 2048 threads over 64 warps in 2 blocks
0.12:cc: entry function 'scanlUp' used 21 registers, 0 bytes smem, 0 bytes lmem, 0 bytes cmem
... multiprocessor occupancy 100.0% : 2048 threads over 64 warps in 2 blocks
0.12:cc: entry function 'scanl1' used 25 registers, 0 bytes smem, 0 bytes lmem, 0 bytes cmem
... multiprocessor occupancy 100.0% : 2048 threads over 64 warps in 2 blocks
0.12:cc: entry function 'scanl1' used 25 registers, 0 bytes smem, 0 bytes lmem, 0 bytes cmem
... multiprocessor occupancy 100.0% : 2048 threads over 64 warps in 2 blocks
0.13:cc: entry function 'scanlUp' used 21 registers, 0 bytes smem, 0 bytes lmem, 0 bytes cmem
... multiprocessor occupancy 100.0% : 2048 threads over 64 warps in 2 blocks
0.13:cc: entry function 'scanl1' used 25 registers, 0 bytes smem, 0 bytes lmem, 0 bytes cmem
... multiprocessor occupancy 100.0% : 2048 threads over 64 warps in 2 blocks
0.13:cc: entry function 'scanl1' used 25 registers, 0 bytes smem, 0 bytes lmem, 0 bytes cmem
... multiprocessor occupancy 100.0% : 2048 threads over 64 warps in 2 blocks
0.13:cc: entry function 'scanlUp' used 21 registers, 0 bytes smem, 0 bytes lmem, 0 bytes cmem
... multiprocessor occupancy 100.0% : 2048 threads over 64 warps in 2 blocks
0.13:cc: entry function 'scanl1' used 25 registers, 0 bytes smem, 0 bytes lmem, 0 bytes cmem
... multiprocessor occupancy 100.0% : 2048 threads over 64 warps in 2 blocks
0.13:cc: entry function 'scanl1' used 25 registers, 0 bytes smem, 0 bytes lmem, 0 bytes cmem
... multiprocessor occupancy 100.0% : 2048 threads over 64 warps in 2 blocks
0.13:cc: entry function 'scanlUp' used 21 registers, 0 bytes smem, 0 bytes lmem, 0 bytes cmem
... multiprocessor occupancy 100.0% : 2048 threads over 64 warps in 2 blocks
0.13:cc: entry function 'scanl1' used 25 registers, 0 bytes smem, 0 bytes lmem, 0 bytes cmem
... multiprocessor occupancy 100.0% : 2048 threads over 64 warps in 2 blocks
0.13:cc: entry function 'scanl1' used 25 registers, 0 bytes smem, 0 bytes lmem, 0 bytes cmem
... multiprocessor occupancy 100.0% : 2048 threads over 64 warps in 2 blocks
0.14:cc: entry function 'scanlUp' used 21 registers, 0 bytes smem, 0 bytes lmem, 0 bytes cmem
... multiprocessor occupancy 100.0% : 2048 threads over 64 warps in 2 blocks
0.14:cc: entry function 'scanl1' used 25 registers, 0 bytes smem, 0 bytes lmem, 0 bytes cmem
... multiprocessor occupancy 100.0% : 2048 threads over 64 warps in 2 blocks
0.14:cc: entry function 'scanl1' used 25 registers, 0 bytes smem, 0 bytes lmem, 0 bytes cmem
... multiprocessor occupancy 100.0% : 2048 threads over 64 warps in 2 blocks
0.14:cc: entry function 'scanlUp' used 21 registers, 0 bytes smem, 0 bytes lmem, 0 bytes cmem
... multiprocessor occupancy 100.0% : 2048 threads over 64 warps in 2 blocks
0.14:cc: entry function 'scanl1' used 25 registers, 0 bytes smem, 0 bytes lmem, 0 bytes cmem
... multiprocessor occupancy 100.0% : 2048 threads over 64 warps in 2 blocks
0.14:cc: entry function 'scanl1' used 25 registers, 0 bytes smem, 0 bytes lmem, 0 bytes cmem
... multiprocessor occupancy 100.0% : 2048 threads over 64 warps in 2 blocks
0.14:cc: (3.0,"+,+\ETB\DC2\196\234k\146\209\a\220,\248%\196")
#include <accelerate_cuda.h>
extern "C" global void scanl(const Int64 shIn0_0, const Int32* restrict arrIn0_0, const Int64 shOut_0, Int32* restrict arrOut_0, const Int64 shBlk_0, Int32* restrict arrBlk_0, Int32* restrict arrSum_0)
{
extern volatile shared Int32 sdata0[];
Int32 x0;
Int32 y0;
Int32 z0;
const Int64 sh0 = shIn0_0;
const int shapeSize = sh0;
const int intervalSize = (shapeSize + gridDim.x - 1) / gridDim.x;
int carryIn = 0;

if (threadIdx.x == 0) {
if (gridDim.x > 1) {
z0 = arrBlk_0[blockIdx.x];
} else {
z0 = (Int32) 0;
}
}

const int start = blockIdx.x * intervalSize;
const int end = min(start + intervalSize, shapeSize);
const int numElements = end - start;
int seg;

for (seg = threadIdx.x; seg < numElements; seg += blockDim.x) {
const int ix = start + seg;

x0 = arrIn0_0[ix];
if (threadIdx.x == 0) {
x0 = z0 + x0;
}
sdata0[threadIdx.x] = x0;
__syncthreads();
if (blockDim.x > 1) {
if (threadIdx.x >= 1) {
y0 = sdata0[threadIdx.x - 1];
x0 = y0 + x0;
}
__syncthreads();
sdata0[threadIdx.x] = x0;
__syncthreads();
}
if (blockDim.x > 2) {
if (threadIdx.x >= 2) {
y0 = sdata0[threadIdx.x - 2];
x0 = y0 + x0;
}
__syncthreads();
sdata0[threadIdx.x] = x0;
__syncthreads();
}
if (blockDim.x > 4) {
if (threadIdx.x >= 4) {
y0 = sdata0[threadIdx.x - 4];
x0 = y0 + x0;
}
__syncthreads();
sdata0[threadIdx.x] = x0;
__syncthreads();
}
if (blockDim.x > 8) {
if (threadIdx.x >= 8) {
y0 = sdata0[threadIdx.x - 8];
x0 = y0 + x0;
}
__syncthreads();
sdata0[threadIdx.x] = x0;
__syncthreads();
}
if (blockDim.x > 16) {
if (threadIdx.x >= 16) {
y0 = sdata0[threadIdx.x - 16];
x0 = y0 + x0;
}
__syncthreads();
sdata0[threadIdx.x] = x0;
__syncthreads();
}
if (blockDim.x > 32) {
if (threadIdx.x >= 32) {
y0 = sdata0[threadIdx.x - 32];
x0 = y0 + x0;
}
__syncthreads();
sdata0[threadIdx.x] = x0;
__syncthreads();
}
if (blockDim.x > 64) {
if (threadIdx.x >= 64) {
y0 = sdata0[threadIdx.x - 64];
x0 = y0 + x0;
}
__syncthreads();
sdata0[threadIdx.x] = x0;
__syncthreads();
}
if (blockDim.x > 128) {
if (threadIdx.x >= 128) {
y0 = sdata0[threadIdx.x - 128];
x0 = y0 + x0;
}
__syncthreads();
sdata0[threadIdx.x] = x0;
__syncthreads();
}
if (blockDim.x > 256) {
if (threadIdx.x >= 256) {
y0 = sdata0[threadIdx.x - 256];
x0 = y0 + x0;
}
__syncthreads();
sdata0[threadIdx.x] = x0;
__syncthreads();
}
if (blockDim.x > 512) {
if (threadIdx.x >= 512) {
y0 = sdata0[threadIdx.x - 512];
x0 = y0 + x0;
}
__syncthreads();
sdata0[threadIdx.x] = x0;
__syncthreads();
}
if (blockDim.x > 1024) {
if (threadIdx.x >= 1024) {
y0 = sdata0[threadIdx.x - 1024];
x0 = y0 + x0;
}
__syncthreads();
sdata0[threadIdx.x] = x0;
__syncthreads();
}
if (1) {
if (threadIdx.x == 0) {
x0 = z0;
} else {
x0 = sdata0[threadIdx.x - 1];
}
}
arrOut_0[ix] = x0;
if (threadIdx.x == 0) {
const int last = min(numElements - seg, blockDim.x) - 1;

z0 = sdata0[last];
}
}
if (threadIdx.x == 0 && blockIdx.x == gridDim.x - 1) {
arrSum_0[0] = z0;
}
}

0.14:cc: (3.0,"\EOTB\148\FS\188"\245\ETB\206a\136\ACK\164\174\RSr")
#include <accelerate_cuda.h>
extern "C" global void scanlUp(const Int64 shIn0_0, const Int32* restrict arrIn0_0, const Int64 shOut_0, Int32* restrict arrOut_0)
{
extern volatile shared Int32 sdata0[];
Int32 x0;
Int32 y0;
const Int64 sh0 = shIn0_0;
const int shapeSize = sh0;
const int intervalSize = (shapeSize + gridDim.x - 1) / gridDim.x;
const int start = blockIdx.x * intervalSize;
const int end = min(start + intervalSize, shapeSize);
const int numElements = end - start;
int carryIn = 0;
int seg;

for (seg = threadIdx.x; seg < numElements; seg += blockDim.x) {
const int ix = start + seg;

x0 = arrIn0_0[ix];
if (threadIdx.x == 0 && carryIn) {
x0 = y0 + x0;
}
sdata0[threadIdx.x] = x0;
__syncthreads();
if (blockDim.x > 1) {
if (threadIdx.x >= 1) {
y0 = sdata0[threadIdx.x - 1];
x0 = y0 + x0;
}
__syncthreads();
sdata0[threadIdx.x] = x0;
__syncthreads();
}
if (blockDim.x > 2) {
if (threadIdx.x >= 2) {
y0 = sdata0[threadIdx.x - 2];
x0 = y0 + x0;
}
__syncthreads();
sdata0[threadIdx.x] = x0;
__syncthreads();
}
if (blockDim.x > 4) {
if (threadIdx.x >= 4) {
y0 = sdata0[threadIdx.x - 4];
x0 = y0 + x0;
}
__syncthreads();
sdata0[threadIdx.x] = x0;
__syncthreads();
}
if (blockDim.x > 8) {
if (threadIdx.x >= 8) {
y0 = sdata0[threadIdx.x - 8];
x0 = y0 + x0;
}
__syncthreads();
sdata0[threadIdx.x] = x0;
__syncthreads();
}
if (blockDim.x > 16) {
if (threadIdx.x >= 16) {
y0 = sdata0[threadIdx.x - 16];
x0 = y0 + x0;
}
__syncthreads();
sdata0[threadIdx.x] = x0;
__syncthreads();
}
if (blockDim.x > 32) {
if (threadIdx.x >= 32) {
y0 = sdata0[threadIdx.x - 32];
x0 = y0 + x0;
}
__syncthreads();
sdata0[threadIdx.x] = x0;
__syncthreads();
}
if (blockDim.x > 64) {
if (threadIdx.x >= 64) {
y0 = sdata0[threadIdx.x - 64];
x0 = y0 + x0;
}
__syncthreads();
sdata0[threadIdx.x] = x0;
__syncthreads();
}
if (blockDim.x > 128) {
if (threadIdx.x >= 128) {
y0 = sdata0[threadIdx.x - 128];
x0 = y0 + x0;
}
__syncthreads();
sdata0[threadIdx.x] = x0;
__syncthreads();
}
if (blockDim.x > 256) {
if (threadIdx.x >= 256) {
y0 = sdata0[threadIdx.x - 256];
x0 = y0 + x0;
}
__syncthreads();
sdata0[threadIdx.x] = x0;
__syncthreads();
}
if (blockDim.x > 512) {
if (threadIdx.x >= 512) {
y0 = sdata0[threadIdx.x - 512];
x0 = y0 + x0;
}
__syncthreads();
sdata0[threadIdx.x] = x0;
__syncthreads();
}
if (blockDim.x > 1024) {
if (threadIdx.x >= 1024) {
y0 = sdata0[threadIdx.x - 1024];
x0 = y0 + x0;
}
__syncthreads();
sdata0[threadIdx.x] = x0;
__syncthreads();
}
if (threadIdx.x == 0) {
const int last = min(numElements - seg, blockDim.x) - 1;

y0 = sdata0[last];
}
carryIn = 1;
}
if (threadIdx.x == 0) {
arrOut_0[blockIdx.x] = y0;
}
}

0.14:cc: (3.0,"t\EM)\ETB\SO\230\237U\203\160C1m\128U\132")
#include <accelerate_cuda.h>
extern "C" global void scanl(const Int64 shIn0_0, const Int32* restrict arrIn0_0, const Int64 shOut_0, Int32* restrict arrOut_0, const Int64 shBlk_0, Int32* restrict arrBlk_0, Int32* restrict arrSum_0)
{
extern volatile shared Int32 sdata0[];
Int32 x0;
Int32 y0;
Int32 z0;
const Int64 sh0 = shBlk_0;
const int shapeSize = sh0;
const int intervalSize = (shapeSize + gridDim.x - 1) / gridDim.x;
int carryIn = 0;

if (threadIdx.x == 0) {
if (gridDim.x > 1) {
z0 = arrBlk_0[blockIdx.x];
} else {
z0 = (Int32) 0;
}
}

const int start = blockIdx.x * intervalSize;
const int end = min(start + intervalSize, shapeSize);
const int numElements = end - start;
int seg;

for (seg = threadIdx.x; seg < numElements; seg += blockDim.x) {
const int ix = start + seg;

x0 = arrBlk_0[ix];
if (threadIdx.x == 0) {
x0 = z0 + x0;
}
sdata0[threadIdx.x] = x0;
__syncthreads();
if (blockDim.x > 1) {
if (threadIdx.x >= 1) {
y0 = sdata0[threadIdx.x - 1];
x0 = y0 + x0;
}
__syncthreads();
sdata0[threadIdx.x] = x0;
__syncthreads();
}
if (blockDim.x > 2) {
if (threadIdx.x >= 2) {
y0 = sdata0[threadIdx.x - 2];
x0 = y0 + x0;
}
__syncthreads();
sdata0[threadIdx.x] = x0;
__syncthreads();
}
if (blockDim.x > 4) {
if (threadIdx.x >= 4) {
y0 = sdata0[threadIdx.x - 4];
x0 = y0 + x0;
}
__syncthreads();
sdata0[threadIdx.x] = x0;
__syncthreads();
}
if (blockDim.x > 8) {
if (threadIdx.x >= 8) {
y0 = sdata0[threadIdx.x - 8];
x0 = y0 + x0;
}
__syncthreads();
sdata0[threadIdx.x] = x0;
__syncthreads();
}
if (blockDim.x > 16) {
if (threadIdx.x >= 16) {
y0 = sdata0[threadIdx.x - 16];
x0 = y0 + x0;
}
__syncthreads();
sdata0[threadIdx.x] = x0;
__syncthreads();
}
if (blockDim.x > 32) {
if (threadIdx.x >= 32) {
y0 = sdata0[threadIdx.x - 32];
x0 = y0 + x0;
}
__syncthreads();
sdata0[threadIdx.x] = x0;
__syncthreads();
}
if (blockDim.x > 64) {
if (threadIdx.x >= 64) {
y0 = sdata0[threadIdx.x - 64];
x0 = y0 + x0;
}
__syncthreads();
sdata0[threadIdx.x] = x0;
__syncthreads();
}
if (blockDim.x > 128) {
if (threadIdx.x >= 128) {
y0 = sdata0[threadIdx.x - 128];
x0 = y0 + x0;
}
__syncthreads();
sdata0[threadIdx.x] = x0;
__syncthreads();
}
if (blockDim.x > 256) {
if (threadIdx.x >= 256) {
y0 = sdata0[threadIdx.x - 256];
x0 = y0 + x0;
}
__syncthreads();
sdata0[threadIdx.x] = x0;
__syncthreads();
}
if (blockDim.x > 512) {
if (threadIdx.x >= 512) {
y0 = sdata0[threadIdx.x - 512];
x0 = y0 + x0;
}
__syncthreads();
sdata0[threadIdx.x] = x0;
__syncthreads();
}
if (blockDim.x > 1024) {
if (threadIdx.x >= 1024) {
y0 = sdata0[threadIdx.x - 1024];
x0 = y0 + x0;
}
__syncthreads();
sdata0[threadIdx.x] = x0;
__syncthreads();
}
if (1) {
if (threadIdx.x == 0) {
x0 = z0;
} else {
x0 = sdata0[threadIdx.x - 1];
}
}
arrOut_0[ix] = x0;
if (threadIdx.x == 0) {
const int last = min(numElements - seg, blockDim.x) - 1;

z0 = sdata0[last];
}
}
if (threadIdx.x == 0 && blockIdx.x == gridDim.x - 1) {
arrSum_0[0] = z0;
}
}

0.14:cc: (3.0,"\STXKX\147\ETXI(#\SOH\214\150B\153\253D\SO")
#include <accelerate_cuda.h>
extern "C" global void generate(const Int32* restrict arrIn0_0, const Int64 shOut_0, Int32* restrict arrOut_0)
{
const int shapeSize = shOut_0;
const int gridSize = blockDim.x * gridDim.x;
int ix;

for (ix = blockDim.x * blockIdx.x + threadIdx.x; ix < shapeSize; ix += gridSize) {
arrOut_0[ix] = (Int32) 0;
}
}

0.14:cc: (3.0,"\bQ\a\131\189l#\131\f\SIw\183\USP\157&3")
#include <accelerate_cuda.h>
extern "C" global void permute(const Int64 shIn0_0, const Int32* restrict arrIn0_0, const Int64 shOut_0, Int32* restrict arrOut_0)
{
const Int64 shIn0 = shIn0_0;
const int shapeSize = shIn0;
const int gridSize = blockDim.x * gridDim.x;
int ix;

for (ix = blockDim.x * blockIdx.x + threadIdx.x; ix < shapeSize; ix += gridSize) {
const Int64 sh0 = ({ assert(ix >= 0 && ix < shIn0); ix; });
const Int64 sh_0 = (Int64) arrIn0_0[sh0];

if (!(sh_0 == -1)) {
Int32 y0;
Int32 _y0;
const Int64 jx0 = sh_0;
const Int64 v1 = ({ assert(ix >= 0 && ix < shIn0_0); ix; });
const Int32 x0 = (Int32) 1;

y0 = arrOut_0[jx0];
do {
_y0 = y0;
y0 = atomicCAS32(&arrOut_0[jx0], _y0, x0 + y0);
} while(y0 != _y0);
}
}
}

0.15:cc: (3.0,"f$\243U\130\180\224[\146\251\138\181\235\161l\EM")
#include <accelerate_cuda.h>
extern "C" global void scanl1(const Int64* restrict arrIn0_0, const Int64 shIn1_0, const Int32* restrict arrIn1_0, const Int64 shIn2_0, const Int32* restrict arrIn2_0, const Int64 shOut_0, Int32* restrict arrOut_1, Int32* restrict arrOut_0, const Int64 shBlk_0, Int32* restrict arrBlk_1, Int32* restrict arrBlk_0, Int32* restrict arrSum_1, Int32* restrict arrSum_0)
{
extern volatile shared Int32 sdata1[];
volatile Int32* sdata0 = (Int32*) &sdata1[blockDim.x];
Int32 x1;
Int32 x0;
Int32 y1;
Int32 y0;
Int32 z1;
Int32 z0;
const Int64 v1 = 0;
const Int64 sh0 = min(arrIn0_0[v1], shIn2_0);
const int shapeSize = sh0;
const int intervalSize = (shapeSize + gridDim.x - 1) / gridDim.x;
int carryIn = 0;

if (threadIdx.x == 0) {
if (blockIdx.x != 0) {
z1 = arrBlk_1[blockIdx.x - 1];
z0 = arrBlk_0[blockIdx.x - 1];
carryIn = 1;
}
}

const int start = blockIdx.x * intervalSize;
const int end = min(start + intervalSize, shapeSize);
const int numElements = end - start;
int seg;

for (seg = threadIdx.x; seg < numElements; seg += blockDim.x) {
const int ix = start + seg;
const Int64 v2 = 0;
const Int64 v4 = ({ assert(ix >= 0 && ix < min(arrIn0_0[v2], shIn2_0)); ix; });

x1 = arrIn1_0[v4];
x0 = arrIn2_0[v4];
if (threadIdx.x == 0 && carryIn) {
const Word8 v0 = (Int32) 0 != x1;

x1 = z1 | x1;
x0 = v0 ? x0 : z0 + x0;
}
sdata1[threadIdx.x] = x1;
sdata0[threadIdx.x] = x0;
__syncthreads();
if (blockDim.x > 1) {
if (threadIdx.x >= 1) {
y1 = sdata1[threadIdx.x - 1];
y0 = sdata0[threadIdx.x - 1];

const Word8 v0 = (Int32) 0 != x1;

x1 = y1 | x1;
x0 = v0 ? x0 : y0 + x0;
}
__syncthreads();
sdata1[threadIdx.x] = x1;
sdata0[threadIdx.x] = x0;
__syncthreads();
}
if (blockDim.x > 2) {
if (threadIdx.x >= 2) {
y1 = sdata1[threadIdx.x - 2];
y0 = sdata0[threadIdx.x - 2];

const Word8 v0 = (Int32) 0 != x1;

x1 = y1 | x1;
x0 = v0 ? x0 : y0 + x0;
}
__syncthreads();
sdata1[threadIdx.x] = x1;
sdata0[threadIdx.x] = x0;
__syncthreads();
}
if (blockDim.x > 4) {
if (threadIdx.x >= 4) {
y1 = sdata1[threadIdx.x - 4];
y0 = sdata0[threadIdx.x - 4];

const Word8 v0 = (Int32) 0 != x1;

x1 = y1 | x1;
x0 = v0 ? x0 : y0 + x0;
}
__syncthreads();
sdata1[threadIdx.x] = x1;
sdata0[threadIdx.x] = x0;
__syncthreads();
}
if (blockDim.x > 8) {
if (threadIdx.x >= 8) {
y1 = sdata1[threadIdx.x - 8];
y0 = sdata0[threadIdx.x - 8];

const Word8 v0 = (Int32) 0 != x1;

x1 = y1 | x1;
x0 = v0 ? x0 : y0 + x0;
}
__syncthreads();
sdata1[threadIdx.x] = x1;
sdata0[threadIdx.x] = x0;
__syncthreads();
}
if (blockDim.x > 16) {
if (threadIdx.x >= 16) {
y1 = sdata1[threadIdx.x - 16];
y0 = sdata0[threadIdx.x - 16];

const Word8 v0 = (Int32) 0 != x1;

x1 = y1 | x1;
x0 = v0 ? x0 : y0 + x0;
}
__syncthreads();
sdata1[threadIdx.x] = x1;
sdata0[threadIdx.x] = x0;
__syncthreads();
}
if (blockDim.x > 32) {
if (threadIdx.x >= 32) {
y1 = sdata1[threadIdx.x - 32];
y0 = sdata0[threadIdx.x - 32];

const Word8 v0 = (Int32) 0 != x1;

x1 = y1 | x1;
x0 = v0 ? x0 : y0 + x0;
}
__syncthreads();
sdata1[threadIdx.x] = x1;
sdata0[threadIdx.x] = x0;
__syncthreads();
}
if (blockDim.x > 64) {
if (threadIdx.x >= 64) {
y1 = sdata1[threadIdx.x - 64];
y0 = sdata0[threadIdx.x - 64];

const Word8 v0 = (Int32) 0 != x1;

x1 = y1 | x1;
x0 = v0 ? x0 : y0 + x0;
}
__syncthreads();
sdata1[threadIdx.x] = x1;
sdata0[threadIdx.x] = x0;
__syncthreads();
}
if (blockDim.x > 128) {
if (threadIdx.x >= 128) {
y1 = sdata1[threadIdx.x - 128];
y0 = sdata0[threadIdx.x - 128];

const Word8 v0 = (Int32) 0 != x1;

x1 = y1 | x1;
x0 = v0 ? x0 : y0 + x0;
}
__syncthreads();
sdata1[threadIdx.x] = x1;
sdata0[threadIdx.x] = x0;
__syncthreads();
}
if (blockDim.x > 256) {
if (threadIdx.x >= 256) {
y1 = sdata1[threadIdx.x - 256];
y0 = sdata0[threadIdx.x - 256];

const Word8 v0 = (Int32) 0 != x1;

x1 = y1 | x1;
x0 = v0 ? x0 : y0 + x0;
}
__syncthreads();
sdata1[threadIdx.x] = x1;
sdata0[threadIdx.x] = x0;
__syncthreads();
}
if (blockDim.x > 512) {
if (threadIdx.x >= 512) {
y1 = sdata1[threadIdx.x - 512];
y0 = sdata0[threadIdx.x - 512];

const Word8 v0 = (Int32) 0 != x1;

x1 = y1 | x1;
x0 = v0 ? x0 : y0 + x0;
}
__syncthreads();
sdata1[threadIdx.x] = x1;
sdata0[threadIdx.x] = x0;
__syncthreads();
}
if (blockDim.x > 1024) {
if (threadIdx.x >= 1024) {
y1 = sdata1[threadIdx.x - 1024];
y0 = sdata0[threadIdx.x - 1024];

const Word8 v0 = (Int32) 0 != x1;

x1 = y1 | x1;
x0 = v0 ? x0 : y0 + x0;
}
__syncthreads();
sdata1[threadIdx.x] = x1;
sdata0[threadIdx.x] = x0;
__syncthreads();
}
if (0) {
if (threadIdx.x == 0) {
x1 = z1;
x0 = z0;
} else {
x1 = sdata1[threadIdx.x - 1];
x0 = sdata0[threadIdx.x - 1];
}
}
arrOut_1[ix] = x1;
arrOut_0[ix] = x0;
if (threadIdx.x == 0) {
const int last = min(numElements - seg, blockDim.x) - 1;

z1 = sdata1[last];
z0 = sdata0[last];
}
carryIn = 1;
}
}

0.16:cc: (3.0,"\221cD\203&1\164\149+`I\192W\227\248An")
#include <accelerate_cuda.h>
extern "C" global void scanlUp(const Int64* restrict arrIn0_0, const Int64 shIn1_0, const Int32* restrict arrIn1_0, const Int64 shIn2_0, const Int32* restrict arrIn2_0, const Int64 shOut_0, Int32* restrict arrOut_1, Int32* restrict arrOut_0)
{
extern volatile shared Int32 sdata1[];
volatile Int32* sdata0 = (Int32*) &sdata1[blockDim.x];
Int32 x1;
Int32 x0;
Int32 y1;
Int32 y0;
const Int64 v1 = 0;
const Int64 sh0 = min(arrIn0_0[v1], shIn2_0);
const int shapeSize = sh0;
const int intervalSize = (shapeSize + gridDim.x - 1) / gridDim.x;
const int start = blockIdx.x * intervalSize;
const int end = min(start + intervalSize, shapeSize);
const int numElements = end - start;
int carryIn = 0;
int seg;

for (seg = threadIdx.x; seg < numElements; seg += blockDim.x) {
const int ix = start + seg;
const Int64 v2 = 0;
const Int64 v4 = ({ assert(ix >= 0 && ix < min(arrIn0_0[v2], shIn2_0)); ix; });

x1 = arrIn1_0[v4];
x0 = arrIn2_0[v4];
if (threadIdx.x == 0 && carryIn) {
const Word8 v0 = (Int32) 0 != x1;

x1 = y1 | x1;
x0 = v0 ? x0 : y0 + x0;
}
sdata1[threadIdx.x] = x1;
sdata0[threadIdx.x] = x0;
__syncthreads();
if (blockDim.x > 1) {
if (threadIdx.x >= 1) {
y1 = sdata1[threadIdx.x - 1];
y0 = sdata0[threadIdx.x - 1];

const Word8 v0 = (Int32) 0 != x1;

x1 = y1 | x1;
x0 = v0 ? x0 : y0 + x0;
}
__syncthreads();
sdata1[threadIdx.x] = x1;
sdata0[threadIdx.x] = x0;
__syncthreads();
}
if (blockDim.x > 2) {
if (threadIdx.x >= 2) {
y1 = sdata1[threadIdx.x - 2];
y0 = sdata0[threadIdx.x - 2];

const Word8 v0 = (Int32) 0 != x1;

x1 = y1 | x1;
x0 = v0 ? x0 : y0 + x0;
}
__syncthreads();
sdata1[threadIdx.x] = x1;
sdata0[threadIdx.x] = x0;
__syncthreads();
}
if (blockDim.x > 4) {
if (threadIdx.x >= 4) {
y1 = sdata1[threadIdx.x - 4];
y0 = sdata0[threadIdx.x - 4];

const Word8 v0 = (Int32) 0 != x1;

x1 = y1 | x1;
x0 = v0 ? x0 : y0 + x0;
}
__syncthreads();
sdata1[threadIdx.x] = x1;
sdata0[threadIdx.x] = x0;
__syncthreads();
}
if (blockDim.x > 8) {
if (threadIdx.x >= 8) {
y1 = sdata1[threadIdx.x - 8];
y0 = sdata0[threadIdx.x - 8];

const Word8 v0 = (Int32) 0 != x1;

x1 = y1 | x1;
x0 = v0 ? x0 : y0 + x0;
}
__syncthreads();
sdata1[threadIdx.x] = x1;
sdata0[threadIdx.x] = x0;
__syncthreads();
}
if (blockDim.x > 16) {
if (threadIdx.x >= 16) {
y1 = sdata1[threadIdx.x - 16];
y0 = sdata0[threadIdx.x - 16];

const Word8 v0 = (Int32) 0 != x1;

x1 = y1 | x1;
x0 = v0 ? x0 : y0 + x0;
}
__syncthreads();
sdata1[threadIdx.x] = x1;
sdata0[threadIdx.x] = x0;
__syncthreads();
}
if (blockDim.x > 32) {
if (threadIdx.x >= 32) {
y1 = sdata1[threadIdx.x - 32];
y0 = sdata0[threadIdx.x - 32];

const Word8 v0 = (Int32) 0 != x1;

x1 = y1 | x1;
x0 = v0 ? x0 : y0 + x0;
}
__syncthreads();
sdata1[threadIdx.x] = x1;
sdata0[threadIdx.x] = x0;
__syncthreads();
}
if (blockDim.x > 64) {
if (threadIdx.x >= 64) {
y1 = sdata1[threadIdx.x - 64];
y0 = sdata0[threadIdx.x - 64];

const Word8 v0 = (Int32) 0 != x1;

x1 = y1 | x1;
x0 = v0 ? x0 : y0 + x0;
}
__syncthreads();
sdata1[threadIdx.x] = x1;
sdata0[threadIdx.x] = x0;
__syncthreads();
}
if (blockDim.x > 128) {
if (threadIdx.x >= 128) {
y1 = sdata1[threadIdx.x - 128];
y0 = sdata0[threadIdx.x - 128];

const Word8 v0 = (Int32) 0 != x1;

x1 = y1 | x1;
x0 = v0 ? x0 : y0 + x0;
}
__syncthreads();
sdata1[threadIdx.x] = x1;
sdata0[threadIdx.x] = x0;
__syncthreads();
}
if (blockDim.x > 256) {
if (threadIdx.x >= 256) {
y1 = sdata1[threadIdx.x - 256];
y0 = sdata0[threadIdx.x - 256];

const Word8 v0 = (Int32) 0 != x1;

x1 = y1 | x1;
x0 = v0 ? x0 : y0 + x0;
}
__syncthreads();
sdata1[threadIdx.x] = x1;
sdata0[threadIdx.x] = x0;
__syncthreads();
}
if (blockDim.x > 512) {
if (threadIdx.x >= 512) {
y1 = sdata1[threadIdx.x - 512];
y0 = sdata0[threadIdx.x - 512];

const Word8 v0 = (Int32) 0 != x1;

x1 = y1 | x1;
x0 = v0 ? x0 : y0 + x0;
}
__syncthreads();
sdata1[threadIdx.x] = x1;
sdata0[threadIdx.x] = x0;
__syncthreads();
}
if (blockDim.x > 1024) {
if (threadIdx.x >= 1024) {
y1 = sdata1[threadIdx.x - 1024];
y0 = sdata0[threadIdx.x - 1024];

const Word8 v0 = (Int32) 0 != x1;

x1 = y1 | x1;
x0 = v0 ? x0 : y0 + x0;
}
__syncthreads();
sdata1[threadIdx.x] = x1;
sdata0[threadIdx.x] = x0;
__syncthreads();
}
if (threadIdx.x == 0) {
const int last = min(numElements - seg, blockDim.x) - 1;

y1 = sdata1[last];
y0 = sdata0[last];
}
carryIn = 1;
}
if (threadIdx.x == 0) {
arrOut_1[blockIdx.x] = y1;
arrOut_0[blockIdx.x] = y0;
}
}

0.16:cc: (3.0,"$\139vCL\128YJ\146\188\US\152\181}\186d")
#include <accelerate_cuda.h>
extern "C" global void scanl1(const Int64* restrict arrIn0_0, const Int64 shIn1_0, const Int32* restrict arrIn1_0, const Int64 shIn2_0, const Int32* restrict arrIn2_0, const Int64 shOut_0, Int32* restrict arrOut_1, Int32* restrict arrOut_0, const Int64 shBlk_0, Int32* restrict arrBlk_1, Int32* restrict arrBlk_0, Int32* restrict arrSum_1, Int32* restrict arrSum_0)
{
extern volatile shared Int32 sdata1[];
volatile Int32* sdata0 = (Int32*) &sdata1[blockDim.x];
Int32 x1;
Int32 x0;
Int32 y1;
Int32 y0;
Int32 z1;
Int32 z0;
const Int64 sh0 = shBlk_0;
const int shapeSize = sh0;
const int intervalSize = (shapeSize + gridDim.x - 1) / gridDim.x;
int carryIn = 0;

if (threadIdx.x == 0) {
if (blockIdx.x != 0) {
z1 = arrBlk_1[blockIdx.x - 1];
z0 = arrBlk_0[blockIdx.x - 1];
carryIn = 1;
}
}

const int start = blockIdx.x * intervalSize;
const int end = min(start + intervalSize, shapeSize);
const int numElements = end - start;
int seg;

for (seg = threadIdx.x; seg < numElements; seg += blockDim.x) {
const int ix = start + seg;

x1 = arrBlk_1[ix];
x0 = arrBlk_0[ix];
if (threadIdx.x == 0 && carryIn) {
const Word8 v0 = (Int32) 0 != x1;

x1 = z1 | x1;
x0 = v0 ? x0 : z0 + x0;
}
sdata1[threadIdx.x] = x1;
sdata0[threadIdx.x] = x0;
__syncthreads();
if (blockDim.x > 1) {
if (threadIdx.x >= 1) {
y1 = sdata1[threadIdx.x - 1];
y0 = sdata0[threadIdx.x - 1];

const Word8 v0 = (Int32) 0 != x1;

x1 = y1 | x1;
x0 = v0 ? x0 : y0 + x0;
}
__syncthreads();
sdata1[threadIdx.x] = x1;
sdata0[threadIdx.x] = x0;
__syncthreads();
}
if (blockDim.x > 2) {
if (threadIdx.x >= 2) {
y1 = sdata1[threadIdx.x - 2];
y0 = sdata0[threadIdx.x - 2];

const Word8 v0 = (Int32) 0 != x1;

x1 = y1 | x1;
x0 = v0 ? x0 : y0 + x0;
}
__syncthreads();
sdata1[threadIdx.x] = x1;
sdata0[threadIdx.x] = x0;
__syncthreads();
}
if (blockDim.x > 4) {
if (threadIdx.x >= 4) {
y1 = sdata1[threadIdx.x - 4];
y0 = sdata0[threadIdx.x - 4];

const Word8 v0 = (Int32) 0 != x1;

x1 = y1 | x1;
x0 = v0 ? x0 : y0 + x0;
}
__syncthreads();
sdata1[threadIdx.x] = x1;
sdata0[threadIdx.x] = x0;
__syncthreads();
}
if (blockDim.x > 8) {
if (threadIdx.x >= 8) {
y1 = sdata1[threadIdx.x - 8];
y0 = sdata0[threadIdx.x - 8];

const Word8 v0 = (Int32) 0 != x1;

x1 = y1 | x1;
x0 = v0 ? x0 : y0 + x0;
}
__syncthreads();
sdata1[threadIdx.x] = x1;
sdata0[threadIdx.x] = x0;
__syncthreads();
}
if (blockDim.x > 16) {
if (threadIdx.x >= 16) {
y1 = sdata1[threadIdx.x - 16];
y0 = sdata0[threadIdx.x - 16];

const Word8 v0 = (Int32) 0 != x1;

x1 = y1 | x1;
x0 = v0 ? x0 : y0 + x0;
}
__syncthreads();
sdata1[threadIdx.x] = x1;
sdata0[threadIdx.x] = x0;
__syncthreads();
}
if (blockDim.x > 32) {
if (threadIdx.x >= 32) {
y1 = sdata1[threadIdx.x - 32];
y0 = sdata0[threadIdx.x - 32];

const Word8 v0 = (Int32) 0 != x1;

x1 = y1 | x1;
x0 = v0 ? x0 : y0 + x0;
}
__syncthreads();
sdata1[threadIdx.x] = x1;
sdata0[threadIdx.x] = x0;
__syncthreads();
}
if (blockDim.x > 64) {
if (threadIdx.x >= 64) {
y1 = sdata1[threadIdx.x - 64];
y0 = sdata0[threadIdx.x - 64];

const Word8 v0 = (Int32) 0 != x1;

x1 = y1 | x1;
x0 = v0 ? x0 : y0 + x0;
}
__syncthreads();
sdata1[threadIdx.x] = x1;
sdata0[threadIdx.x] = x0;
__syncthreads();
}
if (blockDim.x > 128) {
if (threadIdx.x >= 128) {
y1 = sdata1[threadIdx.x - 128];
y0 = sdata0[threadIdx.x - 128];

const Word8 v0 = (Int32) 0 != x1;

x1 = y1 | x1;
x0 = v0 ? x0 : y0 + x0;
}
__syncthreads();
sdata1[threadIdx.x] = x1;
sdata0[threadIdx.x] = x0;
__syncthreads();
}
if (blockDim.x > 256) {
if (threadIdx.x >= 256) {
y1 = sdata1[threadIdx.x - 256];
y0 = sdata0[threadIdx.x - 256];

const Word8 v0 = (Int32) 0 != x1;

x1 = y1 | x1;
x0 = v0 ? x0 : y0 + x0;
}
__syncthreads();
sdata1[threadIdx.x] = x1;
sdata0[threadIdx.x] = x0;
__syncthreads();
}
if (blockDim.x > 512) {
if (threadIdx.x >= 512) {
y1 = sdata1[threadIdx.x - 512];
y0 = sdata0[threadIdx.x - 512];

const Word8 v0 = (Int32) 0 != x1;

x1 = y1 | x1;
x0 = v0 ? x0 : y0 + x0;
}
__syncthreads();
sdata1[threadIdx.x] = x1;
sdata0[threadIdx.x] = x0;
__syncthreads();
}
if (blockDim.x > 1024) {
if (threadIdx.x >= 1024) {
y1 = sdata1[threadIdx.x - 1024];
y0 = sdata0[threadIdx.x - 1024];

const Word8 v0 = (Int32) 0 != x1;

x1 = y1 | x1;
x0 = v0 ? x0 : y0 + x0;
}
__syncthreads();
sdata1[threadIdx.x] = x1;
sdata0[threadIdx.x] = x0;
__syncthreads();
}
if (0) {
if (threadIdx.x == 0) {
x1 = z1;
x0 = z0;
} else {
x1 = sdata1[threadIdx.x - 1];
x0 = sdata0[threadIdx.x - 1];
}
}
arrOut_1[ix] = x1;
arrOut_0[ix] = x0;
if (threadIdx.x == 0) {
const int last = min(numElements - seg, blockDim.x) - 1;

z1 = sdata1[last];
z0 = sdata0[last];
}
carryIn = 1;
}
}

0.16:cc: (3.0,"`r~\159W\220\n\231\148[\252\232\SO\138L\151")
#include <accelerate_cuda.h>
extern "C" global void map(const Int64 shIn0_0, const Int32* restrict arrIn0_1, const Int32* restrict arrIn0_0, const Int64 shOut_0, Int32* restrict arrOut_0)
{
const int shapeSize = shOut_0;
const int gridSize = blockDim.x * gridDim.x;
int ix;

for (ix = blockDim.x * blockIdx.x + threadIdx.x; ix < shapeSize; ix += gridSize) {
const Int32 x0 = arrIn0_0[ix];

arrOut_0[ix] = x0;
}
}

0.17:cc: waiting for nvcc...
0.17:cc: queue: 1.443 s, execute: 1.443 s
... /usr/bin/nvcc -I /home/neil/.cabal/share/x86_64-linux-ghc-7.6.3/accelerate-cuda-0.14.0.0/cubits -arch=sm_30 -cubin -o /tmp/accelerate-cuda-30707/dragon30710.cubin --disable-warnings -DNDEBUG -O3 -m64 /tmp/accelerate-cuda-30707/dragon30710.cu
0.17:cc: queue: 1.467 s, execute: 1.467 s
... /usr/bin/nvcc -I /home/neil/.cabal/share/x86_64-linux-ghc-7.6.3/accelerate-cuda-0.14.0.0/cubits -arch=sm_30 -cubin -o /tmp/accelerate-cuda-30707/dragon30711.cubin --disable-warnings -DNDEBUG -O3 -m64 /tmp/accelerate-cuda-30707/dragon30711.cu
0.17:cc: queue: 1.500 s, execute: 1.500 s
... /usr/bin/nvcc -I /home/neil/.cabal/share/x86_64-linux-ghc-7.6.3/accelerate-cuda-0.14.0.0/cubits -arch=sm_30 -cubin -o /tmp/accelerate-cuda-30707/dragon30709.cubin --disable-warnings -DNDEBUG -O3 -m64 /tmp/accelerate-cuda-30707/dragon30709.cu
0.17:cc: queue: 1.509 s, execute: 1.509 s
... /usr/bin/nvcc -I /home/neil/.cabal/share/x86_64-linux-ghc-7.6.3/accelerate-cuda-0.14.0.0/cubits -arch=sm_30 -cubin -o /tmp/accelerate-cuda-30707/dragon30707.cubin --disable-warnings -DNDEBUG -O3 -m64 /tmp/accelerate-cuda-30707/dragon30707.cu
0.17:cc: queue: 1.510 s, execute: 1.510 s
... /usr/bin/nvcc -I /home/neil/.cabal/share/x86_64-linux-ghc-7.6.3/accelerate-cuda-0.14.0.0/cubits -arch=sm_30 -cubin -o /tmp/accelerate-cuda-30707/dragon30708.cubin --disable-warnings -DNDEBUG -O3 -m64 /tmp/accelerate-cuda-30707/dragon30708.cu
0.17:cc: persist/save: /home/neil/.accelerate/accelerate-cuda-0.14.0.0/cache/3.0/rzrRSzr174zr164zrACKzr136azr206zrETBzr245z22Uzr188zrFSzr148BzrEOT
0.17:cc: entry function 'scanlUp' used 21 registers, 0 bytes smem, 0 bytes lmem, 0 bytes cmem
... multiprocessor occupancy 100.0% : 2048 threads over 64 warps in 2 blocks
0.17:cc: waiting for nvcc...
0.17:cc: persist/save: /home/neil/.accelerate/accelerate-cuda-0.14.0.0/cache/3.0/zr132Uzr128m1Czr160zr203Uzr237zr230zrSOzrETBZRzrEMt
0.17:cc: entry function 'scanl' used 27 registers, 0 bytes smem, 0 bytes lmem, 0 bytes cmem
... multiprocessor occupancy 100.0% : 2048 threads over 64 warps in 2 blocks
0.17:cc: waiting for nvcc...
0.17:cc: persist/save: /home/neil/.accelerate/accelerate-cuda-0.14.0.0/cache/3.0/zr196zvzr248z2cUzr220zrazr209zr146kzr234zr196zrDC2zrETBzpz2cUzp
0.18:cc: entry function 'scanl' used 27 registers, 0 bytes smem, 0 bytes lmem, 0 bytes cmem
... multiprocessor occupancy 100.0% : 2048 threads over 64 warps in 2 blocks
0.18:cc: waiting for nvcc...
0.18:cc: persist/save: /home/neil/.accelerate/accelerate-cuda-0.14.0.0/cache/3.0/zrSODzr253zr153Bzr150zr214zrSOHzhZLIzrETXzr147XKzrSTX
0.18:cc: entry function 'generate' used 7 registers, 0 bytes smem, 0 bytes lmem, 0 bytes cmem
... multiprocessor occupancy 100.0% : 2048 threads over 64 warps in 16 blocks
0.18:cc: waiting for nvcc...
0.18:cc: persist/save: /home/neil/.accelerate/accelerate-cuda-0.14.0.0/cache/3.0/z33Uzr157PzrUSzr183wzrSIzrfzr131zhlzr189zr131zraQzrb
0.18:cc: entry function 'permute' used 9 registers, 0 bytes smem, 0 bytes lmem, 0 bytes cmem
... multiprocessor occupancy 100.0% : 2048 threads over 64 warps in 16 blocks
0.18:cc: waiting for nvcc...
0.18:cc: queue: 1.529 s, execute: 1.529 s
... /usr/bin/nvcc -I /home/neil/.cabal/share/x86_64-linux-ghc-7.6.3/accelerate-cuda-0.14.0.0/cubits -arch=sm_30 -cubin -o /tmp/accelerate-cuda-30707/dragon30712.cubin --disable-warnings -DNDEBUG -O3 -m64 /tmp/accelerate-cuda-30707/dragon30712.cu
0.18:cc: queue: 1.525 s, execute: 1.525 s
... /usr/bin/nvcc -I /home/neil/.cabal/share/x86_64-linux-ghc-7.6.3/accelerate-cuda-0.14.0.0/cubits -arch=sm_30 -cubin -o /tmp/accelerate-cuda-30707/dragon30713.cubin --disable-warnings -DNDEBUG -O3 -m64 /tmp/accelerate-cuda-30707/dragon30713.cu
0.18:cc: persist/save: /home/neil/.accelerate/accelerate-cuda-0.14.0.0/cache/3.0/nAzr248zr227Wzr192Iz60Uzpzr149zr164zrza1zr203Dczr221
0.18:cc: entry function 'scanlUp' used 25 registers, 0 bytes smem, 0 bytes lmem, 0 bytes cmem
... multiprocessor occupancy 100.0% : 2048 threads over 64 warps in 2 blocks
0.18:cc: waiting for nvcc...
0.18:cc: queue: 1.537 s, execute: 1.537 s
... /usr/bin/nvcc -I /home/neil/.cabal/share/x86_64-linux-ghc-7.6.3/accelerate-cuda-0.14.0.0/cubits -arch=sm_30 -cubin -o /tmp/accelerate-cuda-30707/dragon30714.cubin --disable-warnings -DNDEBUG -O3 -m64 /tmp/accelerate-cuda-30707/dragon30714.cu
0.18:cc: persist/save: /home/neil/.accelerate/accelerate-cuda-0.14.0.0/cache/3.0/dzr186z7dUzr181zr152zrUSzr188zr146JYzr128LCvzr139zd
0.18:cc: entry function 'scanl1' used 30 registers, 0 bytes smem, 0 bytes lmem, 0 bytes cmem
... multiprocessor occupancy 100.0% : 2048 threads over 64 warps in 2 blocks
0.18:cc: waiting for nvcc...
0.18:cc: persist/save: /home/neil/.accelerate/accelerate-cuda-0.14.0.0/cache/3.0/zrEMlzr161zr235zr181zr138zr251zr146ZMzr224zr180zr130Uzr243zdf
0.18:cc: entry function 'scanl1' used 30 registers, 0 bytes smem, 0 bytes lmem, 0 bytes cmem
... multiprocessor occupancy 100.0% : 2048 threads over 64 warps in 2 blocks
0.18:cc: waiting for nvcc...
0.19:cc: queue: 2.413 s, execute: 1.001 s
... /usr/bin/nvcc -I /home/neil/.cabal/share/x86_64-linux-ghc-7.6.3/accelerate-cuda-0.14.0.0/cubits -arch=sm_30 -cubin -o /tmp/accelerate-cuda-30707/dragon30715.cubin --disable-warnings -DNDEBUG -O3 -m64 /tmp/accelerate-cuda-30707/dragon30715.cu
0.19:cc: persist/save: /home/neil/.accelerate/accelerate-cuda-0.14.0.0/cache/3.0/zr151Lzr138zrSOzr232zr252ZMzr148zr231zrnzr220Wzr159z7eUrz60U
0.19:cc: entry function 'map' used 10 registers, 0 bytes smem, 0 bytes lmem, 0 bytes cmem
... multiprocessor occupancy 100.0% : 2048 threads over 64 warps in 16 blocks
0.20:cc: entry function 'scanlUp' used 21 registers, 0 bytes smem, 0 bytes lmem, 0 bytes cmem
... multiprocessor occupancy 100.0% : 2048 threads over 64 warps in 2 blocks
0.21:cc: entry function 'scanl' used 27 registers, 0 bytes smem, 0 bytes lmem, 0 bytes cmem
... multiprocessor occupancy 100.0% : 2048 threads over 64 warps in 2 blocks
0.21:cc: entry function 'scanl' used 27 registers, 0 bytes smem, 0 bytes lmem, 0 bytes cmem
... multiprocessor occupancy 100.0% : 2048 threads over 64 warps in 2 blocks
0.21:cc: entry function 'generate' used 7 registers, 0 bytes smem, 0 bytes lmem, 0 bytes cmem
... multiprocessor occupancy 100.0% : 2048 threads over 64 warps in 16 blocks
0.21:cc: entry function 'permute' used 9 registers, 0 bytes smem, 0 bytes lmem, 0 bytes cmem
... multiprocessor occupancy 100.0% : 2048 threads over 64 warps in 16 blocks
0.21:cc: entry function 'scanlUp' used 25 registers, 0 bytes smem, 0 bytes lmem, 0 bytes cmem
... multiprocessor occupancy 100.0% : 2048 threads over 64 warps in 2 blocks
0.21:cc: entry function 'scanl1' used 30 registers, 0 bytes smem, 0 bytes lmem, 0 bytes cmem
... multiprocessor occupancy 100.0% : 2048 threads over 64 warps in 2 blocks
0.21:cc: entry function 'scanl1' used 30 registers, 0 bytes smem, 0 bytes lmem, 0 bytes cmem
... multiprocessor occupancy 100.0% : 2048 threads over 64 warps in 2 blocks
0.21:cc: entry function 'map' used 10 registers, 0 bytes smem, 0 bytes lmem, 0 bytes cmem
... multiprocessor occupancy 100.0% : 2048 threads over 64 warps in 16 blocks
0.22:cc: entry function 'scanlUp' used 21 registers, 0 bytes smem, 0 bytes lmem, 0 bytes cmem
... multiprocessor occupancy 100.0% : 2048 threads over 64 warps in 2 blocks
0.22:cc: entry function 'scanl' used 27 registers, 0 bytes smem, 0 bytes lmem, 0 bytes cmem
... multiprocessor occupancy 100.0% : 2048 threads over 64 warps in 2 blocks
0.22:cc: entry function 'scanl' used 27 registers, 0 bytes smem, 0 bytes lmem, 0 bytes cmem
... multiprocessor occupancy 100.0% : 2048 threads over 64 warps in 2 blocks
0.22:cc: entry function 'generate' used 7 registers, 0 bytes smem, 0 bytes lmem, 0 bytes cmem
... multiprocessor occupancy 100.0% : 2048 threads over 64 warps in 16 blocks
0.22:cc: entry function 'permute' used 9 registers, 0 bytes smem, 0 bytes lmem, 0 bytes cmem
... multiprocessor occupancy 100.0% : 2048 threads over 64 warps in 16 blocks
0.22:cc: entry function 'scanlUp' used 25 registers, 0 bytes smem, 0 bytes lmem, 0 bytes cmem
... multiprocessor occupancy 100.0% : 2048 threads over 64 warps in 2 blocks
0.22:cc: entry function 'scanl1' used 30 registers, 0 bytes smem, 0 bytes lmem, 0 bytes cmem
... multiprocessor occupancy 100.0% : 2048 threads over 64 warps in 2 blocks
0.22:cc: entry function 'scanl1' used 30 registers, 0 bytes smem, 0 bytes lmem, 0 bytes cmem
... multiprocessor occupancy 100.0% : 2048 threads over 64 warps in 2 blocks
0.22:cc: entry function 'map' used 10 registers, 0 bytes smem, 0 bytes lmem, 0 bytes cmem
... multiprocessor occupancy 100.0% : 2048 threads over 64 warps in 16 blocks
0.22:cc: entry function 'scanlUp' used 21 registers, 0 bytes smem, 0 bytes lmem, 0 bytes cmem
... multiprocessor occupancy 100.0% : 2048 threads over 64 warps in 2 blocks
0.22:cc: entry function 'scanl' used 27 registers, 0 bytes smem, 0 bytes lmem, 0 bytes cmem
... multiprocessor occupancy 100.0% : 2048 threads over 64 warps in 2 blocks
0.22:cc: entry function 'scanl' used 27 registers, 0 bytes smem, 0 bytes lmem, 0 bytes cmem
... multiprocessor occupancy 100.0% : 2048 threads over 64 warps in 2 blocks
0.23:cc: entry function 'generate' used 7 registers, 0 bytes smem, 0 bytes lmem, 0 bytes cmem
... multiprocessor occupancy 100.0% : 2048 threads over 64 warps in 16 blocks
0.23:cc: entry function 'permute' used 9 registers, 0 bytes smem, 0 bytes lmem, 0 bytes cmem
... multiprocessor occupancy 100.0% : 2048 threads over 64 warps in 16 blocks
0.23:cc: entry function 'scanlUp' used 25 registers, 0 bytes smem, 0 bytes lmem, 0 bytes cmem
... multiprocessor occupancy 100.0% : 2048 threads over 64 warps in 2 blocks
0.23:cc: entry function 'scanl1' used 30 registers, 0 bytes smem, 0 bytes lmem, 0 bytes cmem
... multiprocessor occupancy 100.0% : 2048 threads over 64 warps in 2 blocks
0.23:cc: entry function 'scanl1' used 30 registers, 0 bytes smem, 0 bytes lmem, 0 bytes cmem
... multiprocessor occupancy 100.0% : 2048 threads over 64 warps in 2 blocks
0.23:cc: entry function 'map' used 10 registers, 0 bytes smem, 0 bytes lmem, 0 bytes cmem
... multiprocessor occupancy 100.0% : 2048 threads over 64 warps in 16 blocks
0.23:cc: entry function 'scanlUp' used 21 registers, 0 bytes smem, 0 bytes lmem, 0 bytes cmem
... multiprocessor occupancy 100.0% : 2048 threads over 64 warps in 2 blocks
0.23:cc: entry function 'scanl' used 27 registers, 0 bytes smem, 0 bytes lmem, 0 bytes cmem
... multiprocessor occupancy 100.0% : 2048 threads over 64 warps in 2 blocks
0.24:cc: entry function 'scanl' used 27 registers, 0 bytes smem, 0 bytes lmem, 0 bytes cmem
... multiprocessor occupancy 100.0% : 2048 threads over 64 warps in 2 blocks
0.24:cc: entry function 'generate' used 7 registers, 0 bytes smem, 0 bytes lmem, 0 bytes cmem
... multiprocessor occupancy 100.0% : 2048 threads over 64 warps in 16 blocks
0.24:cc: entry function 'permute' used 9 registers, 0 bytes smem, 0 bytes lmem, 0 bytes cmem
... multiprocessor occupancy 100.0% : 2048 threads over 64 warps in 16 blocks
0.24:cc: entry function 'scanlUp' used 25 registers, 0 bytes smem, 0 bytes lmem, 0 bytes cmem
... multiprocessor occupancy 100.0% : 2048 threads over 64 warps in 2 blocks
0.24:cc: entry function 'scanl1' used 30 registers, 0 bytes smem, 0 bytes lmem, 0 bytes cmem
... multiprocessor occupancy 100.0% : 2048 threads over 64 warps in 2 blocks
0.24:cc: entry function 'scanl1' used 30 registers, 0 bytes smem, 0 bytes lmem, 0 bytes cmem
... multiprocessor occupancy 100.0% : 2048 threads over 64 warps in 2 blocks
0.24:cc: entry function 'map' used 10 registers, 0 bytes smem, 0 bytes lmem, 0 bytes cmem
... multiprocessor occupancy 100.0% : 2048 threads over 64 warps in 16 blocks
0.24:cc: entry function 'scanlUp' used 21 registers, 0 bytes smem, 0 bytes lmem, 0 bytes cmem
... multiprocessor occupancy 100.0% : 2048 threads over 64 warps in 2 blocks
0.24:cc: entry function 'scanl' used 27 registers, 0 bytes smem, 0 bytes lmem, 0 bytes cmem
... multiprocessor occupancy 100.0% : 2048 threads over 64 warps in 2 blocks
0.24:cc: entry function 'scanl' used 27 registers, 0 bytes smem, 0 bytes lmem, 0 bytes cmem
... multiprocessor occupancy 100.0% : 2048 threads over 64 warps in 2 blocks
0.24:cc: entry function 'generate' used 7 registers, 0 bytes smem, 0 bytes lmem, 0 bytes cmem
... multiprocessor occupancy 100.0% : 2048 threads over 64 warps in 16 blocks
0.24:cc: entry function 'permute' used 9 registers, 0 bytes smem, 0 bytes lmem, 0 bytes cmem
... multiprocessor occupancy 100.0% : 2048 threads over 64 warps in 16 blocks
0.24:cc: entry function 'scanlUp' used 25 registers, 0 bytes smem, 0 bytes lmem, 0 bytes cmem
... multiprocessor occupancy 100.0% : 2048 threads over 64 warps in 2 blocks
0.24:cc: entry function 'scanl1' used 30 registers, 0 bytes smem, 0 bytes lmem, 0 bytes cmem
... multiprocessor occupancy 100.0% : 2048 threads over 64 warps in 2 blocks
0.24:cc: entry function 'scanl1' used 30 registers, 0 bytes smem, 0 bytes lmem, 0 bytes cmem
... multiprocessor occupancy 100.0% : 2048 threads over 64 warps in 2 blocks
0.24:cc: entry function 'map' used 10 registers, 0 bytes smem, 0 bytes lmem, 0 bytes cmem
... multiprocessor occupancy 100.0% : 2048 threads over 64 warps in 16 blocks
0.25:cc: entry function 'scanlUp' used 21 registers, 0 bytes smem, 0 bytes lmem, 0 bytes cmem
... multiprocessor occupancy 100.0% : 2048 threads over 64 warps in 2 blocks
0.25:cc: entry function 'scanl' used 27 registers, 0 bytes smem, 0 bytes lmem, 0 bytes cmem
... multiprocessor occupancy 100.0% : 2048 threads over 64 warps in 2 blocks
0.25:cc: entry function 'scanl' used 27 registers, 0 bytes smem, 0 bytes lmem, 0 bytes cmem
... multiprocessor occupancy 100.0% : 2048 threads over 64 warps in 2 blocks
0.25:cc: entry function 'generate' used 7 registers, 0 bytes smem, 0 bytes lmem, 0 bytes cmem
... multiprocessor occupancy 100.0% : 2048 threads over 64 warps in 16 blocks
0.25:cc: entry function 'permute' used 9 registers, 0 bytes smem, 0 bytes lmem, 0 bytes cmem
... multiprocessor occupancy 100.0% : 2048 threads over 64 warps in 16 blocks
0.25:cc: entry function 'scanlUp' used 25 registers, 0 bytes smem, 0 bytes lmem, 0 bytes cmem
... multiprocessor occupancy 100.0% : 2048 threads over 64 warps in 2 blocks
0.25:cc: entry function 'scanl1' used 30 registers, 0 bytes smem, 0 bytes lmem, 0 bytes cmem
... multiprocessor occupancy 100.0% : 2048 threads over 64 warps in 2 blocks
0.25:cc: entry function 'scanl1' used 30 registers, 0 bytes smem, 0 bytes lmem, 0 bytes cmem
... multiprocessor occupancy 100.0% : 2048 threads over 64 warps in 2 blocks
0.25:cc: entry function 'map' used 10 registers, 0 bytes smem, 0 bytes lmem, 0 bytes cmem
... multiprocessor occupancy 100.0% : 2048 threads over 64 warps in 16 blocks
0.26:cc: entry function 'scanlUp' used 21 registers, 0 bytes smem, 0 bytes lmem, 0 bytes cmem
... multiprocessor occupancy 100.0% : 2048 threads over 64 warps in 2 blocks
0.26:cc: entry function 'scanl' used 27 registers, 0 bytes smem, 0 bytes lmem, 0 bytes cmem
... multiprocessor occupancy 100.0% : 2048 threads over 64 warps in 2 blocks
0.26:cc: entry function 'scanl' used 27 registers, 0 bytes smem, 0 bytes lmem, 0 bytes cmem
... multiprocessor occupancy 100.0% : 2048 threads over 64 warps in 2 blocks
0.26:cc: entry function 'generate' used 7 registers, 0 bytes smem, 0 bytes lmem, 0 bytes cmem
... multiprocessor occupancy 100.0% : 2048 threads over 64 warps in 16 blocks
0.26:cc: entry function 'permute' used 9 registers, 0 bytes smem, 0 bytes lmem, 0 bytes cmem
... multiprocessor occupancy 100.0% : 2048 threads over 64 warps in 16 blocks
0.26:cc: entry function 'scanlUp' used 25 registers, 0 bytes smem, 0 bytes lmem, 0 bytes cmem
... multiprocessor occupancy 100.0% : 2048 threads over 64 warps in 2 blocks
0.26:cc: entry function 'scanl1' used 30 registers, 0 bytes smem, 0 bytes lmem, 0 bytes cmem
... multiprocessor occupancy 100.0% : 2048 threads over 64 warps in 2 blocks
0.26:cc: entry function 'scanl1' used 30 registers, 0 bytes smem, 0 bytes lmem, 0 bytes cmem
... multiprocessor occupancy 100.0% : 2048 threads over 64 warps in 2 blocks
0.26:cc: entry function 'map' used 10 registers, 0 bytes smem, 0 bytes lmem, 0 bytes cmem
... multiprocessor occupancy 100.0% : 2048 threads over 64 warps in 16 blocks
0.26:cc: entry function 'scanlUp' used 21 registers, 0 bytes smem, 0 bytes lmem, 0 bytes cmem
... multiprocessor occupancy 100.0% : 2048 threads over 64 warps in 2 blocks
0.26:cc: entry function 'scanl' used 27 registers, 0 bytes smem, 0 bytes lmem, 0 bytes cmem
... multiprocessor occupancy 100.0% : 2048 threads over 64 warps in 2 blocks
0.26:cc: entry function 'scanl' used 27 registers, 0 bytes smem, 0 bytes lmem, 0 bytes cmem
... multiprocessor occupancy 100.0% : 2048 threads over 64 warps in 2 blocks
0.26:cc: entry function 'generate' used 7 registers, 0 bytes smem, 0 bytes lmem, 0 bytes cmem
... multiprocessor occupancy 100.0% : 2048 threads over 64 warps in 16 blocks
0.26:cc: entry function 'permute' used 9 registers, 0 bytes smem, 0 bytes lmem, 0 bytes cmem
... multiprocessor occupancy 100.0% : 2048 threads over 64 warps in 16 blocks
0.26:cc: entry function 'scanlUp' used 25 registers, 0 bytes smem, 0 bytes lmem, 0 bytes cmem
... multiprocessor occupancy 100.0% : 2048 threads over 64 warps in 2 blocks
0.26:cc: entry function 'scanl1' used 30 registers, 0 bytes smem, 0 bytes lmem, 0 bytes cmem
... multiprocessor occupancy 100.0% : 2048 threads over 64 warps in 2 blocks
0.26:cc: entry function 'scanl1' used 30 registers, 0 bytes smem, 0 bytes lmem, 0 bytes cmem
... multiprocessor occupancy 100.0% : 2048 threads over 64 warps in 2 blocks
0.26:cc: entry function 'map' used 10 registers, 0 bytes smem, 0 bytes lmem, 0 bytes cmem
... multiprocessor occupancy 100.0% : 2048 threads over 64 warps in 16 blocks
0.27:cc: entry function 'scanlUp' used 21 registers, 0 bytes smem, 0 bytes lmem, 0 bytes cmem
... multiprocessor occupancy 100.0% : 2048 threads over 64 warps in 2 blocks
0.27:cc: entry function 'scanl' used 27 registers, 0 bytes smem, 0 bytes lmem, 0 bytes cmem
... multiprocessor occupancy 100.0% : 2048 threads over 64 warps in 2 blocks
0.27:cc: entry function 'scanl' used 27 registers, 0 bytes smem, 0 bytes lmem, 0 bytes cmem
... multiprocessor occupancy 100.0% : 2048 threads over 64 warps in 2 blocks
0.27:cc: entry function 'generate' used 7 registers, 0 bytes smem, 0 bytes lmem, 0 bytes cmem
... multiprocessor occupancy 100.0% : 2048 threads over 64 warps in 16 blocks
0.27:cc: entry function 'permute' used 9 registers, 0 bytes smem, 0 bytes lmem, 0 bytes cmem
... multiprocessor occupancy 100.0% : 2048 threads over 64 warps in 16 blocks
0.27:cc: entry function 'scanlUp' used 25 registers, 0 bytes smem, 0 bytes lmem, 0 bytes cmem
... multiprocessor occupancy 100.0% : 2048 threads over 64 warps in 2 blocks
0.27:cc: entry function 'scanl1' used 30 registers, 0 bytes smem, 0 bytes lmem, 0 bytes cmem
... multiprocessor occupancy 100.0% : 2048 threads over 64 warps in 2 blocks
0.27:cc: entry function 'scanl1' used 30 registers, 0 bytes smem, 0 bytes lmem, 0 bytes cmem
... multiprocessor occupancy 100.0% : 2048 threads over 64 warps in 2 blocks
0.27:cc: entry function 'map' used 10 registers, 0 bytes smem, 0 bytes lmem, 0 bytes cmem
... multiprocessor occupancy 100.0% : 2048 threads over 64 warps in 16 blocks
0.28:cc: entry function 'scanlUp' used 21 registers, 0 bytes smem, 0 bytes lmem, 0 bytes cmem
... multiprocessor occupancy 100.0% : 2048 threads over 64 warps in 2 blocks
0.28:cc: entry function 'scanl' used 27 registers, 0 bytes smem, 0 bytes lmem, 0 bytes cmem
... multiprocessor occupancy 100.0% : 2048 threads over 64 warps in 2 blocks
0.28:cc: entry function 'scanl' used 27 registers, 0 bytes smem, 0 bytes lmem, 0 bytes cmem
... multiprocessor occupancy 100.0% : 2048 threads over 64 warps in 2 blocks
0.28:cc: entry function 'generate' used 7 registers, 0 bytes smem, 0 bytes lmem, 0 bytes cmem
... multiprocessor occupancy 100.0% : 2048 threads over 64 warps in 16 blocks
0.28:cc: entry function 'permute' used 9 registers, 0 bytes smem, 0 bytes lmem, 0 bytes cmem
... multiprocessor occupancy 100.0% : 2048 threads over 64 warps in 16 blocks
0.28:cc: entry function 'scanlUp' used 25 registers, 0 bytes smem, 0 bytes lmem, 0 bytes cmem
... multiprocessor occupancy 100.0% : 2048 threads over 64 warps in 2 blocks
0.28:cc: entry function 'scanl1' used 30 registers, 0 bytes smem, 0 bytes lmem, 0 bytes cmem
... multiprocessor occupancy 100.0% : 2048 threads over 64 warps in 2 blocks
0.28:cc: entry function 'scanl1' used 30 registers, 0 bytes smem, 0 bytes lmem, 0 bytes cmem
... multiprocessor occupancy 100.0% : 2048 threads over 64 warps in 2 blocks
0.28:cc: entry function 'map' used 10 registers, 0 bytes smem, 0 bytes lmem, 0 bytes cmem
... multiprocessor occupancy 100.0% : 20

@tmcdonell
Copy link
Member

Closing as outdated. Please open a new ticket with updated output if you have problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cuda backend [deprecated]
Projects
None yet
Development

No branches or pull requests

3 participants