Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compilation error for the dot product example #331

Closed
mrakgr opened this issue Oct 14, 2016 · 10 comments
Closed

Compilation error for the dot product example #331

mrakgr opened this issue Oct 14, 2016 · 10 comments
Labels
Milestone

Comments

@mrakgr
Copy link

mrakgr commented Oct 14, 2016

This is on Windows 10 and the accelerate-cuda backend.
accelerate.hs

import Data.Array.Accelerate as A
import Data.Array.Accelerate.CUDA

dotp :: Acc (Vector Float) -> Acc (Vector Float) -> Acc (Scalar Float)
dotp xs ys = fold (+) 0 (A.zipWith (*) xs ys)

a = use (fromList (Z:.10) [1..] :: Vector Float)
b = use (fromList (Z:.15) [1..] :: Vector Float)

main = print $ run $ dotp a b

Output:

C:\!Various Exercises\Haskell Exercises>accelerate
dragon633426500.cu
dragon4118467.cu
C:/Users/Marko/AppData/Local/Temp/accelerate-cuda-43690/dragon4118467.cu(25): error: expected an expression

C:/Users/Marko/AppData/Local/Temp/accelerate-cuda-43690/dragon4118467.cu(29): error: expected an expression

2 errors detected in the compilation of "C:/Users/Marko/AppData/Local/Temp/tmpxft_00001f78_00000000-8_dragon4118467.cpp1.ii".
accelerate: nvcc terminated abnormally (2)

This happens using the REPL as well. The only ghc compiler flag is -threaded.

Here are the files in the temp directory:
dragon4118467.cu (I've added the comments to indicate where lines 25 and 29 are)

#include <accelerate_cuda.h>
extern "C" __global__ void foldAll(const Int64 shIn0_0, const float *__restrict__ arrIn0_0, const Int64 shIn1_0, const float *__restrict__ arrIn1_0, const Int64 shOut_0, float *__restrict__ arrOut_0)
{
extern volatile __shared__ float sdata0[];
float x0;
float y0;
float z0;
const Int64 sh0 = min(shIn1_0, shIn0_0);
const int shapeSize = sh0;
const int gridSize = blockDim.x * gridDim.x;
int ix = blockDim.x * blockIdx.x + threadIdx.x;

/*
         * Reduce multiple elements per thread. The number is determined by the
         * number of active thread blocks (via gridDim). More blocks will result in
         * a larger `gridSize', and hence fewer elements per thread
         *
         * The loop stride of `gridSize' is used to maintain coalescing.
         *
         * Note that we can't simply kill threads that won't participate in the
         * reduction, as exclusive reductions of empty arrays then won't be
         * initialised with their seed element.
         */
if (ix < shapeSize) {
const Int64 v1 = ({ assert(ix >= 0 && ix < min(shIn1_0, shIn0_0)); ix; }); // line 25

y0 = arrIn1_0[v1] * arrIn0_0[v1];
for (ix += gridSize; ix < shapeSize; ix += gridSize) {
const Int64 v1 = ({ assert(ix >= 0 && ix < min(shIn1_0, shIn0_0)); ix; }); // line 29

x0 = arrIn1_0[v1] * arrIn0_0[v1];
z0 = y0 + x0;
y0 = z0;
}
}
sdata0[threadIdx.x] = y0;
ix = min(shapeSize - blockIdx.x * blockDim.x, blockDim.x);
__syncthreads();
if (threadIdx.x + 512 < ix) {
x0 = sdata0[threadIdx.x + 512];
z0 = y0 + x0;
y0 = z0;
}
__syncthreads();
sdata0[threadIdx.x] = y0;
__syncthreads();
if (threadIdx.x + 256 < ix) {
x0 = sdata0[threadIdx.x + 256];
z0 = y0 + x0;
y0 = z0;
}
__syncthreads();
sdata0[threadIdx.x] = y0;
__syncthreads();
if (threadIdx.x + 128 < ix) {
x0 = sdata0[threadIdx.x + 128];
z0 = y0 + x0;
y0 = z0;
}
__syncthreads();
sdata0[threadIdx.x] = y0;
__syncthreads();
if (threadIdx.x + 64 < ix) {
x0 = sdata0[threadIdx.x + 64];
z0 = y0 + x0;
y0 = z0;
}
__syncthreads();
sdata0[threadIdx.x] = y0;
__syncthreads();
if (threadIdx.x < 32) {
if (threadIdx.x + 32 < ix) {
x0 = sdata0[threadIdx.x + 32];
z0 = y0 + x0;
y0 = z0;
sdata0[threadIdx.x] = y0;
}
if (threadIdx.x + 16 < ix) {
x0 = sdata0[threadIdx.x + 16];
z0 = y0 + x0;
y0 = z0;
sdata0[threadIdx.x] = y0;
}
if (threadIdx.x + 8 < ix) {
x0 = sdata0[threadIdx.x + 8];
z0 = y0 + x0;
y0 = z0;
sdata0[threadIdx.x] = y0;
}
if (threadIdx.x + 4 < ix) {
x0 = sdata0[threadIdx.x + 4];
z0 = y0 + x0;
y0 = z0;
sdata0[threadIdx.x] = y0;
}
if (threadIdx.x + 2 < ix) {
x0 = sdata0[threadIdx.x + 2];
z0 = y0 + x0;
y0 = z0;
sdata0[threadIdx.x] = y0;
}
if (threadIdx.x + 1 < ix) {
x0 = sdata0[threadIdx.x + 1];
z0 = y0 + x0;
y0 = z0;
sdata0[threadIdx.x] = y0;
}
}
if (threadIdx.x == 0) {
if (shapeSize > 0) {
if (gridDim.x == 1) {
x0 = 0.0f;
z0 = y0 + x0;
y0 = z0;
}
arrOut_0[blockIdx.x] = y0;
} else {
arrOut_0[blockIdx.x] = 0.0f;
}
}
}

dragon633426500.cu

#include <accelerate_cuda.h>
extern "C" __global__ void foldAll(const Int64 shIn0_0, const float *__restrict__ arrIn0_0, const Int64 shIn1_0, const float *__restrict__ arrIn1_0, const Int64 shOut_0, float *__restrict__ arrOut_0, const Int64 shRec_0, const float *__restrict__ arrRec_0)
{
extern volatile __shared__ float sdata0[];
float x0;
float y0;
float z0;
const Int64 sh0 = shRec_0;
const int shapeSize = sh0;
const int gridSize = blockDim.x * gridDim.x;
int ix = blockDim.x * blockIdx.x + threadIdx.x;

/*
         * Reduce multiple elements per thread. The number is determined by the
         * number of active thread blocks (via gridDim). More blocks will result in
         * a larger `gridSize', and hence fewer elements per thread
         *
         * The loop stride of `gridSize' is used to maintain coalescing.
         *
         * Note that we can't simply kill threads that won't participate in the
         * reduction, as exclusive reductions of empty arrays then won't be
         * initialised with their seed element.
         */
if (ix < shapeSize) {
y0 = arrRec_0[ix];
for (ix += gridSize; ix < shapeSize; ix += gridSize) {
x0 = arrRec_0[ix];
z0 = y0 + x0;
y0 = z0;
}
}
sdata0[threadIdx.x] = y0;
ix = min(shapeSize - blockIdx.x * blockDim.x, blockDim.x);
__syncthreads();
if (threadIdx.x + 512 < ix) {
x0 = sdata0[threadIdx.x + 512];
z0 = y0 + x0;
y0 = z0;
}
__syncthreads();
sdata0[threadIdx.x] = y0;
__syncthreads();
if (threadIdx.x + 256 < ix) {
x0 = sdata0[threadIdx.x + 256];
z0 = y0 + x0;
y0 = z0;
}
__syncthreads();
sdata0[threadIdx.x] = y0;
__syncthreads();
if (threadIdx.x + 128 < ix) {
x0 = sdata0[threadIdx.x + 128];
z0 = y0 + x0;
y0 = z0;
}
__syncthreads();
sdata0[threadIdx.x] = y0;
__syncthreads();
if (threadIdx.x + 64 < ix) {
x0 = sdata0[threadIdx.x + 64];
z0 = y0 + x0;
y0 = z0;
}
__syncthreads();
sdata0[threadIdx.x] = y0;
__syncthreads();
if (threadIdx.x < 32) {
if (threadIdx.x + 32 < ix) {
x0 = sdata0[threadIdx.x + 32];
z0 = y0 + x0;
y0 = z0;
sdata0[threadIdx.x] = y0;
}
if (threadIdx.x + 16 < ix) {
x0 = sdata0[threadIdx.x + 16];
z0 = y0 + x0;
y0 = z0;
sdata0[threadIdx.x] = y0;
}
if (threadIdx.x + 8 < ix) {
x0 = sdata0[threadIdx.x + 8];
z0 = y0 + x0;
y0 = z0;
sdata0[threadIdx.x] = y0;
}
if (threadIdx.x + 4 < ix) {
x0 = sdata0[threadIdx.x + 4];
z0 = y0 + x0;
y0 = z0;
sdata0[threadIdx.x] = y0;
}
if (threadIdx.x + 2 < ix) {
x0 = sdata0[threadIdx.x + 2];
z0 = y0 + x0;
y0 = z0;
sdata0[threadIdx.x] = y0;
}
if (threadIdx.x + 1 < ix) {
x0 = sdata0[threadIdx.x + 1];
z0 = y0 + x0;
y0 = z0;
sdata0[threadIdx.x] = y0;
}
}
if (threadIdx.x == 0) {
if (shapeSize > 0) {
if (gridDim.x == 1) {
x0 = 0.0f;
z0 = y0 + x0;
y0 = z0;
}
arrOut_0[blockIdx.x] = y0;
} else {
arrOut_0[blockIdx.x] = 0.0f;
}
}
}

I'll skip the cubin file. I haven't done extensive testing why this is happening, I am trying out Accelerate for the first time here. It might be because I am using the nvcc from Cuda 7.5 SDK. I'll give the one from 7.0 a shot.

Edit: It makes no difference whether I am using nvcc from 7.0 or 7.5.

@robeverest
Copy link
Member

I don't think the problem you're experiencing is due to to the version of nvcc you're using, rather it is a consequence of using windows and msvc. See #234.

@mrakgr
Copy link
Author

mrakgr commented Oct 14, 2016

Ok. Would the LLVM or Opencl backends work, or is this error just specific to Cuda?

Edit: I see that the LLVM backend is still in the experimental state, so nevermind that. The issue is probably specific to the Cuda backend as well. Tracing the issue, I see that there was a flurry of activity on lambda expressions last year to fix this for NVRTC and Windows.

I see that there was a fix by @mwu-tow. What is the status of that?

@tmcdonell
Copy link
Member

The LLVM work is progressing and the CPU backend is complete. If you have some spare cycles and wouldn't mind trying to install it, I would appreciate any feedback on that (I don't know of anybody who has tried to install it on windows yet).

@mrakgr
Copy link
Author

mrakgr commented Oct 16, 2016

Ok, I gave it a shot. When I tried to install llvm-general Haskell package it complained that it was missing llvm-config. It was not in the prebuilt Windows binary, so I build the 3.9.0.0 source with Cmake+VS2015. That did not works as the llvm-config version needed to be 3.5.x, so I got 3.5.2.0 and built that and added it to PATH.

Now it is complaining that it is missing:

setup.exe: Missing dependency on a foreign library:
* Missing C library: LLVM-3.5.2svn

To be honest, I've already been trying to install it for two hours so I will just stop here. I could try building it with GCC maybe as some files did fail to build with VS...actually, let me do that and I will stop there if that fails.

Edit: No, I get the same error unfortunately. There seem to be some build issues in 3.5.2.0 in general unlike in 3.9.0.0, as I saw some warnings about unsupported instructions. The llvm-general package needs an update it seems.

@tmcdonell
Copy link
Member

Thanks for the feedback @mrakgr!

On mac/linux we need to have LLVM built with shared library support (which I think is the default) so that we can use it with ghci/template haskell, which I guess is the error you are getting here. I don't know if the windows config has similar option though; I'll set up a windows VM or something one day and try it out.

@mrakgr
Copy link
Author

mrakgr commented Oct 17, 2016

I see. Actually, I did notice there was an option for shared libraries, but there were like 100 other options so I just opted for the defaults.

@mwu-tow
Copy link

mwu-tow commented Oct 17, 2016

I once tried using Haskell for cross-platform development. It was painful, painful, painful. I don't particularly want to remember… but I guess I can give you some advice, since I got accelerate-cuda backend working without any visible platform-specific issues on Windows.

The issue you encountered is #234nvcc on Windows doesn't support GNU-specific extensions that accelerate-cuda uses. The solution is to replace them with standard C++11 lambda expressions. I have it fixed in my fork on this branch: https://github.com/mwu-tow/accelerate-cuda/tree/lambda-expressions
I haven't submitted a pull request, since it was blocked by lambda support in language-c-quote. My pull request there eventually got through after several months but by that time we already decided to change the technology and I haven't pursued it any further.

I believe the patch is fairly simple and regular, should work against recent accelerate-cuda (unless it added some new usages of expression statements, but it should be easy to fix anyway). I believe it is by far the quickest way to ahve working accelerate-cuda on Windows.

As for the LLVM thing… you can try it if you are brave. Some suggestions:

  1. You can't use MSVC-build LLVM with Haskell. Actually you can't use any library that uses C++ in its interface and was built with MSVC. Pure C libraries and C++ libraries that use pure C for its API should work fine on GHC >= 7.10.3. And don't ever try using any older GHC on Windows.

  2. GHC is based on MinGW, so you need build LLVM with MinGW. That can be troublesome but generally should be doable. Note that you can't use "any MinGW" — there are various distributions, often incompatible with each other. It would be best, if you build LLVM with MinGW that is bundled with GHC. I believe it is based on Msys2 distribution.

  3. Be also careful about various GNU tools that are often required for build (like make, pkg-config, sed, awk, and so on) — they are also available in various tastes that are often not compatible. There are cases when you need to use different tools for building your C/C++ dependencies and different ones for Haskell packages.

  4. Nowadays stack provides Msys2 distribution with its package manager and other goodies. It can speed things up, provided that tools from Msys2 repository are the variant that actually works for your needs. ( I actually never used it, as I had to disable it, because it broke my builds in another subtle way. Given a little luck though, it may work for you. With GHC 8 even most of its C libraries should properly work. )

  5. To debug a Cabal error about missing C library you need to run Cabal with maximum verbosity to see an actual invocation to g++ that caused an error. Typically, Cabal checks for library presence similarly to autoconf — it creates trivial programs and tries to link them with -lsomething flag. If g++ gives an error, Cabal says that library is missing. You need to have it working. Usually it means setting up a proper environment after library is built (CPATH, LIBRARY_PATH and PATH environment variables or proper entries in your stack configuration).

Note that I'm referring to Cabal library that is used both by cabal-install and stack — you are using it, although not directly.

Some Haskell packages also ask Cabal to rely on pkg-config to detect libraries.

  1. I would be very surprised if the llvm-general worked without on Windows without any additional fixes. Once I checked it and it seemed to have some platform-specific code and no Windows support. Though it was a brief check and I may be wrong here.

Good luck!

@tmcdonell
Copy link
Member

Wow, thanks @mwu-tow for the excellent comments!
Sorry you had so much trouble in this area, hope things are going well for you at flowbox.

@mwu-tow
Copy link

mwu-tow commented Oct 19, 2016

@mrakgr
One more thing that may worth mentioning:
7) Package configuration typically comes either from the *.cabal file or from Setup.hs (the Setup.hs defines the program that is actually used by Cabal/stack to configure and build the package). The Setup.hs can be a place to dynamically adjust configuration parameters (like the name of library being linked in, the include/library paths and so on). It can also used to perform some more advanced platform-specific trickery — like we once did with the cuda package. llvm-general also does seem to have non-trivial Setup.hs — I guess it'll need to be adjusted to properly detect and consume Windows-style of LLVM distribution.

Also, I noticed that there's MinGW build of LLVM 3.5.2 on MSYS2 repository — so building it from scratch might not be necessary.

@tmcdonell
Thanks, we're still alive and kicking. :-)

While I do like ranting, I have no regrets. Learning Haskell and porting our packages to Windows have been a very valuable experience (even though we eventually decided to slowly move towards C++ anyway). It's great language and I'd really like for it to became more available for cross-platform development.

I'd say that there's a vicious circle — Haskell doesn't work properly on Windows <-> Haskell developers don't use Windows. It needs to be steadily improved, package after package. If I can help by sharing my experience, I'm happy to. Always feel free to ping / mail me (sometimes I do miss github pings), when there's some issue that you think I could help with.

GHC also improves — given the current pace, things could become quite solid after a few more major releases: handling external libraries and the linker received many fixes by GHC 8, proper DLL support is slated for 8.2, there's still that IO manager thing remaining — and that's pretty much all the big blocker things I'm aware of.

@tmcdonell tmcdonell mentioned this issue Jan 28, 2017
@tmcdonell tmcdonell added the cuda backend [deprecated] label Mar 27, 2017
@tmcdonell tmcdonell modified the milestone: _|_ Apr 14, 2017
@tmcdonell
Copy link
Member

closing as accelerate-cuda is deprecated and accelerate-llvm-ptx works on windows now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants