New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Towards getting simpleFitExample working on Mac #4
Conversation
- Consistently add -m64 flag - Change some further compiler flags to make the example build succeed - Rename wrkdir to build Note: example crashes at runtime at the moment!?
I've added your |
Some more things that should be done automatically or put in the install notes:
At the moment I had to manually go to the |
Just for reference ... I don't know if it helps ... here's the backtrace for the crash:
|
Yes, this is the "Mac crash" we discussed. I added some Makefile changes to auto-check for MacOS; it looks for the output of uname to equal Darwin. Hope that's portable between Macs... |
Ok, now I'm really confused: My test program doesn't crash on our local MacBook anymore. I have no idea what I was doing in February, then. But at any rate our Macs are now consistent! |
Ok, so. There's an assert on line 172 of ThrustPdfFunctor.cu. After some testing I think that assert should not be there; it appears to me that the Mac legitimately puts the first device function it encounters at address zero, which seems to differ from the behaviour of my Linux boxen, hence my initial identification of this "null pointer" as the underlying problem. However. If you remove it, you get back the original problem that made me start looking at function addresses and static allocations and wondering why I got a zero on the Mac in the first place: The cudaMemcpyToSymbol on line 725 fails with error code 63, "OS call failed or operation not supported on this OS". I haven't been able to reproduce this in simple code. |
Just for reference, after removing the assert in
Line 725 is
which you can see in context here |
Er, yes. There's an off-by-one error in the reported line numbers. I'm fairly confident the actual problem is with the cudaMemcpyToSymbol on the next line. :) So, the next step seems to be to make a small program that reproduces the error. |
So I've been writing simple test cases. I found that nvcc on the Mac apparently is quite aggressive in optimising away functions that don't explicitly get called somewhere, even if you statically-assign a variable to point to them; so I get quite unexpected null pointers unless I work around this by writing a global function which does something with my test functions. However this does not seem to be the cause of the GooFit issue, it is separate. I haven't been able to reproduce the problem in a simple test program; when linking against the GooFit objects, the program no longer sees the GPU. That is, even a simple cudaMalloc call will suddenly start returning error code 63, and "cudaMemGetInfo" claims that there is zero total memory. At the same time cudaGetDeviceProperties correctly returns "There's a 650M with such-and-such total memory". It looks to me like the procedure must be to strip down the ThrustPdfFunctor and FunctorBase classes until it is possible to link just against them, without having to include the other GooFit objects, and then try basically a binary commenting-out strategy until we find the offending code. |
Could you try the latest and see if it works on your Mac? |
One minor remaining thing that doesn't work is the
|
Curious - maybe we have different g++ versions? I'm on 4.2.1. At any rate, the rdynamic option is not critical, so I made it conditional on not being on a Mac. |
Maybe because I use the Macports gcc 4.7 compiler. One more thing you might want to do is add
to a top-level
|
(Sorry for mixing stuping whitespace changes with actual changes ... we can just use this for discussion and never merge it.)
I had to make a few changes to make
simpleFitExample
build on Mac OS X 10.8 with CUDA 5.0, V0.2.1221.-m64
flag to avoid link errors+LIBS += -L$(CUDALOCATION)/lib
instead of+LIBS += -L$(CUDALOCATION)/lib64
... this maybe has to be set platform-dependent? (see also scons sets incorrect library path NVIDIA/thrust#356 ).rootstuff
Makefile discussed in issue make root_stuff doesn't work on Mac #3.The
simpleFitExample
now builds fine, but it crashes after a few seconds like this:@RolfAndreassen Is this the crash we discussed in Saas Fee? Do you have a solution / workaround? If this is a CUDA or Thrust bug maybe we can report it?