Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Towards getting simpleFitExample working on Mac #4

Closed
wants to merge 3 commits into from
Closed

Towards getting simpleFitExample working on Mac #4

wants to merge 3 commits into from

Conversation

cdeil
Copy link
Contributor

@cdeil cdeil commented Mar 27, 2013

(Sorry for mixing stuping whitespace changes with actual changes ... we can just use this for discussion and never merge it.)

I had to make a few changes to make simpleFitExample build on Mac OS X 10.8 with CUDA 5.0, V0.2.1221.

The simpleFitExample now builds fine, but it crashes after a few seconds like this:

$ ./simpleFitExample 
Assertion failed: (host_fcn_ptr), function getMetricPointer, file /Users/deil/code/GooFit/FPOINTER/ThrustPdfFunctor.cu, line 171.
Abort trap: 6

@RolfAndreassen Is this the crash we discussed in Saas Fee? Do you have a solution / workaround? If this is a CUDA or Thrust bug maybe we can report it?

- Consistently add -m64 flag
- Change some further compiler flags to make the example build succeed
- Rename wrkdir to build
Note: example crashes at runtime at the moment!?
@cdeil
Copy link
Contributor Author

cdeil commented Mar 27, 2013

I've added your testStatic program ... on my Mac it runs without error.

@cdeil
Copy link
Contributor Author

cdeil commented Mar 27, 2013

Some more things that should be done automatically or put in the install notes:

export DYLD_LIBRARY_PATH=/Users/deil/code/GooFit/rootstuff/:$DYLD_LIBRARY_PATH
source $ROOTSYS/bin/thisroot.sh # maybe the basic example Makefile should be ROOT independent?

At the moment I had to manually go to the root_stuff and examples folders and run make.
Can't this be done automatically from the top level?

@cdeil
Copy link
Contributor Author

cdeil commented Mar 27, 2013

Just for reference ... I don't know if it helps ... here's the backtrace for the crash:

Assertion failed: (host_fcn_ptr), function getMetricPointer, file /Users/deil/code/GooFit/FPOINTER/ThrustPdfFunctor.cu, line 171.

Program received signal SIGABRT, Aborted.
0x00007fff8ba1a212 in __pthread_kill ()
(gdb) bt
#0  0x00007fff8ba1a212 in __pthread_kill ()
#1  0x00007fff8f04ab54 in pthread_kill ()
#2  0x00007fff8f08edce in abort ()
#3  0x00007fff8f08fe2a in __assert_rtn ()
#4  0x0000000100003df4 in getMetricPointer ()
#5  0x000000010425cfe8 in ?? ()
Previous frame inner to this frame (gdb could not unwind past this frame)
(gdb)

@ghost ghost assigned RolfAndreassen Mar 28, 2013
@RolfAndreassen
Copy link
Member

Yes, this is the "Mac crash" we discussed.

I added some Makefile changes to auto-check for MacOS; it looks for the output of uname to equal Darwin. Hope that's portable between Macs...

@RolfAndreassen
Copy link
Member

Ok, now I'm really confused: My test program doesn't crash on our local MacBook anymore. I have no idea what I was doing in February, then. But at any rate our Macs are now consistent!

@RolfAndreassen
Copy link
Member

Ok, so. There's an assert on line 172 of ThrustPdfFunctor.cu. After some testing I think that assert should not be there; it appears to me that the Mac legitimately puts the first device function it encounters at address zero, which seems to differ from the behaviour of my Linux boxen, hence my initial identification of this "null pointer" as the underlying problem.

However. If you remove it, you get back the original problem that made me start looking at function addresses and static allocations and wondering why I got a zero on the Mac in the first place: The cudaMemcpyToSymbol on line 725 fails with error code 63, "OS call failed or operation not supported on this OS". I haven't been able to reproduce this in simple code.

@cdeil
Copy link
Contributor Author

cdeil commented Mar 29, 2013

testStatic always worked on my Mac.

Just for reference, after removing the assert in ThrustPdfFunctor.cu, I see the same error:

$ ./simpleFitExample 
Error code 63 (OS call failed or operation not supported on this OS) at /Users/deil/code/GooFit/FPOINTER/ThrustPdfFunctor.cu, 725

Line 725 is

    num_device_functions++; 

which you can see in context here

@RolfAndreassen
Copy link
Member

Er, yes. There's an off-by-one error in the reported line numbers. I'm fairly confident the actual problem is with the cudaMemcpyToSymbol on the next line. :)

So, the next step seems to be to make a small program that reproduces the error.

@RolfAndreassen
Copy link
Member

So I've been writing simple test cases. I found that nvcc on the Mac apparently is quite aggressive in optimising away functions that don't explicitly get called somewhere, even if you statically-assign a variable to point to them; so I get quite unexpected null pointers unless I work around this by writing a global function which does something with my test functions. However this does not seem to be the cause of the GooFit issue, it is separate.

I haven't been able to reproduce the problem in a simple test program; when linking against the GooFit objects, the program no longer sees the GPU. That is, even a simple cudaMalloc call will suddenly start returning error code 63, and "cudaMemGetInfo" claims that there is zero total memory. At the same time cudaGetDeviceProperties correctly returns "There's a 650M with such-and-such total memory". It looks to me like the procedure must be to strip down the ThrustPdfFunctor and FunctorBase classes until it is possible to link just against them, without having to include the other GooFit objects, and then try basically a binary commenting-out strategy until we find the offending code.

@RolfAndreassen
Copy link
Member

Could you try the latest and see if it works on your Mac?

@cdeil
Copy link
Contributor Author

cdeil commented Apr 5, 2013

simpleFitExample now works for me on Mac. Thanks!

One minor remaining thing that doesn't work is the -rdynamic option for the linker in the example Makefile.
If I remove it the example compiles and runs without problem.

$ make
nvcc -I/usr/local/cuda//include/ -I/Users/deil/code/GooFit/examples/../ -I/Users/deil/code/GooFit/examples/..//rootstuff -I/Users/deil/code/GooFit/examples/..//FPOINTER/  -I/Users/deil/software/root/v5-34-05_cocoa/include/  -O3 -arch=sm_20 -g  -m64 -c -o simpleFitExample.o simpleFitExample.cu
g++    simpleFitExample.o /Users/deil/code/GooFit/examples/..//wrkdir//Variable.o /Users/deil/code/GooFit/examples/..//wrkdir//PdfBuilder.o /Users/deil/code/GooFit/examples/..//wrkdir//ThrustPdfFunctorCUDA.o /Users/deil/code/GooFit/examples/..//wrkdir//Faddeeva.o /Users/deil/code/GooFit/examples/..//wrkdir//FitControl.o /Users/deil/code/GooFit/examples/..//wrkdir//FunctorBase.o /Users/deil/code/GooFit/examples/..//wrkdir//DataSet.o /Users/deil/code/GooFit/examples/..//wrkdir//BinnedDataSet.o /Users/deil/code/GooFit/examples/..//wrkdir//UnbinnedDataSet.o /Users/deil/code/GooFit/examples/..//wrkdir//FunctorWriter.o  -L/usr/local/cuda//lib -lcudart -L/Users/deil/code/GooFit/examples/..//rootstuff -lRootUtils  -L/Users/deil/software/root/v5-34-05_cocoa/lib/ -lCore -lCint -lRIO -lNet -lHist -lGraf -lGraf3d -lGpad -lTree -lRint -lMatrix -lPhysics -lMathCore -pthread -lThread -lMinuit2 -lMinuit -rdynamic -lFoam  -o simpleFitExample
g++: error: unrecognized command line option '-rdynamic'

@RolfAndreassen
Copy link
Member

Curious - maybe we have different g++ versions? I'm on 4.2.1. At any rate, the rdynamic option is not critical, so I made it conditional on not being on a Mac.

@cdeil
Copy link
Contributor Author

cdeil commented Apr 5, 2013

Maybe because I use the Macports gcc 4.7 compiler.
It does work now.

One more thing you might want to do is add

*.eps
*.o
*.so
wrkdir
example/simpleFitExample

to a top-level .gitignore file, so that git status shows a clean workspace, even after building and running the example. At the moment I get:

$ git status
# On branch master
# Untracked files:
#   (use "git add <file>..." to include in what will be committed)
#
#   bifur.eps
#   landau.eps
#   novo.eps
#   simpleFitExample
#   simpleFitExample.o
#   ../rootstuff/TMinuit.o
#   ../rootstuff/TRandom.o
#   ../rootstuff/TRandom3.o
#   ../rootstuff/libRootUtils.so
#   ../wrkdir/
nothing added to commit but untracked files present (use "git add" to track)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants