Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segfaults with Intel NEO driver #276

Closed
kodonnell opened this issue Apr 28, 2018 · 10 comments
Closed

Segfaults with Intel NEO driver #276

kodonnell opened this issue Apr 28, 2018 · 10 comments

Comments

@kodonnell
Copy link

I'm hoping this doesn't need as much work to debug as #66 , but I figured I'd point it out in case. Latest commit on master for CLBlast, and I built Intel NEO a few days ago. This is an i5-5250u running on Ubuntu 16.04. I'm a relative newbie to debugging this kind of thing, so might need some hand-holding.

Specifically, I seem to be getting segfaults in some cases, at the end of certain scripts, e.g.:

$ ./clblast_test_xgemm
...
Segmentation fault (core dumped)

(Note: I do also get a lot of failed tests.) Also

$ ./clblast_sample_sgemm
Completed SGEMM in 558.762 ms with status 0
Segmentation fault (core dumped)

Both seem to 'finish', and then segfault. Interestingly, tuning runs fine though.

I've messed round with the sgemm sample, , and it doesn't segfault if I comment out the Gemm call - which seems to imply (??) that it's something specific to CLBlast that's causing the problem (though it still might be the driver's fault). This is further supported by my having run other OpenCL examples fine.

In addition, it does 'finish' before segfaulting (i.e. a print at the last line of code is still executed before the segfault), so maybe there is some clean-up happening that's causing the problem? This seems to be supported by the freeGraphicsMemory stuff appearing in valgrind (which I've never used before, nor know how to interpret - just copied from here). Also note the mention of clblast::Cache, which sounds relevant.

$ valgrind --tool=memcheck --show-leak-kinds=definite --error-limit=no ./clblast_sample_sgemm
==11739== Memcheck, a memory error detector
==11739== Copyright (C) 2002-2015, and GNU GPL'd, by Julian Seward et al.
==11739== Using Valgrind-3.11.0 and LibVEX; rerun with -h for copyright info
==11739== Command: ./clblast_sample_sgemm
==11739== Parent PID: 9083
==11739==
==11739== Warning: set address range perms: large range [0x395d8000, 0x1395d7000) (defined)
==11739== Syscall param ioctl(generic) points to uninitialised byte(s)
==11739==    at 0x5C59F47: ioctl (syscall-template.S:84)
==11739==    by 0x6AC85B3: OCLRT::Drm::ioctl(unsigned long, void*) (in /opt/intel/opencl/libigdrcl.so)
==11739==    by 0x6ACA5A6: OCLRT::OSTimeLinux::timestampTypeDetect() (in /opt/intel/opencl/libigdrcl.so)
==11739==    by 0x6ACA693: OCLRT::OSTimeLinux::OSTimeLinux(OCLRT::OSInterface*) (in /opt/intel/opencl/libigdrcl.so)
==11739==    by 0x6ACA6D1: OCLRT::OSTime::create(OCLRT::OSInterface*) (in /opt/intel/opencl/libigdrcl.so)
==11739==    by 0x6A9CD1E: OCLRT::Device::createDeviceImpl(OCLRT::HardwareInfo const*, bool, OCLRT::Device&) (in /opt/intel/opencl/libigdrcl.so)
==11739==    by 0x6ACC730: OCLRT::Platform::initialize(unsigned long, OCLRT::HardwareInfo const**) (in /opt/intel/opencl/libigdrcl.so)
==11739==    by 0x6A726B7: clGetPlatformIDs (in /opt/intel/opencl/libigdrcl.so)
==11739==    by 0x53BF7EA: ??? (in /opt/intel/opencl/libOpenCL.so.1)
==11739==    by 0x53C16CE: ??? (in /opt/intel/opencl/libOpenCL.so.1)
==11739==    by 0x623EA98: __pthread_once_slow (pthread_once.c:116)
==11739==    by 0x53BFD20: clGetPlatformIDs (in /opt/intel/opencl/libOpenCL.so.1)
==11739==  Address 0xffeffff08 is on thread 1's stack
==11739==  in frame #2, created by OCLRT::OSTimeLinux::timestampTypeDetect() (???:)
==11739==
==11739== Warning: set address range perms: large range [0x395d8000, 0x1395d7000) (noaccess)
==11739== Invalid read of size 8
==11739==    at 0x6A8B112: OCLRT::CommandQueue::~CommandQueue() (in /opt/intel/opencl/libigdrcl.so)
==11739==    by 0x6AE4186: OCLRT::CommandQueueHw<OCLRT::BDWFamily>::~CommandQueueHw() (in /opt/intel/opencl/libigdrcl.so)
==11739==    by 0x6A9BA2B: OCLRT::Context::~Context() (in /opt/intel/opencl/libigdrcl.so)
==11739==    by 0x6A9BB38: OCLRT::Context::~Context() (in /opt/intel/opencl/libigdrcl.so)
==11739==    by 0x6ADACEB: OCLRT::Program::~Program() (in /opt/intel/opencl/libigdrcl.so)
==11739==    by 0x6ADAD28: OCLRT::Program::~Program() (in /opt/intel/opencl/libigdrcl.so)
==11739==    by 0x6A7437A: clReleaseProgram (in /opt/intel/opencl/libigdrcl.so)
==11739==    by 0x4EE245B: std::_Sp_counted_deleter<_cl_program**, clblast::Program::Program(clblast::Context const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)::{lambda(_cl_program**)#1}, std::allocator<void>, (__gnu_cxx::_Lock_policy)2>::_M_dispose() (in /path/to/CLBlast/build/libclblast.so.1.3.0)
==11739==    by 0x4EFE149: clblast::Cache<std::tuple<_cl_context*, _cl_device_id*, clblast::Precision, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, clblast::Program>::~Cache() (in /path/to/CLBlast/build/libclblast.so.1.3.0)
==11739==    by 0x5B97369: __cxa_finalize (cxa_finalize.c:56)
==11739==    by 0x4ED0222: ??? (in /path/to/CLBlast/build/libclblast.so.1.3.0)
==11739==    by 0x4010DE6: _dl_fini (dl-fini.c:235)
==11739==  Address 0x699f058 is 152 bytes inside a block of size 1,072 free'd
==11739==    at 0x4C2F24B: operator delete(void*) (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==11739==    by 0x6ACC096: OCLRT::Platform::shutdown() (in /opt/intel/opencl/libigdrcl.so)
==11739==    by 0x6ACC336: OCLRT::Platform::~Platform() (in /opt/intel/opencl/libigdrcl.so)
==11739==    by 0x5B96FF7: __run_exit_handlers (exit.c:82)
==11739==    by 0x5B97044: exit (exit.c:104)
==11739==    by 0x5B7D836: (below main) (libc-start.c:325)
==11739==  Block was alloc'd at
==11739==    at 0x4C2E0EF: operator new(unsigned long) (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==11739==    by 0x6ACC70D: OCLRT::Platform::initialize(unsigned long, OCLRT::HardwareInfo const**) (in /opt/intel/opencl/libigdrcl.so)
==11739==    by 0x6A726B7: clGetPlatformIDs (in /opt/intel/opencl/libigdrcl.so)
==11739==    by 0x53BF7EA: ??? (in /opt/intel/opencl/libOpenCL.so.1)
==11739==    by 0x53C16CE: ??? (in /opt/intel/opencl/libOpenCL.so.1)
==11739==    by 0x623EA98: __pthread_once_slow (pthread_once.c:116)
==11739==    by 0x53BFD20: clGetPlatformIDs (in /opt/intel/opencl/libOpenCL.so.1)
==11739==    by 0x402979: cl::Platform::get(std::vector<cl::Platform, std::allocator<cl::Platform> >*) (in /path/to/CLBlast/build/clblast_sample_sgemm)
==11739==    by 0x401BB9: main (in /path/to/CLBlast/build/clblast_sample_sgemm)
==11739==
==11739== Invalid read of size 1
==11739==    at 0x6ABCB70: OCLRT::MemoryManager::isAsyncDeleterEnabled() const (in /opt/intel/opencl/libigdrcl.so)
==11739==    by 0x6A9BA6D: OCLRT::Context::~Context() (in /opt/intel/opencl/libigdrcl.so)
==11739==    by 0x6A9BB38: OCLRT::Context::~Context() (in /opt/intel/opencl/libigdrcl.so)
==11739==    by 0x6ADACEB: OCLRT::Program::~Program() (in /opt/intel/opencl/libigdrcl.so)
==11739==    by 0x6ADAD28: OCLRT::Program::~Program() (in /opt/intel/opencl/libigdrcl.so)
==11739==    by 0x6A7437A: clReleaseProgram (in /opt/intel/opencl/libigdrcl.so)
==11739==    by 0x4EE245B: std::_Sp_counted_deleter<_cl_program**, clblast::Program::Program(clblast::Context const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)::{lambda(_cl_program**)#1}, std::allocator<void>, (__gnu_cxx::_Lock_policy)2>::_M_dispose() (in /path/to/CLBlast/build/libclblast.so.1.3.0)
==11739==    by 0x4EFE149: clblast::Cache<std::tuple<_cl_context*, _cl_device_id*, clblast::Precision, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, clblast::Program>::~Cache() (in /path/to/CLBlast/build/libclblast.so.1.3.0)
==11739==    by 0x5B97369: __cxa_finalize (cxa_finalize.c:56)
==11739==    by 0x4ED0222: ??? (in /path/to/CLBlast/build/libclblast.so.1.3.0)
==11739==    by 0x4010DE6: _dl_fini (dl-fini.c:235)
==11739==    by 0x5B96FF7: __run_exit_handlers (exit.c:82)
==11739==  Address 0x69a7dd8 is 328 bytes inside a block of size 480 free'd
==11739==    at 0x4C2F24B: operator delete(void*) (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==11739==    by 0x6A9C26A: OCLRT::Device::~Device() (in /opt/intel/opencl/libigdrcl.so)
==11739==    by 0x6A9C3B8: OCLRT::Device::~Device() (in /opt/intel/opencl/libigdrcl.so)
==11739==    by 0x6ACC096: OCLRT::Platform::shutdown() (in /opt/intel/opencl/libigdrcl.so)
==11739==    by 0x6ACC336: OCLRT::Platform::~Platform() (in /opt/intel/opencl/libigdrcl.so)
==11739==    by 0x5B96FF7: __run_exit_handlers (exit.c:82)
==11739==    by 0x5B97044: exit (exit.c:104)
==11739==    by 0x5B7D836: (below main) (libc-start.c:325)
==11739==  Block was alloc'd at
==11739==    at 0x4C2E0EF: operator new(unsigned long) (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==11739==    by 0x6B5E97F: OCLRT::DrmCommandStreamReceiver<OCLRT::BDWFamily>::createMemoryManager(bool) (in /opt/intel/opencl/libigdrcl.so)
==11739==    by 0x6A9CD6B: OCLRT::Device::createDeviceImpl(OCLRT::HardwareInfo const*, bool, OCLRT::Device&) (in /opt/intel/opencl/libigdrcl.so)
==11739==    by 0x6ACC730: OCLRT::Platform::initialize(unsigned long, OCLRT::HardwareInfo const**) (in /opt/intel/opencl/libigdrcl.so)
==11739==    by 0x6A726B7: clGetPlatformIDs (in /opt/intel/opencl/libigdrcl.so)
==11739==    by 0x53BF7EA: ??? (in /opt/intel/opencl/libOpenCL.so.1)
==11739==    by 0x53C16CE: ??? (in /opt/intel/opencl/libOpenCL.so.1)
==11739==    by 0x623EA98: __pthread_once_slow (pthread_once.c:116)
==11739==    by 0x53BFD20: clGetPlatformIDs (in /opt/intel/opencl/libOpenCL.so.1)
==11739==    by 0x402979: cl::Platform::get(std::vector<cl::Platform, std::allocator<cl::Platform> >*) (in /path/to/CLBlast/build/clblast_sample_sgemm)
==11739==    by 0x401BB9: main (in /path/to/CLBlast/build/clblast_sample_sgemm)
==11739==
==11739== Invalid read of size 8
==11739==    at 0x6AD9CA6: OCLRT::Program::cleanCurrentKernelInfo() (in /opt/intel/opencl/libigdrcl.so)
==11739==    by 0x6ADAB7D: OCLRT::Program::~Program() (in /opt/intel/opencl/libigdrcl.so)
==11739==    by 0x6ADAD28: OCLRT::Program::~Program() (in /opt/intel/opencl/libigdrcl.so)
==11739==    by 0x6A7437A: clReleaseProgram (in /opt/intel/opencl/libigdrcl.so)
==11739==    by 0x4EE245B: std::_Sp_counted_deleter<_cl_program**, clblast::Program::Program(clblast::Context const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)::{lambda(_cl_program**)#1}, std::allocator<void>, (__gnu_cxx::_Lock_policy)2>::_M_dispose() (in /path/to/CLBlast/build/libclblast.so.1.3.0)
==11739==    by 0x4EFE149: clblast::Cache<std::tuple<_cl_context*, _cl_device_id*, clblast::Precision, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, clblast::Program>::~Cache() (in /path/to/CLBlast/build/libclblast.so.1.3.0)
==11739==    by 0x5B97369: __cxa_finalize (cxa_finalize.c:56)
==11739==    by 0x4ED0222: ??? (in /path/to/CLBlast/build/libclblast.so.1.3.0)
==11739==    by 0x4010DE6: _dl_fini (dl-fini.c:235)
==11739==    by 0x5B96FF7: __run_exit_handlers (exit.c:82)
==11739==    by 0x5B97044: exit (exit.c:104)
==11739==    by 0x5B7D836: (below main) (libc-start.c:325)
==11739==  Address 0x699f058 is 152 bytes inside a block of size 1,072 free'd
==11739==    at 0x4C2F24B: operator delete(void*) (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==11739==    by 0x6ACC096: OCLRT::Platform::shutdown() (in /opt/intel/opencl/libigdrcl.so)
==11739==    by 0x6ACC336: OCLRT::Platform::~Platform() (in /opt/intel/opencl/libigdrcl.so)
==11739==    by 0x5B96FF7: __run_exit_handlers (exit.c:82)
==11739==    by 0x5B97044: exit (exit.c:104)
==11739==    by 0x5B7D836: (below main) (libc-start.c:325)
==11739==  Block was alloc'd at
==11739==    at 0x4C2E0EF: operator new(unsigned long) (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==11739==    by 0x6ACC70D: OCLRT::Platform::initialize(unsigned long, OCLRT::HardwareInfo const**) (in /opt/intel/opencl/libigdrcl.so)
==11739==    by 0x6A726B7: clGetPlatformIDs (in /opt/intel/opencl/libigdrcl.so)
==11739==    by 0x53BF7EA: ??? (in /opt/intel/opencl/libOpenCL.so.1)
==11739==    by 0x53C16CE: ??? (in /opt/intel/opencl/libOpenCL.so.1)
==11739==    by 0x623EA98: __pthread_once_slow (pthread_once.c:116)
==11739==    by 0x53BFD20: clGetPlatformIDs (in /opt/intel/opencl/libOpenCL.so.1)
==11739==    by 0x402979: cl::Platform::get(std::vector<cl::Platform, std::allocator<cl::Platform> >*) (in /path/to/CLBlast/build/clblast_sample_sgemm)
==11739==    by 0x401BB9: main (in /path/to/CLBlast/build/clblast_sample_sgemm)
==11739==
==11739==
==11739== Process terminating with default action of signal 11 (SIGSEGV)
==11739==  Access not within mapped region at address 0x0
==11739==    at 0x6ABC270: OCLRT::MemoryManager::freeGraphicsMemory(OCLRT::GraphicsAllocation*) (in /opt/intel/opencl/libigdrcl.so)
==11739==    by 0x6ABCAEF: OCLRT::MemoryManager::checkGpuUsageAndDestroyGraphicsAllocations(OCLRT::GraphicsAllocation*) (in /opt/intel/opencl/libigdrcl.so)
==11739==    by 0x6AD9CB1: OCLRT::Program::cleanCurrentKernelInfo() (in /opt/intel/opencl/libigdrcl.so)
==11739==    by 0x6ADAB7D: OCLRT::Program::~Program() (in /opt/intel/opencl/libigdrcl.so)
==11739==    by 0x6ADAD28: OCLRT::Program::~Program() (in /opt/intel/opencl/libigdrcl.so)
==11739==    by 0x6A7437A: clReleaseProgram (in /opt/intel/opencl/libigdrcl.so)
==11739==    by 0x4EE245B: std::_Sp_counted_deleter<_cl_program**, clblast::Program::Program(clblast::Context const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)::{lambda(_cl_program**)#1}, std::allocator<void>, (__gnu_cxx::_Lock_policy)2>::_M_dispose() (in /path/to/CLBlast/build/libclblast.so.1.3.0)
==11739==    by 0x4EFE149: clblast::Cache<std::tuple<_cl_context*, _cl_device_id*, clblast::Precision, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, clblast::Program>::~Cache() (in /path/to/CLBlast/build/libclblast.so.1.3.0)
==11739==    by 0x5B97369: __cxa_finalize (cxa_finalize.c:56)
==11739==    by 0x4ED0222: ??? (in /path/to/CLBlast/build/libclblast.so.1.3.0)
==11739==    by 0x4010DE6: _dl_fini (dl-fini.c:235)
==11739==    by 0x5B96FF7: __run_exit_handlers (exit.c:82)
==11739==  If you believe this happened as a result of a stack
==11739==  overflow in your program's main thread (unlikely but
==11739==  possible), you can try to increase the size of the
==11739==  main thread stack using the --main-stacksize= flag.
==11739==  The main thread stack size used in this run was 8388608.
==11739==
==11739== HEAP SUMMARY:
==11739==     in use at exit: 1,209,051 bytes in 2,564 blocks
==11739==   total heap usage: 855,129 allocs, 852,565 frees, 447,919,478 bytes allocated
==11739==
==11739== LEAK SUMMARY:
==11739==    definitely lost: 8 bytes in 1 blocks
==11739==    indirectly lost: 0 bytes in 0 blocks
==11739==      possibly lost: 0 bytes in 0 blocks
==11739==    still reachable: 1,209,043 bytes in 2,563 blocks
==11739==         suppressed: 0 bytes in 0 blocks
==11739== Rerun with --leak-check=full to see details of leaked memory
==11739==
==11739== For counts of detected and suppressed errors, rerun with: -v
==11739== Use --track-origins=yes to see where uninitialised values come from
==11739== ERROR SUMMARY: 4 errors from 4 contexts (suppressed: 0 from 0)

It should be noted that I was getting weird segfaults with other Intel drivers before, though never dug too far into them. Then again, Intel NEO is based off them, so that's not surprising.

I'm not sure of next steps, but either:

  • the above makes sense to someone, and it's an easy fix, or
  • someone can help me debug further, or
  • someone can try to reproduce

(NB: someone is probably @CNugteren)

@CNugteren
Copy link
Owner

Thanks for the elaborate details. I think the first thing I'll do is install NEO myself and try to reproduce the issue.

The crash in clReleaseProgram is perhaps related to #203 and #199, which I thought is a Visual Studio only issue. So I disabled that line when compiled with Visual Studio. Perhaps this is an actual bug in the code, and I'll need to investigate further.

@kodonnell
Copy link
Author

Great - let me know if I can help in any way.

@CNugteren
Copy link
Owner

CNugteren commented Apr 30, 2018

I managed to get NEO installed, and reproduced the issue. I suspect it is not NEO related since it also happens on Visual Studio. I tried to debug, but without luck. Then I created a relatively small stand-alone example of ~100 lines of code (still to be reduced perhaps?) which reproduces the issue.

The file is on gist and can be compiled and run e.g. as g++ --std=c++11 clReleaseProgram_segfault.cpp -o test -lOpenCL && ./test. Could you perhaps verify if you also encounter this issue with this stand-alone example? Then I'll investigate further.

@kodonnell
Copy link
Author

Good stuff - I can reproduce that error. While debugging I saw a reference to beignet which was worrying, and that reminded me of something I'd had while installing caffe, which makes me think it's not playing nicely with other drivers. (It does, after all, overwrite the Intel OpenCL ones I had.) I also found this. I tried to uninstall beignet and install only NEO to test this theory, but I've kind messed up my system - so I think I'm going to try a reinstall of my system. This might take me a day or two - if you can test it more easily, great, otherwise I'll report back when done.

@CNugteren
Copy link
Owner

OK, thanks for reproducing it! I've made the example a bit simpler and nicer to test, and I've reported this as an issue on the NEO project. Hopefully it is not me doing something stupid here ;-)

@kodonnell
Copy link
Author

Good find on the cause. How hard do you think it is to fix (i.e. is it something I could try)? Or will you wait to see what the response to the bug is?

@CNugteren
Copy link
Owner

I would just remove the clean-up of the program, i.e. remove the call to clReleaseProgram here.

@kodonnell
Copy link
Author

Awesome. I'll test that out tonight - provided it works, I'm happy to leave the issue to you to close etc. as you see fit.

@kodonnell
Copy link
Author

I would just remove the clean-up of the program, i.e. remove the call to clReleaseProgram here.

Works for me!

@CNugteren
Copy link
Owner

I've spoken to the Intel NEO developers, and they will address the issue. They know what to do and they will work on it. In the meantime, removing the call to clReleaseProgram as you did solves it. So I'm closing this issue now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants