Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segmentation fault in multi-threaded code #304

Closed
sfantao opened this issue Sep 6, 2023 · 10 comments · Fixed by #309
Closed

Segmentation fault in multi-threaded code #304

sfantao opened this issue Sep 6, 2023 · 10 comments · Fixed by #309

Comments

@sfantao
Copy link

sfantao commented Sep 6, 2023

I have an application that uses up to 7-threads and I randomly get segmentation faults from omnitrace version 1.10.2 coming from:

(gdb) #0  0x000015553b20bcae in std::_Hashtable<unsigned long, std::pair<unsigned long const, long>, std::allocator<std::pair<unsigned long const, long> >, std::__detail::_Select1st, std::equal_to<unsigned long>, std::hash<unsigned long>, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<false, false, true> >::_M_find_before_node () at /usr/include/c++/7/bits/hashtable.h:1551
#1  std::_Hashtable<unsigned long, std::pair<unsigned long const, long>, std::allocator<std::pair<unsigned long const, long> >, std::__detail::_Select1st, std::equal_to<unsigned long>, std::hash<unsigned long>, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<false, false, true> >::_M_find_node ()
    at /usr/include/c++/7/bits/hashtable.h:642
#2  0x000015553b9765dd in std::_Hashtable<unsigned long, std::pair<unsigned long const, long>, std::allocator<std::pair<unsigned long const, long> >, std::__detail::_Select1st, std::equal_to<unsigned long>, std::hash<unsigned long>, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<false, false, true> >::find () at /usr/include/c++/7/bits/hashtable.h:1425
#3  std::unordered_map<unsigned long, long, std::hash<unsigned long>, std::equal_to<unsigned long>, std::allocator<std::pair<unsigned long const, long> > >::find () at /usr/include/c++/7/bits/unordered_map.h:920
#4  omnitrace::hip_activity_callback ()
    at /home/omnitrace/source/lib/omnitrace/library/roctracer.cpp:927
#5  0x000015553aff8236 in ?? ()
   from /pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.4.3/lib/libroctracer64.so.4
#6  0x000015553aff944c in ?? ()
   from /pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.4.3/lib/libroctracer64.so.4
#7  0x0000155554fe2a33 in ?? ()
   from /pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.4.3/deps/libstdc++.so.6
#8  0x000015553bfd7d54 in omnitrace::component::pthread_create_gotcha::wrapper::operator() ()
    at /home/omnitrace/source/lib/omnitrace/library/components/pthread_create_gotcha.cpp:276
#9  0x000015553bfd9402 in omnitrace::component::pthread_create_gotcha::wrapper::wrap ()
    at /home/omnitrace/source/lib/omnitrace/library/components/pthread_create_gotcha.cpp:305
#10 0x000015554bed06ea in start_thread () from /lib64/libpthread.so.0
#11 0x000015554bbe8a6f in clone () from /lib64/libc.so.6

This app was executed as:

omnitrace-sample --trace -- neko.exe hemi.case

I believe there might be a race going into this unordered map. This comes from an app that is not trivial to build. Let me know if you'd like to provide more information about the SEGFault or the app itself.

@jrmadsen
Copy link
Collaborator

jrmadsen commented Sep 6, 2023

Ah, yeah, I see why/how the data race is happening... there is a lock but on different mutexes. I can get it patched easily and I’ll generate a new release

@sfantao
Copy link
Author

sfantao commented Sep 7, 2023

Good stuff! Let me know how to get the patched version.

@gmarkomanolis
Copy link

Jonathan, this is the Neko code, I was discussing the same with Niclas today. Please inform us when there is a change.

@jrmadsen
Copy link
Collaborator

It is being held up by whatever happened to the build system on RedHat in HIP 5.5 and 5.6, seen in #300. Looks like some amdgpu libraries got moved and aren’t being found. I’m trying to get to solving it soon

@jrmadsen
Copy link
Collaborator

Ok, I finally found the time to sort out #300. I’ll get that merged shortly and then addressing this will be quick and easy. There should be a release available tomorrow

@jrmadsen
Copy link
Collaborator

It’s going to be a little while longer until I figure out how to solve the packaging and code coverage routinely running out of disk space.

@jinhongyii
Copy link

Hi @jrmadsen, I also encounters almost the same bug when profiling my multi-threaded program(this one is unordered map, mine is ordered map's RBTree). Is there any estimation about when this bug fix will be brought to release? Thanks!

@jrmadsen
Copy link
Collaborator

jrmadsen commented Sep 21, 2023

I’m tied up with the rocprofiler v2 rewrite right now. @benrichard-amd is looking into fixing #300 so that the testing can pass. Right now it’s just the code coverage job that is failing. If he cannot find a fix soon, I’ll just disable that job so that I can merge it, fix the bug, and generate a release

@jinhongyii
Copy link

Thanks @jrmadsen! waiting for your good news on the fix.

@jrmadsen
Copy link
Collaborator

jrmadsen commented Oct 3, 2023

Just generated the new release. Installers should be available shortly, however I haven't updated the installer generation to provide installers for ROCm 5.7 yet, just FYI.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants