Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Occasional map misbehavior, inserted element not found later on #55

Closed
egmontkob opened this issue Jan 7, 2022 · 31 comments
Closed

Occasional map misbehavior, inserted element not found later on #55

egmontkob opened this issue Jan 7, 2022 · 31 comments

Comments

@egmontkob
Copy link

Hello,

We are experiencing a weird misbehavior with tsl/hopscotch_map, bisected to commit 5c5770a being the culprit.


According to its changelog entry, this looks like an innocent commit purely renaming variables, namely m_buckets to m_buckets_data and m_first_or_empty_bucket to m_buckets. However, danger lurks around in one of the old names being the same as the other new name.

At several places (e.g. hopscotch_hash.h line 1246 being the first one) m_buckets remains m_buckets. Based on the changelog's wording, these look like accidental mistakes rather than intentional "changes".


Into our tsl::hopscotch_map<int, void *> we insert about 1000 entries (using map[key] = val) and erase about 800 (using it = map.find(key); blahblah; map.erase(it) and then no longer using it afterwards). These operations interleave each other in kinda random order, and we sometimes re-add a key that we erased earlier.

Then, after one insertion, the following situation arises:

If I iterate through all the items using for (auto it : map) {...} then I get all the about 200 items, as expected.

However, for one of these entries map.count(key) gives 0 instead of the expected 1. Trying to access this item using .at(key) raises an exception, as usual for missing entries. Or, trying to access this key using [key] inserts this item again, with the default value of nullptr. After this operation, if I loop through the items again using for (auto it : map) {...} then this key gets listed twice, once with the expected value and once with nullptr. And at this point it becomes another key that is "missing" in this weird sense that looping over all the entries finds it, but looking it up directly does not.


Unfortunately I couldn't come up with a self-contained test demonstrating the failure, and I cannot give you access to our in-house software that every now and then triggers this bug. Our software has a very non-deterministic behavior by its nature, but as of your aforementioned commit we see the misbehavior about 1 out of 20-50-ish cases. We haven't seen misbehavior with your hopscotch_map prior to the mentioned commit during thousands of runs, nor with the standard unordered_map.

Let me know if you need any help in debugging this issue (without me digging deeply into your code :)). For example, if you create a version that dumps into a global fd the memory region of the map after every insertion/deletion, I'd be happy to test that. But maybe you just want to revert the functionality and redo the variable naming properly :)

Thanks in advance!

@Tessil
Copy link
Owner

Tessil commented Jan 8, 2022

Hello,

Thanks you very much for taking the time to submit an issue as it is a quite worrisome bug.

The rename in 5c5770a should effectively be of little consequence even for the ones that stayed as m_buckets as using m_buckets[x] is equivalent to m_buckets_data[x] if m_buckets_data is not empty. The m_buckets pointer should always point to m_buckets_data.data() as long as !m_buckets_data.empty(), otherwise it will point to static_empty_bucket_ptr(). If m_buckets_data is empty then m_buckets_data[x] will be invalid for any x (but m_buckets[0]will always be valid). There may be some bug with m_buckets not pointing correctly to non-empty m_buckets_data.data() but from a first glance the code seems sound. I'll double-check more.

Just to be sure the fb3bb98 parent commit works well? As there is some non-deterministic behaviour in the software I suppose enough iterations were run?

To help me debug would it be possible to define the TSL_DEBUG macro (and be sure that NDEBUG is not defined so that assertions are enabled) and report any assertion failure?

Would it also be possible to call the hopscotch_map/set::overflow_size() method to check if the overflow list is empty or not?

Thank you very much,
Thibaut

@egmontkob
Copy link
Author

Thanks a lot for your quick response!

With TSL_DEBUG I can get a consistent failure -- our app fails every time when inserting the 15th item to the map (checked in about ten different runs, it's always exactly the 15th). No deletions take place.

So apparently some internal consistency is broken way earlier than it hits the surface when not debugging.

overflow_size() always returns 0.

The assertion failure with 5c5770a is:

/usr/include/tsl/hopscotch_hash.h:1506: blahblahblah Assertion `!m_buckets[ibucket_for_hash].empty()' failed.

The parent fb3bb98 (a.k.a. v2.1.0) works fine.


Let me mention that we do have multiple hopscotch_maps and work on other ones too (including deleting items). I guess that should be irrelevant -- or do you perhaps share some data structures across them?

@egmontkob
Copy link
Author

I was a bit worried that it might be a memory corruption in our application (that would be embarrassing), somehow always affecting the map's memory area in the same way.

In order to check whether this is the case, I coded up a simple checksumming for the map, and print the value right before and right after every insertion. The value doesn't change unexpectedly: the one printed right after an insertion and the one printed right before the next insertion -- belonging to the same map object -- are always the same.

That is, it seems it's not a memory corruption in our app. (Also note that we are single threaded.)

Could you please make sure that this checksumming is correct and complete? Let me know if I missed a place I need to recurse to.

hop-checksum.patch.txt

@Tessil
Copy link
Owner

Tessil commented Jan 8, 2022

Thank you very much for your reply.

I'll check more in depth, knowing that it fails after 15 insertions without any erase and overflow (or move/copy/manual rehash of the map?) narrow the portions that can cause problem.

There is no global shared state so using multiple hopscotch_mapshould not be a problem.

In the meantime I ran a stress test script that randomly inserts and erases a random number of values with checks but no error so far.

As the bug seems quite consistent it's more likely to be the fault of the library than a memory corruption on your side.

It'd be useful if you eventually have the hashes of the 15 inserted values but I should probably be able to do without it if it's a private information.

@egmontkob
Copy link
Author

Something really mysterious is going on, and I'm still torn whether it'll be a bug on your side or ours :)

I've tried to copy the steps from a concrete run (inserting those exact values) to a standalone app and could not repeat the crash.

Also, I can only reproduce the crash with our app's production build (-O3, NDEBUG and who knows what else). I've only undef'ed NDEBUG for including your header file and then re-defined it. I cannot reproduce the crash with our app in debug mode (-O0 -g, no NDEBUG, our own ad-hoc debugging stuff everywhere, different memory layout due to debug variables, different timings of events...). Let me do some bisecting between the two kinds of builds, maybe something comes out of it.

Here's one debug output from our app. There's nothing special or secret in the numbers, the keys are small integers (somewhat changing every time, but roughly according to the same pattern), and the values are memory addresses. CON is the constructor of the object that has this map as a member. This value is repeated after ADD, DEL, CLEAR and fnd (find) operations. The "about to add" .. "added" lines are printed right around an insertion. "xor" is the checksum. The last ADD is where it crashes, obviously.

I don't think there's anything special here, and I couldn't cause a crash by manually replaying these steps in a standalone app, stress-test running that app thousands of times (I did this before filing the original report).

debug-output.txt


gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0
Linux foxy 5.4.0-92-generic #103-Ubuntu SMP Fri Nov 26 16:13:00 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux

@egmontkob
Copy link
Author

egmontkob commented Jan 8, 2022

We have a totally unrelated class of 24 bytes, with several objects instantiated. If I squeeze one of the data types within that object so that the object shrinks to 16 bytes then the bug is gone.

If I disable link time optimization (no -flto) then the bug is gone.

If I compile with gcc/g++-10 (the Ubuntu 20.04 package: version 10.3.0-1ubuntu1~20.04) then the bug is gone.

Could it be that a bug in your code requires so special circumstances to arise??

At this point I much more suspect that we hit an obscure gcc bug that's hopefully been fixed. What do you think?

@egmontkob
Copy link
Author

If I add -fsanitize=undefined then the bug is gone.

Idea taken from https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97456 which is the closest-looking in the list of fixed bugs for gcc-10.[123] after a quick glimpse. Although the linked bug is said to have appeared in 10.1, so it's not an exact match.

@Tessil
Copy link
Owner

Tessil commented Jan 8, 2022

Thank you very much for your detailed investigation.

The bug is quite strange and we could eventually have hit a GCC bug.

You could test with valgrind or ASan to check for potential invalid memory accesses. The library is tested with ASan and UBSan but some conditions arising here may not be tested.

Tomorrow I'll try to read a bit back the insertion portion code of the library and check that everything looks sane.

@egmontkob
Copy link
Author

Neither -fsanitize=address nor running under valgrind gives anything new, just abort on your assertion.

Your renaming commit is pretty small, especially if we separate the pure renaming from the functional changes. Do you think it helps to bisect the latter?

@egmontkob
Copy link
Author

This patch, when reverse-applied on "buggy" 5c5770a, partially reverts that commit.

hop-partial-revert.patchR.txt

You'll still have the new variable names, and the new use of m_buckets[] rather than m_bucket_data[] at as many place as possible, without crash. This version works correctly for me.

It also introduces two new assertions, to make sure that it really shouldn't matter if you use m_buckets[] or m_bucket_data[].


The new use of m_buckets rather than m_bucket_data all over the place seemed to be a bit more risky to me, for two reasons:

  • If overlooked, the m_bucket_data vector could grow and reallocate, without m_buckets being adjusted. Seems to me this is not the case, m_buckets_data only grows by the two .resize() calls and m_buckets is updated there.

  • If overlooked, m_buckets might be pointing to the global shared empty bucket, and perform operations on that (making that no longer an empty one). I had a few assertions, seems like this does not happen either.


So it is getting more and more suspicious that we're indeed facing a GCC issue.

@Tessil
Copy link
Owner

Tessil commented Jan 9, 2022

Effectively m_bucket_data should never grow and is only initialised/resized in the constructors and assignments operators, it never changes its size during the lifetime of the object (the rehash uses a new map and swap). If m_buckets was pointing to the static empty bucket and out-of-sync with m_buckets_data and we tried to access m_buckets[x > 0] it would show up in ASan or Valgrind quite quickly.

I stress-tested the map a bit more with ASan and TSL_DEBUG and can't reproduce the bug. The insertion code seems sound too.

I really think it could be a GCC issue or something external to the map. It could be interesting to check with Clang too.

I'll check a bit more the code as I want to be sure there is no such critical bug in the library and will let you know if I find anything. I'll close the ticket for now but If you can reproduce the bug with Clang or other versions of GCC don't hesitate to re-open it.

@Tessil Tessil closed this as completed Jan 9, 2022
@egmontkob
Copy link
Author

Thanks so much for the help and the investigation! And thanks so much for this great library, too! :)

As much as I got to understand your code in this short time, I agree with you that it really doesn't look like a bug there.


I tried clang versions 7..12 (out of which for version 9 only cmake complains for us that LTO is not supported... anyway) they all work properly.

I tried gcc 9.4.0 from focal-proposed, it's buggy exactly as the current 9.3.0.

I also tried gcc 7.5.0 and 8.4.0 as shipped by Ubuntu Focal, and they trigger the bug much more prominently. With gcc-9, one of our ~20 integration tests fail, with gcc-7/8 almost all of them fail at the said hopscotch assertion -- and again, only at 5c5770a and not at its parent commit.

Maybe if you feel like running another round of stress-test, you should try with gcc-7/8. But I still coudn't get the standalone app fail. There's more special circumstances needed to trigger the bug.

Fingers crossed that it's properly fixed in gcc-10.

@albi-k
Copy link

albi-k commented May 20, 2022

Hello,
I encountered this same behavior over a year ago, where a just inserted element would right after not be found.
I was using GCC 9 with LTO. At the time there was no open ticket I could find on it and I was unsure if it was truly a library bug.

However I was able to workaround the issue and wanted to document that here, in case it helps to understand the underlying issue.
I found that compiling with -fno-strict-aliasing made the problem go away, while keeping LTO and the rest of the optimizations.
I eventually ended up wrapping hopscotch_hash.h with

#pragma GCC optimize ("-fno-strict-aliasing") 
#pragma GCC reset_options

in order to avoid disabling strict aliasing for the entire codebase.

So this could mean the library violates the strict aliasing rule or that there is a compiler bug. I can't prove either way but disabling strict aliasing worked for me.

@Tessil
Copy link
Owner

Tessil commented May 21, 2022

Hello,

Thank you for the bug report.

P0137R1 introduced a change in the object model in C++17 which now requires to std::launder the reinterpreted pointer from std::aligned_storage. From my understanding and the discussion I have read, the change seems to be backward incompatible and could potentially make previously well-defined code become undefined. See the following discussion for some interesting details: https://stackoverflow.com/questions/47735657/does-reinterpret-casting-stdaligned-storage-to-t-without-stdlaunder-violat and https://groups.google.com/a/isocpp.org/g/std-discussion/c/xaAwFR6qUmY

Could you try with the latest 4442316 commit that fixes this undefined behaviour and see if it solves your problem?

@egmontkob-work
Copy link

Hi guys,

Thanks for coming back to this issue!

Unfortunately, 4442316 does not fix it for us.


I have since upgraded my computer to Ubuntu 22.04 with gcc-11 by default, with which we couldn't reproduce the bug yet. Even though gcc-9 is still available, I decided to fire up a brand new lxc container with 20.04 instead. I placed in it our source code as of the time we were hit by this issue, plus the tsl headers from 4442316 copied to /usr/include/tsl, then did a clean build. Our unittest still fails once out of every 10-20-ish occasions.

Just to be on the safe side, I repeated the same with fb3bb98 a.k.a. v2.1.0, the last version of hopscotch-map that didn't trigger the issue for us. The unittest has already passed 300 runs. Then back to 4442316, it's frequently failing again. This was just to verify my workflow that I'm indeed compiling against the hopscotch I'm thinking I'm compiling against.


Our project indeed uses -std=c++17, so you guys might very easily be on the right track, maybe you just need to launder at a few more places??

@egmontkob-work
Copy link

I eventually ended up wrapping hopscotch_hash.h with

#pragma GCC optimize ("-fno-strict-aliasing") 
#pragma GCC reset_options

I tried this too, added the first of these two lines to the very top of that file, and the second to the very bottom. On top of 4442316.

Seems to work correctly, the unittest passed another 200+ times.

@egmontkob-work
Copy link

Thought I'd do a bit of bisecting, to see where that -fno-strict-aliasing needs to go. So I wrapped hopscotch_hash.h lines 54..395, and in another attempt I restored that and wrapped lines 397..1888 instead. In both cases the bug was gone.

I'm not sure if this observation is useful at all.

@Tessil
Copy link
Owner

Tessil commented May 22, 2022

Thank you very much for the info and help.

reinterpret_cast is only used in the library with std::aligned_storage and so was the main potential culprit, the used static_casts also seem sound. I'll take take a deeper look as there could be another problem in the library.

Thought I'd do a bit of bisecting, to see where that -fno-strict-aliasing needs to go. So I wrapped hopscotch_hash.h lines 54..395, and in another attempt I restored that and wrapped lines 397..1888 instead. In both cases the bug was gone.

I'm not sure if this observation is useful at all.

So the bug is gone when only 54..359 is wrapped in -fno-strict-aliasing and it is also gone when only 397..1888 is wrapped (and not 54..359)? That's strange, will look into that.

@egmontkob-work
Copy link

So the bug is gone when only 54..359 is wrapped in -fno-strict-aliasing and it is also gone when only 397..1888 is wrapped (and not 54..359)?

Exactly (apart from the 359 vs 395 typo). That's why I couldn't continue bisecting :)

After the oddities like ...

We have a totally unrelated class of 24 bytes, with several objects instantiated. If I squeeze one of the data types within that object so that the object shrinks to 16 bytes then the bug is gone.

... I'm not surprised at anything.


We still haven't excluded the possibility of a gcc bug, have we? Although on the list of closed bugs for gcc-10.[123] there's no relevant title match for "aliasing" or "launder".

@Tessil
Copy link
Owner

Tessil commented May 22, 2022

Yes, it may still be a compiler error but @albi-k having encountered a similar problem mean it is more probable that the library has a problem.

I will try to do a thorough review of the code when I have some time to check if I eventually find something.

@markpapadakis
Copy link

We exclusively use clang, so the problem is definitely not specific to GCC.

rbalint added a commit to firebuild/firebuild that referenced this issue Jun 16, 2022
Using gcc-10 was a workaround for
Tessil/hopscotch-map#55 , but now it seems
that the issue is in hopscotch-map, since clang-built binaries are affected,
too.

This reverts commit 0afbea5.
@rbalint
Copy link

rbalint commented Jun 16, 2022

@Tessil Thank you for looking into the code. Maybe the issue could be reopened for the meantime to let others discover the workaround for GCC.

rbalint added a commit to firebuild/firebuild that referenced this issue Jun 16, 2022
Using gcc-10 was a workaround for
Tessil/hopscotch-map#55 , but now it seems
that the issue is in hopscotch-map, since clang-built binaries are affected,
too.

This reverts commit 0afbea5.
rbalint added a commit to firebuild/firebuild that referenced this issue Jun 17, 2022
Using gcc-10 was a workaround for
Tessil/hopscotch-map#55 , but now it seems
that the issue is in hopscotch-map, since clang-built binaries are affected,
too.

This reverts commit 0afbea5.
egmontkob-work pushed a commit to firebuild/firebuild that referenced this issue Jun 17, 2022
Using gcc-10 was a workaround for
Tessil/hopscotch-map#55 , but now it seems
that the issue is in hopscotch-map, since clang-built binaries are affected,
too.

This reverts commit 0afbea5.
@Tessil
Copy link
Owner

Tessil commented Jun 17, 2022

Thanks, I have re-opened it. I'll try to take a deeper look into it.

If anyone has a reproducible example, don't hesitate to share.

@Tessil Tessil reopened this Jun 17, 2022
@egmontkob-work
Copy link

egmontkob-work commented Jul 8, 2022

We've just run into the following crash under ASAN (-fsanitize=address -fsanitize=undefined). (Confidential bits removed from the log.)

  • Ubuntu Jammy 22.04
  • gcc-11.2
  • -std=c++20
  • anywhere where we #include a hopscotch lib, it's wrapped inside a #pragma GCC optimize ("-fno-strict-aliasing") / #pragma GCC reset_options pair
  • LTO enabled (we suspect that it might have an influence on strict aliasing, we're not sure)
  • Ubuntu's libtsl-hopscotch-map-dev package, v2.3.0, i.e. without the recent std::launder fix

I don't know if it's helpful for you or not. It's also unclear to me if it's supposed to work correctly without the std::launder fix, given all the other circumstances (e.g. strict aliasing being disabled).

==9129==ERROR: AddressSanitizer: heap-use-after-free on address 0x603000026260 at pc 0x55f15862fb6c bp 0x7ffeae108260 sp 0x7ffeae108250
READ of size 8 at 0x603000026260 thread T0
    #0 0x55f15862fb6b in std::_List_iterator<std::pair<KEYTYPE*, VALUETYPE*> >::operator++() (/path/to/executable+0x929b6b)
    #1 0x55f158628474 in tsl::detail_hopscotch_hash::hopscotch_hash<std::pair<KEYTYPE*, VALUETYPE*>, tsl::hopscotch_map<KEYTYPE*, VALUETYPE*, std::hash<KEYTYPE*>, std::equal_to<KEYTYPE*>, std::allocator<std::pair<KEYTYPE*, VALUETYPE*> >, 62u, false, tsl::hh::power_of_two_growth_policy<2ul> >::KeySelect, tsl::hopscotch_map<KEYTYPE*, VALUETYPE*, std::hash<KEYTYPE*>, std::equal_to<KEYTYPE*>, std::allocator<std::pair<KEYTYPE*, VALUETYPE*> >, 62u, false, tsl::hh::power_of_two_growth_policy<2ul> >::ValueSelect, std::hash<KEYTYPE*>, std::equal_to<KEYTYPE*>, std::allocator<std::pair<KEYTYPE*, VALUETYPE*> >, 62u, false, tsl::hh::power_of_two_growth_policy<2ul>, std::__cxx11::list<std::pair<KEYTYPE*, VALUETYPE*>, std::allocator<std::pair<KEYTYPE*, VALUETYPE*> > > >::hopscotch_iterator<false>::operator++() (/path/to/executable+0x922474)
    #2 0x55f15862321b in [our code]
    [...]

0x603000026260 is located 0 bytes inside of 32-byte region [0x603000026260,0x603000026280)
freed by thread T0 here:
    #0 0x7ff0565d122f in operator delete(void*, unsigned long) ../../../../src/libsanitizer/asan/asan_new_delete.cpp:172
    #1 0x55f158654cad in __gnu_cxx::new_allocator<std::_List_node<std::pair<KEYTYPE*, VALUETYPE*> > >::deallocate(std::_List_node<std::pair<KEYTYPE*, VALUETYPE*> >*, unsigned long) (/path/to/executable+0x94ecad)
    #2 0x55f15863a764 in std::allocator_traits<std::allocator<std::_List_node<std::pair<KEYTYPE*, VALUETYPE*> > > >::deallocate(std::allocator<std::_List_node<std::pair<KEYTYPE*, VALUETYPE*> > >&, std::_List_node<std::pair<KEYTYPE*, VALUETYPE*> >*, unsigned long) (/path/to/executable+0x934764)
    #3 0x55f1586311d3 in std::__cxx11::_List_base<std::pair<KEYTYPE*, VALUETYPE*>, std::allocator<std::pair<KEYTYPE*, VALUETYPE*> > >::_M_put_node(std::_List_node<std::pair<KEYTYPE*, VALUETYPE*> >*) (/path/to/executable+0x92b1d3)
    #4 0x55f15865bdb1 in std::__cxx11::list<std::pair<KEYTYPE*, VALUETYPE*>, std::allocator<std::pair<KEYTYPE*, VALUETYPE*> > >::_M_erase(std::_List_iterator<std::pair<KEYTYPE*, VALUETYPE*> >) (/path/to/executable+0x955db1)
    #5 0x55f158650264 in std::__cxx11::list<std::pair<KEYTYPE*, VALUETYPE*>, std::allocator<std::pair<KEYTYPE*, VALUETYPE*> > >::erase(std::_List_const_iterator<std::pair<KEYTYPE*, VALUETYPE*> >) (/path/to/executable+0x94a264)
    #6 0x55f158641e4c in tsl::detail_hopscotch_hash::hopscotch_hash<std::pair<KEYTYPE*, VALUETYPE*>, tsl::hopscotch_map<KEYTYPE*, VALUETYPE*, std::hash<KEYTYPE*>, std::equal_to<KEYTYPE*>, std::allocator<std::pair<KEYTYPE*, VALUETYPE*> >, 62u, false, tsl::hh::power_of_two_growth_policy<2ul> >::KeySelect, tsl::hopscotch_map<KEYTYPE*, VALUETYPE*, std::hash<KEYTYPE*>, std::equal_to<KEYTYPE*>, std::allocator<std::pair<KEYTYPE*, VALUETYPE*> >, 62u, false, tsl::hh::power_of_two_growth_policy<2ul> >::ValueSelect, std::hash<KEYTYPE*>, std::equal_to<KEYTYPE*>, std::allocator<std::pair<KEYTYPE*, VALUETYPE*> >, 62u, false, tsl::hh::power_of_two_growth_policy<2ul>, std::__cxx11::list<std::pair<KEYTYPE*, VALUETYPE*>, std::allocator<std::pair<KEYTYPE*, VALUETYPE*> > > >::erase_from_overflow(std::_List_const_iterator<std::pair<KEYTYPE*, VALUETYPE*> >, unsigned long) (/path/to/executable+0x93be4c)
    #7 0x55f1586358c6 in unsigned long tsl::detail_hopscotch_hash::hopscotch_hash<std::pair<KEYTYPE*, VALUETYPE*>, tsl::hopscotch_map<KEYTYPE*, VALUETYPE*, std::hash<KEYTYPE*>, std::equal_to<KEYTYPE*>, std::allocator<std::pair<KEYTYPE*, VALUETYPE*> >, 62u, false, tsl::hh::power_of_two_growth_policy<2ul> >::KeySelect, tsl::hopscotch_map<KEYTYPE*, VALUETYPE*, std::hash<KEYTYPE*>, std::equal_to<KEYTYPE*>, std::allocator<std::pair<KEYTYPE*, VALUETYPE*> >, 62u, false, tsl::hh::power_of_two_growth_policy<2ul> >::ValueSelect, std::hash<KEYTYPE*>, std::equal_to<KEYTYPE*>, std::allocator<std::pair<KEYTYPE*, VALUETYPE*> >, 62u, false, tsl::hh::power_of_two_growth_policy<2ul>, std::__cxx11::list<std::pair<KEYTYPE*, VALUETYPE*>, std::allocator<std::pair<KEYTYPE*, VALUETYPE*> > > >::erase<KEYTYPE*>(KEYTYPE* const&, unsigned long) (/path/to/executable+0x92f8c6)
    #8 0x55f15862e276 in unsigned long tsl::detail_hopscotch_hash::hopscotch_hash<std::pair<KEYTYPE*, VALUETYPE*>, tsl::hopscotch_map<KEYTYPE*, VALUETYPE*, std::hash<KEYTYPE*>, std::equal_to<KEYTYPE*>, std::allocator<std::pair<KEYTYPE*, VALUETYPE*> >, 62u, false, tsl::hh::power_of_two_growth_policy<2ul> >::KeySelect, tsl::hopscotch_map<KEYTYPE*, VALUETYPE*, std::hash<KEYTYPE*>, std::equal_to<KEYTYPE*>, std::allocator<std::pair<KEYTYPE*, VALUETYPE*> >, 62u, false, tsl::hh::power_of_two_growth_policy<2ul> >::ValueSelect, std::hash<KEYTYPE*>, std::equal_to<KEYTYPE*>, std::allocator<std::pair<KEYTYPE*, VALUETYPE*> >, 62u, false, tsl::hh::power_of_two_growth_policy<2ul>, std::__cxx11::list<std::pair<KEYTYPE*, VALUETYPE*>, std::allocator<std::pair<KEYTYPE*, VALUETYPE*> > > >::erase<KEYTYPE*>(KEYTYPE* const&) (/path/to/executable+0x928276)
    #9 0x55f15862742e in tsl::hopscotch_map<KEYTYPE*, VALUETYPE*, std::hash<KEYTYPE*>, std::equal_to<KEYTYPE*>, std::allocator<std::pair<KEYTYPE*, VALUETYPE*> >, 62u, false, tsl::hh::power_of_two_growth_policy<2ul> >::erase(KEYTYPE* const&) (/path/to/executable+0x92142e)
    #10 0x55f15861c2d7 in [our code]
    [...]

previously allocated by thread T0 here:
    #0 0x7ff0565d01c7 in operator new(unsigned long) ../../../../src/libsanitizer/asan/asan_new_delete.cpp:99
    #1 0x55f15866c3a0 in __gnu_cxx::new_allocator<std::_List_node<std::pair<KEYTYPE*, VALUETYPE*> > >::allocate(unsigned long, void const*) (/path/to/executable+0x9663a0)
    #2 0x55f158667f1b in std::allocator_traits<std::allocator<std::_List_node<std::pair<KEYTYPE*, VALUETYPE*> > > >::allocate(std::allocator<std::_List_node<std::pair<KEYTYPE*, VALUETYPE*> > >&, unsigned long) (/path/to/executable+0x961f1b)
    #3 0x55f158662f76 in std::__cxx11::_List_base<std::pair<KEYTYPE*, VALUETYPE*>, std::allocator<std::pair<KEYTYPE*, VALUETYPE*> > >::_M_get_node() (/path/to/executable+0x95cf76)
    #4 0x55f15865ad8d in std::_List_node<std::pair<KEYTYPE*, VALUETYPE*> >* std::__cxx11::list<std::pair<KEYTYPE*, VALUETYPE*>, std::allocator<std::pair<KEYTYPE*, VALUETYPE*> > >::_M_create_node<std::piecewise_construct_t const&, std::tuple<KEYTYPE* const&>, std::tuple<> >(std::piecewise_construct_t const&, std::tuple<KEYTYPE* const&>&&, std::tuple<>&&) (/path/to/executable+0x954d8d)
    #5 0x55f15864eca3 in std::_List_iterator<std::pair<KEYTYPE*, VALUETYPE*> > std::__cxx11::list<std::pair<KEYTYPE*, VALUETYPE*>, std::allocator<std::pair<KEYTYPE*, VALUETYPE*> > >::emplace<std::piecewise_construct_t const&, std::tuple<KEYTYPE* const&>, std::tuple<> >(std::_List_const_iterator<std::pair<KEYTYPE*, VALUETYPE*> >, std::piecewise_construct_t const&, std::tuple<KEYTYPE* const&>&&, std::tuple<>&&) (/path/to/executable+0x948ca3)
    #6 0x55f15863ff57 in std::_List_iterator<std::pair<KEYTYPE*, VALUETYPE*> > tsl::detail_hopscotch_hash::hopscotch_hash<std::pair<KEYTYPE*, VALUETYPE*>, tsl::hopscotch_map<KEYTYPE*, VALUETYPE*, std::hash<KEYTYPE*>, std::equal_to<KEYTYPE*>, std::allocator<std::pair<KEYTYPE*, VALUETYPE*> >, 62u, false, tsl::hh::power_of_two_growth_policy<2ul> >::KeySelect, tsl::hopscotch_map<KEYTYPE*, VALUETYPE*, std::hash<KEYTYPE*>, std::equal_to<KEYTYPE*>, std::allocator<std::pair<KEYTYPE*, VALUETYPE*> >, 62u, false, tsl::hh::power_of_two_growth_policy<2ul> >::ValueSelect, std::hash<KEYTYPE*>, std::equal_to<KEYTYPE*>, std::allocator<std::pair<KEYTYPE*, VALUETYPE*> >, 62u, false, tsl::hh::power_of_two_growth_policy<2ul>, std::__cxx11::list<std::pair<KEYTYPE*, VALUETYPE*>, std::allocator<std::pair<KEYTYPE*, VALUETYPE*> > > >::insert_in_overflow<std::piecewise_construct_t const&, std::tuple<KEYTYPE* const&>, std::tuple<>, std::__cxx11::list<std::pair<KEYTYPE*, VALUETYPE*>, std::allocator<std::pair<KEYTYPE*, VALUETYPE*> > >, (void*)0>(unsigned long, std::piecewise_construct_t const&, std::tuple<KEYTYPE* const&>&&, std::tuple<>&&) (/path/to/executable+0x939f57)
    #7 0x55f1586337d7 in std::pair<tsl::detail_hopscotch_hash::hopscotch_hash<std::pair<KEYTYPE*, VALUETYPE*>, tsl::hopscotch_map<KEYTYPE*, VALUETYPE*, std::hash<KEYTYPE*>, std::equal_to<KEYTYPE*>, std::allocator<std::pair<KEYTYPE*, VALUETYPE*> >, 62u, false, tsl::hh::power_of_two_growth_policy<2ul> >::KeySelect, tsl::hopscotch_map<KEYTYPE*, VALUETYPE*, std::hash<KEYTYPE*>, std::equal_to<KEYTYPE*>, std::allocator<std::pair<KEYTYPE*, VALUETYPE*> >, 62u, false, tsl::hh::power_of_two_growth_policy<2ul> >::ValueSelect, std::hash<KEYTYPE*>, std::equal_to<KEYTYPE*>, std::allocator<std::pair<KEYTYPE*, VALUETYPE*> >, 62u, false, tsl::hh::power_of_two_growth_policy<2ul>, std::__cxx11::list<std::pair<KEYTYPE*, VALUETYPE*>, std::allocator<std::pair<KEYTYPE*, VALUETYPE*> > > >::hopscotch_iterator<false>, bool> tsl::detail_hopscotch_hash::hopscotch_hash<std::pair<KEYTYPE*, VALUETYPE*>, tsl::hopscotch_map<KEYTYPE*, VALUETYPE*, std::hash<KEYTYPE*>, std::equal_to<KEYTYPE*>, std::allocator<std::pair<KEYTYPE*, VALUETYPE*> >, 62u, false, tsl::hh::power_of_two_growth_policy<2ul> >::KeySelect, tsl::hopscotch_map<KEYTYPE*, VALUETYPE*, std::hash<KEYTYPE*>, std::equal_to<KEYTYPE*>, std::allocator<std::pair<KEYTYPE*, VALUETYPE*> >, 62u, false, tsl::hh::power_of_two_growth_policy<2ul> >::ValueSelect, std::hash<KEYTYPE*>, std::equal_to<KEYTYPE*>, std::allocator<std::pair<KEYTYPE*, VALUETYPE*> >, 62u, false, tsl::hh::power_of_two_growth_policy<2ul>, std::__cxx11::list<std::pair<KEYTYPE*, VALUETYPE*>, std::allocator<std::pair<KEYTYPE*, VALUETYPE*> > > >::insert_value<std::piecewise_construct_t const&, std::tuple<KEYTYPE* const&>, std::tuple<> >(unsigned long, unsigned long, std::piecewise_construct_t const&, std::tuple<KEYTYPE* const&>&&, std::tuple<>&&) (/path/to/executable+0x92d7d7)
    #8 0x55f15862bf4e in tsl::hopscotch_map<KEYTYPE*, VALUETYPE*, std::hash<KEYTYPE*>, std::equal_to<KEYTYPE*>, std::allocator<std::pair<KEYTYPE*, VALUETYPE*> >, 62u, false, tsl::hh::power_of_two_growth_policy<2ul> >::ValueSelect::value_type& tsl::detail_hopscotch_hash::hopscotch_hash<std::pair<KEYTYPE*, VALUETYPE*>, tsl::hopscotch_map<KEYTYPE*, VALUETYPE*, std::hash<KEYTYPE*>, std::equal_to<KEYTYPE*>, std::allocator<std::pair<KEYTYPE*, VALUETYPE*> >, 62u, false, tsl::hh::power_of_two_growth_policy<2ul> >::KeySelect, tsl::hopscotch_map<KEYTYPE*, VALUETYPE*, std::hash<KEYTYPE*>, std::equal_to<KEYTYPE*>, std::allocator<std::pair<KEYTYPE*, VALUETYPE*> >, 62u, false, tsl::hh::power_of_two_growth_policy<2ul> >::ValueSelect, std::hash<KEYTYPE*>, std::equal_to<KEYTYPE*>, std::allocator<std::pair<KEYTYPE*, VALUETYPE*> >, 62u, false, tsl::hh::power_of_two_growth_policy<2ul>, std::__cxx11::list<std::pair<KEYTYPE*, VALUETYPE*>, std::allocator<std::pair<KEYTYPE*, VALUETYPE*> > > >::operator[]<KEYTYPE* const&, tsl::hopscotch_map<KEYTYPE*, VALUETYPE*, std::hash<KEYTYPE*>, std::equal_to<KEYTYPE*>, std::allocator<std::pair<KEYTYPE*, VALUETYPE*> >, 62u, false, tsl::hh::power_of_two_growth_policy<2ul> >::ValueSelect, (void*)0>(KEYTYPE* const&) (/path/to/executable+0x925f4e)
    #9 0x55f158624dd0 in tsl::hopscotch_map<KEYTYPE*, VALUETYPE*, std::hash<KEYTYPE*>, std::equal_to<KEYTYPE*>, std::allocator<std::pair<KEYTYPE*, VALUETYPE*> >, 62u, false, tsl::hh::power_of_two_growth_policy<2ul> >::operator[](KEYTYPE* const&) (/path/to/executable+0x91edd0)
    #10 0x55f15861b3e0 in [our code]
    [...]

@Tessil
Copy link
Owner

Tessil commented Jul 8, 2022

Hi, thank you very much for the report.

So an element was inserted, then erased, and an iterator increment is failing. I will look into it but just to be sure, can you double-check you are not incrementing an iterator which points to an erased element (iterators to erased elements are invalidated and can't be used)?

Code like the following is invalid and will create the same kind of error, I want to be sure the problem is not due to an incorrect usage:

https://godbolt.org/z/PdozW7GfM

#include <list>
#include <string>

int main() {
    std::list<std::string> l;
    auto it = l.insert(l.end(), "test");
    l.remove("test");
    ++it;
}
==1==ERROR: AddressSanitizer: heap-use-after-free on address 0x604000000010 at pc 0x000000402b5d bp 0x7ffd2b7f9510 sp 0x7ffd2b7f9508
READ of size 8 at 0x604000000010 thread T0
    #0 0x402b5c in std::_List_iterator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >::operator++() /opt/compiler-explorer/gcc-11.2.0/include/c++/11.2.0/bits/stl_list.h:219
    #1 0x402b5c in main /app/example.cpp:8
    #2 0x7fb8a21490b2 in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x240b2)
    #3 0x402dbd in _start (/app/output.s+0x402dbd)

0x604000000010 is located 0 bytes inside of 48-byte region [0x604000000010,0x604000000040)
freed by thread T0 here:
    #0 0x7fb8a30b9dd7 in operator delete(void*, unsigned long) (/opt/compiler-explorer/gcc-11.2.0/lib64/libasan.so.6+0xb3dd7)
    #1 0x403b79 in __gnu_cxx::new_allocator<std::_List_node<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >::deallocate(std::_List_node<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >*, unsigned long) /opt/compiler-explorer/gcc-11.2.0/include/c++/11.2.0/ext/new_allocator.h:145
    #2 0x403b79 in std::allocator<std::_List_node<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >::deallocate(std::_List_node<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >*, unsigned long) /opt/compiler-explorer/gcc-11.2.0/include/c++/11.2.0/bits/allocator.h:211
    #3 0x403b79 in std::allocator_traits<std::allocator<std::_List_node<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > >::deallocate(std::allocator<std::_List_node<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >&, std::_List_node<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >*, unsigned long) /opt/compiler-explorer/gcc-11.2.0/include/c++/11.2.0/bits/alloc_traits.h:492
    #4 0x403b79 in std::__cxx11::_List_base<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >::_M_put_node(std::_List_node<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >*) /opt/compiler-explorer/gcc-11.2.0/include/c++/11.2.0/bits/stl_list.h:446
    #5 0x403b79 in std::__cxx11::_List_base<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >::_M_clear() /opt/compiler-explorer/gcc-11.2.0/include/c++/11.2.0/bits/list.tcc:81

previously allocated by thread T0 here:
    #0 0x7fb8a30b8f57 in operator new(unsigned long) (/opt/compiler-explorer/gcc-11.2.0/lib64/libasan.so.6+0xb2f57)
    #1 0x4015ec in __gnu_cxx::new_allocator<std::_List_node<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >::allocate(unsigned long, void const*) /opt/compiler-explorer/gcc-11.2.0/include/c++/11.2.0/ext/new_allocator.h:127
    #2 0x4015ec in std::allocator<std::_List_node<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >::allocate(unsigned long) /opt/compiler-explorer/gcc-11.2.0/include/c++/11.2.0/bits/allocator.h:197
    #3 0x4015ec in std::allocator_traits<std::allocator<std::_List_node<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > >::allocate(std::allocator<std::_List_node<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >&, unsigned long) /opt/compiler-explorer/gcc-11.2.0/include/c++/11.2.0/bits/alloc_traits.h:460
    #4 0x4015ec in std::__cxx11::_List_base<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >::_M_get_node() /opt/compiler-explorer/gcc-11.2.0/include/c++/11.2.0/bits/stl_list.h:442
    #5 0x4015ec in std::_List_node<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >* std::__cxx11::list<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >::_M_create_node<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&&) /opt/compiler-explorer/gcc-11.2.0/include/c++/11.2.0/bits/stl_list.h:634
    #6 0x4015ec in std::_List_iterator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > std::__cxx11::list<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >::emplace<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >(std::_List_const_iterator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&&) /opt/compiler-explorer/gcc-11.2.0/include/c++/11.2.0/bits/list.tcc:92
    #7 0x4015ec in std::__cxx11::list<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >::insert(std::_List_const_iterator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&&) /opt/compiler-explorer/gcc-11.2.0/include/c++/11.2.0/bits/stl_list.h:1309
    #8 0x4015ec in main /app/example.cpp:6

@egmontkob-work
Copy link

Hi,

Thanks for your quick and kind response -- as always :)

just to be sure, can you double-check you are not incrementing an iterator which points to an erased element

I am fairly certain that we don't have code like that (and it'd presumably crash with std::unordered_map as well as hopscotch <= 2.1.0, right?). I've checked all occurrences once again, and only found this code which was a bit worrying at first sight, but I think it is correct.

typedef struct _KEYTYPE {
  [...]
} KEYTYPE;

typedef struct _VALUETYPE {
  tsl::hopscotch_set<KEYTYPE*> keys;
  [...]
} VALUETYPE;

tsl::hopscotch_map<int, VALUETYPE *> ccccc;
tsl::hopscotch_map<KEYTYPE*, VALUETYPE *> ddddd;

void deleter_method(int i) {
  auto it = ccccc.find(i);
  if (it == ccccc.end()) {
    return;
  }
  auto boo = it->second;
  for (auto key : boo->keys) {
    ddddd.erase(key);
  }
  ccccc.erase(it);

  [... the rest of the method doesn't use "it"]
}

There are two parallel hashmaps, ccccc keyed by an int and ddddd keyed by KEYTYPE. The value is the same in the two cases, and that value lists all the KEYTYPE keys through which this value can be found.

We locate the element in ccccc, and in its value iterate through all the keys of KEYTYPE, using that to remove all those keys from ddddd. Therefore, we iterate in one hopscotch map, and via that iterator we erase elements from another hopscotch map. And finally we remove just one element from the hopscotch map that this iterator belongs to. This should be safe, correct? Does it raise any suspicion to you?

(Hopscotch doesn't "own" the objects, doesn't call delete/free() on its pointer values when they're erased from the map, correct? So no KEYTYPE or VALUETYPE destructor is run during the erase()s, no chance of them in turn doing ugly things. There's no shared_ptr nearby in this game. Apologies, after years and years of coding in C++ I still often get uncertain about its memory management :))

The middle stack trace "freed by thread T0 here" jumps from this #10 0x55f15861c2d7 in deleter_method(int) to the hopscotch code.

@Tessil
Copy link
Owner

Tessil commented Jul 9, 2022

Thank you very much for the extra information and help on debugging.

From a first glance the code seems to be well-defined though the ASan trace failure is on a hopscotch_map<KEYTYPE*, VALUETYPE*>::iterator increment but I don't see any such code in deleter_method. The only iterator increment is on the range for-loop but it's on a hopscotch_set<KEYTYPE*>. Is the #2 0x55f15862321b in [our code] also in deleter_methodas it's the problematic one?

READ of size 8 at 0x603000026260 thread T0
    #0 0x55f15862fb6b in std::_List_iterator<std::pair<KEYTYPE*, VALUETYPE*> >::operator++() (/path/to/executable+0x929b6b)
    #1 0x55f158628474 in tsl::detail_hopscotch_hash::hopscotch_hash<std::pair<KEYTYPE*, VALUETYPE*>, tsl::hopscotch_map<KEYTYPE*, VALUETYPE*, std::hash<KEYTYPE*>, std::equal_to<KEYTYPE*>, std::allocator<std::pair<KEYTYPE*, VALUETYPE*> >, 62u, false, tsl::hh::power_of_two_growth_policy<2ul> >::KeySelect, tsl::hopscotch_map<KEYTYPE*, VALUETYPE*, std::hash<KEYTYPE*>, std::equal_to<KEYTYPE*>, std::allocator<std::pair<KEYTYPE*, VALUETYPE*> >, 62u, false, tsl::hh::power_of_two_growth_policy<2ul> >::ValueSelect, std::hash<KEYTYPE*>, std::equal_to<KEYTYPE*>, std::allocator<std::pair<KEYTYPE*, VALUETYPE*> >, 62u, false, tsl::hh::power_of_two_growth_policy<2ul>, std::__cxx11::list<std::pair<KEYTYPE*, VALUETYPE*>, std::allocator<std::pair<KEYTYPE*, VALUETYPE*> > > >::hopscotch_iterator<false>::operator++() (/path/to/executable+0x922474)
    #2 0x55f15862321b in [our code]
    [...]

I am fairly certain that we don't have code like that (and it'd presumably crash with std::unordered_map as well as hopscotch <= 2.1.0, right?)

It should crash with 2.1.0 too but may not with std::unordered_map depending on the implementation. In my previous example I used std::list but using std::unordered_set doesn't cause the stacktrace (though it would still be an undefined usage).

@egmontkob-work
Copy link

Geez...

Is the #2 0x55f15862321b in [our code] also in deleter_methodas it's the problematic one?

No, it's elsewhere. Taking yet another really close look at that place... it's indeed a bug in our code. Bummer.

It's not trivial to spot because inside a for (auto pair : ddddd) { ... } it calls out to other functions, which in turn sometimes erase items from ddddd.

Shame on us, really! We owe you a lot of beer! :) Seriously, thank you so much for supporting us all along the way in something where we suspected hopscotch and even gcc to be the culprit, and turned out to be our bug!

It was really interesting though how the bug surfacing depended on the constellation of so many circumstances... something I haven't quite seen before.


Just wondering, the two other guys who commented that they faced this same bug, do they probably also have a similarly broken code?

And by the way, was the launder fix really necessary for c++17? Is it a useful outcome at least? :)


Code that erase()s an element (or clear()s all) under an iterator is buggy, but only fails rarely, works most of the time. Or seemingly works, and might only trigger a visible bug much later on. For software developers, such rare failure is a nightmare to track down, ideally it should fail always, or at least quite often.

Do you think it would be feasible to add some sort of protection in debug mode, in the style of gcc's stack-protector? Under the hood prepend or append a fixed magic constant to every key and/or value upon insertion, verify it when accessing the data, and wipe it out when erased from the set/map... or something along these lines?


Thanks again, your kind help is highly appreciated!

@Tessil
Copy link
Owner

Tessil commented Jul 9, 2022

No worry, thank you very much for your report and glad I could help :)

I'll leave the issue open for a bit, don't hesitate to let me know if you still encounter some weird behaviour (there may still be a bug in the library, we never know)

And by the way, was the launder fix really necessary for c++17? Is it a useful outcome at least? :)

Theoretically it's useful as it wasn't compliant to the C++17 standard before, in practice I wonder though for such corner case, see comment on isocpp discussion group: https://groups.google.com/a/isocpp.org/g/std-discussion/c/xaAwFR6qUmY/m/knkGv2AcBQAJ

Just wondering, the two other guys who commented that they faced this same bug, do they probably also have a similarly broken code?

Probably something different as it's related to strict aliasing but I haven't been able to find anything suspect in the library, the bug may come from somewhere else but there could still be something problematic in the lib. I need to review carefully the whole code of the library when I have time to be sure I don't violate the strict aliasing rule (the only reinterpret_cast I use seems sound, the rest are all static_cast).

For software developers, such rare failure is a nightmare to track down, ideally it should fail always, or at least quite often.

Yes, such subtle undefined behaviours are really difficult to track as there is no really guarantee on how it fails. The code was working with 2.1 and std::unordered_map but it may have been because the undefined behaviour was manifesting itself differently or just luck it worked. Sanitizers are really useful to try to track these.

Also one of the reason of the randomness may be that the problem is visible only when the neighbourhood is full and the items overflow into the overflow list and more subtly fails otherwise (see https://tessil.github.io/2016/08/29/hopscotch-hashing.html for details if you're interested). This is caused by a bad distribution of the hashes which is easy to occur when pointer addresses are used as keys with the identity hash function and power-of-two buckets size (see the second paragraph of the README).

Do you think it would be feasible to add some sort of protection in debug mode, in the style of gcc's stack-protector? Under the hood prepend or append a fixed magic constant to every key and/or value upon insertion, verify it when accessing the data, and wipe it out when erased from the set/map... or something along these lines?

There is the TSL_DEBUG to at least enable all the assertions I put in but adding such magic constant would add a lot of bloat and difficult to implement.

@egmontkob-work
Copy link

Turns out I'm not completely stupid... :)

The weird situation we saw in this thread is probably the result of two or more bugs.

The bug in our codebase that I just caught with your help (and my colleague @rbalint just fixed), namely erasing items of ddddd inside a for (auto pair : ddddd) { ... } loop, was introduced a few weeks after we experienced the first hopscotch-related crashes.

Going back to the version where we first noticed the crash, in an Ubuntu Focal container, I can obviously still reproduce the crash. It's also crashing with your launder fix applied. And I don't see an invalid iterator use in that version of our code.

We don't yet understand what causes that crash. Could, of course, be yet another bug in our codebase (we should take another really thorough look). Or a bug in gcc-9. Or a bug in hopscotch... (My uneducated guess would be gcc-9.)

Maybe we'll never get to know. Probably we won't spend more time on this issue, unless the bug surfaces again for us on Ubuntu Jammy.

@Tessil
Copy link
Owner

Tessil commented Jul 15, 2022

Thank you for the information. Honestly I am not sure what is happening, the commit 5c5770a seems innocent to me and I can't see why it would cause problems in a well-defined program.

I also checked for any strict-aliasing violation and compiled the code with --Wstrict-aliasing=1 (see https://gcc.gnu.org/onlinedocs/gcc/Warning-Options.html#index-Wstrict-aliasing) but to no avail (the documentation says it should have very few false negatives).

The reinterpret_cast I use are used in the same fashion as in the example in https://en.cppreference.com/w/cpp/types/aligned_storage

I will close the issue for now as I haven't been able to reproduce or detect any of the mentioned bug. If anyone has more information (example, stacktrace, ...) feel free to comment or open a new issue.

@Tessil Tessil closed this as completed Jul 15, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants