-
Notifications
You must be signed in to change notification settings - Fork 65
-
Notifications
You must be signed in to change notification settings - Fork 65
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Occasional map misbehavior, inserted element not found later on #55
Comments
Hello, Thanks you very much for taking the time to submit an issue as it is a quite worrisome bug. The rename in 5c5770a should effectively be of little consequence even for the ones that stayed as Just to be sure the fb3bb98 parent commit works well? As there is some non-deterministic behaviour in the software I suppose enough iterations were run? To help me debug would it be possible to define the Would it also be possible to call the Thank you very much, |
Thanks a lot for your quick response! With So apparently some internal consistency is broken way earlier than it hits the surface when not debugging.
The assertion failure with 5c5770a is:
The parent fb3bb98 (a.k.a. v2.1.0) works fine. Let me mention that we do have multiple hopscotch_maps and work on other ones too (including deleting items). I guess that should be irrelevant -- or do you perhaps share some data structures across them? |
I was a bit worried that it might be a memory corruption in our application (that would be embarrassing), somehow always affecting the map's memory area in the same way. In order to check whether this is the case, I coded up a simple checksumming for the map, and print the value right before and right after every insertion. The value doesn't change unexpectedly: the one printed right after an insertion and the one printed right before the next insertion -- belonging to the same map object -- are always the same. That is, it seems it's not a memory corruption in our app. (Also note that we are single threaded.) Could you please make sure that this checksumming is correct and complete? Let me know if I missed a place I need to recurse to. |
Thank you very much for your reply. I'll check more in depth, knowing that it fails after 15 insertions without any erase and overflow (or move/copy/manual rehash of the map?) narrow the portions that can cause problem. There is no global shared state so using multiple In the meantime I ran a stress test script that randomly inserts and erases a random number of values with checks but no error so far. As the bug seems quite consistent it's more likely to be the fault of the library than a memory corruption on your side. It'd be useful if you eventually have the hashes of the 15 inserted values but I should probably be able to do without it if it's a private information. |
Something really mysterious is going on, and I'm still torn whether it'll be a bug on your side or ours :) I've tried to copy the steps from a concrete run (inserting those exact values) to a standalone app and could not repeat the crash. Also, I can only reproduce the crash with our app's production build ( Here's one debug output from our app. There's nothing special or secret in the numbers, the keys are small integers (somewhat changing every time, but roughly according to the same pattern), and the values are memory addresses. CON is the constructor of the object that has this map as a member. This value is repeated after ADD, DEL, CLEAR and fnd (find) operations. The "about to add" .. "added" lines are printed right around an insertion. "xor" is the checksum. The last ADD is where it crashes, obviously. I don't think there's anything special here, and I couldn't cause a crash by manually replaying these steps in a standalone app, stress-test running that app thousands of times (I did this before filing the original report). gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0 |
We have a totally unrelated class of 24 bytes, with several objects instantiated. If I squeeze one of the data types within that object so that the object shrinks to 16 bytes then the bug is gone. If I disable link time optimization (no If I compile with gcc/g++-10 (the Ubuntu 20.04 package: version 10.3.0-1ubuntu1~20.04) then the bug is gone. Could it be that a bug in your code requires so special circumstances to arise?? At this point I much more suspect that we hit an obscure gcc bug that's hopefully been fixed. What do you think? |
If I add Idea taken from https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97456 which is the closest-looking in the list of fixed bugs for gcc-10.[123] after a quick glimpse. Although the linked bug is said to have appeared in 10.1, so it's not an exact match. |
Thank you very much for your detailed investigation. The bug is quite strange and we could eventually have hit a GCC bug. You could test with valgrind or ASan to check for potential invalid memory accesses. The library is tested with ASan and UBSan but some conditions arising here may not be tested. Tomorrow I'll try to read a bit back the insertion portion code of the library and check that everything looks sane. |
Neither Your renaming commit is pretty small, especially if we separate the pure renaming from the functional changes. Do you think it helps to bisect the latter? |
This patch, when reverse-applied on "buggy" 5c5770a, partially reverts that commit. You'll still have the new variable names, and the new use of It also introduces two new assertions, to make sure that it really shouldn't matter if you use The new use of
So it is getting more and more suspicious that we're indeed facing a GCC issue. |
Effectively I stress-tested the map a bit more with ASan and TSL_DEBUG and can't reproduce the bug. The insertion code seems sound too. I really think it could be a GCC issue or something external to the map. It could be interesting to check with Clang too. I'll check a bit more the code as I want to be sure there is no such critical bug in the library and will let you know if I find anything. I'll close the ticket for now but If you can reproduce the bug with Clang or other versions of GCC don't hesitate to re-open it. |
Thanks so much for the help and the investigation! And thanks so much for this great library, too! :) As much as I got to understand your code in this short time, I agree with you that it really doesn't look like a bug there. I tried clang versions 7..12 (out of which for version 9 only I tried gcc 9.4.0 from focal-proposed, it's buggy exactly as the current 9.3.0. I also tried gcc 7.5.0 and 8.4.0 as shipped by Ubuntu Focal, and they trigger the bug much more prominently. With gcc-9, one of our ~20 integration tests fail, with gcc-7/8 almost all of them fail at the said hopscotch assertion -- and again, only at 5c5770a and not at its parent commit. Maybe if you feel like running another round of stress-test, you should try with gcc-7/8. But I still coudn't get the standalone app fail. There's more special circumstances needed to trigger the bug. Fingers crossed that it's properly fixed in gcc-10. |
Hello, However I was able to workaround the issue and wanted to document that here, in case it helps to understand the underlying issue.
in order to avoid disabling strict aliasing for the entire codebase. So this could mean the library violates the strict aliasing rule or that there is a compiler bug. I can't prove either way but disabling strict aliasing worked for me. |
Hello, Thank you for the bug report. P0137R1 introduced a change in the object model in C++17 which now requires to Could you try with the latest 4442316 commit that fixes this undefined behaviour and see if it solves your problem? |
Hi guys, Thanks for coming back to this issue! Unfortunately, 4442316 does not fix it for us. I have since upgraded my computer to Ubuntu 22.04 with gcc-11 by default, with which we couldn't reproduce the bug yet. Even though gcc-9 is still available, I decided to fire up a brand new lxc container with 20.04 instead. I placed in it our source code as of the time we were hit by this issue, plus the tsl headers from 4442316 copied to /usr/include/tsl, then did a clean build. Our unittest still fails once out of every 10-20-ish occasions. Just to be on the safe side, I repeated the same with fb3bb98 a.k.a. Our project indeed uses |
I tried this too, added the first of these two lines to the very top of that file, and the second to the very bottom. On top of 4442316. Seems to work correctly, the unittest passed another 200+ times. |
Thought I'd do a bit of bisecting, to see where that I'm not sure if this observation is useful at all. |
Thank you very much for the info and help.
So the bug is gone when only 54..359 is wrapped in |
Exactly (apart from the 359 vs 395 typo). That's why I couldn't continue bisecting :) After the oddities like ...
... I'm not surprised at anything. We still haven't excluded the possibility of a gcc bug, have we? Although on the list of closed bugs for gcc-10.[123] there's no relevant title match for "aliasing" or "launder". |
Yes, it may still be a compiler error but @albi-k having encountered a similar problem mean it is more probable that the library has a problem. I will try to do a thorough review of the code when I have some time to check if I eventually find something. |
We exclusively use clang, so the problem is definitely not specific to GCC. |
Using gcc-10 was a workaround for Tessil/hopscotch-map#55 , but now it seems that the issue is in hopscotch-map, since clang-built binaries are affected, too. This reverts commit 0afbea5.
@Tessil Thank you for looking into the code. Maybe the issue could be reopened for the meantime to let others discover the workaround for GCC. |
Using gcc-10 was a workaround for Tessil/hopscotch-map#55 , but now it seems that the issue is in hopscotch-map, since clang-built binaries are affected, too. This reverts commit 0afbea5.
Using gcc-10 was a workaround for Tessil/hopscotch-map#55 , but now it seems that the issue is in hopscotch-map, since clang-built binaries are affected, too. This reverts commit 0afbea5.
Using gcc-10 was a workaround for Tessil/hopscotch-map#55 , but now it seems that the issue is in hopscotch-map, since clang-built binaries are affected, too. This reverts commit 0afbea5.
Thanks, I have re-opened it. I'll try to take a deeper look into it. If anyone has a reproducible example, don't hesitate to share. |
We've just run into the following crash under ASAN (
I don't know if it's helpful for you or not. It's also unclear to me if it's supposed to work correctly without the
|
Hi, thank you very much for the report. So an element was inserted, then erased, and an iterator increment is failing. I will look into it but just to be sure, can you double-check you are not incrementing an iterator which points to an erased element (iterators to erased elements are invalidated and can't be used)? Code like the following is invalid and will create the same kind of error, I want to be sure the problem is not due to an incorrect usage: https://godbolt.org/z/PdozW7GfM #include <list>
#include <string>
int main() {
std::list<std::string> l;
auto it = l.insert(l.end(), "test");
l.remove("test");
++it;
}
|
Hi, Thanks for your quick and kind response -- as always :)
I am fairly certain that we don't have code like that (and it'd presumably crash with
There are two parallel hashmaps, We locate the element in (Hopscotch doesn't "own" the objects, doesn't call The middle stack trace "freed by thread T0 here" jumps from this |
Thank you very much for the extra information and help on debugging. From a first glance the code seems to be well-defined though the ASan trace failure is on a
It should crash with 2.1.0 too but may not with |
Geez...
No, it's elsewhere. Taking yet another really close look at that place... it's indeed a bug in our code. Bummer. It's not trivial to spot because inside a Shame on us, really! We owe you a lot of beer! :) Seriously, thank you so much for supporting us all along the way in something where we suspected hopscotch and even gcc to be the culprit, and turned out to be our bug! It was really interesting though how the bug surfacing depended on the constellation of so many circumstances... something I haven't quite seen before. Just wondering, the two other guys who commented that they faced this same bug, do they probably also have a similarly broken code? And by the way, was the Code that Do you think it would be feasible to add some sort of protection in debug mode, in the style of gcc's stack-protector? Under the hood prepend or append a fixed magic constant to every key and/or value upon insertion, verify it when accessing the data, and wipe it out when erased from the set/map... or something along these lines? Thanks again, your kind help is highly appreciated! |
No worry, thank you very much for your report and glad I could help :) I'll leave the issue open for a bit, don't hesitate to let me know if you still encounter some weird behaviour (there may still be a bug in the library, we never know)
Theoretically it's useful as it wasn't compliant to the C++17 standard before, in practice I wonder though for such corner case, see comment on isocpp discussion group: https://groups.google.com/a/isocpp.org/g/std-discussion/c/xaAwFR6qUmY/m/knkGv2AcBQAJ
Probably something different as it's related to strict aliasing but I haven't been able to find anything suspect in the library, the bug may come from somewhere else but there could still be something problematic in the lib. I need to review carefully the whole code of the library when I have time to be sure I don't violate the strict aliasing rule (the only
Yes, such subtle undefined behaviours are really difficult to track as there is no really guarantee on how it fails. The code was working with 2.1 and Also one of the reason of the randomness may be that the problem is visible only when the neighbourhood is full and the items overflow into the overflow list and more subtly fails otherwise (see https://tessil.github.io/2016/08/29/hopscotch-hashing.html for details if you're interested). This is caused by a bad distribution of the hashes which is easy to occur when pointer addresses are used as keys with the identity hash function and power-of-two buckets size (see the second paragraph of the README).
There is the TSL_DEBUG to at least enable all the assertions I put in but adding such magic constant would add a lot of bloat and difficult to implement. |
Turns out I'm not completely stupid... :) The weird situation we saw in this thread is probably the result of two or more bugs. The bug in our codebase that I just caught with your help (and my colleague @rbalint just fixed), namely erasing items of Going back to the version where we first noticed the crash, in an Ubuntu Focal container, I can obviously still reproduce the crash. It's also crashing with your We don't yet understand what causes that crash. Could, of course, be yet another bug in our codebase (we should take another really thorough look). Or a bug in gcc-9. Or a bug in hopscotch... (My uneducated guess would be gcc-9.) Maybe we'll never get to know. Probably we won't spend more time on this issue, unless the bug surfaces again for us on Ubuntu Jammy. |
Thank you for the information. Honestly I am not sure what is happening, the commit 5c5770a seems innocent to me and I can't see why it would cause problems in a well-defined program. I also checked for any strict-aliasing violation and compiled the code with The I will close the issue for now as I haven't been able to reproduce or detect any of the mentioned bug. If anyone has more information (example, stacktrace, ...) feel free to comment or open a new issue. |
Hello,
We are experiencing a weird misbehavior with tsl/hopscotch_map, bisected to commit 5c5770a being the culprit.
According to its changelog entry, this looks like an innocent commit purely renaming variables, namely
m_buckets
tom_buckets_data
andm_first_or_empty_bucket
tom_buckets
. However, danger lurks around in one of the old names being the same as the other new name.At several places (e.g. hopscotch_hash.h line 1246 being the first one)
m_buckets
remainsm_buckets
. Based on the changelog's wording, these look like accidental mistakes rather than intentional "changes".Into our
tsl::hopscotch_map<int, void *>
we insert about 1000 entries (usingmap[key] = val
) and erase about 800 (usingit = map.find(key); blahblah; map.erase(it)
and then no longer usingit
afterwards). These operations interleave each other in kinda random order, and we sometimes re-add a key that we erased earlier.Then, after one insertion, the following situation arises:
If I iterate through all the items using
for (auto it : map) {...}
then I get all the about 200 items, as expected.However, for one of these entries
map.count(key)
gives 0 instead of the expected 1. Trying to access this item using.at(key)
raises an exception, as usual for missing entries. Or, trying to access this key using[key]
inserts this item again, with the default value of nullptr. After this operation, if I loop through the items again usingfor (auto it : map) {...}
then this key gets listed twice, once with the expected value and once with nullptr. And at this point it becomes another key that is "missing" in this weird sense that looping over all the entries finds it, but looking it up directly does not.Unfortunately I couldn't come up with a self-contained test demonstrating the failure, and I cannot give you access to our in-house software that every now and then triggers this bug. Our software has a very non-deterministic behavior by its nature, but as of your aforementioned commit we see the misbehavior about 1 out of 20-50-ish cases. We haven't seen misbehavior with your hopscotch_map prior to the mentioned commit during thousands of runs, nor with the standard unordered_map.
Let me know if you need any help in debugging this issue (without me digging deeply into your code :)). For example, if you create a version that dumps into a global fd the memory region of the map after every insertion/deletion, I'd be happy to test that. But maybe you just want to revert the functionality and redo the variable naming properly :)
Thanks in advance!
The text was updated successfully, but these errors were encountered: