SIGSEGV in strPut due to shared heap #6118
Date: 2016-11-10 14:22:52 +0100
Last updated: 2016-12-21 13:07:02 +0100
Date: 2016-11-10 14:22:52 +0100
Depressing disclaimer: I got this core dump from our production systems. It has happened once, and it happened on a Jul2015 build (b9cb28d6243b).
Program terminated with signal SIGSEGV, Segmentation fault.
...so we've got an obvious NULL pointer dereference. h->base was NULL because:
(gdb) thread 20
Notice that these two threads are both doing strPut on vheap 0x7fa4ce3b3c90 concurrently. For completeness, here's the parent of that vheap (although I don't think it's important to the story):
(gdb) p *BBP.cache->T
My idea for a fix (against the default branch as of today) is:
diff -r 05a0bf0401ff gdk/gdk_batop.c
What do you think? Does this seem like a plausible theory? If it is, can you think of what the reproduction recipe might be?
[My apologies for yet again starting with a conclusion and trying to work backwards from there]
Date: 2016-11-14 17:08:43 +0100
Here's a recipe for going down the vulnerable code path. I ran it with gdk_nr_threads=2 and the default_pipe:
create table foo as select cast(value as varchar(64)) as v from generate_series(cast(0 as int),250000) with data;
That gets me into strPut() when the destination BAT doesn't own its vheap, i.e. b->T->vheap->parentid != b->batCacheid
I haven't managed to reproduce the crash from this test yet, presumably either because I don't have enough concurrency or because the heap is already adequately-sized.
I've also been staring at the implementation of insert_string_bat() and I have a suspicion that the final call to bunfastins() in that function also requires a call to unshare_string_heap(). It's complicated, though, so I'm not 100% sure.
Date: 2016-11-30 19:22:37 +0100
(In reply to Richard Hughes from comment 1)
This case looks pretty benign, but I will make a couple of small changes to really make it benign. What I see happening here is that b and n both use the same string heap from a third bat. Since they both use the same string heap, their offsets are completely compatible and we don't have to mess with the string heap (it necessarily contains all strings of n). I will just make sure that the offsets are copied in this case by setting toff to 0.
Maybe that is actually a slightly different case.
This should be taken care of by the only call to unshare_string_heap already in insert_string_bat. Only a transient bat can use another bat's string heap (the test for role == TRANSIENT), and in that case, if we can't share the string heaps of b and n (toff is not set to a new value, so remains ~0), the string heap is unshared. And only in this case should we get to the final bunfastins.
Date: 2016-11-30 19:23:16 +0100
For complete details, see http//devmonetdborg/hg/MonetDB?cmd=changeset;node=6230882d2425
Date: 2016-12-01 13:04:05 +0100
I'm not sure. I pulled your changes and put a breakpoint on strPut(), using the same query as before. Here's the mclient session:
at this point my breakpoint got hit:
Breakpoint 4, strPut (h=0x63ec70, dst=0x7fffe1191da0,
This seems to be telling me that strPut() is writing to the PERSISTENT vheap owned by the column foo.v. If I dump bat/06/666.theap as hex then I see it contains negative values at the end, which I think ought to be impossible. Weirdly, it also grows by a full set of -1000..-950 values every time I run the SELECT, so somehow double elimination isn't working.
Date: 2016-12-01 17:54:19 +0100
For complete details, see http//devmonetdborg/hg/MonetDB?cmd=changeset;node=a4160f607bef
Date: 2016-12-01 18:45:46 +0100
Thanks. That now works in all the situations I can think of testing.
While you were looking at that, I got distracted by the lack of double-elimination. It turns out that that happens because BATload_intern() calls strCleanHash() which wipes out the complete hash table for any vheap >= 64KB. The BBP policy changed recently(ish) to unload BATs much more aggressively, meaning that the hash table is cleared between most transactions. I had to go all the way back to r17805 for the explanation for the wiping: "heap may have been mmaped-ed, appended-by-force, and then corrupted by crash"
Is this unnecessary disk consumption of interest to you? If yes, it shouldn't be tracked in this bug. If no, then ignore it and I'll wait until there's actual evidence that it's a problem.
Date: 2016-12-01 21:13:18 +0100
(In reply to Richard Hughes from comment 6)
Good to hear.
I think it's worthwhile to open a bug report so that we don't forget. It's probably not very difficult to fix this. Just recreate the hash table instead of merely clearing it.
Date: 2016-12-02 12:22:01 +0100
(In reply to Sjoerd Mullender from comment 7)
Date: 2016-12-08 10:05:36 +0100
This bug seems finally fixed.
The text was updated successfully, but these errors were encountered: