mserver segfault during bulk loading/updating #3848
Last updated: 2016-01-15 11:37:55 +0100
Date: 2015-11-04 18:10:40 +0100
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.80 Safari/537.36
I've been seeing a lot of crashes with the following backtrace since upgrading to Jul2015 (and Jul2015-SP1):
0 BATproject (l=l@entry=0x0, r=r@entry=0x7fb7e8a3a820) at gdk_join.c:3299
Steps to Reproduce:
My workload includes a lot of bulk inserts and updates (with high frequency) that occur in parallel with analytical queries. This particular trace seems to happen during times of heavy loading (many copy into statements and bulk updates back to back). I'm not sure if it's correlated with concurrent analytical queries or not.
Date: 2015-11-04 18:47:30 +0100
Note: I was previously running Feb2013-SP6, so I'm not sure if this bug was introduced in the latest build or somewhere in between.
Date: 2015-11-06 16:34:29 +0100
Here are a couple more errors I've been seeing that feel like they might be related to the same root cause.
Segfault with following backtrace:
Error thrown during a query (select distinct col1, col2, col3, col4 from foo where [lots of predicates]):
Date: 2015-11-06 21:08:14 +0100
At this point, there are so many crashes happening that this version is basically unusable. I've been running it in parallel with our production system to evaluate it under realistic load, and it crashed 12 times in less than an hour!
Here are some more of the common backtraces:
0 mvc_bind_wrap (cntxt=, mb=, stk=0x7fb5c4f658a0, pci=) at sql.c:1836
0 hash_new (sa=0x3567e30, size=59, key=0x7fcf9a5ac320 <base_key>) at sql_hash.c:35
0 strlen () at ../sysdeps/x86_64/strlen.S:106
0 _list_find_name (l=0x4233e10, name=0x7f21c8137ee0 "event_type") at sql_catalog.c:38
Date: 2015-11-11 10:44:40 +0100
A hard one to assess. If you produce a GDB backtrace, you might consider looking at all threads, e.g. use 'thr apply all where'
What is precisely the parallel behavior enforced?
[LQ]* loads and concurrent queries
please be as precise as possible on the workload to get a glimps
Date: 2015-11-12 23:27:29 +0100
After lots of investigation, I believe the problems may be caused by some corrupt bats on disk. I've fixed the source of the corruption on our end, and we're going to try a fresh round of corruption-free testing soon. I know that at least one of the common stack traces I listed here was caused by the corruption issue. I suspect the others may be as well. I'll know more in a few days, and we may be able to just close this ticket entirely.
Date: 2015-11-20 18:27:55 +0100
Created attachment 368
Date: 2015-11-20 18:28:35 +0100
Created attachment 369
Date: 2015-11-20 18:28:52 +0100
Created attachment 370
Date: 2015-11-20 18:40:57 +0100
I've finally been able to narrow this down to an easily repeatable test case. Run the attached setup script, then concurrently run the attached write thread and read thread scripts.
These run simple update where and select where statements, respectively, 1000 times in a loop with a 100ms sleep between each query. On a linux test server, this resulted in 52 crashes with backtraces that all look like:
0 DELTAproject (result=0x7fa3100cb050, sub=, col=, uid=, uval=0x7fa3100cb010, ins=) at sql.c:2418
On my mac laptop, the failure mode is different. Instead of crashing, the query thread eventually hangs (usually after 10-20 queries, never made it past 40). Once that happens, mserver starts using tons of cpu. The update thread continues to run however, and the sql write ahead log just keeps growing and growing, never getting flushed. Meanwhile, I'm still able to execute queries, as long as they don't touch the column that's being updated. If I do a
This only happens with updates. Running the same process with inserts causes not trouble. I was also only able to make this fail on Jul2015 and Jul2015-SP1, all prior versions that I tested (Feb2013-SP6, Jan2014, and Oct2014-SP4) functioned correctly.
Date: 2015-11-20 21:56:03 +0100
For complete details, see http//devmonetdborg/hg/MonetDB?cmd=changeset;node=61ccf71c082f
Date: 2015-11-20 21:56:07 +0100
For complete details, see http//devmonetdborg/hg/MonetDB?cmd=changeset;node=93ef82367cca
Date: 2015-11-20 22:00:35 +0100
Thanks for the test case. It made finding this bug so much easier.
Date: 2015-11-20 22:04:33 +0100
Welcome. And thanks for the fix. I'll have to try things out with your patches to see if there are any other lingering issues that this test didn't expose.
Date: 2015-11-21 16:28:43 +0100
Running with the latest code from the Jul2015 branch, the test case is still crashing very consistently (every 5-10 iterations of the loop). On linux, it's getting SIGABRT with the following stack trace:
0 0x00007fc3ad662cc9 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:56
On mac, it's just hanging the entire database. Can't connect, no cpu activity.
Date: 2015-11-23 15:35:53 +0100
For complete details, see http//devmonetdborg/hg/MonetDB?cmd=changeset;node=1cad4a5b5921
Date: 2015-11-23 17:18:34 +0100
For complete details, see http//devmonetdborg/hg/MonetDB?cmd=changeset;node=759924c5300e
Date: 2015-11-23 17:22:35 +0100
With the latest version, the test case is now working fine for me on both linux and mac. Going to try this out on my mirrored production machine and see how things go. Thanks!
Date: 2015-11-25 01:22:45 +0100
Created attachment 371
Date: 2015-11-25 01:23:10 +0100
Created attachment 372
Date: 2015-11-25 01:23:30 +0100
Created attachment 373
Date: 2015-11-25 01:31:49 +0100
I ran with the latest from Jul2015 branch (as of http://dev.monetdb.org/hg/MonetDB?cmd=changeset;node=1cad4a5b5921) on my production mirror environment for a while, and I'm still seeing lots of crashes. They're different from the ones you've fixed, and a smaller set than I was seeing before, so seems like we're making progress here.
I expanded my test case a bit to include a couple more columns, one with nulls, some inserts per iteration, and updates that modify multiple columns using different criteria in the same transaction. This more closely resembles the behavior of our actual write thread, just in a much more basic and repeatable way. The new scripts are attached.
I just ran it on a linux test box and got 194 cores over the duration of the test. I didn't have time to inspect every stack trace, but from a random sampling, they all seemed to look like the following (which I was also seeing in my production mirror testing):
0 0x00007f6562faccc9 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:56
0 0x00007f65a013920c in SQLtid (cntxt=0x7f65a076f328, mb=0x7f65941d4940, stk=0x7f65941ad8d0, pci=0x7f65941e3040) at sql.c:2547
0 0x00007f43e237dcc9 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:56
Date: 2015-12-24 22:13:07 +0100
test now runs, do you have more crashing examples
The text was updated successfully, but these errors were encountered: