Crash at exit (overrun THRerrorcount?) #3688
Closed
Labels
Comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Date: 2015-03-19 13:58:17 +0100
From: Richard Hughes <<richard.monetdb>>
To: GDK devs <>
Version: 11.19.9 (Oct2014-SP2)
Last updated: 2015-05-07 12:37:45 +0200
Comment 20722
Date: 2015-03-19 13:58:17 +0100
From: Richard Hughes <<richard.monetdb>>
Build is Oct2014 e58372859532
I got this crash at shutdown:
(gdb) bt
0 0x00007f4dd2eff44a in pthread_join (threadid=4294967296,
thread_return=thread_return@entry=0x0) at pthread_join.c:47
1 0x00007f4dd4084c36 in MT_join_thread (t=)
at gdk_system.c:656
2 0x00007f4dd3ff9014 in GDKexit (status=status@entry=0) at gdk_utils.c:1230
3 0x00007f4dd45122dc in mal_exit () at mal.c:215
4
5 0x00007f4dd2c2bf23 in select () at ../sysdeps/unix/syscall-template.S:81
6 0x00007f4dd4085c09 in MT_sleep_ms (ms=) at gdk_posix.c:1096
7 0x00000000004026da in main (argc=24, av=0x0) at mserver5.c:656
Note the absurd value for threadid.
Dumping out the memory around GDKvmtrim_id (which is the global variable that absurd value was derived from):
(gdb) x/10a &GDKvmtrim_id-4
0x7f4dd44b1c28 <THRerrorcount+40>: 0x0 0x0
0x7f4dd44b1c38 <THRerrorcount+56>: 0x1 0x100000000
0x7f4dd44b1c48 <GDKvmtrim_id>: 0x100000001 0xffffff8a2e450000
0x7f4dd44b1c58 <GDK_mallocedbytes_estimate>: 0x11546c6f9 0xe31863
0x7f4dd44b1c68 : 0x100000001 0x1
The first thing I looked at was the purpose of THRerrorcount and whether it could be overrun and, sure enough, it looks very suspicious to me:
gdk_utils.c:
doGDKaddbuf(const char *prefix, const char *message, size_t messagelen, const char *suffix)
...
THRerrorcount[THRgettid()]++;
Can I ask somebody to eyeball this and see if you agree with my analysis? I'm running a 24 core machine, so an overrun of a 16 element array is very doable. I think this may also be why I've sometimes had mserver5 completely stop doing anything and, upon attaching a debugger, finding that most of the threads have mysteriously disappeared - overwriting GDKstopped would cause this to happen.
FYI, nobody ever reads THRerrorcount so the obvious fix is to remove it entirely.
Comment 20732
Date: 2015-03-20 16:26:38 +0100
From: @sjoerdmullender
It does indeed look very suspicious. I'll remove this THRerrorcount completely.
Comment 20733
Date: 2015-03-20 16:27:54 +0100
From: MonetDB Mercurial Repository <>
Changeset 346077203679 made by Sjoerd Mullender sjoerd@acm.org in the MonetDB repo, refers to this bug.
For complete details, see http//devmonetdborg/hg/MonetDB?cmd=changeset;node=346077203679
Changeset description:
Comment 20735
Date: 2015-03-20 16:54:37 +0100
From: Richard Hughes <<richard.monetdb>>
Thanks.
BTW, while you were fixing it, I was confirming my theory by adding a printf to doGDKaddbuf which displays the value of THRgettid(). It is indeed easy to provoke a simple overrun of that array by causing a division by zero error:
create table bar as select cast(0 as int) as value with data;
select 1/0 from bar;
The highest value I've seen for gdk_nr_threads=24 is 26. This bug will definitely explain a load of other random glitches I've seen but never told you about (because I've been unable to reproduce).
The text was updated successfully, but these errors were encountered: