Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crash at exit (overrun THRerrorcount?) #3688

Closed
monetdb-team opened this issue Nov 30, 2020 · 0 comments
Closed

Crash at exit (overrun THRerrorcount?) #3688

monetdb-team opened this issue Nov 30, 2020 · 0 comments

Comments

@monetdb-team
Copy link

@monetdb-team monetdb-team commented Nov 30, 2020

Date: 2015-03-19 13:58:17 +0100
From: Richard Hughes <<richard.monetdb>>
To: GDK devs <>
Version: 11.19.9 (Oct2014-SP2)

Last updated: 2015-05-07 12:37:45 +0200

Comment 20722

Date: 2015-03-19 13:58:17 +0100
From: Richard Hughes <<richard.monetdb>>

Build is Oct2014 e58372859532

I got this crash at shutdown:

(gdb) bt
0 0x00007f4dd2eff44a in pthread_join (threadid=4294967296,
thread_return=thread_return@entry=0x0) at pthread_join.c:47
1 0x00007f4dd4084c36 in MT_join_thread (t=)
at gdk_system.c:656
2 0x00007f4dd3ff9014 in GDKexit (status=status@entry=0) at gdk_utils.c:1230
3 0x00007f4dd45122dc in mal_exit () at mal.c:215
4
5 0x00007f4dd2c2bf23 in select () at ../sysdeps/unix/syscall-template.S:81
6 0x00007f4dd4085c09 in MT_sleep_ms (ms=) at gdk_posix.c:1096
7 0x00000000004026da in main (argc=24, av=0x0) at mserver5.c:656

Note the absurd value for threadid.

Dumping out the memory around GDKvmtrim_id (which is the global variable that absurd value was derived from):

(gdb) x/10a &GDKvmtrim_id-4
0x7f4dd44b1c28 <THRerrorcount+40>: 0x0 0x0
0x7f4dd44b1c38 <THRerrorcount+56>: 0x1 0x100000000
0x7f4dd44b1c48 <GDKvmtrim_id>: 0x100000001 0xffffff8a2e450000
0x7f4dd44b1c58 <GDK_mallocedbytes_estimate>: 0x11546c6f9 0xe31863
0x7f4dd44b1c68 : 0x100000001 0x1

The first thing I looked at was the purpose of THRerrorcount and whether it could be overrun and, sure enough, it looks very suspicious to me:

gdk_utils.c:
doGDKaddbuf(const char *prefix, const char *message, size_t messagelen, const char *suffix)
...
THRerrorcount[THRgettid()]++;

Can I ask somebody to eyeball this and see if you agree with my analysis? I'm running a 24 core machine, so an overrun of a 16 element array is very doable. I think this may also be why I've sometimes had mserver5 completely stop doing anything and, upon attaching a debugger, finding that most of the threads have mysteriously disappeared - overwriting GDKstopped would cause this to happen.

FYI, nobody ever reads THRerrorcount so the obvious fix is to remove it entirely.

Comment 20732

Date: 2015-03-20 16:26:38 +0100
From: @sjoerdmullender

It does indeed look very suspicious. I'll remove this THRerrorcount completely.

Comment 20733

Date: 2015-03-20 16:27:54 +0100
From: MonetDB Mercurial Repository <>

Changeset 346077203679 made by Sjoerd Mullender sjoerd@acm.org in the MonetDB repo, refers to this bug.

For complete details, see http//devmonetdborg/hg/MonetDB?cmd=changeset;node=346077203679

Changeset description:

Remove THRerrorcount.  It serves no useful purpose that I can see.
And this may well fix bug #3688.

Comment 20735

Date: 2015-03-20 16:54:37 +0100
From: Richard Hughes <<richard.monetdb>>

Thanks.

BTW, while you were fixing it, I was confirming my theory by adding a printf to doGDKaddbuf which displays the value of THRgettid(). It is indeed easy to provoke a simple overrun of that array by causing a division by zero error:

create table bar as select cast(0 as int) as value with data;
select 1/0 from bar;

The highest value I've seen for gdk_nr_threads=24 is 26. This bug will definitely explain a load of other random glitches I've seen but never told you about (because I've been unable to reproduce).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
1 participant