Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mserver5 --set gdk_nr_threads=2 --forcemito: deadlock during first SQL client connect on virgin (empty) DB #2865

monetdb-team opened this issue Nov 30, 2020 · 0 comments


Copy link

@monetdb-team monetdb-team commented Nov 30, 2020

Date: 2011-08-22 21:22:14 +0200
From: @drstmane
To: SQL devs <>
Version: 11.5.1 (Aug2011) [obsolete]
CC: @mlkersten

Last updated: 2011-09-30 10:58:41 +0200

Comment 16139

Date: 2011-08-22 21:22:14 +0200
From: @drstmane

When staring a 2-threaded mserver5 (e.g., on a 2-core machine or with --set gdk_nr_threads=2) with --forcemito (as does by default) on a virgin (empty, not yet existing) database,
the server appears to deadlock on the first SQL client (mclient [-lsql]) connect,
e.g., possibly during SQL catalog creation or alike.

It works fine without --forcemito, or single-threaded (--set gdk_nr_threads=1), or with more than 2 threads (also on a 2-core machine), or with MAL (mclient -lmal).

This might be the cause of the first test per directory timing out on our only two 2-core machines: "machoke" & "macbeth".

Comment 16140

Date: 2011-08-22 21:25:03 +0200
From: @drstmane

The same happens on my dual-core x86_64 laptop running 64-bit Fedora 15;
macbeth runs Gentoo on dual-core PowerPC;
machoke runs FreeBSD on dual-core x86_64.

Comment 16141

Date: 2011-08-22 21:31:51 +0200
From: @drstmane

This problem did not occur in the Apr2011 branch.

Comment 16142

Date: 2011-08-22 21:58:04 +0200
From: @drstmane

Created attachment 66
GDB log

The attached GDB log shows that 3 threads, one "DFLOWscheduler", two "runDFLOWworker" appear to be hanging on the same MT_sema_down() (-> sem_wait()) in

1 0x00007ffff6ceca38 in q_dequeue (q=0x7fffdc1b43d8) at /net/
853 MT_sema_down(&q->s, "q_dequeue");
(gdb) li
848 static void *
849 q_dequeue(queue *q)
850 {
851 void r = NULL;
853 MT_sema_down(&q->s, "q_dequeue");
854 MT_lock_set(&q->l, "q_dequeue");
855 assert(q->last > 0);
856 /
LIFO favors garbage collection */
857 r = q->data[--q->last];

Attached file: gdb.log (text/plain, 16359 bytes)
Description: GDB log

Comment 16143

Date: 2011-08-23 11:30:45 +0200
From: @drstmane

The very same problem also occurs with the latest default branch
(1f711d88786c 41529 default tip).

Comment 16144

Date: 2011-08-23 11:39:02 +0200
From: @drstmane

(At least) since this obscures our nightly testing, I consider this a "major" problem with "high" priority.

Comment 16145

Date: 2011-08-23 16:57:53 +0200
From: @drstmane

Please note that the two worker thread semwait in a q_dequeue from the same empty "todo" queue, and the scheduler thread semwaits in a q_dequeue from the empty "done" queue, while no (other) thread is doing any work, let alone inserting anything in either of these empty queues.

Comment 16146

Date: 2011-08-23 21:16:11 +0200
From: @grobian

it seems that exactly these two machines show also a large amount of orphan proceses (mserver5's) left behind after testing

Comment 16147

Date: 2011-08-23 21:29:40 +0200
From: @mlkersten

Most likely a double execution of the same instruction (sql.bind), because the two workers do not use a lock around the list of pending instructions.
This potentially means that a worker and the scheduler, or two workers are picking the same instruction for execution.

The code fragment where it happens is localized using gdb, break in DFLOW scheduler and using MDBlist(cntxt,mb,0,0) to localize where execution is.

 X_88:bat[:oid,:str]  := sql.bind(X_1,"sys","_tables","name",0,0,2);
 X_104 := algebra.uselect(X_88,"callhistory":str);
 X_86:bat[:oid,:str]  := sql.bind(X_1,"sys","_tables","name",2,0,2);

Comment 16148

Date: 2011-08-23 22:35:38 +0200
From: @mlkersten

Aside from the small chance of concurrency conflict, it turns out that sql.init() was called twice. Both use the same MAL plan and the first execution correctly terminates. The second call hangs immediately upon entering the first dataflow block.

The second call of sql.init() was triggered by --dbinit="sql.start();", after removing it from the command line, all worked well (with some additional locks
around the shared structures)

Comment 16149

Date: 2011-08-23 22:44:22 +0200
From: @mlkersten

For completeness, the dataflow scheduler thinks the last bind is the next instruction to be executed.

(gdb) call MDBlist(fs[0].cntxt, flow->status[0].mb,0,0)
function user.main():void;
X_11 := nil:bat[:oid,:oid];
X_43 := nil:int;
X_197 := nil:wrd;
X_191 := nil:bat[:oid,:bit];
barrier X_204 := language.dataflow();
X_1 := sql.mvc();
X_88:bat[:oid,:str] := sql.bind(X_1,"sys","_tables","name",0,0,2);
X_104 := algebra.uselect(X_88,"queryhistory":str);
X_86:bat[:oid,:str] := sql.bind(X_1,"sys","_tables","name",2,0,2);

the structures have been protected now with the global mal_contextlock, which did not resolve the issue, pointing into the direction of an error during the erroneous double sql.init.

Comment 16150

Date: 2011-08-24 00:16:30 +0200
From: @drstmane

At the risk of causing more confusion rather than contributing useful information, but maybe this might be relevant and/or helpful:

  • neither nor I use --dbinit="sql.start();", i.e., the problem also occurs without

  • in my case, SQLinitEnvironment() (the implementation of sql.init()) is not executed at all before the server deadlocks

  • my MonetDB does not show any MAL code:

Breakpoint 1, DFLOWscheduler (flow=0x7fffdc1abd68) at /net/
1254 {
Missing separate debuginfos, use: debuginfo-install bzip2-libs-1.0.6-1.fc14.1.x86_64 cyrus-sasl-lib-2.1.23-12.fc14.x86_64 geos-3.2.1-3.fc14.x86_64 glibc-2.12.90-21.x86_64 keyutils-libs-1.2-6.fc12.x86_64 krb5-libs-1.8.4-2.fc14.x86_64 libcom_err-1.41.12-6.fc14.x86_64 libcurl-7.21.0-8.fc14.x86_64 libgcc-4.5.1-4.fc14.x86_64 libidn-1.18-1.fc14.x86_64 libselinux-2.0.96-6.fc14.1.x86_64 libssh2-1.2.4-1.fc14.x86_64 libstdc++-4.5.1-4.fc14.x86_64 libxml2-2.7.7-3.fc14.x86_64 ncurses-libs-5.7-9.20100703.fc14.x86_64 nspr-4.8.8-1.fc14.x86_64 nss-3.12.10-1.fc14.x86_64 nss-softokn-freebl-3.12.10-1.fc14.x86_64 nss-util-3.12.10-1.fc14.x86_64 openldap-2.4.23-4.fc14.x86_64 openssl-1.0.0d-1.fc14.x86_64 pcre-8.10-1.fc14.x86_64 readline-6.1-2.fc14.x86_64 zlib-1.2.5-2.fc14.x86_64
(gdb) n
1255 int queued = 0, oldq = 0, last;
1256 int i,pc = 0;
1261 int todo = flow->stop - flow->start;
1262 str ret = MAL_SUCCEED;
1263 FlowStatus fs, f = 0;
1265 if (todo == 0)
1268 fs = flow->status;
1275 if (fs[0].cntxt->flags & timerFlag)
(gdb) call MDBlist(fs[0].cntxt, flow->status[0].mb,0,0)
$1 = 0x0

Comment 16172

Date: 2011-08-27 00:53:26 +0200
From: @drstmane

For what it's worth:
On both a 4 core machine (my desktop) and a 12 core machine (scilens12),
the deadlocks repeatedly occur when mserver5 is started with --forcemito and
2, 5, 6, 7, 9, 12, 13, 14, 15, 16, 17, 21, 22, 23 threads;

while everything seems to work fine when mserver5 is started with --forcemito and
1, 3, 4, 8, 10, 11, 18, 19, 20, 24 threads.

Comment 16173

Date: 2011-08-27 01:01:07 +0200
From: @drstmane

(the 12-core machine is scilens08, not scilens12)

Comment 16174

Date: 2011-08-27 08:09:34 +0200
From: @mlkersten

It consistently hangs on reading the file 15_history.sql

Comment 16175

Date: 2011-08-27 08:38:48 +0200
From: @mlkersten

The system hangs on any of the update statements in 15_history.sql during database creation stage. Taking the update statements out and immediately applying them after the scripts have been loaded does not give rise to the
reported problem.

It all points towards a race condition in the SQL catalog initalization.

Comment 16176

Date: 2011-08-27 09:41:25 +0200
From: @drstmane

Nice catch. Thanks!

This patch seems to work as a temporary and local work-around:

diff -r 816d0923209e sql/sql/15_history.sql
--- a/sql/sql/15_history.sql Sat Aug 27 08:15:25 2011 +0200
+++ b/sql/sql/15_history.sql Sat Aug 27 09:37:55 2011 +0200
@@ -23,6 +23,13 @@
-- Whenever a query is compiled and added to the cache, it is also entered
-- into the 'queryHistory' table using a hardwired call to 'keepQuery'.

+-- Temporary and locally disable mitosis to prevent yet undiscoved deadlock;
+-- cf., bug #2865 (
+declare dft_opt string;
+set dft_opt = ( select optimizer );
+set optimizer = 'no_mitosis_pipe';
+-- cf., set optimizer = dft_opt at the end of this script
create table queryHistory(
id wrd primary key,
defined timestamp, -- when entered into the cache
@@ -110,4 +117,7 @@
delete from queryHistory;

+-- cf., top of this script
+set optimizer = dft_opt;
-- set history=true;

Comment 16177

Date: 2011-08-27 21:27:29 +0200
From: @drstmane

Changeset 0f8fa1998fee made by Stefan Manegold in the MonetDB repo, refers to this bug.

For complete details, see http//devmonetdborg/hg/MonetDB?cmd=changeset;node=0f8fa1998fee

Changeset description:

disable mitosis in 15_history.sql to prevent yet undiscoved deadlock;

this is merely a temporary local work-around for nightly testing
until bug #2865 is eventually fixed

Comment 16313

Date: 2011-09-20 21:22:50 +0200
From: @sjoerdmullender

Changeset 9d839b4fd018 made by Sjoerd Mullender in the MonetDB repo, refers to this bug.

For complete details, see http//devmonetdborg/hg/MonetDB?cmd=changeset;node=9d839b4fd018

Changeset description:

When clearing variables also clear eolife field.
Not clearing the field could cause a bogus dependency in the dataflow
scheduler which in turn could cause a circular dependecy among
This fixes bug #2865 and the server timeouts of bug #2888.
(The problem that has with all the hanging servers is not
solved but may not be a problem anymore.)

Comment 16315

Date: 2011-09-21 11:08:39 +0200
From: @sjoerdmullender

Changeset 02c039857104 made by Sjoerd Mullender in the MonetDB repo, refers to this bug.

For complete details, see http//devmonetdborg/hg/MonetDB?cmd=changeset;node=02c039857104

Changeset description:

Add assertions about inter-instruction dependencies.
Instructions should only depend on earlier instructions.
These asserts would probably have made finding the problem with bugs
2865 and 2888 a lot easier...

Comment 16347

Date: 2011-09-30 10:58:41 +0200
From: @grobian

Released in Aug2011-SP1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
1 participant