mserver5 --set gdk_nr_threads=2 --forcemito: deadlock during first SQL client connect on virgin (empty) DB #2865
Last updated: 2011-09-30 10:58:41 +0200
Date: 2011-08-22 21:22:14 +0200
When staring a 2-threaded mserver5 (e.g., on a 2-core machine or with --set gdk_nr_threads=2) with --forcemito (as Mtest.py does by default) on a virgin (empty, not yet existing) database,
It works fine without --forcemito, or single-threaded (--set gdk_nr_threads=1), or with more than 2 threads (also on a 2-core machine), or with MAL (mclient -lmal).
This might be the cause of the first test per directory timing out on our only two 2-core machines: "machoke" & "macbeth".
Date: 2011-08-22 21:25:03 +0200
The same happens on my dual-core x86_64 laptop running 64-bit Fedora 15;
Date: 2011-08-22 21:31:51 +0200
This problem did not occur in the Apr2011 branch.
Date: 2011-08-22 21:58:04 +0200
Created attachment 66
The attached GDB log shows that 3 threads, one "DFLOWscheduler", two "runDFLOWworker" appear to be hanging on the same MT_sema_down() (-> sem_wait()) in
1 0x00007ffff6ceca38 in q_dequeue (q=0x7fffdc1b43d8) at /net/rig.ins.cwi.nl/export/scratch0/manegold/Monet/HG/Aug2011/source/MonetDB/monetdb5/mal/mal_interpreter.mx:853
Date: 2011-08-23 11:30:45 +0200
The very same problem also occurs with the latest default branch
Date: 2011-08-23 11:39:02 +0200
(At least) since this obscures our nightly testing, I consider this a "major" problem with "high" priority.
Date: 2011-08-23 16:57:53 +0200
Please note that the two worker thread semwait in a q_dequeue from the same empty "todo" queue, and the scheduler thread semwaits in a q_dequeue from the empty "done" queue, while no (other) thread is doing any work, let alone inserting anything in either of these empty queues.
Date: 2011-08-23 21:16:11 +0200
it seems that exactly these two machines show also a large amount of orphan proceses (mserver5's) left behind after testing
Date: 2011-08-23 21:29:40 +0200
Most likely a double execution of the same instruction (sql.bind), because the two workers do not use a lock around the list of pending instructions.
The code fragment where it happens is localized using gdb, break in DFLOW scheduler and using MDBlist(cntxt,mb,0,0) to localize where execution is.
Date: 2011-08-23 22:35:38 +0200
Aside from the small chance of concurrency conflict, it turns out that sql.init() was called twice. Both use the same MAL plan and the first execution correctly terminates. The second call hangs immediately upon entering the first dataflow block.
The second call of sql.init() was triggered by --dbinit="sql.start();", after removing it from the command line, all worked well (with some additional locks
Date: 2011-08-23 22:44:22 +0200
For completeness, the dataflow scheduler thinks the last bind is the next instruction to be executed.
(gdb) call MDBlist(fs.cntxt, flow->status.mb,0,0)
the structures have been protected now with the global mal_contextlock, which did not resolve the issue, pointing into the direction of an error during the erroneous double sql.init.
Date: 2011-08-24 00:16:30 +0200
At the risk of causing more confusion rather than contributing useful information, but maybe this might be relevant and/or helpful:
Breakpoint 1, DFLOWscheduler (flow=0x7fffdc1abd68) at /net/rig.ins.cwi.nl/export/scratch0/manegold/Monet/HG/Aug2011/source/MonetDB/monetdb5/mal/mal_interpreter.mx:1254
Date: 2011-08-27 00:53:26 +0200
For what it's worth:
while everything seems to work fine when mserver5 is started with --forcemito and
Date: 2011-08-27 01:01:07 +0200
(the 12-core machine is scilens08, not scilens12)
Date: 2011-08-27 08:09:34 +0200
It consistently hangs on reading the file 15_history.sql
Date: 2011-08-27 08:38:48 +0200
The system hangs on any of the update statements in 15_history.sql during database creation stage. Taking the update statements out and immediately applying them after the scripts have been loaded does not give rise to the
It all points towards a race condition in the SQL catalog initalization.
Date: 2011-08-27 09:41:25 +0200
Nice catch. Thanks!
This patch seems to work as a temporary and local work-around:
diff -r 816d0923209e sql/sql/15_history.sql
+-- Temporary and locally disable mitosis to prevent yet undiscoved deadlock;
+-- cf., top of this script
Date: 2011-08-27 21:27:29 +0200
For complete details, see http//devmonetdborg/hg/MonetDB?cmd=changeset;node=0f8fa1998fee
Date: 2011-09-20 21:22:50 +0200
For complete details, see http//devmonetdborg/hg/MonetDB?cmd=changeset;node=9d839b4fd018
Date: 2011-09-21 11:08:39 +0200
For complete details, see http//devmonetdborg/hg/MonetDB?cmd=changeset;node=02c039857104
Date: 2011-09-30 10:58:41 +0200
Released in Aug2011-SP1
The text was updated successfully, but these errors were encountered: