Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deadlock on insertion #3898

Closed
monetdb-team opened this issue Nov 30, 2020 · 0 comments
Closed

Deadlock on insertion #3898

monetdb-team opened this issue Nov 30, 2020 · 0 comments

Comments

@monetdb-team
Copy link

@monetdb-team monetdb-team commented Nov 30, 2020

Date: 2015-12-29 16:16:25 +0100
From: anthonin.bonnefoy
To: SQL devs <>
Version: 11.21.11 (Jul2015-SP1)
CC: frederic.jolliton+monetdb, @mlkersten, @njnes, @yzchang

Last updated: 2017-12-14 14:45:59 +0100

Comment 21698

Date: 2015-12-29 16:16:25 +0100
From: anthonin.bonnefoy

User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:43.0) Gecko/20100101 Firefox/43.0 Iceweasel/43.0
Build Identifier:

A deadlock appeared on one of our test machine.
We are based on commit eba30f4d7406e70c2918458a1d5a1c4da05a81fb from branch rel-Jul2015.

Here is the backtrace:

Thread 9 (Thread 0x7fc8598d4700 (LWP 23443)):
0 0x00007fc85b7862b3 in select () at ../sysdeps/unix/syscall-template.S:81
1 0x00007fc85cbb60b9 in MT_sleep_ms () from libbat.so.12
2 0x00007fc85d0e31cf in profilerHeartbeat () from libmonetdb5.so.19
3 0x00007fc85ba580a4 in start_thread (arg=0x7fc8598d4700) at pthread_create.c:309
4 0x00007fc85b78d04d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

Thread 8 (Thread 0x7fc8588b8700 (LWP 23445)):
0 0x00007fc85b7862b3 in select () at ../sysdeps/unix/syscall-template.S:81
1 0x00007fc85d19b263 in SERVERlistenThread () from libmonetdb5.so.19
2 0x00007fc85cbb4823 in thread_starter () from libbat.so.12
3 0x00007fc85ba580a4 in start_thread (arg=0x7fc8588b8700) at pthread_create.c:309
4 0x00007fc85b78d04d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

Thread 7 (Thread 0x7fc851127700 (LWP 23699)):
0 0x00007fc85b7862b3 in select () at ../sysdeps/unix/syscall-template.S:81
1 0x00007fc85cbb60b9 in MT_sleep_ms () from libbat.so.12
2 0x00007fc8594137a5 in store_manager () from monetdb5/lib_sql.so
3 0x00007fc8593aa079 in mvc_logmanager () from monetdb5/lib_sql.so
4 0x00007fc85cbb4823 in thread_starter () from libbat.so.12
5 0x00007fc85ba580a4 in start_thread (arg=0x7fc851127700) at pthread_create.c:309
6 0x00007fc85b78d04d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

Thread 6 (Thread 0x7fc81e788700 (LWP 21432)):
0 0x00007fc85ba594db in pthread_join (threadid=140497371657984, thread_return=0x0) at pthread_join.c:92
1 0x00007fc85cbb4a2b in MT_create_thread () from libbat.so.12
2 0x00007fc85cb9bdb8 in BATmsync () from libbat.so.12
3 0x00007fc8594275b8 in tr_update_delta.isra () from monetdb5/lib_sql.so
4 0x00007fc859427f14 in update_table () from monetdb5/lib_sql.so
5 0x00007fc859412452 in rollforward_update_table () from monetdb5/lib_sql.so
6 0x00007fc85941ce6b in rollforward_update_schema () from monetdb5/lib_sql.so
7 0x00007fc85941d821 in sql_trans_commit () from monetdb5/lib_sql.so
8 0x00007fc8593aa670 in mvc_commit () from monetdb5/lib_sql.so
9 0x00007fc8593069c1 in SQLtransaction () from monetdb5/lib_sql.so
10 0x00007fc85d0cc2c0 in runMALsequence () from libmonetdb5.so.19
11 0x00007fc85d0cdb3d in runMAL () from libmonetdb5.so.19
12 0x00007fc85931e5e3 in SQLengineIntern () from monetdb5/lib_sql.so
13 0x00007fc85d0e9817 in runScenarioBody () from libmonetdb5.so.19
14 0x00007fc85d0ea55d in runScenario () from libmonetdb5.so.19
15 0x00007fc85d0eaa88 in MSserveClient () from libmonetdb5.so.19
16 0x00007fc85d0eb0ef in MSscheduleClient () from libmonetdb5.so.19
17 0x00007fc85d19bda6 in doChallenge () from libmonetdb5.so.19
18 0x00007fc85cbb4823 in thread_starter () from libbat.so.12
19 0x00007fc85ba580a4 in start_thread (arg=0x7fc81e788700) at pthread_create.c:309
20 0x00007fc85b78d04d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

Thread 5 (Thread 0x7fc817edf700 (LWP 21623)):
0 0x00007fc85b7862b3 in select () at ../sysdeps/unix/syscall-template.S:81
1 0x00007fc85cbb60b9 in MT_sleep_ms () from libbat.so.12
2 0x00007fc859413acd in store_lock () from monetdb5/lib_sql.so
3 0x00007fc8593aaae2 in mvc_create () from monetdb5/lib_sql.so
4 0x00007fc85931b47c in SQLinitClient () from monetdb5/lib_sql.so
5 0x00007fc85d0e9706 in runScenarioBody () from libmonetdb5.so.19
6 0x00007fc85d0ea55d in runScenario () from libmonetdb5.so.19
7 0x00007fc85d0eaa88 in MSserveClient () from libmonetdb5.so.19
8 0x00007fc85d0eb0ef in MSscheduleClient () from libmonetdb5.so.19
9 0x00007fc85d19bda6 in doChallenge () from libmonetdb5.so.19
10 0x00007fc85cbb4823 in thread_starter () from libbat.so.12
11 0x00007fc85ba580a4 in start_thread (arg=0x7fc817edf700) at pthread_create.c:309
12 0x00007fc85b78d04d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

Thread 4 (Thread 0x7fc81f91c700 (LWP 5271)):
0 0x00007fc85b7862b3 in select () at ../sysdeps/unix/syscall-template.S:81
1 0x00007fc85cbb60b9 in MT_sleep_ms () from libbat.so.12
2 0x00007fc85931afc5 in SQLinitClient () from monetdb5/lib_sql.so
3 0x00007fc85d0e9706 in runScenarioBody () from libmonetdb5.so.19
4 0x00007fc85d0ea55d in runScenario () from libmonetdb5.so.19
5 0x00007fc85d0eaa88 in MSserveClient () from libmonetdb5.so.19
6 0x00007fc85d0eb0ef in MSscheduleClient () from libmonetdb5.so.19
7 0x00007fc85d19bda6 in doChallenge () from libmonetdb5.so.19
8 0x00007fc85cbb4823 in thread_starter () from libbat.so.12
9 0x00007fc85ba580a4 in start_thread (arg=0x7fc81f91c700) at pthread_create.c:309
10 0x00007fc85b78d04d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

Thread 3 (Thread 0x7fc81eb15700 (LWP 5308)):
0 0x00007fc85b7862b3 in select () at ../sysdeps/unix/syscall-template.S:81
1 0x00007fc85cbb60b9 in MT_sleep_ms () from libbat.so.12
2 0x00007fc85931afc5 in SQLinitClient () from monetdb5/lib_sql.so
3 0x00007fc85d0e9706 in runScenarioBody () from libmonetdb5.so.19
4 0x00007fc85d0ea55d in runScenario () from libmonetdb5.so.19
5 0x00007fc85d0eaa88 in MSserveClient () from libmonetdb5.so.19
6 0x00007fc85d0eb0ef in MSscheduleClient () from libmonetdb5.so.19
7 0x00007fc85d19bda6 in doChallenge () from libmonetdb5.so.19
8 0x00007fc85cbb4823 in thread_starter () from libbat.so.12
9 0x00007fc85ba580a4 in start_thread (arg=0x7fc81eb15700) at pthread_create.c:309
10 0x00007fc85b78d04d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

Thread 2 (Thread 0x7fc8210ea700 (LWP 18128)):
0 0x00007fc85b7862b3 in select () at ../sysdeps/unix/syscall-template.S:81
1 0x00007fc85cbb60b9 in MT_sleep_ms () from libbat.so.12
2 0x00007fc85931afc5 in SQLinitClient () from monetdb5/lib_sql.so
3 0x00007fc85d0e9706 in runScenarioBody () from libmonetdb5.so.19
4 0x00007fc85d0ea55d in runScenario () from libmonetdb5.so.19
5 0x00007fc85d0eaa88 in MSserveClient () from libmonetdb5.so.19
6 0x00007fc85d0eb0ef in MSscheduleClient () from libmonetdb5.so.19
7 0x00007fc85d19bda6 in doChallenge () from libmonetdb5.so.19
8 0x00007fc85cbb4823 in thread_starter () from libbat.so.12
9 0x00007fc85ba580a4 in start_thread (arg=0x7fc8210ea700) at pthread_create.c:309
10 0x00007fc85b78d04d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

Thread 1 (Thread 0x7fc85d719780 (LWP 23405)):
0 0x00007fc85b7862b3 in select () at ../sysdeps/unix/syscall-template.S:81
1 0x00007fc85cbb60b9 in MT_sleep_ms () from libbat.so.12
2 0x00000000004029ea in main ()

Thread 6 tries to join thread 5 while having the store_lock and thread 5 try to get store_lock.

Reproducible: Couldn't Reproduce

Comment 21699

Date: 2015-12-29 20:11:33 +0100
From: @mlkersten

This looks rather suspicious.
To nail a possible issue, can you answer the following questions:

  • did you use a privately compiled version of MonetDB with
    -- assertions enabled
    -- define MSYNC_BACKGROUND

  • do you use some of your own BAT manipulation functions?
    -- BATmsync may only be called for persistent BATs
    -- do you use string bats for this addition?

Comment 21703

Date: 2015-12-30 15:32:19 +0100
From: anthonin.bonnefoy

Okay, more details:

We are on changeset 58103:71a94ea8492e (I forgot we use a git mirror internally, sorry…)

MonetDB has been compiled with vertoo.data removed and the following flags:
./configure --enable-int128 --enable-strict --disable-microhttpd --without-rubygem --enable-optimize
So no assertions and MSYNC_BACKGROUND seems to be hardcoded in gdk_storage.c:648

We do use some custom functions (an aggregation function for concatenating strings with ',') but they should not have been used in this case.

The base was only insertions from csv with table creations inside the transaction.
Each table creation is followed by a commit as a (not yet removed) workaround for bug #3802.

I managed to add debug symbol on my gdb, so here a better backtrace of the deadlocked threads:

Thread 8 (Thread 0x7f311dbfa700 (LWP 18293)):
0 0x00007f31291714db in pthread_join (threadid=threadid@entry=139848246736640, thread_return=thread_return@entry=0x0) at pthread_join.c:92
1 0x00007f312a2cca2b in join_threads () at gdk_system.c:484
2 MT_create_thread (t=t@entry=0x7f311dbf7630, f=f@entry=0x7f312a2b21e0 , arg=arg@entry=0x7f30f227fd70, d=d@entry=MT_THR_DETACHED) at gdk_system.c:505
3 0x00007f312a2b3db8 in BATmsync (b=b@entry=0x7f30f072d380) at gdk_storage.c:692
4 0x00007f3126b3f5b8 in tr_update_delta (tr=tr@entry=0x7f30f80016d0, obat=0x7f30f078b050, cbat=0x7f30f0369220, unique=) at bat_storage.c:1904
5 0x00007f3126b3ff14 in update_table (tr=0x7f30f80016d0, ft=0x7f30f07e4b40, tt=0x7f30f22b5300) at bat_storage.c:2154
6 0x00007f3126b2a452 in rollforward_update_table (tr=0x7f30f80016d0, ft=0x7f30f07e4b40, tt=0x7f30f22b5300, mode=3) at store.c:2861
7 0x00007f3126b34e6b in rollforward_changeset_updates (mode=3, fd=, rollforward_deletes=0x7f3126b26a90 <rollforward_drop_table>, rollforward_creates=0x7f3126b26580 <rollforward_create_table>,
rollforward_updates=0x7f3126b29ca0 <rollforward_update_table>, b=0x54caa70, ts=0x54caaa0, fs=0x7f30f805a9e0, tr=0x7f30f80016d0) at store.c:2474
8 rollforward_update_schema (tr=0x7f30f80016d0, fs=0x7f30f805a9b0, ts=0x54caa70, mode=3) at store.c:2937
9 0x00007f3126b35821 in rollforward_changeset_updates (mode=, fd=, rollforward_deletes=, rollforward_creates=, rollforward_updates=, b=,
ts=, fs=, tr=) at store.c:2474
10 rollforward_trans (mode=, tr=) at store.c:2959
11 sql_trans_commit (tr=0x7f30f80016d0) at store.c:3372
12 0x00007f3126ac2670 in mvc_commit (m=0x7f30f2ee2330, chain=chain@entry=0, name=name@entry=0x0) at sql_mvc.c:258
13 0x00007f3126a1e9c1 in SQLtransaction (cntxt=0x7f3126ff3330, mb=, stk=, pci=) at sql.c:268
14 0x00007f312a7e42c0 in runMALsequence (cntxt=0x7f30f511b9d0, mb=0x7f30f09ea540, startpc=18396, stoppc=-1, stk=0x7f30f0a88a60, env=0x7f3100000a50, pcicaller=0x0) at mal_interpreter.c:631
15 0x00007f312a7e5b3d in runMAL (cntxt=0x7f3126ff3330, mb=0x7f30f09ea540, mbcaller=0x47dc, mbcaller@entry=0x0, env=0x0) at mal_interpreter.c:365
16 0x00007f3126a365e3 in SQLengineIntern (c=0x7f3126ff3330, be=0x7f30f0e87dc0) at sql_execute.c:408
17 0x00007f312a801817 in runPhase (phase=4, c=0x7f3126ff3330) at mal_scenario.c:515
18 runScenarioBody (c=c@entry=0x7f3126ff3330) at mal_scenario.c:559
19 0x00007f312a80255d in runScenario (c=c@entry=0x7f3126ff3330) at mal_scenario.c:579
20 0x00007f312a802a88 in MSserveClient (dummy=dummy@entry=0x7f3126ff3330) at mal_session.c:439
21 0x00007f312a8030ef in MSscheduleClient (command=command@entry=0x7f30f01cdf30 "", challenge=challenge@entry=0x7f311dbf9e60 "HC3C1LuG2", fin=0x7f30f0568e70, fout=fout@entry=0x7f30f0c51e20) at mal_session.c:319
22 0x00007f312a8b3da6 in doChallenge (data=) at mal_mapi.c:184
23 0x00007f312a2cc823 in thread_starter (arg=) at gdk_system.c:458
24 0x00007f31291700a4 in start_thread (arg=0x7f311dbfa700) at pthread_create.c:309
25 0x00007f3128ea504d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

Thread 7 (Thread 0x7f30f511b700 (LWP 18396)):
0 0x00007f3128e9e2b3 in select () at ../sysdeps/unix/syscall-template.S:81
1 0x00007f312a2ce0b9 in MT_sleep_ms (ms=ms@entry=1) at gdk_posix.c:1085
2 0x00007f3126b2bacd in store_lock () at store.c:1659
3 0x00007f3126ac2ae2 in mvc_create (clientid=2, stk=stk@entry=0, debug=0, rs=0x7f30fd1476a0, ws=0x7f30fdade3e0) at sql_mvc.c:423
4 0x00007f3126a3347c in SQLinitClient (c=0x7f3126ff9640) at sql_scenario.c:444
5 0x00007f312a801706 in runPhase (phase=5, c=0x7f3126ff9640) at mal_scenario.c:515
6 runScenarioBody (c=c@entry=0x7f3126ff9640) at mal_scenario.c:542
7 0x00007f312a80255d in runScenario (c=c@entry=0x7f3126ff9640) at mal_scenario.c:579
8 0x00007f312a802a88 in MSserveClient (dummy=dummy@entry=0x7f3126ff9640) at mal_session.c:439
9 0x00007f312a8030ef in MSscheduleClient (command=command@entry=0x7f30fd177b40 "P\321r\376\060\177", challenge=challenge@entry=0x7f30f511ae60 "JPz9UODGP", fin=0x7f30fd1476a0, fout=fout@entry=0x7f30fdade3e0) at mal_session.c:319
10 0x00007f312a8b3da6 in doChallenge (data=) at mal_mapi.c:184
11 0x00007f312a2cc823 in thread_starter (arg=) at gdk_system.c:458
12 0x00007f31291700a4 in start_thread (arg=0x7f30f511b700) at pthread_create.c:309
13 0x00007f3128ea504d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

Thread 8 was trying to add some lines on a uuid bat from what I understood from obat and cbat in tr_update_delta.

I can provide the coredump along with the binaries with debug symbol if you want.

Comment 21705

Date: 2015-12-31 16:15:56 +0100
From: @njnes

Do you have some sql script to reproduce this situation. Its not clear why we get into this problem, ie a new client hangs on a thread_join done in a older
(long running) thread. Threads should never try to join an active client thread.
So to find out why we could use a sql or more complex script to test this.

Comment 21776

Date: 2016-02-11 10:20:02 +0100
From: Frédéric Jolliton <<frederic.jolliton+monetdb>>

We got twice this exact problem in the past 3 days on one of our server.

It occured after 48hrs of high activity, so it's hard to figure how to reproduce it.

We're keeping the server in this state for now to try to further investigate the situation.

Comment 21901

Date: 2016-03-14 15:06:20 +0100
From: @yzchang

Hai Anthonin, Frédéric,

Would you please test if the recent fixes for concurrent transactions have any (hopefully positive) effect on this bug? Thanks!

Comment 22300

Date: 2016-08-17 14:00:29 +0200
From: Frédéric Jolliton <<frederic.jolliton+monetdb>>

FYI, we got the exact situation again. We are going to investigate one more time how we got there.

In the meantime, we made a tool to try to get a better view of the thread state and the pending (SQL) queries. The var:desc syntax indicates an attempt to get the lock pointed by 'var' with the description 'desc' (as seen in the source code). The syntax [sem_wait:14<124790>] indicates that the thread 14 (as numbered by GDB) is currently waiting in sem_wait and its LWP ID is 124790.

We see that:

  • thread 32 is waiting on thread 13 to terminate (whose LWP ID 124863 is shown few lines after).
  • thread 2 to 12 (partially elided, all related to SQLinitClient), but also 13, 42 and 43 are all waiting for lock bs_lock to be released.

So something has not released bs_lock.

Threads:
(23267)
|-clone
| -start_thread | |-DFLOWworker | | -q_dequeue-+-[sem_wait:14<124790>]
| | [..]
| | -[sem_wait:41<124261>] | |-profilerHeartbeat | | -MT_sleep_ms---[select:45<23524>]
| -thread_starter | |-SERVERlistenThread | | -[select:44<23527>]
| |-doChallenge
| | -MSscheduleClient | | -MSserveClient
| | -runScenario | | -runScenarioBody
| | |-SQLexitClient
| | | -mvc_destroy | | | -store_lock<bs_lock:trans_lock>
| | | -MT_sleep_ms | | | -[select:42<15669>]
| | -runPhase | | |-SQLengineIntern | | | -runMAL
| | | -runMALsequence | | | -SQLtransaction
| | | -mvc_commit | | | -sql_trans_commit
| | | -rollforward_trans | | | -rollforward_changeset_updates
| | | -rollforward_update_schema | | | -rollforward_changeset_updates
| | | -rollforward_update_table | | | -update_table
| | | -tr_update_delta | | | -BATmsync
| | | -MT_create_thread | | | -join_threads
| | | -[pthread_join[tid=124863]:32<124753>] | | |-SQLinitClient | | | -mvc_create
| | | -store_lock<bs_lock:trans_lock> | | | -MT_sleep_ms
| | | -[select:13<124863>] | | -SQLinitClient<sql_contextLock:SQLinitClient>
| | -MT_sleep_ms | | |-[select:10<32677>] | | [..] | | -[select:9<93552>]
| -mvc_logmanager | -store_manager<bs_lock:store_manager>
| -MT_sleep_ms | -[select:43<23582>]
`-main---MT_sleep_ms---[select:1<23267>]

To further help diagnosing, the tool also output the queries as seen from GDB by examining the process memory:

Queries:
Thread 2 [sql] [Active: 0] <initclient (5)> [Mode: runclient (2)] [Login: 2016-08-16T15:42:41] [Command: 2016-08-16T15:42:41] [Actions: 0] [Total: 0.0s]
Thread 3 [sql] [Active: 0] <initclient (5)> [Mode: runclient (2)] [Login: 2016-08-16T14:41:18] [Command: 2016-08-16T14:41:18] [Actions: 0] [Total: 0.0s]
Thread 4 [sql] [Active: 0] <initclient (5)> [Mode: runclient (2)] [Login: 2016-08-16T10:12:15] [Command: 2016-08-16T10:12:15] [Actions: 0] [Total: 0.0s]
Thread 5 [sql] [Active: 0] <initclient (5)> [Mode: runclient (2)] [Login: 2016-08-16T09:43:28] [Command: 2016-08-16T09:43:28] [Actions: 0] [Total: 0.0s]
Thread 6 [sql] [Active: 0] <initclient (5)> [Mode: runclient (2)] [Login: 2016-08-16T09:41:39] [Command: 2016-08-16T09:41:39] [Actions: 0] [Total: 0.0s]
Thread 7 [sql] [Active: 0] <initclient (5)> [Mode: runclient (2)] [Login: 2016-08-16T09:31:47] [Command: 2016-08-16T09:31:47] [Actions: 0] [Total: 0.0s]
Thread 8 [sql] [Active: 0] <initclient (5)> [Mode: runclient (2)] [Login: 2016-08-16T09:31:02] [Command: 2016-08-16T09:31:02] [Actions: 0] [Total: 0.0s]
Thread 9 [sql] [Active: 0] <initclient (5)> [Mode: runclient (2)] [Login: 2016-08-16T09:30:24] [Command: 2016-08-16T09:30:24] [Actions: 0] [Total: 0.0s]
Thread 10 [sql] [Active: 0] <initclient (5)> [Mode: runclient (2)] [Login: 2016-08-15T10:07:16] [Command: 2016-08-15T10:07:16] [Actions: 0] [Total: 0.0s]
Thread 11 [sql] [Active: 0] <initclient (5)> [Mode: runclient (2)] [Login: 2016-08-14T16:27:36] [Command: 2016-08-14T16:27:36] [Actions: 0] [Total: 0.0s]
Thread 12 [sql] [Active: 0] <initclient (5)> [Mode: runclient (2)] [Login: 2016-08-14T16:27:35] [Command: 2016-08-14T16:27:35] [Actions: 0] [Total: 0.0s]
Thread 13 [sql] [Active: 0] <initclient (5)> [Mode: runclient (2)] [Login: 2016-08-14T16:27:35] [Command: 2016-08-14T16:27:35] [Actions: 0] [Total: 0.0s]
Thread 32 [sql] [Active: 1] <engine (4)> [Mode: runclient (2)] [Login: 2016-08-14T16:27:34] [Command: 2016-08-14T16:27:35] [Actions: 31] [Total: 0.523975s]
'scommit\n;'
Thread 42 [sql] [Active: 0] <reader (0)> [Mode: finishclient (1)] [Login: 2016-08-14T10:05:13] [Command: 2016-08-15T10:07:16] [Actions: 1173] [Total: 120.62709s]
"sselect count(*) from some_table where the_column in ('foo', 'bar', 'baz')"

Thread 42 shows that we use a long running session (the last command ran 24hr after the first connection), which is normal.

(Everything is extracted from GDB, since MonetDB is totally stuck at this point.)

Comment 22301

Date: 2016-08-17 15:21:05 +0200
From: Frédéric Jolliton <<frederic.jolliton+monetdb>>

It seems that mvc_commit() is taking the lock bs_lock through lock_store(), and it is still held when reaching pthread_join it seems (but I might be wrong).

Thus, the thread 32 is waiting (joining) a thread (13) itself waiting for bs_lock to be freed. Hence the deadlock.

Comment 22358

Date: 2016-09-07 14:42:47 +0200
From: @sjoerdmullender

The problem is not so much that locks are being held when the thread join happens, the problem is that a thread join is done for the wrong thread.

The joins that are done in join_threads should only be on threads that have died or are very, very close to dieing. The thread that is being joined is neither. That is the problem.

What I would be interested in at this point is the contents of the posthreads linked list (gdk_system.c) and the value of *p inside the function join_threads.

Comment 24646

Date: 2016-11-02 16:57:31 +0100
From: Frédéric Jolliton <<frederic.jolliton+monetdb>>

We got the same issue today on Jul2015-SP4.

We were able to gather some information about posthreads before restarting the database. Unfortunately, the p variable was optimized out, so there is no info about it.

This linked list had the following entries (head first). Indeed we had 9 mclient stucks (all waiting for a lock).

{next = 0x7f62900016a0, tid = 140047619213056, func = 0x7f62b2de2190 , arg = 0x7f6290000d20, exited = 0}
{next = 0x7f6290001510, tid = 140048365336320, func = 0x7f62b2de2190 , arg = 0x7f6290000d20, exited = 0}
{next = 0x7f62900024b0, tid = 140048915719936, func = 0x7f62b2de2190 , arg = 0x7f6290000d20, exited = 0}
{next = 0x7f62900019c0, tid = 140047785166592, func = 0x7f62b2de2190 , arg = 0x7f6290000d20, exited = 0}
{next = 0x7f6290001380, tid = 140047070258944, func = 0x7f62b2de2190 , arg = 0x7f6290000d20, exited = 0}
{next = 0x7f6290001830, tid = 140047432980224, func = 0x7f62b2de2190 , arg = 0x7f6290000d20, exited = 0}
{next = 0x7f6290003e30, tid = 140047411357440, func = 0x7f62b2de2190 , arg = 0x7f6290000d20, exited = 0}
{next = 0x7f6290001060, tid = 140061527021312, func = 0x7f62b2de2190 , arg = 0x7f6290000d20, exited = 0}
{next = 0x7f624da07970, tid = 140047908521728, func = 0x7f62b2de2190 , arg = 0x7f6290000d20, exited = 0}
{next = 0x7f5e61850bf0, tid = 140048917821184, func = 0x7f62b27cc5d0 , arg = 0x7f5e6a5368d0, exited = 0}
{next = 0x7f5ec884d520, tid = 140047370168064, func = 0x7f62b27cc5d0 , arg = 0x7f5e616d05b0, exited = 0}
{next = 0x7f5f7e75f580, tid = 140046909499136, func = 0x7f62b27cc5d0 , arg = 0x7f5f7cf67a30, exited = 0}
{next = 0x7f5f4bfd3110, tid = 140047669192448, func = 0x7f62b27cc5d0 , arg = 0x7f5f7c1c0810, exited = 0}
{next = 0x7f5ec959ecd0, tid = 140047430878976, func = 0x7f62b27cc5d0 , arg = 0x7f5e281cdaa0, exited = 0}
{next = 0x7f5e41b93f70, tid = 140046890411776, func = 0x7f62b27cc5d0 , arg = 0x7f5ec954e470, exited = 0}
{next = 0x7f5e413d0080, tid = 140047641655040, func = 0x7f62b27cc5d0 , arg = 0x7f5e41b52a10, exited = 0}
{next = 0x7f5ec9564220, tid = 140047671293696, func = 0x7f62b27cc5d0 , arg = 0x7f5e4165f2b0, exited = 0}
{next = 0x7f622b865a80, tid = 140048417998592, func = 0x7f62b27cc5d0 , arg = 0x7f5ec886dae0, exited = 0}
{next = 0x7f5e60400510, tid = 140048206907136, func = 0x7f62b27cc5d0 , arg = 0x7f5e60400590, exited = 0}
{next = 0x7f607ec8cb30, tid = 140047179360000, func = 0x7f62b27cc5d0 , arg = 0x7f5e6063c9c0, exited = 0}
{next = 0x7f607e6ebff0, tid = 140047631148800, func = 0x7f62b27cc5d0 , arg = 0x7f607fe0f730, exited = 0}
{next = 0x7f5f4977fe00, tid = 140047910622976, func = 0x7f62b27cc5d0 , arg = 0x7f607c79dd20, exited = 0}
{next = 0x7f5f43361a40, tid = 140047361455872, func = 0x7f62b27cc5d0 , arg = 0x7f5f4bf554f0, exited = 0}
{next = 0x7f5f400c2990, tid = 140047161587456, func = 0x7f62b27cc5d0 , arg = 0x7f5f41158130, exited = 0}
{next = 0x7f5e20147270, tid = 140047066056448, func = 0x7f62b27cc5d0 , arg = 0x7f5f416aaca0, exited = 0}
{next = 0x7f5ebb644180, tid = 140047664666368, func = 0x7f62b27cc5d0 , arg = 0x7f5e201ef620, exited = 0}
{next = 0x7f5e2050b220, tid = 140047163688704, func = 0x7f62b27cc5d0 , arg = 0x7f5ebb283a00, exited = 0}
{next = 0x7f6248368800, tid = 140047042451200, func = 0x7f62b27cc5d0 , arg = 0x7f5e20126390, exited = 0}
{next = 0x7f5f4b5338c0, tid = 140047349458688, func = 0x7f62b27cc5d0 , arg = 0x7f5e307c6410, exited = 0}
{next = 0x7f5e4148a770, tid = 140047026575104, func = 0x7f62b27cc5d0 , arg = 0x7f5f496205e0, exited = 0}
{next = 0x7f5f49fbf020, tid = 140046927636224, func = 0x7f62b27cc5d0 , arg = 0x7f5fee2705c0, exited = 0}
{next = 0x7f5f4bf09550, tid = 140048497760000, func = 0x7f62b27cc5d0 , arg = 0x7f5f4a247ac0, exited = 0}
{next = 0x7f62488d9540, tid = 140047394707200, func = 0x7f62b27cc5d0 , arg = 0x7f5f4b9e1be0, exited = 0}
{next = 0x7f624a9d3810, tid = 140047629047552, func = 0x7f62b27cc5d0 , arg = 0x7f5e310b3210, exited = 0}
{next = 0x7f5f41b72350, tid = 140047054780160, func = 0x7f62b27cc5d0 , arg = 0x7f624b00be00, exited = 0}
{next = 0x7f5f434c13c0, tid = 140048200091392, func = 0x7f62b27cc5d0 , arg = 0x7f5f42d68210, exited = 0}
{next = 0x7f5e7e0f05e0, tid = 140047635351296, func = 0x7f62b27cc5d0 , arg = 0x7f5f4022f1b0, exited = 0}
{next = 0x7f5e9246da10, tid = 140055718897408, func = 0x7f62b27cc5d0 , arg = 0x7f5e41443ce0, exited = 0}
{next = 0x7f624b59c470, tid = 140047542122240, func = 0x7f62b27cc5d0 , arg = 0x7f5e920a1250, exited = 0}
{next = 0x7f5e31d21310, tid = 140047647958784, func = 0x7f62b27cc5d0 , arg = 0x7f624b4c7b20, exited = 0}
{next = 0x7f624a9edef0, tid = 140048020698880, func = 0x7f62b27cc5d0 , arg = 0x7f5e3082ced0, exited = 0}
{next = 0x7f62291ffe40, tid = 140048367437568, func = 0x7f62b27cc5d0 , arg = 0x7f5e310fb360, exited = 0}
{next = 0x7f5ee84bb2c0, tid = 140047398971136, func = 0x7f62b27cc5d0 , arg = 0x7f5e60a16b60, exited = 0}
{next = 0x7f5ebbb73dc0, tid = 140061727397632, func = 0x7f62b27cc5d0 , arg = 0x7f5ee81a3c70, exited = 0}
{next = 0x7f5f3b4d1910, tid = 140048209008384, func = 0x7f62b27cc5d0 , arg = 0x7f5e207c97b0, exited = 0}
{next = 0x7f5f423f81d0, tid = 140046605080320, func = 0x7f62b27cc5d0 , arg = 0x7f5f3b4c87c0, exited = 0}
{next = 0x7f5e31131f10, tid = 140048016369408, func = 0x7f62b27cc5d0 , arg = 0x7f5f432bb130, exited = 0}
{next = 0x7f5f423f6100, tid = 140047797950208, func = 0x7f62b27cc5d0 , arg = 0x7f624bc8a900, exited = 0}
{next = 0x7f5e308faf40, tid = 140048388953856, func = 0x7f62b27cc5d0 , arg = 0x7f5f43ef8050, exited = 0}
{next = 0x7f5e412fe500, tid = 140047535363840, func = 0x7f62b27cc5d0 , arg = 0x7f624b4d8730, exited = 0}
...

We looked further (maybe ~70 further items in the list) and it was following the same pattern ( and exited=0).

The state of the process at the time of this deadlock was:

mdbtools proc -L -p -t
Not using lock map. You can generate lockmap.py with "mdbtool lockmap".
Process mserver5 (PID 25118)
Settings:
db_path = /srv/xxx/mdb/xxx
exe = /path/to/monetdb/bin/mserver5
gdk_nr_threads = 6
mapi_open = false
mapi_port = 0
mapi_usock = /srv/xxx/mdb/xxx/.mapi.sock
max_clients = 64
merovingian_uri = mapi:monetdb://DOMAIN:50000/nova
monet_daemon = yes
monet_vault_key = /srv/xxx/mdb/xxx/.vaultkey
sql_optimizer = default_pipe
Threads:
(25118)
|-clone{clone.S:111}
| -start_thread{pthread_create.c:309} | |-DFLOWworker{mal_dataflow.c:333}---q_dequeue{mal_dataflow.c:213} | | |-1*[sem_wait{sem_wait.S:85}:11<28632>] | | |-1*[sem_wait{sem_wait.S:85}:12<28631>] | | |-1*[sem_wait{sem_wait.S:85}:13<28628>] | | |-1*[sem_wait{sem_wait.S:85}:14<28627>] | | -1*[sem_wait{sem_wait.S:85}:16<28517>]
| |-profilerHeartbeat{mal_profiler.c:976}---MT_sleep_ms{gdk_posix.c:1089}
| | -1*[select{syscall-template.S:81}:19<25119>] | -thread_starter{gdk_system.c:459}
| |-SERVERlistenThread{mal_mapi.c:234}---1*[select{syscall-template.S:81}:18<25120>]
| |-doChallenge{mal_mapi.c:184}
| | -MSscheduleClient{mal_session.c:319} | | -MSserveClient{mal_session.c:439}
| | -runScenario{mal_scenario.c:579} | | -runScenarioBody{mal_scenario.c:559}
| | -runPhase{mal_scenario.c:515} | | -SQLengineIntern{sql_execute.c:390}
| | -SQLexecutePrepared{sql_execute.c:328} | | -callMAL{mal_interpreter.c:447}
| | -runMALsequence{mal_interpreter.c:631} | | |-runMALdataflow{mal_dataflow.c:784} | | | -MT_sleep_ms{gdk_posix.c:1089}
| | | |-1*[select{syscall-template.S:81}:10<28634>]
| | | |-1*[select{syscall-template.S:81}:2<5042>]
| | | |-1*[select{syscall-template.S:81}:3<5039>]
| | | |-1*[select{syscall-template.S:81}:4<4979>]
| | | |-1*[select{syscall-template.S:81}:5<4110>]
| | | |-1*[select{syscall-template.S:81}:6<2170>]
| | | |-1*[select{syscall-template.S:81}:7<19210>]
| | | |-1*[select{syscall-template.S:81}:8<18598>]
| | | -1*[select{syscall-template.S:81}:9<28650>] | | -runMALdataflow{mal_dataflow.c:825}
| | -MT_create_thread{gdk_system.c:505} | | -join_threads{gdk_system.c:484}
| | -1*[pthread_join{pthread_join.c:92}[tid=28634]:15<28624>] | -mvc_logmanager{sql_mvc.c:133}---store_manager{store.c:1630}
| -MT_sleep_ms{gdk_posix.c:1089} | -1*[select{syscall-template.S:81}:17<26083>]
`-main{mserver5.c:694}---MT_sleep_ms{gdk_posix.c:1089}---1*[select{syscall-template.S:81}:1<25118>]

The value between indicates the LWP ID, following the thread ID (1 to 19). The 1* part indicates how many threads were at the given backtrace (since we expanded everything, all the values are actually 1*).

This shows that only 19 threads were running. (Not sure if posthreads is supposed to track that).

Comment 24885

Date: 2017-01-12 17:17:28 +0100
From: Frédéric Jolliton <<frederic.jolliton+monetdb>>

We got another occurrence of this bug (on Jul2015SP4).

Again, the `p' local variable was optimized away.

The posthread list had the following entries:

{next = 0x7f1cb4009fa0, tid = 139759232767744, func = 0x7f1ccb396190 , arg = 0x7f1cb4001040, exited = 1}
{next = 0x7f1cb4009ce0, tid = 139759248590592, func = 0x7f1ccb396190 , arg = 0x7f1cb4001040, exited = 0}
{next = 0x7f1cb4009a20, tid = 139759250691840, func = 0x7f1ccb396190 , arg = 0x7f1cb4001040, exited = 0}
{next = 0x7f1cb4009760, tid = 139759252793088, func = 0x7f1ccb396190 , arg = 0x7f1cb4001040, exited = 0}
..[64 doChallenge in total]..
{next = 0x7f1bd8456f40, tid = 139759566538496, func = 0x7f1ccb396190 , arg = 0x7f1cb4001040, exited = 0}
{next = 0x7f1cb4000a50, tid = 139759522965248, func = 0x7f1ccad805d0 , arg = 0x7f1bd82c2d50, exited = 0}
{next = 0xaa5a1b0, tid = 139759634413312, func = 0x7f1ccb396190 , arg = 0x7f1cb40008d0, exited = 0}
{next = 0xa8d1e90, tid = 139761509881600, func = 0x7f1cc5f7a2a0 <mvc_logmanager>, arg = 0x0, exited = 0}
{next = 0x0, tid = 139761545668352, func = 0x7f1ccb3956a0 , arg = 0xa8d1e30, exited = 0}

Examining the process memory, we see that only the thread 76 was performing a query (a "commit"). All others were in state initclient=5, runclient=2, active=0.

Also, examining the journal, we see that the last entry is a commit following many insertions (~10 tables of 50 columns on average).

The state of threads on the mserver5 process were:

Threads:
(29289)
|-clone{clone.S:111}
| -start_thread{pthread_create.c:309} | |-DFLOWworker{mal_dataflow.c:333} | | -q_dequeue{mal_dataflow.c:213}-+-1*[sem_wait{sem_wait.S:85}:65<5592>]
| | |-1*[sem_wait{sem_wait.S:85}:66<5591>]
| | |-1*[sem_wait{sem_wait.S:85}:67<5590>]
| | |-1*[sem_wait{sem_wait.S:85}:68<5589>]
| | |-1*[sem_wait{sem_wait.S:85}:69<5588>]
| | |-1*[sem_wait{sem_wait.S:85}:70<5582>]
| | |-1*[sem_wait{sem_wait.S:85}:71<5579>]
| | |-1*[sem_wait{sem_wait.S:85}:72<5574>]
| | |-1*[sem_wait{sem_wait.S:85}:73<5571>]
| | |-1*[sem_wait{sem_wait.S:85}:74<5566>]
| | |-1*[sem_wait{sem_wait.S:85}:75<5558>]
| | |-1*[sem_wait{sem_wait.S:85}:77<5477>]
| | |-1*[sem_wait{sem_wait.S:85}:78<5461>]
| | |-1*[sem_wait{sem_wait.S:85}:79<5458>]
| | -1*[sem_wait{sem_wait.S:85}:80<5454>] | |-profilerHeartbeat{mal_profiler.c:976} | | -MT_sleep_ms{gdk_posix.c:1089}
| | -1*[select{syscall-template.S:81}:83<29300>] | -thread_starter{gdk_system.c:459}
| |-SERVERlistenThread{mal_mapi.c:234}
| | -1*[select{syscall-template.S:81}:82<29301>] | |-doChallenge{mal_mapi.c:184} | | -MSscheduleClient{mal_session.c:319}
| | -MSserveClient{mal_session.c:439} | | -runScenario{mal_scenario.c:579}
| | |-runScenarioBody{mal_scenario.c:542}
| | | -runPhase{mal_scenario.c:515} | | | |-SQLinitClient{sql_scenario.c:427}<sql_contextLock:SQLinitClient> | | | | -MT_sleep_ms{gdk_posix.c:1089}
| | | | |-1*[select{syscall-template.S:81}:10<10020>]
| | | | |-1*[select{syscall-template.S:81}:11<9733>]
| | | | |-1*[select{syscall-template.S:81}:12<9731>]
| | | | |-1*[select{syscall-template.S:81}:13<9729>]
| | | | |-1*[select{syscall-template.S:81}:14<9727>]
| | | | |-1*[select{syscall-template.S:81}:15<9429>]
| | | | |-1*[select{syscall-template.S:81}:16<9425>]
| | | | |-1*[select{syscall-template.S:81}:17<9369>]
| | | | |-1*[select{syscall-template.S:81}:18<9367>]
| | | | |-1*[select{syscall-template.S:81}:19<9108>]
| | | | |-1*[select{syscall-template.S:81}:2<10662>]
| | | | |-1*[select{syscall-template.S:81}:20<9106>]
| | | | |-1*[select{syscall-template.S:81}:21<9100>]
| | | | |-1*[select{syscall-template.S:81}:22<9098>]
| | | | |-1*[select{syscall-template.S:81}:23<8794>]
| | | | |-1*[select{syscall-template.S:81}:24<8792>]
| | | | |-1*[select{syscall-template.S:81}:25<8790>]
| | | | |-1*[select{syscall-template.S:81}:26<8788>]
| | | | |-1*[select{syscall-template.S:81}:27<8495>]
| | | | |-1*[select{syscall-template.S:81}:28<8493>]
| | | | |-1*[select{syscall-template.S:81}:29<8473>]
| | | | |-1*[select{syscall-template.S:81}:3<10386>]
| | | | |-1*[select{syscall-template.S:81}:30<8465>]
| | | | |-1*[select{syscall-template.S:81}:31<8150>]
| | | | |-1*[select{syscall-template.S:81}:32<8148>]
| | | | |-1*[select{syscall-template.S:81}:33<8146>]
| | | | |-1*[select{syscall-template.S:81}:34<8144>]
| | | | |-1*[select{syscall-template.S:81}:35<7850>]
| | | | |-1*[select{syscall-template.S:81}:36<7848>]
| | | | |-1*[select{syscall-template.S:81}:37<7786>]
| | | | |-1*[select{syscall-template.S:81}:38<7784>]
| | | | |-1*[select{syscall-template.S:81}:39<7528>]
| | | | |-1*[select{syscall-template.S:81}:4<10384>]
| | | | |-1*[select{syscall-template.S:81}:40<7526>]
| | | | |-1*[select{syscall-template.S:81}:41<7524>]
| | | | |-1*[select{syscall-template.S:81}:42<7522>]
| | | | |-1*[select{syscall-template.S:81}:43<7207>]
| | | | |-1*[select{syscall-template.S:81}:44<7205>]
| | | | |-1*[select{syscall-template.S:81}:45<7145>]
| | | | |-1*[select{syscall-template.S:81}:46<7143>]
| | | | |-1*[select{syscall-template.S:81}:47<6885>]
| | | | |-1*[select{syscall-template.S:81}:48<6883>]
| | | | |-1*[select{syscall-template.S:81}:49<6881>]
| | | | |-1*[select{syscall-template.S:81}:5<10382>]
| | | | |-1*[select{syscall-template.S:81}:50<6879>]
| | | | |-1*[select{syscall-template.S:81}:51<6544>]
| | | | |-1*[select{syscall-template.S:81}:52<6542>]
| | | | |-1*[select{syscall-template.S:81}:53<6488>]
| | | | |-1*[select{syscall-template.S:81}:54<6486>]
| | | | |-1*[select{syscall-template.S:81}:55<6229>]
| | | | |-1*[select{syscall-template.S:81}:56<6227>]
| | | | |-1*[select{syscall-template.S:81}:57<6219>]
| | | | |-1*[select{syscall-template.S:81}:58<6217>]
| | | | |-1*[select{syscall-template.S:81}:59<5912>]
| | | | |-1*[select{syscall-template.S:81}:6<10380>]
| | | | |-1*[select{syscall-template.S:81}:60<5910>]
| | | | |-1*[select{syscall-template.S:81}:61<5908>]
| | | | |-1*[select{syscall-template.S:81}:62<5906>]
| | | | |-1*[select{syscall-template.S:81}:63<5648>]
| | | | |-1*[select{syscall-template.S:81}:7<10088>]
| | | | |-1*[select{syscall-template.S:81}:8<10084>]
| | | | -1*[select{syscall-template.S:81}:9<10022>] | | | -SQLinitClient{sql_scenario.c:444}
| | | -mvc_create{sql_mvc.c:461} | | | -store_lock{store.c:1693}<bs_lock:trans_lock>
| | | -MT_sleep_ms{gdk_posix.c:1089} | | | -1*[select{syscall-template.S:81}:64<5646>]
| | -runScenarioBody{mal_scenario.c:559} | | -runPhase{mal_scenario.c:515}
| | -SQLengineIntern{sql_execute.c:408} | | -runMAL{mal_interpreter.c:365}
| | -runMALsequence{mal_interpreter.c:631} | | -SQLtransaction{sql.c:268}
| | -mvc_commit{sql_mvc.c:296} | | -sql_trans_commit{store.c:3428}
| | -rollforward_trans{store.c:3015} | | -rollforward_changeset_updates{store.c:2530}
| | -rollforward_update_schema{store.c:2993} | | -rollforward_changeset_updates{store.c:2530}
| | -rollforward_update_table{store.c:2917} | | -update_table{bat_storage.c:2222}
| | -tr_update_delta{bat_storage.c:1936} | | -BATmsync{gdk_storage.c:692}
| | -MT_create_thread{gdk_system.c:505} | | -join_threads{gdk_system.c:484}
| | -1*[pthread_join{pthread_join.c:92}[tid=5646]:76<5544>] | -mvc_logmanager{sql_mvc.c:133}
| -store_manager{store.c:1634}<bs_lock:store_manager> | -MT_sleep_ms{gdk_posix.c:1089}
| -1*[select{syscall-template.S:81}:81<29324>] -main{mserver5.c:694}---MT_sleep_ms{gdk_posix.c:1089}
`-1*[select{syscall-template.S:81}:1<29289>]

We see that that the thread 5544 is waiting for the thread 5646.

Both the threads 5646 and 29289 are waiting for bs_lock. We don't know which thread took the bs_lock (since the internal representation is a atomic value).

Comment 24893

Date: 2017-01-16 09:24:01 +0100
From: Frédéric Jolliton <<frederic.jolliton+monetdb>>

As discussed with Sjoerd, we are going to upgrade to Jun2016 since it seems to include a fix for this exact issue.

Comment 24910

Date: 2017-01-24 17:14:08 +0100
From: @yzchang

Just for the record: this problem might have been fixed by http//devmonetdborg/hg/MonetDB?cmd=changeset;node=63f23a3d619b which was released in Jun2016-SP2.

The problem addressed by changeset 63f23a3d619b is, when many threads are created and then exit shortly, they can be joint on the wrong thread, due to some incorrect threads admin.

Comment 25726

Date: 2017-10-15 20:48:18 +0200
From: @mlkersten

Can this bug be closed?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
1 participant