Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segmentation fault after large table insert #3422

Closed
monetdb-team opened this issue Nov 30, 2020 · 0 comments
Closed

Segmentation fault after large table insert #3422

monetdb-team opened this issue Nov 30, 2020 · 0 comments

Comments

@monetdb-team
Copy link

@monetdb-team monetdb-team commented Nov 30, 2020

Date: 2014-01-20 12:44:10 +0100
From: Alex Shestakov <<shura.shestakov>>
To: SQL devs <>
Version: 11.17.9 (Jan2014)
CC: @njnes, richard.monetdb, shura.shestakov, @yzchang

Last updated: 2014-10-31 14:13:42 +0100

Comment 19462

Date: 2014-01-20 12:44:10 +0100
From: Alex Shestakov <<shura.shestakov>>

User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/31.0.1650.63 Safari/537.36
Build Identifier:

mserver5 receives sigsegv after a big insert into a table.

Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7f511e44c700 (LWP 14773)]
0x00007f620cfef1eb in gtr_update_table (tr=0x2d671e0, t=0x30f6920, tchanges=0x7f511e44bd5c) at bat_storage.c:1454
1454 for (n = t->columns.set->h; ok == LOG_OK && n; n = n->next) {
(gdb) where
0 0x00007f620cfef1eb in gtr_update_table (tr=0x2d671e0, t=0x30f6920, tchanges=0x7f511e44bd5c) at bat_storage.c:1454
1 0x00007f620cfeee0e in _gtr_update (tr=0x2d671e0, gtr_update_table_f=0x7f620cfef1d0 <gtr_update_table>) at bat_storage.c:1504
2 0x00007f620cfe91cc in store_manager () at store.c:1552
3 0x00007f620cf9b4c5 in mvc_logmanager () at sql_mvc.c:149
4 0x00007f6214a1178b in thread_starter (arg=0x2edb510) at gdk_system.c:505
5 0x0000003ee9a07851 in start_thread () from /lib64/libpthread.so.0
6 0x0000003ee92e890d in clone () from /lib64/libc.so.6
(gdb) print t->columns
$1 = {sa = 0x2da3b30, destroy = 0x7f620cfe01d0 <column_destroy>, set = 0x0, dset = 0x0, nelm = 0x0}

(gdb) print *t
$3 = {base = {wtime = 1, rtime = 0, allocated = 1, flag = 0, id = 8421, name = 0x30f6a50 "wave"}, type = 0, system = 0 '\000', persistence = SQL_PERSIST, commit_action = CA_COMMIT,
readonly = 0 '\000', query = 0x0, sz = 1024, pkey = 0x0, columns = {sa = 0x2da3b30, destroy = 0x7f620cfe01d0 <column_destroy>, set = 0x0, dset = 0x0, nelm = 0x0}, idxs = {
sa = 0x2da3b30, destroy = 0x7f620cfe14f0 <idx_destroy>, set = 0x0, dset = 0x0, nelm = 0x0}, keys = {sa = 0x2da3b30, destroy = 0x7f620cfe1e50 <key_destroy>, set = 0x0, dset = 0x0,
nelm = 0x0}, triggers = {sa = 0x2da3b30, destroy = 0x7f620cfe1250 <trigger_destroy>, set = 0x0, dset = 0x0, nelm = 0x0}, tables = {sa = 0x2da3b30,
destroy = 0x7f620cfeba70 <table_destroy>, set = 0x0, dset = 0x0, nelm = 0x0}, drop_action = 0, cleared = 0, data = 0x30e49a0, s = 0x2e80450, p = 0x0}

The problem seems to be caused not by the table being updated, but by a different table 'wave'.

Table wave was created as follows.
create table wave (
wave_id bigint auto_increment,
businessdate date not null,
name varchar(256),
unique (businessdate, name)
);

I've dropped table wave and re-created it again. The problem has gone.
Looks like the table had been in some broken state.

Reproducible: Sometimes

Comment 19476

Date: 2014-01-21 11:22:21 +0100
From: @sjoerdmullender

What did you do to this table "wave" before you got the crash?

Also, if you can reproduce this, please try to do so while the server is running with the extra argument -d10. I would expect this to cause an assertion failure (i.e. abort) or warning message (depending on how the server was built). I'd like to see the messages produced.

Comment 19477

Date: 2014-01-21 12:17:43 +0100
From: Alex Shestakov <<shura.shestakov>>

I've just created table wave as shown above and inserted data into it using
"insert into wave select from ..." statement.

After I've dropped and re-created the table the problem has gone. I can't reproduce it.

Comment 19499

Date: 2014-01-24 10:28:29 +0100
From: Alex Shestakov <<shura.shestakov>>

It started happening again for no obvious reason. Now not only on inserts. I was doing a delete from a different table and it crashed.

[Switching to Thread 0x7f6ad8129700 (LWP 27642)]
0x00007f7bd414c1eb in gtr_update_table (tr=0x2d0b3b0, t=0x309b490, tchanges=0x7f6ad8128d5c) at bat_storage.c:1454
1454 for (n = t->columns.set->h; ok == LOG_OK && n; n = n->next) {
(gdb) print *t
$1 = {base = {wtime = 1, rtime = 0, allocated = 1, flag = 0, id = 8421, name = 0x309b5c0 "wave"}, type = 0, system = 0 '\000', persistence = SQL_PERSIST, commit_action = CA_COMMIT,
readonly = 0 '\000', query = 0x0, sz = 1024, pkey = 0x0, columns = {sa = 0x300acd0, destroy = 0x7f7bd413d1d0 <column_destroy>, set = 0x0, dset = 0x0, nelm = 0x0}, idxs = {
sa = 0x300acd0, destroy = 0x7f7bd413e4f0 <idx_destroy>, set = 0x0, dset = 0x0, nelm = 0x0}, keys = {sa = 0x300acd0, destroy = 0x7f7bd413ee50 <key_destroy>, set = 0x0, dset = 0x0,
nelm = 0x0}, triggers = {sa = 0x300acd0, destroy = 0x7f7bd413e250 <trigger_destroy>, set = 0x0, dset = 0x0, nelm = 0x0}, tables = {sa = 0x300acd0,
destroy = 0x7f7bd4148a70 <table_destroy>, set = 0x0, dset = 0x0, nelm = 0x0}, drop_action = 0, cleared = 0, data = 0x30c2b40, s = 0x2d970b0, p = 0x0}
(gdb) print t->columns
$2 = {sa = 0x300acd0, destroy = 0x7f7bd413d1d0 <column_destroy>, set = 0x0, dset = 0x0, nelm = 0x0}
(gdb) where
0 0x00007f7bd414c1eb in gtr_update_table (tr=0x2d0b3b0, t=0x309b490, tchanges=0x7f6ad8128d5c) at bat_storage.c:1454
1 0x00007f7bd414be0e in _gtr_update (tr=0x2d0b3b0, gtr_update_table_f=0x7f7bd414c1d0 <gtr_update_table>) at bat_storage.c:1504
2 0x00007f7bd41461cc in store_manager () at store.c:1552
3 0x00007f7bd40f84c5 in mvc_logmanager () at sql_mvc.c:149
4 0x00007f7bdbb6e78b in thread_starter (arg=0x2e18590) at gdk_system.c:505
5 0x0000003ee9a07851 in start_thread () from /lib64/libpthread.so.0
6 0x0000003ee92e890d in clone () from /lib64/libc.so.6

I was running mserver5 with -d10 option. It didn't complain on any settings.

The only thing it printed after the initial startup message was the following when I connected to it with SQuirrel SQL (jdbc).

Bulk operator required for str.stringleft
Bulk operator required for str.stringleft

Comment 19929

Date: 2014-07-25 12:55:39 +0200
From: Richard Hughes <<richard.monetdb>>

I've had a SIGSEGV that looks sufficiently similar to this that I'm going to put it here rather than creating a new bug:

v11.17.17 "Jan2014-SP2"

(gdb) bt
0 gtr_update_delta (tr=tr@entry=0x3333e10, cbat=0x7fb538058840,
changes=changes@entry=0x7fd52ff7fe5c) at bat_storage.c:1429
1 0x00007fd5308a3025 in gtr_update_table (tchanges=,
t=0x31bdbf0, tr=0x3333e10) at bat_storage.c:1468
2 _gtr_update (gtr_update_table_f=, tr=0x3333e10)
at bat_storage.c:1512
3 gtr_update (tr=0x3333e10) at bat_storage.c:1529
4 0x00007fd530893672 in store_manager () at store.c:1591
5 0x00007fd530839605 in mvc_logmanager () at sql_mvc.c:149
6 0x00007fd53714497b in thread_starter (arg=0x3110fd0) at gdk_system.c:505
7 0x00007fd534c770a4 in start_thread (arg=0x7fd52ff80700)
at pthread_create.c:309
8 0x00007fd5349ac04d in clone ()
at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111
(gdb) p cur
$1 = (BAT *) 0x0
(gdb) p *cbat
$2 = {name = 0x7fb538090420 "sys_staging_data_id", bid = 0, ibase = 0,
ibid = 28806, ubid = 2252, cnt = 10273, ucnt = 0, cached = 0x0,
wtime = 6398, next = 0x7fb6645e6070}

This happened once only, with a load pattern that hasn't changed before or since. Our system runs with exactly one process (having one connection) doing all writes (mostly bulk inserts, with some updates sprinkled in), and a few processes (2 or 3) doing SELECTs. My speculation (based on pretty much zero evidence) is that one of the SELECTs required an index rebuild and raced with a concurrent COPY or UPDATE. Maybe. Good luck!

I'll keep the core file for a few weeks, if anybody wants me to ask it any more questions.

Comment 19937

Date: 2014-07-30 15:07:14 +0200
From: Richard Hughes <<richard.monetdb>>

Another SIGSEGV that looks similar, v11.17.17 "Jan2014-SP2" (same binary as my last comment)

(gdb) bt
0 BATcleanProps (b=0x0) at bat_storage.c:1407
1 tr_update_delta (tr=tr@entry=0x7f0558007f30, obat=0x7f0559844bc0,
cbat=0x7f0559a229d0, unique=0) at bat_storage.c:1635
2 0x00007f0566ea9b05 in update_table (tr=0x7f0558007f30, ft=0x7f055803fa60,
tt=0x1df86b0) at bat_storage.c:1778
3 0x00007f0566e98558 in rollforward_update_table (tr=0x7f0558007f30,
ft=0x7f055803fa60, tt=0x1df86b0, mode=3) at store.c:2744
4 0x00007f0566ea082a in rollforward_changeset_updates (mode=3,
fd=,
rollforward_deletes=0x7f0566e94d30 <rollforward_drop_table>,
rollforward_creates=0x7f0566e94850 <rollforward_create_table>,
rollforward_updates=0x7f0566e97e50 <rollforward_update_table>,
b=0x1b6cbe0, ts=0x1b6cc08, fs=0x7f0558008258, tr=0x7f0558007f30)
at store.c:2357
5 rollforward_update_schema (tr=0x7f0558007f30, fs=0x7f0558008230,
ts=0x1b6cbe0, mode=3) at store.c:2820
6 0x00007f0566ea1145 in rollforward_changeset_updates (
mode=, fd=,
rollforward_deletes=, rollforward_creates=,
rollforward_updates=, b=,
ts=, fs=, tr=)
at store.c:2357
7 rollforward_trans (mode=, tr=)
at store.c:2842
8 sql_trans_commit (tr=0x7f0558007f30) at store.c:3258
9 0x00007f0566e3fb28 in mvc_commit (m=0x7f05580053f0, chain=chain@entry=0,
name=name@entry=0x0) at sql_mvc.c:274
10 0x00007f0566db5e6d in SQLtransaction (cntxt=0x815968, mb=,
stk=, pci=0x7f0559ca0050) at sql.c:290
11 0x00007f056f673158 in runMALsequence (cntxt=0x0, mb=0x7f0560107590,
startpc=1, stoppc=15, stk=0x7f0559c59e50, env=0x68ec, pcicaller=0x0)
at mal_interpreter.c:648
12 0x00007f056f6749bf in runMAL (cntxt=0x0, cntxt@entry=0x815968,
mb=0x7f0559a6e1f0, mbcaller=0x1, mbcaller@entry=0x0, env=0xf,
env@entry=0x0) at mal_interpreter.c:374
13 0x00007f0566dcb95b in SQLengineIntern (c=0x815968, be=0x7f05580c96b0)
at sql_scenario.c:2301
14 0x00007f056f68f4b7 in runPhase (phase=4, c=0x815968) at mal_scenario.c:528
15 runScenarioBody (c=c@entry=0x815968) at mal_scenario.c:572
16 0x00007f056f69001d in runScenario (c=c@entry=0x815968)
at mal_scenario.c:592
17 0x00007f056f6900e8 in MSserveClient (dummy=0x815968) at mal_session.c:456
18 0x00007f056f1dc97b in thread_starter (arg=0x7f0560000a40)
at gdk_system.c:505
19 0x00007f056cd0f0a4 in start_thread (arg=0x7f0566385700)
at pthread_create.c:309
20 0x00007f056ca4404d in clone ()
at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111
(gdb) f 1
1 tr_update_delta (tr=tr@entry=0x7f0558007f30, obat=0x7f0559844bc0,
cbat=0x7f0559a229d0, unique=0) at bat_storage.c:1635
1635 BATcleanProps(cur);
(gdb) p *obat
$1 = {name = 0x7f0559a28260 "sys_staging_data_id", bid = 0, ibase = 0,
ibid = 52498, ubid = 2252, cnt = 2044, ucnt = 0, cached = 0x0,
wtime = 26859, next = 0x7f0558d04760}

The statement being run at the time was
copy into staging_data from stdin delimiters '\t' null as '\N';

Can somebody please advise me what debugging code I can add in order to help track this down?

Comment 19940

Date: 2014-08-01 11:26:14 +0200
From: Richard Hughes <<richard.monetdb>>

Freshly upgraded to Jan2014SP3 (11.17.21), same backtrace as comment 4.

(gdb) bt
0 gtr_update_delta (tr=tr@entry=0x1e9a760, cbat=0x7f001c1011e0,
changes=changes@entry=0x7f0032f4ce5c) at bat_storage.c:1429
1 0x00007f00356c07b5 in gtr_update_table (tchanges=,
t=0x2516790, tr=0x1e9a760) at bat_storage.c:1468
2 _gtr_update (gtr_update_table_f=, tr=0x1e9a760)
at bat_storage.c:1512
3 gtr_update (tr=0x1e9a760) at bat_storage.c:1529
4 0x00007f00356b0e02 in store_manager () at store.c:1591
5 0x00007f0035656c35 in mvc_logmanager () at sql_mvc.c:149
6 0x00007f003d9f4bfb in thread_starter (arg=0x2230940) at gdk_system.c:505
7 0x00007f003b5270a4 in start_thread (arg=0x7f0032f4d700)
at pthread_create.c:309
8 0x00007f003b25c04d in clone ()
at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111
(gdb) p *cur
Cannot access memory at address 0x0
(gdb) p *cbat
$1 = {name = 0x7f001c101100 "sys_timerangecache_t", bid = 0, ibase = 0,
ibid = 19882, ubid = 1166, cnt = 9730, ucnt = 0, cached = 0x0,
wtime = 9788, next = 0x251cea0}

The statements that had recently executed related to sys.timerangecache were:

delete from timerangecache;
insert into timerangecache (t,n) select t,count(*) from (select (start-(select timestamp '1970-1
-1'))/1800000 from fulldata) as t(t,n) group by t;
commit;

Comment 20163

Date: 2014-09-17 16:45:21 +0200
From: @yzchang

Possibly fixed by Sjoerd's changeset: fe0b3b84c297
Based on a patch of Richard Huges, thanks!
Please test

Comment 20334

Date: 2014-10-31 14:13:42 +0100
From: @sjoerdmullender

Oct2014 has been released.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
1 participant