Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SIGSEGV in MANIFOLDjob #3592

Closed
monetdb-team opened this issue Nov 30, 2020 · 0 comments
Closed

SIGSEGV in MANIFOLDjob #3592

monetdb-team opened this issue Nov 30, 2020 · 0 comments

Comments

@monetdb-team
Copy link

@monetdb-team monetdb-team commented Nov 30, 2020

Date: 2014-10-02 12:57:30 +0200
From: Richard Hughes <<richard.monetdb>>
To: MonetDB5 devs <>
Version: -- development
CC: @mlkersten, shawpolakcrax12

Last updated: 2019-06-07 09:22:48 +0200

Comment 20226

Date: 2014-10-02 12:57:30 +0200
From: Richard Hughes <<richard.monetdb>>

Build is Oct2014 307281054d25 plus a couple of patches (which I hope aren't relevant here)

Traffic pattern is described in bug #3577.

mserver5 crashed a few minutes after we started performing read queries. I'm still working on finding out which exact query might have caused this (I'm not the only one using this server) [BTW, I've never found a way of extracting either the current SQL or MAL from a core dump - do you have a good trick for that?]:

Program terminated with signal SIGSEGV, Segmentation fault.
0 0x00007f3bd66b66cc in MANIFOLDjob (mut=)
at manifold.c:164
164 case 6: Manifoldbody(args[3],args[4],args[5]); break;
(gdb) bt
0 0x00007f3bd66b66cc in MANIFOLDjob (mut=)
at manifold.c:164
1 MANIFOLDevaluate (cntxt=, mb=,
stk=0x7f3b79a11950, pci=0x7f3b797daa60) at manifold.c:319
2 0x00007f3bd65ddbe8 in runMALsequence (cntxt=0x7f3b880968e8,
mb=0x7f3b78970660, startpc=2040620112, stoppc=2040620128,
stk=0x7f3b79a11950, env=0x3, pcicaller=0x0) at mal_interpreter.c:651
3 0x00007f3bd65dfd65 in DFLOWworker (T=0x7f3b880968e8) at mal_dataflow.c:362
4 0x00007f3bd50060a4 in start_thread (arg=0x7f3bbf9d6700)
at pthread_create.c:309
5 0x00007f3bd4d3bc2d in clone ()
at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111
(gdb) i loc
v = 0x7f3b880968e8
i =
q = 0x754 <error: Cannot access memory at address 0x754>
y = 0x0
msg = 0x0
p = 0x0
args = 0x7f3b880a6ea0
(gdb) disassemble $rip-20,$rip+20
Dump of assembler code from 0x7f3bd66b66b8 to 0x7f3bd66b66e0:
0x00007f3bd66b66b8 <MANIFOLDevaluate+6248>: push %rsp
0x00007f3bd66b66b9 <MANIFOLDevaluate+6249>: and $0x20,%al
0x00007f3bd66b66bb <MANIFOLDevaluate+6251>: xor %eax,%eax
0x00007f3bd66b66bd <MANIFOLDevaluate+6253>: mov 0x28(%r15),%rcx
0x00007f3bd66b66c1 <MANIFOLDevaluate+6257>: mov 0x20(%r15),%rdx
0x00007f3bd66b66c5 <MANIFOLDevaluate+6261>: mov 0x18(%r15),%rsi
0x00007f3bd66b66c9 <MANIFOLDevaluate+6265>: mov %r13,%rdi
=> 0x00007f3bd66b66cc <MANIFOLDevaluate+6268>: callq *0x10(%r10)
0x00007f3bd66b66d0 <MANIFOLDevaluate+6272>: test %rax,%rax
0x00007f3bd66b66d3 <MANIFOLDevaluate+6275>: jne 0x7f3bd66ba568 <MANIFOLDevaluate+22296>
0x00007f3bd66b66d9 <MANIFOLDevaluate+6281>: mov 0x2c(%rsp),%edi
0x00007f3bd66b66dd <MANIFOLDevaluate+6285>: cmp %edi,0x28(%rsp)
End of assembler dump.
(gdb) p/x $r10
$1 = 0x7f3b880968d0
(gdb) p *(InstrPtr)$r10
$2 = {token = 0 '\000', barrier = 0 '\000', typechk = 0 '\000',
gc = 0 '\000', polymorphic = 0 '\000', varargs = 0 '\000', recycle = 0,
jump = -2147483648, fcn = 0x8000000000000000, blk = 0x25b63f0,
trace = 0 '\000', calls = 0, ticks = 0, rbytes = 0, wbytes = 0,
modname = 0x25b5280 "mal", fcnname = 0x257e240 "manifold", argc = 6,
retc = 1, maxarg = 8, argv = 0x7f3b8809692c}
(gdb) up
1 MANIFOLDevaluate (cntxt=, mb=,
stk=0x7f3b79a11950, pci=0x7f3b797daa60) at manifold.c:319
319 msg = MANIFOLDjob(&mut);
(gdb) p *pci
$3 = {token = 54 '6', barrier = 0 '\000', typechk = 2 '\002', gc = 3 '\003',
polymorphic = 0 '\000', varargs = 0 '\000', recycle = 0, jump = 0,
fcn = 0x7f3bd66b4e50 , blk = 0x25b63f0, trace = 0 '\000',
calls = 0, ticks = 0, rbytes = 0, wbytes = 0, modname = 0x25b5280 "mal",
fcnname = 0x257e240 "manifold", argc = 6, retc = 1, maxarg = 8,
argv = 0x7f3b797daabc}
(gdb) p *mat@6
$4 = {{b = 0x7f3b8802ab90, first = 0x7f3b88081830, last = 0x7f3b88081830,
size = 0, type = 0, bi = {b = 0x7f3b8802ab90, hvid = 0, tvid = 0}, o = 0,
q = 0, s = 0x0}, {b = 0x0, first = 0x0, last = 0x0, size = 0, type = 0,
bi = {b = 0x0, hvid = 0, tvid = 0}, o = 0, q = 0, s = 0x0}, {b = 0x0,
first = 0x0, last = 0x0, size = 0, type = 0, bi = {b = 0x0, hvid = 0,
tvid = 0}, o = 0, q = 0, s = 0x0}, {b = 0x7f3b880501d0, first = 0x0,
last = 0x754, size = 0, type = 0, bi = {b = 0x7f3b880501d0, hvid = 0,
tvid = 0}, o = 0, q = 1876, s = 0x0}, {b = 0x0, first = 0x7f3b79a16450,
last = 0x7f3b79a16450, size = 0, type = 0, bi = {b = 0x0, hvid = 0,
tvid = 0}, o = 0, q = 0, s = 0x0}, {b = 0x0, first = 0x7f3b79a16460,
last = 0x7f3b79a16460, size = 0, type = 0, bi = {b = 0x0, hvid = 0,
tvid = 0}, o = 0, q = 0, s = 0x0}}
(gdb) i loc
mut =
mat = 0x7f3b88009ed0
i =
tpe =
cnt = 1876
msg = 0x0
fcn =

Obviously from the disassembly, the instruction pointer is actually at manifold.c:71.

The final cause of the crash is when v has overlapped mut.pci and hence memory has been overwritten. Because of this unconstrained heap overrun, it's difficult to say which variables are correct and which are nonsense. The runaway loop would appear to be because mut->args[mut->fvar].size==0 (as in the printout of mat above) but I can't figure out how that happened. It's not helpful that gcc has managed to optimize out all of mut.

This was the second crash on that database, so in theory could have been a consequence of corruption incurred by the previous crash. I'm now going to change my core_pattern not to overwrite the file every time...

There was another crash on another database, which was the first one ever on that database (again, shortly after the read queries started):

0 BBPdestroy (b=0x7fb5540505d0) at gdk_bbp.c:2497
1 decref (i=13266, logical=, releaseShare=,
lock=1) at gdk_bbp.c:2258
2 0x00007fb5ad49da76 in runMALsequence (cntxt=0x8000000000000000,
mb=0x7fb58cb4e280, startpc=0, stoppc=1398, stk=0x7fb58e0a9d70, env=0x0,
pcicaller=0x0) at mal_interpreter.c:815
3 0x00007fb5ad49fd65 in DFLOWworker (T=0x8000000000000000)
at mal_dataflow.c:362
4 0x00007fb5abec60a4 in start_thread (arg=0x7fb576df6700)
at pthread_create.c:309
5 0x00007fb5abbfbc2d in clone ()
at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111
(gdb) p *b
$1 = {batCacheid = 0, H = 0x8000000000000000, T = 0x8000000000000000,
S = 0x8000000000000000}
(gdb) thread 15
[Switching to thread 15 (Thread 0x7fb5765f2700 (LWP 9296))]
0 0x00007fb5ad57672a in MANIFOLDjob (mut=)
at manifold.c:164
164 case 6: Manifoldbody(args[3],args[4],args[5]); break;
(gdb) bt
0 0x00007fb5ad57672a in MANIFOLDjob (mut=)
at manifold.c:164
1 MANIFOLDevaluate (cntxt=, mb=,
stk=0x7fb58e0a9d70, pci=0x7fb58dcede30) at manifold.c:319
2 0x00007fb5ad49dbe8 in runMALsequence (cntxt=0x8000000000000000,
mb=0x7fb58cb4e280, startpc=-1388980992, stoppc=-1911890928,
stk=0x7fb58e0a9d70, env=0x3, pcicaller=0x0) at mal_interpreter.c:651
3 0x00007fb5ad49fd65 in DFLOWworker (T=0x8000000000000000)
at mal_dataflow.c:362
4 0x00007fb5abec60a4 in start_thread (arg=0x7fb5765f2700)
at pthread_create.c:309
5 0x00007fb5abbfbc2d in clone ()
at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

This would appear to be the same heap overrun, just that it managed to take out a different thread before it overwrote itself.

Comment 20227

Date: 2014-10-02 13:40:24 +0200
From: Richard Hughes <<richard.monetdb>>

Got it! Here's a repro:

create table foo (t timestamp,v decimal(18,9));
insert into foo values (now(),42);
insert into foo values (now(),43);
select (t-(select timestamp '1970-1-1')),v from foo union all select (t-(select timestamp '1970-1-1')),null from foo;

If I rewrite that null as cast(null as decimal(18,9)) then it doesn't crash.

Comment 20228

Date: 2014-10-02 14:53:53 +0200
From: Richard Hughes <<richard.monetdb>>

How's this?

diff -r 6676c2c07c3d monetdb5/modules/mal/manifold.c
--- a/monetdb5/modules/mal/manifold.c Wed Oct 01 12:48:06 2014 +0100
+++ b/monetdb5/modules/mal/manifold.c Thu Oct 02 13:52:57 2014 +0100
@@ -314,10 +314,12 @@
else
BATseqbase(mat[0].b, 0);

  •   mut.pci = copyInstruction(pci);
    
  •   mut.pci->fcn = fcn;
    
  •   msg = MANIFOLDjob(&mut);
    
  •   freeInstruction(mut.pci);
    
  •   if ( mat[mut.fvar].b->ttype != TYPE_void){
    
  •           mut.pci = copyInstruction(pci);
    
  •           mut.pci->fcn = fcn;
    
  •           msg = MANIFOLDjob(&mut);
    
  •           freeInstruction(mut.pci);
    
  •   }
    
      // consolidate the properties
      if (ATOMstorage(mat[0].b->ttype) < TYPE_str)
    

Comment 20229

Date: 2014-10-02 14:58:10 +0200
From: Richard Hughes <<richard.monetdb>>

No, forget that, it gives the wrong answer. Back to the drawing board.

Anybody who wants to help me out here, feel free...

Comment 20230

Date: 2014-10-02 15:16:34 +0200
From: Richard Hughes <<richard.monetdb>>

Better:

diff -r 6676c2c07c3d monetdb5/modules/mal/manifold.c
--- a/monetdb5/modules/mal/manifold.c Wed Oct 01 12:48:06 2014 +0100
+++ b/monetdb5/modules/mal/manifold.c Thu Oct 02 14:13:30 2014 +0100
@@ -275,7 +275,9 @@
goto wrapup;
}
mut.lvar = i;

  •                   if (ATOMstorage(tpe) == TYPE_str)
    
  •                   if (ATOMstorage(tpe) == TYPE_void)
    
  •                           mat[i].size = 1;
    
  •                   else if (ATOMstorage(tpe) == TYPE_str)
                              mat[i].size = Tsize(mat[i].b);
                      else
                              mat[i].size = BATatoms[ATOMstorage(tpe)].size;
    

Comment 20231

Date: 2014-10-03 11:17:22 +0200
From: Richard Hughes <<richard.monetdb>>

Thinking about it, there's a missed optimization opportunity here: as long as the function is classified as immutable, you could call it only once and duplicate the result as needed, rather than calling it once per NULL. I'll leave somebody else to do that - I'm just happy that it doesn't crash any more.

Comment 20243

Date: 2014-10-04 12:14:37 +0200
From: @mlkersten

The short test indeed generates a SEGVAULT in the Oct2014 pre-release.
It crashes in sql_round.c line 285

279 nil_2dec(TYPE *res, void *val, int *d, int *sc)
280 {
281 (void) val;
282 (void) d;
283 (void) sc;
284
285 *res = NIL(TYPE);
286 return MAL_SUCCEED;
287 }

Probably caused by calling with an unsupported type.

Comment 20244

Date: 2014-10-04 12:31:55 +0200
From: @mlkersten

Probably in nil_2dec_lng inability to deal with void value.

0 0x00007fffef7e31f4 in nil_2dec_lng (res=0x7fffe4024000, val=0x0, d=0x1db74b0, sc=0x1db74d0) at /export/scratch1/mk/mosaic//package/sql/backends/monet5/sql_round_impl.h:285
1 0x00007ffff7b66aa6 in MANIFOLDjob (mut=0x7fffed049880) at /export/scratch1/mk/mosaic//package/monetdb5/modules/mal/manifold.c:171

Comment 20245

Date: 2014-10-04 12:37:47 +0200
From: MonetDB Mercurial Repository <>

Changeset f67870d9120b made by Martin Kersten mk@cwi.nl in the MonetDB repo, refers to this bug.

For complete details, see http//devmonetdborg/hg/MonetDB?cmd=changeset;node=f67870d9120b

Changeset description:

Add test for bug nil_2dec_lng.Bug-3592

Comment 20252

Date: 2014-10-05 18:41:27 +0200
From: @mlkersten

The result BAT is identified with a TYPE_void tail. This means we have to create a result with TYPE_oid and have to materialize the oids during the loop.

For the time being, ignore void-void columns in the manifold type checker.
The resulting MAL loop at least provides correct results.

The actual code of MANIfold should be redone to run properly over an oid-range
instead of the mixture of pointer calculations.

Comment 20253

Date: 2014-10-05 21:01:53 +0200
From: @mlkersten

Upgraded the code base. Close until we find alternative issues.
More mini-examples in the test suite would be great.

Comment 20322

Date: 2014-10-28 23:46:21 +0100
From: MonetDB Mercurial Repository <>

Changeset da9653156935 made by Stefan Manegold Stefan.Manegold@cwi.nl in the MonetDB repo, refers to this bug.

For complete details, see http//devmonetdborg/hg/MonetDB?cmd=changeset;node=da9653156935

Changeset description:

nil_2dec_lng.Bug-3592: approve single-threaded output

Comment 20372

Date: 2014-10-31 14:14:27 +0100
From: @sjoerdmullender

Oct2014 has been released.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
1 participant