You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_5) AppleWebKit/537.22 (KHTML, like Gecko) Chrome/25.0.1364.99 Safari/537.22
Build Identifier:
We periodically shut down our Monet database. Sometimes, after performing a shutdown, the database gets into a state where it fails to startup. This appears to be related to the data files, because if I take the same data files to a different machine, I can reproduce the issue on startup.
The shutdown that seems to cause the issue shows up in the log as follows:
2013-02-25 12:56:22 MSG merovingian[16884]: sending process 17528 (database 'click') the TERM signal
2013-02-25 12:56:22 MSG merovingian[16884]: database 'click' (17528) has exited with exit status 0
2013-02-25 12:56:22 MSG merovingian[16884]: database 'click' has shut down
2013-02-25 12:56:22 MSG control[16884]: (local): stopped database 'click'
This seems to indicate a clean shutdown.
At startup, we get a segmentation fault. We took a core dump and I've attached the backtrace from it. After the dataset gets corrupted, the failure happens in the exact same place every time.
We reran our scenario and did find a Valgrind error that may be related - that report is also attached.
Reproducible: Always
Steps to Reproduce:
We have a specific dataset with which we're testing. Unfortunately, the data is confidential, so we can't share it here. The general characteristics of what we're doing are:
Tables are loaded via a COPY INTO that is reading from files on /dev/shm (shared memory)
Multiple tables are loaded concurrently in separate transactions
Periodically, we automatically restart MonetDB by using "monetdb stop click" to shut it down and then reconnecting to let monetdbd start it again. We do this in order to bring down Monet's memory usage to within a configured limit. Our app specifically halts all other database activity during the shutdown/restart operation.
What we've found is that if we run with the same data set but don't do the restart, we can run well past the point of failure. If we let the system quiesce and then restart MonetDB manually, it comes back up fine, which seems to suggest it's not something specific to a particular datum that we're writing.
Actual Results:
Data is corrupted and mserver5 enters a restart loop
Expected Results:
Data is not corrupted and mserver5 starts successfully
Examining the core dump revealed that (next + 1) + extralen is referring to an out of bounds address. Here's the backtrace:
0 0x00007faf58414829 in strPut (h=0x1e2d180, dst=0x7fff592cf8f8, v=0x314dac0 "SAD014H1") at gdk_atoms.c:1142
1 0x00007faf582dc935 in BATappend (b=0x1e2cf90, n=0x32dfdb0, force=1 '\001') at gdk_batop.c:578
2 0x00007faf584c301e in la_bat_updates (lg=0x2d9b030, la=0x2c3ef48) at gdk_logger.c:429
3 0x00007faf584c3cf9 in la_apply (lg=0x2d9b030, c=0x2c3ef48) at gdk_logger.c:645
4 0x00007faf584c3f26 in tr_commit (lg=0x2d9b030, tr=0x2e247d0) at gdk_logger.c:705
5 0x00007faf584c4533 in logger_readlog (lg=0x2d9b030,
filename=0x7fff592d1e80 "/opt/clicksecurity/data/_monetdb/click/sql_logs/sql/log.56") at gdk_logger.c:823
6 0x00007faf584c482a in logger_readlogs (lg=0x2d9b030, fp=0x2d9b160,
filename=0x7fff592d3f90 "/opt/clicksecurity/data/_monetdb/click/sql_logs/sql/log") at gdk_logger.c:896
7 0x00007faf584c6f3e in logger_new (debug=0, fn=0x7faf500adfa8 "sql", logdir=0x7faf50090a08 "sql_logs", dbname=0x1fa3da0 "click",
version=52001, prefuncp=0x7faf500746a1 <bl_preversion>, postfuncp=0x7faf500747ed <bl_postversion>) at gdk_logger.c:1420
8 0x00007faf584c704e in logger_create (debug=0, fn=0x7faf500adfa8 "sql", logdir=0x7faf50090a08 "sql_logs", dbname=0x1fa3da0 "click",
version=52001, prefuncp=0x7faf500746a1 <bl_preversion>, postfuncp=0x7faf500747ed <bl_postversion>) at gdk_logger.c:1446
9 0x00007faf50075b19 in bl_create (logdir=0x7faf50090a08 "sql_logs", dbname=0x1fa3da0 "click", cat_version=52001) at bat_logger.c:249
10 0x00007faf50060ce4 in store_init (debug=0, store=store_bat, logdir=0x7faf50090a08 "sql_logs", dbname=0x1fa3da0 "click", stk=0)
at store.c:1287
11 0x00007faf4ffe3d3c in mvc_init (dbname=0x1fa3da0 "click", debug=0, store=store_bat, stk=0) at sql_mvc.c:51
12 0x00007faf4ff66874 in SQLinit () at sql_scenario.c:230
13 0x00007faf4ff6651f in SQLprelude () at sql_scenario.c:159
14 0x00007faf58b3085d in malCommandCall (stk=0x2d36e80, pci=0x2ea5520) at mal_interpreter.c:137
15 0x00007faf58b331b5 in runMALsequence (cntxt=0x7faf5988c020, mb=0x1e04310, startpc=1, stoppc=0, stk=0x2d36e80, env=0x0, pcicaller=0x0)
at mal_interpreter.c:710
16 0x00007faf58b323c1 in runMAL (cntxt=0x7faf5988c020, mb=0x1e04310, startpc=1, mbcaller=0x0, env=0x0, pcicaller=0x0)
at mal_interpreter.c:454
17 0x00007faf58b60a08 in MALengine (c=0x7faf5988c020) at mal_session.c:619
18 0x00007faf58b5f21f in malBootstrap () at mal_session.c:64
19 0x00007faf58b1313b in mal_init () at mal.c:244
20 0x000000000040340e in main (argc=22, av=0x7fff592db568) at mserver5.c:582
I just cloned the Feb2013 branch from mercurial and still have the same problem. I have been able to replicate it using some non-confidential data. The monet data files are 126 MB zipped. I'm happy to share these if that'll help.
Did you see the problem again after reloading or did you upgrade the
db from the old test? In case of the first mail me the download details, such that I can continue to debug.
I installed the newer version of Monet, deleted the old database, recreated and ran my test data through. After a few restarts, I ended up with the crashing issue again.
The reason for the crash is indeed coming from the loading phase. The data on
disk seems already corrupt. Could we some how test with the loading scripts?
Although we haven't been able to reproduce this, we feel that changesets aa2e3065be7e486f2ab17d12 and 054b82fd68c2 may well have fixed these issues.
Our analysis was that the hash table that is used to do double elimination in the string heap (partial elmination when the heap grows large) was corrupted after strings were added to the heap, but the transaction in which this happened was rolled back.
A related issue has to do with string offsets that grow, causing a widening of the offset column. If the transaction in which this happens is rolled back, similar problems could occur.
Hopefully the aforementioned changesets fix these issues, so I'm closing this bug. Feel free to reopen when the issue was not resolved.
Date: 2013-02-26 17:52:49 +0100
From: Percy Wegmann <>
To: GDK devs <>
Version: 11.15.15 (Feb2013-SP4)
CC: ashishk, @njnes
Last updated: 2013-12-03 13:59:37 +0100
Comment 18572
Date: 2013-02-26 17:52:49 +0100
From: Percy Wegmann <>
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_5) AppleWebKit/537.22 (KHTML, like Gecko) Chrome/25.0.1364.99 Safari/537.22
Build Identifier:
We periodically shut down our Monet database. Sometimes, after performing a shutdown, the database gets into a state where it fails to startup. This appears to be related to the data files, because if I take the same data files to a different machine, I can reproduce the issue on startup.
The shutdown that seems to cause the issue shows up in the log as follows:
2013-02-25 12:56:22 MSG merovingian[16884]: sending process 17528 (database 'click') the TERM signal
2013-02-25 12:56:22 MSG merovingian[16884]: database 'click' (17528) has exited with exit status 0
2013-02-25 12:56:22 MSG merovingian[16884]: database 'click' has shut down
2013-02-25 12:56:22 MSG control[16884]: (local): stopped database 'click'
This seems to indicate a clean shutdown.
At startup, we get a segmentation fault. We took a core dump and I've attached the backtrace from it. After the dataset gets corrupted, the failure happens in the exact same place every time.
We reran our scenario and did find a Valgrind error that may be related - that report is also attached.
Reproducible: Always
Steps to Reproduce:
We have a specific dataset with which we're testing. Unfortunately, the data is confidential, so we can't share it here. The general characteristics of what we're doing are:
What we've found is that if we run with the same data set but don't do the restart, we can run well past the point of failure. If we let the system quiesce and then restart MonetDB manually, it comes back up fine, which seems to suggest it's not something specific to a particular datum that we're writing.
Actual Results:
Data is corrupted and mserver5 enters a restart loop
Expected Results:
Data is not corrupted and mserver5 starts successfully
Comment 18573
Date: 2013-02-26 17:54:25 +0100
From: Percy Wegmann <>
Information about Core Dump
The error is happening on line 1142 of gdk_atoms.c:
if (GDK_STRCMP(v, (str) (next + 1) + extralen) == 0) {
Examining the core dump revealed that (next + 1) + extralen is referring to an out of bounds address. Here's the backtrace:
0 0x00007faf58414829 in strPut (h=0x1e2d180, dst=0x7fff592cf8f8, v=0x314dac0 "SAD014H1") at gdk_atoms.c:1142
1 0x00007faf582dc935 in BATappend (b=0x1e2cf90, n=0x32dfdb0, force=1 '\001') at gdk_batop.c:578
2 0x00007faf584c301e in la_bat_updates (lg=0x2d9b030, la=0x2c3ef48) at gdk_logger.c:429
3 0x00007faf584c3cf9 in la_apply (lg=0x2d9b030, c=0x2c3ef48) at gdk_logger.c:645
4 0x00007faf584c3f26 in tr_commit (lg=0x2d9b030, tr=0x2e247d0) at gdk_logger.c:705
5 0x00007faf584c4533 in logger_readlog (lg=0x2d9b030,
filename=0x7fff592d1e80 "/opt/clicksecurity/data/_monetdb/click/sql_logs/sql/log.56") at gdk_logger.c:823
6 0x00007faf584c482a in logger_readlogs (lg=0x2d9b030, fp=0x2d9b160,
filename=0x7fff592d3f90 "/opt/clicksecurity/data/_monetdb/click/sql_logs/sql/log") at gdk_logger.c:896
7 0x00007faf584c6f3e in logger_new (debug=0, fn=0x7faf500adfa8 "sql", logdir=0x7faf50090a08 "sql_logs", dbname=0x1fa3da0 "click",
version=52001, prefuncp=0x7faf500746a1 <bl_preversion>, postfuncp=0x7faf500747ed <bl_postversion>) at gdk_logger.c:1420
8 0x00007faf584c704e in logger_create (debug=0, fn=0x7faf500adfa8 "sql", logdir=0x7faf50090a08 "sql_logs", dbname=0x1fa3da0 "click",
version=52001, prefuncp=0x7faf500746a1 <bl_preversion>, postfuncp=0x7faf500747ed <bl_postversion>) at gdk_logger.c:1446
9 0x00007faf50075b19 in bl_create (logdir=0x7faf50090a08 "sql_logs", dbname=0x1fa3da0 "click", cat_version=52001) at bat_logger.c:249
10 0x00007faf50060ce4 in store_init (debug=0, store=store_bat, logdir=0x7faf50090a08 "sql_logs", dbname=0x1fa3da0 "click", stk=0)
at store.c:1287
11 0x00007faf4ffe3d3c in mvc_init (dbname=0x1fa3da0 "click", debug=0, store=store_bat, stk=0) at sql_mvc.c:51
12 0x00007faf4ff66874 in SQLinit () at sql_scenario.c:230
13 0x00007faf4ff6651f in SQLprelude () at sql_scenario.c:159
14 0x00007faf58b3085d in malCommandCall (stk=0x2d36e80, pci=0x2ea5520) at mal_interpreter.c:137
15 0x00007faf58b331b5 in runMALsequence (cntxt=0x7faf5988c020, mb=0x1e04310, startpc=1, stoppc=0, stk=0x2d36e80, env=0x0, pcicaller=0x0)
at mal_interpreter.c:710
16 0x00007faf58b323c1 in runMAL (cntxt=0x7faf5988c020, mb=0x1e04310, startpc=1, mbcaller=0x0, env=0x0, pcicaller=0x0)
at mal_interpreter.c:454
17 0x00007faf58b60a08 in MALengine (c=0x7faf5988c020) at mal_session.c:619
18 0x00007faf58b5f21f in malBootstrap () at mal_session.c:64
19 0x00007faf58b1313b in mal_init () at mal.c:244
20 0x000000000040340e in main (argc=22, av=0x7fff592db568) at mserver5.c:582
Comment 18574
Date: 2013-02-26 17:56:50 +0100
From: Percy Wegmann <>
Created attachment 185
Results of running mserver5 in Valgrind while feeding in test data
This shows the results of running mserver5 in Valgrind. Notice the below error in a call to MT_msync:
Address 0x1ee3334a is not stack'd, malloc'd or (recently) free'd
Comment 18575
Date: 2013-02-26 17:57:30 +0100
From: Percy Wegmann <>
We tried installing version 11.13.9 and running with that, but we got the same problem.
Comment 18576
Date: 2013-02-26 18:33:50 +0100
From: @njnes
Could you test with the Feb2013-branch? Or the to be released Feb2013-sp1.
There have been related fixes recently.
Comment 18577
Date: 2013-02-26 18:37:41 +0100
From: Percy Wegmann <>
Will do. Stay tuned.
Comment 18578
Date: 2013-02-26 21:10:03 +0100
From: Percy Wegmann <>
I just cloned the Feb2013 branch from mercurial and still have the same problem. I have been able to replicate it using some non-confidential data. The monet data files are 126 MB zipped. I'm happy to share these if that'll help.
Comment 18579
Date: 2013-02-26 21:15:53 +0100
From: @njnes
Did you see the problem again after reloading or did you upgrade the
db from the old test? In case of the first mail me the download details, such that I can continue to debug.
Comment 18580
Date: 2013-02-26 21:24:49 +0100
From: Percy Wegmann <>
I installed the newer version of Monet, deleted the old database, recreated and ran my test data through. After a few restarts, I ended up with the crashing issue again.
I'll email you a link to the data.
Thanks
Comment 18581
Date: 2013-02-27 08:19:24 +0100
From: @njnes
Comment 18996
Date: 2013-08-13 17:47:26 +0200
From: Ashish Kumar Singh <>
Similar issue was found by me also using latest released version of monet DB.
Comment 19346
Date: 2013-11-19 20:56:54 +0100
From: @sjoerdmullender
Although we haven't been able to reproduce this, we feel that changesets aa2e3065be7e 486f2ab17d12 and 054b82fd68c2 may well have fixed these issues.
Our analysis was that the hash table that is used to do double elimination in the string heap (partial elmination when the heap grows large) was corrupted after strings were added to the heap, but the transaction in which this happened was rolled back.
A related issue has to do with string offsets that grow, causing a widening of the offset column. If the transaction in which this happens is rolled back, similar problems could occur.
Hopefully the aforementioned changesets fix these issues, so I'm closing this bug. Feel free to reopen when the issue was not resolved.
Comment 19384
Date: 2013-12-03 13:59:37 +0100
From: @sjoerdmullender
Feb2013-SP6 has been released.
The text was updated successfully, but these errors were encountered: