Occasional dbfarm corruption upon database restart #7152

swingbit · 2021-07-09T12:02:27Z

Describe the bug
For the third time, on different databases, it happened that a properly shut down database would not restart, with the following errors found in merovingian.log:

2021-06-28 11:33:10 MSG merovingian[7]: database 'equip-vc_default01' (-1) has exited with exit status 0
2021-06-28 11:33:20 MSG merovingian[7]: starting database 'equip-vc_default01', up min/avg/max: 1s/1d/6d, crash average: 0.00 0.40 0.13 (8-4=4)
2021-06-28 11:33:21 MSG equip-vc_default01[51470]: arguments: /opt/monetdb/bin/mserver5 --dbpath=/var/lib/monetdb/dbfarm/equip-vc_default01 --set merovingian_uri=mapi:monetdb://f4c5ad81e6df:50000/equip-vc_default01 --set mapi_listenaddr=none --set mapi_usock=/var/lib/monetdb/dbfarm/equip-vc_default01/.mapi.sock --set
 monet_vault_key=/var/lib/monetdb/dbfarm/equip-vc_default01/.vaultkey --set gdk_nr_threads=8 --set max_clients=64 --set sql_optimizer=sequential_pipe --set embedded_py=3 --set mal_for_all=yes
2021-06-28 11:33:21 ERR equip-vc_default01[51470]: #main thread: BBPcheckbats: !ERROR: BBPcheckbats: cannot stat file /var/lib/monetdb/dbfarm/equip-vc_default01/bat/05/513.tail (expected size 18536): No such file or directory
2021-06-28 11:33:23 MSG merovingian[7]: database 'equip-vc_default01' (-1) has exited with exit status 1

Storage is local SSD, I tend to exclude related issues.

To Reproduce
Unfortunately I am not able to reproduce it reliably. I can only say it never happened before Oct2020, and now it already happened 3 times, so I guess there is a bug in the storage layer triggered by some corner-case.
I know it's hard to find the cause without a test, I just hope this can ring a bell.

Software versions

11.39.18
CentOS 7
compiled from sources

The text was updated successfully, but these errors were encountered:

yzchang · 2021-07-12T13:31:16Z

this is a known problem that occasionally happens. unfortunately, we've never got sufficient information to be able to find its cause. if you can give us any more information, we're more than happy to investigate.

A workaround (don't forget to create a backup of the current dbfarm first!) is to create the missing file and fill it with dummy data until it has reached the expected size. the BBP.dir file should tell you the data type. in this way, you can at least get the database restarted to save the remaining data.

swingbit · 2021-07-13T07:33:24Z

Thanks Jennie.
Good to know you are aware of it.

Just a thought, wouldn't it be useful to make the workaround automatic, and then inform the user that tables x,y,w are corrupt?

swingbit · 2021-07-22T16:43:14Z

Maybe it's not much, but something else I noticed:

the problem occurs quite frequently on Oct2020
it seems to happen regularly after the database has filled up the disk. Stop, start, error.
Today this happened twice, on two different databases.

njnes · 2021-08-27T09:16:25Z

checked in a possible fix (rolled forward changes from jun branch)

PedroTadim · 2021-09-29T18:32:41Z

We made some fixes recently on the Jul2021 branch. Please check once the Jul2021-SP1 comes out if this still happens.

swingbit · 2022-02-08T16:15:33Z

Unfortunately this still happens (Jan2022, git head).

A seemingly fine database was running on a system that somehow was leaking 11G of disk space.
df was reporting a partition usage 11G higher than what du reported.
As soon as I stopped the db, the missing 11G were freed and suddenly df and du agreed.

Then when I try to restart the db, it refused with:

#main thread: BBPcheckbats: !ERROR: cannot stat file /var/lib/monetdb/dbfarm/default01/bat/02/11/21122.theap: No such file or directory

So this is most likely the file that held those 11G. It had already been deleted from disk, but it was still open in MonetDB.

I'm not sure what I can do to help debug this, but it is quite serious.

sjoerdmullender · 2022-02-08T16:57:38Z

When you notice this again on a still running database, could you attach a debugger and call BBPdump() from the debugger? This function writes information about all known BATs to stderr, so hopefully the server's stderr goes somewhere. To be safe with respect to other threads running during this, you could do this sequence:

set scheduler-locking on
call BBPdump()
set scheduler-locking off

It would then be interesting to correlate the output with the files present in the database, so if you could also list all files inside the database at the same time (i.e. when the server is stopped in the debugger), and upload those two results, that would (hopefully) be helpful.

njnes added the bug Something isn't working label Aug 27, 2021

njnes closed this as completed Feb 17, 2024

sjoerdmullender added this to the NEXTRELEASE milestone Mar 1, 2024

sjoerdmullender modified the milestones: NEXTRELEASE, Dec2023-SP1 Mar 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Occasional dbfarm corruption upon database restart #7152

Occasional dbfarm corruption upon database restart #7152

swingbit commented Jul 9, 2021 •

edited

yzchang commented Jul 12, 2021

swingbit commented Jul 13, 2021

swingbit commented Jul 22, 2021 •

edited

njnes commented Aug 27, 2021

PedroTadim commented Sep 29, 2021

swingbit commented Feb 8, 2022

sjoerdmullender commented Feb 8, 2022

Occasional dbfarm corruption upon database restart #7152

Occasional dbfarm corruption upon database restart #7152

Comments

swingbit commented Jul 9, 2021 • edited

yzchang commented Jul 12, 2021

swingbit commented Jul 13, 2021

swingbit commented Jul 22, 2021 • edited

njnes commented Aug 27, 2021

PedroTadim commented Sep 29, 2021

swingbit commented Feb 8, 2022

sjoerdmullender commented Feb 8, 2022

swingbit commented Jul 9, 2021 •

edited

swingbit commented Jul 22, 2021 •

edited