Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Occasional dbfarm corruption upon database restart #7152

Closed
swingbit opened this issue Jul 9, 2021 · 7 comments
Closed

Occasional dbfarm corruption upon database restart #7152

swingbit opened this issue Jul 9, 2021 · 7 comments
Labels
bug Something isn't working
Milestone

Comments

@swingbit
Copy link

swingbit commented Jul 9, 2021

Describe the bug
For the third time, on different databases, it happened that a properly shut down database would not restart, with the following errors found in merovingian.log:

2021-06-28 11:33:10 MSG merovingian[7]: database 'equip-vc_default01' (-1) has exited with exit status 0
2021-06-28 11:33:20 MSG merovingian[7]: starting database 'equip-vc_default01', up min/avg/max: 1s/1d/6d, crash average: 0.00 0.40 0.13 (8-4=4)
2021-06-28 11:33:21 MSG equip-vc_default01[51470]: arguments: /opt/monetdb/bin/mserver5 --dbpath=/var/lib/monetdb/dbfarm/equip-vc_default01 --set merovingian_uri=mapi:monetdb://f4c5ad81e6df:50000/equip-vc_default01 --set mapi_listenaddr=none --set mapi_usock=/var/lib/monetdb/dbfarm/equip-vc_default01/.mapi.sock --set
 monet_vault_key=/var/lib/monetdb/dbfarm/equip-vc_default01/.vaultkey --set gdk_nr_threads=8 --set max_clients=64 --set sql_optimizer=sequential_pipe --set embedded_py=3 --set mal_for_all=yes
2021-06-28 11:33:21 ERR equip-vc_default01[51470]: #main thread: BBPcheckbats: !ERROR: BBPcheckbats: cannot stat file /var/lib/monetdb/dbfarm/equip-vc_default01/bat/05/513.tail (expected size 18536): No such file or directory
2021-06-28 11:33:23 MSG merovingian[7]: database 'equip-vc_default01' (-1) has exited with exit status 1

Storage is local SSD, I tend to exclude related issues.

To Reproduce
Unfortunately I am not able to reproduce it reliably. I can only say it never happened before Oct2020, and now it already happened 3 times, so I guess there is a bug in the storage layer triggered by some corner-case.
I know it's hard to find the cause without a test, I just hope this can ring a bell.

Software versions

  • 11.39.18
  • CentOS 7
  • compiled from sources
@yzchang
Copy link
Member

yzchang commented Jul 12, 2021

this is a known problem that occasionally happens. unfortunately, we've never got sufficient information to be able to find its cause. if you can give us any more information, we're more than happy to investigate.

A workaround (don't forget to create a backup of the current dbfarm first!) is to create the missing file and fill it with dummy data until it has reached the expected size. the BBP.dir file should tell you the data type. in this way, you can at least get the database restarted to save the remaining data.

@swingbit
Copy link
Author

Thanks Jennie.
Good to know you are aware of it.

Just a thought, wouldn't it be useful to make the workaround automatic, and then inform the user that tables x,y,w are corrupt?

@swingbit
Copy link
Author

swingbit commented Jul 22, 2021

Maybe it's not much, but something else I noticed:

  • the problem occurs quite frequently on Oct2020
  • it seems to happen regularly after the database has filled up the disk. Stop, start, error.
    Today this happened twice, on two different databases.

@njnes
Copy link
Contributor

njnes commented Aug 27, 2021

checked in a possible fix (rolled forward changes from jun branch)

@njnes njnes added the bug Something isn't working label Aug 27, 2021
@PedroTadim
Copy link
Contributor

We made some fixes recently on the Jul2021 branch. Please check once the Jul2021-SP1 comes out if this still happens.

@swingbit
Copy link
Author

swingbit commented Feb 8, 2022

Unfortunately this still happens (Jan2022, git head).

A seemingly fine database was running on a system that somehow was leaking 11G of disk space.
df was reporting a partition usage 11G higher than what du reported.
As soon as I stopped the db, the missing 11G were freed and suddenly df and du agreed.

Then when I try to restart the db, it refused with:

#main thread: BBPcheckbats: !ERROR: cannot stat file /var/lib/monetdb/dbfarm/default01/bat/02/11/21122.theap: No such file or directory

So this is most likely the file that held those 11G. It had already been deleted from disk, but it was still open in MonetDB.

I'm not sure what I can do to help debug this, but it is quite serious.

@sjoerdmullender
Copy link
Member

When you notice this again on a still running database, could you attach a debugger and call BBPdump() from the debugger? This function writes information about all known BATs to stderr, so hopefully the server's stderr goes somewhere. To be safe with respect to other threads running during this, you could do this sequence:

set scheduler-locking on
call BBPdump()
set scheduler-locking off

It would then be interesting to correlate the output with the files present in the database, so if you could also list all files inside the database at the same time (i.e. when the server is stopped in the debugger), and upload those two results, that would (hopefully) be helpful.

@njnes njnes closed this as completed Feb 17, 2024
@sjoerdmullender sjoerdmullender added this to the NEXTRELEASE milestone Mar 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

5 participants