Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error: database file appears corrupted after restore from backup (FB5, RC2) #7942

Closed
gsbelarus opened this issue Jan 2, 2024 · 65 comments
Closed

Comments

@gsbelarus
Copy link

There are multiple Firebird instances on the server: FB25, FB3, and FB5. Each instance is assigned a dedicated port and an appropriate service name. When using a third-party application to perform a restore, it connects to the server via a TCP connection using the connection string localhost/3056:some_path_to_database.

At the end of the restoration process, the following error message appears:

Unable to complete network request to host "localhost". Error reading data from the connection.

Additionally, there is a record in the firebird.log file:

XNET error: XNET server initialization failed. Probably, another instance of the server is already running.

The resulting database file appears to be corrupted, and subsequent gfix -v -full ... shows:

Number of record level errors : 18722

@gsbelarus gsbelarus changed the title FB5, RC2: Unable to complete network request to host "localhost". Error reading data from the connection. Error: Unable to complete network request to host "localhost". Error reading data from the connection (FB5, RC2) Jan 2, 2024
@sim1984
Copy link

sim1984 commented Jan 2, 2024

XNET error: XNET server initialization failed. Probably, another instance of the server is already running.

To simultaneously run several servers using XNET, you need to configure the IpcName (firebird.conf) in the same way as you configure ports for different FB instances to work via INET.

@gsbelarus
Copy link
Author

gsbelarus commented Jan 2, 2024

I used the connection string localhost/3056:path_to_database. Through all my life, I believed this is network connection. Am I wrong?

@gsbelarus
Copy link
Author

Moreover, I checked the firebird.conf. There is no section dedicated to XNET settings exists. Where and what should I assign?

@AlexPeshkoff
Copy link
Member

It's IPCNAME parameter that makes it possible to run multiple servers, listening xnet, in parallel. But it's related only to message in the log (XNET error: XNET server initialization failed. Probably, another instance of the server is already running.). What you mention first looks like server crash at the first look. Does it happen for all versions or only one particular?

@gsbelarus
Copy link
Author

It's IPCNAME parameter that makes it possible to run multiple servers, listening xnet, in parallel. But it's related only to message in the log (XNET error: XNET server initialization failed. Probably, another instance of the server is already running.). What you mention first looks like server crash at the first look. Does it happen for all versions or only one particular?

I got another backup and now doing restoration. Will come with the result in a few hours.

As for XNET, is it enough if I set IPCNAME to FIREBIRD5 for Firebird 5?

@AlexPeshkoff
Copy link
Member

AlexPeshkoff commented Jan 2, 2024 via email

@gsbelarus
Copy link
Author

On 1/2/24 19:57, Andrej Kirejeŭ wrote: As for XNET, is it enough if I set IPCNAME to FIREBIRD5 for Firebird 5?
Enough to avoid mentioned message in the log.

as it is a common scenario when multiple versions of Firebird co-exists on the same server, it would be useful to set the default IPCNAME to FIREBIRD5 for Firebird v.5, FIREBIRD6 for the next version etc

@gsbelarus
Copy link
Author

sorry, bug confirmed. another backup -- the same result:

Unable to complete network request to host "localhost". Error reading data from the connection.

the message during the restore process and then:

Number of record level errors : 18722

during the gfix check.

@hvlad
Copy link
Member

hvlad commented Jan 2, 2024

There are multiple Firebird instances on the server: FB25, FB3, and FB5. Each instance is assigned a dedicated port and an appropriate service name. When using a third-party application to perform a restore, it connects to the server via a TCP connection using the connection string localhost/3056:some_path_to_database.

What server version is supposed to perform restore ?
Did that application run restore using Services API ?
If, yes, how it is attached to the services manager and why it is used "localhost... " in target database name ?
What happens if perform restore with gbak (or fbsvcmgr) using the same set of params ?

@gsbelarus
Copy link
Author

gsbelarus commented Jan 3, 2024

What server version is supposed to perform restore ?

FB5 5.0.0.1304

Did that application run restore using Services API ?

yes

If, yes, how it is attached to the services manager

through IBX component:

https://github.com/GoldenSoftwareLtd/gedemin/blob/master/Gedemin/IBX/IBServices.pas

it has been doing this way for last 25 years ))

and why it is used "localhost... " in target database name ?

The Presence of localhost in the address string forced IBX to use TCP protocol flag in its components.
We definitely needed it in the times of Yaffil, Firebird 0.9 or even earlier.

BTW, localhost/3056 allows us to call exactly FB5 server. As we have FB3, FB25 also running on this machine
we could distinguish between them only by port number.

What happens if perform restore with gbak (or fbsvcmgr) using the same set of params ?

Need to check.

@hvlad
Copy link
Member

hvlad commented Jan 3, 2024

When run restore with Services API, one should use server name in connection string, not in the target database name.

When speak about IBX, it means that server (remote host) name should be specified in ServerName property, not in DatabaseName property.

In your case localhost/3056:service_mgr is connection string and some_path_to_database (without server host name!) is target database name. If one use plain 'service_mgr' (empty string at ServerName property) it means connection uses XNET or, if it was not successful, fbclient will try 'localhost' without port number. As we have no idea what server version was started first and run XNET listener, we could assume it was not FB5 (message in firebird.log confirms it), I.e. another server instance run restore service and connects to the FB5 due to remote part in database name argument (DatabaseName property). Of course, it doesn't make whole restore process as fast as it should be.

In short: you should check and correct values of ServerName and DatabaseName properties.

BTW, what is last message before error ?

@gsbelarus
Copy link
Author

This is how our UI organized. we specify the target server and path to the database in one string, then divide it programmatically and set the corresponding IBX component's properties. The connection definitely goes to the FB5 server because only FB5 process shows activity during the restore process.

Before the error message, there are a couple of warnings of unknown UDF functions. These are our functions like BIN_AND and BIN_OR. They duplicate built-in functions. There is no need for them right now, just leftovers in the database we didn't clean up yet. Nevertheless, these warnings were present for years and never resulted into database corruption.

@hvlad
Copy link
Member

hvlad commented Jan 3, 2024

Could you be more specific and show service properties used ? There is some confusion that better to be cleared up.
Also, did you investigated if server was crashed (as Alex suppose) ?

@gsbelarus
Copy link
Author

gsbelarus commented Jan 3, 2024

      IBConfigService.ServerName := edServer.Text;

      if IBConfigService.ServerName > '' then
        IBConfigService.Protocol := TCP
      else
        IBConfigService.Protocol := Local;

in our case edServer.text contains localhost/3056

It would be easier to check whether the server crashed or not if appropriate records of server starting and server properly shutting down were put into the firebird.log file.

@mrotteveel
Copy link
Member

mrotteveel commented Jan 3, 2024 via email

@gsbelarus
Copy link
Author

No, it wouldn't be, because that would mean an older Firebird fbclient.dll wouldn't be able to connect using XNET to a newer Firebird server (or newer client to older server). Mark -- Mark Rotteveel

Oh, no. I just changed names to FIREBIRD25, FIREBIRD3, FIREBIRD5... Let us see where it leads now.

But, it won't affect network connections, right?

@gsbelarus
Copy link
Author

gsbelarus commented Jan 3, 2024

And regarding failed gfix. The firebird.log now contains thousands of records like:

XXXX Tue Jan 2 13:16:27 2024
Database: XXXX
Error: Record 3 is wrong length in table GD_EMPLOYEE (1394)

All errors related to one table in the database.

@mrotteveel
Copy link
Member

Oh, no. I just changed names to FIREBIRD25, FIREBIRD3, FIREBIRD5... Let us see where it leads now.

But, it won't affect network connections, right?

No, only XNET. Keep in mind that for a client to now be able to connect to the server with XNET, the firebird.conf of the client library (i.e. located in the same directory as fbclient.dll used by your application) must also contain the right IpcName setting.

@hvlad
Copy link
Member

hvlad commented Jan 3, 2024

      IBConfigService.ServerName := edServer.Text;

      if IBConfigService.ServerName > '' then
        IBConfigService.Protocol := TCP
      else
        IBConfigService.Protocol := Local;

in our case edServer.text contains localhost/3056

Thanks, but...how IBConfigService is related with restore task ? Please, don't add more confusion than we already have.
Also, we need to know value of DatabaseName property (at least).

It would be easier to check whether the server crashed or not if appropriate records of server starting and server properly shutting down were put into the firebird.log file.

If Firebird run as a service, look into Windows Event Log.
Also, it is good idea to always have WER turned ON to collect crash dumps.

@sim1984
Copy link

sim1984 commented Jan 3, 2024

And regarding failed gfix. The firebird.log now contains thousands of records like:

XXXX Tue Jan 2 13:16:27 2024
Database: XXXX
Error: Record 3 is wrong length in table GD_EMPLOYEE (1394)

All errors related to one table in the database.

If restore was not completed like yours, then it is not surprising that gfix shows errors. Here you should rather look at the errors in firebird.log and the log of restore itself (gbak -v -y <log_file>).

You are restoring the database, but where are you restoring it from? What ODS was the backup made from?

@gsbelarus
Copy link
Author

The database was backed up on FB 3 server and restored on FB 5.

@aafemt
Copy link
Contributor

aafemt commented Jan 3, 2024

Is restore with gbak ok? Can you share core dumps of crashed server?

@gsbelarus
Copy link
Author

There are no suspicious records in the Windows's logs. So, I assume that FB5 service didn't crash.

@aafemt
Copy link
Contributor

aafemt commented Jan 3, 2024

Do error happen when all servers but v5 are stopped?

@gsbelarus
Copy link
Author

Do error happen when all servers but v5 are stopped?

cann't check right now. those servers being used.

@gsbelarus
Copy link
Author

gsbelarus commented Jan 3, 2024

well, I'm trying to run from the command prompt and the command just hangs. no CPU activity, no records in the firebird.log file.

C:\Program Files\FB5>gbak -r "K:\Bases\xxx.bk" "g:\Bases\xxx.fdb" -user sysdba -pas xxx 
-v -y "G:\Bases\Broiler\restore.log" -z

@aafemt
Copy link
Contributor

aafemt commented Jan 3, 2024

At this point you can attach debugger to it to see what's happening.

@hvlad
Copy link
Member

hvlad commented Jan 3, 2024

well, I'm trying to run from the command prompt and the command just hangs. no CPU activity, no records in the firebird.log file.

C:\Program Files\FB5>gbak -r "K:\Bases\xxx.bk" "g:\Bases\xxx.fdb" -user sysdba -pas xxx -v -y "G:\Bases\Broiler\restore.log" -z

This command line not uses services and thus not equal to the app case.
It should run in embedded mode, did you check CPU usage of whole system or of Firebird process only ? In second case, it is expected to not use CPU.

Could you run quick check with metadata-only restore using services, instead ? Like:

gbak -se localhost/3056:service_mgr -r -m -v "K:\Bases\xxx.bk" "g:\Bases\xxx.fdb"

@gsbelarus
Copy link
Author

restoring of metadata goes without any problems:

gbak:gbak version WI-V5.0.0.1304 Firebird 5.0 RC 2
gbak:use up to 8 parallel workers
gbak:transportable backup -- data in XDR format
gbak:		backup file is compressed
gbak:backup version is 10
gbak:created database g:\Bases\xxx.fdb, page_size 8192 bytes
gbak:started transaction
...
gbak:adjusting views dbkey length
gbak:updating ownership of packages, procedures and tables
gbak:adding missing privileges
gbak:adjusting system generators
gbak: WARNING:function ABS is not defined
gbak: WARNING:    module name or entrypoint could not be found
gbak: WARNING:    function BIN_AND is not defined
gbak: WARNING:    module name or entrypoint could not be found
gbak:finishing, closing, and going home
gbak:adjusting the ONLINE and FORCED WRITES flags

@hvlad
Copy link
Member

hvlad commented Mar 18, 2024

I downloaded the latest build 1305 with debug symbols, and, run it twice. The first time with procdump attached to the service and the second one -- without procdump. In both cases, the process of restoring ended without any error messages and the server didn't crash. But anyway, the subsequent run of gfix utility shows that the newly restored database file is corrupted:

Summary of validation errors

        Number of record level errors   : 18722

If you run procdump against Windows service process - how do you sure procdump have enough privileges to watch for the process and create dump file ? Also, it is easy to check Windows Event log if serivce was crashed and restarted.

@hvlad
Copy link
Member

hvlad commented Mar 18, 2024

Not necessarily embedded, locahost attachment is fine. The key is not to use services: I guess if this problem may be related to Firebird server crash on service disconnect which I've once seen.

How it will help us to catch the crash ?

@aafemt
Copy link
Contributor

aafemt commented Mar 18, 2024

How it will help us to catch the crash ?

It would narrow borders for testcase creation. If problem is related to TCP server, it technically cannot be reproduced in embedded mode.

@gsbelarus
Copy link
Author

relevant records from firebird.log:

LEPEL	Mon Mar 18 04:30:58 2024
	INET/inet_error: read errno = 10054, server host = localhost, address = ::1/3056


LEPEL	Mon Mar 18 16:32:12 2024
	Database: G:\XXXXX.FDB
	Validation started


LEPEL	Mon Mar 18 17:18:06 2024
	Database: G:\XXXX.FDB
	Validation finished: 0 errors, 0 warnings, 0 fixed

@gsbelarus
Copy link
Author

gsbelarus commented Mar 18, 2024

@hvlad there is a warning in the OS system log:

Fault bucket 1440612613330618035, type 5
Event Name: RADAR_PRE_LEAK_64
Response: Not available
Cab Id: 0

Problem signature:
P1: firebird.exe
P2: 5.0.1.1369
P3: 10.0.17763.2.0.0
P4: 
P5: 
P6: 
P7: 
P8: 
P9: 
P10: 

Attached files:
\\?\g:\temp\RDRB702.tmp\empty.txt
\\?\C:\ProgramData\Microsoft\Windows\WER\Temp\WERB703.tmp.WERInternalMetadata.xml
\\?\C:\ProgramData\Microsoft\Windows\WER\Temp\WERB713.tmp.xml
\\?\C:\ProgramData\Microsoft\Windows\WER\Temp\WERB731.tmp.csv
\\?\C:\ProgramData\Microsoft\Windows\WER\Temp\WERB741.tmp.txt

These files may be available here:


Analysis symbol: 
Rechecking for solution: 0
Report Id: a8edd257-1b38-4aee-a1d3-fc1b6db57ca8
Report Status: 268435456
Hashed bucket: 632b11fed92f1964d3fe158a0484c6b3
Cab Guid: 0

@gsbelarus
Copy link
Author

gsbelarus commented Mar 18, 2024

the process started at 1:53, then at 2:22 the warning (see above) had been added to the OSs log, at 4:30 process finished.

in contrary to what we had a few months before, now, even though gbak been finishing with the error message, the subsequent gfix call didn't find any corruptions of the database structure.

@hvlad
Copy link
Member

hvlad commented Mar 18, 2024

@hvlad there is a warning in the OS system log:

Fault bucket 1440612613330618035, type 5
Event Name: RADAR_PRE_LEAK_64

You may ignore it. See, for example

https://answers.microsoft.com/en-us/windows/forum/all/radarpreleak/372855fb-1285-4f99-a352-4fd517e9b680 :

Radar_Pre_Leak is technical nomenclature that means the OS has detected (Radar) a process that is not handling it routines in a manner that it thinks is efficient (Pre) and as a result it may begin to exhaust memory resources (Leak), i.e. it is a resource hog. It is entirely informational event, though, and does not mean it is necessarily running outside its bounds or has exhausted the limits of Windows memory management.

@hvlad
Copy link
Member

hvlad commented Mar 18, 2024

in contrary to what we had a few months before, now, even though gbak been finishing with the error message, the subsequent gfix call didn't find any corruptions of the database structure.

And empty validation log confirms it.

So, what is current state of things ?

I see that restore still have some problem on detach or when reconnecting after successful restore, as there is no message 'adjusting the ONLINE and FORCED WRITES flags' and database remains in multi-user maintenance mode.

@gsbelarus
Copy link
Author

Well, I ran restoring using only gbak without -se switch and got this error:

21:32:35,79 C:\Program Files\FB5>gbak -r g:\XXXXXXXXXXX.bk g:\XXXXXXXXXXXXXX.fdb -user sysdba -pas XXXXXXXXX -z -v > g:XXXXXXXXXXXXX.txt
gbak: ERROR:action cancelled by trigger (3) to preserve data integrity
gbak: ERROR:    Cannot deactivate index used by a PRIMARY/UNIQUE constraint
gbak:Exiting before completion due to errors

last records from the log file:

gbak:    activating and creating deferred index RDB$PRIMARY1477
gbak:    activating and creating deferred index RDB$PRIMARY1495
gbak:    activating and creating deferred index RDB$PRIMARY1474
gbak:    activating and creating deferred index RDB$PRIMARY1466
gbak:    activating and creating deferred index RDB$PRIMARY1501
gbak:    activating and creating deferred index RDB$PRIMARY1498
gbak:    activating and creating deferred index RDB$PRIMARY1494
gbak:cannot commit index RDB$PRIMARY1494
gbak: ERROR:invalid database handle (no active connection)

new entries in the firebird.log

LEPEL	Tue Mar 19 01:59:32 2024
	Failed to create worker attachment

	database G:\XXXXXXXXXXXX.FDB shutdown


LEPEL	Tue Mar 19 01:59:33 2024
	Failed to create worker attachment

	database G:\XXXXXXXXXXXX.FDB shutdown

@hvlad
Copy link
Member

hvlad commented Mar 19, 2024

Could you provide me with this backup to avoid endless guessing and fix the issue ?

@hvlad
Copy link
Member

hvlad commented Mar 28, 2024

Anything ?

@gsbelarus
Copy link
Author

Anything ?

Is there an email or messenger through which we can discuss the matter, as it requires arranging access to the database somehow?

@hvlad
Copy link
Member

hvlad commented Mar 28, 2024

Sure, email me to the hvlad at users sf com

@gsbelarus
Copy link
Author

ok. I will try another restore attempt on the different machine using a backup file from a different day, If an error occurs I will get in touch.

@gsbelarus
Copy link
Author

Spent half of the week conducting tests on three different machines and discovered something interesting. The restoration of the database crashes when the server mode is set to Classic. It crashes regardless of whether we use the -se switch in the gbak's command line. However, when the server mode is set to Super, the database restores without any problems.

I will attempt to obscure the database and provide a copy to @hvlad for further investigation.

@hvlad
Copy link
Member

hvlad commented Apr 2, 2024

Good to know, thanks

@hvlad
Copy link
Member

hvlad commented Apr 2, 2024

With Classic mode the issue is reproduced, fix will follow soon.

hvlad added a commit that referenced this issue Apr 2, 2024
…e from backup (FB5, RC2)

Don't use common SortOwner for sorts with different dbb's.
After #7671 it could contain cached sort buffers from different pools (dbb_permanent) and then release it to the wrong dbb_sort_buffers making possible usage of deallocated memory later.
@hvlad hvlad self-assigned this Apr 2, 2024
@hvlad
Copy link
Member

hvlad commented Apr 2, 2024

The fix is committed into v5, please check next snapshot build.
The bug really doesn't affects SuperServer and related with parallel index creation.

hvlad added a commit that referenced this issue Apr 3, 2024
…e from backup (FB5, RC2)

Don't use common SortOwner for sorts with different dbb's.
After #7671 it could contain cached sort buffers from different pools (dbb_permanent) and then release it to the wrong dbb_sort_buffers making possible usage of deallocated memory later.
@gsbelarus
Copy link
Author

Confirmed. Everything is ok now. Many thanks!

BTW, I have set MaxParallelWorkers = 64 and ParallelWorkers = 16 and the speed of restoring is awesome!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants