Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Firebird server stops accepting new connections after some time #7480

Closed
agx4ever opened this issue Feb 21, 2023 · 45 comments
Closed

Firebird server stops accepting new connections after some time #7480

agx4ever opened this issue Feb 21, 2023 · 45 comments

Comments

@agx4ever
Copy link

I have server that runs FB3 and I want to migrate to FB4. I have created new test server and installed latest FB4. It works fine until one day. It can work few days or max 2 weeks without problems and then suddenly firebird server just stops accepting new connections. On server I can see firebird in process list, but it simply doesn't accept new connections. When I stop and then start firebird - it works fine again. Error log does not show anything unusual.

I tried same installation and same configuration on different server, to exclude hardware problems or software misconfigurations - and the result is same - FB process stops accepting new connections after some time.

OS: Linux, CentOS Stream release 8
Firebird 4.0.2 - Firebird-4.0.2.2816-0.amd64

--- firebird.conf ---
TempDirectories = /mnt/data0/fb4/tmp/
DefaultDbCachePages = 2048
UseFileSystemCache = true
TempBlockSize = 8M
TempCacheLimit = 64M
InlineSortThreshold = 2048
AuthServer = Srp256
AuthClient = Srp256, Srp
UserManager = Srp
ReadConsistency = 0
RemoteServicePort = 3050
LockMemSize = 1M
LockHashSlots = 8191
ServerMode = SuperClassic

--- databases.conf ---
dev_main = /mnt/data0/fb4/dev_main.fdb
{
DatabaseGrowthIncrement = 128M
DeadlockTimeout = 10
DefaultDbCachePages = 32768
FileSystemCacheThreshold = 1048576
GCPolicy = combined
LockHashSlots = 49999
LockMemSize = 40M
}
--- no replication configuration ---

Last time when the problem occurred I made fbguard and firebird process dumps with "gcore" command. I can send those dumps in email (or other convenient way, just tell how).
If there is anything else I can do, to provide more information, please tell me.

@sim1984
Copy link

sim1984 commented Feb 21, 2023

The current database level configuration is more suitable for SuperServer mode than SuperClassic. The following values are too large:

DefaultDbCachePages = 32768
GCPolicy = combined

@agx4ever
Copy link
Author

This is configuration for my test server. For database I am using pagesize 32768 and together with "DefaultDbCachePages = 32768" - each connection takes ~1GB of RAM. I have 40GB RAM installed and usually there are few connections simultaneously (for test server). If you think that this setting could be reason for this problem - I can reduce it.

For GCPolicy - I will change it to "cooperative". I have turned sweep process off and sweeping is done manually with gfix - I thought that it doesn't affect anything important.

@hvlad
Copy link
Member

hvlad commented Feb 21, 2023

On server I can see firebird in process list, but it simply doesn't accept new connections.

What kind of connection string you use ?
What error is returned by application ?
Does you tried to connect using isql in such moment ?
Could you try embedded (hostless) connection ?

GCPolicy = combined

Doesn't matters for non-SS architectures

@agx4ever
Copy link
Author

What kind of connection string you use ?

From IBExpert:
connect 'my_dev_dns/3050:dev_main' user "SYSDBA" password 'VerySecurePass';

What error is returned by application ?

Mostly I am using Java and Jaybird driver (4.0.8.java11).
When Firebird hangs - I get errors:

java.sql.SQLNonTransientConnectionException: Unable to complete network request to host "xx.xx.xx.xx". [SQLState:08006, ISC error code:335544721]
        at org.firebirdsql.gds.ng.FbExceptionBuilder$Type$5.createSQLException(FbExceptionBuilder.java:598)
        at org.firebirdsql.gds.ng.FbExceptionBuilder$ExceptionInformation.toSQLException(FbExceptionBuilder.java:492)
        at org.firebirdsql.gds.ng.FbExceptionBuilder.toSQLException(FbExceptionBuilder.java:223)
        at org.firebirdsql.gds.ng.wire.WireConnection.socketConnect(WireConnection.java:236)
        at org.firebirdsql.gds.ng.wire.FbWireDatabaseFactory.performConnect(FbWireDatabaseFactory.java:50)
        at org.firebirdsql.gds.ng.wire.FbWireDatabaseFactory.connect(FbWireDatabaseFactory.java:39)
        at org.firebirdsql.gds.ng.wire.FbWireDatabaseFactory.connect(FbWireDatabaseFactory.java:32)
        at org.firebirdsql.jca.FBManagedConnection.<init>(FBManagedConnection.java:145)
        at org.firebirdsql.jca.FBManagedConnectionFactory.createManagedConnection(FBManagedConnectionFactory.java:599)
        at org.firebirdsql.jca.FBStandAloneConnectionManager.allocateConnection(FBStandAloneConnectionManager.java:65)
        at org.firebirdsql.jdbc.FBDataSource.getConnection(FBDataSource.java:109)
        at org.firebirdsql.jdbc.FBDriver.connect(FBDriver.java:114)
        at java.sql/java.sql.DriverManager.getConnection(DriverManager.java:677)
        at java.sql/java.sql.DriverManager.getConnection(DriverManager.java:228)

Also when connecting from IBExpert - it waits around one minute and then says connection failed.

Does you tried to connect using isql in such moment ?
Could you try embedded (hostless) connection ?

Not yet - I will try this next time when server will freeze.
It's sad, that I can't reproduce this problem on demand.

@agx4ever
Copy link
Author

From Java I'm using this connection string:

jdbc:firebirdsql://%s:%s/%s?charSet=%s&encoding=%s&roleName=%s&TRANSACTION_READ_COMMITTED=isc_tpb_read_committed,isc_tpb_nowait,isc_tpb_rec_version

Today FB4 server again froze. So I tried to connect from isql tool from same server.

[root@dev1 fb402]# ./isql
Use CONNECT or CREATE DATABASE to specify a database
SQL> connect 127.0.0.1:dev_main user SYSDBA password 'VerySecurePass';
Statement failed, SQLSTATE = 08004
connection rejected by remote interface
SQL>

It took around 1 minute before error message. After that I tried embedded connection without specifying user/pass.

[root@dev1 fb402]# ./isql
Use CONNECT or CREATE DATABASE to specify a database
SQL> connect /data/fb4/dev_main.fdb ;

It never connected - there was no error messages or anything.

@hvlad
Copy link
Member

hvlad commented Feb 24, 2023

Could you provide full memory dump of firebird process and another one of hung isql (with embedded connection) ?

@agx4ever
Copy link
Author

agx4ever commented Mar 6, 2023

Could you provide full memory dump of firebird process and another one of hung isql (with embedded connection) ?

Today I encountered same problem. I made asked memory dumps.
isql dump was quite small, but firebird process dump took a while and it is ~140GB large. Compressed file is ~230MB large. I can send dropbox share link but only privately (as it may contain sensitive data).
My guess would be, that Firebird freezes due to memory allocation problems, but then it should write according error message or something like that. Also server memory and swap space wasn't fully used.
Where I can send share link?

@AlexPeshkoff
Copy link
Member

This core file should go to me, but how was it compressed? Please make sure you've used tar with --sparse switch to process core dump, in other case I may have problems decompressing it. Also xz may be used - it automatically detects sparse files.
Link to be send to peshkoff@mail.ru.

@agx4ever
Copy link
Author

agx4ever commented Mar 7, 2023

This core file should go to me, but how was it compressed? Please make sure you've used tar with --sparse switch to process core dump, in other case I may have problems decompressing it. Also xz may be used - it automatically detects sparse files. Link to be send to peshkoff@mail.ru.

I just sent link for dump files.
Dump was compressed with "gzip -9 " command.

@AlexPeshkoff
Copy link
Member

Please use xz next time instead gzip for core dumps (file was decompressing >hour - due to disk load). It's not an issue with compressed size, it's about sparse core dump.

@AlexPeshkoff
Copy link
Member

AlexPeshkoff commented Mar 7, 2023

Also I need he following libraries from your box:
ld-linux-x86-64.so.2
libc.so.6
libdl.so.2
libgcc_s.so.1
libgpm.so.2
libm.so.6
libpthread.so.0
librt.so.1
libstdc++.so.6
libthread_db.so.1
libtommath.so.0
libz.so.1

@agx4ever
Copy link
Author

agx4ever commented Mar 7, 2023

Sorry for using wrong compression method.
I just added required libs from Linux server to the same shared folder. I used tar xz compression method this time.
If you need anything else - just ask.

@AlexPeshkoff
Copy link
Member

AlexPeshkoff commented Mar 7, 2023

Definitely wrong libraries:
Error while mapping shared library sections:
/opt/lib64/libdl.so.2': Shared library architecture i386 is not compatible with target architecture i386:x86-64. Error while mapping shared library sections: /opt/lib64/libm.so.6': Shared library architecture i386 is not compatible with target architecture i386:x86-64.
Error while mapping shared library sections:
/opt/lib64/libgcc_s.so.1': Shared library architecture i386 is not compatible with target architecture i386:x86-64. Error while mapping shared library sections: /opt/lib64/libpthread.so.0': Shared library architecture i386 is not compatible with target architecture i386:x86-64.
Error while mapping shared library sections:
/opt/lib64/libc.so.6': Shared library architecture i386 is not compatible with target architecture i386:x86-64. Error while mapping shared library sections: /opt/lib64/libstdc++.so.6': Shared library architecture i386 is not compatible with target architecture i386:x86-64.

@agx4ever
Copy link
Author

agx4ever commented Mar 7, 2023

Sorry. It looks I have copied some wrong files from 32bit folder.
I just uploaded correct libraries from x64 directory. Check once more shared folder.

@AlexPeshkoff
Copy link
Member

Sooner of all your hang is already fixed in current codebase. Please try current snapshot. In any case it should provide more informative core dumps.

PS. If snapshot anyway hangs (with current dump it's hard to diagnose exact reason) please do not try to attach to server 16000 times - almost all core dump (>90% size) contains stacks of attach threads waiting in same place.

@agx4ever
Copy link
Author

I installed 4.0.3 snapshot build and it worked almost 2 weeks without problems. But today Firebird stalled. I made another core dump and uploaded in that same file share in folder named "2023-mar-27".
If there is needed more information or there are some recommendations - just tell me.

@AlexPeshkoff
Copy link
Member

I also need snapshot binaries + d4ebug info.

@agx4ever
Copy link
Author

I uploaded Firebird 4.0.3 binaries I'm using. What kind of debug info you need?

@AlexPeshkoff
Copy link
Member

One which came with that file - Firebird-debuginfo-4.0.3.2906-0.amd64.tar.gz

@AlexPeshkoff AlexPeshkoff self-assigned this Mar 28, 2023
@agx4ever
Copy link
Author

I don't have debuginfo archive from that snapshot build :( and also there is no snapshot archive available on firebird download page. I didn't know that I have to save debuginfo archive when downloading snapshot.
I will install today's snapshot and will save debuginfo archive and try to replicate problem.

@agx4ever
Copy link
Author

After last problem I installed newest 4.0.3 snapshot build and now it worked around 3 weeks without problems. But today again Firebird stalled. So now I made another core dump and uploaded in that same file share in folder named "2023-apr-24". I also included Firebird snapshot binaries and debuginfo package. If there is something additional needed - just ask.

@AlexPeshkoff
Copy link
Member

Once again new case never seen before in your dumps. Though symptoms may look similar - but definitely other reason.

Sorry, the only thing I could do this time is enhance debugging information collecting (3019afa).

@agx4ever
Copy link
Author

agx4ever commented Jun 8, 2023

Sorry for long silence on this issue. I was playing with different configurations to seek some clues on this problem. I got few times when Firebird got stalled. I even restored database from backup, to rule out metadata corruptions possibility. This time from fresh restart Firebird worked around 5 days and then today (to be precise - this night) again stalled.
I made core dump, included firebird version and also debuginfo package. I uploaded everything at same share as previous in folder "2023-jun-08". In case you need something additional or can't access - just let me know.
Suggestions or ideas are also welcome. Strange, that no one else sees same problems.

@agx4ever
Copy link
Author

Today I made another 2 dumps that I believe is right before Firebird hangs up.
I executed simple update query to update one field by primary key and it just stalled and never executed (from IBExpert).
After that I connected from other IBExpert to database and wanted to kill my previous connection. So I opened "Database Monitoring" tool and tried to list all active statements. It executes code:
select st.mon$statement_id as Statement_ID, st.mon$attachment_id as Attachment_ID, st.mon$explained_plan as Explained_Plan, st.mon$transaction_id Transaction_ID, a.mon$user as User_Name, a.mon$remote_address as Remote_Address, a.mon$remote_pid as Remote_PID, a.mon$remote_process as Remote_Process, a.mon$client_version as Client_version, a.mon$remote_version as Remote_Protocol_Version, a.mon$remote_host as Remote_Host_Name, a.mon$remote_os_user as Remote_User_Name, a.mon$auth_method as Authentication_Method, case when a.mon$system_flag = 0 then 'Normal' when a.mon$system_flag = 1 then 'System' end as Connection_Type, a.mon$idle_timeout as Idle_Timeout, a.mon$idle_timer as Idle_Timer, a.mon$statement_timeout as Statement_Timeout, a.mon$wire_compressed as Wire_Compressed, a.mon$wire_encrypted as Wire_Encrypted, a.mon$wire_crypt_plugin as Wire_Crypt_Plugin, case when st.mon$state = 0 then 'IDLE' when st.mon$state = 1 then 'ACTIVE' end as State, st.mon$timestamp Started_At, st.mon$sql_text Statement_Text, st.mon$statement_timeout as Statement_Timeout, st.mon$statement_timer as Statement_Timer, r.mon$record_seq_reads as Non_indexed_Reads, r.mon$record_idx_reads as Indexed_Reads, r.mon$record_inserts as Records_Inserted, r.mon$record_updates as Records_Updated, r.mon$record_deletes as Records_Deleted, r.mon$record_backouts as Records_Backed_Out, r.mon$record_purges as Records_Purged, r.mon$record_expunges as Records_Expunged, r.mon$record_locks as Record_Locks, r.mon$record_waits as Record_Waits, r.mon$record_conflicts as Record_Conflicts, r.mon$backversion_reads as Backversion_Reads, r.mon$fragment_reads as Fragment_Reads, r.mon$record_rpt_reads as Record_Rpt_Reads, r.mon$record_imgc as Records_IMGC, io.mon$page_reads as Page_Reads, io.mon$page_writes as Page_Writes, io.mon$page_fetches as Page_Fetches, io.mon$page_marks as Page_Marks from mon$statements st join mon$attachments a on a.mon$attachment_id = st.mon$attachment_id join mon$record_stats r on (st.mon$stat_id = r.mon$stat_id) join mon$io_stats io on (st.mon$stat_id = io.mon$stat_id) order by st.mon$timestamp
It never executed - just stalled. Then I tried to close my stalled connections from few more computers, but those connections got stuck in same manner. I believe it's beginning of Firebird hanging up.
In order to continue working - I restarted Firebird service and then my update query and monitoring queries worked just fine.
Between those steps I made two core dumps. I uploaded them in "2023-jun-13" folder in same share.
Firebird and libs are same as found in "2023-jun-08" folder.

@EPluribusUnum
Copy link

From time to time we also have the sam issue, Firebirds stops acceppting new connections and select with MON$ tables freeze in active clients. Unfortunately we could not produce dump. Hope this issue will be resolved with the help of new dumps.

@AlexPeshkoff
Copy link
Member

I do not remember where from to download core dumps. Also please put there binaries & debug info.

@agx4ever
Copy link
Author

@AlexPeshkoff I just resent access information to core dumps to your email.

@AlexPeshkoff
Copy link
Member

Looks like you have embedded connections to your database, and that embedded connections hang sometimes. I see no other reasons for current behavior. To better understand what happens please next time when you have that problem in addition to core dump do the following:
fb_lock_print -d /srv/fb4/dev_main.fdb -c -a >somefile.txt
and add somefile.txt together with core dump.

@agx4ever
Copy link
Author

Today again FB started to show hanging symptoms and I made core dump and also fb_lock_print as suggested into somefile.txt ;)
All requested files are uploaded to the same share under folder: 2023-jun-19

@AlexPeshkoff
Copy link
Member

AlexPeshkoff commented Jun 20, 2023 via email

@AlexPeshkoff
Copy link
Member

AlexPeshkoff commented Jun 20, 2023 via email

@agx4ever
Copy link
Author

Yes, of course I'm ready to run special build. Just give it to me and I definitely give it a try.

@agx4ever
Copy link
Author

Update on issue.
I have installed special build from @AlexPeshkoff with built in debugging / core dump when suspicious conditions are met.
Firebird have crashed already few times and it should have produced core dumps, but because of my server misconfiguration - all those core dumps were truncated and are useless. I have now reconfigured server (few times actually) to save full coredumps and I hope that soon I will have necessary debug info.
My bet is that those debug/suspicious conditions is the right place in code for this problem, because - now when firebird crashes, it produces coredump and systemctl process restarts Firebird. It no more stays in halted/hanged state.

@agx4ever
Copy link
Author

I have acquired successful 4 core dumps with provided special FB build. All files are uploaded at previous file share under folder "2023-jul-29". There are also debuginfo, Firebird binaries and libs used.
If there is something else needed - just ask.
Thank you for your support!

@AlexPeshkoff
Copy link
Member

Good news - all 4 dumps show exactly the state that I've expected, all are reasonably same and rather informative.
I also need your firebird.log and exact times when dumps were created - dropbox looses file creation time info.

@agx4ever
Copy link
Author

agx4ever commented Aug 1, 2023

Very good news!
I just uploaded firebird.log file in same folder. These 4 uploaded coredumps and exact times you can find when reading from end of log file. There are older abnormal termination entries as well.

@AlexPeshkoff
Copy link
Member

I see you've sent very truncated log. But what is in log AFTER abort is not interesting, I want to see did something happen right BEFORE abort.

@agx4ever
Copy link
Author

agx4ever commented Aug 1, 2023

It's full log as it is on server. I haven't removed any entry. There is nothing interesting there.
Maybe there are options to output more detailed info? If yes - I need instructions how to set up such logging.

@AlexPeshkoff
Copy link
Member

Sorry - looked truncated. And no - there are no such options. OK, negative result is also result.

@AlexPeshkoff
Copy link
Member

Please install new special build from
https://drive.google.com/drive/folders/14JaiJoRBNhgHBkfolnBHZZDP6pu9Owg0?usp=sharing
As soon as you get first core - report about it please.

@agx4ever
Copy link
Author

agx4ever commented Aug 8, 2023

Thank you for your fast response!
As you asked - I installed special build and today I got new coredump.
As always - I uploaded it at previous file share under folder "2023-aug-08" and I attached log file as well.

@AlexPeshkoff
Copy link
Member

FB3 is almost unaffected - AST on change encryption state should not happen too often (unlike TPC one since FB4). Anyway backported required part of fix to it.

@AlexPeshkoff
Copy link
Member

@agx4ever You can upgrade to tomorrow snapshot (just make sure it's OK on http://firebirdtest.com/), it will contain fix for your bug. But if you can provide me 2 or 3 more dumps it will help us make sure we fixed all possible reasons of a bug.

@agx4ever
Copy link
Author

After I installed snapshot build with this fix - everything works fine and Firebird server hasn't crashed already two months.
It seems that this issue is fixed. Thank you for your fast support and problem debugging!
When this fix will be published in regular version build?

@AlexPeshkoff
Copy link
Member

AlexPeshkoff commented Oct 13, 2023 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment