Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Many connections results in poor performance [CORE6296] #6538

Closed
firebird-issue-importer opened this issue May 2, 2020 · 20 comments
Closed

Many connections results in poor performance [CORE6296] #6538

firebird-issue-importer opened this issue May 2, 2020 · 20 comments

Comments

@firebird-issue-importer
Copy link

@firebird-issue-importer firebird-issue-importer commented May 2, 2020

Submitted by: Thomas Kragh (realic)

Relate to CORE6348

Attachments:
2020_05_02---fb_test--1.png
2020_05_02---fb_test--2.png
prod-db4.PNG
uvd-db2.PNG

------------UPDATE------------
1) The ping statement have been updated to select from RDB$database instead of mon$database.
2) The parameters of the console applications have been ajusted for the new ping statement
3) This bug report stated that Classic architecture was also affected, this looks not to be the case, after updating the ping statement.

Hey
Where I work, we run a fairly large education platform, at peek hours the database servers execute approximately 1200 to 1400 query's/second. Over the last 4-5 weeks I have been working with Alexey from IBSurgeon to resolve a performance issue that we have been having.

The problem occurs when the system is under high load, and over a period of 5-10 seconds the number of attachments rises from eg. 125 to 800 attachments. 800 is the max limit allowed by the connection pools, eg. 4 web servers with a 200 max pool size. Under high load with 125 attachments the web application is spending 13 milliseconds on average waiting for the database to execute query's, when we reach 800 attachments this number i 10-20 times greater.

I believe that the quick rise in attachments can be correlated with garbage collection in .Net, where the application is suspended for eg. 100ms, but the webserver still receives requests, when GC is done, the queued requests are executed in parallel, causing perhaps 100 new attachments to the database. This again causes the database server to respond slower that the average 13ms, when this happens the 3 other webservers will need to create more attachments to the database, causing a chain reaction.
After hitting 800 attachments the average query time is 10-20 times slower, this can last for hours if the web application is not restarted, after witch everything returns to normal.

To recreate what I think is the problem, I have created a sample application(see download link in bottom), that shows that overall query speed slows down proportional to the number of attachments.

Reproduce steps:
1) Ajust the connectionstring found in App.config
2) Start one instance of the console application with the command "FbPerformance.exe --testcase ping --sleep 500", --testcase describes the "mode", and --sleep is the time between ping. Leave the application running. Ping query is a simple "select current_timestamp from rdb$database".
3) Start another instance of the application with the command "FbPerformance.exe --testcase ambient --count 1200" testcase is "ambient" and creates 1200 connections/transactions to the database. No statements will be executed on these connections!
4) While the second application creates attachments please observe the "execute statement" time that the ping application prints. This will go from 2-4ms to 20-40ms.

Step 1-4 is the steps to reprocuce. 5-6 is bonus information

5) Stop the ping application and change --sleep to 20, and observe the CPU usage of the Firebird process, witch i find quite high compared to the small amount of work.
6) Stop the ambient instance from step 3, and observe that start transaction/execute statement in the ping application can take quite some time - I have observed cases of 600-1400ms, after witch executing statement is back to about 2-4ms.

This behavior has been observed on Firebird 3.0.5 running Centos7, in SuperServer mode, where the test application was runnning on Windows Server 2012R2.
And on Firebird 3.0.5 running Windows 10 in SuperClassic and SuperServer mode, where the test application was run on the same machine.

Also note that if I Attache a trace session(FBTraceManager3) it will report that the ping statement takes less that 1ms to execute, even though the ping application reports 40ms.

sample application
https://mublobstorageprod.blob.core.windows.net/mudrift/fbperformance/sampleapplication.zip?st=2020-05-01T21%3A04%3A08Z&se=2025-05-02T21%3A04%3A00Z&sp=rl&sv=2018-03-28&sr=b&sig=SiwOnTZ15E4HQsY3aNtZnkFmTlseiwN8xt5tG%2BmJ3qY%3D

sample application source
https://mublobstorageprod.blob.core.windows.net/mudrift/fbperformance/sampleapplication_src.zip?st=2020-05-01T21%3A04%3A51Z&se=2025-05-02T21%3A04%3A00Z&sp=rl&sv=2018-03-28&sr=b&sig=3qt59y1TcbFkt12%2FUWm1dTPHooo1ELvE77X5tXpHLB0%3D

I hope this is enough information for you guys to track down this issue, and thank for a great database server!
Best regards Thomas

Commits: 04452e3 b5986d5

@firebird-issue-importer
Copy link
Author

@firebird-issue-importer firebird-issue-importer commented May 2, 2020

Commented by: @livius2

>> Ping query is a simple "select current_timestamp from mon$database"

Sorry byt this is not a simple select. It is very costly select which take snapshot of database state.
Simple select will be from RDB$database but not from MON$

@firebird-issue-importer
Copy link
Author

@firebird-issue-importer firebird-issue-importer commented May 2, 2020

Commented by: Thomas Kragh (realic)

Hey Karol
Thank you for the response, the query was chosen so that the program could be run against all databases.
Of coures using RDB$database is better.

Anyway, this can also be recreated whit a primery key query in the employee database, in the .config file you can replace the pingstatement config line with this:
<add key="pingstatement" value="select e.hire_date as "current_timestamp" from employee e where e.emp_no = 20" />

It might be necessary to increase the --count parameter to perhaps 1200 to see big changes using this query. When i ran the test using Classic, SuperClassic and SuperServet, it looks like Classic is less affected when using this new ping query.

@firebird-issue-importer
Copy link
Author

@firebird-issue-importer firebird-issue-importer commented May 2, 2020

Commented by: @ibprovider

I executed your test on my workstation [Win10, FB3.0.6 SS x64, DB on RAM Disk] :)

select current_timestamp from rdb$database

FbPerformance.exe --testcase ping --sleep 2
FbPerformance.exe --testcase ambient --count 1800

Two different providers:
- FirebirdSql.Data.FirebirdClient
- lcpi.data.oledb + LPCI.IBProvider.5 + LCPI OLE DB Services (I little adapted your code)

Not any visible problems. Execution time increased, but not critical.
- FirebirdSql.Data.FirebirdClient - to 9-11 ms
- lcpi.data.oledb - to 6-8ms

See attached screenshot - 2020_05_02---fb_test--1.png.

(About slow time of connection through lcpi.data.oledb. I think, it is result of "right" verification of connection before reusing. Provider make ping of server).

@firebird-issue-importer
Copy link
Author

@firebird-issue-importer firebird-issue-importer commented May 2, 2020

Modified by: @ibprovider

Attachment: 2020_05_02---fb_test--1.png [ 13430 ]

@firebird-issue-importer
Copy link
Author

@firebird-issue-importer firebird-issue-importer commented May 2, 2020

Modified by: Thomas Kragh (realic)

description: Hey
Where I work, we run a fairly large education platform, at peek hours the database servers execute approximately 1200 to 1400 query's/second. Over the last 4-5 weeks I have been working with Alexey from IBSurgeon to resolve a performance issue that we have been having.

The problem occurs when the system is under high load, and over a period of 5-10 seconds the number of attachments rises from eg. 125 to 800 attachments. 800 is the max limit allowed by the connection pools, eg. 4 web servers with a 200 max pool size. Under high load with 125 attachments the web application is spending 13 milliseconds on average waiting for the database to execute query's, when we reach 800 attachments this number i 10-20 times greater.

I believe that the quick rise in attachments can be correlated with garbage collection in .Net, where the application is suspended for eg. 100ms, but the webserver still receives requests, when GC is done, the queued requests are executed in parallel, causing perhaps 100 new attachments to the database. This again causes the database server to respond slower that the average 13ms, when this happens the 3 other webservers will need to create more attachments to the database, causing a chain reaction.
After hitting 800 attachments the average query time is 10-20 times slower, this can last for hours if the web application is not restarted, after witch everything returns to normal.

To recreate what I think is the problem, I have created a sample application(see download link in bottom), that shows that overall query speed slows down proportional to the number of attachments.

Reproduce steps:
1) Ajust the connectionstring found in App.config
2) Start one instance of the console application with the command "FbPerformance.exe --testcase ping --sleep 500", --testcase describes the "mode", and --sleep is the time between ping. Leave the application running. Ping query is a simple "select current_timestamp from mon$database".
3) Start another instance of the application with the command "FbPerformance.exe --testcase ambient --count 800" testcase is "ambient" and creates 800 connections/transactions to the database. No statements will be executed on these connections!
4) While the second application creates attachments please observe the "execute statement" time that the ping application prints. This will go from 2-4ms to 20-40ms.

Step 1-4 is the steps to reprocuce. 5-6 is bonus information

5) Stop the ping application and change --sleep to 20, and observe the CPU usage of the Firebird process, witch i find quite high compared to the small amount of work.
6) Stop the ambient instance from step 3, and observe that start transaction/execute statement in the ping application can take quite some time - I have observed cases of 600-1400ms, after witch executing statement is back to about 2-4ms.

This behavior has been observed on Firebird 3.0.5 running Centos7, in SuperServer mode, where the test application was runnning on Windows Server 2012R2.
And on Firebird 3.0.5 running Windows 10 in Classic, SuperClassic and SuperServer mode, where the test application was run on the same machine.

Also note that if I Attache a trace session(FBTraceManager3) it will report that the ping statement takes less that 1ms to execute, even though the ping application reports 40ms.

sample application
https://mublobstorageprod.blob.core.windows.net/mudrift/fbperformance/sampleapplication.zip?st=2020-05-01T21%3A04%3A08Z&se=2025-05-02T21%3A04%3A00Z&sp=rl&sv=2018-03-28&sr=b&sig=SiwOnTZ15E4HQsY3aNtZnkFmTlseiwN8xt5tG%2BmJ3qY%3D

sample application source
https://mublobstorageprod.blob.core.windows.net/mudrift/fbperformance/sampleapplication_src.zip?st=2020-05-01T21%3A04%3A51Z&se=2025-05-02T21%3A04%3A00Z&sp=rl&sv=2018-03-28&sr=b&sig=3qt59y1TcbFkt12%2FUWm1dTPHooo1ELvE77X5tXpHLB0%3D

I hope this is enough information for you guys to track down this issue, and thank for a great database server!
Best regards Thomas

=>

------------UPDATE------------
1) The ping statement have been updated to select from RDB$database instead of mon$database.
2) The parameters of the console applications have been ajusted for the new ping statement
3) This bug report stated that Classic architecture was also affected, this looks not to be the case, after updating the ping statement.

Hey
Where I work, we run a fairly large education platform, at peek hours the database servers execute approximately 1200 to 1400 query's/second. Over the last 4-5 weeks I have been working with Alexey from IBSurgeon to resolve a performance issue that we have been having.

The problem occurs when the system is under high load, and over a period of 5-10 seconds the number of attachments rises from eg. 125 to 800 attachments. 800 is the max limit allowed by the connection pools, eg. 4 web servers with a 200 max pool size. Under high load with 125 attachments the web application is spending 13 milliseconds on average waiting for the database to execute query's, when we reach 800 attachments this number i 10-20 times greater.

I believe that the quick rise in attachments can be correlated with garbage collection in .Net, where the application is suspended for eg. 100ms, but the webserver still receives requests, when GC is done, the queued requests are executed in parallel, causing perhaps 100 new attachments to the database. This again causes the database server to respond slower that the average 13ms, when this happens the 3 other webservers will need to create more attachments to the database, causing a chain reaction.
After hitting 800 attachments the average query time is 10-20 times slower, this can last for hours if the web application is not restarted, after witch everything returns to normal.

To recreate what I think is the problem, I have created a sample application(see download link in bottom), that shows that overall query speed slows down proportional to the number of attachments.

Reproduce steps:
1) Ajust the connectionstring found in App.config
2) Start one instance of the console application with the command "FbPerformance.exe --testcase ping --sleep 500", --testcase describes the "mode", and --sleep is the time between ping. Leave the application running. Ping query is a simple "select current_timestamp from rdb$database".
3) Start another instance of the application with the command "FbPerformance.exe --testcase ambient --count 1200" testcase is "ambient" and creates 1200 connections/transactions to the database. No statements will be executed on these connections!
4) While the second application creates attachments please observe the "execute statement" time that the ping application prints. This will go from 2-4ms to 20-40ms.

Step 1-4 is the steps to reprocuce. 5-6 is bonus information

5) Stop the ping application and change --sleep to 20, and observe the CPU usage of the Firebird process, witch i find quite high compared to the small amount of work.
6) Stop the ambient instance from step 3, and observe that start transaction/execute statement in the ping application can take quite some time - I have observed cases of 600-1400ms, after witch executing statement is back to about 2-4ms.

This behavior has been observed on Firebird 3.0.5 running Centos7, in SuperServer mode, where the test application was runnning on Windows Server 2012R2.
And on Firebird 3.0.5 running Windows 10 in SuperClassic and SuperServer mode, where the test application was run on the same machine.

Also note that if I Attache a trace session(FBTraceManager3) it will report that the ping statement takes less that 1ms to execute, even though the ping application reports 40ms.

sample application
https://mublobstorageprod.blob.core.windows.net/mudrift/fbperformance/sampleapplication.zip?st=2020-05-01T21%3A04%3A08Z&se=2025-05-02T21%3A04%3A00Z&sp=rl&sv=2018-03-28&sr=b&sig=SiwOnTZ15E4HQsY3aNtZnkFmTlseiwN8xt5tG%2BmJ3qY%3D

sample application source
https://mublobstorageprod.blob.core.windows.net/mudrift/fbperformance/sampleapplication_src.zip?st=2020-05-01T21%3A04%3A51Z&se=2025-05-02T21%3A04%3A00Z&sp=rl&sv=2018-03-28&sr=b&sig=3qt59y1TcbFkt12%2FUWm1dTPHooo1ELvE77X5tXpHLB0%3D

I hope this is enough information for you guys to track down this issue, and thank for a great database server!
Best regards Thomas

@firebird-issue-importer
Copy link
Author

@firebird-issue-importer firebird-issue-importer commented May 2, 2020

Commented by: @ibprovider

2020_05_02---fb_test--2.png

Work after stoping of "ambient 1800" process.

Time of execute statement: 1-2ms for both providers.

@firebird-issue-importer
Copy link
Author

@firebird-issue-importer firebird-issue-importer commented May 2, 2020

Modified by: @ibprovider

Attachment: 2020_05_02---fb_test--2.png [ 13431 ]

@firebird-issue-importer
Copy link
Author

@firebird-issue-importer firebird-issue-importer commented May 2, 2020

Modified by: Thomas Kragh (realic)

Attachment: prod-db4.PNG [ 13432 ]

@firebird-issue-importer
Copy link
Author

@firebird-issue-importer firebird-issue-importer commented May 2, 2020

Modified by: Thomas Kragh (realic)

Attachment: uvd-db2.PNG [ 13433 ]

@firebird-issue-importer
Copy link
Author

@firebird-issue-importer firebird-issue-importer commented May 2, 2020

Commented by: Thomas Kragh (realic)

Hey All
So I ran the application against two servers instead of my laptop - see attached pictures

prod-db4 is Firebird 3 Superserver running on Windows Server, here the overall increase is about 120%
and
uvd-db2 is Firebird 3 Superserver running on CentOs, here the overall increase is about 1000%

As updated in the discription, it looks line classic doesn't have the problem as originally described.

@firebird-issue-importer
Copy link
Author

@firebird-issue-importer firebird-issue-importer commented May 19, 2020

Modified by: @hvlad

assignee: Vlad Khorsun [ hvlad ]

@firebird-issue-importer
Copy link
Author

@firebird-issue-importer firebird-issue-importer commented Jun 11, 2020

Modified by: @hvlad

status: Open [ 1 ] => Resolved [ 5 ]

resolution: Fixed [ 1 ]

Fix Version: 3.0.6 [ 10889 ]

Fix Version: 4.0 RC 1 [ 10930 ]

@firebird-issue-importer
Copy link
Author

@firebird-issue-importer firebird-issue-importer commented Jun 11, 2020

Commented by: Sean Leyne (seanleyne)

Vlad,

Please describe the nature of the "fix"

@firebird-issue-importer
Copy link
Author

@firebird-issue-importer firebird-issue-importer commented Jun 11, 2020

Commented by: @reevespaul

I ran the performance test with the new code and the initial results look good.

@firebird-issue-importer
Copy link
Author

@firebird-issue-importer firebird-issue-importer commented Jun 11, 2020

Commented by: @hvlad

It is all about our TCP listener.

The fix uses sorted array of pollfd structures to avoid few places in our code with O(N^2) complexity.
Note, poll() is not used on Windows, so this problem is not present there.

Also when old code searched for the signalled port (after call of select\poll), it started from the head
of the ports list every time, now such search starts from last checked port in list.

--
PS I restored my comment as it is and was much more correct.
Please, avoid such "correcting" in the future.

@firebird-issue-importer
Copy link
Author

@firebird-issue-importer firebird-issue-importer commented Jun 11, 2020

Commented by: Sean Leyne (seanleyne)

The "Environment" refers to Linux *and Windows*, and the details make same references. Is there a problem with Windows or not?

@firebird-issue-importer
Copy link
Author

@firebird-issue-importer firebird-issue-importer commented Jun 11, 2020

Commented by: @hvlad

Sean,

the Windows is affected too but in a much less degree.

@firebird-issue-importer
Copy link
Author

@firebird-issue-importer firebird-issue-importer commented Jun 15, 2020

Modified by: @pavel-zotov

status: Resolved [ 5 ] => Resolved [ 5 ]

QA Status: No test => Cannot be tested

@firebird-issue-importer
Copy link
Author

@firebird-issue-importer firebird-issue-importer commented Jun 15, 2020

Modified by: @pavel-zotov

status: Resolved [ 5 ] => Closed [ 6 ]

@firebird-issue-importer
Copy link
Author

@firebird-issue-importer firebird-issue-importer commented Jul 2, 2020

Modified by: @AlexPeshkoff

Link: This issue relate to CORE6348 [ CORE6348 ]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
2 participants