New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Excessive DELETE usage in v1.6.2 - See #628 #643
Comments
@barclayadam, can you show me the whole actual execution plan of the following query on your machine? Looks like on your machine query optimizer decides to use DELETE TOP (100) FROM [HangFire].[Job] WITH (READPAST) Where [ExpireAt] < GETUTCDATE(); |
@odinserj I have tested this on two separate databases (both of them are S1-Level SQL Azure v12 databases). They have a fairly similar usage pattern (around 900-3000 jobs per hour). One does not exhibit the same slowdown. The slow query does not have an Index Scan (interestingly, the quick one does on The quick database has 97,553 state records and 32,622 jobs, slow DB has 5,630,739 and 1,855,029 jobs (which is ALL of the jobs ever processed) DB without the issue: DB with the issue: |
Hm, do you know what causes such a heavy operation on Clustered Index Delete? A lot of State records may cause this, but your DB has 3:1 state-job relationship, and that's fine (however, for other cases it's worth adding ExpireAt for State table in 1.7.0). Fragmentation? Slow I/O? What does the tooltip show? |
Image below of the Clustered Index Delete (also, see full query plan at https://www.dropbox.com/s/1wprql0grywprjz/Hangfire%20deletion%20issue.sqlplan?dl=0) I do not think this is an endemic issue with slow I/O, the same schema / database sizing works absolutely fine elsewhere. I'm not sure why there has not been a single job deleted since the start of using Hangfire. The number of processed jobs (from aggregate table) is the same as the jobs still in the table (almost). This may be a bit off-base, but have you considered not separating out the parts of a job in to different tables, adopting more use of JSON perhaps for storing child states / parameters etc. that do not need separate indexing / querying to avoid JOINS etc. |
Estimated number of rows value does not relate to the actual row count. Statistics is terrible, have you tried to update it for the State table?
This is very strange! Does this database is used by the latest Hangfire version? A long time ago, a naive query
I have an idea on how to make it possible with JobParameters table, but it contains breaking changes, and will be postponed for 2.0 anyway. State table can't be replaced with JSON, because it should be fast enough for a lot of inserts. |
Today I was investigating how we can tune the job deletion without changing the schema. The only solution I've found is to remove records from delete from [HangFire].[State]
where [Id] in (
select top (100) s.[Id]
from [HangFire].[State] s with (readpast, xlock, forceseek)
inner join [HangFire].[Job] j with (readcommittedlock, forceseek)
on j.[Id] = s.[JobId]
where j.[ExpireAt] < GETUTCDATE()
) |
But the query above does not prevent SCANs, when deleting rows from the |
Today I've found a correct way to get rid of INDEX SCAN operators when removing records from The only weird thing in your query execution plan is the estimated number of rows, and that 69% of time is spent on record removal from the clustered index. However, this is an estimated number, and may be wrong: another estimated number is already wrong. @barclayadam, what is the maximum number of state and parameter records per one job record do you have? You may use the following scripts to obtain the numbers. You've already sent the total numbers, but despite of it looks like you have 3 states/job, actual numbers may be different. select max(t.Count) as MaxStateCount
from (select count(id) as Count from HangFire.State with (nolock) group by JobId) t
select max(t.Count) as MaxParameterCount
from (select count(id) as Count from HangFire.JobParameter with (nolock) group by JobId) t |
It seems as if the excessive deletion query ties was because of poor statistics, after updating all statistics I get a better estimated row count, which leads to much better deletion times (I am currently deleting all jobs to get back to 'normal' execution times with all expired jobs deleted). I executed the given queries which gave MaxStateCount of 43 and MaxParameterCount of 3 (also, done average for state count, which was 3) This database has not had anything below v1.5 (not sure what the patch level was on first execution) and is kept up to date. |
TL;DR: Just read the headingsCommandTimeout is the root of the problemPrior to Hangfire.SqlServer 1.6.3 (released 3 days after your first message here), the default command timeout value (30 seconds) was used for SQL Azure databases in Basic and even Standard plans have very poor performance, when compared to an on-premise installation. Together with poor statistics they may cause longer duration even for simpler queries, even for Fix is in 1.6.3, but 1.6.5 adds further optimizationsSince Hangfire.SqlServer 1.6.3, the timeout is set to unlimited value. It may take minutes, but records will be removed. Since version 1.6.5, there will be There is a dedicated harness app on Azure nowAll of these optimizations were tested against the worst and cheapest possible environment – Web Application based on D1 Shared pricing tier and Basic tier for SQL Azure database. Testing harness is now running in that application on a daily basis, running 500+K jobs/day 24/7. This application helped me to reveal some other narrow places, all the optimizations will be released with the upcoming 1.6.5. But you can get it now from the CI feed: https://ci.appveyor.com/nuget/hangfire. I'll close this issue as all the changes are already available on the CI feed, but feel free to re-open it at any time! Thanks for reporting such an important information! |
This issue is still happening in 1.6.5. |
Oh dear, I'm planning to remove all the foreign keys with cascade deletions in version 1.7.0, but the first pre-release version will be released only in middle of March, 2017. That release will include changes for Hangfire.SqlServer, including this one. I've already implemented all of them, but didn't pushed it to public yet, so I'm afraid the only thing you can do currently is a private modification with the decreased number of records being deleted, please see this line. |
We are currently running into this issue in a production Azure SQL DB with 100DTUs. The latest NuGet version of Hangfire is 1.6.21. How do we get 1.7.0, or is there a temporary workaround until it becomes available on the NuGet feeds? Thanks! |
We stopped using Azure SQL and started using Azure Redis for Hangfire. Definitely no longer have a problem with high SQL DTU usage. |
We have been experiencing the same excessive SQL server usage as found in issue #628 even after upgrading to v1.6.2.
Is there some further work that can be done here? Is the heartbeat of checks too large for the amount of jobs being executed (default options and only between 100-4000 jobs an hour, averaging around 1000)?
This causes issues with the dashboard as periodically it cannot connect, and also sometimes means jobs not being created as timeouts occur.
The last 24 hours of usage shows 226 executions, with execution time of 03:49:22 for the query
(@now datetime,@count int)delete top (@count) from [HangFire].[Job] with (readpast) where ExpireAt < @now
The text was updated successfully, but these errors were encountered: