Excessive DELETE usage in v1.6.2 - See #628 #643

barclayadam · 2016-08-16T09:46:13Z

We have been experiencing the same excessive SQL server usage as found in issue #628 even after upgrading to v1.6.2.

Is there some further work that can be done here? Is the heartbeat of checks too large for the amount of jobs being executed (default options and only between 100-4000 jobs an hour, averaging around 1000)?

This causes issues with the dashboard as periodically it cannot connect, and also sometimes means jobs not being created as timeouts occur.

The last 24 hours of usage shows 226 executions, with execution time of 03:49:22 for the query (@now datetime,@count int)delete top (@count) from [HangFire].[Job] with (readpast) where ExpireAt < @now

The text was updated successfully, but these errors were encountered:

odinserj · 2016-08-16T14:30:18Z

@barclayadam, can you show me the whole actual execution plan of the following query on your machine? Looks like on your machine query optimizer decides to use INDEX SCAN instead of INDEX SEEK when performing the cascade deletion, but we need to verify this.

DELETE TOP (100) FROM [HangFire].[Job] WITH (READPAST) Where [ExpireAt] < GETUTCDATE();

barclayadam · 2016-08-17T09:48:53Z

@odinserj I have tested this on two separate databases (both of them are S1-Level SQL Azure v12 databases). They have a fairly similar usage pattern (around 900-3000 jobs per hour). One does not exhibit the same slowdown.

The slow query does not have an Index Scan (interestingly, the quick one does on IX_HangFire_JobParameter). Majority of time is spent in a clustered index delete on the State table

The quick database has 97,553 state records and 32,622 jobs, slow DB has 5,630,739 and 1,855,029 jobs (which is ALL of the jobs ever processed)

DB without the issue:

DB with the issue:

odinserj · 2016-08-19T16:37:02Z

Hm, do you know what causes such a heavy operation on Clustered Index Delete? A lot of State records may cause this, but your DB has 3:1 state-job relationship, and that's fine (however, for other cases it's worth adding ExpireAt for State table in 1.7.0). Fragmentation? Slow I/O? What does the tooltip show?

barclayadam · 2016-08-20T11:15:09Z

Image below of the Clustered Index Delete (also, see full query plan at https://www.dropbox.com/s/1wprql0grywprjz/Hangfire%20deletion%20issue.sqlplan?dl=0)

I do not think this is an endemic issue with slow I/O, the same schema / database sizing works absolutely fine elsewhere.

I'm not sure why there has not been a single job deleted since the start of using Hangfire. The number of processed jobs (from aggregate table) is the same as the jobs still in the table (almost).

This may be a bit off-base, but have you considered not separating out the parts of a job in to different tables, adopting more use of JSON perhaps for storing child states / parameters etc. that do not need separate indexing / querying to avoid JOINS etc.

odinserj · 2016-08-31T14:50:50Z

Image below of the Clustered Index Delete

Estimated number of rows value does not relate to the actual row count. Statistics is terrible, have you tried to update it for the State table?

I'm not sure why there has not been a single job deleted since the start of using Hangfire. The number of processed jobs (from aggregate table) is the same as the jobs still in the table (almost).

This is very strange! Does this database is used by the latest Hangfire version? A long time ago, a naive query DELETE FROM was used without specifying a batch number. Perhaps the logging messages written by ExpirationManager help us to determine the issue. For example, it may show us connection pool contention issues.

This may be a bit off-base, but have you considered not separating out the parts of a job in to different tables, adopting more use of JSON perhaps for storing child states / parameters etc. that do not need separate indexing / querying to avoid JOINS etc.

I have an idea on how to make it possible with JobParameters table, but it contains breaking changes, and will be postponed for 2.0 anyway. State table can't be replaced with JSON, because it should be fast enough for a lot of inserts.

odinserj · 2016-08-31T16:54:01Z

Today I was investigating how we can tune the job deletion without changing the schema. The only solution I've found is to remove records from State and JobParameter tables first, and once they are empty, switch to the Job table.

delete from [HangFire].[State] 
where [Id] in (
    select top (100) s.[Id] 
    from [HangFire].[State] s with (readpast, xlock, forceseek)
    inner join [HangFire].[Job] j with (readcommittedlock, forceseek) 
        on j.[Id] = s.[JobId]
    where j.[ExpireAt] < GETUTCDATE()
)

odinserj · 2016-08-31T17:14:18Z

But the query above does not prevent SCANs, when deleting rows from the Job table, especially when statistics is wrong.

odinserj · 2016-09-01T14:57:45Z

Today I've found a correct way to get rid of INDEX SCAN operators when removing records from State and JobParameter tables on cascading deletion. However, this isn't your case, because your slow server already uses SEEK operations.

The only weird thing in your query execution plan is the estimated number of rows, and that 69% of time is spent on record removal from the clustered index. However, this is an estimated number, and may be wrong: another estimated number is already wrong.

@barclayadam, what is the maximum number of state and parameter records per one job record do you have? You may use the following scripts to obtain the numbers. You've already sent the total numbers, but despite of it looks like you have 3 states/job, actual numbers may be different.

select max(t.Count) as MaxStateCount 
from (select count(id) as Count from HangFire.State with (nolock) group by JobId) t

select max(t.Count) as MaxParameterCount 
from (select count(id) as Count from HangFire.JobParameter with (nolock) group by JobId) t

barclayadam · 2016-09-01T15:23:43Z

It seems as if the excessive deletion query ties was because of poor statistics, after updating all statistics I get a better estimated row count, which leads to much better deletion times (I am currently deleting all jobs to get back to 'normal' execution times with all expired jobs deleted).

I executed the given queries which gave MaxStateCount of 43 and MaxParameterCount of 3 (also, done average for state count, which was 3)

This database has not had anything below v1.5 (not sure what the patch level was on first execution) and is kept up to date.

odinserj · 2016-09-23T13:31:32Z

TL;DR: Just read the headings

CommandTimeout is the root of the problem

Prior to Hangfire.SqlServer 1.6.3 (released 3 days after your first message here), the default command timeout value (30 seconds) was used for ExpirationManager's query. Looks like this is the root of the problem, because it can explain why expired records weren't removed in your tables – the query was simply interrupted by an exceeded timeout on every attempt.

SQL Azure databases in Basic and even Standard plans have very poor performance, when compared to an on-premise installation. Together with poor statistics they may cause longer duration even for simpler queries, even for SELECTs. But I understand that neither performance, nor poor statistics should cause processing interruptions, and lead to problems in future.

Fix is in 1.6.3, but 1.6.5 adds further optimizations

Since Hangfire.SqlServer 1.6.3, the timeout is set to unlimited value. It may take minutes, but records will be removed. Since version 1.6.5, there will be LOOP JOIN table hint that will prevent other joins that require more logical reads from being chosen, giving more predictable query duration. I've also added OPTIMIZE FOR hint that reduces the query duration even more.

There is a dedicated harness app on Azure now

All of these optimizations were tested against the worst and cheapest possible environment – Web Application based on D1 Shared pricing tier and Basic tier for SQL Azure database. Testing harness is now running in that application on a daily basis, running 500+K jobs/day 24/7.

This application helped me to reveal some other narrow places, all the optimizations will be released with the upcoming 1.6.5. But you can get it now from the CI feed: https://ci.appveyor.com/nuget/hangfire.

I'll close this issue as all the changes are already available on the CI feed, but feel free to re-open it at any time! Thanks for reporting such an important information!

michelejohlbs · 2017-02-14T16:28:34Z

This issue is still happening in 1.6.5.
It is so bad that the delete of "(@now datetime,@count int)delete top (@count) from [HangFire].[Job] with (readpast) where ExpireAt < @now" is causing an elastic pool in SQL Azure to fill up the quota for tempdb and then in turn cause all SQL queries to throw an exception of "Your tempdb is probably out of space." or "The database 'tempdb' has reached its size quota."

odinserj · 2017-02-15T06:34:17Z

Oh dear, I'm planning to remove all the foreign keys with cascade deletions in version 1.7.0, but the first pre-release version will be released only in middle of March, 2017. That release will include changes for Hangfire.SqlServer, including this one.

I've already implemented all of them, but didn't pushed it to public yet, so I'm afraid the only thing you can do currently is a private modification with the decreased number of records being deleted, please see this line.

cjeffers · 2018-11-13T16:39:44Z

We are currently running into this issue in a production Azure SQL DB with 100DTUs. The latest NuGet version of Hangfire is 1.6.21. How do we get 1.7.0, or is there a temporary workaround until it becomes available on the NuGet feeds?

Thanks!

michelejohlbs · 2018-11-14T08:28:57Z

We stopped using Azure SQL and started using Azure Redis for Hangfire. Definitely no longer have a problem with high SQL DTU usage.

odinserj added a: sql-server t: bug labels Aug 31, 2016

odinserj added this to the 1.6.5 milestone Aug 31, 2016

odinserj mentioned this issue Sep 1, 2016

Deadlock (??) caused by Hangfire Query consumes all SQL Azure DTUs! #628

Closed

odinserj added the ready label Sep 2, 2016

odinserj self-assigned this Sep 2, 2016

odinserj added in progress and removed ready labels Sep 2, 2016

odinserj closed this as completed Sep 23, 2016

odinserj removed the in progress label Sep 23, 2016

odinserj mentioned this issue May 23, 2017

Add migration for Hangfire.SqlServer 1.7.0 #898

Merged

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Excessive DELETE usage in v1.6.2 - See #628 #643

Excessive DELETE usage in v1.6.2 - See #628 #643

barclayadam commented Aug 16, 2016

odinserj commented Aug 16, 2016

barclayadam commented Aug 17, 2016

odinserj commented Aug 19, 2016

barclayadam commented Aug 20, 2016

odinserj commented Aug 31, 2016

odinserj commented Aug 31, 2016 •

edited

odinserj commented Aug 31, 2016

odinserj commented Sep 1, 2016

barclayadam commented Sep 1, 2016

odinserj commented Sep 23, 2016

michelejohlbs commented Feb 14, 2017

odinserj commented Feb 15, 2017

cjeffers commented Nov 13, 2018

michelejohlbs commented Nov 14, 2018

Excessive DELETE usage in v1.6.2 - See #628 #643

Excessive DELETE usage in v1.6.2 - See #628 #643

Comments

barclayadam commented Aug 16, 2016

odinserj commented Aug 16, 2016

barclayadam commented Aug 17, 2016

odinserj commented Aug 19, 2016

barclayadam commented Aug 20, 2016

odinserj commented Aug 31, 2016

odinserj commented Aug 31, 2016 • edited

odinserj commented Aug 31, 2016

odinserj commented Sep 1, 2016

barclayadam commented Sep 1, 2016

odinserj commented Sep 23, 2016

TL;DR: Just read the headings

CommandTimeout is the root of the problem

Fix is in 1.6.3, but 1.6.5 adds further optimizations

There is a dedicated harness app on Azure now

michelejohlbs commented Feb 14, 2017

odinserj commented Feb 15, 2017

cjeffers commented Nov 13, 2018

michelejohlbs commented Nov 14, 2018

odinserj commented Aug 31, 2016 •

edited