Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Excessive DELETE usage in v1.6.2 - See #628 #643

Closed
barclayadam opened this issue Aug 16, 2016 · 14 comments
Closed

Excessive DELETE usage in v1.6.2 - See #628 #643

barclayadam opened this issue Aug 16, 2016 · 14 comments
Assignees
Milestone

Comments

@barclayadam
Copy link

We have been experiencing the same excessive SQL server usage as found in issue #628 even after upgrading to v1.6.2.

Is there some further work that can be done here? Is the heartbeat of checks too large for the amount of jobs being executed (default options and only between 100-4000 jobs an hour, averaging around 1000)?

This causes issues with the dashboard as periodically it cannot connect, and also sometimes means jobs not being created as timeouts occur.

The last 24 hours of usage shows 226 executions, with execution time of 03:49:22 for the query (@now datetime,@count int)delete top (@count) from [HangFire].[Job] with (readpast) where ExpireAt < @now

@odinserj
Copy link
Member

@barclayadam, can you show me the whole actual execution plan of the following query on your machine? Looks like on your machine query optimizer decides to use INDEX SCAN instead of INDEX SEEK when performing the cascade deletion, but we need to verify this.

DELETE TOP (100) FROM [HangFire].[Job] WITH (READPAST) Where [ExpireAt] < GETUTCDATE();

@barclayadam
Copy link
Author

@odinserj I have tested this on two separate databases (both of them are S1-Level SQL Azure v12 databases). They have a fairly similar usage pattern (around 900-3000 jobs per hour). One does not exhibit the same slowdown.

The slow query does not have an Index Scan (interestingly, the quick one does on IX_HangFire_JobParameter). Majority of time is spent in a clustered index delete on the State table

The quick database has 97,553 state records and 32,622 jobs, slow DB has 5,630,739 and 1,855,029 jobs (which is ALL of the jobs ever processed)

DB without the issue:

image

DB with the issue:

image

@odinserj
Copy link
Member

Hm, do you know what causes such a heavy operation on Clustered Index Delete? A lot of State records may cause this, but your DB has 3:1 state-job relationship, and that's fine (however, for other cases it's worth adding ExpireAt for State table in 1.7.0). Fragmentation? Slow I/O? What does the tooltip show?

@barclayadam
Copy link
Author

Image below of the Clustered Index Delete (also, see full query plan at https://www.dropbox.com/s/1wprql0grywprjz/Hangfire%20deletion%20issue.sqlplan?dl=0)

image

I do not think this is an endemic issue with slow I/O, the same schema / database sizing works absolutely fine elsewhere.

I'm not sure why there has not been a single job deleted since the start of using Hangfire. The number of processed jobs (from aggregate table) is the same as the jobs still in the table (almost).

This may be a bit off-base, but have you considered not separating out the parts of a job in to different tables, adopting more use of JSON perhaps for storing child states / parameters etc. that do not need separate indexing / querying to avoid JOINS etc.

@odinserj
Copy link
Member

Image below of the Clustered Index Delete

Estimated number of rows value does not relate to the actual row count. Statistics is terrible, have you tried to update it for the State table?

I'm not sure why there has not been a single job deleted since the start of using Hangfire. The number of processed jobs (from aggregate table) is the same as the jobs still in the table (almost).

This is very strange! Does this database is used by the latest Hangfire version? A long time ago, a naive query DELETE FROM was used without specifying a batch number. Perhaps the logging messages written by ExpirationManager help us to determine the issue. For example, it may show us connection pool contention issues.

This may be a bit off-base, but have you considered not separating out the parts of a job in to different tables, adopting more use of JSON perhaps for storing child states / parameters etc. that do not need separate indexing / querying to avoid JOINS etc.

I have an idea on how to make it possible with JobParameters table, but it contains breaking changes, and will be postponed for 2.0 anyway. State table can't be replaced with JSON, because it should be fast enough for a lot of inserts.

@odinserj
Copy link
Member

odinserj commented Aug 31, 2016

Today I was investigating how we can tune the job deletion without changing the schema. The only solution I've found is to remove records from State and JobParameter tables first, and once they are empty, switch to the Job table.

delete from [HangFire].[State] 
where [Id] in (
    select top (100) s.[Id] 
    from [HangFire].[State] s with (readpast, xlock, forceseek)
    inner join [HangFire].[Job] j with (readcommittedlock, forceseek) 
        on j.[Id] = s.[JobId]
    where j.[ExpireAt] < GETUTCDATE()
)

@odinserj odinserj added this to the 1.6.5 milestone Aug 31, 2016
@odinserj
Copy link
Member

But the query above does not prevent SCANs, when deleting rows from the Job table, especially when statistics is wrong.

@odinserj
Copy link
Member

odinserj commented Sep 1, 2016

Today I've found a correct way to get rid of INDEX SCAN operators when removing records from State and JobParameter tables on cascading deletion. However, this isn't your case, because your slow server already uses SEEK operations.

The only weird thing in your query execution plan is the estimated number of rows, and that 69% of time is spent on record removal from the clustered index. However, this is an estimated number, and may be wrong: another estimated number is already wrong.

@barclayadam, what is the maximum number of state and parameter records per one job record do you have? You may use the following scripts to obtain the numbers. You've already sent the total numbers, but despite of it looks like you have 3 states/job, actual numbers may be different.

select max(t.Count) as MaxStateCount 
from (select count(id) as Count from HangFire.State with (nolock) group by JobId) t

select max(t.Count) as MaxParameterCount 
from (select count(id) as Count from HangFire.JobParameter with (nolock) group by JobId) t

@barclayadam
Copy link
Author

It seems as if the excessive deletion query ties was because of poor statistics, after updating all statistics I get a better estimated row count, which leads to much better deletion times (I am currently deleting all jobs to get back to 'normal' execution times with all expired jobs deleted).

I executed the given queries which gave MaxStateCount of 43 and MaxParameterCount of 3 (also, done average for state count, which was 3)

This database has not had anything below v1.5 (not sure what the patch level was on first execution) and is kept up to date.

@odinserj odinserj added the ready label Sep 2, 2016
@odinserj odinserj self-assigned this Sep 2, 2016
@odinserj odinserj added in progress and removed ready labels Sep 2, 2016
@odinserj
Copy link
Member

TL;DR: Just read the headings

CommandTimeout is the root of the problem

Prior to Hangfire.SqlServer 1.6.3 (released 3 days after your first message here), the default command timeout value (30 seconds) was used for ExpirationManager's query. Looks like this is the root of the problem, because it can explain why expired records weren't removed in your tables – the query was simply interrupted by an exceeded timeout on every attempt.

SQL Azure databases in Basic and even Standard plans have very poor performance, when compared to an on-premise installation. Together with poor statistics they may cause longer duration even for simpler queries, even for SELECTs. But I understand that neither performance, nor poor statistics should cause processing interruptions, and lead to problems in future.

Fix is in 1.6.3, but 1.6.5 adds further optimizations

Since Hangfire.SqlServer 1.6.3, the timeout is set to unlimited value. It may take minutes, but records will be removed. Since version 1.6.5, there will be LOOP JOIN table hint that will prevent other joins that require more logical reads from being chosen, giving more predictable query duration. I've also added OPTIMIZE FOR hint that reduces the query duration even more.

There is a dedicated harness app on Azure now

All of these optimizations were tested against the worst and cheapest possible environment – Web Application based on D1 Shared pricing tier and Basic tier for SQL Azure database. Testing harness is now running in that application on a daily basis, running 500+K jobs/day 24/7.

This application helped me to reveal some other narrow places, all the optimizations will be released with the upcoming 1.6.5. But you can get it now from the CI feed: https://ci.appveyor.com/nuget/hangfire.

I'll close this issue as all the changes are already available on the CI feed, but feel free to re-open it at any time! Thanks for reporting such an important information!

@michelejohlbs
Copy link

This issue is still happening in 1.6.5.
It is so bad that the delete of "(@now datetime,@count int)delete top (@count) from [HangFire].[Job] with (readpast) where ExpireAt < @now" is causing an elastic pool in SQL Azure to fill up the quota for tempdb and then in turn cause all SQL queries to throw an exception of "Your tempdb is probably out of space." or "The database 'tempdb' has reached its size quota."

@odinserj
Copy link
Member

Oh dear, I'm planning to remove all the foreign keys with cascade deletions in version 1.7.0, but the first pre-release version will be released only in middle of March, 2017. That release will include changes for Hangfire.SqlServer, including this one.

I've already implemented all of them, but didn't pushed it to public yet, so I'm afraid the only thing you can do currently is a private modification with the decreased number of records being deleted, please see this line.

@cjeffers
Copy link

We are currently running into this issue in a production Azure SQL DB with 100DTUs. The latest NuGet version of Hangfire is 1.6.21. How do we get 1.7.0, or is there a temporary workaround until it becomes available on the NuGet feeds?

Thanks!

@michelejohlbs
Copy link

We stopped using Azure SQL and started using Azure Redis for Hangfire. Definitely no longer have a problem with high SQL DTU usage.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

4 participants