Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tasks stuck in cancelling if connection is lost to the server #3388

Closed
droyad opened this issue Apr 6, 2017 · 4 comments
Closed

Tasks stuck in cancelling if connection is lost to the server #3388

droyad opened this issue Apr 6, 2017 · 4 comments
Assignees
Milestone

Comments

@droyad
Copy link
Contributor

droyad commented Apr 6, 2017

If the server cannot connect to the database when updating a task's status to Finished, the following message is logged: Unable to mark task ServerTasks-206 as complete

When the server regains access to the database, the task appears to be still running, but the underlying task thread has finished.

When the Cancel button is clicked, the task goes into Cancelling state, which prompts each server node to cancel the task thread. Since it can't find one with that task id, the task never gets updated to `Cancelled.

The trick is that each server node does not know whether the thread was running on itself or on another node (in a HA config).

We should detect whether the node owned the task, and if so, move it to cancelled.

Also we should see if we can try a bit harder to set the final status.

@droyad
Copy link
Contributor Author

droyad commented Apr 6, 2017

Should error 976 (SELECT * FROM SYS.MESSAGES Where message_id = 976):

The target database, '%.*ls', is participating in an availability group and is currently not accessible for queries. Either data movement is suspended or the availability replica is not enabled for read access. To allow read-only access to this and other databases in the availability group, enable read access to one or more secondary availability replicas in the group.  For more information, see the ALTER AVAILABILITY GROUP statement in SQL Server Books Online.

be added to the SqlDatabaseTransientErrorDetectionStrategy?https://github.com/OctopusDeploy/Nevermore/blob/8bc8ff97a13d4f7081df7478a3082b40ebca6b6c/source/Nevermore/Transient/SqlDatabaseTransientErrorDetectionStrategy.cs#L14?

@droyad
Copy link
Contributor Author

droyad commented Apr 7, 2017

Previously if the task completion failed, the task would be left in the running state and removed from the running tasks dictionary.

Now the task is not removed from the dictionary until it's state has been successfully updated to a complete status. If there is an error during completion, the task is marked as such and completion is retried during the task cancellation/cleanup process.

Also the error now shows up in the logs if SQL is unavailable.

Example of a task that failed to complete, but then managed to complete later:
image
image

@droyad droyad closed this as completed Apr 10, 2017
@octoreleasebot octoreleasebot added this to the 3.12.2 milestone Apr 10, 2017
@octoreleasebot
Copy link

Release Note: Tasks now no longer get stuck in running or cancellation state if there is a intermittent database connection problem

@lock
Copy link

lock bot commented Nov 24, 2018

This thread has been automatically locked since there has not been any recent activity after it was closed. If you think you've found a related issue, please contact our support team so we can triage your issue, and make sure it's handled appropriately.

@lock lock bot locked as resolved and limited conversation to collaborators Nov 24, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants