Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Delete failure "the number of retries has been exceeded: StatusCode=404" #596

Closed
mikhailshilkov opened this issue Dec 14, 2020 · 7 comments

Comments

@mikhailshilkov
Copy link

We have a test in CI/CD which, as part of the cleanup, calls the Event Hub namespace DELETE operation. The operation is marked with x-ms-long-running-operation so we call WaitForCompletion on the initial response.

Sometimes, but not always, WaitForCompletion fails (returns an error) with

Future#WaitForCompletion: the number of retries has been exceeded: StatusCode=404 -- Original Error: Code="ResourceNotFound" Message="The Resource 'Microsoft.EventHub/namespaces/foo' under resource group 'bar' was not found. For more details please go to https://aka.ms/ARMResourceNotFoundFix"

It would seem that a 404 on a DELETE operation is actually exactly what we need: this means the resource is successfully deleted. However, I'm not sure how exactly the awaiting goes sideways. I logged the initial response and it's a

Status: 202
Headers: map[Cache-Control:[no-cache] Content-Length:[0] Date:[Fri, 11 Dec 2020 12:16:05 GMT] Expires:[-1] Location:[https://management.azure.com/subscriptions/***/resourceGroups/cozegowygsrz/providers/Microsoft.EventHub/namespaces/cozegowygsrz/operationresults/cozegowygsrz?api-version=2017-04-01] Pragma:[no-cache] Server:[Service-Bus-Resource-Provider/SN1 Microsoft-HTTPAPI/2.0] Server-Sb:[Service-Bus-Resource-Provider/SN1] Strict-Transport-Security:[max-age=31536000; includeSubDomains] X-Content-Type-Options:[nosniff] X-Ms-Correlation-Request-Id:[bd6a86d0-abbd-4c6a-9b51-cb7268603884] X-Ms-Ratelimit-Remaining-Subscription-Deletes:[14999] X-Ms-Request-Id:[9212ffac-9bec-47f2-a26c-2ebbc1f1e3ab_M9SN1_M9SN1] X-Ms-Routing-Request-Id:[EASTUS2:20201211T121605Z:bd6a86d0-abbd-4c6a-9b51-cb7268603884]]:

Any idea what goes wrong here or how I can work around this behavior?

@jhendrixMSFT
Copy link
Member

Sorry for the delay.

I looked into this a bit, if you attempt to delete a namespace that doesn't exist the service returns a 204 initial response so it's nothing that simple unfortunately.

On the normal, success case, the initial response is a 202. Polling on the endpoint in the Location header returns a 200 with a provisioning state of Removing (this is in the JSON response body), and once the provisioning state reaches a terminal state (Succeeded) the polling loop ends.

I thought perhaps it's a race condition, i.e. two goroutines attempting to delete the same namespace, however in my repro one of the calls to Delete() fails with a 409 (conflict). It could just be that the timing in my repro is off though. Does your CI clean-up attempt to delete the same namespace concurrently?

Other possibility is it's a bug in the endpoint itself.

@mikhailshilkov
Copy link
Author

Do I understand that you can't repro this? It takes a while to get the error for us but we get it periodically. I ran the test 30 times in parallel (for different namespaces in different resource groups) and got one error, for example.

the service returns a 204 initial response

In the problematic case, the initial response is 202 for us and 404 happens later in the polling loop.

Does your CI clean-up attempt to delete the same namespace concurrently?

No, it doesn't, it issues one deletion command with awaiting and then gives up (we delete all stuck resources nightly but that's irrelevant).

Other possibility is it's a bug in the endpoint itself.

I found this which suggests this might be the case. Do you have a backchannel to the service teams? That line is in the code for 3 years...

Yet, it would be nice to do something with this case on the go-autorest side.

Thank you for looking into this!

@jhendrixMSFT
Copy link
Member

Correct I wasn't able to repro this. I've updated my test app to create/delete the namespace in succession 50 times, let's see what happens.

In what region are you seeing this happen?

@jhendrixMSFT
Copy link
Member

Running in a loop I was able to repro the issue. I will follow up with the service team to find out more.

@jhendrixMSFT
Copy link
Member

The service team investigated the issue, it's a bug on their side. They will work on deploying a fix but in the meantime you will need to work around the behavior. While you could write your own LRO polling loop, it might be simpler to check the status code along with a non-nil error.

if err != nil && d.Response().StatusCode != http.StatusNotFound {
	// real error, not due to spurious 404.  handle appropriately
}

@mikhailshilkov
Copy link
Author

Thank you @jhendrixMSFT! We did exactly this as a workaround. Hoping to see the fix on the service side.

@jhendrixMSFT
Copy link
Member

Glad that's working. Given this isn't a bug in the SDK I'm going to close this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants