Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make Cloud Storage client retry on backend error #3586

Closed
ywelsch opened this issue Aug 21, 2018 · 7 comments
Closed

Make Cloud Storage client retry on backend error #3586

ywelsch opened this issue Aug 21, 2018 · 7 comments
Assignees
Labels
api: storage Issues related to the Cloud Storage API. priority: p2 Moderately-important priority. Fix may not be included in next release. status: blocked Resolving the issue is dependent on other work. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns.

Comments

@ywelsch
Copy link

ywelsch commented Aug 21, 2018

We're operating at scale on GCS and are regularly experiencing transient HTTP 410 status codes when accessing Cloud storage. Those 410 status codes returned by Cloud storage are bogus though, as they are effectively just hiding an internal backend error on GCS, which is reflected in the error details:

Caused by: com.google.api.client.googleapis.json.GoogleJsonResponseException: 
410 Gone { "code" : 503, "errors" : [ { "domain" : "global", "message" : "Backend Error", "reason" : "backendError" } ], "message" : "Backend Error" }

The google-cloud-storage client does not treat the 410 status code as retryable, understandibly so. It should be retrying on backend errors, though, which are typically exposed with status code 500 or 503. I'm suggesting to treat backend errors in the client in the same way as it treats internal errors, namely match on reason == backendError independently of HTTP status code.

Note that we're not the first ones to experience this, and the client should be resilient against these transient GCS errors.

@andreamlin andreamlin added type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns. api: storage Issues related to the Cloud Storage API. labels Aug 21, 2018
@yihanzhen yihanzhen self-assigned this Aug 21, 2018
@JustinBeckwith JustinBeckwith added the triage me I really want to be triaged. label Aug 22, 2018
@yihanzhen yihanzhen added the priority: p2 Moderately-important priority. Fix may not be included in next release. label Aug 22, 2018
@JustinBeckwith JustinBeckwith removed the triage me I really want to be triaged. label Aug 22, 2018
@yihanzhen
Copy link
Contributor

I've contacted the storage backend team and if they aren't against it I'll add the retry logic.

@yihanzhen
Copy link
Contributor

Based on the discussion with storage backend the 410 happens during a JSON API resumable upload session. The error likely indicates that the upload session has already been terminated and retrying the individual HTTP request would not work (the entire upload session has to be restarted). An internal bug has been filed and storage team is actively working on it.

@JustinBeckwith JustinBeckwith added the status: blocked Resolving the issue is dependent on other work. label Sep 13, 2018
@andrey-qlogic
Copy link

@hzyi-google and @JustinBeckwith
Is this bug still blocked by internal error?

@yihanzhen
Copy link
Contributor

For googlers: b/116709007 is the internal bug.
According to the bug it seems they decided that it's not possible to fix it in the client libraries.

Dataflow already effectively retries 410s by retrying every failed shard 4 times regardless of why it failed, so it won't be a problem for Dataflow users.

b/115694839 tracks the implementation of resumable uploads. People who are having these 410s directly in their projects might need to wait for this feature.

@sduskis
Copy link
Contributor

sduskis commented Dec 11, 2018

This issue is important and unfortunately not solvable by clients. I'm going to close this issue, since there's nothing we can do here.

@romange
Copy link

romange commented Nov 6, 2019

@hzyi-google Could you please update the status of the internal bug? It's been almost a year...

@yihanzhen
Copy link
Contributor

@romange Sorry I do not work in this repo any more. cc/ @kolea2

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: storage Issues related to the Cloud Storage API. priority: p2 Moderately-important priority. Fix may not be included in next release. status: blocked Resolving the issue is dependent on other work. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns.
Projects
None yet
Development

No branches or pull requests

7 participants