Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GCS Connector giving 429s when used from multiple clusters. #185

Closed
jaketf opened this issue Jun 19, 2019 · 4 comments
Closed

GCS Connector giving 429s when used from multiple clusters. #185

jaketf opened this issue Jun 19, 2019 · 4 comments

Comments

@jaketf
Copy link

jaketf commented Jun 19, 2019

We have 4 Dataproc clusters, and a GKE cluster which all use GCS connector to perform GCS operations (on the same bucket which houses the raw data upon which many reports / roll-ups are done on an hourly basis). We are encountering many 429 (too many request) errors:

Caused by: com.google.cloud.hadoop.repackaged.gcs.com.google.api.client.googleapis.json.GoogleJsonResponseException: 429 unknown
<html><head><meta http-equiv="content-type" content="text/html; charset=utf-8"/><title>Sorry...</title><style> body { font-family: verdana, arial, sans-serif; background-color: #fff; color: #000; }</style></head><body><div><table><tr><td><b><font face=sans-serif size=10><font color=#4285f4>G</font><font color=#ea4335>o</font><font color=#fbbc05>o</font><font color=#4285f4>g</font><font color=#34a853>l</font><font color=#ea4335>e</font></font></b></td><td style="text-align: left; vertical-align: bottom; padding-bottom: 15px; width: 50%"><div style="border-bottom: 1px solid #dfdfdf;">Sorry...</div></td></tr></table></div><div style="margin-left: 4em;"><h1>We're sorry...</h1><p>... but your computer or network may be sending automated queries. To protect our users, we can't process your request right now.</p></div><div style="margin-left: 4em;">See <a href="https://support.google.com/websearch/answer/86640">Google Help</a> for more information.<br/><br/></div><div style="text-align: center; border-top: 1px solid #dfdfdf;"><a href="https://www.google.com">Google Home</a></div></body></html>
	at com.google.cloud.hadoop.repackaged.gcs.com.google.api.client.googleapis.json.GoogleJsonResponseException.from(GoogleJsonResponseException.java:150)
	at com.google.cloud.hadoop.repackaged.gcs.com.google.api.client.googleapis.services.json.AbstractGoogleJsonClientRequest.newExceptionOnError(AbstractGoogleJsonClientRequest.java:113)
	at com.google.cloud.hadoop.repackaged.gcs.com.google.api.client.googleapis.services.json.AbstractGoogleJsonClientRequest.newExceptionOnError(AbstractGoogleJsonClientRequest.java:40)
	at com.google.cloud.hadoop.repackaged.gcs.com.google.api.client.googleapis.services.AbstractGoogleClientRequest$1.interceptResponse(AbstractGoogleClientRequest.java:401)
	at com.google.cloud.hadoop.repackaged.gcs.com.google.api.client.http.HttpRequest.execute(HttpRequest.java:1097)
	at com.google.cloud.hadoop.repackaged.gcs.com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:499)
	at com.google.cloud.hadoop.repackaged.gcs.com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:432)
	at com.google.cloud.hadoop.repackaged.gcs.com.google.api.client.googleapis.services.AbstractGoogleClientRequest.execute(AbstractGoogleClientRequest.java:549)
	at com.google.cloud.hadoop.repackaged.gcs.com.google.cloud.hadoop.gcsio.GoogleCloudStorageImpl.getObject(GoogleCloudStorageImpl.java:1939)
	... 29 more

Cluster Details:
1x 1500 node (n1-standard-8) dataproc cluster
2x 1000 node (n1-standard-8) dataproc cluster
1x 800 node (n1-standard-8) dataproc cluster
1x 6 node (n1-standard-2) GKE cluster

Job Details
GKE Cluster hosts clients that do listing of objects to find work to submit to dataproc jobs.
Dataproc drivers also do listing of GCS files.

These 429s were blocking nearly all jobs on v1.9.16 so we bumped to v1.9.17.

Is there a reason that GCS connector does not implement exponential back-off when receiving these 429s?

We are concerned that fs.gs.glob.flatlist.enable will cause client side OOM and/or prohibitively slow performance on these list operations. Testing this on some jobs today.

This is similar to #151 which is marked as fixed as of v1.9.15

@medb
Copy link
Contributor

medb commented Jun 19, 2019

fs.gs.glob.flatlist.enable should not cause OOMs because it supports pagination and on-the fly filtering, unless job processes such a large number of files that it can not fit in memory of one node.

I think the first thing that you can try is to set fs.gs.glob.concurrent.enable property to false - this will reduce number of list requests and memory consumption, but globbing will be slower in some cases.

Regarding exponential back off, GCS connector supports it but doesn't log any information about retries, only final failure.

To mitigate issue with too many requests, you can bump up retries count by setting fs.gs.http.max.retry to 50 (default is 10)

@jaketf
Copy link
Author

jaketf commented Jun 20, 2019

Thanks for the info on exponential backoff. Is there anyway to get information on how many retries happening?

There is some use of GCS Client library directly (from dataproc cluster and gke clients) to perform certain operations faster that is confounding root cause of this issue.

However, there does seem to be a ton of extra API calls due to an extra api call made by GCS connector to check bucket existence for every opertion. This seems to be addressed by 654b66b. But this hasn't been rolled into a release yet.

What is the typical release cadence or ETA on rolling this specific commit into a release?

@medb
Copy link
Contributor

medb commented Jun 20, 2019

Unfortunately, there are no easy way to get number of retries, because retries are handled by GCS API client library that doesn't log or provide this information.

Current versions of GCS connector will check existence of system bucket only when new instance of GoogleHadoopFileSystem is created, if code creates new instance before each GCS operation then it should cache it and reuse between GCS operations instead, to improve performance.

To disable this system bucket check, you need to set fs.gs.system.bucket to empty string.

We plan to release new GCS connector version with latest changes in 3-4 weeks.

@jaketf
Copy link
Author

jaketf commented Jun 27, 2019

resolved

@jaketf jaketf closed this as completed Jun 27, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants