-
Notifications
You must be signed in to change notification settings - Fork 231
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GCS Connector giving 429s when used from multiple clusters. #185
Comments
I think the first thing that you can try is to set Regarding exponential back off, GCS connector supports it but doesn't log any information about retries, only final failure. To mitigate issue with too many requests, you can bump up retries count by setting |
Thanks for the info on exponential backoff. Is there anyway to get information on how many retries happening? There is some use of GCS Client library directly (from dataproc cluster and gke clients) to perform certain operations faster that is confounding root cause of this issue. However, there does seem to be a ton of extra API calls due to an extra api call made by GCS connector to check bucket existence for every opertion. This seems to be addressed by 654b66b. But this hasn't been rolled into a release yet. What is the typical release cadence or ETA on rolling this specific commit into a release? |
Unfortunately, there are no easy way to get number of retries, because retries are handled by GCS API client library that doesn't log or provide this information. Current versions of GCS connector will check existence of system bucket only when new instance of To disable this system bucket check, you need to set We plan to release new GCS connector version with latest changes in 3-4 weeks. |
resolved |
We have 4 Dataproc clusters, and a GKE cluster which all use GCS connector to perform GCS operations (on the same bucket which houses the raw data upon which many reports / roll-ups are done on an hourly basis). We are encountering many 429 (too many request) errors:
Cluster Details:
1x 1500 node (n1-standard-8) dataproc cluster
2x 1000 node (n1-standard-8) dataproc cluster
1x 800 node (n1-standard-8) dataproc cluster
1x 6 node (n1-standard-2) GKE cluster
Job Details
GKE Cluster hosts clients that do listing of objects to find work to submit to dataproc jobs.
Dataproc drivers also do listing of GCS files.
These 429s were blocking nearly all jobs on v1.9.16 so we bumped to v1.9.17.
Is there a reason that GCS connector does not implement exponential back-off when receiving these 429s?
We are concerned that
fs.gs.glob.flatlist.enable
will cause client side OOM and/or prohibitively slow performance on these list operations. Testing this on some jobs today.This is similar to #151 which is marked as fixed as of v1.9.15
The text was updated successfully, but these errors were encountered: