Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add GAX client_config to service factory #848

Merged
merged 1 commit into from Sep 29, 2016

Conversation

blowmage
Copy link
Contributor

@blowmage blowmage commented Aug 24, 2016

Language is the first service using GAX. GAX implements retries and incremental backoff, and our original plan was to make it conform to the retries value used by other services. However, GAX is different enough that it probably isn't feasible to share the same configuration approach as Google API Client. Therefore, here is an option for allowing users to configure the GAX client while continuing to use Google Cloud. See also #849 for overriding the GAX configuration on API calls. Either approach is independent of the other.

This change replaces retries with client_config; it effectively leaks the GAX implementation to the Google Cloud Language public API. This means that if GAX ever changes its implementation Google Cloud Language may need to also make a breaking change to support the new GAX API. So far we have resisted leaking these types of details, but there is no better way to allow the GAX client to be fully configurable other than leaking these details.

What do you think of this type of change? Is this a direction you want to see the other GRPC/GAX services move towards?

The client_config hash can contain any part of the config JSON that is intended to be overridden. Here is what this may look like:

my_config = {"interfaces"=>
              {"google.cloud.language.v1beta1.LanguageService"=>
                {"retry_codes"=>
                  {"retry_codes_def"=>
                    {"idempotent"=>["DEADLINE_EXCEEDED", "UNAVAILABLE"],
                     "non_idempotent"=>[]}},
                 "retry_params"=>
                  {"default"=>
                    {"initial_retry_delay_millis"=>100,
                     "retry_delay_multiplier"=>1.3,
                     "max_retry_delay_millis"=>60000,
                     "initial_rpc_timeout_millis"=>60000,
                     "rpc_timeout_multiplier"=>1.0,
                     "max_rpc_timeout_millis"=>60000,
                     "total_timeout_millis"=>600000}}}}}

require "google/cloud"

gcloud = Google::Cloud.new
lang = gcloud.language client_config: my_config

@blowmage blowmage added feedback wanted api: language Issues related to the Cloud Natural Language API API. labels Aug 24, 2016
@googlebot googlebot added the cla: yes This human has signed the Contributor License Agreement. label Aug 24, 2016
@blowmage
Copy link
Contributor Author

@timanovsky This isn't Datastore, but this is the approach we are considering for configuring GRPC/GAX clients. Can you look at this and let us know what you think of this? Is the ability to configure the client at this level worth the risk of future breaking changes due to how leaky this approach is?

@blowmage
Copy link
Contributor Author

@jmuk Can you weigh in on this? How comfortable are you with gcloud-ruby leaking the GAX configuration this way? How likely are we to see breaking changes to the configuration?

@blowmage
Copy link
Contributor Author

@omaray @jgeewax Thoughts on this?

@blowmage blowmage mentioned this pull request Aug 24, 2016
6 tasks
@jmuk
Copy link
Contributor

jmuk commented Aug 24, 2016

In my feeling, the format is almost stable. It's unlikely to change, and I'd say it's safe to expose to the users.
I chatted with @jgeiger, @bjwatson (who are responsible for Python clients and share the same json config data), and @garrettjonesgoogle (similarly for PHP), and got similar thoughts.

@blowmage
Copy link
Contributor Author

@jmuk Are the code examples in the PR description accurate?

@jmuk
Copy link
Contributor

jmuk commented Aug 24, 2016

Yes, the format in the commit message is correct.

@blowmage
Copy link
Contributor Author

@bmclean Any input on this?

@coveralls
Copy link

coveralls commented Aug 25, 2016

Coverage Status

Coverage increased (+0.001%) to 95.656% when pulling 378c09e00c694589025ff1aebc32e481b23158d0 on blowmage:language-client_config into 956fbba on GoogleCloudPlatform:master.

@bmclean
Copy link
Contributor

bmclean commented Aug 25, 2016

@blowmage I like the approach of defining which error codes get retried. We currently automatically retry at the application level for UNAVAILABLE and INTERNAL but not for DEADLINE_EXCEEDED. With this approach we could move that retry down into the gcloud datastore config.

I don't know if we would make use of the fine-grained control proposed in #849. For datastore calls critical to our business I think we would want to keep the retry at the application level.

@blowmage
Copy link
Contributor Author

@bmclean: Thanks for the input!

@blowmage
Copy link
Contributor Author

@timanovsky Can you look this over and give us your thoughts on exposing this level of configuration? This would replace the retries option.

@timanovsky
Copy link

I don't see how I specify number of retries. I see time deadline, but not number of attempts desired.

Also, can you provide sensible default list of exceptions to retry on? My understanding is that it is

[Google::Cloud::AbortedError,
Google::Cloud::DeadlineExceededError,
Google::Cloud::InternalError,
Google::Cloud::ResourceExhaustedError,
Google::Cloud::UnavailableError,
Google::Cloud::UnknownError]

According to https://cloud.google.com/datastore/docs/concepts/errors#error_codes

(not sure about the last one)

@blowmage
Copy link
Contributor Author

Also, can you provide sensible default list of exceptions to retry on?

Perhaps @jmuk can give us some more explanation of the configuration and what the error strings are that can be set for retry.

I don't see how I specify number of retries.

Right, that's the change with this new configuration. There is no one value for number of retries. The number of retries would have to be calculated from the following parameters:

"retry_params" => {
  "default" => {
    "initial_retry_delay_millis" => 100,
    "retry_delay_multiplier" => 1.3,
    "max_retry_delay_millis" => 60000,
    "initial_rpc_timeout_millis" => 60000,
    "rpc_timeout_multiplier" => 1.0,
    "max_rpc_timeout_millis" => 60000,
    "total_timeout_millis" => 600000
  }
}

@blowmage
Copy link
Contributor Author

@timanovsky For reference, here is the entire GAX JSON configuration for Datastore, since you are likely most familiar with it:

{
  "interfaces": {
    "google.datastore.v1.Datastore": {
      "retry_codes": {
        "retry_codes_def": {
          "idempotent": [
            "DEADLINE_EXCEEDED",
            "UNAVAILABLE"
          ],
          "non_idempotent": []
        }
      },
      "retry_params": {
        "default": {
          "initial_retry_delay_millis": 100,
          "retry_delay_multiplier": 1.3,
          "max_retry_delay_millis": 60000,
          "initial_rpc_timeout_millis": 60000,
          "rpc_timeout_multiplier": 1.0,
          "max_rpc_timeout_millis": 60000,
          "total_timeout_millis": 600000
        }
      },
      "methods": {
        "Lookup": {
          "timeout_millis": 60000,
          "retry_codes_name": "idempotent",
          "retry_params_name": "default"
        },
        "RunQuery": {
          "timeout_millis": 60000,
          "retry_codes_name": "idempotent",
          "retry_params_name": "default"
        },
        "BeginTransaction": {
          "timeout_millis": 60000,
          "retry_codes_name": "non_idempotent",
          "retry_params_name": "default"
        },
        "Commit": {
          "timeout_millis": 60000,
          "retry_codes_name": "non_idempotent",
          "retry_params_name": "default"
        },
        "Rollback": {
          "timeout_millis": 60000,
          "retry_codes_name": "non_idempotent",
          "retry_params_name": "default"
        },
        "AllocateIds": {
          "timeout_millis": 60000,
          "retry_codes_name": "non_idempotent",
          "retry_params_name": "default"
        }
      }
    }
  }
}

@blowmage
Copy link
Contributor Author

@timanovsky The way I read the configuration, the time between retries will increment until the max is reached, and then it will keep retrying at that max interval. But there doesn't seem to be any mechanism for total number of retries. @jmuk can you confirm?

@quartzmo
Copy link
Member

I'm not certain about this, but max_retry_delay_millis seems like it could be the total limit for all retries, and after it is exceeded, an error will occur?

@timanovsky
Copy link

This quite uncommon approach. If it will remain so, I guess I will need to keep using application level retry logic.

@quartzmo
Copy link
Member

@jmuk Regarding @timanovsky's comment above, is there any flexibility in the design of the GAX retry behavior?

@jmuk
Copy link
Contributor

jmuk commented Sep 12, 2016

The interval between retries are increased exponentially (by retry_delay_multiplier), but capped by max_retry_delay_millis. That's the "exponential backoff" and only DEADLINE_EXCEEDED and UNAVAILABLE are specified as "Retry using exponential backoff" in that page.

Other errors are normally not retried automatically, developers will have to look into the details to fix the problem (and so they are recommended as "Do not retry without fixing the problem").

ABORTED can be added for some methods too, although I'm not sure it's actually used for non-transactional APIs.

@jmuk
Copy link
Contributor

jmuk commented Sep 12, 2016

The config has 'retry_codes_def' section where which specified retryable errors, but the retrying number is same for a method, per-failure retry logic is not implemented (and that's complicated when it's combined with exponential backoff).

@blowmage
Copy link
Contributor Author

Thanks @jmuk. Is there a better list of all the GRPC Error Codes than this?

https://github.com/grpc/grpc/blob/master/doc/statuscodes.md

@bmclean
Copy link
Contributor

bmclean commented Sep 12, 2016

How does initial_rpc_timeout_millis work? The description is the initial timeout parameter to the request. If I set it to 5000, and my app makes a request, what happens after the 5 seconds if no response has been received? Does it timeout and move on to the retry logic? I am also assuming that if an error was thrown before the timeout expired (such as UNAVAILABLE) it would also move on to the retry logic?

@blowmage
Copy link
Contributor Author

@jmuk Any input?

@@ -110,7 +110,7 @@ def self.language project = nil, keyfile = nil, scope: nil, retries: nil,
end
Google::Cloud::Language::Project.new(
Google::Cloud::Language::Service.new(
project, credentials, retries: retries, timeout: timeout))
project, credentials, timeout: timeout, client_config: client_config))

This comment was marked as spam.

@jmuk
Copy link
Contributor

jmuk commented Sep 13, 2016

Sorry for the delay. Yes, that link will be the reference for gRPC errors, but in the context of Google API clients, https://cloud.google.com/datastore/docs/concepts/errors#error_codes would be the better reference.

initial_rpc_timeout will be used for the timeout parameters of invoking a method call and behaves as exponential backoff with increase rate of rpc_timeout_multiplier capped by max_rpc_timeout. See: http://www.rubydoc.info/gems/google-gax/Google/Gax/BackoffSettings

If it's set as 5 seconds and the server does not return in that time, DEADLINE_EXCEEDED should be raised, and the invocation will be retried. The retried timeout will be increased by rpc_timeout_multiplier after waiting for the retry_delay. If it's still timed out, timeout will be increased even more, after increased retry_delay, and so on.

@bmclean
Copy link
Contributor

bmclean commented Sep 13, 2016

Thanks @jmuk, that is perfect. I still think this approach has more value than using a fixed retry parameter. It gives you exponential back-off on the desired error codes (I am thinking of UNAVAILABLE) with a max overall timeout (so it won't retry forever) but also allows a force retry after a specific time.

"initial_rpc_timeout_millis": 5000
(the initial timeout parameter to the request)
Times out if a response is not initially received within 5 sec.

"initial_retry_delay_millis": 500
(the initial delay time, in milliseconds, between the completion of the first failed request and the initiation of the first retrying request)
Waits 0.5s and automatically retries the request.

"rpc_timeout_multiplier": 1.0
(the multiplier by which to increase the timeout parameter between failed requests)
Would wait 5 seconds after the retry before timing out.

"retry_delay_multiplier": 2.0
(the multiplier by which to increase the delay time between the completion of failed requests, and the initiation of the subsequent retrying request)
The request will keep being retried with an exponential back-off (0.50, 1, 2) waiting 5 seconds for each request until it succeeds or finally:

"total_timeout_millis": 17000
(the total time, in milliseconds, starting from when the initial request is sent, after which an error will be returned, regardless of the retrying attempts made meanwhile)

@blowmage I am now thinking that we would also have use cases for the per API request approach outlined in #849.

Replace retries value with the GAX client configuration.
This leaks the GAX implementation to the google-cloud public API.
If Gax changes its implementation this may result in a change to
the google-cloud public API.
@quartzmo
Copy link
Member

After considering the options, I agree that this is probably the best way forward. I am going to merge this PR, and I think we should use it as that basis for configuring retries in other services that we convert to GAX.

@quartzmo quartzmo merged commit 848b608 into googleapis:master Sep 29, 2016
quartzmo added a commit to quartzmo/google-cloud-ruby that referenced this pull request Oct 7, 2016
GAX does not directly support  integer.

[refs googleapis#848]
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: language Issues related to the Cloud Natural Language API API. cla: yes This human has signed the Contributor License Agreement.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants