Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Manage Timeout in Http connection and have Hystrix report timeout correctly #920

Closed
billyy opened this issue Oct 4, 2015 · 17 comments
Closed

Comments

@billyy
Copy link

billyy commented Oct 4, 2015

Matt,

Last time we discussed the merit of using Http connection timeout vs letting Hystrix to interrupt the thread the timeout. You did point out the disadvantage in this case is that Hystrix will report these issues as Failures instead of timeout. Is there a way to manually tell Hystrix it is a timeout, not failure in the case of client code managing Http Connection and Socket timeout? I went thru the code and it looks like the HystrixTimeoutException is wrapped in a private class. So it is NOT accessible from the client code.

@mattrjacobs
Copy link
Contributor

At the moment, the metrics around Timeouts refer specifically to timeouts induced by Hystrix. I'm not sure that adding some HTTP errors to this metric make the system easier to understand or operate.

If Hystrix did as you suggest, how would this help you?

@billyy
Copy link
Author

billyy commented Oct 5, 2015

For example, I have an external partner with socket timeout of 30 seconds. We like to configure 5 seconds connection timeout so we can "fail faster" in case there is no connection available. In the current setup, Hystrix will count it as failure instead timeout. Logically I like to still see it as timeout so production support can tell the difference and act accordingly.

@mattrjacobs
Copy link
Contributor

Right, but consider what would happen in practice if your change was in place. Production support would see a rise in "Hystrix timeout"s, and then have to figure out if they were network timeouts or timeouts induced by Hystrix. To do that, they'd then have to look at whatever monitoring you have set up at your network layer. So I don't believe that you're making anything easier by combining the two timeout sources into a single metric.

@billyy
Copy link
Author

billyy commented Oct 6, 2015

Ah. Because Hystrix timeout does not necessary equal to socket or connection timeout. Is the definition of Hystrix timeout just simply about Hystrix waited long enough and return control then? If that's the definition, I would agree it does not make sense.

@mattrjacobs
Copy link
Contributor

What Hystrix counts as a timeout in metrics today is all command invocations that got executed and then executed for longer than the specified Hystrix timeout value. Hystrix should then attempt the fallback, and then use that as the result of the command invocation.

@billyy
Copy link
Author

billyy commented Oct 6, 2015

With that definition, I understand why we don't want to lump the client timeout with Hystrix timeout. The reason I brought this up was our previous conversation you made me realized that client connection timeout will go into the Failure bucket, not the timeout bucket. Is there any recommendation around this topic? Because we have ops and dev separation, we like to be able to tell the difference between timeout (Hystrix or client code timeout) and exception. It almost sounds like I need to introduce a fourth metrics. Anyway, any suggestion will be appreciated.

PS: I remembered when I was in the edge team, we never set the client socket timeout.

@mattrjacobs
Copy link
Contributor

The way I think about it is to consider Hystrix as application-layer code. The run()/construct() method may execute code that is purely local, to a cache, over the network, but the metrics for any of those cases are the same. Those are latencies and counts of outcomes of the command execution. Then, separately, there are metrics for the other subsystems. Cache clients should have a set of metrics, as should each network connection.

As an example, Ribbon (the Netflix HTTP client library) produces metrics which include client-observed latency, NumCompleted/NumErrors nad then breaks down errors by either status code or by ConnectException/ReadException/etc.

In general, a rise in Hystrix error% does not give me enough information to immediately solve the production issue. It does, however, point me to other metrics which should allow that. But building all of these directly into Hystrix is not a great model.

@billyy
Copy link
Author

billyy commented Oct 7, 2015

It sounds like the recommendation is to leave it in the Exception bucket and has the next level drill down on the error. It will be good to be able to see all that in one dashboard, and that's why I was wondering if it should be in Hystrix metrics. Anyway, thanks for the explanation. I am good!

@billyy
Copy link
Author

billyy commented Oct 7, 2015

There is one subtle difference in my situation. Unlike Netflix, We do have separation between ops and Dev. Ops will take care of network issue and dependency issues. Dev is involved with only exceptional case. If there is a way to extend hystrix to group and report this type of connection error separately, that will work for us better. I am thinking more like extension and less of changing the built in behavior to solve for our use case where Dev and ops are separate role.

@billyy
Copy link
Author

billyy commented Oct 7, 2015

Ops is the first one that get call in our case, not dev. That is really my motivation! I can fork and add the feature I want but then I will lose all the bug fixes and enhancement. If there is way to extend hystrix without forking, that will be the preferred route.

@billyy
Copy link
Author

billyy commented Oct 7, 2015

If you can make the following class "HystrixTimeoutException" accessible from client code, that is really all I am looking for. Right now it is wrapped in a private class. Anyway, this is really solving for my use case and not necessary applicable for Netflix (Given dev and ops is the same person). Thanks.

private static class HystrixObservableTimeoutOperator<R> implements Operator<R, R> {

    final AbstractCommand<R> originalCommand;

    public HystrixObservableTimeoutOperator(final AbstractCommand<R> originalCommand) {
        this.originalCommand = originalCommand;
    }

    public static class HystrixTimeoutException extends Exception {

        private static final long serialVersionUID = 7460860948388895401L;

    }

@billyy
Copy link
Author

billyy commented Oct 9, 2015

One more data point. Our external partner has a 30 sec timeout so we want to "fail faster" by setting a five seconds connection timeout. (In Netflix, almost all calls use the default one second so there is lesser reason to have connection vs socket timeout. The action for both hystrix timeout and connection timeout will be the same. Our ops team will reach out to the partner. But if the connection timeout is throwing exception instead, Dev will need to investigate and the end result is also reaching out to the partner.

@mattrjacobs
Copy link
Contributor

I think that making the TimeoutException public would allow each class that extends HystrixCommand to use it if they wish, or ignore it. Then it's up to each Hystrix user on what type of exceptions they want to be reported as TIMEOUTs.

I'll work on that and get it in the next release. Thanks for the detailed explanation.

@billyy
Copy link
Author

billyy commented Oct 10, 2015

Big thanks!!!

@mattrjacobs
Copy link
Contributor

Fixed in #931

@billyy
Copy link
Author

billyy commented Oct 15, 2015

When is the next official maven release?

@mattrjacobs
Copy link
Contributor

This is released in 1.4.18: https://github.com/Netflix/Hystrix/releases/tag/v1.4.18

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants