API backend timeouts can lead to multiple request retries #18

GUI · 2014-01-02T23:33:26Z

There are request timeouts setup at the nginx and varnish reverse proxy layers (defaulting to 60 seconds, I think). So if a request doesn't start responding within 60 seconds, the request is aborted to the client. In the event that an API backend is super-slow to respond, I believe nginx is retrying the request, after it's timed out. This leads to duplicate requests to the API backend. This is probably not what we want in the event of timeouts.

I haven't entirely debugged this, so this needs a bit more investigation, but since I'm seeing mysterious duplicate requests for long-running failed requests, my theory is that nginx is triggering these based on the proxy_next_upstream setting. It should probably be set to omit timeout.

In the case where I've seen this, there's only one API backend server, but since there are multiple gatekeeper servers defined for load balancing, I believe that's what's triggering the retries. So it's probably important to check how the the retries and proxy error handling is affected by each proxy layer.

So to reproduce this, I think all that should be necessary is to introduce an API backend that takes longer than 60 seconds to respond. Then check to see that a single user request via API Umbrella leads to multiple API backend requests after it times out.

Having nginx consider a backend down after a timeout might be okay for some backends, but this should probably not be the default (since it can lead to an API backend getting overwhelmed if those slow requests are resource intensive, and you start making duplicate requests before one has even finished). And it definitely should not be enabled for the proxy that load balances against the gatekeeper processes, since we don't want to consider a single gatekeeper unavailable even if it happens to be serving up a slow API backend request.

The text was updated successfully, but these errors were encountered:

GUI · 2014-10-27T06:17:49Z

This was fixed in the recent revamp of the router. We also now have an integration test to verify this behavior.

Better error handling for if error data is unexpectedly not an object.

Ensure that the error data yaml entered is the expected type (hash)

GUI closed this as completed Oct 27, 2014

GUI added a commit that referenced this issue Sep 27, 2015

Merge pull request #18 from NREL/error-data-non-objects

35efc2c

Better error handling for if error data is unexpectedly not an object.

GUI added a commit that referenced this issue Sep 27, 2015

Merge pull request #18 from NREL/error-data-non-objects

8af00fd

Ensure that the error data yaml entered is the expected type (hash)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

API backend timeouts can lead to multiple request retries #18

API backend timeouts can lead to multiple request retries #18

GUI commented Jan 2, 2014

GUI commented Oct 27, 2014

API backend timeouts can lead to multiple request retries #18

API backend timeouts can lead to multiple request retries #18

Comments

GUI commented Jan 2, 2014

GUI commented Oct 27, 2014