Very small percentage of requests returning 504 Gateway Timeout errors from ELB without ever hitting our servers #251

GUI · 2015-06-26T21:42:40Z

I've stumbled upon the possibility we might have a very small percentage of failing requests against api.data.gov that we weren't previously aware of. The problem seems to be that the ELB that sits in front of our servers may in some cases return a 504 Gateway Timeout error, but I can find no indication of these requests having ever actually hit our servers. So I'm unsure why the ELB is returning a 504 error in these cases. When this problem occurs, it also seems like plenty of other requests are succeeding at the same time, so I'm not sure why the ELB times out without connecting to our servers. Here are some reports of similar sounding issues with ELBs: https://forums.aws.amazon.com/thread.jspa?messageID=620410 https://forums.aws.amazon.com/thread.jspa?messageID=638885

I only discovered this after transitioning developer.nrel.gov over to api.data.gov and noticing an uptick in some 504 errors being generated by one of our apps that consumes our APIs heavily. At first I thought it was maybe NREL's outbound network, but after enabling the ELB logging, I was able to verify that in at least one case the request made it to the ELB which returned the 504. But I cannot find any record of the request hitting our servers or any errors generated by our servers during this time.

9394:2015-06-26T17:45:12.242435Z api-nrel REAL_IP_HERE:48179 - -1 -1 -1 504 0 0 0 "GET http://developer.nrel.gov:80/api/alt-fuel-stations/v1/nearest.json?fuel_type=all&access=public&status=E&owner_type=all&cards_accepted=all&return_filter_sql=true&location=40.829703%2C-73.014264+%281099+Horseblock+Rd%2C+Farmingville%2C+Farmingville+11738%29&limit=10&offset=260&radius=infinite&api_key=REAL_KEY_HERE HTTP/1.1" "Ruby" - -

Since it sounds like there's a possibility this issue is specific to certain ELBs, this issue may not be affecting all domains. Based on some rough estimates, this may be affecting 0.005% of NREL requests, so it's very infrequent, but obviously not ideal. This probably needs a bit more investigation or reaching out to AWS support.

The text was updated successfully, but these errors were encountered:

ziggythehamster · 2015-09-15T19:44:43Z

Hi!

We're having the same issues as you. We're in us-west-1, and our ELB is in a VPC. We're using Route 53 ALIASing to hook DNS up to the ELB, and we're not using SSL right now. Our backend is fine - I just used ab to hit the endpoint about 10,000 times, and it worked 100% of those times.

Did you ever figure out what was causing this?

ziggythehamster · 2015-09-15T19:45:46Z

Also worth noting: http://docs.aws.amazon.com/ElasticLoadBalancing/latest/DeveloperGuide/ts-elb-error-message.html#ts-elb-errorcodes-http504

That document says that the backend must have given up, but our logs and CloudWatch do not corroborate that.

ajmath · 2015-12-02T19:10:30Z

Having similar issues. No evidence of connection on backend servers. The 504 is returned within 750ms.

GUI · 2016-04-13T18:22:07Z

I'm very belated in updating this ticket, but since the other ELB issue cropped up (#330), it reminded me of this.

The short version is that I believe this is solved.

The Lua architecture changes we rolled out last fall I think contributed the most to fixing this, since we were no longer reloading nginx hyper-frequently. I believe the nginx reloads were causing issues with keepalive connections to the ELB which makes sense.

After rolling those changes out, the frequency of these went down, but they still occasionally cropped up. On this [ELB idle connection page], Amazon now points out:

To ensure that the load balancer is responsible for closing the connections to your back-end instance, make sure that the value you set for the keep-alive time is greater than the idle timeout setting on your load balancer.

Since both were 60 seconds, I think this was the other culprit. I then tuned things so that the nginx timeout was longer than the ELB timeout.

Since making that change several months ago, these random 500 errors have been practically eliminated. I setup cloudwatch alerts for these, and I still do get alerts about maybe 2-3 failed requests a month. Those 2-3 failed requests a month represent about 0.00000004% of our requests, so while it would be nice to understand those, at this point, I'm inclined to chalk those up to issues that might be outside our control with a lot more digging. So I'm going to go ahead and close this.

GUI added the bug label Jun 27, 2015

GUI mentioned this issue Oct 27, 2015

The grand Lua pull request NREL/api-umbrella#183

Merged

GUI closed this as completed Apr 13, 2016

GUI mentioned this issue May 30, 2016

Automatically delete old AWS ELB logs from S3 #341

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Very small percentage of requests returning 504 Gateway Timeout errors from ELB without ever hitting our servers #251

Very small percentage of requests returning 504 Gateway Timeout errors from ELB without ever hitting our servers #251

GUI commented Jun 26, 2015

ziggythehamster commented Sep 15, 2015

ziggythehamster commented Sep 15, 2015

ajmath commented Dec 2, 2015

GUI commented Apr 13, 2016

Very small percentage of requests returning 504 Gateway Timeout errors from ELB without ever hitting our servers #251

Very small percentage of requests returning 504 Gateway Timeout errors from ELB without ever hitting our servers #251

Comments

GUI commented Jun 26, 2015

ziggythehamster commented Sep 15, 2015

ziggythehamster commented Sep 15, 2015

ajmath commented Dec 2, 2015

GUI commented Apr 13, 2016