Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Very small percentage of requests returning 504 Gateway Timeout errors from ELB without ever hitting our servers #251

Closed
GUI opened this issue Jun 26, 2015 · 4 comments
Labels

Comments

@GUI
Copy link
Member

GUI commented Jun 26, 2015

I've stumbled upon the possibility we might have a very small percentage of failing requests against api.data.gov that we weren't previously aware of. The problem seems to be that the ELB that sits in front of our servers may in some cases return a 504 Gateway Timeout error, but I can find no indication of these requests having ever actually hit our servers. So I'm unsure why the ELB is returning a 504 error in these cases. When this problem occurs, it also seems like plenty of other requests are succeeding at the same time, so I'm not sure why the ELB times out without connecting to our servers. Here are some reports of similar sounding issues with ELBs: https://forums.aws.amazon.com/thread.jspa?messageID=620410 https://forums.aws.amazon.com/thread.jspa?messageID=638885

I only discovered this after transitioning developer.nrel.gov over to api.data.gov and noticing an uptick in some 504 errors being generated by one of our apps that consumes our APIs heavily. At first I thought it was maybe NREL's outbound network, but after enabling the ELB logging, I was able to verify that in at least one case the request made it to the ELB which returned the 504. But I cannot find any record of the request hitting our servers or any errors generated by our servers during this time.

9394:2015-06-26T17:45:12.242435Z api-nrel REAL_IP_HERE:48179 - -1 -1 -1 504 0 0 0 "GET http://developer.nrel.gov:80/api/alt-fuel-stations/v1/nearest.json?fuel_type=all&access=public&status=E&owner_type=all&cards_accepted=all&return_filter_sql=true&location=40.829703%2C-73.014264+%281099+Horseblock+Rd%2C+Farmingville%2C+Farmingville+11738%29&limit=10&offset=260&radius=infinite&api_key=REAL_KEY_HERE HTTP/1.1" "Ruby" - -

Since it sounds like there's a possibility this issue is specific to certain ELBs, this issue may not be affecting all domains. Based on some rough estimates, this may be affecting 0.005% of NREL requests, so it's very infrequent, but obviously not ideal. This probably needs a bit more investigation or reaching out to AWS support.

@GUI GUI added the bug label Jun 27, 2015
@ziggythehamster
Copy link

Hi!

We're having the same issues as you. We're in us-west-1, and our ELB is in a VPC. We're using Route 53 ALIASing to hook DNS up to the ELB, and we're not using SSL right now. Our backend is fine - I just used ab to hit the endpoint about 10,000 times, and it worked 100% of those times.

Did you ever figure out what was causing this?

@ziggythehamster
Copy link

Also worth noting: http://docs.aws.amazon.com/ElasticLoadBalancing/latest/DeveloperGuide/ts-elb-error-message.html#ts-elb-errorcodes-http504

That document says that the backend must have given up, but our logs and CloudWatch do not corroborate that.

@ajmath
Copy link

ajmath commented Dec 2, 2015

Having similar issues. No evidence of connection on backend servers. The 504 is returned within 750ms.

@GUI
Copy link
Member Author

GUI commented Apr 13, 2016

I'm very belated in updating this ticket, but since the other ELB issue cropped up (#330), it reminded me of this.

The short version is that I believe this is solved.

The Lua architecture changes we rolled out last fall I think contributed the most to fixing this, since we were no longer reloading nginx hyper-frequently. I believe the nginx reloads were causing issues with keepalive connections to the ELB which makes sense.

After rolling those changes out, the frequency of these went down, but they still occasionally cropped up. On this [ELB idle connection page], Amazon now points out:

To ensure that the load balancer is responsible for closing the connections to your back-end instance, make sure that the value you set for the keep-alive time is greater than the idle timeout setting on your load balancer.

Since both were 60 seconds, I think this was the other culprit. I then tuned things so that the nginx timeout was longer than the ELB timeout.

Since making that change several months ago, these random 500 errors have been practically eliminated. I setup cloudwatch alerts for these, and I still do get alerts about maybe 2-3 failed requests a month. Those 2-3 failed requests a month represent about 0.00000004% of our requests, so while it would be nice to understand those, at this point, I'm inclined to chalk those up to issues that might be outside our control with a lot more digging. So I'm going to go ahead and close this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants