Misreported queue time #1

Closed
jaggederest opened this Issue Feb 18, 2013 · 14 comments

Comments

Projects
None yet
4 participants

If the front end machine and the machine running the dyno do not have a clock synchronized to within ~10ms or so, this header alteration will misreport queue time, possibly to a massive extent. This is extremely common on AWS, as the functional resolution of network time synchronization systems is poor (e.g. NTP accuracy on the internet is in the 'tens of milliseconds')

As an example, if a front end machine is a second fast, you'll see an extra second of queue time, or if it is a second slow, you'll see negative queue time: https://github.com/newrelic/rpm/blob/master/lib/new_relic/agent/instrumentation/queue_time.rb#L175

Owner

a-warner commented Feb 18, 2013

@jaggederest yes, you make a great point. This is a concern that we are aware of, but, unfortunately, neither New Relic or Heroku is able to provide us with a better measure of queue time, so we're stuck with this.

Clock drift is a huge issue, and, especially on AWS, I would expect time sync to be somewhat unreliable. We attempt to address the issue of negative queue times with (from queue_time_logger.rb):

queue_time_microseconds = [microseconds - request_start_microseconds, 0].max

With a substantial amount of data (e.g., with many thousands of requests per second) I would hope that the clock drift would be more or less averaged out in the aggregate (some clocks are behind the router, some ahead, etc). One encouraging point is that the queue time, as reported by New Relic with this gem installed, is roughly equal to a simulation that we did in R. See http://rapgenius.com/1506509 for more details.

As you point out, this data isn't perfect, so we're looking to Heroku and New Relic for a more accurate way of measure dyno level queue time. Unfortunately, we wanted to get something in place in the interim, and this gem is the next best thing. I'm happy to incorporate any suggestions!

Unfortunately because of the way you're clamping values to positive, your clock skew will always be positive, usually fairly large. I'd guess that if you're seeing a second of queue time, on average, you're probably mostly seeing skew. At least that's what it looked like at the time when I was testing it - some machines would have a skew of 30 seconds, leading to a large average. If you had a histogram of response times, I'd be willing to bet you wouldn't see a continuous distribution, but discrete peaks from different machines having different skew. This patch will almost certainly deceive you more than paying attention to other numbers.

The ideal way to do it is to have the request get an HTTP_X_QUEUE_START header for each machine it passes through, on both beginning and ending of any queueing that happens. This would be the change I'd hope to see from Heroku, I've never personally liked the idea of them dictating 'these many milliseconds is how long it spent in queue', from a user perspective that's not really a useful number, you're interested in 'how long did it take before the request got to my code in Thin/Unicorn'

Simulations in R give you an idea of expected values, but you'd be better off looking at the end user stuff in New Relic in terms of actual user-seen values - queue time is irrelevant at that point. Subtract out the time displayed in the New Relic stats for server time, as they do on the graph, and optimize for the remaining time is the best way you currently have available, absent more headers and some work from the heroku side.

Owner

a-warner commented Feb 18, 2013

Interesting; how would the additional HTTP_X_QUEUE_START header be applied? Could it be applied by thin at the edge of the dyno, e.g., before it passes the request on to rails? Is this something heroku needs to do, or is there some way to instrument thin ourselves?

I'm not sure where the current queueing is happening.

I would hope that @hone could help you out - the header should be added as soon as possible on each machine that touches the request. Then I would get in contact with the new relic folks, perhaps @gnarg or @samg who can get you a look at some raw metric data from those headers, to make sure you're seeing data that is a) valid and b) useful.

The code that I'm referring to when I recommend these things is all right here, https://github.com/newrelic/rpm/blob/master/lib/new_relic/agent/instrumentation/queue_time.rb

You'll note that it supports multiple of each type of header - with some custom work you could probably get a dashboard built within New Relic that shows you better rainbow-style charts for each of these things.

Again, I'm just an interested observer, somewhat tickled that there's a controversy that involves code that I'm the worldwide expert on, having been part of the process of putting together the queue time instrumentation. I felt at the time that we were probably neglecting internal latency with network transfer and things like that (sometimes an appreciable value, across availability zones and regions), but what we had was simply the best defaults. Since you're having issues, I'd recommend that you really dive in with the New Relic folks and get some really custom instrumentation going, preferably with the help of Heroku since I don't know that all of the code involved is actually Ruby or user-accessible.

That would give you some data that you can take to the bank, as it were. I am also inclined to believe that Heroku will make it right for you, but I have no evidence besides positive previous experiences with Adam and the team.

Had another thought, you could run an app with a fixed response time in server code (wait 1 second before returning) and see what the performance looks like under various scenarios, with something like apache bench, that would be a reasonable way to check the validity of the stats.

samg commented Feb 19, 2013

@a-warner, @jaggederest - We've just released version 3.5.7.57.beta (https://rubygems.org/gems/newrelic_rpm/versions/3.5.7.57.beta) of the New Relic agent. It should resolve the underreporting of queue time on heroku, and make configuration of queue time reporting more robust and less prone to misconfiguration.

You can find the official beta release notes at https://github.com/newrelic/rpm/blob/release/CHANGELOG#L6-L11, but here's a more technical description of what's changed:

  • Ignore HTTP_X_HEROKU_QUEUE_WAIT_TIME http header. We were incorrectly interpreting this value to be the request's queue time (on heroku apps), when in fact it was only a (small) portion of the queue time. We now disregard this header.
  • Allow simplified format for queue time headers. Now the agent will interpret queue time headers (e.g. X-Queue-Start, X-Request-Start) whose value is a plain timestamp (e.g. 1361260322000000). Previously it was required to submit the header with a preceding "t=" (e.g. t=1361260322000000). The previously supported format is still correctly interpreted.
  • Detect timestamp units. We no longer require that the timestamps be submitted in microseconds. We allow times to be submitted in microseconds, milliseconds, or seconds (with fractional unit), and will automatically determine the unit used, based on a reliable heuristic.

Upgrading to this version should result in accurate reporting of queue times on heroku (though the clock drift issues you mention could potentially occur).

We're very interested in your feedback on the new version. I'm also happy to answer any question you or others have about how it works, and our general approach to queue time in the Ruby agent. We're planning to release it as a GA this week.

Thanks!

Owner

a-warner commented Feb 20, 2013

Thanks @samg, we'll definitely check this out. I think the right solution is still (as @jaggederest points out) to get heroku to add these headers at the edge of the dyno, or to instrument thin itself. Is that something you looked in to?

samg commented Feb 20, 2013

Sure thing.
Heroku provides the headers which the agent interprets to generate the Queue Time metric. The new version of the New Relic agent ignores the HTTP_X_HEROKU_QUEUE_WAIT_TIME header which led to the underreported queue times. Heroku also sets an X-Request-Start header, which contains an integer of milliseconds since the Unix epoch. The heroku engineer I've been in contact with has told me the X-Request-Start header is added as shortly after the request hits heroku's infrastructure as possible. The independent testing we've done seems to confirm this, and that (at least on our test app) the Queue Time reported by the new version accurately reflects the time spent in heroku's infrastructure before your application receives the request. (I can provide more details if you're interested). There's also some more (non heroku specific) details about how Queue Time reporting works at https://newrelic.com/docs/features/tracking-front-end-time.

Instrumenting thin won't actually help in this case since what we're measuring is the time before the Ruby application receives the request (routing, load balancing, nginx waiting for a Ruby process to become available, etc.) We do have some projects coming later this year that should make instrumenting time outside the Ruby application easier. We also have plans to automatically instrument and graph rack middlewares which will give more insight into where time is spent before it hits your main app.

Heroku has been working with us to ensure that the headers they set are accurately interpreted by the Ruby agent.

Let me know if you have any questions or I can clarify anything further. Also feel free to email me at sam at newrelic.com.

Thanks!

@samg thanks for the straight dope! @a-warner he's right on the money, I'm going to buzz off now.

Owner

a-warner commented Feb 20, 2013

@samg @jaggederest sorry for dragging this on, but both this gem and the new relic beta release are addressing the same issue - attempting to measure dyno-level queue time. My understanding from Adam Wiggins is that nginx does not run on each dyno, but, rather, runs at the routing mesh layer. Therefore, the queueing must be taking place at the ruby web server layer, e.g., in Thin, right? Am I missing some other layer? Wouldn't the best way to calculate total queue time, while accounting for clock drift, be to ADD router queue time and dyno queue time (as calculated by starting a timer EXACTLY when thin receives the request, and ending it when rails gets to it)?

Ignoring HTTP_X_HEROKU_QUEUE_WAIT_TIME entirely doesn't exactly make sense either, since that number IS relevant in the case where no dynos are available, e.g., during a restart or a deploy.

samg commented Feb 20, 2013

@a-warner, no problem, I'm happy to discuss this at length and will be updating our docs to hopefully be clearer.

The approach we encourage with queue time is to set the X-Request-Start or X-Queue-Start header (which are treated identically) at the earliest point after the request enters your (or in this case heroku's) infrastructure. That typically is when the request hits your front end webserver (e.g. nginx) but could be when it hits your load balancers, etc. When the requests reaches the Ruby application we calculate the delta between the timestamp set in the header and the current time, reporting this duration as queue time. Setting it as early as possible means you are less likely to miss performance problems in your infrastructure that occur before the header is set.

Currently we don't provide a robust way (that doesn't involve creating your own custom views in new relic) to visualize this "queue time" into smaller chunks (e.g. load balancing, front end server, queuing for rails process, middleware, etc.) though there's been talk of doing this for quite some time, and we're planning to take some steps toward this in the next few months (e.g. instrumenting and visualizing time spent in middleware).

In general though, the approach I encourage above (and which Heroku's X-Request-Start header seems in line with), should highlight the type of performance problems RapGenius has experienced. I've personally found this metric very useful when optimizing my own (non-heroku hosted) rails apps. For example it helps expose when there are too few Rails worker processes and nginx is having to wait for one to become available. It can also show the effects of a deploy, where frequently you're waiting for rails processes to respawn and nginx is holding the requests until their ready.

Hopefully that makes sense. Let me know if you have any follow up questions or suggestions. I'm happy to keep discussing it and hopefully get better at explaining how it works and what its limitations are.

BTW, we just release version 3.5.7.59 of the ruby agent (https://rubygems.org/gems/newrelic_rpm) which contains the changes discussed above. We've independently verified that queue time is reported correctly on our heroku apps when running this version of the agent, though the nature of this feature does leave it susceptible to clock drift or misconfiguration of the passed headers. In general though it's can be a very useful indicator of scaling or deployment problems. I'd love to hear how version 3.5.7 of the agent works for you and if you find the metrics it's reporting to be accurate.

github needs a ++ button for comments and threads. thanks @samg for the hard work on these changes brother

Owner

a-warner commented Feb 20, 2013

great, thanks for the writeup @samg - we'll hit this up on our app and see what happens. At some point I'd like to discuss it more - visualizing the in-dyno queue vs infrastructure/lb/dispatching differences would be a GREAT feature

samg commented Feb 22, 2013

Just a couple of quick updates:

We just pushed a blog post describing our Request Queueing feature and the recent changes we've released. It covers similar material to the discussion on this thread.
http://blog.newrelic.com/2013/02/21/using-new-relic-on-heroku-read-how-our-new-ruby-agent-measures-queue-time/

We've also updated our docs on the topic to be clearer and more up to date:
https://newrelic.com/docs/features/request-queueing-and-tracking-front-end-time

WRT splitting in-dyno queueing from dispatching, you could encourage heroku to add additional upstream HTTP timestamp headers (e.g. X-Dispatch-Start, X-Dyno-Queue-Start) to incoming requests. Then we could potentially update the Ruby agent to interpret these, and update the New Relic site to visualize them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment