Description
Describe your environment
OS: Ubuntu
Python version: Python 3.10
SDK version: 1.31.1
API version: 0.52b1 (same code in 0.53b0 though)
What happened?
The tornado instrumenter and how it tracks a request's elapsed time is not async-safe and does not appear to be able to handle multiple concurrent requests without overwriting and using an incorrect value. This makes the HTTP_SERVER_DURATION metric inaccurate.
Steps to Reproduce
- Start one long-running request to a tornado server.
- Wait a moment, and start a second long-running request to a tornado server.
- Let the first request finish
- Let the second request finish.
- Review the calculated elapsed time of that request.
This effect is much more visible with streaming requests/handlers like an SSE handler or WebSocketHandler (if I ignore #2761) that may stay connected for hours/days.
The _START_TIME
is stored in server_histograms which is a global dict created by the instrumentor. It stores this in the _prepare
hook, then reads this back in _record_on_finish_metrics. You can get interleaved start times.
The start time really needs tracked per handler instance (connection).
Expected Result
Request1 Prepare -> StartTime = 10:00:00am
Request2 Prepare -> StartTime = 10:00:05am
Request1 OnFinish -> Reads original StartTime of 10:00:00am, so at 10:00:10am, elapsed time should be 10 seconds.
Request1 OnFinish -> Reads correct original StartTime as 10:00:05am, so at 10:00:10am, elapsed time should be 5 seconds.
Actual Result
Request1 Prepare -> StartTime = 10:00:00am
Request2 Prepare -> StartTime = 10:00:05am
Request1 OnFinish -> Reads incorrect StartTime of 10:00:05am, so at 10:00:10am, elapsed time reported incorrectly as 5 seconds.
Request1 OnFinish -> Reads correct original StartTime as 10:00:05am, so at 10:00:10am, elapsed reported as 5 seconds.
Additional context
No response
Would you like to implement a fix?
Yes