-
Notifications
You must be signed in to change notification settings - Fork 133
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Automatic buffering and preemptive flushing #146
Conversation
bad52e4
to
0e83f3d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
High quality addition to the library! Thanks for your work on this.
I've left a few comments, the most notable one being about the telemetry stuff that I think should probably be done before merging this PR.
lib/datadog/statsd.rb
Outdated
OK = 0 | ||
WARNING = 1 | ||
CRITICAL = 2 | ||
UNKNOWN = 3 | ||
|
||
DEFAULT_BUFFER_SIZE = 8 * 1_024 | ||
UDP_DEFAULT_BUFFER_SIZE = 8_192 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This value (8kb) won't be optimal on real networks where UDP is transported in an Ethernet packet, having most of the time only 1432 bytes available (when the MTU is 1500 = everywhere). Setting this to 8kb means that there will be IP fragmentation and that there is chances to lose 8kb of metrics.
You've detailed what were the reasons for this bump, maybe we should consider doing the telemetry improvements in a PR targeting the current branch (kbogtob/automatic-buffering
), to merge it into this one and then after, to merge this one in master?
module Datadog | ||
class Statsd | ||
class MessageBuffer | ||
PAYLOAD_SIZE_TOLERANCE = 0.05 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I get the idea but are you sure that it's helping somewhere (against just using 100% of the buffer size)? I'm not against it, I just want to know if there is benefit adding this extra math in the buffer code. (mostly because while we will debug the output of the client, we'll have to keep this in mind)
With a buffer size of 1472, it represents 73 bytes, which could be enough for one more metric.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My idea was to send messages preemptively when we think that we can't send a new message. You're right that when using a size of 1432, it's quite a lot. Do you think I should remove it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Latest changes LGTM.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very minor, but I think batch
would be a better name than pool
for what we are doing. Feel free to ignore this if you think otherwise. Other than that LGTM.
It is now handled by the MessageBuffer
* Isolate forwarding logic into class We are also renaming parameters for them to be prefixed by their relevant usage * Rewrite telemetry to snapshot messages in array Also reduce allocations using sprintf to compensate * Use telemetry as a message emitter in the forwarder
8acd4c3
to
5fecfbc
Compare
sample_rate: nil, | ||
disable_telemetry: false, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like this was a breaking API change that didn't trigger a changelog entry or major version bump. I suggest at least retro-actively adding a changelog entry, I had to dig around in the source to find this out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It'll go in with: #288
What does this PR do?
Introducing new breaking changes:
It refactors the batch class into the MessageBuffer class. It removes the usage of the
#batch
method on the Statsd class as buffering is now automatic and always used. It flushes preemptively when reaching a given message count (max_buffer_pool_size
) or 95% of the maximum buffer size (max_buffer_payload_size
). If the user wants to flush data, they could call a new method#flush
on Statsd. (that method is used a lot in the specs)Adding new specific integration tests relevant to buffering to test edge cases.
Motivation
Preparation of version 5 road map with automatic buffering, refactored telemetry (buffered as other messages) and async IO in a separate thread.
Notes
We cannot use the optimized buffer size for UDP (1432 bytes) for now because of how telemetry is implemented. It eats part of the buffer size with a worst estimation of the size it could take. But in UDP, with a 1432 byte size, only a few global tags would make the telemetry allocate a size bigger of the buffer than its own size!
I removed this problem for now by increasing the default size of the buffer. I will add another PR after this one is merged to refactor telemetry as a "client" of the buffer instead of it allocating buffer size and using directly the low level Connection class.