Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ensure CRubyBuffer respects max_size #1715

Merged
merged 4 commits into from
Oct 18, 2021
Merged

Ensure CRubyBuffer respects max_size #1715

merged 4 commits into from
Oct 18, 2021

Conversation

marcotc
Copy link
Member

@marcotc marcotc commented Oct 7, 2021

Fixes #1704

This PR ensures that CRubyBuffer continuously tries to maintain the internal buffer to less then @max_size items.

This is done by "clamping" the internal array to @max_size when we detect at any point that the buffer is full. This eventually catches any race modifications that would cause the buffer to exceed its maximum size.

@marcotc marcotc added the core Involves Datadog core libraries label Oct 7, 2021
@marcotc marcotc self-assigned this Oct 7, 2021
@marcotc marcotc requested a review from a team October 7, 2021 23:23
Copy link
Member Author

@marcotc marcotc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ivoanjo I can think of a few ways to insert strategic Thread.pass around this code to make it exceed the internal buffer capacity, but surprisingly I wasn't able to make it exceed the limit.
Due to this, I'm thinking this implementation is now empirically thread safe 😬, unless we can find a whole in it.

Do you see any possible issues with the implementation as of right now @ivoanjo? If you can find an artificial wait point that breaks it I would appreciate it, because I can't find one anymore.

@codecov-commenter
Copy link

codecov-commenter commented Oct 7, 2021

Codecov Report

Merging #1715 (73c4630) into master (8f697cc) will increase coverage by 0.00%.
The diff coverage is 100.00%.

Impacted file tree graph

@@           Coverage Diff           @@
##           master    #1715   +/-   ##
=======================================
  Coverage   98.18%   98.18%           
=======================================
  Files         934      934           
  Lines       45122    45121    -1     
=======================================
+ Hits        44301    44302    +1     
+ Misses        821      819    -2     
Impacted Files Coverage Δ
lib/ddtrace/buffer.rb 95.34% <100.00%> (ø)
spec/ddtrace/buffer_spec.rb 98.22% <100.00%> (-0.02%) ⬇️
lib/ddtrace/diagnostics/environment_logger.rb 100.00% <0.00%> (+1.94%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 8f697cc...73c4630. Read the comment docs.

Copy link
Contributor

@delner delner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like that there's not a "leniency" and that max size is max size. Looks good!

Copy link
Member

@ivoanjo ivoanjo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ivoanjo I can think of a few ways to insert strategic Thread.pass around this code to make it exceed the internal buffer capacity, but surprisingly I wasn't able to make it exceed the limit.
Due to this, I'm thinking this implementation is now empirically thread safe 😬, unless we can find a whole in it.

Hmm initially I was going to say I was happy but after staring at it for a while I'm not quite anymore lol.

There's a few weird effects I can think of (and simulate with a few sleeps):

  1. If only adds are made, we can still get into trouble, e.g. if I run the test with thread_count = 200, max_size = 100 but each thread only adds a single item AND we add the sleep in #full?, then we're back to a similar issue as in CRubyTraceBuffer can exceed its max size and that leads to flaky tests #1704: all 200 threads decide to add, and all add and go on their merry way.

  2. Because replace is implemented as delete + add, this means that during the execution of replace, between the delete and the add, if you additionally add a sleep there, we can get again into the issue in 1) -- first thread observes a full buffer, decides to replace, removes item, sleeps. All other threads observe a non-full buffer, decide to add.

  3. If all threads decide to replace at the same time, a third weird effect can happen: the buffer can actually be observed as being empty, although it shouldn't. E.g. imagine that thread_count = 100, max_size = 100, and the buffer starts full. Let's say that all 100 threads try to add a single item, decide that they need to replace, and then run delete and then sleep. This means that we can catch them "in the act" and actually find an empty buffer because all 100 threads deleted an item and none yet added their replacement.

So the issues I still identify are:

a. Add can still go over -- may or may not be problematic; e.g. I think 1) above isn't very, but 2) is
b. Replace using delete+add is not atomic, so weird things can happen in between (and between it and add).

My suggestion for a. is also adding slice! to a; I'm still a bit unconvinced that a slice! that effectively does nothing is expensive enough (but if we fix b I think it's less of a problem).

My suggestion for b. is to not use delete+add, but instead just replace a given position. This would mean that the array size would become an invariant after all threads agree that the array is full. E.g. once no thread is inside an add codepath, if we always replace, the array size never changes and thus 2. and 3. cannot happen anymore.

The one downside here is that there's no array API to "replace an item in an array, and give me back the old item". But from my inspection, it doesn't seem that relevant for #replace! to return the old item (it's just a "nice to have"), so we can just change the method contract to not do that. (Because otherwise we could have a race where two threads decide to replace the same location and both claim to have replaced the same item).

@marcotc
Copy link
Member Author

marcotc commented Oct 15, 2021

My suggestion for a. is also adding slice! to a; I'm still a bit unconvinced that a slice! that effectively does nothing is expensive enough (but if we fix b I think it's less of a problem).

I don't think we can have a "thread-safe" slice! operation that we can perform in add! that also does not mess up with the fair trace sampling when the buffer is full:

  1. If we slice! before @items << item we are no better than checking for full? again.
  2. If we slice! after @items << item, we are possibly dropping the item we just inserted at the end of the array. If we slice! from the beginning of @items the operation requires the whole array to be moved in memory every single time.

My suggestion for b. is to not use delete+add, but instead just replace a given position. This would mean that the array size would become an invariant after all threads agree that the array is full. E.g. once no thread is inside an add codepath, if we always replace, the array size never changes and thus 2. and 3. cannot happen anymore.

I've made this change as suggested.

The one downside here is that there's no array API to "replace an item in an array, and give me back the old item". But from my inspection, it doesn't seem that relevant for #replace! to return the old item (it's just a "nice to have"), so we can just change the method contract to not do that. (Because otherwise we could have a race where two threads decide to replace the same location and both claim to have replaced the same item).

I don't think this is a problematic downside. I'm made changes to replace! that now don't return the old item anymore.

@marcotc
Copy link
Member Author

marcotc commented Oct 15, 2021

I can get the internal @items to spike, like you said @ivoanjo. It does get correct on the next item added to the buffer, but in case of a super mad short burst of insertions the buffer can temporarily be longer than the limit.

I don't think this is a large issue, given that those objects that the application want to very quickly insert into the buffer are already taking their place in the heap, and we'll reap them from the Buffer in the very next opportunity, not letting them accumulate memory more than they already have in the heap.

@marcotc marcotc requested a review from ivoanjo October 15, 2021 22:15
Copy link
Member

@ivoanjo ivoanjo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we can have a "thread-safe" slice! operation that we can perform in add! that also does not mess up with the fair trace sampling when the buffer is full:
[...]
I don't think this is a large issue, given that those objects that the application want to very quickly insert into the buffer are already taking their place in the heap, and we'll reap them from the Buffer in the very next opportunity, not letting them accumulate memory more than they already have in the heap.

I don't think it's a big issue either -- I was about to click approve in the previous revision before I spotted the weird interaction during the replace.

I will say a few things though, just to mess with you :p

  1. I don't think this race would be common enough for us to care this much about fairness under the race conditions. E.g. since losing the fairness would be the corner case, I could live with that trade-off either.

  2. I don't think we're keeping the fairness as you intended. Yes, we don't slice at the end of add, but we're still appending to the end. Thus, once the first replace kicks in, we'll slice! all of it anyway. So the fairness only really gets preserved in that tiny window of opportunity between the last add running, and the first replace kicking in. 😉

  3. It would be nice if we were to document some of these hidden expectations from the class, as I was searching a bit and I don't think we ever reference our intentions, which I think would be relevant once neither of us are around ;)

lib/ddtrace/buffer.rb Outdated Show resolved Hide resolved
@marcotc
Copy link
Member Author

marcotc commented Oct 18, 2021

I documented the desire to fairly sample traces.

Thus, once the first replace kicks in, we'll slice! all of it anyway.

I agree. It doesn't seem like we have good performant options to strictly ensure correctness and fairness in Ruby today.

@marcotc marcotc merged commit 72892df into master Oct 18, 2021
@marcotc marcotc deleted the max-size-cruby branch October 18, 2021 19:29
@github-actions github-actions bot added this to the 0.54.0 milestone Oct 18, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Involves Datadog core libraries
Projects
None yet
Development

Successfully merging this pull request may close these issues.

CRubyTraceBuffer can exceed its max size and that leads to flaky tests
4 participants