ddtrace/tracer: Don't drop trace if some spans must be kept #963

piochelepiotr · 2021-07-13T20:27:16Z

The current behavior is:
Drop the trace except if all spans should be kept.

This doesn't work if:

a downstream span has no error, is closed (so should be dropped) but then the rest of the trace ends up having an error so should be kept.
a span should be kept for app analytics (event sampling), but the trace is a p0 so the whole trace is dropped.

The PR inverses the behavior:
Keep the trace except if all spans should be dropped.

ddtrace/tracer/spancontext.go

ddtrace/tracer/tracer_test.go

ddtrace/tracer/span.go

ddtrace/tracer/sampler.go

ddtrace/tracer/spancontext.go

…isions

gbbr · 2021-07-16T18:09:55Z

ddtrace/tracer/spancontext.go

+	// we have a tracer that can receive completed traces.
+	atomic.AddInt64(&tr.spansFinished, int64(len(t.spans)))
+	sd := samplingDecision(atomic.LoadInt64((*int64)(&t.samplingDecision)))
+	if sd == decisionNone {


This ties decisionNone to p0 which is unfair. What if we just did this instead?

Suggested change

if sd == decisionNone {

if p, ok := t.samplingPriority(); ok && p == ext.PriorityAutoReject {

You should move this condition below, inside if sd != decisionKeep

The reason is: let's not make decisionNone be related to priority sampling. Let's just have it mean "undecided".

knusbaum

Should this pass? I think it probably should.

func TestSamplingDecision(t *testing.T) {
...
	t.Run("ratesampler", func(t *testing.T) {
		tracer, _, _, stop := startTestTracer(t)
		defer stop()
		tracer.config.serviceName = "test_service"
		tracer.config.sampler = NewRateSampler(0)
		span := tracer.StartSpan("name_1").(*span)
		child := tracer.StartSpan("name_2", ChildOf(span.context))
		child.Finish()
		span.Finish()
		assert.Equal(t, float64(ext.PriorityAutoReject), span.Metrics[keySamplingPriority])
		assert.Equal(t, decisionKeep, span.context.trace.samplingDecision)
	})
...

Is there a reason we need 3 states (None, Drop, Keep)? It seems more complicated than necessary. I'd prefer to just have 2 (Drop and Keep), with the default being Drop. Then the Sampler or any span can call keep() and the whole trace will be kept. It should be possible for any trace to transition from Drop -> Keep, but not from Keep -> Drop.

piochelepiotr · 2021-07-19T08:21:55Z

Is there a reason we need 3 states (None, Drop, Keep)? It seems more complicated than necessary. I'd prefer to just have 2 (Drop and Keep), with the default being Drop. Then the Sampler or any span can call keep() and the whole trace will be kept. It should be possible for any trace to transition from Drop -> Keep, but not from Keep -> Drop.

So the goal is not to change the current logic (except fixing the bug where we drop P0s with a span containing events or errors).

But the current logic is that if the root span is not sampled by the rate sampler, the trace is dropped (no matter what else happens).

Client sampling is for P0 and P1 traces, the only moment you would enable it when the agent / tracer can't cope with the volume.
So now that we can compute stats in the tracer, we will never enable the client sampling.

ddtrace/tracer/spancontext.go

piochelepiotr · 2021-07-19T08:57:13Z

@gbbr , so we need 3 states because:
trace is sampled if : root span sampled by rate sampler AND at least one span sampled by the other samplers.
We need one state to store the results of root span sampled by rate sampler, and 1 state to store if at least one span was sampled by other samplers.

In other words, the root span being sampled by the rate sampler doesn't imply that the trace is sent to the agent. Just that the trace is not dropped for sure.

gbbr · 2021-07-20T09:33:03Z

ddtrace/tracer/spancontext.go

-	t.mu.RLock()
-	defer t.mu.RUnlock()
+// returns the sampling priority of the trace. Not thread safe.
+func (t *trace) samplingPriorityUnsafe() (p int, ok bool) {


Suggested change

func (t *trace) samplingPriorityUnsafe() (p int, ok bool) {

func (t *trace) samplingPriorityLocked() (p int, ok bool) {

We use this pattern in multiple places in the repository, and generally the suffix is *Locked. Can you please rename it? (it's even used right here as setSamplingPriorityLocked)

gbbr · 2021-07-20T10:01:42Z

I agree with @knusbaum that this is unfortunately still confusing with the 3 states.

decisionNone - means that the local rate sampler chose to keep this trace. In this state, a trace may still be dropped if the feature flags are on and it's a P0.
decisionDrop - the rate sampler chose to drop the trace. It will be dropped no matter what.
decisionKeep - keep it.

Both decisionNone and decisionKeep mean to keep the trace. The only difference is that this decision can be overridden if the trace is a P0 only for decisionNone. A trace with decisionKeep must be kept regardless of factors, even if it's a P0 (in cases where it has a span with an error for example).

Even writing this, I have a hard time wrapping my head around the logic. Even though this may be frustrating for such a small change, I think it's worthwhile we explore the easiest to understand solution. Please share ideas. Here is one (going a bit backwards to 2 booleans):

1. Use two boolean fields (again)

drop - reports whether the trace was dropped by the local rate sampler
important - reports whether this trace is important to the user (e.g. it contains an error or an event)

Then:

If a trace is drop == true it gets dropped no matter what (important or not).
If a trace is drop == false it may get dropped if the feature flag is enabled and it's a P0
If a trace is drop == false and important == true it will not get dropped regardless of feature flag and priority.

2. Use a state

We modify the current type to be more descriptive. We add a field called state of the type traceState. Then we can have 3 states:

stateLocallySampled - sampled by the rate sampler (default state for 99% of use-cases). Includes P0 and P1.
stateLocallyDropped - dropped by the rate sampler.
stateImportant - this is a P0 trace that contains an event or an error

Note that you may think that this is the same as today with different names but notice that stateImportant is not valid for P1's by definition (whereas today stateKeep is).

3. Use bitmask for all booleans

This is an improvement over the first proposal. It implies merging all booleans in the trace structure that we currently have: locked, full, drop and important into one bitmask. It could be called flags as the field name.

type traceFlag uint8

const (
    // flagDropped reports whether the trace was dropped by the local sampler.
    flagDropped traceFlag = 1 << iota
    // flagImportant reports whether the trace is important (e.g. has an error or an event).
    flagImportant
    // flagPriorityLocked reports whether the sampling priority can no longer be changed.
    flagPriorityLocked
    // flagFullCapacity reports whether this trace is at full capacity and can buffer no more spans.
    flagFullCapacity
)

We can then use bit operations or add helper methods.

4. We keep what we have today.

We avoid the above and keep the current solution, since none of them aren't necessarily less confusing in practice.

What do you guys think?

P.S. I do realise that this is a small change in theory, and that it's a lot of philosophy for a bug fix, but I think we should carefully consider future maintenance of this code because if this is hard to understand now while it is fresh in our minds, it will be much harder later on once it fades.

ddtrace/tracer/spancontext.go

…ropping-p0

knusbaum

+1
I'm OK with this, since it seems to be necessary to manage the sampling decision state.

@gbbr Thanks for the alternatives. I think they're all about the same complexity level as this. Unless you think we should pick one over this one, I think we can leave it as-is.

ddtrace/tracer: Don't drop trace if some spans must be kept

1244148

gbbr reviewed Jul 14, 2021

View reviewed changes

ddtrace/tracer/spancontext.go Outdated Show resolved Hide resolved

ddtrace/tracer/spancontext.go Outdated Show resolved Hide resolved

piochelepiotr added 2 commits July 15, 2021 10:57

ddtrace/tracer: Fix bug in sampling

4601f4a

ddtrace/tracer: Fix bug in shouldKeep

d3ed2d9

piochelepiotr force-pushed the piotr-wolski/fix-app-analytics-sampling-when-dropping-p0 branch from 0ccd340 to 7336829 Compare July 15, 2021 10:45

ddtrace/tracer: add tests of kept traces

8ccbae4

piochelepiotr force-pushed the piotr-wolski/fix-app-analytics-sampling-when-dropping-p0 branch from 7336829 to 8ccbae4 Compare July 15, 2021 10:46

piochelepiotr added this to the 1.32.0 milestone Jul 15, 2021

piochelepiotr marked this pull request as ready for review July 15, 2021 10:50

gbbr reviewed Jul 15, 2021

View reviewed changes

ddtrace/tracer/tracer_test.go Outdated Show resolved Hide resolved

ddtrace/tracer/span.go Outdated Show resolved Hide resolved

ddtrace/tracer/span.go Show resolved Hide resolved

gbbr reviewed Jul 15, 2021

View reviewed changes

ddtrace/tracer/sampler.go Outdated Show resolved Hide resolved

piochelepiotr added 2 commits July 15, 2021 15:11

ddtrace/tracer: address comments

a104381

ddtrace/tracer: Fix metric

12f9ed5

piochelepiotr force-pushed the piotr-wolski/fix-app-analytics-sampling-when-dropping-p0 branch from 699790e to 49851a7 Compare July 16, 2021 13:56

ddtrace/tracer: Add sampling decision in trace context

834b0c0

piochelepiotr force-pushed the piotr-wolski/fix-app-analytics-sampling-when-dropping-p0 branch from 49851a7 to 834b0c0 Compare July 16, 2021 13:58

gbbr reviewed Jul 16, 2021

View reviewed changes

ddtrace/tracer/spancontext.go Outdated Show resolved Hide resolved

ddtrace/tracer: Avoid locking trace context when setting sampling dec…

5fc34e7

…isions

gbbr reviewed Jul 16, 2021

View reviewed changes

ddtrace/tracer: Update P0 dropped metric

9768485

knusbaum requested changes Jul 16, 2021

View reviewed changes

gbbr reviewed Jul 19, 2021

View reviewed changes

ddtrace/tracer/spancontext.go Outdated Show resolved Hide resolved

ddtrace/tracer: Add unsafe sampling priority getter

5778962

gbbr reviewed Jul 20, 2021

View reviewed changes

ddtrace/tracer: Update method name

4856aea

gbbr reviewed Jul 20, 2021

View reviewed changes

ddtrace/tracer/spancontext.go Show resolved Hide resolved

Merge branch 'v1' into piotr-wolski/fix-app-analytics-sampling-when-d…

5134f7e

…ropping-p0

knusbaum approved these changes Jul 20, 2021

View reviewed changes

gbbr approved these changes Jul 21, 2021

View reviewed changes

knusbaum merged commit 9a5c7ca into v1 Jul 22, 2021

knusbaum deleted the piotr-wolski/fix-app-analytics-sampling-when-dropping-p0 branch July 22, 2021 21:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ddtrace/tracer: Don't drop trace if some spans must be kept #963

ddtrace/tracer: Don't drop trace if some spans must be kept #963

piochelepiotr commented Jul 13, 2021

gbbr Jul 16, 2021

knusbaum left a comment

piochelepiotr commented Jul 19, 2021

piochelepiotr commented Jul 19, 2021 •

edited

gbbr Jul 20, 2021

gbbr commented Jul 20, 2021 •

edited

knusbaum left a comment

	if sd == decisionNone {
	if p, ok := t.samplingPriority(); ok && p == ext.PriorityAutoReject {

	func (t *trace) samplingPriorityUnsafe() (p int, ok bool) {
	func (t *trace) samplingPriorityLocked() (p int, ok bool) {

ddtrace/tracer: Don't drop trace if some spans must be kept #963

ddtrace/tracer: Don't drop trace if some spans must be kept #963

Conversation

piochelepiotr commented Jul 13, 2021

gbbr Jul 16, 2021

Choose a reason for hiding this comment

knusbaum left a comment

Choose a reason for hiding this comment

piochelepiotr commented Jul 19, 2021

piochelepiotr commented Jul 19, 2021 • edited

gbbr Jul 20, 2021

Choose a reason for hiding this comment

gbbr commented Jul 20, 2021 • edited

1. Use two boolean fields (again)

2. Use a state

3. Use bitmask for all booleans

4. We keep what we have today.

knusbaum left a comment

Choose a reason for hiding this comment

piochelepiotr commented Jul 19, 2021 •

edited

gbbr commented Jul 20, 2021 •

edited