Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add bundler to exporter #39

Merged
merged 16 commits into from Jun 29, 2020

Conversation

stevencl1013
Copy link
Contributor

Bundles spans together and uploads them in separate worker threads

@ymotongpoo
Copy link
Member

@stevencl1013 Please run make precommit

Copy link
Contributor

@nilebox nilebox left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good, but a few bugs and nits. See review comments.

if o.BundleDelayThreshold > 0 {
b.DelayThreshold = o.BundleDelayThreshold
} else {
b.DelayThreshold = 2 * time.Second
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you extract constants to the top of the file, and give them descriptive names, please?

b.BundleCountThreshold = 50
}
if o.NumberOfWorkers > 0 {
b.HandlerLimit = o.NumberOfWorkers
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on the description https://github.com/googleapis/google-api-go-client/blob/cfc62336a21b9af45e2830a9898569d8aac1c5cc/support/bundler/bundler.go#L80-L82, the config parameter should be called MaxNumberOfWorkers rather than just NumberOfWorkers, or Just BundleHandlerLimit to directly refer to the bundler parameter.

b.HandlerLimit = o.NumberOfWorkers
}
// The measured "bytes" are not really bytes, see exportReceiver.
b.BundleByteThreshold = b.BundleCountThreshold * 200
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be done right after setting BundleCountThreshold, to make sure they are set together.
Also, there is no exportReceiver in this repo?

}
// The measured "bytes" are not really bytes, see exportReceiver.
b.BundleByteThreshold = b.BundleCountThreshold * 200
b.BundleByteLimit = b.BundleCountThreshold * 1000
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move next to setting BundleCountThreshold

exporter/trace/trace.go Show resolved Hide resolved
@nilebox
Copy link
Contributor

nilebox commented Jun 10, 2020

Looks like some changes aren't committed, please run make precommit and push missing changes.

@@ -29,11 +34,19 @@ import (
type traceExporter struct {
o *options
projectID string
bundler *bundler.Bundler
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rghetia As far as I remember, there's some issues on using bundler and that's the reason why I avoided using bundler and used BatchWriteSpans in the first implementation. Could you share the background for the record?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ymotongpoo I kind of remember some issue with bundler but I don't remember the details.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For record: at least the report in OC pointed the possibility of memory leak.
census-ecosystem/opencensus-go-exporter-ocagent#71

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From that issue, looks like the memory leak issue only appears when the backend is not reachable. Should be relatively easy to reproduce the issue?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ymotongpoo I don't really understand that issue in the ocagent repo. From the issue:

as long as there is a valid connection to the collector, the “handler” function, i.e. uploadTraces, will be called on the bundle to offload the bundled traces to the collector.

Isn't the handler function called regardless of whether there is a connection? That connection check in the issue is already inside the uploadTraces function. And in this exporter, there is no such check for a connection, the upload function just calls BatchWriteSpans.

Also, since the bundler's BufferedByteLimit is set here, the bundler will not keep more than that many bytes in memory.

// TraceSpansBufferMaxBytes is the maximum size (in bytes) of spans that
// will be buffered in memory before being dropped.
//
// If unset, a default of 8MB will be used.
// TraceSpansBufferMaxBytes int
TraceSpansBufferMaxBytes int
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be just BufferMaxBytes similar to other option names (which are without TraceSpans).
This could potentially be refactored and shared with metrics.

// BundleDelayThreshold determines the max amount of time
// the exporter can wait before uploading view data or trace spans to
// the backend.
// Optional.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please specify the default value.


// BundleCountThreshold determines how many view data events or trace spans
// can be buffered before batch uploading them to the backend.
// Optional.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here.

// for now. The minimum number of workers is 1.
NumberOfWorkers int
// MaxNumberOfWorkers sets the maximum number of go rountines that send requests
// to Stackdriver Trace. The minimum number of workers is 1.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Replace Stackdriver with 'Cloud Trace'. Here and on line 91.

// UploadFunction is the function used to upload spans to Cloud Trace. Used for
// testing. Defaults to uploadSpans in trace.go.
// Optional.
UploadFunction func(ctx context.Context, spans []*tracepb.Span)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can use mock server instead of exporting Test-only interface.

case o.accum == 0:
o.pause = false
case o.accum == 1:
log.Println("OpenCensus Stackdriver exporter: failed to upload span: buffer full")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/OpenCensus Stackdriver/OpenTelemetry Cloud Trace/

o.accum = 0
o.delay()
default:
log.Printf("OpenCensus Stackdriver exporter: failed to upload %d spans: buffer full", o.accum)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as above.

o.mu.Lock()
defer o.mu.Unlock()
if !o.pause {
log.Println("OpenCensus Stackdriver exporter: failed to upload span: buffer full")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same.

Copy link
Contributor

@nilebox nilebox left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, but leaving up to @ymotongpoo whether a potential memory leak is a blocking issue for merging this.

@ymotongpoo
Copy link
Member

Thank you @nilebox for asking me comments. As far as I have heard OC exporter users' experiences, I seldom heard the issues that originate to memory leak. However, still the issue is known, and haven't solved yet. So my request here is for this PR to have at least either (preferably both) of the following:

  1. Add todo or note comment in bundler field in traceExporter to touch the potential issue and existing similar issue linked in my previous comment.
  2. Add test to reproduce the issue.

I think we need at least 1.

@stevencl1013
Copy link
Contributor Author

@ymotongpoo I don't understand how that issue relates to this exporter. I've looked into it, and I mentioned what I found in this comment: #39 (comment)

@james-bebbington
Copy link
Contributor

Sorry to come to this one very late. Yesterday I noticed that there are two different versions of the TraceExporter in the OpenTelemetry SDK: https://github.com/open-telemetry/opentelemetry-go/blob/master/sdk/export/trace/trace.go

  • SpanSyncer which implements ExportSpan
  • SpanBatcher which implements ExportSpans

When setting up your Tracing pipeline, you can configure either of these, by choosing using WithSyncer or WithBatcher (see https://github.com/open-telemetry/opentelemetry-go/blob/master/sdk/trace/provider.go#L186). The Batching work is done in https://github.com/open-telemetry/opentelemetry-go/blob/master/sdk/trace/batch_span_processor.go. In the operations exporter, we could just have a helper function that sets up the SpanBatcher pipeline.

So, I wanted to make sure you'd taken a look at that & considered what do we gain by doing our own bundling in the exporter as opposed to just making use of the work that's already done when using the Batch Span Processor?

@nilebox
Copy link
Contributor

nilebox commented Jun 23, 2020

Follow-up to @james-bebbington 's comment: currently we basically have 2 identical implementations of ExportSpan and ExportSpans:

// ExportSpan exports a SpanData to Stackdriver Trace.
func (e *traceExporter) ExportSpan(ctx context.Context, sd *export.SpanData) {
protoSpan := protoFromSpanData(sd, e.projectID)
e.uploadFn(ctx, []*tracepb.Span{protoSpan})
}
// ExportSpans exports a slice of SpanData to Stackdriver Trace in batch
func (e *traceExporter) ExportSpans(ctx context.Context, sds []*export.SpanData) {
pbSpans := make([]*tracepb.Span, len(sds))
for i, sd := range sds {
pbSpans[i] = protoFromSpanData(sd, e.projectID)
}
e.uploadFn(ctx, pbSpans)
}
, implementing both interfaces he mentioned above.

What if we remove the ExportSpan function (implementation of SpanSyncer), and only support SpanBatcher with ExportSpans?
That way we will force users to always register our exporter via WithBatcher, and the bundler won't be necessary.

FWIW there is already a precedent of this approach:

@stevencl1013
Copy link
Contributor Author

Thanks for the info @james-bebbington and @nilebox, I wasn't aware of the difference between WithSyncer and WithBatcher before. I've been doing my experimentation on a simple application similar to the example in this repo, which uses the WithSyncer option.

I read in internal documentation that OpenTelemetry exporters should use both batching and async writes, so I still believe this PR implements those functionalities for this exporter when using the WithSyncer option. When I created this PR, I had two main goals:

  1. Add the ability to export spans in parallel, and let users configure the number of worker threads (Add workers to export spans in parallel. #19).
  2. Reduce the high latency that I had experienced in my testing, and that was also reported later by a user in Latency issues of HTTP requests with Stackdriver tracing v0.2.0 and otel v0.6.0 #47.

If we were to only support SpanBatcher and remove the bundler, then I think the 2nd goal would be achieved, but the 1st would not. Would it make sense to keep supporting SpanSyncer, either by keeping implementations of both interfaces, or by taking a similar approach to the Jaeger exporter (only implementing SpanSyncer and using a bundler)?

@james-bebbington
Copy link
Contributor

james-bebbington commented Jun 24, 2020

Thanks for the info @james-bebbington and @nilebox, I wasn't aware of the difference between WithSyncer and WithBatcher before. I've been doing my experimentation on a simple application similar to the example in this repo, which uses the WithSyncer option.

No worries, we also weren't aware of it until recently. There is probably some general OpenTelemetry documentation missing with recommendations around how to go about implementing exporters.

If we were to only support SpanBatcher and remove the bundler, then I think the 2nd goal would be achieved, but the 1st would not. Would it make sense to keep supporting SpanSyncer, either by keeping implementations of both interfaces, or by taking a similar approach to the Jaeger exporter (only implementing SpanSyncer and using a bundler)?

Yea we definitely want to support batched & parallel exporting. At the same time, we want to make sure we don't make this confusing for users:

  • If they use WithSyncer but then we batch the data in the exporter anyway, that's somewhat confusing, although maybe that's not a big deal
  • If they use WithBatcher, then it will be confusing that the batch size can be configured in both the batch span processor and in the exporter via BundleCountThreshold

Note we could enforce which setup the user has to use by only implementing one of those interfaces (and optionally also creating a wrapper initialization function to set up the pipeline for them).

I'm not really sure what the right thing is to do here though. Any thoughts @rghetia @ymotongpoo @nilebox?

(overall this PR is probably okay as is but I just want to make sure we've thought this through)


Also note even the Go SDK is not consistent:

  • The Zipkin exporter only implements the SpanBatcher interface
  • The Jaeger exporter only implements the SpanSyncer interface, and then implements batching internally

As another point of reference, the Python SDK only has a single Export function that takes a list of spans. Both the Simple (Syncer) & Batch span processors use this function (the simpler processor will just send a single span at a time).

@@ -127,14 +134,19 @@ func TestExporter_Timeout(t *testing.T) {
mockTrace.spansUploaded = nil
mockTrace.delay = 20 * time.Millisecond
var exportErrors []error
var wg sync.WaitGroup
wg.Add(1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be more idiomatic to use a channel for this I think

@nilebox
Copy link
Contributor

nilebox commented Jun 24, 2020

If we were to only support SpanBatcher and remove the bundler, then I think the 2nd goal would be achieved, but the 1st would not.

@stevencl1013 Can we try addressing this issue in the SDK batcher then instead? That would benefit all users of Go SDK rather than just Stackdriver.


I'm not sure what to suggest that would allow using bundler without making it confusing tbh:

  • It seems that Syncer is designed for use when user wants to send ("sync") spans straight to the backend without buffering, so bundler would go against that design.
  • Similarly, cascading batchers (SDK batcher + bundler) might be confusing as well.

@ymotongpoo
Copy link
Member

@stevencl1013 Sorry for delayed response.

Here's the reason why I mentioned that issue in OC exporter:

  • Having connectivity check should be TODO, thus the implementation should be similar to OC exporter.
  • Though BufferedByteLimit exists, that code is copied from OC exporter, where the issue exists.

Anyway, as @nilebox and @james-bebbington suggested, now we have SpanBatcher in the SDK and it look better to use it for batching to avoid additional complexity.

@ymotongpoo
Copy link
Member

@stevencl1013 I had a offline discussion with @rghetia about the use of bundler. In short, as you confirmed first, we don't check the connectivity and won't in this exporter, so it's safe to use bundler here.

Background:
Originally I thought we need to have the connectivity check later on like OC exporter, which may pile up the unsent spans in the memory. However, the spec for error handling in the upstream doesn't mention such behavior, and how to handle these unsent spans are up to exporter implementations. We should just throw the unbuffered spans in this exporter and let the collector to handle such complex cases.

@reazahmed
Copy link

Apologies for jumping in in the middle of the discussion.

As per OTel specification (https://github.com/open-telemetry/opentelemetry-specification/blob/master/specification/trace/sdk.md#built-in-span-processors):

The OTel standard SDK MUST implement the simple processor (i.e., sync case). With current simple process, a user will experience very high delay for each RPC. This change handles the simple processor case, by brining down latency from ~10sec to ~200 milliseconds.

@james-bebbington
Copy link
Contributor

For reference, given the way the SDK is set up, we decided there isn't any ideal solution, but for now decided to:

  • Merge this PR as is
  • Create a follow up PR, that includes a breaking change, to:
    • Remove the ExportTraces function, and only keep ExportTrace function
    • Add a NewExportPipeline function that we expect consumers to use rather than setting up a Syncer themselves

i.e. we will implement this in the same way as the Jaeger exporter (see the Otel Go SDK).

FYI @nilebox I know you had some reservations around the explicitness of the word Syncer but we decided this is the least bad of two not great solutions for now: duplicate batching options is likely to add more confusion and create more difficulties with regards to making backwards compatible changes in the future.

@stevencl1013 - in the next PR it'd also be good to add/update the example code in this repo

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants