[tracer] reduce memory usage on high-cardinality traces #247

ufoot · 2017-11-09T21:12:07Z

For big traces (typically, long-running traces with one enclosing span and many sub-spans, possibly several thousands) the library could keep everything in memory waiting for an hypothetical flush.

This patch partially flushes consistent parts of traces, so that they don't fill up the RAM.

ryplo · 2017-11-10T20:05:36Z

lib/ddtrace/context.rb

+      roots, marked_ids = partial_roots()
+      return nil unless roots
+
+      return unless roots


Is this a duplicate of line 144?

Thanks for spotting this, indeed, it is.

p-lambert · 2017-11-13T18:41:08Z

lib/ddtrace/context.rb

+    def partial_roots
+      return nil unless @current_span
+
+      marked_ids = Hash[([@current_span.span_id] + @current_span.parent_ids).map { |id| [id, true] }]


Maybe a Set DS would be more idiomatic here: http://ruby-doc.org/stdlib-2.4.2/libdoc/set/rdoc/Set.html

p-lambert · 2017-11-14T16:28:22Z

lib/ddtrace/context.rb

+        ids.reject! { |id| marked_ids.key? id }
+        ids.each do |id|
+          if roots_spans.key?(id)
+            unfinished[id] = true unless span.finished?


I think we can simplify this (and avoid unnecessary lookups) a little bit by doing:

root_spans.delete(id) unless span.finished?

That way we can get rid of theunfinished hash and roots_spans.reject! { |id| unfinished.key? id } altogether. We will also skip some work for spans that belong to that subtree.

Will try this for sure, sounds good.

p-lambert · 2017-11-14T16:44:17Z

lib/ddtrace/context.rb

+
+    # Return a hash containting all sub traces which are candidates for
+    # a partial flush.
+    def partial_roots_spans


I'll do some whiteboarding later, but I suspect we can simplify and improve the complexity of this algorithm a little bit if we build a graph and then search for all reachable nodes from the partial_roots instead of the other way around (traversing the subtrees of all spans in search for the partial_roots). I might be wrong, but I think its worth trying it before we merge this.

Yes, possibly, this was just a 1st naive impl., it's fine to iterate on that obviously.

p-lambert · 2017-11-15T22:07:26Z

lib/ddtrace/context.rb

+    end
+
+    # Iterate on each span within the trace. This is thread safe.
+    def each_span


p-lambert · 2017-11-15T22:41:14Z

lib/ddtrace/tracer.rb

@@ -289,8 +292,15 @@ def record(context)
      context = context.context if context.is_a?(Datadog::Span)
      return if context.nil?
      trace, sampled = context.get
-      ready = !trace.nil? && !trace.empty? && sampled
-      write(trace) if ready
+      if sampled


I'm probably missing something, but it seems to me that if the current trace has an unfinished Span, Context#check_finished_spans will return false and Context#get will return [nil, nil]. Doesn't that prevent the partial flush?

p-lambert · 2017-11-15T22:52:10Z

test/context_flush_test.rb

+
+    public :partial_roots
+    public :partial_roots_spans
+    public :partial_flush


I tend to think of private methods as implementation details. If we rely on them for testing, then we're testing more the implementation than the "behavior" provided, making this code almost impossible to refactor in the future. I think it would be better if we could use something like a FauxWriter and inspect if the buffer contains something like [partial_trace1, partial_trace2, partial_trace3] (that is, testing the "side-effect" of the ContextFlush). I completely understand that we might be short on time now, but I think that's a topic worth discussing at some point :)

mmm, will give it a try anyway, I think now that we have this each_partial_trace entry, which is fine to keep public I think, then I could rewrite part of the tests to rely on that without changing the whole world.

p-lambert · 2017-11-15T22:56:45Z

lib/ddtrace/context_flush.rb

+      return nil unless roots
+
+      roots_spans = Hash[roots.map { |id| [id, []] }]
+      unfinished = Set.new


Just curious: that suggestion I gave about the unfinished variable didn't work?

I'm reading this again... and... no, but the code here does not work either. Some code paths are obviously not triggered by any tests yet. Fixing.

ufoot · 2017-11-17T15:11:27Z

test/context_flush_test.rb

+
+    context = tracer.call_context
+    tracer.configure(min_spans_before_partial_flush: context.max_length,
+                     max_spans_before_partial_flush: context.max_length,


As of today, context max_length and context_flush boundaries are the same, but it's not necessarily that way. In the future, hard limit could be higher, and this test has no reason to fail test, it just tests that when hard and soft limit are the same, hard limit prevails.

ufoot · 2018-01-04T08:23:39Z

lib/ddtrace/context_flush.rb

+    # by default, soft and hard limits are the same
+    DEFAULT_MAX_SPANS_BEFORE_PARTIAL_FLUSH = Datadog::Context::DEFAULT_MAX_LENGTH
+    # by default, never do a partial flush
+    DEFAULT_MIN_SPANS_BEFORE_PARTIAL_FLUSH = Datadog::Context::DEFAULT_MAX_LENGTH


Set this to 10 to avoid anything smaller that 10 spans to be flushed, even partially.

ufoot · 2018-01-04T08:24:09Z

lib/ddtrace/context_flush.rb

+  # It performs memory flushes when required.
+  class ContextFlush
+    # by default, soft and hard limits are the same
+    DEFAULT_MAX_SPANS_BEFORE_PARTIAL_FLUSH = Datadog::Context::DEFAULT_MAX_LENGTH


Set this to 1000 to have partial flush be triggered for anything bigger that 1000 spans.

delner · 2018-03-12T19:23:49Z

lib/ddtrace/context_flush.rb

+      # We need to reject by span ID and not by value, because a span
+      # value may be altered (typical example: it's finished by some other thread)
+      # since we lock only the context, not all the spans which belong to it.
+      context.delete_span_if { |span| flushed_ids.include? span.span_id }


This seems dangerous in the case that writing fails. If I'm understanding it correctly, first we get all partial traces, deleting them permanently from the context, then iterate on each span via each_partial_trace. But when each_partial_trace calls write(trace), if that fails (say there's a blip in connectivity), then an exception will be thrown, and the deleted spans won't actually be flushed (with no means of recovering them.)

Could we do this after the yield on line 124? Or within the tracer as a part of the write operation?

Although the point technically stands, the write operation is only writing to a buffer right now, and in the case of the SyncWriter, partial flushing doesn't apply. The odds of it failing are thus really low. So we probably don't have to change this.

For big traces (typically, long-running traces with one enclosing span and many sub-spans, possibly several thousands) the library could keep everything in memory waiting for an hypothetical flush. This patch partially flushes consistent parts of traces, so that they don't fill up the RAM.

…_flush or configuration options to enable.

palazzem

I would only change Context methods and I think we're fine. With this change we're:

adding an upper bound to the Context size (based on count and not on the size, but it's good enough for now)
adding an experimental feature that if not activated, the tracer behaves like before

palazzem · 2018-04-05T21:52:41Z

docs/GettingStarted.md

@@ -798,6 +798,7 @@ Available options are:
 - ``env``: set the environment. Rails users may set it to ``Rails.env`` to use their application settings.
 - ``tags``: set global tags that should be applied to all spans. Defaults to an empty hash
 - ``log``: defines a custom logger.
+ - ``partial_flush``: set to ``true`` to enable partial trace flushing (for long running traces.) Disabled by default. *Experimental.*


Let's keep this experimental for some releases. Then we may improve it, make it fully supported, or replace with a new behavior.

palazzem · 2018-04-05T22:05:41Z

lib/ddtrace/context.rb

@@ -161,6 +182,50 @@ def attach_sampling_priority
      )
    end

+    # Return the start time of the root span, or nil if there are no spans or this is undefined.
+    def start_time


can we make these methods "private" or anyway use a convention to avoid people using them? since it's internal machinery, I don't want that developers use an API that could be removed later. If we use the __start_time, let's also add in the comment that this is an internal API.

palazzem

All good to me! Thank you very much!

ufoot added bug Involves a bug core Involves Datadog core libraries labels Nov 9, 2017

ufoot added this to the 0.9.2 milestone Nov 9, 2017

ufoot requested a review from palazzem November 9, 2017 21:12

ufoot force-pushed the christian/limittracesize branch from ffad65a to a4af593 Compare November 10, 2017 02:04

palazzem changed the title ~~[tracer] fix memory leak on high-cardinality traces~~ [tracer] fix memory usage on high-cardinality traces Nov 10, 2017

ufoot changed the title ~~[tracer] fix memory usage on high-cardinality traces~~ [tracer] reduce memory usage on high-cardinality traces Nov 10, 2017

ufoot force-pushed the christian/limittracesize branch 3 times, most recently from a9552bc to 43cb1ac Compare November 10, 2017 15:09

ryplo reviewed Nov 10, 2017

View reviewed changes

ufoot modified the milestones: 0.9.2, 0.10.0 Nov 14, 2017

ufoot requested a review from p-lambert November 14, 2017 13:50

p-lambert reviewed Nov 14, 2017

View reviewed changes

ufoot force-pushed the christian/limittracesize branch 3 times, most recently from 02cf38f to 1400031 Compare November 15, 2017 04:44

palazzem mentioned this pull request Nov 15, 2017

Memory explodes and logs show Request Entity Too Large / Connection reset by peer #220

Closed

ufoot force-pushed the christian/limittracesize branch from 624769a to a5ecc11 Compare November 15, 2017 14:47

p-lambert reviewed Nov 15, 2017

View reviewed changes

ufoot force-pushed the christian/limittracesize branch 4 times, most recently from 8bf7b1d to 53cad3f Compare November 17, 2017 14:43

ufoot commented Nov 17, 2017

View reviewed changes

p-lambert mentioned this pull request Nov 20, 2017

Improve shutdown process #253

Merged

ufoot force-pushed the christian/limittracesize branch from b1a8c32 to 53cad3f Compare November 22, 2017 15:42

palazzem removed this from the 0.10.0 milestone Nov 23, 2017

ufoot force-pushed the christian/limittracesize branch from 53cad3f to 004caf3 Compare November 24, 2017 08:34

ufoot force-pushed the christian/limittracesize branch from 004caf3 to 13484c9 Compare November 29, 2017 09:24

palazzem force-pushed the christian/limittracesize branch from 13484c9 to e63e190 Compare November 30, 2017 10:31

ufoot commented Jan 4, 2018

View reviewed changes

palazzem added the do-not-merge/WIP Not ready for merge label Feb 7, 2018

palazzem assigned delner Mar 9, 2018

palazzem requested a review from delner March 9, 2018 14:34

palazzem added this to the 0.12.0 milestone Mar 9, 2018

delner force-pushed the christian/limittracesize branch from e63e190 to 2c743d6 Compare March 12, 2018 19:09

delner changed the base branch from master to 0.12-dev March 12, 2018 19:10

delner reviewed Mar 12, 2018

View reviewed changes

delner force-pushed the christian/limittracesize branch from 2c743d6 to 9ffe0b5 Compare March 30, 2018 15:21

palazzem removed the do-not-merge/WIP Not ready for merge label Apr 5, 2018

ufoot and others added 2 commits April 5, 2018 17:35

Changed: Context flushing to be inactive by default. Provide :partial…

d58cba9

…_flush or configuration options to enable.

delner force-pushed the christian/limittracesize branch from f30d5af to 9c75df2 Compare April 5, 2018 21:36

Added: partial_flush tracer option to documentation.

d1c61e1

delner force-pushed the christian/limittracesize branch from 9c75df2 to d1c61e1 Compare April 5, 2018 21:52

palazzem suggested changes Apr 5, 2018

View reviewed changes

Changed: Make context flush methods on context private.

603c99d

palazzem approved these changes Apr 6, 2018

View reviewed changes

delner merged commit b3a85cd into 0.12-dev Apr 6, 2018

delner deleted the christian/limittracesize branch April 6, 2018 21:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[tracer] reduce memory usage on high-cardinality traces #247

[tracer] reduce memory usage on high-cardinality traces #247

ufoot commented Nov 9, 2017 •

edited

ryplo Nov 10, 2017

ufoot Nov 14, 2017

p-lambert Nov 13, 2017

ufoot Nov 14, 2017

p-lambert Nov 14, 2017 •

edited

ufoot Nov 14, 2017

p-lambert Nov 14, 2017 •

edited

ufoot Nov 14, 2017

p-lambert Nov 15, 2017

p-lambert Nov 15, 2017

p-lambert Nov 15, 2017

ufoot Nov 16, 2017

p-lambert Nov 15, 2017

ufoot Nov 16, 2017

ufoot Nov 17, 2017

ufoot Jan 4, 2018

ufoot Jan 4, 2018

delner Mar 12, 2018 •

edited

delner Mar 30, 2018 •

edited

palazzem left a comment

palazzem Apr 5, 2018

palazzem Apr 5, 2018

palazzem left a comment

[tracer] reduce memory usage on high-cardinality traces #247

[tracer] reduce memory usage on high-cardinality traces #247

Conversation

ufoot commented Nov 9, 2017 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

p-lambert Nov 14, 2017 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

p-lambert Nov 14, 2017 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

delner Mar 12, 2018 • edited

Choose a reason for hiding this comment

delner Mar 30, 2018 • edited

Choose a reason for hiding this comment

palazzem left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

palazzem left a comment

Choose a reason for hiding this comment

ufoot commented Nov 9, 2017 •

edited

p-lambert Nov 14, 2017 •

edited

p-lambert Nov 14, 2017 •

edited

delner Mar 12, 2018 •

edited

delner Mar 30, 2018 •

edited