Add cause to error stack trace #1472

marcotc · 2021-04-20T22:31:52Z

This PR changes the format of Ruby exceptions reported in ddtrace spans to match the latest default language format.

ddtrace will now include all exception causes and format the output to exactly match the Ruby native format, introduced in Ruby 2.6.

def root
  raise 'root cause'
end

def wrapper
  root
rescue
  raise 'wrapper layer'
end

wrapper

# /tmp/test.rb:8:in `rescue in wrapper': wrapper layer (RuntimeError)
# 	from /tmp/test.rb:5:in `wrapper'
# 	from /tmp/test.rb:11:in `<main>'
# /tmp/test.rb:2:in `root': root cause (RuntimeError)
# 	from /tmp/test.rb:6:in `wrapper'
# 	from /tmp/test.rb:11:in `<main>'

The stack trace order was reversed between Ruby 2.5 and 2.7, but ddtrace will follow the current Ruby 3.0 format, using natural ordering on all Ruby versions supported.

For recent Ruby versions (>= 2.6), the output is retrieved from a single VM call (Exception#full_message), thus incurring a much lower cost compared to the backported versions for Ruby < 2.6.

Ruby 2.0 does not support Exception#cause, thus this change does not affect the output for that version.

Benchmarks

A benchmark is committed as part of this PR (RSpec test named Datadog::Error.build_from with an exception with a cause benchmark) and here are the results for different versions of Ruby, each taking a different code path in this PR:

Ruby version	Top-level error only (before)	Full error stack (after)	Implementation
2.0	~~39.933k i/s~~	39.933k i/s*	Incomplete: only top-level error
2.5	~~92.231k i/s~~	~~0.33k i/s~~	Our implementation to generate a full error stack report
2.6	~~95.634k i/s~~	~~8.034k i/s~~	VM method call to retrieve a full error stack report

New amended numbers, after performance improvement work:

Ruby version	Top-level error only (before)	Full error stack (after)	Implementation
2.0	61.005k i/s	51.075k i/s*	Incomplete: only top-level error
2.5	93.955k i/s	22.729k i/s	Our implementation to generate a full error stack report
2.6	97.763k i/s	21.709k i/s	Our implementation to generate a full error stack report
2.6	97.763k i/s	10.360k i/s	`Exception#full_message`

Ruby 2.0 stays relatively unchanged from before: it performs the same work in both implementations.

Ruby 2.6 to 3.0 (and beyond) invokes Exception#full_message, which generates the full stack trace natively. This method is roughly 11 times slower than before, but provides all the information required for stack trace analysis. Ruby 2.6 to 3.0 have moved to our Ruby-based implementation, due to better performance, explained below.

Ruby 2.1 to 2.5 uses a Ruby-based implementation to generate the same output as Exception#full_message. This allows for a consistent format between all versions of Ruby (except 2.0), at the expense of performance. This method is roughly 4-4.5 times slower than before.

Given this method should only happen on exceptional code paths, having a richer output would be preferable to a more performance, but less informative dataset.

ivoanjo

The code change itself looks reasonable, but the non-trivial performance change is a bit worrying. Exceptions should be exceptional, but it wouldn't be the first time I saw services that were chugging along with many, many exceptions, and thus a slowdown of 3 orders of magnitude can have a non-trivial impact. I think it's worth waiting for a second option on this, /cc @delner ?

For what it's worth, on my machine I get 10.513k (± 8.2%) i/s on Ruby 2.5 for our handcrafted implementation. And I took a stab at making it a bit faster (diff below) and got it to 31.659k (± 4.0%) i/s. So those 0.33k i/s you reported seem to be quite slow, let me know what you get with:

diff --git a/lib/ddtrace/error.rb b/lib/ddtrace/error.rb
index 896a7ce3..9dbefb91 100644
--- a/lib/ddtrace/error.rb
+++ b/lib/ddtrace/error.rb
@@ -30,23 +30,26 @@ module Datadog
           backtrace.join("\n") if backtrace
         end
       elsif Gem::Version.new(RUBY_VERSION) < Gem::Version.new('2.6.0')
+        puts "USING FULL IMPL"
+
         # Backports Ruby >= 2.6 output to older versions.
         # This exposes the 'cause' chain in the stack trace,
         # allowing for complete visibility of the error stack.
         def full_backtrace(ex)
-          backtrace = []
+          backtrace = String.new
           backtrace_for(ex, backtrace)

           # Avoid circular causes
-          causes = Set.new
-          causes.add(ex)
+          causes = Hash.new
+          causes[ex] = true

-          while (cause = ex.cause) && !causes.include?(cause)
+          while (cause = ex.cause) && !causes.key?(cause)
             backtrace_for(cause, backtrace)
-            causes.add(cause)
+            causes[cause] = true
           end

-          backtrace.join("\n")
+
+          backtrace
         end

         # Outputs the following format for exceptions:
@@ -61,23 +64,21 @@ module Datadog
           trace = ex.backtrace
           return unless trace

-          error_line, *caller_lines = trace
-
-          if error_line
+          if trace[0]
             # Add Exception information to error line
-            error_line = "#{error_line}: #{ex.message} (#{ex.class})"
-            backtrace << error_line
+            backtrace << "#{trace[0]}: #{ex.message} (#{ex.class})"
           end

-          if caller_lines
+          if trace[1]
             # Ident stack trace for caller lines, to separate
             # them from the main error lines.
-            caller_lines = caller_lines.map do |line|
-              "        from #{line}"
+            trace[1..-1].each do |line|
+              backtrace << "\n from "
+              backtrace << line
             end
-
-            backtrace.concat(caller_lines)
           end
+
+          backtrace << "\n"
         end

ivoanjo · 2021-05-05T09:49:18Z

lib/ddtrace/error.rb

+        def backtrace_for(ex, backtrace)
+          trace = ex.backtrace
+          return unless trace
+
+          error_line, *caller_lines = trace


Minor: The names backtrace and trace here are slightly confusing at first glance. Perhaps accumulated_backtrace instead of just backtrace to make it clearer what it is?

lloeki · 2021-05-05T16:11:33Z

We talked briefly with @marcotc about the performance, and we thought the stack trace with causes is really adding value, but the cost from the microbenchmark is worrying indeed, so we thought that could be a configurable.

TBH, I 'm not fond of inflating the inventory of configurables too much. But that's a contingency, and even an option needs a default, so we're back to square one and have to decide which behaviour to have on the older rubies.
The microbenchmark highlights a performance difference but doesn't quite help to really decide.

For now I probably lack some vision of how to translate this benchmark result into a real impact within an app. Maybe we could measure that and gather the impact relative to a whole app?

Without that hard-ish data I'd go with preserving the old behaviour and perf profile, and giving the user an opportunity to opt in the new, richer behaviour by a) upgrading Ruby or b) activating the configurable (at a then known cost/benefit for him). In terms of overall experience this also has the benefit of not having a opt-out configurable that would stay if the user forgets to remove it upon them ultimately upgrading Ruby.

marcotc · 2021-05-07T21:53:48Z

Thank you guys for pressing the issue!
I follow @ivoanjo's suggestions, and added a few more minor optimizations. I believe that we now have an implementation that has reasonable performance impact, and we (hopefully 🤞) don't have to make hard decisions on how to safely roll this out anymore.

Here are the new numbers:

Ruby version	Top-level error only (before)	Full error stack (after)	Implementation
2.0	61.005k i/s	51.075k i/s*	Incomplete: only top-level error
2.5	93.955k i/s	22.729k i/s	Our implementation to generate a full error stack report
2.6	97.763k i/s	21.709k i/s	Our implementation to generate a full error stack report
2.6	97.763k i/s	10.360k i/s	`Exception#full_message`

The new implementation runs are around 24% of the original speed, but provides full error context.

Interestingly, the native implementation is 2x slower than our Ruby-based one 😮, so I extended the reach of our custom implementation to all Ruby versions we support.

I'll take a look into the CRuby implementation some time in the future and see if there are any gains to be had upstream.

IMO, I'm happy with these numbers, given the large visibility increase on error for the performance impact.

ivoanjo

Agreed that this version is good enough™️ for on always. Also pushing our implementation of full_message upstream seems a nice contribution: faster, and easier to maintain :) :)

ivoanjo · 2021-05-10T09:38:57Z

spec/ddtrace/error_spec.rb

+            wrapper_error_message = /error_spec.rb:\d+:in.*wrapper': wrapper layer \(RuntimeError\)/
+            caller_stack = /from.*error_spec.rb:\d+:in `call'/
+            root_error_message = /error_spec.rb:\d+:in.*root': root cause \(RuntimeError\)/
+            wrapper_stack = /from.*error_spec.rb:\d+:in `wrapper'/
+
+            expect(error.backtrace)
+              .to match(/
+                         #{wrapper_error_message}.*
+                         #{caller_stack}.*
+                         #{root_error_message}.*
+                         #{wrapper_stack}.*
+                         #{caller_stack}.*
+                         /mx)


MInor: There's nothing wrong with this current version, but TBH I'd favor being a bit more strict/hardcoded here. If we were to move the example code to a separate file, we could couple on the exact lines and filename, and expect here on a much more exact string.

I never like too fancy assertions ahaha :)

I'm taking the free pass this time: I think the assertion is not the clearest, but neither it is unreadable. Assertion errors also were legible from my experience developing this feature.

I agree that a separate file would allow for fixed line numbers, but it creates context switch between test files. Here we don't have that context switch, but we do have less legible test assertions. ⚖️

marcotc requested a review from a team April 20, 2021 22:31

marcotc self-assigned this Apr 20, 2021

marcotc force-pushed the error-stack branch 2 times, most recently from a6a1279 to 360db9b Compare April 20, 2021 22:38

marcotc added this to In progress in Active work Apr 20, 2021

marcotc marked this pull request as draft April 20, 2021 23:09

marcotc removed the request for review from a team April 20, 2021 23:09

delner added this to Long Term Priority in Planning Apr 21, 2021

marcotc force-pushed the error-stack branch 2 times, most recently from b2e14a4 to 5a3a3e6 Compare April 21, 2021 19:35

marcotc added 3 commits April 30, 2021 10:38

Add cause to error stack trace

8735aa4

Remove unnecessary constructor

c783d11

Fix empty backtrace case for 2.0

c69ed16

marcotc force-pushed the error-stack branch from 5a3a3e6 to c69ed16 Compare April 30, 2021 17:42

marcotc added the feature-request A request for a new feature or change to an existing one label Apr 30, 2021

marcotc marked this pull request as ready for review April 30, 2021 18:08

marcotc moved this from In progress to In review in Active work Apr 30, 2021

marcotc requested review from lloeki and a team May 3, 2021 16:48

delner removed this from Long Term Priority in Planning May 3, 2021

ivoanjo reviewed May 5, 2021

View reviewed changes

ivoanjo requested a review from delner May 5, 2021 10:19

marcotc added 3 commits May 7, 2021 10:29

Performance changes, thanks Ivo!

f7775c2

More optimizations

098cf2e

Lint

ce0f7f2

ivoanjo approved these changes May 10, 2021

View reviewed changes

marcotc merged commit bb36490 into master May 10, 2021

Active work automation moved this from In review to Merged & awaiting release May 10, 2021

marcotc deleted the error-stack branch May 10, 2021 18:13

github-actions bot added this to the 0.49.0 milestone May 10, 2021

marcotc mentioned this pull request May 31, 2021

Nested error reporting not walking clause chain #1535

Merged

ivoanjo moved this from Merged & awaiting release to Released in Active work Aug 9, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add cause to error stack trace #1472

Add cause to error stack trace #1472

marcotc commented Apr 20, 2021 •

edited

ivoanjo left a comment

ivoanjo May 5, 2021

lloeki commented May 5, 2021

marcotc commented May 7, 2021

ivoanjo left a comment •

edited

ivoanjo May 10, 2021

marcotc May 10, 2021

Add cause to error stack trace #1472

Add cause to error stack trace #1472

Conversation

marcotc commented Apr 20, 2021 • edited

Benchmarks

ivoanjo left a comment

Choose a reason for hiding this comment

ivoanjo May 5, 2021

Choose a reason for hiding this comment

lloeki commented May 5, 2021

marcotc commented May 7, 2021

ivoanjo left a comment • edited

Choose a reason for hiding this comment

ivoanjo May 10, 2021

Choose a reason for hiding this comment

marcotc May 10, 2021

Choose a reason for hiding this comment

marcotc commented Apr 20, 2021 •

edited

ivoanjo left a comment •

edited