Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error rescued with rescue_from still marked as error #1155

Closed
iain opened this issue Aug 25, 2020 · 8 comments
Closed

Error rescued with rescue_from still marked as error #1155

iain opened this issue Aug 25, 2020 · 8 comments
Milestone

Comments

@iain
Copy link

iain commented Aug 25, 2020

We recently tried to upgrade from ddtrace 0.37.0 to 0.39.0 and we found that errors in a Rails controller are still counted as errors in APM.

For example:

class WidgetsController < ApplicationController

  rescue_from SomeError do
    render json: { message: "nah" }, status: 404
  end

  def index
    raise SomeError
  end

end

APM would show that requests made to this endpoint as errors, while they are responding with 404 status.

We are using Rails 5.0.7.2, which admittedly is pretty old.

@iain
Copy link
Author

iain commented Aug 26, 2020

It is mentioned in #1124:

One tradeoff here is that, this might mean that if there's custom error handling middleware inserted in the Action Dispatcher middleware stack, we might not be honoring it by pre-emptively setting the rack span to be the same error as the actioncontroller.

If I remember correctly, rescue_from will compile down to a middleware too, so that would explain why. The question is if that is intended behavior or not, because what was thought to be a relatively rare edge case, could be way more common.

@ericmustin
Copy link
Contributor

Hey @iain , thanks for reaching out about this.

I think the goal of #1124 was to provide more visibility for user's top level spans, not significantly change current behavior around what is / is not marked an error. You bring up a valuable and reasonable use case. Looking into rescue_from a bit, I think there are two things at work here.

First, at present our instrumentation of rail's controllers simply is not accounting for rescue_from. Take for example a small modification of your snippet above:

class WidgetsController < ApplicationController
  rescue_from SomeError, with: :some_other_error # self defined exception

  def index
    raise SomeError
  end
end

In this scenario, our instrumentation would record the error as being SomeError as opposed to SomeOtherError. This is due, I believe, to where our metaprogramming does monkeypatching (on the ActionController::Metal.proccess_action method). I think we're simply monkeypatching before the rescue handlers get called.

Second is that the most recent work in #1124 makes the first point more obvious. IIRC, previously rack spans were naively marked errors/not-error based on the response being a 500 or not, but now they directly inherit the rails controller errors. Since rescue_from can decide to swallow an error (or replace a 500 with a 404), and we don't account for rescue_from, the rack span doesn't get adjusted.

So this is a long and winding road of saying that we want to address this. I'll discuss with the team and keep this issue updated.

In the meantime, if you need a workaround here I can suggest adding a Processor to the flushing pipeline that will adjust these 404 rack spans to not be reflected as errors

Datadog::Pipeline.before_flush(
  # https://github.com/DataDog/dd-trace-rb/blob/master/docs/GettingStarted.md#processing
  # Ensure Rack Span 404 responses are not set as errors on rack span
  Datadog::Pipeline::SpanProcessor.new { |span| 
    begin

      if span.name && span.name == 'rack.request' && span.status == 1 && span.get_tag('http.status_code') == '404'
        span.status = 0  
        span.set_tag('error.stack', nil)
        span.set_tag('error.type', nil)
        span.set_tag('error.msg', nil)
      end
    rescue StandardError => error
      # no-op
    end
  }
)

@iain
Copy link
Author

iain commented Sep 1, 2020

I see the Rack fixees in #1162, but I realized there is also a rescue_from in ActiveJob (and maybe more places like ActionMailer, because rescue_from is part of a mixin). How does this impact other systems like ActiveJob? I don't see any easy attribute on the span to see if it turned out okay, like a status code.

@ericmustin
Copy link
Contributor

@iain So, for ActionMailer, we actually don't support it at the moment afaik, but if we did I think it would still exist as a child span of Rack, so the same fix would apply. And for ActiveJob, I believe our instrumentation exists only in the context of other libraries, like Sidekiq. So i'd have to see where the sidekiq middleware hooks are here, if they're a level above active job than it would probably be handled appropriately.

We're still discussing internally long term whether there's a robust way we can instrument these hooks, in this case rescue_from but more broadly things like before/after hooks. We'd like to be able to offer that visibility. In the meantime though I think the critical thing is that we just don't propagate stale/handled errors to the rack level, as that would being to interfere with users generated service level statistics, and potentially alerting/monitoring behavior. So I wanted to make sure a patch got in for that, ideally before our next release.

@iain
Copy link
Author

iain commented Sep 1, 2020

Makes sense. Thanks!

@marcotc marcotc added this to the 0.40.0 milestone Sep 8, 2020
@marcotc
Copy link
Member

marcotc commented Sep 8, 2020

👋 @iain, we've shipped 0.40.0 with the changes we talked about earlier, which should fix the issue we were seeing here (#1162).

Let us know if it addresses the issues in your application!

@iain
Copy link
Author

iain commented Sep 9, 2020

Just upgraded to 0.40.0, and we were able to remove the workarounds mentioned above. Looking good, thanks! 👍

@iain iain closed this as completed Sep 9, 2020
@ericmustin
Copy link
Contributor

@iain very much appreciate the patience and confirmation here. We'll try to update here as we have more info regarding longer term work measuring rescue_from and before/after hooks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants