Sidekiq integration with custom error handler #958

ryanhouston · 2020-02-24T21:50:22Z

I'm wondering if there's currently a way to provide a custom error handler for the Sidekiq server integration. It doesn't appear to be possible though I think I see how it could be done unobtrusively.

I'm happy to provide a PR if this seems reasonable and there isn't already a way to accomplish this.

Goal
Allow sidekiq to retry jobs gracefully for known transient errors without reporting errors to Datadog.

Description
By providing an error handler proc to the sidekiq configuration options, the application code could include an ignore list for certain error types.

An application configuration provide like the following

# config/initializers/datadog.rb
IGNORED_ERRORS = [CustomTransientError].freeze
sidekiq_error_handler = proc do |span, error| 
  span.set_tag('error', true) unless IGNORED_ERRORS.include?(error.class)
end

Datadog.configure do |c|
  c.use :sidekiq, :on_error => sidekiq_error_handler
end

And updating this line:

dd-trace-rb/lib/ddtrace/contrib/sidekiq/server_tracer.rb

Line 22 in 39083f3

    
           @tracer.trace(Ext::SPAN_JOB, service: service, span_type: Datadog::Ext::AppTypes::WORKER) do |span|

to provide the on_error option to the @tracer.trace call:

@tracer.trace(Ext::SPAN_JOB, service: service, span_type: Datadog::Ext::AppTypes::WORKER, on_error: configuration[:on_error]) do |span|
  # same code here
end

This could also use the worker_config to provide this custom error handler on a per worker basis.

The text was updated successfully, but these errors were encountered:

delner · 2020-02-25T18:39:48Z

@ryanhouston This might make some sense, I think there are a couple of similarities in other parts of the code:

HTTP requests that return 4XX codes are specified as errors/non-errors based on user configuration (I think this is a strong analogy to what you'd like to do here.)
We have some "hooks"/"callbacks" for certain instrumentation (Net::HTTP) that allows users to specify custom instrumentation.

I'm not opposed to adding this as a one-off feature for Sidekiq, but I'd like to consider if there's a more general solution/implementation in order to maintain consistency and expand the benefits to other instrumentation as well.

This could take a few different forms:

A configurable on_error handler for all job frameworks e.g.

Datadog.configure do |c|
  c.use :sidekiq do |sidekiq|
    c.on_error { |span, error, job| span.set_error(error) unless IGNORED_ERRORS[job.class].include?(error.class) }
  end
  # Identical for Resque, DelayedJob, Shoryuken, etc...
end

A configurable on_finish handler for any integration that implements it. Similar to what's above except it's invoked whenever a span finishes (either success or error.) More general purpose for other integrations as well.

I've been thinking a lot about adding these kinds of hooks into the trace lifecycle to allow for such customization for some time now... maybe this is a good time to do this?

I guess some questions I have before a decision is made:

Would this hook pattern be generally useful/solve most customization issues? (Is this the right thing to build?)
How much more effort does it take to make a general fit than a one off? Much more or just a little more? (What's the right size of scope?)
- I'd like to minimize how often we change the UX for configuring integrations; users don't like breaking changes and inconsistency between both integrations and versions on the trace library. This tends to discourage highly iterative evolution of features like this.
To what extent are you interested in contributing and how much effort do you have to spare? (What budget do we have?)
- This is something I'd be very interested in having, but could use a hand in building. I can definitely spare time to consult/review but likely can't develop in the short term.

Let me know your thoughts!

ryanhouston · 2020-03-02T23:26:51Z

@delner The "configurable on_error handler for all job frameworks" solution makes sense to me based on my usage of this gem and the code I've seen so far. Especially considering the hook and option is already integrated into Datadog::Tracer#trace so it is reusing the existing mechanisms.

The on_finish handler solution is not as clear to me and seems like it would require deeper changes to the tracer or each integration in order to capture the errors and relevant result object in a way that the integration configuration can make use of.

Regarding your questions:

The on_error handler for all job frameworks seems highly useful in the context of job worker frameworks and follows current conventions and concepts of the Tracer#trace calls. I don't have the expertise in this codebase and all the various integrations to know at this time how much more difficult a on_finish handler would be to implement and if it would be the right solution for all the various integration types which may require hooks at more specific points or integration specific parameter types in order to be effective.
The on_error handler change could include all job worker framework integrations without much extra effort.
- Seemingly since the Tracer#trace call already includes the on_error option, it could make sense as a general fit solution for all integrations to react to exceptions?
- Other types of integrations may require more specific hooks at different points to effectively accomplish their needs (which are currently unknown)
- The general fit on_finish handler seems like it might require updates to each integration middleware to properly hook this in with the necessary parameters being passed. This seems like significantly more work to decide on implementations and verify (should an HTTP request handler get the [request, response, error]? I'm not sure if each integration's trace context requires special considerations.
I'd be happy to contribute a PR in the short-term for something like the on_error handler for the job framework integrations. Something that is more cross-cutting that required updating all integrations for a general fit solution might be out of my available time budget at the moment. I could potentially chip away at it over a longer period with well-defined project scope and deliverables, but can't make any guarantees.

Thanks for your help!

delner · 2020-03-03T18:50:08Z

@ryanhouston That all makes sense to me. In this case, I think we can start with this on_error mechanism for all job frameworks, as there's already a clearly demonstrated use case, and an on_error hook wouldn't exclude the possibility of an on_finish hook in the future.

I'd welcome a PR for on_error if that's something you'd like to take a crack at! To that end, I can provide any reviews or consultation as needed in turn.

And thanks for the suggestion and thoughtful response!

Stashchenko · 2021-11-24T13:16:20Z

We tried to use like this:

IGNORE_ERRORS_LIST = %w[Sidekiq::Limiter::OverLimit RetryNotAnError]

Datadog::Pipeline.before_flush do |trace|
    trace.each do |span|
      if span.service == 'sidekiq' && span.status == 1 && IGNORE_ERRORS_LIST.include?(span.instance_variable_get('@meta')&.dig('error.type'))
        span.status = 0
      end
    end
  end

but is more like a hack instead of a solution. Would be nice to see any feature to ignore some kind of errors by the DD

marcotc · 2021-11-24T20:31:16Z

This feature has been added by #1212 in version https://github.com/DataDog/dd-trace-rb/releases/tag/v0.43.0.

Could you give it a try @Stashchenko, @ryanhouston and let us know if this works for your use case?

Stashchenko · 2021-11-24T21:03:42Z

This feature has been added by #1212 in version https://github.com/DataDog/dd-trace-rb/releases/tag/v0.43.0.

Could you give it a try @Stashchenko, @ryanhouston and let us know if this works for your use case?

we'll let you know asap.
Thanks a lot @marcotc !

Stashchenko · 2021-11-25T11:43:24Z

@marcotc, @ryanhouston It works very well! thanks a lot for help!

marcotc · 2021-11-26T20:40:09Z

🎉 awesome! I'll close this issue as resolved.

delner self-assigned this Feb 25, 2020

delner added community Was opened by a community member feature-request A request for a new feature or change to an existing one integrations Involves tracing integrations labels Feb 25, 2020

marcotc closed this as completed Nov 26, 2021

feliperaul mentioned this issue Oct 5, 2024

Improve Retry documentation for better handling of transient errors (or new API for retrying without raising) sidekiq/sidekiq#6450

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sidekiq integration with custom error handler #958

Sidekiq integration with custom error handler #958

ryanhouston commented Feb 24, 2020

delner commented Feb 25, 2020

ryanhouston commented Mar 2, 2020

delner commented Mar 3, 2020

Stashchenko commented Nov 24, 2021 •

edited

Loading

marcotc commented Nov 24, 2021

Stashchenko commented Nov 24, 2021 •

edited

Loading

Stashchenko commented Nov 25, 2021

marcotc commented Nov 26, 2021

Sidekiq integration with custom error handler #958

Sidekiq integration with custom error handler #958

Comments

ryanhouston commented Feb 24, 2020

delner commented Feb 25, 2020

ryanhouston commented Mar 2, 2020

delner commented Mar 3, 2020

Stashchenko commented Nov 24, 2021 • edited Loading

marcotc commented Nov 24, 2021

Stashchenko commented Nov 24, 2021 • edited Loading

Stashchenko commented Nov 25, 2021

marcotc commented Nov 26, 2021

Stashchenko commented Nov 24, 2021 •

edited

Loading

Stashchenko commented Nov 24, 2021 •

edited

Loading