[celery] use Celery signals instead of patch #530

palazzem · 2018-08-06T12:57:49Z

Overview

Closes #495

The previous approach used a monkey-patch mechanism to instrument the base Celery() app, including any registered tasks. Unfortunately, the patch mechanism altered the way Celery registers old-style Task (for abstract classes), making complicated the signature match for the task decorator. The approach introduced many problems with synchronous Celery calls (#495).

This PR removes the old monkey-patch mechanism in favor of Celery signals. The new approach uses the internal tracing system to detect when tasks are executed via Celery API.

Caveats

Signals are not emitted when a plain function is called. It means if we have a task defined as:

@app.task
def my_task():
    pass

calling my_task() will not generate any traces. This may be considered a breaking change in the integration behavior that cannot be achieved without reintroducing monkey-patch again (causing other problems with old-style tasks and signatures matching).

Backward Compatibility

This PR introduces the following changes:

if a task function is called directly (my_task() or via the .run() method), no traces are generated. Synchronous calls done via .apply() work as expected. It could be considered more common using the Celery API (.delay(), .apply_async([]), .apply()), even though it could stop working on some systems
this PR doesn't introduce a filtering to decide if a task is instrumented or not. With OOTB instrumentation, all tasks are instrumented and it's not possible anymore to decide to instrument only a subset of tasks
any methods available in the previous API (patch_task() and unpatch_task()) are preserved, so there are no breaking changes in the API. The first API adds instrumentation to all tasks, while the other is a no-op.

This reverts commit fe9d7f2.

… check from our tests

This reverts commit 0440d39.

palazzem · 2018-08-06T12:59:19Z

ddtrace/contrib/celery/__init__.py

@@ -0,0 +1,64 @@
+"""


The documentation is updated here: #531

palazzem · 2018-08-06T13:01:19Z

ddtrace/contrib/celery/app.py

+    pin = pin or Pin(service=WORKER_SERVICE, app=APP, app_type=AppTypes.worker)
+    pin.onto(app)
+
+    signals.task_prerun.connect(trace_prerun)


The same signal is not called multiple times so our handlers are called globally if at least one app is instrumented. In general, we have only one Celery app active for each process, even though we support multiple apps running at the same time.

palazzem · 2018-08-06T13:03:07Z

ddtrace/contrib/celery/signals.py

+
+    # propagate the `Span` in the current task Context
+    span = pin.tracer.trace(c.WORKER_ROOT_SPAN, service=c.WORKER_SERVICE, resource=task.name)
+    propagate_span(task, task_id, span)


This part is required to move the span from one signal to another. You can check the weak dictionary implementation for more details. There is also a test that checks for a memory leak. Would be great having your opinion on that.

Sidenote: we can't use objects inside the signal arguments. Any change to dictionary passed there, will alter the Celery behavior.

To my understanding (which could be faulty, I'm not well-versed in weak reference usage) it seems as though the only reference to the in-transit span is through the weakref dictionary. If this is the case then aren't the spans subject to being garbage collected at any time? Could this lead to spans being garbage collected before they finish?

Am I missing something?

I think this makes sense.

If I'm following things correctly, the tracer object has a strong reference to the span so it won't get garbage collected until the tracer finishes the trace and drops its reference to the span.

In theory having the link to the task as a normal link should be fine, but in the case where there's a memory leak for tasks (e.g. finished tasks being kept around in memory) we'd end up amplifying that problem.

This approach should buy us a bit of safety for fairly minimal risk and complexity.

Correct, we have a strong reference in the tracer (Context to be accurate). My idea is to avoid increasing the memory leak from our side, even though we can generate a leak from another part of the instrumentation. This change is to propagate tracing data from one signal to the other, reducing at minimum the risk of having a leak in case something goes wrong.

I'm keeping this implementation for now.

palazzem · 2018-08-06T13:04:12Z

ddtrace/contrib/celery/task.py

+    )
+
+    # Enable instrumentation everywhere
+    patch_app(task.app)


I'm not sure if this behavior is correct. Probably if developers want to instrument only 2 or 3 tasks, it's better to make this a no-op too. Otherwise they will see everything instrumented. I think it's a matter of tradeoffs here.

In order to allow developers to instrument a select set of tasks could we provide a configuration setting like patched_tasks=frozenset([task1, task2,...]) and then in the trace methods we check if the task is to be instrumented?

Yeah, this one feels pretty rough.

My initial reaction is to say that this should be a no-op as normally that's safer than having a more expansive execution than the original semantics.

That being said, after chewing on it a bit, I think this (patch everything) is the right call. The reason being, for us too much data > lost data. If somebody were to apply an update without noticing the deprecation and we made this a no-op, they'd irretrievably loose data. On the other hand, if their account gets spammed with more than they wanted it would both be obvious where the problem is and they'd still have access to the traces they cared about (among the noise).

OK, I will keep it that way. My reasoning to not provide a patched_tasks config, is because I would like to understand if it's really a use case. Also, we don't have configurations in place right now, so it's hard to provide this setting for Celery. I may work on it in another PR.

palazzem · 2018-08-06T13:05:00Z

ddtrace/contrib/celery/util.py

+        return body.get('id')
+
+
+def require_pin(decorated):


I think this is not needed anymore. Probably we should simply remove it.

I don't see it anywhere else in the PR. So yeah it looks safe to drop.

It was only used for testing purposes for old-style tasks, so it should be safe.

Removing it then.

palazzem · 2018-08-06T13:06:02Z

ddtrace/span.py

@@ -35,6 +35,7 @@ class Span(object):
        '_context',
        '_finished',
        '_parent',
+        '__weakref__',


It's required to store weak references for these instances. In a sense is a huge change in the API, but it's not changing anything about how the Span works. If we have other ideas we may change how to propagate the Span.

palazzem · 2018-08-06T13:07:01Z

tests/contrib/celery/test_integration.py

+        eq_(t, 42)
+
+        traces = self.tracer.writer.pop_traces()
+        eq_(0, len(traces))


palazzem · 2018-08-06T13:08:16Z

tests/contrib/celery/test_integration.py

-
-    def test_fn_task_parameters_bind(self):
-        # it should execute a traced task that has parameters
+        eq_(1, len(traces[0]))


Because before the .run() method was instrumented via monkey-patch, calling .apply() generated 2 spans: one for the apply() and one for the run(). From the developer perspective, I think it's better to have only one span because apply() (sync call) always call the body of the function.

palazzem · 2018-08-06T13:08:46Z

tests/contrib/celery/test_old_style_task.py

-        t.run()
-        spans = self.tracer.writer.pop()
-        self.assertEqual(len(spans), 2)
+        res = t.apply()


Test changed to use apply() instead of run() because of the considerations above.

palazzem · 2018-08-06T13:09:22Z

tests/contrib/celery/test_task.py

@@ -1,439 +0,0 @@
-import celery


This is removed entirely. All these cases are tested under test_integration.py.

palazzem · 2018-08-06T13:12:11Z

tests/contrib/celery/test_utils.py

+        weak_dict.get(task_id).finish()
+        self.tracer.writer.pop()
+        self.tracer.writer.pop_traces()
+        gc.collect()


because we have weak references, this means the element is eligible for garbage collection at any time after the Span is finished. It's not garbage collected if we still have references of the Span. This is needed because it's possible the signal mechanism is not called properly from Celery (unexpected behaviors / unhandled executions) and so even if a Span is not finished it can be garbage collected if the Tracer releases the reference to the Span.

We may need to check if we have to connect to other signals to be sure Spans are always finished.

LotharSee · 2018-08-06T16:41:31Z

ddtrace/contrib/celery/signals.py

+        return
+    else:
+        span.finish()
+        remove_span(task, task_id)


Nit: remove_span would make me think that we remove the span itself. Maybe go with something like detach? (in which can propagate becomes attach?).

I agree, changing with this verb is way more accurate.

LotharSee

Overall it looks good to me.

My main meta question: do we have the certainty that pre and post signals are always caught?
Any setup in which one could be missing, which would then break a whole trace?

mgu · 2018-08-08T15:36:54Z

from what I've seen in the Celery code, task_postrun is always called when task_prerun has been called

Kyle-Verhoog

I really like the signal patching. So much cleaner!

Mostly just nits, small improvements and me answering my own questions.

Kyle-Verhoog · 2018-08-08T15:23:22Z

ddtrace/contrib/celery/app.py

+    """
+    pin = Pin.get_from(app)
+    if pin is not None:
+        delattr(app, _DD_PIN_NAME)


[Not in the scope of this PR] should we consider adding a Pin.remove_from(app) method? It doesn't seem ideal that we have to import an implementation detail from the pin module to delete a pin.

Yeah that was the plan. Because of the nature of changes introduced in 0.13.0, I didn't want to change anything so critical in our core. Consider that we should change a bit the Pin internal API because it's not a quick change with a small impact.

Scheduling some work for the future.

Kyle-Verhoog · 2018-08-08T15:59:45Z

ddtrace/contrib/celery/util.py

+
+from .constants import CTX_KEY
+
+# Service info


These are duplicates from the values defined in constants.py:

dd-trace-py/ddtrace/contrib/celery/util.py

Lines 11 to 14 in b7f5598

# Service info

APP = 'celery'

PRODUCER_SERVICE = os.environ.get('DATADOG_SERVICE_NAME') or 'celery-producer'

WORKER_SERVICE = os.environ.get('DATADOG_SERVICE_NAME') or 'celery-worker'

dd-trace-py/ddtrace/contrib/celery/constants.py

Lines 16 to 19 in b7f5598

# Service info

APP = 'celery'

PRODUCER_SERVICE = getenv('DATADOG_SERVICE_NAME') or 'celery-producer'

WORKER_SERVICE = getenv('DATADOG_SERVICE_NAME') or 'celery-worker'

nice catch, probably it's a wrong copy I did when trying to make the PR readable. Removing them and keeping only things on constants.py.

Kyle-Verhoog · 2018-08-08T17:17:26Z

ddtrace/contrib/celery/app.py

+)
+
+
+def patch_app(app, pin=None):


patch_app does not appear to be idempotent. If it were to be called repeatedly would we register the same signal receiver multiple times on the same signal?

It looks like in the implementation of Signal.connect that we will add another identical receiver: https://github.com/celery/celery/blob/b24425ea6320c2b95fe1873f1f00966c3b38952b/celery/utils/dispatch/signal.py#L221

but I'm not 100% sure about this.

We could do similarly to our other integrations and set a patched attribute to the app and check for it when patching. Like we do here:

dd-trace-py/ddtrace/contrib/redis/patch.py

Lines 18 to 20 in 2eb60a9

if getattr(redis, '_datadog_patch', False):

return

setattr(redis, '_datadog_patch', True)

Edit: I do see that there's an integration test covering the idempotent patching but I still think it'd be good practice to keep the patching attribute on the task to be explicit and not depend on celery.

Correct, worth changing especially because it's something internal that may change in the future heading to an unexpected behavior.

Kyle-Verhoog · 2018-08-08T17:24:16Z

ddtrace/contrib/celery/signals.py

+    # changes in Celery
+    task = kwargs.get('sender')
+    task_id = kwargs.get('task_id')
+    if task is None or task_id is None:


Do we want to maybe log something in these cases that we shortcut out without action?

This could be useful for users trying to get our tracing working with an unsupported celery version.

Yeah let's log something in debug mode.

Kyle-Verhoog · 2018-08-08T17:36:30Z

ddtrace/contrib/celery/task.py

+    )
+
+    # Enable instrumentation everywhere
+    patch_app(task.app)


In order to allow developers to instrument a select set of tasks could we provide a configuration setting like patched_tasks=frozenset([task1, task2,...]) and then in the trace methods we check if the task is to be instrumented?

Kyle-Verhoog · 2018-08-08T17:46:44Z

tests/contrib/celery/test_app.py


    def test_unpatch_app(self):
-        # When unpatch_app is called on a patched app we unpatch the `task()` method
+        # When celery.App is patched it must not include a `Pin` instance


typo nit:

dd-trace-py/tests/contrib/celery/test_app.py

Line 20 in b7f5598

# When celery.App is patched it must not include a `Pin` instance

patched -> unpatched

Good catch!

Kyle-Verhoog · 2018-08-08T17:56:26Z

ddtrace/contrib/celery/util.py

@@ -0,0 +1,115 @@
+# stdlib


nit: should this file be utils.py?

changing it to utils.py. It's not exported from our __all__ list, so it should not be considered a breaking change.

Kyle-Verhoog · 2018-08-08T18:04:52Z

ddtrace/contrib/celery/signals.py

+
+    # propagate the `Span` in the current task Context
+    span = pin.tracer.trace(c.WORKER_ROOT_SPAN, service=c.WORKER_SERVICE, resource=task.name)
+    propagate_span(task, task_id, span)


To my understanding (which could be faulty, I'm not well-versed in weak reference usage) it seems as though the only reference to the in-transit span is through the weakref dictionary. If this is the case then aren't the spans subject to being garbage collected at any time? Could this lead to spans being garbage collected before they finish?

Am I missing something?

Kyle-Verhoog · 2018-08-08T18:16:06Z

tests/contrib/celery/test_integration.py

+        eq_(t, 42)
+
+        traces = self.tracer.writer.pop_traces()
+        eq_(0, len(traces))


We should be sure to document this so that users are aware that .run() invocations will not generate spans.

SeanOC

Looks good.

Provided feedback where requested. Otherwise it looks like other reviewers already commented on anything I'd look to change.

SeanOC · 2018-08-08T19:12:05Z

ddtrace/contrib/celery/signals.py

+
+    # propagate the `Span` in the current task Context
+    span = pin.tracer.trace(c.WORKER_ROOT_SPAN, service=c.WORKER_SERVICE, resource=task.name)
+    propagate_span(task, task_id, span)


I think this makes sense.

If I'm following things correctly, the tracer object has a strong reference to the span so it won't get garbage collected until the tracer finishes the trace and drops its reference to the span.

In theory having the link to the task as a normal link should be fine, but in the case where there's a memory leak for tasks (e.g. finished tasks being kept around in memory) we'd end up amplifying that problem.

This approach should buy us a bit of safety for fairly minimal risk and complexity.

SeanOC · 2018-08-08T19:21:50Z

ddtrace/contrib/celery/task.py

+    )
+
+    # Enable instrumentation everywhere
+    patch_app(task.app)


Yeah, this one feels pretty rough.

My initial reaction is to say that this should be a no-op as normally that's safer than having a more expansive execution than the original semantics.

That being said, after chewing on it a bit, I think this (patch everything) is the right call. The reason being, for us too much data > lost data. If somebody were to apply an update without noticing the deprecation and we made this a no-op, they'd irretrievably loose data. On the other hand, if their account gets spammed with more than they wanted it would both be obvious where the problem is and they'd still have access to the traces they cared about (among the noise).

* [core] revert argv patch * [core] patch celery via post-import hooks * [core] set patched to true for celery patching * [core] remove patched module count side effect * [celery] add tests verifying the import hook patching

…ilable

SeanOC

👍

Emanuele Palazzetti added 10 commits August 6, 2018 14:36

[celery] base implementation

ba89329

This reverts commit fe9d7f2.

[core] Span object adds __weakref__ slot to support WeakReferences

f7053af

[celery] use Celery Signal Framework to instrument the execution

6835c25

[celery] remove Registry and Task monkey-patch; Signals are used instead

b3c7274

[celery] hostname is not available in all Celery versions; remove the…

0aa4448

… check from our tests

[celery] refactor publish signals to support Protocol v1 messages

30cc18e

[celery] remove task import on __init__

fd83e0e

[celery] remove Span key when after signals are triggered

58ab619

[celery] add patch_task and unpatch_task as a backward compatibility

e75bc70

This reverts commit 0440d39.

[celery] test if patch() or unpatch() are called twice

b7f5598

palazzem added the integrations label Aug 6, 2018

palazzem added this to the 0.13.0 milestone Aug 6, 2018

palazzem requested review from SeanOC, LotharSee and Kyle-Verhoog August 6, 2018 12:57

palazzem mentioned this pull request Aug 6, 2018

[celery] update Celery documentation based on the new instrumentation #531

Merged

palazzem commented Aug 6, 2018

View reviewed changes

ddtrace/contrib/celery/__init__.py

@@ -0,0 +1,64 @@

"""

Copy link

Author

palazzem Aug 6, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The documentation is updated here: #531

palazzem commented Aug 6, 2018

View reviewed changes

palazzem added the do-not-merge/WIP label Aug 6, 2018

This was referenced Aug 6, 2018

critical: 0.12.1 breaks synchronous celery calls (argument mismatch) #495

Closed

[celery] refactor tests execution; change the patched Task() method #519

Merged

LotharSee reviewed Aug 6, 2018

View reviewed changes

LotharSee approved these changes Aug 6, 2018

View reviewed changes

Kyle-Verhoog mentioned this pull request Aug 8, 2018

Cannot run ddtrace-run celery worker attempts to execute celery with python #493

Closed

Kyle-Verhoog requested changes Aug 8, 2018

View reviewed changes

SeanOC reviewed Aug 8, 2018

View reviewed changes

Kyle-Verhoog and others added 3 commits August 13, 2018 16:14

[celery] Patch via post-import hooks (#534)

99d717d

* [core] revert argv patch * [core] patch celery via post-import hooks * [core] set patched to true for celery patching * [core] remove patched module count side effect * [celery] add tests verifying the import hook patching

[celery] safe-guard for double instrumentation

3421588

[celery] add log.debug() when the task or the task_id are not ava…

6468d8b

…ilable

Emanuele Palazzetti added 3 commits August 16, 2018 20:14

[celery] use attach/detach verbs instead of propagate/remove

e067121

[celery] update util.py to utils.py; removing duplicated / unused code

bcae71d

[celery] update tests comments

614b479

Kyle-Verhoog approved these changes Aug 16, 2018

View reviewed changes

SeanOC approved these changes Aug 16, 2018

View reviewed changes

[celery] fix imports

2216737

palazzem removed the do-not-merge/WIP label Aug 16, 2018

palazzem merged commit 0ea3774 into palazzem/celery-integration Aug 16, 2018

palazzem deleted the palazzem/celery-signals branch August 16, 2018 19:11

	# Service info
	APP = 'celery'
	PRODUCER_SERVICE = os.environ.get('DATADOG_SERVICE_NAME') or 'celery-producer'
	WORKER_SERVICE = os.environ.get('DATADOG_SERVICE_NAME') or 'celery-worker'

	# Service info
	APP = 'celery'
	PRODUCER_SERVICE = getenv('DATADOG_SERVICE_NAME') or 'celery-producer'
	WORKER_SERVICE = getenv('DATADOG_SERVICE_NAME') or 'celery-worker'

	if getattr(redis, '_datadog_patch', False):
	return
	setattr(redis, '_datadog_patch', True)

[celery] use Celery signals instead of patch #530

[celery] use Celery signals instead of patch #530

Conversation

palazzem commented Aug 6, 2018

Overview

Caveats

Backward Compatibility

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

LotharSee left a comment

Choose a reason for hiding this comment

mgu commented Aug 8, 2018

Kyle-Verhoog left a comment

Choose a reason for hiding this comment

Kyle-Verhoog Aug 8, 2018 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SeanOC left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SeanOC left a comment

Choose a reason for hiding this comment

Kyle-Verhoog Aug 8, 2018 •

edited