Skip to content

Memory leak while handling large trace from long-running celery tasks #1632

@pjwerneck

Description

@pjwerneck

TL;DR

A long running celery task keeps adding to a large trace that's already larger than the API payload limit, using a lot of memory and causing a memory usage spike during serialization, all for a trace that's going to be dropped anyway.

Details

We have a celery task that occasionally handles a large payload, taking up to an hour to complete and performing tens of thousands of queries. Since the ddtrace.contrib.celery patch creates a single parent span for the whole task execution, all children spans are added to it and flushed at the end, when the whole trace is serialized and either sent to the datadog API or dropped because it exceeds the API payload limit.

This works as expected for small tasks, but for these large, long-running tasks, it always results in a trace larger than the API payload limit. We get the Trace is larger than the max payload size, dropping it or Trace is too big to fit in a payload, dropping it warnings. If it's the API limit, fair enough, but ddtrace keeps pointlessly adding to the trace even after it's already too large for the API. Also, during the serialization it holds both the raw and serialized payload in memory, causing a short memory spike that in our application sometimes hits Heroku's memory quota and cause it to start swapping.

Our temporary workaround was to disable the celery patch with DD_TRACE_CELERY_ENABLED=False so we still get traces for queries and other inner calls in the taks, but they aren't grouped under a single parent trace.

Ideally, ddtrace should be able to get a rough estimate of when a parent trace is already too large for the API and stop adding to it.

Which version of dd-trace-py are you using?

0.41.0

Which version of the libraries are you using?

django 3.0.5
celery 4.4.7

How can we reproduce your problem?

Create a long-running celery task that creates a lot of spans and activate the ddtrace.contrib.celery. It will generate a large trace that's dropped at the end.

What is the result that you get?

ddtrace keeps adding to a doomed trace, wasting memory, and causing a huge memory spike during serialization at the end.

What is the result that you expected?

ddtrace shouldn't keep adding to a trace that's already past the API payload limit.

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions