fix(profiling): avoid change-after-check pattern for lock #12792

taegyunkim · 2025-03-19T11:31:32Z

        start = None
        if hasattr(self, "_self_acquired_at"):
            # _self_acquired_at is only set when the acquire was captured
            # if it's not set, we're not capturing the release
            start = self._self_acquired_at
            del self._self_acquired_at

Above code can raise an AttributeError, if there are multiple threads calling on release(). Though in such scenarios, except for the thread which actually held and released the lock, such threads would result in a RuntimeError, the current behavior makes customers blame our code instead of theirs.

Another thing is that both acquire() and release() uses try/except/finally to wrap the call to inner function. However, even when the inner function call failed, the profiler emitted a sample under some circumstances. This PR fixes this.

Checklist

PR author has checked that all the criteria below are met
The PR description includes an overview of the change
The PR description articulates the motivation for the change
The change includes tests OR the PR description describes a testing strategy
The PR description notes risks associated with the change, if any
Newly-added code is easy to change
The change follows the library release note guidelines
The change includes or references documentation updates if necessary
Backport labels are set (if applicable)

Reviewer Checklist

Reviewer has checked that all the criteria below are met
Title is accurate
All changes are related to the pull request's stated goal
Avoids breaking API changes
Testing strategy adequately addresses listed risks
Newly-added code is easy to change
Release note makes sense to a user of the library
If necessary, author has acknowledged and discussed the performance implications of this PR as reported in the benchmarks PR comment
Backport labels are set in a manner that is consistent with the release branch maintenance policy

github-actions · 2025-03-19T11:32:12Z

CODEOWNERS have been resolved as:

releasenotes/notes/profiling-lock-error-eb98c3b9dff3f07c.yaml           @DataDog/apm-python
ddtrace/profiling/collector/_lock.py                                    @DataDog/profiling-python
tests/profiling_v2/collector/test_threading.py                          @DataDog/profiling-python

github-actions · 2025-03-19T11:50:34Z

Bootstrap import analysis

Comparison of import times between this PR and main.

Summary

The average import time in this PR is: 239 ± 3 ms.

The average import time in main is: 241 ± 4 ms.

The import time difference between this PR and main is: -1.8 ± 0.2 ms.

Import time breakdown

The following import paths have shrunk:

ddtrace.auto 1.946 ms (0.81%)

ddtrace.bootstrap.sitecustomize 1.284 ms (0.54%)

ddtrace.bootstrap.preload 1.284 ms (0.54%)

ddtrace.internal.products 1.284 ms (0.54%)

ddtrace.internal.remoteconfig.client 0.643 ms (0.27%)

ddtrace 0.662 ms (0.28%)

pr-commenter · 2025-03-19T12:11:41Z

Benchmarks

Benchmark execution time: 2025-03-20 15:51:21

Comparing candidate commit 014a300 in PR branch taegyunkim/scp-639-acquired-at with baseline commit 39e5baf in branch main.

Found 1 performance improvements and 0 performance regressions! Performance is the same for 497 metrics, 2 unstable metrics.

scenario:iast_aspects-ospathjoin_aspect

🟩 execution_time [-547.556ns; -480.668ns] or [-9.796%; -8.600%]

nsrip-dd

What's the motivation for this change? Is it just an optimization?

releasenotes/notes/profiling-lock-error-eb98c3b9dff3f07c.yaml

Co-authored-by: Emmett Butler <723615+emmettbutler@users.noreply.github.com>

taegyunkim · 2025-03-19T18:22:20Z

@nsrip-dd
Initially to address customer escalation SCP-639.

Do you mind taking another look at this?

nsrip-dd

Thanks. The fix generally makes sense to me. I was able to reproduce the attribute error with this program, run with ddtrace-run and profiling enabled:

import threading
import sys

sys.setswitchinterval(0.0000001)

def unlock(l):
    try:
        l.release()
    except RuntimeError:
        pass
    except Exception as e:
        raise e

while True:
    l = threading.Lock()
    l.acquire()
    threads = [threading.Thread(target=unlock, args=[l,]) for _ in range(64)]
    for t in threads:
        t.start()
    for t in threads:
        t.join()

Output:

Traceback (most recent call last):
  File "/opt/homebrew/Cellar/python@3.12/3.12.7_1/Frameworks/Python.framework/Versions/3.12/lib/python3.12/threading.py", line 1075, in _bootstrap_inner
    self.run()
  File "/opt/homebrew/Cellar/python@3.12/3.12.7_1/Frameworks/Python.framework/Versions/3.12/lib/python3.12/threading.py", line 1012, in run
    self._target(*self._args, **self._kwargs)
  File "/Users/nick.ripley/sandbox/python/lockcollectorcrash/main.py", line 12, in unlock
    raise e
  File "/Users/nick.ripley/sandbox/python/lockcollectorcrash/main.py", line 8, in unlock
    l.release()
  File "/Users/nick.ripley/sandbox/python/lockcollectorcrash/.venv/lib/python3.12/site-packages/ddtrace/profiling/collector/_lock.py", line 242, in release
    return self._release(self.__wrapped__.release, *args, **kwargs)
  1 from threading import Semaphore, Thread
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/nick.ripley/sandbox/python/lockcollectorcrash/.venv/lib/python3.12/site-packages/ddtrace/profiling/collector/_lock.py", line 187, in _release
    start = self._self_acquired_at
            ^^^^^^^^^^^^^^^^^^^^^^
AttributeError: '_thread.lock' object has no attribute '_self_acquired_at'

With this fix, doesn't crash after a pretty long run. The test you have seems reasonable, since my reproducer doesn't crash in a deterministic amount of time.

Just left a few questions.

tests/profiling_v2/collector/test_threading.py

nsrip-dd · 2025-03-19T19:26:43Z

ddtrace/profiling/collector/_lock.py

+                    try:
+                        del self._self_acquired_at
+                    except AttributeError:
+                        pass


I'm having a little trouble picturing how we can reach this case. If del fails due to the attribute error, that means another thread was racing on _release and reached this line first, right? But if that's the case, wouldn't the inner_func call for the thread that hits this exception have already failed since the lock was unlocked?

I was imagining the following scenario. Suppose there are two threads A and B.
Suppose the lock was locked and thread A calls release() on it. Right after A has completed running inner_func and before running the finally block, another thread B could acquire and release the lock running the finally block till the end, which results in deleting self._self_acquired_at. Then, when thread A tries to delete self._self_acquired_at, it would result in an AttributeError. But in this case, we still want to capture the event.

I think I understand, but in that case I don't think this code is doing the right thing. Consider this scenario:

Thread A acquires a wrapped lock, and then calls release on it.

Thread A reaches inner_func in our wrapper and successfully releases the lock, but doesn't yet advance into the finally block.

Thread B comes along and acquires the lock, setting self._self_acquired_at, so we want to record it

Thread A now advances into the finally block in _release and deletes self._self_acquired_at

Thread B releases the lock, and sees that self._self_acquired_at is not set, and thus doesn't record it

In that scenario, I think this is the wrong behavior. Basically thread A could have deleted the timestamp that thread B was using to record the locking/unlocking, preventing thread B from recording the lock like it intended to. Does that make sense?

I'm still thinking through what the right behavior is, considering the bug that motivated this PR. I think the lock we're wrapping kind of gives implicit protection for the extra state we're tracking? Like, an earlier version of this PR popped the acquired timestamp from the attributes before calling the wrapped unlock function. That seems closer to what we'd need to do...

I initially thought that the state is implicitly protected too, but turns out any Python threading.Thread can call release() on the lock. So if we pop the timestamp before calling the wrapped unlock function, then even if a thread succeeds to actually release the lock, it might not get the timestamp as it could have been popped by another call to release()

import threading lock = threading.Lock() lock.acquire() def thread_func(): try: lock.release() except RuntimeError as e: print(f"Error: {e}") t = threading.Thread(target=thread_func) t.start() t.join()

I initially thought that the thread would result in raising a RuntimeError, but on both Mac OS and Linux, it doesn't.

Oh maybe this is ok?

Yeah it's okay if a different thread releases the lock than the thread that acquired it, but it's not okay if the lock is released when it isn't locked. And I think the problem there (that motivated this PR) is that if two threads race to release the lock (a user error) then they also race in our wrapper code which is modifying our wrapper state during release. And that turns their error (a RuntimeError) into "our" error.

ddtrace/profiling/collector/_lock.py

taegyunkim · 2025-03-25T15:26:44Z

Closing in favor of #12833

fix(profiling): avoid change-after-check pattern for lock

abc6307

P403n1x87 approved these changes Mar 19, 2025

View reviewed changes

taegyunkim marked this pull request as ready for review March 19, 2025 13:53

taegyunkim requested a review from a team as a code owner March 19, 2025 13:53

nsrip-dd reviewed Mar 19, 2025

View reviewed changes

reno

686bae3

taegyunkim requested a review from a team as a code owner March 19, 2025 14:24

taegyunkim requested review from emmettbutler and sanchda March 19, 2025 14:24

taegyunkim added the Profiling Continous Profling label Mar 19, 2025

emmettbutler approved these changes Mar 19, 2025

View reviewed changes

releasenotes/notes/profiling-lock-error-eb98c3b9dff3f07c.yaml Outdated Show resolved Hide resolved

taegyunkim and others added 11 commits March 19, 2025 16:21

use getattr and del

ba7d045

simplify

3ec09be

add comments and reorg

86a304b

check result from acquire()

3f12939

Update releasenotes/notes/profiling-lock-error-eb98c3b9dff3f07c.yaml

b2e4378

Co-authored-by: Emmett Butler <723615+emmettbutler@users.noreply.github.com>

tests

45420e8

fix asyncio test

df14fb9

update comment

4faa2fd

add more comment

437de4d

be more specific

2113efa

update relnote

f6cde2c

update test

cf72046

nsrip-dd reviewed Mar 19, 2025

View reviewed changes

datadog-datadog-prod-us1 bot reviewed Mar 19, 2025

View reviewed changes

ddtrace/profiling/collector/_lock.py Outdated Show resolved Hide resolved

datadog-datadog-prod-us1 bot reviewed Mar 19, 2025

View reviewed changes

ddtrace/profiling/collector/_lock.py Outdated Show resolved Hide resolved

datadog-datadog-prod-us1 bot reviewed Mar 19, 2025

View reviewed changes

ddtrace/profiling/collector/_lock.py Outdated Show resolved Hide resolved

taegyunkim added 2 commits March 19, 2025 20:06

add assert

49f8df4

accept nicks suggestion

529c969

datadog-datadog-prod-us1 bot reviewed Mar 19, 2025

View reviewed changes

ddtrace/profiling/collector/_lock.py Outdated Show resolved Hide resolved

taegyunkim added 2 commits March 19, 2025 17:06

Merge branch 'main' into taegyunkim/scp-639-acquired-at

1d233b9

move try/except

014a300

taegyunkim closed this Mar 25, 2025

taegyunkim deleted the taegyunkim/scp-639-acquired-at branch March 25, 2025 15:26

fix(profiling): avoid change-after-check pattern for lock #12792

fix(profiling): avoid change-after-check pattern for lock #12792

Uh oh!

Conversation

taegyunkim commented Mar 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Checklist

Reviewer Checklist

Uh oh!

github-actions bot commented Mar 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Mar 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Bootstrap import analysis

Summary

Import time breakdown

Uh oh!

pr-commenter bot commented Mar 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Benchmarks

scenario:iast_aspects-ospathjoin_aspect

Uh oh!

nsrip-dd left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

taegyunkim commented Mar 19, 2025 • edited by atlassian bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nsrip-dd left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

nsrip-dd Mar 19, 2025

Choose a reason for hiding this comment

Uh oh!

taegyunkim Mar 19, 2025

Choose a reason for hiding this comment

Uh oh!

nsrip-dd Mar 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

taegyunkim Mar 20, 2025

Choose a reason for hiding this comment

Uh oh!

taegyunkim Mar 20, 2025

Choose a reason for hiding this comment

Uh oh!

nsrip-dd Mar 20, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

taegyunkim commented Mar 25, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

taegyunkim commented Mar 19, 2025 •

edited

Loading

github-actions bot commented Mar 19, 2025 •

edited

Loading

github-actions bot commented Mar 19, 2025 •

edited

Loading

pr-commenter bot commented Mar 19, 2025 •

edited

Loading

taegyunkim commented Mar 19, 2025 •

edited by atlassian bot

Loading

nsrip-dd Mar 20, 2025 •

edited

Loading