Skip to content

Conversation

@taegyunkim
Copy link
Contributor

@taegyunkim taegyunkim commented Mar 19, 2025

        start = None
        if hasattr(self, "_self_acquired_at"):
            # _self_acquired_at is only set when the acquire was captured
            # if it's not set, we're not capturing the release
            start = self._self_acquired_at
            del self._self_acquired_at

Above code can raise an AttributeError, if there are multiple threads calling on release(). Though in such scenarios, except for the thread which actually held and released the lock, such threads would result in a RuntimeError, the current behavior makes customers blame our code instead of theirs.

Another thing is that both acquire() and release() uses try/except/finally to wrap the call to inner function. However, even when the inner function call failed, the profiler emitted a sample under some circumstances. This PR fixes this.

Checklist

  • PR author has checked that all the criteria below are met
  • The PR description includes an overview of the change
  • The PR description articulates the motivation for the change
  • The change includes tests OR the PR description describes a testing strategy
  • The PR description notes risks associated with the change, if any
  • Newly-added code is easy to change
  • The change follows the library release note guidelines
  • The change includes or references documentation updates if necessary
  • Backport labels are set (if applicable)

Reviewer Checklist

  • Reviewer has checked that all the criteria below are met
  • Title is accurate
  • All changes are related to the pull request's stated goal
  • Avoids breaking API changes
  • Testing strategy adequately addresses listed risks
  • Newly-added code is easy to change
  • Release note makes sense to a user of the library
  • If necessary, author has acknowledged and discussed the performance implications of this PR as reported in the benchmarks PR comment
  • Backport labels are set in a manner that is consistent with the release branch maintenance policy

@github-actions
Copy link
Contributor

github-actions bot commented Mar 19, 2025

CODEOWNERS have been resolved as:

releasenotes/notes/profiling-lock-error-eb98c3b9dff3f07c.yaml           @DataDog/apm-python
ddtrace/profiling/collector/_lock.py                                    @DataDog/profiling-python
tests/profiling_v2/collector/test_threading.py                          @DataDog/profiling-python

@github-actions
Copy link
Contributor

github-actions bot commented Mar 19, 2025

Bootstrap import analysis

Comparison of import times between this PR and main.

Summary

The average import time in this PR is: 239 ± 3 ms.

The average import time in main is: 241 ± 4 ms.

The import time difference between this PR and main is: -1.8 ± 0.2 ms.

Import time breakdown

The following import paths have shrunk:

ddtrace.auto 1.946 ms (0.81%)
ddtrace.bootstrap.sitecustomize 1.284 ms (0.54%)
ddtrace.bootstrap.preload 1.284 ms (0.54%)
ddtrace.internal.products 1.284 ms (0.54%)
ddtrace.internal.remoteconfig.client 0.643 ms (0.27%)
ddtrace 0.662 ms (0.28%)

@pr-commenter
Copy link

pr-commenter bot commented Mar 19, 2025

Benchmarks

Benchmark execution time: 2025-03-20 15:51:21

Comparing candidate commit 014a300 in PR branch taegyunkim/scp-639-acquired-at with baseline commit 39e5baf in branch main.

Found 1 performance improvements and 0 performance regressions! Performance is the same for 497 metrics, 2 unstable metrics.

scenario:iast_aspects-ospathjoin_aspect

  • 🟩 execution_time [-547.556ns; -480.668ns] or [-9.796%; -8.600%]

@taegyunkim taegyunkim marked this pull request as ready for review March 19, 2025 13:53
@taegyunkim taegyunkim requested a review from a team as a code owner March 19, 2025 13:53
Copy link
Contributor

@nsrip-dd nsrip-dd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the motivation for this change? Is it just an optimization?

@taegyunkim taegyunkim requested a review from a team as a code owner March 19, 2025 14:24
@taegyunkim taegyunkim added the Profiling Continous Profling label Mar 19, 2025
@taegyunkim
Copy link
Contributor Author

taegyunkim commented Mar 19, 2025

@nsrip-dd
Initially to address customer escalation SCP-639.

Do you mind taking another look at this?

Copy link
Contributor

@nsrip-dd nsrip-dd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. The fix generally makes sense to me. I was able to reproduce the attribute error with this program, run with ddtrace-run and profiling enabled:

import threading
import sys

sys.setswitchinterval(0.0000001)

def unlock(l):
    try:
        l.release()
    except RuntimeError:
        pass
    except Exception as e:
        raise e

while True:
    l = threading.Lock()
    l.acquire()
    threads = [threading.Thread(target=unlock, args=[l,]) for _ in range(64)]
    for t in threads:
        t.start()
    for t in threads:
        t.join()

Output:

Traceback (most recent call last):
  File "/opt/homebrew/Cellar/python@3.12/3.12.7_1/Frameworks/Python.framework/Versions/3.12/lib/python3.12/threading.py", line 1075, in _bootstrap_inner
    self.run()
  File "/opt/homebrew/Cellar/python@3.12/3.12.7_1/Frameworks/Python.framework/Versions/3.12/lib/python3.12/threading.py", line 1012, in run
    self._target(*self._args, **self._kwargs)
  File "/Users/nick.ripley/sandbox/python/lockcollectorcrash/main.py", line 12, in unlock
    raise e
  File "/Users/nick.ripley/sandbox/python/lockcollectorcrash/main.py", line 8, in unlock
    l.release()
  File "/Users/nick.ripley/sandbox/python/lockcollectorcrash/.venv/lib/python3.12/site-packages/ddtrace/profiling/collector/_lock.py", line 242, in release
    return self._release(self.__wrapped__.release, *args, **kwargs)
  1 from threading import Semaphore, Thread
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/nick.ripley/sandbox/python/lockcollectorcrash/.venv/lib/python3.12/site-packages/ddtrace/profiling/collector/_lock.py", line 187, in _release
    start = self._self_acquired_at
            ^^^^^^^^^^^^^^^^^^^^^^
AttributeError: '_thread.lock' object has no attribute '_self_acquired_at'

With this fix, doesn't crash after a pretty long run. The test you have seems reasonable, since my reproducer doesn't crash in a deterministic amount of time.

Just left a few questions.

try:
del self._self_acquired_at
except AttributeError:
pass
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm having a little trouble picturing how we can reach this case. If del fails due to the attribute error, that means another thread was racing on _release and reached this line first, right? But if that's the case, wouldn't the inner_func call for the thread that hits this exception have already failed since the lock was unlocked?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was imagining the following scenario. Suppose there are two threads A and B.
Suppose the lock was locked and thread A calls release() on it. Right after A has completed running inner_func and before running the finally block, another thread B could acquire and release the lock running the finally block till the end, which results in deleting self._self_acquired_at. Then, when thread A tries to delete self._self_acquired_at, it would result in an AttributeError. But in this case, we still want to capture the event.

Copy link
Contributor

@nsrip-dd nsrip-dd Mar 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I understand, but in that case I don't think this code is doing the right thing. Consider this scenario:

  • Thread A acquires a wrapped lock, and then calls release on it.
  • Thread A reaches inner_func in our wrapper and successfully releases the lock, but doesn't yet advance into the finally block.
  • Thread B comes along and acquires the lock, setting self._self_acquired_at, so we want to record it
  • Thread A now advances into the finally block in _release and deletes self._self_acquired_at
  • Thread B releases the lock, and sees that self._self_acquired_at is not set, and thus doesn't record it

In that scenario, I think this is the wrong behavior. Basically thread A could have deleted the timestamp that thread B was using to record the locking/unlocking, preventing thread B from recording the lock like it intended to. Does that make sense?

I'm still thinking through what the right behavior is, considering the bug that motivated this PR. I think the lock we're wrapping kind of gives implicit protection for the extra state we're tracking? Like, an earlier version of this PR popped the acquired timestamp from the attributes before calling the wrapped unlock function. That seems closer to what we'd need to do...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I initially thought that the state is implicitly protected too, but turns out any Python threading.Thread can call release() on the lock. So if we pop the timestamp before calling the wrapped unlock function, then even if a thread succeeds to actually release the lock, it might not get the timestamp as it could have been popped by another call to release()

import threading

lock = threading.Lock()
lock.acquire()

def thread_func():
    try:
        lock.release() 
    except RuntimeError as e:
        print(f"Error: {e}")

t = threading.Thread(target=thread_func)
t.start()
t.join()

I initially thought that the thread would result in raising a RuntimeError, but on both Mac OS and Linux, it doesn't.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh maybe this is ok?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah it's okay if a different thread releases the lock than the thread that acquired it, but it's not okay if the lock is released when it isn't locked. And I think the problem there (that motivated this PR) is that if two threads race to release the lock (a user error) then they also race in our wrapper code which is modifying our wrapper state during release. And that turns their error (a RuntimeError) into "our" error.

@taegyunkim
Copy link
Contributor Author

Closing in favor of #12833

@taegyunkim taegyunkim closed this Mar 25, 2025
@taegyunkim taegyunkim deleted the taegyunkim/scp-639-acquired-at branch March 25, 2025 15:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Profiling Continous Profling

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants