-
Notifications
You must be signed in to change notification settings - Fork 477
fix(profiling): avoid change-after-check pattern for lock #12792
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
|
Bootstrap import analysisComparison of import times between this PR and main. SummaryThe average import time in this PR is: 239 ± 3 ms. The average import time in main is: 241 ± 4 ms. The import time difference between this PR and main is: -1.8 ± 0.2 ms. Import time breakdownThe following import paths have shrunk:
|
BenchmarksBenchmark execution time: 2025-03-20 15:51:21 Comparing candidate commit 014a300 in PR branch Found 1 performance improvements and 0 performance regressions! Performance is the same for 497 metrics, 2 unstable metrics. scenario:iast_aspects-ospathjoin_aspect
|
nsrip-dd
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the motivation for this change? Is it just an optimization?
Co-authored-by: Emmett Butler <723615+emmettbutler@users.noreply.github.com>
nsrip-dd
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. The fix generally makes sense to me. I was able to reproduce the attribute error with this program, run with ddtrace-run and profiling enabled:
import threading
import sys
sys.setswitchinterval(0.0000001)
def unlock(l):
try:
l.release()
except RuntimeError:
pass
except Exception as e:
raise e
while True:
l = threading.Lock()
l.acquire()
threads = [threading.Thread(target=unlock, args=[l,]) for _ in range(64)]
for t in threads:
t.start()
for t in threads:
t.join()Output:
Traceback (most recent call last):
File "/opt/homebrew/Cellar/python@3.12/3.12.7_1/Frameworks/Python.framework/Versions/3.12/lib/python3.12/threading.py", line 1075, in _bootstrap_inner
self.run()
File "/opt/homebrew/Cellar/python@3.12/3.12.7_1/Frameworks/Python.framework/Versions/3.12/lib/python3.12/threading.py", line 1012, in run
self._target(*self._args, **self._kwargs)
File "/Users/nick.ripley/sandbox/python/lockcollectorcrash/main.py", line 12, in unlock
raise e
File "/Users/nick.ripley/sandbox/python/lockcollectorcrash/main.py", line 8, in unlock
l.release()
File "/Users/nick.ripley/sandbox/python/lockcollectorcrash/.venv/lib/python3.12/site-packages/ddtrace/profiling/collector/_lock.py", line 242, in release
return self._release(self.__wrapped__.release, *args, **kwargs)
1 from threading import Semaphore, Thread
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/nick.ripley/sandbox/python/lockcollectorcrash/.venv/lib/python3.12/site-packages/ddtrace/profiling/collector/_lock.py", line 187, in _release
start = self._self_acquired_at
^^^^^^^^^^^^^^^^^^^^^^
AttributeError: '_thread.lock' object has no attribute '_self_acquired_at'
With this fix, doesn't crash after a pretty long run. The test you have seems reasonable, since my reproducer doesn't crash in a deterministic amount of time.
Just left a few questions.
ddtrace/profiling/collector/_lock.py
Outdated
| try: | ||
| del self._self_acquired_at | ||
| except AttributeError: | ||
| pass |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm having a little trouble picturing how we can reach this case. If del fails due to the attribute error, that means another thread was racing on _release and reached this line first, right? But if that's the case, wouldn't the inner_func call for the thread that hits this exception have already failed since the lock was unlocked?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was imagining the following scenario. Suppose there are two threads A and B.
Suppose the lock was locked and thread A calls release() on it. Right after A has completed running inner_func and before running the finally block, another thread B could acquire and release the lock running the finally block till the end, which results in deleting self._self_acquired_at. Then, when thread A tries to delete self._self_acquired_at, it would result in an AttributeError. But in this case, we still want to capture the event.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I understand, but in that case I don't think this code is doing the right thing. Consider this scenario:
- Thread A acquires a wrapped lock, and then calls
releaseon it. - Thread A reaches
inner_funcin our wrapper and successfully releases the lock, but doesn't yet advance into thefinallyblock. - Thread B comes along and acquires the lock, setting
self._self_acquired_at, so we want to record it - Thread A now advances into the
finallyblock in_releaseand deletesself._self_acquired_at - Thread B releases the lock, and sees that
self._self_acquired_atis not set, and thus doesn't record it
In that scenario, I think this is the wrong behavior. Basically thread A could have deleted the timestamp that thread B was using to record the locking/unlocking, preventing thread B from recording the lock like it intended to. Does that make sense?
I'm still thinking through what the right behavior is, considering the bug that motivated this PR. I think the lock we're wrapping kind of gives implicit protection for the extra state we're tracking? Like, an earlier version of this PR popped the acquired timestamp from the attributes before calling the wrapped unlock function. That seems closer to what we'd need to do...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I initially thought that the state is implicitly protected too, but turns out any Python threading.Thread can call release() on the lock. So if we pop the timestamp before calling the wrapped unlock function, then even if a thread succeeds to actually release the lock, it might not get the timestamp as it could have been popped by another call to release()
import threading
lock = threading.Lock()
lock.acquire()
def thread_func():
try:
lock.release()
except RuntimeError as e:
print(f"Error: {e}")
t = threading.Thread(target=thread_func)
t.start()
t.join()
I initially thought that the thread would result in raising a RuntimeError, but on both Mac OS and Linux, it doesn't.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh maybe this is ok?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah it's okay if a different thread releases the lock than the thread that acquired it, but it's not okay if the lock is released when it isn't locked. And I think the problem there (that motivated this PR) is that if two threads race to release the lock (a user error) then they also race in our wrapper code which is modifying our wrapper state during release. And that turns their error (a RuntimeError) into "our" error.
|
Closing in favor of #12833 |
Above code can raise an
AttributeError, if there are multiple threads calling onrelease(). Though in such scenarios, except for the thread which actually held and released the lock, such threads would result in aRuntimeError, the current behavior makes customers blame our code instead of theirs.Another thing is that both
acquire()andrelease()usestry/except/finallyto wrap the call to inner function. However, even when the inner function call failed, the profiler emitted a sample under some circumstances. This PR fixes this.Checklist
Reviewer Checklist