New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(profiling): lock Recorder on reset #1560
fix(profiling): lock Recorder on reset #1560
Conversation
if is_module_patched("threading"): | ||
|
||
@attr.s | ||
class DoubleLock(object): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 I was skeptical of the need to lock across both coroutine and thread at first but I think I was able to work through it in my head (with my limited knowledge of gevent).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, it's really not that obvious. Basically, if you have a thread with coroutines, you have concurrency in that thread between coroutines so you need a gevent-lock to make sure they are excluded and scheduled correctly.
The coroutine also needs to exclude threads, so you need a thread-lock to make sure you have, in the end, one thread and one coroutine acting.
I think what I was scared of is that you can use a gevent-lock from 2 threads, but it seems to work fine if you use a thread-lock first to synchronize. Scary!
import threading
import sys
import gevent.monkey
import time
gevent.monkey.patch_all()
gl = threading.Lock()
sleep = gevent.monkey.get_original("time", "sleep")
thread = gevent.monkey.get_original("_thread", "start_new_thread")
tl = gevent.monkey.get_original("threading", "Lock")()
def lock_unlock():
while True:
tl.acquire()
gl.acquire()
print("thread got it")
sleep(0.01)
tl.release()
gl.release()
sleep(0.01)
th = thread(lock_unlock, ())
while True:
tl.acquire()
gl.acquire()
print("gevent got it")
time.sleep(0.01)
tl.release()
gl.release()
sleep(0.01)
If you remove the thread-lock, it explodes.
I should write a blog post about that. 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I was curious about the ordering! Thanks for adding the comment here and in the source below 😄. Blog post would be great!
@@ -0,0 +1,65 @@ | |||
# -*- encoding: utf-8 -*- |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lol "_nogevent.py" gotta love that we have to bend over backwards to deal with gevent 🙈 🤦
I can't really think of a better name... maybe "compat" or "truth" lol.. 🤷
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
truth
haha.
I had this idea of module for a few weeks, but I was not able to come with a name.
Finally, it hit me. 😆
This should make it easier to access non-gevent function under gevent.
Added a few comments and one more test, just for the sake of safetyness. |
5f5f65e
to
b4336bd
Compare
The lock-free approach does not work as the following scenario might happen: - Recorder.push_events() is called from a Collector - The collector gets a reference to the event queue and is interrupted - The Scheduler calls Recorder.reset() and passes the old events queue to the Exporter - The Exporter starts to export and iterate over the list of events - The Exporter gets interrupted and the Collector resumes - The Collectors finally pushes its events and go back to sleep - The Exporter resumes and the iteration breaks because the event queue changed This can be seen in the wild with the following backtrace: Traceback (most recent call last): File "/lib/python2.7/site-packages/ddtrace/profiling/scheduler.py", line 44, in flush exp.export(events, start, self._last_export) File "/lib/python2.7/site-packages/ddtrace/profiling/exporter/http.py", line 132, in export profile = super(PprofHTTPExporter, self).export(events, start_time_ns, end_time_ns) File "/lib/python2.7/site-packages/ddtrace/profiling/exporter/pprof.py", line 327, in export for event in events.get(stack.StackSampleEvent, []): RuntimeError: deque mutated during iteration In order to fix that, a Lock must be used. For the gevent case, we actually need two locks: - one for locking out the coroutines - one for locking out the OS thread that might be used (e.g. stack collector) We introduce a DoubleLock class that does exactly that.
b4336bd
to
6ced7f2
Compare
if is_module_patched("threading"): | ||
|
||
@attr.s | ||
class DoubleLock(object): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I was curious about the ordering! Thanks for adding the comment here and in the source below 😄. Blog post would be great!
|
||
def acquire(self): | ||
# You cannot acquire a gevent-lock from another thread if it has been acquired already: | ||
# make sure we exclude the gevent-lock from being acquire by another thread by using a thread-lock first. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
refactor(profiling): move gevent/no-gevent code in a _nogevent module
This should make it easier to access non-gevent function under gevent.
refactor(profiling): wait for service to join when using context manager
fix(profiling): lock Recorder on reset
The lock-free approach does not work as the following scenario might happen:
Exporter
This can be seen in the wild with the following backtrace:
Traceback (most recent call last):
File "/lib/python2.7/site-packages/ddtrace/profiling/scheduler.py", line 44, in flush
exp.export(events, start, self._last_export)
File "/lib/python2.7/site-packages/ddtrace/profiling/exporter/http.py", line 132, in export
profile = super(PprofHTTPExporter, self).export(events, start_time_ns, end_time_ns)
File "/lib/python2.7/site-packages/ddtrace/profiling/exporter/pprof.py", line 327, in export
for event in events.get(stack.StackSampleEvent, []):
RuntimeError: deque mutated during iteration
In order to fix that, a Lock must be used.
For the gevent case, we actually need two locks:
We introduce a DoubleLock class that does exactly that.