Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(profiling): lock Recorder on reset #1560

Merged
merged 4 commits into from Jul 10, 2020

Conversation

jd
Copy link
Contributor

@jd jd commented Jul 8, 2020

refactor(profiling): move gevent/no-gevent code in a _nogevent module

This should make it easier to access non-gevent function under gevent.

refactor(profiling): wait for service to join when using context manager

fix(profiling): lock Recorder on reset

The lock-free approach does not work as the following scenario might happen:

  • Recorder.push_events() is called from a Collector
  • The collector gets a reference to the event queue and is interrupted
  • The Scheduler calls Recorder.reset() and passes the old events queue to the
    Exporter
  • The Exporter starts to export and iterate over the list of events
  • The Exporter gets interrupted and the Collector resumes
  • The Collectors finally pushes its events and go back to sleep
  • The Exporter resumes and the iteration breaks because the event queue changed

This can be seen in the wild with the following backtrace:

Traceback (most recent call last):
File "/lib/python2.7/site-packages/ddtrace/profiling/scheduler.py", line 44, in flush
exp.export(events, start, self._last_export)
File "/lib/python2.7/site-packages/ddtrace/profiling/exporter/http.py", line 132, in export
profile = super(PprofHTTPExporter, self).export(events, start_time_ns, end_time_ns)
File "/lib/python2.7/site-packages/ddtrace/profiling/exporter/pprof.py", line 327, in export
for event in events.get(stack.StackSampleEvent, []):
RuntimeError: deque mutated during iteration

In order to fix that, a Lock must be used.

For the gevent case, we actually need two locks:

  • one for locking out the coroutines
  • one for locking out the OS thread that might be used (e.g. stack collector)

We introduce a DoubleLock class that does exactly that.

@jd jd requested a review from a team as a code owner July 8, 2020 12:27
@jd jd added this to the 0.41.0 milestone Jul 9, 2020
Kyle-Verhoog
Kyle-Verhoog previously approved these changes Jul 10, 2020
if is_module_patched("threading"):

@attr.s
class DoubleLock(object):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 I was skeptical of the need to lock across both coroutine and thread at first but I think I was able to work through it in my head (with my limited knowledge of gevent).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it's really not that obvious. Basically, if you have a thread with coroutines, you have concurrency in that thread between coroutines so you need a gevent-lock to make sure they are excluded and scheduled correctly.
The coroutine also needs to exclude threads, so you need a thread-lock to make sure you have, in the end, one thread and one coroutine acting.

I think what I was scared of is that you can use a gevent-lock from 2 threads, but it seems to work fine if you use a thread-lock first to synchronize. Scary!

import threading
import sys
import gevent.monkey
import time

gevent.monkey.patch_all()

gl = threading.Lock()

sleep = gevent.monkey.get_original("time", "sleep")
thread = gevent.monkey.get_original("_thread", "start_new_thread")
tl = gevent.monkey.get_original("threading", "Lock")()


def lock_unlock():
    while True:
        tl.acquire()
        gl.acquire()
        print("thread got it")
        sleep(0.01)
        tl.release()
        gl.release()
        sleep(0.01)


th = thread(lock_unlock, ())

while True:
    tl.acquire()
    gl.acquire()
    print("gevent got it")
    time.sleep(0.01)
    tl.release()
    gl.release()
    sleep(0.01)

If you remove the thread-lock, it explodes.

I should write a blog post about that. 🤔

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I was curious about the ordering! Thanks for adding the comment here and in the source below 😄. Blog post would be great!

@@ -0,0 +1,65 @@
# -*- encoding: utf-8 -*-
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lol "_nogevent.py" gotta love that we have to bend over backwards to deal with gevent 🙈 🤦

I can't really think of a better name... maybe "compat" or "truth" lol.. 🤷

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

truth haha.

I had this idea of module for a few weeks, but I was not able to come with a name.

Finally, it hit me. 😆

@jd
Copy link
Contributor Author

jd commented Jul 10, 2020

Added a few comments and one more test, just for the sake of safetyness.

@jd jd force-pushed the profiling/fix-sync-lock-exporter branch 2 times, most recently from 5f5f65e to b4336bd Compare July 10, 2020 10:09
The lock-free approach does not work as the following scenario might happen:
- Recorder.push_events() is called from a Collector
- The collector gets a reference to the event queue and is interrupted
- The Scheduler calls Recorder.reset() and passes the old events queue to the
  Exporter
- The Exporter starts to export and iterate over the list of events
- The Exporter gets interrupted and the Collector resumes
- The Collectors finally pushes its events and go back to sleep
- The Exporter resumes and the iteration breaks because the event queue changed

This can be seen in the wild with the following backtrace:

  Traceback (most recent call last):
    File "/lib/python2.7/site-packages/ddtrace/profiling/scheduler.py", line 44, in flush
      exp.export(events, start, self._last_export)
    File "/lib/python2.7/site-packages/ddtrace/profiling/exporter/http.py", line 132, in export
      profile = super(PprofHTTPExporter, self).export(events, start_time_ns, end_time_ns)
    File "/lib/python2.7/site-packages/ddtrace/profiling/exporter/pprof.py", line 327, in export
      for event in events.get(stack.StackSampleEvent, []):
  RuntimeError: deque mutated during iteration

In order to fix that, a Lock must be used.

For the gevent case, we actually need two locks:
- one for locking out the coroutines
- one for locking out the OS thread that might be used (e.g. stack collector)

We introduce a DoubleLock class that does exactly that.
@jd jd force-pushed the profiling/fix-sync-lock-exporter branch from b4336bd to 6ced7f2 Compare July 10, 2020 10:59
if is_module_patched("threading"):

@attr.s
class DoubleLock(object):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I was curious about the ordering! Thanks for adding the comment here and in the source below 😄. Blog post would be great!


def acquire(self):
# You cannot acquire a gevent-lock from another thread if it has been acquired already:
# make sure we exclude the gevent-lock from being acquire by another thread by using a thread-lock first.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@Kyle-Verhoog Kyle-Verhoog merged commit 030325b into DataDog:master Jul 10, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants