-
Notifications
You must be signed in to change notification settings - Fork 54
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Profile spawning processes #281
Conversation
Completely agree with the roadmap and end goals. About the currently proposed implementation, I don't see a reason to create The name even still fits, |
The reason I had in mind is - I wanted some way to disable this feature with a flag; that could be implemented by changing the base class.. Kinda ugly tho 😅 I also liked that their logic is separate. But I suppose I could merge them. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
.
The idea is to use a more rich output format for profilers - not just PID -> stackcollapse, but PID -> object with numerous fields, one of them being stackcollapse. Currently, the 2 additional fields are the appid and app metadata. Logically, it makes sense that the runtime profilers retrieve them, and not the generic area in gprofiler - because these are per-runtime anyway. Code is structured more "sensibly" when runtime profilers return these values. Also, it serves a technical purpose: if profilers collect & attach the appid & app metadata directly to the profile data, we can be sure that: * App id & metadata are always collected consistently - if a certain app has "appid: X", then all stacks of it will have that appid. Previously it wasn't the case, because we could profile a process that'd go down before we had the chance to get its app id (was retrieved post profiling). * No chance for PID reuse collision - PID being reused by another app and we get the wrong data. That's ^^ not true yet for pyperf, because pyperf still attaches this data post profiling. This will change in a PR after #281 - where we will move the process selection to gprofiler from pyperf, and then gprofiler will take the metadata & id of apps before pyperf begins to profile them.
I wanna be able to disable the proc_events logic. I'll see if I can do it with a flag in the class, or if I desire to keep them separate. |
Wouldn't we still need it for the processes that already exist at the beginning of the session? |
|
|
I went over all changes @ty93 . No major comment. Let's fix these and I'm good to merge this as an experimental feature, disabled by default. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
.
# delete it before blocking on the exit of all threads (to ensure no new work | ||
# is added) | ||
self._threads = None | ||
threads.shutdown() # waits (although - all are done by now) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need to wait here, as we're waiting for completion just after calling _stop_profiling_spawning
threads.shutdown() # waits (although - all are done by now) | |
threads.shutdown(wait=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why does it matter, tho? for safety let's add it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't really matter.
We're wasting time waiting rather than reading profiling output :-P
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not really wasting time though, since they're already finished
self._preexisting_pids = None | ||
|
||
def _proc_exec_callback(self, tid: int, pid: int) -> None: | ||
with contextlib.suppress(NoSuchProcess): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should check here if we're not stopping, otherwise we might still block forever in self._sched.run()
with contextlib.suppress(NoSuchProcess): | |
if not self._is_profiling_spawning: | |
return | |
with contextlib.suppress(NoSuchProcess): |
self._sched.run() | ||
self._stop_event.wait(0.1) | ||
|
||
def _check_process(self, process: Process, interval: float) -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Clear self._sched
if we're stopping, so that self._sched.run()
exits faster.
def _check_process(self, process: Process, interval: float) -> None: | |
def _clear_shced(self): | |
for event in self._sched.queue: | |
self._sched.cancel(event) | |
def _check_process(self, process: Process, interval: float) -> None: | |
with contextlib.suppress(NoSuchProcess): | |
if not self._is_profiling_spawning: | |
self._clear_sched() | |
return | |
if not is_process_running(process) or process.ppid() == os.getpid(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The clear_sched
call needn't be in suppress
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
def _check_process(self, process: Process, interval: float) -> None: | |
def _clear_shced(self): | |
for event in self._sched.queue: | |
self._sched.cancel(event) | |
def _check_process(self, process: Process, interval: float) -> None: | |
if not self._is_profiling_spawning: | |
self._clear_sched() | |
return | |
with contextlib.suppress(NoSuchProcess): | |
if not is_process_running(process) or process.ppid() == os.getpid():``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Premature optimization if I've ever seen one
Something is messed up with the tests :/ |
Based on:
Description
This PR aims to solve the "short processes" issue: gProfiler runs profiling sessions of 60 seconds (that's the default). Upon each session, the different profilers (
JavaProfiler
,PySpyProfiler
etc) scan through all currently running processes and invoke their respective profilers for each relevant process.This works nicely for processes running for long durations (and still, it has some disadvantages which I'll describe later). However, it fails miserably in environments where processes are spawning and exiting quickly. For example, for an app with average lifetime of 10 seconds, we'll only see profiling data from the respective profiler for less than 1/6 of the time! (because on average, we'll catch only one out of 6 processes, and that process might already be halfway through its 10 seconds...). The situation was even worse for Java before #265, because we didn't read the output file if the process exited before the profiling session has ended. So by design - processes shorter than 60 seconds were not profiled... (py-spy & rb-spy do not suffer from this problem; PyPerf & phpspy, by design, do not suffer from it).
This PR is the first step towards the "final" design of gProfiler, as I see it. I'll explain what I did here, then explain the next steps.
The
SpawningProcessProfilerBase
class was added (name T.B.D). It extendsProcessProfilerBase
which is the base class for process-based profilers - profilers invoked per process (AP, py-spy, rbspy).This class extends the existing behavior of
ProcessProfilerBase
usingproc_events
. Here's how it works:SpawningProcessProfilerBase.snapshot()
is calledProcessProfilerBase.snapshot()
is called. Upon selecting the currently running processes to profile, it notifies its base classes with the selected processes. At this point,SpawningProcessProfilerBase
starts listening for new execs viaproc_events
(had it been listening earlier, we could "double catch" processes). This section is racy; fixing it is not too hard, but it's irrelevant in the final solution, and the race is not very likely anyway. So I don't intend to do it now.psutil.Process
object of the new process into a queue. It will now be checked for "should we profile it" multiple times over 1.5 seconds (with some basic exponential backoff logic). The reason we need to check multiple times is that our "should profile" checks are based on parsing/proc/pid/maps
. Once the exec completes, libraries are not yet loaded (it isld.so
's job). Furthermore, in the case of Java,libjvm.so
is not even an (in)direct dependency ofjava
- it's loaded later by whatever. This may take a short while, also depending on the load on the machine... In the case of Python, we sometimes select processes to profile based onsite-packages
in their maps - that definitely doesn't show up in maps right after the exec._profile_process
in a TPE, with the duration decreased by how much time has passed since this session started.ProcessProfilerBase.snapshot()
return, we also stop listening for new execs, and terminate the TPE.ProcessProfilerBase.snapshot()
are merged with the results from our TPE.Issues with the current impl:
SpawningProcessProfilerBase
contains lots of logic that should be split.Roadmap & end goal
The final idea of gProfiler is this:
proc_events
, and decides for each PID if any of the profilers should get it (as an optimization - phpsy & PyPerf should be modified to accept PIDs this way, instead of scanning themselves - scanning has a cost on the system, and thus if we have multiple unrelated "scanners", they run at lower frequency compared to the effectiveness ofproc_events
.proc_events
, gProfiler will continue scanning for processes, but at a much higher interval (e.g each 0.5 seconds)Roadmap:
proc_events
for low-priv._select_processes_to_profile
- that is, don't "select processes" upon each session start, just select processes based onproc_events
, and start profiling until the current session end.dump
action instead ofstop
), not sure about the others.How Has This Been Tested?
T.B.A
Checklist: