Profile spawning processes #281

Jongy · 2022-01-30T00:16:08Z

Based on:

Description

This PR aims to solve the "short processes" issue: gProfiler runs profiling sessions of 60 seconds (that's the default). Upon each session, the different profilers (JavaProfiler, PySpyProfiler etc) scan through all currently running processes and invoke their respective profilers for each relevant process.

This works nicely for processes running for long durations (and still, it has some disadvantages which I'll describe later). However, it fails miserably in environments where processes are spawning and exiting quickly. For example, for an app with average lifetime of 10 seconds, we'll only see profiling data from the respective profiler for less than 1/6 of the time! (because on average, we'll catch only one out of 6 processes, and that process might already be halfway through its 10 seconds...). The situation was even worse for Java before #265, because we didn't read the output file if the process exited before the profiling session has ended. So by design - processes shorter than 60 seconds were not profiled... (py-spy & rb-spy do not suffer from this problem; PyPerf & phpspy, by design, do not suffer from it).

This PR is the first step towards the "final" design of gProfiler, as I see it. I'll explain what I did here, then explain the next steps.

The SpawningProcessProfilerBase class was added (name T.B.D). It extends ProcessProfilerBase which is the base class for process-based profilers - profilers invoked per process (AP, py-spy, rbspy).
This class extends the existing behavior of ProcessProfilerBase using proc_events. Here's how it works:

SpawningProcessProfilerBase.snapshot() is called
ProcessProfilerBase.snapshot() is called. Upon selecting the currently running processes to profile, it notifies its base classes with the selected processes. At this point, SpawningProcessProfilerBase starts listening for new execs via proc_events (had it been listening earlier, we could "double catch" processes). This section is racy; fixing it is not too hard, but it's irrelevant in the final solution, and the race is not very likely anyway. So I don't intend to do it now.
Upon new execs, we place the psutil.Process object of the new process into a queue. It will now be checked for "should we profile it" multiple times over 1.5 seconds (with some basic exponential backoff logic). The reason we need to check multiple times is that our "should profile" checks are based on parsing /proc/pid/maps. Once the exec completes, libraries are not yet loaded (it is ld.so's job). Furthermore, in the case of Java, libjvm.so is not even an (in)direct dependency of java - it's loaded later by whatever. This may take a short while, also depending on the load on the machine... In the case of Python, we sometimes select processes to profile based on site-packages in their maps - that definitely doesn't show up in maps right after the exec.
Once a spawned process was deemed relevant for profiling, we submit it to _profile_process in a TPE, with the duration decreased by how much time has passed since this session started.
Upon ProcessProfilerBase.snapshot() return, we also stop listening for new execs, and terminate the TPE.
Results from ProcessProfilerBase.snapshot() are merged with the results from our TPE.

Issues with the current impl:

If used for rbspy, py-spy and AP: we're cloning the scheduler thread 3 times, Not sure it's that bad.
SpawningProcessProfilerBase contains lots of logic that should be split.

Roadmap & end goal

The final idea of gProfiler is this:

all profilers run continuously (like perf, PyPerf & phpspy are now).
Process-based profilers (AP, py-spy, rbspy) can run indefinitely, and dump output upon signal / request / process exit. So upon each session, we just "ask" profilers to give their current state.
gProfiler runs a single thread using proc_events, and decides for each PID if any of the profilers should get it (as an optimization - phpsy & PyPerf should be modified to accept PIDs this way, instead of scanning themselves - scanning has a cost on the system, and thus if we have multiple unrelated "scanners", they run at lower frequency compared to the effectiveness of proc_events.
- for low-priv executions, where we can't use proc_events, gProfiler will continue scanning for processes, but at a much higher interval (e.g each 0.5 seconds)

Roadmap:

Implement an alternative to proc_events for low-priv.
Get rid of _select_processes_to_profile - that is, don't "select processes" upon each session start, just select processes based on proc_events, and start profiling until the current session end.
Add support for continuous profiling in AP, rbspy, py-spy. I think that AP might have it (just gotta use the dump action instead of stop), not sure about the others.
Run all profilers independently of the profiling sessions, and just call a "dump data from all currently running py-spys" method.

How Has This Been Tested?

T.B.A

Checklist:

Decide on the final design & implement it
Make sure it is enabled for rbspy & py-spy as well.
Add at least one automatic test, so that it runs in CI.
Document theory of operation (I plan to add that in README)
Document my future plans (as noted shortly in this PR)

adi-benz · 2022-01-30T16:56:38Z

Completely agree with the roadmap and end goals.

About the currently proposed implementation, I don't see a reason to create SpawningProcessProfilerBase, why not change ProcessProfilerBase's implementation to the new one proposed? We won't need the old implementation, and it would be easier not to separate them using inheritance
The separation seems "unnatural", I think that having both just makes ProcessProfilerBase a class that's only purpose in life is to be a part of SpawningProcessProfilerBase, but it still needs to keep the ProfilerInterface API without any good reason (no one will inherit from it directly)

The name even still fits, ProcessProfilerBase - it handles profilers that run per process, and that class make it seem to the API like a system-wide profiler

Jongy · 2022-01-30T20:41:01Z

Completely agree with the roadmap and end goals.

About the currently proposed implementation, I don't see a reason to create SpawningProcessProfilerBase, why not change ProcessProfilerBase's implementation to the new one proposed? We won't need the old implementation, and it would be easier not to separate them using inheritance The separation seems "unnatural", I think that having both just makes ProcessProfilerBase a class that's only purpose in life is to be a part of SpawningProcessProfilerBase, but it still needs to keep the ProfilerInterface API without any good reason (no one will inherit from it directly)

The name even still fits, ProcessProfilerBase - it handles profilers that run per process, and that class make it seem to the API like a system-wide profiler

The reason I had in mind is - I wanted some way to disable this feature with a flag; that could be implemented by changing the base class.. Kinda ugly tho 😅 I also liked that their logic is separate. But I suppose I could merge them.

adi-benz

.

The idea is to use a more rich output format for profilers - not just PID -> stackcollapse, but PID -> object with numerous fields, one of them being stackcollapse. Currently, the 2 additional fields are the appid and app metadata. Logically, it makes sense that the runtime profilers retrieve them, and not the generic area in gprofiler - because these are per-runtime anyway. Code is structured more "sensibly" when runtime profilers return these values. Also, it serves a technical purpose: if profilers collect & attach the appid & app metadata directly to the profile data, we can be sure that: * App id & metadata are always collected consistently - if a certain app has "appid: X", then all stacks of it will have that appid. Previously it wasn't the case, because we could profile a process that'd go down before we had the chance to get its app id (was retrieved post profiling). * No chance for PID reuse collision - PID being reused by another app and we get the wrong data. That's ^^ not true yet for pyperf, because pyperf still attaches this data post profiling. This will change in a PR after #281 - where we will move the process selection to gprofiler from pyperf, and then gprofiler will take the metadata & id of apps before pyperf begins to profile them.

Jongy · 2022-05-31T11:30:31Z

I wanna be able to disable the proc_events logic. I'll see if I can do it with a flag in the class, or if I desire to keep them separate.

gprofiler/profilers/profiler_base.py

TomerYogev · 2022-06-29T10:20:12Z

Get rid of _select_processes_to_profile - that is, don't "select processes" upon each session start, just select processes based on proc_events, and start profiling until the current session end.

Wouldn't we still need it for the processes that already exist at the beginning of the session?

Jongy · 2022-06-30T12:32:37Z

Track only execs
Track forks for Python processes

TomerYogev · 2022-06-30T14:21:42Z

Track only execs
Track forks for Python processes

elif event["what"] == self._PROC_EVENT_EXEC:
Currently only filtering EXECs (and EXITs), do you want me to add support for FORKs for python in this PR?

gprofiler/utils/process.py

tests/test_sanity.py

gprofiler/profilers/profiler_base.py

Jongy · 2022-07-11T00:25:40Z

I went over all changes @ty93 . No major comment. Let's fix these and I'm good to merge this as an experimental feature, disabled by default.

gprofiler/profilers/profiler_base.py

glipner

.

glipner · 2022-07-11T14:39:26Z

gprofiler/profilers/profiler_base.py

+            # delete it before blocking on the exit of all threads (to ensure no new work
+            # is added)
+            self._threads = None
+            threads.shutdown()  # waits (although - all are done by now)


No need to wait here, as we're waiting for completion just after calling _stop_profiling_spawning

Suggested change

threads.shutdown() # waits (although - all are done by now)

threads.shutdown(wait=False)

Why does it matter, tho? for safety let's add it

Doesn't really matter.
We're wasting time waiting rather than reading profiling output :-P

Not really wasting time though, since they're already finished

glipner · 2022-07-11T14:43:22Z

gprofiler/profilers/profiler_base.py

+            self._preexisting_pids = None
+
+    def _proc_exec_callback(self, tid: int, pid: int) -> None:
+        with contextlib.suppress(NoSuchProcess):


Should check here if we're not stopping, otherwise we might still block forever in self._sched.run()

Suggested change

with contextlib.suppress(NoSuchProcess):

if not self._is_profiling_spawning:

return

with contextlib.suppress(NoSuchProcess):

glipner · 2022-07-11T14:45:22Z

gprofiler/profilers/profiler_base.py

+            self._sched.run()
+            self._stop_event.wait(0.1)
+
+    def _check_process(self, process: Process, interval: float) -> None:


Clear self._sched if we're stopping, so that self._sched.run() exits faster.

Suggested change

def _check_process(self, process: Process, interval: float) -> None:

def _clear_shced(self):

for event in self._sched.queue:

self._sched.cancel(event)

def _check_process(self, process: Process, interval: float) -> None:

with contextlib.suppress(NoSuchProcess):

if not self._is_profiling_spawning:

self._clear_sched()

return

if not is_process_running(process) or process.ppid() == os.getpid():

The clear_sched call needn't be in suppress

Suggested change

def _check_process(self, process: Process, interval: float) -> None:

def _clear_shced(self):

for event in self._sched.queue:

self._sched.cancel(event)

def _check_process(self, process: Process, interval: float) -> None:

if not self._is_profiling_spawning:

self._clear_sched()

return

with contextlib.suppress(NoSuchProcess):

if not is_process_running(process) or process.ppid() == os.getpid():```

Premature optimization if I've ever seen one

Jongy · 2022-07-12T08:16:21Z

Something is messed up with the tests :/

Done

Jongy added 6 commits January 29, 2022 19:24

Move proc_events run_in_ns to granulate-utils

71b5765

profilers: Add SpawningProcessProfilerBase

1e11813

java: Extend SpawningProcessProfilerBase

8f4b06f

python, ruby: Fix _profile_process API usage

ef4775f

fixup! profilers: Add SpawningProcessProfilerBase

02becb4

profilers: Add TPE shutdown()

b6986e4

Jongy added the enhancement New feature or request label Jan 30, 2022

Jongy requested a review from adi-benz January 30, 2022 00:16

adi-benz reviewed Feb 27, 2022

View reviewed changes

Jongy mentioned this pull request Mar 10, 2022

Application metadata #299

Merged

4 tasks

Jongy mentioned this pull request Mar 29, 2022

ProfileData API #326

Merged

2 tasks

Jongy added 2 commits April 20, 2022 10:26

Merge remote-tracking branch 'origin/master' into short-processes

46e8d9c

Merge remote-tracking branch 'origin/master' into short-processes

c44f9e8

Fix minor issues with new spawning proc profiler

3409865

TomerYogev reviewed Jun 29, 2022

View reviewed changes

gprofiler/profilers/profiler_base.py Outdated Show resolved Hide resolved

TomerYogev added 3 commits June 29, 2022 15:27

Add proc_events support for pyspy and rbspy

071010e

Add proc_events support for pyspy and rbspy

2ae3879

Don't profile child processes

4c70990

TomerYogev added 2 commits June 30, 2022 16:23

Add automatic test for spawn profiling

420f3fb

Fix test

d1f7938

TomerYogev added 3 commits June 30, 2022 18:02

Handle FileNotFound errors

64216ba

Make sure not to profile same process twice

c732220

Merge branch 'master' into short-processes

ef4eedf

Jongy commented Jul 11, 2022

View reviewed changes

gprofiler/utils/process.py Outdated Show resolved Hide resolved

Jongy commented Jul 11, 2022

View reviewed changes

tests/test_sanity.py Outdated Show resolved Hide resolved

Jongy commented Jul 11, 2022

View reviewed changes

tests/test_sanity.py Show resolved Hide resolved

Jongy commented Jul 11, 2022

View reviewed changes

gprofiler/profilers/profiler_base.py Outdated Show resolved Hide resolved

Jongy commented Jul 11, 2022

View reviewed changes

gprofiler/profilers/profiler_base.py Outdated Show resolved Hide resolved

TomerYogev and others added 2 commits July 11, 2022 13:41

Fix CR

8a0152f

Merge remote-tracking branch 'origin/master' into short-processes

24e93ac

TomerYogev previously approved these changes Jul 11, 2022

View reviewed changes

Merge remote-tracking branch 'origin/master' into short-processes

d2c02c3

Jongy dismissed TomerYogev’s stale review via d2c02c3 July 11, 2022 11:36

Move print to where it makes sense

1699bc2

Jongy requested a review from TomerYogev July 11, 2022 11:52

TomerYogev reviewed Jul 11, 2022

View reviewed changes

gprofiler/profilers/profiler_base.py Show resolved Hide resolved

Log if profiled process is spawned or not

e2ccf5a

Jongy requested a review from TomerYogev July 11, 2022 12:54

Jongy added 5 commits July 11, 2022 15:54

Improve test

5907281

Ensure no new work is added to spawned threads TPE

3393e00

Unlimited TPE

239170c

Add TODO

f03b489

Add another log

e9f0b67

glipner previously requested changes Jul 11, 2022

View reviewed changes

Fix CR

5bd5cd5

TomerYogev added 2 commits July 12, 2022 15:22

Fix bug

daf7e50

Merge branch 'master' into short-processes

a921340

TomerYogev approved these changes Jul 12, 2022

View reviewed changes

Jongy merged commit c83c446 into master Jul 12, 2022

Jongy deleted the short-processes branch July 12, 2022 14:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Profile spawning processes #281

Profile spawning processes #281

Jongy commented Jan 30, 2022

adi-benz commented Jan 30, 2022

Jongy commented Jan 30, 2022

adi-benz left a comment

Jongy commented May 31, 2022

TomerYogev commented Jun 29, 2022

Jongy commented Jun 30, 2022

TomerYogev commented Jun 30, 2022

Jongy commented Jul 11, 2022

glipner left a comment

glipner Jul 11, 2022

Jongy Jul 11, 2022

glipner Jul 11, 2022

TomerYogev Jul 12, 2022

glipner Jul 11, 2022

glipner Jul 11, 2022

Jongy Jul 12, 2022

glipner Jul 12, 2022

glipner Jul 12, 2022

TomerYogev Jul 12, 2022

Jongy commented Jul 12, 2022

	threads.shutdown() # waits (although - all are done by now)
	threads.shutdown(wait=False)

-    def _check_process(self, process: Process, interval: float) -> None:
+    def _clear_shced(self):
+        for event in self._sched.queue:
+            self._sched.cancel(event)
+    def _check_process(self, process: Process, interval: float) -> None:
+        with contextlib.suppress(NoSuchProcess):
+            if not self._is_profiling_spawning:
+                self._clear_sched()
+                return
+            if not is_process_running(process) or process.ppid() == os.getpid():

Profile spawning processes #281

Profile spawning processes #281

Conversation

Jongy commented Jan 30, 2022

Description

Roadmap & end goal

How Has This Been Tested?

Checklist:

adi-benz commented Jan 30, 2022

Jongy commented Jan 30, 2022

adi-benz left a comment

Choose a reason for hiding this comment

Jongy commented May 31, 2022

TomerYogev commented Jun 29, 2022

Jongy commented Jun 30, 2022

TomerYogev commented Jun 30, 2022

Jongy commented Jul 11, 2022

glipner left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Jongy commented Jul 12, 2022