Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Profile spawning processes #281

Merged
merged 40 commits into from
Jul 12, 2022
Merged

Profile spawning processes #281

merged 40 commits into from
Jul 12, 2022

Conversation

Jongy
Copy link
Contributor

@Jongy Jongy commented Jan 30, 2022

Based on:

Description

This PR aims to solve the "short processes" issue: gProfiler runs profiling sessions of 60 seconds (that's the default). Upon each session, the different profilers (JavaProfiler, PySpyProfiler etc) scan through all currently running processes and invoke their respective profilers for each relevant process.

This works nicely for processes running for long durations (and still, it has some disadvantages which I'll describe later). However, it fails miserably in environments where processes are spawning and exiting quickly. For example, for an app with average lifetime of 10 seconds, we'll only see profiling data from the respective profiler for less than 1/6 of the time! (because on average, we'll catch only one out of 6 processes, and that process might already be halfway through its 10 seconds...). The situation was even worse for Java before #265, because we didn't read the output file if the process exited before the profiling session has ended. So by design - processes shorter than 60 seconds were not profiled... (py-spy & rb-spy do not suffer from this problem; PyPerf & phpspy, by design, do not suffer from it).

This PR is the first step towards the "final" design of gProfiler, as I see it. I'll explain what I did here, then explain the next steps.

The SpawningProcessProfilerBase class was added (name T.B.D). It extends ProcessProfilerBase which is the base class for process-based profilers - profilers invoked per process (AP, py-spy, rbspy).
This class extends the existing behavior of ProcessProfilerBase using proc_events. Here's how it works:

  1. SpawningProcessProfilerBase.snapshot() is called
  2. ProcessProfilerBase.snapshot() is called. Upon selecting the currently running processes to profile, it notifies its base classes with the selected processes. At this point, SpawningProcessProfilerBase starts listening for new execs via proc_events (had it been listening earlier, we could "double catch" processes). This section is racy; fixing it is not too hard, but it's irrelevant in the final solution, and the race is not very likely anyway. So I don't intend to do it now.
  3. Upon new execs, we place the psutil.Process object of the new process into a queue. It will now be checked for "should we profile it" multiple times over 1.5 seconds (with some basic exponential backoff logic). The reason we need to check multiple times is that our "should profile" checks are based on parsing /proc/pid/maps. Once the exec completes, libraries are not yet loaded (it is ld.so's job). Furthermore, in the case of Java, libjvm.so is not even an (in)direct dependency of java - it's loaded later by whatever. This may take a short while, also depending on the load on the machine... In the case of Python, we sometimes select processes to profile based on site-packages in their maps - that definitely doesn't show up in maps right after the exec.
  4. Once a spawned process was deemed relevant for profiling, we submit it to _profile_process in a TPE, with the duration decreased by how much time has passed since this session started.
  5. Upon ProcessProfilerBase.snapshot() return, we also stop listening for new execs, and terminate the TPE.
  6. Results from ProcessProfilerBase.snapshot() are merged with the results from our TPE.

Issues with the current impl:

  • If used for rbspy, py-spy and AP: we're cloning the scheduler thread 3 times, Not sure it's that bad.
  • SpawningProcessProfilerBase contains lots of logic that should be split.

Roadmap & end goal

The final idea of gProfiler is this:

  • all profilers run continuously (like perf, PyPerf & phpspy are now).
  • Process-based profilers (AP, py-spy, rbspy) can run indefinitely, and dump output upon signal / request / process exit. So upon each session, we just "ask" profilers to give their current state.
  • gProfiler runs a single thread using proc_events, and decides for each PID if any of the profilers should get it (as an optimization - phpsy & PyPerf should be modified to accept PIDs this way, instead of scanning themselves - scanning has a cost on the system, and thus if we have multiple unrelated "scanners", they run at lower frequency compared to the effectiveness of proc_events.
    • for low-priv executions, where we can't use proc_events, gProfiler will continue scanning for processes, but at a much higher interval (e.g each 0.5 seconds)

Roadmap:

  1. Implement an alternative to proc_events for low-priv.
  2. Get rid of _select_processes_to_profile - that is, don't "select processes" upon each session start, just select processes based on proc_events, and start profiling until the current session end.
  3. Add support for continuous profiling in AP, rbspy, py-spy. I think that AP might have it (just gotta use the dump action instead of stop), not sure about the others.
  4. Run all profilers independently of the profiling sessions, and just call a "dump data from all currently running py-spys" method.

How Has This Been Tested?

T.B.A

Checklist:

  • Decide on the final design & implement it
  • Make sure it is enabled for rbspy & py-spy as well.
  • Add at least one automatic test, so that it runs in CI.
  • Document theory of operation (I plan to add that in README)
  • Document my future plans (as noted shortly in this PR)

@Jongy Jongy added the enhancement New feature or request label Jan 30, 2022
@Jongy Jongy requested a review from adi-benz January 30, 2022 00:16
@adi-benz
Copy link
Contributor

Completely agree with the roadmap and end goals.

About the currently proposed implementation, I don't see a reason to create SpawningProcessProfilerBase, why not change ProcessProfilerBase's implementation to the new one proposed? We won't need the old implementation, and it would be easier not to separate them using inheritance
The separation seems "unnatural", I think that having both just makes ProcessProfilerBase a class that's only purpose in life is to be a part of SpawningProcessProfilerBase, but it still needs to keep the ProfilerInterface API without any good reason (no one will inherit from it directly)

The name even still fits, ProcessProfilerBase - it handles profilers that run per process, and that class make it seem to the API like a system-wide profiler

@Jongy
Copy link
Contributor Author

Jongy commented Jan 30, 2022

Completely agree with the roadmap and end goals.

About the currently proposed implementation, I don't see a reason to create SpawningProcessProfilerBase, why not change ProcessProfilerBase's implementation to the new one proposed? We won't need the old implementation, and it would be easier not to separate them using inheritance The separation seems "unnatural", I think that having both just makes ProcessProfilerBase a class that's only purpose in life is to be a part of SpawningProcessProfilerBase, but it still needs to keep the ProfilerInterface API without any good reason (no one will inherit from it directly)

The name even still fits, ProcessProfilerBase - it handles profilers that run per process, and that class make it seem to the API like a system-wide profiler

The reason I had in mind is - I wanted some way to disable this feature with a flag; that could be implemented by changing the base class.. Kinda ugly tho 😅 I also liked that their logic is separate. But I suppose I could merge them.

Copy link
Contributor

@adi-benz adi-benz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.

@Jongy Jongy mentioned this pull request Mar 10, 2022
4 tasks
@Jongy Jongy mentioned this pull request Mar 29, 2022
2 tasks
Jongy added a commit that referenced this pull request Apr 17, 2022
The idea is to use a more rich output format for profilers - not just PID -> stackcollapse, but PID -> object with numerous fields, one of them being stackcollapse.

Currently, the 2 additional fields are the appid and app metadata. Logically, it makes sense that the runtime profilers retrieve them, and not the generic area in gprofiler - because these are per-runtime anyway. Code is structured more "sensibly" when runtime profilers return these values.

Also, it serves a technical purpose: if profilers collect & attach the appid & app metadata directly to the profile data, we can be sure that:
* App id & metadata are always collected consistently - if a certain app has "appid: X", then all stacks of it will have that appid. Previously it wasn't the case, because we could profile a process that'd go down before we had the chance to get its app id (was retrieved post profiling).
* No chance for PID reuse collision - PID being reused by another app and we get the wrong data.

That's ^^ not true yet for pyperf, because pyperf still attaches this data post profiling. This will change in a PR after #281 - where we will move the process selection to gprofiler from pyperf, and then gprofiler will take the metadata & id of apps before pyperf begins to profile them.
@Jongy
Copy link
Contributor Author

Jongy commented May 31, 2022

I wanna be able to disable the proc_events logic. I'll see if I can do it with a flag in the class, or if I desire to keep them separate.

@TomerYogev
Copy link
Contributor

Get rid of _select_processes_to_profile - that is, don't "select processes" upon each session start, just select processes based on proc_events, and start profiling until the current session end.

Wouldn't we still need it for the processes that already exist at the beginning of the session?

@Jongy
Copy link
Contributor Author

Jongy commented Jun 30, 2022

  • Track only execs
  • Track forks for Python processes

@TomerYogev
Copy link
Contributor

Track only execs
Track forks for Python processes

elif event["what"] == self._PROC_EVENT_EXEC:
Currently only filtering EXECs (and EXITs), do you want me to add support for FORKs for python in this PR?

gprofiler/utils/process.py Outdated Show resolved Hide resolved
tests/test_sanity.py Outdated Show resolved Hide resolved
@Jongy
Copy link
Contributor Author

Jongy commented Jul 11, 2022

I went over all changes @ty93 . No major comment. Let's fix these and I'm good to merge this as an experimental feature, disabled by default.

TomerYogev
TomerYogev previously approved these changes Jul 11, 2022
@Jongy Jongy requested a review from TomerYogev July 11, 2022 11:52
@Jongy Jongy requested a review from TomerYogev July 11, 2022 12:54
Copy link
Contributor

@glipner glipner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.

# delete it before blocking on the exit of all threads (to ensure no new work
# is added)
self._threads = None
threads.shutdown() # waits (although - all are done by now)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to wait here, as we're waiting for completion just after calling _stop_profiling_spawning

Suggested change
threads.shutdown() # waits (although - all are done by now)
threads.shutdown(wait=False)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does it matter, tho? for safety let's add it

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't really matter.
We're wasting time waiting rather than reading profiling output :-P

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not really wasting time though, since they're already finished

self._preexisting_pids = None

def _proc_exec_callback(self, tid: int, pid: int) -> None:
with contextlib.suppress(NoSuchProcess):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should check here if we're not stopping, otherwise we might still block forever in self._sched.run()

Suggested change
with contextlib.suppress(NoSuchProcess):
if not self._is_profiling_spawning:
return
with contextlib.suppress(NoSuchProcess):

self._sched.run()
self._stop_event.wait(0.1)

def _check_process(self, process: Process, interval: float) -> None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clear self._sched if we're stopping, so that self._sched.run() exits faster.

Suggested change
def _check_process(self, process: Process, interval: float) -> None:
def _clear_shced(self):
for event in self._sched.queue:
self._sched.cancel(event)
def _check_process(self, process: Process, interval: float) -> None:
with contextlib.suppress(NoSuchProcess):
if not self._is_profiling_spawning:
self._clear_sched()
return
if not is_process_running(process) or process.ppid() == os.getpid():

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The clear_sched call needn't be in suppress

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def _check_process(self, process: Process, interval: float) -> None:
def _clear_shced(self):
for event in self._sched.queue:
self._sched.cancel(event)
def _check_process(self, process: Process, interval: float) -> None:
if not self._is_profiling_spawning:
self._clear_sched()
return
with contextlib.suppress(NoSuchProcess):
if not is_process_running(process) or process.ppid() == os.getpid():```

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Premature optimization if I've ever seen one

@Jongy
Copy link
Contributor Author

Jongy commented Jul 12, 2022

Something is messed up with the tests :/

@Jongy Jongy merged commit c83c446 into master Jul 12, 2022
@Jongy Jongy deleted the short-processes branch July 12, 2022 14:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants