New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Benchmarker does not properly terminate child processes #927

Closed
hamiltont opened this Issue Jul 15, 2014 · 10 comments

Comments

Projects
None yet
4 participants
@hamiltont
Contributor

hamiltont commented Jul 15, 2014

If i understand correctly, Benchmarker calls __run_test in a thread specifically to avoid long-running tests. It a test takes too long, Benchmarker can terminate that thread and result operation.

Any thread termination should also ensure any child processes created are stopped. This seems to be tricky to do correctly, and I've seen a number of times where a test is not properly terminated and the underlying servers are still active and bound to ports. I suppose an ideal solution would be to politely call test.stop(), but also to create a process group and SIGTERM everything in the group, so that any child processes are definitely stopped

@msmith-techempower

This comment has been minimized.

Show comment
Hide comment
@msmith-techempower

msmith-techempower Jul 15, 2014

Member

Agreed... in fact we have had this trouble with a few frameworks and done some EXTREMELY silly things in stop to try and ensure that their processes (and child processes) die.

I am up for suggestions on this one.

Member

msmith-techempower commented Jul 15, 2014

Agreed... in fact we have had this trouble with a few frameworks and done some EXTREMELY silly things in stop to try and ensure that their processes (and child processes) die.

I am up for suggestions on this one.

@hamiltont

This comment has been minimized.

Show comment
Hide comment
@hamiltont

hamiltont Jul 16, 2014

Contributor

@methane Was thinking I could attempt to include this solution into #913. However, AFAIK it won't work on windows, so it's only a partial fix if it works

Also, it's a PITA to fix this in such a way that it would immediately work for all the setup.py files. We would likely need to change __run_tests to create a subprocess instead of a thread, and then we could do something like this use a group to automatically clean up any process the setup.py files created. The other method is just to add this to #913 and modify the setup.py files one by one to use whatever the new method is that can automatically clean up the processes

Contributor

hamiltont commented Jul 16, 2014

@methane Was thinking I could attempt to include this solution into #913. However, AFAIK it won't work on windows, so it's only a partial fix if it works

Also, it's a PITA to fix this in such a way that it would immediately work for all the setup.py files. We would likely need to change __run_tests to create a subprocess instead of a thread, and then we could do something like this use a group to automatically clean up any process the setup.py files created. The other method is just to add this to #913 and modify the setup.py files one by one to use whatever the new method is that can automatically clean up the processes

@methane

This comment has been minimized.

Show comment
Hide comment
@methane

methane Jul 16, 2014

Contributor

We would likely need to change __run_tests to create a subprocess instead of a thread,

FYI, You already use process instead of thread.

https://github.com/TechEmpower/FrameworkBenchmarks/blob/master/toolset/benchmark/benchmarker.py#L492

Contributor

methane commented Jul 16, 2014

We would likely need to change __run_tests to create a subprocess instead of a thread,

FYI, You already use process instead of thread.

https://github.com/TechEmpower/FrameworkBenchmarks/blob/master/toolset/benchmark/benchmarker.py#L492

@hamiltont

This comment has been minimized.

Show comment
Hide comment
@hamiltont

hamiltont Jul 16, 2014

Contributor

Good point, for some reason I though multiprocessing was thread-based. There may be hope for a generic fix

Contributor

hamiltont commented Jul 16, 2014

Good point, for some reason I though multiprocessing was thread-based. There may be hope for a generic fix

@methane

This comment has been minimized.

Show comment
Hide comment
@methane

methane Jul 16, 2014

Contributor

On Linux, setsid() in top of __run_tests may reduce problem.
On Windows, Using TASKKILL /T may resolve problem.

Contributor

methane commented Jul 16, 2014

On Linux, setsid() in top of __run_tests may reduce problem.
On Windows, Using TASKKILL /T may resolve problem.

@methane

This comment has been minimized.

Show comment
Hide comment
@methane

methane Jul 16, 2014

Contributor

But I don't have confidence because I'm not an expert of process management.

Contributor

methane commented Jul 16, 2014

But I don't have confidence because I'm not an expert of process management.

hamiltont added a commit to hamiltont/FrameworkBenchmarks that referenced this issue Jul 19, 2014

Fixes #927 (terminate child processes) for linux systems
Fix this by:
- Using linux process groups and session identifier to track created processes
- Automatically send SIGTERM to processes when needed
- Use SIGTERM handler in subprocess to call setup_module.stop()
- Use boolean flag to ensure stop() is only called once
- Removes checks for KeyboardInterrupt from subprocess (this has known bugs with multiprocessing)
- Removes check for system exit from subprocess, not required
@hamiltont

This comment has been minimized.

Show comment
Hide comment
@hamiltont

hamiltont Jul 19, 2014

Contributor

I've fixed this (as well as possible) for linux with the above commit. It's not based on your master, i'll have to do that later.

Naturally, this code is just building a guarantee of termination above and beyond what is provided by a good stop function in each setup.py, so a number of the changes are just ensuring that stop is called correctly regardless of how the exit happens, and ensuring that stop is only called one time.

The guarantee provided beyond stop is pretty good - with one exception, any processes created as a side-effect of __run_test are guaranteed to be destroyed before __run_test is called again. The exception is processes that intentionally daemonize themselves by changing their session identifier. This is fairly odd for TFB, it normally means something was installed into the local system, via a mechanism such as apt-get, and it is being started from a system directory (/usr, /etc/init.d, etc), with sudo permissions. There is just no non-hackish way I can find to keep track of those processes and clean them up, so for this scenario we have to rely on people writing a good stop() function. My main test case here has been aspnet-mono which internally runs nginx (an exception to the above guarantee) and mono (not an exception) with like 20 open ports and 50ish processes - this commit appears to be stopping all the mono and nginx processes every single time.

I don't have a windows box, can't help there. There was a comment saying "These features don't work on windows" that was confusing - multiprocessing works just fine on windows IIRC.

Contributor

hamiltont commented Jul 19, 2014

I've fixed this (as well as possible) for linux with the above commit. It's not based on your master, i'll have to do that later.

Naturally, this code is just building a guarantee of termination above and beyond what is provided by a good stop function in each setup.py, so a number of the changes are just ensuring that stop is called correctly regardless of how the exit happens, and ensuring that stop is only called one time.

The guarantee provided beyond stop is pretty good - with one exception, any processes created as a side-effect of __run_test are guaranteed to be destroyed before __run_test is called again. The exception is processes that intentionally daemonize themselves by changing their session identifier. This is fairly odd for TFB, it normally means something was installed into the local system, via a mechanism such as apt-get, and it is being started from a system directory (/usr, /etc/init.d, etc), with sudo permissions. There is just no non-hackish way I can find to keep track of those processes and clean them up, so for this scenario we have to rely on people writing a good stop() function. My main test case here has been aspnet-mono which internally runs nginx (an exception to the above guarantee) and mono (not an exception) with like 20 open ports and 50ish processes - this commit appears to be stopping all the mono and nginx processes every single time.

I don't have a windows box, can't help there. There was a comment saying "These features don't work on windows" that was confusing - multiprocessing works just fine on windows IIRC.

hamiltont added a commit to hamiltont/FrameworkBenchmarks that referenced this issue Jul 19, 2014

Fixes #927 (terminate child processes) for linux systems
Fix this by:
- Using linux process groups and session identifier to track created processes
- Automatically send SIGTERM to processes when needed
- Use SIGTERM handler in subprocess to call setup_module.stop()
- Use boolean flag to ensure stop() is only called once
- Removes checks for KeyboardInterrupt from subprocess (this has known bugs with multiprocessing)
- Removes check for system exit from subprocess, not required
@msmith-techempower

This comment has been minimized.

Show comment
Hide comment
@msmith-techempower

msmith-techempower Jul 21, 2014

Member

The real test, from my experience, is whether Plack will be terminated correctly on an EC2 MLarge instance (not enough machine for Plack, haha). We regularly had the situation you were mentioning where it would run nginx from some root-perm folder, spawn 8 threads, and nothing called from stop() seemed a bad enough dude to kill it (ultimately, sudo killall plackup was the 'nuke from orbit' fix just for EC2).

Member

msmith-techempower commented Jul 21, 2014

The real test, from my experience, is whether Plack will be terminated correctly on an EC2 MLarge instance (not enough machine for Plack, haha). We regularly had the situation you were mentioning where it would run nginx from some root-perm folder, spawn 8 threads, and nothing called from stop() seemed a bad enough dude to kill it (ultimately, sudo killall plackup was the 'nuke from orbit' fix just for EC2).

@hamiltont

This comment has been minimized.

Show comment
Hide comment
@hamiltont

hamiltont Dec 19, 2014

Contributor

resolved in master!

Contributor

hamiltont commented Dec 19, 2014

resolved in master!

@hamiltont hamiltont closed this Dec 19, 2014

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment