Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Exiting under Windows is strange #11027

Open
julianoes opened this issue Dec 12, 2018 · 12 comments
Open

Exiting under Windows is strange #11027

julianoes opened this issue Dec 12, 2018 · 12 comments
Assignees
Labels
bug Development Environment: windows Microsoft Windows OS environment stale

Comments

@julianoes
Copy link
Contributor

julianoes commented Dec 12, 2018

I find it annoying that I can't easily quit SITL under windows. When I press Ctrl+C the signal handler SIGINT is called correctly but then it doesn't shutdown cleanly but hands and you need to press either Enter or Ctrl+D. Presumably, the pxh shell is hanging at getchar.

FYI: @MaEtUgR

@julianoes julianoes added bug Development Environment: windows Microsoft Windows OS environment labels Dec 12, 2018
@MaEtUgR
Copy link
Member

MaEtUgR commented Dec 12, 2018

Related: #10098 (comment)
My issue with less explanation for the same issue: PX4/PX4-windows-toolchain#3
I have to deal with this problem every day... so I would be greatful to have a solution given it cannot be that complicated. The shutdown command works just fine now. Can we not do the same thing on Ctrl+C?

@julianoes
Copy link
Contributor Author

I believe the shutdown command works because you won't call getchar anymore afterwards. However, when you catch the SIGINT signal and try to exit it still hangs at getchar.

@MaEtUgR
Copy link
Member

MaEtUgR commented Dec 19, 2018

Thanks for sharing your findings. So the expected behavior would be that it kills all threads but since the pxh shell thread is still waiting for input it also waits for the thread. Maybe that's similar to the waiting for the poll of the named pipe issue that is solved now.

@MaEtUgR MaEtUgR self-assigned this Dec 19, 2018
@MaEtUgR
Copy link
Member

MaEtUgR commented Jan 16, 2019

Since #11177 simulation runs a lot more stable on windows but the shutdown command also hangs so killing the process is the only way left.

I started debugging and found that:

  • When I put if (false) (to not get unused function error) before this line:
    https://github.com/PX4/Firmware/blob/d8ab059ff3dcd50e0b8953da4de4150dffc2b75e/platforms/posix/src/main.cpp#L276
    the px4 binary stays exitable as expected with Ctrl + C and shutdown.

  • With this line left in but a basically empty startup script the shutdown command still works but Ctrl + C requires an additional enter like described in OP despite no obvious problems:

    INFO  [px4] Calling startup script: /bin/sh etc/init.d-posix/rcS 0
    INFO  [px4] Startup script returned successfully
    

@MaEtUgR
Copy link
Member

MaEtUgR commented Jan 16, 2019

Aha
https://github.com/PX4/Firmware/blob/d8ab059ff3dcd50e0b8953da4de4150dffc2b75e/platforms/posix/src/main.cpp#L528
http://man7.org/linux/man-pages/man3/system.3.html

During execution of the command, SIGCHLD will be blocked, and SIGINT
and SIGQUIT will be ignored, in the process that calls system()
(these signals will be handled according to their defaults inside the
child process that executes command).

As mentioned, system() ignores SIGINT and SIGQUIT. This may make
programs that call it from a loop uninterruptible, unless they take
care themselves to check the exit status of the child.

I'm currently assuming newlib which Cygwin uses somehow screws up the SIG's in the process of implementing this specification.

https://github.com/mirror/newlib-cygwin/blob/17918cc6a6e6471162177a1125c6208ecce8a72e/newlib/libc/stdlib/system.c

Maybe I can find a simple workaround.

@MaEtUgR
Copy link
Member

MaEtUgR commented Jan 24, 2019

So the system() call is one thing that confuses the signal handling and makes the application not exit anymore and sadly I found out that something in the lockstep scheduler (#10648) is the other thing.

So in short if I replace

  1. https://github.com/PX4/Firmware/blob/2ffb49b734fad6b6dfd6fcaee79bf31c6cf595a7/platforms/posix/src/main.cpp#L529 with popen(shell_command.c_str(), "r") and comment out this line
  2. https://github.com/PX4/Firmware/blob/2ffb49b734fad6b6dfd6fcaee79bf31c6cf595a7/boards/px4/sitl/default.cmake#L101

it all runs smooth in my takeoff Ctrl+C test.

Now this is only a possible quick hotfix to reestablish functionality for currently blocked users and not something desirable since lockstep works (other than exiting) super well on Windows. I'll follow up with a more detailed explanation of what I tried and I'd really appreciate if I could ask @julianoes some futher questions (he already helped me when we last met). 😇

@MaEtUgR
Copy link
Member

MaEtUgR commented Jan 25, 2019

So here's the patch I last described master...MaEtUgR:exiting-experiments with the biggest disadvantage of disabling lockstep which is no solution.

I tracked the problem down to this call:
https://github.com/PX4/Firmware/blob/master/src/modules/simulator/simulator_mavlink.cpp#L295

In my tests it always returned 0 and according to what I've seen does what it should setting a clock that starts counting when the simulator starts. The frustrating part is: I tried to only process the first 1 or 100 MAVLink messages from the simulator and then exiting still works fine. If I process >300 exiting the application is broken. Now my debugging on why this could happen is inconclusive. What's suspicious is that it breaks after processing around one second of simulator data but not exactly to the milisecond.

@julianoes Could it be that something is triggered after the monotonic time in the simulator runs for one second that then starts other threads that lock everything up?

@MaEtUgR
Copy link
Member

MaEtUgR commented Feb 27, 2019

I just encountered a case on Linux where the first Ctrl+C didn't exit but also got stuck and the second Ctrl+C then then killed it:

pxh> INFO  [ecl/EKF] EKF aligned, (pressure height, IMU buf: 22, OBS buf: 14)
INFO  [ekf2] Mag sensor ID changed to 196616
INFO  [ecl/EKF] EKF GPS checks passed (WGS-84 origin set)
INFO  [ecl/EKF] EKF commencing GPS fusion
pxh> INFO  [mavlink] partner IP: 127.0.0.1
INFO  [commander] Takeoff detected

Exiting...
^C
Exiting...
pxh> Shutting down
ninja: build stopped: interrupted by user.
Makefile:190: recipe for target 'px4_sitl_default' failed
make: *** [px4_sitl_default] Interrupt

maetugr@ubuntuvm:~/Firmware$

It's not one more SIGINT for the simulator the px4 binary processes both as can be seen from the two Exiting.... This might give a hint.

@MaEtUgR
Copy link
Member

MaEtUgR commented Jun 25, 2019

I'm still relying on the hotfix #11305 and Ctrl+C works as expected. Not sure how to track down why the normal way it gets stuck.

@hamishwillee
Copy link
Contributor

Still a problem

@stale stale bot removed the Admin: Wont fix label Sep 24, 2019
@PX4 PX4 deleted a comment from stale bot Sep 24, 2019
@PX4 PX4 deleted a comment from stale bot Sep 24, 2019
@MaEtUgR
Copy link
Member

MaEtUgR commented Oct 21, 2019

True, this is related: #11654 It fixed the shutdown command in my tests.

@stale
Copy link

stale bot commented Jan 19, 2020

This issue has been automatically marked as stale because it has not had recent activity. Thank you for your contributions.

@stale stale bot added the stale label Jan 19, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Development Environment: windows Microsoft Windows OS environment stale
Projects
None yet
Development

No branches or pull requests

3 participants