Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RunHostTest: op-test gets killed frequently when running tests #508

Open
harish-24 opened this issue Jun 19, 2019 · 9 comments
Open

RunHostTest: op-test gets killed frequently when running tests #508

harish-24 opened this issue Jun 19, 2019 · 9 comments

Comments

@harish-24
Copy link
Contributor

harish-24 commented Jun 19, 2019

The following trace is observed and op-test process gets killed.

2019-06-19 06:35:24,328:op-test.common.OpTestUtil:dump_versions:INFO:Log Location: /home/harish/upstream/op-test-framework/test-reports/test-run-20190619040239/*debug*
2019-06-19 06:35:24,328:op-test.common.OpTestUtil:dump_versions:INFO:
----------------------------------------------------------
OpTestSystem Firmware Versions Tested
(if flashed things like skiboot.lid, may not be accurate)
----------------------------------------------------------
Firmware Versions Unavailable
----------------------------------------------------------
----------------------------------------------------------

  File "./op-test", line 830, in <module>
    res = run_tests(t, failfast=OpTestConfiguration.conf.args.failfast)
  File "./op-test", line 805, in run_tests
    return runner(**kwargs).run(t)
  File "/usr/lib64/python2.7/unittest/runner.py", line 151, in run
    test(result)
  File "/usr/lib64/python2.7/unittest/suite.py", line 70, in __call__
    return self.run(*args, **kwds)
  File "/usr/lib64/python2.7/unittest/suite.py", line 108, in run
    test(result)
  File "/usr/lib64/python2.7/unittest/suite.py", line 70, in __call__
    return self.run(*args, **kwds)
  File "/usr/lib64/python2.7/unittest/suite.py", line 108, in run
    test(result)
  File "/usr/lib64/python2.7/unittest/suite.py", line 70, in __call__
    return self.run(*args, **kwds)
  File "/usr/lib64/python2.7/unittest/suite.py", line 108, in run
    test(result)
  File "/usr/lib64/python2.7/unittest/suite.py", line 70, in __call__
    return self.run(*args, **kwds)
  File "/usr/lib64/python2.7/unittest/suite.py", line 108, in run
    test(result)
  File "/usr/lib64/python2.7/unittest/case.py", line 433, in __call__
    return self.run(*args, **kwds)
  File "/usr/lib64/python2.7/unittest/case.py", line 369, in run
    testMethod()
  File "/home/harish/upstream/op-test-framework/testcases/RunHostTest.py", line 84, in runTest
    con.run_command(line, timeout=self.host_cmd_timeout)
  File "/home/harish/upstream/op-test-framework/common/OpTestSSH.py", line 214, in run_command
    output = self.try_command(term_obj, command, timeout)
  File "/home/harish/upstream/op-test-framework/common/OpTestUtil.py", line 1454, in try_command
    rc = pty.expect([expect_prompt, r"[Pp]assword for", pexpect.TIMEOUT, pexpect.EOF], timeout=timeout)
  File "/home/harish/upstream/op-test-framework/common/OPexpect.py", line 87, in expect
    searchwindowsize=searchwindowsize)
  File "/usr/lib/python2.7/site-packages/pexpect/spawnbase.py", line 341, in expect
    timeout, searchwindowsize, async_)
  File "/usr/lib/python2.7/site-packages/pexpect/spawnbase.py", line 369, in expect_list
    return exp.expect_loop(timeout)
  File "/usr/lib/python2.7/site-packages/pexpect/expect.py", line 111, in expect_loop
    incoming = spawn.read_nonblocking(spawn.maxread, timeout)
  File "/usr/lib/python2.7/site-packages/pexpect/pty_spawn.py", line 499, in read_nonblocking
    if (timeout != 0) and select(timeout):
  File "/usr/lib/python2.7/site-packages/pexpect/pty_spawn.py", line 449, in select
    return select_ignore_interrupts([self.child_fd], [], [], timeout)[0]
  File "/usr/lib/python2.7/site-packages/pexpect/utils.py", line 143, in select_ignore_interrupts
    return select.select(iwtd, owtd, ewtd, timeout)
  File "./op-test", line 54, in optest_handler
    traceback.print_stack()
2019-06-19 06:35:24,331:op-test:optest_handler:ERROR:OpTestSystem exiting from signal 'SIGHUP' signum=1

Config used

[op-test]
bmc_type=OpenBMC
bmc_username=root
bmc_password=...
bmc_ip=...
bmc_usernameipmi=root
bmc_passwordipmi=...
host_ip=...
host_user=root
host_password=...
host_cmd_file=command.txt
host_cmd_timeout=720000
host_cmd_resultpath='/home/tests/results'

Following commands are run
command.txt

cd tests
python avocado-setup.py --run-suite host_test

# ./op-test --run testcases.RunHostTest -c wsp12.conf --host-cmd-file command.txt

Issue is observed very frequently and irrespective of bmc_type

@oohal
Copy link
Contributor

oohal commented Aug 6, 2019

 File "/home/harish/upstream/op-test-framework/testcases/RunHostTest.py", line 84, in runTest
    con.run_command(line, timeout=self.host_cmd_timeout)
  File "/home/harish/upstream/op-test-framework/common/OpTestSSH.py", line 214, in run_command
    output = self.try_command(term_obj, command, timeout)
  File "/home/harish/upstream/op-test-framework/common/OpTestUtil.py", line 1454, in try_command
    rc = pty.expect([expect_prompt, r"[Pp]assword for", pexpect.TIMEOUT, pexpect.EOF], timeout=timeout)
  File "/home/harish/upstream/op-test-framework/common/OPexpect.py", line 87, in expect
    searchwindowsize=searchwindowsize)

Looks like it's hitting a timeout waiting for "[Pp]assword for" to appear. Please debug further.

@harish-24
Copy link
Contributor Author

The issue seems to occur when the ssh connection of the triggered server is disconnected.
Server running the op-test instance in background thought nohup + &

[root@host op-test-2]# date
Fri Jun 26 03:22:08 EDT 2020
[root@host op-test-2]# Connection to lep8a.aus.stglabs.ibm.com closed.

After logging in again and checking, the background op-test process gets killed at the exact point of disconnection

2020-06-26 03:22:09,144:op-test.common.OpTestUtil:cleanup:INFO:OpTestSystem Starting to Gather ESEL's
...
...

leading to the above trace.

@sathnaga
Copy link

@harish-24 this could help solve that issue you are facing, seems to be framework catches and sighup bydefault and terminates..

diff --git a/common/OPexpect.py b/common/OPexpect.py
index d2d1832..820bcc5 100644
--- a/common/OPexpect.py
+++ b/common/OPexpect.py
@@ -42,7 +42,7 @@ class spawn(pexpect.spawn):
     def __init__(self, command, args=[], maxread=8000,
                  searchwindowsize=None, logfile=None, cwd=None, env=None,
-                 ignore_sighup=False, echo=True, preexec_fn=None,
+                 ignore_sighup=True, echo=True, preexec_fn=None,
                  encoding='utf-8', codec_errors='ignore', dimensions=None,
                  failure_callback=None, failure_callback_data=None):
         self.command = command
diff --git a/op-test b/op-test
index 45f5a8c..4fa42a4 100755
--- a/op-test
+++ b/op-test
@@ -138,7 +138,7 @@ def optest_handler(signum, frame):
     # try/except block, control will be given back to that block
-signal.signal(signal.SIGHUP, optest_handler)
+#signal.signal(signal.SIGHUP, optest_handler)
 signal.signal(signal.SIGINT, optest_handler)
 signal.signal(signal.SIGQUIT, optest_handler)
 signal.signal(signal.SIGILL, optest_handler)

will have to see for any other issues and will work on a cleaner patch..

@harish-24
Copy link
Contributor Author

@sathnaga It does help, now I dont see the issue with the above hack.

@klauskiwi
Copy link

Care to expand a bit more?

If someone is connected (ssh or whatever) to a machine where the op-test.py instance is running (i.e., driving the test, not a target test system), and that connection is dropped, is only natural that a SIGHUP (Hangup) signal is sent to op-test, who needs to be killed imho..

If we're running with nohup ./op-test.py we should not have been seeing hangups in the case of disconnection.. So I'm confused to what the problem is (after all getting traces if the connection is dropped is expected) and what's the proposed solution here..

@harish-24
Copy link
Contributor Author

Care to expand a bit more?

If someone is connected (ssh or whatever) to a machine where the op-test.py instance is running (i.e., driving the test, not a target test system), and that connection is dropped, is only natural that a SIGHUP (Hangup) signal is sent to op-test, who needs to be killed imho..

If we're running with nohup ./op-test.py we should not have been seeing hangups in the case of disconnection.. So I'm confused to what the problem is (after all getting traces if the connection is dropped is expected) and what's the proposed solution here..

Yes, we are running with nohup ./op-test.py ... & and the problem is that we are still seeing hangups in case of disconnection. The hack provided by @sathnaga in #508 (comment) avoids that problem.

@sathnaga
Copy link

sathnaga commented Jun 29, 2020

Care to expand a bit more?

If someone is connected (ssh or whatever) to a machine where the op-test.py instance is running (i.e., driving the test, not a target test system), and that connection is dropped, is only natural that a SIGHUP (Hangup) signal is sent to op-test, who needs to be killed imho..

If we're running with nohup ./op-test.py we should not have been seeing hangups in the case of disconnection.. So I'm confused to what the problem is (after all getting traces if the connection is dropped is expected) and what's the proposed solution here..

@klauskiwi More details, the issue does not show up if you run nohup ./op-test.py ... & cleanly exits (1) from the terminal like typing exit but it happens (2) while abruptly closing the terminal, in some cases consider if you are running/driving a test from remote terminal and ssh connection drops due to network can also induce the same behaviour, this is because of terminal/bash behaviour.

Demo:
(1) Clean exit:

Terminal 1:
$ echo $$
150465
--------------------> do start strace here
$ exit

Terminal 2:
$ strace -e signal -p 150465
strace: Process 150465 attached
rt_sigaction(SIGINT, {sa_handler=0x55cb0434bbe0, sa_mask=[], sa_flags=SA_RESTORER, sa_restorer=0x7f05ecbfcab0}, {sa_handler=0x55cb043a4b70, sa_mask=[], sa_flags=SA_RESTORER, sa_restorer=0x7f05ecbfcab0}, 8) = 0
rt_sigaction(SIGTERM, {sa_handler=0x55cb0434b6d0, sa_mask=[], sa_flags=SA_RESTORER|SA_RESTART, sa_restorer=0x7f05ecbfcab0}, {sa_handler=0x55cb043a4b70, sa_mask=[], sa_flags=SA_RESTORER, sa_restorer=0x7f05ecbfcab0}, 8) = 0
rt_sigaction(SIGHUP, {sa_handler=0x55cb0434bfb0, sa_mask=[HUP INT ILL TRAP ABRT BUS FPE USR1 SEGV USR2 PIPE ALRM TERM XCPU XFSZ VTALRM SYS], sa_flags=SA_RESTORER, sa_restorer=0x7f05ecbfcab0}, {sa_handler=0x55cb043a4b70, sa_mask=[], sa_flags=SA_RESTORER, sa_restorer=0x7f05ecbfcab0}, 8) = 0
rt_sigaction(SIGALRM, {sa_handler=0x55cb0434bfb0, sa_mask=[HUP INT ILL TRAP ABRT BUS FPE USR1 SEGV USR2 PIPE ALRM TERM XCPU XFSZ VTALRM SYS], sa_flags=SA_RESTORER, sa_restorer=0x7f05ecbfcab0}, {sa_handler=0x55cb043a4b70, sa_mask=[], sa_flags=SA_RESTORER, sa_restorer=0x7f05ecbfcab0}, 8) = 0
rt_sigaction(SIGWINCH, {sa_handler=0x55cb0434b6c0, sa_mask=[], sa_flags=SA_RESTORER|SA_RESTART, sa_restorer=0x7f05ecbfcab0}, {sa_handler=0x55cb043a37b0, sa_mask=[], sa_flags=SA_RESTORER|SA_RESTART, sa_restorer=0x7f05ecbfcab0}, 8) = 0
rt_sigaction(SIGINT, {sa_handler=0x55cb0434bbe0, sa_mask=[], sa_flags=SA_RESTORER, sa_restorer=0x7f05ecbfcab0}, {sa_handler=0x55cb0434bbe0, sa_mask=[], sa_flags=SA_RESTORER, sa_restorer=0x7f05ecbfcab0}, 8) = 0
rt_sigaction(SIGINT, {sa_handler=0x55cb0434bbe0, sa_mask=[], sa_flags=SA_RESTORER, sa_restorer=0x7f05ecbfcab0}, {sa_handler=0x55cb0434bbe0, sa_mask=[], sa_flags=SA_RESTORER, sa_restorer=0x7f05ecbfcab0}, 8) = 0
rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
+++ exited with 0 +++

==========No sighup

(2) Abrupt close:

Terminal 1:
$ echo $$
150512

just close this terminal

Terminal 2
$ strace -e signal -p 150512
strace: Process 150512 attached
...
...
 SEGV USR2 PIPE ALRM TERM XCPU XFSZ VTALRM SYS], sa_flags=SA_RESTORER, sa_restorer=0x7f1fe3f9cab0}, 8) = 0
kill(150512, SIGHUP)                    = 0
--- SIGHUP {si_signo=SIGHUP, si_code=SI_USER, si_pid=150512, si_uid=1000} ---
+++ killed by SIGHUP +++

my quick hack just disables sighup signal handler, I doubt that's a cleaner solution, am still thinking of ways to distinguish above behaviour vs all other sighup possibilities, so that can only handle that better.

@harihare
Copy link
Collaborator

harihare commented Feb 4, 2021

@harish-24 Any updates here ?

@harish-24
Copy link
Contributor Author

@harihare The issue remains the same. Not sure if @sathnaga has any updates after the hack that has been posted.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants