I'm investigating a weird issue, the test_io tests started failing deterministicaly within our CI environment. A previously-working combination of libyang+libnetconf2 started failing. The only change which failed is an update of the underlying OS (Fedora 29).
9/11 Test #1: test_io ..........................***Failed 0.06 sec
[==========] Running 4 test(s).
[ RUN ] test_write_rpc_10
[ERR]: Session 0: communication socket unexpectedly closed.
[ ERROR ] --- 0 != 0x5
[ LINE ] --- /home/ci/src/cesnet-gerrit-czechlight/CzechLight/dependencies/libnetconf2/tests/test_io.c:107: error: Failure!
[ FAILED ] test_write_rpc_10
[ RUN ] test_write_rpc_10_bad
[ERR]: Session 0: invalid session to send RPCs.
[ OK ] test_write_rpc_10_bad
[ RUN ] test_write_rpc_11
[ERR]: Session 0: communication socket unexpectedly closed.
[ ERROR ] --- 0 != 0x5
[ LINE ] --- /home/ci/src/cesnet-gerrit-czechlight/CzechLight/dependencies/libnetconf2/tests/test_io.c:107: error: Failure!
[ FAILED ] test_write_rpc_11
[ RUN ] test_write_rpc_11_bad
[ERR]: Session 0: invalid session to send RPCs.
[ OK ] test_write_rpc_11_bad
[==========] 4 test(s) run.
[ PASSED ] 2 test(s).
[ FAILED ] 2 test(s), listed below:
[ FAILED ] test_write_rpc_10
[ FAILED ] test_write_rpc_11
2 FAILED TEST(S)
That didn't make much sense to me, so strace to the rescue (see the full output, available over IPv6 only). Here's the relevant part:
poll([{fd=0, events=POLLIN}], 1, 0) = 1 ([{fd=0, revents=POLLHUP}])
write(2, "[ERR]: Session 0: communication "..., 60[ERR]: Session 0: communication socket unexpectedly closed.
) = 60
I.e, poll() returns with POLLHUP for stdin.
I tried to reproduce this locally (outside of the CI system, running the same code within a Fedora 29 VM, but over SSH). An easy way of getting the same behavior as our failure is this:
$ true | ./tests/test_io
[==========] Running 4 test(s).
[ RUN ] test_write_rpc_10
[ERR]: Session 0: communication socket unexpectedly closed.
[ ERROR ] --- 0 != 0x5
[ LINE ] --- /home/ci/dependencies/libnetconf2/tests/test_io.c:107: error: Failure!
[ FAILED ] test_write_rpc_10
[ RUN ] test_write_rpc_10_bad
[ERR]: Session 0: invalid session to send RPCs.
And as shown under strace:
$ true | strace -e trace=poll ./tests/test_io
[==========] Running 4 test(s).
[ RUN ] test_write_rpc_10
poll([{fd=0, events=POLLIN}], 1, 0) = 1 ([{fd=0, revents=POLLHUP}])
[ERR]: Session 0: communication socket unexpectedly closed.
[ ERROR ] --- 0 != 0x5
[ LINE ] --- /home/ci/dependencies/libnetconf2/tests/test_io.c:107: error: Failure!
[ FAILED ] test_write_rpc_10
Now, there has been no change to the way how our CI launches its tests, which is something that puzzles me. It's done via Zuul, and the actual system that executes stuff on the build node is Ansible. I have no idea how it deals with stdin of the launched processes. I also cannot answer why these tests passed with the same version libnetconf2. The only change that I am aware of is a reinstallation of the build VM (+container), but it is once again done automatically and driven by the same command...
- Would it make sense to operate
test_io on a pipe created for that purpose? That would make the test more robust to random changes in its environment.
I'm investigating a weird issue, the
test_iotests started failing deterministicaly within our CI environment. A previously-working combination of libyang+libnetconf2 started failing. The only change which failed is an update of the underlying OS (Fedora 29).That didn't make much sense to me, so
straceto the rescue (see the full output, available over IPv6 only). Here's the relevant part:I.e,
poll()returns withPOLLHUPforstdin.I tried to reproduce this locally (outside of the CI system, running the same code within a Fedora 29 VM, but over SSH). An easy way of getting the same behavior as our failure is this:
And as shown under
strace:Now, there has been no change to the way how our CI launches its tests, which is something that puzzles me. It's done via Zuul, and the actual system that executes stuff on the build node is Ansible. I have no idea how it deals with
stdinof the launched processes. I also cannot answer why these tests passed with the same version libnetconf2. The only change that I am aware of is a reinstallation of the build VM (+container), but it is once again done automatically and driven by the same command...test_ioon a pipe created for that purpose? That would make the test more robust to random changes in its environment.