Skip to content

test_io fails in our CI (poll revents=POLLHUP on stdin) #153

@jktjkt

Description

@jktjkt

I'm investigating a weird issue, the test_io tests started failing deterministicaly within our CI environment. A previously-working combination of libyang+libnetconf2 started failing. The only change which failed is an update of the underlying OS (Fedora 29).

 9/11 Test  #1: test_io ..........................***Failed    0.06 sec
[==========] Running 4 test(s).
[ RUN      ] test_write_rpc_10
[ERR]: Session 0: communication socket unexpectedly closed.
[  ERROR   ] --- 0 != 0x5
[   LINE   ] --- /home/ci/src/cesnet-gerrit-czechlight/CzechLight/dependencies/libnetconf2/tests/test_io.c:107: error: Failure!
[  FAILED  ] test_write_rpc_10
[ RUN      ] test_write_rpc_10_bad
[ERR]: Session 0: invalid session to send RPCs.
[       OK ] test_write_rpc_10_bad
[ RUN      ] test_write_rpc_11
[ERR]: Session 0: communication socket unexpectedly closed.
[  ERROR   ] --- 0 != 0x5
[   LINE   ] --- /home/ci/src/cesnet-gerrit-czechlight/CzechLight/dependencies/libnetconf2/tests/test_io.c:107: error: Failure!
[  FAILED  ] test_write_rpc_11
[ RUN      ] test_write_rpc_11_bad
[ERR]: Session 0: invalid session to send RPCs.
[       OK ] test_write_rpc_11_bad
[==========] 4 test(s) run.
[  PASSED  ] 2 test(s).
[  FAILED  ] 2 test(s), listed below:
[  FAILED  ] test_write_rpc_10
[  FAILED  ] test_write_rpc_11

 2 FAILED TEST(S)

That didn't make much sense to me, so strace to the rescue (see the full output, available over IPv6 only). Here's the relevant part:

poll([{fd=0, events=POLLIN}], 1, 0)     = 1 ([{fd=0, revents=POLLHUP}])
write(2, "[ERR]: Session 0: communication "..., 60[ERR]: Session 0: communication socket unexpectedly closed.
) = 60

I.e, poll() returns with POLLHUP for stdin.

I tried to reproduce this locally (outside of the CI system, running the same code within a Fedora 29 VM, but over SSH). An easy way of getting the same behavior as our failure is this:

$ true | ./tests/test_io
[==========] Running 4 test(s).
[ RUN      ] test_write_rpc_10
[ERR]: Session 0: communication socket unexpectedly closed.
[  ERROR   ] --- 0 != 0x5
[   LINE   ] --- /home/ci/dependencies/libnetconf2/tests/test_io.c:107: error: Failure!
[  FAILED  ] test_write_rpc_10
[ RUN      ] test_write_rpc_10_bad
[ERR]: Session 0: invalid session to send RPCs.

And as shown under strace:

$ true | strace -e trace=poll ./tests/test_io
[==========] Running 4 test(s).
[ RUN      ] test_write_rpc_10
poll([{fd=0, events=POLLIN}], 1, 0)     = 1 ([{fd=0, revents=POLLHUP}])
[ERR]: Session 0: communication socket unexpectedly closed.
[  ERROR   ] --- 0 != 0x5
[   LINE   ] --- /home/ci/dependencies/libnetconf2/tests/test_io.c:107: error: Failure!
[  FAILED  ] test_write_rpc_10

Now, there has been no change to the way how our CI launches its tests, which is something that puzzles me. It's done via Zuul, and the actual system that executes stuff on the build node is Ansible. I have no idea how it deals with stdin of the launched processes. I also cannot answer why these tests passed with the same version libnetconf2. The only change that I am aware of is a reinstallation of the build VM (+container), but it is once again done automatically and driven by the same command...

  • Would it make sense to operate test_io on a pipe created for that purpose? That would make the test more robust to random changes in its environment.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions