azure-lb: Don't redirect nc listener output to pidfile #1528

nrwahl2 · 2020-07-07T09:33:37Z

The lb_start() function spawns an nc listener background process
and echoes the resulting pid to $pidfile. Due to a bug in the
redirection, all future data received by the nc process is also
appended to $pidfile.

If binary data is received later and appended to $pidfile, the
monitor operation fails when grep searches the now-binary file.

line 97: kill: Binary: arguments must be process or job IDs ]
line 97: kill: file: arguments must be process or job IDs ]
line 97: kill: /var/run/nc_PF2_02.pid: arguments must be process or job IDs ]
line 97: kill: matches: arguments must be process or job IDs ]

Then the start operation fails during recovery. lb_start() spawns a
new nc process, but the old process is still running and using the
configured port.

nc_PF2_02_start_0:777:stderr [ Ncat: bind to :::62502: Address already in use. QUITTING. ]

This patch fixes the issue by removing the nc & command from the
section whose output gets redirected to $pidfile. Now, only the nc
PID is echoed to $pidfile.

Resolves: RHBZ#1850778
Resolves: RHBZ#1850779

The `lb_start()` function spawns an `nc` listener background process and echoes the resulting pid to `$pidfile`. Due to a bug in the redirection, all future data received by the `nc` process is also appended to `$pidfile`. If binary data is received later and appended to `$pidfile`, the monitor operation fails when `grep` searches the now-binary file. ``` line 97: kill: Binary: arguments must be process or job IDs ] line 97: kill: file: arguments must be process or job IDs ] line 97: kill: /var/run/nc_PF2_02.pid: arguments must be process or job IDs ] line 97: kill: matches: arguments must be process or job IDs ] ``` Then the start operation fails during recovery. `lb_start()` spawns a new `nc` process, but the old process is still running and using the configured port. ``` nc_PF2_02_start_0:777:stderr [ Ncat: bind to :::62502: Address already in use. QUITTING. ] ``` This patch fixes the issue by removing the `nc &` command from the section whose output gets redirected to `$pidfile`. Now, only the `nc` PID is echoed to `$pidfile`. Resolves: RHBZ#1850778 Resolves: RHBZ#1850779

oalbrigt · 2020-07-07T11:25:18Z

LGTM.

sjohnsonsf · 2021-03-03T04:38:54Z

@nrwahl2 , apologies in advance if this is not the correct forum...But I'm seeing this issue exactly on my clusters and it persists after updating to resource-agents.x86_64 - 4.1.1-61.el7 from CentOS Base repo. My servers are running RHEL 7.9, however, we're pulling Pacemaker/Resource Agents from CentOS base. I've confirmed the updated azure-lb.sh in /usr/lib/ocf/resource.d/heartbeat does not reflect changes in this pull in resource-agents.x86_64 - 4.1.1-61.el7. Thanks in advance!

nrwahl2 · 2021-03-03T05:00:23Z

@sjohnsonsf On RHEL, the fix was introduced in resource-agents-4.1.1-61.el7_9.4. It looks it's available here for CentOS (see the Changelog near the bottom):

https://centos.pkgs.org/7/centos-updates-x86_64/resource-agents-4.1.1-61.el7_9.4.x86_64.rpm.html

Can you give that a try?

sjohnsonsf · 2021-03-03T05:46:00Z

@nrwahl2 pulling from Updates Repo seems to have done it. Can't thank you enough for the prompt reply!

nrwahl2 · 2021-03-03T06:14:06Z

@sjohnsonsf No problem, glad it's sorted out!

sjohnsonsf · 2021-03-10T00:06:22Z

Hey @nrwahl2 , since patching, our clusters have stabilized but the azure-lb resource still fails for us, though now it does not cause a cluster down scenarios. What's particularly interesting for us is that it has occurred every Tuesday, at the same exact time. As we're a large environment I am reviewing relevant logs via Splunk, etc. to correlate but wanted to see if you had any experience with what could trigger this bug (what's sending data to the listener port?). Any tips would be helpful. Thanks again!

nrwahl2 · 2021-03-10T00:11:34Z

@sjohnsonsf Try changing:

        $cmd &

to

        $cmd >/dev/null 2>&1 &

I'm investigating an apparent bug right now where the nc listener process dies upon receiving input... it's not clear to me yet why input from the Azure health probes doesn't kill it, but if I just connect and send it the text payload "test", it dies with SIGPIPE. We first saw a user encounter this when running a Tenable security scan.

So far, it seems that redirecting output prevents the nc listener from dying.

nrwahl2 · 2021-03-10T02:36:37Z

@sjohnsonsf #1620

sjohnsonsf · 2021-03-11T21:24:19Z

@nrwahl2 thanks. The culprit in our environment is a BeyondTrust appliance scanning VMs for root password change/management. We're in the processes of testing to confirm but glad to know the Resource Agent will be able to handle this sort of the thing in the future. Thanks again!

oalbrigt merged commit cbc5c8e into ClusterLabs:master Jul 10, 2020

nrwahl2 deleted the nrwahl2-fix_azure-lb branch July 10, 2020 09:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

azure-lb: Don't redirect nc listener output to pidfile #1528

azure-lb: Don't redirect nc listener output to pidfile #1528

nrwahl2 commented Jul 7, 2020 •

edited

Loading

oalbrigt commented Jul 7, 2020

sjohnsonsf commented Mar 3, 2021

nrwahl2 commented Mar 3, 2021

sjohnsonsf commented Mar 3, 2021

nrwahl2 commented Mar 3, 2021

sjohnsonsf commented Mar 10, 2021

nrwahl2 commented Mar 10, 2021 •

edited

Loading

nrwahl2 commented Mar 10, 2021

sjohnsonsf commented Mar 11, 2021 •

edited

Loading

azure-lb: Don't redirect nc listener output to pidfile #1528

azure-lb: Don't redirect nc listener output to pidfile #1528

Conversation

nrwahl2 commented Jul 7, 2020 • edited Loading

oalbrigt commented Jul 7, 2020

sjohnsonsf commented Mar 3, 2021

nrwahl2 commented Mar 3, 2021

sjohnsonsf commented Mar 3, 2021

nrwahl2 commented Mar 3, 2021

sjohnsonsf commented Mar 10, 2021

nrwahl2 commented Mar 10, 2021 • edited Loading

nrwahl2 commented Mar 10, 2021

sjohnsonsf commented Mar 11, 2021 • edited Loading

nrwahl2 commented Jul 7, 2020 •

edited

Loading

nrwahl2 commented Mar 10, 2021 •

edited

Loading

sjohnsonsf commented Mar 11, 2021 •

edited

Loading