Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

azure-lb: Don't redirect nc listener output to pidfile #1528

Merged
merged 1 commit into from
Jul 10, 2020

Conversation

nrwahl2
Copy link
Contributor

@nrwahl2 nrwahl2 commented Jul 7, 2020

The lb_start() function spawns an nc listener background process
and echoes the resulting pid to $pidfile. Due to a bug in the
redirection, all future data received by the nc process is also
appended to $pidfile.

If binary data is received later and appended to $pidfile, the
monitor operation fails when grep searches the now-binary file.

line 97: kill: Binary: arguments must be process or job IDs ]
line 97: kill: file: arguments must be process or job IDs ]
line 97: kill: /var/run/nc_PF2_02.pid: arguments must be process or job IDs ]
line 97: kill: matches: arguments must be process or job IDs ]

Then the start operation fails during recovery. lb_start() spawns a
new nc process, but the old process is still running and using the
configured port.

nc_PF2_02_start_0:777:stderr [ Ncat: bind to :::62502: Address already in use. QUITTING. ]

This patch fixes the issue by removing the nc & command from the
section whose output gets redirected to $pidfile. Now, only the nc
PID is echoed to $pidfile.

Resolves: RHBZ#1850778
Resolves: RHBZ#1850779

The `lb_start()` function spawns an `nc` listener background process
and echoes the resulting pid to `$pidfile`. Due to a bug in the
redirection, all future data received by the `nc` process is also
appended to `$pidfile`.

If binary data is received later and appended to `$pidfile`, the
monitor operation fails when `grep` searches the now-binary file.

```
line 97: kill: Binary: arguments must be process or job IDs ]
line 97: kill: file: arguments must be process or job IDs ]
line 97: kill: /var/run/nc_PF2_02.pid: arguments must be process or job
    IDs ]
line 97: kill: matches: arguments must be process or job IDs ]
```

Then the start operation fails during recovery. `lb_start()` spawns a
new `nc` process, but the old process is still running and using the
configured port.

```
nc_PF2_02_start_0:777:stderr [ Ncat: bind to :::62502: Address
    already in use. QUITTING. ]
```

This patch fixes the issue by removing the `nc &` command from the
section whose output gets redirected to `$pidfile`. Now, only the `nc`
PID is echoed to `$pidfile`.

Resolves: RHBZ#1850778
Resolves: RHBZ#1850779
@oalbrigt
Copy link
Contributor

oalbrigt commented Jul 7, 2020

LGTM.

@oalbrigt oalbrigt merged commit cbc5c8e into ClusterLabs:master Jul 10, 2020
@nrwahl2 nrwahl2 deleted the nrwahl2-fix_azure-lb branch July 10, 2020 09:23
@sjohnsonsf
Copy link

@nrwahl2 , apologies in advance if this is not the correct forum...But I'm seeing this issue exactly on my clusters and it persists after updating to resource-agents.x86_64 - 4.1.1-61.el7 from CentOS Base repo. My servers are running RHEL 7.9, however, we're pulling Pacemaker/Resource Agents from CentOS base. I've confirmed the updated azure-lb.sh in /usr/lib/ocf/resource.d/heartbeat does not reflect changes in this pull in resource-agents.x86_64 - 4.1.1-61.el7. Thanks in advance!

@nrwahl2
Copy link
Contributor Author

nrwahl2 commented Mar 3, 2021

@sjohnsonsf On RHEL, the fix was introduced in resource-agents-4.1.1-61.el7_9.4. It looks it's available here for CentOS (see the Changelog near the bottom):

https://centos.pkgs.org/7/centos-updates-x86_64/resource-agents-4.1.1-61.el7_9.4.x86_64.rpm.html

Can you give that a try?

@sjohnsonsf
Copy link

@nrwahl2 pulling from Updates Repo seems to have done it. Can't thank you enough for the prompt reply!

@nrwahl2
Copy link
Contributor Author

nrwahl2 commented Mar 3, 2021

@sjohnsonsf No problem, glad it's sorted out!

@sjohnsonsf
Copy link

Hey @nrwahl2 , since patching, our clusters have stabilized but the azure-lb resource still fails for us, though now it does not cause a cluster down scenarios. What's particularly interesting for us is that it has occurred every Tuesday, at the same exact time. As we're a large environment I am reviewing relevant logs via Splunk, etc. to correlate but wanted to see if you had any experience with what could trigger this bug (what's sending data to the listener port?). Any tips would be helpful. Thanks again!

@nrwahl2
Copy link
Contributor Author

nrwahl2 commented Mar 10, 2021

@sjohnsonsf Try changing:

        $cmd &

to

        $cmd >/dev/null 2>&1 &

I'm investigating an apparent bug right now where the nc listener process dies upon receiving input... it's not clear to me yet why input from the Azure health probes doesn't kill it, but if I just connect and send it the text payload "test", it dies with SIGPIPE. We first saw a user encounter this when running a Tenable security scan.

So far, it seems that redirecting output prevents the nc listener from dying.

@nrwahl2
Copy link
Contributor Author

nrwahl2 commented Mar 10, 2021

@sjohnsonsf #1620

@sjohnsonsf
Copy link

sjohnsonsf commented Mar 11, 2021

@nrwahl2 thanks. The culprit in our environment is a BeyondTrust appliance scanning VMs for root password change/management. We're in the processes of testing to confirm but glad to know the Resource Agent will be able to handle this sort of the thing in the future. Thanks again!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants