Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tlog-rec-session breaks Ansible pipelining #227

Closed
mikedep333 opened this issue Dec 10, 2018 · 23 comments · Fixed by #276
Closed

tlog-rec-session breaks Ansible pipelining #227

mikedep333 opened this issue Dec 10, 2018 · 23 comments · Fixed by #276

Comments

@mikedep333
Copy link

Test setup the Ansible-manged system, ldt-4009653:

  • RHEL 7.6 w/ tlog v5 (almost fresh install)
  • local account "admin-renamed" with sudo privileges
  • admin-renamed has its shell set to /usr/bin/tlog-rec-session
  • default tlog-rec & tlog-rec-session settings
  • "requiretty" disabled in /etc/sudoers (per Ansible pipelining instructions)

Test setup on my Ansible management system, ldt-4323517:

  • RHEL 7.6 & Ansible 2.7 or 2.8 (development branch)
  • In ansible.cfg:
[ssh_connection]
pipelining = True

Running this ansible test command (and entering the password when prompted) fails:

[Michael.DePaulo@ldt-4323517 Ansible]$ ansible -vvv -m setup -i ldt-4009653, ldt-4009653 -u admin-renamed -k --become-method sudo --become
ansible 2.8.0.dev0
  config file = /home/Michael.DePaulo/linux-desktop/Ansible/ansible.cfg
  configured module search path = [u'/home/Michael.DePaulo/.ansible/plugins/modules', u'/usr/share/ansible/plugins/modules']
  ansible python module location = /usr/lib/python2.7/site-packages/ansible
  executable location = /usr/bin/ansible
  python version = 2.7.5 (default, Sep 12 2018, 05:31:16) [GCC 4.8.5 20150623 (Red Hat 4.8.5-36)]
Using /home/Michael.DePaulo/linux-desktop/Ansible/ansible.cfg as config file
SSH password: 
Parsed ldt-4009653, inventory source with host_list plugin
META: ran handlers
Using module file /usr/lib/python2.7/site-packages/ansible/modules/system/setup.py
<ldt-4009653> ESTABLISH SSH CONNECTION FOR USER: admin-renamed
<ldt-4009653> SSH: EXEC sshpass -d11 ssh -C -o ControlMaster=auto -o ControlPersist=60s -o User=admin-renamed -o ConnectTimeout=10 -o ControlPath=/home/Michael.DePaulo/.ansible/cp/fe6b6f0560 ldt-4009653 '/bin/sh -c '"'"'sudo -H -S -n -u root /bin/sh -c '"'"'"'"'"'"'"'"'echo BECOME-SUCCESS-bxmeronhupopkdzckqypdvyqpetglojk; /usr/bin/python'"'"'"'"'"'"'"'"' && sleep 0'"'"''
Escalation succeeded
<ldt-4009653> (1, '', 'X11 forwarding request failed\r\n\nATTENTION! Your session is being recorded!\n\n')
ldt-4009653 | FAILED! => {
    "changed": false, 
    "module_stderr": "X11 forwarding request failed\r\n\nATTENTION! Your session is being recorded!\n\n", 
    "module_stdout": "", 
    "msg": "MODULE FAILURE\nSee stdout/stderr for the exact error", 
    "rc": 1
}

Normally it would print out tons of facts about the managed system (what the "setup" module does.)

The stderr is not the problem. I removed the message, and the X11 forwarding message is harmless (we normally disable X11 forwarding for Ansible.)

The rc 1 seems to be the problem though.

We can workaround this by disabling pipelining and accepting the reduced Ansible performance.

@spbnick
Copy link
Member

spbnick commented Dec 11, 2018

Thank you for the report, @mikedep333. This might be a real issue. I'm not sure what Ansible pipelining exactly is, as there's not much documentation besides mostly marketing speak, but this needs to be investigated.

Yes, you need to disable any port forwarding and X11 forwarding on the sshd server if you want tighter control over what goes in and out of the host without recording.

@spbnick
Copy link
Member

spbnick commented Dec 11, 2018

Here might be a clue for what's going on: ansible/ansible@f488de8

@trevor-vaughan
Copy link
Contributor

We're also seeing issues in the native Vagrant commands and are trying to debug.

It may be the method that tlog is using to hook onto the input/output handles and what happens if they do not exist at some point.

@trevor-vaughan
Copy link
Contributor

I have some additional information related to this:

If you set the shell for root to /bin/tlog-rec-session, and then run vagrant reload <whatever>, the command will not succeed and the following will be output:

==> hostname: Attempting graceful shutdown of VM...
The following SSH command responded with a non-zero exit status.
Vagrant assumes that this means the command failed!

shutdown -h now

Stdout from the command:



Stderr from the command:


My guess is that this is the same issue showing in two different ways.

@AndreasDavour
Copy link

I have now encountered this as well. It's a real issue, and there's something not playing nice when ansible uses the multiplexed ssh connections and it makes tlog confused. I'm trying to see if I can turn off all the ansible magic, but there's more magic that I initially comprehend.

@justin-stephenson
Copy link
Collaborator

Thank you for reporting this issue, I am able to reproduce the problem. If I understand https://github.com/ansible/ansible/blob/devel/lib/ansible/plugins/connection/ssh.py#L760 correctly, when pipelining is enabled the ansible initiated ssh command(executed without -tt argument) is spawned inside a python subprocess.Popen execution context using pipes instead of pseudo-terminal.

However, per 3d8dbd6 tlog should be able to handle forking a child connected via pipes. More investigation required on my end here.

@merlyel
Copy link

merlyel commented Dec 23, 2019

Hi
As far as I understand, Ansible when run with pipelining enabled runs python and pushes code into it via stdin. It may be that ansible does not expect notice message popping out at that time.
Can we try disabling notice message somehow and test if pipelining mode is still broken or not?

@merlyel
Copy link

merlyel commented Dec 23, 2019

I see that tlog outputs the notice to stderr. I think this may be the reason pipelining Ansible thinks there's some problem happening

@spbnick
Copy link
Member

spbnick commented Dec 23, 2019

@merlyel,I think you can disable the notice completely if you just set it to empty string ("").

@AndreasDavour
Copy link

I can not reproduce my problem, probably because I need ansible to do its connectivity persistence magic, and it does it when it feels like it. So, even though I see no change with the empty string I am not sure it proves anything.

Also, why is that message on stderr instead of stdout?

@spbnick
Copy link
Member

spbnick commented Jan 7, 2020

@AndreasDavour, because the message is not the primary output of the program. It's not something that the program produces as its function, but rather a note to the user.

@merlyel
Copy link

merlyel commented Jan 7, 2020

Two playbooks:

  • ping.yml - without pipelining (and without problems)
  • ping-pipe.yml - with pipelining

I'm reproducing the problem on RHEL8 also with ansible 2.9.1

ping.txt
ping-pipe.txt

Btw, I see the difference in debug output of ansible even with disabled pipelining:

<192.168.11.189> ESTABLISH SSH CONNECTION FOR USER: None
<192.168.11.189> SSH: EXEC ssh -q -o ControlMaster=no -o ControlPath=none -o StrictHostKeyChecking=no -o KbdInteractiveAuthentication=no -o PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey -o PasswordAuthentication=no -o ConnectTimeout=10 192.168.11.189 '/bin/sh -c '"'"'/usr/bin/python3.6 && sleep 0'"'"''
<192.168.11.189> (120, b'', b'')
<192.168.11.189> Failed to connect to the host via ssh:
[WARNING]: Unhandled error in Python interpreter discovery for host rhel8: Expecting value: line 1 column 1 (char 0)

But playbook keeps running after this 'failed to connect' and warning

While without session recording:

<192.168.11.189> ESTABLISH SSH CONNECTION FOR USER: None
<192.168.11.189> SSH: EXEC ssh -q -o ControlMaster=no -o ControlPath=none -o StrictHostKeyChecking=no -o KbdInteractiveAuthentication=no -o PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey -o PasswordAuthentication=no -o ConnectTimeout=10 192.168.11.189 '/bin/sh -c '"'"'/usr/bin/python3.6 && sleep 0'"'"''
<192.168.11.189> (0, b'{"platform_dist_result": ["redhat", ....

So even without pipelining tlog causes some warning for ansible

@AndreasDavour
Copy link

I see. Thanks @spbnick

@AndreasDavour
Copy link

@merlyel I take it we then can conclude (if I understood your test case) that there is some issue with ansible even with an empty message?

@merlyel
Copy link

merlyel commented Jan 7, 2020

No, empty notice message did not help

@spbnick
Copy link
Member

spbnick commented Jan 7, 2020

Just a wild stab at it: what if you completely remove the notice property from the configuration file?

@merlyel
Copy link

merlyel commented Jan 7, 2020

@spbnick then it falls to default one

@spbnick
Copy link
Member

spbnick commented Jan 7, 2020

@merlyel, could you try removing it from /usr/share/tlog/tlog-rec-session.conf as well?

@merlyel
Copy link

merlyel commented Jan 7, 2020

No, it did not help.
Discovered another interesting thing. Tried playing with the ssh command from the ansible verbose output.
Without recording:

$ echo 'test()'| ssh -q ... 192.168.11.189 '/bin/sh -c '"'"'/usr/bin/python3.6 && sleep 0'"'"
Traceback (most recent call last):
File "", line 1, in
NameError: name 'test' is not defined
$ echo $?
1

Return code is 1, which is correct

With recording enabled:

$ echo 'test()'| ssh -q ... 192.168.11.189 '/bin/sh -c '"'"'/usr/bin/python3.6 && sleep 0'"'"
$ echo $?
120

I see two differences:

  • return code is 120
  • no error output (caught by tlog?)

I tried searching for error code 120 and found this: Python 3.6 changelog:

Py_Exit() (and the main interpreter) now override the exit status with 120 if flushing buffered data failed. See bpo-5319.

So I tried changing to python2.7 (which RHEL7.6 has by default just like topicstarter). The result is slightly different. 120 is gone, return code are the same. But the error output is still missing:

Without recording:

$ echo 'test()'| ssh -q ... 192.168.11.189 '/bin/sh -c '"'"'/usr/bin/python2.7 && sleep 0'"'" ; echo $?
Traceback (most recent call last):
File "", line 1, in
NameError: name 'test' is not defined
1

With recording:

$ echo 'test()'| ssh -q ... 192.168.11.189 '/bin/sh -c '"'"'/usr/bin/python2.7 && sleep 0'"'" ; echo $?
1

WHere's the stderr?

Than switching to simple command cat:
No recording:

$ echo 'test'|ssh -q ... 192.168.11.189 '/bin/sh -c '"'"'cat '"'" ; echo $?
test
0

Recording:

$ echo 'test'| ssh -q ... 192.168.11.189 '/bin/sh -c '"'"'cat'"'" ; echo $?
141

141 is pipe fail

As ansible in pipeline mode uses pipes, then this pipe fail error may be the reason for the problem.

@justin-stephenson
Copy link
Collaborator

Hi, I have been looking into this over the past couple days - I believe the problem can be reproduced with the following command, where 'recuser' is a user configured with tlog-rec-session as their shell.

echo -e "print('hello world')\n" | ssh recuser@remotehost "python3.7"

More simply, and removing 'ssh' as a factor we can see the following does not output the expected command:

echo test | tlog-rec-session -c whoami

In my debugging thus far, the problem appears to be when tlog-rec-session reads poll()'d data from stdin and the child process(whoami output). The stdin file descriptor gets read first in tlog_tty_source_read()

(gdb) p pkt.data.io.buf
$12 = (uint8_t *) 0x56338754cb80 "root\r\necho test\r\n"

I am still looking into how best to fix this.

@justin-stephenson
Copy link
Collaborator

Hello all,

It would be great if someone is willing to test and verify this fixes the ansible pipelining issue using the COPR build below:

https://copr.fedorainfracloud.org/coprs/jstephen/tlog-copr/build/1338641/
https://copr.fedorainfracloud.org/coprs/jstephen/tlog-copr/

@trevor-vaughan
Copy link
Contributor

@justin-stephenson I can confirm that this appears to fix output pipelines.

@justin-stephenson
Copy link
Collaborator

thank you @trevor-vaughan !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants