Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

postscript remoteshell loops for a long time #7430

Open
OliverTUBAF opened this issue Mar 25, 2024 · 3 comments
Open

postscript remoteshell loops for a long time #7430

OliverTUBAF opened this issue Mar 25, 2024 · 3 comments

Comments

@OliverTUBAF
Copy link

This fault is probably on my site, but I cannot figure out, what the reason is. Hopefully you guys can point me in the right direction: When a node boots it eventually reaches the state "xcat.deployment.postbootscript: postbootscript start..: remoteshell". Running this script it falls into a loop because it cannot retrieve the ssh keys from the xcat server. I enabled xcatdebug to shed some light on what is going on. This is what I see in /var/log/xcat/xcat.log on the client compute node:

Mon Mar 25 08:30:47 CET 2024 [info]: xcat.deployment.postbootscript: postbootscript start..: remoteshell
+ '[' -n xcat.deployment.postbootscript ']'
+ log_label=xcat.deployment.postbootscript
+ umask 0077
+ '[' -f /etc/os-release ']'
+ cat /etc/os-release
+ grep -i -e '^NAME=[ "'\'']*Cumulus Linux[ "'\'']*$'
++ uname -s
++ tr A-Z a-z
+ '[' linux = linux ']'
++ dirname ./remoteshell
+ str_dir_name=.
+ . ./xcatlib.sh
++ declare -a array_nic_params
++ declare -a array_extra_param_names
++ declare -a array_extra_param_values
+ '[' -e /etc/xCATMN ']'
+ '[' -n '' ']'
++ uname -s
+ '[' Linux = AIX ']'
+ master=10.10.0.5
+ useflowcontrol=0
+ '[' '' = YES ']'
+ '[' '' = yes ']'
+ '[' '' = 1 ']'
+ '[' -r /etc/ssh/sshd_config ']'
+ logger -t xcat.deployment.postbootscript -p local4.info 'remoteshell:  setup /etc/ssh/sshd_config and ssh_config'
+ cp /etc/ssh/sshd_config /etc/ssh/sshd_config.ORIG
+ sed -i '/X11Forwarding /d' /etc/ssh/sshd_config
+ echo 'X11Forwarding yes'
+ sed -i '/MaxStartups /d' /etc/ssh/sshd_config
+ echo 'MaxStartups 1024'
+ '[' '' = 1 ']'
+ '[' -r /etc/ssh/ssh_config ']'
+ sed -i '/StrictHostKeyChecking /d' /etc/ssh/ssh_config
+ echo 'StrictHostKeyChecking no'
+ xcatpost=xcatpost
+ '[' -d /xcatpost/_ssh ']'
+ logger -p local4.info -t xcat.deployment.postbootscript 'Install: setup root .ssh'
+ cd /xcatpost/_ssh
+ mkdir -p /root/.ssh
+ cp -f authorized_keys copy.sh /root/.ssh
+ cd -
+ chmod 700 /root/.ssh
+ chmod 600 /root/.ssh/authorized_keys /root/.ssh/copy.sh
+ '[' '!' -x /usr/bin/openssl ']'
+ CREDPID=3021
+ sleep 1
+ allowcred.awk
+ '[' 0 = 1 ']'
+ getcredentials.awk ssh_dsa_hostkey
+ grep -E -v '</{0,1}xcatresponse>|</{0,1}serverdone>'
+ sed -e 's/&lt;/</' -e 's/&gt;/>/' -e 's/&amp;/&/' -e 's/&quot/"/' -e 's/&apos;/'\''/'
+ grep -E '<error>' /tmp/ssh_dsa_hostkey
+ '[' 1 -ne 0 ']'
+ cat /tmp/ssh_dsa_hostkey
+ grep -E -v '</{0,1}errorcode>|/{0,1}data>|</{0,1}content>|</{0,1}desc>'
+ logger -t xcat.deployment.postbootscript -p local4.info 'remoteshell: getting ssh_host_dsa_key'
+ MAX_RETRIES=10
+ RETRY=0
++ cat /etc/ssh/ssh_host_dsa_key
+ MYCONT=
+ '[' -z '' ']'
+ '[' 0 = 0 ']'
+ let SLI=31275%10
+ let SLI=SLI+10
+ sleep 15
+ RETRY=1
+ '[' 1 -eq 10 ']'
+ '[' 0 = 1 ']'
+ getcredentials.awk ssh_dsa_hostkey
+ grep -v '<'
+ sed -e 's/&lt;/</' -e 's/&gt;/>/' -e 's/&amp;/&/' -e 's/&quot/"/' -e 's/&apos;/'\''/'
++ cat /etc/ssh/ssh_host_dsa_key
+ MYCONT=
+ '[' -z '' ']'
+ '[' 0 = 0 ']'
+ let SLI=13752%10
+ let SLI=SLI+10
+ sleep 12
+ RETRY=2
+ '[' 2 -eq 10 ']'
+ '[' 0 = 1 ']'
+ getcredentials.awk ssh_dsa_hostkey
+ grep -v '<'
+ sed -e 's/&lt;/</' -e 's/&gt;/>/' -e 's/&amp;/&/' -e 's/&quot/"/' -e 's/&apos;/'\''/'
++ cat /etc/ssh/ssh_host_dsa_key
+ MYCONT=
+ '[' -z '' ']'
+ '[' 0 = 0 ']'
+ let SLI=341%10
+ let SLI=SLI+10
+ sleep 11
+ RETRY=3
+ '[' 3 -eq 10 ']'
+ '[' 0 = 1 ']'
+ getcredentials.awk ssh_dsa_hostkey
+ grep -v '<'
+ sed -e 's/&lt;/</' -e 's/&gt;/>/' -e 's/&amp;/&/' -e 's/&quot/"/' -e 's/&apos;/'\''/'
++ cat /etc/ssh/ssh_host_dsa_key
+ MYCONT=
+ '[' -z '' ']'
+ '[' 0 = 0 ']'
+ let SLI=29739%10
+ let SLI=SLI+10
+ sleep 19
+ RETRY=4
+ '[' 4 -eq 10 ']'
...

Meanwhile the server logs:

Mar 25 08:40:04 mgmtnode xcat[263816]: DEBUG xcatd: connection from node035
Mar 25 08:40:04 mgmtnode xcat[263816]: DEBUG xcatd: open new process : xcatd SSL: getcredentials for node035
Mar 25 08:40:04 mgmtnode xcat[263816]: INFO xCAT: Allowing getcredentials ssh_host_dsa_key from node035
Mar 25 08:40:04 mgmtnode xcat[263817]: DEBUG xcatd: dispatch request 'getcredentials ssh_host_dsa_key' to plugin 'credentials'
Mar 25 08:40:04 mgmtnode xcat[263817]: DEBUG xcatd: handle request 'getcredentials' by plugin 'credentials''s process_request
Mar 25 08:40:04 mgmtnode xcat[263817]: ERR The node (node035) is not ready, ignore it.
Mar 25 08:40:05 mgmtnode xcat[263816]: DEBUG xcatd: close connection with node035
Mar 25 08:40:15 mgmtnode xcat[263824]: DEBUG xcatd: connection from node035
Mar 25 08:40:15 mgmtnode xcat[263824]: DEBUG xcatd: open new process : xcatd SSL: getcredentials for node035
Mar 25 08:40:15 mgmtnode xcat[263824]: INFO xCAT: Allowing getcredentials ssh_host_dsa_key from node035
Mar 25 08:40:15 mgmtnode xcat[263825]: DEBUG xcatd: dispatch request 'getcredentials ssh_host_dsa_key' to plugin 'credentials'
Mar 25 08:40:15 mgmtnode xcat[263825]: DEBUG xcatd: handle request 'getcredentials' by plugin 'credentials''s process_request
Mar 25 08:40:15 mgmtnode xcat[263825]: ERR The node (node035) is not ready, ignore it.
Mar 25 08:40:15 mgmtnode xcat[263824]: DEBUG xcatd: close connection with node035

Searching the web I found, that is command is, what is being run by the remoteshell script, unfortunately running it manually gives an empty result

USEOPENSSLFORXCAT=yes XCATSERVER=10.10.0.5:3001 /xcatpost/getcredentials.awk ssh_dsa_hostkey
<xcatresponse>
  <serverdone></serverdone>
</xcatresponse>

While the loops runs, I can check /tmp directory, the keyfile is there, but empty (probably because an empty data was redirect to that file):

[root@node035 ~]# ll /tmp/
-rwxr-xr-x 1 root root  39609 Mar 25 09:30 jjFPgeyQIO.dsh
drwxr-xr-x 2 root root     40 Mar 25 08:30 postage
-rw------- 1 root root      0 Mar 25 09:30 ssh_dsa_hostkey
drwx------ 3 root root     60 Mar 25 09:30 systemd-private-4eea0db9428746cc9942ea2c6e404a84-chronyd.service-XKbOBQ
-rw-r--r-- 1 root root 101021 Mar 25 09:30 wget.log

So I tried to predeploy the keys via syncfiles into the image, this worked, because I can ssh into the node while it boots, but the loop still persists, so I guess the problem is not the key itself. The correct keys are in fact still there when the node finally finished booting, I guess this is because of a final "syncfile" process at boottime overwriting the fresh generated keys due to the failing remoteshell script.

What additional information could I provide to help fixing this issue?

Thank you in advance!

@rlcto
Copy link

rlcto commented Apr 15, 2024

I'm seeing the same/similar issue. Trying to debug it now. For me it looks like an issue with getcredentials.awk, but not getting much debug info. For now, I have edited the remoteshell script and set MAX_RETRIES from 10 to 1. This decreases the time to a more reasonable amount. Not sure if you've come up with a different work around or if you've figured out what's going on.

@samveen
Copy link
Member

samveen commented Apr 16, 2024

Would you the output of the following to the initial post, for additional info:

  • lsdef node035
  • nodestat node035

@OliverTUBAF
Copy link
Author

Thank you for coming back to us, here is the requested output:

lsdef node035

Object name: node035
    appstatus=xend=down,sshd=up,rdp=down,https=down,pbs=up,msrpc=down
    appstatustime=07-09-2020 06:33:56
    arch=x86_64
    bmc=node035.ipmi
    bmcport=3
    chain=runcmd=bmcsetup,shell
    chassis=MyChassisID-123
    consoleenabled=1
    currchain=shell
    currstate=netboot rocky8.8-x86_64-compute
    groups=compute_192,compute,intel,rack02,all,compute_40c
    height=1
    ip=10.10.1.35
    mac=a4:bf:01:47:8b:c1!node035.cluster|a4:bf:01:47:8b:c5!node035.ipmi
    mgt=ipmi
    netboot=xnba
    nicaliases.ib0=node035
    nicaliases.ipmi=node035
    nicips.eno1=10.10.1.35
    nicips.ipmi=10.11.1.35
    nicips.ib0=10.12.1.35
    nicnetworks.eno1=10_10_0_0-255_255_0_0
    nicnetworks.ipmi=10_11_0_0-255_255_0_0
    nicnetworks.ib0=10_12_0_0-255_255_0_0
    nictypes.eno1=Ethernet
    nictypes.ipmi=bmc
    nictypes.ib0=Infiniband
    os=rocky8.8
    postbootscripts=otherpkgs,setroute
    postscripts=setupntp,syslog,remoteshell,syncfiles,org_final,confignetwork,setroute
    profile=compute
    provmethod=rocky8.8-x86_64-netboot-compute
    rack=rack02
    routenames=defgw10_10_0_5
    serial=253089-1
    serialport=0
    serialspeed=115200
    slot=1
    status=booted
    statustime=03-26-2024 16:03:39
    unit=35
    updatestatus=failed
    updatestatustime=03-26-2024 15:38:18

nodestat node035

node035: sshd

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants