Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to get kernel image in netboot #7391

Open
arthur-miguel opened this issue Jun 23, 2023 · 3 comments
Open

Unable to get kernel image in netboot #7391

arthur-miguel opened this issue Jun 23, 2023 · 3 comments

Comments

@arthur-miguel
Copy link

Hi,

I've been trying to setup a stateless cluster following OpenHPC recipes in CentOS8 but have run into an issue. When trying to boot on PXE I'm able to get the booting scrip for the nodes, but when requesting the kernel image it returns my an HTTP error as shown in the image bellow:

20230623_140341

When running xcatpobe on master node it points out that everything is ok, the warning comes from the fact that nameserver 150.xxx.x.x is our public network that doesn't have acces to the cluster's internal interface.

[root@mn nodes]# xcatprobe xcatmn -i enp8s0
[mn]: Checking all xCAT daemons are running...                                                                    [ OK ]
[mn]: Checking xcatd can receive command request...                                                               [ OK ]
[mn]: Checking 'site' table is configured...                                                                      [ OK ]
[mn]: Checking provision network is configured...                                                                 [ OK ]
[mn]: Checking 'passwd' table is configured...                                                                    [ OK ]
[mn]: Checking important directories(installdir,tftpdir) are configured...                                        [ OK ]
[mn]: Checking SELinux is disabled...                                                                             [ OK ]
[mn]: Checking HTTP service is configured...                                                                      [ OK ]
[mn]: Checking TFTP service is configured...                                                                      [ OK ]
[mn]: Checking DNS service is configured...                                                                       [WARN]
[mn]: DNS nameserver 150.xxx.x.x can not resolve 192.168.1.101
[mn]: Checking DHCP service is configured...                                                                      [ OK ]
[mn]: Checking NTP service is configured...                                                                       [ OK ]
[mn]: Checking rsyslog service is configured...                                                                   [ OK ]
[mn]: Checking firewall is disabled...                                                                            [ OK ]
[mn]: Checking minimum disk space for xCAT ['/tmp' needs 1GB;'/install' needs 10GB;'/var' needs 1GB]...           [ OK ]
[mn]: Checking Linux ulimits configuration...                                                                     [ OK ]
[mn]: Checking network kernel parameter configuration...                                                          [ OK ]
[mn]: Checking xCAT daemon attributes configuration...                                                            [ OK ]
[mn]: Checking xCAT log is stored in /var/log/xcat/cluster.log...                                                 [ OK ]
[mn]: Checking xCAT management node IP: <192.168.1.101> is configured to static...                                [ OK ]
[mn]: Checking dhcpd.leases file is less than 100M...                                                             [ OK ]
[mn]: Checking DB packages installation...                                                                        [ OK ]
=================================== SUMMARY ====================================
[MN]: Checking on MN...                                                                                           [ OK ]
    Checking DNS service is configured...                                                                         [WARN]
        DNS nameserver 150.162.1.1 can not resolve 192.168.1.101

DHCP server also seems to be working fine

[root@mn nodes]# xcatprobe  detect_dhcpd -i enp8s0 -m 18:66:da:1d:63:0e
Start to detect DHCP, please wait 10 seconds                                                                      [INFO]
++++++++++++++++++++++++++++++++++                                                                                [INFO]
There are 1 servers replied to dhcp discover.                                                                     [INFO]
    Server:192.168.1.101 assign IP [192.168.1.102]. The next server is [192.168.1.101]!                           [INFO]
++++++++++++++++++++++++++++++++++                                                                                [INFO]

I'm also able to access all files that are required in the PXE boot scrip

#!gpxe
#netboot centos-stream8-x86_64-compute
imgfetch -n kernel http://${next-server}:80/tftpboot/xcat/osimage/centos-stream8-x86_64-netboot-compute/kernel
imgload kernel
imgargs kernel imgurl=http://${next-server}:80//install/netboot/centos-stream8/x86_64/compute/rootimg.cpio.gz XCAT=${next-server}:3001 NODE=c1 FC=0 XCATHTTPPORT=80  console=tty0 console=ttyS0,115200 BOOTIF=01-${netX/machyp}
imgfetch -n initrd http://${next-server}:80/tftpboot/xcat/osimage/centos-stream8-x86_64-netboot-compute/initrd-stateless.gz
imgexec kernel

Wireshark points out a TFTP error that is aborting the transactions

Capturing on 'enp8s0'
    1 0.000000000      0.0.0.0 → 255.255.255.255 DHCP 590 DHCP Discover - Transaction ID 0xda1d630e
    2 0.000200027 192.168.1.101 → 255.255.255.255 DHCP 342 DHCP Offer    - Transaction ID 0xda1d630e
    3 4.008654276      0.0.0.0 → 255.255.255.255 DHCP 590 DHCP Request  - Transaction ID 0xda1d630e
    4 4.008908360 192.168.1.101 → 255.255.255.255 DHCP 342 DHCP ACK      - Transaction ID 0xda1d630e
    5 4.016202763 Dell_1d:63:0e → Broadcast    ARP 60 Who has 192.168.1.101? Tell 192.168.1.102
    6 4.016221132 9c:53:22:48:50:bb → Dell_1d:63:0e ARP 42 192.168.1.101 is at 9c:53:22:48:50:bb
    7 4.016258287 192.168.1.102 → 192.168.1.101 TFTP 73 Read Request, File: xcat/xnba.kpxe, Transfer type: octet, tsize=0
    8 4.019157973 192.168.1.101 → 192.168.1.102 TFTP 56 Option Acknowledgement, tsize=67650
    9 4.019200647 192.168.1.102 → 192.168.1.101 TFTP 60 Error Code, Code: Not defined, Message: TFTP Aborted
   10 4.021014854 192.168.1.102 → 192.168.1.101 TFTP 78 Read Request, File: xcat/xnba.kpxe, Transfer type: octet, blksize=1456
   11 4.023521541 192.168.1.101 → 192.168.1.102 TFTP 57 Option Acknowledgement, blksize=1456

If you guys have any clue of what it might be, I'd be very glad to hear it. Thx

@arthur-miguel arthur-miguel changed the title Unable to get kernel image of netboot image Unable to get kernel image in netboot Jun 23, 2023
@samveen
Copy link
Member

samveen commented Jun 26, 2023

From any other machine on the same network segment as the failing node, would you try running wget or curl for http://192.168.1.101:80/tftpboot/xcat/osimage/centos-stream8-x86_64-netboot-compute/kernel ? (and the rootimg and initrd too)

@arthur-miguel
Copy link
Author

Thanks for the reply @samveen.

From any other machine on the same network segment as the failing node, would you try running wget or curl for http://192.168.1.101:80/tftpboot/xcat/osimage/centos-stream8-x86_64-netboot-compute/kernel ? (and the rootimg and initrd too)

Yes, I'm able to get the files from a client in the same network both via wget and tftp. It seems that only xNBA isn't able to get the files, which for me seems very weird.

[root@client ~]# wget http://192.168.1.101:80/tftpboot/xcat/osimage/centos-stream8-x86_64-netboot-compute/kernel
--2023-06-26 11:37:07--  http://192.168.1.101/tftpboot/xcat/osimage/centos-stream8-x86_64-netboot-compute/kernel
Connecting to 192.168.1.101:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 10876640 (10M)
Saving to: ‘kernel’

kernel                                  100%[==============================================================================>]  10.37M  --.-KB/s    in 0.09s   

2023-06-26 11:37:07 (112 MB/s) - ‘kernel’ saved [10876640/10876640]
[root@client ~]# tftp 192.168.1.101 -v
Connected to 192.168.1.101 (192.168.1.101), port 69
tftp> status
Connected to 192.168.1.101.
Mode: netascii Verbose: on Tracing: off Literal: off
Rexmt-interval: 5 seconds, Max-timeout: 25 seconds
tftp> get xcat/osimage/centos-stream8-x86_64-netboot-compute/kernel
getting from 192.168.1.101:xcat/osimage/centos-stream8-x86_64-netboot-compute/kernel to kernel [netascii]
Received 10953854 bytes in 3.2 seconds [27783387 bit/s]

The same occurs for rootimg and initrd.

@samveen
Copy link
Member

samveen commented Jun 27, 2023

Looking at the error code as listed by xnba (iPXE), there seems to be something going on with httpd on the master when xnba requests the URL for the kernel, which causes the failure in the xnba HTTP core (in net/tcp/httpcore.c). Would you try and check webserver logs on the master to check what might be causing the requests to fail?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants