[nvmf_rpc] nvme connect timed out #3357

ksztyber · 2024-04-26T10:15:36Z

00:12:59.475  [2024-04-26 08:46:41.335058] rdma.c:3018:nvmf_rdma_listen: *NOTICE*: *** NVMe/RDMA Target Listening on 192.168.100.8 port 4420 ***
00:12:59.475   08:46:41	-- common/autotest_common.sh@577 -- # [[ 0 == 0 ]]
00:12:59.475   08:46:41	-- target/rpc.sh@84 -- # rpc_cmd nvmf_subsystem_add_ns nqn.2016-06.io.spdk:cnode1 Malloc1 -n 5
00:12:59.475   08:46:41	-- common/autotest_common.sh@549 -- # xtrace_disable
00:12:59.475   08:46:41	-- common/autotest_common.sh@10 -- # set +x
00:12:59.475   08:46:41	-- common/autotest_common.sh@577 -- # [[ 0 == 0 ]]
00:12:59.475   08:46:41	-- target/rpc.sh@85 -- # rpc_cmd nvmf_subsystem_allow_any_host nqn.2016-06.io.spdk:cnode1
00:12:59.475   08:46:41	-- common/autotest_common.sh@549 -- # xtrace_disable
00:12:59.475   08:46:41	-- common/autotest_common.sh@10 -- # set +x
00:12:59.475   08:46:41	-- common/autotest_common.sh@577 -- # [[ 0 == 0 ]]
00:12:59.475   08:46:41	-- target/rpc.sh@86 -- # nvme connect -i 15 --hostnqn=nqn.2014-08.org.nvmexpress:uuid:80bdebd3-4c74-ea11-906e-0017a4403562 --hostid=80bdebd3-4c74-ea11-906e-0017a4403562 -t rdma -n nqn.2016-06.io.spdk:cnode1 -a 192.168.100.8 -s 4420
00:13:00.411  [2024-04-26 08:46:42.272341] rdma.c:2871:nvmf_rdma_destroy: *ERROR*: transport wr pool count is 4095 but should be 2048
00:44:46.772  Cancelling nested steps due to timeout
00:44:46.775  Sending interrupt signal to process
00:44:53.163  Terminated

There are some hardware errors reported in the dmesg too, not sure if that's related:

[Fri Apr 26 09:04:09 2024] {3}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 1
[Fri Apr 26 09:04:09 2024] pcieport 0000:5d:02.0: pciehp: Slot(257-1): Link Down
[Fri Apr 26 09:04:09 2024] {3}[Hardware Error]: event severity: recoverable
[Fri Apr 26 09:04:09 2024] {3}[Hardware Error]:  Error 0, type: recoverable
[Fri Apr 26 09:04:09 2024] {3}[Hardware Error]:   section_type: PCIe error
[Fri Apr 26 09:04:09 2024] {3}[Hardware Error]:   port_type: 4, root port
[Fri Apr 26 09:04:09 2024] {3}[Hardware Error]:   version: 3.0
[Fri Apr 26 09:04:09 2024] {3}[Hardware Error]:   command: 0x0547, status: 0x4010
[Fri Apr 26 09:04:09 2024] {3}[Hardware Error]:   device_id: 0000:5d:02.0
[Fri Apr 26 09:04:09 2024] {3}[Hardware Error]:   slot: 96
[Fri Apr 26 09:04:09 2024] {3}[Hardware Error]:   secondary_bus: 0x5e
[Fri Apr 26 09:04:09 2024] {3}[Hardware Error]:   vendor_id: 0x8086, device_id: 0x2032
[Fri Apr 26 09:04:09 2024] {3}[Hardware Error]:   class_code: 060400
[Fri Apr 26 09:04:09 2024] {3}[Hardware Error]:   bridge: secondary_status: 0x0000, control: 0x0003
[Fri Apr 26 09:04:09 2024] pcieport 0000:5d:02.0: AER: aer_status: 0x00004000, aer_mask: 0x00100000
[Fri Apr 26 09:04:09 2024] pcieport 0000:5d:02.0: pciehp: Slot(257-1): Card not present
[Fri Apr 26 09:04:09 2024] pcieport 0000:5d:02.0:    [14] CmpltTO                (First)
[Fri Apr 26 09:04:09 2024] pcieport 0000:5d:02.0: AER: aer_layer=Transaction Layer, aer_agent=Requester ID
[Fri Apr 26 09:04:09 2024] vfio-pci 0000:5e:00.0: Relaying device request to user (#0)
[Fri Apr 26 09:04:09 2024] pcieport 0000:5d:02.0: AER: aer_uncor_severity: 0x000e2030
[Fri Apr 26 09:05:51 2024] vfio-pci 0000:5e:00.0: Relaying device request to user (#10)
[Fri Apr 26 09:06:56 2024] vfio-pci 0000:5e:00.0: vfio_bar_restore: reset recovery - restoring BARs
[Fri Apr 26 09:06:56 2024] pcieport 0000:5d:02.0: AER: device recovery failed
[Fri Apr 26 09:07:09 2024] pcieport 0000:5d:02.0: pciehp: Slot(257-1): Card present

Link to the failed CI build

https://ci.spdk.io/results/autotest-per-patch/builds/121208/archive/nvmf-phy-autotest_70321/build.log
https://ci.spdk.io/public_build/autotest-per-patch_121208.html

The text was updated successfully, but these errors were encountered:

spdkci · 2024-04-26T10:15:55Z

Another instance of this failure. Reported by @ksztyber. log: https://ci.spdk.io/public_build/autotest-per-patch_121208.html

mikeBashStuff · 2024-04-26T10:22:12Z

These errors point at a slot where nvme is located (the one hooked up to the beetle instance) so unless this test was relaying on that nvme (I don't think that's the case since it uses malloc bdevs through and through), this shouldn't be related - that said, it definitely shows overall instability of this node so I am going to offline it.

ksztyber added the Intermittent Failure label Apr 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[nvmf_rpc] nvme connect timed out #3357

[nvmf_rpc] nvme connect timed out #3357

ksztyber commented Apr 26, 2024

spdkci commented Apr 26, 2024

mikeBashStuff commented Apr 26, 2024

[nvmf_rpc] nvme connect timed out #3357

[nvmf_rpc] nvme connect timed out #3357

Comments

ksztyber commented Apr 26, 2024

Link to the failed CI build

spdkci commented Apr 26, 2024

mikeBashStuff commented Apr 26, 2024