Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[nvmf_rpc] nvme connect timed out #3357

Open
ksztyber opened this issue Apr 26, 2024 · 2 comments
Open

[nvmf_rpc] nvme connect timed out #3357

ksztyber opened this issue Apr 26, 2024 · 2 comments

Comments

@ksztyber
Copy link
Contributor

00:12:59.475  [2024-04-26 08:46:41.335058] rdma.c:3018:nvmf_rdma_listen: *NOTICE*: *** NVMe/RDMA Target Listening on 192.168.100.8 port 4420 ***
00:12:59.475   08:46:41	-- common/autotest_common.sh@577 -- # [[ 0 == 0 ]]
00:12:59.475   08:46:41	-- target/rpc.sh@84 -- # rpc_cmd nvmf_subsystem_add_ns nqn.2016-06.io.spdk:cnode1 Malloc1 -n 5
00:12:59.475   08:46:41	-- common/autotest_common.sh@549 -- # xtrace_disable
00:12:59.475   08:46:41	-- common/autotest_common.sh@10 -- # set +x
00:12:59.475   08:46:41	-- common/autotest_common.sh@577 -- # [[ 0 == 0 ]]
00:12:59.475   08:46:41	-- target/rpc.sh@85 -- # rpc_cmd nvmf_subsystem_allow_any_host nqn.2016-06.io.spdk:cnode1
00:12:59.475   08:46:41	-- common/autotest_common.sh@549 -- # xtrace_disable
00:12:59.475   08:46:41	-- common/autotest_common.sh@10 -- # set +x
00:12:59.475   08:46:41	-- common/autotest_common.sh@577 -- # [[ 0 == 0 ]]
00:12:59.475   08:46:41	-- target/rpc.sh@86 -- # nvme connect -i 15 --hostnqn=nqn.2014-08.org.nvmexpress:uuid:80bdebd3-4c74-ea11-906e-0017a4403562 --hostid=80bdebd3-4c74-ea11-906e-0017a4403562 -t rdma -n nqn.2016-06.io.spdk:cnode1 -a 192.168.100.8 -s 4420
00:13:00.411  [2024-04-26 08:46:42.272341] rdma.c:2871:nvmf_rdma_destroy: *ERROR*: transport wr pool count is 4095 but should be 2048
00:44:46.772  Cancelling nested steps due to timeout
00:44:46.775  Sending interrupt signal to process
00:44:53.163  Terminated

There are some hardware errors reported in the dmesg too, not sure if that's related:

[Fri Apr 26 09:04:09 2024] {3}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 1
[Fri Apr 26 09:04:09 2024] pcieport 0000:5d:02.0: pciehp: Slot(257-1): Link Down
[Fri Apr 26 09:04:09 2024] {3}[Hardware Error]: event severity: recoverable
[Fri Apr 26 09:04:09 2024] {3}[Hardware Error]:  Error 0, type: recoverable
[Fri Apr 26 09:04:09 2024] {3}[Hardware Error]:   section_type: PCIe error
[Fri Apr 26 09:04:09 2024] {3}[Hardware Error]:   port_type: 4, root port
[Fri Apr 26 09:04:09 2024] {3}[Hardware Error]:   version: 3.0
[Fri Apr 26 09:04:09 2024] {3}[Hardware Error]:   command: 0x0547, status: 0x4010
[Fri Apr 26 09:04:09 2024] {3}[Hardware Error]:   device_id: 0000:5d:02.0
[Fri Apr 26 09:04:09 2024] {3}[Hardware Error]:   slot: 96
[Fri Apr 26 09:04:09 2024] {3}[Hardware Error]:   secondary_bus: 0x5e
[Fri Apr 26 09:04:09 2024] {3}[Hardware Error]:   vendor_id: 0x8086, device_id: 0x2032
[Fri Apr 26 09:04:09 2024] {3}[Hardware Error]:   class_code: 060400
[Fri Apr 26 09:04:09 2024] {3}[Hardware Error]:   bridge: secondary_status: 0x0000, control: 0x0003
[Fri Apr 26 09:04:09 2024] pcieport 0000:5d:02.0: AER: aer_status: 0x00004000, aer_mask: 0x00100000
[Fri Apr 26 09:04:09 2024] pcieport 0000:5d:02.0: pciehp: Slot(257-1): Card not present
[Fri Apr 26 09:04:09 2024] pcieport 0000:5d:02.0:    [14] CmpltTO                (First)
[Fri Apr 26 09:04:09 2024] pcieport 0000:5d:02.0: AER: aer_layer=Transaction Layer, aer_agent=Requester ID
[Fri Apr 26 09:04:09 2024] vfio-pci 0000:5e:00.0: Relaying device request to user (#0)
[Fri Apr 26 09:04:09 2024] pcieport 0000:5d:02.0: AER: aer_uncor_severity: 0x000e2030
[Fri Apr 26 09:05:51 2024] vfio-pci 0000:5e:00.0: Relaying device request to user (#10)
[Fri Apr 26 09:06:56 2024] vfio-pci 0000:5e:00.0: vfio_bar_restore: reset recovery - restoring BARs
[Fri Apr 26 09:06:56 2024] pcieport 0000:5d:02.0: AER: device recovery failed
[Fri Apr 26 09:07:09 2024] pcieport 0000:5d:02.0: pciehp: Slot(257-1): Card present

Link to the failed CI build

https://ci.spdk.io/results/autotest-per-patch/builds/121208/archive/nvmf-phy-autotest_70321/build.log
https://ci.spdk.io/public_build/autotest-per-patch_121208.html

@spdkci
Copy link

spdkci commented Apr 26, 2024

Another instance of this failure. Reported by @ksztyber. log: https://ci.spdk.io/public_build/autotest-per-patch_121208.html

@mikeBashStuff
Copy link
Contributor

These errors point at a slot where nvme is located (the one hooked up to the beetle instance) so unless this test was relaying on that nvme (I don't think that's the case since it uses malloc bdevs through and through), this shouldn't be related - that said, it definitely shows overall instability of this node so I am going to offline it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants