Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[vhost_blockdev] "bdev_nvme_detach_controller" Nvme0 failed #3369

Open
m-spiewak opened this issue May 7, 2024 · 3 comments
Open

[vhost_blockdev] "bdev_nvme_detach_controller" Nvme0 failed #3369

m-spiewak opened this issue May 7, 2024 · 3 comments
Assignees

Comments

@m-spiewak
Copy link
Contributor

CI Intermittent Failure

00:10:08.097 ************************************
00:10:08.097 START TEST vhost_blockdev
00:10:08.097 ************************************

Link to the failed CI build

https://ci.spdk.io/public_build/autotest-per-patch_121662.html

Execution failed at

00:10:47.185   15:40:45	-- common/autotest_common.sh@10 -- # set +x
00:10:47.185  [2024-05-06 15:40:45.451661] lun.c: 398:bdev_event_cb: *NOTICE*: bdev name (Nvme0n1p0) received event(SPDK_BDEV_EVENT_REMOVE)
00:10:47.185  [2024-05-06 15:40:45.451766] lun.c: 398:bdev_event_cb: *NOTICE*: bdev name (Nvme0n1p1) received event(SPDK_BDEV_EVENT_REMOVE)
00:10:47.185  [2024-05-06 15:40:45.451791] lun.c: 398:bdev_event_cb: *NOTICE*: bdev name (Nvme0n1p2) received event(SPDK_BDEV_EVENT_REMOVE)
00:10:47.185  [2024-05-06 15:40:45.451815] lun.c: 398:bdev_event_cb: *NOTICE*: bdev name (Nvme0n1p3) received event(SPDK_BDEV_EVENT_REMOVE)
00:10:47.185  [2024-05-06 15:40:45.451848] vhost_blk.c:1218:vhost_user_bdev_remove_cb: *WARNING*: naa.Nvme0n1_blk0.0: hot-removing bdev - all further requests will fail.
00:10:47.185  [2024-05-06 15:40:45.451864] vhost_blk.c:1218:vhost_user_bdev_remove_cb: *WARNING*: naa.Nvme0n1_blk1.0: hot-removing bdev - all further requests will fail.
00:11:02.124   15:41:00	-- common/autotest_common.sh@587 -- # [[ '' == 0 ]]
00:11:02.124    15:41:00	-- common/autotest_common.sh@587 -- # trap - ERR
00:11:02.124    15:41:00	-- common/autotest_common.sh@587 -- # print_backtrace
00:11:02.124    15:41:00	-- common/autotest_common.sh@1149 -- # [[ ehxBET =~ e ]]
00:11:02.124    15:41:00	-- common/autotest_common.sh@1151 -- # args=('Nvme0' 'bdev_nvme_detach_controller')
00:11:02.124    15:41:00	-- common/autotest_common.sh@1151 -- # local args
00:11:02.124    15:41:00	-- common/autotest_common.sh@1153 -- # xtrace_disable
00:11:02.124    15:41:00	-- common/autotest_common.sh@10 -- # set +x
00:11:02.124  ========== Backtrace start: ==========
00:11:02.124  
00:11:02.124  in /home/vagrant/spdk_repo/spdk/test/common/autotest_common.sh:587 -> rpc_cmd(["bdev_nvme_detach_controller"],["Nvme0"])
00:11:02.124       ...
00:11:02.124     582 			echo "$rsp"
00:11:02.124     583 		done
00:11:02.124     584 	
00:11:02.124     585 		rc=${!status[*]}
00:11:02.124     586 		xtrace_restore
00:11:02.124  => 587 		[[ $rc == 0 ]]
00:11:02.124     588 	}
00:11:02.124     589 	
00:11:02.124     590 	function rpc_cmd_simple_data_json() {
00:11:02.124     591 	
00:11:02.124     592 		local elems="$1[@]" elem
00:11:02.124       ...
00:11:02.124  in /home/vagrant/spdk_repo/spdk/test/vhost/initiator/blockdev.sh:86 -> main([])
00:11:02.124       ...
00:11:02.124     81  	
00:11:02.124     82  	timing_enter run_spdk_fio_unmap
00:11:02.124     83  	run_spdk_fio $testdir/bdev.fio --filename="VirtioScsi1t0:VirtioScsi2t0" --spdk_json_conf=$testdir/bdev.json
00:11:02.124     84  	timing_exit run_spdk_fio_unmap
00:11:02.124     85  	
00:11:02.124  => 86  	rpc_cmd bdev_nvme_detach_controller Nvme0
00:11:02.124     87  	
00:11:02.124     88  	trap - SIGINT SIGTERM EXIT
00:11:02.124     89  	rm -f $testdir/bdev.json
00:11:02.124     90  	
00:11:02.124     91  	killprocess $vhost_pid
00:11:02.124       ...
00:11:02.124  
00:11:02.124  ========== Backtrace end ==========
00:11:02.124    15:41:00	-- common/autotest_common.sh@1190 -- # return 0
00:11:02.124   15:41:00	-- common/autotest_common.sh@1 -- # err_cleanup
00:11:02.124   15:41:00	-- initiator/blockdev.sh@16 -- # rm -f /home/vagrant/spdk_repo/spdk/test/vhost/initiator/bdev.json
00:11:02.124   15:41:00	-- initiator/blockdev.sh@17 -- # killprocess 62382
00:11:02.124   15:41:00	-- common/autotest_common.sh@946 -- # '[' -z 62382 ']'
00:11:02.124   15:41:00	-- common/autotest_common.sh@950 -- # kill -0 62382
00:11:02.124    15:41:00	-- common/autotest_common.sh@951 -- # uname
00:11:02.124   15:41:00	-- common/autotest_common.sh@951 -- # '[' Linux = Linux ']'
00:11:02.124    15:41:00	-- common/autotest_common.sh@952 -- # ps --no-headers -o comm= 62382
00:11:02.124   15:41:00	-- common/autotest_common.sh@952 -- # process_name=reactor_1
00:11:02.124   15:41:00	-- common/autotest_common.sh@956 -- # '[' reactor_1 = sudo ']'
00:11:02.124  killing process with pid 62382
00:11:02.124   15:41:00	-- common/autotest_common.sh@964 -- # echo 'killing process with pid 62382'
00:11:02.124   15:41:00	-- common/autotest_common.sh@965 -- # kill 62382
00:11:02.124   15:41:00	-- common/autotest_common.sh@970 -- # wait 62382
00:11:03.057  [2024-05-06 15:41:02.800469] bdev_nvme.c:6330:bdev_nvme_delete_complete_poll: *ERROR*: NVMe path 'Nvme0' still exists after delete
00:11:03.624   15:41:03	-- initiator/blockdev.sh@18 -- # [[ -n 62452 ]]
00:11:03.624   15:41:03	-- initiator/blockdev.sh@18 -- # kill -0 62452
00:11:03.624   15:41:03	-- common/autotest_common.sh@1 -- # exit 1
00:11:03.624    15:41:03	-- common/autotest_common.sh@1121 -- # trap - ERR
00:11:03.624    15:41:03	-- common/autotest_common.sh@1121 -- # print_backtrace
00:11:03.624    15:41:03	-- common/autotest_common.sh@1149 -- # [[ ehxBET =~ e ]]
00:11:03.624    15:41:03	-- common/autotest_common.sh@1151 -- # args=('/home/vagrant/spdk_repo/spdk/test/vhost/initiator/blockdev.sh' 'vhost_blockdev' '/home/vagrant/spdk_repo/autorun-spdk.conf')
00:11:03.624    15:41:03	-- common/autotest_common.sh@1151 -- # local args
00:11:03.624    15:41:03	-- common/autotest_common.sh@1153 -- # xtrace_disable
00:11:03.624    15:41:03	-- common/autotest_common.sh@10 -- # set +x
00:11:03.624  ========== Backtrace start: ==========
00:11:03.624  
00:11:03.624  in /home/vagrant/spdk_repo/spdk/test/common/autotest_common.sh:1121 -> run_test(["vhost_blockdev"],["/home/vagrant/spdk_repo/spdk/test/vhost/initiator/blockdev.sh"])
00:11:03.624       ...
00:11:03.624     1116		timing_enter $test_name
00:11:03.624     1117		echo "************************************"
00:11:03.624     1118		echo "START TEST $test_name"
00:11:03.624     1119		echo "************************************"
00:11:03.624     1120		xtrace_restore
00:11:03.624     1121		time "$@"
00:11:03.624     1122		xtrace_disable
00:11:03.624     1123		echo "************************************"
00:11:03.624     1124		echo "END TEST $test_name"
00:11:03.624     1125		echo "************************************"
00:11:03.624     1126		timing_exit $test_name
00:11:03.624       ...
00:11:03.624  in /home/vagrant/spdk_repo/spdk/autotest.sh:321 -> main(["/home/vagrant/spdk_repo/autorun-spdk.conf"])
00:11:03.624       ...
00:11:03.624     316 			run_test "blob_io_wait" $rootdir/test/blobstore/blob_io_wait/blob_io_wait.sh
00:11:03.624     317 		fi
00:11:03.624     318 	
00:11:03.624     319 		if [ $SPDK_TEST_VHOST_INIT -eq 1 ]; then
00:11:03.624     320 			timing_enter vhost_initiator
00:11:03.624  => 321 			run_test "vhost_blockdev" $rootdir/test/vhost/initiator/blockdev.sh
00:11:03.624     322 			run_test "spdkcli_virtio" $rootdir/test/spdkcli/virtio.sh
00:11:03.624     323 			run_test "vhost_shared" $rootdir/test/vhost/shared/shared.sh
00:11:03.624     324 			run_test "vhost_fuzz" $rootdir/test/vhost/fuzz/fuzz.sh
00:11:03.624     325 			timing_exit vhost_initiator
00:11:03.624     326 		fi
00:11:03.624       ...
00:11:03.624  
00:11:03.624  ========== Backtrace end ==========
00:11:03.624    15:41:03	-- common/autotest_common.sh@1190 -- # return 0
00:11:03.624  
@spdkci
Copy link

spdkci commented May 7, 2024

Another instance of this failure. Reported by @m-spiewak. log: https://ci.spdk.io/public_build/autotest-per-patch_121662.html

@mikeBashStuff
Copy link
Contributor

Similar to #3295.

@tomzawadzki
Copy link
Contributor

tomzawadzki commented May 28, 2024

Similar to #3295.

How come ? Timeout due to taking longer than the default timeout can occur for any RPC. Yet only for a few that is valid occurrence - ones that issue I/O for example.

In this instance it looks like after removal of the NVMe controller (implying removal of six split bdevs on it), four vhost scsi controllers and two vhost blk controller do get a notification. This is just a cleanup at the end of the test, as in any prior fio runs should have finished already. Yet the bdev_nvme_detach_controller() does not complete in time.

00:10:47.185  VHOST_CONFIG: (/home/vagrant/spdk_repo/spdk/naa.Malloc1.0) vhost peer closed
00:10:47.185   15:40:45	-- initiator/blockdev.sh@84 -- # timing_exit run_spdk_fio_unmap
00:10:47.185   15:40:45	-- common/autotest_common.sh@726 -- # xtrace_disable
00:10:47.185   15:40:45	-- common/autotest_common.sh@10 -- # set +x
00:10:47.185   15:40:45	-- initiator/blockdev.sh@86 -- # rpc_cmd bdev_nvme_detach_controller Nvme0
00:10:47.185   15:40:45	-- common/autotest_common.sh@559 -- # xtrace_disable
00:10:47.185   15:40:45	-- common/autotest_common.sh@10 -- # set +x
00:10:47.185  [2024-05-06 15:40:45.451661] lun.c: 398:bdev_event_cb: *NOTICE*: bdev name (Nvme0n1p0) received event(SPDK_BDEV_EVENT_REMOVE)
00:10:47.185  [2024-05-06 15:40:45.451766] lun.c: 398:bdev_event_cb: *NOTICE*: bdev name (Nvme0n1p1) received event(SPDK_BDEV_EVENT_REMOVE)
00:10:47.185  [2024-05-06 15:40:45.451791] lun.c: 398:bdev_event_cb: *NOTICE*: bdev name (Nvme0n1p2) received event(SPDK_BDEV_EVENT_REMOVE)
00:10:47.185  [2024-05-06 15:40:45.451815] lun.c: 398:bdev_event_cb: *NOTICE*: bdev name (Nvme0n1p3) received event(SPDK_BDEV_EVENT_REMOVE)
00:10:47.185  [2024-05-06 15:40:45.451848] vhost_blk.c:1218:vhost_user_bdev_remove_cb: *WARNING*: naa.Nvme0n1_blk0.0: hot-removing bdev - all further requests will fail.
00:10:47.185  [2024-05-06 15:40:45.451864] vhost_blk.c:1218:vhost_user_bdev_remove_cb: *WARNING*: naa.Nvme0n1_blk1.0: hot-removing bdev - all further requests will fail.
00:11:02.124   15:41:00	-- common/autotest_common.sh@587 -- # [[ '' == 0 ]]

In a passing log, the result is immediate.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: 28.05
Development

No branches or pull requests

4 participants