Fix/ub 1060 unmount should continue on faulty device #246

olgashtivelman · 2018-09-04T12:38:43Z

This PR is solving the following idempotent issue:
in case the umount process fails after the volume is already detached in the storage but the multipath device still exists, the next umount we will try to run will fail and get stuck on the "umount" command forever.
the solution is to run the dmsetup command before the umount command (dmsetup puts 0 to queue_if_no_path). this seems to solve the issue for this scenario and we are able to finish the umount without further issues.

This change is

coveralls · 2018-09-04T12:45:23Z

Coverage decreased (-0.02%) to 51.96% when pulling 69c7022 on fix/UB-1060_unmount_should_continue_on_faulty_device into 76dfee6 on dev.

shay-berman

Reviewable status: 0 of 9 files reviewed, 6 unresolved discussions (waiting on @olgashtivelman, @beckmani, and @shay-berman)

remote/mounter/block_device_mounter_utils/block_device_mounter_utils.go, line 119 at r1 (raw file):

	defer b.logger.Trace(logs.INFO, logs.Args{{"devicePath", devicePath}})

	err := b.blockDeviceUtils.SetDmsetup(devicePath)

just wonder if we want to do it by default on every device for unmount or only if its faulty device?
want to raise this question here so we will take a conscious decision about this behaviour. (consult with isaac as well, he may have input about it from HAK perspective)

remote/mounter/block_device_mounter_utils/block_device_mounter_utils.go, line 121 at r1 (raw file):

	err := b.blockDeviceUtils.SetDmsetup(devicePath)
	if err != nil {
		return b.logger.ErrorRet(err, "Dmsetup failed")

so if dmsetup message fail it will stop here and not continue to unmount. are we sure its ok?
just raising the concern here, maybe its better to do that only its faulty device.

In addition you should consider about idempotent aspect here.
What happened if you run dmsetup message on deviceX, and then if u ran it again is the command will succeed or fail?
if it will fail in the second time, then it means you have here a new idempotent issue. if not dmsetup message command is idempotent by nature then we are ok.

please consider and reply

remote/mounter/block_device_mounter_utils/block_device_utils_mounter_test.go, line 284 at r1 (raw file):

			Expect(err).To(MatchError(callErr))
		})
		FIt("should fail if dmsetup failed", func() {

Catch you :-)
you forgot to remove the F from the FIt.
please fix

remote/mounter/block_device_utils/fs.go, line 99 at r1 (raw file):

	args := []string{mpoint}
	if _, err := b.exec.ExecuteWithTimeout(TimeoutMilisecondUmountCmdUmountFs, umountCmd, args); err != nil {
		if err.Error() == context.DeadlineExceeded.Error(){

is it relevant for the PR? if so please update the PR description

remote/mounter/block_device_utils/mpath.go, line 105 at r1 (raw file):

s(stringOutput, "faulty"){

I am not sure if this check if cover enough.
and what is sg_inq fail but the multipath -ll doesn't say its faulty device?
also i think there are additional output that could present faulty even if the word faulty is not presented (try to map a volume, discover it and then unmap it and try to disciver it again, first you will see faulty but after 1-2 addition rescans you may not see the device detail at all but it is a faulty device.)
So I think you should identify additional faulty cases and add UT for them and then check if this new flow works for these faulty cases.

maybe every time that sg_inq fail it means that its faulty? not sure.

Another case that you need to consider - if the device is faulty, then are you sure that the discover will go to this flow?

please add at least additional faulty device cases as UTs and make sure you cover them all. (BTW check the current UT, there are some outputs of faulty device cases.)

remote/mounter/block_device_utils/mpath.go, line 243 at r1 (raw file):

}

func (b *blockDeviceUtils) SetDmsetup(mpath string) error{

LGTM

BTW if there are other places we used this command, then please just use this new function.

olgashtivelman

Reviewable status: 0 of 10 files reviewed, 6 unresolved discussions (waiting on @olgashtivelman, @shay-berman, and @beckmani)

remote/mounter/block_device_mounter_utils/block_device_mounter_utils.go, line 119 at r1 (raw file):

Previously, shay-berman wrote…

just wonder if we want to do it by default on every device for unmount or only if its faulty device?
want to raise this question here so we will take a conscious decision about this behaviour. (consult with isaac as well, he may have input about it from HAK perspective)

I ran the staging on this build to make sure we are not runing anything and it passed so I think this means this is safe no?

remote/mounter/block_device_mounter_utils/block_device_mounter_utils.go, line 121 at r1 (raw file):

Previously, shay-berman wrote…

so if dmsetup message fail it will stop here and not continue to unmount. are we sure its ok?
just raising the concern here, maybe its better to do that only its faulty device.

In addition you should consider about idempotent aspect here.
What happened if you run dmsetup message on deviceX, and then if u ran it again is the command will succeed or fail?
if it will fail in the second time, then it means you have here a new idempotent issue. if not dmsetup message command is idempotent by nature then we are ok.

please consider and reply

its the same logic that it had it is just moved before the umount. so if there is an idempotent issue here it existed before and should be addressed separately . you can check out the diff with the cleanup function where it used to be, so the logic of it did not change.

remote/mounter/block_device_mounter_utils/block_device_utils_mounter_test.go, line 284 at r1 (raw file):

Previously, shay-berman wrote…

Catch you :-)
you forgot to remove the F from the FIt.
please fix

Done.

remote/mounter/block_device_utils/fs.go, line 99 at r1 (raw file):

Previously, shay-berman wrote…

is it relevant for the PR? if so please update the PR description

not really, I just merged from dev and noticed this was missing so I added this. I can add this small fix to the description as well (I dont think i need to open a new PR for that its a small fix)

remote/mounter/block_device_utils/mpath.go, line 105 at r1 (raw file):

Previously, shay-berman wrote…

s(stringOutput, "faulty"){
I am not sure if this check if cover enough.
and what is sg_inq fail but the multipath -ll doesn't say its faulty device?
also i think there are additional output that could present faulty even if the word faulty is not presented (try to map a volume, discover it and then unmap it and try to disciver it again, first you will see faulty but after 1-2 addition rescans you may not see the device detail at all but it is a faulty device.)
So I think you should identify additional faulty cases and add UT for them and then check if this new flow works for these faulty cases.

maybe every time that sg_inq fail it means that its faulty? not sure.

Another case that you need to consider - if the device is faulty, then are you sure that the discover will go to this flow?

please add at least additional faulty device cases as UTs and make sure you cover them all. (BTW check the current UT, there are some outputs of faulty device cases.)

I did not want to do idempotent on the sg_inq here. the only thing that is relevant for the current PR is the case where the sg_inq fzails on faulty device. anyway this is only for a specific case . in the specific case we are addressing this is the issue and this helps to fix it. i dont know what will happen in other scenarios where we have a faulty device (what are they?) and I am not sure the current fix will fix them but it does fix the issue we are talking about in this ticket.

about the UT - I added some test.

remote/mounter/block_device_utils/mpath.go, line 243 at r1 (raw file):

Previously, shay-berman wrote…

LGTM

BTW if there are other places we used this command, then please just use this new function.

no other places.

shay-berman

Reviewable status: 0 of 10 files reviewed, 3 unresolved discussions (waiting on @shay-berman, @olgashtivelman, and @beckmani)

remote/mounter/block_device_mounter_utils/block_device_mounter_utils.go, line 119 at r1 (raw file):

Previously, olgashtivelman wrote…

I ran the staging on this build to make sure we are not runing anything and it passed so I think this means this is safe no?

i would still consult with Isaac about it.

remote/mounter/block_device_mounter_utils/block_device_mounter_utils.go, line 121 at r1 (raw file):

Previously, olgashtivelman wrote…

its the same logic that it had it is just moved before the umount. so if there is an idempotent issue here it existed before and should be addressed separately . you can check out the diff with the cleanup function where it used to be, so the logic of it did not change.

just wonder if there is any difference that now we move it before the unmount, maybe it could lead to a new idempotent issue. just something to consider. if you think no new idempotent then its ok.

remote/mounter/block_device_utils/block_device_utils_test.go, line 232 at r2 (raw file):

			Expect(err).To(Equal(&block_device_utils.FaultyDeviceError{device}))	
		})
		It("should return command execution error if device is not faulty.", func() {

i think that there is another state when device become faulty after a while there is no vendor in the output.
Maybe we should also address that as well.
So if device has the WWN we want but no vendor exist, then its a faulty device and we should consider it as a faulty device and continue to clear it. I think there is example in the UT, if not then u can simulate it by map device to host, rescan and discover it, then unmap it and run multiple rescans to see the new multipath output without the vendor mentioned (IBM)

update - so this UT will be handled in UB-1522

remote/mounter/block_device_utils/fs.go, line 99 at r1 (raw file):

Previously, olgashtivelman wrote…

not really, I just merged from dev and noticed this was missing so I added this. I can add this small fix to the description as well (I dont think i need to open a new PR for that its a small fix)

yes please, just add it to the PR. so it wil lbe clear

remote/mounter/block_device_utils/mpath.go, line 105 at r1 (raw file):

Previously, olgashtivelman wrote…

I did not want to do idempotent on the sg_inq here. the only thing that is relevant for the current PR is the case where the sg_inq fzails on faulty device. anyway this is only for a specific case . in the specific case we are addressing this is the issue and this helps to fix it. i dont know what will happen in other scenarios where we have a faulty device (what are they?) and I am not sure the current fix will fix them but it does fix the issue we are talking about in this ticket.

about the UT - I added some test.

ok, I agree its different senaio that we need to handle. and it should not be part of this PR.
So i opened a ticket UB-1522 for that. as a follow up to this ticket.

olgashtivelman

Reviewable status: 0 of 10 files reviewed, 2 unresolved discussions (waiting on @shay-berman and @beckmani)

remote/mounter/block_device_mounter_utils/block_device_mounter_utils.go, line 119 at r1 (raw file):

Previously, shay-berman wrote…

i would still consult with Isaac about it.

Done.

remote/mounter/block_device_mounter_utils/block_device_mounter_utils.go, line 121 at r1 (raw file):

Previously, shay-berman wrote…

just wonder if there is any difference that now we move it before the unmount, maybe it could lead to a new idempotent issue. just something to consider. if you think no new idempotent then its ok.

Done.

Olga Shtivelman added 12 commits August 9, 2018 15:05

Ub-1060: trying to conritnu even if device is faulty

62c4ff7

UB-1060: adde some logs

ecc5588

UB-1060: trying umount with timeout

64cbee2

UB-1060: fix

1eddf17

UB-1060: added execute with timeout

3302ce8

UB-1060- trying to use umount -f

68727cb

UB-1060- trying to use umount -f

29089f1

UB-1060: trying to run dmsetup before the umount

2a4f72b

UB-1060: trying to run dmsetup before the umount

112c230

Ub-1060: added function

48b31b5

UB-1060: merge from dev

f6d6bae

Ub-1060: fix tests

55e88ae

olgashtivelman requested review from shay-berman and beckmani September 4, 2018 12:58

shay-berman suggested changes Sep 5, 2018

View reviewed changes

UB-1060: code review fixes + some other fixes

69c7022

olgashtivelman commented Sep 5, 2018

View reviewed changes

shay-berman suggested changes Sep 6, 2018

View reviewed changes

olgashtivelman commented Sep 18, 2018

View reviewed changes

olgashtivelman merged commit 4894056 into dev Sep 18, 2018

olgashtivelman deleted the fix/UB-1060_unmount_should_continue_on_faulty_device branch September 18, 2018 06:58

shay-berman added v2.0.0 idempotency Improve Flex API idempotency aspects labels Nov 21, 2018

shay-berman added this to the v2.0.0 - Spectrum Scale support milestone Nov 22, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix/ub 1060 unmount should continue on faulty device #246

Fix/ub 1060 unmount should continue on faulty device #246

olgashtivelman commented Sep 4, 2018 •

edited

coveralls commented Sep 4, 2018 •

edited

shay-berman left a comment

olgashtivelman left a comment

shay-berman left a comment

olgashtivelman left a comment

Fix/ub 1060 unmount should continue on faulty device #246

Fix/ub 1060 unmount should continue on faulty device #246

Conversation

olgashtivelman commented Sep 4, 2018 • edited

coveralls commented Sep 4, 2018 • edited

shay-berman left a comment

Choose a reason for hiding this comment

olgashtivelman left a comment

Choose a reason for hiding this comment

shay-berman left a comment

Choose a reason for hiding this comment

olgashtivelman left a comment

Choose a reason for hiding this comment

olgashtivelman commented Sep 4, 2018 •

edited

coveralls commented Sep 4, 2018 •

edited