Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix/ub 1060 unmount should continue on faulty device #246

Merged

Conversation

olgashtivelman
Copy link
Collaborator

@olgashtivelman olgashtivelman commented Sep 4, 2018

This PR is solving the following idempotent issue:
in case the umount process fails after the volume is already detached in the storage but the multipath device still exists, the next umount we will try to run will fail and get stuck on the "umount" command forever.
the solution is to run the dmsetup command before the umount command (dmsetup puts 0 to queue_if_no_path). this seems to solve the issue for this scenario and we are able to finish the umount without further issues.

This change is Reviewable

@coveralls
Copy link

coveralls commented Sep 4, 2018

Coverage Status

Coverage decreased (-0.02%) to 51.96% when pulling 69c7022 on fix/UB-1060_unmount_should_continue_on_faulty_device into 76dfee6 on dev.

Copy link
Contributor

@shay-berman shay-berman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 9 files reviewed, 6 unresolved discussions (waiting on @olgashtivelman, @beckmani, and @shay-berman)


remote/mounter/block_device_mounter_utils/block_device_mounter_utils.go, line 119 at r1 (raw file):

	defer b.logger.Trace(logs.INFO, logs.Args{{"devicePath", devicePath}})

	err := b.blockDeviceUtils.SetDmsetup(devicePath)

just wonder if we want to do it by default on every device for unmount or only if its faulty device?
want to raise this question here so we will take a conscious decision about this behaviour. (consult with isaac as well, he may have input about it from HAK perspective)


remote/mounter/block_device_mounter_utils/block_device_mounter_utils.go, line 121 at r1 (raw file):

	err := b.blockDeviceUtils.SetDmsetup(devicePath)
	if err != nil {
		return b.logger.ErrorRet(err, "Dmsetup failed")

so if dmsetup message fail it will stop here and not continue to unmount. are we sure its ok?
just raising the concern here, maybe its better to do that only its faulty device.

In addition you should consider about idempotent aspect here.
What happened if you run dmsetup message on deviceX, and then if u ran it again is the command will succeed or fail?
if it will fail in the second time, then it means you have here a new idempotent issue. if not dmsetup message command is idempotent by nature then we are ok.

please consider and reply


remote/mounter/block_device_mounter_utils/block_device_utils_mounter_test.go, line 284 at r1 (raw file):

			Expect(err).To(MatchError(callErr))
		})
		FIt("should fail if dmsetup failed", func() {

Catch you :-)
you forgot to remove the F from the FIt.
please fix


remote/mounter/block_device_utils/fs.go, line 99 at r1 (raw file):

	args := []string{mpoint}
	if _, err := b.exec.ExecuteWithTimeout(TimeoutMilisecondUmountCmdUmountFs, umountCmd, args); err != nil {
		if err.Error() == context.DeadlineExceeded.Error(){

is it relevant for the PR? if so please update the PR description


remote/mounter/block_device_utils/mpath.go, line 105 at r1 (raw file):

s(stringOutput, "faulty"){

I am not sure if this check if cover enough.
and what is sg_inq fail but the multipath -ll doesn't say its faulty device?
also i think there are additional output that could present faulty even if the word faulty is not presented (try to map a volume, discover it and then unmap it and try to disciver it again, first you will see faulty but after 1-2 addition rescans you may not see the device detail at all but it is a faulty device.)
So I think you should identify additional faulty cases and add UT for them and then check if this new flow works for these faulty cases.

maybe every time that sg_inq fail it means that its faulty? not sure.

Another case that you need to consider - if the device is faulty, then are you sure that the discover will go to this flow?

please add at least additional faulty device cases as UTs and make sure you cover them all. (BTW check the current UT, there are some outputs of faulty device cases.)


remote/mounter/block_device_utils/mpath.go, line 243 at r1 (raw file):

}

func (b *blockDeviceUtils) SetDmsetup(mpath string) error{

LGTM

BTW if there are other places we used this command, then please just use this new function.

Copy link
Collaborator Author

@olgashtivelman olgashtivelman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 10 files reviewed, 6 unresolved discussions (waiting on @olgashtivelman, @shay-berman, and @beckmani)


remote/mounter/block_device_mounter_utils/block_device_mounter_utils.go, line 119 at r1 (raw file):

Previously, shay-berman wrote…

just wonder if we want to do it by default on every device for unmount or only if its faulty device?
want to raise this question here so we will take a conscious decision about this behaviour. (consult with isaac as well, he may have input about it from HAK perspective)

I ran the staging on this build to make sure we are not runing anything and it passed so I think this means this is safe no?


remote/mounter/block_device_mounter_utils/block_device_mounter_utils.go, line 121 at r1 (raw file):

Previously, shay-berman wrote…

so if dmsetup message fail it will stop here and not continue to unmount. are we sure its ok?
just raising the concern here, maybe its better to do that only its faulty device.

In addition you should consider about idempotent aspect here.
What happened if you run dmsetup message on deviceX, and then if u ran it again is the command will succeed or fail?
if it will fail in the second time, then it means you have here a new idempotent issue. if not dmsetup message command is idempotent by nature then we are ok.

please consider and reply

its the same logic that it had it is just moved before the umount. so if there is an idempotent issue here it existed before and should be addressed separately . you can check out the diff with the cleanup function where it used to be, so the logic of it did not change.


remote/mounter/block_device_mounter_utils/block_device_utils_mounter_test.go, line 284 at r1 (raw file):

Previously, shay-berman wrote…

Catch you :-)
you forgot to remove the F from the FIt.
please fix

Done.


remote/mounter/block_device_utils/fs.go, line 99 at r1 (raw file):

Previously, shay-berman wrote…

is it relevant for the PR? if so please update the PR description

not really, I just merged from dev and noticed this was missing so I added this. I can add this small fix to the description as well (I dont think i need to open a new PR for that its a small fix)


remote/mounter/block_device_utils/mpath.go, line 105 at r1 (raw file):

Previously, shay-berman wrote…
s(stringOutput, "faulty"){

I am not sure if this check if cover enough.
and what is sg_inq fail but the multipath -ll doesn't say its faulty device?
also i think there are additional output that could present faulty even if the word faulty is not presented (try to map a volume, discover it and then unmap it and try to disciver it again, first you will see faulty but after 1-2 addition rescans you may not see the device detail at all but it is a faulty device.)
So I think you should identify additional faulty cases and add UT for them and then check if this new flow works for these faulty cases.

maybe every time that sg_inq fail it means that its faulty? not sure.

Another case that you need to consider - if the device is faulty, then are you sure that the discover will go to this flow?

please add at least additional faulty device cases as UTs and make sure you cover them all. (BTW check the current UT, there are some outputs of faulty device cases.)

I did not want to do idempotent on the sg_inq here. the only thing that is relevant for the current PR is the case where the sg_inq fzails on faulty device. anyway this is only for a specific case . in the specific case we are addressing this is the issue and this helps to fix it. i dont know what will happen in other scenarios where we have a faulty device (what are they?) and I am not sure the current fix will fix them but it does fix the issue we are talking about in this ticket.

about the UT - I added some test.


remote/mounter/block_device_utils/mpath.go, line 243 at r1 (raw file):

Previously, shay-berman wrote…

LGTM

BTW if there are other places we used this command, then please just use this new function.

no other places.

Copy link
Contributor

@shay-berman shay-berman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 10 files reviewed, 3 unresolved discussions (waiting on @shay-berman, @olgashtivelman, and @beckmani)


remote/mounter/block_device_mounter_utils/block_device_mounter_utils.go, line 119 at r1 (raw file):

Previously, olgashtivelman wrote…

I ran the staging on this build to make sure we are not runing anything and it passed so I think this means this is safe no?

i would still consult with Isaac about it.


remote/mounter/block_device_mounter_utils/block_device_mounter_utils.go, line 121 at r1 (raw file):

Previously, olgashtivelman wrote…

its the same logic that it had it is just moved before the umount. so if there is an idempotent issue here it existed before and should be addressed separately . you can check out the diff with the cleanup function where it used to be, so the logic of it did not change.

just wonder if there is any difference that now we move it before the unmount, maybe it could lead to a new idempotent issue. just something to consider. if you think no new idempotent then its ok.


remote/mounter/block_device_utils/block_device_utils_test.go, line 232 at r2 (raw file):

			Expect(err).To(Equal(&block_device_utils.FaultyDeviceError{device}))	
		})
		It("should return command execution error if device is not faulty.", func() {

i think that there is another state when device become faulty after a while there is no vendor in the output.
Maybe we should also address that as well.
So if device has the WWN we want but no vendor exist, then its a faulty device and we should consider it as a faulty device and continue to clear it. I think there is example in the UT, if not then u can simulate it by map device to host, rescan and discover it, then unmap it and run multiple rescans to see the new multipath output without the vendor mentioned (IBM)

update - so this UT will be handled in UB-1522


remote/mounter/block_device_utils/fs.go, line 99 at r1 (raw file):

Previously, olgashtivelman wrote…

not really, I just merged from dev and noticed this was missing so I added this. I can add this small fix to the description as well (I dont think i need to open a new PR for that its a small fix)

yes please, just add it to the PR. so it wil lbe clear


remote/mounter/block_device_utils/mpath.go, line 105 at r1 (raw file):

Previously, olgashtivelman wrote…

I did not want to do idempotent on the sg_inq here. the only thing that is relevant for the current PR is the case where the sg_inq fzails on faulty device. anyway this is only for a specific case . in the specific case we are addressing this is the issue and this helps to fix it. i dont know what will happen in other scenarios where we have a faulty device (what are they?) and I am not sure the current fix will fix them but it does fix the issue we are talking about in this ticket.

about the UT - I added some test.

ok, I agree its different senaio that we need to handle. and it should not be part of this PR.
So i opened a ticket UB-1522 for that. as a follow up to this ticket.

Copy link
Collaborator Author

@olgashtivelman olgashtivelman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 10 files reviewed, 2 unresolved discussions (waiting on @shay-berman and @beckmani)


remote/mounter/block_device_mounter_utils/block_device_mounter_utils.go, line 119 at r1 (raw file):

Previously, shay-berman wrote…

i would still consult with Isaac about it.

Done.


remote/mounter/block_device_mounter_utils/block_device_mounter_utils.go, line 121 at r1 (raw file):

Previously, shay-berman wrote…

just wonder if there is any difference that now we move it before the unmount, maybe it could lead to a new idempotent issue. just something to consider. if you think no new idempotent then its ok.

Done.

@olgashtivelman olgashtivelman merged commit 4894056 into dev Sep 18, 2018
@olgashtivelman olgashtivelman deleted the fix/UB-1060_unmount_should_continue_on_faulty_device branch September 18, 2018 06:58
@shay-berman shay-berman added v2.0.0 idempotency Improve Flex API idempotency aspects labels Nov 21, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
idempotency Improve Flex API idempotency aspects v2.0.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants