Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix/ub-600 : sg_inq is skipped if device is faulty #250

Merged
merged 8 commits into from
Oct 25, 2018

Conversation

olgashtivelman
Copy link
Collaborator

@olgashtivelman olgashtivelman commented Oct 15, 2018

in case a device is faulty we identify the situation and skip sg_inq + sontinue to to do the rest of the umount flow as needed.
this is done to prevent a situation where if a device has been faulty for a while and we try to umount it (~6 minutes) the sg_inq command usually gets stuck.
this should work for both ubuntu + red hat (the multipath output is a bit differente between them)


This change is Reviewable

@olgashtivelman olgashtivelman changed the title Fix/ub 600 sg_inq is skipped if device is faulty Fix/ub-600 : sg_inq is skipped if device is faulty Oct 15, 2018
@coveralls
Copy link

coveralls commented Oct 16, 2018

Coverage Status

Coverage increased (+1.3%) to 57.742% when pulling 9b2f077 on fix/UB-600_remove_faulty_device_with_no_sginq into 1c0104b on dev.

Copy link
Contributor

@shay-berman shay-berman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 7 files reviewed, 12 unresolved discussions (waiting on @olgashtivelman and @shay-berman)


remote/mounter/block_device_utils/block_device_utils_test.go, line 436 at r2 (raw file):

		It("should return error if device is faulty", func() {
			dev := "mpathhe"
			mpath := `mpathhe (36001738cfc9035eb0000000000cea5f6) dm-3 IBM     ,2810XIV

please add more UTs for additional output options like:

  1. with ubuntu output as well
  2. with different output for "inactive faulty offline"
  3. with all physical devices like this "33:0:0:1 sdb 8:16 ## ## ##"
    Note: i am not sure you have the right indintation of multipathing output (e.g: the physical devices line should goes one indintation to the policy line I think. I am not saying its a problem but just validate that unit testing using the current output)

remote/mounter/block_device_utils/block_device_utils_test.go, line 446 at r2 (raw file):

			Expect(fakeExec.ExecuteWithTimeoutCallCount()).To(Equal(0))
		})
		It("should not return an error if check for faulty device failed", func() {

i am not sure how the test description related to the test it self.
it looks like the test check that Getwwn function return error.
so please explain or fix the test description


remote/mounter/block_device_utils/block_device_utils_test.go, line 454 at r2 (raw file):

							- 34:0:0:1 sdc 8:32 fault faulty running`
			_, err := bdUtils.GetWwnByScsiInq(mpath, dev)
			Expect(err).To(HaveOccurred())

what about adding more unit testing that the multipath output contain few faulty but at least 1 in running state - lets make sure it works ok after your fix.


remote/mounter/block_device_utils/errors.go, line 89 at r2 (raw file):

}

func (e *DeviceNotFoundError) Error() string {

i think there is VolumeNotFoundError as well in this file. i wonder if its different and how?
if u really need additional error that talk about device not found in multipath, maybe ass the word Multipath tot eh error struct.


remote/mounter/block_device_utils/mpath.go, line 105 at r2 (raw file):

				return mpath, b.logger.ErrorRet(err, "failed")
			default:
				b.logger.Error("Failed to run multipath command while executing sg_inq.", logs.Args{{"err", err}})

maybe if u can please fix this text. "Failed to run sg_inq on multipath device " + mpath


remote/mounter/block_device_utils/mpath.go, line 158 at r2 (raw file):

}

func (b *blockDeviceUtils) GetWwnByScsiInq(mpathOutput string, dev string) (string, error) {

Olga, a suggestion
its not really mandatory to pass the mpathoputput to this function, instead you can had a dedicated function that check if device is faulty(which u already have) and then just call it before calling GetWwnByScsiInq.
up to you if you want to change it


remote/mounter/block_device_utils/mpath.go, line 201 at r2 (raw file):

		// we should not get here since we get the device from the multipath output so there is not reason for it to be missing
		// but in case something weird occurs we need to continue to not hurt the current flow.
		b.logger.Warning("an error occured while trying to check if device is faulty.", logs.Args{{"err", err}, {"device", dev}})

so what does it mean?
that if its happened, then isFaulty == true? does it make sense? can we just return error here - and since we idempotent then it will retry it again. what do you think?


remote/mounter/block_device_utils/mpath.go, line 204 at r2 (raw file):

	}
	if isFaulty {
		return "", b.logger.ErrorRet(&FaultyDeviceError{dev}, "Device is faulty. sg_inq will not run on a faulty device.")

text fix : just change the text a bit - ....("The device is faulty. Therefore skipping sg_inq check on this device " + dev)


remote/mounter/block_device_utils/utils.go, line 16 at r2 (raw file):

 * limitations under the License.
 */
package block_device_utils

we should carefull about adding utils.go file, because it just a generic name.
Instead call it multipath_utils.go
you can also consider to add it inside the mpath.go (i know its not part of the interface, but its related to multipath. anyway if not inside mpath.go then rename it to multipath_utils.go)


remote/mounter/block_device_utils/utils.go, line 26 at r2 (raw file):

)

func checkIsFaulty(mpath string, logger logs.Logger) bool {

if its the multipath output then please rename the param to multipathOutput so it will be self explanatory

about the regex "(\d)+:(\d)+:(\d)+:(\d)+.[failed|active]+."
ok i guess the \d\d\d\d is to catch HCTL (which is the [host|connectivity|target|lun] of the scsi device). But why double \? is it ok?

please if u can add a comment with the multiapth output so it will be easy to understand what it looking for.
and add a command that your regex catch all the physical lines of the multipath device by using HCTL as ancore to catch the device. so it will be clearer.


remote/mounter/block_device_utils/utils.go, line 28 at r2 (raw file):

func checkIsFaulty(mpath string, logger logs.Logger) bool {
	re := regexp.MustCompile("(\\d)+:(\\d)+:(\\d)+:(\\d)+.*[failed|active]+.*")
	paths := re.FindAllString(mpath, -1)

BTW is this works only on a specific multipath device output? just double checking.


remote/mounter/block_device_utils/utils.go, line 35 at r2 (raw file):

	logger.Debug(fmt.Sprintf("Device paths are : [%s]", paths))
	isFaulty := true
	activeRe := regexp.MustCompile("active.*ready.*running.*")

you can change .* to .+ because for sure there is a char between them.
also add a comment that say - that this regex is to find a real and working physical devices of a multipath device. and "active ready running" indicate about it


remote/mounter/block_device_utils/utils.go, line 37 at r2 (raw file):

	activeRe := regexp.MustCompile("active.*ready.*running.*")
	for _, path := range paths {
		logger.Debug(fmt.Sprintf("path : [%s] contains faulty? %t", path, strings.Contains(path, "faulty")))

maybe you can change it to warning - since its more then debug level.

Copy link
Contributor

@shay-berman shay-berman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 7 files reviewed, 17 unresolved discussions (waiting on @olgashtivelman)


remote/mounter/block_device_utils/utils_test.go, line 50 at r2 (raw file):

			Expect(isFaulty).To(Equal(true))
		})
		It("returns true on a device with no paths", func() {

please add another UT with stange output like
|- 33:0:0:1 sdb 8:16 ## ## ##
- 34:0:0:1 sdc 8:32 ## ## ##`


remote/mounter/block_device_utils/utils_test.go, line 78 at r2 (raw file):

			})
			It("return error for one device which does not exists", func() {
				mapth := `mpathb (36001738cfc9035eb0000000000d0ec0e) dm-3 IBM     ,2810XIV

please add to the output another device (so u will have 2 devices in the output) like the above one.
so it will cover more.


remote/mounter/block_device_utils/utils_test.go, line 126 at r2 (raw file):

					Expect(result).To(Equal(output))
				})
				It("return right mpath for second device in multiple device scenario", func() {

what do you mean second device? clarify the test case here please.


remote/mounter/block_device_utils/utils_test.go, line 134 at r2 (raw file):

					device := "mpathc"
					result, err := findDeviceMpathOutput(mpathOutput, device, logs.GetLogger())
					fmt.Println(output)

is it correct to write directly to stdout? not sure


remote/mounter/block_device_utils/utils_test.go, line 156 at r2 (raw file):

					Expect(result).To(Equal(output))
				})
				It("return error fif device does not exist", func() {

fif


remote/mounter/block_device_utils/utils_test.go, line 310 at r2 (raw file):

})

func TestBlockDeviceUtilsUtils(t *testing.T) {

VERY well done
great unit testing file!

Copy link
Collaborator Author

@olgashtivelman olgashtivelman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 7 files reviewed, 17 unresolved discussions (waiting on @shay-berman and @olgashtivelman)


remote/mounter/block_device_utils/block_device_utils_test.go, line 436 at r2 (raw file):

Previously, shay-berman wrote…

please add more UTs for additional output options like:

  1. with ubuntu output as well
  2. with different output for "inactive faulty offline"
  3. with all physical devices like this "33:0:0:1 sdb 8:16 ## ## ##"
    Note: i am not sure you have the right indintation of multipathing output (e.g: the physical devices line should goes one indintation to the policy line I think. I am not saying its a problem but just validate that unit testing using the current output)
  1. you can find it in utils_test
  2. added more to utils_test
  3. added to utils_test
  4. well i checked it on a live env and it seemed to catch the indentation correctly also i am not relaying on the white spaces just the "\n" so i am not sure this is an issue.

remote/mounter/block_device_utils/block_device_utils_test.go, line 446 at r2 (raw file):

Previously, shay-berman wrote…

i am not sure how the test description related to the test it self.
it looks like the test check that Getwwn function return error.
so please explain or fix the test description

Done.


remote/mounter/block_device_utils/block_device_utils_test.go, line 454 at r2 (raw file):

Previously, shay-berman wrote…

what about adding more unit testing that the multipath output contain few faulty but at least 1 in running state - lets make sure it works ok after your fix.

these test are at utils_test. I think its better to test the functions that check the faulty firectly. regarding the bahaviour of the GetWWnSCSI utils there are plenty of tests that check the flow in case the device is not faulty.


remote/mounter/block_device_utils/errors.go, line 89 at r2 (raw file):

Previously, shay-berman wrote…

i think there is VolumeNotFoundError as well in this file. i wonder if its different and how?
if u really need additional error that talk about device not found in multipath, maybe ass the word Multipath tot eh error struct.

Done.


remote/mounter/block_device_utils/mpath.go, line 105 at r2 (raw file):

Previously, shay-berman wrote…

maybe if u can please fix this text. "Failed to run sg_inq on multipath device " + mpath

Done.


remote/mounter/block_device_utils/mpath.go, line 158 at r2 (raw file):

Previously, shay-berman wrote…

Olga, a suggestion
its not really mandatory to pass the mpathoputput to this function, instead you can had a dedicated function that check if device is faulty(which u already have) and then just call it before calling GetWwnByScsiInq.
up to you if you want to change it

as we discussed since this is used in 2 places i felt more comfortable adding this here instead of in 2 other functions.


remote/mounter/block_device_utils/mpath.go, line 201 at r2 (raw file):

Previously, shay-berman wrote…

so what does it mean?
that if its happened, then isFaulty == true? does it make sense? can we just return error here - and since we idempotent then it will retry it again. what do you think?

this will mean isFaulty = false (you can see this here:

{code}
deviceOutput, err := findDeviceMpathOutput(deviceMultipathOutput, baseDevice, logger)
if err != nil {
return false, err
}
{code}

errors are not likely to be raised, BUT if they do i dont want to hurt or change the current flow where we do not check if the device is faulty at all. i think it is better to leave the existing logic as is and just add to it as well as we can . in case i cannot check if the device is faulty i just want to continue with the flow as it used to be.


remote/mounter/block_device_utils/mpath.go, line 204 at r2 (raw file):

Previously, shay-berman wrote…

text fix : just change the text a bit - ....("The device is faulty. Therefore skipping sg_inq check on this device " + dev)

Done.


remote/mounter/block_device_utils/utils.go, line 16 at r2 (raw file):

Previously, shay-berman wrote…

we should carefull about adding utils.go file, because it just a generic name.
Instead call it multipath_utils.go
you can also consider to add it inside the mpath.go (i know its not part of the interface, but its related to multipath. anyway if not inside mpath.go then rename it to multipath_utils.go)

renaming this to multipath_utils.go (BTW I dont think its that bad of an issue having a generic utils file - since we can always understand what is in used and what is not but still as per your request :) )


remote/mounter/block_device_utils/utils.go, line 26 at r2 (raw file):

Previously, shay-berman wrote…

if its the multipath output then please rename the param to multipathOutput so it will be self explanatory

about the regex "(\d)+:(\d)+:(\d)+:(\d)+.[failed|active]+."
ok i guess the \d\d\d\d is to catch HCTL (which is the [host|connectivity|target|lun] of the scsi device). But why double \? is it ok?

please if u can add a comment with the multiapth output so it will be easy to understand what it looking for.
and add a command that your regex catch all the physical lines of the multipath device by using HCTL as ancore to catch the device. so it will be clearer.

Done.


remote/mounter/block_device_utils/utils.go, line 35 at r2 (raw file):

Previously, shay-berman wrote…

you can change .* to .+ because for sure there is a char between them.
also add a comment that say - that this regex is to find a real and working physical devices of a multipath device. and "active ready running" indicate about it

Done.


remote/mounter/block_device_utils/utils.go, line 37 at r2 (raw file):

Previously, shay-berman wrote…

maybe you can change it to warning - since its more then debug level.

remove this log message - not relevant


remote/mounter/block_device_utils/utils_test.go, line 50 at r2 (raw file):

Previously, shay-berman wrote…

please add another UT with stange output like
|- 33:0:0:1 sdb 8:16 ## ## ##
- 34:0:0:1 sdc 8:32 ## ## ##`

Done.


remote/mounter/block_device_utils/utils_test.go, line 78 at r2 (raw file):

Previously, shay-berman wrote…

please add to the output another device (so u will have 2 devices in the output) like the above one.
so it will cover more.

there are tests in the multipath device area bellow (3 devices) is this not enough?


remote/mounter/block_device_utils/utils_test.go, line 126 at r2 (raw file):

Previously, shay-berman wrote…

what do you mean second device? clarify the test case here please.

second one in the output. added clarification


remote/mounter/block_device_utils/utils_test.go, line 134 at r2 (raw file):

Previously, shay-berman wrote…

is it correct to write directly to stdout? not sure

oops. removed.


remote/mounter/block_device_utils/utils_test.go, line 156 at r2 (raw file):

Previously, shay-berman wrote…

fif

Done.

Copy link
Contributor

@shay-berman shay-berman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Great @olga. if staging green on this PR then go ahead and merge to dev branch.

Reviewable status: 0 of 7 files reviewed, all discussions resolved

@olgashtivelman olgashtivelman merged commit 1c9ea85 into dev Oct 25, 2018
deeghuge pushed a commit that referenced this pull request Oct 31, 2018
in case a device is faulty we identify the situation and skip sg_inq + sontinue to to do the rest of the umount flow as needed.
this is done to prevent a situation where if a device has been faulty for a while and we try to umount it (~6 minutes) the sg_inq command usually gets stuck.
this should work for both ubuntu + red hat (the multipath output is a bit differente between them)
@shay-berman shay-berman added v2.0.0 idempotency Improve Flex API idempotency aspects labels Nov 21, 2018
@olgashtivelman olgashtivelman deleted the fix/UB-600_remove_faulty_device_with_no_sginq branch November 22, 2018 06:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
idempotency Improve Flex API idempotency aspects v2.0.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants