Feature/ub 1941 unmount fix #298

27149chen · 2019-03-14T09:57:54Z

Description of the issue
There is an issue with the new rescan method introduced in PR #294 that the unmount didn't clean up the physical devices.
More detail : In the new cleanup proceduce, we get the multipath devices belonging to the same volume from multipath -ll output, but it happens in ActionAfterDetach, before that, a multipath -f is called, which will remove the volume record from the multipath -ll output. So that we can't get what we want, and the command "echo 1 > /sys/block/sdx/device/delete" won't be called. It will lead to stale info (stale paths in /dev/disk/by-path/) on host.

Description of the optimal solution - will not done in this PR
Change the order of the flex unmount as follow:

multipath -f
echo 1 > /sys/block/sdx/device/delete
unmap

rescan or cleanup in ActionAfterDetach doesn't help much.

Description of the solution
But reordering the steps is a big change, will do it in next release. For the current release, I'll use a local cache to store the informations required in the cleanup phase(vol wwn\lun\multipath device and list of physical devices), so that we can cleanup the paths even if it is not in the multipath -ll output.

Detail about the cache
The cache is a "map[volumeWWn]*VolumeMountProperties" (in fact it is a sync.Map, which is like a Go map[interface{}]interface{} but is safe for concurrent use by multiple goroutines without additional locking or coordination).Each key-value stands for a volume which is mounted to the current host by flex. The value VolumeMountProperties is a struct which has 4 attributes: WWN, LunNUbmer, DeviceMapper("mpthx"), Devices (["sda", "sdb", ...]) and stands for a multipath device and all its paths.

This change is

Signed-off-by: 27149chen <7991675+27149chen@users.noreply.github.com>

# Conflicts: # resources/resources.go

coveralls · 2019-03-14T10:02:04Z

Coverage remained the same at 66.484% when pulling 82e54ee on feature/UB-1941_unmount_fix into 91b55dc on dev.

Signed-off-by: 27149chen <7991675+27149chen@users.noreply.github.com>

shay-berman

Reviewable status: 0 of 11 files reviewed, 18 unresolved discussions (waiting on @27149chen and @shay-berman)

remote/mounter/block_device_mounter_utils/block_device_mounter_utils.go, line 164 at r2 (raw file):

		b.blockDeviceUtils.StoreVolumeToCache(volumeMountProperties)
	} else {
		b.logger.Warning("Failed to store volume info", logs.Args{{"volumeWWN", volumeWwn}, {"error", err}})

what does this warning mean? how the warning help to the user?
if it happened how it impact?

remote/mounter/block_device_utils/block_device_utils_test.go, line 55 at r2 (raw file):

	})

	Context("volume cache", func() {

try to give more text here.

remote/mounter/block_device_utils/rescan.go, line 46 at r2 (raw file):

		return volume.(*resources.VolumeMountProperties)
	}
	b.logger.Warning("Volume not found in cache.", logs.Args{{"wwn", volumeMountProperties.WWN}})

why its a warning? does it mean that every first call will raise warning in the log?

remote/mounter/block_device_utils/rescan.go, line 60 at r2 (raw file):

	volume := new(resources.VolumeMountProperties)
	*volume = *volumeMountProperties

I guess you want to copy by value. are you sure it really copy the physical device list inside?
if u have map or struct inside volumeMountProperties it will not deep copy. please check

remote/mounter/block_device_utils/rescan.go, line 138 at r2 (raw file):

	defer b.logger.Trace(logs.DEBUG)()

	volume := b.GetVolumeFromCache(volumeMountProperties)

if u get the detail from cache here, so why you need to get it again from cache inside the DisconnectVolume method? sounds like double get. not sure i understand the reason. please explain.

remote/mounter/block_device_utils/rescan.go, line 147 at r2 (raw file):

		return err
	}
	b.RemoveVolumeFromCache(volumeMountProperties)

but what if for some reason the flex unmount

remote/mounter/block_device_utils/resources.go, line 46 at r2 (raw file):

	IsDirAMountPoint(dirPath string) (bool, []string, error)
	SetDmsetup(mpath string) error
	GetVolumeFromCache(volumeMountProperties *resources.VolumeMountProperties) *resources.VolumeMountProperties

Please move these Cache methods out of BlockDeviceUtils, since its not blockdevice util at all, its just wrapper for a caching.
So please move these APIs to a new interface and just use it inside the struct.

again, becaue blockdeviceutil is interface for actual block operations and not for cache wrapping.

make sense?
So u can write dedicated UT for it.

remote/mounter/initiator/linuxfc_test.go, line 245 at r2 (raw file):

		})

		It("should flush 3 times if fails", func() {

why u change from 2 to 3?

remote/mounter/initiator/linuxscsi.go, line 23 at r2 (raw file):

}

// FlushMultipath flushes the device, retry 3 times if it is failed.

its better to have retry const and just reuse it in the code and in the test - to avoid mistakes.

how do u device 3 times?

remote/mounter/initiator/linuxscsi.go, line 37 at r2 (raw file):

		}
	}
}

why you are not handling error for the 3 time?

remote/mounter/initiator/connectors/fibre_channel.go, line 65 at r2 (raw file):

	devMapper := volumeMountProperties.DeviceMapper
	devNames := volumeMountProperties.Devices
	if devMapper == "" {

when it will be empty?

remote/mounter/initiator/connectors/fibre_channel.go, line 68 at r2 (raw file):

		_, devMapper, devNames, err = utils.GetMultipathOutputAndDeviceMapperAndDevice(volumeMountProperties.WWN, c.exec)
		if err != nil || devMapper == "" {
			return c.logger.ErrorRet(err, "Failed to get multipath output before disconnecting volume")

the word disconnecting is very confusing
i guess u mean to clean up the device,right?

remote/mounter/initiator/connectors/fibre_channel.go, line 73 at r2 (raw file):

	// flush multipath device
	c.logger.Info("Flush multipath device", logs.Args{{"name", devMapper}})

Please add to the text "Flush(remove) multipath device"

remote/mounter/initiator/connectors/fibre_channel.go, line 74 at r2 (raw file):

	// flush multipath device
	c.logger.Info("Flush multipath device", logs.Args{{"name", devMapper}})
	c.linuxfc.FlushMultipath(devMapper)

what about errors in flushing?

remote/mounter/initiator/connectors/fibre_channel.go, line 76 at r2 (raw file):

	c.linuxfc.FlushMultipath(devMapper)

	for _, devName := range devNames {

idempotancy aspects

what if now the flex binary fail for some reason, then what? is it idempotent? the cache is not exist and then another retry comes from k8s and run again the flex but now u cannot see the multipath device so u cannot really delete the physicall devices. means u will have stale devices in such case.

Please make sure that the caching layer is not adding bugs related to idempotency.
what is the risk here?

BTW its really confusing, you are store to the cache in the block_device_mounter_utils.UnmountDeviceFlow but then u uses it inside different layer block_device_utils.rescan.

remote/mounter/initiator/connectors/fibre_channel.go, line 88 at r2 (raw file):

	// If flushing the multipath failed before, try now after we have removed the devices.
	c.logger.Info("Flush multipath device again after removing the devices", logs.Args{{"name", devMapper}})

why not to delete the mpath again only if it failed in the first place? and why its Info?

remote/mounter/initiator/connectors/fibre_channel.go, line 89 at r2 (raw file):

	// If flushing the multipath failed before, try now after we have removed the devices.
	c.logger.Info("Flush multipath device again after removing the devices", logs.Args{{"name", devMapper}})
	c.linuxfc.FlushMultipath(devMapper)

and i think you again don't handling the error from the second flushing. any reason?

resources/resources.go, line 362 at r2 (raw file):

	WWN          string
	LunNumber    int
	DeviceMapper string

please add comment on these 2 new fields, that you need it only for the caching thing.

Signed-off-by: 27149chen <7991675+27149chen@users.noreply.github.com>

shay-berman · 2019-03-24T13:17:33Z

@27149chen opened up PR #299 with improved solution based on the comments. So this PR was closed and a new one opened #299.

27149chen added 3 commits March 14, 2019 14:42

store the volume mount properties in cache

90c6fa7

Signed-off-by: 27149chen <7991675+27149chen@users.noreply.github.com>

add ut

df56c03

Signed-off-by: 27149chen <7991675+27149chen@users.noreply.github.com>

Merge branch 'dev' into feature/UB-1941_unmount_fix

12862b5

# Conflicts: # resources/resources.go

27149chen added the v2.1.0 label Mar 14, 2019

27149chen added this to the Helm chart support milestone Mar 14, 2019

27149chen self-assigned this Mar 14, 2019

27149chen requested a review from shay-berman March 14, 2019 09:57

27149chen added 6 commits March 14, 2019 20:57

add trace

0e31777

Signed-off-by: 27149chen <7991675+27149chen@users.noreply.github.com>

use thread safe map

c4f9106

Signed-off-by: 27149chen <7991675+27149chen@users.noreply.github.com>

store volume info before multipath flush

59003f2

Signed-off-by: 27149chen <7991675+27149chen@users.noreply.github.com>

remove MpathDevFullPath

572d1b4

Signed-off-by: 27149chen <7991675+27149chen@users.noreply.github.com>

skip cleanup if volume not found in cache

026f5c8

Signed-off-by: 27149chen <7991675+27149chen@users.noreply.github.com>

update ut

8a6600c

shay-berman suggested changes Mar 18, 2019

View reviewed changes

27149chen added 3 commits March 19, 2019 14:39

iscsi cleanup

440a631

Signed-off-by: 27149chen <7991675+27149chen@users.noreply.github.com>

remove cache

495f92e

Signed-off-by: 27149chen <7991675+27149chen@users.noreply.github.com>

remove cache from interface & fixed ut

82e54ee

Signed-off-by: 27149chen <7991675+27149chen@users.noreply.github.com>

27149chen closed this Mar 19, 2019

shay-berman added the bug label Mar 26, 2019

shay-berman modified the milestone: Helm chart support Mar 26, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature/ub 1941 unmount fix #298

Feature/ub 1941 unmount fix #298

27149chen commented Mar 14, 2019 •

edited by shay-berman

Loading

coveralls commented Mar 14, 2019 •

edited

Loading

shay-berman left a comment

shay-berman commented Mar 24, 2019

Feature/ub 1941 unmount fix #298

Feature/ub 1941 unmount fix #298

Conversation

27149chen commented Mar 14, 2019 • edited by shay-berman Loading

coveralls commented Mar 14, 2019 • edited Loading

shay-berman left a comment

Choose a reason for hiding this comment

shay-berman commented Mar 24, 2019

27149chen commented Mar 14, 2019 •

edited by shay-berman

Loading

coveralls commented Mar 14, 2019 •

edited

Loading