Skip to content
This repository has been archived by the owner on Jul 22, 2024. It is now read-only.

Feature/ub 1941 unmount fix #298

Closed
wants to merge 12 commits into from
Closed

Conversation

27149chen
Copy link
Contributor

@27149chen 27149chen commented Mar 14, 2019

Description of the issue
There is an issue with the new rescan method introduced in PR #294 that the unmount didn't clean up the physical devices.
More detail : In the new cleanup proceduce, we get the multipath devices belonging to the same volume from multipath -ll output, but it happens in ActionAfterDetach, before that, a multipath -f is called, which will remove the volume record from the multipath -ll output. So that we can't get what we want, and the command "echo 1 > /sys/block/sdx/device/delete" won't be called. It will lead to stale info (stale paths in /dev/disk/by-path/) on host.

Description of the optimal solution - will not done in this PR
Change the order of the flex unmount as follow:

  1. multipath -f
  2. echo 1 > /sys/block/sdx/device/delete
  3. unmap

rescan or cleanup in ActionAfterDetach doesn't help much.

Description of the solution
But reordering the steps is a big change, will do it in next release. For the current release, I'll use a local cache to store the informations required in the cleanup phase(vol wwn\lun\multipath device and list of physical devices), so that we can cleanup the paths even if it is not in the multipath -ll output.

Detail about the cache
The cache is a "map[volumeWWn]*VolumeMountProperties" (in fact it is a sync.Map, which is like a Go map[interface{}]interface{} but is safe for concurrent use by multiple goroutines without additional locking or coordination).Each key-value stands for a volume which is mounted to the current host by flex. The value VolumeMountProperties is a struct which has 4 attributes: WWN, LunNUbmer, DeviceMapper("mpthx"), Devices (["sda", "sdb", ...]) and stands for a multipath device and all its paths.


This change is Reviewable

Signed-off-by: 27149chen <7991675+27149chen@users.noreply.github.com>
Signed-off-by: 27149chen <7991675+27149chen@users.noreply.github.com>
# Conflicts:
#	resources/resources.go
@27149chen 27149chen added this to the Helm chart support milestone Mar 14, 2019
@27149chen 27149chen self-assigned this Mar 14, 2019
@coveralls
Copy link

coveralls commented Mar 14, 2019

Coverage Status

Coverage remained the same at 66.484% when pulling 82e54ee on feature/UB-1941_unmount_fix into 91b55dc on dev.

Signed-off-by: 27149chen <7991675+27149chen@users.noreply.github.com>
Signed-off-by: 27149chen <7991675+27149chen@users.noreply.github.com>
Signed-off-by: 27149chen <7991675+27149chen@users.noreply.github.com>
Signed-off-by: 27149chen <7991675+27149chen@users.noreply.github.com>
Signed-off-by: 27149chen <7991675+27149chen@users.noreply.github.com>
Copy link
Contributor

@shay-berman shay-berman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 11 files reviewed, 18 unresolved discussions (waiting on @27149chen and @shay-berman)


remote/mounter/block_device_mounter_utils/block_device_mounter_utils.go, line 164 at r2 (raw file):

		b.blockDeviceUtils.StoreVolumeToCache(volumeMountProperties)
	} else {
		b.logger.Warning("Failed to store volume info", logs.Args{{"volumeWWN", volumeWwn}, {"error", err}})

what does this warning mean? how the warning help to the user?
if it happened how it impact?


remote/mounter/block_device_utils/block_device_utils_test.go, line 55 at r2 (raw file):

	})

	Context("volume cache", func() {

try to give more text here.


remote/mounter/block_device_utils/rescan.go, line 46 at r2 (raw file):

		return volume.(*resources.VolumeMountProperties)
	}
	b.logger.Warning("Volume not found in cache.", logs.Args{{"wwn", volumeMountProperties.WWN}})

why its a warning? does it mean that every first call will raise warning in the log?


remote/mounter/block_device_utils/rescan.go, line 60 at r2 (raw file):

	volume := new(resources.VolumeMountProperties)
	*volume = *volumeMountProperties

I guess you want to copy by value. are you sure it really copy the physical device list inside?
if u have map or struct inside volumeMountProperties it will not deep copy. please check


remote/mounter/block_device_utils/rescan.go, line 138 at r2 (raw file):

	defer b.logger.Trace(logs.DEBUG)()

	volume := b.GetVolumeFromCache(volumeMountProperties)

if u get the detail from cache here, so why you need to get it again from cache inside the DisconnectVolume method? sounds like double get. not sure i understand the reason. please explain.


remote/mounter/block_device_utils/rescan.go, line 147 at r2 (raw file):

		return err
	}
	b.RemoveVolumeFromCache(volumeMountProperties)

but what if for some reason the flex unmount


remote/mounter/block_device_utils/resources.go, line 46 at r2 (raw file):

	IsDirAMountPoint(dirPath string) (bool, []string, error)
	SetDmsetup(mpath string) error
	GetVolumeFromCache(volumeMountProperties *resources.VolumeMountProperties) *resources.VolumeMountProperties

Please move these Cache methods out of BlockDeviceUtils, since its not blockdevice util at all, its just wrapper for a caching.
So please move these APIs to a new interface and just use it inside the struct.

again, becaue blockdeviceutil is interface for actual block operations and not for cache wrapping.

make sense?
So u can write dedicated UT for it.


remote/mounter/initiator/linuxfc_test.go, line 245 at r2 (raw file):

		})

		It("should flush 3 times if fails", func() {

why u change from 2 to 3?


remote/mounter/initiator/linuxscsi.go, line 23 at r2 (raw file):

}

// FlushMultipath flushes the device, retry 3 times if it is failed.

its better to have retry const and just reuse it in the code and in the test - to avoid mistakes.

how do u device 3 times?


remote/mounter/initiator/linuxscsi.go, line 37 at r2 (raw file):

		}
	}
}

why you are not handling error for the 3 time?


remote/mounter/initiator/connectors/fibre_channel.go, line 65 at r2 (raw file):

	devMapper := volumeMountProperties.DeviceMapper
	devNames := volumeMountProperties.Devices
	if devMapper == "" {

when it will be empty?


remote/mounter/initiator/connectors/fibre_channel.go, line 68 at r2 (raw file):

		_, devMapper, devNames, err = utils.GetMultipathOutputAndDeviceMapperAndDevice(volumeMountProperties.WWN, c.exec)
		if err != nil || devMapper == "" {
			return c.logger.ErrorRet(err, "Failed to get multipath output before disconnecting volume")

the word disconnecting is very confusing
i guess u mean to clean up the device,right?


remote/mounter/initiator/connectors/fibre_channel.go, line 73 at r2 (raw file):

	// flush multipath device
	c.logger.Info("Flush multipath device", logs.Args{{"name", devMapper}})

Please add to the text "Flush(remove) multipath device"


remote/mounter/initiator/connectors/fibre_channel.go, line 74 at r2 (raw file):

	// flush multipath device
	c.logger.Info("Flush multipath device", logs.Args{{"name", devMapper}})
	c.linuxfc.FlushMultipath(devMapper)

what about errors in flushing?


remote/mounter/initiator/connectors/fibre_channel.go, line 76 at r2 (raw file):

	c.linuxfc.FlushMultipath(devMapper)

	for _, devName := range devNames {

idempotancy aspects

what if now the flex binary fail for some reason, then what? is it idempotent? the cache is not exist and then another retry comes from k8s and run again the flex but now u cannot see the multipath device so u cannot really delete the physicall devices. means u will have stale devices in such case.

Please make sure that the caching layer is not adding bugs related to idempotency.
what is the risk here?

BTW its really confusing, you are store to the cache in the block_device_mounter_utils.UnmountDeviceFlow but then u uses it inside different layer block_device_utils.rescan.


remote/mounter/initiator/connectors/fibre_channel.go, line 88 at r2 (raw file):

	// If flushing the multipath failed before, try now after we have removed the devices.
	c.logger.Info("Flush multipath device again after removing the devices", logs.Args{{"name", devMapper}})

why not to delete the mpath again only if it failed in the first place? and why its Info?


remote/mounter/initiator/connectors/fibre_channel.go, line 89 at r2 (raw file):

	// If flushing the multipath failed before, try now after we have removed the devices.
	c.logger.Info("Flush multipath device again after removing the devices", logs.Args{{"name", devMapper}})
	c.linuxfc.FlushMultipath(devMapper)

and i think you again don't handling the error from the second flushing. any reason?


resources/resources.go, line 362 at r2 (raw file):

	WWN          string
	LunNumber    int
	DeviceMapper string

please add comment on these 2 new fields, that you need it only for the caching thing.

Signed-off-by: 27149chen <7991675+27149chen@users.noreply.github.com>
Signed-off-by: 27149chen <7991675+27149chen@users.noreply.github.com>
Signed-off-by: 27149chen <7991675+27149chen@users.noreply.github.com>
@27149chen 27149chen closed this Mar 19, 2019
@shay-berman
Copy link
Contributor

@27149chen opened up PR #299 with improved solution based on the comments. So this PR was closed and a new one opened #299.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants