Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feature request: RPC call to trigger a re-discovery of a deleted raid1 bdev #3341

Open
arsiesys opened this issue Apr 15, 2024 · 6 comments
Open
Assignees

Comments

@arsiesys
Copy link

Suggestion

If we have a raid1 bdev composed of the following base_bdevs on host 1:

  • one remote lvol attached (from host 0)
  • one local lvol (host 1)

If I remove the remote base_bdev from the raid (bdev_raid_remove_base_bdev) then delete the raid on host 1, the raid can't be re-discovered without restarting spdk and running a bdev_examine. We may miss one rpc call to force a rediscover if the raid1 have only one disk left.
The current "workaround", if the raid1 is having only one remaining device, we need to attach it to an other spdk host (nvmeof) and expose it from there to trigger the discover and use it without restarting the spdk node it is currently hosted in.

The reason we are doing such operation is in the context of a kubernetes CSI, when a volume/raid is unpublish (not used by any POD anymore), we are deleting the raid and detaching all remote lvol part of the replication when the PVC/volume is not used. We may have some situation where we may have only one base_bdev on the raid1 temporarily or even on purpose. Example:

  • A temporarily degraded raid1 with only one base_bdev left where other bdevs have been removed
  • A raid1 created with only 1 bdev but that could get (grow) new base_bdev over time (there is currently no available rpc call to allow such usage, but it may be coming in the future.. In such situation, I create a raid1 with two base_bdev and remove one to have one available slot "spare" for this use case)

In both use cases, I could make the my kubernetes CSI smarter and do not remove the raid1 if only one base_bdev is currently active. But I still think that having a rpc call to trigger the raid discover would be handy if the raid1 got deleted by mistake, it would avoid to restart SPDK and impact other volumes/usage/traffic in such situation.

Current Behavior

Once a raid1 bdev has been deleted, there is no way or existing rpc call that would allow to rediscover the raid1 without restarting SPDK_tgt.

Possible Solution

A rpc call would allow to re-trigger the discovery of the raid and make the raid bdev available.

Context (Environment including OS version, SPDK version, etc.)

Reference discussion: #3306 (comment)
SPDK version: https://review.spdk.io/gerrit/c/spdk/spdk/+/22641

Thank you !

@apaszkie apaszkie self-assigned this Apr 16, 2024
@apaszkie
Copy link
Contributor

I've been wondering how to add this functionality and I think that a good solution would be to introduce two new RPCs: raid_bdev_start, which would allow to start (bring ONLINE) a raid bdev from CONFIGURING state, and raid_bdev_stop that would revert it back to CONFIGURING state. Then, instead of raid_bdev_delete you could use raid_bdev_stop to deactivate the raid and then raid_bdev_start to activate it again. Would this be sufficient for your use cases?

With this, we could easily enable two other functionalities:

  1. the ability to force activation of a raid_bdev when not all base bdevs are discovered,
  2. allow raid_bdev_delete to actually delete the raid by removing the superblock from the base bdevs.

@arsiesys
Copy link
Author

arsiesys commented Apr 23, 2024

Does raid_bdev_stop would keep a volume in a "claimed" state ?
In the situation where I decide to failover which spdk storage node would "expose" the raid with nvmeof, I would need to be able to create a subsystem then listener to expose the local volume to an other node. If there is a current (previously discovered) raid_bdev in configuring state, it may not allow me to do that as the resource may be locked ?

Also, if I detach a remote NVME part of this stopped raid, is there a risk that it would remove the base_bdev in the same time (I remember this happened if I removed the subsystem, maybe it won't do that if it's just a detach nvme) ? (Until now, I was always taking care that the raid was removed to safely remove attached base_bdev).

@apaszkie
Copy link
Contributor

bdev_raid_stop would keep the base bdevs claimed but removing a base bdev (directly or with bdev_raid_remove_base_bdev) will unclaim it without affecting the state in the superblock. When in CONFIGURING state, base bdevs can be freely removed and added back (since https://review.spdk.io/gerrit/c/spdk/spdk/+/22497).

@arsiesys
Copy link
Author

If this would allow to disassemble all the base_bdev of a raid until the last one to be able to eventually expose one of the base_bdev with nvmeof to assemble the raid in an other node. Then, it should works for my use cases. However, let's imagine, the raid1 got assembled in an other node (nvmeof) and a new base_bdev got added then we for x raison, with move back the raid assembly in the previous node that already have knowledge of the raid1 that existed before. May it create issue and do not know/discover about the new base_bdev ?

Regarding the other functionalities it may allow:

  1. the ability to force activation of a raid_bdev when not all base bdevs are discovered,
    => Is this already possible by removing the non-discovered/missing base_bdev with bdev_raid_remove_base_bdev while in configuring mode ? If not, then yes, this would be a very great bonus!
  2. allow raid_bdev_delete to actually delete the raid by removing the superblock from the base bdevs.
    This may help users having raid1 using a physical local/remote nvme as it would avoid the user to manage to access or use dd to wipe the existing superblock and reuse the disk more easily. Was it the reason you though about it ?

@apaszkie
Copy link
Contributor

let's imagine, the raid1 got assembled in an other node (nvmeof) and a new base_bdev got added then we for x raison, with move back the raid assembly in the previous node that already have knowledge of the raid1 that existed before. May it create issue and do not know/discover about the new base_bdev ?

I'm not sure I understand but I think that the situation you described is comparable to this:

  1. remove base bdev A
  2. add a new base bdev B (rebuild)
  3. delete the raid bdev, detach all base bdevs
  4. attach the base bdev A again - raid will start (configuring) from the "old" version of the superblock, without bdev B, waiting for the remaining base bdevs
  5. when any of the other base bdevs are attached, the newer version of the superblock will be read and it will be used instead - with bdev B, not A.

That's because the superblock has a sequence number, incremented on every update, and the superblock version that has the highest sequence number is considered as the correct one. There are tests in test/bdev/bdev_raid.sh that check such cases (raid_superblock_test, raid_rebuild_test).

Is this already possible by removing the non-discovered/missing base_bdev with bdev_raid_remove_base_bdev while in configuring mode ? If not, then yes, this would be a very great bonus!

It's not possible now, there is no direct way to force a raid to start without all the expected base bdevs. We definitely need to be able to do this somehow.

This may help users having raid1 using a physical local/remote nvme as it would avoid the user to manage to access or use dd to wipe the existing superblock and reuse the disk more easily. Was it the reason you though about it ?

Yes, exactly. I'm not sure yet if this should be the default behavior or an option for bdev_raid_delete. Also, another RPC to wipe the superblock from individual base bdevs could also be useful.

@arsiesys
Copy link
Author

arsiesys commented Apr 25, 2024

If the superblock is always using the last updated seq number, that should be good then ! Thanks


bdev_raid_delete
I would indeed feel more "safe" if the superblock wipe was an option in bdev_raid_delete (false by default).


It's not possible now, there is no direct way to force a raid to start without all the expected base bdevs. We definitely need to be able to do this somehow.

I don't handled this situation case in my csi yet but I will probably be happy to have it.. In case of we would lose a node for X reason, I could start the raid and replace the lost replicas. That mean, I would need to be able to remove the lost replicas with bdev_raid_remove_base_bdev so I do not waste a slot in the raid1.


Something magical would be that.. once a superblock is removed, being able to create a new raid1 and keep the existing data would be.. amazing for cloning purpose but I do not immediate application for this.. but I like the idea it could be possible if the need arrive. But maybe I should keep that for a future feature request ! The previous point of being able to start a raid with a missing replicas seems more important :p.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Todo
Development

No branches or pull requests

2 participants