Merge redhat lvm.sh feature set into heartbeat LVM agent #222

davidvossel · 2013-04-17T20:14:58Z

No description provided.

davidvossel · 2013-04-17T20:22:20Z

Hopefully this branch will help the review process go smoother.

The first patch sets up the current LVM heartbeat agent for the feature set merger that happens in the second patch. Look at the commit message for the first patch for details on where functionality got moved. The refactoring is necessary to keep this agent manageable with the inclusion of the redhat lvm.sh features. I structured the agent the way I did so I could both re-use the existing logic with the new feature set and isolate the original behavior with as little impact as possible.

dmuhamedagic · 2013-04-19T14:08:59Z

There are no significant differences between the original pull request and this one. Do we need to reiterate the discussion and a plea to keep the changes minimal and to give explanation for those changes ("this is needed for the feature xy" is not only vague, but hard to verify too). This is a complete rewrite of the existing RA and as such difficult to validate. The at length discussed topic (on the linux-ha-dev ML, IIRC) on using lvm tools in the monitor operation and the resulting patch was also not observed.

davidvossel · 2013-04-19T14:51:01Z

Good, this is something we can discuss.

The original LVM agent only did an 'ls' on the /dev/volgrpname directory to see if something existed in it. This new logic actually figures out what logical volumes are present in the volume group and verifies they are up individually. This is proven logic we've supported in the redhat agent for years. I'd be happy to address any technical concerns you have with this approach. If you could link me to the ML discussion I'd like to understand why something similar to this was disagreed on.

-- Vossel

fabbione · 2013-04-19T15:24:04Z

On 4/19/2013 4:09 PM, Dejan Muhamedagic wrote:

There are no significant differences between the original pull request
and this one. Do we need to reiterate the discussion and a plea to keep
the changes minimal and to give explanation for those changes ("this is
needed for the feature xy" is not only vague, but hard to verify too).

So ok.. just that we agree upfront.. what kind of level of details would
make you happy?

Example (not even sure if it´s touched by the patch.. just to get to the
same page here):

replace LVM_status() with merged_LVM_status()

old LVM_status supports both lvm1 and lvm2 and gather status via
vgstatus x..y..z

new merged_LVM_status supports both lvm1 and lvm2, lay down the path for
lvm2 support and replaces use of per-version vgdisplay with
god_of_vg_display -o format_output that is compatible with both.

god_of_vg_display then automatically detects... blablabla...

---- code/patch -----

now.. redhat lvm provide has tons of pre-made functions that´s probably
what makes it a bit more complex to read on your side because they seem
to kill all code in LVM, but effectively they only organize it logically.

code path expanded would make it easier to read?

For example:

god_of_vg_display calls into file_x:function_y:line to gather info Z.

Does that help? it´s a lot of work to be done this way, but doable if it
helps you. Otherwise can you be more explicit on your expectations?

Fabio

dmuhamedagic · 2013-04-22T18:15:08Z

The relevant discussion ... I don't have that readily available. At any rate, the resulting changes are these two: a405e78 and b4b8775

The crux of the matter is that just about any lvm command goes through all VGs on the system. If one of them is borked in such a way that it is inaccessible, the command will hang and the operation time out. Thus, an unrelated storage can affect the storage actually managed in this resource instance. This is what was observed in the field and reported in several instances, both by customers as well as on the public ML.

davidvossel · 2013-04-22T20:30:15Z

interesting. I'll rework the status function so it attempts to avoid using the lvm tools. There are situations where this will be unavoidable, but I believe I can restore this behavior for the pre-merger LVM use-cases. I'll give it a go and see what I come up with.

davidvossel · 2013-04-23T15:48:55Z

Alright, take a look at the new vg_status_normal function in the first commit.

davidvossel@5ec9a8a#L3R27

It is very similar to the old status function. The only difference I made is that if it looks like an empty volume group, we will fall back to using lvm tools to read the metadata just to verify the vg is in fact empty. In normal use cases, the lvm tools will not be executed though. sound reasonable?

l-mb · 2013-04-23T17:43:04Z

I really appreciate that code merging is happening, kudos for that. And the new exclusive activation feature also rocks. Given that it is a merge of two mature code bases and not incremental improvements, slightly less-than-perfectly incremental patches are to be expected.

Moving code from the RA to a common shell library probably makes sense if those are shared between multiple implementations, but I agree that this makes it harder to review than in-place updates. Oh well. ;-)

Calling out to "lvs" in the monitor function though is dangerous, unfortunately. I'm surprised you've not hit that on RHEL. If IO hangs (which is particularly common with multipath), that'll hang and time out - and, like Dejan commented, this can be triggered by any path within the LVM filter spec, since lvs will open/read them all.

I've never seen a VG directory (/dev/cvg/) exist if no LVs inside it were activated; i.e., never seen the directory exist but empty. Can this actually happen? In short, does this test - except for checking if device nodes exist - actually make sense? It seems harmless enough now though, especially if it can't happen.

As far as the exclusive activation of individual LVs is concerned, I admit I have second thoughts on lumping this in with the general LVM RA that manages whole volume groups. I was thinking that perhaps a dedicated LV RA would make more sense. But it does share a lot of code.

I agree with Dejan that such a merge will want plenty of testing, but we seem to gain some good functionality here

davidvossel · 2013-04-23T18:22:39Z

I've updated the status function to further remove all references to 'lvs'. I tested the vg activation without any logical volumes and it appears the /dev/vg dir doesn't exist until a logical volume is created. For some reason I thought I had seen it before.

l-mb · 2013-04-23T18:49:29Z

Thank you!

Looks good to me.

l-mb · 2013-04-30T11:38:35Z

To clarify, will the code in resource-agents/rgmanager/src/resources/ also switch to using these shared libraries?

fabbione · 2013-05-01T01:56:57Z

On 04/30/2013 01:38 PM, Lars Marowsky-Brée wrote:

To clarify, will the code in resource-agents/rgmanager/src/resources/
also switch to using these shared libraries?

Those libraries come from there. As the agents are merged,
rgmanager/src* will disappear completely.

Fabio

davidvossel · 2013-05-06T22:45:47Z

Anything I can do to move this along?

dmuhamedagic · 2013-05-08T12:48:57Z

As already mentioned in the first comment, it is difficult to validate the first changeset (davidvossel@0ef1449) because of the number of changes within. If you can split that in smaller chunks which are self-sufficient and well explained, that would certainly help. Further observations in the first changeset.

davidvossel · 2013-05-08T15:10:08Z

The exclusive activation using tags is only questionable if the pacemaker deployment does not have fencing enabled.

davidvossel · 2013-05-08T20:02:34Z

Please copy in your ocft test results. Mine is failing during setup before it even gets to the part where it invokes the agent.
I'm confused about how I should be testing portability. My understanding was that #!/sbin/sh would run bash in a way that removed bash's extended functionality. I had to change some things to get this to work. How else should I be testing for compatibility?
Logging messages moved, and changed. New checks during validation got added replacing previous behavior. Is there a specific log output I'm missing that indicates a condition we need to know about?
Status is a depreciated alias for monitor. are you saying monitor and status are broken? Can you tell me what you encountered? I have tested this. Try and be more descriptive.
restore_transient_failed_pvs just attempts to restore missing physical volumes before starting. This has the potential to make a partial activation a complete activation, but shouldn't interfere with the partial activation if the pv can not be restored.

davidvossel · 2013-05-08T21:48:28Z

I looked into portability more. I'll see what I can do to port the changes to work with dash.

davidvossel · 2013-05-09T20:52:03Z

I've ported the changes to work under with both bash and dash. checkbashisms passes for all the changed files as well.

edit I've also gone through and validated all use cases (normal, tags, exclusive with clvm) in my test environment again.

Functionality from the original heartbeat agent has moved into new function names and files. This is a map to understanding where the original functionality lives within the new merged agent. LVM_monitor and LVM_status --> vg_status_normal() in lvm_vg.sh The new monitor function does additional checking to verify both the volume group and each logical volume is correctly activated. LVM_start() ---> vg_start_normal() in lvm_vg.sh LVM_stop() ---> vg_stop_normal() in lvm_vg.sh LVM_validate_all() ---> verify_setup() in lvm_functions.sh

dmuhamedagic · 2013-05-13T16:55:20Z

On Wed, May 08, 2013 at 08:10:10AM -0700, David Vossel wrote:

The exclusive activation using tags is only questionable if the pacemaker deployment does not have fencing enabled.

If you want to rely on the cluster resource manager and fencing
functioning, why do you then need to verify the membership
information? If it's just to prevent somebody outside of cluster
to use the volume, then testing the membership from the resource
agent is not of any benefit.

davidvossel · 2013-05-13T17:59:24Z

It doesn't appear to have anything to do with trying to prevent an outside source from activating the volume. If that happens, the agent will just strip the tag from the outside source and claim the volume for the cluster.

It looks like this functionality is present to prevent a node from potentially stripping another cluster node's ownership of the volume. It seems like for the stop action this would be important.

lhh · 2013-05-13T22:10:02Z

I believe the tagging/stripping and the way it's implemented is designed to
prevent a few things:

obvious administrative errors within a running cluster:

clone resource that really must NEVER be cloned
executing agent directly while it's active
(and other things)

bugs in the resource manager

betting your entire volume group on "no bugs"?

"Don't do that" only goes so far, and is little comfort to an administrator
who has corrupt LVM metadata.

There's probably some other bits and pieces I've forgotten; Jon Brassow
would know.

On Mon, May 13, 2013 at 1:59 PM, David Vossel notifications@github.comwrote:

It doesn't appear to have anything to do with trying to prevent an outside
source from activating the volume. If that happens, the agent will just
strip the tag from the outside source and claim the volume for the cluster.

It looks like this functionality is present to prevent a node from
potentially stripping another cluster node's ownership of the volume. It
seems like for the stop action this would be important.

—
Reply to this email directly or view it on GitHubhttps://github.com//pull/222#issuecomment-17828909
.

Lon Hohberger
DoD#4269
KB1YGC

l-mb · 2013-05-14T10:31:13Z

It also appears to protect against manual invocation of the LVM commands.

I'm not perfectly sure the added validation against the host list is mandatory, but it also doesn't seem to hurt.

davidvossel · 2013-05-14T19:13:00Z

I have re-written the portability patch to avoid using a subshell while parsing lvm tool output.

davidvossel · 2013-05-15T21:20:13Z

I've made 2 new updates to these patches.

Fixed issue with detecting when clustered exclusive activation is taking place
Changed portability patch to use pattern matching with case statements instead of grep regex

…heartbeat LVM agent This patch leaves the original LVM heartbeat functionality intact while adding these addition features from the redhat agent. 1. Exclusive activation using volume group tags. This feature allows a volume group to live on shared storage within the cluster without requiring the use of cLVM for metadata locking. 2. individual logical volume activation for local and cluster volume groups by using the new 'lvname' option. 3. Better setup validation when the 'exclusive' option is enabled. This patch validates that when exclusive activation is enabled, either a cluster volume group is in use with cLVM, or the tags variant is configured correctly. These new checks also makes it impossible to enable the exclusive activation for cloned resources.

Thanks to Lars Ellenberg for his review and help on this patch.

davidvossel · 2013-06-04T19:59:05Z

I'm closing this pull request and opening a new one with a new set of patches.

davidvossel mentioned this pull request Apr 18, 2013

Heartbeat:LVM: Retry VG shutdown - udev collisions #223

Closed

davidvossel and others added 3 commits May 15, 2013 20:55

Fix: deactivate vg fails, loop until deactivated

5fb99f6

Low: LVM: Restore LVM agent portability

1e8c080

Thanks to Lars Ellenberg for his review and help on this patch.

davidvossel closed this Jun 4, 2013

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Merge redhat lvm.sh feature set into heartbeat LVM agent #222

Merge redhat lvm.sh feature set into heartbeat LVM agent #222

davidvossel commented Apr 17, 2013

davidvossel commented Apr 17, 2013

dmuhamedagic commented Apr 19, 2013

davidvossel commented Apr 19, 2013

fabbione commented Apr 19, 2013

dmuhamedagic commented Apr 22, 2013

davidvossel commented Apr 22, 2013

davidvossel commented Apr 23, 2013

l-mb commented Apr 23, 2013

davidvossel commented Apr 23, 2013

l-mb commented Apr 23, 2013

l-mb commented Apr 30, 2013

fabbione commented May 1, 2013

davidvossel commented May 6, 2013

dmuhamedagic commented May 8, 2013

davidvossel commented May 8, 2013

davidvossel commented May 8, 2013

davidvossel commented May 8, 2013

davidvossel commented May 9, 2013

dmuhamedagic commented May 13, 2013

davidvossel commented May 13, 2013

lhh commented May 13, 2013

l-mb commented May 14, 2013

davidvossel commented May 14, 2013

davidvossel commented May 15, 2013

davidvossel commented Jun 4, 2013

Merge redhat lvm.sh feature set into heartbeat LVM agent #222

Merge redhat lvm.sh feature set into heartbeat LVM agent #222

Conversation

davidvossel commented Apr 17, 2013

davidvossel commented Apr 17, 2013

dmuhamedagic commented Apr 19, 2013

davidvossel commented Apr 19, 2013

fabbione commented Apr 19, 2013

dmuhamedagic commented Apr 22, 2013

davidvossel commented Apr 22, 2013

davidvossel commented Apr 23, 2013

l-mb commented Apr 23, 2013

davidvossel commented Apr 23, 2013

l-mb commented Apr 23, 2013

l-mb commented Apr 30, 2013

fabbione commented May 1, 2013

davidvossel commented May 6, 2013

dmuhamedagic commented May 8, 2013

davidvossel commented May 8, 2013

davidvossel commented May 8, 2013

davidvossel commented May 8, 2013

davidvossel commented May 9, 2013

dmuhamedagic commented May 13, 2013

davidvossel commented May 13, 2013

lhh commented May 13, 2013

l-mb commented May 14, 2013

davidvossel commented May 14, 2013

davidvossel commented May 15, 2013

davidvossel commented Jun 4, 2013