Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merge redhat lvm.sh feature set into heartbeat LVM agent #222

Closed
wants to merge 4 commits into from

Conversation

davidvossel
Copy link
Contributor

No description provided.

@davidvossel
Copy link
Contributor Author

Hopefully this branch will help the review process go smoother.

The first patch sets up the current LVM heartbeat agent for the feature set merger that happens in the second patch. Look at the commit message for the first patch for details on where functionality got moved. The refactoring is necessary to keep this agent manageable with the inclusion of the redhat lvm.sh features. I structured the agent the way I did so I could both re-use the existing logic with the new feature set and isolate the original behavior with as little impact as possible.

@dmuhamedagic
Copy link
Contributor

There are no significant differences between the original pull request and this one. Do we need to reiterate the discussion and a plea to keep the changes minimal and to give explanation for those changes ("this is needed for the feature xy" is not only vague, but hard to verify too). This is a complete rewrite of the existing RA and as such difficult to validate. The at length discussed topic (on the linux-ha-dev ML, IIRC) on using lvm tools in the monitor operation and the resulting patch was also not observed.

@davidvossel
Copy link
Contributor Author

Good, this is something we can discuss.

The original LVM agent only did an 'ls' on the /dev/volgrpname directory to see if something existed in it. This new logic actually figures out what logical volumes are present in the volume group and verifies they are up individually. This is proven logic we've supported in the redhat agent for years. I'd be happy to address any technical concerns you have with this approach. If you could link me to the ML discussion I'd like to understand why something similar to this was disagreed on.

-- Vossel

@fabbione
Copy link
Member

On 4/19/2013 4:09 PM, Dejan Muhamedagic wrote:

There are no significant differences between the original pull request
and this one. Do we need to reiterate the discussion and a plea to keep
the changes minimal and to give explanation for those changes ("this is
needed for the feature xy" is not only vague, but hard to verify too).

So ok.. just that we agree upfront.. what kind of level of details would
make you happy?

Example (not even sure if it´s touched by the patch.. just to get to the
same page here):

replace LVM_status() with merged_LVM_status()

old LVM_status supports both lvm1 and lvm2 and gather status via
vgstatus x..y..z

new merged_LVM_status supports both lvm1 and lvm2, lay down the path for
lvm2 support and replaces use of per-version vgdisplay with
god_of_vg_display -o format_output that is compatible with both.

god_of_vg_display then automatically detects... blablabla...

---- code/patch -----


now.. redhat lvm provide has tons of pre-made functions that´s probably
what makes it a bit more complex to read on your side because they seem
to kill all code in LVM, but effectively they only organize it logically.

code path expanded would make it easier to read?

For example:

god_of_vg_display calls into file_x:function_y:line to gather info Z.

Does that help? it´s a lot of work to be done this way, but doable if it
helps you. Otherwise can you be more explicit on your expectations?

Fabio

@dmuhamedagic
Copy link
Contributor

The relevant discussion ... I don't have that readily available. At any rate, the resulting changes are these two: a405e78 and b4b8775

The crux of the matter is that just about any lvm command goes through all VGs on the system. If one of them is borked in such a way that it is inaccessible, the command will hang and the operation time out. Thus, an unrelated storage can affect the storage actually managed in this resource instance. This is what was observed in the field and reported in several instances, both by customers as well as on the public ML.

@davidvossel
Copy link
Contributor Author

interesting. I'll rework the status function so it attempts to avoid using the lvm tools. There are situations where this will be unavoidable, but I believe I can restore this behavior for the pre-merger LVM use-cases. I'll give it a go and see what I come up with.

@davidvossel
Copy link
Contributor Author

Alright, take a look at the new vg_status_normal function in the first commit.

davidvossel@5ec9a8a#L3R27

It is very similar to the old status function. The only difference I made is that if it looks like an empty volume group, we will fall back to using lvm tools to read the metadata just to verify the vg is in fact empty. In normal use cases, the lvm tools will not be executed though. sound reasonable?

@l-mb
Copy link
Contributor

l-mb commented Apr 23, 2013

I really appreciate that code merging is happening, kudos for that. And the new exclusive activation feature also rocks. Given that it is a merge of two mature code bases and not incremental improvements, slightly less-than-perfectly incremental patches are to be expected.

Moving code from the RA to a common shell library probably makes sense if those are shared between multiple implementations, but I agree that this makes it harder to review than in-place updates. Oh well. ;-)

Calling out to "lvs" in the monitor function though is dangerous, unfortunately. I'm surprised you've not hit that on RHEL. If IO hangs (which is particularly common with multipath), that'll hang and time out - and, like Dejan commented, this can be triggered by any path within the LVM filter spec, since lvs will open/read them all.

I've never seen a VG directory (/dev/cvg/) exist if no LVs inside it were activated; i.e., never seen the directory exist but empty. Can this actually happen? In short, does this test - except for checking if device nodes exist - actually make sense? It seems harmless enough now though, especially if it can't happen.

As far as the exclusive activation of individual LVs is concerned, I admit I have second thoughts on lumping this in with the general LVM RA that manages whole volume groups. I was thinking that perhaps a dedicated LV RA would make more sense. But it does share a lot of code.

I agree with Dejan that such a merge will want plenty of testing, but we seem to gain some good functionality here

@davidvossel
Copy link
Contributor Author

I've updated the status function to further remove all references to 'lvs'. I tested the vg activation without any logical volumes and it appears the /dev/vg dir doesn't exist until a logical volume is created. For some reason I thought I had seen it before.

@l-mb
Copy link
Contributor

l-mb commented Apr 23, 2013

Thank you!

Looks good to me.

@l-mb
Copy link
Contributor

l-mb commented Apr 30, 2013

To clarify, will the code in resource-agents/rgmanager/src/resources/ also switch to using these shared libraries?

@fabbione
Copy link
Member

fabbione commented May 1, 2013

On 04/30/2013 01:38 PM, Lars Marowsky-Brée wrote:

To clarify, will the code in resource-agents/rgmanager/src/resources/
also switch to using these shared libraries?

Those libraries come from there. As the agents are merged,
rgmanager/src* will disappear completely.

Fabio

@davidvossel
Copy link
Contributor Author

Anything I can do to move this along?

@dmuhamedagic
Copy link
Contributor

As already mentioned in the first comment, it is difficult to validate the first changeset (davidvossel@0ef1449) because of the number of changes within. If you can split that in smaller chunks which are self-sufficient and well explained, that would certainly help. Further observations in the first changeset.

@davidvossel
Copy link
Contributor Author

The exclusive activation using tags is only questionable if the pacemaker deployment does not have fencing enabled.

@davidvossel
Copy link
Contributor Author

  • Please copy in your ocft test results. Mine is failing during setup before it even gets to the part where it invokes the agent.
  • I'm confused about how I should be testing portability. My understanding was that #!/sbin/sh would run bash in a way that removed bash's extended functionality. I had to change some things to get this to work. How else should I be testing for compatibility?
  • Logging messages moved, and changed. New checks during validation got added replacing previous behavior. Is there a specific log output I'm missing that indicates a condition we need to know about?
  • Status is a depreciated alias for monitor. are you saying monitor and status are broken? Can you tell me what you encountered? I have tested this. Try and be more descriptive.
  • restore_transient_failed_pvs just attempts to restore missing physical volumes before starting. This has the potential to make a partial activation a complete activation, but shouldn't interfere with the partial activation if the pv can not be restored.

@davidvossel
Copy link
Contributor Author

I looked into portability more. I'll see what I can do to port the changes to work with dash.

@davidvossel
Copy link
Contributor Author

I've ported the changes to work under with both bash and dash. checkbashisms passes for all the changed files as well.

edit I've also gone through and validated all use cases (normal, tags, exclusive with clvm) in my test environment again.

Functionality from the original heartbeat agent has moved into new
function names and files. This is a map to understanding where the
original functionality lives within the new merged agent.

LVM_monitor and LVM_status --> vg_status_normal() in lvm_vg.sh
The new monitor function does additional checking to verify
both the volume group and each logical volume is correctly activated.

LVM_start() ---> vg_start_normal() in lvm_vg.sh

LVM_stop() ---> vg_stop_normal() in lvm_vg.sh

LVM_validate_all() ---> verify_setup() in lvm_functions.sh
@dmuhamedagic
Copy link
Contributor

On Wed, May 08, 2013 at 08:10:10AM -0700, David Vossel wrote:

The exclusive activation using tags is only questionable if the pacemaker deployment does not have fencing enabled.

If you want to rely on the cluster resource manager and fencing
functioning, why do you then need to verify the membership
information? If it's just to prevent somebody outside of cluster
to use the volume, then testing the membership from the resource
agent is not of any benefit.

@davidvossel
Copy link
Contributor Author

It doesn't appear to have anything to do with trying to prevent an outside source from activating the volume. If that happens, the agent will just strip the tag from the outside source and claim the volume for the cluster.

It looks like this functionality is present to prevent a node from potentially stripping another cluster node's ownership of the volume. It seems like for the stop action this would be important.

@lhh
Copy link
Contributor

lhh commented May 13, 2013

I believe the tagging/stripping and the way it's implemented is designed to
prevent a few things:

  1. obvious administrative errors within a running cluster:
  • clone resource that really must NEVER be cloned
  • executing agent directly while it's active
    (and other things)
  1. bugs in the resource manager
  • betting your entire volume group on "no bugs"?

"Don't do that" only goes so far, and is little comfort to an administrator
who has corrupt LVM metadata.

There's probably some other bits and pieces I've forgotten; Jon Brassow
would know.

On Mon, May 13, 2013 at 1:59 PM, David Vossel notifications@github.comwrote:

It doesn't appear to have anything to do with trying to prevent an outside
source from activating the volume. If that happens, the agent will just
strip the tag from the outside source and claim the volume for the cluster.

It looks like this functionality is present to prevent a node from
potentially stripping another cluster node's ownership of the volume. It
seems like for the stop action this would be important.


Reply to this email directly or view it on GitHubhttps://github.com//pull/222#issuecomment-17828909
.

Lon Hohberger
DoD#4269
KB1YGC

@l-mb
Copy link
Contributor

l-mb commented May 14, 2013

It also appears to protect against manual invocation of the LVM commands.

I'm not perfectly sure the added validation against the host list is mandatory, but it also doesn't seem to hurt.

@davidvossel
Copy link
Contributor Author

I have re-written the portability patch to avoid using a subshell while parsing lvm tool output.

@davidvossel
Copy link
Contributor Author

I've made 2 new updates to these patches.

  1. Fixed issue with detecting when clustered exclusive activation is taking place
  2. Changed portability patch to use pattern matching with case statements instead of grep regex

davidvossel and others added 3 commits May 15, 2013 20:55
…heartbeat LVM agent

This patch leaves the original LVM heartbeat functionality
intact while adding these addition features from the redhat agent.

1. Exclusive activation using volume group tags. This feature
allows a volume group to live on shared storage within the cluster
without requiring the use of cLVM for metadata locking.

2. individual logical volume activation for local and cluster
volume groups by using the new 'lvname' option.

3. Better setup validation when the 'exclusive' option is enabled.
This patch validates that when exclusive activation is enabled, either
a cluster volume group is in use with cLVM, or the tags variant is
configured correctly. These new checks also makes it impossible to
enable the exclusive activation for cloned resources.
Thanks to Lars Ellenberg for his review and help on this patch.
@davidvossel
Copy link
Contributor Author

I'm closing this pull request and opening a new one with a new set of patches.

@davidvossel davidvossel closed this Jun 4, 2013
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants