many neutron-ha-tool fixes #1

aspiers · 2016-03-08T20:48:17Z

In a Crowbar environment, the retry commit requires crowbar/crowbar-openstack#344 and also https://review.openstack.org/#/c/297664/ in order to be effective.

Whilst looking into bsc#965886 (L3: neutron-ha-tool not migrating dhcp networks during failover), I discovered many instances of braindead code in neutron-ha-tool. In particular, previous authors seemed to think that LOG.exception() aborts the code flow, which it doesn't, so several corner cases just completely fail.

Rather than submitting all these changes as individual rpm patches like it's 1995, I've forked the original upstream git repo to this fork, resurrected the Icehouse branch (which was tagged as eol-icehouse) and created a new neutron-ha-tool-maintenance branch from it which can be subject to the normal code review process. We can then easily repoint https://build.suse.de/package/show/Devel:Cloud:6:Staging/openstack-neutron at it. Similarly for Cloud 5.

jsuchome · 2016-03-09T08:54:42Z

files/default/neutron-ha-tool.py

-                except:
-                    LOG.exception("There was an error migrating a router")
-                    continue
+    migration_count = 0


This is already initialized at https://github.com/SUSE-Cloud/cookbook-openstack-network/pull/1/files#diff-937ed0b9cba9efd2521c8fd1c9e8416fR268

Good spot - fixed.

jsuchome · 2016-03-09T08:56:55Z

I think those commits replacing LOG.exception could be squashed into one, the reasoning is always the same. But in general LGTM

jsuchome · 2016-03-09T09:10:11Z

While you were adding also some older patches to this PR (like the one for password file), could you also add a commit with https://build.suse.de/package/view_file/Devel:Cloud:6:Staging/openstack-neutron/0001-L3-agent-migration-should-only-choose-enabled-agent.patch?expand=1 ?

aspiers · 2016-03-09T09:26:04Z

I think those commits replacing LOG.exception could be squashed into one, the reasoning is always the same. But in general LGTM

The reasoning is the same, but the damage is different and also some cases are handled in different ways (e.g. raise vs. LOG.error and sys.exit(1)), as explained in the commit messages. So I think it's better to keep separate.

While you were adding also some older patches to this PR (like the one for password file), could you also add a commit with https://build.suse.de/package/view_file/Devel:Cloud:6:Staging/openstack-neutron/0001-L3-agent-migration-should-only-choose-enabled-agent.patch?expand=1 ?

Yeah, good catch - thanks I will!

aspiers · 2016-03-09T09:45:39Z

Yeah, good catch - thanks I will!

Ah, I had already moved this into a separate branch which I was going to submit as a separate PR, because I didn't realise we already had a patch for it. Actually I'll just push it and the password patch into the target branch (SUSE-Cloud) since we already know we need them - no point including them in this review.

jsuchome · 2016-03-10T07:49:29Z

I've tested with this patch, and although I did got failures of neutron-ha-tool, they are not related to this changes - and yes, I do not see any extra errors coming from incorrect exceptions handling.

+1

vuntz · 2016-03-17T10:49:02Z

files/default/neutron-ha-tool.py

+    if os.getenv('OS_PASSWORD'):
+        os_password = os.environ['OS_PASSWORD']
+    else:
+        with open('/etc/neutron/os_password') as f:


Should we check for the file existence?

You mean so that we can abort with a more friendly error message, rather than an exception and a stacktrace? Sounds reasonable to me. Or do you mean something else?

That's what I meant, yes.

Fixed in a new commit.

At this point, we haven't even parsed the CLI arguments in order to determine logging level, so we haven't set up the logger properly yet. Even if we had, there is no exception to log.

LOG.exception() should only be used when an exception was caught. In this internal helper function, when something goes wrong, it makes sense to raise exceptions which can optionally be caught by callers. Previously it was not bailing correctly if the router failed to be removed.

aspiers · 2016-03-17T19:23:29Z

Rebased against latest "upstream" in the SUSE-Cloud fork, since I already pushed the commits there which mirror the patches we already had in the packages: 26a7df4 and 29e9759.

vuntz · 2016-03-23T09:32:18Z

files/default/neutron-ha-tool.py

@@ -22,6 +22,7 @@
 import argparse
 import random
 import time
+import retrying


Wrong import order.

LGTM, tried it and works fine. Why is it wrong?

List of imports is supposed to be sorted.

That being said, retrying is maybe not in the standard library? If not, then it needs a blank like before the import.

jsuchome · 2016-03-23T16:43:29Z

I've also tested it several times now and it seems to work correctly. Do we still need WIP and Do not merge tags?

vuntz · 2016-03-24T09:07:33Z

@jsuchome I think the point about noop not doing what it was doing before is still open.

aspiers · 2016-03-24T10:10:13Z

@jsuchome and also two other items:

some sys.exit(1) calls need to be changed into return or exceptions which get caught higher up
the retry behaviour should be optional with parameter(s), not hardcoded

SUSE-Cloud/cookbook-openstack-network#1 adds retry logic to neutron-ha-tool so that it will not fail immediately if there are problems talking to the neutron-server API. Therefore we need to give neutron-ha-tool enough time to have a decent chance of succeeding if the neutron-server still needs a while to recover after some kind of failover. https://bugzilla.suse.com/show_bug.cgi?id=965886

SUSE-Cloud/cookbook-openstack-network#1 adds retry logic to neutron-ha-tool so that it will not fail immediately if there are problems talking to the neutron-server API. Therefore we need to give neutron-ha-tool enough time to have a decent chance of succeeding if the neutron-server still needs a while to recover after some kind of failover. https://bugzilla.suse.com/show_bug.cgi?id=965886 (cherry picked from commit 55504e2e8ccfb23abb8a6b5b31e4ff7ef99e911b)

To have a log message at the end with an accurate count of how many migrations were need, we should count all needed migrations, regardless of whether we're in dry-run mode or whether some migrations failed. This also makes future refactoring easier.

Remove code duplication by logging inside migrate_router(). Also, if another caller of migrate_router() was introduced, it would accidentally omit to log the migration. Moving the log message inside migrate_router() prevents this.

This removes some code duplication and helps make l3_agent_migrate() and l3_agent_evacuate (which are already overgrown) more concise and readable. Also remove some spurious 'continue' keywords.

This removes duplicated code, and helps make l3_agent_migrate() and l3_agent_evacuate() (which are already overgrown) more concise and readable.

In the evacuation case, migrate_router() can be called on alive agents, not just dead ones. Also add a debug message.

If the password isn't provided via OS_PASSWORD environment variable or in /etc/neutron/os_password, then die gracefully.

Output an error if no action is specified. Also output an error if more than one action is specified, because it's too complicated to support meaningful semantics for exit codes if multiple actions are run. For example, if --l3-agent-migrate and --replicate-dhcp were both specified to the same invocation, each action could independently have one of three outcomes: - no action was required (everything already healthy) - some action was required (router migration or dhcp replication) - some error was encountered during the action That would require choosing script exit codes to meaningfully represent 9 possible different outcomes, just for this combination of two actions. Instead, we just limit invocation to one action at a time, and then each action can have easily understandable exit codes associated with its outcomes. This should have no impact since nothing currently requires multiple actions within a single run, and even if it did, it could simply invoke the tool multiple times, once for each action.

If neutron-ha-tool's monitor action fails during a window in which the database or message bus are down, or still starting up, its consequent start action will fail. There's no clean way to make Pacemaker do a sensible number of retries at staggered intervals, so inside we embed the retry logic inside the tool itself. Ideally it would have gone in the OCF RA, but it's uglier doing it in bash. https://bugzilla.suse.com/show_bug.cgi?id=965886

Make the usage text include a more helpful argument name: [--l3-agent-evacuate AGENT_ID]

If the "now" parameter to l3_agent_migrate() is false, and an agent comes online during the period of waiting, migration will be abandoned. So document this in the docstring.

Clarify when variables are referring to hostnames or agent ids.

aspiers · 2016-03-24T23:50:01Z

I think I have now addressed all remaining concerns, and some extra ones:

Retry is now optional via --retry, with --retry-max-interval customizable
At considerable effort which I'm not convinced was worth it, got rid of the sys.exit calls. At least it should keep @vuntz happy.
Better exception handling: don't trap arbitrary exceptions, just NeutronException
Many code cleanups

vuntz · 2016-03-25T13:04:20Z

files/default/neutron-ha-tool.py

+        return True
+    except:
+        LOG.exception("Failed to migrate router=%s from agent=%s "
+                      "to agent=%s", router_id, agent_id, target_id)


Would it make sense to also log the message from the exception?

Doh, I guess that's already done by LOG.exception.

aspiers · 2016-03-25T13:09:05Z

Tested on SOC6 and it works, although for some reason I have not managed to reproduce a situation where a retry was actually required. But manual retries from CLI work, so I can't imagine why they wouldn't from the OCF.

aspiers · 2016-03-25T13:09:31Z

(BTW this requires adding --retry to the two invocations of the tool in the OCF start action)

vuntz · 2016-03-25T13:15:48Z

+1

aspiers · 2016-03-25T14:41:54Z

Submitted gerrit#I41c37500 (neutron-ha-tool: make start action retry)

AbelNavarro · 2016-03-29T08:28:18Z

+1

SUSE-Cloud/cookbook-openstack-network#1 adds retry logic to neutron-ha-tool so that it will not fail immediately if there are problems talking to the neutron-server API. Therefore we need to give neutron-ha-tool enough time to have a decent chance of succeeding if the neutron-server still needs a while to recover after some kind of failover. https://bugzilla.suse.com/show_bug.cgi?id=965886 (cherry picked from commit 55504e2e8ccfb23abb8a6b5b31e4ff7ef99e911b)

SUSE-Cloud/cookbook-openstack-network#1 adds (amongst many other things) support for neutron-ha-tool to retry its connections to neutron-server. By taking advantage of this in this OCF RA, we can make failover more robust. Signed-off-by: Adam Spiers <aspiers@suse.com> Change-Id: I41c37500f691e2e0ecfd6c31f1720f483513e447

SUSE-Cloud/cookbook-openstack-network#1 adds retry logic to neutron-ha-tool so that it will not fail immediately if there are problems talking to the neutron-server API. Therefore we need to give neutron-ha-tool enough time to have a decent chance of succeeding if the neutron-server still needs a while to recover after some kind of failover. https://bugzilla.suse.com/show_bug.cgi?id=965886 (cherry picked from commit 55504e2e8ccfb23abb8a6b5b31e4ff7ef99e911b)

SUSE-Cloud/cookbook-openstack-network#1 adds (amongst many other things) support for neutron-ha-tool to retry its connections to neutron-server. By taking advantage of this in this OCF RA, we can make failover more robust. Signed-off-by: Adam Spiers <aspiers@suse.com> Change-Id: I41c37500f691e2e0ecfd6c31f1720f483513e447

aspiers added wip do not merge yet backport from SOC8 labels Mar 8, 2016

jsuchome reviewed Mar 9, 2016
View reviewed changes

aspiers force-pushed the neutron-ha-tool-maintenance branch from 8c601fa to 6aa4b9a Compare March 9, 2016 09:21

vuntz reviewed Mar 17, 2016
View reviewed changes

Adam Spiers added 3 commits March 17, 2016 19:20

flake8 fixes for 58f12c5

f90c3b2

Don't try to log an exception before logger is setup

e061f0c

At this point, we haven't even parsed the CLI arguments in order to determine logging level, so we haven't set up the logger properly yet. Even if we had, there is no exception to log.

aspiers force-pushed the neutron-ha-tool-maintenance branch from d651f83 to ee91d1d Compare March 17, 2016 19:20

vuntz reviewed Mar 23, 2016
View reviewed changes

aspiers force-pushed the neutron-ha-tool-maintenance branch from 5d02f61 to 184898b Compare March 23, 2016 16:43

aspiers mentioned this pull request Mar 24, 2016

increase neutron-ha-tool start timeout to 120s (bsc#965886) crowbar/crowbar-openstack#344

Merged

jsuchome mentioned this pull request Mar 24, 2016

[3.0] increase neutron-ha-tool start timeout to 120s (bsc#965886) crowbar/crowbar-openstack#345

Merged

jsuchome mentioned this pull request Mar 24, 2016

increase neutron-ha-tool start timeout to 120s (bsc#965886) crowbar/barclamp-neutron#271

Closed

aspiers force-pushed the neutron-ha-tool-maintenance branch from 184898b to b99ae36 Compare March 24, 2016 17:53

aspiers force-pushed the neutron-ha-tool-maintenance branch from 89c2c5e to 0d487eb Compare March 24, 2016 22:30

Adam Spiers added 14 commits March 24, 2016 23:00

Remove superfluous blank lines

d9e07ef

Log migrations inside migrate_router()

f2af68a

Remove code duplication by logging inside migrate_router(). Also, if another caller of migrate_router() was introduced, it would accidentally omit to log the migration. Moving the log message inside migrate_router() prevents this.

Extract migrate_router_safely() function

41a048c

This removes some code duplication and helps make l3_agent_migrate() and l3_agent_evacuate (which are already overgrown) more concise and readable. Also remove some spurious 'continue' keywords.

Extract migrate_l3_routers_from_agent() function

df8b1c5

This removes duplicated code, and helps make l3_agent_migrate() and l3_agent_evacuate() (which are already overgrown) more concise and readable.

Improve comment and logging in migrate_router()

13c48e2

In the evacuation case, migrate_router() can be called on alive agents, not just dead ones. Also add a debug message.

Die more gracefully if password can't be obtained

fbf6a28

If the password isn't provided via OS_PASSWORD environment variable or in /etc/neutron/os_password, then die gracefully.

Add metavar for --l3-agent-evacuate

0ac8c0f

Make the usage text include a more helpful argument name: [--l3-agent-evacuate AGENT_ID]

Explain what happens if agent comes online during wait period

d330342

If the "now" parameter to l3_agent_migrate() is false, and an agent comes online during the period of waiting, migration will be abandoned. So document this in the docstring.

Use %d not %s for integer output

5905627

Make return values and exit codes consistent

079a010

improve variable names

5be4099

Clarify when variables are referring to hostnames or agent ids.

aspiers force-pushed the neutron-ha-tool-maintenance branch from 0d487eb to 5be4099 Compare March 24, 2016 23:41

aspiers removed the wip label Mar 24, 2016

vuntz reviewed Mar 25, 2016
View reviewed changes

aspiers changed the title ~~[NEEDS TESTING] many neutron-ha-tool fixes~~ many neutron-ha-tool fixes Mar 25, 2016

aspiers merged commit 551dadd into SUSE-Cloud:neutron-ha-tool-maintenance Mar 31, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

many neutron-ha-tool fixes #1

many neutron-ha-tool fixes #1

aspiers commented Mar 8, 2016

jsuchome Mar 9, 2016

aspiers Mar 9, 2016

jsuchome commented Mar 9, 2016

jsuchome commented Mar 9, 2016

aspiers commented Mar 9, 2016

aspiers commented Mar 9, 2016

jsuchome commented Mar 10, 2016

vuntz Mar 17, 2016

aspiers Mar 17, 2016

vuntz Mar 17, 2016

aspiers Mar 17, 2016

aspiers commented Mar 17, 2016

vuntz Mar 23, 2016

AbelNavarro Mar 23, 2016

vuntz Mar 23, 2016

jsuchome commented Mar 23, 2016

vuntz commented Mar 24, 2016

aspiers commented Mar 24, 2016

aspiers commented Mar 24, 2016

vuntz Mar 25, 2016

vuntz Mar 25, 2016

aspiers Mar 25, 2016

aspiers commented Mar 25, 2016

aspiers commented Mar 25, 2016

vuntz commented Mar 25, 2016

aspiers commented Mar 25, 2016

AbelNavarro commented Mar 29, 2016

many neutron-ha-tool fixes #1

many neutron-ha-tool fixes #1

Conversation

aspiers commented Mar 8, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jsuchome commented Mar 9, 2016

jsuchome commented Mar 9, 2016

aspiers commented Mar 9, 2016

aspiers commented Mar 9, 2016

jsuchome commented Mar 10, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aspiers commented Mar 17, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jsuchome commented Mar 23, 2016

vuntz commented Mar 24, 2016

aspiers commented Mar 24, 2016

aspiers commented Mar 24, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aspiers commented Mar 25, 2016

aspiers commented Mar 25, 2016

vuntz commented Mar 25, 2016

aspiers commented Mar 25, 2016

AbelNavarro commented Mar 29, 2016