Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

many neutron-ha-tool fixes #1

Conversation

aspiers
Copy link

@aspiers aspiers commented Mar 8, 2016

In a Crowbar environment, the retry commit requires crowbar/crowbar-openstack#344 and also https://review.openstack.org/#/c/297664/ in order to be effective.

Whilst looking into bsc#965886 (L3: neutron-ha-tool not migrating dhcp networks during failover), I discovered many instances of braindead code in neutron-ha-tool. In particular, previous authors seemed to think that LOG.exception() aborts the code flow, which it doesn't, so several corner cases just completely fail.

Rather than submitting all these changes as individual rpm patches like it's 1995, I've forked the original upstream git repo to this fork, resurrected the Icehouse branch (which was tagged as eol-icehouse) and created a new neutron-ha-tool-maintenance branch from it which can be subject to the normal code review process. We can then easily repoint https://build.suse.de/package/show/Devel:Cloud:6:Staging/openstack-neutron at it. Similarly for Cloud 5.

except:
LOG.exception("There was an error migrating a router")
continue
migration_count = 0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good spot - fixed.

@jsuchome
Copy link
Member

jsuchome commented Mar 9, 2016

I think those commits replacing LOG.exception could be squashed into one, the reasoning is always the same. But in general LGTM

@jsuchome
Copy link
Member

jsuchome commented Mar 9, 2016

While you were adding also some older patches to this PR (like the one for password file), could you also add a commit with https://build.suse.de/package/view_file/Devel:Cloud:6:Staging/openstack-neutron/0001-L3-agent-migration-should-only-choose-enabled-agent.patch?expand=1 ?

@aspiers aspiers force-pushed the neutron-ha-tool-maintenance branch from 8c601fa to 6aa4b9a Compare March 9, 2016 09:21
@aspiers
Copy link
Author

aspiers commented Mar 9, 2016

I think those commits replacing LOG.exception could be squashed into one, the reasoning is always the same. But in general LGTM

The reasoning is the same, but the damage is different and also some cases are handled in different ways (e.g. raise vs. LOG.error and sys.exit(1)), as explained in the commit messages. So I think it's better to keep separate.

While you were adding also some older patches to this PR (like the one for password file), could you also add a commit with https://build.suse.de/package/view_file/Devel:Cloud:6:Staging/openstack-neutron/0001-L3-agent-migration-should-only-choose-enabled-agent.patch?expand=1 ?

Yeah, good catch - thanks I will!

@aspiers
Copy link
Author

aspiers commented Mar 9, 2016

Yeah, good catch - thanks I will!

Ah, I had already moved this into a separate branch which I was going to submit as a separate PR, because I didn't realise we already had a patch for it. Actually I'll just push it and the password patch into the target branch (SUSE-Cloud) since we already know we need them - no point including them in this review.

@jsuchome
Copy link
Member

I've tested with this patch, and although I did got failures of neutron-ha-tool, they are not related to this changes - and yes, I do not see any extra errors coming from incorrect exceptions handling.

+1

if os.getenv('OS_PASSWORD'):
os_password = os.environ['OS_PASSWORD']
else:
with open('/etc/neutron/os_password') as f:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we check for the file existence?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean so that we can abort with a more friendly error message, rather than an exception and a stacktrace? Sounds reasonable to me. Or do you mean something else?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's what I meant, yes.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in a new commit.

Adam Spiers added 3 commits March 17, 2016 19:20
At this point, we haven't even parsed the CLI arguments in order to
determine logging level, so we haven't set up the logger properly yet.
Even if we had, there is no exception to log.
LOG.exception() should only be used when an exception was caught.  In
this internal helper function, when something goes wrong, it makes sense
to raise exceptions which can optionally be caught by callers.
Previously it was not bailing correctly if the router failed to be
removed.
@aspiers aspiers force-pushed the neutron-ha-tool-maintenance branch from d651f83 to ee91d1d Compare March 17, 2016 19:20
@aspiers
Copy link
Author

aspiers commented Mar 17, 2016

Rebased against latest "upstream" in the SUSE-Cloud fork, since I already pushed the commits there which mirror the patches we already had in the packages: 26a7df4 and 29e9759.

@@ -22,6 +22,7 @@
import argparse
import random
import time
import retrying
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wrong import order.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, tried it and works fine. Why is it wrong?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

List of imports is supposed to be sorted.

That being said, retrying is maybe not in the standard library? If not, then it needs a blank like before the import.

@jsuchome
Copy link
Member

I've also tested it several times now and it seems to work correctly. Do we still need WIP and Do not merge tags?

@aspiers aspiers force-pushed the neutron-ha-tool-maintenance branch from 5d02f61 to 184898b Compare March 23, 2016 16:43
@vuntz
Copy link
Member

vuntz commented Mar 24, 2016

@jsuchome I think the point about noop not doing what it was doing before is still open.

@aspiers
Copy link
Author

aspiers commented Mar 24, 2016

@jsuchome and also two other items:

  1. some sys.exit(1) calls need to be changed into return or exceptions which get caught higher up
  2. the retry behaviour should be optional with parameter(s), not hardcoded

aspiers pushed a commit to aspiers/crowbar-openstack that referenced this pull request Mar 24, 2016
SUSE-Cloud/cookbook-openstack-network#1
adds retry logic to neutron-ha-tool so that it will not fail
immediately if there are problems talking to the neutron-server
API.  Therefore we need to give neutron-ha-tool enough time
to have a decent chance of succeeding if the neutron-server
still needs a while to recover after some kind of failover.

https://bugzilla.suse.com/show_bug.cgi?id=965886
jsuchome pushed a commit to jsuchome/barclamp-neutron that referenced this pull request Mar 24, 2016
SUSE-Cloud/cookbook-openstack-network#1
adds retry logic to neutron-ha-tool so that it will not fail
immediately if there are problems talking to the neutron-server
API.  Therefore we need to give neutron-ha-tool enough time
to have a decent chance of succeeding if the neutron-server
still needs a while to recover after some kind of failover.

https://bugzilla.suse.com/show_bug.cgi?id=965886
(cherry picked from commit 55504e2e8ccfb23abb8a6b5b31e4ff7ef99e911b)
@aspiers aspiers force-pushed the neutron-ha-tool-maintenance branch from 184898b to b99ae36 Compare March 24, 2016 17:53
@aspiers aspiers force-pushed the neutron-ha-tool-maintenance branch from 89c2c5e to 0d487eb Compare March 24, 2016 22:30
Adam Spiers added 14 commits March 24, 2016 23:00
To have a log message at the end with an accurate count of how many
migrations were need, we should count all needed migrations, regardless
of whether we're in dry-run mode or whether some migrations failed.

This also makes future refactoring easier.
Remove code duplication by logging inside migrate_router().  Also, if
another caller of migrate_router() was introduced, it would accidentally
omit to log the migration.  Moving the log message inside
migrate_router() prevents this.
This removes some code duplication and helps make l3_agent_migrate() and
l3_agent_evacuate (which are already overgrown) more concise and
readable.

Also remove some spurious 'continue' keywords.
This removes duplicated code, and helps make l3_agent_migrate() and
l3_agent_evacuate() (which are already overgrown) more concise and
readable.
In the evacuation case, migrate_router() can be called on alive agents,
not just dead ones.  Also add a debug message.
If the password isn't provided via OS_PASSWORD environment variable or
in /etc/neutron/os_password, then die gracefully.
Output an error if no action is specified.

Also output an error if more than one action is specified, because it's
too complicated to support meaningful semantics for exit codes if
multiple actions are run.  For example, if --l3-agent-migrate and
--replicate-dhcp were both specified to the same invocation, each action
could independently have one of three outcomes:

  - no action was required (everything already healthy)
  - some action was required (router migration or dhcp replication)
  - some error was encountered during the action

That would require choosing script exit codes to meaningfully represent
9 possible different outcomes, just for this combination of two actions.

Instead, we just limit invocation to one action at a time, and then
each action can have easily understandable exit codes associated with
its outcomes.

This should have no impact since nothing currently requires multiple
actions within a single run, and even if it did, it could simply invoke
the tool multiple times, once for each action.
If neutron-ha-tool's monitor action fails during a window in which the
database or message bus are down, or still starting up, its consequent
start action will fail.  There's no clean way to make Pacemaker do a
sensible number of retries at staggered intervals, so inside we embed
the retry logic inside the tool itself.  Ideally it would have gone in
the OCF RA, but it's uglier doing it in bash.

https://bugzilla.suse.com/show_bug.cgi?id=965886
Make the usage text include a more helpful argument name:

    [--l3-agent-evacuate AGENT_ID]
If the "now" parameter to l3_agent_migrate() is false, and an agent
comes online during the period of waiting, migration will be abandoned.
So document this in the docstring.
Clarify when variables are referring to hostnames or agent ids.
@aspiers aspiers force-pushed the neutron-ha-tool-maintenance branch from 0d487eb to 5be4099 Compare March 24, 2016 23:41
@aspiers aspiers removed the wip label Mar 24, 2016
@aspiers
Copy link
Author

aspiers commented Mar 24, 2016

I think I have now addressed all remaining concerns, and some extra ones:

  • Retry is now optional via --retry, with --retry-max-interval customizable
  • At considerable effort which I'm not convinced was worth it, got rid of the sys.exit calls. At least it should keep @vuntz happy.
  • Better exception handling: don't trap arbitrary exceptions, just NeutronException
  • Many code cleanups

return True
except:
LOG.exception("Failed to migrate router=%s from agent=%s "
"to agent=%s", router_id, agent_id, target_id)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to also log the message from the exception?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doh, I guess that's already done by LOG.exception.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep :-)

@aspiers
Copy link
Author

aspiers commented Mar 25, 2016

Tested on SOC6 and it works, although for some reason I have not managed to reproduce a situation where a retry was actually required. But manual retries from CLI work, so I can't imagine why they wouldn't from the OCF.

@aspiers
Copy link
Author

aspiers commented Mar 25, 2016

(BTW this requires adding --retry to the two invocations of the tool in the OCF start action)

@vuntz
Copy link
Member

vuntz commented Mar 25, 2016

+1

@aspiers aspiers changed the title [NEEDS TESTING] many neutron-ha-tool fixes many neutron-ha-tool fixes Mar 25, 2016
@aspiers
Copy link
Author

aspiers commented Mar 25, 2016

@AbelNavarro
Copy link

+1

@aspiers aspiers merged commit 551dadd into SUSE-Cloud:neutron-ha-tool-maintenance Mar 31, 2016
jsuchome pushed a commit to jsuchome/barclamp-neutron that referenced this pull request Apr 1, 2016
SUSE-Cloud/cookbook-openstack-network#1
adds retry logic to neutron-ha-tool so that it will not fail
immediately if there are problems talking to the neutron-server
API.  Therefore we need to give neutron-ha-tool enough time
to have a decent chance of succeeding if the neutron-server
still needs a while to recover after some kind of failover.

https://bugzilla.suse.com/show_bug.cgi?id=965886
(cherry picked from commit 55504e2e8ccfb23abb8a6b5b31e4ff7ef99e911b)
openstack-gerrit pushed a commit to openstack-archive/openstack-resource-agents that referenced this pull request Apr 1, 2016
SUSE-Cloud/cookbook-openstack-network#1
adds (amongst many other things) support for neutron-ha-tool to retry
its connections to neutron-server.  By taking advantage of this in
this OCF RA, we can make failover more robust.

Signed-off-by: Adam Spiers <aspiers@suse.com>
Change-Id: I41c37500f691e2e0ecfd6c31f1720f483513e447
jsuchome pushed a commit to jsuchome/barclamp-neutron that referenced this pull request Apr 4, 2016
SUSE-Cloud/cookbook-openstack-network#1
adds retry logic to neutron-ha-tool so that it will not fail
immediately if there are problems talking to the neutron-server
API.  Therefore we need to give neutron-ha-tool enough time
to have a decent chance of succeeding if the neutron-server
still needs a while to recover after some kind of failover.

https://bugzilla.suse.com/show_bug.cgi?id=965886
(cherry picked from commit 55504e2e8ccfb23abb8a6b5b31e4ff7ef99e911b)
bhdn pushed a commit to bhdn/openstack-resource-agents that referenced this pull request Apr 5, 2016
SUSE-Cloud/cookbook-openstack-network#1
adds (amongst many other things) support for neutron-ha-tool to retry
its connections to neutron-server.  By taking advantage of this in
this OCF RA, we can make failover more robust.

Signed-off-by: Adam Spiers <aspiers@suse.com>
Change-Id: I41c37500f691e2e0ecfd6c31f1720f483513e447
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants