-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
many neutron-ha-tool fixes #1
many neutron-ha-tool fixes #1
Conversation
except: | ||
LOG.exception("There was an error migrating a router") | ||
continue | ||
migration_count = 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is already initialized at https://github.com/SUSE-Cloud/cookbook-openstack-network/pull/1/files#diff-937ed0b9cba9efd2521c8fd1c9e8416fR268
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good spot - fixed.
I think those commits replacing LOG.exception could be squashed into one, the reasoning is always the same. But in general LGTM |
While you were adding also some older patches to this PR (like the one for password file), could you also add a commit with https://build.suse.de/package/view_file/Devel:Cloud:6:Staging/openstack-neutron/0001-L3-agent-migration-should-only-choose-enabled-agent.patch?expand=1 ? |
8c601fa
to
6aa4b9a
Compare
The reasoning is the same, but the damage is different and also some cases are handled in different ways (e.g.
Yeah, good catch - thanks I will! |
Ah, I had already moved this into a separate branch which I was going to submit as a separate PR, because I didn't realise we already had a patch for it. Actually I'll just push it and the password patch into the target branch ( |
I've tested with this patch, and although I did got failures of neutron-ha-tool, they are not related to this changes - and yes, I do not see any extra errors coming from incorrect exceptions handling. +1 |
if os.getenv('OS_PASSWORD'): | ||
os_password = os.environ['OS_PASSWORD'] | ||
else: | ||
with open('/etc/neutron/os_password') as f: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we check for the file existence?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You mean so that we can abort with a more friendly error message, rather than an exception and a stacktrace? Sounds reasonable to me. Or do you mean something else?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's what I meant, yes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed in a new commit.
At this point, we haven't even parsed the CLI arguments in order to determine logging level, so we haven't set up the logger properly yet. Even if we had, there is no exception to log.
LOG.exception() should only be used when an exception was caught. In this internal helper function, when something goes wrong, it makes sense to raise exceptions which can optionally be caught by callers. Previously it was not bailing correctly if the router failed to be removed.
d651f83
to
ee91d1d
Compare
@@ -22,6 +22,7 @@ | |||
import argparse | |||
import random | |||
import time | |||
import retrying |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wrong import order.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, tried it and works fine. Why is it wrong?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
List of imports is supposed to be sorted.
That being said, retrying
is maybe not in the standard library? If not, then it needs a blank like before the import
.
I've also tested it several times now and it seems to work correctly. Do we still need WIP and Do not merge tags? |
5d02f61
to
184898b
Compare
@jsuchome I think the point about |
@jsuchome and also two other items:
|
SUSE-Cloud/cookbook-openstack-network#1 adds retry logic to neutron-ha-tool so that it will not fail immediately if there are problems talking to the neutron-server API. Therefore we need to give neutron-ha-tool enough time to have a decent chance of succeeding if the neutron-server still needs a while to recover after some kind of failover. https://bugzilla.suse.com/show_bug.cgi?id=965886
SUSE-Cloud/cookbook-openstack-network#1 adds retry logic to neutron-ha-tool so that it will not fail immediately if there are problems talking to the neutron-server API. Therefore we need to give neutron-ha-tool enough time to have a decent chance of succeeding if the neutron-server still needs a while to recover after some kind of failover. https://bugzilla.suse.com/show_bug.cgi?id=965886 (cherry picked from commit 55504e2e8ccfb23abb8a6b5b31e4ff7ef99e911b)
184898b
to
b99ae36
Compare
89c2c5e
to
0d487eb
Compare
To have a log message at the end with an accurate count of how many migrations were need, we should count all needed migrations, regardless of whether we're in dry-run mode or whether some migrations failed. This also makes future refactoring easier.
Remove code duplication by logging inside migrate_router(). Also, if another caller of migrate_router() was introduced, it would accidentally omit to log the migration. Moving the log message inside migrate_router() prevents this.
This removes some code duplication and helps make l3_agent_migrate() and l3_agent_evacuate (which are already overgrown) more concise and readable. Also remove some spurious 'continue' keywords.
This removes duplicated code, and helps make l3_agent_migrate() and l3_agent_evacuate() (which are already overgrown) more concise and readable.
In the evacuation case, migrate_router() can be called on alive agents, not just dead ones. Also add a debug message.
If the password isn't provided via OS_PASSWORD environment variable or in /etc/neutron/os_password, then die gracefully.
Output an error if no action is specified. Also output an error if more than one action is specified, because it's too complicated to support meaningful semantics for exit codes if multiple actions are run. For example, if --l3-agent-migrate and --replicate-dhcp were both specified to the same invocation, each action could independently have one of three outcomes: - no action was required (everything already healthy) - some action was required (router migration or dhcp replication) - some error was encountered during the action That would require choosing script exit codes to meaningfully represent 9 possible different outcomes, just for this combination of two actions. Instead, we just limit invocation to one action at a time, and then each action can have easily understandable exit codes associated with its outcomes. This should have no impact since nothing currently requires multiple actions within a single run, and even if it did, it could simply invoke the tool multiple times, once for each action.
If neutron-ha-tool's monitor action fails during a window in which the database or message bus are down, or still starting up, its consequent start action will fail. There's no clean way to make Pacemaker do a sensible number of retries at staggered intervals, so inside we embed the retry logic inside the tool itself. Ideally it would have gone in the OCF RA, but it's uglier doing it in bash. https://bugzilla.suse.com/show_bug.cgi?id=965886
Make the usage text include a more helpful argument name: [--l3-agent-evacuate AGENT_ID]
If the "now" parameter to l3_agent_migrate() is false, and an agent comes online during the period of waiting, migration will be abandoned. So document this in the docstring.
Clarify when variables are referring to hostnames or agent ids.
0d487eb
to
5be4099
Compare
I think I have now addressed all remaining concerns, and some extra ones:
|
return True | ||
except: | ||
LOG.exception("Failed to migrate router=%s from agent=%s " | ||
"to agent=%s", router_id, agent_id, target_id) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it make sense to also log the message from the exception?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doh, I guess that's already done by LOG.exception
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep :-)
Tested on SOC6 and it works, although for some reason I have not managed to reproduce a situation where a retry was actually required. But manual retries from CLI work, so I can't imagine why they wouldn't from the OCF. |
(BTW this requires adding |
+1 |
+1 |
SUSE-Cloud/cookbook-openstack-network#1 adds retry logic to neutron-ha-tool so that it will not fail immediately if there are problems talking to the neutron-server API. Therefore we need to give neutron-ha-tool enough time to have a decent chance of succeeding if the neutron-server still needs a while to recover after some kind of failover. https://bugzilla.suse.com/show_bug.cgi?id=965886 (cherry picked from commit 55504e2e8ccfb23abb8a6b5b31e4ff7ef99e911b)
SUSE-Cloud/cookbook-openstack-network#1 adds (amongst many other things) support for neutron-ha-tool to retry its connections to neutron-server. By taking advantage of this in this OCF RA, we can make failover more robust. Signed-off-by: Adam Spiers <aspiers@suse.com> Change-Id: I41c37500f691e2e0ecfd6c31f1720f483513e447
SUSE-Cloud/cookbook-openstack-network#1 adds retry logic to neutron-ha-tool so that it will not fail immediately if there are problems talking to the neutron-server API. Therefore we need to give neutron-ha-tool enough time to have a decent chance of succeeding if the neutron-server still needs a while to recover after some kind of failover. https://bugzilla.suse.com/show_bug.cgi?id=965886 (cherry picked from commit 55504e2e8ccfb23abb8a6b5b31e4ff7ef99e911b)
SUSE-Cloud/cookbook-openstack-network#1 adds (amongst many other things) support for neutron-ha-tool to retry its connections to neutron-server. By taking advantage of this in this OCF RA, we can make failover more robust. Signed-off-by: Adam Spiers <aspiers@suse.com> Change-Id: I41c37500f691e2e0ecfd6c31f1720f483513e447
In a Crowbar environment, the retry commit requires crowbar/crowbar-openstack#344 and also https://review.openstack.org/#/c/297664/ in order to be effective.
Whilst looking into bsc#965886 (L3: neutron-ha-tool not migrating dhcp networks during failover), I discovered many instances of braindead code in
neutron-ha-tool
. In particular, previous authors seemed to think thatLOG.exception()
aborts the code flow, which it doesn't, so several corner cases just completely fail.Rather than submitting all these changes as individual rpm patches like it's 1995, I've forked the original upstream git repo to this fork, resurrected the Icehouse branch (which was tagged as
eol-icehouse
) and created a newneutron-ha-tool-maintenance
branch from it which can be subject to the normal code review process. We can then easily repoint https://build.suse.de/package/show/Devel:Cloud:6:Staging/openstack-neutron at it. Similarly for Cloud 5.