Rolling update & upgrade #43

jschmid1 · 2016-11-02T14:58:52Z

This serves the purpose of solving the problem of updating & rebooting. [0] -> detailed explanation

With this decoupled approach we update the kernel & reboot per-role sequentially and in correct order ( mon -> osd .. etc ) and hardfail if something goes wrong.

Now you safely run ( for now called maintenance - naming needs to be straightened out though ) which installs everything BUT the kernel ( not implemented ) and does a graceful restart which was previous implemented with ceph.restart.

[0]:
If you currently run stage0 it will first install all updates and subsequently check if a kernel update was applied. If so DeepSea initializes a reboot. The issue with this is that you don't know whether you have a new ceph-* binary added to your system. In a larger scale cluster you want to run on different versions of ceph as short as possible. Reboots might take up to 15 minutes per node which means you end up with different versions of ceph-* for a rather long time.

don't merge this yet. - WIP

Signed-off-by: Joshua Schmid jschmid@suse.de

jschmid1 · 2016-11-02T16:21:25Z

So we will most likely end up with something like this:

Phase1)
Take a host in the pool of roles:monitor

In case the cluster is health:
apply only kernel updates
reboot the node, if there was a kernel update
move to the next node

Repeat for all other roles -> osd -> rgw -> mds -> igw

you now have applied the kernel updates without downtime.

Phase2)
Take a host in the pool of roles:monitor

In case the cluster is health:
apply all but kernel updates
restart the service, if its binary ( ceph-mon ) was updated
move to the next node

Repeat for all other roles -> osd -> rgw -> mds -> igw

you are now running the latest version of ceph & kernel without downtime.

In the grand scheme of things admins want to run this periodically and automated as a part of stage 0-5. Might be suitable to add this to the existing stages, although the behavior of reboot differs if there is a cluster running. ( parallel - serial )

due to the need of these two phases I separated ceph.updates into ceph.updates.kernel and ceph.updates.regular. If there is the need of calling them combined (e.g. in the initial deployment - ceph.updates includes kernel and regular )

The said sls files are not adapted yet and don't do what they are supposed to do. WIP

installing everything BUT the kernel might be parsed inside jinja with something like:

zypper lu -a | awk '{print $5}' | grep -v 'kernel-default' | xargs sudo zypper up

but this might actually be a good candidate for a module as I can see it failing already. Zypper parsing might also be very helpful for the service.restart decision

jschmid1 · 2016-11-03T10:27:09Z

the restarting condition for each service might look like this:

{% set installed_version = salt['cmd.run']('rpm -q --last ceph-mon | head -1 | cut -f1 -d+| grep -Po '(\d+\.)?(\d+\.)?(\*|\d+)'') %}
{% set used_version = salt['cmd.run']('ceph-mon --version | head -1 | cut -f1 -d- | grep -Po '(\d+\.)?(\d+\.)?(\*|\d+)'') %}

with the according unless block:

 - unless: "systemctl is-failed ceph-mon@{{ grains['host'] }}.service ||  if [[ {{ installed_version }} == {{ used_version }} ]]"

jschmid1 · 2016-11-10T09:24:59Z

any comments on that?

l-mb · 2016-11-18T10:50:46Z

I'm not sure I agree with this particular approach. Precisely because you want to reduce the amount of reboots. Also, the updates might depend on each other - e.g., a new ceph-* package depends on a glibc update, iSCSI requires a new kernel version, etc.

I believe iterating over the nodes in the right sequence and upgrading them entirely in one step (stop, dup, restart, reboot if necessary) is the saner choice.

15 minutes really is not much. Two reboots are worse. (We'll have users upgrading the nodes one by one over days, I guarantee it.)

jschmid1 · 2016-11-21T14:07:37Z

Thanks for the feedback!

I think Eric's initial plan was to run Stage 0-5 in a regular fashion ( 0..10 times a day). If you follow that path my proposed solution makes sense because minor updates won't effect ceph's ability to function properly hence no action is needed. That path acutually reduces the amount of reboots.

And due to that assumption we decided to go for a differentiated approach of only applying a restart or a reboot if it's actually needed. Your point on dependencies is correct, though. We have to agree on packages that will cause a reboot - Currently it's only a kernel update.

If you for example start the update process at the monitors the following scenario might eventuate:

If you followed the solution of just zypper dup you will end up rebooting the nodes anyways. So nothing is gained here. The crucial point here is also that dup might have installed a newer version of ceph-*. The reboot 'activates' this version - as the reboot might take up to 15 minutes, the version will be out of sync and perhaps cause some problems.

If you only installed - if there is - the latest kernel & reboot the machine, nothing serious happens and the ceph versions are still in sync. Now you can safely apply the remaining updates, which won't cause any reboots, and sequentially restart the ceph services.

l-mb · 2016-11-21T17:03:53Z

Services might fail at any time anyway and then be restarted on the newer version. That's something that must work, otherwise we've got a potential problem. Or they could reboot for whatever reason.

Restarting individual services on a node is pretty much a question of luck (unless containers were to be used). I don't think we need to over-optimize for this case.

As for the reboot detection, zypper actually knows after the update has been applied:
102 - ZYPPER_EXIT_INF_REBOOT_NEEDED
Returned after a successful installation of a patch which requires reboot of computer.

On other distros, this causes /var/run/reboot-required, not sure if this also happens on SUSE.

jschmid1 · 2016-11-23T10:17:02Z

Services might fail at any time anyway and then be restarted on the newer version. That's something that must work, otherwise we've got a potential problem. Or they could reboot for whatever reason.
Restarting individual services on a node is pretty much a question of luck (unless containers were to be used). I don't think we need to over-optimize for this case.

If that's the case I'd support your approach aswell.

As for the reboot detection, zypper actually knows after the update has been applied:
102 - ZYPPER_EXIT_INF_REBOOT_NEEDED
Returned after a successful installation of a patch which requires reboot of computer.

On other distros, this causes /var/run/reboot-required, not sure if this also happens on SUSE.

That is indeed pretty helpful, thanks for that valuable information.

So what you propose is splitting the 'restart' and the 'update' part, right?

If you as a user decide to update we iterate over the roles/nodes and let zypper decide if a reboot is necessary.

The ceph.restart procedure will mostly be used for manual intervention or for a intentional restart after a PTF or similar.

l-mb · 2016-12-14T09:23:59Z

So what you propose is splitting the 'restart' and the 'update' part, right?

Perhaps. What I'd do would be to:

Build a list of all nodes, ordered { MON; MDS; OSD; iSCSI gateways; radosgw }
- If a node is already in the list, because it co-hosted a higher priority service, don't add it again
- Co-hosted services will restart some services earlier than optimal, but without dedicated nodes or containers, that is effectively impossible to prevent in all cases; and if it can happen, we need to test it and make sure that it works. The best way to do that is to make it the regular case ;-)
- I just realize that based on the previous comment we might as well randomize the order in which we upgrade the nodes. Must work, right?
Don't start unless HEALTH_OK
Determine if there are any updates at all that'd be applied.
- Skip node if not (already uptodate)
Stop all services on a node
Apply all updates
Determine whether this requires a reboot (based on zypper feedback)
- If zypper feedback is not sufficient, indeed we can look for newer kernels, modules, glibc etc. I'd suggest to have a whitelist of package patterns to trigger reboots then.
- Reboot if so
Restart Ceph services
- which the reboot ought to do automatically?
Wait until HEALTH_OK
Rinse & repeat

jschmid1 · 2016-12-14T14:05:34Z

One thing I simply don't know is what could possibly happen if ceph runs on different versions for a longer period of time.Going with the all-in-one update and reboot strategy could theoretically cause a cluster to be out of sync(version-wise) for some hours.
My best guess is that it depends if that is an actual problem or not(major vs minor version update). But making that decision is hard again.

Secondly I'd like know what the advantage

Stop all services on a node

Apply all updates
[..]

Restart Ceph services

over:

Apply all updates
Check if a restart/boot is actually needed.
- Restart/Reboot

That would potentially save one round of restarts + a bit of cluster shakiness..

But If you know any potential problems that might occur with updating while the process is running, this might be a tradeoff we have to accept.

I just realize that based on the previous comment we might as well randomize the order in which we upgrade the nodes. Must work, right?

Well, true. Going with the update-reboot solution we loose the ability to control the order in which we restart services to some extend.. We might need some practical feedback/testing/eval if that causes any problems..

smithfarm · 2016-12-14T17:00:36Z

One thing I simply don't know is what could possibly happen if ceph runs on different versions for a longer period of time.Going with the all-in-one update and reboot strategy could theoretically cause a cluster to be out of sync(version-wise) for some hours.

That should not be any problem at all. Ceph undergoes extensive upgrade testing upstream.

Another question is whether the users will be prepared (psychologically speaking) for the update to take several hours.

jschmid1 · 2016-12-15T10:07:49Z

That should not be any problem at all. Ceph undergoes extensive upgrade testing upstream.

Thats good to hear.

Another question is whether the users will be prepared (psychologically speaking) for the update to take several hours.

I guess we need to document the process very precisely to make sure they understand why it might take so long.

@smithfarm Does upstream also test a random order? That is not strating as recommended with the MON->OSD .. ?

smithfarm · 2016-12-15T10:26:55Z

No, it's not random. But they do test different orders. See e.g. https://github.com/ceph/ceph/blob/master/qa/suites/upgrade/jewel-x/parallel/3-upgrade-sequence/upgrade-mon-osd-mds.yaml

jschmid1 · 2016-12-15T10:34:55Z

great, thanks for the hint.

So the last question is whether to:

stop services -> update -> start
or
update -> maybe restart

swiftgist · 2017-01-13T16:59:12Z

@l-mb @smithfarm Just for clarification, when either of you say "random" order, both of you are okay with a standard approach of monitors, then storage, then remaining in general? For the Ceph cluster with dedicated nodes, there's no real issue even with reboots.

My concern is when things go bad, how does that leave the cluster and what is the admin left holding?
I think I'd rather leave an admin with "Yeah, the monitors upgraded fine as did most of the storage nodes, but we're in a catch-22 with some of the gateways on shared storage nodes" is better than "Well, some monitors upgraded, some storage nodes upgraded, one igw is working, another isn't, some other services partially upgraded too, but the upgrade stopped". The number of conversations asking which nodes were upgraded and that the admin needs to create a scorecard for management and support isn't really a selling point.

I guess six partial upgrades leaves me edgy, where as two complete upgrades followed by a failure is easy to describe.

l-mb · 2017-02-02T16:14:01Z

@swiftgist Trying to order the nodes is fine, but ordering them is probably not a hard mandatory aspect.

If the update fails for any node, flag that node, abort the update sequence? And once the admin has manually resolved whatever problem occurred, they can just restart the update process? We don't even need to start in the middle, since upgrading an already upgraded node is effectively a no-op and would just be skipped anyway.

jschmid1 · 2017-02-17T12:10:05Z

That pullrequest tries to archive:

abstract the packagemanager calls to a module to make deepsea portable
handle the reaction to packagemanagers returncodes in that module
- srv/salt/_modules/packagemanager.py
granularize updates (packages vs kernel)
- srv/salt/ceph/updates/regular
- srv/salt/ceph/updates/kernel
- call them combined with ceph.updates
add states for maintenance mode
- updates
  - get nodes ordered and unique
  - ask the pm module to do reboots
  - abort if the cluster cant recover from HEALTH_ERR in $timeout
- upgrades
  - prevalidate checks (not yet)
  - get nodes ordered and unique
  - zypper dup
refactor stage0
- when prep detects a running cluster execute 'maintenance mode update'
- if not call ceph.updates in parallell

Signed-off-by: Eric Jackson <ejackson@suse.com>

BlaineEXE · 2017-04-05T17:57:04Z

srv/salt/_modules/packagemanager.py

+            log.info("returncode: {}".format(proc.returncode))
+            if proc.returncode == 0:
+                if os.path.isfile('/var/run/reboot-required'):
+                    self._reboot()


I don't have a total grasp of how all these pieces are fitting together, so I want to make sure I'm understanding it. From my understanding, I'm seeing a possibility for issuing mulitple reboots between this code (the PackageManager classes) and the sls file that does upgrades.

On this line, I see that when up or dup is called, the package manager (Apt in this case, but a similar line exists for Zypper patch) issues a reboot automatically.

In the SLS file, there is also a reboot line. When PackageManager issues a reboot, does it set auto_reboot in the pillar to False so there won't be a double reboot?

I know @jschmid1 also mentioned needing to check old vs new kernel version since zypper has some bugs/lack of reboot reporting to work around. Would it make sense to instead of issuing a reboot on this line to instead set the auto_reboot to True and let that be handled in the SLS?

I added autoreboot: False|True to globally set the default behaviour on whether deepsea has the right to 'automagically reboot' in case the system requests it. So see it as a 'user variable' rather than an internal one.

My intention was to do it the way around. I mentioned in another comment that changing to 'zypper patch' will get rid of the ceph/update/reboot sls file.

and so technically double reboots won't be possible in zypper right now, as the 107 will never be received (due to zypper up's missing ret implementation). They may be with Apt but as Packagemanager will cause a reboot, the next step in the orchestrate file (after the reboot) is ceph.upgrade.reboot which compares the 'installed' vs 'used' kernel, which is the same after the reboot.

OH! Okay. I am following now. I didn't realize the auto_reboot was a gate to allow that to happen at all. #samepage
Sounds all good. Zypper makes it non-ideal it seems, but I think I have a good idea of what's happening now. Thanks!

* allow to globally choose the update_method_init which can enable you to use zypper.patch * add zypper.patch handle register

swiftgist · 2017-04-20T18:53:43Z

merged into wip-updates-and-restart

jschmid1 · 2017-04-24T02:07:06Z

merged with #222, closing

jschmid1 mentioned this pull request Nov 2, 2016

sequential reboot #35

Closed

jschmid1 force-pushed the wip-update-and-restart branch from 8fa21f3 to cfad9b7 Compare November 2, 2016 15:53

jschmid1 changed the title ~~First draft for update & reboot & restart sequentially and in order glued together~~ draft for update & reboot & restart sequentially and in order glued together Nov 2, 2016

jschmid1 changed the title ~~draft for update & reboot & restart sequentially and in order glued together~~ draft for update & reboot & restart sequentially, in order glued together Nov 2, 2016

jschmid1 force-pushed the wip-update-and-restart branch from 19e5f55 to cfad9b7 Compare November 3, 2016 10:20

jschmid1 force-pushed the wip-update-and-restart branch from ceb8569 to cfad9b7 Compare November 3, 2016 10:29

jschmid1 added the enhancement label Nov 3, 2016

jschmid1 requested review from l-mb and swiftgist December 13, 2016 12:17

jschmid1 self-assigned this Dec 14, 2016

jschmid1 force-pushed the wip-update-and-restart branch from 29fea7f to 25e24e6 Compare December 20, 2016 15:24

jschmid1 force-pushed the wip-update-and-restart branch 2 times, most recently from 7c2c089 to fbfc57b Compare February 15, 2017 15:21

jschmid1 force-pushed the wip-update-and-restart branch from e4fd7b6 to bc2be5d Compare February 20, 2017 14:29

swiftgist and others added 5 commits April 5, 2017 10:30

Add noout states, reactor configuration

2790c31

Signed-off-by: Eric Jackson <ejackson@suse.com>

makefile & specfiles

545cdd9

dont use controlvariable out of context

684cf4a

syntax and superfluous stancas

7bbcae1

fixes suggested changes after review

e48e47a

jschmid1 force-pushed the wip-update-and-restart branch from f697b0f to e48e47a Compare April 5, 2017 08:32

Joshua Schmid added 3 commits April 5, 2017 10:57

newlines

bf75d4a

stateizing noout warnings

fb6960e

fix syntax

f80f092

BlaineEXE reviewed Apr 5, 2017

View reviewed changes

Joshua Schmid added 7 commits April 6, 2017 13:23

change message and add fire_event

7884a17

the reactor should match the tag generated by fire_event

a64725f

put warning in right places

881ca9f

update deepsea first

a728f32

add syning before upgrade/ after master deepsea update

2c29c79

fix conflict ID and makefile issue

e932b8c

first update salt and deepsea, then start the ceph upgrade process

af5f4da

jschmid1 changed the title ~~draft for update & reboot & restart sequentially, in order glued together~~ Rolling update & upgrade Apr 12, 2017

Joshua Schmid added 8 commits April 12, 2017 15:27

* trigger the fire_master event inside the PM module

2145824

* allow to globally choose the update_method_init which can enable you to use zypper.patch * add zypper.patch handle register

cleanup

73916cf

refactoring and tests added

7be94c8

extend takes list()

9977d13

fix and add zypper-migration

8369648

move for the default case to a runner that sets the noout flag

c3c5d7b

adapt reactor calls

7b6ad35

move runner to orch

1b383f4

swiftgist mentioned this pull request Apr 22, 2017

Wip update and restart #222

Merged

jschmid1 closed this Apr 24, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rolling update & upgrade #43

Rolling update & upgrade #43

jschmid1 commented Nov 2, 2016 •

edited

Loading

jschmid1 commented Nov 2, 2016 •

edited

Loading

jschmid1 commented Nov 3, 2016

jschmid1 commented Nov 10, 2016

l-mb commented Nov 18, 2016 •

edited by BlaineEXE

Loading

jschmid1 commented Nov 21, 2016

l-mb commented Nov 21, 2016

jschmid1 commented Nov 23, 2016 •

edited by BlaineEXE

Loading

l-mb commented Dec 14, 2016

jschmid1 commented Dec 14, 2016 •

edited

Loading

smithfarm commented Dec 14, 2016 •

edited by BlaineEXE

Loading

jschmid1 commented Dec 15, 2016 •

edited by BlaineEXE

Loading

smithfarm commented Dec 15, 2016

jschmid1 commented Dec 15, 2016

swiftgist commented Jan 13, 2017

l-mb commented Feb 2, 2017

jschmid1 commented Feb 17, 2017

BlaineEXE Apr 5, 2017

jschmid1 Apr 6, 2017

BlaineEXE Apr 6, 2017

swiftgist commented Apr 20, 2017

jschmid1 commented Apr 24, 2017

Rolling update & upgrade #43

Rolling update & upgrade #43

Conversation

jschmid1 commented Nov 2, 2016 • edited Loading

jschmid1 commented Nov 2, 2016 • edited Loading

jschmid1 commented Nov 3, 2016

jschmid1 commented Nov 10, 2016

l-mb commented Nov 18, 2016 • edited by BlaineEXE Loading

jschmid1 commented Nov 21, 2016

l-mb commented Nov 21, 2016

jschmid1 commented Nov 23, 2016 • edited by BlaineEXE Loading

l-mb commented Dec 14, 2016

jschmid1 commented Dec 14, 2016 • edited Loading

smithfarm commented Dec 14, 2016 • edited by BlaineEXE Loading

jschmid1 commented Dec 15, 2016 • edited by BlaineEXE Loading

smithfarm commented Dec 15, 2016

jschmid1 commented Dec 15, 2016

swiftgist commented Jan 13, 2017

l-mb commented Feb 2, 2017

jschmid1 commented Feb 17, 2017

BlaineEXE Apr 5, 2017

Choose a reason for hiding this comment

jschmid1 Apr 6, 2017

Choose a reason for hiding this comment

BlaineEXE Apr 6, 2017

Choose a reason for hiding this comment

swiftgist commented Apr 20, 2017

jschmid1 commented Apr 24, 2017

jschmid1 commented Nov 2, 2016 •

edited

Loading

jschmid1 commented Nov 2, 2016 •

edited

Loading

l-mb commented Nov 18, 2016 •

edited by BlaineEXE

Loading

jschmid1 commented Nov 23, 2016 •

edited by BlaineEXE

Loading

jschmid1 commented Dec 14, 2016 •

edited

Loading

smithfarm commented Dec 14, 2016 •

edited by BlaineEXE

Loading

jschmid1 commented Dec 15, 2016 •

edited by BlaineEXE

Loading