Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rolling update & upgrade #43

Closed
wants to merge 61 commits into from

Conversation

jschmid1
Copy link
Contributor

@jschmid1 jschmid1 commented Nov 2, 2016

This serves the purpose of solving the problem of updating & rebooting. [0] -> detailed explanation

With this decoupled approach we update the kernel & reboot per-role sequentially and in correct order ( mon -> osd .. etc ) and hardfail if something goes wrong.

Now you safely run ( for now called maintenance - naming needs to be straightened out though ) which installs everything BUT the kernel ( not implemented ) and does a graceful restart which was previous implemented with ceph.restart.

[0]:
If you currently run stage0 it will first install all updates and subsequently check if a kernel update was applied. If so DeepSea initializes a reboot. The issue with this is that you don't know whether you have a new ceph-* binary added to your system. In a larger scale cluster you want to run on different versions of ceph as short as possible. Reboots might take up to 15 minutes per node which means you end up with different versions of ceph-* for a rather long time.

  • don't merge this yet. - WIP

Signed-off-by: Joshua Schmid jschmid@suse.de

@jschmid1
Copy link
Contributor Author

jschmid1 commented Nov 2, 2016

So we will most likely end up with something like this:

Phase1)
Take a host in the pool of roles:monitor

  • In case the cluster is health:
  • apply only kernel updates
  • reboot the node, if there was a kernel update
  • move to the next node

Repeat for all other roles -> osd -> rgw -> mds -> igw

you now have applied the kernel updates without downtime.

Phase2)
Take a host in the pool of roles:monitor

  • In case the cluster is health:
  • apply all but kernel updates
  • restart the service, if its binary ( ceph-mon ) was updated
  • move to the next node

Repeat for all other roles -> osd -> rgw -> mds -> igw

you are now running the latest version of ceph & kernel without downtime.

In the grand scheme of things admins want to run this periodically and automated as a part of stage 0-5. Might be suitable to add this to the existing stages, although the behavior of reboot differs if there is a cluster running. ( parallel - serial )

due to the need of these two phases I separated ceph.updates into ceph.updates.kernel and ceph.updates.regular. If there is the need of calling them combined (e.g. in the initial deployment - ceph.updates includes kernel and regular )

The said sls files are not adapted yet and don't do what they are supposed to do. WIP

installing everything BUT the kernel might be parsed inside jinja with something like:

zypper lu -a | awk '{print $5}' | grep -v 'kernel-default' | xargs sudo zypper up

but this might actually be a good candidate for a module as I can see it failing already. Zypper parsing might also be very helpful for the service.restart decision

@jschmid1 jschmid1 changed the title First draft for update & reboot & restart sequentially and in order glued together draft for update & reboot & restart sequentially and in order glued together Nov 2, 2016
@jschmid1 jschmid1 changed the title draft for update & reboot & restart sequentially and in order glued together draft for update & reboot & restart sequentially, in order glued together Nov 2, 2016
@jschmid1
Copy link
Contributor Author

jschmid1 commented Nov 3, 2016

the restarting condition for each service might look like this:

{% set installed_version = salt['cmd.run']('rpm -q --last ceph-mon | head -1 | cut -f1 -d+| grep -Po '(\d+\.)?(\d+\.)?(\*|\d+)'') %}
{% set used_version = salt['cmd.run']('ceph-mon --version | head -1 | cut -f1 -d- | grep -Po '(\d+\.)?(\d+\.)?(\*|\d+)'') %}

with the according unless block:

 - unless: "systemctl is-failed ceph-mon@{{ grains['host'] }}.service ||  if [[ {{ installed_version }} == {{ used_version }} ]]"

@jschmid1
Copy link
Contributor Author

any comments on that?

@l-mb
Copy link
Member

l-mb commented Nov 18, 2016

I'm not sure I agree with this particular approach. Precisely because you want to reduce the amount of reboots. Also, the updates might depend on each other - e.g., a new ceph-* package depends on a glibc update, iSCSI requires a new kernel version, etc.

I believe iterating over the nodes in the right sequence and upgrading them entirely in one step (stop, dup, restart, reboot if necessary) is the saner choice.

15 minutes really is not much. Two reboots are worse. (We'll have users upgrading the nodes one by one over days, I guarantee it.)

@jschmid1
Copy link
Contributor Author

Thanks for the feedback!

I think Eric's initial plan was to run Stage 0-5 in a regular fashion ( 0..10 times a day). If you follow that path my proposed solution makes sense because minor updates won't effect ceph's ability to function properly hence no action is needed. That path acutually reduces the amount of reboots.

And due to that assumption we decided to go for a differentiated approach of only applying a restart or a reboot if it's actually needed. Your point on dependencies is correct, though. We have to agree on packages that will cause a reboot - Currently it's only a kernel update.

If you for example start the update process at the monitors the following scenario might eventuate:

If you followed the solution of just zypper dup you will end up rebooting the nodes anyways. So nothing is gained here. The crucial point here is also that dup might have installed a newer version of ceph-*. The reboot 'activates' this version - as the reboot might take up to 15 minutes, the version will be out of sync and perhaps cause some problems.

If you only installed - if there is - the latest kernel & reboot the machine, nothing serious happens and the ceph versions are still in sync. Now you can safely apply the remaining updates, which won't cause any reboots, and sequentially restart the ceph services.

@l-mb
Copy link
Member

l-mb commented Nov 21, 2016

Services might fail at any time anyway and then be restarted on the newer version. That's something that must work, otherwise we've got a potential problem. Or they could reboot for whatever reason.

Restarting individual services on a node is pretty much a question of luck (unless containers were to be used). I don't think we need to over-optimize for this case.

As for the reboot detection, zypper actually knows after the update has been applied:
102 - ZYPPER_EXIT_INF_REBOOT_NEEDED
Returned after a successful installation of a patch which requires reboot of computer.

On other distros, this causes /var/run/reboot-required, not sure if this also happens on SUSE.

@jschmid1
Copy link
Contributor Author

jschmid1 commented Nov 23, 2016

Services might fail at any time anyway and then be restarted on the newer version. That's something that must work, otherwise we've got a potential problem. Or they could reboot for whatever reason.
Restarting individual services on a node is pretty much a question of luck (unless containers were to be used). I don't think we need to over-optimize for this case.

If that's the case I'd support your approach aswell.

As for the reboot detection, zypper actually knows after the update has been applied:
102 - ZYPPER_EXIT_INF_REBOOT_NEEDED
Returned after a successful installation of a patch which requires reboot of computer.

On other distros, this causes /var/run/reboot-required, not sure if this also happens on SUSE.

That is indeed pretty helpful, thanks for that valuable information.

So what you propose is splitting the 'restart' and the 'update' part, right?

If you as a user decide to update we iterate over the roles/nodes and let zypper decide if a reboot is necessary.

The ceph.restart procedure will mostly be used for manual intervention or for a intentional restart after a PTF or similar.

@l-mb
Copy link
Member

l-mb commented Dec 14, 2016

So what you propose is splitting the 'restart' and the 'update' part, right?

Perhaps. What I'd do would be to:

  • Build a list of all nodes, ordered { MON; MDS; OSD; iSCSI gateways; radosgw }

    • If a node is already in the list, because it co-hosted a higher priority service, don't add it again
    • Co-hosted services will restart some services earlier than optimal, but without dedicated nodes or containers, that is effectively impossible to prevent in all cases; and if it can happen, we need to test it and make sure that it works. The best way to do that is to make it the regular case ;-)
    • I just realize that based on the previous comment we might as well randomize the order in which we upgrade the nodes. Must work, right?
  • Don't start unless HEALTH_OK

  • Determine if there are any updates at all that'd be applied.

    • Skip node if not (already uptodate)
  • Stop all services on a node

  • Apply all updates

  • Determine whether this requires a reboot (based on zypper feedback)

    • If zypper feedback is not sufficient, indeed we can look for newer kernels, modules, glibc etc. I'd suggest to have a whitelist of package patterns to trigger reboots then.
    • Reboot if so
  • Restart Ceph services

    • which the reboot ought to do automatically?
  • Wait until HEALTH_OK

  • Rinse & repeat

@jschmid1
Copy link
Contributor Author

jschmid1 commented Dec 14, 2016

One thing I simply don't know is what could possibly happen if ceph runs on different versions for a longer period of time.Going with the all-in-one update and reboot strategy could theoretically cause a cluster to be out of sync(version-wise) for some hours.
My best guess is that it depends if that is an actual problem or not(major vs minor version update). But making that decision is hard again.

Secondly I'd like know what the advantage

  • Stop all services on a node

  • Apply all updates
    [..]

  • Restart Ceph services

over:

  • Apply all updates

  • Check if a restart/boot is actually needed.

    • Restart/Reboot

That would potentially save one round of restarts + a bit of cluster shakiness..

But If you know any potential problems that might occur with updating while the process is running, this might be a tradeoff we have to accept.

I just realize that based on the previous comment we might as well randomize the order in which we upgrade the nodes. Must work, right?

Well, true. Going with the update-reboot solution we loose the ability to control the order in which we restart services to some extend.. We might need some practical feedback/testing/eval if that causes any problems..

@jschmid1 jschmid1 self-assigned this Dec 14, 2016
@smithfarm
Copy link
Contributor

smithfarm commented Dec 14, 2016

One thing I simply don't know is what could possibly happen if ceph runs on different versions for a longer period of time.Going with the all-in-one update and reboot strategy could theoretically cause a cluster to be out of sync(version-wise) for some hours.

That should not be any problem at all. Ceph undergoes extensive upgrade testing upstream.

Another question is whether the users will be prepared (psychologically speaking) for the update to take several hours.

@jschmid1
Copy link
Contributor Author

jschmid1 commented Dec 15, 2016

That should not be any problem at all. Ceph undergoes extensive upgrade testing upstream.

Thats good to hear.

Another question is whether the users will be prepared (psychologically speaking) for the update to take several hours.

I guess we need to document the process very precisely to make sure they understand why it might take so long.

@smithfarm Does upstream also test a random order? That is not strating as recommended with the MON->OSD .. ?

@smithfarm
Copy link
Contributor

No, it's not random. But they do test different orders. See e.g. https://github.com/ceph/ceph/blob/master/qa/suites/upgrade/jewel-x/parallel/3-upgrade-sequence/upgrade-mon-osd-mds.yaml

@jschmid1
Copy link
Contributor Author

great, thanks for the hint.

So the last question is whether to:

  • stop services -> update -> start
    or
  • update -> maybe restart

@swiftgist
Copy link
Contributor

@l-mb @smithfarm Just for clarification, when either of you say "random" order, both of you are okay with a standard approach of monitors, then storage, then remaining in general? For the Ceph cluster with dedicated nodes, there's no real issue even with reboots.

My concern is when things go bad, how does that leave the cluster and what is the admin left holding?
I think I'd rather leave an admin with "Yeah, the monitors upgraded fine as did most of the storage nodes, but we're in a catch-22 with some of the gateways on shared storage nodes" is better than "Well, some monitors upgraded, some storage nodes upgraded, one igw is working, another isn't, some other services partially upgraded too, but the upgrade stopped". The number of conversations asking which nodes were upgraded and that the admin needs to create a scorecard for management and support isn't really a selling point.

I guess six partial upgrades leaves me edgy, where as two complete upgrades followed by a failure is easy to describe.

@l-mb
Copy link
Member

l-mb commented Feb 2, 2017

@swiftgist Trying to order the nodes is fine, but ordering them is probably not a hard mandatory aspect.

If the update fails for any node, flag that node, abort the update sequence? And once the admin has manually resolved whatever problem occurred, they can just restart the update process? We don't even need to start in the middle, since upgrading an already upgraded node is effectively a no-op and would just be skipped anyway.

@jschmid1 jschmid1 force-pushed the wip-update-and-restart branch 2 times, most recently from 7c2c089 to fbfc57b Compare February 15, 2017 15:21
@jschmid1
Copy link
Contributor Author

That pullrequest tries to archive:

  • abstract the packagemanager calls to a module to make deepsea portable

  • handle the reaction to packagemanagers returncodes in that module

    • srv/salt/_modules/packagemanager.py
  • granularize updates (packages vs kernel)

    • srv/salt/ceph/updates/regular
    • srv/salt/ceph/updates/kernel
    • call them combined with ceph.updates
  • add states for maintenance mode

    • updates
      • get nodes ordered and unique
      • ask the pm module to do reboots
      • abort if the cluster cant recover from HEALTH_ERR in $timeout
    • upgrades
      • prevalidate checks (not yet)
      • get nodes ordered and unique
      • zypper dup
  • refactor stage0

    • when prep detects a running cluster execute 'maintenance mode update'
    • if not call ceph.updates in parallell

log.info("returncode: {}".format(proc.returncode))
if proc.returncode == 0:
if os.path.isfile('/var/run/reboot-required'):
self._reboot()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have a total grasp of how all these pieces are fitting together, so I want to make sure I'm understanding it. From my understanding, I'm seeing a possibility for issuing mulitple reboots between this code (the PackageManager classes) and the sls file that does upgrades.

On this line, I see that when up or dup is called, the package manager (Apt in this case, but a similar line exists for Zypper patch) issues a reboot automatically.

In the SLS file, there is also a reboot line. When PackageManager issues a reboot, does it set auto_reboot in the pillar to False so there won't be a double reboot?

I know @jschmid1 also mentioned needing to check old vs new kernel version since zypper has some bugs/lack of reboot reporting to work around. Would it make sense to instead of issuing a reboot on this line to instead set the auto_reboot to True and let that be handled in the SLS?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added autoreboot: False|True to globally set the default behaviour on whether deepsea has the right to 'automagically reboot' in case the system requests it. So see it as a 'user variable' rather than an internal one.

My intention was to do it the way around. I mentioned in another comment that changing to 'zypper patch' will get rid of the ceph/update/reboot sls file.

and so technically double reboots won't be possible in zypper right now, as the 107 will never be received (due to zypper up's missing ret implementation). They may be with Apt but as Packagemanager will cause a reboot, the next step in the orchestrate file (after the reboot) is ceph.upgrade.reboot which compares the 'installed' vs 'used' kernel, which is the same after the reboot.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OH! Okay. I am following now. I didn't realize the auto_reboot was a gate to allow that to happen at all. #samepage
Sounds all good. Zypper makes it non-ideal it seems, but I think I have a good idea of what's happening now. Thanks!

@jschmid1 jschmid1 changed the title draft for update & reboot & restart sequentially, in order glued together Rolling update & upgrade Apr 12, 2017
@swiftgist
Copy link
Contributor

merged into wip-updates-and-restart

@swiftgist swiftgist mentioned this pull request Apr 22, 2017
@jschmid1
Copy link
Contributor Author

merged with #222, closing

@jschmid1 jschmid1 closed this Apr 24, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants