Salt updates #1126

swiftgist · 2018-05-07T14:11:57Z

Description of Issue/Question

DeepSea cannot gracefully upgrade Salt. This is simply a function of how orchestrations work. When the Salt master restarts, the orchestration is terminated. Shoehorning this into the reactor would likely cause more issues than it solves.

Salt updates are not terribly frequent and DeepSea is completely dependent on a working Salt cluster. I believe Stage 0 needs to fail with instructions on installing and upgrading Salt. Some users may be using Salt for other purposes and bringing these changes to their attention should prevent any confusion. Also, updating and not restarting Salt daemons will likely end in heartache and confusion when a minion is upgraded prior to the master restarting.

jschmid1 · 2018-05-07T14:40:42Z

working on that

jschmid1 · 2018-05-07T14:41:20Z

I'm also adding a warning for stage0/3 when updates are pending that would cause the daemons to restart.

jschmid1 · 2018-05-07T16:14:28Z

wip-branch

Martin-Weiss · 2018-05-14T05:57:52Z

Could we add logging and tracking of states in the orchestration so that we can "continue" at the point that failed or where the orchestration was stopped?

The problem is not only updating a salt-master (that IMO should not restart the salt-master automatically), the problem is also a loss of the ssh session or a reboot that might be initiated during a long running orchestration job - i.e. while doing a migration of filestore to bluestore..

jschmid1 · 2018-09-24T16:10:53Z

[Bug 1100238] salt-master takes forever to "patch"

Martin-Weiss · 2018-10-08T07:53:15Z

Either we suppress the salt-master/salt-minon restart during the upgrade until we can reboot the servers (similar to the kernel-update process) or we might need to find a way to detect the hanging / broken orchestration and exit on that.

I also like the idea of detecting an update for the salt-master/salt-minion but in case we do not update these services as part of the orchestration - updating all of them manually is an other big "manual" task.

It would be great to have a "salt-master/salt-minion" self-update as part of the orchestration.. but I am not sure if that is possible.

Martin-Weiss · 2018-10-08T08:07:45Z

Maybe setting DISABLE_RESTART_ON_UPDATE = 1 in /etc/sysconfig/services might be a solution.

Set this, run zypper up salt-master/salt-minion, restart the services manually "async" right after exiting the orchestration with the message to the admin "please restart stage 0".

jschmid1 · 2018-10-08T09:25:47Z

Either we suppress the salt-master/salt-minon restart during the upgrade until we can reboot the servers (similar to the kernel-update process)

probably possible with the DISABLE_RESTART_ON_UPDATE = 1 directive.

or we might need to find a way to detect the hanging / broken orchestration and exit on that.

Not possible because you loose connection to the bus.

I also like the idea of detecting an update for the salt-master/salt-minion but in case we do not update these services as part of the orchestration -

That's what the posted patch does. It's not ready though.

updating all of them manually is an other big "manual" task.

What do you mean by that?

Maybe setting DISABLE_RESTART_ON_UPDATE = 1 in /etc/sysconfig/services might be a solution.
Set this, run zypper up salt-master/salt-minion, restart the services manually "async" right after exiting the orchestration with the message to the admin "please restart stage 0".

We also spent a thought on that. It's definitely doable with the tools we have.

Martin-Weiss · 2018-10-08T14:24:08Z

updating all of them manually is an other big "manual" task.
What do you mean by that?

Even in case the orchestration can not update salt-master/minions automatically and orchestrated - we need to find a way to do that from a single command in the right way for all the masters and minions.

Not sure if that helps - but would https://docs.saltstack.com/en/2017.7/topics/orchestrate/orchestrate_runner.html#masterless-orchestration help to update the master and something like https://www.shellhacks.com/upgrade-salt-minions/ for the minions?

Maybe we can just update the salt stack (master) always before running stage.0 tasks?
(in case the master is on the latest version in the attached channel - nothing would happen)

jschmid1 · 2018-10-09T10:04:01Z

In our meeting we agreed on the least aggressive solution which would be notifying the user that DS detected a salt version which needs to be updated before stage.0 should be run.

For the CI this needs to be handled differently.

Just fail when there is a salt update and ask the maintainer to update the base-image
??

Martin-Weiss · 2018-10-12T07:44:08Z

In case we follow this road we have to make sure to tell the admin "how to update salt-master and all salt-minions" in a proper way and in a large cluster we have to provide a tool that does it as manually connecting to all cluster nodes and manually updating salt-master and salt-minion packages without using salt might not be acceptable for deepsea users. Maybe this whole topic needs to be escalated to the salt maintainers as this "salt-self-update problem" seems to exist for all salt deployments.

jschmid1 · 2018-10-17T15:21:47Z

in a proper way and in a large cluster we have to provide a tool that does it as manually connecting to all cluster nodes and manually updating salt-master and salt-minion packages without using salt might not be acceptable for deepsea users

I don't think that we have to go that route. It's sufficient to kick off a master & minion update via a separate(salt) state/orchestration. You might not get a result back, but you will end up with the new versions. From there on you can just re-run the stage.

Martin-Weiss · 2018-10-17T15:37:28Z

Am 17.10.2018 um 17:22 schrieb Joshua Schmid ***@***.***>: in a proper way and in a large cluster we have to provide a tool that does it as manually connecting to all cluster

nodes and manually updating salt-master and salt-minion packages without using salt might not be acceptable for deepsea users

I don't think that we have to go that route. It's sufficient to kick off a master & minion update via a separate(salt)

state/orchestration. You might not get a result back, but you will end up with the new versions. From there on you can just re-run the stage.

Then we will be at the same place where we are already! When we update the salt-master via orchestration - the orchestration process hangs „forever“. Or how do you want to „kickoff“ the update and show progress and result to the admin?

jschmid1 · 2018-10-18T08:04:08Z

The current advise is to do:

You have a salt update pending. In order to provide a smooth experience, please update these packages manually before proceeding. salt -I 'roles:master' state.apply ceph.updates.master and salt -G 'deepsea:*' state.apply ceph.updates.salt in case you have deepsea_minions defined otherwise run: salt -I 'cluster:ceph' state.apply ceph.updates.salt

Then we will be at the same place where we are already! When we update the salt-master via orchestration - the orchestration process hangs „forever“. Or how do you want to „kickoff“ the update and show progress and result to the admin?

This would be a state which is expected to hang. That's something we can't change and don't need to change imo. It's just a single command that is being executed and can verified by running a zypper info right after.

Martin-Weiss · 2018-10-18T08:52:29Z

Just trying to understand this ;-).
First of all we do NOT use orchestration for updating salt and we will stop any orchestration (stage.0 up to stage.5) in case there are any updates available for salt-master or salt-minion on any server?
Do we also verify during orchestration that salt-master and all salt-minions are on a compatible version level?
Second - we use "salt states" and "apply them manually" where we update the salt-master first, and then all minions in parallel.

To rephrase a bit and to ensure the actions are understood for sure - please correct me if this is not correct (and maybe this can be used as text for exiting the orchestration in case salt updates are detected)?

Variant 1) In case you do have deepsea_minions defined using minion targeting i.e. via deepsea_minions: 'ses-5-admin* or ses-5-[1,2,3,4]*' in /srv/pillar/ceph/deepsea_minions.sls

execute salt -I 'roles:master' state.apply ceph.updates.master
execute salt -G 'deepsea:*' state.apply ceph.updates.salt

Variant 2) In case you have deepsea_minions defined using the deepsea grain via deepsea_minions: 'G@deepsea:*' in /srv/pillar/ceph/deepsea_minions.sls

execute salt -I 'cluster:ceph' state.apply ceph.updates.salt

Basically I see some challenges - not sure if they are really existing ;-):
a) what happens if one or the other minon or the master have already different versions? Do we detect this properly?
b) does the salt-master state complete or hang in case the salt-master gets restarted (same is true for the minion update)?
c) why do we have a difference based on deepsea targeting at all? Why not just do a
salt '<admin host minion>' ceph.updates.master and
salt '*' ceph.updates.salt
-> I assume the ceph.updates.master JUST updates the salt-master package and ceph.updates.salt updates the salt-minions.
d) in variant 2) we update the master and the minions all at the same point in time instead of "master first" - not sure if that is the right way.
e) how do we do the sync of modules after these updates? Again by stage.0?
f) do the states verify that there are currently no jobs running on the master or minion before the update is applied to ensure we do not break anything?

jschmid1 · 2018-10-18T09:11:20Z

First of all we do NOT use orchestration for updating salt and we will stop any orchestration (stage.0 up to stage.5) in case there are any updates available for salt-master or salt-minion on any server?

stage.0 will be stopped as this is the only stage that does a zypper up. We do not use an orchestration but a state for updating salt packages.

Do we also verify during orchestration that salt-master and all salt-minions are on a compatible version level?

No, if the users didn't mess with their cluster, the updates are applied the same way. We do check if a salt version is incompatible with the ceph release you are running. Secondly, it's not critical to run different minor versions of salt as long as this will be fixed early (it will by applying the states)

Second - we use "salt states" and "apply them manually" where we update the salt-master first, and then all minions in parallel.

Manual, in the sense that we call the states ourselves as oppsed to the orchestrations. Yes, master first, then the minions in parallel.

Variant 1) In case you do have deepsea_minions defined using minion targeting i.e. via deepsea_minions: 'ses-5-admin* or ses-5-[1,2,3,4]*' in /srv/pillar/ceph/deepsea_minions.sls

execute salt -I 'roles:master' state.apply ceph.updates.master
execute salt -G 'deepsea:*' state.apply ceph.updates.salt

Variant 2) In case you have deepsea_minions defined using the deepsea grain via deepsea_minions: 'G@deepsea:*' in /srv/pillar/ceph/deepsea_minions.sls

execute salt -I 'cluster:ceph' state.apply ceph.updates.salt

This

execute salt -I 'roles:master' state.apply ceph.updates.master

is independent of the 'variant'. It's only the targeting for the minions that changes.

a) what happens if one or the other minon or the master have already different versions? Do we detect this properly?

If this would be a problem, you couldn't even call the state. If this is not a problem you'd apply the states and end up with the same patch level across all servers.

b) does the salt-master state complete or hang in case the salt-master gets restarted (same is true for the minion update)?

Yes.

c) why do we have a difference based on deepsea targeting at all? Why not just do a
salt '<admin host minion>' ceph.updates.master and
salt '*' ceph.updates.salt
-> I assume the ceph.updates.master JUST updates the salt-master package and ceph.updates.salt updates the salt-minions.

No.
ceph.updates.master updates the master, minion and the deepsea package whilst the ceph.updates.salt state only updates the minion package.

d) in variant 2) we update the master and the minions all at the same point in time instead of "master first" - not sure if that is the right way.

No, see comment for variants. The message doesn't seem clear enough. How would you rephrase it to make it more clear?

e) how do we do the sync of modules after these updates? Again by stage.0?

Yes

f) do the states verify that there are currently no jobs running on the master or minion before the update is applied to ensure we do not break anything?

At least that there are no jobs running that would interfere. In this case no 'pkg.installed'

Martin-Weiss · 2018-10-18T11:14:30Z

This
 execute salt -I 'roles:master' state.apply ceph.updates.master
is independent of the 'variant'. It's only the targeting for the minions that changes.

Not sure if I understand this reply to my question right - why are you using ceph.updates.master sometimes and ceph.updates.salt ?... ok somewhat further in the thread it gets clear with the following:

ceph.updates.master updates the master, minion and the deepsea package whilst the ceph.updates.salt state only updates the minion package.

Now that the states are clear to me this is the next open point:

b) does the salt-master state complete or hang in case the salt-master gets restarted (same is true for the minion update)?

Yes.

"Yes" on an "either or" question does not seem to be the right answer ;-)

We have to make sure that the salt command we use will work and give a proper status back on the update done, successful etc. In the past I have seen that when restarting the salt-minion or salt-master during the upgrade the minion or master does NOT reply anymore.. Maybe this problem is gone and then I am fine with the above as "command finishes with compete and $?=0".

So basically could we simplify the whole salt update by this process:

salt '<admin host minion>' ceph.updates.master
salt '*' saltutil.sync_all
salt '*' ceph.updates.salt

(in case we update deepsea as part of ceph.updates.master we have to sync the modules, too - just in case the ceph.updates.salt would rely on one of the modules we deliver)

I am also not sure how this deals with someone working in openATTIC using the salt-api to run actions at the same point in time.. maybe we have to stop openATTIC before updating the salt-master, too?

jschmid1 · 2018-10-18T11:30:55Z

b) does the salt-master state complete or hang in case the salt-master gets restarted (same is true for the minion update)?

Yes, the state hangs for a salt-master update and yes, it hangs also on a minion update (I'm not 100% sure on that anymore, in the meanwhile they might reconnect to the master and report back.)

jschmid1 · 2018-10-18T11:34:10Z

We have to make sure that the salt command we use will work and give a proper status back on the update done, successful etc. In the past I have seen that when restarting the salt-minion or salt-master during the upgrade the minion or master does NOT reply anymore.. Maybe this problem is gone and then I am fine with the above as "command finishes with compete and $?=0".

It can't. It's just technically(at least that I know of) not possible. The salt-master is connected to a bus where it reads messages from (what jobs got triggered, what the status of those jobs is, etc). If you restart this service it doens't automatically reconnect to this. The technicality behind is, is probably connected to session-keys expiring or similar..

jschmid1 · 2018-10-18T11:36:13Z

So basically could we simplify the whole salt update by this process:
salt '' ceph.updates.master
salt '' saltutil.sync_all
salt '' ceph.updates.salt
(in case we update deepsea as part of ceph.updates.master we have to sync the modules, too - just in case the ceph.updates.salt would rely on one of the modules we deliver)

We ceph.updates.salt just relies on salt(built-in) modules. Although it obviously doesn't hurt to sync_all it's generally not needed.

jschmid1 · 2018-10-18T11:37:16Z

I am also not sure how this deals with someone working in openATTIC using the salt-api to run actions at the same point in time.. maybe we have to stop openATTIC before updating the salt-master, too?

Wouldn't that be up to the operator?

Martin-Weiss · 2018-10-18T11:43:56Z

We have to make sure that the salt command we use will work and give a proper status back on the update done, successful etc. In the past I have seen that when restarting the salt-minion or salt-master during the upgrade the minion or master does NOT reply anymore.. Maybe this problem is gone and then I am fine with the above as "command finishes with compete and $?=0".

It can't. It's just technically(at least that I know of) not possible. The salt-master is connected to a bus where it reads messages from (what jobs got triggered, what the status of those jobs is, etc). If you restart this service it doens't automatically reconnect to this. The technicality behind is, is probably connected to session-keys expiring or similar..

That means we have the same problem with the state execution for the salt-update as with running the orchestration of stage.0 for updating salt.

So basically "updating salt with using salt" is not possible and we need a second way to work around this or run the upgrade "async" and check the status later.

When ever we run an upgrade of anything - we must have a return for the admin "finished successful" or "error" - and just "hanging" is not acceptable.. this is how the whole discussion started.

Martin-Weiss · 2018-10-18T11:44:41Z

So basically could we simplify the whole salt update by this process:
salt '' ceph.updates.master
salt '' saltutil.sync_all
salt '' ceph.updates.salt
(in case we update deepsea as part of ceph.updates.master we have to sync the modules, too - just in case the ceph.updates.salt would rely on one of the modules we deliver)

We ceph.updates.salt just relies on salt(built-in) modules. Although it obviously doesn't hurt to sync_all it's generally not needed.

Today this is true - for the future - we never know ;-)

Martin-Weiss · 2018-10-18T11:46:58Z

I am also not sure how this deals with someone working in openATTIC using the salt-api to run actions at the same point in time.. maybe we have to stop openATTIC before updating the salt-master, too?

Wouldn't that be up to the operator?

If there are multiple people - one managing stuff in openATTIC and an other one applying updates it might not be one single person.. Think about an admin configuring a new iSCSI LUN via openATTIC while an other one is applying a salt update. It is not obvious to the two that this will break the iSCSI administration - and while the iSCSI change is operated - it would be not the best thing to update the salt-master and salt-minion at the same point in time.

jschmid1 · 2018-10-18T11:53:06Z

If there are multiple people - one managing stuff in openATTIC and an other one applying updates it might not be one single person.. Think about an admin configuring a new iSCSI LUN via openATTIC while an other one is applying a salt update. It is not obvious to the two that this will break the iSCSI administration - and while the iSCSI change is operated - it would be not the best thing to update the salt-master and salt-minion at the same point in time.

That's another question and it's a more fundamental issue. We tracked that somewhere in the issues #219

jschmid1 · 2018-10-18T13:24:00Z

So basically "updating salt with using salt" is not possible and we need a second way to work around this or run the upgrade "async" and check the status later.

When ever we run an upgrade of anything - we must have a return for the admin "finished successful" or "error" - and just "hanging" is not acceptable.. this is how the whole discussion started.

right, but as stated. This is not clear to me as to how to do this.
We gain more awareness towards the fact that 'no return' will happen. This is all we can offer right now (if no-one can provide a better solution)

Martin-Weiss · 2018-10-18T13:51:05Z

Check this out: https://github.com/saltstack/salt/pull/39952/files (from saltstack/salt#39952) - did not do that myself and looks complicated - basically it is a hack because I believe a systems management tookit basically needs a second daemon that is used to manage the systems management agents during updates..

jschmid1 · 2018-10-18T14:05:06Z

This is basically what we discussed earlier wrt(DISABLE_RESTART_ON_UPDATE = 1 in /etc/sysconfig/services) just adapted to the debian world. At first glance this example does:

Set the /usr/sbin/policy-rc.d (equivalent to our way, mentioned above)
upgrade the salt minion
enable it (not clear if enabling also restarts it, but I assume as it wouldn't make sense otherwise)
remove the file from 1)

But since we just call a state that does one thing, namely updating the salt package, it's more or less the same. Do I miss something?

Martin-Weiss · 2018-10-18T14:26:08Z

Basically I believe this is the trick:

"one way of handling this (on Linux and UNIX-based operating systems) is to use
at (a job scheduler which predates cron) to schedule a restart of the
service. at is not installed by default on most distros, and requires a
service to be running (usually called atd) in order to schedule jobs.

--> update the rpm with a job that runs AFTER the salt state is finished. This way the salt state can complete and the real action happens later.

We "just" need to run this state - then wait for some time (sleep 10 ;-)) and then check for the salt-minion version.

jschmid1 · 2018-10-18T15:16:09Z

--> update the rpm with a job that runs AFTER the salt state is finished. This way the salt state can complete and the real action happens later.

And this is what we're doing now. Run one single command. 'salt update'. It's the last.

We "just" need to run this state - then wait for some time (sleep 10 ;-)) and then check for the salt-minion version.

The next problem with this is that we can't order in orchestrations. And we structure our stages in orchestrations. So that's not a option.

I'll play around with the mentioned concept and come back with this later. However this should not block #1431

Martin-Weiss · 2018-10-19T06:57:43Z

One thing to keep in mind is that the default of the salt-master and salt-minion rpm update is that the rpm itself does a restart of the corresponding daemon as long as DISABLE_RESTART_ON_UPDATE="no" is set in /etc/sysconfig/services.
So when you are working on this you also might have to change this setting to yes, update the rpm, change it back and then restart the salt services some time later (async) when no job, orchestratation or salt-api call is active anymore on the master or minion side of things.

swiftgist added the enhancement label May 7, 2018

jschmid1 self-assigned this May 7, 2018

jschmid1 mentioned this issue Oct 17, 2018

Add validation for pending updates #1431

Merged

3 tasks

jschmid1 closed this as completed in #1431 Oct 26, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Salt updates #1126

Salt updates #1126

swiftgist commented May 7, 2018

jschmid1 commented May 7, 2018

jschmid1 commented May 7, 2018

jschmid1 commented May 7, 2018

Martin-Weiss commented May 14, 2018

jschmid1 commented Sep 24, 2018

Martin-Weiss commented Oct 8, 2018

Martin-Weiss commented Oct 8, 2018

jschmid1 commented Oct 8, 2018

Martin-Weiss commented Oct 8, 2018

jschmid1 commented Oct 9, 2018

Martin-Weiss commented Oct 12, 2018

jschmid1 commented Oct 17, 2018

Martin-Weiss commented Oct 17, 2018 via email

jschmid1 commented Oct 18, 2018

Martin-Weiss commented Oct 18, 2018

jschmid1 commented Oct 18, 2018

Martin-Weiss commented Oct 18, 2018

jschmid1 commented Oct 18, 2018

jschmid1 commented Oct 18, 2018

jschmid1 commented Oct 18, 2018

jschmid1 commented Oct 18, 2018

Martin-Weiss commented Oct 18, 2018

Martin-Weiss commented Oct 18, 2018

Martin-Weiss commented Oct 18, 2018

jschmid1 commented Oct 18, 2018

jschmid1 commented Oct 18, 2018

Martin-Weiss commented Oct 18, 2018

jschmid1 commented Oct 18, 2018

Martin-Weiss commented Oct 18, 2018

jschmid1 commented Oct 18, 2018

Martin-Weiss commented Oct 19, 2018

Salt updates #1126

Salt updates #1126

Comments

swiftgist commented May 7, 2018

Description of Issue/Question

jschmid1 commented May 7, 2018

jschmid1 commented May 7, 2018

jschmid1 commented May 7, 2018

Martin-Weiss commented May 14, 2018

jschmid1 commented Sep 24, 2018

Martin-Weiss commented Oct 8, 2018

Martin-Weiss commented Oct 8, 2018

jschmid1 commented Oct 8, 2018

Martin-Weiss commented Oct 8, 2018

jschmid1 commented Oct 9, 2018

Martin-Weiss commented Oct 12, 2018

jschmid1 commented Oct 17, 2018

Martin-Weiss commented Oct 17, 2018 via email

jschmid1 commented Oct 18, 2018

Martin-Weiss commented Oct 18, 2018

jschmid1 commented Oct 18, 2018

Martin-Weiss commented Oct 18, 2018

jschmid1 commented Oct 18, 2018

jschmid1 commented Oct 18, 2018

jschmid1 commented Oct 18, 2018

jschmid1 commented Oct 18, 2018

Martin-Weiss commented Oct 18, 2018

Martin-Weiss commented Oct 18, 2018

Martin-Weiss commented Oct 18, 2018

jschmid1 commented Oct 18, 2018

jschmid1 commented Oct 18, 2018

Martin-Weiss commented Oct 18, 2018

jschmid1 commented Oct 18, 2018

Martin-Weiss commented Oct 18, 2018

jschmid1 commented Oct 18, 2018

Martin-Weiss commented Oct 19, 2018