Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Salt updates #1126

Closed
swiftgist opened this issue May 7, 2018 · 31 comments · Fixed by #1431
Closed

Salt updates #1126

swiftgist opened this issue May 7, 2018 · 31 comments · Fixed by #1431
Assignees

Comments

@swiftgist
Copy link
Contributor

Description of Issue/Question

DeepSea cannot gracefully upgrade Salt. This is simply a function of how orchestrations work. When the Salt master restarts, the orchestration is terminated. Shoehorning this into the reactor would likely cause more issues than it solves.

Salt updates are not terribly frequent and DeepSea is completely dependent on a working Salt cluster. I believe Stage 0 needs to fail with instructions on installing and upgrading Salt. Some users may be using Salt for other purposes and bringing these changes to their attention should prevent any confusion. Also, updating and not restarting Salt daemons will likely end in heartache and confusion when a minion is upgraded prior to the master restarting.

@jschmid1 jschmid1 self-assigned this May 7, 2018
@jschmid1
Copy link
Contributor

jschmid1 commented May 7, 2018

working on that

@jschmid1
Copy link
Contributor

jschmid1 commented May 7, 2018

I'm also adding a warning for stage0/3 when updates are pending that would cause the daemons to restart.

@jschmid1
Copy link
Contributor

jschmid1 commented May 7, 2018

wip-branch

@Martin-Weiss
Copy link

Could we add logging and tracking of states in the orchestration so that we can "continue" at the point that failed or where the orchestration was stopped?

The problem is not only updating a salt-master (that IMO should not restart the salt-master automatically), the problem is also a loss of the ssh session or a reboot that might be initiated during a long running orchestration job - i.e. while doing a migration of filestore to bluestore..

@jschmid1
Copy link
Contributor

@Martin-Weiss
Copy link

Either we suppress the salt-master/salt-minon restart during the upgrade until we can reboot the servers (similar to the kernel-update process) or we might need to find a way to detect the hanging / broken orchestration and exit on that.

I also like the idea of detecting an update for the salt-master/salt-minion but in case we do not update these services as part of the orchestration - updating all of them manually is an other big "manual" task.

It would be great to have a "salt-master/salt-minion" self-update as part of the orchestration.. but I am not sure if that is possible.

@Martin-Weiss
Copy link

Maybe setting DISABLE_RESTART_ON_UPDATE = 1 in /etc/sysconfig/services might be a solution.

Set this, run zypper up salt-master/salt-minion, restart the services manually "async" right after exiting the orchestration with the message to the admin "please restart stage 0".

@jschmid1
Copy link
Contributor

jschmid1 commented Oct 8, 2018

Either we suppress the salt-master/salt-minon restart during the upgrade until we can reboot the servers (similar to the kernel-update process)

probably possible with the DISABLE_RESTART_ON_UPDATE = 1 directive.

or we might need to find a way to detect the hanging / broken orchestration and exit on that.

Not possible because you loose connection to the bus.

I also like the idea of detecting an update for the salt-master/salt-minion but in case we do not update these services as part of the orchestration -

That's what the posted patch does. It's not ready though.

updating all of them manually is an other big "manual" task.

What do you mean by that?

Maybe setting DISABLE_RESTART_ON_UPDATE = 1 in /etc/sysconfig/services might be a solution.
Set this, run zypper up salt-master/salt-minion, restart the services manually "async" right after exiting the orchestration with the message to the admin "please restart stage 0".

We also spent a thought on that. It's definitely doable with the tools we have.

@Martin-Weiss
Copy link

updating all of them manually is an other big "manual" task.
What do you mean by that?

Even in case the orchestration can not update salt-master/minions automatically and orchestrated - we need to find a way to do that from a single command in the right way for all the masters and minions.

Not sure if that helps - but would https://docs.saltstack.com/en/2017.7/topics/orchestrate/orchestrate_runner.html#masterless-orchestration help to update the master and something like https://www.shellhacks.com/upgrade-salt-minions/ for the minions?

Maybe we can just update the salt stack (master) always before running stage.0 tasks?
(in case the master is on the latest version in the attached channel - nothing would happen)

@jschmid1
Copy link
Contributor

jschmid1 commented Oct 9, 2018

In our meeting we agreed on the least aggressive solution which would be notifying the user that DS detected a salt version which needs to be updated before stage.0 should be run.

For the CI this needs to be handled differently.

  1. Just fail when there is a salt update and ask the maintainer to update the base-image

  2. ??

@Martin-Weiss
Copy link

In case we follow this road we have to make sure to tell the admin "how to update salt-master and all salt-minions" in a proper way and in a large cluster we have to provide a tool that does it as manually connecting to all cluster nodes and manually updating salt-master and salt-minion packages without using salt might not be acceptable for deepsea users. Maybe this whole topic needs to be escalated to the salt maintainers as this "salt-self-update problem" seems to exist for all salt deployments.

@jschmid1
Copy link
Contributor

in a proper way and in a large cluster we have to provide a tool that does it as manually connecting to all cluster nodes and manually updating salt-master and salt-minion packages without using salt might not be acceptable for deepsea users

I don't think that we have to go that route. It's sufficient to kick off a master & minion update via a separate(salt) state/orchestration. You might not get a result back, but you will end up with the new versions. From there on you can just re-run the stage.

@Martin-Weiss
Copy link

Martin-Weiss commented Oct 17, 2018 via email

@jschmid1
Copy link
Contributor

The current advise is to do:

You have a salt update pending. In order to provide a smooth experience, please update these packages manually before proceeding. salt -I 'roles:master' state.apply ceph.updates.master and salt -G 'deepsea:*' state.apply ceph.updates.salt in case you have deepsea_minions defined otherwise run: salt -I 'cluster:ceph' state.apply ceph.updates.salt

Then we will be at the same place where we are already! When we update the salt-master via orchestration - the orchestration process hangs „forever“. Or how do you want to „kickoff“ the update and show progress and result to the admin?

This would be a state which is expected to hang. That's something we can't change and don't need to change imo. It's just a single command that is being executed and can verified by running a zypper info right after.

@Martin-Weiss
Copy link

Just trying to understand this ;-).
First of all we do NOT use orchestration for updating salt and we will stop any orchestration (stage.0 up to stage.5) in case there are any updates available for salt-master or salt-minion on any server?
Do we also verify during orchestration that salt-master and all salt-minions are on a compatible version level?
Second - we use "salt states" and "apply them manually" where we update the salt-master first, and then all minions in parallel.

To rephrase a bit and to ensure the actions are understood for sure - please correct me if this is not correct (and maybe this can be used as text for exiting the orchestration in case salt updates are detected)?

Variant 1) In case you do have deepsea_minions defined using minion targeting i.e. via deepsea_minions: 'ses-5-admin* or ses-5-[1,2,3,4]*' in /srv/pillar/ceph/deepsea_minions.sls

execute salt -I 'roles:master' state.apply ceph.updates.master
execute salt -G 'deepsea:*' state.apply ceph.updates.salt

Variant 2) In case you have deepsea_minions defined using the deepsea grain via deepsea_minions: 'G@deepsea:*' in /srv/pillar/ceph/deepsea_minions.sls

execute salt -I 'cluster:ceph' state.apply ceph.updates.salt

Basically I see some challenges - not sure if they are really existing ;-):
a) what happens if one or the other minon or the master have already different versions? Do we detect this properly?
b) does the salt-master state complete or hang in case the salt-master gets restarted (same is true for the minion update)?
c) why do we have a difference based on deepsea targeting at all? Why not just do a
salt '<admin host minion>' ceph.updates.master and
salt '*' ceph.updates.salt
-> I assume the ceph.updates.master JUST updates the salt-master package and ceph.updates.salt updates the salt-minions.
d) in variant 2) we update the master and the minions all at the same point in time instead of "master first" - not sure if that is the right way.
e) how do we do the sync of modules after these updates? Again by stage.0?
f) do the states verify that there are currently no jobs running on the master or minion before the update is applied to ensure we do not break anything?

@jschmid1
Copy link
Contributor

First of all we do NOT use orchestration for updating salt and we will stop any orchestration (stage.0 up to stage.5) in case there are any updates available for salt-master or salt-minion on any server?

stage.0 will be stopped as this is the only stage that does a zypper up. We do not use an orchestration but a state for updating salt packages.

Do we also verify during orchestration that salt-master and all salt-minions are on a compatible version level?

No, if the users didn't mess with their cluster, the updates are applied the same way. We do check if a salt version is incompatible with the ceph release you are running. Secondly, it's not critical to run different minor versions of salt as long as this will be fixed early (it will by applying the states)

Second - we use "salt states" and "apply them manually" where we update the salt-master first, and then all minions in parallel.

Manual, in the sense that we call the states ourselves as oppsed to the orchestrations. Yes, master first, then the minions in parallel.

Variant 1) In case you do have deepsea_minions defined using minion targeting i.e. via deepsea_minions: 'ses-5-admin* or ses-5-[1,2,3,4]*' in /srv/pillar/ceph/deepsea_minions.sls

execute salt -I 'roles:master' state.apply ceph.updates.master
execute salt -G 'deepsea:*' state.apply ceph.updates.salt

Variant 2) In case you have deepsea_minions defined using the deepsea grain via deepsea_minions: 'G@deepsea:*' in /srv/pillar/ceph/deepsea_minions.sls

execute salt -I 'cluster:ceph' state.apply ceph.updates.salt

This

execute salt -I 'roles:master' state.apply ceph.updates.master

is independent of the 'variant'. It's only the targeting for the minions that changes.

a) what happens if one or the other minon or the master have already different versions? Do we detect this properly?

If this would be a problem, you couldn't even call the state. If this is not a problem you'd apply the states and end up with the same patch level across all servers.

b) does the salt-master state complete or hang in case the salt-master gets restarted (same is true for the minion update)?

Yes.

c) why do we have a difference based on deepsea targeting at all? Why not just do a
salt '<admin host minion>' ceph.updates.master and
salt '*' ceph.updates.salt
-> I assume the ceph.updates.master JUST updates the salt-master package and ceph.updates.salt updates the salt-minions.

No.
ceph.updates.master updates the master, minion and the deepsea package whilst the ceph.updates.salt state only updates the minion package.

d) in variant 2) we update the master and the minions all at the same point in time instead of "master first" - not sure if that is the right way.

No, see comment for variants. The message doesn't seem clear enough. How would you rephrase it to make it more clear?

e) how do we do the sync of modules after these updates? Again by stage.0?

Yes

f) do the states verify that there are currently no jobs running on the master or minion before the update is applied to ensure we do not break anything?

At least that there are no jobs running that would interfere. In this case no 'pkg.installed'

@Martin-Weiss
Copy link

This

 execute salt -I 'roles:master' state.apply ceph.updates.master

is independent of the 'variant'. It's only the targeting for the minions that changes.

Not sure if I understand this reply to my question right - why are you using ceph.updates.master sometimes and ceph.updates.salt ?... ok somewhat further in the thread it gets clear with the following:

ceph.updates.master updates the master, minion and the deepsea package whilst the ceph.updates.salt state only updates the minion package.

Now that the states are clear to me this is the next open point:

b) does the salt-master state complete or hang in case the salt-master gets restarted (same is true for the minion update)?

Yes.

"Yes" on an "either or" question does not seem to be the right answer ;-)

We have to make sure that the salt command we use will work and give a proper status back on the update done, successful etc. In the past I have seen that when restarting the salt-minion or salt-master during the upgrade the minion or master does NOT reply anymore.. Maybe this problem is gone and then I am fine with the above as "command finishes with compete and $?=0".

So basically could we simplify the whole salt update by this process:

salt '<admin host minion>' ceph.updates.master
salt '*' saltutil.sync_all
salt '*' ceph.updates.salt

(in case we update deepsea as part of ceph.updates.master we have to sync the modules, too - just in case the ceph.updates.salt would rely on one of the modules we deliver)

I am also not sure how this deals with someone working in openATTIC using the salt-api to run actions at the same point in time.. maybe we have to stop openATTIC before updating the salt-master, too?

@jschmid1
Copy link
Contributor

b) does the salt-master state complete or hang in case the salt-master gets restarted (same is true for the minion update)?

Yes, the state hangs for a salt-master update and yes, it hangs also on a minion update (I'm not 100% sure on that anymore, in the meanwhile they might reconnect to the master and report back.)

@jschmid1
Copy link
Contributor

We have to make sure that the salt command we use will work and give a proper status back on the update done, successful etc. In the past I have seen that when restarting the salt-minion or salt-master during the upgrade the minion or master does NOT reply anymore.. Maybe this problem is gone and then I am fine with the above as "command finishes with compete and $?=0".

It can't. It's just technically(at least that I know of) not possible. The salt-master is connected to a bus where it reads messages from (what jobs got triggered, what the status of those jobs is, etc). If you restart this service it doens't automatically reconnect to this. The technicality behind is, is probably connected to session-keys expiring or similar..

@jschmid1
Copy link
Contributor

So basically could we simplify the whole salt update by this process:
salt '' ceph.updates.master
salt '' saltutil.sync_all
salt '
' ceph.updates.salt
(in case we update deepsea as part of ceph.updates.master we have to sync the modules, too - just in case the ceph.updates.salt would rely on one of the modules we deliver)

We ceph.updates.salt just relies on salt(built-in) modules. Although it obviously doesn't hurt to sync_all it's generally not needed.

@jschmid1
Copy link
Contributor

I am also not sure how this deals with someone working in openATTIC using the salt-api to run actions at the same point in time.. maybe we have to stop openATTIC before updating the salt-master, too?

Wouldn't that be up to the operator?

@Martin-Weiss
Copy link

We have to make sure that the salt command we use will work and give a proper status back on the update done, successful etc. In the past I have seen that when restarting the salt-minion or salt-master during the upgrade the minion or master does NOT reply anymore.. Maybe this problem is gone and then I am fine with the above as "command finishes with compete and $?=0".

It can't. It's just technically(at least that I know of) not possible. The salt-master is connected to a bus where it reads messages from (what jobs got triggered, what the status of those jobs is, etc). If you restart this service it doens't automatically reconnect to this. The technicality behind is, is probably connected to session-keys expiring or similar..

That means we have the same problem with the state execution for the salt-update as with running the orchestration of stage.0 for updating salt.

So basically "updating salt with using salt" is not possible and we need a second way to work around this or run the upgrade "async" and check the status later.

When ever we run an upgrade of anything - we must have a return for the admin "finished successful" or "error" - and just "hanging" is not acceptable.. this is how the whole discussion started.

@Martin-Weiss
Copy link

So basically could we simplify the whole salt update by this process:
salt '' ceph.updates.master
salt '' saltutil.sync_all
salt '
' ceph.updates.salt
(in case we update deepsea as part of ceph.updates.master we have to sync the modules, too - just in case the ceph.updates.salt would rely on one of the modules we deliver)

We ceph.updates.salt just relies on salt(built-in) modules. Although it obviously doesn't hurt to sync_all it's generally not needed.

Today this is true - for the future - we never know ;-)

@Martin-Weiss
Copy link

I am also not sure how this deals with someone working in openATTIC using the salt-api to run actions at the same point in time.. maybe we have to stop openATTIC before updating the salt-master, too?

Wouldn't that be up to the operator?

If there are multiple people - one managing stuff in openATTIC and an other one applying updates it might not be one single person.. Think about an admin configuring a new iSCSI LUN via openATTIC while an other one is applying a salt update. It is not obvious to the two that this will break the iSCSI administration - and while the iSCSI change is operated - it would be not the best thing to update the salt-master and salt-minion at the same point in time.

@jschmid1
Copy link
Contributor

If there are multiple people - one managing stuff in openATTIC and an other one applying updates it might not be one single person.. Think about an admin configuring a new iSCSI LUN via openATTIC while an other one is applying a salt update. It is not obvious to the two that this will break the iSCSI administration - and while the iSCSI change is operated - it would be not the best thing to update the salt-master and salt-minion at the same point in time.

That's another question and it's a more fundamental issue. We tracked that somewhere in the issues #219

@jschmid1
Copy link
Contributor

So basically "updating salt with using salt" is not possible and we need a second way to work around this or run the upgrade "async" and check the status later.

When ever we run an upgrade of anything - we must have a return for the admin "finished successful" or "error" - and just "hanging" is not acceptable.. this is how the whole discussion started.

right, but as stated. This is not clear to me as to how to do this.
We gain more awareness towards the fact that 'no return' will happen. This is all we can offer right now (if no-one can provide a better solution)

@Martin-Weiss
Copy link

Check this out: https://github.com/saltstack/salt/pull/39952/files (from saltstack/salt#39952) - did not do that myself and looks complicated - basically it is a hack because I believe a systems management tookit basically needs a second daemon that is used to manage the systems management agents during updates..

@jschmid1
Copy link
Contributor

This is basically what we discussed earlier wrt(DISABLE_RESTART_ON_UPDATE = 1 in /etc/sysconfig/services) just adapted to the debian world. At first glance this example does:

  1. Set the /usr/sbin/policy-rc.d (equivalent to our way, mentioned above)

  2. upgrade the salt minion

  3. enable it (not clear if enabling also restarts it, but I assume as it wouldn't make sense otherwise)

  4. remove the file from 1)

But since we just call a state that does one thing, namely updating the salt package, it's more or less the same. Do I miss something?

@Martin-Weiss
Copy link

Basically I believe this is the trick:

"one way of handling this (on Linux and UNIX-based operating systems) is to use
at (a job scheduler which predates cron) to schedule a restart of the
service. at is not installed by default on most distros, and requires a
service to be running (usually called atd) in order to schedule jobs.

--> update the rpm with a job that runs AFTER the salt state is finished. This way the salt state can complete and the real action happens later.

We "just" need to run this state - then wait for some time (sleep 10 ;-)) and then check for the salt-minion version.

@jschmid1
Copy link
Contributor

--> update the rpm with a job that runs AFTER the salt state is finished. This way the salt state can complete and the real action happens later.

And this is what we're doing now. Run one single command. 'salt update'. It's the last.

We "just" need to run this state - then wait for some time (sleep 10 ;-)) and then check for the salt-minion version.

The next problem with this is that we can't order in orchestrations. And we structure our stages in orchestrations. So that's not a option.

I'll play around with the mentioned concept and come back with this later. However this should not block #1431

@Martin-Weiss
Copy link

One thing to keep in mind is that the default of the salt-master and salt-minion rpm update is that the rpm itself does a restart of the corresponding daemon as long as DISABLE_RESTART_ON_UPDATE="no" is set in /etc/sysconfig/services.
So when you are working on this you also might have to change this setting to yes, update the rpm, change it back and then restart the salt services some time later (async) when no job, orchestratation or salt-api call is active anymore on the master or minion side of things.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants