PreUpgradeSafetyCheck EnsureAvailabilitySafetyCheck on 1 node cluster #377

aL3891 · 2017-07-31T14:09:40Z

Hello,
I'm having some problems upgrading applications on a one node cluster, specifically they get stuck on PreUpgradeSafetyCheck EnsureAvailabilitySafetyCheck witch I suppose make sense, since the service will have down time.

However I only get this for some applications, others work fine, also I've tried setting the DefaultServiceTypeHealthPolicy parameter to "100,100,100" in an effort to get the cluster to accept downtime but that didn't work either.

So why does this work for some apps and not others? how can I disable that check? I'm already doing the upgrade in unmonitored auto and I have not found any other setting that seems to disable the checks.

All the apps I've tried have also worked fine to upgrade on a 5 node cluster, so I imagine there is some setting i'm missing

Any tips are appreciated :)

oanapl · 2017-07-31T17:14:43Z

The services are NOT supposed to have down time, and this is what the safety checks ensure.
The safety check waits until there are replicas available so the partition doesn't get into quorum loss and availability loss. Some services hit this because they don't have enough replicas.
If you don't care about availability, you can change UpgradeReplicaSetCheckTimeout upgrade parameter. Read more about upgrade parameters.

aL3891 · 2017-07-31T17:23:23Z

Right, but this is just for testing (our internal data migration logic basically) so uptime is not important in this case, I just want to kick of an upgrade and see what happens to the data.

I have not looked at that parameter actually, i'll check it out

masnider · 2017-07-31T17:24:51Z

What are the differences between the projects that work and those that don't? Some things to check:

The exact commands used to kick off the upgrades
The types of services in the application
Whether you're deploying through VS or not (and what parameters it is using)

Yeah as Oana suggested, check that parameter. You could also start progressively trying more monitored upgrades and see where it fails, but for your scenario I'd just start with Start-ServiceFabricApplicationUpgrade -UnmonitoredAuto -ForceRestart -ReplicaQuorumTimeoutSec 1

aL3891 · 2017-07-31T17:54:11Z

The difference is what's puzzling, I have a test project where it works and a production project where it doesn't. One possible difference it that the test project doesn't have any actual data in it, just services of every type. perhaps it ignores the quorum if there is no data in the service? seems odd though..

I kick off upgrades like this currently

Publish-UpgradedServiceFabricApplication -ApplicationPackagePath $PkgPath -ApplicationParameterFilePath $ApplicationParameterFilePath

The app (both working and non working) contains stateful and stateless services as well as actors

I've been doing powershell deploys, mostly with default values, I did try and change the stable delay/health check parameters but that did not seem to do it..

I',m resetting my environment now and adding that parameter to our scripts, but it really sounds like that's the one, especially since its set to infinite by default

oanapl · 2017-07-31T18:06:02Z

Are your test and production applications the same? Same configurations, same min and target replica counts / instance counts? Are the test and production clusters the same?

The health parameters affect the health checks performed after an upgrade domain is upgraded. The safety checks are performed before starting an upgrade domain - they are prerequisites for starting the upgrade. The health parameter do not impact the safety checks in any way.

Let us know if the setting fixes your issue. Make sure you don't set it in the production without really understanding the possible effects (unavailability). And like Matt suggested, try the monitored upgrades and make sure to test as much as possible in test environment what you will use in production.

aL3891 · 2017-07-31T18:49:05Z

No they are different, the test application is just an empty app I created to try and replicate the issue. These test were done on a local dev cluster (though that was the same across the tests).
In both cases target and min replicas/instances was the same (1).

We're only doing this in our CI test environment as well as locally on dev boxes, we're not doing it in production. We also have other more complete multi node environments that we test on before reaching production. If I am able to run monitored upgrades with only one node I do prefer that but my initial impression was that it was not possible, since there was only one node.

My initial testing do indicate that ReplicaQuorumTimeoutSec (or UpgradeReplicaSetCheckTimeoutSec as it now seems to be called) was indeed the reason for the upgrades not starting, i'm modifying our real ci scripts now and i'll test with the full solution and then close the issue.

oanapl · 2017-07-31T19:37:14Z

Sounds good, thanks for the update!

aL3891 · 2017-07-31T19:58:49Z

I've now verified on our main app and UpgradeReplicaSetCheckTimeoutSec was indeed the key, thanks again!

aL3891 · 2017-07-31T20:10:31Z

Monitored upgrade did work also btw, the problem was only the health check preventing be upgrade from starting (correctly so, since there is only one node)

masnider assigned oanapl Jul 31, 2017

masnider added investigating question labels Jul 31, 2017

aL3891 closed this as completed Jul 31, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PreUpgradeSafetyCheck EnsureAvailabilitySafetyCheck on 1 node cluster #377

PreUpgradeSafetyCheck EnsureAvailabilitySafetyCheck on 1 node cluster #377

aL3891 commented Jul 31, 2017 •

edited

oanapl commented Jul 31, 2017

aL3891 commented Jul 31, 2017

masnider commented Jul 31, 2017

aL3891 commented Jul 31, 2017

oanapl commented Jul 31, 2017

aL3891 commented Jul 31, 2017 •

edited

oanapl commented Jul 31, 2017

aL3891 commented Jul 31, 2017 •

edited

aL3891 commented Jul 31, 2017

PreUpgradeSafetyCheck EnsureAvailabilitySafetyCheck on 1 node cluster #377

PreUpgradeSafetyCheck EnsureAvailabilitySafetyCheck on 1 node cluster #377

Comments

aL3891 commented Jul 31, 2017 • edited

oanapl commented Jul 31, 2017

aL3891 commented Jul 31, 2017

masnider commented Jul 31, 2017

aL3891 commented Jul 31, 2017

oanapl commented Jul 31, 2017

aL3891 commented Jul 31, 2017 • edited

oanapl commented Jul 31, 2017

aL3891 commented Jul 31, 2017 • edited

aL3891 commented Jul 31, 2017

aL3891 commented Jul 31, 2017 •

edited

aL3891 commented Jul 31, 2017 •

edited

aL3891 commented Jul 31, 2017 •

edited