Skip to content
This repository has been archived by the owner on Jan 16, 2021. It is now read-only.

PreUpgradeSafetyCheck EnsureAvailabilitySafetyCheck on 1 node cluster #377

Closed
aL3891 opened this issue Jul 31, 2017 · 9 comments
Closed

Comments

@aL3891
Copy link

aL3891 commented Jul 31, 2017

Hello,
I'm having some problems upgrading applications on a one node cluster, specifically they get stuck on PreUpgradeSafetyCheck EnsureAvailabilitySafetyCheck witch I suppose make sense, since the service will have down time.

However I only get this for some applications, others work fine, also I've tried setting the DefaultServiceTypeHealthPolicy parameter to "100,100,100" in an effort to get the cluster to accept downtime but that didn't work either.

So why does this work for some apps and not others? how can I disable that check? I'm already doing the upgrade in unmonitored auto and I have not found any other setting that seems to disable the checks.

All the apps I've tried have also worked fine to upgrade on a 5 node cluster, so I imagine there is some setting i'm missing

Any tips are appreciated :)

@oanapl
Copy link

oanapl commented Jul 31, 2017

The services are NOT supposed to have down time, and this is what the safety checks ensure.
The safety check waits until there are replicas available so the partition doesn't get into quorum loss and availability loss. Some services hit this because they don't have enough replicas.
If you don't care about availability, you can change UpgradeReplicaSetCheckTimeout upgrade parameter. Read more about upgrade parameters.

@aL3891
Copy link
Author

aL3891 commented Jul 31, 2017

Right, but this is just for testing (our internal data migration logic basically) so uptime is not important in this case, I just want to kick of an upgrade and see what happens to the data.

I have not looked at that parameter actually, i'll check it out

@masnider
Copy link
Member

What are the differences between the projects that work and those that don't? Some things to check:

  • The exact commands used to kick off the upgrades
  • The types of services in the application
  • Whether you're deploying through VS or not (and what parameters it is using)

Yeah as Oana suggested, check that parameter. You could also start progressively trying more monitored upgrades and see where it fails, but for your scenario I'd just start with Start-ServiceFabricApplicationUpgrade -UnmonitoredAuto -ForceRestart -ReplicaQuorumTimeoutSec 1

@aL3891
Copy link
Author

aL3891 commented Jul 31, 2017

The difference is what's puzzling, I have a test project where it works and a production project where it doesn't. One possible difference it that the test project doesn't have any actual data in it, just services of every type. perhaps it ignores the quorum if there is no data in the service? seems odd though..

I kick off upgrades like this currently

Publish-UpgradedServiceFabricApplication -ApplicationPackagePath $PkgPath -ApplicationParameterFilePath $ApplicationParameterFilePath

The app (both working and non working) contains stateful and stateless services as well as actors

I've been doing powershell deploys, mostly with default values, I did try and change the stable delay/health check parameters but that did not seem to do it..

I',m resetting my environment now and adding that parameter to our scripts, but it really sounds like that's the one, especially since its set to infinite by default

@oanapl
Copy link

oanapl commented Jul 31, 2017

Are your test and production applications the same? Same configurations, same min and target replica counts / instance counts? Are the test and production clusters the same?

The health parameters affect the health checks performed after an upgrade domain is upgraded. The safety checks are performed before starting an upgrade domain - they are prerequisites for starting the upgrade. The health parameter do not impact the safety checks in any way.

Let us know if the setting fixes your issue. Make sure you don't set it in the production without really understanding the possible effects (unavailability). And like Matt suggested, try the monitored upgrades and make sure to test as much as possible in test environment what you will use in production.

@aL3891
Copy link
Author

aL3891 commented Jul 31, 2017

No they are different, the test application is just an empty app I created to try and replicate the issue. These test were done on a local dev cluster (though that was the same across the tests).
In both cases target and min replicas/instances was the same (1).

We're only doing this in our CI test environment as well as locally on dev boxes, we're not doing it in production. We also have other more complete multi node environments that we test on before reaching production. If I am able to run monitored upgrades with only one node I do prefer that but my initial impression was that it was not possible, since there was only one node.

My initial testing do indicate that ReplicaQuorumTimeoutSec (or UpgradeReplicaSetCheckTimeoutSec as it now seems to be called) was indeed the reason for the upgrades not starting, i'm modifying our real ci scripts now and i'll test with the full solution and then close the issue.

@oanapl
Copy link

oanapl commented Jul 31, 2017

Sounds good, thanks for the update!

@aL3891
Copy link
Author

aL3891 commented Jul 31, 2017

I've now verified on our main app and UpgradeReplicaSetCheckTimeoutSec was indeed the key, thanks again!

@aL3891 aL3891 closed this as completed Jul 31, 2017
@aL3891
Copy link
Author

aL3891 commented Jul 31, 2017

Monitored upgrade did work also btw, the problem was only the health check preventing be upgrade from starting (correctly so, since there is only one node)

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

3 participants