-
Notifications
You must be signed in to change notification settings - Fork 21
PreUpgradeSafetyCheck EnsureAvailabilitySafetyCheck on 1 node cluster #377
Comments
The services are NOT supposed to have down time, and this is what the safety checks ensure. |
Right, but this is just for testing (our internal data migration logic basically) so uptime is not important in this case, I just want to kick of an upgrade and see what happens to the data. I have not looked at that parameter actually, i'll check it out |
What are the differences between the projects that work and those that don't? Some things to check:
Yeah as Oana suggested, check that parameter. You could also start progressively trying more monitored upgrades and see where it fails, but for your scenario I'd just start with Start-ServiceFabricApplicationUpgrade -UnmonitoredAuto -ForceRestart -ReplicaQuorumTimeoutSec 1 |
The difference is what's puzzling, I have a test project where it works and a production project where it doesn't. One possible difference it that the test project doesn't have any actual data in it, just services of every type. perhaps it ignores the quorum if there is no data in the service? seems odd though.. I kick off upgrades like this currently
The app (both working and non working) contains stateful and stateless services as well as actors I've been doing powershell deploys, mostly with default values, I did try and change the stable delay/health check parameters but that did not seem to do it.. I',m resetting my environment now and adding that parameter to our scripts, but it really sounds like that's the one, especially since its set to infinite by default |
Are your test and production applications the same? Same configurations, same min and target replica counts / instance counts? Are the test and production clusters the same? The health parameters affect the health checks performed after an upgrade domain is upgraded. The safety checks are performed before starting an upgrade domain - they are prerequisites for starting the upgrade. The health parameter do not impact the safety checks in any way. Let us know if the setting fixes your issue. Make sure you don't set it in the production without really understanding the possible effects (unavailability). And like Matt suggested, try the monitored upgrades and make sure to test as much as possible in test environment what you will use in production. |
No they are different, the test application is just an empty app I created to try and replicate the issue. These test were done on a local dev cluster (though that was the same across the tests). We're only doing this in our CI test environment as well as locally on dev boxes, we're not doing it in production. We also have other more complete multi node environments that we test on before reaching production. If I am able to run monitored upgrades with only one node I do prefer that but my initial impression was that it was not possible, since there was only one node. My initial testing do indicate that |
Sounds good, thanks for the update! |
I've now verified on our main app and |
Monitored upgrade did work also btw, the problem was only the health check preventing be upgrade from starting (correctly so, since there is only one node) |
Hello,
I'm having some problems upgrading applications on a one node cluster, specifically they get stuck on PreUpgradeSafetyCheck EnsureAvailabilitySafetyCheck witch I suppose make sense, since the service will have down time.
However I only get this for some applications, others work fine, also I've tried setting the DefaultServiceTypeHealthPolicy parameter to "100,100,100" in an effort to get the cluster to accept downtime but that didn't work either.
So why does this work for some apps and not others? how can I disable that check? I'm already doing the upgrade in unmonitored auto and I have not found any other setting that seems to disable the checks.
All the apps I've tried have also worked fine to upgrade on a 5 node cluster, so I imagine there is some setting i'm missing
Any tips are appreciated :)
The text was updated successfully, but these errors were encountered: