Skip to content
This repository has been archived by the owner on Jan 16, 2021. It is now read-only.

Application stuck in rolling upgrade state #1279

Closed
meanin opened this issue Sep 14, 2018 · 3 comments
Closed

Application stuck in rolling upgrade state #1279

meanin opened this issue Sep 14, 2018 · 3 comments
Assignees

Comments

@meanin
Copy link

meanin commented Sep 14, 2018

Hi all,

after a new application version deployment failed, it stuck in a rolling upgrade state. To be able to work on this application, I decided to remove it from a cluster through an explorer. Service Fabric is unable to delete this application for some reason.

30 minutes later, NamingService shows unhealthy evaluations:
image
Seems that AODeleteService is not able to start.

I was trying to investigate this from a PowerShell level also.
Application state:

StartTimestampUtc             : 13.09.2018 15:42:26
UpgradeState                  : Failed
UpgradeDuration               : 08:07:28
CurrentUpgradeDomainDuration  : 08:04:27
CurrentUpgradeDomainProgress  : 2
                                
                                NodeName            : xxxxx
                                UpgradePhase        : PreUpgradeSafetyCheck
                                PendingSafetyChecks :
                                	EnsureAvailability - PartitionId: a57e06eb-7e7a-4578-93bc-4315bedd1e50
NextUpgradeDomain             : 2
UpgradeDomainsStatus          : { "0" = "Completed";
                                "1" = "Completed";
                                "2" = "Pending" }
UpgradeKind                   : Rolling
RollingUpgradeMode            : UnmonitoredAuto
ForceRestart                  : False
UpgradeReplicaSetCheckTimeout : 49710.06:28:15

Update-ServiceFabricApplicationUpgrade results timeout as well.

Basically, I am not able to remove broken application without node restart.

@mikkelhegn
Copy link

PreUpgradeSafetyCheck - https://docs.microsoft.com/en-us/azure/service-fabric/service-fabric-application-upgrade-troubleshooting

An UpgradePhase of PreUpgradeSafetyCheck means there were issues preparing the upgrade domain before it was performed. The most common issues in this case are service errors in the close or demotion from primary code paths.

EnsureAvailability: https://docs.microsoft.com/en-us/rest/api/servicefabric/sfclient-model-ensureavailabilitysafetycheck

Safety check that waits to ensure the availability of the partition. It waits until there are replicas available such that bringing down this replica will not cause availability loss for the partition.

More on that safetycheck kind: https://docs.microsoft.com/en-us/previous-versions/azure/reference/mt280061(v=azure.100)

EnsureAvailability Indicates that there is either a stateless service partition on the node having exactly one instance, or there is a primary replica on the node for which the partition is quorum loss. In both cases, bringing down the replica will result in loss of availability.

I'm not aware of a way to get past this, other than bringing down the process, which holds that replica manually. You could try this API if you are ok loosing the availbility: https://docs.microsoft.com/en-us/rest/api/servicefabric/sfclient-api-restartdeployedcodepackage - aka kill that process.

This aborts the code package process, which will restart all the user service replicas hosted in that process.

Let me know if this helps.

@oanapl
Copy link

oanapl commented Sep 15, 2018

@meanin , regarding the unhealthy evaluations on the naming service - they say that the service deletion is taking more than 30 minutes. There is nothing wrong with naming service, it's just the entity this issue surfaces on.

So you start application upgrade, this fails on safety checks and while the upgrade is pending, you issue delete app and that is stuck (needs node reboot). @motanv , can you look more into this?

FYI, deleting the app during upgrade is probably not the best mitigation. You can try rollback or you can change UpgradeReplicaSetCheckTimeout - since you delete the app I understand that you are OK with losing state.

@oanapl oanapl assigned oanapl and motanv and unassigned oanapl Sep 15, 2018
@meanin
Copy link
Author

meanin commented Sep 15, 2018

@mikkelhegn I will try this way next time (hope there won't be next time :) ). I was digging into deployment parameters also, but we were in a hurry.

@oanapl It happened on a development cluster, so it is not a big deal (node reboot). Furthermore, we do not have stateful services yet.

Again, we were in a hurry, just before a business demo, so I took any opportunity to get the cluster in a valid state. Next time I will do this in a proper manner, with rollback first and then redeploy the application.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants