Application stuck in rolling upgrade state #1279

meanin · 2018-09-14T09:24:24Z

Hi all,

after a new application version deployment failed, it stuck in a rolling upgrade state. To be able to work on this application, I decided to remove it from a cluster through an explorer. Service Fabric is unable to delete this application for some reason.

30 minutes later, NamingService shows unhealthy evaluations:

Seems that AODeleteService is not able to start.

I was trying to investigate this from a PowerShell level also.
Application state:

StartTimestampUtc             : 13.09.2018 15:42:26
UpgradeState                  : Failed
UpgradeDuration               : 08:07:28
CurrentUpgradeDomainDuration  : 08:04:27
CurrentUpgradeDomainProgress  : 2
                                
                                NodeName            : xxxxx
                                UpgradePhase        : PreUpgradeSafetyCheck
                                PendingSafetyChecks :
                                	EnsureAvailability - PartitionId: a57e06eb-7e7a-4578-93bc-4315bedd1e50
NextUpgradeDomain             : 2
UpgradeDomainsStatus          : { "0" = "Completed";
                                "1" = "Completed";
                                "2" = "Pending" }
UpgradeKind                   : Rolling
RollingUpgradeMode            : UnmonitoredAuto
ForceRestart                  : False
UpgradeReplicaSetCheckTimeout : 49710.06:28:15

Update-ServiceFabricApplicationUpgrade results timeout as well.

Basically, I am not able to remove broken application without node restart.

The text was updated successfully, but these errors were encountered:

mikkelhegn · 2018-09-14T10:34:04Z

PreUpgradeSafetyCheck - https://docs.microsoft.com/en-us/azure/service-fabric/service-fabric-application-upgrade-troubleshooting

An UpgradePhase of PreUpgradeSafetyCheck means there were issues preparing the upgrade domain before it was performed. The most common issues in this case are service errors in the close or demotion from primary code paths.

EnsureAvailability: https://docs.microsoft.com/en-us/rest/api/servicefabric/sfclient-model-ensureavailabilitysafetycheck

Safety check that waits to ensure the availability of the partition. It waits until there are replicas available such that bringing down this replica will not cause availability loss for the partition.

More on that safetycheck kind: https://docs.microsoft.com/en-us/previous-versions/azure/reference/mt280061(v=azure.100)

EnsureAvailability	Indicates that there is either a stateless service partition on the node having exactly one instance, or there is a primary replica on the node for which the partition is quorum loss. In both cases, bringing down the replica will result in loss of availability.

I'm not aware of a way to get past this, other than bringing down the process, which holds that replica manually. You could try this API if you are ok loosing the availbility: https://docs.microsoft.com/en-us/rest/api/servicefabric/sfclient-api-restartdeployedcodepackage - aka kill that process.

This aborts the code package process, which will restart all the user service replicas hosted in that process.

Let me know if this helps.

oanapl · 2018-09-15T00:30:02Z

@meanin , regarding the unhealthy evaluations on the naming service - they say that the service deletion is taking more than 30 minutes. There is nothing wrong with naming service, it's just the entity this issue surfaces on.

So you start application upgrade, this fails on safety checks and while the upgrade is pending, you issue delete app and that is stuck (needs node reboot). @motanv , can you look more into this?

FYI, deleting the app during upgrade is probably not the best mitigation. You can try rollback or you can change UpgradeReplicaSetCheckTimeout - since you delete the app I understand that you are OK with losing state.

meanin · 2018-09-15T09:41:17Z

@mikkelhegn I will try this way next time (hope there won't be next time :) ). I was digging into deployment parameters also, but we were in a hurry.

@oanapl It happened on a development cluster, so it is not a big deal (node reboot). Furthermore, we do not have stateful services yet.

Again, we were in a hurry, just before a business demo, so I took any opportunity to get the cluster in a valid state. Next time I will do this in a proper manner, with rollback first and then redeploy the application.

oanapl assigned oanapl and motanv and unassigned oanapl Sep 15, 2018

mikkelhegn closed this as completed Sep 17, 2018

mikkelhegn mentioned this issue Sep 19, 2018

Application Upgrade is Stuck #14

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Application stuck in rolling upgrade state #1279

Application stuck in rolling upgrade state #1279

meanin commented Sep 14, 2018

mikkelhegn commented Sep 14, 2018

oanapl commented Sep 15, 2018

meanin commented Sep 15, 2018

Application stuck in rolling upgrade state #1279

Application stuck in rolling upgrade state #1279

Comments

meanin commented Sep 14, 2018

mikkelhegn commented Sep 14, 2018

oanapl commented Sep 15, 2018

meanin commented Sep 15, 2018