Skip to content
This repository has been archived by the owner on Jan 16, 2021. It is now read-only.

Service Fabric service upgrade not working #595

Closed
ghost opened this issue Nov 2, 2017 · 22 comments
Closed

Service Fabric service upgrade not working #595

ghost opened this issue Nov 2, 2017 · 22 comments
Assignees
Labels
Milestone

Comments

@ghost
Copy link

ghost commented Nov 2, 2017

So, let me start at the beginning... We recently upgraded our SF project from VS2015 to VS2017 and we noticed that SF application gets deployed only 1/3 times and this is the behavior on multiple machines and not just one. Not being sure if this was a VS thing or a SF thing, I thought this was time for me to focus on differential packaging so that maybe I can do smaller updates instead of full deployments which would result in a faster deployments and hence my VS2017 wouldn't time out.

Now switching my focus to differential packaging, I made a very very simple SF Application with only 4 ServiceTypes. Than, I did the following:

  • Deployed v2.0.0 of my newly created application to local cluster through PowerShell.
  • Updated Service3 and the Manifest file to v2.0.1.
  • Repackaged the application, manually removed all service folders except Service3 and the AppManifest xml file.
  • Performed an update through PowerShell and it failed.

Now, I've tried multiple variations of it, I've also tried reset my local cluster and what not but have had no luck. Here is the result of my "Get-ServiceFabricApplicationUpgrade" command.

ApplicationName                : fabric:/DifferentialPackaging
ApplicationTypeName            : DifferentialPackagingType
TargetApplicationTypeVersion   : 2.0.1
ApplicationParameters          : {}
StartTimestampUtc              : 11/2/2017 12:36:06 PM
FailureTimestampUtc            : 11/2/2017 12:41:06 PM
FailureReason                  : UpgradeDomainTimeout
UpgradeDomainProgressAtFailure : 0

                                 NodeName            : _Node_0
                                 UpgradePhase        : PreUpgradeSafetyCheck
                                 PendingSafetyChecks :
                                        EnsureAvailability - PartitionId: 7c7bd322-0588-4ae8-a545-1050459990c6
UpgradeState                   : RollingBackInProgress
UpgradeDuration                : 00:12:01
CurrentUpgradeDomainDuration   : 00:07:01
CurrentUpgradeDomainProgress   : 0

                                 NodeName            : _Node_0
                                 UpgradePhase        : PreUpgradeSafetyCheck
                                 PendingSafetyChecks :
                                        EnsureAvailability - PartitionId: 1fd9be7b-747c-4c1e-a337-a0781e6a74f3
NextUpgradeDomain              :
UpgradeDomainsStatus           : { "0" = "InProgress" }
UpgradeKind                    : Rolling
RollingUpgradeMode             : UnmonitoredAuto
ForceRestart                   : False
UpgradeReplicaSetCheckTimeout  : 00:20:00

I'm not sure what I might be doing wrong here so please help me out, Thanks!

@masnider
Copy link
Member

masnider commented Nov 2, 2017

For the existing package before you go actually try to deploy it: does it pass Test-ServiceFabricApplicationPackage if you point to the existing package?

@vaishnavk vaishnavk assigned oanapl and unassigned vaishnavk Nov 3, 2017
@oanapl
Copy link

oanapl commented Nov 3, 2017

How are your services configured - how many replicas (min/target)?

In the upgrade status you pasted, the upgrade fails because the UD timeout is exhausted. Inside the UD, the upgrade is stuck at PreUpgradeSafetyCheck. This is a check we perform to ensure availability. We don't proceed with the upgrade until we are sure the application has enough replicas to function properly. There are 2 partitions that are mentioned above that are stuck.

The upgrade specified UpgradeReplicaSetCheckTimeout to 20 minutes. This is the time-out period to check whether the replica set has quorum. After the time-out period, the upgrade proceeds. If you set the UD timeout to a value less that the replica set check timeout (for example, 10 minutes) and moving the replica out of the node could cause quorum loss, the upgrade will fail (which is the correct behavior, since the main purpose of the monitored upgrade is to maintain availability).

As a side note, UpgradeReplicaSetCheckTimeout is deprecated, you should use UpgradeReplicaSetCheckTimeoutSec parameter instead.

These articles tell you more about upgrade parameters and troubleshoot app upgrades.

@ghost
Copy link
Author

ghost commented Nov 3, 2017

@masnider Nope. It gives me the following error but I don't think that it should expect to see the ServiceManifest.xml file if its a differential packaging update.

λ  Test-ServiceFabricApplicationPackage -ApplicationPackagePath "C:\Users\Haseeb\Documents\Visual Studio 2017\Projects\DifferentialPackaging\DifferentialPackaging\pkg\Debug"
False
Test-ServiceFabricApplicationPackage : The BuildLayout of the application in C:\Users\Haseeb\AppData\Local\Temp\TestApplicationPackage_2936631255996\cc4awbux.rch\Debug is invalid. ServiceManifest.xml is
missing for service Service1Pkg.
At line:1 char:1
+ Test-ServiceFabricApplicationPackage -ApplicationPackagePath "C:\User ...
+ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    + CategoryInfo          : InvalidOperation: (:) [Test-ServiceFabricApplicationPackage], FabricImageBuilderValidationException
    + FullyQualifiedErrorId : TestApplicationPackageErrorId,Microsoft.ServiceFabric.Powershell.TestApplicationPackage

@ghost
Copy link
Author

ghost commented Nov 3, 2017

@oanapl It is a very very simple example... No partitioning, no stateful services.... Just 4 stateless services. Also, the InstanceCount for every service is -1. If I am missing something else, please do let me know :-)

@oanapl
Copy link

oanapl commented Nov 3, 2017

Pass the ImageStoreConnectionString to Test-ServiceFabricApplication to use the previous deployed package for validation.

Are you using one node cluster? If you have one stateless instance and nowhere to move it, upgrade waits until UpgradeReplicaSetCheckTimeout passes to let the upgrade continue.

@masnider
Copy link
Member

masnider commented Nov 3, 2017

@haseeb-ahmed-tkxel @oanapl Yes. For the most part this is probably physical layout so let's make sure that works first. The Test package command needs to succeed before we can expect the actual deployment to work. Passing the image store connection address tells SF to check the image store for a package to delta from if this one is differential.

https://docs.microsoft.com/en-us/azure/service-fabric/service-fabric-application-upgrade-advanced#upgrade-with-a-diff-package may help a little.

Upgrade is a second thing. Let's get there after we be sure that the differential package is correct.

@makar-sasha
Copy link

hello

I'm also experiencing
Test-ServiceFabricApplicationPackage : The BuildLayout of the application in C:\Users\makar\AppData\Local\Temp\TestApplicationPackage_240509860615\53ruzecf.u5e\Release is invalid. ServiceManifest.xml is missing for service Stateless1Pkg.

I have created default app with 2 default stateless services. I have created a package and deployed with c:\code\temp\difftest\Scripts\Deploy-FabricApplication.ps1 -ApplicationPackagePath c:\code\temp\difftest\pkg\Release\ -PublishProfileFile c:\code\temp\difftest\PublishProfiles\Cloud.xml -UseExistingClusterConnection:$true -DeployOnly:$false -UnregisterUnusedApplicationVersionsAfterUpgrade $false -OverrideUpgradeBehavior 'None' -OverwriteBehavior 'SameAppTypeAndVersion' -SkipPackageValidation:$false -ErrorAction Stop
After this

  • I have updated cluster version and one particular service version
  • Created a package and manually removed unchanged service from it.
  • Executed the deployment command again hoping it will deploy the diff package.

Instead of this I see specified above error. Could you help with this?

@oanapl
Copy link

oanapl commented Nov 9, 2017

@makar-sasha , when you say "updated cluster version" I assume app manifest version?

These are the steps that should work:
For version 1, the package contains:

  • App manifest version 1, references <SP1, version1> and <SP2, version1>
    • SP1 version 1
    • SP2 version 1

Then version 2:

  • App manifest version 2, references <SP1, version1> (non-included) and <SP2, version2>
    • SP2 version 2

You provision version 1.
Test-ServiceFabricApplicationPackage with correct -ImageStoreConnectionString finds SP1 in the cluster and validates modified package.

Can you validate that this is what you did?

@makar-sasha
Copy link

@oanapl

yes I have tried this flow. I have found it works if I'm adding
<UpgradeDeployment Mode="Monitored" Enabled="true"> <Parameters FailureAction="Rollback" Force="True" /> </UpgradeDeployment>
to publish profile. in other case it results in error ServiceManifest.xml is missing for service Stateless1Pkg

thanks for your comments!

@oanapl
Copy link

oanapl commented Nov 10, 2017

@dbreshears , can you take a look at the publish profile issue?

@dbreshears
Copy link
Member

I guess I am not seeing the issue with the publish profile. Deploy-FabricApplication.ps1 is just a wrapper around the scripts installed as a part of the SDK. When upgrade in Publish Profile is specified, then it calls Publish-UpgradedServiceFabricApplication script , otherwise Publish-NewServiceFabricApplication is called.

The issue if I am understanding correctly seems that Test-ServiceFabricApplicationPackage expects a path to a full package when Publish-NewServiceFabricApplication script invokes it, but Publish-UpgradedServiceFabricApplication script does not.

@oanapl
Copy link

oanapl commented Nov 10, 2017

@makar-sasha , looks like our SDK script does not pass the ImageStoreConnectionString when calling Test-ServiceFabricApplicationPackage. I opened a tracking issue to improve our scripts.

As a mitigation, can you change your local scripts to pass the parameter?

Or you can call the powershell cmdlets directly if that's more convenient.

@anamkhalid
Copy link

@oanapl @masnider
Test-ServiceFabricApplicationPackage -ApplicationPackagePath '...' -ImageStoreConnectionString '...' returns True but still the deployment/upgrade fails.

@oanapl
Copy link

oanapl commented Nov 20, 2017

@anamkhalid , have you changed your local scripts to pass the image store connection string as I mentioned in my previous reply? Without this, the deployment will fail.

You can also run all the deployment steps through Powershell cmdlets.

@anamkhalid
Copy link

@oanapl Yes, I used Powershell cmdlets to pass image store connection string.

@oanapl oanapl added this to the Backlog milestone Nov 27, 2017
@oanapl
Copy link

oanapl commented Nov 27, 2017

Thank you for the update, I am glad you are unblocked. We will change our scripts to pass the ImageStoreConnectionString in our next major release (6.2).

@anamkhalid
Copy link

@oanapl Actually not :) I was just confirming that I did pass image store connection string but the results are still same. Update Domain Timeout issue that Haseeb mentioned in his top comment.

Are you able to get rid of this issue by passing image store connection string at your end?

@oanapl
Copy link

oanapl commented Nov 28, 2017

My bad, you said in previous post that you used Powershell to pass image store connection string, and I assumed this worked. Did you mean you called Powershell directly and that worked?
To double check, you changed the scripts and they still don't work? Which step fails this time and what error do you see?

@amanbha is fixing this at our end.

@anamkhalid
Copy link

@oanapl Here is the current status:
1- I'm using Powershell cmdlets to Connect, Copy, Register, Test and Upgrade app.
2- I'm passing ImageStoreConnectionString with Test command but it always fails with Upgrade Domain Timeout issue as shown in the image below:

image

@oanapl
Copy link

oanapl commented Nov 29, 2017

The upgrade fails because of safety checks. See my first reply above for more upgrade related resources. Basically, it can't safely move replicas out of the node to proceed with upgrade, moving them out can affect availability.

How many nodes are in the cluster? If you have 1 node, there's nowhere to move the app and the upgrade will fail. Since 1 node is for testing purposes, you can pass small UpgradeReplicaSetCheckTimeoutSec to the upgrade command to tell the cluster it's ok to move the replicas immediately.

If your cluster has > 1 node, how many services do you have in the app and how are they configured (number of partitions and number of replicas)?

Can you paste the powershell command you used to start the upgrade?

@anamkhalid
Copy link

@oanapl It seems like adding UpgradeReplicaSetCheckTimeoutSec in the upgrade command resolves the issue. I copied below command from Upgrade Using Powershell

Start-ServiceFabricApplicationUpgrade -ApplicationName fabric:/VisualObjects -ApplicationTypeVersion 2.0.0.0 -HealthCheckStableDurationSec 60 -UpgradeDomainTimeoutSec 1200 -UpgradeTimeout 3000 -FailureAction Rollback -Monitored

Upgrade fails without explicitly specifying this parameter.

@oanapl
Copy link

oanapl commented Dec 4, 2017

Ok, this is expected behavior for a one node cluster. Closing the issue based on this.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

7 participants