Skip to content
This repository has been archived by the owner on Jan 16, 2021. It is now read-only.

System.Fabric.FabricNotPrimaryException on GetStateAsync inside Actor's ReceiveReminderAsync #379

Closed
jfloodnet opened this issue Aug 1, 2017 · 6 comments
Assignees
Labels

Comments

@jfloodnet
Copy link

jfloodnet commented Aug 1, 2017

I've asked this question on SO here -https://stackoverflow.com/questions/45429751/system-fabric-fabricnotprimaryexception-on-getstateasync-inside-actor

I have (n) actors that are executing on a continuous reminder every second.

These actor's have been running fine for the last 4 days when out of no where every instance receives the below exception on calling StateManager.GetStateAsync. Subsequently, I see all the actors are deactivated.

I cannot find any information relating to this exception being encountered by reliable actors.

Once this exception occurs and the actors are deactivated, they do not get re-activated.

What are the conditions for this error to occur and how can I further diagnose the problem?

"System.Fabric.FabricNotPrimaryException: Exception of type 'System.Fabric.FabricNotPrimaryException' was thrown. at Microsoft.ServiceFabric.Actors.Runtime.ActorStateProviderHelper.d__81.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)
at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
at Microsoft.ServiceFabric.Actors.Runtime.ActorStateManager.d__181.MoveNext() --- End of stack trace from previous location where exception was thrown --- at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task) at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task) at Microsoft.ServiceFabric.Actors.Runtime.ActorStateManager.d__7`1.MoveNext()

Having a look at the cluster explorer, I can now see the following warnings on one of the partitions for that actor service:

Unhealthy event: SourceId='System.FM', Property='State', HealthState='Warning', ConsiderWarningAsError=false.
Partition reconfiguration is taking longer than expected.
fabric:/Ism.TvcRecognition.App/TvChannelMonitor 3 3 4dcca5ee-2297-44f9-b63e-76a60df3bc3d
S/S IB _Node1_4 Up 131456742276273986
S/P RD _Node1_2 Up 131456742361691499
P/S RD _Node1_0 Down 131457861497316547
(Showing 3 out of 4 replicas. Total available replicas: 1.)

With a warning in the primary replica of that partition:

Unhealthy event: SourceId='System.RAP', Property='IReplicator.CatchupReplicaSetDuration', HealthState='Warning', ConsiderWarningAsError=false.

And a warning in the ActiveSecondary:

Unhealthy event: SourceId='System.RAP', Property='IStatefulServiceReplica.CloseDuration', HealthState='Warning', ConsiderWarningAsError=false. Start Time (UTC): 2017-08-01 04:51:39.740 _Node1_0

3 out of 5 Nodes are showing the following error:

Unhealthy event: SourceId='FabricDCA', Property='DataCollectionAgent.DiskSpaceAvailable', HealthState='Warning', ConsiderWarningAsError=false. The Data Collection Agent (DCA) does not have enough disk space to operate. Diagnostics information will be left uncollected if this continues to happen.

More Information:

My cluster setup consists of 5 nodes of D1 virtual machines.

Event viewer errors in Microsoft-Service Fabric application:

I see quite a lot of

Failed to read some or all of the events from ETL file D:\SvcFab\Log\QueryTraces\query_traces_5.6.231.9494_131460372168133038_1.etl.
System.ComponentModel.Win32Exception (0x80004005): The handle is invalid
at Tools.EtlReader.TraceFileEventReader.ReadEvents(DateTime startTime, DateTime endTime)
at System.Fabric.Dca.Utility.PerformWithRetries[T](Action`1 worker, T context, RetriableOperationExceptionHandler exceptionHandler, Int32 initialRetryIntervalMs, Int32 maxRetryCount, Int32 maxRetryIntervalMs)
at FabricDCA.EtlProcessor.ProcessActiveEtlFile(FileInfo etlFile, DateTime lastEndTime, DateTime& newEndTime, CancellationToken cancellationToken)

and a heap of warnings like:

Api IStatefulServiceReplica.Close() slow on partition {4dcca5ee-2297-44f9-b63e-76a60df3bc3d} replica 131457861497316547, StartTimeUTC = ‎2017‎-‎08‎-‎01T04:51:39.789083900Z

And finally I think I might be at the root of all this. Event Viewer Application Logs has a whole ream of errors like:

Ism.TvcRecognition.TvChannelMonitor (3688) (4dcca5ee-2297-44f9-b63e-76a60df3bc3d:131457861497316547): An attempt to write to the file "D:\SvcFab_App\Ism.TvcRecognition.AppType_App1\work\P_4dcca5ee-2297-44f9-b63e-76a60df3bc3d\R_131457861497316547\edbres00002.jrs" at offset 5242880 (0x0000000000500000) for 0 (0x00000000) bytes failed after 0.000 seconds with system error 112 (0x00000070): "There is not enough space on the disk. ". The write operation will fail with error -1808 (0xfffff8f0). If this error persists then the file may be damaged and may need to be restored from a previous backup.

Ok so, that error is pointing to the D drive, which is Temporary Storage. It has 549 MB free of 50 GB.
Should Service fabric really be persisting to Temporary Storage ?

Digging into to the SvcFab folder it looks like I have a very overweight partition, i.e. the partition referenced above is 40GB, while all the other partitions are about 2000 KB.

@jfloodnet jfloodnet changed the title ystem.Fabric.FabricNotPrimaryException on GetStateAsync inside Actor's ReceiveReminderAsync System.Fabric.FabricNotPrimaryException on GetStateAsync inside Actor's ReceiveReminderAsync Aug 1, 2017
@jfloodnet
Copy link
Author

Is there a way to get the size of a partition as a metric from within service fabric for monitoring?

@harahma
Copy link

harahma commented Aug 9, 2017

@alexwun

@jfloodnet Can you share path of the files which are occupying most space on your disk?

@jfloodnet
Copy link
Author

@harahma D:\SvcFab_App\Ism.TvcRecognition.AppType_App16\work

It turned out that my actors were not being distributed evenly across partitions. This came down to a misunderstanding of how the the actor service partitioning scheme worked. I was using a long as the actor Id and all my actors Id's were within a small range. I changed this to a string with prefix and I can now see the actors being evenly distributed. I haven't had that exception occur since making this change. However, I'm not sure the size of all partitions now, will account for the size of that one partition before.

Also, I'm still interested to know why all the files are on temporary storage? Considering that drive as a big "DATALOSS_WARNING_README" file in it, it seems concerning?

@masnider
Copy link
Member

@jfloodnet TLDR: it's fine because you're using SF, if you weren't it'd be a really bad idea.

Those drives are "temporary" from Azure's perspective in the sense that they're the local drives on the machine. Azure doesn't know what you're doing with the drives, and it doesn't want any single machine app to think that data written there is safe. In SF we replicate the data to multiple machines, so using the local disks is fine/safe. SF also integrates with Azure so that a lot of the management operations that would destroy that data are managed in the cluster to prevent exactly that from happening. When Azure announces that it's going to do an update that will destroy the data on that node, we move your service somewhere else before allowing that to happen, and [try to] stall the update in the meantime. Some more info on that is here.

So - we're good with this issue? Think we all understand what's going on now?

@masnider masnider added this to the General Question milestone Aug 25, 2017
@masnider
Copy link
Member

I also replied with basically the same in SO, just so that the info is present in both locations. If you could manage that Q & A as well that'd be great. In the future, maybe just pick one location. But I'll take the karma ;)

@jfloodnet
Copy link
Author

Thanks, make sense when you put it that way. Cheers

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

4 participants