System.Fabric.FabricNotPrimaryException on GetStateAsync inside Actor's ReceiveReminderAsync #379

jfloodnet · 2017-08-01T05:52:59Z

I've asked this question on SO here -https://stackoverflow.com/questions/45429751/system-fabric-fabricnotprimaryexception-on-getstateasync-inside-actor

I have (n) actors that are executing on a continuous reminder every second.

These actor's have been running fine for the last 4 days when out of no where every instance receives the below exception on calling StateManager.GetStateAsync. Subsequently, I see all the actors are deactivated.

I cannot find any information relating to this exception being encountered by reliable actors.

Once this exception occurs and the actors are deactivated, they do not get re-activated.

What are the conditions for this error to occur and how can I further diagnose the problem?

"System.Fabric.FabricNotPrimaryException: Exception of type 'System.Fabric.FabricNotPrimaryException' was thrown. at Microsoft.ServiceFabric.Actors.Runtime.ActorStateProviderHelper.d__81.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)
at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
at Microsoft.ServiceFabric.Actors.Runtime.ActorStateManager.d__181.MoveNext() --- End of stack trace from previous location where exception was thrown --- at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task) at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task) at Microsoft.ServiceFabric.Actors.Runtime.ActorStateManager.d__7`1.MoveNext()

Having a look at the cluster explorer, I can now see the following warnings on one of the partitions for that actor service:

Unhealthy event: SourceId='System.FM', Property='State', HealthState='Warning', ConsiderWarningAsError=false.
Partition reconfiguration is taking longer than expected.
fabric:/Ism.TvcRecognition.App/TvChannelMonitor 3 3 4dcca5ee-2297-44f9-b63e-76a60df3bc3d
S/S IB _Node1_4 Up 131456742276273986
S/P RD _Node1_2 Up 131456742361691499
P/S RD _Node1_0 Down 131457861497316547
(Showing 3 out of 4 replicas. Total available replicas: 1.)

With a warning in the primary replica of that partition:

Unhealthy event: SourceId='System.RAP', Property='IReplicator.CatchupReplicaSetDuration', HealthState='Warning', ConsiderWarningAsError=false.

And a warning in the ActiveSecondary:

Unhealthy event: SourceId='System.RAP', Property='IStatefulServiceReplica.CloseDuration', HealthState='Warning', ConsiderWarningAsError=false. Start Time (UTC): 2017-08-01 04:51:39.740 _Node1_0

3 out of 5 Nodes are showing the following error:

Unhealthy event: SourceId='FabricDCA', Property='DataCollectionAgent.DiskSpaceAvailable', HealthState='Warning', ConsiderWarningAsError=false. The Data Collection Agent (DCA) does not have enough disk space to operate. Diagnostics information will be left uncollected if this continues to happen.

More Information:

My cluster setup consists of 5 nodes of D1 virtual machines.

Event viewer errors in Microsoft-Service Fabric application:

I see quite a lot of

Failed to read some or all of the events from ETL file D:\SvcFab\Log\QueryTraces\query_traces_5.6.231.9494_131460372168133038_1.etl.
System.ComponentModel.Win32Exception (0x80004005): The handle is invalid
at Tools.EtlReader.TraceFileEventReader.ReadEvents(DateTime startTime, DateTime endTime)
at System.Fabric.Dca.Utility.PerformWithRetries[T](Action`1 worker, T context, RetriableOperationExceptionHandler exceptionHandler, Int32 initialRetryIntervalMs, Int32 maxRetryCount, Int32 maxRetryIntervalMs)
at FabricDCA.EtlProcessor.ProcessActiveEtlFile(FileInfo etlFile, DateTime lastEndTime, DateTime& newEndTime, CancellationToken cancellationToken)

and a heap of warnings like:

Api IStatefulServiceReplica.Close() slow on partition {4dcca5ee-2297-44f9-b63e-76a60df3bc3d} replica 131457861497316547, StartTimeUTC = ‎2017‎-‎08‎-‎01T04:51:39.789083900Z

And finally I think I might be at the root of all this. Event Viewer Application Logs has a whole ream of errors like:

Ism.TvcRecognition.TvChannelMonitor (3688) (4dcca5ee-2297-44f9-b63e-76a60df3bc3d:131457861497316547): An attempt to write to the file "D:\SvcFab_App\Ism.TvcRecognition.AppType_App1\work\P_4dcca5ee-2297-44f9-b63e-76a60df3bc3d\R_131457861497316547\edbres00002.jrs" at offset 5242880 (0x0000000000500000) for 0 (0x00000000) bytes failed after 0.000 seconds with system error 112 (0x00000070): "There is not enough space on the disk. ". The write operation will fail with error -1808 (0xfffff8f0). If this error persists then the file may be damaged and may need to be restored from a previous backup.

Ok so, that error is pointing to the D drive, which is Temporary Storage. It has 549 MB free of 50 GB.
Should Service fabric really be persisting to Temporary Storage ?

Digging into to the SvcFab folder it looks like I have a very overweight partition, i.e. the partition referenced above is 40GB, while all the other partitions are about 2000 KB.

jfloodnet · 2017-08-04T01:20:56Z

Is there a way to get the size of a partition as a metric from within service fabric for monitoring?

harahma · 2017-08-09T20:57:48Z

@alexwun

@jfloodnet Can you share path of the files which are occupying most space on your disk?

jfloodnet · 2017-08-10T00:49:01Z

@harahma D:\SvcFab_App\Ism.TvcRecognition.AppType_App16\work

It turned out that my actors were not being distributed evenly across partitions. This came down to a misunderstanding of how the the actor service partitioning scheme worked. I was using a long as the actor Id and all my actors Id's were within a small range. I changed this to a string with prefix and I can now see the actors being evenly distributed. I haven't had that exception occur since making this change. However, I'm not sure the size of all partitions now, will account for the size of that one partition before.

Also, I'm still interested to know why all the files are on temporary storage? Considering that drive as a big "DATALOSS_WARNING_README" file in it, it seems concerning?

masnider · 2017-08-25T17:20:47Z

@jfloodnet TLDR: it's fine because you're using SF, if you weren't it'd be a really bad idea.

Those drives are "temporary" from Azure's perspective in the sense that they're the local drives on the machine. Azure doesn't know what you're doing with the drives, and it doesn't want any single machine app to think that data written there is safe. In SF we replicate the data to multiple machines, so using the local disks is fine/safe. SF also integrates with Azure so that a lot of the management operations that would destroy that data are managed in the cluster to prevent exactly that from happening. When Azure announces that it's going to do an update that will destroy the data on that node, we move your service somewhere else before allowing that to happen, and [try to] stall the update in the meantime. Some more info on that is here.

So - we're good with this issue? Think we all understand what's going on now?

masnider · 2017-08-25T17:33:38Z

I also replied with basically the same in SO, just so that the info is present in both locations. If you could manage that Q & A as well that'd be great. In the future, maybe just pick one location. But I'll take the karma ;)

jfloodnet · 2017-08-26T10:14:06Z

Thanks, make sense when you put it that way. Cheers

jfloodnet changed the title ~~ystem.Fabric.FabricNotPrimaryException on GetStateAsync inside Actor's ReceiveReminderAsync~~ System.Fabric.FabricNotPrimaryException on GetStateAsync inside Actor's ReceiveReminderAsync Aug 1, 2017

sridmad assigned harahma and vturecek Aug 9, 2017

masnider added the question label Aug 25, 2017

masnider added this to the General Question milestone Aug 25, 2017

jfloodnet closed this as completed Aug 26, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

System.Fabric.FabricNotPrimaryException on GetStateAsync inside Actor's ReceiveReminderAsync #379

System.Fabric.FabricNotPrimaryException on GetStateAsync inside Actor's ReceiveReminderAsync #379

jfloodnet commented Aug 1, 2017 •

edited

jfloodnet commented Aug 4, 2017

harahma commented Aug 9, 2017

jfloodnet commented Aug 10, 2017

masnider commented Aug 25, 2017

masnider commented Aug 25, 2017

jfloodnet commented Aug 26, 2017

System.Fabric.FabricNotPrimaryException on GetStateAsync inside Actor's ReceiveReminderAsync #379

System.Fabric.FabricNotPrimaryException on GetStateAsync inside Actor's ReceiveReminderAsync #379

Comments

jfloodnet commented Aug 1, 2017 • edited

jfloodnet commented Aug 4, 2017

harahma commented Aug 9, 2017

jfloodnet commented Aug 10, 2017

masnider commented Aug 25, 2017

masnider commented Aug 25, 2017

jfloodnet commented Aug 26, 2017

jfloodnet commented Aug 1, 2017 •

edited