You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Jan 16, 2021. It is now read-only.
I have (n) actors that are executing on a continuous reminder every second.
These actor's have been running fine for the last 4 days when out of no where every instance receives the below exception on calling StateManager.GetStateAsync. Subsequently, I see all the actors are deactivated.
I cannot find any information relating to this exception being encountered by reliable actors.
Once this exception occurs and the actors are deactivated, they do not get re-activated.
What are the conditions for this error to occur and how can I further diagnose the problem?
"System.Fabric.FabricNotPrimaryException: Exception of type 'System.Fabric.FabricNotPrimaryException' was thrown. at Microsoft.ServiceFabric.Actors.Runtime.ActorStateProviderHelper.d__81.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)
at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
at Microsoft.ServiceFabric.Actors.Runtime.ActorStateManager.d__181.MoveNext() --- End of stack trace from previous location where exception was thrown --- at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task) at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task) at Microsoft.ServiceFabric.Actors.Runtime.ActorStateManager.d__7`1.MoveNext()
Having a look at the cluster explorer, I can now see the following warnings on one of the partitions for that actor service:
Unhealthy event: SourceId='System.FM', Property='State', HealthState='Warning', ConsiderWarningAsError=false.
Partition reconfiguration is taking longer than expected.
fabric:/Ism.TvcRecognition.App/TvChannelMonitor 3 3 4dcca5ee-2297-44f9-b63e-76a60df3bc3d
S/S IB _Node1_4 Up 131456742276273986
S/P RD _Node1_2 Up 131456742361691499
P/S RD _Node1_0 Down 131457861497316547
(Showing 3 out of 4 replicas. Total available replicas: 1.)
With a warning in the primary replica of that partition:
Unhealthy event: SourceId='FabricDCA', Property='DataCollectionAgent.DiskSpaceAvailable', HealthState='Warning', ConsiderWarningAsError=false. The Data Collection Agent (DCA) does not have enough disk space to operate. Diagnostics information will be left uncollected if this continues to happen.
More Information:
My cluster setup consists of 5 nodes of D1 virtual machines.
Event viewer errors in Microsoft-Service Fabric application:
I see quite a lot of
Failed to read some or all of the events from ETL file D:\SvcFab\Log\QueryTraces\query_traces_5.6.231.9494_131460372168133038_1.etl.
System.ComponentModel.Win32Exception (0x80004005): The handle is invalid
at Tools.EtlReader.TraceFileEventReader.ReadEvents(DateTime startTime, DateTime endTime)
at System.Fabric.Dca.Utility.PerformWithRetries[T](Action`1 worker, T context, RetriableOperationExceptionHandler exceptionHandler, Int32 initialRetryIntervalMs, Int32 maxRetryCount, Int32 maxRetryIntervalMs)
at FabricDCA.EtlProcessor.ProcessActiveEtlFile(FileInfo etlFile, DateTime lastEndTime, DateTime& newEndTime, CancellationToken cancellationToken)
and a heap of warnings like:
Api IStatefulServiceReplica.Close() slow on partition {4dcca5ee-2297-44f9-b63e-76a60df3bc3d} replica 131457861497316547, StartTimeUTC = 2017-08-01T04:51:39.789083900Z
And finally I think I might be at the root of all this. Event Viewer Application Logs has a whole ream of errors like:
Ism.TvcRecognition.TvChannelMonitor (3688) (4dcca5ee-2297-44f9-b63e-76a60df3bc3d:131457861497316547): An attempt to write to the file "D:\SvcFab_App\Ism.TvcRecognition.AppType_App1\work\P_4dcca5ee-2297-44f9-b63e-76a60df3bc3d\R_131457861497316547\edbres00002.jrs" at offset 5242880 (0x0000000000500000) for 0 (0x00000000) bytes failed after 0.000 seconds with system error 112 (0x00000070): "There is not enough space on the disk. ". The write operation will fail with error -1808 (0xfffff8f0). If this error persists then the file may be damaged and may need to be restored from a previous backup.
Ok so, that error is pointing to the D drive, which is Temporary Storage. It has 549 MB free of 50 GB.
Should Service fabric really be persisting to Temporary Storage ?
Digging into to the SvcFab folder it looks like I have a very overweight partition, i.e. the partition referenced above is 40GB, while all the other partitions are about 2000 KB.
The text was updated successfully, but these errors were encountered:
jfloodnet
changed the title
ystem.Fabric.FabricNotPrimaryException on GetStateAsync inside Actor's ReceiveReminderAsync
System.Fabric.FabricNotPrimaryException on GetStateAsync inside Actor's ReceiveReminderAsync
Aug 1, 2017
It turned out that my actors were not being distributed evenly across partitions. This came down to a misunderstanding of how the the actor service partitioning scheme worked. I was using a long as the actor Id and all my actors Id's were within a small range. I changed this to a string with prefix and I can now see the actors being evenly distributed. I haven't had that exception occur since making this change. However, I'm not sure the size of all partitions now, will account for the size of that one partition before.
Also, I'm still interested to know why all the files are on temporary storage? Considering that drive as a big "DATALOSS_WARNING_README" file in it, it seems concerning?
@jfloodnet TLDR: it's fine because you're using SF, if you weren't it'd be a really bad idea.
Those drives are "temporary" from Azure's perspective in the sense that they're the local drives on the machine. Azure doesn't know what you're doing with the drives, and it doesn't want any single machine app to think that data written there is safe. In SF we replicate the data to multiple machines, so using the local disks is fine/safe. SF also integrates with Azure so that a lot of the management operations that would destroy that data are managed in the cluster to prevent exactly that from happening. When Azure announces that it's going to do an update that will destroy the data on that node, we move your service somewhere else before allowing that to happen, and [try to] stall the update in the meantime. Some more info on that is here.
So - we're good with this issue? Think we all understand what's going on now?
I also replied with basically the same in SO, just so that the info is present in both locations. If you could manage that Q & A as well that'd be great. In the future, maybe just pick one location. But I'll take the karma ;)
I've asked this question on SO here -https://stackoverflow.com/questions/45429751/system-fabric-fabricnotprimaryexception-on-getstateasync-inside-actor
I have (n) actors that are executing on a continuous reminder every second.
These actor's have been running fine for the last 4 days when out of no where every instance receives the below exception on calling StateManager.GetStateAsync. Subsequently, I see all the actors are deactivated.
I cannot find any information relating to this exception being encountered by reliable actors.
Once this exception occurs and the actors are deactivated, they do not get re-activated.
What are the conditions for this error to occur and how can I further diagnose the problem?
Having a look at the cluster explorer, I can now see the following warnings on one of the partitions for that actor service:
With a warning in the primary replica of that partition:
And a warning in the ActiveSecondary:
3 out of 5 Nodes are showing the following error:
More Information:
My cluster setup consists of 5 nodes of D1 virtual machines.
Event viewer errors in Microsoft-Service Fabric application:
I see quite a lot of
and a heap of warnings like:
And finally I think I might be at the root of all this. Event Viewer Application Logs has a whole ream of errors like:
Ok so, that error is pointing to the D drive, which is Temporary Storage. It has 549 MB free of 50 GB.
Should Service fabric really be persisting to Temporary Storage ?
Digging into to the SvcFab folder it looks like I have a very overweight partition, i.e. the partition referenced above is 40GB, while all the other partitions are about 2000 KB.
The text was updated successfully, but these errors were encountered: