Add file referenced from recent blog (temp)

Evolveum · Apr 15, 2021 · 3a3b8ab · 3a3b8ab
1 parent 7da55fd
commit 3a3b8ab
Showing 1 changed file with 380 additions and 0 deletions.
diff --git a/docs/synchronization/live-sync-error-handling-strategy.adoc b/docs/synchronization/live-sync-error-handling-strategy.adoc
@@ -0,0 +1,380 @@
+= Task Error Handling
+:page-wiki-name: Live sync error handling strategy HOWTO
+:page-experimental: true
+:page-since: "4.3"
+:page-toc: top
+:page-keywords: task, error, failure, error handling, error recovery
+
+== Overview
+
+During a task execution, failures can occur.footnote:[Although the words "error" and "failure"
+have their precise meaning in the engineering context, we will use them interchangeably and somehow
+freely. In this document they denote any midPoint-detected problem in processing, represented
+by appropriate operation status: either `FATAL_ERROR` or `PARTIAL_ERROR`.] Some of them affect
+the whole task. A typical example is misconfigured or unreachable resource. These failures usually
+cause the task either to suspend, or to repeat its execution at defined time.
+
+However, there can
+be also failures that are limited to individual objects. They are often caused by issues
+during provisioning of changes to a target resource. There can also be failures related
+to mapping evaluation, for example because of a programming error. Yet another class of problems
+are those related directly to the source objects, e.g. if there are major data quality issues that
+prevent these objects be processed in any reasonable way, or even technical issues preventing
+them from being fetched from the resource in the first place.
+
+This document deals with the latter ones, i.e. failures related to processing of individual objects.
+So we will consider a situation that during task execution a subset of the objects fails to
+be processed, while the rest proceeds successfully. The ratio of failed objects can be negligible
+(like less than one in a thousand), moderate, or they may even represent the majority of all objects.
+
+By "error handling" we mean here the mechanisms provided by midPoint to help the administrator to handle
+this kind of failures (errors).
+
+== The State Before
+
+In midPoint 4.2 and before, the error handling in tasks was quite rudimentary.
+
+For common tasks like reconciliation, recomputation, bulk actions execution, or import from
+resource, the only option how to handle failures that had occurred was to re-run the task.
+That way, all the objects were re-processed: those that had been processed successfully, were
+(most often) processed successfully again, and those that had failed, were (hopefully) processed
+correctly this time. The overhead of this solution depended on the ratio of failed objects. For
+very large sets of all objects (say millions) it could be considered wasteful to repeat the whole
+processing for relative small number of failed ones.footnote:[Administrators often had to resort
+to clever hacks, like trying to identify patterns of failures, and then formulating that patterns
+as object filters that were used in repeated task runs. However, this was generally tedious and
+applicable only in some situations.]
+
+A special category of tasks is _live synchronization_. Here midPoint provided two ways
+of handling errors:
+
+. stop processing when an error is encountered, until the error is not fixed;
+
+. ignore any errors and just continue processing.
+
+The former option is safe, but can result in unnecessary delays in processing, mainly if errors
+occur relatively often. The latter eliminates delays, but results in missing updates and therefore
+resource vs. midPoint state inconsistency.
+
+== Error Handling in 4.3
+
+As part of the link:/midpoint/projects/midscale/[midScale project], we have experimentally implemented
+two distinct task error handling mechanisms.
+
+=== Operation Execution Records
+
+When an object is processed by a task, an _operation execution record_ is attached to the object.footnote:[Actually,
+there are two kinds of operation execution records: operation-level records (sometimes
+called "complex") and modification execution records (sometimes called "simple"). We now talk
+about the former ones. In midPoint 4.2 and before, we did not explicitly differentiate between
+these two, and the support for operation-level records was incomplete.] That record carries an
+information whether the processing was successful or not. What is crucial for error handling is
+that these records allow us to easily select failed objects for re-processing, without the need
+to go through all the objects.
+
+The practical use of this feature looks like this:
+
+1. A main task `M` is run, processing a set of objects. Some of these objects encountered errors.
+Respective operation execution records are created for them.
+2. Then (when system administrator decides) another task is run, aimed at these erroneous
+objects. Let's call this task the recoverer (`R`). It has the following characteristics:
+
+* It usually has the same type as `M`. For example, if `M` is an import task, then `R` is usually
+import task as well. Other significant parameters, like specific bulk action to execute, should
+also match.footnote:[This is not a strict rule. There can be situations when, for example, the main task is a bulk action
+task, and the recoverer is recomputation task. Or the recoverer can use a different bulk action
+than was used in the main task, if needed.]
+
+* It operates on the same set of objects (specified e.g. by resource reference, object class,
+kind, intent, and/or a query) but with so-called _failed objects selector_ added. This selector
+specifies e.g. result states that should be matched (e.g. fatal error, partial error, warning),
+reference to the main task(s)footnote:[A single recoverer can treat multiple main tasks.
+Also, a recoverer can be the same task as the main one, with just the selector added.], or
+the time interval when the error occurred.
+
+3. The recoverer then goes through failed objects, according to the original set specification
+combined with failed object selector, and tries to process them. The errors occurring in
+this task can be later handled again.
+
+=== Triggers
+
+Another option is to automatically schedule any failed object for re-processing using _triggers_.
+This mechanism is currently limited to synchronization tasks (import, reconciliation,
+live synchronization) and works like this:
+
+1. An error is encountered during processing of a resource object shadow in a task.
+
+2. If appropriate configuration is set, a trigger is created on the respective resource object
+shadow, reminding midPoint that the shadow should be synchronized again. The time interval for the
+trigger is configurable.
+
+3. After specified time arrives, the `Trigger scanner` task retrieves the shadow and ensures that
+it is re-synchronized.
+
+4. If the repeated processing is successful, the process ends here.
+If not, another trigger (with an interval that may be the same or different) is set up,
+and the process repeats.
+
+5. If the process is not successful even after specified number of repetitions, the process ends.
+
+=== Which Approach to Use
+
+Each of the options described has its own strengths and limitations. These are summarized
+in the table below.
+
+[%autowidth]
+[%header]
+|===
+| Feature | Operation Execution Records | Triggers | Comment
+
+| Applicability
+| Any kind of object processed by (almost) any task.
+| Shadows, processed by synchronization tasks.
+|
+
+| Extra configuration required
+| Yes. A recoverer task should (usually) be set up, including careful specification of failed objects selector.
+| No. Trigger scanner takes care of everything. Only the retry strategy has to be set up
+in the main task.
+|
+
+|===
+
+TODO any other differences?
+
+== Configuration Samples and Reference
+
+=== Operation Execution Records
+
+An example of a recoverer task:
+
+[source,xml]
+----
+<task oid="e06f3f5c-4acc-4c6a-baa3-5c7a954ce4e9"
+    xmlns="http://midpoint.evolveum.com/xml/ns/public/common/common-3"
+    xmlns:ext="http://midpoint.evolveum.com/xml/ns/public/model/extension-3"
+    xmlns:ri="http://midpoint.evolveum.com/xml/ns/public/resource/instance-3">
+
+    <name>Import: retry errors</name>
+
+    <extension>
+        <ext:kind>account</ext:kind>
+        <ext:intent>default</ext:intent>
+        <ext:objectclass>ri:AccountObjectClass</ext:objectclass>
+        <ext:failedObjectsSelector>
+            <taskRef oid="e06f3f5c-4acc-4c6a-baa3-5c7a954ce4e9" />
+            <timeFrom>2021-02-18T15:00:00.342+01:00</timeFrom>
+        </ext:failedObjectsSelector>
+    </extension>
+
+    <ownerRef oid="00000000-0000-0000-0000-000000000002"/>
+    <executionStatus>runnable</executionStatus>
+
+    <handlerUri>http://midpoint.evolveum.com/xml/ns/public/model/synchronization/task/import/handler-3</handlerUri>
+    <objectRef oid="a1c7dcb8-07f8-4626-bea7-f10d9df7ec9f" type="ResourceType"/>
+    <recurrence>single</recurrence>
+</task>
+----
+
+The `failedObjectSelector` can have the following items:
+
+[%autowidth]
+[%header]
+|===
+| Item | Description | Default
+
+| `status`
+| What operation result statuses to select.
+| `FATAL_ERROR` and `PARTIAL_ERROR`
+
+| `taskRef`
+| What task(s) to look for when checking operation execution records?
+| The current task.
+
+| `timeFrom`
+| What is the earliest time of the record to be considered? This is important because
+the old execution records are not deleted automatically when an object is re-processed, unless one of the following occurs:
+either the recoverer task is the same as the main task (then the result
+is replaced by the new one), or a defined limit for operation execution records is reached. Then
+the oldest ones are purged.
+
+Therefore, one has to set up this information carefully to avoid repeated processing
+of already processed objects.
+| No limit.
+
+| `timeTo`
+| What is the latest time of the record to be considered?
+| If explicit task is not specified, then it is the last start timestamp of the current
+task's root. If the task is different, then there is no limit there by default.
+
+| `selectionMethod`
+| How are failed objects selected. This is to overcome some technological obstacles in
+object searching in the provisioning module. Normally, there is no need to override the default
+value.
+| `default`
+|===
+
+The selection method has the following values:
+
+[%autowidth]
+[%header]
+|===
+| Item | Description
+| `default` | When searching for shadows via provisioning, `fetchFailedObjects`; otherwise `narrowQuery`.
+| `narrowQuery` | Simply narrow the original query by adding failed objects filter.
+It works with repository but usually not with provisioning.
+| `fetchFailedObjects` | Failed objects are selected using the repository. Only after that, they are fetched
+one-by-one via provisioning and processed. This is preferable when there is only
+a small percentage of failed records.
+| `filterAfterRetrieval` | Uses original query to retrieve objects from a resource. Filtering is
+done afterwards, i.e. before results are passed to the processing. This is preferable when there is
+large percentage of failed records.
+|===
+
+=== Triggers
+
+An example of configuration of error handling strategy using triggers:
+
+[source,xml]
+----
+<task oid="2d7f0709-3e9b-4b92-891f-c5e1428b6458"
+    xmlns="http://midpoint.evolveum.com/xml/ns/public/common/common-3"
+    xmlns:ext="http://midpoint.evolveum.com/xml/ns/public/model/extension-3"
+    xmlns:ri="http://midpoint.evolveum.com/xml/ns/public/resource/instance-3">
+
+    <name>Live Sync</name>
+
+    <extension>
+        <ext:objectclass>ri:AccountObjectClass</ext:objectclass>
+    </extension>
+
+    <ownerRef oid="00000000-0000-0000-0000-000000000002"/>
+    <executionStatus>runnable</executionStatus>
+
+    <handlerUri>http://midpoint.evolveum.com/xml/ns/public/model/synchronization/task/live-sync/handler-3</handlerUri>
+    <objectRef oid="a20bb7b7-c5e9-4bbb-94e0-79e7866362e6" type="ResourceType"/>
+    <recurrence>single</recurrence>
+
+    <errorHandlingStrategy>
+        <entry>
+            <situation>
+                <errorCategory>generic</errorCategory>
+            </situation>
+            <reaction>
+                <retryLater>
+                    <initialInterval>PT30M</initialInterval>
+                    <nextInterval>PT1H</nextInterval>
+                    <retryLimit>3</retryLimit>
+                </retryLater>
+            </reaction>
+        </entry>
+        <entry>
+            <situation>
+                <errorCategory>configuration</errorCategory>
+                <status>fatal_error</status>
+            </situation>
+            <reaction>
+                <retryLater>
+                    <initialInterval>P1D</initialInterval>
+                    <nextInterval>P3D</nextInterval>
+                    <!-- no retry limit -->
+                </retryLater>
+            </reaction>
+        </entry>
+    </errorHandlingStrategy>
+</task>
+----
+
+In this sample, after a generic error is encountered, the retry is attempted after 30 minutes. The next retries
+are done after 1 hour. The process stops after 4 attempts. However, if the error was configuration-related
+(with the status of `FATAL_ERROR`), then the initial interval is 3 days, with retries after 3 days,
+and without attempt limit.
+
+Generally, the `errorHandlingStrategy` contains a list of entries. Each entry has:
+
+[%autowidth]
+[%header]
+|===
+| Item | Description | Default
+| `order` | Order in which this entry is to be evaluated. (Related to other entries.) Smaller numbers
+go first. Entries with no order go last. | No order.
+| `situation` | A situation that can occur. | Any error.
+| `reaction` | What should a task do when a given situation is encountered? | `ignore` or `stop` (see below)
+|===
+
+A `situation` contains the following:
+
+[%autowidth]
+[%header]
+|===
+| Item | Description | Default
+| `status` | Operation result status to match. Can be either PARTIAL_ERROR or FATAL_ERROR.
+| If not present, we decide solely on error category. If error categories are not specified,
+any error matches.
+| `errorCategory` | Error category (network, security, policy, ...) to match. Note that some errors are not propagated
+to the level where they can be recognized by this selector. So be careful and consider this feature
+to be highly experimental.
+| If not present, we decide solely on the status. If status is not present, any error matches.
+|===
+
+The `reaction` is either:
+
+[%autowidth]
+[%header]
+|===
+| Reaction | Description | Note
+
+| `ignore`
+| The processing should continue, ignoring the error. E.g. for live sync tasks, this means that
+the sync token is advanced to the next item, effectively marking the record as processed.
+| This is the default strategy for the majority of tasks.
+
+| `stop`
+| The processing is stopped.
+| This is the default strategy for live sync and async update tasks.
+
+| `retryLater`
+| Processing of the specified account should be retried later using a trigger, as was described.
+| This strategy has more parameters, see below.
+|===
+
+Notes:
+
+1. Names for these options may be changed in the future, to make them more compatible with
+error handling based on operation execution records. (They were created before, and
+not revised afterwards.)
+
+2. Operation execution recording is *not* influenced by these settings. So each error
+is recorded regardless of the value of `reaction`. This is why operation execution records based
+error handling works well with the default setting of `ignore` reaction (although
+by "ignoring" one can imagine that the error is not even recorded).
+
+3. Besides these options, you can specify also `stopAfter` property (applicable to `ignore`
+and `retryLater` reactions) that cause the task to be stopped after seeing specified number
+of error situations.
+
+The `retryLater` reaction has itself the following properties:
+
+[%autowidth]
+[%header]
+|===
+| Property | Meaning | The default
+
+| `initialInterval`
+| Initial retry interval.
+| 30 minutes
+
+| `nextInterval`
+| Next retry interval, after initial attempt.
+| 6 hours
+
+| `retryLimit`
+| Maximal number of retries to attempt.
+| unlimited
+|===
+
+[NOTE]
+====
+To conclude, the mechanisms described here are all *experimental*. They will be fine-tuned based on users' experiences
+and feedback.
+====