Skip to content

Commit

Permalink
Add file referenced from recent blog (temp)
Browse files Browse the repository at this point in the history
  • Loading branch information
mederly committed Apr 15, 2021
1 parent 7da55fd commit 3a3b8ab
Showing 1 changed file with 380 additions and 0 deletions.
380 changes: 380 additions & 0 deletions docs/synchronization/live-sync-error-handling-strategy.adoc
Original file line number Diff line number Diff line change
@@ -0,0 +1,380 @@
= Task Error Handling
:page-wiki-name: Live sync error handling strategy HOWTO
:page-experimental: true
:page-since: "4.3"
:page-toc: top
:page-keywords: task, error, failure, error handling, error recovery

== Overview

During a task execution, failures can occur.footnote:[Although the words "error" and "failure"
have their precise meaning in the engineering context, we will use them interchangeably and somehow
freely. In this document they denote any midPoint-detected problem in processing, represented
by appropriate operation status: either `FATAL_ERROR` or `PARTIAL_ERROR`.] Some of them affect
the whole task. A typical example is misconfigured or unreachable resource. These failures usually
cause the task either to suspend, or to repeat its execution at defined time.

However, there can
be also failures that are limited to individual objects. They are often caused by issues
during provisioning of changes to a target resource. There can also be failures related
to mapping evaluation, for example because of a programming error. Yet another class of problems
are those related directly to the source objects, e.g. if there are major data quality issues that
prevent these objects be processed in any reasonable way, or even technical issues preventing
them from being fetched from the resource in the first place.

This document deals with the latter ones, i.e. failures related to processing of individual objects.
So we will consider a situation that during task execution a subset of the objects fails to
be processed, while the rest proceeds successfully. The ratio of failed objects can be negligible
(like less than one in a thousand), moderate, or they may even represent the majority of all objects.

By "error handling" we mean here the mechanisms provided by midPoint to help the administrator to handle
this kind of failures (errors).

== The State Before

In midPoint 4.2 and before, the error handling in tasks was quite rudimentary.

For common tasks like reconciliation, recomputation, bulk actions execution, or import from
resource, the only option how to handle failures that had occurred was to re-run the task.
That way, all the objects were re-processed: those that had been processed successfully, were
(most often) processed successfully again, and those that had failed, were (hopefully) processed
correctly this time. The overhead of this solution depended on the ratio of failed objects. For
very large sets of all objects (say millions) it could be considered wasteful to repeat the whole
processing for relative small number of failed ones.footnote:[Administrators often had to resort
to clever hacks, like trying to identify patterns of failures, and then formulating that patterns
as object filters that were used in repeated task runs. However, this was generally tedious and
applicable only in some situations.]

A special category of tasks is _live synchronization_. Here midPoint provided two ways
of handling errors:

. stop processing when an error is encountered, until the error is not fixed;

. ignore any errors and just continue processing.

The former option is safe, but can result in unnecessary delays in processing, mainly if errors
occur relatively often. The latter eliminates delays, but results in missing updates and therefore
resource vs. midPoint state inconsistency.

== Error Handling in 4.3

As part of the link:/midpoint/projects/midscale/[midScale project], we have experimentally implemented
two distinct task error handling mechanisms.

=== Operation Execution Records

When an object is processed by a task, an _operation execution record_ is attached to the object.footnote:[Actually,
there are two kinds of operation execution records: operation-level records (sometimes
called "complex") and modification execution records (sometimes called "simple"). We now talk
about the former ones. In midPoint 4.2 and before, we did not explicitly differentiate between
these two, and the support for operation-level records was incomplete.] That record carries an
information whether the processing was successful or not. What is crucial for error handling is
that these records allow us to easily select failed objects for re-processing, without the need
to go through all the objects.

The practical use of this feature looks like this:

1. A main task `M` is run, processing a set of objects. Some of these objects encountered errors.
Respective operation execution records are created for them.
2. Then (when system administrator decides) another task is run, aimed at these erroneous
objects. Let's call this task the recoverer (`R`). It has the following characteristics:

* It usually has the same type as `M`. For example, if `M` is an import task, then `R` is usually
import task as well. Other significant parameters, like specific bulk action to execute, should
also match.footnote:[This is not a strict rule. There can be situations when, for example, the main task is a bulk action
task, and the recoverer is recomputation task. Or the recoverer can use a different bulk action
than was used in the main task, if needed.]

* It operates on the same set of objects (specified e.g. by resource reference, object class,
kind, intent, and/or a query) but with so-called _failed objects selector_ added. This selector
specifies e.g. result states that should be matched (e.g. fatal error, partial error, warning),
reference to the main task(s)footnote:[A single recoverer can treat multiple main tasks.
Also, a recoverer can be the same task as the main one, with just the selector added.], or
the time interval when the error occurred.

3. The recoverer then goes through failed objects, according to the original set specification
combined with failed object selector, and tries to process them. The errors occurring in
this task can be later handled again.

=== Triggers

Another option is to automatically schedule any failed object for re-processing using _triggers_.
This mechanism is currently limited to synchronization tasks (import, reconciliation,
live synchronization) and works like this:

1. An error is encountered during processing of a resource object shadow in a task.

2. If appropriate configuration is set, a trigger is created on the respective resource object
shadow, reminding midPoint that the shadow should be synchronized again. The time interval for the
trigger is configurable.

3. After specified time arrives, the `Trigger scanner` task retrieves the shadow and ensures that
it is re-synchronized.

4. If the repeated processing is successful, the process ends here.
If not, another trigger (with an interval that may be the same or different) is set up,
and the process repeats.

5. If the process is not successful even after specified number of repetitions, the process ends.

=== Which Approach to Use

Each of the options described has its own strengths and limitations. These are summarized
in the table below.

[%autowidth]
[%header]
|===
| Feature | Operation Execution Records | Triggers | Comment

| Applicability
| Any kind of object processed by (almost) any task.
| Shadows, processed by synchronization tasks.
|

| Extra configuration required
| Yes. A recoverer task should (usually) be set up, including careful specification of failed objects selector.
| No. Trigger scanner takes care of everything. Only the retry strategy has to be set up
in the main task.
|

|===

TODO any other differences?

== Configuration Samples and Reference

=== Operation Execution Records

An example of a recoverer task:

[source,xml]
----
<task oid="e06f3f5c-4acc-4c6a-baa3-5c7a954ce4e9"
xmlns="http://midpoint.evolveum.com/xml/ns/public/common/common-3"
xmlns:ext="http://midpoint.evolveum.com/xml/ns/public/model/extension-3"
xmlns:ri="http://midpoint.evolveum.com/xml/ns/public/resource/instance-3">
<name>Import: retry errors</name>
<extension>
<ext:kind>account</ext:kind>
<ext:intent>default</ext:intent>
<ext:objectclass>ri:AccountObjectClass</ext:objectclass>
<ext:failedObjectsSelector>
<taskRef oid="e06f3f5c-4acc-4c6a-baa3-5c7a954ce4e9" />
<timeFrom>2021-02-18T15:00:00.342+01:00</timeFrom>
</ext:failedObjectsSelector>
</extension>
<ownerRef oid="00000000-0000-0000-0000-000000000002"/>
<executionStatus>runnable</executionStatus>
<handlerUri>http://midpoint.evolveum.com/xml/ns/public/model/synchronization/task/import/handler-3</handlerUri>
<objectRef oid="a1c7dcb8-07f8-4626-bea7-f10d9df7ec9f" type="ResourceType"/>
<recurrence>single</recurrence>
</task>
----

The `failedObjectSelector` can have the following items:

[%autowidth]
[%header]
|===
| Item | Description | Default

| `status`
| What operation result statuses to select.
| `FATAL_ERROR` and `PARTIAL_ERROR`

| `taskRef`
| What task(s) to look for when checking operation execution records?
| The current task.

| `timeFrom`
| What is the earliest time of the record to be considered? This is important because
the old execution records are not deleted automatically when an object is re-processed, unless one of the following occurs:
either the recoverer task is the same as the main task (then the result
is replaced by the new one), or a defined limit for operation execution records is reached. Then
the oldest ones are purged.

Therefore, one has to set up this information carefully to avoid repeated processing
of already processed objects.
| No limit.

| `timeTo`
| What is the latest time of the record to be considered?
| If explicit task is not specified, then it is the last start timestamp of the current
task's root. If the task is different, then there is no limit there by default.

| `selectionMethod`
| How are failed objects selected. This is to overcome some technological obstacles in
object searching in the provisioning module. Normally, there is no need to override the default
value.
| `default`
|===

The selection method has the following values:

[%autowidth]
[%header]
|===
| Item | Description
| `default` | When searching for shadows via provisioning, `fetchFailedObjects`; otherwise `narrowQuery`.
| `narrowQuery` | Simply narrow the original query by adding failed objects filter.
It works with repository but usually not with provisioning.
| `fetchFailedObjects` | Failed objects are selected using the repository. Only after that, they are fetched
one-by-one via provisioning and processed. This is preferable when there is only
a small percentage of failed records.
| `filterAfterRetrieval` | Uses original query to retrieve objects from a resource. Filtering is
done afterwards, i.e. before results are passed to the processing. This is preferable when there is
large percentage of failed records.
|===

=== Triggers

An example of configuration of error handling strategy using triggers:

[source,xml]
----
<task oid="2d7f0709-3e9b-4b92-891f-c5e1428b6458"
xmlns="http://midpoint.evolveum.com/xml/ns/public/common/common-3"
xmlns:ext="http://midpoint.evolveum.com/xml/ns/public/model/extension-3"
xmlns:ri="http://midpoint.evolveum.com/xml/ns/public/resource/instance-3">
<name>Live Sync</name>
<extension>
<ext:objectclass>ri:AccountObjectClass</ext:objectclass>
</extension>
<ownerRef oid="00000000-0000-0000-0000-000000000002"/>
<executionStatus>runnable</executionStatus>
<handlerUri>http://midpoint.evolveum.com/xml/ns/public/model/synchronization/task/live-sync/handler-3</handlerUri>
<objectRef oid="a20bb7b7-c5e9-4bbb-94e0-79e7866362e6" type="ResourceType"/>
<recurrence>single</recurrence>
<errorHandlingStrategy>
<entry>
<situation>
<errorCategory>generic</errorCategory>
</situation>
<reaction>
<retryLater>
<initialInterval>PT30M</initialInterval>
<nextInterval>PT1H</nextInterval>
<retryLimit>3</retryLimit>
</retryLater>
</reaction>
</entry>
<entry>
<situation>
<errorCategory>configuration</errorCategory>
<status>fatal_error</status>
</situation>
<reaction>
<retryLater>
<initialInterval>P1D</initialInterval>
<nextInterval>P3D</nextInterval>
<!-- no retry limit -->
</retryLater>
</reaction>
</entry>
</errorHandlingStrategy>
</task>
----

In this sample, after a generic error is encountered, the retry is attempted after 30 minutes. The next retries
are done after 1 hour. The process stops after 4 attempts. However, if the error was configuration-related
(with the status of `FATAL_ERROR`), then the initial interval is 3 days, with retries after 3 days,
and without attempt limit.

Generally, the `errorHandlingStrategy` contains a list of entries. Each entry has:

[%autowidth]
[%header]
|===
| Item | Description | Default
| `order` | Order in which this entry is to be evaluated. (Related to other entries.) Smaller numbers
go first. Entries with no order go last. | No order.
| `situation` | A situation that can occur. | Any error.
| `reaction` | What should a task do when a given situation is encountered? | `ignore` or `stop` (see below)
|===

A `situation` contains the following:

[%autowidth]
[%header]
|===
| Item | Description | Default
| `status` | Operation result status to match. Can be either PARTIAL_ERROR or FATAL_ERROR.
| If not present, we decide solely on error category. If error categories are not specified,
any error matches.
| `errorCategory` | Error category (network, security, policy, ...) to match. Note that some errors are not propagated
to the level where they can be recognized by this selector. So be careful and consider this feature
to be highly experimental.
| If not present, we decide solely on the status. If status is not present, any error matches.
|===

The `reaction` is either:

[%autowidth]
[%header]
|===
| Reaction | Description | Note

| `ignore`
| The processing should continue, ignoring the error. E.g. for live sync tasks, this means that
the sync token is advanced to the next item, effectively marking the record as processed.
| This is the default strategy for the majority of tasks.

| `stop`
| The processing is stopped.
| This is the default strategy for live sync and async update tasks.

| `retryLater`
| Processing of the specified account should be retried later using a trigger, as was described.
| This strategy has more parameters, see below.
|===

Notes:

1. Names for these options may be changed in the future, to make them more compatible with
error handling based on operation execution records. (They were created before, and
not revised afterwards.)

2. Operation execution recording is *not* influenced by these settings. So each error
is recorded regardless of the value of `reaction`. This is why operation execution records based
error handling works well with the default setting of `ignore` reaction (although
by "ignoring" one can imagine that the error is not even recorded).

3. Besides these options, you can specify also `stopAfter` property (applicable to `ignore`
and `retryLater` reactions) that cause the task to be stopped after seeing specified number
of error situations.

The `retryLater` reaction has itself the following properties:

[%autowidth]
[%header]
|===
| Property | Meaning | The default

| `initialInterval`
| Initial retry interval.
| 30 minutes

| `nextInterval`
| Next retry interval, after initial attempt.
| 6 hours

| `retryLimit`
| Maximal number of retries to attempt.
| unlimited
|===

[NOTE]
====
To conclude, the mechanisms described here are all *experimental*. They will be fine-tuned based on users' experiences
and feedback.
====

0 comments on commit 3a3b8ab

Please sign in to comment.