Skip to content

Commit

Permalink
Update rule-composition.adoc
Browse files Browse the repository at this point in the history
  • Loading branch information
mederly committed Sep 13, 2022
1 parent c385799 commit fe8e388
Showing 1 changed file with 95 additions and 15 deletions.
110 changes: 95 additions & 15 deletions docs/correlation/rule-composition.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -15,44 +15,55 @@ Correlators are hierarchical, with a specified default algorithm for combining t
In midPoint 4.6 we support only the flat hierarchy: a composite correlator defined on top, with component correlators right beneath it.
Please see <<Limitations>> section at the end of this document.

== Confidence Values

The result of the correlator(s) evaluation is a set of _candidates_, each with a _confidence value_.

A confidence value indicates how strongly the correlator(s) believe that this candidate is the one we are looking for.
It is a decimal number from the interval of (0, 1]:
This means that it should be greater than zero (otherwise, the candidate would not be listed at all), and can be as large as 1.
By default, the value of 1 means that the correlator is sure that this is the candidate it has been looking for.

== Composition Algorithm Outline

How is the result determined?

Individual correlators are evaluated in a defined order (see <<Tiers and Rule Ordering>>).

Each correlator produces a set of _candidates_ having zero, one, or more objects.
Each candidate has a (local) _confidence_ value from the interval of (0, 1].
Each correlator has its own _weight_ that is used as a multiplication factor for the local confidence values produced by the correlator.
(For convenience, a global _scale_ can be defined. It can be used to re-scale the confidence values to the interval of (0, 1].)

After the evaluation, a union of all candidate sets is created, and the overall confidence for each candidate is computed:
After the evaluation, a union of all candidate sets is created, and the total confidence for each candidate is computed:

image::confidence-formula.png[Confidence formula,width=600,pdfwidth=50%,scaledwidth=50%]

Where

- _total confidence~cand~_ is the total confidence for candidate _cand_ (being computed),
- _confidence~cand,cor~_ is the confidence of candidate _cand_ provided by child correlator _cor_,
- _weight~cor~_ is the weight of child correlator _cor_ in the composition (a parameter of the composition; default is 1),
- _scale_ is the scale of the given composite correlator (a parameter of the composition; default is 1).
- _confidence~cand,cor~_ is the confidence of candidate _cand_ provided by the correlator _cor_,
- _weight~cor~_ is the weight of the correlator _cor_ in the composition (it is a parameter of the composition; default is 1),
- _scale_ is the scale of the given composition (it is a parameter of the composition; default is 1).

== A Naive Example

=== Rules

Let us have rules like these:
Let us have the following rules:

.Sample set of correlation rules
[%header]
[%autowidth]
|===
| Rule name | Rule content | Weight
| name-date-id
| `name-date-id`
| Family name, date of birth, and national ID exactly match.
| 1.0
| names-date
| `names-date`
| Given name, family name, and date of birth exactly match.
| 0.4
| id
| `id`
| The national ID exactly matches.
| 0.4
|===
Expand Down Expand Up @@ -130,16 +141,16 @@ The total confidence is [purple]*1.4*, cropped down to [purple]*1.0*.

== "Ignore if Matched by" Flag

We see that the match of the rule `name-date-id` implies the match of the rule `id`.
After a quick look, we see that the match of the rule `name-date-id` implies the match of the rule `id`.
Hence, each candidate matching `name-date-id` gets a confidence increment *1.4*.
This is, most probably, not the behavior that we expect.
(While not necessarily incorrect, it is quite counter-intuitive.)

Therefore, we have introduced a mechanism to mark rule `id` as being ignored for those candidates that are matched by rule `name-date-id` before.
Therefore, midPoint has a mechanism to mark rule `id` as _ignored_ for those candidates that are matched by rule `name-date-id` before.

=== Configuration

It is done by setting `ignoreIfMatchedBy` like this:
This is done by setting `ignoreIfMatchedBy` like here:

.Listing 2. Ignoring `id` rule for candidates matching `name-date-id`
[source,xml]
Expand Down Expand Up @@ -201,6 +212,73 @@ If we do, we finish the computation.
If there is no certain candidate, we continue.
We continue also in case there are multiple certain candidates, although this situation indicates there is something wrong with the correlation rules.

=== Configuration

.Listing 3. Dividing the computation into tiers
[source,xml]
----
<correlators>
<items>
<name>name-date-id</name>
<documentation>
If key attributes (family name, date of birth, national ID) exactly match,
we are immediately done. We ignore given name here.
</documentation>
<item>
<ref>familyName</ref>
</item>
<item>
<ref>extension/dateOfBirth</ref>
</item>
<item>
<ref>extension/nationalId</ref>
</item>
<composition>
<tier>1</tier>
</composition>
</items>
<items>
<name>names-date</name>
<documentation>If given and family name and the date of birth match, we present an option to the operator.</documentation>
<item>
<ref>givenName</ref>
</item>
<item>
<ref>familyName</ref>
</item>
<item>
<ref>extension/dateOfBirth</ref>
</item>
<composition>
<tier>2</tier> <!--1-->
<order>10</order> <!--2-->
<weight>0.4</weight>
</composition>
</items>
<items>
<name>id</name>
<documentation>If national ID matches, we present an option to the operator.</documentation>
<item>
<ref>extension/nationalId</ref>
</item>
<composition>
<tier>2</tier> <!--1-->
<order>20</order> <!--2-->
<weight>0.4</weight>
</composition>
</items>
</correlators>
----
<1> Tier number for the last tier can be omitted.
<2> The ordering is not important here.

Note that it is not necessary to specify the last tier, that is number 2 in this case.
It is because unnumbered tier always goes last.

Also, ordering within a single tier is usually not needed.
This case is no exception.
We provide ordering information just as an illustration how it can be done.

=== Example Computation

Now, when correlating `Ian Smith, 2004-02-06, 040206/1328` with the candidate being `John Smith, 2004-02-06, 040206/1328`,
Expand All @@ -226,10 +304,10 @@ In midPoint 4.6, the resulting aggregated confidence values for individual candi
[%autowidth]
|===
| Value | Description
| _Definite match threshold_ (`DM`)
| Definite match threshold (`DM`)
| If a confidence value is equal or greater than this one, the candidate is considered to definitely match the identity data.
(If, for some reason, multiple candidates do this, then human decision is requested.)
| _Candidate match threshold_ (`CM`)
| Candidate match threshold (`CM`)
| If a confidence value is below this one, the candidate is not considered to be matching at all - not even for human decision.
|===

Expand All @@ -241,7 +319,7 @@ Said in other words:

=== Default values

.Default values for the threshold
.Default values for the thresholds
[%header]
[%autowidth]
|===
Expand All @@ -252,7 +330,7 @@ Said in other words:

=== Configuration

.Listing 3. Setting the thresholds
.Listing 4. Setting the thresholds
[source,xml]
----
<correlation>
Expand All @@ -274,6 +352,8 @@ Although it is possible to configure arbitrary combination of the correlators, a
. Filter-based correlators cannot be combined with the other ones.
. Expression-based correlators are experimental altogether.
. Composite correlator can be provided at the top level only.
(It is the implicit instance of the composite correlator that is not visible in the correlation definition.
It is represented by the root `correlators` configuration item.)

Said in other words, only the `items` correlators can be combined.
The use of other ones in the composition is considered experimental.

0 comments on commit fe8e388

Please sign in to comment.