Add page on fuzzy search

Evolveum · Sep 9, 2022 · 17c2415 · 17c2415
1 parent 1d2fa45
commit 17c2415
Show file tree

Hide file tree

Showing 2 changed files with 241 additions and 3 deletions.
diff --git a/docs/correlation/fuzzy-logic.adoc b/docs/correlation/fuzzy-logic.adoc
@@ -1,3 +1,239 @@
 = Fuzzy Logic
+:page-toc: top
+:page-since: "4.6"
 
-#TODO#
+IMPORTANT: This feature is available only when using the xref:/midpoint/reference/repository/native-postgresql/[native repository implementation].
+
+For an introduction, please see xref:/midpoint/reference/correlation/#fuzzy-matching[Fuzzy Matching] section in the overview document.
+
+== Fuzzy Search Methods
+
+Currently, there are two methods available:
+
+.Fuzzy string matching methods
+[%header]
+[%autowidth]
+|===
+| Method | Description
+| Levenshtein edit distance
+| Matches according to the minimum number of single-character edits (insertions, deletions or substitutions) required to change one string into the other.
+(From link:https://en.wikipedia.org/wiki/Levenshtein_distance[wikipedia].)
+| Trigram similarity
+| Matches using the ratio of common trigrams to all trigrams in compared strings.
+(See link:https://www.postgresql.org/docs/current/pgtrgm.html[PostgreSQL documentation on `pg_trgm` module].)
+|===
+
+== Using in Correlation
+
+=== Specification of the Filters
+
+The fuzzy searching filters are specified in `search/fuzzy` configuration item.
+Let us have look at an example that searches for users having family name within the Levenshtein distance to the provided one of at most 3.
+
+.Listing 1. Correlation looking for family name "close enough" to the provided one
+[source,xml]
+----
+<objectTemplate>
+    ...
+    <correlation>
+        <correlators>
+            <items>
+                <item>
+                    <ref>familyName</ref>
+                    <search>
+                        <fuzzy>
+                            <levenshtein>
+                                <threshold>3</threshold>
+                            </levenshtein>
+                        </fuzzy>
+                    </search>
+                </item>
+            </items>
+        </correlators>
+    </correlation>
+</objectTemplate>
+----
+
+There are the following options available:
+
+.Configuration properties for Levenshtein edit distance search
+[%header]
+[%autowidth]
+|===
+| Property | Description | Default
+| `threshold` | Upper limit on the edit distance to be matched. | Must be specified.
+| `inclusive` | Is the value of "threshold" meant to be inclusive? | `true`
+|===
+
+.Configuration properties for trigram similarity search
+[%header]
+[%autowidth]
+|===
+| Property | Description | Default
+| `threshold` | Lower limit on the similarity to be matched. | Must be specified.
+| `inclusive` | Is the value of "threshold" meant to be inclusive? | `true`
+|===
+
+=== Confidence Values
+
+When using fuzzy search, not all search results are equally relevant.
+Typically, the higher Levenshtein edit distance, the lower confidence we have in the particular match.
+On the other hand, the higher trigram similarity value, the higher confidence.
+
+Therefore, midPoint allows to specify a transformation from the fuzzy string metric (edit distance or similarity value) to the confidence value of (0, 1].
+
+An example:
+
+.Listing 2. Deriving confidence value from Levenshtein edit distance
+[source,xml]
+----
+<objectTemplate>
+    ...
+    <correlation>
+        <correlators>
+            <items>
+                <item>
+                    <ref>familyName</ref>
+                    <search>
+                        <fuzzy>
+                            <levenshtein>
+                                <threshold>3</threshold>
+                            </levenshtein>
+                        </fuzzy>
+                        <confidence>
+                            <expression>
+                                <script>
+                                    <code>1 / (input+1)</code>
+                                </script>
+                            </expression>
+                        </confidence>
+                    </search>
+                </item>
+            </items>
+        </correlators>
+    </correlation>
+</objectTemplate>
+----
+
+The confidence expression provides a custom confidence value for the Levenshtein edit distance based match like this:
+
+.Transformation of edit distance to a confidence value by the expression in Listing 2
+[%header]
+[%autowidth]
+|===
+| Edit distance | Resulting confidence
+| 0 (exact match) | 1.0
+| 1 | 0.5
+| 2 | 0.333
+| 3 | 0.25
+|===
+
+This computation may or may not fit your needs.
+You may provide any reasonable expression to do the computation.
+
+If the confidence expression is not specified, the confidence is set to 1.0.
+(It may be later weighted when rules are composed together, as described in xref:/midpoint/reference/correlation/rule-composition/[rule composition] document.)
+
+==== Multiple Correlation Items
+
+If there are multiple correlation items in given correlation rule, their confidences are multiplied.
+
+An example:
+
+.Listing 3. Two items to be matched by a fuzzy search
+[source,xml]
+----
+<objectTemplate>
+    ...
+    <correlation>
+        <correlators>
+            <items>
+                <item>
+                    <ref>givenName</ref>
+                    <search>
+                        <fuzzy>
+                            <similarity>
+                                <threshold>0.5</threshold>
+                            </similarity>
+                        </fuzzy>
+                        <confidence>
+                            <expression>
+                                <script>
+                                    <code>input</code>
+                                </script>
+                            </expression>
+                        </confidence>
+                    </search>
+                </item>
+                <item>
+                    <ref>familyName</ref>
+                    <search>
+                        <fuzzy>
+                            <levenshtein>
+                                <threshold>3</threshold>
+                            </levenshtein>
+                        </fuzzy>
+                        <confidence>
+                            <expression>
+                                <script>
+                                    <code>1 / (input+1)</code>
+                                </script>
+                            </expression>
+                        </confidence>
+                    </search>
+                </item>
+            </items>
+        </correlators>
+    </correlation>
+</objectTemplate>
+----
+
+The confidence factor for `givenName` is defined to be equal to the trigram similarity value.
+The confidence factor for `familyName` is defined just like in the example above.
+
+For example, if a correlation candidate has a given name with the distance of 1 and similarity of 0.8, its confidence is computed as:
+
+.Example of the confidence computation
+[%header]
+[%autowidth]
+|===
+| Property | Fuzzy search metric value | Confidence factor
+| `givenName` | 0.8 | 0.8
+| `familyName` | 1 | 0.5
+2+| *Overall confidence* | *0.4* (= 0.8 x 0.5)
+|===
+
+== Using in Filters
+
+[WARNING]
+====
+The use of fuzzy matching outside correlation is highly experimental.
+In particular, matching of `PolyString` values does not work as expected.
+Also, the serialization format may change in the future.
+
+Here we describe it only for educational purposes - to emphasize the fact that correlation is ultimately implemented using regular queries.
+====
+
+.Listing 4. Sample Levenshtein distance query in XML
+[source,xml]
+----
+<q:query xmlns:q="http://prism.evolveum.com/xml/ns/public/query-3">
+    <q:filter>
+        <q:fuzzyStringMatch>
+            <q:path>familyName</q:path>
+            <q:value>gren</q:value>
+            <q:method>
+                <q:levenshtein>
+                    <q:threshold>3</q:threshold>
+                </q:levenshtein>
+            </q:method>
+        </q:fuzzyStringMatch>
+    </q:filter>
+</q:query>
+----
+
+.Listing 5. Sample trigram similarity filter in Axiom
+[source,axiom]
+----
+familyName similarity ('gren', 0.5, true)
+----
diff --git a/docs/correlation/index.adoc b/docs/correlation/index.adoc
@@ -243,6 +243,8 @@ Please see xref:/midpoint/reference/correlation/rule-composition/[rule compositi
 
 === Custom Indexing
 
+IMPORTANT: This feature is available only when using the xref:/midpoint/reference/repository/native-postgresql/[native repository implementation].
+
 Sometimes, we need to base the search on specially-indexed data.
 For example, we could need to match only first five normalized characters of the surname.
 Or, we could want to take only digits into account when searching for the national ID.
@@ -324,7 +326,7 @@ If there are multiple normalizations defined for a given focus item (and none is
 
 Please see xref:/midpoint/reference/correlation/custom-indexing/[custom indexing] and xref:/midpoint/reference/correlation/items-correlator/[`items` correlator] for more information.
 
-=== Fuzzy Logic
+=== Fuzzy Matching
 
 By default, the searching is done using "exact match" criteria, either on original values or on the ones that underwent the standard or custom normalization.
 Sometimes, however, we want to search for objects that have a property value somewhat similar to the value we have at hand.
@@ -345,7 +347,7 @@ To do this, a fuzzy search logic was implemented. There are two methods availabl
 (See link:https://www.postgresql.org/docs/current/pgtrgm.html[PostgreSQL documentation on `pg_trgm` module].)
 |===
 
-NOTE: The fuzzy search is implemented for the native PostgreSQL-based repository only.
+IMPORTANT: The fuzzy search is available only when using the xref:/midpoint/reference/repository/native-postgresql/[native repository implementation].
 
 An example that searches for users having given name and family name close to the provided ones.
 The given name has to have Levenshtein edit distance (to the provided one) at most 3.