Skip to content

Commit

Permalink
Add page on fuzzy search
Browse files Browse the repository at this point in the history
  • Loading branch information
mederly committed Sep 9, 2022
1 parent 1d2fa45 commit 17c2415
Show file tree
Hide file tree
Showing 2 changed files with 241 additions and 3 deletions.
238 changes: 237 additions & 1 deletion docs/correlation/fuzzy-logic.adoc
Original file line number Diff line number Diff line change
@@ -1,3 +1,239 @@
= Fuzzy Logic
:page-toc: top
:page-since: "4.6"

#TODO#
IMPORTANT: This feature is available only when using the xref:/midpoint/reference/repository/native-postgresql/[native repository implementation].

For an introduction, please see xref:/midpoint/reference/correlation/#fuzzy-matching[Fuzzy Matching] section in the overview document.

== Fuzzy Search Methods

Currently, there are two methods available:

.Fuzzy string matching methods
[%header]
[%autowidth]
|===
| Method | Description
| Levenshtein edit distance
| Matches according to the minimum number of single-character edits (insertions, deletions or substitutions) required to change one string into the other.
(From link:https://en.wikipedia.org/wiki/Levenshtein_distance[wikipedia].)
| Trigram similarity
| Matches using the ratio of common trigrams to all trigrams in compared strings.
(See link:https://www.postgresql.org/docs/current/pgtrgm.html[PostgreSQL documentation on `pg_trgm` module].)
|===

== Using in Correlation

=== Specification of the Filters

The fuzzy searching filters are specified in `search/fuzzy` configuration item.
Let us have look at an example that searches for users having family name within the Levenshtein distance to the provided one of at most 3.

.Listing 1. Correlation looking for family name "close enough" to the provided one
[source,xml]
----
<objectTemplate>
...
<correlation>
<correlators>
<items>
<item>
<ref>familyName</ref>
<search>
<fuzzy>
<levenshtein>
<threshold>3</threshold>
</levenshtein>
</fuzzy>
</search>
</item>
</items>
</correlators>
</correlation>
</objectTemplate>
----

There are the following options available:

.Configuration properties for Levenshtein edit distance search
[%header]
[%autowidth]
|===
| Property | Description | Default
| `threshold` | Upper limit on the edit distance to be matched. | Must be specified.
| `inclusive` | Is the value of "threshold" meant to be inclusive? | `true`
|===

.Configuration properties for trigram similarity search
[%header]
[%autowidth]
|===
| Property | Description | Default
| `threshold` | Lower limit on the similarity to be matched. | Must be specified.
| `inclusive` | Is the value of "threshold" meant to be inclusive? | `true`
|===

=== Confidence Values

When using fuzzy search, not all search results are equally relevant.
Typically, the higher Levenshtein edit distance, the lower confidence we have in the particular match.
On the other hand, the higher trigram similarity value, the higher confidence.

Therefore, midPoint allows to specify a transformation from the fuzzy string metric (edit distance or similarity value) to the confidence value of (0, 1].

An example:

.Listing 2. Deriving confidence value from Levenshtein edit distance
[source,xml]
----
<objectTemplate>
...
<correlation>
<correlators>
<items>
<item>
<ref>familyName</ref>
<search>
<fuzzy>
<levenshtein>
<threshold>3</threshold>
</levenshtein>
</fuzzy>
<confidence>
<expression>
<script>
<code>1 / (input+1)</code>
</script>
</expression>
</confidence>
</search>
</item>
</items>
</correlators>
</correlation>
</objectTemplate>
----

The confidence expression provides a custom confidence value for the Levenshtein edit distance based match like this:

.Transformation of edit distance to a confidence value by the expression in Listing 2
[%header]
[%autowidth]
|===
| Edit distance | Resulting confidence
| 0 (exact match) | 1.0
| 1 | 0.5
| 2 | 0.333
| 3 | 0.25
|===

This computation may or may not fit your needs.
You may provide any reasonable expression to do the computation.

If the confidence expression is not specified, the confidence is set to 1.0.
(It may be later weighted when rules are composed together, as described in xref:/midpoint/reference/correlation/rule-composition/[rule composition] document.)

==== Multiple Correlation Items

If there are multiple correlation items in given correlation rule, their confidences are multiplied.

An example:

.Listing 3. Two items to be matched by a fuzzy search
[source,xml]
----
<objectTemplate>
...
<correlation>
<correlators>
<items>
<item>
<ref>givenName</ref>
<search>
<fuzzy>
<similarity>
<threshold>0.5</threshold>
</similarity>
</fuzzy>
<confidence>
<expression>
<script>
<code>input</code>
</script>
</expression>
</confidence>
</search>
</item>
<item>
<ref>familyName</ref>
<search>
<fuzzy>
<levenshtein>
<threshold>3</threshold>
</levenshtein>
</fuzzy>
<confidence>
<expression>
<script>
<code>1 / (input+1)</code>
</script>
</expression>
</confidence>
</search>
</item>
</items>
</correlators>
</correlation>
</objectTemplate>
----

The confidence factor for `givenName` is defined to be equal to the trigram similarity value.
The confidence factor for `familyName` is defined just like in the example above.

For example, if a correlation candidate has a given name with the distance of 1 and similarity of 0.8, its confidence is computed as:

.Example of the confidence computation
[%header]
[%autowidth]
|===
| Property | Fuzzy search metric value | Confidence factor
| `givenName` | 0.8 | 0.8
| `familyName` | 1 | 0.5
2+| *Overall confidence* | *0.4* (= 0.8 x 0.5)
|===

== Using in Filters

[WARNING]
====
The use of fuzzy matching outside correlation is highly experimental.
In particular, matching of `PolyString` values does not work as expected.
Also, the serialization format may change in the future.
Here we describe it only for educational purposes - to emphasize the fact that correlation is ultimately implemented using regular queries.
====

.Listing 4. Sample Levenshtein distance query in XML
[source,xml]
----
<q:query xmlns:q="http://prism.evolveum.com/xml/ns/public/query-3">
<q:filter>
<q:fuzzyStringMatch>
<q:path>familyName</q:path>
<q:value>gren</q:value>
<q:method>
<q:levenshtein>
<q:threshold>3</q:threshold>
</q:levenshtein>
</q:method>
</q:fuzzyStringMatch>
</q:filter>
</q:query>
----

.Listing 5. Sample trigram similarity filter in Axiom
[source,axiom]
----
familyName similarity ('gren', 0.5, true)
----
6 changes: 4 additions & 2 deletions docs/correlation/index.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -243,6 +243,8 @@ Please see xref:/midpoint/reference/correlation/rule-composition/[rule compositi

=== Custom Indexing

IMPORTANT: This feature is available only when using the xref:/midpoint/reference/repository/native-postgresql/[native repository implementation].

Sometimes, we need to base the search on specially-indexed data.
For example, we could need to match only first five normalized characters of the surname.
Or, we could want to take only digits into account when searching for the national ID.
Expand Down Expand Up @@ -324,7 +326,7 @@ If there are multiple normalizations defined for a given focus item (and none is

Please see xref:/midpoint/reference/correlation/custom-indexing/[custom indexing] and xref:/midpoint/reference/correlation/items-correlator/[`items` correlator] for more information.

=== Fuzzy Logic
=== Fuzzy Matching

By default, the searching is done using "exact match" criteria, either on original values or on the ones that underwent the standard or custom normalization.
Sometimes, however, we want to search for objects that have a property value somewhat similar to the value we have at hand.
Expand All @@ -345,7 +347,7 @@ To do this, a fuzzy search logic was implemented. There are two methods availabl
(See link:https://www.postgresql.org/docs/current/pgtrgm.html[PostgreSQL documentation on `pg_trgm` module].)
|===

NOTE: The fuzzy search is implemented for the native PostgreSQL-based repository only.
IMPORTANT: The fuzzy search is available only when using the xref:/midpoint/reference/repository/native-postgresql/[native repository implementation].

An example that searches for users having given name and family name close to the provided ones.
The given name has to have Levenshtein edit distance (to the provided one) at most 3.
Expand Down

0 comments on commit 17c2415

Please sign in to comment.