Rename edit_distance/min_similarity to fuzziness

A lot of different API's currently use different names for the same logical parameter. Since lucene moved away from the notion of a `similarity` and now uses an `fuzziness` we should generalize this and encapsulate the generation, parsing and creation of these settings across all queries. This commit adds a new `Fuzziness` class that handles the renaming and generalization in a backwards compatible manner. This commit also added a ParseField class to better support deprecated Query DSL parameters The ParseField class allows specifying parameger that have been deprecated. Those parameters can be more easily tracked and removed in future version. This also allows to run queries in `strict` mode per index to throw exceptions if a query is executed with deprected keys. Closes elastic#4082
brusic · Jan 19, 2014 · 9a10b13 · 9a10b13
1 parent 4b6c60f
commit 9a10b13
Show file tree

Hide file tree

Showing 46 changed files with 917 additions and 196 deletions.
diff --git a/docs/reference/api-conventions.asciidoc b/docs/reference/api-conventions.asciidoc
@@ -122,6 +122,21 @@ fields within a document indexed treated as boolean fields.
 All REST APIs support providing numbered parameters as `string` on top
 of supporting the native JSON number types.
 
+[[time-units]]
+[float]
+=== Time units
+
+Whenever durations need to be specified, eg for a `timeout` parameter, the duration
+can be specified as a whole number representing time in milliseconds, or as a time value like `2d` for 2 days.  The supported units are:
+
+[horizontal]
+`y`::   Year
+`M`::   Month
+`w`::   Week
+`h`::   Hour
+`m`::   Minute
+`s`::   Second
+
 [[distance-units]]
 [float]
 === Distance Units
@@ -144,6 +159,63 @@ Centimeter::    `cm` or `centimeters`
 Millimeter::    `mm` or `millimeters`
 
 
+[[fuzziness]]
+[float]
+=== Fuzziness
+
+Some queries and APIs support parameters to allow inexact _fuzzy_ matching,
+using the `fuzziness` parameter. The `fuzziness` parameter is context
+sensitive which means that it depends on the type of the field being queried:
+
+[float]
+==== Numeric, date and IPv4 fields
+
+When querying numeric, date and IPv4 fields, `fuzziness` is interpreted as a
+`+/- margin. It behaves like a <<query-dsl-range-query>> where:
+
+    -fuzziness <= field value <= +fuzziness
+
+The `fuzziness` parameter should be set to a numeric value, eg `2` or `2.0`. A
+`date` field interprets a long as milliseconds, but also accepts a string
+containing a time value -- `"1h"` -- as explained in <<time-units>>. An `ip`
+field accepts a long or another IPv4 address (which will be converted into a
+long).
+
+[float]
+==== String fields
+
+When querying `string` fields, `fuzziness` is interpreted as a
+http://en.wikipedia.org/wiki/Levenshtein_distance[Levenshtein Edit Distance]
+-- the number of one character changes that need to be made to one string to
+make it the same as another string.
+
+The `fuzziness` parameter can be specified as:
+
+`0`, `1`, `2`::
+
+the maximum allowed Levenshtein Edit Distance (or number of edits)
+
+`AUTO`::
++
+--
+generates an edit distance based on the length of the term. For lengths:
+
+`0..1`:: must match exactly
+`1..4`:: one edit allowed
+`>4`:: two edits allowed
+
+`AUTO` should generally be the preferred value for `fuzziness`.
+--
+
+`0.0..1.0`::
+
+converted into an edit distance using the formula: `length(term) * (1.0 -
+fuzziness)`, eg a `fuzziness` of `0.6` with a term of length 10 would result
+in an edit distance of `4`. Note: in all APIs except for the
+<<query-dsl-flt-query>>, the maximum allowed edit distance is `2`.
+
+
+
 [float]
 === Result Casing
 

diff --git a/docs/reference/query-dsl/queries/flt-field-query.asciidoc b/docs/reference/query-dsl/queries/flt-field-query.asciidoc
@@ -33,8 +33,8 @@ The `fuzzy_like_this_field` top level parameters include:
 |`max_query_terms` |The maximum number of query terms that will be
 included in any generated query. Defaults to `25`.
 
-|`min_similarity` |The minimum similarity of the term variants. Defaults
-to `0.5`.
+|`fuzziness` |The fuzziness of the term variants. Defaults
+to `0.5`. See  <<fuzziness>>.
 
 |`prefix_length` |Length of required common prefix on variant terms.
 Defaults to `0`.

diff --git a/docs/reference/query-dsl/queries/flt-query.asciidoc b/docs/reference/query-dsl/queries/flt-query.asciidoc
@@ -32,8 +32,8 @@ Defaults to the `_all` field.
 |`max_query_terms` |The maximum number of query terms that will be
 included in any generated query. Defaults to `25`.
 
-|`min_similarity` |The minimum similarity of the term variants. Defaults
-to `0.5`.
+|`fuzziness` |The minimum similarity of the term variants. Defaults
+to `0.5`. See  <<fuzziness>>.
 
 |`prefix_length` |Length of required common prefix on variant terms.
 Defaults to `0`.

diff --git a/docs/reference/query-dsl/queries/fuzzy-query.asciidoc b/docs/reference/query-dsl/queries/fuzzy-query.asciidoc
@@ -1,12 +1,15 @@
 [[query-dsl-fuzzy-query]]
 === Fuzzy Query
 
-A fuzzy query that uses similarity based on Levenshtein (edit
-distance) algorithm. This maps to Lucene's `FuzzyQuery`.
+The fuzzy query uses similarity based on Levenshtein edit distance for
+`string` fields, and a `+/-` margin on numeric and date fields.
 
-Warning: this query is not very scalable with its default prefix length
-of 0 - in this case, *every* term will be enumerated and cause an edit
-score calculation or `max_expansions` is not set.
+==== String fields
+
+The `fuzzy` query generates all possible matching terms that are within  the
+maximum edit distance specified in `fuzziness` and then checks the term
+dictionary to find out which of those generated terms actually exist in the
+index.
 
 Here is a simple example:
 
@@ -17,63 +20,83 @@ Here is a simple example:
 }
 --------------------------------------------------
 
-More complex settings can be set (the values here are the default
-values):
+Or with more advanced settings:
 
 [source,js]
 --------------------------------------------------
-    {
-        "fuzzy" : { 
-            "user" : {
-                "value" : "ki",
-                "boost" : 1.0,
-                "min_similarity" : 0.5,
-                "prefix_length" : 0
-            }
+{
+    "fuzzy" : {
+        "user" : {
+            "value" :         "ki",
+            "boost" :         1.0,
+            "fuzziness" :     2,
+            "prefix_length" : 0,
+            "max_expansions": 100
         }
     }
+}
 --------------------------------------------------
 
-The `max_expansions` parameter (unbounded by default) controls the
-number of terms the fuzzy query will expand to.
+[float]
+===== Parameters
+
+[horizontal]
+`fuzziness`::
+
+    The maximum edit distance. Defaults to `AUTO`. See <<fuzziness>>.
+
+`prefix_length`::
+
+    The number of initial characters which will not be ``fuzzified''. This
+    helps to reduce the number of terms which must be examined. Defaults
+    to `0`.
+
+`max_expansions`::
+
+    The maximum number of terms that the `fuzzy` query will expand to.
+    Defaults to `0`.
+
+
+WARNING: this query can be very heavy if `prefix_length` and `max_expansions`
+are both set to their defaults of `0`. This could cause every term in the
+index to be examined!
+
 
 [float]
-==== Numeric / Date Fuzzy
+==== Numeric and date fields
+
+Performs a <<query-dsl-range-query>> ``around'' the value using the
+`fuzziness` value as a `+/-` range, where:
+
+    -fuzziness <= field value <= +fuzziness
 
-`fuzzy` query on a numeric field will result in a range query "around"
-the value using the `min_similarity` value. For example:
+For example:
 
 [source,js]
 --------------------------------------------------
 {
     "fuzzy" : {
         "price" : {
             "value" : 12,
-            "min_similarity" : 2
+            "fuzziness" : 2
         }
     }
 }
 --------------------------------------------------
 
-Will result in a range query between 10 and 14. Same applies to dates,
-with support for time format for the `min_similarity` field:
+Will result in a range query between 10 and 14. Date fields support
+<<time-units,time values>>, eg:
 
 [source,js]
 --------------------------------------------------
 {
     "fuzzy" : {
         "created" : {
             "value" : "2010-02-05T12:05:07",
-            "min_similarity" : "1d"
+            "fuzziness" : "1d"
         }
     }
 }
 --------------------------------------------------
 
-In the mapping, numeric and date types now allow to configure a
-`fuzzy_factor` mapping value (defaults to 1), which will be used to
-multiply the fuzzy value by it when used in a `query_string` type query.
-For example, for dates, a fuzzy factor of "1d" will result in
-multiplying whatever fuzzy value provided in the min_similarity by it.
-Note, this is explicitly supported since query_string query only allowed
-for similarity valued between 0.0 and 1.0.
+See <<fuzziness>> for more details about accepted values.
diff --git a/docs/reference/query-dsl/queries/match-query.asciidoc b/docs/reference/query-dsl/queries/match-query.asciidoc
@@ -34,9 +34,10 @@ The `analyzer` can be set to control which analyzer will perform the
 analysis process on the text. It default to the field explicit mapping
 definition, or the default search analyzer.
 
-`fuzziness` can be set to a value (depending on the relevant type, for
-string types it should be a value between `0.0` and `1.0`) to constructs
-fuzzy queries for each term analyzed. The `prefix_length` and
+`fuzziness` allows _fuzzy matching_ based on the type of field being queried.
+See <<fuzziness>> for allowed settings.
+
+The `prefix_length` and
 `max_expansions` can be set in this case to control the fuzzy process.
 If the fuzzy option is set the query will use `constant_score_rewrite`
 as its <<query-dsl-multi-term-rewrite,rewrite
@@ -80,9 +81,9 @@ change that the `zero_terms_query` option can be used, which accepts
 .cutoff_frequency
 The match query supports a `cutoff_frequency` that allows
 specifying an absolute or relative document frequency where high
-frequent terms are moved into an optional subquery and are only scored 
-if one of the low frequent (below the cutoff) terms in the case of an 
-`or` operator or all of the low frequent terms in the case of an `and` 
+frequent terms are moved into an optional subquery and are only scored
+if one of the low frequent (below the cutoff) terms in the case of an
+`or` operator or all of the low frequent terms in the case of an `and`
 operator match.
 
 This query allows handling `stopwords` dynamically at runtime, is domain
@@ -101,8 +102,8 @@ Note: If the `cutoff_frequency` is used and the operator is `and`
 _stacked tokens_ (tokens that are on the same position like `synonym` filter emits)
 are not handled gracefully as they are in a pure `and` query. For instance the query
 `fast fox` is analyzed into 3 terms `[fast, quick, fox]` where `quick` is a synonym
-for `fast` on the same token positions the query might require `fast` and `quick` to 
-match if the operator is `and`. 
+for `fast` on the same token positions the query might require `fast` and `quick` to
+match if the operator is `and`.
 
 Here is an example showing a query composed of stopwords exclusivly:
 

diff --git a/docs/reference/query-dsl/queries/query-string-query.asciidoc b/docs/reference/query-dsl/queries/query-string-query.asciidoc
@@ -46,8 +46,8 @@ increments in result queries. Defaults to `true`.
 |`fuzzy_max_expansions` |Controls the number of terms fuzzy queries will
 expand to. Defaults to `50`
 
-|`fuzzy_min_sim` |Set the minimum similarity for fuzzy queries. Defaults
-to `0.5`
+|`fuzziness` |Set the fuzziness for fuzzy queries. Defaults
+to `AUTO`. See  <<fuzziness>> for allowed settings.
 
 |`fuzzy_prefix_length` |Set the prefix length for fuzzy queries. Default
 is `0`.
@@ -70,7 +70,7 @@ in the resulting boolean query should match. It can be an absolute value
 both>>.
 
 |`lenient` |If set to `true` will cause format based failures (like
-providing text to a numeric field) to be ignored. 
+providing text to a numeric field) to be ignored.
 |=======================================================================
 
 When a multi term query is being generated, one can control how it gets
@@ -128,7 +128,7 @@ search on all "city" fields:
 
 Another option is to provide the wildcard fields search in the query
 string itself (properly escaping the `*` sign), for example:
-`city.\*:something`. 
+`city.\*:something`.
 
 When running the `query_string` query against multiple fields, the
 following additional parameters are allowed:

diff --git a/docs/reference/search/suggesters/completion-suggest.asciidoc b/docs/reference/search/suggesters/completion-suggest.asciidoc
@@ -199,7 +199,7 @@ curl -X POST 'localhost:9200/music/_suggest?pretty' -d '{
         "completion" : {
             "field" : "suggest",
             "fuzzy" : {
-                "edit_distance" : 2
+                "fuzziness" : 2
             }
         }
     }
@@ -210,8 +210,9 @@ The fuzzy query can take specific fuzzy parameters.
 The following parameters are supported:
 
 [horizontal]
-`edit_distance`::
-    Maximum edit distance, defaults to `1`
+`fuzziness`::
+    The fuzziness factor, defaults to `AUTO`.
+    See  <<fuzziness>> for allowed settings.
 
 `transpositions`::
     Sets if transpositions should be counted

diff --git a/src/main/java/org/apache/lucene/queryparser/classic/MapperQueryParser.java b/src/main/java/org/apache/lucene/queryparser/classic/MapperQueryParser.java
@@ -30,6 +30,7 @@
 import org.elasticsearch.common.lucene.Lucene;
 import org.elasticsearch.common.lucene.search.Queries;
 import org.elasticsearch.common.lucene.search.XFilteredQuery;
+import org.elasticsearch.common.unit.Fuzziness;
 import org.elasticsearch.index.mapper.FieldMapper;
 import org.elasticsearch.index.mapper.MapperService;
 import org.elasticsearch.index.query.QueryParseContext;
@@ -435,7 +436,7 @@ private Query getFuzzyQuerySingle(String field, String termStr, String minSimila
             if (currentMapper != null) {
                 try {
                     //LUCENE 4 UPGRADE I disabled transpositions here by default - maybe this needs to be changed
-                    Query fuzzyQuery = currentMapper.fuzzyQuery(termStr, minSimilarity, fuzzyPrefixLength, settings.fuzzyMaxExpansions(), false);
+                    Query fuzzyQuery = currentMapper.fuzzyQuery(termStr, Fuzziness.build(minSimilarity), fuzzyPrefixLength, settings.fuzzyMaxExpansions(), false);
                     return wrapSmartNameQuery(fuzzyQuery, fieldMappers, parseContext);
                 } catch (RuntimeException e) {
                     if (settings.lenient()) {