Skip to content

Commit

Permalink
Rename edit_distance/min_similarity to fuzziness
Browse files Browse the repository at this point in the history
A lot of different API's currently use different names for the
same logical parameter. Since lucene moved away from the notion
of a `similarity` and now uses an `fuzziness` we should generalize
this and encapsulate the generation, parsing and creation of these
settings across all queries.

This commit adds a new `Fuzziness` class that handles the renaming
and generalization in a backwards compatible manner.

This commit also added a ParseField class to better support deprecated
Query DSL parameters

The ParseField class allows specifying parameger that have been deprecated.
Those parameters can be more easily tracked and removed in future version.
This also allows to run queries in `strict` mode per index to throw
exceptions if a query is executed with deprected keys.

Closes elastic#4082
  • Loading branch information
s1monw authored and brusic committed Jan 19, 2014
1 parent 4b6c60f commit 9a10b13
Show file tree
Hide file tree
Showing 46 changed files with 917 additions and 196 deletions.
72 changes: 72 additions & 0 deletions docs/reference/api-conventions.asciidoc
Expand Up @@ -122,6 +122,21 @@ fields within a document indexed treated as boolean fields.
All REST APIs support providing numbered parameters as `string` on top
of supporting the native JSON number types.

[[time-units]]
[float]
=== Time units

Whenever durations need to be specified, eg for a `timeout` parameter, the duration
can be specified as a whole number representing time in milliseconds, or as a time value like `2d` for 2 days. The supported units are:

[horizontal]
`y`:: Year
`M`:: Month
`w`:: Week
`h`:: Hour
`m`:: Minute
`s`:: Second

[[distance-units]]
[float]
=== Distance Units
Expand All @@ -144,6 +159,63 @@ Centimeter:: `cm` or `centimeters`
Millimeter:: `mm` or `millimeters`


[[fuzziness]]
[float]
=== Fuzziness

Some queries and APIs support parameters to allow inexact _fuzzy_ matching,
using the `fuzziness` parameter. The `fuzziness` parameter is context
sensitive which means that it depends on the type of the field being queried:

[float]
==== Numeric, date and IPv4 fields

When querying numeric, date and IPv4 fields, `fuzziness` is interpreted as a
`+/- margin. It behaves like a <<query-dsl-range-query>> where:

-fuzziness <= field value <= +fuzziness

The `fuzziness` parameter should be set to a numeric value, eg `2` or `2.0`. A
`date` field interprets a long as milliseconds, but also accepts a string
containing a time value -- `"1h"` -- as explained in <<time-units>>. An `ip`
field accepts a long or another IPv4 address (which will be converted into a
long).

[float]
==== String fields

When querying `string` fields, `fuzziness` is interpreted as a
http://en.wikipedia.org/wiki/Levenshtein_distance[Levenshtein Edit Distance]
-- the number of one character changes that need to be made to one string to
make it the same as another string.

The `fuzziness` parameter can be specified as:

`0`, `1`, `2`::

the maximum allowed Levenshtein Edit Distance (or number of edits)

`AUTO`::
+
--
generates an edit distance based on the length of the term. For lengths:

`0..1`:: must match exactly
`1..4`:: one edit allowed
`>4`:: two edits allowed

`AUTO` should generally be the preferred value for `fuzziness`.
--

`0.0..1.0`::

converted into an edit distance using the formula: `length(term) * (1.0 -
fuzziness)`, eg a `fuzziness` of `0.6` with a term of length 10 would result
in an edit distance of `4`. Note: in all APIs except for the
<<query-dsl-flt-query>>, the maximum allowed edit distance is `2`.



[float]
=== Result Casing

Expand Down
4 changes: 2 additions & 2 deletions docs/reference/query-dsl/queries/flt-field-query.asciidoc
Expand Up @@ -33,8 +33,8 @@ The `fuzzy_like_this_field` top level parameters include:
|`max_query_terms` |The maximum number of query terms that will be
included in any generated query. Defaults to `25`.

|`min_similarity` |The minimum similarity of the term variants. Defaults
to `0.5`.
|`fuzziness` |The fuzziness of the term variants. Defaults
to `0.5`. See <<fuzziness>>.

|`prefix_length` |Length of required common prefix on variant terms.
Defaults to `0`.
Expand Down
4 changes: 2 additions & 2 deletions docs/reference/query-dsl/queries/flt-query.asciidoc
Expand Up @@ -32,8 +32,8 @@ Defaults to the `_all` field.
|`max_query_terms` |The maximum number of query terms that will be
included in any generated query. Defaults to `25`.

|`min_similarity` |The minimum similarity of the term variants. Defaults
to `0.5`.
|`fuzziness` |The minimum similarity of the term variants. Defaults
to `0.5`. See <<fuzziness>>.

|`prefix_length` |Length of required common prefix on variant terms.
Defaults to `0`.
Expand Down
85 changes: 54 additions & 31 deletions docs/reference/query-dsl/queries/fuzzy-query.asciidoc
@@ -1,12 +1,15 @@
[[query-dsl-fuzzy-query]]
=== Fuzzy Query

A fuzzy query that uses similarity based on Levenshtein (edit
distance) algorithm. This maps to Lucene's `FuzzyQuery`.
The fuzzy query uses similarity based on Levenshtein edit distance for
`string` fields, and a `+/-` margin on numeric and date fields.

Warning: this query is not very scalable with its default prefix length
of 0 - in this case, *every* term will be enumerated and cause an edit
score calculation or `max_expansions` is not set.
==== String fields

The `fuzzy` query generates all possible matching terms that are within the
maximum edit distance specified in `fuzziness` and then checks the term
dictionary to find out which of those generated terms actually exist in the
index.

Here is a simple example:

Expand All @@ -17,63 +20,83 @@ Here is a simple example:
}
--------------------------------------------------

More complex settings can be set (the values here are the default
values):
Or with more advanced settings:

[source,js]
--------------------------------------------------
{
"fuzzy" : {
"user" : {
"value" : "ki",
"boost" : 1.0,
"min_similarity" : 0.5,
"prefix_length" : 0
}
{
"fuzzy" : {
"user" : {
"value" : "ki",
"boost" : 1.0,
"fuzziness" : 2,
"prefix_length" : 0,
"max_expansions": 100
}
}
}
--------------------------------------------------

The `max_expansions` parameter (unbounded by default) controls the
number of terms the fuzzy query will expand to.
[float]
===== Parameters

[horizontal]
`fuzziness`::

The maximum edit distance. Defaults to `AUTO`. See <<fuzziness>>.

`prefix_length`::

The number of initial characters which will not be ``fuzzified''. This
helps to reduce the number of terms which must be examined. Defaults
to `0`.

`max_expansions`::

The maximum number of terms that the `fuzzy` query will expand to.
Defaults to `0`.


WARNING: this query can be very heavy if `prefix_length` and `max_expansions`
are both set to their defaults of `0`. This could cause every term in the
index to be examined!


[float]
==== Numeric / Date Fuzzy
==== Numeric and date fields

Performs a <<query-dsl-range-query>> ``around'' the value using the
`fuzziness` value as a `+/-` range, where:

-fuzziness <= field value <= +fuzziness

`fuzzy` query on a numeric field will result in a range query "around"
the value using the `min_similarity` value. For example:
For example:

[source,js]
--------------------------------------------------
{
"fuzzy" : {
"price" : {
"value" : 12,
"min_similarity" : 2
"fuzziness" : 2
}
}
}
--------------------------------------------------

Will result in a range query between 10 and 14. Same applies to dates,
with support for time format for the `min_similarity` field:
Will result in a range query between 10 and 14. Date fields support
<<time-units,time values>>, eg:

[source,js]
--------------------------------------------------
{
"fuzzy" : {
"created" : {
"value" : "2010-02-05T12:05:07",
"min_similarity" : "1d"
"fuzziness" : "1d"
}
}
}
--------------------------------------------------

In the mapping, numeric and date types now allow to configure a
`fuzzy_factor` mapping value (defaults to 1), which will be used to
multiply the fuzzy value by it when used in a `query_string` type query.
For example, for dates, a fuzzy factor of "1d" will result in
multiplying whatever fuzzy value provided in the min_similarity by it.
Note, this is explicitly supported since query_string query only allowed
for similarity valued between 0.0 and 1.0.
See <<fuzziness>> for more details about accepted values.
17 changes: 9 additions & 8 deletions docs/reference/query-dsl/queries/match-query.asciidoc
Expand Up @@ -34,9 +34,10 @@ The `analyzer` can be set to control which analyzer will perform the
analysis process on the text. It default to the field explicit mapping
definition, or the default search analyzer.

`fuzziness` can be set to a value (depending on the relevant type, for
string types it should be a value between `0.0` and `1.0`) to constructs
fuzzy queries for each term analyzed. The `prefix_length` and
`fuzziness` allows _fuzzy matching_ based on the type of field being queried.
See <<fuzziness>> for allowed settings.

The `prefix_length` and
`max_expansions` can be set in this case to control the fuzzy process.
If the fuzzy option is set the query will use `constant_score_rewrite`
as its <<query-dsl-multi-term-rewrite,rewrite
Expand Down Expand Up @@ -80,9 +81,9 @@ change that the `zero_terms_query` option can be used, which accepts
.cutoff_frequency
The match query supports a `cutoff_frequency` that allows
specifying an absolute or relative document frequency where high
frequent terms are moved into an optional subquery and are only scored
if one of the low frequent (below the cutoff) terms in the case of an
`or` operator or all of the low frequent terms in the case of an `and`
frequent terms are moved into an optional subquery and are only scored
if one of the low frequent (below the cutoff) terms in the case of an
`or` operator or all of the low frequent terms in the case of an `and`
operator match.

This query allows handling `stopwords` dynamically at runtime, is domain
Expand All @@ -101,8 +102,8 @@ Note: If the `cutoff_frequency` is used and the operator is `and`
_stacked tokens_ (tokens that are on the same position like `synonym` filter emits)
are not handled gracefully as they are in a pure `and` query. For instance the query
`fast fox` is analyzed into 3 terms `[fast, quick, fox]` where `quick` is a synonym
for `fast` on the same token positions the query might require `fast` and `quick` to
match if the operator is `and`.
for `fast` on the same token positions the query might require `fast` and `quick` to
match if the operator is `and`.

Here is an example showing a query composed of stopwords exclusivly:

Expand Down
8 changes: 4 additions & 4 deletions docs/reference/query-dsl/queries/query-string-query.asciidoc
Expand Up @@ -46,8 +46,8 @@ increments in result queries. Defaults to `true`.
|`fuzzy_max_expansions` |Controls the number of terms fuzzy queries will
expand to. Defaults to `50`

|`fuzzy_min_sim` |Set the minimum similarity for fuzzy queries. Defaults
to `0.5`
|`fuzziness` |Set the fuzziness for fuzzy queries. Defaults
to `AUTO`. See <<fuzziness>> for allowed settings.

|`fuzzy_prefix_length` |Set the prefix length for fuzzy queries. Default
is `0`.
Expand All @@ -70,7 +70,7 @@ in the resulting boolean query should match. It can be an absolute value
both>>.

|`lenient` |If set to `true` will cause format based failures (like
providing text to a numeric field) to be ignored.
providing text to a numeric field) to be ignored.
|=======================================================================

When a multi term query is being generated, one can control how it gets
Expand Down Expand Up @@ -128,7 +128,7 @@ search on all "city" fields:

Another option is to provide the wildcard fields search in the query
string itself (properly escaping the `*` sign), for example:
`city.\*:something`.
`city.\*:something`.

When running the `query_string` query against multiple fields, the
following additional parameters are allowed:
Expand Down
7 changes: 4 additions & 3 deletions docs/reference/search/suggesters/completion-suggest.asciidoc
Expand Up @@ -199,7 +199,7 @@ curl -X POST 'localhost:9200/music/_suggest?pretty' -d '{
"completion" : {
"field" : "suggest",
"fuzzy" : {
"edit_distance" : 2
"fuzziness" : 2
}
}
}
Expand All @@ -210,8 +210,9 @@ The fuzzy query can take specific fuzzy parameters.
The following parameters are supported:

[horizontal]
`edit_distance`::
Maximum edit distance, defaults to `1`
`fuzziness`::
The fuzziness factor, defaults to `AUTO`.
See <<fuzziness>> for allowed settings.

`transpositions`::
Sets if transpositions should be counted
Expand Down
Expand Up @@ -30,6 +30,7 @@
import org.elasticsearch.common.lucene.Lucene;
import org.elasticsearch.common.lucene.search.Queries;
import org.elasticsearch.common.lucene.search.XFilteredQuery;
import org.elasticsearch.common.unit.Fuzziness;
import org.elasticsearch.index.mapper.FieldMapper;
import org.elasticsearch.index.mapper.MapperService;
import org.elasticsearch.index.query.QueryParseContext;
Expand Down Expand Up @@ -435,7 +436,7 @@ private Query getFuzzyQuerySingle(String field, String termStr, String minSimila
if (currentMapper != null) {
try {
//LUCENE 4 UPGRADE I disabled transpositions here by default - maybe this needs to be changed
Query fuzzyQuery = currentMapper.fuzzyQuery(termStr, minSimilarity, fuzzyPrefixLength, settings.fuzzyMaxExpansions(), false);
Query fuzzyQuery = currentMapper.fuzzyQuery(termStr, Fuzziness.build(minSimilarity), fuzzyPrefixLength, settings.fuzzyMaxExpansions(), false);
return wrapSmartNameQuery(fuzzyQuery, fieldMappers, parseContext);
} catch (RuntimeException e) {
if (settings.lenient()) {
Expand Down

0 comments on commit 9a10b13

Please sign in to comment.