Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ContextSuggester #4044

Closed
wants to merge 1 commit into from
Closed

Conversation

chilling
Copy link
Contributor

@chilling chilling commented Nov 1, 2013

This commit extends the CompletionSuggester by context
informations. In example such a context informations can
be a simple string representing a category reducing the
suggestions in order to this category.

Three base implementations of these context informations
have been setup in this commit.

  • a Category Context
  • a Geo Context
  • a Field Context

All the mapping for these context informations are
specified within a context field in the completion
field that should use this kind of information.

Mapping Example

The following example shows the mapping for a GeoContext.

{
    "testType":{
        "properties":{
            "testField":{
                "type":"completion",
                "index_analyzer":"simple",
                "search_analyzer":"simple",
                "payloads":true,
                "preserve_separators":false,
                "preserve_position_increments":true,
                "context":{
                    "geo":{
                        "separator":"|",
                        "precision":8,
                        "neighbors":true
                    }
                }
            }
        }
    }
}

Indexing

During indexing a document the subfield context of the
completion field contains the data to be used in order
to provide suggestions.

{
    "testField":{
        "input":[["pizza - berlin","pizza","food"]],
        "context":"u33dc1v0xupz"
    }
}

Suggestion Example

The Suggestion request is extended by a context value. The
suggest request for a geolocation looks like

{
    "suggest":{
        "text":"pizza",
        "completion":{
            "field":"testField",
            "size":10,
            "context":{
                "geo":"u33dc0cpke4q"
            }
        }
    }
}

The context objects contains a field with the same name as
defined in the mapping. According to the type of the context
this field contains the data associated with the suggestion
request. In this example the geohash of a location.

Category Context

The simplest way to use this feature is a category context. It
supports a arbitrary name of a category to use with the completion
suggest API.

To set the context support to the category type this option must be
set to true:

"testField":{
    "type":"completion",
    "context":{
        "category": true
    }
}

The name of the context category then needs to be set within the
suggestion context during indexing:

{
    "testField":{
        "input":[["pizza - berlin","pizza","food"]],
        "context":"delivery"
    }
}

and can be used by setting the category value:

{
    "suggest":{
        "text":"pizza",
        "completion":{
            "field":"testField",
            "size":10,
            "context":{
                "category":"delivery"
            }
        }
    }
}

Field Context

The Field Context works like the category context but the value of this
will context will not explicitly be set. It refers to another field in
the document. In example a category field.

{
    "category":{
        "type": "string"
    },
    "testField":{
        "type":"completion",
        "context":{
            "field": "category"
        }
    }
}

for indexing the field context must be set to true:

{
    "category":"delivery",
    "testField":{
        "input":[["pizza - berlin","pizza","food"]],
        "context":true
    }
}

and suggestions use the context.field value

{
    "suggest":{
        "text":"pizza",
        "completion":{
            "field":"testField",
            "size":10,
            "context": {
                "field": "delivery"
            }
        }
    }
}

Geo Context

The last context feature is the GeoContext. It take a location into account.
For example if one searches for delivery services it might be use full to find
results around the location the query was sent. This context internally works
on geohashes only but the REST API allows any form defined for geo_points

In the mapping this kind of context is configured by two parameters:

  • precision
  • neighbors

The precision option is used to configure the range of result. If the
neighbors option is enabled not only the given geohash cell will be used
but also all it's neighbors.

"context":{
    "geo":{
        "separator":"|",
        "precision":8,
        "neighbors":true
    }
}

The context during indexing is set to the location of the input:

{
    "testField":{
        "input":[["pizza - berlin","pizza","food"]],
        "context": "u33dc1v0xupz"
    }
}

To get a ist of suggestions around a specific area the context.geo field
must contain the position of this area:

{
    "suggest":{
        "text":"pizza",
        "completion":{
            "field":"testField",
            "size":10,
            "context": {
                "geo": "u33dc0cpke4q"
            }
        }
    }
}

Closes #3959

@kimchy
Copy link
Member

kimchy commented Nov 1, 2013

haven't reviewed the code, API wise it looks good to me, but can we also have as part of this pull request the relevant changes to our docs?

@chilling
Copy link
Contributor Author

chilling commented Nov 1, 2013

@kimchy of course

@ghost ghost assigned chilling Nov 1, 2013
@s1monw
Copy link
Contributor

s1monw commented Nov 7, 2013

A couple of comments based on the documetnation:

For mapping:

 "geo":{
        "separator":"|",
        "precision":8,
        "neighbors":true
    }

I don't think we should expose the separator? What is the reason this is an implementation detail and should not be exposed to the user.

On the request side of things I don't think we should require folks to specify stuff like this:

"context": {
 "field": "delivery"
}

it should rather look like this:

"context":  "delivery"

Instead, since we know how to parse the values ie. we know from the mapping that it is a field or geo?!

For the mapping of Category and Field I think we should fold the two into one and make the Field the default. Something like this:

"testField":{
    "type":"completion",
    "context":{
        "category": {
           "default_field" : "cat_type",
           "default_value : "delivery"
         }
    }
}

Then you can decide if you want to specify it manually like this:

{
    "testField":{
        "input":[["pizza - berlin","pizza","food"]],
        "context" : "delivery"
    }
}

or via a field:

{
    "cat_type" : "delivery",
    "testField":{
        "input":[["pizza - berlin","pizza","food"]],
        "context":"delivery"
    }
}

and since we defined the default value in the mapping we can still have a context if the cat_type field is not there and we didn't specify anything as the context. Makes sense?

@s1monw
Copy link
Contributor

s1monw commented Nov 7, 2013

I left some comments on the code, looks great but I think we can simplify the API as I described above! If you push changes don't rebase your branch with master or so please just add the relevant changes add, commit and push the branch to your github repo so we can see them here.

thanks !!

@clintongormley
Copy link

Hiya

I've got some comments about how to improve the API.

There is no mention of what happens if a context field contains multiple values. That said, geohashes already has multiple contexts, as edge ngrams are generated. I'd also like to be able to use a multi-value field as a context, eg:

colors: [red,green]

I probably would also like to combine multiple fields to generate a context, eg:

  • country
  • region
  • city

Specifying this via the API could make things messy and overly complicated. The simple answer to this would be to concatenate these values into one field. I'd also like it if we could accept scripts which can generate a context value (or array of values) based on the _source.

So we have:

  • value: passed-in || default_value
  • field: passed-in || field_value || default_value
  • script: passed-in || generated by the script || default_value
  • geo: could be any of the above, but requires special handling

The mappings should look like the following:

Mapping for value

No default value set:

"testField":{
    "type":"completion"
}

Default value set:

"testField":{
    "type":"completion",
    "context":{
        "default":  "foo"
    }
}

Should default actually be called null_value?

Mapping for field

"testField":{
    "type":"completion",
    "context":{
        "field":    "tags",
        "default":  "foo"    # optional
    }
}

Mapping for script

"testField":{
    "type":"completion",
    "context":{
        "script": {
            "lang": "mvel",
            "script": "ctx._source.foo"
        },
        "default":  "foo"    # optional
    }
}

Mapping for geo

By specifying geohash:{} it triggers geohash handling of the context value. The geohash itself can come from anywhere.

(btw, neighbours should default to true, I think)

Geo as value:

"testField":{
    "type":"completion",
    "context":{
        "geohash": {                
            "precision": 8,         # optional
            "neighbours": true      # optional
        },
        "default":  "u33dc1v0xupz"  # optional
    }
}

Geo as a field:

"testField":{
    "type":"completion",
    "context":{
        "field":    "location",
        "default":  "u33dc1v0xupz"  # optional
        "geohash": {                
            "precision": 8,         # optional
            "neighbours": true      # optional
        },
    }
}

Although we could derive the geohash stuff from the referenced field, for consistency's sake, I think I prefer actually specifying it. All it needs is "geohash:{}", it doesn't need to set anything else.

Geo as a script:

"testField":{
    "type":"completion",
    "context":{
        "default":  "u33dc1v0xupz"  # optional
        "geohash": {                
            "precision": 8,         # optional
            "neighbours": true      # optional
        },
        "script": {
            "lang":   "mvel",
            "script": "foo.bar.value"
        }
    }
}

Indexing a document

Specifying the context directly:

"testField": {
    "input":    ["pizza - berlin", "pizza","food"],
    "context":  "foo" 
}

context not specified

"testField": {
    "input":    ["pizza - berlin", "pizza","food"]
}

Derives value from :

 -> script  (if specified)
 -> field   (if specified)
 -> default (if specified)

Can have multiple contexts, eg:

"testField": {
    "input":    ["pizza - berlin", "pizza","food"],
    "context":  ["italian","fastfood"] 
}

This should generate two suggestion paths

Searching:

There is no need to specify geo etc, as we can figure that out from the mapping, so searching just looks like:

{
    "suggest":{
        "text":"pizza",
        "completion":{
            "field":"testField",
            "size":10,
            "context": "u33dc0cpke4q"
        }
    }
}

@chilling
Copy link
Contributor Author

Thanks a lot for your suggestions. I had already started implementing the new style of the API and the multivalued contexts.

@chilling
Copy link
Contributor Author

Hey @simon,

I think we should keep the separator option. For contexts with arbitrary text values we need to distinguish between different context/values to guarantee unique prefixes. Think of a context a and a suggestion value of |b in the first place and a context a| and a value of b. Both will create the same path (a||b) in the underlying FST.

@chilling
Copy link
Contributor Author

@clintongormley I like the idea of using multivalued fields as context. But I think we have two options here:

  • combine the values and-wise
  • combine the values or-wise

The first option requires an order to create an unique context. The second option will simply generate alternative contexts. I think the later option should be used. But I like to hear your idea of those contexts.

@s1monw
Copy link
Contributor

s1monw commented Nov 20, 2013

@chilling I think we should use a less promient character to separate to begin with like \u001F and if the context contains it we reject it. It's very unlikely that it contains it.

Can you reply to all the code comments etc. as well please.

@chilling
Copy link
Contributor Author

@s1monw, you're right. I think this will be an acceptable solution.

@chilling
Copy link
Contributor Author

I just managed to push a clean update. So for now we support multiple values and multiple context in the context suggester. Currently I'm working on the documentation but feel free to have a first look at the current implementation.

public class PrefixAnalyzer extends Analyzer {

public static final char DEFAULT_SEPARATOR = '|';
public static final char DEFAULT_DELIMITER = '-';
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DEFAULT_DELIMITER is unused?

@s1monw
Copy link
Contributor

s1monw commented Dec 12, 2013

hey florian, I took a loot at the PR and I think it looks good though. I wonder a bit how the API (REST) looks like at this point. Can you add documentation to the PullRequest so we can also see the API from a REST perspective. It's kind of hard to tell from the tests etc. :)

@chilling
Copy link
Contributor Author

Thanks @s1monw. I already started working on docs. So documentation will be there tomorrow.

@chilling
Copy link
Contributor Author

@s1monw for now I worked on your code review and changed the most of the thinks. What keeps me busy right now, is the fuzzy logic. I think I'm not able to adjust the fuzzyprefix, because the ContextSuggester supports multiple values as alternatives. So there could be different paths with different length. So I'm trying to build two separate Automatons. One for the prefixes and the other from the actual Suggestion. But I haven't found the right place for this now. Today I tested also tested the REST API and I still have some minor issues on parsing. The API should look like this

Mapping

        "properties" : {
            "name" : { "type" : "string" },
            "TypeOfService" : { "type" : "string", "index": "not_analyzed" },
            "suggest" : {
                "type" : "completion",
                "index_analyzer" : "simple",
                "search_analyzer" : "simple",
                "payloads" : true,
                "context": [
                    { "geo": { "precision": "50m", "neighbors": true } },
                    { "field": { "fieldname": "TypeOfService", "default": "unknown" } },
                    { "category": { "default": "none" } },
                ]
            }

Index

{
    "name" : "Hotel Berlin, Tokyo",
    "TypeOfService" : "hotel",
    "suggest" : {
        "input": [ "Hotel", "Berlin", "Hotel Berlin"],
        "output": "Hotel Berlin, Tokyo",
        "context": [ {"lat": 35.689506, "lon": 139.6917}, on, ["hotel", "rooms"]
        "payload" : { "id" : 1 },
    }
}

Query

{
    "suggest" : {
        "text" : "b",
        "completion" : {
            "field" : "suggest",
            "context": [{"lat": 35.689506, "lon": 139.6917}, "hotel", "rooms"]
        }
    }
}

@s1monw
Copy link
Contributor

s1monw commented Dec 13, 2013

I think I'm not able to adjust the fuzzyprefix, because the ContextSuggester supports multiple values as alternatives. So there could be different paths with different length.

Wait, you are only using exactly one prefix to lookup the suggestion since our interface doesn't allow multiple. At index time is a different story. But at serach time you should be able to tell how long the prefix is?

--------------------------------------------------

=== Geo location Context
A geo location as context information works slightly different from the other queries.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A geo context allows you to limit results to those that lie within a certain distance of a specified geolocation. At index time, a lat/long geopoint is converted into a geohash of a certain precision, which provides the context.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It'd be nice if we could specify multiple precisions, eg:

precision: [ "5m", "1km", "10km"]

Then we'd need to be able to specify the precision at search time as well.

@clintongormley
Copy link

Hi @chilling

I've finished reviewing the docs. I think there are improvements we can make to the API:

  • Change missing to default
  • Merge category and field. So you can specify default categories, and optionally a path containing the name a field which contains the value. At index time you can either specify a context manually, or retrieve it from the field (if specified) or use the default.
  • Geo contexts should allow multiple precisions, in which case you would need to specify the required precision at search time. (If only one precision is specified then it would use that by default).
  • Should neighbours default to true?

@chilling
Copy link
Contributor Author

chilling commented Mar 5, 2014

hi @clintongormley,

for now I worked in most of your comments. Currently I actually working on the multiple precisions. I makes completely sense. My basic idea in the initial commit was to index the whole geohash path. Namely every prefix. Since this obviously caused trouble I removed this. But the idea of having at least a precision at the query, will solve the problem. Do you think it makes sense to index all prefixes of a geohash and the just query by a certain precision?

@clintongormley
Copy link

Hi @chilling

Do you think it makes sense to index all prefixes of a geohash and the just query by a certain precision?

That makes sense to me, but I have a feeling that it may generate very large FSTs given that you essentially add all the data 12 times. If it were possible to compress the FST that would be awesome. Failing that, I think the list of precisions is the best compromise.

@chilling
Copy link
Contributor Author

Hi @s1monw, @clintongormley and @spinscale,

just finished the next iteration. Maybe you like to have a look, I hope we're getting closer.

@s1monw
Copy link
Contributor

s1monw commented Mar 12, 2014

code changes look good to me - @clintongormley can you check if your comments / concerns were addressed?

@@ -171,7 +171,10 @@ Kilometer:: `km` or `kilometers`
Meter:: `m` or `meters`
Centimeter:: `cm` or `centimeters`
Millimeter:: `mm` or `millimeters`
<<<<<<< HEAD

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You've committed the conflict markers here

@clintongormley
Copy link

Hiya @chilling

I would combine the docs for category and field, as I mentioned in the comment above. That, plus fixing the bad merge with the conflict markers <<<<< and I'm +1

@chilling
Copy link
Contributor Author

thanks @clintongormley! Can you have a last short look at this?

@clintongormley
Copy link

LGTM

================

This commit extends the `CompletionSuggester` by context
informations. In example such a context informations can
be a simple string representing a category reducing the
suggestions in order to this category.

Three base implementations of these context informations
have been setup in this commit.

- a Category Context
- a Geo Context

All the mapping for these context informations are
specified within a context field in the completion
field that should use this kind of information.
@chilling
Copy link
Contributor Author

merged

@s1monw
Copy link
Contributor

s1monw commented Mar 25, 2014

moving to 1.2 due to #5525

@chilling chilling removed their assignment Mar 10, 2015
@clintongormley clintongormley added the :Search/Suggesters "Did you mean" and suggestions as you type label Jun 6, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>feature :Search/Suggesters "Did you mean" and suggestions as you type v1.2.0 v2.0.0-beta1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Context extension of the Suggester
5 participants