New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Highlight suggestions #3442
Comments
Hey @nik9000 thanks for opening this issue. I'm not sure if this is the right approach to go here since pure tokenization not always what you want. I think we should rather provide an API that provides a highlighted version of the corrected string or it should return plain offsets. Yet, why I think we should not plain tokenize is that if you have synonyms etc. in pre or post filters tokens will not correspond to actual corrected substrings. The way I think we should do that is to provide pre and post tokens similar to the highlight API and return a "highlighted" version of the string. I think the big advantage of this is that you can display it directly to the user and I don't wanna overcomplicate the response format. We can also return actual offsets which will require some processing in the client which I would want to prevent. Yet, if you wanna take a look at implementing this, we are maintaining the splitted tokens until we are done and join them together. we can add a flag to |
I agree with your point about the API. Elasticsearch should do the highlighting. |
internally we use shingles or ngrams to lookup the frequencies but those are only in the index and the bigrams / trigrams that are build from the query are build at runtime and not from the shingle filter so that each |
Hey @nik9000 I think we can add this in a backwards compatible way by simply using a custom join function on
so if
does this make sense? I think what we need here is a notion of a |
@s1monw I think I'd prefer to return the unlighted version next to the highlighted version. That way I don't have to re-strip the highlight components when building the suggestion query. Here is an updated curl testing script: curl -XDELETE "http://localhost:9200/test?pretty"
curl -XPOST "http://localhost:9200/test?pretty" -d '{
"settings": {
"index": {
"number_of_replicas": 0
"analysis":{
"analyzer":{
"suggest":{
"type": "custom",
"tokenizer": "standard",
"filter": [ "standard", "lowercase", "suggest_shingle" ]
}
},
"filter":{
"suggest_shingle":{
"type": "shingle",
"min_shingle_size": 2,
"max_shingle_size": 5,
"output_unigrams": true
}
}
}
}
}
}'
curl -XPOST "http://localhost:9200/test/test/_mapping?pretty" -d '{
"test":{
"foo":{
"analyzer":"suggest"
}
}
}'
curl -XGET 'http://localhost:9200/_cluster/health?pretty=true&wait_for_status=yellow'
curl -XPOST "http://localhost:9200/test/test?pretty" -d '{
"foo": "I love shingles for suggestions"
}'
curl -XPOST "http://localhost:9200/test/test?pretty" -d '{
"foo": "but not for finding differences"
}'
sleep 1
curl -XGET "http://localhost:9200/test/test/_search?pretty" -d '{
"query": {
"match_all": {}
},
"suggest": {
"text": "findning differrences",
"phrase":{
"phrase":{
"field": "foo",
"max_errors": 5
}
}
}
}'
curl -XGET "http://localhost:9200/test/test/_search?pretty" -d '{
"query": {
"match_all": {}
},
"suggest": {
"text": "findning differrences",
"phrase":{
"phrase":{
"field": "foo",
"max_errors": 5,
"highlight": {
"pre_tag": "<em>",
"post_tag": "</em>"
}
}
}
}
}' The first doesn't have highlighting enabled so the suggest section should look exactly like a search without this feature: "suggest" : {
"phrase" : [ {
"text" : "findning differrences",
"offset" : 0,
"length" : 21,
"options" : [ {
"text" : "finding differences",
"score" : 0.49144784
}, {
"text" : "findning differences",
"score" : 0.3803136
}, {
"text" : "finding differrences",
"score" : 0.37071815
} ]
} ] and the second search should have highlighting like this: "suggest" : {
"phrase" : [ {
"text" : "findning differrences",
"offset" : 0,
"length" : 21,
"options" : [ {
"text" : "finding differences",
"highlighted" : "<em>finding</em> <em>differences</em>",
"score" : 0.49144784
}, {
"text" : "findning differences",
"highlighted" : "findning <em>differences</em>",
"score" : 0.3803136
}, {
"text" : "finding differrences",
"highlighted" : "<em>finding</em> differrences",
"score" : 0.37071815
} ]
} ]
} |
oh that is an very good reason for having it side by side to the original suggestion....! agreed! |
@nik9000 I changed your commit message a bit and running tests now. I will push this very shortly! Unfortunately we can't add this to 0.90 at this point since it breaks the wire format. It's a pity though. But anyway thanks so much for adding this! |
This commit adds general highlighting support to the suggest feature. The only implementation that implements this functionality at this point is the phrase suggester. The API supports a 'pre_tag' and a 'post_tag' that are used to wrap suggested parts of the given user input changed by the suggester. Closes #3442
Backported to 0.90 with version compatibility on the transport protocol. I kept the version checks for master (1.0) for now to have the same code in master and 0.90 but that doesn't mean a major version upgrade from 0.90 to 1.0 doesn't require a full cluster restart |
This commit adds general highlighting support to the suggest feature. The only implementation that implements this functionality at this point is the phrase suggester. The API supports a 'pre_tag' and a 'post_tag' that are used to wrap suggested parts of the given user input changed by the suggester. Closes elastic#3442
I think it'd be cool if the suggestion API returned the suggestions in a tokenized form so consumers could easily highlight the changed tokens. This particular tokenization would have to ignore any shingle filters to in the analyzer because we'd end up highlighting whole changed phrases and that really isn't how users think.
I'm thinking it should work like this:
The final request would return something like this:
The text was updated successfully, but these errors were encountered: