New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
make term statistics and term vectors accessible in scripts #4161
Conversation
Was talking to @s1monw and we now think we have to optimize more. For example use posting list info instead of term vectors unless term vectors are explicitly requested, do not initialize all positions, payloads, ... unless the user really wants to. Also, we should make available an iterator for the positions. Will be back with a new version soon. |
@@ -0,0 +1,473 @@ | |||
h5 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this file is obsolet
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes
@clint Thanks for cleaning up the docs! Here is what I made of your comments:
[float] The implementation so far does not give you the term vector but rather statistics for single terms. If you want to gather information on all terms in a field, you must store the term vectors (set Is that better? |
Yes, you can just add another include statement.
Ah that was probably me trying
I don't feel strongly about it, it was just a suggestion. I'm open to what others think here.
Yes it makes sense. I don't know if idf is useful or not - was just a suggestion in case it had been overlooked
yes, much more understandable :) |
Thank you all so much for the comments! I added another commit that implements the changes. The two things I did not do yet:
|
@@ -168,8 +171,9 @@ public int nextDoc() throws IOException { | |||
|
|||
@Override | |||
public float score() throws IOException { | |||
return scoreCombiner.combine(subQueryBoost, scorer.score(), | |||
function.score(scorer.docID(), scorer.score()), maxBoost); | |||
float score = scorer.score(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like!
term statistics can be accessed via the _shard variable. Below is a minimal example. See documentation on details. ``` DELETE paytest PUT paytest { "mappings": { "test": { "_all": { "auto_boost": true, "enabled": true }, "properties": { "text": { "index_analyzer": "fulltext_analyzer", "store": "yes", "type": "string" } } } }, "settings": { "analysis": { "analyzer": { "fulltext_analyzer": { "filter": [ "my_delimited_payload_filter" ], "tokenizer": "whitespace", "type": "custom" } }, "filter": { "my_delimited_payload_filter": { "delimiter": "+", "encoding": "float", "type": "delimited_payload_filter" } } }, "index": { "number_of_replicas": 0, "number_of_shards": 1 } } } POST paytest/test/1 { "text": "the+1 quick+2 brown+3 fox+4 is quick+10" } POST paytest/test/2 { "text": "the+1 quick+2 red+3 fox+4" } POST paytest/_refresh POST paytest/_search { "script_fields": { "ttf": { "script": "_shard[\"text\"][\"quick\"].ttf()" } } } POST paytest/_search { "script_fields": { "freq": { "script": "_shard[\"text\"][\"quick\"].freq()" } } } POST paytest/test/2/_termvector POST paytest/_search { "script_fields": { "payloads": { "script": "term = _shard[\"text\"].get(\"red\",_PAYLOADS);payloads = []; for(pos : term){payloads.add(pos.payloadAsFloat(-1));} return payloads;" } } } POST paytest/_search { "script_fields": { "tv": { "script": "_shard[\"text\"][\"quick\"].freq()" } }, "query": { "function_score": { "functions": [ { "script_score": { "script": "_shard[\"text\"][\"quick\"].freq()" } } ] } } } ``` closes elastic#3772
Implemented all changes as discussed |
Implementation relies on elastic/elasticsearch#4161 This adds scripts for - cosime similarity - tfidf - language model scoring - simple phrase scorer
// exist. Can be DocsEnum or DocsAndPositionsEnum. | ||
DocsEnum docsEnum; | ||
|
||
// Stores if positions, offsets and payloads are requested. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you use private final
instead of final private
here across the board?
I left a couple of minor comments. I guess you can just fix them and push to master! |
Below is a minimal example which you can run to see how it works in principle. Documentation lists all options.
At this point I would first like to make sure that the general concept is OK. After that I will add complete javadoc etc.
closes #3772
Minimal example:
and so on. see documentation for all options.
Note that
_index
was called_shard
before. See #4584