make term statistics and term vectors accessible in scripts #4161

brwe · 2013-11-13T13:27:29Z

Below is a minimal example which you can run to see how it works in principle. Documentation lists all options.

At this point I would first like to make sure that the general concept is OK. After that I will add complete javadoc etc.

closes #3772

Minimal example:

DELETE paytest

PUT paytest
{
    "mappings": {
        "test": {
            "properties": {
                "text": {
                    "index_analyzer": "fulltext_analyzer",
                    "type": "string"
                }
            }
        }
    },
    "settings": {
        "analysis": {
            "analyzer": {
                "fulltext_analyzer": {
                    "filter": [
                        "my_delimited_payload_filter"
                    ],
                    "tokenizer": "whitespace",
                    "type": "custom"
                }
            },
            "filter": {
                "my_delimited_payload_filter": {
                    "delimiter": "+",
                    "encoding": "float",
                    "type": "delimited_payload_filter"
                }
            }
        },
        "index": {
            "number_of_replicas": 0,
            "number_of_shards": 1
        }
    }
}


POST paytest/test/1
{
    "text": "the+1 quick+2 brown+3 fox+4 is quick+10"
}

POST paytest/test/2
{
    "text": "the+1 quick+2 red+3 fox+4"
}

POST paytest/_refresh

#get the total term frequency for "quick"

POST paytest/_search
{
    "script_fields": {
       "ttf": {
          "script": "_index[\"text\"][\"quick\"].ttf()"
       }
    }
}

# get the term frequencies
POST paytest/_search
{
    "script_fields": {
       "freq": {
          "script": "_index[\"text\"][\"quick\"].freq()"
       }
    }
}
POST paytest/test/2/_termvector
# get the payloads which are floats in this case
POST paytest/_search
{
    "script_fields": {
       "payloads": {
          "script": "term = _index[\"text\"].get(\"red\",_PAYLOADS);payloads = []; for(pos : term){payloads.add(pos.payloadAsFloat(-1));} return payloads;"
       }
    }
}

#compute very simple score: just use the term frequency
POST paytest/_search
{
   "script_fields": {
      "tv": {
         "script": "_index[\"text\"][\"quick\"].freq()"
      }
   },
   "query": {
      "function_score": {
         "functions": [
            {
               "script_score": {
                  "script": "_index[\"text\"][\"quick\"].freq()"
               }
            }
         ]
      }
   }
}

and so on. see documentation for all options.

Note that _index was called _shard before. See #4584

brwe · 2013-11-14T12:25:20Z

Was talking to @s1monw and we now think we have to optimize more. For example use posting list info instead of term vectors unless term vectors are explicitly requested, do not initialize all positions, payloads, ... unless the user really wants to. Also, we should make available an iterator for the positions. Will be back with a new version soon.

s1monw · 2013-12-06T10:14:55Z

docs/reference/styles.css

@@ -0,0 +1,473 @@
+h5 {


this file is obsolet

brwe · 2013-12-10T16:07:57Z

@clint Thanks for cleaning up the docs! Here is what I made of your comments:

Move doc to different file: Can I just put that at a separate file that is called "advancedscripting" and list it right under "scripting" and also link it in the scripting guide?
Currently only available in function_score: No, it is actually available for native scripts that inherit from AbstractSearchScript and for mvel scripts regardless of the context. I use the scripts as scriptfield and in function_score in the tests.
Rename _DO_NOT_RECORD -> _NO_CACHE or _CACHE depending on default: Names sound much better indeed. Will change the default also, so that users must set _CACHE if they want to iterate twice
numDocs, maxDoc, deletedDocs : I added the maxDoc method. As far as I understand in lucene maxDoc is the number of documents without subtracting the deleted ones, numDocs is the maxDoc-numDeleted (my doc comment was wrong). I changed code and documentation accordingly.
Can docCount be computed for live docs only?: No, unfortunately not possible. I added a comment to the docs.
Rename sumttf -> total_tf and sumdf -> total_df: In lucene it is called that way which is why I thought for consistency it might make sense to call it so. Also, I think sum tells you a little more about the nature of the value than total. But I am not too passionate, if you insist I'll change it.
Inconsistent naming freq() vs ...tf(): Right, will rename freq() -> tf(). Is then different to lucene but the naming makes more sense.
Expose idf: inverted document frequency is a value computed from docCount and df. Do you mean we should expose the value that is computed in the lucene tfidf scoring? The only reason to do so that I could think of is saving time, that is: instead of computing the value once per doc, use the pre computed value stored already. However, if a user wishes more flexibility, I think it makes more sense to provide a mechanism to evaluate scripts once before search actually starts and then we would not have to expose pre-computed values. Does that make sense?
lucene Field: I added a hopefully better explanation:

[float]
=== Term vectors:

The implementation so far does not give you the term vector but rather statistics for single terms. If you want to gather information on all terms in a field, you must store the term vectors (set term_vector in the mapping as described in the <<mapping-core-types,mapping documentation>>). To access them, call
_shard.getTermVectors() to get a
https://lucene.apache.org/core/4_0_0/core/org/apache/lucene/index/Fields.html[Fields]
instance. This object can then be used as described in https://lucene.apache.org/core/4_0_0/core/org/apache/lucene/index/Fields.html[lucene doc] to iterate over fields and then for each field iterate over each term in the field.
The method will return null if the term vectors were not stored.

Is that better?

clintongormley · 2013-12-10T16:29:29Z

Move doc to different file: Can I just put that at a separate file that is called "advancedscripting" and list it right under "scripting" and also link it in the scripting guide?

Yes, you can just add another include statement.

Currently only available in function_score: No, it is actually available for native scripts that inherit from AbstractSearchScript and for mvel scripts regardless of the context. I use the scripts as scriptfield and in function_score in the tests.

Ah that was probably me trying getDocCount() which resulted in an error.

Rename sumttf -> total_tf and sumdf -> total_df: In lucene it is called that way which is why I thought for consistency it might make sense to call it so. Also, I think sum tells you a little more about the nature of the value than total. But I am not too passionate, if you insist I'll change it.

I don't feel strongly about it, it was just a suggestion. I'm open to what others think here.

Expose idf: inverted document frequency is a value computed from docCount and df. Do you mean we should expose the value that is computed in the lucene tfidf scoring? The only reason to do so that I could think of is saving time, that is: instead of computing the value once per doc, use the pre computed value stored already. However, if a user wishes more flexibility, I think it makes more sense to provide a mechanism to evaluate scripts once before search actually starts and then we would not have to expose pre-computed values. Does that make sense?

Yes it makes sense. I don't know if idf is useful or not - was just a suggestion in case it had been overlooked

lucene Field: I added a hopefully better explanation:

yes, much more understandable :)

brwe · 2013-12-10T16:54:46Z

Thank you all so much for the comments! I added another commit that implements the changes. The two things I did not do yet:

setDocId() and setNextReader() called twice needs to be looked into and potentially an own issue
Rename sumttf, sumdf -> total_ttf, total_df is undecided yet. If I get no more opinions on that, I'll leave it as is.

s1monw · 2013-12-10T22:58:24Z

src/main/java/org/elasticsearch/common/lucene/search/function/FunctionScoreQuery.java

@@ -168,8 +171,9 @@ public int nextDoc() throws IOException {

        @Override
        public float score() throws IOException {
-            return scoreCombiner.combine(subQueryBoost, scorer.score(),
-                    function.score(scorer.docID(), scorer.score()), maxBoost);
+            float score = scorer.score();


term statistics can be accessed via the _shard variable. Below is a minimal example. See documentation on details. ``` DELETE paytest PUT paytest { "mappings": { "test": { "_all": { "auto_boost": true, "enabled": true }, "properties": { "text": { "index_analyzer": "fulltext_analyzer", "store": "yes", "type": "string" } } } }, "settings": { "analysis": { "analyzer": { "fulltext_analyzer": { "filter": [ "my_delimited_payload_filter" ], "tokenizer": "whitespace", "type": "custom" } }, "filter": { "my_delimited_payload_filter": { "delimiter": "+", "encoding": "float", "type": "delimited_payload_filter" } } }, "index": { "number_of_replicas": 0, "number_of_shards": 1 } } } POST paytest/test/1 { "text": "the+1 quick+2 brown+3 fox+4 is quick+10" } POST paytest/test/2 { "text": "the+1 quick+2 red+3 fox+4" } POST paytest/_refresh POST paytest/_search { "script_fields": { "ttf": { "script": "_shard[\"text\"][\"quick\"].ttf()" } } } POST paytest/_search { "script_fields": { "freq": { "script": "_shard[\"text\"][\"quick\"].freq()" } } } POST paytest/test/2/_termvector POST paytest/_search { "script_fields": { "payloads": { "script": "term = _shard[\"text\"].get(\"red\",_PAYLOADS);payloads = []; for(pos : term){payloads.add(pos.payloadAsFloat(-1));} return payloads;" } } } POST paytest/_search { "script_fields": { "tv": { "script": "_shard[\"text\"][\"quick\"].freq()" } }, "query": { "function_score": { "functions": [ { "script_score": { "script": "_shard[\"text\"][\"quick\"].freq()" } } ] } } } ``` closes elastic#3772

…k classes

brwe · 2013-12-19T09:57:51Z

Implemented all changes as discussed

Implementation relies on elastic/elasticsearch#4161 This adds scripts for - cosime similarity - tfidf - language model scoring - simple phrase scorer

s1monw · 2013-12-20T16:13:52Z

src/main/java/org/elasticsearch/search/lookup/ScriptTerm.java

+    // exist. Can be DocsEnum or DocsAndPositionsEnum.
+    DocsEnum docsEnum;
+
+    // Stores if positions, offsets and payloads are requested.


can you use private final instead of final private here across the board?

s1monw · 2013-12-20T16:19:20Z

I left a couple of minor comments. I guess you can just fix them and push to master!

brwe · 2014-01-02T11:29:27Z

Pushed to master (1ede9a5) and 0.90 (d1f753e)

This was referenced Nov 13, 2013

Add support for using payloads to boost terms #3772

Closed

Feature Request: pre-select terms in TermVector request #3924

Closed

s1monw reviewed Dec 6, 2013
View reviewed changes

s1monw reviewed Dec 10, 2013
View reviewed changes

brwe added 7 commits December 17, 2013 14:52

do not call score() twice

4489f9c

implemented comments from @s1monw, @clintongormley, @martijnvg

7718b8d

implemented @s1monw comments

0cd0028

save term frequency to avoid null checks

b132824

randomize number of shards in some tests and remove unneeded benchmar…

de7a768

…k classes

remove todo in doc and some minor changes

4a0fd70

ghost assigned brwe Dec 18, 2013

s1monw reviewed Dec 20, 2013
View reviewed changes

brwe closed this Jan 2, 2014

kimchy mentioned this pull request Jan 2, 2014

Revisit _shard / class names in exposing terms stats for scripts #4584

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

make term statistics and term vectors accessible in scripts #4161

make term statistics and term vectors accessible in scripts #4161

brwe commented Nov 13, 2013

brwe commented Nov 14, 2013

s1monw Dec 6, 2013

brwe Dec 6, 2013

brwe commented Dec 10, 2013

clintongormley commented Dec 10, 2013

brwe commented Dec 10, 2013

s1monw Dec 10, 2013

brwe commented Dec 19, 2013

s1monw Dec 20, 2013

s1monw commented Dec 20, 2013

brwe commented Jan 2, 2014

make term statistics and term vectors accessible in scripts #4161

make term statistics and term vectors accessible in scripts #4161

Conversation

brwe commented Nov 13, 2013

brwe commented Nov 14, 2013

s1monw Dec 6, 2013

Choose a reason for hiding this comment

brwe Dec 6, 2013

Choose a reason for hiding this comment

brwe commented Dec 10, 2013

clintongormley commented Dec 10, 2013

brwe commented Dec 10, 2013

s1monw Dec 10, 2013

Choose a reason for hiding this comment

brwe commented Dec 19, 2013

s1monw Dec 20, 2013

Choose a reason for hiding this comment

s1monw commented Dec 20, 2013

brwe commented Jan 2, 2014