Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

make term statistics and term vectors accessible in scripts #4161

Closed
wants to merge 7 commits into from

Conversation

brwe
Copy link
Contributor

@brwe brwe commented Nov 13, 2013

Below is a minimal example which you can run to see how it works in principle. Documentation lists all options.

At this point I would first like to make sure that the general concept is OK. After that I will add complete javadoc etc.

closes #3772

Minimal example:

DELETE paytest

PUT paytest
{
    "mappings": {
        "test": {
            "properties": {
                "text": {
                    "index_analyzer": "fulltext_analyzer",
                    "type": "string"
                }
            }
        }
    },
    "settings": {
        "analysis": {
            "analyzer": {
                "fulltext_analyzer": {
                    "filter": [
                        "my_delimited_payload_filter"
                    ],
                    "tokenizer": "whitespace",
                    "type": "custom"
                }
            },
            "filter": {
                "my_delimited_payload_filter": {
                    "delimiter": "+",
                    "encoding": "float",
                    "type": "delimited_payload_filter"
                }
            }
        },
        "index": {
            "number_of_replicas": 0,
            "number_of_shards": 1
        }
    }
}


POST paytest/test/1
{
    "text": "the+1 quick+2 brown+3 fox+4 is quick+10"
}

POST paytest/test/2
{
    "text": "the+1 quick+2 red+3 fox+4"
}

POST paytest/_refresh

#get the total term frequency for "quick"

POST paytest/_search
{
    "script_fields": {
       "ttf": {
          "script": "_index[\"text\"][\"quick\"].ttf()"
       }
    }
}

# get the term frequencies
POST paytest/_search
{
    "script_fields": {
       "freq": {
          "script": "_index[\"text\"][\"quick\"].freq()"
       }
    }
}
POST paytest/test/2/_termvector
# get the payloads which are floats in this case
POST paytest/_search
{
    "script_fields": {
       "payloads": {
          "script": "term = _index[\"text\"].get(\"red\",_PAYLOADS);payloads = []; for(pos : term){payloads.add(pos.payloadAsFloat(-1));} return payloads;"
       }
    }
}

#compute very simple score: just use the term frequency
POST paytest/_search
{
   "script_fields": {
      "tv": {
         "script": "_index[\"text\"][\"quick\"].freq()"
      }
   },
   "query": {
      "function_score": {
         "functions": [
            {
               "script_score": {
                  "script": "_index[\"text\"][\"quick\"].freq()"
               }
            }
         ]
      }
   }
}

and so on. see documentation for all options.

Note that _index was called _shard before. See #4584

@brwe
Copy link
Contributor Author

brwe commented Nov 14, 2013

Was talking to @s1monw and we now think we have to optimize more. For example use posting list info instead of term vectors unless term vectors are explicitly requested, do not initialize all positions, payloads, ... unless the user really wants to. Also, we should make available an iterator for the positions. Will be back with a new version soon.

@@ -0,0 +1,473 @@
h5 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this file is obsolet

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes

@brwe
Copy link
Contributor Author

brwe commented Dec 10, 2013

@clint Thanks for cleaning up the docs! Here is what I made of your comments:

  1. Move doc to different file: Can I just put that at a separate file that is called "advancedscripting" and list it right under "scripting" and also link it in the scripting guide?
  2. Currently only available in function_score: No, it is actually available for native scripts that inherit from AbstractSearchScript and for mvel scripts regardless of the context. I use the scripts as scriptfield and in function_score in the tests.
  3. Rename _DO_NOT_RECORD -> _NO_CACHE or _CACHE depending on default: Names sound much better indeed. Will change the default also, so that users must set _CACHE if they want to iterate twice
  4. numDocs, maxDoc, deletedDocs : I added the maxDoc method. As far as I understand in lucene maxDoc is the number of documents without subtracting the deleted ones, numDocs is the maxDoc-numDeleted (my doc comment was wrong). I changed code and documentation accordingly.
  5. Can docCount be computed for live docs only?: No, unfortunately not possible. I added a comment to the docs.
  6. Rename sumttf -> total_tf and sumdf -> total_df: In lucene it is called that way which is why I thought for consistency it might make sense to call it so. Also, I think sum tells you a little more about the nature of the value than total. But I am not too passionate, if you insist I'll change it.
  7. Inconsistent naming freq() vs ...tf(): Right, will rename freq() -> tf(). Is then different to lucene but the naming makes more sense.
  8. Expose idf: inverted document frequency is a value computed from docCount and df. Do you mean we should expose the value that is computed in the lucene tfidf scoring? The only reason to do so that I could think of is saving time, that is: instead of computing the value once per doc, use the pre computed value stored already. However, if a user wishes more flexibility, I think it makes more sense to provide a mechanism to evaluate scripts once before search actually starts and then we would not have to expose pre-computed values. Does that make sense?
  9. lucene Field: I added a hopefully better explanation:

[float]
=== Term vectors:

The implementation so far does not give you the term vector but rather statistics for single terms. If you want to gather information on all terms in a field, you must store the term vectors (set term_vector in the mapping as described in the <<mapping-core-types,mapping documentation>>). To access them, call
_shard.getTermVectors() to get a
https://lucene.apache.org/core/4_0_0/core/org/apache/lucene/index/Fields.html[Fields]
instance. This object can then be used as described in https://lucene.apache.org/core/4_0_0/core/org/apache/lucene/index/Fields.html[lucene doc] to iterate over fields and then for each field iterate over each term in the field.
The method will return null if the term vectors were not stored.

Is that better?

@clintongormley
Copy link

Move doc to different file: Can I just put that at a separate file that is called "advancedscripting" and list it right under "scripting" and also link it in the scripting guide?

Yes, you can just add another include statement.

Currently only available in function_score: No, it is actually available for native scripts that inherit from AbstractSearchScript and for mvel scripts regardless of the context. I use the scripts as scriptfield and in function_score in the tests.

Ah that was probably me trying getDocCount() which resulted in an error.

Rename sumttf -> total_tf and sumdf -> total_df: In lucene it is called that way which is why I thought for consistency it might make sense to call it so. Also, I think sum tells you a little more about the nature of the value than total. But I am not too passionate, if you insist I'll change it.

I don't feel strongly about it, it was just a suggestion. I'm open to what others think here.

Expose idf: inverted document frequency is a value computed from docCount and df. Do you mean we should expose the value that is computed in the lucene tfidf scoring? The only reason to do so that I could think of is saving time, that is: instead of computing the value once per doc, use the pre computed value stored already. However, if a user wishes more flexibility, I think it makes more sense to provide a mechanism to evaluate scripts once before search actually starts and then we would not have to expose pre-computed values. Does that make sense?

Yes it makes sense. I don't know if idf is useful or not - was just a suggestion in case it had been overlooked

lucene Field: I added a hopefully better explanation:

yes, much more understandable :)

@brwe
Copy link
Contributor Author

brwe commented Dec 10, 2013

Thank you all so much for the comments! I added another commit that implements the changes. The two things I did not do yet:

  1. setDocId() and setNextReader() called twice needs to be looked into and potentially an own issue
  2. Rename sumttf, sumdf -> total_ttf, total_df is undecided yet. If I get no more opinions on that, I'll leave it as is.

@@ -168,8 +171,9 @@ public int nextDoc() throws IOException {

@Override
public float score() throws IOException {
return scoreCombiner.combine(subQueryBoost, scorer.score(),
function.score(scorer.docID(), scorer.score()), maxBoost);
float score = scorer.score();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like!

term statistics can be accessed via the _shard variable.

Below is a minimal example. See documentation on details.

```

DELETE paytest

PUT paytest
{
    "mappings": {
        "test": {
            "_all": {
                "auto_boost": true,
                "enabled": true
            },
            "properties": {
                "text": {
                    "index_analyzer": "fulltext_analyzer",
                    "store": "yes",
                    "type": "string"
                }
            }
        }
    },
    "settings": {
        "analysis": {
            "analyzer": {
                "fulltext_analyzer": {
                    "filter": [
                        "my_delimited_payload_filter"
                    ],
                    "tokenizer": "whitespace",
                    "type": "custom"
                }
            },
            "filter": {
                "my_delimited_payload_filter": {
                    "delimiter": "+",
                    "encoding": "float",
                    "type": "delimited_payload_filter"
                }
            }
        },
        "index": {
            "number_of_replicas": 0,
            "number_of_shards": 1
        }
    }
}

POST paytest/test/1
{
    "text": "the+1 quick+2 brown+3 fox+4 is quick+10"
}

POST paytest/test/2
{
    "text": "the+1 quick+2 red+3 fox+4"
}

POST paytest/_refresh

POST paytest/_search
{
    "script_fields": {
       "ttf": {
          "script": "_shard[\"text\"][\"quick\"].ttf()"
       }
    }
}

POST paytest/_search
{
    "script_fields": {
       "freq": {
          "script": "_shard[\"text\"][\"quick\"].freq()"
       }
    }
}
POST paytest/test/2/_termvector
POST paytest/_search
{
    "script_fields": {
       "payloads": {
          "script": "term = _shard[\"text\"].get(\"red\",_PAYLOADS);payloads = []; for(pos : term){payloads.add(pos.payloadAsFloat(-1));} return payloads;"
       }
    }
}

POST paytest/_search
{
   "script_fields": {
      "tv": {
         "script": "_shard[\"text\"][\"quick\"].freq()"
      }
   },
   "query": {
      "function_score": {
         "functions": [
            {
               "script_score": {
                  "script": "_shard[\"text\"][\"quick\"].freq()"
               }
            }
         ]
      }
   }
}

```

closes elastic#3772
@ghost ghost assigned brwe Dec 18, 2013
@brwe
Copy link
Contributor Author

brwe commented Dec 19, 2013

Implemented all changes as discussed

brwe added a commit to brwe/elasticsearch-native-script-example that referenced this pull request Dec 20, 2013
Implementation relies on
elastic/elasticsearch#4161

This adds scripts for

- cosime similarity
- tfidf
- language model scoring
- simple phrase scorer
// exist. Can be DocsEnum or DocsAndPositionsEnum.
DocsEnum docsEnum;

// Stores if positions, offsets and payloads are requested.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you use private final instead of final private here across the board?

@s1monw
Copy link
Contributor

s1monw commented Dec 20, 2013

I left a couple of minor comments. I guess you can just fix them and push to master!

@brwe
Copy link
Contributor Author

brwe commented Jan 2, 2014

Pushed to master (1ede9a5) and 0.90 (d1f753e)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add support for using payloads to boost terms
4 participants