Infrastructure for changing easily the significance terms heuristic #6561

brwe · 2014-06-19T10:35:04Z

...euristic

This commit adds the infrastructure to allow pluging in different
measures for computing the significance of a term.
Significance measures can be provided externally by overriding

SignificanceHeuristic
SignificanceHeuristicBuilder
SignificanceHeuristicParser

and registering Parser and Heuristic at the SignificantTermsHeuristicModule.

As a proof of concept, this commit also adds a second heuristic to the
already existing one (MutualInformation).

jpountz · 2014-06-19T13:45:49Z

docs/reference/search/aggregations/bucket/significantterms-aggregation.asciidoc

+The scores are derived from the doc frequencies in _foreground_ and _background_ sets. The _absolute_ change in popularity (foregroundPercent - backgroundPercent) would favor common terms whereas the _relative_ change in popularity (foregroundPercent/ backgroundPercent) would favor rare terms. Rare vs common is essentially a precision vs recall balance and so the absolute and relative changes are multiplied to provide a sweet spot between precision and recall.
+
+===== mutual information
+added[1.2.3]


I think this should not go into a bugfix release (ie. s/1.2.3/1.3.0/)

markharwood · 2014-06-20T09:34:39Z

...a/org/elasticsearch/search/aggregations/bucket/significant/heuristics/MutualInformation.java

+
+    protected static final String[] NAMES = {"mutual_information", Strings.toCamelCase("mutual_information")};
+
+    protected static final ParseField NAMES_FIELD = new ParseField(NAMES[0], NAMES[1]);


Note that ParseField does the camel casing for you and the 2nd/3rd... args to its constructor are actually deprecated names so that in future we can run in "strict" mode and flag any client uses of deprecated APIs

markharwood · 2014-06-20T13:24:27Z

Looks great @brwe ! I have the Log Likelihood Ratio code from Mahout if you want to bundle that too?
I made a couple of tweaks with Ted's guidance as part of our tests.

brwe · 2014-06-20T13:46:28Z

@markharwood I thought I should make a second pull request that adds that and also Chi square and all that? It is lots of code already

jpountz · 2014-06-21T13:32:19Z

+1 to split into several pull requests

brwe · 2014-06-23T10:57:16Z

I added two commits to add the deprecated names checking but only for the significant terms heuristics here. It seems to me that deprecated names are never checked in aggregations anywhere unless I am missing something. I am now wondering if would make more sense to add that to aggregations in a separate commit.

brwe · 2014-06-23T14:09:53Z

I am now wondering if would make more sense to add that to aggregations in a separate commit.

Removed the strict parsing flag check again, seems to make more sense to do that consistently in a different pull request.

s1monw · 2014-06-26T08:31:02Z

I only looked briefly at this but can we add extensive unittests for the individual heuristics, I think we should add those for these!

jpountz · 2014-06-27T11:17:05Z

...main/java/org/elasticsearch/search/aggregations/bucket/significant/SignificantLongTerms.java

@@ -120,6 +122,7 @@ public void readFrom(StreamInput in) throws IOException {
        this.minDocCount = in.readVLong();
        this.subsetSize = in.readVLong();
        this.supersetSize = in.readVLong();
+        significanceHeuristic = SignificanceHeuristicStreams.read(in);


Should you only read it if the version is >= 1,3,0 and otherwise fall back to the default impl?

yes, I added a commit "check if version supports..."
I added a bwc test for this check, but on latest master that fails because of the combination of 6093
and 5659. Currently, the branch is based on e2da211 . I'll rebase on latest master once this is resolved.

tests run now, I rebased on master

brwe · 2014-07-01T07:55:38Z

Updated with new commits. @s1monw : I added unit tests in SignificantTermsUnitTests, is that extensive enough?
@markharwood: I added assertions to the score computation and had to change one of the tests (see commits "test score assertions and score" and "check for shard failures...") - is the new behavior OK?

s1monw · 2014-07-02T07:44:51Z

...java/org/elasticsearch/search/aggregations/bucket/significant/SignificantTermsUnitTests.java

+/**
+ *
+ */
+@ElasticsearchIntegrationTest.SuiteScopeTest


if it is an ElasticsearchLuceneTestCase you don't need @ElasticsearchIntegrationTest.SuiteScopeTest. I also think it should extend ElasticsearchTestCase rather than ElasticsearchLuceneTestCase or do you use any lucene specific parts? Can we also call this test not ..UnitTest maybe SignificanceHeuristicTest

done, commit is "make ElasticsearchTestCase and rename..."

brwe · 2014-07-02T11:35:40Z

@markharwood About the assertions in the scoring function: I agree, we might not always want to rely on the strict superset property. However, for mutual information we sort of rely on the fact that it is strict, else the computations do not make sense.

Mutual information compares two sets and not so much foreground against background. I assumed that the two sets are the subset and the background without the subset. It therefore relies on knowing the frequency in the subset but also the frequency in the background set without the subset. Because currently I only get the background frequency, I have to do a subtraction of background frequency and foreground frequency to figure out how many are in the other set.

Now an example:
Background contains 3 documents, but foreground contains 2 because the strict superset property was violated or because the two sets are completely independent. Now, if the function gets passed foreground freg = 3 and background freq=2 I know that one set contains 3 but I have no means to determine how many documents are actually in the other set as I do not know the overlap of the two sets. Subtraction of background frequency and foreground frequency is clearly wrong - I get a negative number and the computed value will have no meaning. Hence all the strict checks.

I will remove the assertions from the default score and only leave them in mutual information. Actually I am thinking I should replace the asserts by exceptions to make sure users are aware that whatever is computed is wrong...

markharwood · 2014-07-02T11:47:16Z

I find practical uses of these significance algos on free text are vastly improved if the foreground sample is devoid of the sorts of duplicate text introduced by retweets, email replies, copyright notices etc. we find in typical content. This is the area I am working on at the moment to efficiently strip out repetitions and this will only add to the fuzziness of the numbers presented (e.g. I count only half of the text in documents in a result set). This will mean the foreground sample under-reports word frequencies and any significance algos shouldn't be too thrown off by that.
I'm closing to issuing a PR for this so it may be useful to try some of these alternative significant algos in this context.

brwe · 2014-07-02T12:17:46Z

@markharwood the problem does not arise from under reports of word frequencies but from the inability to clearly distinguish what the frequencies in the two sets are that are compared.
The current heuristic compares one set vs a background and the counts can be fuzzy I agree. But mutual information compares two distinct sets and the significance cannot be determined if the frequencies in each of the sets cannot be computed.

This will mean the foreground sample under-reports word frequencies and any significance algos shouldn't be too thrown off by that.

Maybe I am missing something, but I do not see how I should compare two sets if I cannot determine the frequencies within them?

brwe · 2014-07-02T12:28:40Z

Maybe we could let the user give a hint for mutual information if or if not the background is actually a strict superset or defines a completely different set. Something like

 "significant_terms": {
    "field": ...,
    ...
    "mutual_information": {
      "background_is_superset": true/false
    }
  }

and then derive the different frequencies depending on that? This way, the user would have all the flexibility.

markharwood · 2014-07-02T12:43:03Z

Maybe I am missing something, but I do not see how I should compare two sets if I cannot determine the frequencies within them?

We potentially have a number of useful tools at our disposal in producing sample sets (background_filters, de-duping and doc "slicing") and they can all introduce some oddities into the numbers presented. Maybe it is a mistake to use words in the code like "subset" and "superset" to describe the numbers if certain algos expect that strict behaviour? Maybe foreground/background-sample are less charged words and your flag "is superset" helps clarify the position.

brwe · 2014-07-03T07:14:54Z

I implemented all review comments I got so far, ready for the next review round.

markharwood · 2014-07-03T10:16:39Z

I just wanted to register the concern that this may well become a scoring function with custom params in future. Shouldn't be too hard to refactor if we choose to add this later.

s1monw · 2014-07-08T07:46:52Z

src/main/java/org/elasticsearch/search/aggregations/bucket/significant/heuristics/JLHScore.java

+
+    private JLHScore() {};
+
+    @Override


you must drop this equals method unless you have a corresponding hashCode impl Yet since this is a singleton you can just drop it.

I removed it.

s1monw · 2014-07-08T07:48:03Z

left a tiny comment - if you fix this you can push ie LGTM

markharwood · 2014-07-08T08:09:05Z

...a/org/elasticsearch/search/aggregations/bucket/significant/heuristics/MutualInformation.java

+
+    protected static final ParseField IS_BACKGROUND = new ParseField("is_background");
+
+    protected static final String SCORE_ERROR_MESSAGE = ", does you background filter not include all documents in the bucket? If so and it is intentional, set \"" + IS_BACKGROUND.getPreferredName() + "\": false";


typo: "does you". Curious why you moved away from is_superset as the parameter name? The new "Is_background" says nothing about the required characteristics of the background. How about "background_is_superset" ?

"background_is_superset" is best. I'll change it to that.

markharwood · 2014-07-08T17:57:20Z

src/main/java/org/elasticsearch/search/aggregations/bucket/significant/heuristics/JLHScore.java

+        @Override
+        public SignificanceHeuristic parse(XContentParser parser) throws IOException, QueryParsingException {
+            // move to the closing bracket
+            parser.nextToken();


Check nextToken is Token.END_OBJECT and throw appropriate error if not. Without this additional check the parser errors are somewhat confused if the JSON contains a parameter.

markharwood · 2014-07-08T18:10:16Z

Left a couple of small comments but otherwise looks great.

brwe · 2014-07-11T14:00:19Z

implemented latest review comments

s1monw · 2014-07-14T07:59:47Z

LGTM +1 to push

…e heuristic This commit adds the infrastructure to allow pluging in different measures for computing the significance of a term. Significance measures can be provided externally by overriding - SignificanceHeuristic - SignificanceHeuristicBuilder - SignificanceHeuristicParser closes #6561

brwe added review labels Jun 19, 2014

jpountz reviewed Jun 19, 2014
View reviewed changes

brwe removed the v1.2.2 label Jun 19, 2014

markharwood reviewed Jun 20, 2014
View reviewed changes

areek assigned rmuir and unassigned rmuir Jun 23, 2014

jpountz reviewed Jun 27, 2014
View reviewed changes

s1monw reviewed Jul 2, 2014
View reviewed changes

brwe added 6 commits July 3, 2014 09:13

make ElasticsearchTestCase and rename to SignificanceHeuristicTests

01a3ab4

DefaultHeuristic: make ctor private and instead one instance

5ad1e3e

MutualInformation: force to add parameters in ctor

a505dd5

remove ClusterScope

5067dc1

rename DefaultHeuristic -> JLHScore

b1efc77

cleanup

938af79

s1monw reviewed Jul 8, 2014
View reviewed changes

markharwood reviewed Jul 8, 2014
View reviewed changes

brwe added 2 commits July 8, 2014 16:49

remove equals from JLHScore, add hashCode() to MutualInformation

ced7f4f

irename is_background -> background_is_superset

ff75159

markharwood reviewed Jul 8, 2014
View reviewed changes

s1monw removed the review label Jul 9, 2014

brwe added 2 commits July 11, 2014 15:32

fix typo

cb4cb36

add exception in parsers in case an unknown field is encountered

b705a53

brwe closed this in 74927ad Jul 14, 2014

clintongormley changed the title ~~significant terms: infrastructure for changing easily the significance h...~~ Aggregations: Infrastructure for changing easily the significance terms heuristic Jul 16, 2014

clintongormley added the feature label Jul 16, 2014

brwe mentioned this pull request Jul 24, 2014

Significant Terms: Add google normalized distance and chi square #6858

Closed

clintongormley added the :Analytics/Aggregations Aggregations label Jun 6, 2015

clintongormley changed the title ~~Aggregations: Infrastructure for changing easily the significance terms heuristic~~ Infrastructure for changing easily the significance terms heuristic Jun 6, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Infrastructure for changing easily the significance terms heuristic #6561

Infrastructure for changing easily the significance terms heuristic #6561

brwe commented Jun 19, 2014

jpountz Jun 19, 2014

markharwood Jun 20, 2014

markharwood commented Jun 20, 2014

brwe commented Jun 20, 2014

jpountz commented Jun 21, 2014

brwe commented Jun 23, 2014

brwe commented Jun 23, 2014

s1monw commented Jun 26, 2014

jpountz Jun 27, 2014

brwe Jul 1, 2014

brwe Jul 2, 2014

brwe commented Jul 1, 2014

s1monw Jul 2, 2014

brwe Jul 2, 2014

brwe commented Jul 2, 2014

markharwood commented Jul 2, 2014

brwe commented Jul 2, 2014

brwe commented Jul 2, 2014

markharwood commented Jul 2, 2014

brwe commented Jul 3, 2014

markharwood commented Jul 3, 2014

s1monw Jul 8, 2014

brwe Jul 8, 2014

s1monw commented Jul 8, 2014

markharwood Jul 8, 2014

brwe Jul 8, 2014

markharwood Jul 8, 2014

markharwood commented Jul 8, 2014

brwe commented Jul 11, 2014

s1monw commented Jul 14, 2014


		protected static final String[] NAMES = {"mutual_information", Strings.toCamelCase("mutual_information")};

		protected static final ParseField NAMES_FIELD = new ParseField(NAMES[0], NAMES[1]);


		protected static final ParseField IS_BACKGROUND = new ParseField("is_background");

		protected static final String SCORE_ERROR_MESSAGE = ", does you background filter not include all documents in the bucket? If so and it is intentional, set \"" + IS_BACKGROUND.getPreferredName() + "\": false";

Infrastructure for changing easily the significance terms heuristic #6561

Infrastructure for changing easily the significance terms heuristic #6561

Conversation

brwe commented Jun 19, 2014

Choose a reason for hiding this comment

Choose a reason for hiding this comment

markharwood commented Jun 20, 2014

brwe commented Jun 20, 2014

jpountz commented Jun 21, 2014

brwe commented Jun 23, 2014

brwe commented Jun 23, 2014

s1monw commented Jun 26, 2014

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

brwe commented Jul 1, 2014

Choose a reason for hiding this comment

Choose a reason for hiding this comment

brwe commented Jul 2, 2014

markharwood commented Jul 2, 2014

brwe commented Jul 2, 2014

brwe commented Jul 2, 2014

markharwood commented Jul 2, 2014

brwe commented Jul 3, 2014

markharwood commented Jul 3, 2014

Choose a reason for hiding this comment

Choose a reason for hiding this comment

s1monw commented Jul 8, 2014

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

markharwood commented Jul 8, 2014

brwe commented Jul 11, 2014

s1monw commented Jul 14, 2014