Speed up phrase suggestion scoring #5396

nik9000 · 2014-03-12T01:12:11Z

Two changes:

In the StupidBackoffScorer only look for the trigram if there is a bigram.
Cache the frequencies in WordScorer rather so we don't look them up
again and again and again.

This provides a pretty substantial speedup when there are many candidates.

nik9000 · 2014-03-12T01:16:03Z

I'd really appreciate someone having a look at this. I think it could save me some time. I'll run it on a more real dataset in the morning and have more to report.

I'd also like to be able to use the candidate frequency rather than look up the term frequency again given that it is already there. I think that would work in most cases but it doesn't seem right when you use one of the transformers. I don't have a good grasp of those at the moment so I'm going to leave it as is.

nik9000 · 2014-03-12T01:17:35Z

@s1monw, is the phrase suggester still your baby?

s1monw · 2014-03-12T08:46:01Z

src/main/java/org/elasticsearch/search/suggest/phrase/CandidateScorer.java

@@ -92,6 +92,7 @@ public void findCandidates(CandidateSet[] candidates, Candidate[] path, int ord,
    private void updateTop(CandidateSet[] candidates, Candidate[] path, PriorityQueue<Correction> corrections, double cutoffScore, double score)
            throws IOException {
        score = Math.exp(score);
+        // This assertion totally throws off performance numbers....  It makes caching look better then it is.  It is still good though.


run your benchmarks not with -ea ;)

I was more trying to make a unit test that made the problem I see in production more obvious. In production I see 300ms per phrase suggester and I'm hoping to cut that by an order of magnitude. Here I had to make it bad enough to go from 5-10 seconds to 300ms so the test can actually catch stuff.... And this assertion caught me off guard there.

s1monw · 2014-03-12T09:13:37Z

@nik9000 that's a great patch - lets get this in soon it can speed things u dramatically I guess - I left some comments impl wise

nik9000 · 2014-03-12T19:14:42Z

Added a commit to improve indexRandom for this patch. I think, though, that I should send it in another pull request and make this one dependent on it.

s1monw · 2014-03-12T19:33:59Z

src/main/java/org/elasticsearch/common/lucene/FrequencyCachingTermsEnum.java

+
+import java.io.IOException;
+
+public class FrequencyCachingTermsEnum extends FilterTermsEnum {


I mean this is trivial but I don't want to add this to ES. It's pretty much broken unless you exactly know how it works internally. What I mean is that you can't call #next() or docs() since it doesn't seek. I think we can work towards this but it should have all the functions supported. I think we should cache both ttf and df and put a boolean if a real seek is needed and then do the seek if one of the other methods i called. makes sense? Other than that I think this is the way to go

Let me have a look at that.

s1monw · 2014-03-12T19:38:46Z

Added a commit to improve indexRandom for this patch. I think, though, that I should send it in another pull request and make this one dependent on it.

I like the change and I agree we should make it a sep. commit but not need to open a different PR. We can just keep this a sep commit, makes sense? I can port that commit then to 1.0 since this is a n improvement so it will only go into 1.x and master

nik9000 · 2014-03-12T21:50:30Z

src/main/java/org/elasticsearch/common/lucene/FrequencyCachingTermsEnumWrapper.java

+ * To serve requests that can't come from the cache just perform the seek if you skipped the last one.
+ * We could do fancier things if all enums supported ord.
+ */
+public class FrequencyCachingTermsEnumWrapper extends FilterTermsEnum {


Now that I think I have a working TermsEnum I'd love to have a unit test to prove it. @s1monw, do you think the best way to go about this is to grab something from Lucene?

hard to tell to be honest.. I'd just write a test that randomly iterates over a terms enum and calls the same things on a wrapped and unwrapped instance and checks if everything is the same?

nik9000 · 2014-03-13T14:52:57Z

Caching helps real suggest calls with real data in the index. Improvements get better as the phrase gets longer. 1 word phrases break even. 9 word phrases are ~75% faster.

How I figured that out:
I wanted to give this a shot with more "real world" data. I made a script to load titles from English Wikipedia. I didn't use the river because I only want the titles. Anyway, then I hit it with some searches with another script. I turned off the caching and hit is again with the same script. That spat out some data. I summarize it above.

* Index one at a time only rarely if doing more then 300. * When launching async actions, take some care to make sure you don't already have more then 150 other async actions in flight. * When indexing in bulk split into chunks of 1000 documents.

nik9000 · 2014-03-13T16:22:28Z

I've cleaned up the pull request, squashing it into two commits: one for the scoring improvements and one for the indexRandom changes.

s1monw · 2014-03-14T22:55:04Z

hey nik, I took a look at the TermsEnum you added and I think it goes in the right direction but it became more complex that I thought it would. I'd like to get this improvement in for 1.1 which is close but on the other hand I think the general TermsEnum is useful. What about splitting this up and go back to the previous version and put the enum as an inner class into WordScorer making all the methods that would require an actual seek throw UnsupportedOperationException. If it turns out that we really really need to do all the seeks etc. we can still open a sep issue? I think this would be more practical and then we can think about how to generally test it? I mean we can open a dedicated issue right away? Sorry for taking that step back now after you implemented all that stuff.

nik9000 · 2014-03-16T17:50:08Z

I have to admit I'm having trouble working up the willpower to back it out and work it back in in another issue. What about slapping more warnings on it? Just moving it to a subclass as is?

s1monw · 2014-03-16T17:53:30Z

@nik9000 I can do it though no worries. Yet the reasoning here is that we don't need all the DocsEnum etc methods so no real seek is needed. I could have made up my mind earlier, hindsight is 20/20.... I am worries about the usecase for this general class since it's very limited and second I don't have a good answer to the testing question so I suggest make it simpler and throw UnsupportedOperationException where appropriate. No need for a sep issue..

nik9000 · 2014-03-16T19:51:27Z

The test I added isn't too bad.... Anyway, I'll switch it back to the old implementation tonight.

Sent from my iPhone

On Mar 16, 2014, at 1:53 PM, Simon Willnauer notifications@github.com wrote:

@nik9000 I can do it though no worries. Yet the reasoning here is that we don't need all the DocsEnum etc methods so no real seek is needed. I could have made up my mind earlier, hindsight is 20/20.... I am worries about the usecase for this general class since it's very limited and second I don't have a good answer to the testing question so I suggest make it simpler and throw UnsupportedOperationException where appropriate. No need for a sep issue..

—
Reply to this email directly or view it on GitHub.

nik9000 · 2014-03-17T01:26:36Z

Done. Let me know what else I need to do. I reran the (not scientific) performance tests I was running and it is marginally faster. I assume because it is maintaining less stuff in the cache.

s1monw · 2014-03-17T10:39:46Z

don't get me wrong I think you test was ok! I didn't want to make you upset or anything I really really appreciate your work here that's for sure!

s1monw · 2014-03-17T10:46:06Z

src/main/java/org/elasticsearch/search/suggest/phrase/WordScorer.java

+    * that is requested and will always return that one. Also only works with seekExact(BytesRef), not next, not seekCeil.  Because of this
+    * it really only makes sense in this context.
+    */
+   private static class FrequencyCachingTermsEnumWrapper extends FilterTermsEnum {


this looks great. I think we should throw UnsupportedOperationException on the methods that are not correct if they are called?
I also think we should rather use a struct to hold both the ttf and the docFreq - it seems hard to understand what is returned at this point . Other than than I think we are ready to go ie. you can squash and we push it in!

Yes! Sorry, forgot about that while I was doing this.

kimchy · 2014-03-17T10:49:58Z

Few questions on the caching:

is it correct to have the last freq work as it does today, convoluting the doc freq with totalTermFreq? it might be very tricky down the road... . See the below point, but 2 long arrays for each freq would work well?
Can we move the caching data structure to be BytesRefHash based (hopefully using paged data structures)? Something similar to what we ended up doing in significant terms: https://github.com/elasticsearch/elasticsearch/blob/master/src/main/java/org/elasticsearch/search/aggregations/bucket/significant/SignificantTermsAggregatorFactory.java#L164.

nik9000 · 2014-03-17T11:43:44Z

Few questions on the caching:

is it correct to have the last freq work as it does today, convoluting the doc freq with totalTermFreq? it might be very tricky down the road... . See the below point, but 2 long arrays for each freq would work well?

It works in this context because the phrase suggester is configured to use one of the two and uses that one the whole time. @s1monw also asked that I switch back to holding both so I'll have a look at that.

Can we move the caching data structure to be BytesRefHash based (hopefully using paged data structures)? Something similar to what we ended up doing in significant terms: https://github.com/elasticsearch/elasticsearch/blob/master/src/main/java/org/elasticsearch/search/aggregations/bucket/significant/SignificantTermsAggregatorFactory.java#L164.

I was using that a few days ago but it required modifying a dozen classes to pass BigArrays around and handle Releasable. I think that'd be OK but many of the classes are valid plugin extension points so maybe that isn't OK so late in the 1.1 release cycle.

nik9000 · 2014-03-17T11:59:32Z

Another note while I think about it, just in case folks need to further speed this process up:

Using the default configuration with suggest_mode = always, performance is now dominated by drawing candidates. This is because the default only scores the top five candidates per term. Even on many many term queries scoring is now pretty fast. So if you wanted to make this even faster:

You could get a quick win on the worst case by skipping looking at edit distance one. Even though generating the automata for edit distance two is expensive, on my dataset edit distance one never seems to cut it. The improvement is 10%ish or so on my tests but I can't be sure if they are truly representative here.
Beyond that I don't see getting any faster without abandoning DirectSpellChecker for something that pregenerates all the candidates. It'd be a pain to deal with, I think.

nik9000 · 2014-03-17T12:52:53Z

Add commit to cache both ttf and df AND to throw UnsupportedOperationExceptions. Less hacky now.

s1monw · 2014-03-17T16:34:02Z

LGTM can you maybe squash the commits together such that we have two commits, one for Make indexRandom handle many documents better ... and one for the speeups? Thanks Nik

Two changes: 1. In the StupidBackoffScorer only look for the trigram if there is a bigram. 2. Cache the frequencies in WordScorer so we don't look them up again and again and again. This is implemented by wrapping the TermsEnum in a special purpose wrapper that really only works in context of the WordScorer. This provides a pretty substantial speedup when there are many candidates. Closes elastic#5395

kimchy · 2014-03-17T18:18:56Z

++, we can work on a common frequency caching terms enum and reuse between both the spell checker and significant terms /cc @markharwood, @jpountz

nik9000 · 2014-03-17T18:32:38Z

Squashed.

s1monw · 2014-03-18T11:47:20Z

pushed to 1.x and master

s1monw reviewed Mar 12, 2014
View reviewed changes

s1monw added enhancement labels Mar 12, 2014

s1monw self-assigned this Mar 12, 2014

s1monw reviewed Mar 12, 2014
View reviewed changes

nik9000 reviewed Mar 12, 2014
View reviewed changes

s1monw reviewed Mar 17, 2014
View reviewed changes

markharwood mentioned this pull request Mar 18, 2014

Significant terms aggregation should only employ caching when necessary #5450

Closed

s1monw closed this Mar 18, 2014

clintongormley added the :Search Relevance/Suggesters "Did you mean" and suggestions as you type label Jun 7, 2015


		import java.io.IOException;

		public class FrequencyCachingTermsEnum extends FilterTermsEnum {

Speed up phrase suggestion scoring #5396

Speed up phrase suggestion scoring #5396

Uh oh!

Conversation

nik9000 commented Mar 12, 2014

Uh oh!

nik9000 commented Mar 12, 2014

Uh oh!

nik9000 commented Mar 12, 2014

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

s1monw commented Mar 12, 2014

Uh oh!

nik9000 commented Mar 12, 2014

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

s1monw commented Mar 12, 2014

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nik9000 commented Mar 13, 2014

Uh oh!

nik9000 commented Mar 13, 2014

Uh oh!

s1monw commented Mar 14, 2014

Uh oh!

nik9000 commented Mar 16, 2014

Uh oh!

s1monw commented Mar 16, 2014

Uh oh!

nik9000 commented Mar 16, 2014

Uh oh!

nik9000 commented Mar 17, 2014

Uh oh!

s1monw commented Mar 17, 2014

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kimchy commented Mar 17, 2014

Uh oh!

nik9000 commented Mar 17, 2014

Uh oh!

nik9000 commented Mar 17, 2014

Uh oh!

nik9000 commented Mar 17, 2014

Uh oh!

s1monw commented Mar 17, 2014

Uh oh!

kimchy commented Mar 17, 2014

Uh oh!

nik9000 commented Mar 17, 2014

Uh oh!

s1monw commented Mar 18, 2014

Uh oh!

Uh oh!