Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speed up phrase suggestion scoring #5396

Closed
wants to merge 2 commits into from

Conversation

nik9000
Copy link
Member

@nik9000 nik9000 commented Mar 12, 2014

Two changes:

  1. In the StupidBackoffScorer only look for the trigram if there is a bigram.
  2. Cache the frequencies in WordScorer rather so we don't look them up
    again and again and again.

This provides a pretty substantial speedup when there are many candidates.

Closes #5395

@nik9000
Copy link
Member Author

nik9000 commented Mar 12, 2014

I'd really appreciate someone having a look at this. I think it could save me some time. I'll run it on a more real dataset in the morning and have more to report.

I'd also like to be able to use the candidate frequency rather than look up the term frequency again given that it is already there. I think that would work in most cases but it doesn't seem right when you use one of the transformers. I don't have a good grasp of those at the moment so I'm going to leave it as is.

@nik9000
Copy link
Member Author

nik9000 commented Mar 12, 2014

@s1monw, is the phrase suggester still your baby?

@@ -92,6 +92,7 @@ public void findCandidates(CandidateSet[] candidates, Candidate[] path, int ord,
private void updateTop(CandidateSet[] candidates, Candidate[] path, PriorityQueue<Correction> corrections, double cutoffScore, double score)
throws IOException {
score = Math.exp(score);
// This assertion totally throws off performance numbers.... It makes caching look better then it is. It is still good though.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

run your benchmarks not with -ea ;)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was more trying to make a unit test that made the problem I see in production more obvious. In production I see 300ms per phrase suggester and I'm hoping to cut that by an order of magnitude. Here I had to make it bad enough to go from 5-10 seconds to 300ms so the test can actually catch stuff.... And this assertion caught me off guard there.

@s1monw
Copy link
Contributor

s1monw commented Mar 12, 2014

@nik9000 that's a great patch - lets get this in soon it can speed things u dramatically I guess - I left some comments impl wise

@s1monw s1monw self-assigned this Mar 12, 2014
@nik9000
Copy link
Member Author

nik9000 commented Mar 12, 2014

Added a commit to improve indexRandom for this patch. I think, though, that I should send it in another pull request and make this one dependent on it.


import java.io.IOException;

public class FrequencyCachingTermsEnum extends FilterTermsEnum {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean this is trivial but I don't want to add this to ES. It's pretty much broken unless you exactly know how it works internally. What I mean is that you can't call #next() or docs() since it doesn't seek. I think we can work towards this but it should have all the functions supported. I think we should cache both ttf and df and put a boolean if a real seek is needed and then do the seek if one of the other methods i called. makes sense? Other than that I think this is the way to go

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me have a look at that.

@s1monw
Copy link
Contributor

s1monw commented Mar 12, 2014

Added a commit to improve indexRandom for this patch. I think, though, that I should send it in another pull request and make this one dependent on it.

I like the change and I agree we should make it a sep. commit but not need to open a different PR. We can just keep this a sep commit, makes sense? I can port that commit then to 1.0 since this is a n improvement so it will only go into 1.x and master

* To serve requests that can't come from the cache just perform the seek if you skipped the last one.
* We could do fancier things if all enums supported ord.
*/
public class FrequencyCachingTermsEnumWrapper extends FilterTermsEnum {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that I think I have a working TermsEnum I'd love to have a unit test to prove it. @s1monw, do you think the best way to go about this is to grab something from Lucene?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hard to tell to be honest.. I'd just write a test that randomly iterates over a terms enum and calls the same things on a wrapped and unwrapped instance and checks if everything is the same?

@nik9000
Copy link
Member Author

nik9000 commented Mar 13, 2014

Caching helps real suggest calls with real data in the index. Improvements get better as the phrase gets longer. 1 word phrases break even. 9 word phrases are ~75% faster.

How I figured that out:
I wanted to give this a shot with more "real world" data. I made a script to load titles from English Wikipedia. I didn't use the river because I only want the titles. Anyway, then I hit it with some searches with another script. I turned off the caching and hit is again with the same script. That spat out some data. I summarize it above.

* Index one at a time only rarely if doing more then 300.
* When launching async actions, take some care to make sure you don't already
have more then 150 other async actions in flight.
* When indexing in bulk split into chunks of 1000 documents.
@nik9000
Copy link
Member Author

nik9000 commented Mar 13, 2014

I've cleaned up the pull request, squashing it into two commits: one for the scoring improvements and one for the indexRandom changes.

@s1monw
Copy link
Contributor

s1monw commented Mar 14, 2014

hey nik, I took a look at the TermsEnum you added and I think it goes in the right direction but it became more complex that I thought it would. I'd like to get this improvement in for 1.1 which is close but on the other hand I think the general TermsEnum is useful. What about splitting this up and go back to the previous version and put the enum as an inner class into WordScorer making all the methods that would require an actual seek throw UnsupportedOperationException. If it turns out that we really really need to do all the seeks etc. we can still open a sep issue? I think this would be more practical and then we can think about how to generally test it? I mean we can open a dedicated issue right away? Sorry for taking that step back now after you implemented all that stuff.

@nik9000
Copy link
Member Author

nik9000 commented Mar 16, 2014

I have to admit I'm having trouble working up the willpower to back it out and work it back in in another issue. What about slapping more warnings on it? Just moving it to a subclass as is?

@s1monw
Copy link
Contributor

s1monw commented Mar 16, 2014

@nik9000 I can do it though no worries. Yet the reasoning here is that we don't need all the DocsEnum etc methods so no real seek is needed. I could have made up my mind earlier, hindsight is 20/20.... I am worries about the usecase for this general class since it's very limited and second I don't have a good answer to the testing question so I suggest make it simpler and throw UnsupportedOperationException where appropriate. No need for a sep issue..

@nik9000
Copy link
Member Author

nik9000 commented Mar 16, 2014

The test I added isn't too bad.... Anyway, I'll switch it back to the old implementation tonight.

Sent from my iPhone

On Mar 16, 2014, at 1:53 PM, Simon Willnauer notifications@github.com wrote:

@nik9000 I can do it though no worries. Yet the reasoning here is that we don't need all the DocsEnum etc methods so no real seek is needed. I could have made up my mind earlier, hindsight is 20/20.... I am worries about the usecase for this general class since it's very limited and second I don't have a good answer to the testing question so I suggest make it simpler and throw UnsupportedOperationException where appropriate. No need for a sep issue..


Reply to this email directly or view it on GitHub.

@nik9000
Copy link
Member Author

nik9000 commented Mar 17, 2014

Done. Let me know what else I need to do. I reran the (not scientific) performance tests I was running and it is marginally faster. I assume because it is maintaining less stuff in the cache.

@s1monw
Copy link
Contributor

s1monw commented Mar 17, 2014

don't get me wrong I think you test was ok! I didn't want to make you upset or anything I really really appreciate your work here that's for sure!

* that is requested and will always return that one. Also only works with seekExact(BytesRef), not next, not seekCeil. Because of this
* it really only makes sense in this context.
*/
private static class FrequencyCachingTermsEnumWrapper extends FilterTermsEnum {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this looks great. I think we should throw UnsupportedOperationException on the methods that are not correct if they are called?
I also think we should rather use a struct to hold both the ttf and the docFreq - it seems hard to understand what is returned at this point . Other than than I think we are ready to go ie. you can squash and we push it in!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes! Sorry, forgot about that while I was doing this.

@kimchy
Copy link
Member

kimchy commented Mar 17, 2014

Few questions on the caching:

@nik9000
Copy link
Member Author

nik9000 commented Mar 17, 2014

Few questions on the caching:

is it correct to have the last freq work as it does today, convoluting the doc freq with totalTermFreq? it might be very tricky down the road... . See the below point, but 2 long arrays for each freq would work well?

It works in this context because the phrase suggester is configured to use one of the two and uses that one the whole time. @s1monw also asked that I switch back to holding both so I'll have a look at that.

Can we move the caching data structure to be BytesRefHash based (hopefully using paged data structures)? Something similar to what we ended up doing in significant terms: https://github.com/elasticsearch/elasticsearch/blob/master/src/main/java/org/elasticsearch/search/aggregations/bucket/significant/SignificantTermsAggregatorFactory.java#L164.

I was using that a few days ago but it required modifying a dozen classes to pass BigArrays around and handle Releasable. I think that'd be OK but many of the classes are valid plugin extension points so maybe that isn't OK so late in the 1.1 release cycle.

@nik9000
Copy link
Member Author

nik9000 commented Mar 17, 2014

Another note while I think about it, just in case folks need to further speed this process up:

Using the default configuration with suggest_mode = always, performance is now dominated by drawing candidates. This is because the default only scores the top five candidates per term. Even on many many term queries scoring is now pretty fast. So if you wanted to make this even faster:

  1. You could get a quick win on the worst case by skipping looking at edit distance one. Even though generating the automata for edit distance two is expensive, on my dataset edit distance one never seems to cut it. The improvement is 10%ish or so on my tests but I can't be sure if they are truly representative here.
  2. Beyond that I don't see getting any faster without abandoning DirectSpellChecker for something that pregenerates all the candidates. It'd be a pain to deal with, I think.

@nik9000
Copy link
Member Author

nik9000 commented Mar 17, 2014

Add commit to cache both ttf and df AND to throw UnsupportedOperationExceptions. Less hacky now.

@s1monw
Copy link
Contributor

s1monw commented Mar 17, 2014

LGTM can you maybe squash the commits together such that we have two commits, one for Make indexRandom handle many documents better ... and one for the speeups? Thanks Nik

Two changes:
1.  In the StupidBackoffScorer only look for the trigram if there is a bigram.
2.  Cache the frequencies in WordScorer so we don't look them up again and
again and again.  This is implemented by wrapping the TermsEnum in a special
purpose wrapper that really only works in context of the WordScorer.

This provides a pretty substantial speedup when there are many candidates.

Closes elastic#5395
@kimchy
Copy link
Member

kimchy commented Mar 17, 2014

++, we can work on a common frequency caching terms enum and reuse between both the spell checker and significant terms /cc @markharwood, @jpountz

@nik9000
Copy link
Member Author

nik9000 commented Mar 17, 2014

Squashed.

@s1monw
Copy link
Contributor

s1monw commented Mar 18, 2014

pushed to 1.x and master

@s1monw s1monw closed this Mar 18, 2014
@clintongormley clintongormley added the :Search/Suggesters "Did you mean" and suggestions as you type label Jun 7, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>enhancement :Search/Suggesters "Did you mean" and suggestions as you type v1.1.0 v2.0.0-beta1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Speed up phrase suggester
4 participants