New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Speed up phrase suggestion scoring #5396
Conversation
I'd really appreciate someone having a look at this. I think it could save me some time. I'll run it on a more real dataset in the morning and have more to report. I'd also like to be able to use the candidate frequency rather than look up the term frequency again given that it is already there. I think that would work in most cases but it doesn't seem right when you use one of the transformers. I don't have a good grasp of those at the moment so I'm going to leave it as is. |
@s1monw, is the phrase suggester still your baby? |
@@ -92,6 +92,7 @@ public void findCandidates(CandidateSet[] candidates, Candidate[] path, int ord, | |||
private void updateTop(CandidateSet[] candidates, Candidate[] path, PriorityQueue<Correction> corrections, double cutoffScore, double score) | |||
throws IOException { | |||
score = Math.exp(score); | |||
// This assertion totally throws off performance numbers.... It makes caching look better then it is. It is still good though. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
run your benchmarks not with -ea
;)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was more trying to make a unit test that made the problem I see in production more obvious. In production I see 300ms per phrase suggester and I'm hoping to cut that by an order of magnitude. Here I had to make it bad enough to go from 5-10 seconds to 300ms so the test can actually catch stuff.... And this assertion caught me off guard there.
@nik9000 that's a great patch - lets get this in soon it can speed things u dramatically I guess - I left some comments impl wise |
Added a commit to improve indexRandom for this patch. I think, though, that I should send it in another pull request and make this one dependent on it. |
|
||
import java.io.IOException; | ||
|
||
public class FrequencyCachingTermsEnum extends FilterTermsEnum { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean this is trivial but I don't want to add this to ES. It's pretty much broken unless you exactly know how it works internally. What I mean is that you can't call #next()
or docs()
since it doesn't seek. I think we can work towards this but it should have all the functions supported. I think we should cache both ttf
and df
and put a boolean if a real
seek is needed and then do the seek if one of the other methods i called. makes sense? Other than that I think this is the way to go
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me have a look at that.
I like the change and I agree we should make it a sep. commit but not need to open a different PR. We can just keep this a sep commit, makes sense? I can port that commit then to |
* To serve requests that can't come from the cache just perform the seek if you skipped the last one. | ||
* We could do fancier things if all enums supported ord. | ||
*/ | ||
public class FrequencyCachingTermsEnumWrapper extends FilterTermsEnum { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now that I think I have a working TermsEnum I'd love to have a unit test to prove it. @s1monw, do you think the best way to go about this is to grab something from Lucene?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hard to tell to be honest.. I'd just write a test that randomly iterates over a terms enum and calls the same things on a wrapped and unwrapped instance and checks if everything is the same?
Caching helps real suggest calls with real data in the index. Improvements get better as the phrase gets longer. 1 word phrases break even. 9 word phrases are ~75% faster. How I figured that out: |
* Index one at a time only rarely if doing more then 300. * When launching async actions, take some care to make sure you don't already have more then 150 other async actions in flight. * When indexing in bulk split into chunks of 1000 documents.
I've cleaned up the pull request, squashing it into two commits: one for the scoring improvements and one for the indexRandom changes. |
hey nik, I took a look at the TermsEnum you added and I think it goes in the right direction but it became more complex that I thought it would. I'd like to get this improvement in for 1.1 which is close but on the other hand I think the general TermsEnum is useful. What about splitting this up and go back to the |
I have to admit I'm having trouble working up the willpower to back it out and work it back in in another issue. What about slapping more warnings on it? Just moving it to a subclass as is? |
@nik9000 I can do it though no worries. Yet the reasoning here is that we don't need all the DocsEnum etc methods so no real seek is needed. I could have made up my mind earlier, |
The test I added isn't too bad.... Anyway, I'll switch it back to the old implementation tonight. Sent from my iPhone
|
Done. Let me know what else I need to do. I reran the (not scientific) performance tests I was running and it is marginally faster. I assume because it is maintaining less stuff in the cache. |
don't get me wrong I think you test was ok! I didn't want to make you upset or anything I really really appreciate your work here that's for sure! |
* that is requested and will always return that one. Also only works with seekExact(BytesRef), not next, not seekCeil. Because of this | ||
* it really only makes sense in this context. | ||
*/ | ||
private static class FrequencyCachingTermsEnumWrapper extends FilterTermsEnum { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this looks great. I think we should throw UnsupportedOperationException on the methods that are not correct if they are called?
I also think we should rather use a struct to hold both the ttf
and the docFreq
- it seems hard to understand what is returned at this point . Other than than I think we are ready to go ie. you can squash and we push it in!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes! Sorry, forgot about that while I was doing this.
Few questions on the caching:
|
It works in this context because the phrase suggester is configured to use one of the two and uses that one the whole time. @s1monw also asked that I switch back to holding both so I'll have a look at that.
I was using that a few days ago but it required modifying a dozen classes to pass BigArrays around and handle Releasable. I think that'd be OK but many of the classes are valid plugin extension points so maybe that isn't OK so late in the 1.1 release cycle. |
Another note while I think about it, just in case folks need to further speed this process up: Using the default configuration with
|
Add commit to cache both ttf and df AND to throw UnsupportedOperationExceptions. Less hacky now. |
LGTM can you maybe squash the commits together such that we have two commits, one for |
Two changes: 1. In the StupidBackoffScorer only look for the trigram if there is a bigram. 2. Cache the frequencies in WordScorer so we don't look them up again and again and again. This is implemented by wrapping the TermsEnum in a special purpose wrapper that really only works in context of the WordScorer. This provides a pretty substantial speedup when there are many candidates. Closes elastic#5395
++, we can work on a common frequency caching terms enum and reuse between both the spell checker and significant terms /cc @markharwood, @jpountz |
Squashed. |
pushed to |
Two changes:
again and again and again.
This provides a pretty substantial speedup when there are many candidates.
Closes #5395