Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added normalization option to Query Rescorer #9535

Closed
wants to merge 4 commits into from

Conversation

RossLieberman
Copy link

My rescored results were so varied from my original searches that I needed a way to normalize the data so that the rescoring wouldn't overpower the tfid scores.

  • The Good
    • I removed 2 array loops
    • It appears to work
    • All tests and new one pass
  • The Bad
    • Issue Switch to Lucene QueryRescorer #7707 was to implement Lucene's Rescorer. I'm still using the majority of the TopDocs functionality and the other supporting classes, however Lucene's Rescorer performs the combine steps logic before they are resorted. In order to normalize the results we need to calculate the score first and then get the Min/Max values to create the score range.
    • From what I can tell the QueryRescorer calls out to the leaf nodes to perform the query by the Score.Combine function happens on the primary node anyway so it doesn't look like any performance is lost by moving where the Score.Combine is calculated.
  • The Ugly
    • I am not a Java Developer, I last touched Java 15 years ago, and that's only if you count Hello World applications.
    • I am a C# developer and used the unit tests & debugger to modify the code.
    • I used a HashMap to build the rescored values. I don't know the performance comparison between that and an array, but since array size wasn't determined and I need to later access Min/Max values along with finding the DocID I went this route. If performance is that different than I'd move to a custom class where Min/Max is added while scores are initially calculated, and an array where size is equal to topN, then advance through the doc similar to how the rescoreElastic() is implemented.
    • In the rescoreElastic() method I had to keep Lucene's original Sort method to organize results by DocID then use scoring.advance() to find the next document. Again, I don't understand the performance aspect enough to know if that is better or if there was a way to load by id.

I also removed a comment from the previous commit: // TODO: shouldn't this be up to the ScoreMode? I.e., we should just invoke ScoreMode.combine, passing 0.0f for the secondary score
This should probably be up to the developer, since performing certain actions like Average or Multiple will treat a non-match as 0 and could behave differently than expected.

If you are interested, let me know any feedback or changes I'd be more than happy to make them.

@clintongormley
Copy link

Hi @RossLieberman

Thanks for the PR. We'll take a look and get back to you.

@s1monw could you review this please?

@clintongormley clintongormley added :Search/Search Search-related issues that do not fall into other categories >enhancement review labels Feb 5, 2015
… has a size of 0).

Added test for this scenario
@clintongormley
Copy link

Hi @RossLieberman

Apologies that it has taken so long to look at this. Got lost in the interwebs. Would you be interested in updating the PR against master? If so, please could I also ask you to sign the CLA?

http://www.elasticsearch.org/contributor-agreement/

thanks

@RossLieberman
Copy link
Author

Not a problem it's a pretty minor thing when compared against 2.0. I can work on updating the PR with what's current on Master.

As far as the CLA, I'm not sure why that didn't get updated. I resigned the Adobe Document on Feb 15th 2016: "Individual Contributor License Agreement between elasticsearch BV and Ross Lieberman is Signed and Filed!"

@clintongormley
Copy link

As far as the CLA, I'm not sure why that didn't get updated. I

Found it. You used a different email address, but I've added your github address now. thanks

@dakrone
Copy link
Member

dakrone commented Apr 6, 2016

@s1monw I think this one still needs review now that the CLA has been signed

@RossLieberman
Copy link
Author

I'll see if I'm able to knock it out today. I haven't worked with
ElasticSearch in the past 9 months, and was asked to update the pull
request to 2.x I'd been slammed the last few weeks, and haven't had an
opportunity to handle the conflicts in the merge, but I would like to see
this in the main release, it will make it easier for my clients to do
future upgrades.

On Wed, Apr 6, 2016 at 5:14 PM, Lee Hinman notifications@github.com wrote:

@s1monw https://github.com/s1monw I think this one still needs review
now that the CLA has been signed


You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
#9535 (comment)

# Conflicts:
#	src/main/java/org/elasticsearch/search/rescore/QueryRescorer.java
#	src/main/java/org/elasticsearch/search/rescore/RescoreBuilder.java
#	src/test/java/org/elasticsearch/search/rescore/QueryRescorerTests.java
@RossLieberman
Copy link
Author

There has been some massive restructuring to the Rescorer (for the better). I'm looking to apply my changes, however before I did, I wanted to get the new master branch to compile, which seems impossible right now given the incompatibilities of Gradle & IntelliJ. I just spent the last 4 hours trying to get IntelliJ to build on my Mac and found that it was because JAVA_HOME wasn't being picked up from my profile. Once that was resolved I'm getting thousands of Import References not found throughout the project even though they are clearly part of the solution I can go to the folders and find them manually. I'm going to try Eclipse tomorrow to see if that resolves this.

@jasontedor
Copy link
Member

I just spent the last 4 hours trying to get IntelliJ to build on my Mac and found that it was because JAVA_HOME wasn't being picked up from my profile.

Hmm...I'm not sure what this is about? I've never had issues with this but I really want to help you! What happened here?

Once that was resolved I'm getting thousands of Import References not found throughout the project even though they are clearly part of the solution I can go to the folders and find them manually.

Are you using IntelliJ 2016.1? Sadly, it's broken. If you're a die-hard IntelliJ user, your best bet is to continue to use IntelliJ 15 if you want to work with Elasticsearch.

@RossLieberman
Copy link
Author

That solves my problem! Thanks Jaden, I just downloaded everything today
and was on 2016.1! I'm a die-hard Visual Studio (.NET programmer). I
haven't used IntelliJ in over a year and figured I'd try making these
updates in OSX and just grabbed everything fresh.

On Fri, Apr 8, 2016 at 7:46 PM, Jason Tedor notifications@github.com
wrote:

I just spent the last 4 hours trying to get IntelliJ to build on my Mac
and found that it was because JAVA_HOME wasn't being picked up from my
profile.

Hmm...I'm not sure what this is about? I've never had issues with this but
I really want to help you! What happened here?

Once that was resolved I'm getting thousands of Import References not
found throughout the project even though they are clearly part of the
solution I can go to the folders and find them manually.

Are you using IntelliJ 2016.1? Sadly, it's broken
https://youtrack.jetbrains.com/issue/IDEA-152254. If you're a die-hard
IntelliJ user, your best bet is to continue to use IntelliJ 15 if you want
to work with Elasticsearch.


You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
#9535 (comment)

Replaced default Lucene rescore logic with a version that normalizes the returned calculated scores by a new boolean option normalize. By removing the default logic we cut down on the number of loops that the logic does. Added new ScoreMode: Replace which takes the rescored calculation and replaces original score with the new value.
@RossLieberman
Copy link
Author

Sorry about mistyping your name @jasontedor it was a long frustrating 12 hour work day on Friday and I was heading out the door when your reply came in and that did help me. It turns out my problem was I had to open the Gradle Window and then hit the refresh button, even though I had checked the auto-import when I imported the project after running the "gradle import" command. After that I ran into the JarHell issue with ant-javafx.jar, however once that was removed everything worked! (in case anyone else has those issues, this was how I resolved them).

@clintongormley @s1monw I updated the Normalization code to the current Master branch. It was mostly the same as before although I cleaned up some of the code since I had the opportunity (i.e. instead of doing Collection.Min() & Collection.Max(), I grab the values during the initial loop to reduce extra lookups). You may also want @cbuescher to take a look, since he was the last to touch the section when he abstracted some of the QueryRescorer logic back in January.

A few notes:

  • This method replaces the core Lucene rescore() logic, however this method should be improved because we've removed an extra loop and 2 additional array copies that were being performed. The new rescore method was copied from the original but there was no other way around it if we wanted to calculate the Min/Max scores
  • This code also includes another change that I made some time ago which is a new ScoreMode: Replace that takes the value returned from the Rescorer and uses that value instead of figuring min/max, multiple, addition, etc.
  • All tests pass related to the changes I made, however when I ran the full build at the command line I got an error with org.elasticsearch.backwards.MultiNodeBackwardsIT.test / :qa:backwards-5.0:integTest I was unable to track down the test to determine if that was an issue with my changes or if that was failing before.
  • My original comments from 02/2015 cover the changes that were made
  • My new commit is still saying that the CLA failed again which is bizarre, my UserName is still the same RossLieberman but my email address did change when I switched companies. I'll try to submit under the original email address as well since I still have access to that account.

Let me know if there's anything else I can do or any changes you'd like to see.

@karmi
Copy link
Contributor

karmi commented May 18, 2016

Hi @RossLieberman, we have found your signature in our records, but it seems like you have signed with a different e-mail than the one used in yout Git commit. Can you please add both of these e-mails into your Github profile (they can be hidden), so we can match your e-mails to your Github profile?

@RossLieberman
Copy link
Author

@karmi I re-added those email addresses to my GitHub Account. Let me know if that resolves the issue.

@karmi
Copy link
Contributor

karmi commented May 19, 2016

Hi @RossLieberman, it looks like some commits were commited with a different address then the one you have signed with, have a look: https://github.com/RossLieberman/elasticsearch/commit/da5d6437d34138023ae7050cf5edd69846a53ecf.patch

It is generally enough to add those e-mails to your Github profile (they can be hidden, it's just a way for Github to match your profile with commits), and the CLA checker is happy.

Alternatively, you can commit with git commit --author=..., amend the commit and force push. Note, that as a part of your pull request, there's a merge commit which shouldn't be there, you should git rebase master in general.

@RossLieberman
Copy link
Author

Thanks @karmi, I didn't realize the email address being referenced was not just related to the Pull Request on my GitHub account and the CLA. I added several aliases yesterday +elasticsearch +github from my former company (hence the reason behind the email address change), but I also signed a new CLA with the email address used in the commit along with adding that address to GitHub. Hopefully this resolves things. If that still doesn't resolve things. I'll try going down the road of git commit --author=

@karmi
Copy link
Contributor

karmi commented May 21, 2016

@RossLieberman, the check is now passing, so your approach has worked -- thanks!!

@dakrone
Copy link
Member

dakrone commented Sep 12, 2016

@s1monw is this still applicable? If so, I think it still needs a review

@rjernst
Copy link
Member

rjernst commented Jun 9, 2017

@RossLieberman Are you still interested in getting this PR in? It has merge conflicts that need to be addressed.
@s1monw Can you review? This PR has been open for over 2 years, we should make a decision on it. :)

@elasticmachine
Copy link
Collaborator

Since this is a community submitted pull request, a Jenkins build has not been kicked off automatically. Can an Elastic organization member please verify the contents of this patch and then kick off a build manually?

@brusic
Copy link
Contributor

brusic commented Jun 9, 2017

I subscribed to this issue a long time ago because I think it is useful. Not currently using the rescorer, but I might be soon. If @RossLieberman does not get back and the community thinks it is useful, I can help with the conflicts (and perhaps addressed other issues originally mentioned).

@dakrone
Copy link
Member

dakrone commented Aug 15, 2017

@brusic given that this hasn't seen updates in quite a while, it sounds like @RossLieberman isn't going to get back to it. I'm going to close this issue, if you want to pick it up I think opened a new PR would be the best thing to do.

@dakrone dakrone closed this Aug 15, 2017
@brusic
Copy link
Contributor

brusic commented Aug 16, 2017

Still have no immediate need for reacting, but I would not mind seeing what the merge conflicts are. It has been a long time since I have contributed anything, but I do have something lined up (now that an upstream fix has been made).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>enhancement :Search/Search Search-related issues that do not fall into other categories
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

9 participants