Skip to content
This repository has been archived by the owner on Sep 12, 2020. It is now read-only.

Add QueryScorer.js and a test for it. #1

Merged
merged 1 commit into from Nov 5, 2019
Merged

Add QueryScorer.js and a test for it. #1

merged 1 commit into from Nov 5, 2019

Conversation

0c0w3
Copy link
Owner

@0c0w3 0c0w3 commented Nov 5, 2019

QueryScorer implements a simple scoring mechanism for search
strings against documents. See the comments in QueryScorer.js.
See the instructions in README.md for running the test.

QueryScorer implements a simple scoring mechanism for search
strings against documents.  See the comments in QueryScorer.js.
See the instructions in README.md for running the test.
Copy link
Collaborator

@mak77 mak77 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any idea of the cutoff we may use in the add-on?
If I read this correctly, typing 2 words, like "clear history", if both words match a doc, the score is zero, if one matches the other has 1 typo the score is 0.5, if both have 1 typo, the score is 1. Is this correct?
I'm mostly thinking about risk of returning non-relevant results, the threshold should be quite high.
I suspect one possible improvements may involve removing common articles and conjuctions, "clear the history" should just search for "clear" and "history".

@0c0w3
Copy link
Owner Author

0c0w3 commented Nov 5, 2019

Thanks!

Any idea of the cutoff we may use in the add-on?

In the Invision spec, Verdi specifies a distance of 1, so that's what I've been using so far in the background.js I've been developing as I worked on this patch. I'm not sure how strongly he feels about that or how he arrived at it, and "a distance of 1" isn't a complete enough specification by itself. This patch actually computes mean distance, so as you point out, if a query is two words and they both have one typo, the mean distance is 2 / 2 = 1, not 2, and so my background.js would count that as a match. We can certainly tweak as we go along.

If I read this correctly, typing 2 words, like "clear history", if both words match a doc, the score is zero, if one matches the other has 1 typo the score is 0.5, if both have 1 typo, the score is 1. Is this correct?

Yes, exactly.

As another example, if a query is two words and one word matches exactly but the other has a distance of 2, the mean distance is also 2 / 2 = 1. An earlier version of this patch also had a cut off before saving the min distance for each word-document pair: If the distance was > 1, then it wouldn't be considered as a minimum distance, or iow it would be boosted to Infinity. In that case, this example would be Infinity / 2 = Infinity, so the query would not match.

I'm mostly thinking about risk of returning non-relevant results, the threshold should be quite high.

Yes, we also don't want to make it too hard to match, though. It would be good to have at least Verdi play with it to see how it feels.

I suspect one possible improvements may involve removing common articles and conjuctions, "clear the history" should just search for "clear" and "history".

An earlier version of this patch actually had stop words. It removed stop words from the "documents" and also from the query. I left that out of the final patch because we could just ensure that the documents we use don't have stop words in the first place, and also because some of Verdi's suggested queries actually use stop words, like "how to" and "keeps". I'm not 100% sure whether we should use them or not, and I'm open to adding them back.

@0c0w3 0c0w3 merged commit 774cf8e into master Nov 5, 2019
@0c0w3 0c0w3 deleted the QueryScorer-pr branch November 5, 2019 17:38
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
2 participants